Advances in Artificial Intelligence: 15th Conference of the Canadian Society for Computational Studies of Intelligence, AI 2002, Calgary, Canada, May 27-29, 2002: Proceedings
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Lecture Notes in Artificial Intelligence Subseries of Lecture Notes in Computer Science Edited by J. G. Carbonell, and J. Siekmann
Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis, and J. van Leeuwen
2338
3
Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Tokyo
Robin Cohen
Bruce Spencer (Eds.)
Advances in Artificial Intelligence 15th Conference of the Canadian Society for Computational Studies of Intelligence, AI 2002 Calgary, Canada, May 27-29, 2002 Proceedings
13
Series Editors Jaime G. Carbonell,Carnegie Mellon University, Pittsburgh, PA, USA J¨org Siekmann, University of Saarland, Saarbr¨ucken, Germany
Preface The AI conference series is the premier event sponsored by the Canadian Society for the Computational Studies of Intelligence / Soci´et´e canadienne pour l’´etude d’intelligence par ordinateur. Attendees enjoy our typically Canadian atmosphere – hospitable and stimulating. The Canadian AI conference showcases the excellent research work done by Canadians, their international colleagues, and others choosing to join us each spring. International participation is always high; this year almost 40% of the submitted papers were from non-Canadian researchers. We accepted 24 papers and 8 poster papers from 52 full-length papers submitted. We also accepted eight of ten abstracts submitted to the Graduate Student Symposium. All of these accepted papers appear in this volume. The Canadian AI Conference is the oldest continuously-held national AI conference in the world. (ECCAI’s predecessor, AISB, held meetings in 1974, but these have since become international.) Conferences have been held biennially since 1976, and annually since 2000. AI 2002 again joined its sister Canadian computer science conferences, Vision Interface and Graphics Interface, enriching the experience for all participants. The joint meeting allows us to stay informed about other areas, to make new contacts, and perhaps to investigate cross-disciplinary research. This year the conferences was held on the beautiful campus of the University of Calgary, and many participants took the opportunity to tour nearby Banff and the magnificent Rocky Mountains. To mark the second quarter-century of the conference, we invited three of the founders of the society to give invited talks: Zenon Pylyshyn, Alan Mackworth, and Len Schubert. Their foresight and efforts, at that time and continuing until this, mark a milestone in Canadian AI worth celebrating. Canadians are reputedly overly modest (although we boast about our olympic gold medals!). However, at AI 2002, we wished to applaud those who first recognized that Canadian AI researchers need a society to support them – to give them an identity, a community, and a voice. We are grateful to many: to the American Association for Artificial Intelligence and to the National Research Council Canada for supporting the Graduate Symposium, allowing many graduate students to attend and to display and present their work; to CSCSI’s president Bob Mercer for keeping the flame, and to its treasurer Howard Hamilton for tending it; to Camille Sinanan for coordinating the local arrangements for all three conferences; to Ali Ghorbani the AI 2002 chair; to the program committee and the referees for their dedication to the vital task of assessing scientific content; to the authors for contributing the material that is the main attraction; again to our invited speakers; to Alfred Hofmann and Karin Henzold of Springer-Verlag for their assistance in preparing these proceedings; to the organizers of VI and GI conferences for their collegiality while coordinating from a distance; and to the partipants, for making all of the effort worthwhile. May 2002
Robin Cohen, Bruce Spencer
Executive Committee Conference Chair: Program Co-chairs:
Ali Ghorbani (UNB) Robin Cohen (Waterloo) and Bruce Spencer (UNB and NRC)
Program Committee Sue Abu-Hakima (Amika Now!) Aijun An (York U.) Liliana Ardissono (U. Torino) Sabine Bergler (Concordia U.) Jennifer Chu-Carroll (IBM) Jim Delgrande (SFU) Chrysanne Di Marco (U. Waterloo) Toby Donaldson (Tech BC) Ren´e Elio (U. Alberta) Jim Greer (U. Saskatchewan) Randy Goebel (U. Alberta) Scott Goodwin (U. Windsor) Howard Hamilton (U. Regina) Peter Heeman (OGI) Rob Holte (U. Alberta) Froduald Kabanza (U. Windsor) Gerhard Lakemeyer (U. Aachen) Guy Lapalme (U. Montreal) Elliott Macklovitch (U. Montreal) Marzena Makuta (Microsoft) Richard Mann (U. Waterloo)
Gord McCalla (U. Saskatchewan) Bob Mercer (U. Western Ontario) Evangelos Milios (Dalhousie U.) Guy Mineau (U. Laval) Eric Neufeld (U. Saskatchewan) David Poole (UBC) Fred Popowich (SFU) Jonathan Schaeffer (U. Alberta) Dale Schuurmans (U. Waterloo) Fei Song (U. Guelph, Ask Jeeves) Deb Stacey (U. Guelph) Suzanne Stevenson (U. Toronto) Stan Szpakowicz (U. Ottawa) Andre Trudel (Acadia U.) Paul Van Arragon (Mitra) Peter van Beek (U. Waterloo) Julita Vassileva (U. Saskatchewan) Eric Yu (U. Toronto) Jianna Zhang (U. Manitoba)
Reviewers Aijun An Mohamed Aoun-allah Liliana Ardissono Brad Bart Sabine Bergler Robert D. Cameron Jennifer Chu-Carroll James Delgrande Chrysanne Di Marco Toby Donaldson Ren´e Elio Michael Fleming
Randy Goebel Scott Goodwin Jim Greer Howard Hamilton Peter Heeman Robert Holte Michael C. Horsch Jimmy Huang Froduald Kabanza Tomohiko Kimura Frederick W. Kroon Gerhard Lakemeyer
Guy Lapalme Elliott Macklovitch Gord McCalla Robert Mercer Evangelos Milios Guy Mineau Chris Pal David Poole Jonathan Schaeffer Dale Schuurmans Fei Song Pascal Soucy
Organization
Deborah Stacey Suzanne Stevenson Stan Szpakowicz Andre Trudel
Davide Turcato Paul van Arragon Peter van Beek Julita Vassileva
Kenny Wong Eric Yu Jianna Zhang
Sponsoring Institutions American Association for Artificial Intelligence National Research Council Canada, Institute for Information Technology
VII
Table of Contents
Agents – 1 Modeling Organizational Rules in the Multi-agent Systems Engineering Methodology . . . . . . . . . . . . . . . . . . . . . . . 1 Scott A. DeLoach AERO: An Outsourced Approach to Exception Handling in Multi-agent Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 David Chen and Robin Cohen A Learning Algorithm for Buying and Selling Agents in Electronic Marketplaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Thomas Tran and Robin Cohen
Search – 2 Generalized Arc Consistency with Application to MaxCSP . . . . . . . . . . . . . . . . 104 Michael C. Horsch, William S. Havens, and Aditya K. Ghose Two-Literal Logic Programs and Satisfiability Representation of Stable Models: A Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Guan-Shieng Huang, Xiumei Jia, Churn-Jung Liau, and Jia-Huai You Using Communicative Acts to Plan the Cinematographic Structure of Animations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 Kevin Kennedy and Robert E. Mercer
Learning Mining Incremental Association Rules with Generalized FP-Tree . . . . . . . . . . 147 Christie I. Ezeife and Yue Su Topic Discovery from Text Using Aggregation of Different Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .161 Hanan Ayad and Mohamed Kamel Genetic Algorithms for Continuous Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 James R. Parker
Probability On the Role of Contextual Weak Independence in Probabilistic Inference . .185 Cory J. Butz and Manon J. Sanscartier A Structural Characterization of DAG-Isomorphic Dependency Models . . . .195 S. K. M. Wong, D. Wu, and T. Lin Construction of a Non-redundant Cover for Conditional Independencies . . . 210 S. K. M. Wong, T. Lin, and D. Wu
Agents – 2 Using Inter-agent Trust Relationships for Efficient Coalition Formation . . . 221 Silvia Breban and Julita Vassileva Using Agent Replication to Enhance Reliability and Availability of Multi-agent Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .237 Alan Fedoruk and Ralph Deters
Table of Contents
XI
Natural Language An Efficient Compositional Semantics for Natural-Language Database Queries with Arbitrarily-Nested Quantification and Negation . . . . . . . . . . . . . . . . . . . . . . 252 Richard Frost and Pierre Boulos Text Summarization as Controlled Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 Terry Copeck, Nathalie Japkowicz, and Stan Szpakowicz QUANTUM: A Function-Based Question Answering System . . . . . . . . . . . . . . 281 Luc Plamondon and Leila Kosseim Generic and Query-Based Text Summarization Using Lexical Cohesion . . . . 293 Yllias Chali
Poster Papers Natural Language A Lexical Functional Mapping Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 Tamer S. Mahdi and Robert E. Mercer A Constructive Approach to Parsing with Neural Networks – The Hybrid Connectionist Parsing Method . . . . . . . 310 Christel Kemke Extraction of Text Phrases Using Hierarchical Grammar . . . . . . . . . . . . . . . . . . 319 Jan Bakus, Mohamed Kamel, and Tom Carey
Learning An Enhanced Genetic Algorithm Approach to the Channel Assignment Problem in Mobile Cellular Networks . . . . . . . . . 325 G. Grewal, T. Wilson, and C. Nell RFCT: An Association-Based Causality Miner . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334 Kamran Karimi and Howard J. Hamilton Retrieval of Short Documents from Discussion Forums . . . . . . . . . . . . . . . . . . . . 339 Fulai Wang and Jim Greer
Probability Application of Bayesian Networks to Shopping Assistance . . . . . . . . . . . . . . . . . 344 Yang Xiang, Chenwen Ye, and Deborah Ann Stacey
Graduate Student Symposium User Models: Customizing E-Commerce Websites to the Context of Use . . . 354 Rony Abi-Aad, Thiruvengadam Radhakrishnan, and Ahmed Seffah Supporting the Needs of Mobile Home Care Workers: A Case Study for Saskatoon District Health System . . . . . . . . . . . . . . . . . . . . . . .356 Golha Sharifi, Julita Vassileva, and Ralph Deters Multi-agent System Architecture for Computer-Based Tutoring Systems . . 358 Elhadi Shakshuki and P. Kajonpotisuwan Word Prediction Evaluation Measures with Performance Benchmarking . . . 361 Alfred I. Renaud Relaxed Unification – Proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364 Tony Abou-Assaleh and Nick Cercone A Learning Algorithm for Buying and Selling Agents in Electronic Marketplaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366 Thomas Tran and Robin Cohen cbCPT: Knowledge Engineering Support for CPTs in Bayesian Networks . 368 Diego Zapata-Rivera Query-Less Retrieval of Interesting Postings in a WebForum . . . . . . . . . . . . . . 371 Laxmikanta Mishra Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .373
Modeling Organizational Rules in the Multi-agent Systems Engineering Methodology Scott A. DeLoach Department of Computing and Information Sciences, Kansas State University 234 Nichols Hall, Manhattan, KS 66506 [email protected]
Abstract. Recently, two advances in agent-oriented software engineering have had a significant impact: the identification of interaction and coordination as the central focus of multi-agent systems design and the realization that the multi-agent organization is distinct from the agents that populate the system. This paper presents detailed guidance on how to integrate organizational rules into existing multiagent methodologies. Specifically, we look at the Multi-agent Systems Engineering models to investigate how to integrate the existing abstractions of goals, roles, tasks, agents, and conversations with organizational rules and tasks. We then discuss how designs can be implemented using advanced as well as traditional coordination models.
1
Introduction
Over the last few years, two conceptual advances in agent-oriented software engineering have had a significant impact on our approach toward building multiagent systems. The first of these was identification of interaction and coordination as the central focus of multi-agent systems design. That is, interaction and coordination play a central role in the analysis and design of multi-agent systems and makes the multi-agent approach significantly different from other approaches towards building distributed or intelligent systems. This realization lead to several new methodologies for building multi-agent systems that focused on the interaction between agents as the critical design aspect. Several agent-oriented methodologies fit this form including MaSE 3, Gaia 10, and MESSAGE 7. The second, more recent advancement is the division of the agents populating a system from the system organization 11. While agents play roles within the organization, they do not constitute the organization. The organization itself is part of the agent’s environment and defines the social setting in which the agent must exist. An organization includes organizational structures as well as organizational rules, which define the requirements for the creation and operation of the system. These rules include constraints on agent behavior as well as their interactions. There are separate responsibilities for agents and organizations; the organization, not the agents, should be responsible for setting and enforcing the organization rules. R. Cohen and B. Spencer (Eds.): AI 2002, LNAI 2338, pp. 1-15, 2002. Springer-Verlag Berlin Heidelberg 2002
2
Scott A. DeLoach
Organizational design has many advantages over traditional multi-agent systems design methods. First, it defines a clean separation between the agent and the organization in which the agent works, which in turn simplifies each design. In traditional agent-oriented approaches, the rules that govern interaction must be incorporated into the agents themselves, thus intertwining the organizational design in various agent designs. Secondly, separating the organization from the agent allows the developer to build a separate organizational structure that can enforce the organizational rules. This is especially critical in open systems where we do not know the intent of the agents working within the system. While these advances are rather recent, there have been some discussions on how to incorporate them into existing multi-agent systems methodologies. For instance, there is a proposal to modify the Gaia multi-agent systems methodology to incorporate the notion of social laws 12. Other approaches view the organization as a separate institutional agent 9. However, these proposals have been made at a high level and do not provide concrete guidance on how to use existing analysis and design abstractions with advanced coordination models and organizational concepts. Also, the advent of more powerful coordination models, such as hybrid coordination media, have allowed us to imagine new ways of implementing organization rules. With these advanced models, we can now embed organizational rules in the coordination media instead of implementing them internal to the individual agents 1. The goal of this paper is to present more detailed guidance on how to integrate organizational rules into existing multi-agent methodologies. Specifically, we will look at the Multi-agent Systems Engineering (MaSE) analysis and design models to investigate how to integrate the existing abstractions of goals, roles, tasks, agents, and conversations with organizational rules. We will also briefly take a look at how we can use advanced coordination models to implement multi-agent systems that separate agents from the organizational rules that govern them. We believe that extending existing conversation-based multi-agent analysis and design approaches with organizational rules is a major step toward building coherent, yet adaptive multi-agent systems in a disciplined fashion. While one might be tempted to simply throw out the concept of conversations altogether in favor of some of the more powerful models being proposed, we resist that urge for two basic reasons. First, conversation-based approaches are widely understood and provide an easily understandable metaphor for agent-to-agent communication. Second, conversation-based approaches have shown that they are verifiable and give designers some measure of system coherence 5. Using the full power of these coordination models without restraint could lead to multi-agent system designs that are not understandable, verifiable, or coherent. In Section 2, we discuss how to model organizational rules MaSE. In Section 2.1, we look at the analysis phase where we add the notion of organizational rules to the existing MaSE analysis models. In Section 2.1.4 we show how to map the various analysis artifacts, including organizational rules, into an enhanced design model that explicitly models the organization through the notion of organizationally based tasks. Finally, in Section 3 we show how these organizational tasks might be implemented. We end with a discussion of our results and conclusions in Section 4.
Modeling Organizational Rules in the Multi-agent Systems Engineering Methodology
2
3
Modeling Organizational Rules In MaSE
In this section we show how we have extended the MaSE analysis and design phases to take advantage of the concept of organizational rules. In the analysis phase, we add a new model, the organizational model, to capture the organizational rules themselves, while in the design phase, we introduce the concept of organizationally-based tasks to carry out specific tasks that are part of the organization and do not belong to a specific agent. These tasks are often used to implement and enforce the organizational rules defined during analysis. Throughout this paper, we will use the conference management example as defined in 11. The conference management system is an open multi-agent system supporting the management of various sized international conferences that require the coordination of several individuals and groups. There are five distinct phases in which the system must operate: submission, review, decision, and final paper collection. During the submission phase, authors should be notified of paper receipt and given a paper submission number. After the deadline for submissions has passed, the program committee (PC) has to review the papers by either contacting referees and asking them to review a number of the papers, or reviewing them themselves. After the reviews are complete, a decision on accepting or rejecting each paper must be made. After the decisions are made, authors are notified of the decisions and are asked to produce a final version of their paper if it was accepted. Finally, all final copies are collected and printed in the conference proceedings. The conference management system consists of an organization whose membership changes during each stage of the process (authors, reviewers, decision makers, review collectors, etc.). Also, since each agent is associated with a particular person, it is not impossible to imagine that the agents could be coerced into displaying opportunistic, and somewhat unattractive, behaviors that would benefit their owner to the detriment of the system as a whole. Such behaviors could include reviewing ones own paper or unfair allocation of work between reviewers, etc. 2.1 The Analysis Phase The purpose of the MaSE analysis phase is to produce a set of roles whose tasks describe what the system has to do to meet its overall requirements. A role describes an entity that performs some function within the system. In MaSE, each role is responsible for achieving, or helping to achieve specific system goals or sub-goals. Because roles are goal-driven, we also chose to abstract the requirements into a set of goals that can be assigned to the individual roles. Our approach is similar to the notions used in the KAOS 6. The overall approach in the MaSE analysis phase is fairly simple. Define the system goals from a set of functional requirements and then define the roles necessary to meet those goals. While a direct mapping from goals to roles is possible, MaSE suggests the use of use cases to help validate the system goals and derive an initial set of roles. As stated above, the ultimate objective of the analysis phase is to transform the goals and use cases into roles and their associated tasks since they are forms more suitable for designing multi-agent systems. Roles form the foundation for agent classes and represent system goals during the design phase, thus
4
Scott A. DeLoach
the system goals are carried into the system design. To support organizational rules, the MaSE analysis phase was extended with an explicit organizational model, which is developed as the last step in the analysis phase and is defined using concepts from the role and ontology models. 2.1.1
Role Model
Due to space limitations, we will skip the goal and use case analysis for the conference system example and jump right to the role model. The MaSE role model depicts the relationships between the roles in the conference management system, as shown in Fig. 1. In Fig. 1, a box denotes each role while a directed arrow represents a protocol between roles, with the arrows pointing away from the initiator to the responder. Notice that while we referred to the PC chair and PC members in the problem description, we have intentionally abstracted out the roles played by those typical positions into partitioning, assigning reviews, reviewing papers, collecting reviews, and making the final decision. As we will see later, this provides significant flexibility in the design phase. The system starts by having authors submit papers to a paper database (PaperDB) role, which is responsible for collecting the papers, along with their abstracts, and providing copies to reviewers when requested. Once the deadline has past for submissions, the person responsible partitioning the entire set of papers into groups to be reviewed (the Partitioner role) asks the PaperDB role to provide it the abstracts of all papers. The Partitioner partitions the papers and assigns them to a person (the Assigner) who is responsible for finding n reviewers for each paper. Once assigned a paper to review, a Reviewer requests the actual paper from the PaperDB, prepares a review, and submits the review to the Collector. Once all (or enough) of the reviews are complete, the Decision Maker determines which papers should be accepted and notifies the authors. Partitioner
Collector
make assignments Assigner
review papers
get reviews
Decision Maker
submit review
inform authors
Reviewer retrieve paper
retrieve abstracts
PaperDB
submit paper
Author
Fig. 1. Role Model for Conference Management System
Thus, we have identified seven explicit roles. However, in MaSE, we do not stop at simply identifying the roles, we also identify the tasks that the roles must perform in accomplishing their goals. Therefore, a more detailed version of the conference management system role model is shown in Fig. 2. In MaSE, we have extended the traditional role model by adding the tasks (shown using ellipses attached to each role). Generally, each role performs a single task, whose definition is straightforward and documented in a concurrent task diagram (not discussed here due to space
Modeling Organizational Rules in the Multi-agent Systems Engineering Methodology
5
limitations), which define agent behaviour and interaction via finite state machines. However, some roles, such as the Paper DB or Reviewer roles have multiple tasks. For instance, the Paper DB role has three tasks: Collect Papers, Distribute Papers, and Get Abstracts. While the tasks are related, they are distinct and are thus modelled separately. The Collect Papers task accepts papers, ensures they are in the right format and meet all the eligibility requirements. The Get Abstracts task extracts the abstract from submitted papers and sends them to a Partitioner. The Distribute Papers task simply distributes accepted papers to the appropriate Reviewers when requested. Decision Maker
Collector Assigner
Assignto Reviewers
Reviewer review papers
get Collect reviews Negotiate Papers
Select Papers
Reviews
Review Paper
inform authors
submit reviews make assignments
Partition Papers
retrieve paper
GetAbstracts
Distrib Papers
submit paper
Collect Papers
retrieve abstracts Partitioner
SubmitPaper
WritePaper
PaperDB
Author
Fig. 2. Expanded MaSE role model
2.1.2
Ontology Model
The next step in the MaSE analysis phase is to develop an Ontology Model, which defines the data types and their relationships within the system 4. Fig. 3 shows an ontology model for the conference review system. The ontology is focused around the central data type, a paper, each with an associated abstract and a set of reviews. Given the ontology, we can talk about the reviews a paper has received paperReview(p) or a paper’s abstract paperAbstract(p), etc. There are also constraints placed on the data via the ontology. For instance, each abstract must have exactly one paper and each paper must have exactly one abstract. Also, a review can only exist on a single paper, while a paper may have any number of reviews on it (including none). Thus several organizational constraints can be defined in the ontology itself. Using the ontology model, we can extract a number of functions to describe the data in our system. The functions and their resulting types for the conference management system are shown in Table 1. These functions can be used in conjunction with protocol functions to describe many relationships, as we will see in the next section. Abstract
paperAbstract 1
wholePaper 1
Paper author : String
reviewedPaper 1
Fig. 3. Conference Management Ontology
paperReview 0 .. *
Review
6
Scott A. DeLoach Table 1. Functions Derived from Ontology
In our previous treatments of MaSE, we would go to the design phase at this point. However, this is precisely the point at which we can effectively begin to identify organizational rules. By definition, organizational rules define constraints on agent behavior and their interactions. At the analysis level, this equates to restrictions on the roles an agent may play or how an agent may interact with other agents. To state these rules in a formal manner, we must have a language based on analysis artifacts. This language is defined by the role model, the ontology model, and a set of metapredicates. We can use the protocols and roles defined in the Role Model to describe how the system will operate, which will be very useful when defining organizational rules. For instance, we can refer to an agent playing a particular role. We annotate this using a data type like notation, for instance, r:Reviewer, which states that agent r is of type (i.e., plays the role of a) Reviewer. Thus if we wanted to state that the agent making final decisions cannot be an author of any papers for the conference, we could say ∀ a:Author, d:DecisionMaker d ≠ a
Another way to state the same requirement would be through the use of a metapredicate Plays, which states that a particular agent plays a particular role. Therefore, we could state the same requirement as ∀ a:Agent ¬(Plays(a, Author) ∧ Plays(a, DecisionMaker))
The use of meta-predicates can be useful in stating requirements. For instance, if we want all agents in the system to be authors, we can simply state, ∀ a: Agent Plays(a, Author), which is simpler than using the data type notation. We will also need to refer to the relationships between agents (or roles) in the system. Since the only relationships we have defined in MaSE are via protocols, we use protocol instances to specify relationships. We refer to a protocol between two agents as prototocolName(initiator, responder, data), which states that a protocol exists between two roles, initiator and responder, and concerns a particular piece of data. The initiator and responder must be capable of playing the appropriate roles and the data must refer to data passed between roles via the protocol. Thus the expression, reviewPapers(a, r, p), states that a protocol named reviewPapers exists between the roles a and r (involving a paper, p), which must be capable of playing the Assigner and Reviewer roles respectively. Thus if we wanted to state that a Reviewer can only review papers for one Assigner, we could make the following rule. ∀ a1, a2:Assigner, r:Reviewer, p1, p2:Paper reviewPapers(a1, r, p1) ∧ reviewPapers(a2, r, p2) ⇒ a1 = a2
Modeling Organizational Rules in the Multi-agent Systems Engineering Methodology
7
Although we can state some requirements using only concepts from the role model, there are other times where we must relate roles and their relationships based on particular data in the system. For instance, in the conference management system we are interested in the relationships between roles based on the papers they submit, review, or collect. Thus we must be able to talk about the data in the system as well, which is defined by the ontology model. In the original paper describing the conference management system in terms of organizational rules 11, the authors defined seven organizational rules. While the authors stated the rules using a formal notation, there was no real definition of how the rules mapped to the artifacts of their analysis and design. Here we will redefine them using the notation presented above based on the role and ontology models. The rules as originally presented are shown below using the temporal operators as defined in Table 2. ∀p : #(reviewer(p)) ≥ 3 ∀i, p : Plays(i, reviewer(p)) ⇒ ! " ¬Plays(i, reviewer(p)) ∀i, p : Plays(i, author(p)) ⇒ " ¬Plays(i, reviewer(p)) ∀i, p : Plays(i, author(p)) ⇒ " ¬Plays(i, collector(p)) ∀i, p : participate(i, receivePaper(p)) ⇒ " initiate(i, submitReview(p)) ∀i, p : participate(i, receivePaper(p)) B initiate(i, submitReview(p)) ∀p : [submittedReviews(p) > 2] B initiate(chair, decision(p)) The first rule states that there must be at least three reviewers for each paper (# is cardinality) while rule two keeps a reviewer from reviewing the same paper more than once. Rules three and four attempt to limit selfish agent behaviour by ensuring that a paper author does not review or collect reviews of his or her own paper. The last three rules describe appropriate system operation. Rule five states that if a paper is received, it should eventually be reviewed. Rule six requires that a paper must actually be received before a review can be submitted on it while rule seven requires that there be at least two reviews before a paper can be accepted or rejected. Table 2. Temporal Operators
!ϕ "ϕ "ϕ ϕBφ
ϕ is true next ϕ is always true ϕ is eventually true ϕ is true before φ is true
The first organizational rule states that each paper should have at least three reviewers. While we might be tempted to use the ontology model to say that each paper should have three or more reviews, this does not adequately capture the requirement. What we want to state is that three agents, playing the part of reviewers should be assigned to each paper, which requires more knowledge than is in the ontology. It requires that we combine relationships and data definitions from the ontology with relationships (defined by protocols) defined in the role model. What we
8
Scott A. DeLoach
need to say is that for a given paper, p, there must be at least three reviewers assigned. Since the review assignment process is accomplished via the reviewPapers protocol between the Assigner role and the Reviewer role, there must be three instances of that protocol for paper p. Thus we can state the requirement as ∀ p:Paper, a:Assigner, r:Reviewer #{r | reviewPapers(a,r,p)} ≥ 3
The second rule keeps a reviewer from reviewing the same paper more than once. While this appears be subsumed by our first rule, in fact it is not. Our first rule states that we must have three unique reviewers, but it does not stop them from submitting multiple reviews on the same paper. To accomplish this, we must limit the number of submitReview protocols that can exist between the Reviewer role and any Collector roles for a given paper. This is formalized as ∀ r1, r2:Review, r:Reviewer, c1, c2:Collector submitReview(r,c1,r1) ⇒ ! " (¬submitReview(r,c2,r2) ∨ reviewedPaper(r1) ≠ reviewedPaper(r2))
The next two rules (three and four) limit selfish agent behavior by ensuring that a paper author does not review or collect reviews of his or her own paper. The first of these rules states that an author may not review his or her own paper while the second does not let the author acts as a collector of the reviews on his or her paper. There two approaches to modeling an author. As defined in 11, we could assume that the author is the one who submits the paper and identify the author as the role that submits the paper to the PaperDB role via the submitPaper protocol. The second approach would be to use the author attribute of the paper object and compare it to the reviewer. This would require the ability to identify the name of the Reviewer role, which would require an extension to the MaSE role model. Therefore, we will use the first approach and define the third rule as ∀ a:Author, d:PaperDB, p:Paper, s:Assigner, r:Reviewer, c:Collector, r1:Review submitPaper(a,d,p) ⇒ ¬(submitReview(r,c,r1) ∧ a = r ∧ r1 = paperReview(p))
Likewise, the fourth rule ensures the author does not participate as a collector. ∀ a:Author, d:PaperDB, p:Paper, r:Reviewer, c:Collector, r1:Review submitPaper(a,d,p) ⇒ ¬(submitReview(r,c,r1) ∧ a = c ∧ r1 = paperReview(p))
Finally, the last three rules define the way in which the system should operate. Rule five simply requires that if a paper is submitted via the sumbitPaper protocol, a review should eventually be submitted to a collector by via the submitReview protocol. This rule is state straightforwardly using the appropriate temporal operator. ∀ a:Author, d:PaperDB, p:Paper, r:Reviewer, c:Collector, r1:Review submitPaper(a,d,p) ⇒ " submitReview(r,c,r1) ∧ r1 = paperReview(p))
Rule six, requiring the paper be submitted before it can be reviewed can be defined as
Modeling Organizational Rules in the Multi-agent Systems Engineering Methodology
Finally, the last rule requiring at least two submitted reviews per paper before a decision can be rendered can be encoded as ∀ r: Reviewer, c:Collector, r1:Review, m:DecisionMaker, a:Author, p:Paper #{r1 | submitReview(r,c,r1) ∧ r1 = paperReview(p)} ≥ 2 B (informAuthor(m,a,p))
During the analysis phase, these organizational rules are collected and defined in terms of the ontology and role model; however, they are integrated into the overall system design in the next stage. It is at this point that the designer must decide how to monitor or enforce these rules. As we will see, the rules can be assigned to a particular agent in the design or they can be implemented via conversational, monitoring, or enforcement tasks as organizational tasks. 2.1.4
The Design Phase
The initial step in the MaSE design phase is to define agents from the roles defined in the analysis phase. The product of this phase is an Agent Class Diagram, as shown in Fig. 4, which depicts the overall agent system organization defined by agent classes and conversations between them. An agent class is a template for a type of agent in the system and is analogous to an object class in object-orientation while an agent is an instance of an agent class. During this step, agent classes are defined in terms of the roles they will play and the conversations in which they must participate. In the diagram, boxes denote agent types (with the roles it plays listed under its name) while directed arrows represent conversations between agent types with a similar semantics to role model protocols. Business Seller
auctionProduct
Consumer Buyer
Fig. 4. Agent Class Diagram
In this paper we extend the Agent Class Diagram with organizationally based tasks, which is a new concept that allow us to model aspects of the organization independently of the agents. Organizationally based tasks are tasks that are assigned to the organization (as opposed to a particular agent) and can be used to implement social tasks, monitor system and individual agent behavior, and enforce organizational and security rules. An example of an organizationally based task is shown in Fig. 5. The Seller and Buyer boxes are agents while the rounded rectangle denotes the organization. The ellipse in the organization box is an organizationally based task, Auction, which was derived from a task belonging to a role in the role model. In the initial step of the design phase, the designer determines the roles each agent type will play as well as which roles (and tasks) will be relegated to the organization. The designer may also create new organizationally based tasks to implement and enforce the organizational rules defined during the analysis phase.
10
Scott A. DeLoach
In the remainder of this section, we take our analysis of the conference management system, including the organizational rules, and show how it can be developed into a number of different designs using organizationally-based tasks in conjunction with conventional MaSE Agent Class Diagrams. The goal here is to show a number of different options that are available with the notion of organizationally based tasks, not to advocate a particular approach as being necessarily better in all instances. Business
Consumer
Seller
Buyer
sellProduct
buyProduct Auction
Organization
Fig. 5. Organizationally-Based Task
2.1.5
Design 1 - Traditional
Traditional multi-agent design approaches, as advocated in 3, might result in the design shown in Fig. 6. In this design, various roles are combined into agents. For instance, the PC Chair agent plays the Partitioner, Collector, and Decision Maker roles while the PC Member agent plays both the assigner and reviewer roles. Outside of author agents, the only other agent is the DB agent, which provides an interface to the database containing the papers, abstracts, and author information, etc. PCChair Partitioner Collector Decison Maker retrieve abstracts
collect reviews
PCMember Assigner Reviewer
make assignments
inform authors
retrieve paper DB
submit paper
PaperDB
Author Author
Fig. 6. Traditional design
Unfortunately, the traditional multi-agent design described above does not provide the separation of agent tasks from social, or organizational, tasks, which is desirable for extensible, open multi-agent systems 2. To ensure the organizational rules are enforced, we must interweave the organizational rules into the individual agents themselves. For example, the only place we can check to ensure that at least two reviews were completed before the decision to accept or reject a paper was made (rule 7) is in the PC Chair agent itself. This forces us to rely on self-policing agents, which, if we assume the possibility of self-interested agents, is a less than desirable approach to ensuring the enforcement of organizational rules.
Modeling Organizational Rules in the Multi-agent Systems Engineering Methodology
2.1.6
11
Design 2 – Assigning Tasks to the Organization
As advocated by some 2, the appropriate place to monitor and enforce organizational rules is in the organization itself. Thus, using the same analysis, we have created a new design that uses organization-based tasks to implement the PaperDB and Collector roles. Fig. 7 shows the details of this new design. Notice that the tasks of the PaperDB and the Collector roles have been assigned to the organization. In effect, their tasks become part of the organization as organizationally based tasks. PCChair
get reviews
Partitioner DecisionMaker
inform authors
make assignments PCMember
review papers
Assigner retrieve abstracts
Reviewer
Author
Reviewer
Author
retrieve paper
GetAbstracts
DistribPapers
collect reviews
CollectReviews
submit paper
CollectPapers
Organization
Fig. 7. Design with explicit tasks
By being part of the organization, the Get Abstracts, Distribute Papers, Collect Papers, and Collect Reviews tasks can more easily support the conference management organizational rules. This is because the information collected and used by these tasks can easily be shared through a common database. For instance, The Distribute Papers task can enforce rule 3 (an author cannot review his or her own paper) by simply checking the reviewer against the paper author. Likewise, the Collect Reviews task can monitor rule 5 (if a reviewer receives a paper, he or she must eventually submit a review) and send warnings if reviews are not submitted in a timely fashion. The same task can also enforce rule 6 (the paper must be received by a reviewer before the review is submitted) by not accepting reviews until the paper has actually been requested, as well as rule 7 (there must be at least two reviews before the chair can make a decision) by only sending reviews once there are at least two of them. This design approach also allows the organizational rules to be updated without necessarily affecting the individual agent designs. 2.1.7
Design 3 – Designing New Organizational Roles
A third design that does not assign tasks from the role model to the organization is shown in Fig. 8. However, we still use organizationally based tasks to monitor and enforce the organizational rules presented above. We do this by creating new tasks in the design to implement the organizational rules. For instance, in Fig. 8, there are three organizational tasks (Monitor Num Reviews, Monitor Decisions, and Monitor Reviewers) that did not exist in the role model, but were added by the designer to monitor/enforce organizational rules 2, 3, and 7. The dashed line between the tasks
12
Scott A. DeLoach
and the conversations denote that the tasks monitor those conversations by executing when the conversations are started. These tasks may simply monitor the communication between agents and either display or log the information of interest. For instance, the Monitor Decision task might monitor the inform author conversations and log only those decisions that are made without the required number of reviews being made. Note that the Monitor Decision task would have access to this information via tuples shared by the Monitor Num Reviews task. Monitor Num Reviews (7)
Organization
Monitor Decision (7)
Monitor Reviewers (2, 3)
inform authors
PCChair collect reviews
Partitioner Collector Decison Maker make assignments
PCMember Assigner
retrieve abstracts
Reviewer review papers
Reviewer
retrieve paper
Author Author submit paper
DB PaperDB
Fig. 8. Design with monitoring/conversational tasks
A task that simply monitors a conversation is shown in Fig. 9. In modelling monitoring tasks, we assume that the task receives a message before the agent on the other end of the conversation and must forward the message before the intended recipient can receive it. In Fig. 9a this is shown by the receive event that initiates the transition from the start state. Once the message is received, the Monitor Decision task validates it (in this case, that it has had at least two reviews) and, if valid, passes the message along to the intended recipient. We can use the same basic design as shown in Fig. 8 but use tasks that do more than just monitor the conversations; they may actually interrupt the conversation or modify the data being passed between agents, thus providing correction either directly or indirectly with the offending agents. For example, Fig. 9b defines a task that intercepts the notice message being sent to an author; if the correct number of reviews has not been accomplished, the task sends the PC Chair a message stating that the decision was invalid instead of forwarding the notice message on to the author. Of course, a task that communicates directly with agents in a conversation forces the agents involved to be able to handle additional communication. Thus, the original inform authors conversation (from the viewpoint of the PC Chair) must be modified to work with this type of task. Specifically, the PC Chair’s side of the conversation must be able to handle an invalidDecision message from the organization. Thus, in Fig. 10, we have modified the conversation to accept the invalidDecision after sending the original notice. This is an example of the strength of using a conversation based design approach. Using conversations, it is possible to trace the sequence of possible
Modeling Organizational Rules in the Multi-agent Systems Engineering Methodology
13
messages through the system and thus automatically verify that all conversations and tasks are consistent and do not cause unwanted side effects such as deadlock. receive(notice(accept, paper), pcchair, author)
Ideally, organization based tasks would be implemented using a coordination model that has equivalent structures, such as hybrid coordination media. Hybrid coordination models are data-centered coordination models that include (1) a logically centralize repository where the agents read and write data and (2) a set of reactions that are functions that react to, and can read and modify data in the data store 1 8. In a hybrid coordination media, the media itself has the ability to see the communication between agents and perform tasks in reaction to those communications. Thus we could easily model organizationally based tasks as reactions in hybrid coordination media. For role model tasks that are assigned to the organization, the hybrid model is ideal since the reaction is not under control of an individual agent, but is part of the organization itself and thus is started at system initialisation. Such tasks may include controlling the introduction of new agents in the system. Tasks that intercept messages and forwarding them on if they are valid, as well as those that just monitor messages are also easily implemented in hybrid models. The ability of reactions to read all data in
14
Scott A. DeLoach
the data store allows them to monitor messages and take action when necessary. For example, if authors are required to submit papers in PDF format, we could enforce this rule via a reaction that would automatically convert non-PDF formats to PDF; the reaction would simply extract any non-conforming papers and replace them with the appropriate PDF version. While useful, such an advanced coordination models are not required to take advantage of an organizational design approach. While it might be less efficient, these designs could also be implemented using a more traditional message oriented middleware component. One approach would be to build an “organization” agent (or agents) that would handle all the organization tasks that would normally be assigned to reactions in a hybrid coordination media. Using this approach, all critical communications can be routed through organizationally based tasks to ensure the organizational rules are adhered to. Whether using a hybrid coordination media or organizational agents, the advantages based on separating organizational tasks rules from the agents would remain.
4
Results and Conclusions
The goal of this paper was to present our approach toward integrating organizational rules within the MaSE methodology. To accomplish our goal, we extended the MaSE analysis phase with an explicit organizational model, which defines organizational constraints based on concepts defined in the role and ontology models. In the design phase, we extended the MaSE Agent Class Diagram with an explicit organization artifact, which contains its own organizationally based tasks. We also showed various approaches toward integrating the organizational rules defined in the analysis model. We also discussed various approaches to implementing organizational tasks including both hybrid coordination media as well as traditional message passing media. While we originally developed MaSE to design closed multi-agent systems, the incorporation of organizational rules moves it toward being useful for the analysis and design of open systems as well. While MaSE still requires specific coordination protocols, designers no longer have to rely on incorporating organizational rules into the agents themselves. The concept of organizational tasks provides a mechanism to allow agents to enter the system, monitor their behavior, and ensure compliance with organizational rules and protocols.
References 1. 2.
Cabri, G., Leonardi, L., and Zambonelli, F. Implementing Agent Auctions using MARS. Technical Report MOSAICO/MO/98/001. Ciancarini, P., Omicini, A., and Zambonelli, F. Multi-agent System Engineering: the Coordination Viewpoint. Intelligent Agents VI. Agent Theories, Architectures, and Languages, 6th International Workshop (ATAL'99), Orlando (FL), May 1999, Proceedings. LNAI 1757, Springer-Verlag, 2000.
Modeling Organizational Rules in the Multi-agent Systems Engineering Methodology
3.
15
DeLoach, S.A., Wood, M.F., and Sparkman, C.H. Multi-agent Systems Engineering, The International Journal of Software Engineering and Knowledge Engineering, Volume 11 no. 3, June 2001. 4. Dileo, J.M. Ontological Engineering and Mapping in Multi-agent Systems Development. MS thesis, AFIT/GCS/ENG/02M-03. School of Engineering, Air Force Institute of Technology, Wright Patterson Air Force Base, OH, 2002. 5. Lacey, T.H., and DeLoach, S.A. Automatic Verification of Multi-agent Conversations. in Proceedings of the Eleventh Annual Midwest Artificial Intelligence and Cognitive Science Conference, pp. 93-100, AAAI Press, Fayetteville, Arkansas, April 2000. 6. Letier, E. Reasoning about Agents in Goal-Oriented Requirements Engineering, Phd Thesis, Université Catholique de Louvain, Dépt. Ingénierie Informatique, Louvain-la-Neuve, Belgium, May 2001. 7. MESSAGE: Methodology for Engineering Systems of Software Agents. Deliverable 1. Initial Methodology. July 2000. EURESCOM Project P907-GI. 8. Omicini, A., Denti, E. From Tuple Spaces to Tuple Centres. Science of Computer Programming 41(3). Elsevier Science B. V., November 2001. 9. Wagner, G. Agent-Oriented Analysis and Design of Organizational Information Systems. Proceedings of the 4th IEEE International Baltic Workshop on Databases and Information Systems, Vilnius, Lithuania, May 2000. 10. Wooldridge, M., Jennings, N.R., and Kinny, D. The Gaia Methodology for Agent-Oriented Analysis and Design. Journal of Autonomous Agents and MultiAgent Systems. Volume 3(3), 2000. 11. Zambonelli, F., Jennings, N.R., and Wooldridge, M.J. Organisational Rules as an Abstraction for the Analysis and Design of Multi-Agent Systems. International Journal of Software Engineering and Knowledge Engineering. Volume 11, Number 3, June 2001. Pages 303-328 12. Zambonelli, F., Jennings, N.R., Omicini, A., and Wooldridge M.J. AgentOriented Software Engineering for Internet Applications. Coordination of Internet Agents: Models, Technologies, and Applications, Chapter 13. Springer-Verlag, March 2001.
AERO: An Outsourced Approach to Exception Handling in Multi-agent Systems David Chen and Robin Cohen Department of Computer Science, University of Waterloo Waterloo, Ontario, N2L 3G1, Canada {dhchen, rcohen}@uwaterloo.ca
Abstract. In this paper, we propose an outsourced approach to exception handling in agent systems. Our AERO (Agents for Exception Recovery Outsourcing) system includes two types of mobile agents - consultants and specialists. When an agent system calls in AERO to help it with its exception handling, a consultant migrates to the host system and in turn calls in a number of specialists, who also migrate to the host system to handle to the required types of exceptions. The AERO system can service multiple host systems at the same time. We sketch how the AERO system works to effectively handle the exceptions in an agent system and discuss the value of using an outsourced system with specialists dedicated to addressing each exception type. This work is important because it addresses the problem of how a collection of exception handling agents can work in collaboration to monitor problem-solving agents in agent systems and provides insight into how agent systems can be reliable and continue to operate during exceptional circumstances.
1
Introduction
An agent is a software program that acts on behalf of a user, performing tasks autonomously [1]. One interesting research challenge is to design a system of agents (referred to here as an agent system) to work together cooperatively on completing a particular task. Most agent systems to-date have been designed to be closed, well-controlled environments running carefully designed agents in the static environment. However, as agent research and development moves toward open, heterogeneous programming and environments, exceptions are likely to occur more often, requiring complex exception handling. Klein and Dellarocas [2] state: A critical challenge to creating effective agent-based systems is to allow agents to function effectively even when... the operating environment is complex, dynamic, and error-prone ... In such domains, we can expect to utilize a highly diverse set of agents... New tasks, agents and other resources can be expected to appear and disappear in unpredictable ways. R. Cohen and B. Spencer (Eds.): AI 2002, LNAI 2338, pp. 16–30, 2002. c Springer-Verlag Berlin Heidelberg 2002
AERO: An Outsourced Approach to Exception Handling
17
Communication channels can fail or be compromised, agents can ‘die’ (break down) or make mistakes, inadequate resource allocations, unanticipated agent inter-dependencies can lead to systemic problems like multi-agent conflicts, ‘circular-wait’ deadlocks, and so on. All of these departures from ‘ideal’ collaborative behavior can be called exceptions. The result of inadequate exception handling is the potential for systemic problems such as clogged networks, wasted resources, poor performance, system shutdowns, and security vulnerabilities. As indicated above, there is a real need to address exceptions in agent systems. In this paper we propose that exception handling can be viewed as an outsourcing service provided by a collection of collaborative mobile agents. We see this approach as addressing some of the limitations of current “agent-centric” approaches to exception handling in agent systems, as well as the new “systemcentric” approach currently researched by Klein and Dellarocas [2]. We believe that there are some inherent benefits of an outsourced approach to exception handling. With an outsourced approach, an agent system can initiate and terminate exception handling whenever needed. This is particularly useful when an agent system only wants to address certain types of exceptions at any point in time. An outsourced exception handling service also has the ability to employ specialists that are dedicated to handle certain types of exceptions. In designing an outsourced exception handling service for agent systems, a number of decisions have to be made. These decisions include: (i) deciding how to organize the exception handling service into particular agents with particular roles (ii) deciding when and how some of these agents will migrate to an agent system in order to provide exception handling (iii) determining what form of messaging will be used for communication between these agents (iv) deciding what form of interface will exist between the agent system and the exception handling service (v) determining how to keep track of and manage all of the service’s exception handling agents. Our efforts to make all these decisions has resulted in the AERO (Agents for Exception Recovery Outsourcing) model. In this paper, we focus on discussing the first two design decisions. We have also implemented the AERO model, both to show the feasibility of this particular design of outsourced exception handling and to provide the basis for experimentation to determine the value of an outsourced approach to exception handling, in comparison with other approaches. The implementation is done in the Java environment using the Aglets SDK (Software Development Kit) [3], a tool which is particularly well-suited to designing systems with mobile agents. In order to see AERO in action, we have introduced a simulation of a system of problem solving agents - the agent system which enlists the services of AERO. This environment is called a Workplace and is modelled as a set of agents working together in hierarchical task structure, coordinated by a central unit. Within the Workplace it is possible to vary certain factors such as the number of workers and the time taken per task in order to conduct a series of experiments. The output of these experiments is a set of statistics including time for completion
18
David Chen and Robin Cohen
and cost of work, used to measure the effectiveness of the exception handling with and without AERO. We will briefly present the results of our implementation of AERO, for one class of exceptions referred to as “timeout dysfunction.” This exception arises when one agent incorrectly assumes that another is dead and thereby delays its overall time of completion, leading to a possible chain of misinterpretations by other agents in the system. More details on this exception type are presented in Section 3.
2
The AERO Model
AERO is a system housing exception-handling agents that can travel to a host system and provide service to the system. An AERO system is customized, particularly by running management agents, to allow other agent systems to make service requests and to allow the entire AERO system to act as one coherent entity. To coordinate activities within AERO and to provide services to external agent systems, we have introduced the following basic roles in the AERO system: Manager, Consultant, Specialist, and Liaison. Every AERO system consists of a manager that accepts incoming service requests, dispatches consultants and specialists, corresponds with on-site consultants for information on new services, and embraces the consultants and specialists when they return from a dispatch, transferring any new knowledge they may have acquired while providing service. A liaison is an interfacing agent that allows a host agent system operator to request and configure services from an AERO system without knowing the details such as the address of the AERO system, the communication mechanisms, the actual agents providing the desired services and so on. A liaison is the only AERO agent that resides on the host system and is started by the host system. A consultant is both the representative of the AERO system as well as the advisor that works closely with a host system to derive opinions about a host system, carry out requests of the host system, and to act as a contact point for both the host system and the AERO system to reach each other. When the liaison first contacts the manager on AERO, the manager dispatches the consultant to the host system to engage in a detailed discussion with the liaison to determine the desired services. A specialist is a worker agent that actually provides exception handling, such as detection and recovery. A specialist is first created on the AERO system and is registered with the manager, and therefore the manager is aware of this specialist. While a consultant is abroad, it contacts the manager to learn more about the available specialists and the services they provide. The consultant wishing to solicit services from specialists would first contact the manager and request that the manager dispatch the specialists that can provide the desired
AERO: An Outsourced Approach to Exception Handling
19
services1 . The manager dispatches the required specialists to the host system. The manager also prepares the specialists with the on-site consultant’s contact information so that once they arrive on-site the specialists will know how to reach the consultant. During the lifespan of a specialist while providing service in a host system, it is under the complete supervision of the consultant that is also on-site. For example, a consultant would decide whether a specialist should start or stop its service, terminate the service and return to AERO, or instruct a replacement specialist before it returns home. The key focus in designing the AERO (Agents for Exception Recovery Outsourcing) system can be summarized as follows: to provide a systematic mechanism allowing AERO agents to migrate to remote systems, support on-site exception-handling service and allow a real-time service upgrade. Since this is an outsourcing approach, it is crucial to demonstrate that a service provider system such as AERO (located remotely) can dispatch service agents to a host system and provide service on site. It is also important to see that AERO is able to ‘upgrade’ the services it provides to the host system when new services become available. The AERO infrastructure consists of the agents and their roles in the system and their messaging mechanisms and protocols to collaborate and deliver service. Figure 1 presents a diagram of the specialists (S) migrating to the remote host system from AERO as a result of the consultant (C) soliciting services from these specialists.
Fig. 1. AERO agents working together
1
One way of dispatching a specialist is to clone the original specialist agent and then send the clone to the remote system. Another way may be to explicitly request the specialist to create a functionally-equivalent agent.
20
2.1
David Chen and Robin Cohen
Host to AERO Processing Algorithm
The steps below demonstrate a host system calling AERO and enlisting the exception handling service. The various AERO agents work together in the following way. – – – – – – – – – – – –
Liaison informs Manager that it requires an exception handling service Manager dispatches Consultant to the host system Consultant greets Liaison and indicates its arrival Liaison informs Consultant of the list of required services Consultant requests from Manager a list of Specialists each capable of providing some services required Manager provides to Consultant the list of Specialists Consultant indicates to Manager the desired Specialists Manager dispatches the selected Specialists to the host system Specialists greet Consultant and indicate their arrival Consultant informs Liaison that the services can be provided Liaison gives Consultant the go-ahead Consultant gives the Specialists the go-ahead
The steps below demonstrate a host system ending the AERO service and the AERO agents leaving the host system to return to AERO. – – – –
Liaison informs Consultant to stop providing exception handling service Consultant informs Specialists Specialists migrate back and inform Manager of their return Consultant migrates back and informs Manager of its return
There are some key features of our design of AERO and the roles of its agents. Specialists are dedicated to one or more exceptions, compared with the agent-centric approach, where each agent must manage all possible exception types. Since specialists are also able to perform in different hosts, they have the potential to learn and to become expert in their restricted range of exception handling. Including a manager agent allows AERO to service multiple host systems; the manager coordinates all consultants and specialists and knows of their whereabouts. The consultant is a constant AERO representative in the host system. It can therefore serve as a coordinator for dispatched specialists. The liaison is a conduit between the host system and AERO to request exception handling. It allows the host system to know the service being provided. 2.2
Messaging Protocol
AERO agents have the ability to communicate with each other through a flexible messaging protocol. An agent wishing to communicate with another sends an AERO message to the other agent. An AERO message is, quite simply put, a named list data structure that allows an unlimited appendix of name-data pairs of date to accompany a message transmission.
AERO: An Outsourced Approach to Exception Handling
21
There are two general reply types for messages: ok and error. We then support general message types which apply to all AERO agents. These include: identification messages (get-type, get-name, get-version, get-info), dispatching messages (go, return, return-to-manager), and state messages (arrived-abroad, arrivedhome, arrived-home-to-manager). There are also liaison-bound messages (greet), manager-bound messages (send-consultant, list-specialists, send-specialists, returned), consultant-bound messages (update-service, start-service, stop-service, end-service, new-specialist) and specialist-bound messages (start-service, stopservice, end-service). The message types are then used by the various AERO agents to communicate. For example, to Start Service, the liaison would send a “start-service” message to the consultant, the consultant would send a “start-service” message to the specialists. We will forgo more detailed description of the messaging protocols of AERO. More details can be found in [4]. 2.3
Exception Handling by Specialists
Specialists are the AERO agents that migrate to a host system in order to perform exception handling. Below we discuss in more detail the information from a host system used by specialists in order to do exception handling. Note that our focus in this research is on providing an architecture to enable exception handling rather than developing new strategies for handling exceptions, so we will limit our discussion below to demonstrating how exception handling can be accommodated within the AERO framework. Exception handling consists of both detecting an exception and repairing that exception. How exception handling is done may vary depending on the type of exception being handled and depending on the architecture (interaction protocol) of the host system. The following questions need to be answered in order to specify the exception handling of a specialist: (i) what information is made available to the specialist (ii) where that information is stored (iii) how that information is used by a specialist to detect an exception (iv) how an exception is actually repaired. In the context of a host system of collaborative problem-solving agents, calls need to be made and acknowledged between collaborating agents. This calling information is useful to detect exceptions. It is also often standard for a host system to track the status of its agents, recording whether they are actively working on tasks or not. This information about the status of agents is also generally useful to detect exceptions. Who provides the information needed for exception handling depends on the architecture used for the host system. For instance, an architecture that includes a middle agent, such as a matchmaker [5], would typically have that agent tracking the status of all agents and of all calls within the system. In a more de-centralized architecture, the individual agents themselves might be responsible for recording this information in a centralized location. In fact, there
22
David Chen and Robin Cohen
might be an incentive for individual agents to do this, in order to facilitate exception handling. There are many different solutions for where to store this information in order to allow specialists to perform exception handling, all of which could be supported within the AERO architecture. For example, a host system could provide a repository of information 2 . Another strategy would be for every specialist to maintain its own repository of information. Yet another option would be to have one specialist specializing in monitoring and providing a repository. How the information is used by a specialist to detect an exception is very much dependent on the type of exception that specialist is handling. This determines what information is drawn out of the repository to be analyzed. For example, a specialist trying to detect agent death could make use of the status information that certain agents are not actively working on tasks. In order to repair an exception, there are various options. First, the specialist that detects the exception can inform agent(s) impacted by the exception, who are then responsible for carrying out the repair. For instance, if a specialist, knowing that a call is in place, detects that the callee has died, it can advise the caller (who may then find a replacement for the callee.) A second option is for the specialist to carry out the repair, informing all impacted agents about what it has done. In the next section we outline our implementation for a specific type of exception, timeout dysfunction.
3
Implementation
We have implemented the AERO system using the IBM Aglets Software Development Kit [3] and furthermore created a Workplace environment with which we would run some problem solving agents and at run-time call in our AERO exception handling agents to handle exceptions in that environment. The workplace environment simulates a marketplace of workers by utilizing a simple form of matchmaking agent-interaction protocol. Aglets are software agents that are implemented in Java and are instantiated as Java objects and are perceived to have their own threads of execution. Aglets can move from one host on the network to another, commonly referred to as being mobile. That is, an aglet that executes on one host can halt execution, dispatch to a remote host, and start executing again. When the aglet moves, it takes along with it its program code as well as the states of all the objects it has access to. A built-in security mechanism makes it safe to host untrusted aglets (see Aglets Specification Documentation [3] for more details). All aglet objects have a message queue object as well. Therefore, aglets are message-driven active objects. The Aglets Application Programming Interface 2
There are different ways in which this repository can be provided. A blackboardstyle which is only viewed when visited may not be reasonable in a very dynamic environment. In this case, it may be more useful for an agent requiring knowledge to subscribe to a service which informs the agent whenever new information arrives.
AERO: An Outsourced Approach to Exception Handling
23
(API) allows multiple messages to be handled concurrently, ideal for designing agents like AERO’s manager. AgletRuntime provides AgletContext APIs to allow an aglet to communicate with aglets on a remote host transparently as if it is communicating with a local aglet, requiring only that you have obtained the proxy object (AgletProxy) for the aglet you wish to communicate with. Tahiti server is a server that hosts Aglet agents and is part of the Aglets SDK package. Tahiti has a graphical user interface that allows users to configure and manage the agents on the server. The AERO system and the Workplace system we have implemented each run on a separate Tahiti server, allowing each server to behave like an AERO or Workplace host system. For our implementation, we have created a Java package named aero. We have then implemented the common messaging object AEROMsg to encapsulate our AERO message definition format. Next, we created a common aglet class called aero.AEROAgent that is the base class for all AERO agents. The core AERO agents that extend aeroAEROAgent are aero.Manager, aero.Liaison, aero.Consultant and aero.Specialist. We have also implemented specialists which extend aero.Specialist, to handle specific types of exceptions. In this section, we discuss the specialist aero.S TimeoutDysfunction, which is applied to the Workplace environment at run time to provide exception handling. As discussed earlier, Liaison is the agent an agent system can use to reach and interact with an AERO system, to request the appropriate exception handling service. We have implemented Liaison as a subclass of AEROAgent, packaged separately. Liaison’s interface window is divided into two lists and a row of buttons. The first list box indicates the list of available specialists. The second list box indicates the list of specialists called in to provide the services. A user would use the buttons to control the AERO service provided to the local agent system. The buttons included are: Call (to dispatch a messenger agent to the AERO system, to establish communication with the Manager; this causes a Consultant agent to be dispatched, eventually inactivating the Call button and activating the other buttons); Start (to allow Liaison to instruct Consultant to begin handling exceptions; this causes Specialists to begin); Add (to allow Liaison to move a selected specialist from the available list (left) to the provided list (right)) Del (to allow Liaison to move a selected specialist from the provided list back to the available list); Update (works in conjunction with an Add or Del operation from the Liaison, to signal to the Consultant that a specialist is to be dispatched or returned) and End (to signal to Consultant that it and its specialists should return to the home system). Figure 2 displays a Tahiti server showing a snapshot of the agents running on the server. The Liaison has been called into service and has requested that two specialists be dispatched. To show AERO in action in a controlled environment, we have created the Workplace system to simulate a host system of problem-solving agents. The Workplace system consists of Worker agents that carry out tasks. Worker agents are also able to delegate subtasks to other Worker agents resulting in a hierarchi-
24
David Chen and Robin Cohen
Fig. 2. Liaison Agent Window
cal problem-solving structure. A unit of work is represented using the WorkInfo class. Work is defined to be either time a worker is required to spend, partial work to be delegated, or both. For instance, a WorkInfo object may describe that a worker should perform an indicated task, delegate portions of the task to workers, collect results and return. We have created the Controller class to allow configuring of initial WorkInfo objects that would instruct a chain of workers to perform work. The Controller class is a special worker that creates WorkInfo objects based on a work configuration panel, delegates the work created to one worker, and receives the result that the first worker collects from other workers. The configuration panel allows specification of work information such as dynamic or static worker pool, the size of the static worker pool, how much time on tasks, how work is distributed, etc. Figure 3 shows a screenshot of the Controller’s configuration and monitor panel. The buttons on the top of the screen allows a user to start and stop a simulation, close the controller interface and exit the WorkForce, restart the WorkForce, and reset the statistics. The rest of the controller panel is divided into configuration fields on the left hand side and statistics monitor fields on the right. According to Figure 3, for example, the initial workload is configured so that the first worker will create a working tree of 3 levels (Depth) with each level having 3 workers (Breadth). Each worker in this simulation will need 5 clock ticks (Work) to complete its work. There will be 1 working tree per run (Concurrent). Each worker will wait at most 10 times (Max Retries) for its sub workers. Each worker will wait at most 10 clock ticks (Retry Period) each time. Finally, each worker (except the initial worker) has a 15% chance of becoming dead (Failure Ratio) while working and thus not returning to the parent worker. In addition, the control panel indicates that initially there will be no workers in the workplace (Apply WorkForce ) and that the WorkForce is a dynamic pool rather than a static pool (Toggle WorkForce). Finally, the control panel can be set to automatically restart the simulation after a run is completed (CallAgain). The right side of figure 3 tracks statistics during a run. This includes tracking the current clock tick (Clock) and the total clock ticks elapsed during last run (Last Clock). The panel also tracks the total cost (in clock ticks) of completing
AERO: An Outsourced Approach to Exception Handling
25
Fig. 3. Controller Panel During Simulation
the work during the last run (Last Cost). Total cost is the total worker-ticks (i.e., man-hours) and is collected by having every worker summing up its cost and its sub workers total costs and reporting it to its parent worker. The panel also tracks the currently-engaged workers (ActiveWorkers), dead workers (DeadWorkers), total workers in the workplace (TotalWorkers), and workers without assignments (FreeWorkers). Finally, the panel tracks current calls taking place (ActiveCalls), calls replaced during the current simulation run (ReplacedCalls) and completed calls (FinishedCalls). 3.1
The Timeout Dysfunction Exception
Timeout Dysfunction is the name we give to a particular kind of exception which arises when agent death exceptions are being handled by individual problemsolving agents (or workers) through the agent-centric approach. With agent death, a worker, while working, ceases to operate. These agents typically conclude that a delegated agent is dead when it does not return within an expected time duration. Timeout dysfunction arises when a problem-solving agent determines that another agent is dead when, in reality, that agent’s return is delayed, possibly due to the delayed agent trying to handle the death of another agent. This is an example of the shortcomings of the agent-centric approach, where exceptions are handled without being able to acquire the system-wide knowledge needed to make more informed exception handling decisions. In this case, the delegating agents were not able to observe the delays that may have occurred somewhere down the delegation chain that caused the delegated agents to fail to return in time.
26
David Chen and Robin Cohen
Timeout dysfunction is important to address. A worker that presumes delegated workers are dead may delay its overall time of completion. In addition, an “unzippering effect” may arise in the calling chain, whereby while a worker delays its time of completion, it may itself be presumed dead, causing further disruptions throughout the hierarchically structured problem-solving environment. With this observation, we find it interesting to see how AERO can assist the worker agents to achieve better exception handling results. A specialist can be assigned to observe the activities of all workers in the Workplace and help the workers determine deaths more accurately. Although our overall focus is to have AERO taking away the exception handling burdens from the workers, it is interesting to see how AERO can assist, rather than take over, the exception handling in an agent system. We have created the S TimeoutDysfunction class that implements the Timeout Dysfunction exception handling specialist. The S TimeoutDysfunction class will be working in the simulation agent system environment called the Workplace System. Tasks (or workload) in the system are delegated through a matchmaker called the WorkForce. The WorkForce tracks the problem-solving agents (workers) involved in delegation relationships (calls) and makes available all worker and call statuses. The WorkForce broadcasts worker and call status changes to agents in the system that have subscribed to the status broadcast in the form of observer-observation relationship. The S TimeoutDysfunction agent records the status of all calls in the system. The calls are organized into a call tree in which the S TimeoutDysfunction agent can identify potential local timeout dysfunctions by observing call reassignments. Strictly speaking, the S TimeoutDysfunction agent does not repair exceptions, since the true exceptions are agent deaths and they are handled by the workers themselves. The specialist handles the dysfunction caused by the workers attempting to handle agent deaths. Therefore, the specialist assists the system to “avoid” potential dysfunctions by “anticipating” them. In a broader term, however, the specialist can be viewed as designed to repair a dysfunction exception.
4
Experimentation
We have implemented AERO’s assisting a Workplace environment of agents. In this paper, we focus on presenting an example where the Liaison agent requests that Timeout Dysfunction exception handling be addressed. As a first run, we show the Workplace operating without AERO’s assistance. As a second run, we have the Workplace calling in AERO for assistance. Of the various statistics produced in a run, the clock counts can be considered the most significant figures when comparing runs. The clock measures the endto-end execution time (in clock ticks) for a run. In other words, it is a measure on how long a simulation has taken to complete. The shorter it takes to finish a run, the better the exception handling is.
AERO: An Outsourced Approach to Exception Handling
27
Cost is another way of comparing the different runs. The total cost is calculated from adding each worker’s workload (the Work input parameter, or the workload, in clock ticks) and the time (in clock ticks) the worker has spent waiting for all its sub-workers to return from a task delegation. Each worker passes its comprehensive cost to its parent worker. A parent worker in turn adds the costs of all its sub-workers to its own before passing it up to its parent worker. Similar to the clock count, the smaller the cost is for a run, the better the exception handling is for that run. Depending on the perception of cost, it can be thought of that orphaned workers generate cost as well (especially in a non-growing work force) where the number of workers is fixed and that it is costly to have orphaned workers (and their sub-workers) to waste their time on work that are no longer relevant. Currently, our cost statistic does not account for the cost of the orphaned workers. If that is calculated instead, the difference in the costs for a run with AERO and a run without AERO would be greater when exposed to the timeout dysfunction exception. The TimeoutDysfunction specialist improves the overall system performance by attempting to prevent live workers from being declared dead, which would in turn avoid sub-workers from becoming orphans. The specialist does that by monitoring all of the worker’s sub-workers statuses. If a sub-worker has replaced a call, it will need more time from its parent worker so it can complete the assigned task, rather than getting replaced itself. Based on this approach, we could expect that the clock and the cost statistics would improve when AERO is involved in the exception handling. Figure 4 shows that this is the case. We conducted several runs with and without AERO for handling Timeout Dysfunction exceptions. For future research, we are planning a more in-depth analysis of the results of these runs. Although there are many input combinations to configure the Workplace system, we have found that some parameters create more stable environments than others. For instance, a simulation with 10 worker agents, on depending on another, each with a failure rate of 50% would have a difficult time running to completion (since half of the workers continue to crash and any one death in the chain would disrupt the total completed work). We have also found that having a shallow network (e.g. the first worker delegates work to five other workers, who in turn delegate to none) and a low failure rate (e.g. 1%) produces a fairly short execution time for the experiment.
5
Discussion
There are three main contributions which arise from the development of the AERO outsourced exception handling framework for agent systems. First of all, there is benefit simply in specifying how an outsourced exception handling service would work, deciding on the various roles for agents and the necessary communication and coordination between the various agents. The model presented in this thesis argues for the benefit of various key agents in the overall architecture - the Liaison, the Manager, the Consultant and the Specialists. The
28
David Chen and Robin Cohen
Without AERO
With AERO
Depth
Input 3
Clock
3315
707
Breadth
3
Last Clock
3315
707
Work
5
Last Cost
3570
1150
Concurrent
1
Average Cost
3570
1150
Max Retries
10
Maximum Cost
3570
1150
Retry Period
10
Active Workers
103
15
Failure Ratio
0.15
Dead Workers
98
37
Total Workers
215
74
14
22
Active Calls
262
63
Replaced Calls
103
41
Finished Calls
409
212
Free Workers
Fig. 4. Handling Timeout Dysfunction with and without AERO
roles of these agents are specified, together with the processing algorithm for enlisting AERO’s services. This model therefore demonstrates how it is feasible to construct an outsourced exception handling service. A second main contribution is the implementation of the AERO model. We have gone beyond the specification of AERO to a working program which implements AERO’s exception handling service, using the Aglets software. In addition, we have developed an implementation of a system of problem solving agents called the Workplace, as a test bed for AERO exception handling. This implementation therefore provides for experimentation with the AERO model, showing how it works in conjunction with different configurations of workers and with different variations on the extent of failure within the society. We have presented some of our own results of runs of the AERO system in the Workplace environment. Further experimentation and analysis can also be done in the future, because of the existing implementation. It can be argued that the demonstration of AERO shows a winning edge over the agent-centric (survivalist) approach as well as an added bonus to the system-centric (citizen) approach. There are general problems associated with an agent-centric approach to exception handling in agent systems. It is difficult to require every agent to look out for every type of exception and it is especially difficult to detect exceptions that arise as a result of failures within a separate subgroup of agents in the system. In comparing the outsourced approach to exception handling of AERO to the system-centric approach of [6,7] it would appear that most of the benefits are non-functional ones. Since consultants and specialists become part of the host system, their exception handling is done in-house, just as in the systemcentric approach. It would seem that many of the benefits that the AERO agents offer are the same ones that can be offered with the system-centric approach. Having these agents as part of an outsourced service offers a way of isolating and improving various services that the agent systems may require. One example is providing Runtime Service Update - upgrading the quality of the exception handling service on the fly. In a system-centric approach, in order to perform upgrades, it would be necessary to stop the in-house service or the agent system
AERO: An Outsourced Approach to Exception Handling
29
altogether, replace the exception handling code manually, and restart the service or the agent system. On the other hand, in an outsourced service, such as AERO, run-time service upgrade becomes a natural part of the overall service which is provided. Because exception handling agents migrate into the host system in a seamless fashion, the same mechanism can be used to allow agents with improved exception handling abilities to be brought into the host system as replacements. An outsourced approach to exception handling offers an additional degree of separation from the agent system and an external view of its operations (including any inter-system relationships.) Since specialists in AERO are focused on exception types, they are able to learn how to handle their assigned exception types well, making use of any experience gained by servicing multiple host systems. In addition, specialists are able to discover any chain effects of exceptions across agents, because they are not purely focused on one agent. The way specialists are designed in AERO, it is possible for them to communicate with problem solving agents directly, in cases where time is of the essence (rather than waiting indefinitely), making the exception handling quite explicit. In other cases, the specialists can simply be working in the background. AERO is designed to accommodate both scenarios. This provides for a very flexible exception handling environment. Since specialists are able to service different host systems, it is possible for them to learn about how to perform exception handling and to improve their exception handling methods, over time. For example, specialists may learn strategies for how to replace worker agents that are dead (i.e. which workers to use as replacements). These strategies could then be used when the same specialists are working in other host systems with a similar configuration of workers or for host systems that are working on similar kinds of tasks. Specialists may also learn over time how long to wait before assuming that an agent death has occurred in a system. It is also possible for specialists to return to the same host system in order to do exception handling. In this case, the specialists may have learned about the reputation of individual agents within the system and this may result in more effective decisions about which agents to propose as replacements in the case of agent death. For future work, it would be worthwhile to explore in more detail some of the potential for specialists to learn as they visit new host systems. AERO is designed to be able to work with a variety of host systems and to service multiple systems simultaneously. In allowing AERO to work with multiple host systems, it is possible for AERO to establish a reputation as a reputable exception handling service, continued to be requested by more host systems. When two or more systems enlist AERO’s service, we can see AERO playing a role in detecting and handling exceptions involving agents from different agent systems. And there may be a need for specialists in each of the agent systems to work together. In addition, there may some advantage to having an outsourced exception handling service, such as AERO, to assist with evaluating the trustworthiness of
30
David Chen and Robin Cohen
each agent in a system or to act as a third party observer. Another possible role for AERO would be to act as a shelter for threatened agents, providing a “safe” location for escape until malicious agents have been located and repaired.
6
Conclusions
This paper has presented AERO, a model for outsourced exception handling in agent systems. AERO is described in detail, including a discussion of the AERO agents and their roles, the communication and coordination that exists between AERO agents and between AERO and the host system which enlists its services and how the process of exception handling is managed within the AERO architecture. AERO is a service which can be called in by multiple host systems. The design of AERO is therefore intentionally general enough to allow it to be of use in a variety of environments. In addition, we have examined some of the ways in which AERO can be of particular assistance, given that it may be servicing multiple systems simultaneously. Since exceptions are in fact a real possibility in many agent systems, there is a need to develop methods for addressing these exceptions, in order to maintain the efficiency of these systems and to enable these systems to be better trusted by their users. The development of AERO can therefore be seen as an important part of the overall effort to design reliable agent systems.
References 1. Woolridge, M., Jennings, N.: Intelligent agents: theory and practice. Knowledge Engineering Review 10(2) (1995) 115–152 16 2. Klein, M., Dellarocas, C.: Exception Handling in Agent Systems. In: Proceedings of the Third International Conference on Autonomous Agents, Seattle, WA (1999) 16, 17 3. IBM: Aglets. Web page: http://www.trl.ibm.com/aglets/ (1998) 17, 22 4. Chen, D. H.: AERO: An Outsourcing Approach to Exception Handling in MultiAgent Systems. M. Math Thesis, Department of Computer Science, University of Waterloo (2001) 21 5. Sycara, K., Decker, K., Pannu, A., Williamson, M.: Behaviors for information agents. In: Proceedings of Agents 1997 conference. (1997) 21 6. Klein, M.: An Exception Handling Approach to Enhancing Consistency, Completeness and Correctness in Collaborative REquirements Capture. Concurrent Engineering Research and Applications (1997) 28 7. Klein, M., Dellarocas, C.: Domain-Independent Exception Handling Services That Increase Robustness in Open Multi-Agent Systems. Working Paper ASES-WP2000-02, Center for Coordination Science, MIT (2000) http://ccs.mit.edu/ases. 28
A Learning Algorithm for Buying and Selling Agents in Electronic Marketplaces Thomas Tran and Robin Cohen Department of Computer Science, University of Waterloo Waterloo, ON, N2L 3G1, Canada {tt5tran, rcohen}@math.uwaterloo.ca
Abstract. In this paper, we propose a reputation oriented reinforcement learning algorithm for buying and selling agents in electronic market environments. We take into account the fact that multiple selling agents may offer the same good with different qualities. In our approach, buying agents learn to avoid the risk of purchasing low quality goods and to maximize their expected value of goods by dynamically maintaining sets of reputable sellers. Selling agents learn to maximize their expected profits by adjusting product prices and by optionally altering the quality of their goods. As detailed in the paper, we believe that our proposed strategy leads to improved performance for buyers and sellers, reduced communication load, and robust systems.
1
Introduction
The problem of how to design personal, intelligent agents for e-commerce applications is a subject of increasing interest from both the academic and industrial research communities [1,3,15]. Since a multi-agent electronic market environment is, by its very nature, open (agents can enter or leave the environment at will), dynamic (information such as prices, product quality etc. may be altered), and unpredictable (agents lack perfect knowledge of one another), it is very important that participant agents are equipped with effective and feasible learning algorithms to accomplish their delegated tasks or achieve their delegated goals. In this paper, we propose a reinforcement learning and reputation based algorithm for buying and selling agents in electronic market environments. We model the agent environment as an open marketplace which is populated with economic agents. The nature of an open marketplace allows economic agents, which we classify as buyers and sellers, to freely enter or leave the market. Buyers and sellers are self-interested agents whose goal is to maximize their own benefit. Buying and selling prices are determined by individual buyers and sellers respectively, based on their aggregate past experiences. Our market environment is rooted in an information delivery infrastructure such as the Internet, which provides agents with virtually direct and free access to all other agents. The process of buying and selling products is realized via a contract-net like mechanism [2,11], which consists of three elementary phrases: (i) A buyer announces its desire for a good. (ii) Sellers submit bids for delivering R. Cohen and B. Spencer (Eds.): AI 2002, LNAI 2338, pp. 31–43, 2002. c Springer-Verlag Berlin Heidelberg 2002
32
Thomas Tran and Robin Cohen
such goods. (iii) The buyer evaluates the submitted bids and selects a suitable seller. The buyer then pays the chosen seller and receives the good from that seller. Thus, the buying and selling process can be viewed as an auction where a seller is said to be “winning the auction” if it is able to sell its good to the buyer. We assume that the quality of a good offered by different sellers may not be the same, and a seller may alter the quality of its goods. We also assume that a buyer can examine the quality of the good it purchases only after it receives that good from the selected seller. Each buyer has some way to evaluate the good it purchases, based on the price and the quality of the good received. Thus, in our market environment a buyer tries to find those sellers whose goods best meet its expected value of goods, while a seller tries to maximize its expected profit by setting suitable prices for and providing more customized value to its goods, in order to satisfy the buyers’ needs. Reinforcement learning has been studied for various multi-agent problems [5,6,7,9,10,14]. However, the agents and environments in these works are not directly modeled as economic agents and market environments. [12] does apply reinforcement learning in market environments for buying and selling agents, but does not use reputation as a means to protect buyers from purchasing low quality goods. Moreover, selling agents in [12] do not consider altering the quality of their products while learning to maximize their profits. In our proposed learning algorithm, buyers are designed to be reputationoriented to avoid the risk of purchasing unsatisfactory quality goods. They each dynamically maintain a set of sellers with good reputation, and learn to maximize their expected product values by selecting appropriate sellers among those reputable sellers. Sellers in our approach learn to maximize their expected profits by not only adjusting product prices but also by optionally altering the quality of their products. As discussed in detail later, we believe that the proposed algorithm will result in improved performance for buyers, better satisfaction for both buyers and sellers, reduced communication load, and more robust systems. The paper is organized as follows: The next section, section 2, introduces our proposed learning algorithm for buyers and sellers, respectively. Section 3 discusses the possible advantages of the proposed algorithm. Section 4 remarks on related work. Section 5 provides some future research directions. Finally, section 6 concludes the paper.
2
The Proposed Learning Algorithm
In this section we propose a reinforcement learning and reputation based algorithm for buyers and sellers, respectively. The algorithm is aimed at maximizing the expected values of goods for buyers, and maximizing the expected profits for sellers. Note that it is quite possible for both a seller s and a buyer b to be “winning” in a business transaction. This happens when seller s could choose a price p to sell good g to buyer b that maximized its expected profit, and buyer b decided that purchasing good g at price p from seller s would maximize its ex-
A Learning Algorithm for Buying and Selling Agents
33
pected value of goods. We also provide a numerical example to illustrate how the algorithm works. 2.1
Buying Algorithm
Consider the scenario where a buyer b makes an announcement of its desire for some good g. Let G be the set of goods, P be the set of prices, and S be the set of all sellers in the marketplace. G, P , and S are finite sets. Let Srb be the set of sellers with good reputation to buyer b; that is, Srb contains the sellers that have served b well in the past and are therefore trusted by b. Hence, Srb ⊆ S and Srb is initially empty. To measure the reputation of a seller s ∈ S, buyer b uses a real-valued function rb : S → (−1, 1), which is called the reputation function of b. Initially, buyer b sets rb (s) = 0 for all s ∈ S. Thus, the set Srb consists of all sellers s with rb (s) ≥ Θ > 0, where Θ is the reputation threshold determined by b, i.e., Srb = {s ∈ S | rb (s) ≥ Θ > 0} ⊆ S. Buyer b estimates the expected value of the goods it purchases using the expected value function f b : G × P × S → IR. Hence, the real number f b (g, p, s) represents buyer b’s expected value of buying good g at price p from seller s. Since a seller may alter the quality of its goods, buyer b puts more trust in the sellers with good reputation. Thus, it chooses among the reputable sellers in Srb a seller sˆ that offers good g at price p with maximum expected value: sˆ = arg max f b (g, p, s), s∈Srb
(1)
where arg is an operator such that arg f b (g, p, s) returns s. If no sellers in Srb submit bids for delivering g (or if Srb = ∅), then buyer b will have to choose a seller sˆ from the set of non-reputable sellers: sˆ = arg
max
s∈(S−Srb )
f b (g, p, s).
(2)
In addition, with a small probability ρ, buyer b chooses to explore (rather than exploit) the marketplace by randomly selecting a seller sˆ from the set of all sellers. This gives buyer b an opportunity to discover new reputable sellers. Initially, the value of ρ should be set to 1, then decreased over time to some fixed minimum value determined by the buyer. After paying seller sˆ and receiving good g, buyer b can examine the quality q ∈ Q of good g, where Q is a finite set of real values representing product qualities. It then calculates the true value of good g using the function v b : P × Q → IR. For instance, if p = q and buyer b prefers the sellers that offer goods with higher quality, it may set v b (p, q) = cq − p, where c is a constant greater than 1.
34
Thomas Tran and Robin Cohen
The expected value function f b is now incrementally learned in a reinforcement learning framework: ∆ = v b (p, q) − f b (g, p, sˆ), b
b
f (g, p, sˆ) ← f (g, p, sˆ) + α∆,
(3) (4)
where α is called the learning rate (0 ≤ α ≤ 1). The learning rate should be initially set to a starting value of 1 and, similar to ρ, be reduced over time to a fixed minimum value chosen depending on individual buyers. Thus, if v b (p, q) ≥ f b (g, p, sˆ) (i.e., if ∆ ≥ 0) then f b (g, p, sˆ) is updated with the same or a greater value than before. This means that seller sˆ has a chance to be chosen by buyer b again if it continues offering good g at price p in the next auction. Conversely, if ∆ < 0 then f b (g, p, sˆ) is updated with a smaller value than before. So, seller sˆ may not be selected by buyer b in the next auction if it continues selling good g at price p. In addition to updating the expected value function, the reputation rats) of seller sˆ also needs to be updated. We use a reputation updating ing rb (ˆ scheme motivated by that proposed in [16], as follows: If ∆ ≥ 0, that is, if seller sˆ offers good g with value greater than or equal to s) is increased by the value expected by buyer b, then its reputation rating rb (ˆ b r (ˆ s) + µ(1 − rb (ˆ s)) if rb (ˆ s) ≥ 0, s) ← rb (ˆ (5) b b r (ˆ s) + µ(1 + r (ˆ s)) if rb (ˆ s) < 0, where µ is a positive constant called the cooperation factor1 (0 < µ < 1). Otherwise, if ∆ < 0, that is, if seller sˆ sells good g with value less than that expected by buyer b, then its reputation rating rb (ˆ s) is decreased by b r (ˆ s) + ν(1 − rb (ˆ s)) if rb (ˆ s) ≥ 0, rb (ˆ (6) s) ← b b r (ˆ s) + ν(1 + r (ˆ s)) if rb (ˆ s) < 0, where ν is a negative constant called the non-cooperation factor (−1 < ν < 0). To protect itself from dishonest sellers, buyer b may require that |ν| > |µ|. This implements the traditional idea that reputation should be difficult to build up, but easy to tear down. The set of reputable sellers to buyer b now needs to be updated based on the s), as in one of the following two cases: new reputation rating rb (ˆ s) < Θ) then buyer b no longer considers sˆ as a reputable – If (ˆ s ∈ Srb ) and (rb (ˆ seller, i.e., s}. (7) Srb ← Srb − {ˆ s) ≥ Θ) then buyer b now considers sˆ as a seller with – If (ˆ s∈ / Srb ) and (rb (ˆ good reputation, i.e., Srb ← Srb ∪ {ˆ s}. (8) Let us now look at the sellers’ algorithm. 1
Buyer b will consider seller sˆ as being cooperative if the good sˆ sells to b has value greater than or equal to that expected by b.
A Learning Algorithm for Buying and Selling Agents
2.2
35
Selling Algorithm
Consider the scenario where a seller s ∈ S has to decide on the price to sell some good g to a buyer b. Let B be the (finite) set of buyers in the marketplace IR estimate the expected profit for seller s. and let function hs : G × P × B → Thus, the real number hs (g, p, b) represents the expected profit for seller s if it sells good g at price p to buyer b. Let cs (g, b) be the cost of seller s to produce good g for buyer b. Note that seller s may produce various versions of good g, which are tailored to meet the needs of different buyers. Seller s will choose a price pˆ greater than or equal to cost cs (g, b) to sell good g to buyer b such that its expected profit is maximized: pˆ = arg
max
p∈P p ≥ cs (g, b)
hs (g, p, b),
(9)
where in this case arg is an operator such that arg hs (g, p, b) returns p. The expected profit function hs is learned incrementally using reinforcement learning: hs (g, p, b) ← hs (g, p, b) + α(P rof its (g, p, b) − hs (g, p, b)),
(10)
where P rof its (g, p, b) is the actual profit of seller s if it sells good g at price p to buyer b. Function P rof its (g, p, b) is defined as follows: p − cs (g, b) if seller s wins the auction, P rof its (g, p, b) = (11) 0 otherwise. Thus, if seller s does not win the auction then (P rof its (g, p, b) − hs (g, p, b)) is negative, and by (10), hs (g, p, b) is updated with a smaller value than before. This reduces the chance that price pˆ will be chosen again to sell good g to buyer b in future auctions. Conversely, if seller s wins the auction then price pˆ will probably be re-selected in future auctions. If seller s once succeeded in selling good g to buyer b, but subsequently fails for a number of auctions, say for m consecutive auctions (where m is seller s specific constant), then it may not only because s has set a too high price for good g, but probably also because the quality of g does not meet buyer b’s expectation. Thus, in addition to lowering the price via equation (10), seller s may optionally add more value (quality) to g by increasing its production cost2 : cs (g, b) ← (1 + Inc)cs (g, b),
(12)
where Inc is seller s specific constant called the quality increasing factor. In contrast, if seller s is successful in selling good g to buyer b for n consecutive auctions, it may optionally reduce the quality of good g, and thus try to further increase its future profit: cs (g, b) ← (1 − Dec)cs (g, b), where Dec is seller s specific constant called the quality decreasing factor. 2
This supports the common idea that high quality goods cost more to produce.
(13)
36
2.3
Thomas Tran and Robin Cohen
An Example
This subsection provides a numerical example illustrating the proposed algorithm for buyers and sellers, respectively. Buying Situation Consider a simple buying situation where a buyer b announces its need of some good g. Suppose that there are 6 sellers in the marketplace, i.e., S = {si | i = 1..6}, and that the set of sellers with good reputation to b is Srb = {sj | j = 1..3} ⊂ S. Furthermore, suppose v b (p, q) = 2.5q − p, α = 0.8, µ = 0.2, ν = −0.4, Θ = 0.4, and the reputation ratings rb (si ) are given as follows: Table 1. Reputation ratings of different sellers to buyer b si s1 s2 s3 s4 s5 s6 rb (si ) 0.40 0.45 0.50 0.30 0.25 0.20
After b’s announcement of its desire for g, the sellers bid with the following prices to deliver g to b: Table 2. Prices offered by different sellers for good g si s1 s2 s3 s4 s5 s6 p 4 5 4.5 4 5 3.5
Assume that b’s expected values of buying g at various prices from different sellers are Table 3. Buyer b’s expected value of buying good g at various prices from different sellers si s1 s2 s3 s4 s5 s6 p 4 5 4.5 4 5 3.5 f b (g, p, si ) 6.15 7.25 6.65 5.50 5.75 5.20
A Learning Algorithm for Buying and Selling Agents
37
Then, by equation (1), b buys g from s2 at price p = 5 with f b (g, p, s2 ) = 7.25 = max f b (g, p, s). s∈Srb
Suppose b examines the quality q of good g and finds that q = 5. It then calculates the true value of g: v b (p, q) = 2.5q − p = 2.5(5) − 5 = 7.50. Buyer b now updates its expected value function using equations (3) and (4): ∆ = v b (p, q) − f b (g, p, s2 ) = 7.50 − 7.25 = 0.25, and f b (g, p, s2 ) ← f b (g, p, s2 ) + α∆ ← 7.25 + (0.80)(0.25) = 7.45. Finally, b updates the reputation rating rb (s2 ) using equation (5): rb (s2 ) ← rb (s2 ) + µ(1 − rb (s2 )) ← 0.45 + (0.20)(1 − 0.45) = 0.56. Thus, by providing good g with high value, seller s2 has improved its reputation to buyer b and remained in the set Srb of reputable sellers to b. Selling Situation Consider how a seller in the above-said marketplace, say seller s4 , behaves according to the proposed selling algorithm. Suppose cs4 (g, b) = 2.5, α = 0.8, and Inc = Dec = 0.1. Upon receiving buyer b’s announcement of its desire for good g, s4 has to decide on the price to sell g to b. Assume that s4 ’s expected profits to sell g to b at various prices are Table 4. Expected profits of seller s4 in selling good g to buyer b at different prices p 2.5 2.75 3.0 3.25 3.5 3.75 4.0 4.25 4.5 hs4 (g, p, b) 0.00 0.25 0.50 0.75 1.00 1.25 1.50 0.00 0.00
Table 4 indicates that s4 does not expect to be able to sell g to b at price p ≥ 4.25. By equation (9), s4 chooses price pˆ = 4 to sell g to b: pˆ = arg
max
p∈P p ≥ cs (g, b)
hs4 (g, p, b) = 4.
38
Thomas Tran and Robin Cohen
Since b chooses to buy g from another seller, namely s2 , the actual profit of s4 is zero, i.e., P rof its4 (g, pˆ, b) = 0. Hence, s4 updates its expected profit using equation (10) as follows: hs4 (g, pˆ, b) ← hs4 (g, pˆ, b) + α(P rof its4 (g, pˆ, b) − hs4 (g, pˆ, b)) ← 1.50 + (0.80)(0 − 1.50) = 0.30. Thus, according to equation (9), it is unlikely that price pˆ = 4 will be chosen again to sell good g to buyer b in future auctions. Assume that s4 has failed to sell g to b for a number of auctions. It therefore decides to add more quality to g by increasing the production cost using equation (12): cs (g, b) ← (1 + Inc)cs (g, b) ← (1 + 0.10)(2.5) = 2.75. By doing so, seller s4 hopes that good g may now meet buyer b’s quality expectation and that it will be able to sell g to b in future auctions.
3
Discussion
Work in the area of software agents has been focusing on how agents should cooperate to provide valuable services to one another. However, answering the question of why agents should cooperate with one another at all is also of equal importance [4]. We believe that modeling an agent environment as a marketplace, where agents are motivated by economic incentives to provide goods and services to each other, is a practical and feasible approach. It is possible in the marketplace that some sellers may try to “cheat” by delivering some good with reasonable quality, followed by low quality goods. Buyers in our approach use reputation as a means to protect themselves from those dishonest sellers: They each dynamically maintain a set of reputable sellers and consider choosing suitable sellers from this set first. This strategy reduces a buyer’s risk of purchasing low quality goods, and therefore brings better satisfaction to the buyer. Since a buyer’s set of reputable sellers is certainly a lot smaller (in terms of cardinality) than the set of all sellers in the market, the proposed buying algorithm reduces computational cost, and accordingly results in improved performance for the buyer (compared to the case where the buyer has to consider all possible sellers). This is especially important in those application domains where the buyer is required to calculate a suitable seller within a constrained time frame. For instance, if the buyer serves some user as a personal assistant, then it must respond to the user within an allowable time period. We note that various buyers may use different reputation thresholds (thus, resulting in dissimilar sets of reputable sellers) as well as different learning rates. In our proposed buying algorithm, a buyer selects a seller based on its own past experience and doesn’t communicate with other buyers for its decision. We
A Learning Algorithm for Buying and Selling Agents
39
believe that this type of learning has certain advantages: Buyers can act independently and autonomously without being affected by communication delays (due to other buyers being busy), the failure of some key-buyer (whose buying policy influences other buyers), or the reliability of the information (the information received from other buyers may not be reliable). The resultant system, therefore, should be more robust [10]. The underlying mechanism that allows agents to do business with one another in our marketplace is actually a form of the contract-net protocol [2,11], where buyers announce their desire for goods to all sellers via multicast or possibly broadcast. This works well in small and moderate-sized environments; however, as the problem size (i.e., the number of communicating agents and the number of desired goods) increases, this may run into difficulties due to the slow and expensive communication. The proposed buying algorithm provides a potential solution to this problem: A buyer may just send announcements of its desire for goods to its reputable sellers instead of all sellers, and thus reducing the communication load and increasing the overall system performance. Since the marketplace is open and sellers are continuously learning to improve their profits, some new, good sellers may have entered the market, and/or some non-reputable sellers may have reasonably adjusted their prices and greatly improved the quality of their products, and thus should be considered as reputable sellers. The proposed buying strategy accounts for this possibility by letting a buyer b explore the marketplace with probability ρ to discover new reputable sellers. The proposed selling strategy is suitable for sellers in market environments where a seller can only sell its products and gain profit by winning auctions. There are two important reasons why a seller may not be able to win an auction in our market environment: (i) It may set the price too high, and (ii) the quality of its product may be under the buyer’s expectation level. Our proposed selling algorithm considers both of these factors by allowing the seller to not only adjust the price (equation (10)), but also optionally add more quality to its product (equation (12)). Various sellers may have different policies for adjusting prices and altering the quality of their products. This is reflected by the way they choose their learning rates and how they increase/decrease their production costs, e.g., using linear functions as in equations (12) and (13), or using more sophisticated functions.
4
Related Work
Reinforcement learning has been studied in various multi-agent problems such as pursuit games [7], soccer [5], the prisoner’s dilemma game [9], and coordination games [10]. However, the agents and environments studied in these works are not economic agents and market environments. The reinforcement learning based algorithm proposed in this paper is, in contrast, aimed at application domains where agents are economically motivated and act in open market environments.
40
Thomas Tran and Robin Cohen
In addition, our work contrasts with other efforts to assist users in buying and selling goods in electronic marketplaces. A number of agent models for electronic market environments have been proposed. Jango [3] is a shopping agent that assists customers in getting product information. Given a specific product by a customer, Jango simultaneously queries multiple online merchants (from a list maintained by NetBot, Inc.) for the product availability, price, and important product features. Jango then displays the query results to the customer. Although Jango provides customers with useful information for merchant comparison, at least three shortcomings may be identified: (i) The task of analyzing the resultant information and selecting appropriate merchants is completely left for customers. (ii) The algorithm underlying its operation does not consider product quality, which is of great importance for the merchant selection task. (iii) Jango is not equipped with any learning capability to help customers choose more and more appropriate merchants. Another interesting agent model is Kasbah [1], designed by the MIT Media Lab. Kasbah is a multi-agent electronic marketplace where selling and buying agents can negotiate with one another to find the “best possible deal” for their users. The main advantage of Kasbah is that its agents are autonomous in making decisions, thus freeing users from having to find and negotiate with buyers and sellers. However, as admitted in [1], Kasbah’s agents are not very smart as they do not make use of any AI learning techniques. Vidal and Durfee [12] address the problem of how buying and selling agents should behave in an information economy such as the University of Michigan Digital Library. They divide agents into classes corresponding to the agents’ capabilities of modeling other agents: Zero-level agents are the agents that learn from the observations they make about their environment, and from any environmental rewards they receive. One-level agents are those agents that model other agents as zero-level agents. Two-level agents are those that model other agents as one-level agents. Higher level agents are recursively defined in the same manner. It should be intuitive that the agents with more complete models of others will always do better. However, because of the computational costs associated with maintaining deeper (i.e., more complex) models, there should be a level at which the gains and the costs of having deeper models balance out for each agent. The main problem addressed in [12] is to answer the question of when an agent benefits from having deeper models of others. The work in [12] motivates and serves as a starting point for our work. Nevertheless, we believe that in a market environment, reputation of sellers is an important factor that buyers can exploit to avoid interaction with dishonest sellers, therefore reducing the risk of purchasing low quality goods. On the other hand, we think that sellers may increase their sales (and hence their profits) by not only adjusting the prices of their goods, but also by tailoring their goods to meet the buyers’ specific needs. Thus, instead of having agents maintain recursive models of others and dealing with the associated computational costs, we consider taking a new approach: We would like to use a reputation mechanism as a means of shielding buyers from being “cheated” (by malicious sellers), and
A Learning Algorithm for Buying and Selling Agents
41
to give sellers the option of altering the quality of their goods to satisfy the buyers’ needs.
5
Future Research Directions
For the next step, we would like to experimentally confirm the possible advantages of the proposed algorithm. In particular, we are interested in answering at least the following three questions: (i) When a buyer uses a reputation mechanism, can it achieve a better level of satisfaction? (ii) How better can a buyer perform (in terms of computational time) if it uses a reputation mechanism? (iii) If a seller considers improving the quality of its products, will it have more chances to win an auction? To provide answers to these questions, we plan to perform a simulation to evaluate and compare the proposed algorithm with a simplified version where buyers do not use a reputation mechanism and sellers do not consider altering the quality of their products. Also, we plan to experimentally consider a number of additional versions of the algorithm in order to clearly specify the circumstances under which a particular version is preferable. The additional versions that are of interest to us include – Buyers (sellers) do not keep track of sellers’ (buyers’) behaviour3 . – Buyers keep track of sellers’ behaviour but sellers do not keep track of buyers’ behaviour. – Buyers keep track of sellers’ behaviour, while sellers divide buyers into groups and keep track of groups of buyers’ behaviour. For further research, it would also be possible to investigate more sophisticated algorithms for agents in electronic markets that allow agents to cooperate with other agents and/or take advantage of their knowledge about other agents to maximize their local utility. One specific case to consider is allowing buyers in the marketplace to form neighborhoods such that within a neighborhood they inform one another of their knowledge about sellers. These buyers can then use their own knowledge combined with the informed knowledge to make decisions about which sellers to select. We predict that this form of transferring knowledge will be especially beneficial to new buyers, who may be able to use the experience of existing buyers to make satisfactory decisions without having to undergo several trials to build up enough experience for themselves. Allowing agents to share knowledge with one another may necessitate the social dimension of current reputation models, i.e., the issue of how an agent should evaluate the reputation of another agent based on the ratings of the latter’s neighbors. The work of Yu and Singh [16] served as a motivation for our concept of reputation. It includes a mechanism for sharing information about reputation within neighborhoods and would therefore be a useful starting point for any future work which explores the use of advice from other buyers in the marketplace. 3
In our algorithm, a buyer (seller) keeps track of sellers’ (buyers’) behaviour by including variable s (b) in its expected value (expected profit) function.
42
Thomas Tran and Robin Cohen
One additional avenue for future work is to explore further the concept of reputation in multi-agent electronic marketplaces. We are interested in addressing questions such as (i) Is there a better way for buying agents to benefit from using a reputation mechanism? (ii) Can selling agents also make use of a reputation mechanism? (iii) What would be an efficient and suitable way to represent, manage, and use reputation in electronic marketplaces? This line of research may lead to an analysis of existing reputation models and the development of a new model. One useful starting point for this work is the research of Sabater and Sierra [8], which proposes that reputation be modeled as a weighted combination of different factors.
6
Conclusion
In this paper, we proposed a feasible, reinforcement learning and reputation based algorithm for buying and selling agents in market environments. According to this algorithm, buying agents learn to optimize their expected product values by selecting appropriate sellers to do business with among their reputable sellers. Selling agents also learn to maximize their expected profits by both adjusting product prices and optionally altering the quality of their products. We discussed that the proposed algorithm may lead to improved performance for buying agents, higher level of satisfaction for both buying and selling agents, reduced communication load, and more robust systems. This work therefore demonstrates that reputation mechanisms can be used in combination with reinforcement learning techniques to design intelligent learning agents that participate in market environments. Our future research aims to provide a set of feasible learning algorithms together with a clear characterization of different situations under which a particular algorithm is preferable. Such a characterization will address several important questions, such as under which circumstances buying agents should make use of a reputation mechanism, under what conditions agents may not need to track behaviour of other agents, and under what situations buying agents should exchange their knowledge about selling agents, etc. By accomplishing this objective, we hope to provide some general guidelines for AI-systems designers in building effective economic agents and desirable market environments.
References 1. A. Chavez, and P. Maes. Kasbah: An Agent Marketplace for Buying and Selling Goods. In Proceedings of the First International Conference on the Practical Application of Intelligent Agents and Multi-Agent Technology, 1996. 31, 40 2. R. Davis, and R. G. Smith. Negotiation as a Metaphor for Distributed Problem Solving. In Artificial Intelligence, 20(1): 63-109, January 1983. 31, 39 3. R. B. Doorenbos, O. Etzioni, and D. Weld. A Scalable Comparison-Shopping Agent for the World Wide Web. In Proceedings of the First International Conference on Autonomous Agents, pages 39-48, February 1997. 31, 40
A Learning Algorithm for Buying and Selling Agents
43
4. J. O. Kephart. Economic Incentives for Information Agents. In M. Klusch and L. Kerschberg, editors, Cooperative Information Agents IV, Lecture Notes in Artificial Intelligence, Vol. 1860, pages 72-82. Springer-Verlag, Berlin, 2000. 38 5. M. L. Littman. Markov Games As Framework for Multi-Agent Reinforcement Learning. In Proceedings of the Eleventh International Conference on Machine Learning, pages 157-163, 1994. 32, 39 6. Y. Nagayuki, S. Ishii, and K. Doya. Multi-Agent Reinforcement Learning: An Approach Based on the Other Agent’s Internal Model. In Proceedings of the Fourth International Conference on Multi-Agent Systems, pages 215-221, 2000. 32 7. N. Ono, and K. Fukumoto. Multi-Agent Reinforcement Learning: A Modular Approach. In Proceedings of the Second International Conference on Multi-Agent Systems, pages 252-258, 1996. 32, 39 8. J. Sabater, and C. Sierra. REGRET: A Reputation Model for Gregarious Societies. In Papers from the Fifth International Conference on Autonomous Agents Workshop on Deception, Fraud and Trust in Agent Societies, pages 61-69, 2001. 42 9. T. W. Sandholm, and R. H. Crites. Multi-Agent Reinforcement in the Iterated Prisoner’s Dilemma. In Biosystems, 37: 147-166, 1995. 32, 39 10. S. Sen, M. Sekaran, and J. Hale. Learning to Coordinate without Sharing Information. In Proceedings of the Twelfth National Conference on Artificial Intelligence, pages 426-431, 1994. 32, 39 11. R. G. Smith. The Contract Net Protocol: High Level Communication and Control in a Distributed Problem Solver. In IEEE Transactions on Computers, C-29(12): 1104-1113, December 1980. 31, 39 12. J. M. Vidal, and E. H. Durfee. The Impact of Nested Agent Models in an Information Economy. In Proceedings of the Second International Conference on MultiAgent Systems, pages 377-384, 1996. 32, 40 13. G. Weiss, editor. Multi-Agent Systems: A Modern Approach to Distributed Artificial Intelligence. The MIT Press, 2000. 14. G. Weiss. Learning to Coordinate Actions in Multi-Agent Systems. In Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, pages 311-316, 1993. 32 15. P. R. Wurman, M. P. Wellman, and W. E. Wash. The Michigan Internet AuctionBot: A Configurable Auction Server for Humans and Software Agents. In Proceedings of the Second International Conference on Autonomous Agents, pages 301-308, 1998. 31 16. B. Yu, and M. P. Singh. A Social Mechanism of Reputation Management in Electronic Communities. In M. Klusch and L. Kerschberg, editors, Cooperative Information Agents IV, Lecture Notes in Artificial Intelligence, Vol. 1860, pages 154-165. Springer-Verlag, Berlin, 2000. 34, 41
Grid-Based Path-Finding Peter Yap Department of Computing Science, University of Alberta Edmonton, Canada T6G 2E8 [email protected]
Abstract. Path-finding is an important problem for many applications, including network traffic, robot planning, military simulations, and computer games. Typically, a grid is superimposed over a region, and a graph search is used to find the optimal (minimal cost) path. The most common scenario is to use a grid of tiles and to search using A*. This paper discusses the tradeoffs for different grid representations and grid search algorithms. Grid representations discussed are 4-way tiles, 8-way tiles, and hexes. This paper introduces texes as an efficient representation of hexes. The search algorithms used are A* and iterative deepening A* (IDA*). Application-dependent properties dictate which grid representation and search algorithm will yield the best results.
1
Introduction
Commercial games were a $9 billion (US) industry in 1999, and the rapid rate of growth has not abated [10]. In the past, better computer graphics have been the major technological sales feature of games. With faster processors, larger memories, and better graphics cards, this has almost reached a saturation point. The perceived need for better graphics has been replaced by the demand for a more realistic gaming experience. All the major computer games companies are making big commitments to artificial intelligence [3]. Path-finding is an important problem for many applications, including transportation routing, robot planning, military simulations, and computer games. Path-finding involves analyzing a map to find the “best” cost of traveling from one point to another. Best can be a multi-valued function and use such criteria as the shortest path, least-cost path, safest path, etc. For many computer games this is an expensive calculation, made more difficult by the limited percentage of cycles that are devoted to AI processing. Typically, a grid is superimposed over a region, and a graph search is used to find the best path. Most game programs conduct path-finding on a (rectangular) tile grid (e.g., The Sims, Ages of Empire, Alpha Centauri, and Baldur’s Gate). Each tile has a positive weight that is associated with the cost to travel into that tile. The path-finding algorithm usually used is A* [2]. A few games use IDA* (Iterative Deepening A*) [4], which avoids A*’s memory overhead usually at the cost of a slower search. It is worth noting that the commercial computer games industry “discovered” A* in 1996 [9]. R. Cohen and B. Spencer (Eds.): AI 2002, LNAI 2338, pp. 44–55, 2002. c Springer-Verlag Berlin Heidelberg 2002
Grid-Based Path-Finding
45
Path-finding in computer games may be conceptually easy, but for many game domains it is difficult to do well [1]. Real-time constraints limit the resources—both time and space—that can be used for path-finding. One solution is to reduce the granularity of the grid, resulting in a smaller search space. This gives a coarser representation, which is often discernible to the user (characters may follow in contorted paths). Another solution is to cheat and have the characters move in unrealistic ways (e.g., teleporting). Of course, a third solution is to get a faster processor. Regardless, the demands for realism in games will always result in more detailed domain terrains, resulting in a finer grid and a larger search space. Most game programs decompose a terrain into a set of squares or tiles. Traditionally, one is allowed to move in the four compass directions on a tile. However, it is possible to also include the four diagonal directions (so eight directions in total). We call the latter an octile grid and the former a tile grid. Once the optimal path is found under the chosen grid, smoothing is done on this “grid-optimal” path to make it look more realistic [8]. This paper presents several new path-finding results. Grid representations discussed are tiles, octiles and the oft-overlooked hexes (for historical reasons, usually only seen in war strategy games). This paper introduces texes as an efficient representation of hexes. The search algorithms used are A* and iterative deepening A* (IDA*). Applicant-dependent properties dictate which grid representation and search algorithm will yield the best results. This work provides insights into different representations and their performance trade-offs. The theoretical and empirical analysis show the potential for major performance improvements to grid-based path-finding algorithms.
2
Path-Finding in Practice
Many commercial games exhibit path-finding problems. Here we highlight a few examples that we are personally familiar with. It is not our intent to make negative remarks about these products, only to illustrate that there is a serious problem and that it is widespread. Consider Blizzard’s successful multi-player game Diablo II. To be very brief, the player basically runs around and kills hordes of demonic minions... over and over again. To finish the game, the player usually exterminates a few thousand minions. The game involves quite a lot of path-finding, since each of these minions either chases the player, or (less commonly) runs away from the player. In the meantime, the player is rapidly clicking on the screen in an effort to either chase the minion, or (more commonly) to run away from the minion and company. All this frantic running and chasing requires path-finding computations. To complicate the matter, the player is allowed to hire NPCs (non-player characters, called hirelings) or play with other humans in an effort to kill more minions. This significantly adds to the complexity of the game in terms of path-finding. Consider the scenario whereby a party of human players with hirelings is attacked by a very large horde of minions. From the path-finding point of view,
46
Peter Yap
this is a complicated situation. Being very sensible, the human players all independently run in different directions to escape the horde. In this state, pathfinding is done on each fleeing player by interpreting the player’s mouse clicks, path-finding must be done on each minion so that they give chase to the human players, and path-finding must be done on each hireling so that they flee with their respective employers. On a slow computer or in a network game, this computationally-intensive scenario reduces the game to a slide show, often leading to a player’s untimely death, since the minions still attack even while the game appears “frozen” to the human players. One of the solutions applied by the game programmers was to magically teleport the hireling close to the player instead of calculating a path to move the hireling to the player. This solution is not satisfactory; sometimes the hireling is teleported close to the player, but inside a large group of minions. The most serious problem with teleportation is that it detracts from the fun and realism of the game. As a second case, consider The Sims. Here the player controls a family in a household environment. Often, the house is cluttered with obstacles like furniture, making path-finding slightly tricky. Typical path-finding problems involve deadlocks when two Sims are trying to occupy the same space at the same time. A common situation is when a Sim has just finished using the bathroom and is trying to leave through the bathroom door. Simultaneously, another Sim desperately needs to go and rushes towards the bathroom. The two Sims collide at the bathroom door and a deadlock ensues. Often the player must personally resolve the issue. This situation could be avoided with better path-finding (and if the Sims could learn simple courtesy). These case studies are representative of the difficulties encountered in many commercial games. Clearly, these problems must be resolved if we are to realize John Laird’s vision of creating “human-level AI” in these characters [6].
3
Search Algorithms
A* is the classic artificial intelligence optimization search algorithm. It uses a best-first search strategy, exploring the most likely candidate while eliminating provably inferior solutions. Its effectiveness is based on having a good heuristic estimator, H, on the remaining distance from the current state to a goal state. If the heuristic is admissible (does not overestimate), then an optimal answer is guaranteed. On a grid, A* can been shown to explore a search space that is proportional to D2 , where D is the distance to a goal state [7]. Iterative-deepening A* (IDA*) is a memory-efficient version of A*. It eliminates the open and closed lists by trading off space for time. For many applications, space is the limiting factor and thus IDA* is preferred. However, since IDA* iterates and repeatedly explores paths, this may result in a horribly inefficient search that is still asymptotically optimal (e.g., DNA sequence alignment). The speed of an IDA* search depends on the number of nodes it needs to examine. Analysis has shown that the size of the nodes to be searched is proportional to O(bD−H ) [5], where b is the average branching factor and H is the effect
Grid-Based Path-Finding
47
of the heuristic. Intuitively, this is because IDA* checks every path of length D, and at each depth, each node branches into b more nodes.
4
Tiles, Octiles, and Hexes
In this section an analysis is presented of the cost of path-finding with IDA* using tiles (4 degrees of movement), hexes (6), and octiles (8). We assume nonnegatively weighted nodes. A tile has four adjacent nodes (b = 4). Hence a path-finding search has to consider the four adjacent tiles to explore. Since one never backtracks along an optimal path, it is not necessary to consider the direction that the search just came from (i.e., it is not optimal to undo the previously made move). Hence, b = 3 (except for the start tile), and the number of nodes that need to be searched to find a solution at depth D for IDA* is proportional to O(3D−H ). Now consider a hex grid with six degrees of movement. Using a similar argument, one might deduce that the branching factor of a hex grid is five. However, we can do better and reduce the branching factor to three. Assume that a hexagonal tile’s neighbors are in the compass directions N, NE, SE, S, SW, and NW (see Figure 1). Consider moving in direction N from tile1 to tile2. What is the branching factor now at tile2? Moving back to tile1 does not have to be considered (backtracking). SE and SW also do not need to be considered, since if they were on the optimal path, one would move from tile1 in directions NE and NW, respectively, instead of going to tile2. In summary, at each non-root hex, we need only examine three hexes and hence there is a branching factor of three (b = 3).
N
NW
NE
tile2
SW
SE
tile1
Fig. 1. Branching factor of the hexagonal grid The branching factor of both the tile grid and the hex grid is three. For comparison purposes, the area of the hex is made to be the same as that of a
48
Peter Yap
GOAL
START
GOAL
START
A Tile Path
A Hex Path
GOAL
START
An Octile Path
Fig. 2. Optimal paths on different grids
tile. Given the same distance, on average, a path represented by the hex grid is shorter than the path represented by the tile grid (see Figure 2, where the direct path from start to goal is given in bold, and an optimal path following a grid topology is given in regular font). It follows that because the hex path is shorter, one doesn’t need to search as deep (i.e., it requires fewer steps to reach the goal node and hence D is smaller). It can be mathematically shown that given the same distance, if a tile grid searches with depth D then a hex grid will search with depth 0.81D on average [11]. Combining the branching factors and the depths of both grids, it follows that if the tile grid searches through O(3D−H ) tiles in a search, then the hex grid searches through O(30.81D−H ) ≈ O(2.42D−H ). This result proves that a hex grid is exponentially faster than a tile grid for an IDA* search. Now consider the octile grid, which has eight degrees of movement. Using similar arguments to that given above, we can deduce that the branching factor of an octile grid is five. However, a closer inspection shows that this is too high. With some enhancements, one can reduce an octile search to have an asymptotic branching factor of roughly 4.2 (5 for diagonal movements and 3 for non-diagonal movements). One can also mathematically show that √ if the tile grid is searched for D depth, then the octile grid is searched for D/ 2 depth on average (see Figure 2). Intuitively, the depth for the octile should be less than that of a tile because one diagonal octile move is equal to two tile moves. Hence an octile grid 1 √ D searches O(4.2 2 ) ≈ O(2.77D ) [11]. In terms of IDA* search speed, hexes are better than octiles, and octiles are better than tiles. For A*, the asymptotic search speed is indifferent to the choice of grid (although, of course, D will differ).
Grid-Based Path-Finding
5
49
Introducing the Tex Grid
In addition to the exponential search advantage that the hex grid enjoys over the tile grid, hexes have very nice geometric properties. For example, each hex is perpendicular and equidistant to each adjacent hex. Furthermore, a hex shares exactly one side with each adjacent hex. These hex properties provide a better topological representation of the path-finding problem for computer games. Consider Figure 3 where a unit wishes to move diagonally. The search needs to check if the two obstacles that pince the direction of movement are connected (like a mountain) or not connected (canyon) (top row of the figure). The middle row shows a possibly ambiguous tile representation of the two scenarios. Although this ambiguity can be resolved with some extra work, it can be entirely avoided by using hexes or texes (bottom row). For these reasons, it is common to see hexes used in war strategy games. Unfortunately, because of the regular hexagon’s shape, the hex grid is harder to implement. The tex grid (a tiled hex) is introduced which is topologically equivalent with a hex structure but uses tiles. One can imagine a tex grid as a tile grid such that the odd columns are moved up by half the height of a tile (see Figure 3 or Figure 7). A bricked wall is another example of a tex grid. Tex grids are more manageable and representative than hex grids since space is represented as rectangles. Additionally, each tex is equidistant and shares exactly one side to each adjacent tex. More importantly, texes have a branching factor of three. Theoretically, texes are only slightly slower than the hex grid on average: O(30.809D−H ) instead of O(30.805D−H ). Another obvious advantage that texes have over hexes is that every tex path is shorter than a tile path, whereas some hex paths are longer than some tile paths (but on average is shorter). All in all, tex grids are exponentially faster than tiles (and slightly slower than the hex grid), produce smoother and shorter paths, and are easy to work with. The attributes for the choice of grid are summarized in Table 1.
In pathfinding search algorithms like A* or IDA*, we use a heuristic that estimates the cost of reaching the goal. This heuristic is generally the shortest distance between the current node and the goal node in the absence of obstacles. For the tile grid, this shortest distance heuristic is called the Manhattan distance. For the hex grid, we introduce the Vancouver distance. The Vancouver distance also works for the tex grid since it is topologically equivalent to the hex grid. Given two nodes (x1 ,y1 ) and (x2 ,y2 ) under a hexagonal co-ordinate system, let x = x1 − x2 y = y1 − y2 x1 (mod2) , correction = x2 (mod2) , 0 ,
if y1 < y2 and x is odd if y1 > y2 and x is odd otherwise
then the Vancouver Distance, or the number of hexes between the two points, is max{0, y − (x/2)} + x − correction
Grid-Based Path-Finding
51
The above result follows from three observations. Firstly, the hexes (x, y) on or under the 30◦ diagonal are exactly x hexes away from the origin. Secondly, the hexes above this diagonal are exactly y −(x/2) away from the diagonal (see Figure 4). Thirdly, for two points such that one point is in an odd column and the other point in an even column, the heights of these two points will be different in the Cartesian plane even if they are the same on the hexagonal grid; as such, it is necessary we add a correction term to compensate. Using these facts, we can arbitarily set one node to be the origin and use symmetry to calculate the Vancouver distance between two nodes.
(0,3)
3 (0,2)
(1,2)
3 (1,1)
2 (0,1)
1 (0,0)
(4,3)
(2,3)
2 (1,0)
1
0
(5,2)
(3,2) 4 (2,2)
3 (2,1)
2 (2,0)
2
5 4 (3,1)
3 (3,0)
3
(4,2) 5 4 (4,1)
4 (4,0)
4
(5,1)
5
(x,y)
(5,0)
5 distance to origin
Fig. 4. Vancouver Distance under this hex co-ordinate system. Note that we arbitarily arranged the hex grid so that the odd columns are above the even columns
7
Empirical Results
This paper has shown the asymptotic theoretical results, but it is unclear how the various grid topologies behave for the grid sizes normally used in practice. This section contrasts the costs of IDA* searches on tile grids, against comparable tex grids (tiles/octiles and texes are the same size). There are two reasons why we are empirically comparing tiles and texes, and not use octiles or hexes for comparison. Firstly, comparing hexes and any rectangular grid will not be fair because they are of different shape and size even if their area is the same (see Figure 5); this becomes a problem if we were to compare a M xN hex grid or a M xN tiled grid, as the hex grid would be taller and thinner than the tiled grid by an irrational proportion. As such, we are left with comparing the tex grid versus the tile or octile grid. Although all these grids are rectangular, they all have different topologies (due to their different branching factors). A convenient surjective mapping exists from the tile grid to the tex grid: for every pair of adjacent tile nodes, there exists a corresponding pair of adjacent tex nodes. The corresponding injective mapping does not exist, because not every pair of adjacent hex nodes has a corresponding pair of adjacent
52
Peter Yap
tile nodes. However, we can find a corresponding pair of physically adjacent tile nodes (see Figure 6).
Fig. 5. A tile/octile and an overlapping hex. Both have the same area but have different dimensions All test cases where the tex grid has an unfair advantage over the tile grid were removed. For example in Figure 7, the tex grid can’t be compared to the tile grid since the tile grid cannot reach A while the tex grid can. It is practically impossible to fairly compare these two topologically different grids; nevertheless, the results are presented below. Table 2. Empirical Results Size Trials Obstacles 102 106 0% 102 106 10% 102 106 20% 102 106 30% 202 106 0% 202 6000 10% 202 1000 20% 302 106 0%
T exP ath T exN odes T exT ime T ileP ath T ileN odes T ileT ime
0.809 0.809 0.810 0.826 0.808 0.808 0.813 0.811
0.769 0.067 0.049 0.021 0.492 0.012 0.019 0.152
1.172 0.157 0.090 0.048 0.974 0.020 0.034 0.285
The tile grid and its comparable tex grid are compared in numerous independent trials (see Table 2). In each trial, a fixed number of obstacles are randomly placed on the grid; after that the start and the goal are randomly placed. Additionally, a path exists between the start and the goal in every trial. A glance exP ath at the TT ileP ath column show that it reaffirms the theoretical prediction of 0.81.
Grid-Based Path-Finding
53
b a
b
c
a
c e
d
e
f
d
f h
g
h
i
g
(b,e) (d,e) (f,e) (h,e)
i
(b,e) (d,e) (f,e) (h,e) (a,e) (c,e)
Fig. 6. Nodes b,d,f ,h are adjacent to node e in both the tile and tex grid. Nodes a and c are adjacent to e in the tex grid but not in the tile grid, but are physically adjacent. If we were to compare the octile grid and the tex grid using the same example, we would find that g and i are adjacent to e in the octile grid but neither adjacent nor physically adjacent in the tex grid
5 6
7
goal 5
goal 4
4
5
3
4
3 3 5
4
2
5 1
2
start 1
start
A
A
1
Fig. 7. A tile grid and the corresponding tex grid
54
Peter Yap
exP ath Note that the fraction of the tex path length over the tile path length ( TT ileP ath ) grows as the number of obstacles increases, this is not surprising given that the presence of obstacles restrict the advantages of texes or tiles (obstacles reduce the directions of movement). exN odes T exT ime The TT ileN odes and T ileT ime clearly show that searches on tex grids examine less nodes and are faster. Note that the tex grid searches slower (but checks less nodes) than the tile grid when there are no obstacles, this is because the tex grid search checks, at the node level, that every move does not give an unfair advantage over the comparable tile grid search; clearly, the hex grid search would be much faster if the need for fair comparisons is removed.
The general trend in Table 2 is that Texes perform better than tiles when the grid size becomes larger or when the number of obstacles increases. This trend is most apparent in the 10x10 grids, whose numbers are more informative because the number of trials is large. In comparison, we only have a limited number of trials for the 20x20 grids since it takes exponential amount of time to gather those trials.
8
Conclusion
Path-finding is an important issue in many application domains, including computer games. It is worth noting that the results of this paper applies not only to computer games, but to any type of pathfinding on a grid. This paper introduces results that increase our understanding of the algorithms and data representations used: 1. Hexagonal grids provide a better topological representation of the underlying problem space. Each hex is equidistant and uniquely shares one side with each adjacent hex. 2. The tex grid retains the advantages of a hexagonal grid but is easier to implement. 3. While the choice of grid does not affect the asymptotic performance of A*, it does for IDA*. 4. It is mathematically proven and empirically shown that the hexagonal grid is superior to the conventional tile grid for IDA* searches. Furthermore, searching on a hex grid instead of a tile or octile grid will result in exponentially faster searches. It can also be proven that a hex grid is optimal in terms of search speed for all regular planar tessellations. Hex grids provide a better topological representation than tile or octile grids. Moreover, for memory constrained domains that necessitate IDA*, the hex grid is the optimal grid choice. Finally, the implementation of hex grids is made easier with tex grids. Current research involves analyzing the performance of different grid topologies and search algorithms in BioWare’s Baldur’s Gate series of programs.
Grid-Based Path-Finding
55
Acknowledgments Thanks to Mark Brockington for his invaluable insights into the path-finding issues in BioWare’s products. I would like to express my deepest appreciation to Jonathan Schaeffer, this paper has changed and improved a lot as a consequence of his input. Financial support was provided by the Natural Sciences and Engineering Research Council of Canada (NSERC) and Alberta’s Informatics Circle of Research Excellence (iCORE).
References 1. K. Forbus, J. Mahoney, and K. Dill. How qualitative spatial reasoning can improve strategy game AIs. AAAI Spring Symposium on Artificial Intelligence and Interactive Entertainment, pages 35–40, 2001. 45 2. P. Hart, N. Nilsson, and B. Raphael. A formal basis for the heuristic determination of minimum cost paths. IEEE Trans. Syst. Sci. Cybernet., 4(2):100–107, 1968. 44 3. S. Johnson. Wild Things. Wired, pages 78–83, 2002. March issue. 44 4. R. Korf. Depth-first iterative-deepening: An optimal admissible tree search. Artificial Intelligence, 27(1):97–109, 1985. 44 5. R. Korf, M. Reid, and S. Edelkamp. Time complexity of iterative-deepening A*. Artificial Intelligence, 129(2):199–218, 2001. 46 6. J. Laird and M. van Lent. Human-level AI’s killer application: Interactive computer games. In AAAI National Conference, pages 1171–1178, 2000. 46 7. J. Pearl. Heuristics: Intelligent search strategies. In Addison-Wesley, 1984. 46 8. S. Rabin. A* Aesthetic Optimizations. Game Programming Gems. Charles River Media, pages 264-271, 2000. 45 9. B. Stout. Smart moves: Intelligent path-finding. Game Developer Magazine, (October):28–35, 1996. 44 10. D. Takahashi. Games get serious. Red Herring, (87):64–70, 2000. December 18 issue. 44 11. P. Yap. New Ideas in Pathfinding. PhD thesis, Department of Computing Science, University of Alberta. In preparation. 48
Transposition Table Driven Work Scheduling in Distributed Game-Tree Search Akihiro Kishimoto and Jonathan Schaeffer Department of Computing Science, University of Alberta Edmonton, Canada T6G 2E8 {[email protected],[email protected]}
Abstract. MTD(f) is a new variant of the αβ algorithm that has become popular amongst practitioners. TDS is a new parallel search algorithm that has proven to be effective in the single-agent domain. This paper presents TDSAB, applying the ideas behind TDS parallelism to the MTD(f) algorithm. Results show that TDSAB gives comparable performance to that achieved by conventional parallel αβ algorithms. This result is very encouraging, given that traditional parallel αβ approaches appear to be exhausted, while TDSAB opens up new opportunities for further performance improvements.
1
Introduction
Many artificial intelligence applications require real-time responses. Achieving this can be achieved by a combination of software and hardware. On the software side, anytime algorithms have been developed to ensure that a quality response is available as a function of the resources consumed [3]. Clearly faster hardware enables more computations to be performed in a fixed amount of time, generally allowing for a better quality answer. Dual-processor machines and clusters of inexpensive processors are ubiquitous and are the de facto research computing platforms used today. Single-agent domains and two-player games have been popular test-beds for experimenting with new ideas in sequential and parallel search. This work transfers naturally to many real-world problem domains, for example planning, pathfinding, theorem proving, and DNA sequence alignment. There are many similarities in the approaches used to solve single-agent domains (A* and IDA*) and two-player games (αβ). Many of the sequential enhancements developed in one domain can be applied (with modifications) to the other. Two recent developments have changed the way that researchers look at twoplayer search. First, MTD(f) has emerged as the new standard framework for the αβ algorithm preferred by practitioners [8]. Second, TDS is a new, powerful parallel search paradigm for distributed-memory hardware that has been applied to single-agent search [10,11]. Given that there is a new standard for sequential αβ search (MTD(f)) and a new standard for parallel single-agent search (TDS), the obvious question is what happens when both ideas are combined. This paper investigates the issues of using TDS to parallelize MTD(f). R. Cohen and B. Spencer (Eds.): AI 2002, LNAI 2338, pp. 56–68, 2002. c Springer-Verlag Berlin Heidelberg 2002
Transposition Table Driven Work Scheduling
57
In MTD(f), all searches are done with a so-called minimal window [α, α + 1]. Each search answers a binary question: is the result ≤ α or is it > α? At the root of the tree, a series of minimal window searches are performed until the result converges on the value of the search tree. MTD(f) has been shown to empirically out-perform other αβ variants. It has the nice property of searching using a single value, an important consideration in a parallel search. TDS is an elegant idea that reverses the traditional view of parallel search. Instead of sending data to the work that needs it, TDS sends the work to the data. This simple reversal of the relationship between computation and data simplifies the parallelism, reduces the overhead of parallelism, and produces impressive results for single-agent search applications. This paper introduces TDSAB, TDS parallelism adapted to αβ search (specifically, MTD(f)) [5]. This is the first implementation of TDS in the twoplayer domain, and the results are very encouraging. The speedups in two application domains (Awari, small branching factor; Amazons, large branching factor) average roughly 23 on a network of 64 workstations, a result that is comparable to what others have achieved using conventional parallel algorithms. Given that this is the first implementation of TDSAB, and that there are numerous ideas for further enhancing performance, this is a successful result. Section 2 discusses sequential and parallel game-tree search algorithms. Section 3 introduces TDSAB, while Section 4 presents experimental data on its performance. Section 5 discusses future work on enhancing this algorithm.
2
Game-Tree Search
This section gives a quick survey of αβ searching. Good surveys of the literature are available for sequential [7] and parallel [1] search. 2.1
Sequential Search
For more than 30 years the αβ algorithm has been the most popular algorithm for two-player games. The algorithm eliminates provable irrelevant nodes from the search. Two bounds are maintained, α and β, representing the lower and upper bounds respectively on the minimax value of the search tree (the search window). The savings of αβ come from the observation that once the search proves that the score of the node is outside the search window, then further effort at that node is irrelevant. A large number of enhancements have been added to αβ to (dramatically) improve the search efficiency. The most important of these is the transposition table, a large cache containing the results from previously searched sub-trees [13]. Its effectiveness is application dependent. For chess, for example, it is worth a factor of 10 in performance. Thus any high-performance implementation must have a transposition table. MTD(f ) is recognized as the most efficient variant of sequential αβ. Figure 1 shows that MTD(f ) is just a sequence of minimal window αβ calls, searching
58
Akihiro Kishimoto and Jonathan Schaeffer
node n to depth d. The initial search is centered around the value f (usually the value returned by the previous iteration in an iterative-deepening search). This result is then monotonically increased or decreased to the correct minimax value. The transposition table is critical to the performance of MTD(f ), since the tree is repeatedly traversed, albeit with a different search window. The table prevents nodes that have been proven to be inferior from being searched repeatedly.
int MTD(node t n, int d, int f) { int s; int lowerbound = −∞; upperbound = ∞; if (f == -∞) b = f + 1; else b = f; do { /* Minimal window search */ (+) s = AlphaBeta(n, d, b-1, b); if (s < b) upperbound = s; else lowerbound = s; /* Reset the bound */ if (lowerbound == s) b = s + 1; else b = s; } while (lowerbound = upperbound); }
Fig. 1. MTD(f )
2.2
Obstacles to Parallel Search
αβ has proven to be a notoriously difficult algorithm to get good parallel performance with. There are numerous obstacles to overcome: 1. Search overhead is the (usually) larger tree built by the parallel algorithm as compared to the sequential algorithm. 2. Synchronization overhead occurs when processors have to sit idle waiting for the results from other searches. 3. Communication overhead is the result of processors sending messages between each other. 4. Load balancing reflects how evenly the work has been distributed between the processors. None of these overheads is independent of each other. For example, increased communication can reduce search overhead; reducing synchronization time usually increases search overhead. A high-performance αβ searcher needs to be finely tuned to choose the right mix of parameters to balance the performance trade-offs.
Transposition Table Driven Work Scheduling
59
To minimize synchronization overhead, many algorithms use work stealing to offload work from a busy processor to an otherwise idle processor. When a processor is starved for work, it randomly chooses a processor and then “steals” a piece of work from its work queue. Although work stealing improves the load balancing, on distributed-memory machines this can have adverse impact because of the lack of information available in the local transposition table. 2.3
Parallel Search Algorithms
Numerous parallel αβ search algorithms have been proposed. The Young Brothers Wait Concept (YBWC) [4] is a well-known representative of this class of algorithms. There are many variants of YBWC, but they only differ in the implementation details (e.g., [6,14]). Highly optimized sequential αβ search algorithms usually do a good job of ordering moves from best to worst. A well-ordered αβ search tree has the property that if a cutoff is to occur at a node, then the first move considered has a high probability of achieving it. YBWC states that the left-most branch at a node has to be searched before the other branches at the node are searched. This observation reduces the search overhead (by ensuring that the move with the highest probability of causing a cutoff is investigated before initiating the parallelism) at the price of increasing the synchronization overhead (waiting for the first branch to return). Practical αβ search algorithms have transposition tables. When parallelizing αβ search with the work-stealing framework on distributed-memory machines, the efficient implementation of transposition tables becomes a serious problem. Since processors do not share memory, they cannot access the table entries of other processors without incurring communication overhead. There are three naive ways to implement transposition tables on distributedmemory machines. With local transposition tables, each processor has its own table. No table entries are shared among processors. Therefore, looking up and updating an entry can be done without communication. However, local transposition tables usually result in a large search overhead, because a processor may end up repeating a piece of work done by another processor. With partitioned transposition tables, each processor keeps a disjoint subset of the table entries. This can be seen as a large transposition table divided among all the processors (e.g., [4]). Let L be the total number of table entries with p processors, then each processor usually has Lp entries. When a processor P needs a table entry, it sends a message to ask the processor Q which keeps the corresponding entry to return the information to P (communication overhead). P has to wait for Q to send back the information on the table entry to P (synchronization overhead). When P updates a table entry, P sends a message to the corresponding processor to update the entry. Updating messages can be done asynchronously. Using replicated transposition tables results in each processor having a copy of the same transposition table. Looking up a table entry can be done by a local access, but updating an entry requires a broadcast to all the other processors
60
Akihiro Kishimoto and Jonathan Schaeffer
to update their tables with the new information (excessive communication overhead). Even if messages for updates can be sent asynchronously and multiple messages can be sent at a time by combining them as a single message, the communication overhead increases as the number of processors increases. As well, replicated tables have fewer aggregate entries than a partitioned table. All three approaches may do redundant search in the case of a DAG (Directed Acyclic Graph). A search result is stored in the transposition table after having finished searching. If identical nodes are allocated to different processors, then duplicate search may occur, which increases the search overhead. Because the efficient implementation of transposition tables in a distributed environment is a challenging problem, researchers have been looking for better solutions [2,12]. ABDADA [14] does not use work-stealing, preferring a shared transposition table to control the parallel search. All the processors start searching the root node simultaneously. Each transposition table entry has a field for the number of processors entering a node, which is used to determine the order in which to search children of that node. ABDADA achieved better speedups than YBWC in chess and Othello on a shared memory machine. However, it is hard to implement ABDADA on distributed-memory machines because of the necessity of sharing the transposition tables. 2.4
TDS
Transposition-table Driven Scheduling (TDS) flips the idea of work-stealing to solve the transposition table problem [10,11]. While work-stealing moves the data to where the work is located, TDS moves the work to where the data is located. In TDS, transposition tables are partitioned over the processors like a partitioned transposition table. Whenever a node is expanded, its children are scattered to the processors (called home processors) which keep their transposition table entries. Once the work is sent to a processor, that processor accesses the appropriate transposition table information locally. All communication is asynchronous. Once a processor sends a piece of work, it can immediately work on another task. Processors periodically check to see if new work has arrived. The idea of TDS seems to be easily applied to two-player games. However, there are important differences between single-agent search (IDA*) and αβ that complicate the issue. First, αβ has a search window which IDA* does not have. This makes the implementation of TDS complicated because (a) the window may be narrowed after searching a node, and (b) a node reached through more than one path may be searched with different windows. Second, the order in which nodes are considered in αβ is much more important than in IDA*. In IDA*, the search order is managed using a stack to let TDS behave in a depth-first manner. For parallel αβ search, we need a more complicated scheme to let the left-most and shallowest nodes be searched first. Finally, IDA* does not have αβ-like pruning. The results of the children in IDA* are not reported back to its parent. On the other hand, αβ checks if a cutoff happens, after searching a branch of a node. When a cutoff occurs, we need a mechanism to not only receive
Transposition Table Driven Work Scheduling
61
the scores reported from the children, but also to tell the other processors to stop searching other branches to avoid unnecessary search. TDS has several important advantages: 1. Transposition table access always involve only local communication. All communication is asynchronous. 2. DAGs are not a problem, since identical positions are guaranteed to be assigned to the same processor. 3. Given that positions are mapped to random numbers to be used for transposition table processor assignments, statistically good load balancing happens for free. Since TDS has proven to be so successful in single-agent search, the obvious question to ask is how it would fare in αβ search.
3
TDSAB
This section presents a new parallel αβ algorithm, TDSAB (Transposition-table Driven Scheduling Alpha-Beta), combining parallel search (TDS) with MTD(f). Transposition tables are critical to the performance of MTD(f), and a TDS-like search addresses the problem of an efficient implementation of this data structure in a distributed-memory environment. 3.1
The TDSAB Algorithm
MTD(f) has an important advantage over classical αβ for parallel search; since all searches use a minimal window, the problem of disjoint and overlapping search windows will not occur (a serious problem with conventional parallel αβ implementations). The disadvantage is that for each iteration of MTD(f), there may be multiple calls to αβ, each of which incurs a synchronization point (at line “+” in Figure 1). Each call to αβ has the parallelism restricted to adhere to the YBWC restriction to reduce the search overhead. The distribution of nodes to processors is done as in TDS. Since TDSAB follows the TDS philosophy of moving work to the data, the issues explained in the last section have to be resolved. The following new techniques are used for TDSAB: – Search Order: The parallel search must preserve the good move ordering (best to worst) that is seen in sequential αβ. Our solution to this issue is similar to that used in APHID [2]. Each node is given a priority based on how “left-sided” it is. To compute the priority of a node, the path from the root to that node is considered. Each move along that path contributes a score based on whether the move is the left-most in the search tree, left-most in that sub-tree, or none of the above. These scores are added together to give a priority, and nodes are sorted to determine the order in which to consider work (see [5] for more details).
62
Akihiro Kishimoto and Jonathan Schaeffer
– Pruning: When searching the children of a node in parallel and a cutoff score is returned, further work at this node is not necessary; all outstanding work must be stopped. However, because in TDS all the descendants of a node are not always on the same processor, we have to consider an efficient way of tracking down this work (and any work that has been spawned by it) and terminating it. Cutoffs can be elegantly handled by the idea of giving each node a signature. Intuitively the signature for a node P is a function of the path traversed from the root node to P ; hence every node has a unique signature. When a cutoff happens at a node P , TDSAB broadcasts the signature of P to all the processors. A processor receiving a cutoff signature examines its local priority queue and deletes all the nodes whose prefix of the signatures are the same as the signature of P . Figure 2 gives pseudo code for TDSAB. For simplicity, we just explain TDSAB without YBWC. The function P arallelM W S does one iteration of a minimal window search [α, α + 1] in parallel. The end of the search is checked by the function F inishedSearchingRoot, which can be implemented by broadcasting a message when the score for the root has been decided. The function RecvN ode checks regularly if new information comes to a processor. RecvN ode receives three kinds of information: 1. New Work: a processor has done a one-ply search and has spawned the children to be evaluated in parallel. The new work arrives, assigned a priority, and inserted in the priority queue. If the new piece of work is terminal or a small piece of work, then it is immediately searched locally (the cost of doing it in parallel out-weighs the benefits) and sent to its parent node (lines marked “-” in the figure). 2. Cut off: a signature is received and used to remove work from the priority queue. If a processor receives a signature, the function CutAllDescendants examines its local queue and discards all nodes with a matching signature prefix (see the pseudo-code at “*”). 3. Search Result: the minimax score of a node is being returned to its parent node. If new information arrives at a processor, GetN ode, GetSignature, and Get − SearchResult get information on a node, signature, and score for a node respectively. GetLocalJob determines a node to be expanded from its local priority queue, and DeleteLocalJob deletes a node from the queue. We note that TDSAB keeps information on nodes being searched unlike the IDA* version of TDS. SendN ode sends a node to the processor chosen by the function HomeP rocessor, which returns the processor having a transposition entry to the node, decided by the transposition table key (TTKey). When receiving a search result, TDSAB has to consider two cases (Store − SearchResult). If a score proves a fail high (result > α), TDSAB does not need to search the rest of the branches. The fail-high score is saved in the transposition table (T ransF ailHighStore), the node is dequeued from the priority queue, and the score is sent to the processor having the parent of the node (SendScore).
Transposition Table Driven Work Scheduling
63
int α; /* A search window is set to [α, α + 1]. */ const int granularity; /* Granularity depends on machines, networks, and so on. */ void ParallelMWS() { int type, value; node t p; signature t signature; do { if (RecevNode(&type) == TRUE) { /* Check if new information arrives. */ switch(type) { case NEW WORK: /* New work is stored in its priority queue. */ GetNode(&p); Enqueue(p); break; (*) case CUT OFF: /* Obsolete nodes are deleted from its priority queue. */ (*) GetSignature(&signature); CutAllDescendants(signature); break; (+) case SEARCH RESULT: /* A search result is saved in the transposition table. */ (+) GetSearchResult(&p,&value); StoreSearchResult(p,value); break; } } GetLocalJob(&p); if (p == FOUND) { (-) if (p == terminal || p.depth ≤ granularity) { (-) value = AlphaBeta(p,p.depth,α,α + 1); /* Local search is done for small work. */ (-) SendScore(p.parent,value); DeleteLocalJob(p); } else { /* Do one-ply search in parallel. */ for (int i = 0; i < p.num of children; i++) { p.child node[i].depth = p.depth - 1; SendNode(p.child node[i],HomeProcessor(TTKey(p.child node[i]))); } } } } while (!FinishedSearchingRoot()); } void StoreSearchResult(node t p, int value) { if (value > α) { /* Fail high */ TransFailHighStore(p,value); (#) SendScore(p.parent,value); (#) SendPruningMessage(p.signature); DeleteLocalJob(p); } else { /* Fail low */ p.score = MAX(p.score,value); p.num received ++; if (p.num received == p.num of children) { /* All the scores for its children are received. */ TransFailLowStore(p,p.score); SendScore(p.parent,p.score); DeleteLocalJob(p); } } }
Fig. 2. Simplified Pseudo-Code for TDSAB (without YBWC)
Only after a processor has completed searching a node is it discarded. Because searching the rest of the branches has already started, the processor broadcasts a signature to abort useless search, then deletes the node. When a fail low happens
64
Akihiro Kishimoto and Jonathan Schaeffer
(result ≤ α), a processor stores the maximum score of the branches. If all the branches of a node are searched, the fail-low score for the node is stored in the transposition table (T ransF ailLowStore), and the score is reported back to its parent. 3.2
Implementation Details
TDSAB has been implemented for the games of Awari and Amazons (see www.cs.ualberta.ca/~games). The African game of Awari is characterized by a low branching factor (less than 6) and an inexpensive evaluation function. Amazons is a recently-invented game that has grown in popularity since it seems to be intermediate in difficulty between chess and Go. It has a very large branching factor (2,176 at the start of the game) and an expensive evaluation function. These games have different properties that exhibit themselves by different characteristics of a parallel search. For Amazons the YBWC strategy was modified. Because of the large branching factor the basic YBWC strategy distributes too many nodes to the processors, resulting in excessive search overhead. Therefore, if the first branch of a node does not cause a cutoff, a smaller number of children (P where P is the number of processors) are searched in parallel at a time. If none of these branches causes a cut-off then the next P nodes are searched in parallel, and so on. Two enhancements were added to our Awari implementation. First, all the children of the root node are searched in parallel. Although this is a search overhead versus synchronization overhead trade-off, it solves a serious problem for any domain with a small branching factor: insufficient work to keep the processors busy (starvation). Second, the search order of identical nodes has to be carefully handled in order to avoid deadlock. Assume that a processor has two identical nodes. If searching a node is always delayed until after the completion of the other node, a deadlock may occur. Figure 3 illustrates this problem. Suppose that B and B are identical nodes. If these nodes are searched in the following order, a deadlock will occur: (1)A is expanded, and B and E are sent to their home processors. (2) B is expanded, and C is sent. If a processor of B receives a node B identical to B, searching B is delayed until it receives a score for B. (3) E is expanded, and D is sent. (4) D is expanded, and B is sent. Searching B is done after finishing B. (5) C is expanded, and D is sent. In this case, B waits for the score for C, C waits for D, D waits for B , and B waits for B. Therefore a cyclic wait has been created and a deadlock ensues. To eliminate the possibility of deadlock, if two identical nodes are encountered and neither of the nodes has been searched yet, then TDSAB searches the shallower one first. When a node n1 whose search depth is shallower or equivalent to an identical node n2 whose search has already begun, then n2 waits until n1 ’s search completes. When a deeper search has already started, the shallower search of an identical node is also started. This strategy avoids a deadlock by preventing a shallower node from waiting for a deeper node to return its score, which happened in Figure 3. However, some nodes are searched more than once
Transposition Table Driven Work Scheduling
65
A
B E C D
B’
Fig. 3. Deadlock with Cycles
even if it does not cause a deadlock, when a deeper node is expanded before a shallower identical node.
4
Experiments
Tables 1 and 2 show the experimental results for Awari and Amazons. All results were obtained using Pentium IIIs at 933 Mhz, connected by a 100 Mb/s Ethernet. Each processor had its own 200 MB transposition table. The search depths were chosen so that the typical test position would take 1-2 minutes on 64 processors (i.e., the typical speed seen in tournaments). Awari, with its low branching factor and inexpensive evaluation function, can search 24-ply deep in roughly the time it takes to search Amazons (and its large branching factor and expensive evaluation function) 5-ply deep. To measure synchronization and communication overheads, we used different programs, which have extra operations than those used to measure speedups and search overhead. Therefore, we note that the theoretical speedups calculated by these overheads do not always reflect the observed speedups in each game.
The Awari results can be compared to previous work using checkers, which has a similarly small branching factor. The TDSAB speedup of 21.8 on 64 processors easily beats the APHID speedup of 14.35 using comparable hardware [2]. Analysis of the overheads shows that synchronization is the major culprit. This is not surprising, given that there are 12 iterations (the program iterated in steps of two ply at a time), which contained an average of 3 synchronization points per iteration in the experiments. Figure 4 shows a graph of processor idle time (white space) for a typical search. The Y-axis is the processor number (0-31) and the X-axis is time. The vertical lines show where a synchronization point
Fig. 4. Awari (left) and Amazons (Right) Idle Times
Transposition Table Driven Work Scheduling
67
occurred. Clearly, the last few synchronization points resulted in a large amount of idle time, limiting the speedup. Amazons has only slightly better performance (23.5-fold speedup), which may seem surprising given the large branching factor (and, hence, no shortage of work to be done). The very large branching factor turns out to be a liability. At nodes where parallelism can be initiated, many pieces of work are generated, creating lots of concurrent activity (which is good). If a cutoff occurs, many of these pieces of work may have been unnecessary resulting in increased search overhead (which is bad). In this case, search overhead limits the performance, suggesting that the program should be more prudent than it currently is in initiating parallel work. Other parallel implementations have adopted a similar policy of searching subsets of the possible moves at a node, precisely to limit the impact of unexpected cutoffs (for example, [14]). In some sense, implementing a new parallel idea using MTD(f) was a mistake. When comparing results with previous work, one has to realize that two parameters have changed: the sequential algorithm (MTD(f) versus αβ) and the parallel algorithm (TDS versus something like YBWC). Thus, it is not obvious which component is responsible for any parallel inefficiencies. As the synchronization overhead for Awari demonstrates, MTD(f) has more synchronization points at the root and, hence, more synchronization overhead. Multigame is the only previous attempt to parallelize MTD(f) [9]. Multigame’s performance at checkers (21.54-fold speedup) is comparable to TDSAB’s result in Awari. In chess, Multigame achieved a 28.42-fold speedup using partitioned transposition tables; better than TDSAB’s results in Amazons. However, comparing these numbers is not fair. The Multigame results were obtained using slower machines (Pentium Pros at 200 Mhz versus Pentium IIIs at 933 Mhz), a faster network (Myrinet 1.2 Gb/s duplex network versus 100 Mb/s Ethernet), longer execution times (roughly 33% larger), and different games. Chess and checkers could have been used for our TDSAB implementations, allowing for a fairer comparison between our work and the existing literature. However, chess and checkers no longer seem to interest the research community. Both Awari and Amazons are the subject of active research efforts and thus are of greater interest.
5
Conclusions
The results of our work on TDSAB are both encouraging and discouraging. Clearly, the TDS framework offers important advantages for a high-performance search application, including asynchronous communication and effective use of memory. However, these advantages are partially offset by the increased synchronization overhead of MTD(f). The end result of this work are speedups that are comparable to what others have achieved. This is disappointing, since given the obvious advantages of transposition table driven scheduling, one would hope for a better result. On the other hand, this is the first attempt to apply TDS to the two-player domain, and undoubtedly improvements will appear.
68
Akihiro Kishimoto and Jonathan Schaeffer
There are numerous ideas yet to explore with TDSAB including: better priority queue node ordering, reducing MTD(f) synchronization, controlling the amount of parallelism initiated at a node, and speculative search. As well, a TDS implementation of αβ (not MTD(f)) would be useful. All these ideas are topics of current research.
Acknowledgments Financial support was provided by the Natural Sciences and Engineering Research Council of Canada (NSERC) and Alberta’s Informatics Circle of Research Excellence (iCORE).
References 1. Mark Brockington. A taxonomy of parallel game-tree search algorithms. International Computer Chess Association Journal, 19(3):162–174, 1996. 57 2. Mark Brockington. Asynchronous Parallel Game-Tree Search. PhD thesis, Department of Computing Science, University of Alberta, 1998. 60, 61, 66 3. T. Dean and M. Boddy. An analysis of time-dependent planning. In AAAI National Conference, pages 49–54, 1988. 56 4. Rainer Feldmann. Game Tree Search on Massively Parallel Systems. PhD thesis, University of Paderborn, August 1993. 59 5. Akihiro Kishimoto. Transposition Table Driven Scheduling for Two-Player Games. Master’s thesis, Department of Computing Science, University of Alberta, 2002. 57, 61 6. Bradley Kuszmaul. Synchronized MIMD Computing. PhD thesis, Massachusetts Institute of Technology, 1994. 59 7. Tony Marsland. Relative performance of alpha-beta implementations. In International Joint Conference on Artificial Intelligence, pages 763–766, 1983. 57 8. Aske Plaat, Jonathan Schaeffer, Wim Pijls, and Arie de Bruin. Best-first fixeddepth minimax algorithms. Artificial Intelligence, 87(1–2):1–38, 1996. 56 9. John Romein. Multigame - An Environment for Distributed Game-Tree Search. PhD thesis, Vrije Universitat Amsterdam, 2001. 67 10. John Romein, Henri Bal, Jonathan Schaeffer, and Aske Plaat. A performance analysis of transposition-table-driven scheduling. IEEE Transactions on Parallel and Distributed Systems, 2001. To appear. 56, 60 11. John Romein, Aske Plaat, Henri Bal, and Jonathan Schaeffer. Transposition table driven work scheduling in distributed search. AAAI National Conference, pages 725–731, 1999. 56, 60 12. Jonathan Schaeffer. Distributed game-tree searching. Journal of Parallel and Distributed Computing, 6:90–114, 1989. 60 13. David Slate and Larry Atkin. CHESS 4.5 - The Northwestern University Chess Program. In Peter Frey, editor, Chess Skill in Man and Machine, pages 82–118. Springer-Verlag, 1977. 57 14. J.-C. Weill. The ABDADA distributed minmax-search algorithm. International Computer Chess Association Journal, 19(1):3–16, 1996. 59, 60, 67
Clue as a Testbed for Automated Theorem Proving Eric Neufeld Department of Computer Science, 57 Campus Drive, University of Saskatchewan Saskatoon, Sk. S7K 5A9 (306-966-4887) [email protected]
Abstract. Recent articles explore formalizations of the popular board game Clue. For several years, this game was used as a testbed for automated theorem proving exercises in the context of introductory and advanced AI classes at the University of Saskatchewan, by way of motivating the usefulness of Prolog Technology Theorem Provers (PTTPS) as game playing engines. Although the game has a simple axiomatization over a small universe, adapting a PTTP to play a full game of Clue is not trivial. This paper illustrates solutions to problems encountered during this exercise, including some interesting computational rules of inference.
1
Background and Introduction
Clue is a popular commercial board game about solving a mystery. The game comes with a deck of three different types of cards: a set of suspects, a set of weapons, and a set of locations. One card of each type is randomly selected and secretly placed in an envelope. The remaining cards are randomly dealt to at least three players. (The game is trivial if there are only two players.) The goal of the game is to deduce the three cards hidden in the envelope. Players take turns rolling a pair of dice, moving pieces and possibly making “accusations”. An accusation is really a question posed to the player immediately to the current player’s left of the form “are you holding suspect S or weapon W or room R?” If the player to the left is holding one of S, W, R, that player must show the asking player any one of them. If the player to the left holds none of these, the next player must reveal a card to the asker, and so on. Only the asker sees the card. Naïve Clue players typically rule out possibilities as they locate other cards, but most do not exploit the vague (negative and disjunctive) information that crosses the table. For example, suppose Player 1 holds the card Mustard, the name of a suspect, and suppose also that Player 1 learned from an earlier query that Player 2 is holding Candlestick, a weapon. If Player 2 asks Player 3, “are you holding any of Mustard, Candlestick or Library”, and Player 3 answers “yes”, Player 1 should be able to deduce that Player 3 holds Library, even without the benefit of seeing the card. Alternately, suppose again that Player 1 knows Player 2 holds Candlestick and remembers that Player 3 previously answered “no” to a query that involved the R. Cohen and B. Spencer (Eds.): AI 2002, LNAI 2338, pp. 69-78, 2002. Springer-Verlag Berlin Heidelberg 2002
70
Eric Neufeld
Lounge. If Player 2 asks Player 3, “are you holding Plum, Candlestick, or Lounge”, Player 1 should be able to deduce that Player 3 is holding Plum. The preceding examples are fairly simple and show that it is possible to exploit both negative and disjunctive information. The next example, which occurs in a game similar to Clue, shows how complicated the reasoning can get. A more recent game, Mystery at Hogwarts, differs from Clue in that when Player 1 (say) poses a question, all players holding one of the cards named by Player 1 must show Player 1 a card, so that only Player 1 sees it. If there are 5 players, and if Player 1 makes a query, and Players 2, 3, and 4 each show Player 1 a card, Player 5 can deduce that none of the named cards are in the envelope, while not knowing where any of the named cards are. Indeed, a non-player eavesdropping on the game can make this inference. In terms of knowledge representation and reasoning, the same inference could occur in Clue when the same query gets posed three different times to three different people who respond positively. In Clue, players are less likely to manage such information in their heads. Furthermore, it is straightforward enough to construct subtle variations of this example where ordinary players are unlikely to make the necessary inferences. Although my children, for example, came to the right conclusion while playing Hogwarts, the obvious clausal representation results in a complex and deep resolution proof. Perhaps this is as it should be, since Clue, to the best of my knowledge, does not have high stakes international tournaments. However, it turns out that this type of game is about the right size for a knowledge representation project in an AI class. It is small enough to understand quickly, large enough to pose a few challenges, yet small enough to solve them with a few hints. The solution, I think, is a good motivation for wider use of the first order logic (FOL) in other domains, and refutes at least some of the myths about the inapplicability of FOL in AI. The implementation described herein is based on Prolog technology [Sti88] and to the extent the Clue motivates interest in formal logic, also motivates some interest in Prolog.
2
Axioms of Clue
The axioms of Clue are simple enough to state plainly in English and third year students, for the most part, were able to deduce them, as a knowledge representation exercise. These axioms are more or less the same as those in [BDS01] and equivalent to those in [van01]. However, where those papers sought to characterize this game in terms of knowledge, belief and actions, I found that a simple first order axiomatization suffices so long as one is not interested in planning strategies related to the board game. The following axioms use the convention that Player 0 is the envelope where the mystery cards are hidden. They are 1. If Player 0 holds exactly 1 card S of type suspect, Player 0 does not hold any other card of type suspect. 2. Player 0 holds at least one card of type suspect. 3. If Player 0 holds card W of type weapon, Player 0 does not hold any other card of type weapon. 4. Player 0 holds at least one card of type weapon.
Clue as a Testbed for Automated Theorem Proving
5. 6. 7. 8.
71
If Player 0 holds card R of type room, Player 0 does not hold any other card of type room. Player 0 holds at least one card of type room. Every card is held by some player. If some card is held by Player N, it is not held by all other Players M ≠ N.
All the axioms are necessary – the theorem prover cannot know whether the players are dealt cards (in which case cards are held by one and only one player) or whether players independently choose cards (with replacement). (This is similar to the difference between traditional one-winner-take-all lotteries and state lotteries such as the Canadian 6-49 where different players could hold the same numbers.) As well, the axioms seem to be sufficient for building a good clue player: there have been games where a program “eavesdropping” on a group of human players has been able to deduce the solution before any of the players, without the benefit of holding any cards, or seeing any cards. A frequent question is whether it is necessary and/or useful to represent the size of the hands held by individual players. It is not difficult to illustrate examples where this information is useful. However, I haven’t yet seen a convenient way to represent this information in a manner compatible with clauses consisting of predicates of the form holds(player, card). So, for the sake of manageability in the classroom setting, I have not looked at this. Unlike [BDS01], I did not use situation calculus or any formalism beyond FOL, apart from equality predicates trivially implemented. Knowledge gained during a game is strictly monotonic and can be simply asserted to the Prolog database or recompiled, on the assumption that no one lies during the game. (It has happened that some of the simulated games (transcriptions from human games) contained errors by the players, and obviously, this will cause problems. Finally, I have been asked whether better or faster solutions might exist using other AI technologies, for example, constraints, or other logic programming technology. The goal of the exercise was to motivate the use of first order logic to undergraduates in a setting where Prolog was used as a programming language, so other approaches were not considered.
3
Implementation History
A 3 player game of Clue is about determining the truth or falsity of 63 propositions of the general form holds(Player, Card). Because these numbers seem, on first glance, to be about reasonable for an assignment, it came as a surprise that the search space for a full-blown game of Clue (using all 21 cards) was much too large for the stack of a home-made LISP implementation of a linear resolution theorem prover, including the axioms above, even with a few optimizations. (I didn’t formally analyze the problem, but the likely cause was redundancy generated by disjunctive proofs, and possibly a LISP implementation that didn’t do many stack optimizations; see below.) To handle the proof space for the first offering of this challenge, the problem was whittled down to a few artificial scenarios involving nine or so cards.
72
Eric Neufeld
The next version of the theorem prover was carved out of one of David Poole’s implementations of Theorist [Poo88], a Prolog technology theorem prover [Sti88] that implements a linear resolution theorem prover with (mostly prepositional) loop checking as well as negated ancestor resolution. (One major change to that theorem prover was adding the capability of iterative deepening [Kor85], where the prover seeks proofs at ever-increasing fixed depths; this eliminates many problems related to looping, and it is well-known that the cost of iterative deepening asymptotically is the same as depth-first search.) The theorem proving system is as follows. 1.
Each input clause of the form a ∨ b ∨ … ∨ d is added to a database as a set of Prolog rules including all contrapositive forms of the original input clause, e.g. a :- ¬b , ¬c , … , ¬d. b :- ¬a , ¬c , … , ¬d. c :- ¬a , ¬b , … , ¬d. … d :- ¬a , ¬b , … , ¬c. (Call this form Prolog Normal Form, or PNF).
2. 3.
Unary negation is implemented syntactically as a functor or prefix. The advantage of this representation is that a Prolog meta-interpreter requires only minor modifications to answer queries about such clauses. This allows us to represent both genuine negation and genuine disjunction. (Standard Prolog cannot represent a ∨ b, where both atoms are positive.) The meta-interpreter is modified to maintain a stack of previous goals. The meta-interpreter is also modified to check the current goal against previous goals in two different ways. If the current goal is identically equal to a goal on the stack, the current branch is a loop and the meta-interpreter forces a fail. If the current goal is identically equal to the negation of a goal on the stack, the current proof succeeds. This is called the rule of negated ancestor [H&S97], or rule NA.
Rule NA’s pedigree that goes back to Loveland [Lov69], where it is called model elimination. The astute reader may notice that it would not have been difficult to implement the equivalent theorem prover as extensions to a LISP resolution theorem prover. Poole’s theorem prover is robust and fast and opened up the possibility of exploring other theorem proving problems. Initially, I hoped that a combination of the robustness and the compilation facilities of SICStus Prolog would be enough to permit a full-blown game of Clue in real time. In fact, the Prolog implementation initially was even more troublesome for at least two reasons. The first problem was created by introducing first order variables. Variables allow a compact representation of the Clue domain axioms of the form ∀X,Y,Z player(X)∧ player(Y) ∧ X≠ Y ∧ card(Z) ∧ holds(X, Z) →¬holds(Y,Z). This axiom states that a player cannot hold a card held by another player. The dual axiom states that a player holds a card if it can be proven that no other player holds
Clue as a Testbed for Automated Theorem Proving
73
the card. Consequently, a proof containing unbound variables can result in a loop of the form: Player X holds card Z if.. For each Y≠ X, we can prove Player Y does not hold card Z, which can be proved if… For some W, player W holds card Z, which can be proved if.. For each Y≠ W, we can prove Player Y does not hold card Z, which can be proved if… For some W, player W holds card Z, which can be proved if.. In fact, proofs beginning with ground queries can still enter this kind of a loop. The loop checking mechanism described earlier cannot detect this kind of loop because the terms will not be identically equal. A solution (suggested in conversation by Bruce Spencer) might be to count different occurrences of variables on the stack and force failure when the count exceeds the number of objects in the universe. This will terminate some loops, but will not generally be helpful. A solution that worked for this domain was to generate all axioms to propositional form. This dramatically increased the speed of the prover as well as the size of domain it was capable of handling. It still was not capable of proving theorems in real time. The next enhancement was implementing Spencer and Horton’s method of clause numbering [H&S97], (a variation of what Spencer first called foothold proofs and subsequently ordered clause resolution) to eliminate redundant proofs. Consider the following propositional database: a ∨b∨c a →d b →d c→d This pattern of knowledge arises reasonably often, and particularly in Clue. The first sentence is a disjunction expressing some vague knowledge. (For example, it could express the knowledge that some player holds one of a, b, or c.) The last three sentences form a family where each of the disjuncts from the first clause implies the same consequence. The extended theorem prover with true negation and disjunction lets us represent these sentences. Furthermore, it should be obvious that d follows from the above, by cases. The problem that arises is that if the above set of clauses is converted to Prolog Normal Form, as described earlier, there are three different proofs of d, illustrated below. The check mark at the bottom of a branch indicates that a proof branch succeeds by rule NA. This shows that there are three proofs of the same fact with the same clauses although the database contains no duplicate clauses. Unfortunate placement of goal d in the context of a larger computation could unnecessarily treble the cost of a large unsuccessful computation that repeatedly backtracks to a redundant and ultimately doomed proof of d. If a proof rests on many pieces of disjunctive knowledge, the cost of the proof can factorially, i.e., exponentially. This is very much a property of the game of Clue and requires a solution.
74
Eric Neufeld
Clause numbering yields a significant speedup. Figure 2 shows the proofs of Figure 1.
Fig. 1. Redundant negated ancestor proofs use same clauses in different orders
Fig. 2. Proofs of Figure 1, disjunction highlighted
Recall from earlier that the sentences were referred to as consisting of a disjunction and a family of rules based on the disjuncts. Figure 2 illustrates that all three proofs of d are similar in that they all require the disjunction as well as all rules in the family. The only difference between them is that a different family member appears “first”. This is equivalent to doing the cases in a proof by cases in a different order. Horton and Spencer [1997] suggest a simple solution to this problem. Prior to conversion to PNF, attach a unique number to each input clause. (An easy way to do this is by adding the number as a clause parameter that is ignored during unification.) Finally, only allow rule NA to succeed if the smallest (or largest) numbered clause in the “family” appears at the top of the proof. This optimization does not eliminate the redundant proofs themselves. Rather, it eliminates factorially many backtrack points. The result is an order of magnitude
Clue as a Testbed for Automated Theorem Proving
75
speedup. (This idea has led to a new architecture for automated theorem proving based on clause trees [H&S97].) Third year students were required only to axiomatize the game, implement the axioms as PNF clauses in the theorem prover as described above, and demonstrate the correctness of their solution. A class of fourth year and graduate students spent a good deal of class time understanding the techniques described in this section, as well as implementation details. Although performance of this system, implemented in SWIProlog on a Windows PC was acceptable, students were challenged to propose possible optimizations, implement them and report on successes or problems. Although no constraints were formally imposed, most of the solutions had relatively straightforward implementations in the context of the Prolog meta-interpreter, i.e., no solution required redesign of the architecture.
4 4.1
Further Improvements Memoing and Local Lemmas
One improvement resulted from a consolidation of unit clause preference, memoization, and maintaining a run-time list of local lemmas (see below). All of these are relatively straightforward to implement in a meta-interpreter. In the course of proving any particular goal, other predicates may be proved true or false and carried through the proof as part of a difference list. Local lemmas are all those subgoals that lie on the path between a leaf that has been proved using rule NA and its matching negated ancestor. Figure 3 depicts a contrived proof branch containing other branches to the right of the NA proof:
Fig. 3. When local lemmas can be used
76
Eric Neufeld
Subgoals a, b, c, d lie on the path between the proof branch between e and ¬e, which succeeds by Rule NA. Each of a, b, c, d have conjunctive subgoals to the right, in this case, consisting of additional calls to c, d, a and c. Once the branch ending with subgoal ¬e has succeeded by Rule NA, we know that any subsequent proofs of a, b, c, d in the subtree rooted at e will also succeed by Rule NA. However, a, b, c, d cannot be memoed for use in other subtrees. However, the proof space indicated by the top 3 “blobs” in Figure 3 to the right in the proof tree rooted at e can be pruned away. Recording any predicates successfully proved while attempting to prove some other predicate yielded a noticeable improvement in the theorem prover. However, no noticeable improvement resulted from a simple implementation of local lemmas. (There is an easy implementation of local lemmas that is quadratic in the depth of the stack. Subgoals are pushed on the stack together with an unbound variable. When a proof succeeds by rule NA, all unbound variables along the path to the negated ancestor are marked with the negated ancestor.) 4.2
Positive and Negative “Negation as Failure”
The point of adding true negation and disjunction to a Prolog technology theorem prover is to increase the expressive power of the input language. The nature of the game of Clue is that negative and disjunctive knowledge are the main form of communication, so this is a natural pairing. The preceding sections give a flavour for the combinatorial explosion that results when the language is extended. Another aspect of knowledge is that a theorem prover, for a goal g, might traverse a proof space before eventually failing in the event that g is not true. However, given a language that can represent both positive and negative knowledge, it may be the case that the theorem prover is executing an expensive proof of g when ¬g is already known as a literal, or when a shallow proof of ¬g has an inexpensive, shallow proof. This is simple enough to implement. Obviously, this can be done for positive or negative literals, and hence the title of this section. When proving a goal g, check whether ¬g is already known, and if so, fail. Students observed an improvement of an approximately an order of magnitude after implementing this. This seems to make sense. At any point in the game, a goal g might have a costly proof, and if its contrary is known, this reduces the cost to constant cost. In the case of Clue, the set of relevant literals is very small, and as the game progresses, the probability of a contrary being true also increases, so the observed speedup seems justified. The next question we considered is whether, given a goal g, is it worth the effort to dig for a shallow proof of ¬g at the same time? That is, it may be the case that ¬g has a fast proof that could be used to terminate an expensive proof of g. In the absence of parallel hardware/software, simple dovetailing would be well worth it when such proofs existed. In general, it seems reasonable to conjecture that this not of significant value. For instance, in domains with large, or even, infinite sets of literals that are not highly constrained, it could be that dovetailing positive and negative proofs would in general double the cost of computation. Furthermore, if one built a theorem prover to do this
Clue as a Testbed for Automated Theorem Proving
77
recursively (i.e., at every node in the proof tree), the cost could increase exponentially. However, Clue is not such a domain. It consists of a relatively small set of highly constrained literals. Placing a card in one player’s hand removes it from the hands of all other players. Moreover, as the game draws to a close, the probability that the truth of falsity of any proposition is known (implicitly or explicitly) becomes quite high and therefore, the potential value of extending this inference rule seems high. Furthermore, with some bounding, this seems practical in the context of an iterative deepening theorem prover. 4.3
Performance Results and other Improvements
With the enhancements described herein, and ignoring the matter of the board game, a PTTP-based Clue theorem prover could easily play human players in real time. Assuming a level playing field --- where human players were restricted to logic --- the program would likely win. The program’s performance is the worst early in the game where little is known and much search is required to simply fail to find a proof of most propositions. However, as the game proceeds, the computer will be able to exploit vague knowledge better than a human player. (Because no convenient user interface for non-computer scientist users was ever built for this software it wasn’t possible to test the program in a realistic play situation. However, the program completed difficult proofs in seconds.) Some years ago, a student agreed to provide a set of audit trails of Clue games played by human players to permit a simulation of the computer “eavesdropping” on human games without actually playing. The computer was able to deduce the contents of the envelope before the human players in two of three cases. However, the audit trails revealed that the players were not playing as competitively (or consistently) as they might. As a matter of record, classroom discussion of the game generated some interesting conversations about the way humans use probabilistic clues (names of cards appearing frequently in queries), body language, and board strategy in actual play. Building a player that would avoid giving away its knowledge by not using deduced cards in every query would not be difficult. However, a program capable of board strategies and bluffing is an entirely different type of AI The improvements described above mostly involve extensions to the meta-interpreter. Some students tried optimizations at the knowledge level, including partial evaluation, tautology elimination and redundant clause elimination. Partial evaluation is a natural extension of memoing; elimination of redundant clauses and tautologies cannot hurt. In all of these cases, the problem is the cost of updating the knowledge base during the course of “play”.
5
Conclusions and Future Work
The work described herein began by way of motivating the use of first order logic as a tool for knowledge representation in game playing to undergraduate and graduate
78
Eric Neufeld
computer science students. The combined work of instructor and students over the years has resulted in a theorem prover that typically wins a simulated game of Clue that does not taken into account any aspects of the board game – how pieces move, etc. As a player, the program wins consistently with luck being its main adversary. In some cases, the program can deduce the solution to the mystery as an eavesdropper (i.e, without holding or seeing any cards, but only hearing the queries and responses). The main problem of interest for future work is determining analytically and empirically the value of dovetailing proofs for positive and negative goals as described in Section 4.2. Although an automated clue-playing game can exploit logic to outplay ordinary human players, Clue is not a spectator sport with a large audience. The main value of this exercise has been to demonstrate that the first order logic may have some potential for discovering knowledge implicit in vague (i.e., negative and disjunctive) knowledge. Thanks to all the students who participated in this exercise over the years. Thanks also to Bruce Spencer for many useful discussions. Lastly, thanks to three referees for a careful reading.
References [BDS01] Bart, B., Delgrande, J. and O. Schulte (2001). Knowledge and Planning in an Action-Based Multi-Agent Framework: A Case Study. Advances in Artificial Intelligence, eds. E. Stroulia and S. Matwin, Springer Lecture Notes in AI 2056, 121-130. [H&S95] Horton, J.D. and Spencer, E.B. Clause Trees: A Tool for Understanding and Implementing Resolution in Automated Reasoning. Artificial Intelligence 92 (1997) 25-89 [Kor85] Korf, R.E. Depth-first iterative deepening. An optimal admissible treesearch. Artificial Intelligence, 27, 97-109. [Lov69] Loveland, D. Theorem Provers Combining Model Elimination and Resolution. In Machine Intelligence 4 University Press, Edinburgh (1969) [Poo88] Poole, David. A logical framework for default reasoning. Artificial Intelligence 36, (1988) pp 27-47 [Sti88] Stickel, Mark. A Prolog technology theorem prover. Journal of Automated Reasoning 4, 4 (December 1988), 353-380 [van01] van Ditmarsch, Hans P. The description of game actions in Cluedo (November 2001) Submitted to Game theory and Applications VIII, Eds. Petrosian and Mazalov
A Noise Filtering Method for Inductive Concept Learning George V. Lashkia Department of Information and Computer Eng., Okayama University of Science 1-1 Ridai-cho, Okayama, 700-0005 Japan
Abstract. In many real-world situations there is no known method for computing the desired output from a set of inputs. A strategy for solving these type of problems is to learn the input-output functionality from examples. However, in such situations it is not known which information is relevant to the task at hand. In this paper we focus on selection of relevant examples. We propose a new noise elimination method which is based on the filtering of the so called pattern frequency domain and which resembles frequency domain filtering in signal and image processing. The proposed method is inspired by the bases selection algorithm. A basis is an irredundant set of relevant attributes. By identifying examples that are non-typical in bases determination, noise elimination is achieved. The empirical results show the effectiveness of the proposed example selection method on artificial and real databases.
1
Introduction
In general, the task of concept learning can be divided into two subtasks: deciding which information to use to describe the concept, and deciding how to use this information. The problem of focusing on the most relevant information in a potentially overwhelming quantity of data has become increasingly important. Redundant and irrelevant information degrades the performance of concept learning, both in speed and generalization ability. In this view, the selection of relevant attributes and examples, and the elimination of irrelevant ones is one of the central problems in machine learning. In order to improve the performance on domains with irrelevant attributes, a variety of attribute extraction methods appeared in the literature [1]-[5]. However, most of the proposed methods are based on heuristics and do not help with redundant attributes [3] (relevant attributes can be redundant). In [6] the notion of a basis, which is an irredundant set of relevant attributes, has been introduced, and a way of selecting all bases of training examples and a possible way to identify a basis of the target concept have been described. The experimental results in [6] show that the proposed bases selection procedures were successful. However, noise examples in the training data worsened the performance of the bases detection. In contrast to the attribute extraction, example selection is a less investigated area. Most of the proposed methods (edited nearest neighbor, tree pruning, etc.) R. Cohen and B. Spencer (Eds.): AI 2002, LNAI 2338, pp. 79–89, 2002. c Springer-Verlag Berlin Heidelberg 2002
80
George V. Lashkia
embed the selection process within the learning algorithm [7]-[9]. This paper is concerned with a less common in the machine learning literature the so-called filter approach. The filter approach can be described as the explicit detection and elimination of noisy data in data preprocessing. The basic idea of the most explicit filtering methods [10] is the n-fold cross-validation approach, which is based on the nearest neighbor method. Experimental results provide evidence that such filters can successfully deal with noisy data. However, these methods use the hypothesis that the proximity in the data space generally expresses the membership of the same class, and therefore a data set which does not satisfy this condition cannot be treated by such an approach. Noise detection and elimination is more intensively studied in other computer science fields, such as signal and image processing. Here, two basic filter approaches exist: spatial domain filtering and frequency domain filtering. The filtering methods proposed in machine learning resemble spatial domain filtering, since they operate on the instance domain. In general, spatial domain filtering methods require many parameters and threshold values that must be set empirically, and therefore frequency domain filtering methods such as low pass filters are more common in signal and image processing. In this paper we propose a new filter method which resembles frequency domain filtering in signal and image processing. The approach is based on the basis notion proposed in [6], and is designed to improve the accuracy of bases detection, when noise examples are presented in the training data. Experiments show the effectiveness of the proposed example selection method, which in combination with bases selection can lead to a large reduction of the training set without sacrificing the classification accuracy.
2
Bases
In this paper, we consider the problem of automatically inferring a target concept c defined over X, from a given training set of examples D. Following standard practice, we identify each domain instance with a finite vector of attributes x ˜ = (x1 , ..., xn ). First, we consider binary valued attributes. Let Xn = {0, 1}n be the set of all domain instances. The learner is trying to learn a concept c, c : X → 0, 1 where X ⊆ Xn . The goal of the learner is to find a hypothesis h such that h(˜ x) = c(˜ x) for all x ˜ from X. For a set of attributes F , let us denote x ˜|F be x ˜ with all the attributes not in F eliminated. Similarly, let D|F be the data set D with each x˜ ∈ D replaced by x ˜|F . In the machine learning literature, there are a number of different definitions of what it means for an attribute to be relevant. However, analyzing these definitions it was shown [1] that we need to consider only two degrees of relevance, weak and strong. Denote h to be either c or D. x) if there exist examDefinition 1 An attribute xi is strongly relevant to h(˜ a) =h(˜b). ples a ˜ and ˜b in X that differ only in their assignment to xi and h(˜
A Noise Filtering Method for Inductive Concept Learning
81
Another way of stating this definition is that the attribute xi is strongly relevant if there exists some example for which the removal of xi alone affects the classification given by h. x) if it is possible to Definition 2 An attribute xi is weakly relevant to h(˜ remove a subset of attributes so that xi becomes strongly relevant to h. Features that are weakly relevant may or not be important to keep, depending on which other attributes are ignored. We say that an attribute is relevant if it is either weakly relevant or strongly relevant, and is irrelevant otherwise. These notions of relevance are useful from the viewpoint of a learning algorithm attempting to decide which attributes to keep and which to ignore. The above definitions are concerned with the relevance of one attribute. Since any learner has to decide which set of attributes to use (relevant attributes could be correlated), [6] introduce additional definitions related to sets of relevant attributes. Definition 3 A set of relevant attributes is complete for h, if h can be expressed unambiguously by this set of attributes. Definition 4 A complete set of attributes is irredundant if the removal of any attribute from this set makes the remaining attributes not complete. We call a set of irredundant attributes a basis. Example Let us consider the instance space (x1 , x2 , x3 , x4 ) such that x2 is negatively correlated with x3 . The target concept c(x1 , x2 , x3 , x4 ) = x1 &x2 contains eight instances and represents a partial boolean function. In this case, the attribute x1 is strongly relevant, the attributes x2 and x3 are weakly relevant, the attribute x4 is irrelevant, the sets {x1 , x2 , x3 , x4 }, {x1 , x2 , x3 }, {x1 , x2 } and {x1 , x3 } are complete, and only {x1 , x2 } and {x1 , x3 } are bases. To succeed in learning we need to find a complete set of attributes, and preferably a basis of the target concept. Bases of the training data can be detected by deterministic procedures (see [6] for more details). In our experiments we employ the deterministic asymptotically optimal basis detection algorithm (PTD1.1) developed by the author and available at http://lacom2.ice.ous.ac.jp/lash/rel.html. Since the target concept can have many bases, ranking them by some criteria will be helpful. The number of attributes in the basis, which we define as the length of the basis, is one important characteristic. If the length of the basis is short, the number of possible samples needed to describe a concept is small. Therefore, the detection of a short basis for a concept should be the preference of the attribute selection algorithm. The number of different patterns in D|F , where F is a basis, is a very important parameter for successful learning. The more patterns we have in D|F the more we know about the target concept. The length of a basis and the number of patterns the training set generates on it can be combined to estimate the usefulness of that particular basis. Denote Tˆ as the set of all bases of the training set, and Tˆxi , i = 1, ..., n, as the set of all bases from Tˆ containing the ith attribute. We define an info vector as p˜ = (p1 , ..., pn ), where pi = |Tˆi |/|Tˆ | is an info weight [11]. It is easy to show that, if pi = 1 then xi is a strongly relevant attribute, and if pi = 0 then xi is
82
George V. Lashkia
an irrelevant attribute. The info weight can be considered as a measure of the attribute relevance. By calculating an info vector we can also estimate relevant attributes of the target concept. We can assume that the more the bases contain the ith attribute, i.e. the higher the values of pi , the more the ith attribute is relevant to the target concept. Because the construction of the set of all bases is time consuming, we use a set of short bases instead of Tˆ . For detection of a basis of the target concept, we use the following simple heuristic algorithm originally proposed in [12]. First, the set T containing bases of the training set is formed. Next, an info vector is calculated using T . Finally, by choosing attributes with the highest info weights, such that they form either a basis or a complete set of the training set, we construct a candidate basis for the target concept. The set T can be constructed by using a deterministic asymptotically optimal prime test detection algorithm or by using a stochastic procedure. Selecting attributes by tests or an info vector can be considered as a filter approach to attributes selection. Other popular filter methods are sequential selection algorithms [2], and relief [3]. The main shortcoming of the conventional methods is that they do not select a basis, and do not help with redundant attributes [3]. Next let us consider many-valued attribute cases. A relevant attribute can be defined similarly as in a binary valued case. The only requirement is to have an inequality relation R on the attribute space, such as an ordinary inequality =, or any other specific inequality. We defineR using a threshold tr by aRb iff |a − b| > tr. The threshold tr = 0 defines an ordinary inequality =, and tr > 0 can be used to introduce stronger relations. An attribute xi is strongly relevant to h(˜ x) if there exist examples a ˜ and ˜b such that |a1 − b1 | ≤ tr1 , ..., |ai−1 − bi−1 | ≤ tri−1 , |ai − bi | > tri , |ai+1 − bi+1 | ≤ tri+1 , ..., |an − bn | ≤ trn , and h(˜ a) =h(˜b). The definition of a basis for many-valued attribute cases remains unchanged.
3
Tolerating Noise Examples
Noise presented in the training set can increase the number of false relevant attributes and degrade the generalization accuracy of the classifiers that use bases. Noise sensitivity motivates the need for noise tolerant extensions of the bases selection method. Elimination of noise examples is one of the most difficult problems in inductive machine learning. Several approaches to handle noise data have been proposed. In contrast to the so-called noise tolerant procedures for noise handling, such as rule truncation and tree pruning [7], and embed and wrapper methods [9], this section is concerned with the explicit detection and elimination of noisy examples in data preprocessing. We propose a new noise filtering method, which is designed to improve the accuracy of the bases detection, when noise examples are presented in the training data. Each training example affects the formation of a basis of the training
A Noise Filtering Method for Inductive Concept Learning
83
set. The proposed method eliminates noise by identifying examples that are nontypical in bases determination. This method is based on the filtering of the so called pattern frequency domain of training examples. It gives some resemblance with frequency domain filtering in signal and image processing. Noise instances generate patterns with low frequency (opposite to high frequency in signal and image processing), and therefore by preserving high frequencies and suppressing low pattern frequencies, noise cleaning can be achieved. The term noise is used here as a synonym for outliers: it refers to incorrect examples as well as exceptions. For simplicity we consider a binary attribute space, although it is easy to extend the concept to a many-valued attribute space. Suppose that a training set is formed from positive instances P , and from negative instances N . Let us define a set of binary patterns S = P ⊕ N , where P ⊕ N = {˜ a ⊕ ˜b | a ˜ ∈ P, ˜b ∈ N }, a ˜ ⊕ ˜b = (a1 ⊕ b1 , ..., an ⊕ bn ), and ’⊕’ denotes XOR. Pattern frequency domain is a set of pairs (˜ x, m), where x ˜ ∈ S and m is an integer which represents the frequency of appearance of the pattern x˜ while calculating P ⊕ N . Note that one pattern can be generated by many pairs of positive and negative samples. We define the length of a pattern as the number of 1s in it. Theorem τ is a basis of D iff attributes from τ form an irreducible disjunction that is consistent with S. Proof Suppose τ is a basis of D. From Definition 3 we have P |τ ∩ N |τ = ∅ and hence S|τ does not contain the pattern (0,...,0). Therefore, attributes from τ form a disjunction consistent with S. The irreducibility of the disjunction follows from Definition 4. Next, let us suppose that attributes from τ form an irreducible disjunction that is consistent with S. Since S|τ does not contain the (0,...,0) pattern, P |τ ∩N |τ = ∅. Therefore D can be expressed unambiguously by the set τ , and τ is complete. The irredanduncy of τ follows from its irreducibility. Theorem is thus proved. Thus, the detection of bases is equivalent to the detection of irreducible disjunctions that are consistent with S, and therefore we can change our original attribute space to the pattern frequency domain without loosing bases. This transformation is useful since patterns have frequency information. It is natural to consider patterns with low frequencies as noise patterns, and patterns with high frequencies as typical patterns. Suppose x1 is a strong relevant attribute. This means that there exists a pair of examples a ˜, ˜b such that D(˜ a) =D(˜b), and ˜ a ˜ and b differ only in their assignment to x1 . If noise results in the relevance ˜, ˜b pairs is low, and all such pairs generate the of x1 , the number of different a same pattern (1,0,...,0). The low frequency of this pattern indicates that the relevance of x1 should be treated as a noise, while high frequency shows that the relevance is real. Similarly, we can show that weak relevance (which is defined by means of strong relevance) caused by noisy examples results in low frequency patterns. The frequency information can be used for noise elimination from the training data. We define a high pass filter (hpf ) as a process that eliminates examples that generate low frequency patterns.
84
George V. Lashkia
Table 1. The pattern frequencies of the Monk3 pattern frequency 110110 256 111111 255 .... 000101 5 100000 5 101001 4 101000 4 000100 4 100001 2
Table 2. Recognition results of classifiers on the breast cancer and the heart disease databases Breast cancer T F Rτ˜ 100.0 ± 0.0 rejections (%) 35.0 ± 5.4 T F RT 97.8 ± 3.1 rejections (%) 14.0 ± 2.2 kN N 96.7 ± 2.7 kN Nτ˜ 96.2 ± 2.3 kN NT 97.7 ± 2.5 ID3 93.8 ± 2.0 ID3τ˜ 93.1 ± 4.0 ID3T 94.6 ± 3.5
Next, let us concentrate on the patterns themselves. In general, the shorter the patterns we have in S the longer is the irreducible disjunction consistent with S. Shorter patterns give us less choice for variable selection for the consistent disjunction, since only a few variables cover them. We suppose that the elimination of noise examples, in contrast to the elimination of examples for which the target concept is correct, reduces the length of bases of the training set. The elimination of examples that produce short low frequency patterns, we refer as a short pattern filtering and denote it as spf . The proposed filters have linear time complexity and are easy to implement. We demonstrate our novel filtering approach on artificial and real databases.
A Noise Filtering Method for Inductive Concept Learning
85
Table 3. Recognition results of classifiers with hpf and spf on the breast cancer database storage (%) T F Rτ˜ rejections (%) kN Nτ˜ ID3τ˜
In order to confirm the theoretical results of this paper, experiments were conducted on artificial and real data, and three inductive learning algorithms kN N , ID3 [7] and T F R [11] were used. The artificial Monk3 database [7,13] has six attributes, and its target concept is given by: (attribute 5 = 3 and attribute 4 = 1) or (attribute 5 = 4 and attribute 2 = 3). The Monk3 database has training and testing sets containing 122 and 432 examples, respectively, and has 5% noise in the training set. This problem has three relevant attributes and three irrelevant ones, i.e. {x2 , x4 , x5 } is the basis of the target concept. The bases detection algorithm (bases were defined using an ordinary inequality, i.e., all tri were taken equal to 0) applied to the Monk3 training data outputs only one basis {x1 , x2 , x4 , x5 }. Incorporating this information, for example, in kN N improves its performance from an 82.4% recognition rate on all attributes to 84.3% on the detected basis. With ID3, a larger gain is obtained, with the recognition rate improving from 90.3% on all attributes to 94.4% using the detected basis. However, a few noise examples make attribute 1 relevant and noise filtering is necessary to improve more the classifiers performance further. The pattern frequencies of the Monk3 database are shown in Table 1. By using hpf and deleting examples that generate the 5 lowest frequency patterns, we achieve the elimination of all noise examples. Although hpf in this case causes the lost of 17 non-noise examples, it successfully deletes all noise instances from Monk3 training set. By eliminating four lowest frequency patterns with length no more than 2, spf removes all noise and loses only 14 non-noise examples. The bases detection algorithm applied to the filtered training data outputs two bases: {x2 , x4 , x5 } and {x1 , x2 , x3 , x5 }. Our preference is a short basis {x2 , x4 , x5 } (see Section 2), which exactly identifies all relevant attributes of Monk3 database. kN N gives a 94.4% recognition rate on this detected basis, and ID3 and T F R achieve a 100% recognition rate. For real data, we used the well-known breast cancer database and heart disease (Cleveland) database from the Machine Learning Database Repository at the University of California. We removed a few samples with missing attribute values from the above databases, and in our experiments we use a ten-fold cross-
86
George V. Lashkia
Table 4. Recognition results of classifiers with hpf and spf on the breast cancer database using stronger inequalities storage (%) T F Rτ˜ rejections (%) kN Nτ˜ ID3τ˜
validation. Each database was divided into 10 partitions. Nine partitions (90% of the data) were used as a training set. The classification accuracy was evaluated using the remaining one partition (10% of the data). Ten such trials were run using different partitions as the test set. The breast cancer database is composed of two classes in 9 dimensions. Bases were defined using an ordinary inequality, i.e. values of tri were taken equal to 0. The heart disease database is composed of two classes in 13 attributes. Because this database has attributes in different scales, they were preprocessed by a normalization routine in which each attribute was centered and reduced to a unit variance. The values of tri were taken equal to 0.6. These are the maximal tri values that still generate non-trivial bases. In each trial, the bases detection algorithm was applied to the training data. First, the set T containing all bases with length l was constructed, where l denotes the minimal length of the detected bases. Then, by calculating an info vector from the set T , we constructed a candidate basis of the target concept τ of length d ≥ l , by taking the d attributes with highest info weights, such that the selected attributes form either a basis or a complete set of a given training set. Table 2 shows the recognition results on the breast cancer and heart disease databases. The reported accuracies are the mean of the ten accuracies from ten-fold cross-validation. We also show the standard deviation of the mean. By kN Nτ , ID3τ and T F Rτ we denote kN N , ID3 and T F R evaluated on the attributes defined by the proposed basis τ . The average length of τ was 4 and 12.6 in the cases of the breast cancer and heart disease databases, respectively. kN NT , ID3T and T F RT are kN N , ID3 and T F RT evaluated on the set of bases from T . The first line of Table 2 shows the results for T F R using the proposed basis. The second line shows the results for T F R using the set of all detected short prime tests. Although T F Rτ˜ rejects many samples, it achieves a 100% recognition rate. This means that the proposed candidate bases are real. Bases of the target concept were detected in all ten trials. In the case of the heart disease database T F R rejects a large number of examples. Sets D|τ˜ of the heart disease database are poorly represented and cause many rejections. Another reason for rejections could be noise. Noise instances can increase the length of bases and cause degradation in the performance of T F R classifier. The candidate basis
A Noise Filtering Method for Inductive Concept Learning
87
Table 5. Recognition results of classifiers with spf (3, 15%) on the breast cancer database Trial 1 Trial 2 Trial 3 Trial 4 Trial 5 Trial 6 Trial 7 Trial 8 Trial 9 Trial 10 Average storage (%)
T F Rτ˜ rejections (%)
kN Nτ˜ ID3τ˜
26.1 98.2 20.6 94.1 89.7
20.4 98.0 28.0 97.1 95.6
33.2 98.3 11.8 97.1 97.1
30.7 98.3 13.2 92.7 92.7
32.0 100 5.9 98.5 95.6
31.2 100 22.1 97.1 88.2
29.5 100 18.8 98.6 95.7
33.8 100 14.5 98.6 97.1
32.5 98.3 17.4 98.6 92.8
33.1 100 19.1 98.5 91.2
30.2 99.1 17.2 97.1 93.6
detection algorithm could not improve the performance of plain kN N and ID3, however, the use of the set of short bases T tends to improve the performance of the plain classifiers. Next, we evaluate classifiers on the training set filtered by the proposed hpf and spf . First, we consider the breast cancer database. Several experiments were conducted here. Our goal is to investigate the sensitivity of hpf and spf to the different parameter values. We denote hpf (k) as a high pass filter that suppresses patterns with frequencies low or equal to k. We denote spf (l, k) as a short pattern filter that suppresses patterns of the length less or equal to l that have frequencies low or equal to k. In Table 3 we show results of T F Rτ , kN Nτ , and ID3τ applied to the training data filtered by hpf and spf , where τ˜ is a proposed basis of the target concept. The first row labeled by ”storage” shows the percentage of the original training set retained after the noise filtering and attribute selection. As we can see, hpf and spf can significantly reduce the training data. As we expected, hpf and spf eliminate outlier examples and the retained training data produces shorter bases. The length of the proposed basis dropped to 3. The shorter candidate basis in turn reduces the number of rejections of T F R. The number of errors and rejections for T F R in each trial are given in the rows labeled by the name ”errors” and ”rejection”, respectively. At high values of k bigger reductions of the training data are achieved, but results show that the success in bases detection drops and the number of T F R rejections increases. The degradation of T F Rs performance for high values of k can be explained by elimination of non-noise examples. On the other hand, with kN Nτ and ID3τ a high performance is still obtained at high values of k, even though the training data has been reduced by almost 75%. Experiments show that in contrast to hpf , reductions of training data by spf result in a better performance of classifiers. The performance of kN Nτ and ID3τ gives better results than their original implementation with no noise elimination. Noise elimination gives the possibility of inducing stronger relevancies on many-valued and real-valued attributes. The high tri values, that generate the trivial basis (the set of all attributes) or result in the equality of some positive and negative instances, can now be used. In all previous experiments on the breast
88
George V. Lashkia
Table 6. Recognition results of classifiers with spf on the heart disease database tr storage (%) T F Rτ˜ rejections (%) kN Nτ˜ ID3τ˜
cancer database, relevance was defined using ordinary inequality. In all cases of Table 4 and 5, tri values were increased to 1. In Table 5 we show the recognition results of each trial using spf filter. By spf (3, 15%) we denote a short pattern filtering that suppresses 15% of low frequency patterns with length less or equal to 3. In all cases of Table 4, T F R gained reasonable improvement in reducing the number of rejections. Again, spf performed better than hpf , and in many cases resulted in improvement of the recognition rates of conventional classifiers. Both, the recognition accuracy and the number of eliminated examples demonstrate that the proposed filters can be used as an effective noise handling mechanism. Next, we consider the heart disease database. The pattern space of the heart disease database contains a huge number of patterns with frequency 1. Therefore, we decided to use only spf , since hpf results in elimination of almost all samples. The recognition rates of classifiers are shown in Table 6. The results of the left part of Table 6 were obtained using original threshold values tri = 0.6. The length of the proposed basis dropped to 7. The right part shows results using the increased threshold values tri = 0.8. In this case, the length of the proposed basis dropped to 6. In both cases, the classification accuracy of kN N and ID3 was similar to that obtained in the original experiments, but achieved using a hugely reduced training set. Although spf decreases slightly the classification accuracy of T F R (but not more than by 0.7 examples on avarige were misclassified in each case), the significant reduction of the training data and the number of rejections overtakes a small lost in accuracy and confirms practical utility of the proposed noise filter. For example, in the case of tri = 0.8, only 13.6% of the training data was retained, and the number of T F R rejections dropped by 36%.
5
Conclusions
In this paper, we addressed the problem of relevance in learning and indicate a way of how relevant examples can be detected. We proposed a novel noise filtering method which resembles frequency domain filtering in signal and image processing. The proposed filters have linear time complexity and are easy to implement. Experiments showed that the proposed filters lead to the reasonable reduction of the training set without sacrificing the classification accuracy, and in many cases improve the recognition performance of the classifiers tested.
A Noise Filtering Method for Inductive Concept Learning
89
References 1. R. Kohavi and J. John, Wrappers for subset selection. Artificial Intelligence 97, 273-324, Elsevier, 1997. 79, 80 2. P. Devijver and J. Kittler, Pattern recognition: a statistical approach, Prentice Hall, 1982. 82 3. K. Kira and L. Rendell, A practical approach to feature selection. Proceedings of the 9th International Conference on Machine Learning, 249-256, 1992. 79, 82 4. H. Almuallim and T. Dietterich, Learning with many irrelevant features. Proceedings of the 9th National Conference on Artificial Intelligence, 547-552, 1991. 5. J. Mao and A. Jain, Artificial neural networks for feature extraction and multivariate data projection, IEEE Trans. Neural Networks 6, 296-317, 1995. 79 6. G. V. Lashkia, Learning with only the relevant features, Proceedings of the IEEE International Conference on Syst., Man, and Cyber., 298-303, 2001. 79, 80, 81 7. J. Quinlan, C4.5: Programs for machine learning, Morgan Kaufmann, 1993. 80, 82, 85 8. D. Wilson and T. Martinez, Reduction techniques for instance-based learning algorithms, Machine Learning 38, 257-286, 2000. 9. A. Blum and P. Langley, Selection of relevant features and examples in machine learning, Artificial Intelligence 97, 245-272, Elsevier, 1997. 80, 82 10. C. Brodley and M. Friedl, Identifying and eliminating mislabeled training instances, Proc. of the 13rd National Conference on Artificial Intelligence, AAAI press., 799-805, 1996. 80 11. G. V. Lashkia and S. Aleshin, Test feature classifiers: performance and applications, IEEE Trans. Syst., Man, Cybern. Vol. 31, No. 4, 643-650, 2001. 81, 85 12. G. V. Lashkia, S. Kaneko and M. Okura, On high generalization ability of test feature classifiers, Trans. IEEJ, vol.121-C, no.8, 1347-1353, 2001. 82 13. S. B. Thrun, et al., The Monk’s problems: A performance comparison of different learning algorithms. Technical Report CMU-CS-91-197, Carnegie Mellon University, Pittsburg, 1991. 85
The Task Rehearsal Method of Life-Long Learning: Overcoming Impoverished Data Daniel L. Silver1 and Robert E. Mercer2 1
Intelligent Information Technology Research Centre Jodrey School of Computer Science, Acadia University Wolfville, Nova Scotia, Canada B0P 1X0 [email protected] 2 Cognitive Engineering Laboratory Department of Computer Science, The University of Western Ontario London, Ontario, Canada N6A 5B7 [email protected]
Abstract. The task rehearsal method (TRM) is introduced as an approach to life-long learning that uses the representation of previously learned tasks as a source of inductive bias. This inductive bias enables TRM to generate more accurate hypotheses for new tasks that have small sets of training examples. TRM has a knowledge retention phase during which the neural network representation of a successfully learned task is stored in a domain knowledge database, and a knowledge recall and learning phase during which virtual examples of stored tasks are generated from the domain knowledge. The virtual examples are rehearsed as secondary tasks in parallel with the learning of a new (primary) task using the ηMTL neural network algorithm, a variant of multiple task learning (MTL). The results of experiments on three domains show that TRM is effective in retaining task knowledge in a representational form and transferring that knowledge in the form of virtual examples. TRM with ηMTL is shown to develop more accurate hypotheses for tasks that suffer from impoverished training sets.
1
Introduction
One of the key aspects of human learning is that individuals face a sequence of learning problems over a lifetime. Humans take advantage of this by transferring knowledge from previously learned tasks to facilitate the learning of a new task. In contrast, the majority of machine learning research has focused on the single task learning (STL) approach where an hypothesis for a single task is induced from a set of training examples. Life-long learning, a new area of research, is concerned with the persistent and cumulative nature of learning [10]. Life-long learning considers situations in which a learner faces a series of different tasks and develops methods of retaining and using task knowledge to improve the effectiveness (more accurate hypotheses) and efficiency (shorter training times) of learning. R. Cohen and B. Spencer (Eds.): AI 2002, LNAI 2338, pp. 90–101, 2002. c Springer-Verlag Berlin Heidelberg 2002
The Task Rehearsal Method of Life-Long Learning
91
A challenge often faced by a life-long learning agent is a deficiency of training examples from which to develop accurate hypotheses. Machine learning theory tells us that this problem can be overcome with an appropriate inductive bias [5], one source being prior task knowledge [2]. Lacking a method of knowledge transfer [3,10] that distinguishes knowledge from related and unrelated tasks, we have developed one and applied it to life-long learning problems, such as learning a more accurate medical diagnostic model from a small sample of patient data [7]. Our research has focussed on two aspects of selective knowledge transfer : (1) a measure of task relatedness which is used to guide the selective transfer of previously learned knowledge when learning a new task and (2) the retention of learned task knowledge and its recall when learning a new task. In [8,9] we have developed ηMTL, a modified version of the multiple task learning (MTL) method of parallel functional transfer to provide a solution to the first problem of selective transfer. Using a measure of secondary task to primary task relatedness an ηMTL network can favourably bias the induction of a hypothesis for a primary task. This paper introduces the task rehearsal method (TRM) to solve the second problem of retention and recall of learned task knowledge. TRM uses either the standard MTL or the ηMTL learning algorithms as the method of knowledge transfer and inductive bias. Task rehearsal is so named because previously learned tasks are relearned or rehearsed in parallel with the learning of each new task. It is through the rehearsal of previously learned tasks that knowledge is transferred to the new task.
2
Background
The constraint on a learning system’s hypothesis space, beyond the criterion of consistency with the training examples, is called inductive bias [5]. Inductive bias is essential for the development of a hypothesis with good generalization from a practical number of examples. Ideally, a life-long learning system can select its inductive bias to tailor the preference for hypotheses according to the task being learned [11]. One type of inductive bias is knowledge of the task domain. The retention and use of domain knowledge as a source of inductive bias remains an unsolved problem in machine learning. We define knowledge-based inductive learning as a learning method that uses knowledge of the task domain as a source of inductive bias. The method relies on the transfer of knowledge from one or more secondary tasks, stored as neural network representations in a domain knowledge database, to a new primary task. The problem of selecting an appropriate bias becomes one of selecting the appropriate task knowledge for transfer. A multiple task learning (MTL) neural network is a learning system in which knowledge is transferred between tasks. An MTL network is a feed-forward multilayer network with an output for each task that is to be learned [2,3]. The standard back-propagation of error learning algorithm is used to train all tasks in parallel. Consequently, MTL training examples are composed of a set of input
92
Daniel L. Silver and Robert E. Mercer
attributes as well as a target output for each task. The sharing of the internal representation (weights of connections) within the network is the method by which inductive bias occurs within an MTL network. This is a powerful method of knowledge transfer. For example, a two output MTL network can learn the logical XOR and not-XOR functions because they share a common internal representation. By comparison, it is not possible to learn these two tasks within the same single task learning (STL) network because their examples would conflict. To optimize the transfer of knowledge within an MTL network, the secondary tasks should be as closely related to the primary task as possible, else negative inductive bias can result in a less accurate hypothesis. The ηMTL algorithm overcomes this problem with a separate learning rate, ηk , for each task output Tk [9]. ηk varies as a measure of relatedness between secondary task Tk and the primary task. In [7] we define task relatedness and develop and test various static, dynamic and hybrid measures of relatedness. In this paper we present the results of a hybrid measure that is composed of a static measure based on the linear correlation of training set target values and a dynamic measure based on the mutual information of hidden node features with respect to training set target values.
3 3.1
Sequential Learning through Task Rehearsal The Task Rehearsal Method
The rehearsal of task examples was first proposed in [6] to solve the problem of catastrophic forgetting [4]. The method overcomes the problem of retaining specific training examples of early learning by using the representation of existing models to generate virtual examples that can be relearned along with new examples. We have extended this concept to learning sequences of tasks. TRM, presented diagrammatically in Figure 1, is a knowledge based inductive learning system that relies upon the rehearsal of previously learned tasks when learning a new task. After a task Tk has been learned, its representation is retained in domain knowledge for use during future learning. When learning a new task T0 , the domain knowledge for tasks T1 ...Tk ...Tt can be used to generate a set of virtual training examples for relearning the tasks in parallel within an ηMTL network. The virtual examples can be considered a form of hint [1] that are used to transfer knowledge in a functional manner from T1 ...Tk ...Tt to T0 . The Networks. Two sets of feed-forward neural networks interact during two different phases of operation to produce a knowledge based inductive learning system. The set of single output feed-forward networks labeled domain knowledge is the long-term storage area for the representation of tasks which have been successfully learned. A task is considered successfully learned when a hypothesis for that task has been developed to meet a minimum level of generalization error on an independent test set. The retention of representations of previously learned tasks eliminates the need to retain specific training examples for each task. Domain knowledge representations provide a more flexible source of functional
knowledge for a new task because they are a means of generating a virtual example for any set of input attribute values. Transfer of knowledge within TRM happens at a functional level, at the time of learning a new task, and is specific to the training examples for that task. It is the relationship between the functions of the various tasks and not the relationship between their representations which is important. Consequently, there is no need to consolidate the representations of the domain knowledge networks. In fact, the architectures of the stored networks can be completely different from one another. The fundamental requirement of domain knowledge is an ability to store and retrieve the representations of induced hypotheses and to use these representations to generate accurate virtual examples of the original tasks. The inductive learning system of TRM is the ηMTL back-propagation network. This network can be considered a short-term memory area for learning the new task while transferring knowledge from the domain knowledge networks. The architecture of the ηMTL network must contain enough representation (number of hidden nodes) to develop a sufficiently accurate hypothesis for at least the primary task and potentially for all secondary tasks. There is no requirement that the architecture of the ηMTL network be the same for each new task that is learned. There is an important modification to the back-propagation algorithm used in TRM. The algorithm must be able to accept an impoverished set of training examples for the primary task augmented with additional virtual examples for the secondary tasks. The additional virtual examples are important for relearning accurate hypotheses for the secondary tasks. This poses a problem because the
94
Daniel L. Silver and Robert E. Mercer
corresponding target classification values for the primary task are not known. The solution is a modification to the back-propagation algorithm that allows training examples to be marked with an unknown target value for the primary task. The ηMTL algorithm recognizes an unknown target value and automatically considers the primary task error for that example to be zero. Consequently, for such an example, only the error of secondary tasks affect the development of the neural network’s internal representation. Phases of Operation. The task rehearsal method has two phases of operation. The knowledge recall and training phase concerns the learning of a new task in the ηMTL network. The operation of the network proceeds as if actual training examples were available for all tasks. Each primary task (T0 ) training example provides n input attributes and a target class value which is accepted by the ηMTL network. The target values for secondary tasks T1 ...Tk ...Tt are, of course, not part of this T0 training data. They must be generated by feeding the n input attributes into the domain knowledge networks. The resulting secondary task target values complete the training examples for the ηMTL network. Because training focuses on the generation of accurate hypotheses for the primary task there is no need to generate validation or test set virtual examples for the secondary tasks. The domain knowledge networks output continuous values, therefore, the virtual target values will range between 0 and 1. It is a simple matter to convert these to a strict 0 or 1 class identifier based on a cut-off such as 0.5, if so desired, however, it is beneficial to consider leaving the virtual target as a continuous value. Continuous target values will more accurately convey the function of the domain knowledge networks and they provide the means by which dichotomous classification tasks may transfer knowledge from related continuous valued tasks. The domain knowledge update phase follows the successful learning of a new task. If the hypothesis for the primary task is able to classify a test set of examples with accuracy above a user specified level, the task is considered successfully learned. The representation of the primary task within the ηMTL network is saved in domain knowledge: the network architecture (number of nodes in each layer) and the weights of the connections between the nodes are recorded. 3.2
Benefits and Consequences of TRM
The benefits of task rehearsal are related to the transfer of knowledge through the use of virtual training examples. – TRM provides an efficient storage of training examples. The hypothesis representations stored in domain knowledge implicitly retain the information from the training examples in a compressed form. – TRM provides considerable freedom in the choice of training examples for a new task because there is no need to match specific examples of previously learned tasks. TRM will automatically generate matching virtual target values for the secondary tasks from the input attributes of new task examples.
The Task Rehearsal Method of Life-Long Learning
95
The ability to utilize all available training examples for a new task is a benefit for any life-long learning system. – The source of inductive bias under TRM is the set of virtual examples generated from domain knowledge for relearning the secondary tasks [3]. When the number of primary task training examples is small, TRM can generate additional virtual examples through the use of primary examples that are marked with the unknown target value. These additional virtual examples can be selected by way of random sampling or by an ordering over the input attribute space. There are also negative consequences that come with the transfer of knowledge through virtual examples. – The generation of accurate virtual examples from domain knowledge is essential to TRM because they are the means by which knowledge is transferred from a secondary task to the primary hypothesis and only accurate virtual examples can ensure that accurate knowledge is transferred. In [7] we examine how the generation of accurate virtual examples depends upon the accuracy of the domain knowledge hypotheses. The more accurate the hypothesis (as recorded at the time of its learning) the more accurate the virtual examples. In addition, accurate virtual examples depend on the choice of input attribute values used to generate the examples. Input attribute values should have the same range of values and level of resolution as those originally used to train, validate and test the domain knowledge hypotheses. Further research is needed into the best choice of attribute values for each new task. – Generating virtual examples for TRM learning requires computational time and space resources. Two options are available. The target values for the secondary tasks can be computed on-line during learning at the cost of significantly increased learning time. Alternatively, the target values can be generated in batch before learning begins at the cost of additional memory during learning. Our current implementation uses the latter approach. The reader may consider that this consequence conflicts with the benefit of efficient storage of training examples. However, because each training example has matching input attribute values the additional memory cost is only for the storage of the target values of the secondary tasks. Also recall that there is no need to generate virtual examples for the validation of test sets of data. 3.3
The Prototype Software
A prototype software system has been developed that implements the TRM shown in Figure 1. The system uses enhanced back-propagation ANN software that is capable of single task learning (STL), MTL or ηMTL. The ANN architecture embedded in the software is the standard feed-forward type. The system employs a batch method of back-propagation of error that utilizes a momentum term in the weight update equation to speed the convergence of the network. A
96
Daniel L. Silver and Robert E. Mercer
save-best weights method is used to save the representation of the network at minimum error on the validation set. The prototype system uses a sequence table to control the order in which tasks will be learned. For each task in the table, the software moves through the two phases of operation described above. Before learning a new primary task, the examples for the primary task are used to generate the virtual examples for all secondary tasks. A domain knowledge table contains the names of previously learned secondary task representations. If so desired, the table can be populated with names of task representations learned during previous runs of the system. Once a minimum validation error hypothesis has been developed the TRM software must determine if the hypothesis is sufficiently accurate to be stored within domain knowledge. The criteria for an accurate hypothesis is that it falls below a user specifed level of error on a set of independent test data. If the test error level is met, the hypothesis is accepted as accurate, the hypothesis representation is stored, and the task name is added to the domain knowledge table. If the test error level is not met, the hypothesis is rejected and no representation or record of the task is kept. A record of the task’s name in the domain knowledge table ensures that the associated hypothesis will be considered during the learning of future tasks.
4 4.1
Empirical Studies The Domains Studied
TRM has been tested with three different task domains (details can be found in [7]). The Band domain consists of seven synthetic tasks. Each task is a differently oriented band of positive examples across a 2-dimensional input space. The tasks were synthesized so that the primary task, T0 , would vary in its relatedness to the other tasks. A preliminary study showed that T4 , T5 and T6 were the tasks more related to T0 because when individually learned in parallel with T0 they consistently resulted in the most accurate hypotheses for T0 . The Logic domain consists of eight synthetic tasks. Each positive example is defined by a logical combination of 4 of the 11 input variables of the form, T0 : (A > 0.5 ∨ B > 0.5) ∧ (C > 0.5 ∨ D > 0.5). Tasks T1 , T2 and T3 are more related to T0 with T2 being the most related. The Band and Logic domains have been designed so that all tasks are non-linearly separable; each task requires the use of at least two hidden nodes of a neural network to form an accurate hypothesis. The coronary artery disease (CAD) domain contains three real medical diagnostic tasks and four synthesized tasks. Data for the real tasks (clev, hung and vamc) were extracted from the heart disease database in the UCI machine learning repository. Because of the relatively high degree of relatedness between these tasks, data for four additional tasks (A, B, C, and D) that vary in their relatedness to the real tasks were synthesized based on our knowledge of general rules for predicting CAD. Predicting disease for the vamc task is of primary interest.
The Task Rehearsal Method of Life-Long Learning
97
Table 1 summarizes the size of the data sets used for training, validating, and testing each task of each domain under study. The tasks are presented in the order in which they were sequentially learned using TRM. Note that each training set is augmented with additional examples with unknown target values for the primary task. The number of additional examples varied for each task so as to ensure there were at least 50, 50 and 100 training examples for the Band, Logic and CAD domain, respectfully. These additional examples are used by TRM to generate virtual examples for rehearsing the secondary tasks (see Section 3.3).
Table 1. Training, validation, and test set sizes for each task for the three domains Domain Tasks and Size of Training Set Band T1 T6 T2 T5 T4 T3 T0 50 35 30 25 10 20 10 Logic CAD
4.2
T7 T6 T5 T4 T3 50 50 50 45 40
T2 35
T1 T0 30 25
A B C D clev hung vamc 123 123 123 123 148 30 10
Val. Set Test Set 20
200
20
200
6–64
75–96
Method
The tasks for each domain are learned in the left-to-right order presented in Table 1 using the TRM system with each of the inductive learning methods: STL, MTL, and ηMTL. Under STL, of course there is no inductive bias effect from domain knowledge. The neural networks used are all 3-layer architectures composed of an input layer of as many nodes as input attributes, a hidden layer of sufficient representation for all tasks in the domain and an output layer of as many nodes as there are tasks in the domain. A standard back-propagation learning approach is taken using the validation sets to prevent over-fit of the network to the training data. The test sets are used by TRM to determine if the hypotheses that are learned are sufficiently accurate to be saved in domain knowledge. The maximum misclassification rate allowed on a test data set is 35%, which means that the TRM will not save a hypothesis representation unless 65% of the test examples are properly classified. Analysis of the experimental results focuses on the accuracy of hypotheses developed for each task, particularly those at the end of the learning sequence within each domain. Table 1 shows the progressive impoverishment of the training sets for each task as one moves through the sequence of tasks for each domain. Our goal is to show that the TRM with ηMTL can overcome the impoverished training sets by selectively transferring knowledge from related tasks learned
98
Daniel L. Silver and Robert E. Mercer
earlier in the sequence and saved in domain knowledge. The performance of the TRM system under each learning method is based on the accuracy of hypotheses against their respective test sets. The mean number of misclassifications from repeated experiments is the measure of performance. The results shown below are based on 20 repetitions of sequential learning on each domain in which the random initial weights of the networks as well as the training, validation and test examples were resampled. 4.3
Results and Discussion
Figure 2 and Table 2 present the test results for hypotheses developed for each task for each domain in the order in which they were learned. The STL results can be used as a baseline for comparison with the TRM results that used either MTL or ηMTL learning. In Table 2 hypotheses developed under MTL and ηMTL with mean percent misclassifications significantly less than STL hypotheses are indicated in bold (95% confidence based on difference of means t-test). Hypotheses developed under ηMTL with mean percent misclassifications significantly less than MTL hypotheses are shown in parentheses. The very best results are, therefore, in both bold and parentheses.
Table 2. The mean percentage of misclassifications of test set examples by the hypotheses generated by the learning methods under TRM Domain Band STL MTL ηMTL Logic STL MTL ηMTL CAD STL MTL ηMTL
The results indicate that hypotheses developed under STL for tasks that have large numbers of training examples (typically the first four or five tasks) performed as well as or better than hypotheses developed under TRM. Those hypotheses developed under TRM using MTL as the learning method had misclassification rates that were at times significantly higher than that of STL hypotheses. The arbitrary transfer of domain knowledge can have a detrimental effect on learning, particularly when sufficient training examples are available.
The Task Rehearsal Method of Life-Long Learning
99
50 45 40 35 30
STL MTL etaMTL
25 20 15 10 5 0 T1
T6
T2
T5
T4
T3
T0
(a) Band Domain 35 30 25 20
STL MTL etaMTL
15 10 5 0 T7
T6
T5
T4
T3
T2
T1
T0
(b) Logic Domain 40 35 30 25 STL MTL etaMTL
20 15 10 5 0 A
B
C
D
clev
hung
vamc
(c) CAD Domain
Fig. 2. Performance results from sequential learning on the three task domains. Shown is the mean percentage of misclassifications by hypotheses generated by each learning method for each task. The results are presented in the order that the tasks were learned
100
Daniel L. Silver and Robert E. Mercer
This is most evident in the case of learning the Logic domain task T3 , where 40 training examples convey sufficient information to develop relatively accurate hypotheses under STL. Negative inductive bias from unrelated secondary tasks result in MTL hypotheses for T3 having significantly higher error as compared with STL hypotheses. Inductive bias from secondary hypotheses will always have an effect on the internal representation developed within the network. The challenge for a knowledge based inductive learning system is to filter out negative bias for the primary task. As can been seen from Table 2, TRM using ηMTL makes a significant improvement upon the MTL results. The effect of negative inductive bias from unrelated tasks is mitigated by control over the individual learning rates for each of the secondary tasks. The error rate for the first four or five tasks under ηMTL are therefore closer to that of STL. The training data for the final two tasks of each domain are dramatically impoverished as compared to that of the first task in each learning sequence (see Table 1). STL has difficultly developing accurate hypotheses because the training data for these tasks provides so little information. The TRM with ηMTL augments the impoverished data with inductive bias from domain knowledge, resulting in more accurate hypotheses for the tasks. The measure of relatedness reflected in each ηk is able to affect a selective transfer of knowledge from previously learned tasks. The TRM with MTL does not fare as well because a mix of postive and negative inductive bias occurs from all of the domain knowledge tasks.
5
Summary and Conclusions
This paper has presented a life-long inductive learning method called task rehearsal that is able to retain task knowledge and use that knowledge to bias the induction of hypotheses for future tasks. The results of repeated experiments on three task domains demonstrate that TRM with ηMTL produces hypotheses of significantly greater accuracy than either STL or TRM with MTL for tasks with impoverished training data. The success can be attributed to the functional knowledge within the virtual examples generated by the TRM and to the effective use of that knowledge through ηMTL’s ability to select the more related secondary tasks. In a similar manner, the TRM with ηMTL is able to mitigate but not eliminate the effect of the negative inductive bias on tasks that have sufficient training examples. Further work is required so as to limit such ill effects. The space requirements of TRM scale linearly with the number of tasks and the representation of the primary hypothesis. Because TRM uses the backpropagation algorithm its time complexity is O(W 3 ), where W is the number of weights in the ηMTL network. The number of weights in the network can be reduced by eliminating secondary tasks that do not meet a predefined level of task relatedness. The ability of TRM to generate accurate virtual examples from domain knowledge is essential. Only through the transfer of accurate knowledge from
The Task Rehearsal Method of Life-Long Learning
101
a related secondary task is the performance of the primary hypothesis increased. The value of a virtual example can be measured by the difference in the mean performance of primary hypotheses developed with and without that example. Supplemental experiments [7] have shown the value of more accurate virtual examples and the incremental value of additional virtual examples when developing hypotheses for related tasks. This agrees with the reasonable expectation that the effort spent on accurately learning tasks early in life will benefit the learner later in life.
References 1. Yaser S. Abu-Mostafa. Hints. Neural Computation, 7:639–671, 1995. 92 2. Jonathan Baxter. Learning internal representations. Proceedings of the Eighth International Conference on Computational Learning Theory, 1995. 91 3. Richard A. Caruana. Multitask learning. Machine Learning, 28:41–75, 1997. 91, 95 4. Michael McCloskey and Neal J. Cohen. Catastrophic interference in connectionist networks: the sequential learning problem. The Psychology of Learning and Motivation, 24:109–165, 1989. 92 5. Tom M. Mitchell. Machine Learning. McGraw Hill, New York, NY, 1997. 91 6. Anthony V. Robins. Catastrophic forgetting, rehearsal, and pseudorehearsal. Connection Science, 7:123–146, 1995. 92 7. Daniel L. Silver. Selective Transfer of Neural Network Task Knowledge. PhD Thesis, Dept. of Computer Science, University of Western Ontario, London, Canada, June 2000. 91, 92, 95, 96, 101 8. Daniel L. Silver and Robert E. Mercer. The parallel transfer of task knowledge using dynamic learning rates based on a measure of relatedness. Connection Science Special Issue: Transfer in Inductive Systems, 8(2):277–294, 1996. 91 9. Daniel L. Silver and Robert E. Mercer. Selective functional transfer: Inductive bias from related tasks. Proceedings of the International Conference on Artificial Intelligence and Soft Computing (ASC2001), pages 182–189, 2001. 91, 92 10. Sebastian Thrun. Lifelong learning algorithms. Learning to Learn, pages 181–209, 1997. 90, 91 11. Paul E. Utgoff. Machine Learning of Inductive Bias. Kluwer Academc Publisher, Boston, MA, 1986. 91
Recycling the Cycle of Perception: A Polemic Extended Talk Abstract Alan Mackworth Laboratory for Computational Intelligence, Department of Computer Science University of British Columbia Vancouver, B.C., Canada V6T 1Z4 [email protected] http://www.cs.ubc.ca/spider/mack
The Buddhist Middle Way acknowledges, but avoids, the polarities of asceticism and attachment. Analogously, in computational intelligence, we navigate amongst unipolar views on opposing sides of various apparent dichotomies. Suppose we ask, “What characterizes intelligence?” We might answer with one, or more, of the following nine views. An intelligent agent is: Proactive An agent achieves goals, implicit or explicit. Its behaviour is teleological, planned and future-oriented. Reactive An agent perceives and reacts to environmental change. Its behaviour is causal and past-determined. Model-based It uses models of the world to guide its perception and behaviour. Learning-oriented It acquires new behaviours and new models. Rational It reasons, solves problems and uses tools. Social It collaborates, cooperates, commits and competes with other agents. Linguistic It communicates and coordinates using language. Situated It is embedded or situated in a world to which it is coupled. It is particular not universal. Constraint-based It satisfies and optimizes multiple external and internal constraints. Each of these aspects represents only a single perspective on intelligence, just as in the Buddhist legend of the elephant encountered by several hunters, each perceiving it idiosyncratically. No single aspect alone is an adequate, or sufficient, characterization of intelligence. Many of the endless controversies in AI, computational vision and robotics come from clashes between single-minded commitments to one of these views. An unexamined theory is not worth believing. Elsewhere, I have characterized the clash between the proactive and the reactive views of agents as the war between GOFAIR and Insect AI. A related clash in vision opposes a top-down, model-based approach with a bottom-up, model-free approach. Both of these clashes exemplify the dangers of extremism in the pursuit of agent theories. In each case we need a theory of the middle way that supports a clean union of both approaches. Otherwise, we’ll continue R. Cohen and B. Spencer (Eds.): AI 2002, LNAI 2338, pp. 102–103, 2002. c Springer-Verlag Berlin Heidelberg 2002
Recycling the Cycle of Perception: A Polemic
103
to see the oscillation from one pole to the other that characterizes much of our scientific history. The thesis of this informal talk is that a useful way of understanding the evolution of theories of intelligence is an attempt to unify the proactive and reactive views. Consider the schema theories of Kant, Helmholtz, Bartlett, Piaget and Minsky. Piaget’s cycle of assimilation and accommodation aims at integration. So do Neisser’s Perceptual Cycle and Mackworth’s Cycle of Perception, in different ways. The Expectation-Maximization (EM) algorithm involves two complementary phases. The E phase fits evidence to hypotheses and the M phase fits hypotheses to evidence. In logic-based approaches to, say, diagnosis, deduction and abduction play analogous roles for symptoms and diseases. In Bayesian approaches, a Belief Net can determine posteriors on hypotheses, given observations of evidence and vice versa, integrating both. The Kalman filter allows for both uncertainty and dynamics: integrating uncertain evidence with an simple, uncertain, predictive model. These approaches all have in common a cyclic interaction, with mutual accommodation and co-arising, between models and evidence or between the agent and its environment. The obvious dilemma for any synthesis is that a coherent theory must have a single point of view. For example, Mackworth and Zhang’s Constraint-Based Agent theory is, surprisingly enough, Constraint-Based. At the same time it is motivated by the claim that Proactive and Reactive views are, together, the necessary and sufficient conditions for intelligent agency. It also, modestly, claims to subsume each of the Model-based, Learning-oriented, Rational, Social, Linguistic and Situated views in a single architecture. Needless to say, some of these claims are more substantiated than others, to date.
Generalized Arc Consistency with Application to MaxCSP Michael C. Horsch1 , William S. Havens2 , and Aditya K. Ghose3 1
2
Department of Computer Science, University of Saskatchewan Saskatoon, SK, Canada S7N 5A9 [email protected] Intelligent Systems Laboratory, School of Computing Science Simon Fraser University, Burnaby, B.C., Canada V5A 1S6 [email protected] 3 Dept. of Information Systems, University of Wollongong Wollongong NSW 2522 Australia [email protected]
Abstract. We present an abstract generalization of arc consistency which subsumes the definition of arc consistency in classical CSPs. Our generalization is based on the view of local consistency as technique for approximation of marginal solutions. These approximations are intended for use as heuristics during search. We show that this generalization leads to useful application in classical CSPs as well as non-classical CSPs such as MaxCSP, and instances of the Semi-ring CSP formalism developed by Bistarelli et al. [2]. We demonstrate the application ofthe theory by developing a novel algorithm for use in solving MaxCSP.
In classical constraint satisfaction problems (CSPs), the purpose is to find an assignment of values to variables such that all constraints on these variables are satisfied. When preferences do exist in an application domain, the problem becomes an optimization problem, in which the object is to find an assignment that maximizes the preference measure. For example, if a classical CSP is overconstrained, (i.e., no satisficing solution exists), it may be desirable to maximize the number of satisfied constraints (MaxCSP). Other “non-classical” constraint problems include weighted CSPs, probabilistic CSPs, fuzzy CSPs, etc., in which the objective is to find an assignment that optimizes the total weight, probability, rough membership, or other global property. CSPs are typically solved by a combination of consistency enforcing algorithms, and search [10,12,8,2,13] Consistency algorithms simplify a CSP to an equivalent CSP (i.e., they have the same set of solutions), so that search algorithms require less time to find solutions. Consistency can also be viewed as a heuristic, distinguishing values (or tuples) which do not appear in a solution, from those that may (or may not) appear in a solution. A marginal solution of a CSP is the projection of the set of solutions to a CSP onto the combined domain of a subset of the variables. When exact marginal R. Cohen and B. Spencer (Eds.): AI 2002, LNAI 2338, pp. 104–118, 2002. c Springer-Verlag Berlin Heidelberg 2002
Generalized Arc Consistency with Application to MaxCSP
105
solutions are known, there is enough information to eliminate backtracking; otherwise, some search will be required. An arc consistency algorithm can be seen as a way to approximate marginal solutions, for each individual variable. The better the approximation, the fewer backtracks caused during search. The approximation of marginal solutions, in the form of marginal probabilities for constraint problems, has been shown to be a surprisingly useful heuristic [5]. The algorithm for computing these probabilities is essentially arc consistency. It has also been shown that arc consistency algorithms are essentially equivalent to algorithms for computing marginal probabilities in Bayesian networks [6]. The difference between the use of local methods in these two representations lies in the algebra. This paper presents the generalization of these methods by “factoring out” the algebraic operations. It can be instantiated for classical and non-classical CSPs, by specifying algebraic operations suitable to the instance. Our work is related to consistency algorithms for soft constraint problems [2,13], but differs in several key ways. First, we are not computing equivalent problems, but rather, we focus on computing heuristics to be used dynamically during search. Second, we do not restrict our attention to algebras that have idempotent operators. We call our approach “xAC.” The features of the approach are as follows: local information is used to determine global properties; inference is based on operations defined by the optimization problem; exact results are attainable in special cases, where sub-problem independence holds; in applications that violate the assumption of sub-problem independence, the local operations can be applied iteratively, to generate approximate marginal solutions. In this paper, our approach is demonstrated by developing an algorithm, called MaxAC, for use in solving MaxCSP problems. This algorithm is novel, in the sense that it propagates information about the number of constraints which are satisfiable (as opposed to propagating lower bounds). In the next sections, we present the approach in detail, starting with a formal description of terminology and notation. The heart of this paper is Theorem 2, which defines a set of equations that generalize arc consistency in binary CSPs. Section 3 shows how arc consistency and probabilistic arc consistency are instances of the approach. In Section 4 we show how the technique can be used to derive heuristics for MaxCSP, which we evaluate empirically. In Section 6 we summarize the approach and point to future work.
1
Semiring CSPs
In this section, we review the semi-ring constraint satisfaction problem (SCSP) notation of [2]. The SCSP framework provides a common language for many kinds of soft constraint problems, including classical, fuzzy, valued, etc. A semiring is a tuple A, ×, +, 0, 1, in which A is a set, and 0, 1 ∈ A. The operator × is a closed associative operator over pairs of elements of A, with 1 as its unit element, and 0 as its absorbing element. The operator + is closed, commutative and associative, with unit element 0. Note that 0, 1, +, × do not necessarily
106
Michael C. Horsch et al.
denote the usual meanings of these symbols in arithmetic. A c-semiring is a semiring in which × is commutative, + is idempotent (i.e., a + a = a), and 1 is the absorbing element of +. In the SCSP framework, we have a finite set of variables V , each taking values from a common domain D. Classically, a constraint is defined as a relation on tuples of values from D. Equivalently, such a relation can be expressed as a function on tuples, that returns (i.e., true) if the tuple is in the relation, and ⊥ (i.e., false) if it is not. The SCSP framework generalizes this idea by defining a constraint as a pair def , con, where con ⊆ V is the type of the constraint, and def : D|con| → A is the constraint function, which indicates the degree to which a tuple of the appropriate type satisfies the constraint. As a departure from the notational conventions of [2], we will sometimes write c.def and c.con to refer to the components of a constraint c. Let con be a set of variables, and t ∈ D|con| . Let con ⊂ con, and let t ∈ |con | D . The projection of t onto to con is the set of values in t for the variables in con , and is written t = t ↓con con . If c1 and c2 are two constraints, then define c = c1 ⊗ c2 where c.con = c1 .con ∪ c2 .con, and c.con c.def (t) = c1 .def (t ↓c.con c1 .con ) × c2 .def (t ↓c2 .con )
Note that × is the multiplicative operator for the semiring. The projection of constraint c onto I ⊆ V is a constraint written c = c ⇓I where c .con = c.con ∩ I and c.def (t) c .def (t ) = t|=t
Here we use the operator as the prefix expression for the additive semiring operator +. Also, we have taken another slight departure from the notation of [2]: for two tuples t ∈ D|con| , t ∈ D|con | , t |= t whenever t ↓con con = t . Thus the sum is over all the tuples t that project into t . The solution of a SCSP is also a generalization of a classical CSP concept. In the SCSP framework, a SCSP problem is a tuple C, con where C is a set of constraints, and con is a set of variables of interest. The solution is a constraint defined by Sol(P ) = c∈C c ⇓con . In this paper, the variables of interest will be the union of the types of all constraints in C.
2
xAC: A Framework for Generalized Arc Consistency
The results in this section are presented in the language of the Semiring CSP framework as presented above. For classical CSPs, even though complete solutions could be found by computing the relational join of all the constraints, i.e., c∈C c , search is the technique of choice for reasons of space efficiency, and the possibility of using heuristics to avoid many useless inferences. The idea behind the following definition is to project the solution of the SCSP onto a single variable.
Generalized Arc Consistency with Application to MaxCSP
107
Definition 1. Let P = C, con be a semiring constraint problem with constraints C and variables con. The marginal solution of P on variables I ⊆ con, written MI (P ) is defined as follows: c ⇓I MI (P ) = c∈C
When I contains a single variable X, we write MX =
c∈C
c ⇓{X} .
This definition is not an efficient means for computing the marginal solution, since it is derived from the solution of the constraint problem. It is well known that when a CSP has a tree-structured constraint graph (no two nodes have more than one path between them), directed arc consistency can be used to construct a solution without backtracking [3]. In this case, arc consistency computes exact marginal solutions. The following definition and theorem reformulates the idea in terms of marginal solutions. Note that we also generalize slightly, by using the idea of “separability.” Definition 2. A constraint problem P is separable if there exist P1 , . . . , Pm , such that for every i ∈ {1, . . . m}, Pi ⊂ P and a variable X ∈ P.con such that for every pair of constraints ci ∈ Ci and cj ∈ Cj , ci .con ∩ cj .con = ∅, i = j, and conX ∩ coni = Ni , where Ni is the set of variables in Pi that share a constraint with X. The variable X separates P into subproblems P1 , . . . , Pm . Theorem 1. Let PX be the set of constraints whose type includes variable X, i.e., PX = {c ∈ C|X ∈ c.con}. If X separates P into subproblems P1 , . . . , Pm , then m
CXNi ⊗ MNi (Pi ) ⇓X MX (P ) = i=1
(CXNi ∈ PX is the constraint involving X and its neighbours Ni in Pi ). ✷ The theorem says that if the problem is separable at X, then we can compute the marginal on X if we know the marginals on X’s neighbours. There is nothing surprising about this result, since it expresses ideas such as “divide-and-conquer” and “independence” which are general problem solving techniques. However, it provides a basis for local consistency in SCSPs [2,1]. Furthermore, while the theorem seems rather restricted, applicable to separable CSPs only, marginal solutions can be approximated iteratively on a much wider class of CSP. The proof of the theorem is straightforward. It relies on the commutativity and associativity of the semiring operations + and ×, but no other properties. It does not require the idempotence of + and an absorbing element 1 of +, as in [2]. If the theorem holds for every variable in a binary SCSP, then the problem has the tree-structured property. In this case, marginal solutions can be computed exactly by purely local computations, and also in polynomial time, as in the following theorem.
108
Michael C. Horsch et al.
Theorem 2. Let P be a binary SCSP constraint problem, defined over the csemiring A, ×, +, 0, 1 with a tree-structured constraint graph. For any variable X in P , having constraints CXYi = def i , {X, Yi }, i = 1, . . . , m (the Yi are the neighbours of X in the constraint graph), define the following:
(k) (k) SXYi = CXYi ⊗ PXYi ⇓X (k) (k) MX = SYi X i
(k+1) PYi X
=
(k)
SXYi
j=i (0) PYi X
such that (0)
PYi X .con = {X} (0)
PYi X .def (x) = 1 If the constraint graph has diameter d, then (d)
MX (P ) = MX ✷
The theorem defines the xAC equations, and establishes that they compute ex(k) act marginal solutions in certain cases. The notation SXY can be interpreted (k) as the support of X from Y in the kth iteration. Likewise, PXY is the message (k) propagated to X from Y on the kth iteration, and MX is the kth approximation of the marginal solution. Figure 1 gives view of the theorem in terms of parameters and messages super-imposed on a fragment of a constraint graph. The proof is not difficult, and it is only a very slight generalization of wellknown results in constraint reasoning and probabilistic reasoning. In particular, [3] gives the result for solution counting in tree-structured classical CSPs; in [6], we show how these equations are a special case of belief updating in singlyconnected Bayesian networks [7]. The purpose of this paper is not to improve upon these existing results, but rather to express the result in the language of SCSPs, and show how the idea can be used in soft constraint problems. While the requirement that the CSP be tree-structured seems very strong, we point out that several instances of this approach can be successfully applied to CSPs which are not tree-structured; AC-1 [10] is one instance, and probabilistic arc consistency [6] is another. We discuss these applications in the next section.
3
Current Applications of xAC
In this section, we discuss applications of Theorem 1 that were developed prior to xAC. These applications target classical constraint problems, with a set V of variables, and a set C of constraints. Each variable X ∈ V has a domain DX , and we will assume that constraints are binary and represented as relation CXY ⊂ DX × DY (here, × is used to denote cross product of sets).
Generalized Arc Consistency with Application to MaxCSP
109
Y2 PXYi Yi
X Y1
MX CXYi SXYi
PYi X
Fig. 1. The xAC parameters, imposed over a constraint graph. Each node X has parameters MX , CXYi and SXYi . Message PXYi is received by node X and PYi X is received by neighbour Yi at time step k 3.1
Arc Consistency
For classical CSPs, the technique of arc consistency is intended to remove elements from the domain of variables if these elements, on the basis of purely local information, are known not to occur in a solution [10]. The SCSP instantiation for classical binary CSPs is {, ⊥}, ∧, ∨, ⊥, , i.e., the language of boolean arithmetic [2]. Applying Theorem 2 to this structure, we get the following. CXYi .def : DX × DYi → {, ⊥} (k) CXYi .def (x, y) ∧ PXYi .def (y) (x) =
(k) SXYi .def
y∈DYi (k)
MX .def (x) =
m
(k)
SXYi .def (x)
i=1 (k+1) PYi X .def
(x) =
(k)
SXYj .def (x)
j=i (0) PYi X .def
(x) = (k)
Note the definition for SXY .def (x), which encodes the definition of arc con(k) sistency for the single arc (X, Y ). The equation for MX .def (x) determines if a value x ∈ DX should be eliminated from the domain of X: if no neighbour Y supports the value, then the conjunction is false, and the kth projection onto X is ⊥.
110
3.2
Michael C. Horsch et al.
Solution Counting
This is the problem of counting the number of solutions to a classical CSP. An efficient method is known for tree-structured CSPs [3]. This method is equivalent to the xAC equations on the semiring structure {0, . . . , N }, ×, +, 0, N , where N = |Dn |, and n is the number of constraints in the problem, and × and + represent integer multiplication and addition, respectively. Theorem 2 can be applied, resulting in: CXYi .def : DX × DYi → {0, . . . , N } (k) CXYi .def (x, y) × PXYi .def (y) (x) =
(k) SXYi .def
y∈DYi (k)
MX .def (x) =
m
(k)
SXYi .def (x)
i=1
(k+1) PYi X .def
(x) =
(k)
SXYj .def (x)
j=i (0) PYi X .def
3.3
(x) = 1
Probabilistic Arc Consistency
Probabilistic arc consistency (pAC) is a technique used to approximate solution probabilities for a classical CSP [6]. A solution probability for a variable is a frequency distribution over its values, representing the proportion of the number of solutions that make use of that value. The use of this approximation as a dynamic variable ordering heuristic has been found to reduce backtracking in random CSPs by as many as two orders of magnitude [5]. The equations for pAC can be derived from the xAC equations for solution counting with the addition of a normalization constant in the definition of (k) MX .def . In the pAC equations that follow, ×, + are multiplication and addition defined on reals. CXYi .def : DX × DYi → [0, 1] (k) CXYi .def (x, y) × PXYi .def (y) (x) =
(k) SXYi .def
y∈DYi (k)
MX .def (x) = α (k+1) PYi X .def
(x) =
m
(k)
SXYi .def (x)
i=1 (k) SXYj .def j=i
(x)
(0)
PYi X .def (x) = 1 (k)
The quantity α in the definition of MX .def is a normalization constant that ensures that each marginal distribution sums to 1. It does not come from Theorem2, but is due to the assumption that the marginal can be determined by
Generalized Arc Consistency with Application to MaxCSP
111
considering the constraints independently. This assumption is called “causal independence” when made in Bayesian networks, for example. 3.4
Causal Independence in Bayesian Networks
A Bayesian network is a DAG in which the nodes are “random variables” and the arcs represent probabilistic dependence. Each node X in the graph has parents pa(X), except for “root” nodes, for whom pa(X) = ∅. The joint probability distribution encoded by the Bayesian network with a set of variables U = {X1 , . . . , Xn } is assumed to have a factorization in terms of prior and conditional probabilities. Each variable Xi has a set of possible values Di ; the probability of a tuple t = (x1 , . . . , xn ) ∈ D1 × . . . × Dn is given by P(t) =
n
U P(t ↓U Xi |t ↓pa(Xi ) )
i=1
(here, and for the remainder of this section, and are multiplication and addition defined on real numbers). For a variable Xj , the marginal probability of xj ∈ Dj is U P(xj ) = P(t) = P(xj |t ↓U pa(Xj ) )P(t ↓pa(Xj ) ) t|=xj
t|=xj
Here, the summation is over all tuples that contain xj . Note that P(xj |t ↓U pa(Xj ) ) U is given information, and that P(t ↓pa(Xj ) ) is a marginal probability on a tuple from the parents of Xj . The reader is invited to compare this equation with Theorem 1. If we assume that the parents of Xj are conditionally independent, then we can write P(t ↓U P(t ↓U X) pa(Xj ) ) = X∈pa(Xj )
This last assumption is equivalent to the assumption that the Bayesian network is tree structured (or singly-connected). Finally, we consider the very special case that the conditional probability distribution of a variable Xj given its parents, can be represented as a product of simpler distributions: P(Xj |t ↓U P(xj |t ↓U X) pa(Xj ) ) = α X∈pa(Xj )
The α is a normalization constant, which ensures that P(x|t ↓U pa(Xj ) ) = 1 x∈DXj
This assumption is sometimes called “causal independence.” It may seem an unrealistic assumption for most probabilistic models, but it is routinely made
112
Michael C. Horsch et al.
for CSPs: each constraint on a variable acts on that variable independently of any other constraint on that variable. It is this assumption that results in the normalization constant for pAC, as mentioned above. These assumptions allow us to express the marginal probability of xj ∈ Dj : P(xj ) = α P(xj |y)P(y) Y ∈pa(Xj ) y∈DY
In other words, when the assumptions hold, the marginal on Xj can be determined from the marginals on its parents. This gives a top-down algorithm for determining marginals in the Bayesian network, which is equivalent to directed arc consistency [3]. The reader is invited to compare this equation with pAC (k) equation for MX .def .
4
MaxCSP
In this section, we focus on the application of Theorem 2 to the partial CSP problem, also called MaxCSP. In this problem, we take an over-constrained classical CSP, and the goal is to find an assignment that maximizes the number of satisfied constraints. The usual approach involves branch and bound search, possibly with heuristics. In the following, we derive a local consistency algorithm for MaxCSP that can be used as a heuristic during search. 4.1
MaxAC: Constraint Propagation for MaxCSP
In MaxCSP, constraints are classical, but the quantity we are optimizing is the number of satisfied constraints. Thus, if there are n constraints in the problem, the semiring structure for MaxCSP is {0, . . . , n}, +, max, 0, n. The multiplicative operator of the semiring, ×, is instantiated here as integer addition. In other words, if two constraints are combined, then any given tuple can satisfy either, both, or neither constraint, and the number of constraints satisfied is determined by simply adding def (t) values. The additive operator of the semiring is max. The xAC equations for MaxCSP (the “MaxAC equations”) are as follows: CXYi .def : DX × DYi → {0, 1}
Theorem 2 states that under the assumption of tree-structured constraint (k) graphs, the quantity MX .def (x) represents the maximum number of satisfied
Generalized Arc Consistency with Application to MaxCSP
113
constraints satisfied by any solution involving X = x. It is a simple matter to (k) maximize for the best value in DX . Furthermore, the computation of MX .def (x) can be done efficiently. Therefore, we have the following algorithm: for every variable in some order, we maximize the marginal, and fix the variable to the maximizing value, and recompute the marginals. When all the variables have been assigned in this way, a MaxCSP solution has been found. When the CSP is not tree-structured, the theorem does not apply. The equations as Theorem 2 describes them, result in marginal solutions that increase without bound as k increases. This same problem occurs in solution counting (the integer precursor to pAC; see Section 3.2). The pAC marginals are constrained by normalization to be in the range [0, 1]. The convergence of pAC has been demonstrated empirically. Thus, we hypothesized that if the MaxAC equations were augmented by some means of limiting the range of the marginals, useful convergence might be obtained. We used the following method. (k) Define MX .def so that the marginal distribution is represented in terms of the difference from the minimum value: (k) (k) MX .def (x) = −β + SY X .def (x) Y
where β = min x
(k)
SY X .def (x)
Y
This approach is motivated by the empirical observation that the count of satisfied constraints may increase with each iteration, but the difference between the counts may not change very much. In the following section we show that this very simple idea has merit as a heuristic, and it will be worthwhile to investigate more sophisticated ways to bound the range of the MaxAC equations. 4.2
Preliminary Evaluation
In this section, we briefly present some preliminary empirical evidence that suggests that the MaxAC equations as defined in the previous section can be informative. The quality of the information provided by the MaxAC equations is shown by a positive correlation of the approximate marginals with the exact marginals. The computation of exact values limits the size of the problems feasible for the comparison. The MaxAC equations were applied (using difference from minimum as the normalizing step) to random over-constrained CSPs. We used two samples of 150 problems of 10 variables and 5 values each. The first sample was computed using the “flawed” random CSP model as discussed by [4], for p1 ∈ [0.5, . . . , 1.0] and p2 ∈ [0.5, . . . , 0.9]. The second sample were constructed using a unique random CSP model, which is based on the “flawed” model, but in about onethird of the constraints, the disallowed pairs are chosen from a smaller subset of possible pairs. In effect this model puts clusters, or “holes,” of disallowed pairs into the constraint. This model was constructed to avoid “uniformity” of
114
Michael C. Horsch et al.
typical random models. The second sample was generated with p1 ∈ [0.5, . . . , 1.0] and p2 ∈ [0.5, . . . , 0.9], as before. The comparison determined the correlation
Table 1. Correlation between approximate marginals, computed by MaxAC, and the exact marginals for two sets of random over-constrained CSPs Min Max Median Average Std. Dev. Sample 1 -0.0936 0.995 0.625 0.620 0.253 Sample 2 0.292 1.0 0.929 0.853 0.163
coefficient between the approximate marginal and the exact marginal, for each variable in the CSP, and then averaging over all variables. The results are shown in Table 1. For the first sample, the correlation ranges from just below zero (indicating no correlation) to just below 1.0, indicating perfect correlation. For this sample, the median is 0.625, which indicates that for half of the problems, the MaxAC approximation is strongly correlated with the exact values. The second sample has a stronger correlation between the approximate and exact marginals, as indicated by the median correlation of 0.929. These results show that MaxAC approximations can be very good.
20 18 16 14
#Solved
12 10
MaxAC Unordered Random
8 6 4 2 0 0
2e+07
4e+07
6e+07
8e+07
1e+08
1.2e+08
1.4e+08
1.6e+08
1.8e+08
#CC
Fig. 2. A plot of the number of consistency checks required to find the MaxCSP assignment for flawed random CSPs, using various heuristics
Generalized Arc Consistency with Application to MaxCSP
115
The second part of our preliminary results made use of MaxAC as a heuristic during search. We constructed over-constrained random classical CSPs, and employed branch and bound search with MaxAC as a heuristic, without heuristic guidance (i.e., lexicographical ordering) and also with a random value ordering. The MaxAC marginals were used as a static value ordering heuristic (i.e., exploring values with high MaxAC counts first). As well, the MaxAC marginals were used to find a lower bound on the MaxCSP solution. This was done by constructing an assignment based on the MaxAC marginals in a greedy manner: choose the value that maximized the MaxAC marginal for each variable. The results are shown in Figures 2 and 3. We constructed 20 CSPs with 10 variables and 10 values with p1 = 0.8 and p2 = 0.8. We recorded the number of consistency checks required to find the MaxCSP assignment (the search carries on in vain trying to find a better solution). The graphs show how many of the problems were solved using a given number of consistency checks. Clearly, using MaxAC is superior to the uninformed strategies, and therefore MaxAC does provide information that is valuable during search. We note that this evidence is preliminary, and that comparison to other constraint propagation methods for MaxCSP (e.g., [9,13]) is necessary to establish the value of this scheme.
20 18 16 14
#Solved
12 10
MaxAC Unordered Random
8 6 4 2 0 0
5e+07
1e+08
1.5e+08
2e+08
2.5e+08
3e+08
#CC
Fig. 3. A plot of the number of consistency checks required to find the MaxCSP assignment for skewed random CSPs using various heuristics
116
5
Michael C. Horsch et al.
Related Work
In [1], constraint propagation for the SemiRing CSP framework is discussed. Local consistency is computed for each variable, and the resulting constraint on the variable is added to the constraint store. This technique is correct when the constraints in the store are combined with an idempotent operator. However, this technique would over-count A values for every constraint, except that for fact of idempotence, which means that a value a ∈ A can be combined any number of times without over-counting, since a × a = a. The xAC equations avoid this over-counting, by recomputing the marginal solutions, as opposed to (k) recombining new information with the old. In Theorem2, the MX depends only (k−1) on information from its neighbours, and not on MX . In fact, all neighbours of X factor out information directly from X in their messages to X. In treestructured problems, this is enough to prevent over counting. In more general problem classes, double counting is possible. The restriction of idempotence is true of several varieties of constraint problems, such as fuzzy CSPs, and by the nature of boolean arithmetic, is true of classical constraint problems as well. However, it is not true of other models, such as MaxCSP, and probabilities (e.g., Bayesian networks). In particular, their notion of local consistency The xAC equations have been successfully applied to problem classes that are not idempotent. In [13], Schiex proposes an algorithm that extends the notion of local consistency to classes of constraint problem which do not naturally use an idempotent combination operator. In this work, local consistency is computed for each variable, and as a result, some of the constraints in the constraint store are replaced (as opposed to combined into the store using ×, as in [1]). In particular, Schiex’s algorithm maintains a single unary constraint on each variable, and when local consistency is computed for that variable the unary constraint is replaced by a unary constraint that reflects the current minimum of local consistency for that variable. As well, in order to account for possible double counting of local consistency values, Schiex’s algorithm also “compensates” for the new unary constraint, replacing the n-ary constraint used to infer local consistency. The replacement constraint compensates for the modified unary constraint by “subtracting” the minimum level of consistency from tuples within the constraint. The correctness of this algorithm depends on a natural condition called “fairness,” which basically says that the algebra of local consistency allows for compensation while maintaining an equivalence between the original and the compensated problem. Schiex’s algorithm, when applied to the MaxCSP problem, computes lower bounds on the cost of solutions. It is shown that the lower bounds are at least as good as those computed by directed arc consistency methods [14,9]. Good lower bounds are important for Max-CSP, which is generally solved using branch-andbound search. The xAC approach is similar to Schiex’s algorithm. If the constraint store changes, the marginal solution is recomputed and replaced. However, xAC does
Generalized Arc Consistency with Application to MaxCSP
117
not compensate the original constraints between variables. The xAC marginal is, in the restricted case of a tree-structured problem, not a lower bound, but an exact count of the number of violated constraints. Thus, for this restricted problem class, the xAC equations compute lower bounds which cannot be improved. In the more general case, xAC approximates marginal solutions; there is no intention that the marginal solution represent an identical problem; instead, the marginal solutions’ main utility is in providing dynamic variable and value ordering heuristics for search. Directed arc consistency methods [14,9] and Schiex’s arc consistency algorithm for MaxCSP [13] can be seen as algorithms that approximate marginal solutions by computing lower bounds on costs. These results can in principle be used in conjunction with xAC, combining heuristics and bounds.
6
Summary and Future Work
We have presented a constraint propagation algorithm schema, xAC, that generalizes classical arc consistency, based on the idea of approximating the marginal solution of the CSP. The schema can be instantiated by providing operators for combining and projecting constraints. xAC uses local information to compute global properties such as satisfiability or probability whenever the CSP has a constraint graph that is tree structured. When the constraint graph is not tree structured, xAC can still be applied to these problems, as an approximation method. Our approach differs from other approaches to local consistency, in that the xAC equations are not intended to compute an equivalent simpler problem. The intention is to compute a global property directly, when possible, and approximate it when necessary. Local consistency algorithms can be seen as heuristics; the problem is that they are not designed with heuristic guidance in mind, and therefore may be rather poor in that role. For example, arc consistency in classical CSPs eliminates values from domains, but the remaining values are not given any priority. On the other hand, the instances of xAC provide a heuristic ordering of domain values which are often highly correlated with the ordering implied by the exact marginal solutions. As well, xAC does not depend on the idempotence of the multiplicative operator used to combine constraint values. This is an advantage when dealing with a problem without an idempotent operator, but a disadvantage, as it means that in some cases, convergence is not guaranteed. In some applications, such as classical arc consistency, convergence to a reasonable approximation is guaranteed by the operators. In other cases, convergence is not guaranteed, but in some of these cases, convergence can be assisted by the choice of a normalization operator. Several instances of xAC have been observed to converge to useful approximations when convergence is not guaranteed theoretically, such as pAC [6], belief propagation in Bayesian networks [11]. Future work includes a more detailed treatment of convergence. We believe that convergence is a desirable property, but not a necessary one. We do not
118
Michael C. Horsch et al.
expect a heuristic to have perfect foresight, and if the iteration of the xAC equations must be interrupted due to oscillation, a search procedure can make a choice that is not informed by xAC in this situation, and carry on. In practice, it is important to detect divergence as early as possible. How best to do this, and how best to carry on is still an issue to be investigated.
References 1. S. Bistarelli, R. Gennari, and F. Rossi. Constraint propagation for Soft Constraint Satisfaction Problems: Generalization and Termination Conditions. In Proceedings of the Sixth International Conference on Principles and Practice of Constraint Programming, 2000. 107, 116 2. S. Bistarelli, U. Montanari, F. Rossi, T. Schiex, G. Verfaille, and H. Fargier. Semiring-Based CSPs and Valued CSPs: Frameworks, Properties and Comparison. Constraints, 4(3):199–240, 1999. 104, 105, 106, 107, 109 3. Rina Dechter and Judea Pearl. Network-based heuristics for constraint-satisfaction problems. Artificial Intelligence Journal, 34:1–34, 1988. 107, 108, 110, 112 4. Ian Gent, Ewan MacIntyre, Patrick Prosser, Barbara Smith, and Toby Walsh. Random constraint satisfaction: Flaws and structure. Technical Report 98.23, University of Leeds, School of Computer Studies, 1998. 113 5. Michael C. Horsch and William S. Havens. An empirical evaluation of Probabilistic Arc Consistency as an variable ordering heuristic. In Proceedings of the Sixth International Conference on Principles and Practice of Constraint Programming, pages 525–530, 2000. 105, 110 6. Michael C. Horsch and William S. Havens. Probabilistic Arc Consistency: A connection between constraint reasoning and probabilistic reasoning. In Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, pages 282–290, 2000. 105, 108, 110, 117 7. Jin H. Kim and Judea Pearl. A computational model for causal and diagnostic reasoning in inference systems. In Proceedings of the Eighth International Joint Conference on Artificial Intelligence, pages 190–193, 1983. 108 8. Vipin Kumar. Algorithms for constraint-satisfaction problems: A survey. AI Magazine, 13(1):32–44, 1992. 104 9. Javier Larrossa, Pedro Meseguer, Thomas Schiex, and Gerard Verfaille. Reversible DAC and Other Improvements for Solving Max-CSP. In Proceedings of the Fifteenth National Conference on Artificial Intelligence, 1998. 115, 116, 117 10. Alan K. Mackworth. Consistency in networks of relations. Artificial Intelligence Journal, 8(1):99–118, 1977. 104, 108, 109 11. Kevin P. Murphy, Yair Weiss, and Michael I. Jordan. Loopy belief propagation for approximate inference: An empirical study. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, pages 467–475, 1999. 117 12. B. Nadel. Tree search and arc consistency in constraint satisfaction problems. In L. Kanal and V. Kumar, editors, Search in Artificial Intelligence, pages 287–342. 1988. 104 13. Thomas Schiex. Arc consistency for soft constraints. In Proceedings of the Sixth International Conference on Principles and Practice of Constraint Programming, pages 411–424, 2000. 104, 105, 115, 116, 117 14. Richard J. Wallace. Directed arc consistency preprocessing as a strategy for maximal constraint satisfaction. In ECAI-94 Workshop on Constraint Processing, pages 69–77, 1994. 116, 117
Two-Literal Logic Programs and Satisfiability Representation of Stable Models: A Comparison Guan-Shieng Huang1 , Xiumei Jia2 , Churn-Jung Liau3 , and Jia-Huai You2 1
Department of Computer Science, National Taiwan University Taipei, Taiwan 2 Department of Computing Science, University of Alberta Edmonton, Canada [email protected] 3 Institute of Information Science, Academia Sinica Taipei, Taiwan
Abstract. Logic programming with the stable model semantics has been proposed as a constraint programming paradigm for solving constraint satisfaction and other combinatorial problems. In such a language one writes function-free logic programs with negation. Such a program is instantiated to a ground program and its stable models are computed. In this paper, we identify a class of logic programs for which the current techniques in solving SAT problems can be adopted for the computation of stable models efficiently. These logic programs are called 2-literal programs where each rule or constraint consists of at most 2 literals. Many logic programming encodings of graph-theoretic, combinatorial problems given in the literature fall into the class of 2-literal programs. We show that a 2-literal program can be translated to a SAT instance in polynomial time without using extra variables. We report and compare experimental results on solving a number of benchmarks by a stable model generator and by a SAT solver.
1
Introduction
The satisfiability problem (SAT) is to determine whether a set of clauses in propositional logic has a model. When a model exists, a typical SAT solver also generates some models. The close relationship between the satisfiability problem and the stable model semantics [6] has attracted some attention to the issues related to representation and search efficiency. On the one hand, since both belong to the hardest problems in NP and thus there exist a polynomial-time reduction from one to the other, it is interesting to know how the two may be differentiated from the representation point of view. On the other hand, the close relation between the implementation techniques for building SAT solvers and those for building stable model generators has prompted the questions of how these solvers may be compared on classes of programs and the problems they encode. R. Cohen and B. Spencer (Eds.): AI 2002, LNAI 2338, pp. 119–131, 2002. c Springer-Verlag Berlin Heidelberg 2002
120
Guan-Shieng Huang et al.
On the issues of representation, it is known that SAT instances can be translated to logic programs locally in linear time. Conversely, the translation from logic programs to SAT instances is much more difficult. First, there is no modular translation [12]. Second, the currently known translations require a substantial amount of extra variables, in the order of n2 in the worse case, where n is the number of variables in the given program [2,16]. The use of extra variables could have significant impact on search efficiency, because, since each variable (a proposition) is true or false in an interpretation, each additional variable could double the search space, representing an exponential increase of search space. On the issues of implementation techniques, it is known, for example, that the smodels system [15] is implemented by adopting and improving upon effective methods for building a Davis-Putnam procedure [4], on which some of the complete SAT solvers are also based. It has been demonstrated experimentally that Smodels is superior to some of the well-known SAT solvers on problems such as the Hamiltonian cycle problem, whose logic programming encoding is much more compact than its satisfiability encoding [14]. For other satisfiability problems, it is often remarked that the performance of smodels is comparable to the performance of efficient SAT solvers (cf. [9,15]). In this paper, – we identify a class of logic programs which are reducible to SAT instances in polynomial time without using extra variables; – we report experimental results on solving a number of benchmarks by smodels, and by reducing these logic programs to SAT instances and solving them by an efficient SAT solver, SATO [18]. Smodels is one of the most widely used systems for computing stable models.1 Sato is one of the most efficient SAT solvers around.2 The choice of sato is also due to the fact that it is designed to generate all models when a SAT instance is satisfiable. We conducted two sets of experiments. In the first one, the benchmarks are 2-literal programs. These programs are translated to sets of clauses without using extra variables. It turns out that for these benchmarks sato outperformed smodels consistently. In the second set of experiments, we are interested in two questions. The first is whether the advantage of sato for 2-literal programs remains for non2-literal programs that can also be translated to SAT instances without using extra variables. The second question is whether it is worthwhile to translate such a program to a 2-literal one using a linear number of extra variables. In this set of experiments, we choose the Blocks World encoding of Niemel¨ a [12], whose completion formula is known to characterize the stable models semantics [1], thus there is a translation without using extra variables. It turns out that sato performed well with the “right” value for the parameter g. A wrong choice could 1 2
DLV [8] is another system designed to compute stable models of disjunctive programs. Performance comparison with smodels can be found in [13] (also see [7]). For a comparison with other SAT solvers, see [11].
Two-Literal Logic Programs and Satisfiability Representation
121
degrade sato’s performance dramatically.3 As for a translation to a 2-literal program with a linear number of extra variables, it turns out that sato performed badly. The paper is organized as follows. The next section gives the background on the stable model semantics. We show in Section 3 that a number of well-known benchmarks given in the literature are essentially 2-literal programs. We also show how in general a 2-literal program can be translated into a SAT instance without using extra variables, and provide experimental results. Section 4 reports experiments with the Blocks World example. Section 5 comments on related work, and Section 6 concludes the paper.
2
Stable Models of Logic Programs
In this paper, we consider (normal) logic programs which are finite sets of rules a ← b1 , ..., bm , not c1 , ..., not cn where a, bi and ci are atoms of the underlying propositional language L. Here an atom with not in front is called a default negation. Programs without default negations are said to be positive. The stable models are defined in two stages [6]. The idea is that one guesses a set of atoms, and then tests whether the atoms in the set are precisely those that can be derived. In the first stage, given a program P and a set of atoms M , the reduct of P w.r.t. M is defined as: P M = {a ← b1 , ..., bm | a ← b1 , ..., bm , not c1 , ..., not cn ∈ P and ∀i ∈ [1..n], ci ∈ M} Since P M is a positive program, its deductive closure {φ | P M φ, φ is an atom in L}, is the least model of P M . Then, M is a stable model of P iff M is the least model of P M . A constraint is of the form ← b1 , ..., bm , not c1 , ..., not cn which can be viewed as representing a rule of the form f ← b1 , ..., bm , not c1 , ..., not cn , not f where f is a new symbol. In the sequel, a program consists of program rules as well as constraints. In general, we may write nonground, function-free programs. These programs are instantiated to ground ones for the computation of stable models. 3
At each failed branch in the Davis-Putnam algorithm, sato tries to identify the source of the failure and create a new clause. The value of g specifies the maximal length of the created clauses to be saved.
122
3
Guan-Shieng Huang et al.
Two-Literal Programs
A 2-literal program is a program where each rule or constraint consists of at most two literals. A literal in this case refers to an atom φ or a default negation not φ. Two-literal programs are similar syntactically to 2-SAT instances, a finite set of clauses each of which consists of two (classic) literals. Although 2-SAT is linear-time solvable, the existence problem for 2-literal programs is NP-complete. Theorem 1. Deciding if a 2-literal program has a stable model is NP-complete. A number of NP-complete problems can be expressed as a 2-literal program. 3-SAT The 3-SAT problem is NP-complete. The reduction as given in [12,17] can be adopted to reduce a 3-SAT instance to a 2-literal program. Let (U, C) be an instance of 3-SAT where C is the set of clauses and U the set of variables. For each clause xi ∨ yi ∨ zi ∈ C, where xi , yi , and zi are literals, put three program rules and one constraint into P : ci ← not xi ci ← not yi ci ← not zi ← not ci where ci is a new symbol, and xi = xi ∼ if xi is an atom, and xi is the atomic part of xi if xi is a negated atom (i.e. if xi is ¬a then xi is a); similarly for yi and zi . In addition, for each variable x ∈ U , the program P has two rules x ← not x∼ and x∼ ← not x. Clearly, the resulting program is a 2-literal one. It can be verified easily that M is a stable model of P iff M − {ci } − {xi ∼ } is a model of (U, C). The following programs are taken from Niemel¨ a [12] and will be used in this paper as benchmarks. K-colorability The problem is, given facts about vertex(v), arc(v, u), and available colors col(c), find an assignment of colors to vertices such that vertices connected with an arc do not have the same color. color(V, C) ← vertex(V ), col(C), not othercolor(V, C) othercolor(V, C) ← vertex(V ), col(C), col(D), C = D, color(V, D) ← arc(V, U ), col(C), color(V, C), color(U, C) An assignment of colors to vertices is represented by the predicate color(V, C): vertex V is colored by color C. The first two rules above generate candidate solutions using an auxiliary predicate othercolor(V, C), while the third rule eliminates illegal ones. A stable model of the program then corresponds to a solution
Two-Literal Logic Programs and Satisfiability Representation
123
of K-colorability, containing instances of color(V, C) for each vertex V and the color C it gets. The ground instantiation of this program can be easily reduced to a 2-literal program. Once the facts about vertex(v), arc(v, u), and col(c) are given as true atoms, their occurrences can be removed from the body of a rule. For the second rule, we ensure in addition that C and D are instantiated to different colors so that C = D is true and can also be removed. For example, suppose we have three colors, col(red), col(blue), and col(yellow). For any vertex n, the ground instantiation of the first two rules includes color(n, red) ← not othercolor(n, red) othercolor(n, red) ← color(n, blue) othercolor(n, red) ← color(n, yellow) For example, if color(n, red) is in a stable model M , othercolor(n, red) must be false in M , which in turn forces color(n, blue) and color(n, yellow) to be false too. Queens The problem is to place n queens on an n × n board so that no queen attacks any other queens. q(X, Y ) ← d(X), d(Y ), not negq(X, Y ) negq(X, Y ) ← d(X), d(Y ), not q(X, Y ) X ← d(X), d(Y ), d(X1), q(X, Y ), q(X , Y ), X = ← d(X), d(Y ), d(Y ), q(X, Y ), q(X, Y ), Y = Y ← d(X), d(Y ), d(X ), d(Y ), q(X, Y ), q(X , Y ), X =X , = Y , abs(X − X ) = abs(Y − Y ) ← d(X), not hasq(X) hasq(X) ← d(X), d(Y ), q(X, Y ) Instances of d(X) and d(Y ) are given as facts providing dimensions of the board. An instance of q(X, Y ) describes a legal position of a queen. Thus, in a stable model, instances of q(X, Y ) are queens’ positions on the board. The first two rules generate all candidate board positions, whereas the next three constraints remove illegal ones. The last rule and the constraint above it ensure that every queen gets a position. When this program is instantiated, the true facts of predicate d(X), along with inequality and equality among absolute values are removed so that the resulting ground program is a 2-literal one. Pigeons The problem is to put M pigeons into N holes so that there is at most one pigeon in a hole. pos(P, H) ← pigeon(P ), hole(H), not negpos(P, H) negpos(P, H) ← pigeon(P ), hole(H), not pos(P, H) ← pigeon(P ), not hashole(P ) hashole(P ) ← pigeon(P ), hole(H), pos(P, H) ← pigeon(P ), hole(H), hole(H ), pos(P, H), pos(P, H ), H =H ← pigeon(P ), pigeon(P ), hole(H), pos(P, H), pos(P , H), P = P
124
Guan-Shieng Huang et al.
Again, given facts about pigeon(P ) and hole(H), the first two rules generate all possible arrangements of pigeons and holes. The next constraint and rule ensure that every pigeon gets a hole, followed by the constraint that no pigeon gets more than one hole. The last constraint says that no hole holds more than one pigeon. The ground instantiation of this program yields a 2-literal program. It can be shown by a similar instantiation process that the program given by Niemel¨a in [12] to solve the Schur problem, and the program by Marek and Truszczy´ nski [10] to solve the Clique problem are also 2-literal programs. 3.1
Translating Two-Literal Programs to SAT Instances
The class of 2-literal programs can be translated to SAT instances without using extra variables. Theorem 2. There exists a polynomial time reduction from a 2-literal program P to a set of clauses S, without using extra variables, such that M is a stable model of P iff M is a model of S. We prove this theorem using Dung’s result on fixpoint completion [5], which is based on a mechanism of reducing a program to a quasi-program. We sketch our proof below. A quasi-program is a collection of quasi-rules. A quasi-rule is of the form a ← not c1 , ..., not cn where n ≥ 0. Given program rules h ← a1 , ..., ak , not c1 , ..., not cn ai ← not di,1 , ..., not di,mi 1≤i≤k the first rule above can be reduced to h ← not d1,1 , ..., not d1,m1 , ..., not dk,1 , ..., not dk,mk , not c1 , ..., not cn A fixpoint construction is defined so that every program P can be reduced to a quasi-program Pquasi . Dung shows that there is a one-to-one correspondence between the stable models of P and those of Pquasi . However, the reduction is not a polynomial time process in the general case. However, if P is a 2-literal program, the fixpoint construction above is bounded by O(m2 ) where m is the number of rules in the program. Once we get a quasi-program, we can use Clark’s predicate completion [3]. Suppose Π is a propositional program. The Clark completion of Π, denoted Comp(Π), is the following set of formulas: for each atom φ ∈ L, – if φ does not appear as the head of any rule in Π, φ ↔ F ∈ Comp(Π) (F stands for falsity here); – otherwise, φ ↔ B1 ∨ ... ∨ Bn ∈ Comp(Π) (with default negations replaced by negative literals), if there are exactly n rules φ ← Bi ∈ Π with φ as the head. We write T (tautology) for Bi if Bi is empty.
Two-Literal Logic Programs and Satisfiability Representation
125
– for any constraint ← b1 , ..., bm , not c1 , ..., not cn in Π, ¬b1 ∨ ... ∨ ¬bm ∨ ... ∨ c1 ... ∨ cn is in Comp(Π). For any quasi-program Pquasi , it is well-known that the models of Comp(Pquasi ) correspond to the stable models of Pquasi . Note that for a 2-literal program P , Comp(Pquasi ) can be translated to a set of clauses in a simple way. For example, suppose we have n rules with proposition p as the head: p ← l1 ,..., a ← ln . Then, the completion formula is p ↔ l1 ∨ ... ∨ ln (where the occurrences of not are replaced by ¬), which is equivalent to two clauses: p ∨ ¬l1 ∨ ... ∨ ¬ln and ¬p ∨ l1 ∨ ... ∨ ln . In contrast, the translation of an equivalence to a set of clauses in the worst case requires exponential time and space, since it involves converting a disjunctive normal form to a conjunctive normal form. This is how the class of 2-literal programs distinguishes itself from the class of 3-literal programs (where each rule consists of at most three literals). Finally, since this proof is independent of the size of any constraint, the restriction that a constraint has at most two literals in the given program can be removed and it will not affect the claim stated in the theorem. 3.2
Experiments with Two-Literal Programs
We compare search efficiencies for 2-literals programs experimentally. We tested three problems: K-colorability, Queens, and Pigeons. For each problem, we give a table listing the search times in seconds under the setting described below, which is followed by a chart for easy comparison. The logic programs for these problems and the graph instances were taken from the smodels’ web site4 . These programs were run in smodels (Version 2.2.6), and the search times reported by smodels were recorded. The search time here excludes the time spent by lparse, an interface to smodels whose main function is to instantiate a function-free program to a ground program. For a problem that can be solved within a few seconds, the time used by lparse could be a significant factor. For harder problems, it becomes insignificant. We used the same encodings for translation. First, a logic program was grounded by lparse (Version 1.0.6). Then the ground program was translated to a SAT instance. The translation was implemented by a Prolog program which generated clauses in the format required by sato. We then ran the translated SAT instance in sato (Version 3.2.1) with the default setting of the parameter g,5 and recorded the search time reported by sato. In the tables these search times are given under the header Sato-Translation. In sato the search time plus the “build time” is the user CPU time of the unix time command. The build time is usually a small fraction of the entire user CPU time. Note that, since we are interested only in search efficiency, neither grounding by lparse nor the translation to a SAT instance was considered part of the execution. 4 5
at http://saturn.tcs.hut.fi/Software/smodels/ The default is g = 10. It was said to be g = 20 in the user manual, and clarified via an email exchange with the author.
126
Guan-Shieng Huang et al.
Sato comes with a number of benchmarks encoded as SAT instances, including Queens and Pigeons. We also ran these SAT benchmarks for comparison. In the tables, they are reported under the header Sato-Benchmark. All of the experiments were performed on a Linux machine with a Pentium IV 1.5 GHz processor and 1 Gb main memory. When a solution exists, the recorded time is for the generation of the first solution. “Too long” in an entry means that no answer was returned within 2 hours time. Currently, sato can accommodate at most 30,000 variables. “Too many variables” in an entry means that the SAT instance is over that limit. In testing Pigeons we used the data where the number of pigeons was one more than the number of holes, so that the problem had no solution. For colorability, the test results were generated by using 3 colors.
Table 1. Search times for Pigeons Pigeons 6 7 8 9 10 11 12
Smodels 0.023 0.073 0.348 1.672 4.450 15.115 25.146 11.750 153.730 251.836 too long
Table 2. Search times for Queens
3
10
SATO Benchmark SATO Translation Smodels 2
10
1
Time(sec.)
10
0
10
−1
10
−2
10
−3
10
8
10
12
14
16
18
20
22
24
26
Fig. 2. Queens
4
Experiments with a Non-2-Literal Program
A logic programming encoding of the Blocks World is given by Niemel¨a [12]. It is shown in [1] that the completion formula of the program characterizes the stable model semantics. We give two tables showing different kinds of experimental results. The setting under which the experiments were conducted is the same as before, except that we experimented with different values of the parameter g in the case of sato. It turns out that this is important.
128
Guan-Shieng Huang et al.
Table 3. Search times for 3-colorability Nodes 20 25 29 100 300 600 1000 6000
Sato-Translation 0.00 0.00 0.00 0.01 0.01 0.02 0.04 too many variables
Smodels 0.01 0.02 0.02 0.03 0.10 0.25 0.435 3.05
1
10
SATO Translation Smodels
0
Time(sec.)
10
−1
10
−2
10
−3
10
1
10
2
10
3
10
4
10
Fig. 3. 3-colorability
In the first table, the data files large.c, large.d, and large.e (which specify the blocks, their initial configuration and the target configuration) were taken from Niemel¨a’s website. For each instance, two cases were tested, the case with the smaller number of steps had no solution whereas the one with the next larger number had a solution. For large.c for instance, 7 steps cannot transform the given configuration to the target configuration while 8 steps can. We created the data file large.f with 21 blocks which requires 11 steps to solve. This instance turns out to be important in illustrating the changing behavior of sato. The header Sato (g=20) means that the SAT instance translated from the logic program by completion was run by setting the parameter g to 20. The last column, with the header Sato-2 (g=20), reports the experiments for the approach where a non-2-literal rule was first translated to a number of 2-literal rules using one extra variable, as follows:
Two-Literal Logic Programs and Satisfiability Representation
129
If r is h ← l1 , . . . , lm with m ≥ 2, then translate r into h ← not xr , and for for each i ∈ [1..m], xr ← not li if li is an atom and xr ← φ if li is default negation not φ. Essentially, this breaks rule r into several 2-literal ones by representing the body of the rule by a new variable. It can be shown that the class of programs that can be faithfully translated this way includes the class of programs whose completion characterizes the stable model semantics. Since the latter is the case for the Blocks World program of Niemel¨a, there is a one-to-one correspondence between the stable models of the original program and those of the translated program. Then, the resulting 2-literal program was translated to a SAT instance, and run in sato. As the reader can see, even with a linear number of extra variables, the performance became unacceptable very quickly. In the second table, we illustrate the performance variations of sato on the two larger instances. It turns out that sato is very slow with the default value g = 10, and extremely efficient when g is set to 90.
Table 4. Search times for Blocks World Problem Blocks Steps large.c 15 7 8 large.d 17 8 9 large.e 19 9 10 large.f 21 10 11
Sato-2(g=20) 1.17 1.97 7.50 11.26 11.84 71.49 too long too long
Table 5. Comparison with different values of g Problem Blocks Steps large.e 19 9 10 large.f 21 10 11
Sato (g=10) 2.41 761.93 55.88 too long
Sato (g=20) 1.58 6.17 307.07 29.59
Sato (g=90) 2.08 3.34 3.87 8.70
130
5
Guan-Shieng Huang et al.
Related Work
Babovich et al. [1] also reported experiments with smodels and sato for the Blocks World problem. To a large extent, our experimental results with g = 20 are comparable to theirs. However, the creation of the instance large.f in our case revealed the changing behavior of sato. In contrast, smodels behaved gracefully and proportionally. Given a program P , if its completion formula characterizes the stable model semantics, then obviously P can be translated to a SAT instance without using extra variables. Note that this in general doesn’t guarantee that the resulting SAT instance is of the polynomial size, since it involves converting a disjunctive normal form to a conjunctive normal form. There are problems whose “natural” solutions are 2-literal programs but their completions may not characterize the stable models semantics. One of these problems is the reachability problem (known to be NL-complete, i.e. nondeterministic log-space complete): given a graph G = (V, E) and two vertices s and t, determine whether there is a path from s to t. The problem can be solved by the following program: reached(V ) ← arc(U, V ), reached(U ) reached(V ) ← arc(s, V ) reached(t) ← not reached(t) The ground instances of the first rule become 2-literal rules after true instances of arc(U, V ) are removed. Now consider the digraph with the set of vertices V = {s, t, u} and the set of paths E = {(t, u), (u, t)}. Though the problem has no stable models, its completion formula has a model.
6
Conclusion
The main result of this paper is the discovery that an efficient SAT encoding requiring no extra variables exists for the class of 2-literal programs. Our experimental results indicate that the SAT translations of these programs can be solved efficiently by a competent SAT solver. Whether to use extra variables or not in a translation could be significant, as extra variables may increase search space exponentially. Our experimental results also suggest that even with a linear number of extra variables, the performance can be degraded significantly. We also reported the changing behavior of sato for some larger Blocks World instances, which were not known previously. This seems to suggest that the advantage of sato on search efficiency is not at all obvious for non-2-literal programs in general, even if some of these programs can be translated to SAT instances without using extra variables.
Two-Literal Logic Programs and Satisfiability Representation
131
References 1. Y. Babovich, E. Erdem, and V. Lifschitz. Fage’s theorem and answer set programming. In Proc. Int’l Workshop on Non-Monotonic Reasoning, 2000. 120, 127, 130 2. R. Ben-Eliyahu and R. Dechter. Propositional semantics for disjunctive logic programs. Annals of Math. and Artificial Intelligence, 14:53–87, 1994. 120 3. K. L. Clark. Negation as failure. Logics and Databases, pages 293–322, 1978. 124 4. M. Davis, G. Logemann, and D. Loveland. A machine program for theorem proving. Communications of the ACM, 5(7):394–397, July 1962. 120 5. P. Dung. A fixpoint approach to declarative semantics of logic programs. In Proc. North American Conf. on Logic Programming, pages 604–625, 1989. 124 6. M. Gelfond and V. Lifschitz. The stable model semantics for logic programming. In Proc. 5th ICLP, pages 1070–1080. MIT Press, 1988. 119, 121 7. T. Janhunen, I. Niemel¨ a, P. Simons, and J. You. Unfolding partiality and disjunctions in stable model semantics. In Proc. KR 2000, pages 411–424. Morgan Kaufmann, April 2000. 120 8. N. Leone et al. DLV: a disjunctive datalog system, release 2000-10-15. At http:// www.dbai.tuwien.ac.at/proj/dlv/, 2000. 120 9. V. Lifschitz. Answer set programming. In K. R. Apt et al., editor, The Logic Programming Paradigm: A 25-Year Perspective, pages 357–371. Springer, 1999. 120 10. V. Marek and M. Truszczy´ nski. Stable models and an alternative logic programming paradigm. In K. R. Apt et al., editor, The Logic Programming Paradigm: A 25-Year Perspective, pages 375–398. Springer, 1999. 124 11. M. Moskecicz, C. Madigan, Y. Zhao, L. Zhang, and S. Malik. Chaff: Enginerring an efficient sat solver. In Proc. 38th ACM Design Automation Conference, pages 530–535, June 2000. 120 12. I. Niemel¨ a. Logic programs with stable model semantics as a constraint programming paradigm. Annals of Math. and Artificial Intelligence, 25(3-4):241–273, 1999. 120, 122, 124, 127 13. I. Niemel¨ a and P. Simons. Extending the Smodels system with cardinality and weight constraints, pages 491–521. Kluwer Academic Publishers, 2000. 120 14. C. H. Papadimitriou. Computational Complexity. Addison-Wesley, 1994. 120 15. P. Simons. Extending and Implementing the Stable Model Semantics. PhD thesis, Helsinki University of Technology, Helsinki, Finland, 2000. 120 16. M. Truszczy´ nski. Computing large and small stable models. In Proc. ICLP, pages 169–183. MIT Press, 1999. 120 17. J. You, R. Cartwright, and M. Li. Iterative belief revision in extended logic programs. Theoretical Computer Science, 170:383–406, 1996. 122 18. H. Zhang. Sato: an efficient propositional prover. In Proc. CADE, pages 272–275, 1997. 120
Using Communicative Acts to Plan the Cinematographic Structure of Animations Kevin Kennedy and Robert E. Mercer Cognitive Engineering Laboratory, Department of Computer Science The University of Western Ontario, London, Ontario, Canada [email protected], [email protected]
Abstract. A planning system that aids animators in presenting their cinematic intentions using a range of techniques available to cinematographers is described. The system employs a knowledge base of cinematic techniques such as lighting, color choice, framing, and pacing to enhance the expressive power of an animation. The system demonstrates the ability to apply cinematography knowledge to a pre-defined animation in a way that alters the viewer’s perceptions of that animation. The tool is able to apply cinematography techniques to communicate information, emotion, and intentions to the viewer. The application of cinematography knowledge to animations is performed by a communicative act planner that draws on techniques from Rhetorical Structure Theory (RST). The RST-guided planning paradigm generates coherent communicative plans that apply cinematography knowledge in a principled way. An example shows how the system-generated cinematography structure can enhance the communicative intent of an animation.
1
Introduction
In our research we are building a system that will aid animators in presenting their cinematic intentions using the large range of techniques which are available to cinematographers. An experienced animator can use techniques far more expressive than the simple presentation of spatially arranged objects. Our system employs a knowledge base of cinematographic techniques such as lighting, color choice, framing, and pacing to enhance the expressive power of an animation. By harnessing cinematography information with a knowledge representation system, the computer can plan an animation as a presentation task and create cinematography effects on behalf of the animator. The system described in this paper contains a knowledge base of cinematography effects and tools, which is employed by a planner to automate the presentation of animations. The planner reasons directly about the communicative intent that the animator desires to express through the animation. Utilizing this system, an animator can take an existing animation and ask the computer to create a communicative plan that uses cinematography techniques to reinforce the content of the animation, or create new meanings which
This research was funded by NSERC Research Grant 0036853.
R. Cohen and B. Spencer (Eds.): AI 2002, LNAI 2338, pp. 132–146, 2002. c Springer-Verlag Berlin Heidelberg 2002
Using Communicative Acts to Plan the Cinematographic Structure
133
would otherwise not be expressed. The cinematography planning system acts to increase the visual vocabulary of an animator by acting as an expert assistant in the domain of cinematography technique. The computer program that has been implemented as part of this research does not create animations, but works as a semi-automated tool to assist in generating an animation. The techniques presented here are intended to enhance the communication skill of a inexperienced animator when working in this medium or to assist the communication skill of an experienced animator by presenting multiple cinematographic presentation choices. 1.1
Animation
Animation is the process of generating a motion picture one frame at a time using methods such as drawing, photographing, photocopying, or clay modelling. The illusion of smooth motion is created by making slight alterations to each successive frame in a series and playing them quickly in their proper sequence. For much of its history, animation has been a painstaking task heavily burdened with manual labour. Computers have been applied to animation to help automate some of the more mechanical aspects of animation production with some success [15]. A successful area for applying computers to animation is the art of what is generally called computer animation. In a computer animation, the computer takes over the entire act of synthesizing the images, though humans generally control the image content. In this sense computer animation refers to the use of computer-generated images in forming an animation. Computer animation’s success is the result of mature computer graphics algorithms for synthesizing (or rendering) images from geometric models. Today it is possible to render very realistic images from a geometric model of the objects, backgrounds, lighting, and atmospheric effects present in a scene. However, rather than being interested in these aspects of computer animation, we are interested in communicating information, emotions and themes that the human director would like to convey with their animation. 1.2
Cinematography as a Means of Communication
Cinematography is the art of photographing a moving film image. The art encompasses placing the camera, choosing the film and colour filters, and controlling (and compensating for) the lighting. The cinematographer, usually referred to as the director of photography, controls the equipment that captures light onto the film media and manipulates the visual look of the film. In the realm of computer animation, the mechanism of filming is significantly simplified. A virtual camera can be operated without care for shutter speed, focus, or physical size and weight (of course these things could be simulated if so desired). Inside the computer there are no cloudy or sunny days, and the cameraman never makes mistakes. This lack of physical constraints reduces the role of the virtual cinematographer to its very essence: controlling the visual message of the film.
134
Kevin Kennedy and Robert E. Mercer
Viewer Attention The most basic method of communication with cinematography is directing the viewer’s attention. When telling a story visually, it is important that it be easy for the viewer to follow the action. There are many ways that the film-maker can do this, including directed lighting, colour contrasts, camera focus, and framing. By strongly lighting important objects, for example, the film-maker can restrict the viewer’s attention. Information The use of cinematography to provide information to the viewer is also pervasive. Many effects are used by film-makers to tell the viewer something without any direct statements. In a novel, such information would be provided by the narrator, but in film it can be presented in a more subtle, subliminal way. The size and weight of objects is implied by the framing of objects on the screen. A tall character is often shot from a low position to enhance her height or menace. The time of day is indicated by the lighting quality of the scene. Passing time is communicated through slow dissolves and fade-outs. Themes and Emotion The emotional predisposition of a viewer can be altered by applying the proper visual effect. Though the psychological effects of film techniques are hard to quantify, they are still relentlessly sought by film-makers. There is probably a certain level of learned response involved, as audiences are repeatedly exposed to the same stimuli in similar situations [20]. The broad approaches to setting mood are well understood. To set up a happy scene, one should use bright lights and vivid colours to create a “feeling-up” mood. A sad, dramatic, or scary scene should use low-key, “feeling-down” effects. De-saturation of light and colour can be used to draw viewers into the scene, or high saturation can make them feel like outside observers. These techniques are applied constantly in the medium to give films a specific mood.
2
Achieving a Communicative Act
The way in which cinematography can be used to communicate is best shown with an example. Figures 1 and 2 demonstrate how the manipulation of an animation’s cinematography can enhance the story and provide the viewer with more information. Figure 1 shows two still frames from a simple animation. “Superball” and “Evilguy” are squared off in a tense confrontation. Superball sizes up the situation and concludes that discretion is the better part of valour and thus makes a hasty exit to stage right. Figure 1 shows how this animation can be shown with a dispassionate outside observer’s point-of-view. However, cinematography gives us better tools to present this animation. By changing the cinematography, the animator can add information, emotions, and drama to this simple scene. Figure 2 shows an alternative presentation of the animation that expresses several communicative acts through cinematography. Figure 2(a) sets up the scene by showing the confrontation in screen center with a more dramatic lighting effect than Figure 1(a). Figure 2(b) gives us a
Using Communicative Acts to Plan the Cinematographic Structure
(a)
(b)
Fig. 1. Original Flat Animation
(a)
(b)
(c)
(d)
Fig. 2. Communicative Act Animation
135
136
Kevin Kennedy and Robert E. Mercer
close-up of Evilguy who is looking suitably menacing. Superball’s reaction, and his uncomfortable feelings are shown in Figure 2(c). Finally, Superball beats his retreat which is enhanced by showing him zooming off into the distance (Figure 2(d)). Figures 2(a)–(d) are sample frames from all four shots of the new cinematographically-enhanced animation. These frames were generated automatically by our cinematography planning tool. The only user inputs were a geometric description and a set of communicative acts to be achieved by the animation. To create the enhanced animation the planner was told to communicate the following information to the viewer: – – – –
Superball is diminutive and frightened Evilguy is domineering the scene is dramatic Superball makes a hasty exit to stage right
The planning system created the output animation shown in Figure 2 based on this limited set of inputs. In Figure 2(a), drama is created by stark lighting choice, and a slightly pulledback camera. Figure 2(b) shows Evilguy’s domineering nature by placing the camera close to him, looking upwards. The effect is further enhanced by positioning him to touch the top edge of the frame. Superball’s reaction of fear is shown through a harsh spotlight from above which casts a dark shadow beneath him. His short stature is shown with a high camera looking down. Finally, the exit to stage right is enhanced by aligning the action with the z-axis (into the frame) as Superball recedes into the distance.
3
System Description
The implemented system described here is meant to be used as an interactive tool to refine the cinematography plan for an animation. A pre-defined description of character action is interactively re-filmed to achieve the animator’s requested communicative acts. The animator can change the communication goals or request alternate cinematography plans to perfect the animation. The current system, which is intended as a proof of concept, has a traditional AI architecture consisting of a knowledge base, a planner, and an acting agent, the latter being a graphical renderer in this case. A real animation production environment which uses a mixed initiative strategy [11] would be an obvious next step. The current system has been shown to work on several examples of short action sequences, and is able to recreate scenes that are similar to pre-existing animations. The knowledge base stores knowledge about space, time, solid objects, lights, colours, cameras, scenes, shots, and cinematography effects. This knowledge is implemented using the language LOOM [6]. The planner creates a plan that implements the desired communicative acts. The renderer transforms the cinematography plan created by the planner into a sequence of animation frames which constitute the output of the program. The renderer makes use of the POVray ray-tracing system to create the graphical images.
Using Communicative Acts to Plan the Cinematographic Structure
137
Fig. 3. Module Interactions
Figure 3 shows the interactions between the various modules. The knowledge base acts as a source of information about cinematography, and as a store of knowledge while the animation plan is being built up. It is also used to store the planning rules that the planner assembles to create a solution. Hence the interaction between the planner and the knowledge base is bi-directional. The renderer, on the other hand, only retrieves information from the knowledge base, it does not add knowledge. There is also a direct connection from the planner to the renderer because the planner provides the renderer with the complete scene description that is to be rendered. Finally, the renderer outputs data to the POVray ray-tracer, which is an external program. The virtual director of photography acts as a tool to be used by a human operator or director. The human and computer work together as a team, just like the cooperation between a film director and her director of photography on a movie set. The director must do a great deal of the work involved with specifying the action, character blocking, and narrative. The human must also express her narrative goals in terms of the communicative acts that the system understands. The computer will, however, assemble these elements together, position the camera and lights, and generate the sequence of images that create the animation. Since this is meant as a semi-automated approach, the director has the capability to overrule the computer and tell it to keep searching for a better cinematography “solution”. 3.1
Cinematography Knowledge Base
The knowledge representation language used for this paper (LOOM) is a description logic that allows a programmer to use aspects of several knowledge representation schemes. LOOM is primarily a frame-based language; however it encodes knowledge in a hierarchy of concepts and roles (also known as objects and relations) that is very similar to conceptual graphs. Concept definitions can contain simplified predicate logic statements to provide a limited element of
138
Kevin Kennedy and Robert E. Mercer
logic-based representation. LOOM also provides the ability to encode rules and actions just like a production system. The knowledge base is our attempt to capture the “common sense” of cinematography. The following areas are represented in the knowledge base: – – – – – –
cameras, camera positions, field of view lights, fill lights, spot-lights, lightsets colors, color energies, color saturation scenes, foreground/background, stage positions spatial objects/relationships, 3D vectors, occlusion moods and themes; color/light effects to present them
The source we have chosen for our common-sense knowledge is a Film Studies textbook by Herbert Zettl [20]. This book is written as an introduction to cinematography and contains many rules which lend themselves to our knowledge representation technique. Figure 4 shows an example of some of the knowledge presented by Zettl in several chapters on cinematography lighting. In this figure we have broken down the techniques described into their major classifications, arranging them from left to right according to the visual “energy” they convey. The terms written below each lighting method are the thematic or emotional effects that Zettl associates with these techniques. It is these effects that the animator can select when constructing a scene with our program. It should be noted that, though the lighting techniques are modelled in detail by the knowledge base, the thematic and emotional effects are not. For example, there is no attempt to model the meaning of “wonderment” in the knowledge base. These are simply text tags which are meaningful only to a human operator. In addition to lighting techniques, the knowledge base represents camera effects like framing, zooms, and wide-angle or narrow-angle lenses. Colour selections for objects and backgrounds, as well as their thematic meanings, are also contained in the knowledge base. These three major techniques (lighting, colour, and framing) can be used to present a wide variety of effects to the viewer. The actions that make up the animation can be any combination of move, jump, stretch, squash, turn, and tilt. This allows expression of simple cartoonlike animations. Cinematography Techniques To accomplish the task laid out above, the computer cinematographer must have a knowledge of cinematography and must be able to apply it. The types of cinematography knowledge the system contains is described below. Lighting Lighting is used to set mood, direct viewer attention, and provide information. The computer cinematographer can apply lighting to characters and backgrounds independently. The quality of lighting can be adjusted to alter the amount and sharpness of shadows. The brightness and direction of lighting is changed to achieve communicative acts as required.
Using Communicative Acts to Plan the Cinematographic Structure
Fig. 4. Semantic Deconstruction of Zettl’s Lighting Models
139
140
Kevin Kennedy and Robert E. Mercer
Colour The computer has limited control of colour for scene objects. When object models are created they are also created with colour sets which can be applied to them before rendering. The colour sets fall into several overlapping classes of colour energy and saturation. The system can select a specific colour set which satisfies constraints imposed by the director’s communicative goals. The system described here does not contain a general model for the aesthetics of colour, but relies on the the programmer to classify colours in terms of energy, temperature, and saturation. Camera Placement The computer director of photography takes ownership of the virtual camera and its point of view. Given a scene containing background objects and characters, the system will orient the camera in a way that achieves the desired effects. The system presented here can only function with objects that are “well behaved”. The analogy used is that of a stage or small set. The computer can deal with objects that move through different positions on this small set, and arrive at proper camera placement solutions. An animation that involves large sweeping movements, interaction of convoluted objects, or highly constrained environments cannot be expected to work correctly. Framing Closely related to camera placement is the framing of objects within the two dimensional field. When prompted by the director’s communication goals, the computer will attempt to frame objects in certain zones of the screen to achieve corresponding visual effects. Shot Structure In what is a step outside of the duties of a director of photography, the system takes on some of the duties of a film editor. Given overall goals of pacing and rhythm, the computer will make decisions about where to place cuts in the film time-line. To assemble an overall viewer impression of the scene environment, the computer will assemble short sequences of shots that portray important objects and relationships within a scene. The director can choose either an inductive or deductive approach to shot sequencing. The animator must also supply information about which objects and characters are important and what meaningful relationships exist between these objects. Qualitative Representation and Reasoning Qualitative physics is the basis for the knowledge representation of most spatial qualities and physical measurements used in this paper [19]. For example, the size of an object is specified as being tiny, small, medium-sized, large, or very-large. Reasoning about where to place cameras is based upon these simplified measurements. Qualitative measurements are sufficient to capture the knowledge required for this domain. Though the use of standard quantitative representations would allow finer control of physical quantities and measurements, such greater accuracy would not benefit the reasoning process. The camera positioning really only needs to distinguish between close-up, medium, and long shots. Whether a camera is 4.0 or 4.1 units from a target is of little significance. Similarly, the
Using Communicative Acts to Plan the Cinematographic Structure
141
actual physical quantities like the number of lumens emitted by a light are not important; they only need to be bright, medium-strength, or dim. Much of the information presented in Zettl is qualitative in nature and lends itself to representation in this format [20]. For example, colour energies are discussed in terms of low-energy, neutral, and high-colour-energy. This type of knowledge is easily represented and reasoned about, using qualitative methods. 3.2
Rhetorical Structure Theory
In a strict sense, the system presented in this paper is concerned with the translation from communicative acts into animation instructions to be rendered. The communicative acts are the information, emotions, and themes that the human director would like to convey with their animation. The computer takes these communicative acts and transforms them into visual effects that can be rendered into digital images. The set of communicative acts understood by the computer cinematographer acts as a sort of vocabulary that the human animator/director can use to add to the basic action they have specified to take place in their animation. The communicative acts understood by the computer includes things like: – Show that character A is important – Increase viewer involvement – Promote viewer discomfort The task of the computer is to sort out competing goals, apply standard “default techniques”, and create final rendered images. Like a true director of photography, the computer acts as an assistant to the director. The tasks of directing the actors and creating the narrative are left to the human in charge. The computer takes control of the camera, colour, and lighting and presents final animations that comply with the director’s vision. In our research, Rhetorical Structure Theory is used to guide our representation and planning of communicative acts. Rhetorical Structure Theory (RST) is concerned with describing the structure of text [16]. The main focus of RST is that of textual coherence. A coherent text is one in which every segment of a text has a reason for being there. A coherent text consists of well-phrased sentences devoid of non-sequiturs and gaps. RST can describe the structure of any coherent text. When describing the structure of a text, RST uses a framework of nucleus-satellite relations. A nucleus is the central idea or fact that is being presented by a portion of text. The satellite is a secondary phrase or sentence that in some way supports the rhetorical purpose of the nucleus. These nucleus-satellite relations are called rhetorical relations. There are approximately 50 rhetorical relations recognized for textual analysis. The following are some examples of rhetorical relations (RR), nuclei (N), and satellites (S): – RR: Background; N: text whose understanding is being facilitated; S: text for facilitating understanding.
142
Kevin Kennedy and Robert E. Mercer
– RR: Elaboration; N: basic information; S: additional information. – RR: Preparation; N: text to be presented; S: text which prepares the reader to expect and interpret the text to be presented. – RR: Contrast; N: one alternate; S: the other alternate. The above description of RST shows its application as a tool for analyzing text; however, it is also useful for generating text. For example, the ILEX system [17] creates hypertext descriptions of objects in a virtual museum tour. This paper, however, is not concerned with the generation of texts; it is concerned with the generation of visual effects to communicate with the animation viewers. RST is used as a way to connect the wishes of the animator with the actions of the renderer. It provides a methodology for transforming intent into action. The communicative acts are not comprised of sentences, but are assembled from the structure and presentation of the scene. As an example of how RST concepts can be used to plan a scene, consider the RST relation elaborate. Elaboration occurs when one or more pieces of text present additional detail about another portion of text. Our animation planner would use this approach when trying to present a non-obvious concept. For example, when asked to show that a character is unhappy, the planner could elaborate by using both dim lighting and a muted color scheme around the character. Other systems have also used RST in the visual domain. Andr´e and Rist [1] used RST theory to aid in the creation of diagrams. These diagrams were intended to provide instruction to people in the operation of mechanical devices. In this case RST was used because it gave a structure for reasoning about communicative acts. Each image in an instruction manual has a very specific communication intent; for example, “turn this knob” or “pay heed to this indicator”. RST provides a straightforward structure for codifying these concepts. The transformation to the visual domain requires adapting many rhetorical relations and creating new ones where necessary. Transforming RST to the area of animation requires three analogies to be created between text and animations: (1) Author ⇒ Animator. (2) Text ⇒ Images. (3) Phrases ⇒ Scene Presentation and Structure. These analogies allow us to apply RST principles to the concepts of visual communication. The scene presentation and structure being referred to here are specifically the techniques outlined previously as being a part of cinematography. RST provides a structure for reasoning about high-level communicative acts, and producing coherent visual “texts” for performing these acts. 3.3
RST/Planning Integration
The planner itself is a forward-chaining planner. Planners were initially envisioned as action planners for agents interacting with the real world. The planning performed for this paper is somewhat like the planning that is required to produce a paragraph of text. Although a paragraph of text does unfold over time, the arrangement of sentences is guided more by the requirements of logic and rhetoric than by the strict requirements of cause and effect.
Using Communicative Acts to Plan the Cinematographic Structure
143
A planner works by assembling discrete plan steps together in a logical framework that achieves its goals. The plan steps are actually RST-style rhetorical relations. The planner works by combining these relations into an RST tree that achieves the communicative act “goals” of the animator who is presenting the animation to the viewer. The RST plan is thus a rhetorical structure that implements a plan of action to communicate something to the viewer. The main task of the planner is to create a plan that does not contain contradictions. The communicative acts requested by the animator are matched to high-level RST plan steps that have the desired effect when implemented. Higherlevel RST plan steps will contain lists of lower-level plan steps that are required to attain the goal of the higher-level plan step. These lower-level plan steps are usually expressed as AND/OR/SOME combinations, meaning all/one/morethan-one lower-level plan steps must be successful for the higher level plan step to be considered accomplished. At some point, the RST-style plan steps give way to direct cinematography actions that must be attempted. At this point, a contradiction is easily encountered because of conflicts with earlier branches of the RST plan tree. For example, an earlier plan step might have called for a calming mid-left screen placement for a character, but a new plan step calls for placement near a screen edge. The planner must prevent these conflicts from taking place and must circumvent them by backtracking. Although the planner does act at two distinct levels, that is, the RST plan step level and the cinematography action level, the planner cannot be considered a hierarchical planner. This is because the lowest level of the tree, a cinematography action, does not actually require any planning. Cinematography actions are called for by RST plan steps and they must be carried out for the RST plan step to be fulfilled. Hence all actual planning is restricted to the RST planning and does not occur for the lower levels of the plan tree. Future work in this area may necessitate a move to an explicitly hierarchical planner that can plan about communicative acts at the levels of narrative and acting as well as cinematography.
4
Related Work
Much recent work has investigated the problem of automatically placing a camera to view a 3D environment. Gleicher and Witkin built a system that allows the user to control the camera by placing constraints on the image that the camera should produce [12]. Philips et al. created a system that integrates camera movement with the movement of virtual human characters so that the camera views the tasks undertaken by the virtual humans. The CINEMA system developed by Drucker et al. provides a language for users to specify camera movements procedurally [10]. Bares et al. have written several papers on the problems of positioning a camera correctly in complex 3D worlds [2,4]. They use a real-time constraint solver to position a camera in a complex interactive 3D world in a way that views characters and fulfills a cinematic goal. Another approach they have taken is to model the user’s preferences to create a user-optimal camera place-
144
Kevin Kennedy and Robert E. Mercer
ment to view a virtual 3D world [3]. Bares et al. also use a constraint solver to find a solution to various user-imposed camera viewing requirements [5]. Halper and Olivier use a genetic algorithm to find a good camera placement in highly constrained situations [13]. Christiansen et al., using established camera placement idioms from film textbooks, created a language called the Declarative Camera Control Language (DCCL) for encoding cinematography idioms [8], for example, a three shot sequence depicting one actor approaching another. They also created a virtual cinematographer that used a finite state machine approach to operationalize cinematography idioms used in filming conversations [14]. Tomlinson et al. [18] integrate a cinematography agent into a virtual animated agent environment. This cinematography agent uses lighting changes and camera angles to reflect emotions of the virtual agents. Butz [7] created a system to create short animations to explain the function of technical devices. Communicative goals are achieved using effects like pointing the camera and shining spotlights to highlight certain components and actions. Whereas these other works deal with automation of technical aspects of cinematography, the system that we present combines these techniques to bolster the communicative nature of cinematography. Some of these systems could be integrated with the system outlined in this paper in a way that would enhance the generality of the system to handle more difficult camera placement requirements. The camera placement idiom research is applicable to this paper in that it provides a general way to represent cinematography idioms, a task that is not attempted in this paper. This paper makes use of idioms that achieve certain communicative goals and applies them directly, without concern for a global representation of such idioms.
5
Conclusion and Future Work
This paper has described the area of computer animation, and how it bridges both traditional animation and real-life film. The approach taken in this paper is to take the cinematography techniques of traditional film, and apply them using automated methods in the computer graphics medium. Animation is fundamentally a form of communication between animator and viewer. By codifying the communicative acts that can be achieved with cinematography, we have created a tool that helps animators in enhancing the communicative power of their animations using cinematography techniques. In performing this task we have identified a large body of knowledge in cinematography textbooks that can be captured using a qualitative approach to knowledge representation. The applied techniques of cinematography are governed by discrete and concise rules that can be captured with modern AI techniques. The main task of the knowledge engineering lies in capturing the meaning of the visual vocabulary of cinematography. Our research achieves this in several important areas of cinematography.
Using Communicative Acts to Plan the Cinematographic Structure
145
Operationalization of this cinematography knowledge requires the ability to plan the application of the knowledge to the problem domain. By considering the act of creating an animation as a type of visual communication, the application of cinematography knowledge to real animations is performed by a communicative act planner that draws on techniques from Rhetorical Structure Theory. This planner allows an animator to enact specific communicative acts through applied cinematography. By using an RST-inspired planning paradigm, the animation assistant can generate coherent communicative plans that apply cinematography knowledge in a rational fashion. Future areas of research extending from this paper could involve: – Robust reasoning about temporal organization. The current system handles the arrangement of cuts between shots using a script-like system. It would be more interesting to reason more fully about the mechanisms of shot sequence structures and the influence of animation action on choosing cut timing. This aspect of the research would also require solving problems dealing with the narrative structure. Others have developed methods for dealing with space and time in image sequences (such as the logic developed in [9]). However, these methods are interested in media in which real time and media time are essentially equivalent. In cinematography time can be used to create effects or to map parallel events to sequential camera shots. These issues need to be captured in a temporal reasoner. – More robust camera placement. Other cinematography research has dealt with the issue of placing cameras in tightly constrained situations, and actively evaluating placements based on the output images. It would be interesting to coordinate this research with such a system to create a more general cinematography tool. Such an extension would also allow a wider range of actions or settings to be expressed. – Narrative and virtual characters. It is possible to imagine a system that integrates animation presentation issues with character actions, intentions and beliefs. The organization of these characters in front of the camera, all in the name of satisfying communicative acts, poses an interesting problem.
References 1. E. Andr´e and T. Rist. The Design of Illustrated Documents as a Planning Task. In M. T. Maybury, editor, Intelligent Multimedia Interfaces, pages 94–116. American Association for Artificial Intelligence, 1993. 142 2. W. H. Bares, J. P. Gregoire, and J. C. Lester. Realtime Constraint-Based Cinematography for Complex Interactive 3D Worlds. In Proceedings of the Tenth National Conference on Innovative Applications of Artificial Intelligence, pages 1101–1106. American Association of Artificial Intelligence, July 1998. 143 3. W. H. Bares and J. C. Lester. Cinematographic User Models for Automated Realtime Camera Control in Dynamic 3D Environments. In Proceedings of the Sixth International Conference on User Modelling, pages 215–226, June 1997. 144
146
Kevin Kennedy and Robert E. Mercer
4. W. H. Bares and J. C. Lester. Intelligent Multi-Shot Visualization Interfaces for Dynamic 3D Worlds. In Proceedings of the 1999 International Conference on Intelligent User Interface, pages 119–126, 1999. 143 5. W. H. Bares, S. Thainimit, and S. McDermott. A Model for Constraint-Based Camera Planning. In Smart Graphics: Papers from the 2000 AAAI Symposium, pages 84–91. AAAI Press, 2000. 144 6. D. Brill. LOOM reference manual version 2.0. Technical report, University of Southern California, Los Angeles California, 1993. 136 7. A. Butz. Anymation with CATHI. In AAAI’97 Proceedings of the 14th National Conference on Artificial Intelligence, pages 957–962. AAAI Press, 1997. 144 8. D. B. Christianson, S. E. Anderson, L. He, D. H. Salesin, D. S. Weld, and M. F. Cohen. Declarative Camera Control for Automatic Cinematography. In Proceedings of the Thirteenth National Conference on Artificial Intelligence and the Eighth Innovative Applications of Artificial Intelligence Conference, volume 1, pages 148– 155. AAAI Press/The MIT Press, 1996. 144 9. A. Del Bimbo and E. Vicario. Specification by-Example of Virtual Agents Behavior. IEEE Transactions on Visualization and Computer Graphics, 1(4):350–360, December 1995. 145 10. S. M. Drucker, T. A. Galyean, and D. Zeltzer. CINEMA: A System for Procedural Camera Movements. In Proceeedings 1992 Symposium on Interactive 3D Graphics. ACM, March–April 1992. 143 11. G. Ferguson, J. Allen, and B. Miller. TRAINS-95: Towards a Mixed-Initiative Planning Assistant. In Proceedings of the Third Conference on Artificial Intelligence Planning Systems, pages 70–77. American Association of Artificial Intelligence, May 1996. 136 12. M. Gleicher and A. Witkin. Through-the-Lens Camera Control. Computer Graphics, 26(2):331–340, July 1992. 143 13. N. Halper and P. Olivier. CAMPLAN: A Camera Planning Agent. In Smart Graphics: Papers from the 2000 AAAI Symposium, pages 92–100. AAAI Press, 2000. 144 14. L. He, M. F. Cohen, and D. H. Salesin. The Virtual Cinematographer: A Paradigm for Automatic Real-Time Camera Control and Directing. In Computer Graphics Proceedings, Annual Conference Series, pages 217–224. ACM, August 1996. 144 15. P. C. Litwinowicz. Inkwell: A 2 1/2-D Animation System. Computer Graphics, 25(4):113–121, July 1991. 133 16. W. C. Mann and S. A. Thompson. Rhetorical Structure Theory: Toward a Functional Theory of Text Organization. Text, 8(13):243–281, 1988. 141 17. C. Mellish, M. O’Donnell, J. Oberlander, and A. Knott. An Architecture for Opportunistic Text Generation. In Proceedings of the Ninth International Workshop on Natural Language Generation, 1998. 142 18. B. Tomlinson, B. Blumberg, and D. Nain. Expressive Autonomous Cinematography for Interactive Virtual Environments. In Proceedings of the Fourth International Conference on Autonomous Agents, pages 317–324. ACM Press, June 2000. 144 19. D. Weld and J. d. Kleer. Qualitative Reasoning about Physical Systems, ed. Morgan Kaufmann, 1990. 140 20. H. Zettl. Sight Sound Motion: Applied Media Aesthetics. Wadsworth Publishing Company, 1990. 134, 138, 141
Mining Incremental Association Rules with Generalized FP-Tree Christie I. Ezeife and Yue Su 1
2
School of Computer Science, University of Windsor Windsor, Ontario, Canada N9B 3P4 [email protected] http://www.cs.uwindsor.ca/users/c/cezeife School of Computer Science, University of Windsor
Abstract. New transaction insertions and old transaction deletions may lead to previously generated association rules no longer being interesting, and new interesting association rules may also appear. Existing association rules maintenance algorithms are Apriori-like, which mostly need to scan the entire database several times in order to update the previously computed frequent or large itemsets, and in particular, when some previous small itemsets become large in the updated database. This paper presents two new algorithms that use the frequent patterns tree (FP-tree) structure to reduce the required number of database scans. One proposed algorithm is the DB-tree algorithm, which stores all the database information in an FP-tree structure and requires no re-scan of the original database for all update cases. The second algorithm is the PotFp-tree (Potential frequent pattern) algorithm, which uses a prediction of future possible frequent itemsets to reduce the number of times the original database needs to be scanned when previous small itemsets become large after database update. Keywords: Incremental maintenance, association rules mining, FP-tree structure
1
Introduction
Databases and in particular, data warehouses contain large amounts of data that grow or shrink over time due to insertion, deletion and modification of transactions (records) of the database over a period of time. Association rule mining is a data mining technique which discovers strong associations or correlation relationships among data. Given a set of transactions (similar to database records in this context), where each transaction consists of items (or attributes), an association rule is an implication of the form X → Y ,
This research was supported by the Natural Science and Engineering Research Council (NSERC) of Canada under an operating grant (OGP-0194134) and a University of Windsor grant.
R. Cohen and B. Spencer (Eds.): AI 2002, LNAI 2338, pp. 147–160, 2002. c Springer-Verlag Berlin Heidelberg 2002
148
Christie I. Ezeife and Yue Su
where X and Y are sets of items and X ∩ Y = ∅. The support of this rule is defined as the percentage of transactions that contain the set X, while its confidence is the percentage of these “X” transactions that also contain Y. In association rule mining, all items with support higher than a specified minimum support are called large or frequent itemsets. An itemset X is called an i-itemset if it contains i items. Agrawal et al. [1] presents the concept of association rule mining and an example of a simple rule is “80% of customers who purchase milk and bread also buy eggs”. Since discovering all such rules may help market baskets or crosssales analysis, decision making, and business management, algorithms presented in this research area include [1,7,6]. These algorithms mainly focus on how to efficiently generate frequent patterns and how to discover the most interesting rules from the generated frequent patterns. However, when the database is updated, the discovered association rules may change. Some old rules may no longer be interesting, while new rules may emerge. Four types of scenarios may arise with the generated frequent itemsets (frequent patterns) when an original database is updated: (1) frequent itemsets in the original database (F) remain frequent in the newly updated database (F’), (2) frequent itemsets in the original database (F) now become small itemsets in the new database (S’), (3) small itemsets in the original database (S) remain small in the new database (S’), and (4) small itemsets in the original database (S) now become frequent in the new database (F’). The symbols F and F’ stand for frequent patterns in the original and new databases respectively, while S and S’ stand for small itemsets in the original and new databases respectively. The four types of changes that may occur with frequent patterns following an update in the database can be summarized as: (1) F → F’, (2) F → S’, (3) S → S’, (4) S → F’. Incremental maintenance of association rules involves a technique that uses mostly only the updated part of the database, not the entire new database (consisting of the original data plus the updated part) to maintain association rules. Existing work on incremental maintenance of association rules with better performance than the Apriori algorithm [1] include [3,2,8]. These incremental maintenance algorithms are still Apriori-like and long lists of candidate itemsets are generated at each level, while either the updated part of the new database (for F → F’, F → S’, and S → S’) or the entire new database (for S → F’) has to be re-scanned to obtain their supports. Han et al. [5] proposes a tree structure for mining frequent patterns without candidate generation, but applying this technique directly to the problem of incremental maintenance of association rules does not produce optimal results as the new database needs to be re-scanned for the S → F’ case. In this paper, we present two new algorithms (DB-tree and PotFP-tree algorithms) for efficiently mining association rules in updated database using the FP-tree structure. These algorithms aim at removing the need to re-scan the entire new database and re-construct the FP-tree when the case S → F’ arises. The algorithms work by only scanning the updated parts of the database once (at
Mining Incremental Association Rules with Generalized FP-Tree
149
most twice) without generating candidate sets like previous incremental association rules algorithms. Thus, the proposed algorithms achieve good performance. 1.1
Related Work
The problem of mining association rules is decomposed into two subproblems, namely (1) generating all frequent itemsets in the database and (2) generating association rules in the database according to the frequent itemsets generated in the first step. Apriori algorithm [1] is designed for generating association rules. The basic idea of this algorithm is to find all the frequent itemsets iteratively. In the first iteration, it finds the frequent 1-itemsets L1 (each frequent 1-itemset contains only one item). To obtain L1 , it first generates candidate set C1 which contains all 1-itemsets of basket data, then the database is scanned for each itemset in the set C1 to compute its support. The items with support greater than or equal to minimum support (minsupport) are chosen as frequent items L1 . The minsupport is provided by the user before mining. In the next iteration, apriori gen function [1] is used to generate candidate set C2 by joining L1 with itself, L1 and keeping all unique itemsets with 2 items in each. The frequent itemsets L2 is again computed from the set C2 by selecting items that meet the minsupport requirement. The iterations go on by applying apriori gen function until Li or Ci is empty. Finally, the frequent itemsets L is obtained as the union of all L1 to Li−1 . FUP2 algorithm is proposed by [3] to address the incremental maintenance problem for association rule mining. This algorithm utilizes the idea of Apriori algorithm, to find the frequent itemsets iteratively by scanning mostly only the updated part of new database. It scans the entire new database when small itemsets become large. The MAAP algorithm [8] computes the frequent patterns from the higher level reusing the old frequent patterns and reducing the need to generate long lists of lower level candidate sets, thus, performing better than the FUP and the FUP2 techniques, but is still level wise. In [4,5], a compact data structure called frequent pattern tree (FP-tree) is proposed. The FP-tree stores all the frequent patterns on the tree before mining the frequent patterns using the FP-tree algorithm. Constructing the FP-tree requires two database scans (one scan for constructing and ordering the frequent patterns and the second for building branches of the tree). The FP-tree algorithm brings about better performance than the Apriori-like algorithms due to much reduced database scan. However, applying this technique directly for incremental maintenance of frequent patterns still requires the usual maximum of two database scans on the entire new database when previous small items become large. 1.2
Contributions
This paper contributes by proposing two new algorithms, DB-tree and Potential Frequent Pattern tree (PotFP-tree) algorithms, for efficiently mining incremental association rules in updated database. The proposed algorithms store frequent patterns on a more generalized FP-tree, which stores tree branches for all items
150
Christie I. Ezeife and Yue Su
in the database (in the case of DB-tree) and stores for all items that are frequent or are potentially frequent in the near future (in the case of PotFP-tree). The proposed algorithms eliminate the need to scan the old database in order to update the FP-tree structure when previously small itemsets become large in the new database. 1.3
Outline of the Paper
The organization of the rest of the paper is shown as follows: section 2 presents an example mining of a database and its update using the basic FP-tree algorithm; section 3 presents formal details of the proposed algorithms with examples; section 4 presents performance analysis of the algorithm; and finally, section 5 presents conclusions and future work.
2
Mining An Example Database with FP-Tree
The DB-tree and PotFP-tree algorithms being proposed in this paper are more generalized forms of the FP-tree algorithm, and we first apply the FP-tree algorithm to an example in section 2.1 before applying FP-tree to an updated database in section 2.2. 2.1
FP-Tree Algorithm on a Sample Database
Suppose we have a database DB with set of items, I={a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p} and MinSupport=60% of DB transactions. A simple database transaction table for illustrating the idea is given in the first two columns of Table 1.
Table 1. The Example Database Transaction Table with Ordered Frequent Items TID 100 200 300 400 500
To compute the frequent itemsets in the old database (Table 1), FP-tree algorithm first constructs the FP-tree from the original database to obtain the extended prefix-tree structure (Figure 1), which stores all the frequent patterns in each record of the database as shown in Figure 1. Secondly, the algorithm
Mining Incremental Association Rules with Generalized FP-Tree
Item
151
root
Head of Node-links
f
c:1
f:4 b:1 c a
b:1
c:3 a:3
b m p
b:1
p:1
m:2 p:2
m:1
Fig. 1. The FP-tree of the Example Database
mines or computes the frequent itemsets in the database by mining the frequent patterns on the FP-tree that satisfy minimum support. To construct the FP-tree, the original DB (first two columns of Table 1) scans the database once to obtain and present in descending order of support all the frequent 1-itemsets L1 (with support more than or equal to 3 records) as < (f : 4), (c : 4), (a : 3), (b : 3), (m : 3), (p : 3) >. The number accompanying each item is its support in number of records of the DB it occurs. Using this sorted frequent pattern obtained after the first scan of the database, then, each record of the database is again scanned to obtain the third column of Table 1 showing the frequent items that are present in each transaction in descending order of frequency. Starting from the first transaction, once the frequent patterns in a transaction is obtained, they are inserted in the appropriate branch of the FPtree. For example, since the first transaction (100) has frequent items (f,c,a,m,p), a branch of the tree is constructed from the root with f as its immediate child, c is the child of f and a is the child of c and so on. Since these tree nodes have occurred only once so far, a count of 1 is recorded for each node. Reading the second transaction (200) with frequent items (f,c,a,b,m) causes the previous nodes f, c, and a of the first branch to have a count of 2 each. However, a child node of a is now b with a count of 1, while a child node of b is m with a count of 1. To handle the frequent items (f, b) of transaction 300, a branch b with a count of 1 is created from node f, which now has a count of 3. For the frequent item (c,b,p) of transaction 400, since there is no current branch on the tree to share path with, a branch c with count 1 is created from the root. Finally, the frequent items for transaction 500 (f,c,a,m,p) are inserted on the first branch of the tree and the counts of the nodes are incremented to give the final FP-tree shown in Figure 1. To mine the frequent patterns from the constructed FP-tree, starting from the lowest frequent item on the item header list (pointing to the nodes of the tree containing this frequent item), it derives the frequency of this item and continues to define all paths (branches) that occurred with this item and their counts. These paths make the item’s conditional pattern base. The frequent patterns derived from these pattern bases consist of only those prefix items (before the
152
Christie I. Ezeife and Yue Su
item) which make a combined support of more or less than the minimum support. For example, from the FP-tree of Figure 1, the lowest frequent item is p and following the header pointer, a frequent pattern (p:3) can be derived from the two nodes that it appeared. The two p nodes are from the conditional pattern bases < (f : 2, c : 2, a : 2, m : 2) > and < (c : 1, b : 1) >. Since only the prefix c has a combined support of 3 from these conditional pattern bases, it is the only prefix that can combine with p to form the frequent pattern cp:3. This concludes the search for frequent patterns of p. For items m, the conditional pattern bases are < (f : 2, c : 2, a : 2) > and < (f : 1, c : 1, a : 1, b : 1) >. Only the items in the path before the considered frequent item are considered in the analysis. The frequent pattern of m that can be derived from these bases is < f : 3, c : 3, a : 3) >. Thus, the derived frequent pattern is fcam:3. Obtaining fcam as a frequent item is equivalent to having all of its subsets confirmed frequent. Thus, {(m:3), (am:3), (cm:3), (fm:3), (cam:3), (fam:3), (fcam:3), (fcm:3)} are all frequent. The conditional pattern base and conditional FP-tree of other items can be obtained in a similar way to give Table 2. Thus, the result of the original mining with FP-tree of the original DB is the set of maximal frequent patterns {cp}, {fcam} and {b}. All the subpatterns of these frequent patterns are also frequent.
Table 2. Mined Conditional Pattern Bases and Frequent Patterns of Example DB Item Conditional Pattern Bases p {f:2,c:2,a:2,m:2),(c:1,b:1)} m {f:2,c:2,a:2),(f:1,c:1,a:1,b:1)} b {f:1,c:1,a:1),(f:1), (c:1)} a {(f:3,c:3)} c {(f:3)} f ∅
If we update the DB such that transactions 600 (p,f,m,g,o,l), 700 (q,f,b,p,l,w), 800 (f,c,b,e,m,o), 900 (f,b,p,m,a,l) and 1000 (f,t,a,b,p,l,o) are added, we can get an updated database DB’, which is suitable for applying the FP-tree algorithm. The number of items in the DB has also changed to include the items {q, t, w}. Since many of the incremental maintenance algorithms FUP [2], FUP2 [3], and MAAP [8] aim at improving performance by updating the generated frequent itemsets without scanning the entire database but only the updated part, this section discusses how the FP-tree algorithm can be used to update the frequent itemsets without the need to re-scan the entire database. With the above update (insertions) made to the example database of Table 1, scanning only the updated part of the database reveals that the frequent patterns
Mining Incremental Association Rules with Generalized FP-Tree
153
from only the updated part of the database are (p:4, l:4, m:3, o:3), while the previous frequent patterns of the old database on which the original FP-tree is constructed are (f:4, c:4, a:3, b:3, m:3, p:3). Without the need to re-scan the entire database consisting of the old database and the updated part, the updated frequent patterns in descending order can be obtained from the two FP lists above as: (p:7, m:6, f:4, c:4, l:4, a:3, b:3, o:3). It can be seen that the order of frequency has changed for some items like p (going from the smallest frequent item to the highest frequent item), and m (moving from the second smallest frequent item to the second highest frequent item). Some previously unfrequent items like l and o are now frequent. With these kinds of changes to the database, the only way to update the FP-tree is to rescan the entire database at least one more time to re-construct the tree since the correct mining is based on the order. Secondly, some transactions with previously low items like l and o in the old database have to be scanned to include these new frequent items in their tree branches. The cost of re-scanning and re-constructing the FP-tree when thousands of items are in the database and millions of rows or transactions are in the database could be quite significant. Thus, this paper proposes two ways to eliminate the need to re-scan the entire database when the database is updated either through insertions, deletions or modifications of transactions.
3
The Proposed Incremental Generalized FP-Tree Algorithms
This section presents two algorithms being proposed for mining frequent itemsets incrementally using a more generalized FP-tree structure. Section 3.1 discusses the algorithm DB-Tree, while section 3.2 presents the PotFP-Tree, standing for potential FP-Tree algorithm. The PotFP-Tree algorithm keeps information about some previously unfrequent items which are predicted to have a high potential for being frequent soon in the future, and the DB-tree keeps the entire database on a generalized FP-tree with minimum support of 0 or 1. 3.1
Mining Incremental Rules with DB-Tree
DB-Tree is a generalized form of FP-tree which stores in descending order of support all items in the database, as well as counts of all items in all transactions in the database in its branches. The DB-tree is constructed the same way the FP-tree is constructed except that it includes all the items instead of only the frequent 1-itemsets. Thus, like the FP-tree, it takes two database scans to construct the DB-tree. DB-tree has more branches and more nodes than the FPtee and thus needs larger storage than the FP-tree. However, the DB-tree is still much smaller than the database since items share paths in the tree structure. A DB-tree can be seen as an FP-tree with a minimum support of 0. This means that the DB-tree contains an FP-tree on top. At any point in time, the desired FP-tree could be projected from the DB-tree based on the minimum support.
154
Christie I. Ezeife and Yue Su
root
Item Head of Node-links
f
c:1
f:4 b:1
c
c:3
a b
a:3
b:1 o:1
b:1
h:1
m:2
m p
j:1
p:1
k:1
m:1
p:2
s:1
l d:1
l:1
l:1
g:1
e:1
o:1
i:1
n:1
o d g i h j k s e n
Fig. 2. DB-Tree of Example Database
The lower the minimum support of a desired FP-tree, the closer to a DB-tree it is. The DB-tree of the database in Table 1 is given as Figure 2. In Figure 2, all patterns of the database are included and solid circles indicate frequent items while dotted circles indicate small items. On top of the DB-tree is the FP-tree. Mining the frequent patterns from DB-tree requires first projecting the FP-tree from the DB-tree and mining the patterns from the FP-tree. Suppose two transactions, 600 (a, c, f, m, g, o, l) and 700 (f, b, a, c, l, m, o, n) are inserted into the database of Table 1, and we want to update the frequent itemsets with a support of 60%, this update will not cause any database scan using the DB-tree algorithm, while with the previous algorithms like FUP, FUP2 and MAAP, the original database will need to be scanned several times. The original FP-tree algorithm also needs to scan the original database once to update the FP-tree since some items like (l and m) that were not previously large have now become large. To handle this update, using the DB-tree algorithm, would require just scanning the two transactions to update the DB-tree. The occurrences after the two transaction insertions are: (f:6), (c:6), (a:5), (b:4), (m:5), (l:4), (o:4), (p:3), (g:2), (n:2), (j:1), (k:1), (s:1), (e:1), (n:1). The DB-tree after the insertion is given as Figure 3. The minimum occurrence in the updated database is 4 for frequent itemsets. It can be seen that item p is no longer large in the new database. While it is not necessary to change the order when only frequent patterns order change and are out of order, it is necessary to re-order when either the frequent items become small or small items become frequent.
Mining Incremental Association Rules with Generalized FP-Tree
Item
root
Head of Node-links f
b:1 b:1
c:5
a b m l
o:1
a:5 b:2
j:1 l:2
p:1
m:2
d:1
l:2
p:1 k:1 s:1
o:1
p:1
g:1
e:1
g:1
o:2
n:1
i:1
n:1
g i
h:1
m:3
p
d
c:1
f:6
c
o
155
h j k s e n
Fig. 3. DB-Tree after Inserted Transactions
To project FP-tree from the DB-tree, we start from the root of the DB-tree and extract each branch where a next node is still large. The cost of projecting an FP-tree is equal to the cost of traversing FP-tree once. The projected FP-tree from the updated DB-tree of Figure 3 is given as Figure 4. 3.2
Mining Incremental Rules with PotFP-Tree Algorithm
The PotFP-tree adopts a more relaxed principle for picking items to store on the tree than both the FP-tree (storing only frequent items) and the DB-tree (storing all items). Thus, the PotFP-tree is based on a principle that is inbetween the two extremes. The PotFP-tree also stores items that are not frequent at present but have high probability of being frequent after the database update. Updating the database would entail following the PotFP-tree to update each node’s count (for deletion or insertion), or adding new branches into the PotFP-tree (some insertions). The small items in the original database can be divided into two groups, namely, (1) those that are not large now but may be large after the database update (called the potentially frequent items, P), and (2) those that are small now and with high possibility of still being small after update of the database, M. We can give a tolerance t when constructing and re-constructing the FPtree, which is equivalent to the watermark, in [5]. Watermark is defined as the minimum support that most mining processes are based on over a period of time. For example, if over a year, 60% of all mining processes were based on minimum support of ≥20, then 20 is the watermark, meaning that if more than
156
Christie I. Ezeife and Yue Su
root
Item Head of Node-links f
c:1
f:6 b:1
c
b:1
c:5 a b
o:1
a:5 b:2
m l
m:3 l:2 o:1
m:2 l:2
o o:2
Fig. 4. FP-tree of the Updated DB projected from DB-tree
20 transaction update occurs, the FP-tree needs to be re-constructed. Thus, the idea here is to use a tolerance t that is slightly lower than the support that most mining process (average minsupport) have been based on recently. Keeping small items with support less than average minsupport, but greater than t, will benefit the incremental mining process. Database items with support s, where t ≤ s ≤ averageminsupport are the potentially frequent items which are not part of the current FP-tree, but included in the PotFP-tree structure for purposes of eliminating the need to scan the entire old database in order to update the FP-tree when updates occur in the database. Like the DB-tree, the FP-tree sits on top in the PotFP-tree while the patterns involving potentially frequent items are near the leaves of the tree. The advantage of the PotFP-tree over the FP-tree is that if database update causes all potentially frequent items in group P to become large after database update, the Pot-FP tree algorithm would not require a scan of the original database. However, if some potentially frequent items in group M become large, it will need to scan the original database like the FP-tree. This algorithm has reduced the number of times the entire database needs to be scanned due to updates. How much is gained in response time due to non-scanning of the entire database depends on the choice of t. An experiment in the next section is used to examine what would constitute reasonable values for t.
4
Experimental and Performance Analysis
A performance comparison of DB-tree, PotFP-tree with original FP-tree and Apriori algorithms was conducted and the results of the experiments are presented in this section. All these four algorithms were implemented and run on the same datasets generated using the resource code [1] for generating synthetic datasets downloaded from http://www.almaden.ibm.com/cs/quest/syndata.html. The correctness of the
Mining Incremental Association Rules with Generalized FP-Tree
157
implementations were confirmed by checking that the frequent itemsets generated for the same dataset by the four algorithms are the same. The experiments were conducted on a 733 MHz P3 PC machine with 256 megabytes of main memory running Linux operating system. The programs were written in C++. The transactions in the dataset mimic the transactions in a retail environment. The result of two experiments are reported as follows. – Experiment 1: Given a fixed size dataset (inserted and deleted parts of the dataset are also fixed), we test CPU execution time at different thresholds of support to compare DB-tree, PotFP-tree, FP-tree and Apriori algorithms. The aim of this experiment is to show that performance of PotFP-tree algorithm is better than that of FP and Apriori algorithms at different levels of support using the same dataset size. The number of transactions (D) in this dataset is one hundred thousand records, that is |D| = 100,000 records, the average size of transactions (number of items in transactions) (T) is 10, |T | = 10, average length of maximal pattern (that is, average number of items in the longest frequent itemsets) (I) is 6, or |I| = 6, number of items (N) (the total number of attributes) is one thousand, N=1000. Assume the size of updated (inserted) dataset is 10,000 records, the size of updated (deleted) dataset is 10,000 records (these parameters are abbreviated as T10.I6.D100K-10K+10K with 1000 items, the support thresholds are varied between 0.1% and 6%, meaning that for a support level of 0.1%, an itemset has to appear in 100 (one hundred) or more transactions to be taken as a frequent itemset, while with a support of 6%, an itemset has to appear in 6000 (six thousand transactions to be large). An experimental result is shown in Table 3, while its graphical representation is given in Figure 5.
Table 3. Execution Times for Dataset at Different Supports Algorithms CPU Time (in secs) at 0.1 0.2 0.3 0.5 0.7 1 2 Apriori FP-tree 54 51 49 43 39 33 19 DB-tree 44 44 44 43 43 43 42 PotFP-tree 42 40 38 34 30 26 13
From the observation of the experimental result, we can see that (i) as the size of the support increases, the execution time of all the algorithms decreases. (ii) for the same support, the execution time of PotFP-tree algorithm is less than that of FP-tree and Apriori algorithms. (iii) as the size of support increases, the difference in execution times of PotFP-tree algorithm and FPtree diminishes. In this experiment, DB-tree only shows a little advantage over the FP-tree when the minimum support is very small (less than 0.5%).
158
Christie I. Ezeife and Yue Su 60 FP-tree
Execution Times in seconds
DB-tree 50
PotFP-tree
40
30
20
10
0 0.1
0.2
0.3
0.5
0.7
1
2.0
Support thresholds(%)
Fig. 5. Execution Times At Different Support Levels
– Experiment 2: Given a fixed size dataset (including inserted and deleted datasets) and a fixed support, we test CPU execution times when different numbers of old frequent itemsets are allowed to change in the new database. Since the number of frequent itemsets changed may affect CPU time of PotFP-tree algorithm, this experiment is conducted to observe the performance of both PotFP-tree and FP-tree with DB-tree algorithms. The dataset used for this experiment is the same as for experiment 1 above except that the number of changed transactions are varied from 10K to 50K. These parameters are abbreviated as T10.I6.D100K). The experiment is conducted at three different minimum supports of 0.5% (low), 1% (medium) and 6% (high). The result of this experiment is shown in Table 4 for minimum support of 0.5%. The PotFP-tree outperforms the other algorithms at all times but has better performance gain at lower minimum supports. It can be seen that as the size of the changed transaction becomes large, all the execution times increase. This experiment also shows that the PotFPtree algorithm always has better performance than the DB-tree algorithm and the original FP-tree algorithm.
Table 4. Execution Times at Different Transaction Sizes on Support 0.5% Algorithms (times in secs) FP-tree DB-tree PotFP-tree
Different 10K 20K 43 69 43 68 34 54
Changed 30K 40K 100 125 98 124 71 89
Transaction Size 50K 158 154 111
Mining Incremental Association Rules with Generalized FP-Tree
159
160
Execution Times (in seconds)
Fp-tree DB-tree
140
PotFp-tree 120 100 80 60 40 20 0 10k
20k
30k
40k
50k
Size of changed transactions
Fig. 6. Execution Times at Different Sizes of Changed Transactions
An experiment on what would constitute a reasonable tolerance shows that best result is achieved with a tolerance value that is equal to 90% of the minimum support and performs worst with a tolerance value equal to 50% of the minimum support. Thus, we can deduce that the closer the tolerance is to the minimum support, the better the performance. However, a tolerance value that is too close to the minimum support loses the advantage gained by using the PotFP-tree.
5
Conclusions and Future Work
This paper presents two new algorithms DB-tree and PotFP-tree algorithms, for incrementally maintaining association rules in the updated database. These algorithms are based on a generalized FP-tree structure that store more items on the tree than only those that are frequent. The contribution of these algorithms is better response time and in particular when minimum support is low. Future work should include looking for a theoretical method to decide the most beneficial tolerance value t for the PotFP-tree scheme and consider using a partitioned version of the DB-tree to improve on its performance. Application of this method and incremental mining approaches in general, to web usage mining should be investigated.
160
Christie I. Ezeife and Yue Su
References 1. Agrawal, R. and Srikant, R.: Fast Algorithms for Mining Association Rules in Large Databases, In Proceedings of the 20th International Conference on very Large Databases, Santiago, Chile, 1994, pp. 487-499. 148, 149, 156 2. Cheung, D. W., Han, J., Ng, V. T., Wong, C. Y.: Maintenance of Discovered Association Rules in Large Database: An I ncremental Updating Technique, In Proceedings of the 12th International Conference on Data Engineering, New Orleans, Louisiana, 1996. 148, 152 3. Cheung, D. W., Lee, S. D., Kao, B.: A General Incremental Technique for Maintaining Discovered Association Rules, In Proceedings of the Fifth International Conference on Database Systems for Advanced Applications, Melbourne, Australia, Jan, 1997. 148, 149, 152 4. Han, J., Kamber, M.: Data Mining-Concepts and Techniques, Morgan Kaufmann Publisher, 2001. 149 5. Han, J., Pei, J., Yin, Y.: Mining Frequent Patterns without Candidate Generation, ACM SIGMOD 2000, Dallas, TX, U. S. A. 148, 149, 155 6. Holsheimer, M., Kersten, H., Mannila, M., Toivonen, H.: A Perspective on Databases and Data Mining. First International conference on Knowledge, Discovery and data Mining, Montreal, canada, AAAI Press, 1995, pp. 150-155. 148 7. Park, J. S., Chen, M. S., Yu, P. S., An Effective Hashed Based Algorithm for Mining Association Rules, In Proceedings of the ACM SIGMOD Conference on Management of Data, San Jose, California, may, 1995. 148 8. Zhou, Z., Ezeife, C. I.: A Low-Scan Incremental Association Rule Maintenance Method Based on the Apriori Property, in the Proceedings of the fourteenth Canadian Conference on Artificial Intelligence, AI 2001, June, 2001, Ottawa. 148, 149, 152
Topic Discovery from Text Using Aggregation of Different Clustering Methods Hanan Ayad and Mohamed Kamel Pattern Analysis and Machine Intelligence Lab, Systems Design Engineering University of Waterloo, Waterloo, Ontario N2L 3G1, Canada {hanan,mkamel}@pami.uwaterloo.ca http://pami.uwaterloo.ca/
Abstract. Cluster analysis is an un-supervised learning technique that is widely used in the process of topic discovery from text. The research presented here proposes a novel un-supervised learning approach based on aggregation of clusterings produced by different clustering techniques. By examining and combining two different clusterings of a document collection, the aggregation aims at revealing a better structure of the data rather than imposing one that is imposed or constrained by the clustering method itself. When clusters of documents are formed, a process called topic extraction picks terms from the feature space (i.e. the vocabulary of the whole collection) to describe the topic of each cluster. It is proposed at this stage to re-compute terms weights according to the revealed cluster structure. The work further investigates the adaptive setup of the parameters required for the clustering and aggregation techniques. Finally, a topic accuracy measure is developed and used along with the F-measure to evaluate and compare the extracted topics and the clustering quality (respectively) before and after the aggregation. Experimental evaluation shows that the aggregation can successfully improve the clustering quality and the topic accuracy over individual clustering techniques.
1
Introduction
In today’s information age, users are struggling to cope with the overwhelming amount of information that is made available to them everyday. The need for tools that can help analyze the contents of the information to discover and approximately describe the topics from text are becoming more and more critical. Cluster analysis is an un-supervised learning technique used in exploratory data analysis. It tries to find intrinsic structure of data by organizing patterns into groups or clusters [1]. Clustering is called un-supervised learning because no category labels denoting a priori partition of the objects are used. Several clustering techniques exist. However, each single technique comes with a single result, and only knowledge about the structure of the data can justify the proper choice of the technique used. But, the true structure of the data is unknown, so
This work was partially funded by an NSERC strategic grant.
R. Cohen and B. Spencer (Eds.): AI 2002, LNAI 2338, pp. 161–175, 2002. c Springer-Verlag Berlin Heidelberg 2002
162
Hanan Ayad and Mohamed Kamel
this appears to be an endless problem [2]. The research presented here proposes to combine clusterings produced by different clustering techniques for the discovery of topics from text. When a clustering of a document collection is generated, the topic of each cluster needs to be extracted. The truncated centroids are a commonly used representation of topics, where the centroid of each cluster is computed and then truncated to a specified maximum length l, by keeping only the l highest weighted terms. Instead of using the truncated centroid, it is proposed to re-compute the term weights according the revealed cluster structure. The attributes with the greatest weights towards the uncovered cluster structure may be used to find a set of descriptors for each cluster. Finally, an evaluation metric that can be used to assess the topic accuracy is introduced.
2
System Architecture
2.1
Overview
Figure 1 shows a schematic diagram of the system architecture for topic discovery from text. In the first phase, called automatic text analysis, the textual content of the documents is analyzed and represented using the vector space model. The details of text analysis and representation are covered in Section 2.2. The second phase, called the Clustering System, consists of several different clustering components (hierarchical, Incremental, and partitional), in addition to the aggregation component. The details of each of these components will be covered later. It suffices to mention here that each of these components generates a clustering of the input document collection. Finally, the third phase is the Topic Extraction which finds an approximate description of the topic of each cluster
Document Collection Clustering System Automatic Text Analysis
Clustering Component (Hierarchical)
Clustering Component (Partitional)
Clustering Component (Incremental)
Document Clusterings
Document Vectors Aggregation
Topic Extraction
Topics
Fig. 1. System architecture for topic discovery from text
Topic Discovery from Text
163
for any of the clusters generated from the previous phase. The topic extraction process uses a cluster-specific weighting scheme to evaluate the importance of each attribute to the cluster structure at hand. The highest weighted attributes are used to approximately describe the topics of the clusters. Further details about this process are given later in Section 2.3. 2.2
Automatic Text Analysis
In order to automatically discover the topics from textual contents of a document collection, the prerequisite for this task is to represent the documents in a form that can be processed by the chosen clustering techniques. For our algorithms, documents are represented using the vector-space model, which is one of the most common representation techniques in Information Retrieval, also called full text indexing [3]. The major assumption of this representation is that the content of a document can be represented purely by the set of words contained in the text. In the vector space model, a vector is used to represent each item or document in a collection. Each component of the vector reflects a particular word, or term, associated with the given document. All terms from the document are taken and no ordering of terms or any structure of text is used. The value assigned to each component of the vector reflects the importance of the term in representing the semantics of the document. A term-weighting technique called Term Frequency-Inverse Document Frequency, or TF-IDF is commonly used in Information Retrieval. TF-IDF assumes that the best indexing terms are those that occur frequently in individual documents but rarely in the remainder of the collection. As shown in Equation 1, the weight wij of term j in document i is calculated as the product of the term frequency tfij , that is, the frequency of occurrence of term j in document i, and N the inverse document frequency idf = log df , where N is the total number of j documents and dfj is the document frequency, that is the number of documents in which term j occurs at least once. The exact formulas used in different approaches may slightly vary, but the idea remains the same. For instance, some factors might be added, and normalization is usually performed [4]. wij = tfij log
N dfj
(1)
Once the documents are represented as vectors, it is important to exploit the geometric relationships between document vectors to model similarities and differences in content [5]. One common measure of similarity is calculated by the cosine of the angle between the document vectors. It is defined as: t k=1 (wik · wjk ) . (2) Sim(DOCi , DOCj ) = t t 2 2 k=1 (wik ) · k=1 (wjk ) In the implemented prototype, the document indexing process involves the following steps. First, textual contents of documents are extracted and textual
164
Hanan Ayad and Mohamed Kamel
structure and organization are ignored. Second, a stop list, comprising a few hundred high frequency function words, such as “and”, “of”, “or”, and “but”, is used to eliminate such words from consideration in the subsequent processing. Third, all remaining words in a text are then reduced to their stems using the Porter suffix-stripping algorithm [6], and will be called terms. This procedure reduces redundancy - for instance, computer, computers, computing, computability all reduce to comput. Fourth, the terms of each document are then weighted using a modified TF-IDF equation adopted from previous work by [7]. It is given by: wij = (coefij )[log N − log dfj + 1]
(3)
where, 1 if 1.5 if coef = 2 if 2.5 if
tfij = 1 1 < tfij ≤ 5 5 < tfij ≤ 10 tfij > 10
(4)
As explained in [7], the original TF-IDF equation is modified to ensure that a single term does not obtain an extremely high weight solely due to its frequency of occurrence in the document [7]. This can be noticed in long documents, where the frequency of occurrence of terms can be relatively high, which in turn produce relatively higher weights. In order to reduce the number of terms used in representing the documents, Document Frequency (DF) Thresholding is performed. In this step, a removal of the terms that have either too low or too high DF value is performed. For instance, terms that appear only once or twice, or appear in over 90% of the documents are removed. For further reduction of the terms, only the highest weighted n terms in each document are assigned to the document set as index terms. The default value for n in the experiments presented in this paper is 25 as determined in a previous study by Larsen and Aone [8]. Their experiments were done on the same document collection from which we sampled our collections. Finally, the terms weights of a document vector are normalized. A challenging problem in processing document representations obtained by full text indexing is that the number of index terms is very large even with document collections of moderate size. In the vector space model, documents are represented as vectors belonging to a very high-dimensional feature space spanned by size of the vocabulary contained in the whole collection. In all our algorithms, instead of having an active representation of all the document vectors, only two vectors are built at a time, since only two vectors need to be compared at one time. This way the memory requirements as well as the overhead on the processing time due to memory congestion are significantly reduced. 2.3
Clustering System
When documents are represented in the vector space model, as described above, similarity between them can then be computed. By determining the similarity between various pairs of documents, it then becomes possible to construct
Topic Discovery from Text
165
clusters of documents, such that the documents within a given cluster exhibit substantial similarities to each others. Hierarchical Clustering A hierarchical classification is a nested sequence of partitions. A hierarchical clustering algorithm called the Agglomerative Algorithm [9] starts with the disjoint clustering which places each of the n objects to be clustered in an individual cluster, and then proceeds by gradually merging the most similar objects or clusters of objects into one cluster. The way in which similarity between clusters, (as opposed to individual documents) is calculated can be varied. Those variations include the single-link, and the complete-link methods. In the single-link clustering, the similarity between the most similar pair of documents from the two clusters is used as the similarity between the two clusters. In the complete-link clustering, the similarity between the least similar pair of documents from the two clusters is used as the cluster similarity. In the system described here, the complete-link version is used. In this work, the algorithm is setup to stop the merging process when the maximum similarity between any two objects or clusters of objects reaches a value below a chosen stopping threshold, creating the resulting clustering. The stopping threshold is computed here as a function of the similarities between all pairs of documents in the given dataset. This approach makes the value of the threshold adaptive to different sets of input data. It is computed as shown in Equation 5. N N Stopping Threshold = a
i=1
j=i+1
Sim(DOCi , DOCj ) N 2 −N 2
(5)
where a is a real number such that a > 0 and can be determined experimentally. The larger the value of a, the larger is the resulting number of clusters and vice versa. The limitation of the agglomerative clustering algorithm is its inadequacy for large text mining problem as its processing time is O(N 2 ) with respect to the number of documents N . Incremental Clustering Incremental clustering, also referred to as single-pass clustering, is one of the simplest clustering techniques, yet one that has been used before for discovering topics from textual contents of documents [10,11]. In spite of the simplicity of this technique, it enables the discovery of valuable information [11]. The algorithm used here is called the Leader Clustering Algorithm [12]. It performs a single pass through the data objects (document vectors) in order to generate clusters of similar vectors. The algorithm works by assigning the first document DOC1 as the representative of cluster C1 . Then, each document vector is compared to the centroid of all the clusters formed so far. The vector will be absorbed by the most similar cluster unless its similarity is lower than a
166
Hanan Ayad and Mohamed Kamel
certain threshold, in which case the vector forms a cluster by itself. The centroid of each cluster is re-computed each time a new vector is added to it. A reallocation method is also applied on the resulting clustering, by recomparing all the documents to the clusters centers. Documents are re-assigned to the closest re-calculated centroids. The Leader algorithm has several drawbacks; most noticeably is that it is not invariant under reordering of the vectors. However, one very important strength of the algorithm is that it is fast and memory efficient. It requires only one pass over the data. Partitional Clustering The problem of partitional clustering can be formally stated as follows: given N patterns in a d−dimensional metric space, determine a partition of the patterns into K groups, or clusters, such that the patterns in a cluster are more similar to each other than to patterns in different clusters. A clustering criterion, such as the sum of square-error shown in Equation 6 is adopted. 2
E =
K ni
(xij − ci )T (xij − ci )
(6)
i=1 j=1
Where ci is the centroid vector of cluster Ci and ni is the number of elements xi in cluster i. Instead of the minimum square error criterion, we use a criterion function that is based on the similarity measure of angle cosine between the vectors in a cluster and the centroid, The new criterion function, as shown in Equation 7, is defined in terms of the sum of the cosine measure between the patterns and cluster centers. S=
ni K
cos(xij , ci )
(7)
i=1 j=1
There are a number of partitional clustering techniques. In this study, we used the K-means algorithm which is also widely used in document clustering. The K-means clustering technique is presented below as in Algorithm 1. Algorithm 1 K-means Clustering Algorithm 1. Select K points as the initial centroids. 2. Assign all points to the closest centroid. 3. Recompute the centroid of each cluster. 4. Repeat steps 2 and 3 until the centroids don’t change. 5. Set a maximum number of iterations, and repeat steps 1-4 with different initial centroids. 6. Select the partition that realizes the maximum value of the criterion function of Equation 7
Topic Discovery from Text
167
The K-means clustering algorithm is fast but requires an estimate of the number of clusters and the selection of the initial seeds. These requirements can be randomly chosen or taken from the results of other clustering components [9] such as the agglomerative or the incremental algorithms. Given an initial clustering consisting of C clusters, the centroid of each cluster is computed and K ≤ C centroids are randomly picked to become the initial centers for the K-means clustering algorithm. The value of K is computed as K = (b × C), rounded to the closest integer. Where C is the number of clusters resulted from another clustering algorithm, and b is a non-zero fraction that can be determined experimentally. The choice of the K centroids can alternatively be done by picking the centers that satisfy some criterion such as being as dispersed from each other as possible. However, for simplicity we just picked them randomly. Because the clustering result of the K-means algorithm depends on the initial seeds, the clustering is repeated with different initial centers and the best clustering -that is the one that realizes the maximum value of the similarity criterionis chosen as the clustering result of the partitional clustering component. Aggregation The aggregation process compares two different clusterings of the input document collection and generates what is called an aggregated or combined clustering. The aggregation algorithm requires a parameter called aggregation threshold. It uses the value of the threshold to decide whether to merge or separate the compared clusters. So, basically. the aggregation algorithm produces a clustering of the input document collection by analyzing and modifying two different clusterings of the data. The value of the aggregation threshold parameter can be determined experimentally. However, we tried to compute it from statistical information gathered from the input data. The way we computed this threshold is similar to the computation of the stopping threshold with possibly different values for the “a” parameter, see Equation 5. A pseudo-code of the aggregation algorithm is listed in Algorithm 2. Essentially, the aggregation algorithm works as follows. It takes two different clusterings A and B at a time and generates a combined clustering G. When the algorithm starts, the combined clustering G is initially empty, and each cluster Bj in B is marked as unprocessed. For each cluster Ai in clustering A, the algorithm proceeds by identifying intersecting clusters Bj in B that have not been processed. The objective of this step is to identify the commonalities between the two clusterings. This is performed in an incremental fashion. That is, clusters are processed once, in the order they appear, with no backtracking. Similar to the incremental clustering technique, this makes the algorithm fast, and less complicated, but is not invariant under clusters re-ordering. For each pair of the intersecting clusters Ai and Bj , the identified commonality may consist, in the first case, of a complete overlap of the two sides, that is Ai = Bj . In this case, the set Ai = Bj is added to the combined clustering G. In the second case, the commonality may consist of partial overlap from one side
168
Hanan Ayad and Mohamed Kamel
Algorithm 2 Aggregation Algorithm 1. Initialization: Each cluster Bi in clustering B is marked as un-processed i = 1, 2, · · · , size(B), and G←φ 2. Perform Aggregation: for all clusters Ai in clustering A, such that i = 1, 2, · · · , size(A) do W ← Ai for all clusters Bj in B, such that j = 1, 2, · · · , size(B) do if Bj is marked as un-processed AND (Bj W ) = φ then dif f AB ← (W − Bj ) and dif f BA ← (Bi − W ) if dif f AB = φ then if dif f BA = φ then newCluster ← φ, c0 ← Centroid(Ai Bj ) Add each document dk in dif f BA to either W if its similarity with c0 exceeds or is equal to the aggregationT hreshold, or to the newCluster if the similarity is less. Add W to G, if it is not a duplicate, and add newCluster to G, if is non-empty and not a duplicate. else if ((Ai Bj ) G) then add (Ai Bj ) to G end if end if else if dif f BA = φ then c1 ← Centroid((Ai − Bj )), c2 ← Centroid((Bj − Ai )) if Similarity(c1, c2) ≥ aggregationT hreshold then Add each document dk in dif f BA to W , such that k = 1, 2, · · · , size(dif f BA) Add W to G if it is not a duplicate. else For each document dk in (Ai Bj ), compare s1 ← Similarity(dk , c1) and s2 ← Similarity(dk , c2) Add dk to dif f AB if s1 > s2, or to add it dif f BA and remove it from W if s2 > s1, or add it to both dif f AB and dif f BA if s1 = s2 Add dif f AB and dif f BA to G of they don’t represent a duplication. end if else newCluster ← φ c0 ← Centroid((Ai Bj )) Add each document dk in dif f AB to newCluster and remove it from W if its similarity with c0 is less than the aggregationT hreshold Add W to G, if it is not a duplicate, and add newCluster to G, if is non-empty and not a duplicate. end if end if mark Bj as processed. end if end for end for 3. Handling Overlap: To minimize the overlap between the clusters in G, documents that appear in more than one cluster are kept only in the cluster(s) where it is most similar to its centroid and removed from the others.
T
T
T
T*
T
T
Topic Discovery from Text
169
but complete overlap from the other. In other words, Ai is a subset of Bj , or Bj is a subset of Ai . In this case, the similarity between each element in the set difference and the center of the intersection set (Ai Bj ) is computed. Each element in the set difference is then merged with the set (Ai Bj ) if the similarity is larger than the aggregation threshold. The elements that do not get merged are placed together in a new cluster. The resulting clusters are then added to the combined clustering G. In the third case, the commonality may consist of partial overlap from both sides. That is the two sets are intersecting and the two set differences (Ai − Bj ) and (Bj − Ai ) are non-empty. In this case, the centers of the set differences are computed and compared. If the similarity between the centers of those two sets is larger than the aggregation threshold, the two clusters Ai and Bj are merged into one cluster Gk , which is then added to G. The reason for taking out the intersection (Ai Bj ) when computing the centers, is to avoid its influence when measuring the similarity between the different clusters. On the other hand, if the similarity between the centers of the set differences is not larger than the aggregation threshold, each element in the intersection is merged with the most similar of the two set differences (Ai − Bj ) and (Bj − Ai ). It can also be merged to both clusters if it is equally similar to both. The clusters resulting from this splitting process are added to the combined clustering G. 2.4
Topic Extraction
In order to generate the topics of the clusters, the approach that is commonly adopted is to use the clusters’ centroids [10]. This can be performed by truncating the centroids with respect to a predefined length threshold. Alternatively, the terms in the centroid can be filtered against a weight threshold. However, a common criticism of the statistical approaches to clustering, like the techniques described above, is that they are useful for finding cohesive sets of objects in large collections of data, but make no attempt to ensure that their result corresponds to an intuitive concept [13,14]. To overcome this limitation, Perkowitz and Etzioni [14] use statistical clustering techniques to find cohesive clusters, and then evaluate the cluster-specific importance of attributes, and the attributes with the greatest weights towards a cluster (or cluster structure) are used to form a conjunctive concept that approximately describes the cluster. In this work, we chose to adopt a similar approach, by re-computing the terms weights according the revealed cluster structure. The weights are re-computed using a cluster-specific version of the TF-IDF formula as given in Equation 8. In this case, it is assumed that the best terms are those that occur frequently inside the cluster but rarely in the other clusters. First, all non-zero entries of each cluster center are selected. Then, the terms weights are re-computed as follows. C wij = tfij log (8) cfj where wij is the weight of term j in cluster i. It is calculated as the product of the term frequency tfij , that is, the frequency of occurrence of term j in
170
Hanan Ayad and Mohamed Kamel
cluster i, (the number of documents in cluster i in which the term j occurs), and the inverse cluster frequency icf = log cfCj , where C is the total number of clusters and cfj is the cluster frequency, that is the number of clusters in which term j occurs at least once. Cluster-specific weights give a way of finding approximate descriptions for every cluster separately. Each generated topic Ti is represented, as shown in Equation 9, by a vector of l (term, weight) pairs.
A prototype system for topic discovery from text was implemented in Java, using Sun’s Java 2 Platform (www.javasoft.com). The system takes as input a document collection, and generates groupings of the documents and extracted topics using each of the clustering system components, that is the different clustering techniques and the aggregation. The goal of the experiments is to compare the quality of the results before and after the aggregation. 3.1
Test Data Collection
Experiments in this paper draw on data from the Reuters-21578 collection1 . the Reuters-21578 collection is a resource for research in information retrieval, machine learning, and other corpus-based research. The documents in the Reuters21578 collection are Reuters newswire stories that appeared on the Reuters newswire in 1987. In the experiments, we use only news stories that has topics labels. Each document (story) in the Reuters collection has a TOPICS tag that is used to delimit a list of zero or more TOPICS categories for the document. Since the clustering algorithms used in this paper has very limited handling of multi-topic documents, the experiments are confined only to single-topic documents. 3.2
Performance Measures
Since clustering is used here as a topic discovery tool, the evaluation approach focuses on the overall quality of the generated clusters and extracted topics. F-Measure In order to assess the quality of the generated clusters, an evaluation criterion called the F-measure [8] is used. The F-measure allows us to evaluate how well a clustering technique is working by comparing the clusters it produced to known classes. The F-measure combines the precision and recall ideas from information retrieval. Each cluster is treated as if it is the result of a 1
The Reuters-21578, Distribution 1.0 text collection is available from the David D. Lewis’s professional home page. http://www.research.att.com/∼lewis.
Topic Discovery from Text
171
query and each class as if it were the desired set of documents for a query. Then, the recall and precision of that cluster is calculated for each given class. For each hand-labeled topic T in the document set, it is assumed that a cluster X ∗ corresponding to that topic is formed. To find X ∗ , precision, recall and F-measure for each cluster with respect to the topic in question are calculated. For any topic T and cluster X: N1 = Number of documents judged to be of topic T in cluster X N2 = Number of documents in cluster X N3 = Number of documents judged to be of topic T in the entire collection Precision(X,T ) = N1 /N2 Recall(X,T ) = N1 /N3 2P R F = P +R Where F = F-Measure, P = Precision, and R = Recall. The cluster with the highest F-Measure is considered to be X ∗ , and that F-Measure becomes the system’s score for topic T . The overall F-Measure for a given clustering is the weighted average of the F-Measure for each topic T . (|T | × F (T )) Overall F-Measure = T ∈M (10) T ∈M |T | where M is the set of hand-labeled topics, |T | is the number of documents judged to be of topic T , and F (T ) is the F-Measure for topic T . Notice that for a cluster to reach the maximum precision value of 1.0 (i.e. no penalty) with respect to a given manual topic, all of its documents need to be classified as belonging to that topic. Moreover, for a cluster to reach the maximum recall value of 1.0, with respect to a topic, all the documents belonging to that topic should exist in that cluster. To reach the maximum precision, the cluster may in fact consists of only one document since that document is classified as belonging the given topic. That is, in a clustering where each document is put into a separate cluster, each cluster will reach the maximum value for the precision for a some given topic. However, their recall value will be very low. On the other hand, in a clustering where all documents are put into one cluster, this cluster will reach the maximum recall value of 1.0 with respect to each topic, but the precision value will be very low. Therefore, combining both precision and recall in computing the F-measure is a better way for evaluating a clustering. Topic Accuracy In order to assess the quality of the extracted terms in representing the topics of the clusters, an evaluation metric was developed. Since each topic is represented by a vector of (term, weight) pairs, the accuracy of each term in representing a given topic is evaluated. The topic accuracy is computed as the average accuracy of the highest weighted l terms representing this topic. The overall topic accuracy for all generated topics is measured by averaging all of the topic accuracies.
172
Hanan Ayad and Mohamed Kamel
The accuracy A(i, j) of term j in topic i is calculated as shown in Equation 11. It is the ratio of the weight that term j exhibits in cluster i to the sum of the weights that term j exhibits in each cluster. The value of A(i, j) reflect the significance of term j in distinguishing topic i from the rest of the topics in the collection. w(i, j) A(i, j) = C k=1 w(k, j)
(11)
where w(i, j) is the weight of term j in clusters i, and C is the total number of clusters. A(i, j) takes a value between 0 and 1.The Overall topic accuracy is computed as shown in Equation 12. C l 1 1 Overall Topic Accuracy = A(i, j) C i=1 l j=1
3.3
(12)
Experimental Results
Different document sets sampled from the data collection mentioned earlier are set up to test the prototype system. Tables 1 to 4 show the average values of the performance measures for experiments performed on 10 different document collections each of size 300 documents and another 10 collections of 500 documents each. The results are expressed in terms of the average number of clusters, which should be compared to the average number of manual topics, the average F-Measure and the average topic accuracy. The results are shown before and after the aggregation is performed. Before the aggregation, the results refer to the clusterings produced by two of the original clustering techniques while after the aggregation, they refer to the combined clustering. When comparing the results of different clusterings, the number of clusters has to be taken into account. For example, the same value of a performance measure such as the F-measure or the topic accuracy for a number of clusters closer to the desired number of topics is considered an overall improvement. By observing the results, it is noticed that after the aggregation is performed, the number of clusters become closer to the desired number of topics. In Table 1, the average number of clusters was brought from 41.1 to 23.1, which is closer to the average desired number of cluster of 26.4. The average topic accuracy of the extracted topics is 0.91 after aggregation, compared to 0.82 and 0.78 for the individual clusterings. The average F-measure before the aggregation was 0.62 and 0.53 and after the aggregation, it became 0.59. This is slightly less than the larger of the two original techniques, but for a closer estimate of the total number of clusters. In Table 2, the average value for the desired number of topics is 26.4. After the aggregation, the average number of clusters was brought from 87.8, and 44.3 to 36.1. The overall topic accuracy becomes 0.75, compared to 0.54 and 0.68
Topic Discovery from Text
173
Table 1. Aggregation of Incremental and Partitional clusterings for 10 different sets of 300 documents each
Table 2. Aggregation of Hierarchical and Partitional clusterings for 10 different sets of 300 documents each
Avg. No. of Manual Topics Incremental Clustering Avg. No. of Clusters Avg. F-Measure Avg. Topic Accuracy Partitional Clustering Avg. No. of Clusters Avg. F-Measure Avg. Topic Accuracy Combined Clustering Avg. No. of Clusters Avg. F-Measure Avg. Topic Accuracy
Avg. No. of Manual Topics Hierarchical Clustering Avg. No. of Clusters Avg. F-Measure Avg. Topic Accuracy Partitonal Clustering Avg. No. of Clusters Avg. F-Measure Avg. Topic Accuracy Combined Clustering Avg. No. of Clusters Avg. F-Measure Avg. Topic Accuracy
before the aggregation. The average F-measure is 0.39 compared to 0.26 and 0.36 before the aggregation. In Table 3, the average value for the desired number of topics is 33. After the aggregation, the average number of clusters was brought from 63.1 to 32.5. The overall topic accuracy becomes 0.88, compared to 0.78 and 0.72 before the aggregation. The average F-measure is 0.58 compared to 0.59 and 0.48 before the aggregation. In Table 4, the average value for the desired number of topics is 33. After the aggregation, the average number of clusters was brought from 119.8, and 60.2 to 44.4. The overall topic accuracy becomes 0.78, compared to 0.51 and 0.66 before
174
Hanan Ayad and Mohamed Kamel
the aggregation. The average F-measure is 0.36 compared to 0.21 and 0.32 before the aggregation.
4
Conclusion and Future Work
In this paper, a novel approach for topic extraction from text was proposed. The approach is based on the aggregation of clusters produced by different clustering techniques. The approach uses a topic extraction technique, in which the attributes with the greatest weights towards the uncovered cluster structure are used to describe the topic of each cluster. A topic accuracy measure was introduced and used along with the F-measure for evaluation. As shown by the results of the experiments, the quality of the clustering and of the extracted topics was improved after the aggregation was performed. In future work, handling of multi-topic documents will also be investigated with fuzzy clustering possibly being the proposed avenue. Moreover, we would like to extend the aggregation technique to combine more than two clustering at a time and to look at possibilities to further improve the performance of the aggregation technique. For larger document collections, the clustering time of the agglomerative algorithm will be dramatically higher as a consequence of its O(N 2 ) clustering time. Faster techniques will have to be used instead. Moreover, for initialization of the k-means algorithm for larger document collections, a technique that is based on the refinement of the initial conditions as proposed in [15] can be used.
References 1. V. L. Brailovsky. A probabilistic approach to clustering. Pattern Recognition Letters, 12:193–198, 1991. 161 2. E. Backer. Computer-Assisted Reasoning in Cluster Analysis. Prentice Hall, 1995. 162 3. D. Merkl. Text classification with self-organizing maps: Some lessons learned. Neurocomputing, 21:61–77, 1998. 163 4. D. Mladenic. Personal webwatcher: Implementation and design. Technical Report IJS-DP-7472, Department of Intelligent Systems, J. Stefan Institute, Slovenia, 1996. 163 5. M. W. Berry, Z. Drmac, and E. R. Jessup. Matrices, vector spaces, and information retrieval. Society for Industrial and Applied Mathematics Review, 41(2):335–362, 1999. 163 6. M. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980. 164 7. I. Khan, D. Blight, R. D McLeod, and H. C Card. Categorizing web documents using competitive learning: An ingredient of a personal adaptive agent. In Proceedings of the 1997 IEEE International Conference on Neural Networks, volume 1, pages 96–99, 1997. 164 8. B. Larsen and C. Aone. Fast and effective text mining using linear-time document clustering. In S. Chaudhuri and D. Madigan, editors, Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 16–22, San Diego, California, USA, August 1999. 164, 170
Topic Discovery from Text
175
9. A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988. 165, 167 10. D. Cheung, B. Kao, and J. Lee. Discovering user access patterns on the world wide web. Knowledge-Based Systems, 10:463–470, 1998. 165, 169 11. T. Yan, H. Jacobsen, H. Garcia-Molina, and U. Dayal. From user access patterns to dynamic hypertext linking. In Proceedings of the 5th International WWW Conference, May 1996. 165 12. J. Hartigan. Clustering Algorithms. Wiley, New York, 1975. 165 13. B. Mirkin. Concept learning and feature selection based on square-error clustering. Machine Learning, 35:25–39, 1999. 169 14. M. Perkowitz and O. Etzioni. Towards adaptive web sites: Conceptual framework and case study. Artificial Intelligence, 118:245–275, 2000. 169 15. P. Bradley and U. Fayyad. Refining initial points for k-means clustering. In Proceedings of the 15th International Conference on Machine Learning, pages 91– 99, 1998. 174
Genetic Algorithms for Continuous Problems James R. Parker Department of Computer Science, University of Calgary 2500 University Dr. N.W. Calgary, Alberta, Canada, T2N 1N4
Abstract. The single bit mutation and one point crossover operations are most commonly implemented on a chromosome that is encoded as a bit string. If the actual arguments are real numbers this implies a fixed point encoding and decoding each time an argument is updated. A method is presented here for applying these operators to floating point numbers directly, eliminating the need for bit strings The result accurately models the equivalent bit string operations, and is faster overall. Moreover, it provides a better facility for the application of genetic algorithms for continuous optimization problems. As an example, two multimodal functions are used to test the operators, and an adaptive GA in which size and range are varied is tested.
1
Introduction
Many methods have been devised to optimize the value of a function in one of more parameters. These methods employ some figure of merit that determines how good the optimization is, then optimize (minimize, usually) this value by changing the parameters repeatedly. The most straight forward approach is to choose new values of the parameters by changing them in the direction that most reduces the value of the figure of merit; these are commonly referred to as steepest descent or gradient descent algorithms. Though this would work fine for functions with a single minimum, it has the unpleasant problem of trapping some other optimizations in local minima. To improve the chances of locating the one global minimum these methods are sometimes run several times from several different starting points and the best result obtained is taken as the global minimum. A genetic algorithm [4,5,6] is a stochastic optimization technique, essentially a biased random walk through the parameter space. Specifically, a genetic algorithm uses the analogy of selective pressure on a population of living organisms; such a population adapts to its environment (an optimization) by taking advantage of random variations to individuals. Some variations will improve the individual, and these will survive and reproduce, while other variations are negative, and will not be propagated for very many generations. In a genetic algorithm an individual is represented by a bit string which is the concatenation of all of the parameters to the objective function. The selective pressure is R. Cohen and B. Spencer (Eds.): AI 2002, LNAI 2338, pp. 176-184, 2002. Springer-Verlag Berlin Heidelberg 2002
Genetic Algorithms for Continuous Problems
177
that of minimizing the objective function, and each individual has a single numeric fitness value associated with it, that being the objective function applied to that individual’s bit string. Smaller values are more fit to survive and reproduce. Variation is introduced into the population in two principal ways: mutation and crossover, these being analogs of real biological events. Mutation is the random alteration of one of the bits in a bit string, which amounts to varying one of the parameters. Crossover is an interchange of parts of two individuals; a bit position in the two strings is selected, and the upper portion of one string is concatenated with the lower portion of the other, and vice versa. This amounts to using some of the parameters of one individual with the remaining parameters of the other, while modifying the one where the concatenation takes place. Both mutation and crossover result in new individuals to be evaluated, and either selected for survival and reproduction or not. As part of a population, individuals reproduce and carry their selective advantages (parameter values) into the succeeding generations. The size of the population influences how well the genetic algorithm performs, as does the rate of mutation and crossover, the means of selecting individuals for reproduction, and dozens of other variables. The use of bit strings to represent parameters places a limit on the accuracy with which the solution can be found. When the parameters are in reality floating point numbers, the bit string representation amounts to a transformation to fixed point. In addition, there is a serious overhead involved in encoding and decoding the parameters to and from bit string form. At least one encode and one decode must be performed for each member of the population at each iteration, and this accounts for some of the reputation that genetic algorithms have for being slow. What will be done here is to eliminate the need for bit strings in a simple and straightforward manner, speeding up the technique without loss of the precise analogy. The mutation and crossover operators will be replaced by versions that operate directly on floating point numbers; these will be tested to ensure that they behave precisely as do the bit string versions and to measure their execution performance. Finally, they will be applied to a practical problem that uses floating point parameters.
2
Optimizing Functions of Real Parameters
The two operations that could be called the basis of genetic algorithms are the single bit mutation and the one point crossover. Together they provide the bulk of the variability in the parameter sets that is needed for the selection process to do its work. Both operations are applied to bit strings, which are the encoded form of the actual parameter. In the case where the parameters are symbolic, integer, or even fixed point this representation makes a great deal of sense, and is evocative of the original biological analogy. For floating point (real) parameters the bit string representation requires a good deal of time spent converting to and from bit string form. Each time the objective function is evaluated the parameter must be decoded (converted into real form). Initially, each parameter must be encoded (converted to bit string form) so that the operators may be applied. In addition, some flexibility is lost - some work has been done
178
James R. Parker
with dynamic changing of the ranges and size of the bit string as the optimization proceeds, and this also requires encoding and decoding. Nonetheless, genetic algorithms have been used to minimize continuous real functions of many parameters, and, like simulated annealing, tend not to get trapped in local minima. In order to fit a function to a sampled surface a measure of goodness of fit is minimized. The figure of merit employed here to determine the goodness of the fit is the χ2 value, which gets smaller as the fit improves. An example, introduced by Bohachevsky[1], can be used to illustrate these how to fit a function using a genetic algorithm. The function to be minimized is:
2
2
f (x, y ) = x + 2y – 0.3 cos ( 3 πx ) –0.4 cos ( 4 πy ) + 0.7
(1)
This function is an effort to model the sort of convoluted surface that a fitness function might achieve as a worst case. As can be seen in Figure 1a this function has many local minima and one well defined global minimum at (0,0). Any downhill method could get caught in one of these local valleys and never reach (0,0). To test the genetic algorithm we start at (1,1) and let the algorithm proceed. For this specific purpose we use a very simple implementation. Figure 1b shows some example paths followed by the algorithm, where the dark lines indicate the steps that it followed. This plot illustrates why the genetic algorithm can be considered a random walk. Of course, the bit string implementation implies the use of fixed, rather than floating, point parameter values, having a precision generally much lower than that of a double precision floating point number. The idea of using real numbers directly has been discussed, but generally in the context of hybrid algorithms, and the operators used are not the same - they are statistical analogies. For example, Davis [4] suggests real number mutation, which is the replacing of a real number in a chromosome with a randomly selected real number. This is not a single bit mutation; indeed, it could change all of the bits in that number. Real number creep is also not a traditional operator, but is an additional one that can prove useful in some circumstances but which many purists eschew as being not general or robust. What will be proposed here are two operators that are precise analogs of the bit string operations, but which apply to real numbers. Some of the advantages are obvious for continuous optimization problems; there are also some interesting by-products. /* Implement a single bit mutation void fmut (float *b, int n, float xmax, float xmin, int size) { double m, v, d3; v = (double)(xmax-xmin)/(double)(1<<size); m = (double)(1<<(size-n-1)); if ( (int)((*b-xmin)/(v*m)+0.000001) % 2 ) *b = *b - v*m; else *b = *b + v*m; }
Algorithm 1 – Single bit mutation (in C)
Genetic Algorithms for Continuous Problems
179
(b) Fig. 1 (a) Bohachevsky's function; it is hard to find a global minimum because of all of the local ones. (b) The paths that the genetic algorithm takes to find the minimum
3
Single Bit Mutation
Without loss of generality the single bit mutation operator will be defined as one which reverses a specified bit in the encoded representation. Consider a 4-bit representation of the range 0.0 - 1.0; the values it is possible to represent are: 0.0
= 00
0.25 = 01
0.50 = 10
0.75 = 11
A bit in the low bit position has a value V=0.25 associated with it; more generally, Dmax – Dmin V = -----------------------------------n 2
(2)
where Dmax is the largest value in the range, Dmin is the smallest, and n is the number of bits in the representation. In the example, V=(1.0-0.0)/22. Each bit in the encoded bit string representation corresponds to a power of two multiplied by V. The low bit is 20 * V, the next is 21 * V and so on. It seems apparent that to do a single bit reversal on a parameter X at bit position n is to subtract 2n * V from X if bit n is set (1), or to add 2n*V to X if bit n is clear (0). Bit n of the fixed point bit string representation of X is set if the bit string is shifted right n bits has the low bit set; the corresponding floating point expression is to consider the expression (X-Dmin)/(v*2n). If this result has the low bit set then the bit corresponding to v*2n is set in the encoding. This argument leads to the single bit mutation function seen in Figure 1. Note that the calculation of V would normally be done globally only once, and all values of V*2n could also be pre-computed. This leaves the operation count at one floating divide, one floating multiply, two floating subtractions and an integer division.
180
4
James R. Parker
One Point Crossover
The crossover operation is a little more complex, but is based on the same ideas. Consider the situation where an argument has a size of 10 bits and a range from 0.0 to 1.0. Two instances of this argument, X and Y, are to be crossed over at bit 4; this situation is: Argument X Y
Bit Value 0110101010 0010010011
Real Value .416016 .143555
Crossed Bit Value 0110010011 0010101010
Resulting Real .393555 .166016
This is accomplished by taking the high 4 bits of X and adding to it the low 6 bits of y (and vice versa). In real terms, the low 6 bits can represent values up to (27-1)*V, and the high 4 bits represent values from 27*V upwards. Thus, the high 4 bits of the real value X is given by: n – Dmin Xh = X ------------------------- × V × 2 (3) n
V× 2
for n=4, which effectively masks out the low 6 bits by truncation and then shifts the value left by the proper amount. The low 6 bits are obtained by simply subtracting this from X:
Xl = X – Xh
(4)
Now do the same for Y, and finally let X = Xh+Yl and Y = Yh+Xl, as seen in the code for fcross in Figure 2. /* Perform a 1 point crossover */ void fcross1 (float *s1, float *s2, int n, float xmin, float xmax, int size) { int i, k, c1; double xl, xh, yl, yh; double v, j; v = (double)(xmax-xmin)/(double)(1<<size); j = (double)(1 << (size-n)); xh = (int)((*s1-xmin)/(j*v)+0.000001) * (j*v); xl = (*s1) - xh; yh = (int)((*s2-xmin)/(j*v)+0.000001) * (j*v); yl = (*s2) - yh; xh = xh + yl ; yh = yh + xl ; *s1 = (float)xh; *s2 = (float)yh; } Algorithm 2 – Code for a one-point crossover
Genetic Algorithms for Continuous Problems
5
181
Evaluation of the Methods
The single bit mutation function has been tested on over 2.5 million mutations using various experimental protocols. For an example, a program has been written that generates random whole-number ranges, bit string sizes, and values and then creates a bit string from the generated values. All possible one bit mutations were carried out, both on the bit string and the real value, and the results were compared. The worst case error, the difference between the two values, was 0. For this same experiment the worst case error for the one point crossover was also 0. It must be noted that the mutations and crossovers were carried out on double precision numbers that could be represented exactly using the bit string form. That the floating point chromosome representation can achieve all legal real values in the parameter range can be viewed as an advantage. Speed of execution is another important issue, but a more difficult one to evaluate. There is a very good chance that some other programmer could code a faster bit string module than the one used in the evaluation. Still, some evaluation must be done and, after all, all of the code was written by the same person. For the execution profile done on a SPARC5 workstation, an example that performed 2500000 crossovers had the following performance (CPU time): Method Real numbers Bit strings
Time for crossovers 6.42 seconds 79.76 seconds
For mutations, 2.5 million trials were performed, each trial mutating every bit of the parameter in turn. Thus, the total number of mutations was 80 million. The result was: Method Real numbers Bit strings
Time for mutations 114.38 seconds 33.4 seconds
In both cases the size of the parameter was set at 32 bits, and no special optimizations were performed. When a table was used to perform mutations the time was reduced to 77.02 seconds. However, it is not in the computation of the bit mutation or crossover where the great savings of time can be expected. After a new generation is determined, the parameters must be decoded from bit string form to fixed point form so that the objective function can be evaluated. The decoding alone, for 80 million mutations, requires 1150 seconds; the use of floating point parameters eliminates the need for any decoding step, and this component of the execution time is saved by using the floating point operators.
6
Dynamic Sizes and Ranges
When using real arguments in the way described thus far the notion of chromosome length is a fiction. The size and range associated with any argument is only of impor-
182
James R. Parker
tance when a mutation or crossover is being performed. It is therefore a simple task to change these, or possibly even make them parameters themselves. Increasing the size of the bit string has the effect of decreasing V, which means that the precision of the search is improved. There is no significant implementation overhead involved in using large sizes up to the maximum precision of a floating point number. Of course, when using a string representation there are only certain numbers that can be represented. As the string size increases, the number of representable numbers also increases, doubling each time the size increases by one. An arbitrary floating point number x is nearest to a representable one z using the relationship z = x --- v (5) v We call this operation phasing, and it is done whenever the population is initialized using random floating point values. Changing the ranges while performing an optimization can be done if it can be determined with some reliability that the argument involved does not have an optimal value in the outlying regions of the range. Decreasing the range while maintaining a constant size has the effect of refining the step size (decreasing V again) and increasing the precision of the result while improving the speed with which the optimal value can be achieved. To modify the range in a bit string implementation requires a decoding of the argument, a modification of the range, and then an encoding at the new range. The real implementation does not require a change in the encoding, and achieves high accuracy simulation of the bit string results provided that the ranges have acceptable numerical properties. That size and range can be changed at any time with little overhead means that this can be used to advantage while performing an optimization. Consider the problem of finding the global minimum to Bohachevsky’s function, defined above. Figure 2 shows the result of the average of 100 trials of both the basic genetic algorithm and an adaptive variant in finding this minimum. The standard algorithm uses a population of 100, and a size of 10 bits, using a range of between -1 to +1 in each parameter. The adaptive method starts with a string size of 2 bits, and adds a bit every 25 iterations. In addition, we use a dynamic region with a given range centered about each point instead of a fixed extent domain for the parameters. The size of this domain decreases by 1% every iteration (multiplied by 0.99). It is clear that the adaptive version finds the minimum is fewer function evaluations. It must be mentioned that the changes to the program to implement the adaptive algorithm were trivial, as no strings were actually used in the implementation. Using the same kind of graph as Figure 2, it was determined that, when using the adaptive algorithm, we should always start using 2-bit string lengths, increasing the strings by 1 (virtual) bit each 30 iterations (of 10 individuals), and changing the range by 0.99 each iteration. This same scheme was then used to find the global minimum of the Ackley function[9, 10]:
Genetic Algorithms for Continuous Problems
183
Fig. 2 Comparison of the standard genetic algorithm and the adaptive on in attempting to minimize Bohachevsky's function. The adaptive method is faster – 0.2 1--n
f ( x ) = – 20e
n
∑ i =1
x 2i
–e
1 -n
n
∑ cos2π x i
i= 1
+ 20 + e
(6)
The graph of Figure 3 shows that the adaptive algorithm finds the global minimum (at 0,0) much more quickly than the standard algorithm, with no modifications from the implementation used to produce Figure 2 except the change in objective function.
7
Conclusions and Further Work
method for modeling the effects of single bit mutations and one point crossovers using real arguments has been described. It has a high similarity to the equivalent bit string operations, and is not a statistical approximation; moreover, it executes at an acceptably high speed. This method has been used in a stellar photometry system[2,3,7], and will be used in future in applications to computer vision and image processing. The method allows the size and domain of the genetic algorithm to be modified on the fly, producing an adaptive algorithm that appears to perform floating point parameter optimizations much more quickly than the standard algorithm. The method could also be easily applied to multiple mutation and crossover, and to other operators used in genetic algorithms.
184
James R. Parker
A Fig. 3 Comparison of the standard genetic algorithm and the adaptive one in attempting to minimize the Ackley function
References 1. 2. 3. 4. 5. 6. 7. 8.
Bohachevsky, I.O., Johnson, M.E. and Stein, M.L., “Generalized simulated annealing for function optimization”, Technometrics, Vol. 28, Pp. 209-217, 1986. Stetson, P.B., “DAOPHOT: A computer program for crowded-field stellar photometry”, Pub. A. S. P., Vol. 191, 1987. Pp. 191-222, Parker, J.R., “Algorithms for Image Processing and Computer Vision”, John Wiley & Sons, New York. 1997.. Davis, L. (ed), “Handbook of Genetic Algorithms”, Van Nostrand Reinhold, New York NY, 1991. Goldberg, D.E., “Genetic Algorithms, Optimization, and Machine Learning”, Addison-Wesley, Reading MA. 1989. Holland, J.H., “Adaptation in Natural and Artificial Systems”, University of Michigan Press, Ann Arbor, MI. 1975. Groisman, G. and Parker, J.R., “Computer Assisted Photometry Using Simulated Annealing”, Computers in Physics, Vol. 7 No. 1, Jan/Feb 1993. Pp. 87-96. Mitchell, M., “An introduction to genetic algorithms”, The MIT Press, Cambridge, Mass., 1996
On the Role of Contextual Weak Independence in Probabilistic Inference Cory J. Butz and Manon J. Sanscartier Department of Computer Science, University of Regina Regina, SK, S4S 0A2, Canada
Abstract. Previous experimental results have clearly demonstrated the effectiveness of utilizing context-specific independence (CSI) in probabilistic inference. However, CSI is a special case of a more general independence called contextual weak independence (CWI). In this paper, we show how CWI can be utilized for more efficient probabilistic inference. These results are quite significant as they suggest that CWI may play an important role in probabilistic inference.
1
Introduction
In practice, probabilistic inference would not be feasible without making independency assumptions. Directly specifying a joint probability distribution is not always possible as one would have to specify 2n entries for a distribution over n binary variables. However, Bayesian networks [2,4] have become a basis for designing probabilistic expert systems as the conditional independence (CI) assumptions encoded in a Bayesian network allow for a joint distribution to be indirectly specified as a product of conditional probability tables (CPTs). More importantly, perhaps, this factorization can lead to computationally feasible inference in some applications. Nevertheless, this approach to probabilistic inference is rather limited since it is based on a very strict type of independence, the probabilistic conditional independence. It is well-known that the notion of conditional independence is too restrictive to capture independencies that only hold in certain contexts. This kind of contextual independency was formalized as context-specific independence (CSI) by Boutilier et al. [1]. The important point is that Zhang and Poole [7] have empirically demonstrated that CSI can significantly speed up inference. At the same time, Wong and Butz [6] emphasized that CSI is a special case of a more general contextual independency called contextual weak independence (CWI). In this paper, we show that CWI may be more useful than CSI in probabilistic inference. While the notion of CI can factorize a joint distribution as a product of CPTs, the notion of CSI can refine the CPTs themselves. Since CSI is a special case of CWI, the notion of CWI can further refine the CPTs. We explicitly demonstrate in Section 4 that this refinement can reduce the number of multiplications and additions needed for probabilistic inference. Finally, it is worth mentioning that although this paper focuses on inference using CWI, we take advantage of the union product operator developed by Zhang and Poole [7]. R. Cohen and B. Spencer (Eds.): AI 2002, LNAI 2338, pp. 185–194, 2002. c Springer-Verlag Berlin Heidelberg 2002
186
Cory J. Butz and Manon J. Sanscartier
This paper is organized as follows. In Section 2, we briefly review probabilistic inference in Bayesian networks. In Section 3, we illustrate the usefulness of CSI in inference. In Section 4, we show how more efficient probabilistic inference can be achieved in a CWI approach using independencies that would go unnoticed in a CSI approach. The conclusion is presented in Section 5.
2
Bayesian Networks
Consider a finite set U = {A1 , A2 , . . . , An } of discrete random variables, where each variable A ∈ U takes on values from a finite domain VA . We may use capital letters, such as A, B, C, for variable names and lowercase letters a, b, c to denote specific values taken by those variables. Sets of variables will be denoted by capital letters such as X, Y, Z, and assignments of values to the variables in these sets (called configurations or tuples) will be denoted by lowercase letters x, y, z. We use VX in the obvious way. We shall also use the short notation p(a) for the probabilities p(A = a), a ∈ VA , and p(z) for the set of variables Z = {A, B} = AB meaning p(Z = z) = p(A = a, B = b) = p(a, b), where a ∈ VA , b ∈ VB . Let p be a joint probability distribution (jpd) [2] over the variables in U and X, Y, Z be subsets of U . We say Y and Z are conditionally independent given X, if given any x ∈ VX , y ∈ VY , then for all z ∈ VZ , p(y | x, z)
=
p(y | x),
whenever p(x, z) > 0.
(1)
For convenience we write Eq. (1) as p(Y | X, Z) = p(Y | X). Based on the conditional independence (CI) assumptions encoded in the Bayesian network in Fig. 1, the jpd p(A, B, C, D, E) can be factorized as p(A, B, C, D, E)
=
p(A) · p(B) · p(C|A) · p(D|A, B) · p(E|A, C, D).
(2)
Using the CPTs p(D|A, B) and p(E|A, C, D) shown in Fig. 2, we conclude this section with an example of probabilistic inference. The distribution p(A, B, C, E) can be computed from Eq. (2) as p(A, B, C, D, E) p(A, B, C, E) = D
=
p(A) · p(B) · p(C|A) · p(D|A, B) · p(E|A, C, D)
D
=
p(A) · p(B) · p(C|A) ·
p(D|A, B) · p(E|A, C, D).
(3)
D
Computing the product p(D|A, B) · p(E|A, C, D) of the two distributions in Fig. 2 requires 32 multiplications. Marginalizing out variable D from this product requires 16 additions. The resulting distribution can be multiplied with p(A) · p(B) · p(C|A) to obtain our desired distribution p(A, B, C, E).
On the Role of Contextual Weak Independence in Probabilistic Inference
Fig. 2. The CPTs p(D|A, B) and p(E|A, C, D) in Eq. (2)
3
Inference with Context-Specific Independence
The Bayesian network factorization of p(A, B, C, D, E) in Eq. (2) only reflects conditional independencies p(y|x, z) = p(y|x) which hold for all x ∈ VX . In some situations, however, the conditional independence may only hold for certain specific values in VX . Consider again the CPT p(D|A, B) redrawn in Fig. 3 (i). Although variables D and B are not conditionally independent given A, it can be seen in Fig. 3 (ii,iii) that D and B are independent in context A = 0, that is, p(D = d|A = 0, B = b)
=
p(D = d|A = 0).
188
Cory J. Butz and Manon J. Sanscartier
A 0 0 0 0 1 1 1 1
B 0 0 1 1 0 0 1 1
A B D p(D|A = 0, B) A D p(D|A = 0) D p(D|A, B) 0 0 0 0.3 → 0 0 0.3 0 0.3 0 0 1 0.7 0 1 0.7 1 0.7 0 1 0 0.3 0 0.3 0 1 1 0.7 1 0.7 0 0.6 A B D p(D|A = 1, B) 1 0.4 1 0 0 0.6 0 0.8 1 0 1 0.4 1 0.2 1 1 0 0.8 1 1 1 0.2 (i) (ii) (iii)
Fig. 3. Variables D and B are conditionally independent in context A = 0
Similarly, for the CPT p(E|A, C, D) redrawn in Fig. 4 (i), it can be seen in Fig. 4 (ii,iii) that variables E and D are independent given C in context A = 0, while variables E and C are independent given D in context A = 1, i.e., p(E = e|A = 0, C = c, D = d)
=
p(E = e|A = 0, C = c)
p(E = e|A = 1, C = c, D = d)
=
p(E = e|A = 1, D = d).
and
This kind of contextual independency was formalized as context-specific independence (CSI) by Boutilier et al. [1] as follows. Let X, Y, Z, C be pairwise disjoint subsets of U and c ∈ VC . We say Y and Z are conditionally independent given X in context C = c, if p(y | x, z, c) = p(y | x, c),
whenever p(x, z, c) > 0.
In order to utilize the above three context-specific independencies for more efficient probabilistic inference, Zhang and Poole [7] generalized the standard product operator · as the union product operator . The union product p(Y, X) q(X, Z) of functions p(Y, X) and q(X, Z) is the function on Y XZ defined as p(y, x) · q(x, z) if both p(y, x) and q(x, z) are defined p(y, x) if p(y, x) is defined and q(x, z) is undefined p(y, x) q(x, z) = q(x, z) if p(y, x) is undefined and q(x, z) is defined undefined if both p(y, x) and q(x, z) are undefined. Note that is commutative and associative [7]. (It should be mentioned that Zhang and Poole [7] also pointed out that the notion of CSI can be applied in the problem of constructing a Bayesian network [5].) The union product operator allows for a single CPT to be horizontally partitioned into more than one CPT, which, in turn, exposes the contextual independencies. Returning to the factorization in Eq. (2), the CPT p(D|A, B) can
On the Role of Contextual Weak Independence in Probabilistic Inference
The use of CSI leads to more efficient probabilistic inference. Computing p(A, B, C, E) from Eq. (6) involves p(A, B, C, E) = p(A) · p(B) · p(C|A) p(D|A = 0) p(D|A = 1, B) D
p(E|A = 0, C) p(E|A = 1, D) =
p(A) · p(B) · p(C|A) p(E|A = 0, C)
p(D|A = 0)
D
p(D|A = 1, B) p(E|A = 1, D).
(7)
190
Cory J. Butz and Manon J. Sanscartier
Computing the union product p(D|A = 0) p(D|A = 1, B) p(E|A = 1, D) requires 8 multiplications. Next, 8 additions are required to marginalize out variable D. Eight more multiplications are required to compute the union product of the resulting distribution with p(E|A = 0, C). The resulting distribution can be multiplied with p(A) · p(B) · p(C|A) to give p(A, B, C, E). The important point in this section is that computing p(A, B, C, E) from the CSI factorization in Eq. (7) required 16 fewer multiplications and 8 fewer additions compared to the respective number of computations needed to compute p(A, B, C, E) from the CI factorization in Eq. (3).
4
Inference with Contextual Weak Independence
Since CSI is a special case of contextual weak independence (CWI) [6], any computational savings achieved in a CSI approach will also be achieved in a CWI approach. In addition, we show in this section that more efficient probabilistic inference can be achieved in a CWI approach using independencies that would go unnoticed in a CSI approach. Consider another jpd p (A, B, C, D, E) which also satisfies the conditional independencies encoded in the Bayesian network in Fig. 1, p (A, B, C, D, E)
p (A) · p (B) · p (C|A) · p (D|A, B) · p (E|A, C, D). (8)
=
The two CPTs p (D|A, B) and p (E|A, C, D) are shown in Fig. 5 (i) and Fig. 6 (i), respectively. In the CPT p (D|A, B), there are no context-specific independencies holding in p (D|A = 0, B) nor in p (D|A = 1, B). Similarly, in the CPT p (E|A, C, D), there are no context-specific independencies holding in p (E|A = 0, C, D) nor in p (E|A = 1, C, D). This means that no refinement of the BN factorization in Eq. (8) is possible in a CSI approach. Thereby, computing p (A, B, C, E) from Eq. (8) in a CSI approach involves p (A) · p (B) · p (C|A) · p (D|A, B) · p (E|A, C, D) p (A, B, C, E) = D
=
p (A) · p (B) · p (C|A) ·
p (D|A, B) · p (E|A, C, D). (9)
D
Computing D p (D|A, B) · p (E|A, C, D) requires 64 multiplications and 32 additions. Unlike the definition of CSI [1], the definitions of CWI (given below) and the union product operator do not require a CPT to be horizontally partitioned as a dichotomy (see Fig. 3 and 4). On the contrary, by the definition of , the CPT p (D|A, B) in Eq. (8) can be written as
=
p (D|A, B) p1 (D|A = 0, B) p2 (D|A = 0, B) p1 (D|A = 1, B) p2 (D|A = 1, B), (10)
On the Role of Contextual Weak Independence in Probabilistic Inference
A 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
B 0 0 1 1 2 2 3 3 0 0 1 1 2 2 3 3
D 0 1 0 1 2 3 2 3 0 1 0 1 2 3 2 3
A 0 0 p (D|A, B) 0 0.3 0 0.7 0.3 A 0.7 0 0.1 0 0.9 → 0 0.1 0 0.9 0.6 A 0.4 1 0.8 → 1 0.2 1 0.2 1 0.8 0.3 A 0.7 1 1 1 1 (i)
B 0 0 1 1
D p1 (D|A = 0, B) A D p1 (D|A = 0) 0 0.3 → 0 0 0.3 1 0.7 0 1 0.7 0 0.3 1 0.7
B 2 2 3 3
D p2 (D|A = 0, B) A D p2 (D|A = 0) 2 0.1 → 0 2 0.1 3 0.9 0 3 0.9 2 0.1 3 0.9
B 0 0 1 1
D p1 (D|A = 1, B) 0 0.6 1 0.4 0 0.8 1 0.2
B 2 2 3 3
D p2 (D|A = 1, B) 2 0.2 3 0.8 2 0.3 3 0.7 (ii)
191
(iii)
Fig. 5. Variables D and B are weakly independent in context A = 0
as illustrated in Fig. 5 (i,ii). Variables D and B are conditionally independent in context A = 0 in both p1 (D|A = 0, B) and p2 (D|A = 0, B), as depicted in Fig. 5 (iii). Thus, Eq. (10) can be refined as p (D|A, B) =
as illustrated in Fig. 6 (i,ii). Variables E and D are conditionally independent given C in context A = 0 in both p1 (E|A = 0, C, D) and p2 (E|A = 0, C, D), as depicted in Fig. 6 (iii). In addition, E and C are conditionally independent given D in context A = 1 in both p1 (E|A = 1, C, D) and p2 (E|A = 1, C, D), as shown in Fig. 6 (iii). These independencies can be used to refine Eq. (12) as p (E|A, C, D) = p1 (E|A = 0, C) p2 (E|A = 0, C) p1 (E|A = 1, D) p2 (E|A = 1, D). (13)
E p2 (E|A = 1, C, D) A D E p2 (E|A = 1, D) 2 0.4 1 2 2 0.4 3 0.6 → 1 2 3 0.6 2 0.1 1 3 2 0.1 3 0.9 1 3 3 0.9 2 0.4 3 0.6 2 0.1 3 0.9 (ii) (iii)
C 0 0 1 1
E p1 (E|A = 0, C) 0 0.1 1 0.9 0 0.8 1 0.2
Fig. 6. Variables E and D are weakly independent given C in context A = 0, while E and C are weakly independent given D in context A = 1
This type of contextual independency was formalized as contextual weak independence (CWI) by Wong and Butz [6] as follows. Let X, Y, Z, C be pairwise disjoint subsets of U and c ∈ VC . We say Y and Z are weakly independent given X in context C = c, if both of the following two conditions are satisfied: (i) there exists a maximal disjoint compatibility class [3] π = {ti , . . . , tj } in the relation θ(X, Y, C = c) ◦ θ(X, Z, C = c), and (ii) given any x ∈ VXπ , y ∈ VYπ , then
On the Role of Contextual Weak Independence in Probabilistic Inference
193
for all z ∈ VZπ , p(y | x, z, c) = p(y | x, c), whenever p(x, z, c) > 0, where θ(W ) denotes the equivalence relation induced by the set W of variables, π denotes the set of values for W ◦ denotes the composition operator, and VW appearing in π. Unlike the notion of CSI, the notion of CWI can refine the Bayesian network factorization of p (A, B, C, D, E) in Eq. (8). By substituting Eqs. (11) and (13) into Eq. (8), the factorization of p (A, B, C, D, E) in a CWI approach is p (A, B, C, D, E) =
p (A) · p (B) · p (C|A) p1 (D|A = 0) p2 (D|A = 0) p1 (D|A = 1, B) p2 (D|A = 1, B) p1 (E|A = 0, C) p2 (E|A = 0, C) p1 (E|A = 1, D) p2 (E|A = 1, D).
(14)
Computing p (A, B, C, E) from Eq. (14) involves p (A, B, C, E) = p (A) · p (B) · p (C|A) p1 (D|A = 0) p2 (D|A = 0) p1 (D|A = 1, B) D
p2 (D|A = 1, B) p1 (E|A = 0, C) p2 (E|A = 0, C) p1 (E|A = 1, D) p2 (E|A = 1, D)
= p (A) · p (B) · p (C|A) p1 (E|A = 0, C) p2 (E|A = 0, C) p1 (D|A = 0) p2 (D|A = 0) p1 (D|A = 1, B) p2 (D|A = 1, B) D
p1 (E|A = 1, D) p2 (E|A = 1, D).
(15)
In this case, only 32 multiplications and 16 additions are required to compute the distribution to be multiplied with p (A) · p (B) · p (C|A), as opposed to the needed 64 multiplications and 32 additions in the CSI factorization in Eq. (9). The main point in this section is that computing p (A, B, C, E) from the CWI factorization in Eq. (15) required 32 fewer multiplications and 16 fewer additions compared to the respective number of computations needed to compute p (A, B, C, E) in the CSI factorization in Eq. (9).
5
Conclusion
Recently, it has been empirically demonstrated that CSI can lead to more efficient probabilistic inference than can be obtained using CI alone [7]. At the same time, it has been shown in [6] that CSI is a special case of CWI. This means that any computational savings achieved in a CSI approach will also be achieved in a CWI approach. In addition, as shown in Section 4, more efficient probabilistic inference can be obtained in a CWI approach when compared to a CSI approach. We are currently conducting more thorough experiments to support the encouraging results in this paper.
194
Cory J. Butz and Manon J. Sanscartier
References 1. Boutilier, C., Friedman, N., Goldszmidt, M., Koller, D.: Context-specific independence in Bayesian networks, Proceedings of the Twelfth Conference on Uncertainty in Artificial Intelligence, 115–123, 1996. 185, 188, 190 2. Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers, San Francisco (1988) 185, 186 3. Preparata, F. and Yeh, R.: Introduction to Discrete Structures. Addison-Wesley, Don Mills, Ontario (1973) 192 4. Wong, S. K. M., Butz, C. J., Wu, D.: On the implication problem for probabilistic conditional independency. IEEE Trans. Syst. Man Cybern. SMC-A 30(6) (2000) 785–805 185 5. Wong, S. K. M., Butz, C. J.: Constructing the dependency structure of a multiagent probabilistic network. IEEE Trans. Knowl. Data Eng. 13(3) (2001) 395–415 188 6. Wong, S. K. M., Butz, C. J.: Contextual weak independence in Bayesian networks, Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, 670–679, 1999. 185, 190, 192, 193 7. Zhang, N. and Poole, D.: On the role of context-specific independence in probabilistic inference, Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, 1288–1293, 1999. 185, 188, 193
A Structural Characterization of DAG-Isomorphic Dependency Models S. K. M. Wong, Dan Wu, and Tao Lin Department of Computer Science, University of Regina Regina, Saskatchewan, Canada, S4S 0A2 {wong,danwu}@cs.uregina.ca
Abstract. Graphical models have been extensively used in probabilistic reasoning for representing conditional independency (CI) information. Among them two of the well known models are undirected graphs (UGs), and directed acyclic graphs (DAGs). Given a set of CIs, it would be desirable to know whether this set can be perfectly represented by a UG or DAG. A necessary and sufficient condition using axioms has been found for a set of CIs that can be perfectly represented by a UG; while negative result has been shown for DAGs, i.e., there does not exist a finite set of axioms which can characterize a set of CIs having a perfect DAG. However, this does not exclude other possible ways for such a characterization. In this paper, by studying the relationship between CIs and factorizations of a joint probability distribution, we show that there does exist such a characterization for DAGs in terms of the structure of the given set of CIs. More precisely, we demonstrate that if the given set of CIs satisfies certain constraints, then it has a perfect DAG representation.
1
Introduction
Bayesian networks [6] (BNs) have been well established as a mechanism for processing uncertain information. The success of BNs relies on the utilization of conditional independency (CI) information to factorize a joint probability distribution (jpd). Various graphical models have been developed for representing CIs. Among them two of the well known models are undirected graphs (UGs), and directed acyclic graphs (DAGs). One of the advantages of these graphical models is that they provide a convenient graphical method to infer new CIs that are logical consequences of an input set. These methods are referred to as cutset separation and d-separation for UGs and DAGs [2,6], respectively. More importantly, perhaps, these two graphical representations are perfect maps for particular kinds of CIs. This means both the DAG and UG provide a faithful graphical representations of certain categories of CIs. An important question naturally arises: Is it possible to use axioms to characterize these CIs that can be faithfully represented by UGs or DAGs? It was shown in [7] that one can indeed find a finite set of axioms to characterize those CIs that have a UG as a perfect map. Unfortunately, the answer to the above R. Cohen and B. Spencer (Eds.): AI 2002, LNAI 2338, pp. 195–209, 2002. c Springer-Verlag Berlin Heidelberg 2002
196
S. K. M. Wong et al.
question is negative for those CIs that can be faithfully represented by a DAG [4], namely, no such finite set of axioms exists. In designing a probabilistic reasoning system, very often one may have to construct the DAG of a BN from a given set of CIs. In these situations, it is important to know a priori whether the input CIs can be faithfully represented by a DAG. A procedure was suggested in [10] to test if a given set of CIs has a DAG as a perfect map. Obviously, this is not a characterization of CIs. In this paper, we propose a structural characterization for those CIs that can be faithfully represented by the DAG of a BN. We show that if the input CIs can be arranged in a certain hierarchical structure, then they can be faithfully represented by a DAG. Our result also reveal the intrinsic relationship between DAGs and various factorizations of a jpd. The paper is organized as follows. We review pertinent notions in Sect. 2. In Sect. 3, we discuss the intrinsic relationship between factorizations of a jpd and equivalent class of DAGs. We study the use of CIs to factorize a jpd in Sect. 4. In Sect.5, we introduce our hierarchical characterization of CIs. The conclusion is presented in Sect.6.
2
Background
We first introduce the notion of hypergraph. A hypergraph is a pair (N , S), where N is a finite set of nodes (variables) and S is a set of edges (hyperedges) which are arbitrary subsets of N [1,8]. If the nodes are understood, we will use S to denote the hypergraph (N , S). An ordinary undirected graph (without self-loops) is, of course, a hypergraph whose every edge is of size two. We say an element Si in S is a twig if there exists another element Sj in S, distinct from Si , such that (∪(S − {Si })) ∩ Si = Si ∩ Sj . We call any such Sj a branch for the twig Si . A hypergraph S is a hypertree [5,8] if its elements can be ordered, say S1 , S2 , ..., SN , so that Si is a twig in {S1 , S2 , ..., Si }, for i = 2, ..., N . We call any such ordering a tree (hypertree) construction ordering for S. Given a tree construction ordering S1 , S2 , ..., SN , we can choose, for i from 2 to N , an integer j(i) such that 1 ≤ j(i) ≤ i − 1 and Sj(i) is a branch for Si in {S1 , S2 , ..., Si }. We call a function j(i) that satisfies this condition a branching for S and S1 , S2 , ..., SN . For example, let N = {A1 , A2 , ..., A6 }. Consider a hypergraph S = {S1={A1 , A2 , A3 }, S2= {A1 , A2 , A4 }, S3={A2 , A3 , A5 }, S4={A5 , A6 }}. This hypergraph is a hypertree, as there exists a tree construction ordering, S3 , S1 , S2 , S4 . Furthermore, the branching function for this ordering is j(1) = 3, j(2) = 1, j(4) = 3. We assume readers have familiarities with basic notions in Bayesian network to the extent in [6]. We thus will quickly go through the notions that will be used in the paper. We use p(U ) to represent a joint probability distribution (jpd) over a finite set U = {A, B, . . . , } of variables. Letters in the tail of alphabet represent a set of variables while letters in the head of alphabet represent single variable. We call p(V ), V ⊂ U , marginal (distribution) of p(U ) and p(X|Y ) conditional
A Structural Characterization of DAG-Isomorphic Dependency Models
197
(distribution). By XY , we mean X ∪ Y . By definition of conditional probability, p(X|Y ) =
p(XY ) , whenever p(Y ) > 0, p(Y )
we thus say that in the above expression, the denominator p(Y ) is absorbed by the numerator p(XY ) to yield conditional p(X|Y ). A conditional p(X|Y ) is reduced to marginal if Y = ∅. Let X, Y, Z be three disjoint set of variables, we say X and Y are conditional independent (CI) given Z, denoted I(X, Z, Y ), if p(XZ) · p(Y Z) , whenever p(Z) > 0. p(Z)
p(XY Z) =
The CI I(X, Z, Y ) is full, if XY Z = U , otherwise, it is embedded. The context of a CI I(X, Z, Y ) is XY Z. We will use the term “CI” to denote either a full CI or an embedded CI when no confusion arises. We say X and Y are unconditional independent, denoted I(X, ∅, Y ), if p(XY ) = p(X) · p(Y ). A Bayesian network (BN) is a tuple (D, P ), where (a) D =< U, E > is a directed acyclic graph (DAG), as a qualitative part, with U = {X1 , . . . , Xn } as the nodes (variables) of DAG and E = {< Xi , Xj > | Xi , Xj ∈ U } as the set of directed edges of D; (b) P = {p(Xi |pa(Xi )) | Xi ∈ U } is a set of conditionals as quantitative part, where pa(Xi ) denotes the parents of node Xi in DAG D, such that, p(U ) =
n
p(Xi |pa(Xi )),
i=1
and we call the above expression a BN factorization. This is the conventional definition of BN [6]. We will use the term BN and DAG interchangeably if no confusion arises. A topological ordering of the nodes in U is an ordering of nodes in U , such that, if there is a (directed) path from node Xi to node Xj , then Xj appears after Xi in the ordering. A DAG may have many different topological orderings. Similar to BN, a Markov network (MN) is a tuple (H, P ), where (a) H = {h1 , . . . , hm } is a hypertree, as a qualitative part, defined on the set U = h1 h2 . . . hm of variables with h1 , . . . , hm being the hypertree construction ordering; and (b) P = {p(hi ) | hi ∈ H} is a set of marginals, as quantitative part, where p(hi ) is a marginal over nodes in hi , such that, p(U ) =
and we call the above expression a MN factorization. We will use the term MN and hypertree interchangeably if no confusion arises. It is noted that a MN can also be represented by a (chordal) DAG [11]. In this paper, we refer to a dependency model as a set of CI statements. A jpd certainly defines a dependency model because we can enumerate all the CI statements that this jpd satisfies.
198
S. K. M. Wong et al.
Given a set C of CIs and a graph G, if every CI that is logically implied by C can be graphically verified in G, then we called G is a I-map of C. If every CI that can be graphical verified in G is logically implied by C, then we call G is a D-map of C. If G is both a I-map and D-map of C, then we call G a perfect map of C or G faithfully represents C. Given an arbitrary set C of CIs, a necessary and sufficient condition for C to have a perfect map UG is that this C must satisfy a set of 5 axioms as elaborated in [7]. A DAG is equivalent to a special set of CIs, called causal input list (CIL), (also called boundary strata in [6]). In other words, DAG is a perfect map of CIL. For each topological ordering X1 , X2 , . . . , Xn of a DAG, there is an associated CIL, namely, {I(Xi , pa(Xi ), X1 X2 . . . Xi−1 − pa(Xi )); i = 1, . . . , n}. By definition of CI, the CIL implies p(Xi |X1 X2 . . . Xi−1 ) = p(Xi |pa(Xi )), which gives rise to BN factorization. In the special case that pa(Xi ) = X1 X2 . . . Xi−1 , the BN factorization reduces to chain rule factorization, and we call it a trivial BN factorization. Since a DAG may have multiple topological orderings, it thus has multiple CILs, but all the CILs of the same DAG are equivalent. There exists a set of sound and complete inference axioms, namely, semi-graphoid (SG) axioms [6,11] for CIL. A MN along with its associated hypertree H is equivalent to a set of conflict free full CIs [11], denoted CF (H). In other words, the hypertree H is a perfect map of CF (H). There also exists a set of sound and complete inference axioms for full CIs [11].
3
BN Factorization and Equivalent Class of DAGs
It is well known that for a given DAG D0 , there may exist a set of equivalent DAGs {D1 , D2 , . . . , Dn } such that all of them possess the same CI information [9]. Graphical method has been developed to obtain all equivalent DAGs for any given BN [3] by studying the reversibility of directed edges in a DAG. In this section, we give an algebraic explanation of equivalent DAGs using BN factorization. Example 1. Consider the DAG D0 shown in Fig. 1 (i) and all its equivalent DAGs, D1 , D2 , D3 , shown in Fig. 1 (ii), respectively. The BN factorization for the DAG in Fig. 1 (i) is: p(ABCD) = p(A) · p(B|A) · p(C|B) · p(D|B).
(1)
The equation in (1) can also be equivalently written as: p(BA) p(CB) p(DB) · · , p(A) p(B) p(B) p(BA) · p(CB) · p(DB) = . p(B) · p(B)
p(ABCD) = p(A) ·
(2)
A Structural Characterization of DAG-Isomorphic Dependency Models
A
A
A
A
B
B
B
B
C
D
D0
C
D
D1
(i)
C
D
D2
C
199
D
D3
(ii)
Fig. 1. (i) A DAG D0 . (ii) All its equivalent DAGs, D1 , D2 , D3 Let’s take a close look at (2). Equation (1) can be obtained from (2) by: (i) absorbing the denominator p(B) by numerator p(DB) to yield p(D|B); (ii) absorbing the denominator p(B) by numerator p(CB) to yield p(C|B); (ii) the numerator p(BA) is factorized by chain rule as p(BA) = p(A) · p(B|A). The absorbability of the denominators in (2) can be considered as assigning each denominator p(X) a numerator p(Y ) such that X ⊆ Y . But the absorbability of the denominators in (2) is not unique. Equation (2) can also be equivalently written by assigning (absorbing) the denominators in different ways as shown below: p(CB) p(DB) · , p(B) p(B) = p(B) · p(A|B) · p(C|B) · p(D|B), p(BA) p(DB) · , p(ABCD) = p(CB) · p(B) p(B) = p(C) · p(B|C) · p(A|B) · p(D|B), p(BA) p(CB) p(ABCD) = p(DB) · · , p(B) p(B) = p(D) · p(B|D) · p(A|B) · p(C|B), p(ABCD) = p(BA) ·
(3)
(4)
(5)
where (3), (4) and (5) correspond to those equivalent DAGs, D1 , D2 and D3 , in Fig. 1 (ii), respectively. It is worth mentioning that the CILs for those 4 DAGs shown in Fig. 1 are equivalent. This example clearly indicates that a BN factorization of a jpd p(U ) in terms of numerators and denominators is intrinsic and from which all equivalent DAGs can be derived algebraically by trying all possible assignments (absorptions) of denominators so that all denominators can be absorbed.
4
Conditional Independencies and Factorization
The question of characterizing a set of CIs which has a perfect map DAG has been studies in [4], in which a negative result has been reported that there is
200
S. K. M. Wong et al.
no way to characterize a set of CIs which has a perfect map DAG using a finite set of axioms. However, this result does not exclude possible means of other alternative characterization. Verma and Pearl [10] have designed an algorithm to test whether a set of CIs has a perfect map DAG. However, this algorithm is not a characterization. Before we tackle the problem of characterization, in this section, we first study the relationship between a set of CIs and a jpd factorization. Example 2. Recall Example 1 in previous section, the CIL (for the topological ordering A, B, C, D ) for the DAG shown in Fig. 1 (i) is: I(A, ∅, ∅), I(B, A, ∅), I(C, B, A), I(D, B, AC),
(6)
each of which implies by definition of CI that: p(A∅) · p(∅) , p(∅) p(AB) · p(A∅) , p(AB) = p(A) p(AB) · p(BC) p(ABC) = , p(B) p(BD) · p(ABC) p(ABCD) = . p(B) p(A) =
(7) (8) (9) (10)
It is noted that by substitution, we obtain: p(ABCD) =
p(BD) · ( p(AB)·p(BC) ) p(B)
, p(B) p(BD) · p(AB) · p(BC) = . p(B) · p(B)
(11) (12)
The above equation (12) is exactly the same as in (2). In other words, the CIL in (6) implies the intrinsic factorization of a BN in terms of numerators and denominator. (Actually, any CIL of any equivalent DAG of D0 implies the same factorization in (12).) Based on the above example, one may already feel the nested substitution of factorizations and conjecture that given a set C of CIs (not necessary CIL), if a jpd can be factorized using CIs in C in terms of numerators and denominators and in the meantime, all denominators can be absorbed, then this jpd defines a BN factorization. The following example refutes this conjecture. Example 3. Let C = {I(X, ∅, Y ), I(X, ZW, Y )}. The CI I(X, ∅, Y ) implies the following factorization: p(XY ) = p(X) · p(Y ),
(13)
A Structural Characterization of DAG-Isomorphic Dependency Models
201
the CI I(X, ZW, Y ) implies the following factorization: p(XY ZW ) =
p(XZW ) · p(ZW Y ) . p(ZW )
(14)
The equations (13), (14) can be combined as shown below: p(XZW ) · p(ZW Y ) p(XY ) · , p(X) · p(Y ) p(ZW ) p(XY ) · p(XZW ) · p(ZW Y ) = , p(X) · p(Y ) · p(ZW ) = p(Y |X) · p(X|ZW ) · p(ZW |Y ).
p(XY ZW ) =
It can be easily verified that (15) is not a BN factorization.
(15)
This example indicates that some input set of CIs implies the factorization of a jpd in terms of numerators and denominators and at the same time, all denominators can be absorbed, in other words, the jpd can be factorized as a product of marginals and conditionals, but such a factorization is not a BN factorization. In the following section, we will show that the conjecture is true unless the input set of CIs satisfies some structural constraints.
5
A Structural Characterization—Hierarchical Conditional Independence
In this section, we will present a structural characterization of CIs which have a perfect map DAG. We begin our discussion with some pertinent definitions. Since a MN (and its associated hypertree H) is equivalent to a set of conflict free CIs, namely, CF (H), as mentioned in Sect. 2, we will use the terms MN, hypertree H, and conflict free full CIs, i.e., CF (H), interchangeably when no confusion arises. Definition 1. Let H be a hypergraph defined on R = A1 A2 . . . Am . The context of H, denoted CT (H), is the set of nodes on which H is defined, i.e., CT (H) = R. Definition 2. Let H0 , H1 , . . . , Hl be a set of hypertrees. We call Hj a descendant of Hi , if CT (Hj ) ⊆ CT (Hi ). If Hj is a descendant of Hi , then we call Hi is an ancestor of Hj . We call Hj a child of Hi , if Hj is a descendant of Hi and there does not exist a Hk such that CT (Hj ) ⊆ CT (Hk ) ⊆ CT (Hi ). If Hj is a child of Hi , then we call Hi a parent of Hj . We now are ready to define the notion of hierarchical CI which describes the structural constraints mentioned. Definition 3. Given a set C of CIs, we can group CIs in C according to their respective contexts, i.e., C = {C0 , C1 , . . . , Cl }, where Ci ⊆ C, all the CIs in Ci has the same context. This set C is called hierarchical CI (HCI) if it satisfies the following conditions:
202
S. K. M. Wong et al.
A
B C
B A
D B
C
C
D
Fig. 2. A HCI H (i) each Ci is a conflict free set of full CIs with respect to the context of Ci and thus has a corresponding MN whose associated hypertree is denoted as Hi , i.e., CF (Hi ) = Ci . (ii) Let H={H0 , H1 , . . . , Hl } be a tree hierarchy of hypertrees such that CT (Hi ) ⊆ CT (H0 ), i = 1, . . . , l. H0 is the root of the hierarchy. (iii) If Hj is a child of Hi , then there exists a hyperedge hi ∈ Hi , such that CT (Hi ) ⊆ hi , and (iv) If Hj and Hk are two distinct children of Hi , then CT (Hj ) ⊆ h and CT (Hk ) ⊆ h , where h and h are distinct hyperedges of Hi , i, j, k are distinct. Since H={H0 , H1 , . . . , Hl } completely characterizes the notion of HCI, we will therefore in the following use H to denote HCI if no confusion arises. Our purpose of defining the notion of HCI is to use the structure of HCI to facilitate the factorization of a jpd. Example 4. Consider the HCI H = {H0 = {ABC, BCD}, H1 = {AB, AC}, H2 = {BD, CD}} shown in Fig. 2. It can be easily verified that H satisfies the above definition of HCI. H0 implies the following factorization: p(ABCD) =
p(ABC) · p(BCD) . p(BC)
(16)
Similarly, H1 and H2 implies the following factorizations, respectively: p(AB) · p(AC) , p(A) p(BD) · p(CD) . p(BCD) = p(D) p(ABC) =
(17) (18)
By substituting (17), (18) for the numerators p(ABC), p(BCD) in (16), respectively, we obtain the following: p(ABCD) =
A Structural Characterization of DAG-Isomorphic Dependency Models
A
B A
B
D
C
E
F
B C
C
203
D D
E
E
F
Fig. 3. A HCI H B
D F
A C
E
Fig. 4. The perfect map DAG of the HCI H in Fig 3 It is noted that in (19), the denominator p(BC) can never be absorbed so that (19) can never be turned into a product of conditional and marginals. Therefore, there is no chance for this HCI to have a perfect map DAG. The above example indicates that the structure of HCI facilitates the factorization of a jpd but not necessarily defines a BN factorization. Let us take a look at another similar example. Example 5. Consider the HCI H = {H0 = {ABC, BCDE, DEF }, H1 = {AB, AC}, H2 = {BC, CDE}, H3 = {DF, EF }} shown in Fig. 3. H0 implies the following factorization: p(ABCDEF ) =
Unlike in (19) of example 4, in which there is a denominator, namely, p(BC), can’t be absorbed, it happens that in (24) all the denominators can be absorbed to yield the following factorization: p(AB) p(AC) p(CDE) p(DF ) · · · · p(EF ), p(A) p(C) p(DE) p(F ) = p(B|A) · p(A|C) · p(C|DE) · p(D|F ) · p(EF ).
p(ABCDEF ) =
(25) (26)
It can be easily verified that the above (26) is a BN factorization. Moreover, the DAG associated with the BN defined by (26), shown in Fig. 4, is a perfect map of the HCI H. This is because the HCI implies the BN factorization in (26), which in turn implies the CIL of the DAG. On the other head, the CIL of the DAG in Fig. 4 implies the BN factorization in (26), which in turn implies HCI. Therefore, HCI is equivalent to the CIL of the DAG. Since we know DAG is a perfect map of the CIL, it follows that this DAG is also a perfect map of the HCI H. Remark 1. Contrast to example 4, the above example indicates that for HCI, if we can absorb all denominators after all substitutions, then the resulting factorization, which is a product of marginals and conditionals, is a BN factorization such that the DAG associated with the BN is a perfect map of the HCI. Consider a MN whose associated hypertree H = {h1 , h2 } with the following MN factorization: p(U ) =
where the superscripts in p1 , p2 are used to distinguish that the factorizations are with respect to hyperedge h1 , h2 , respectively. It is noted that that we can absorb all denominators in (29) and (31) to obtain (28) and (30), respectively, is due to the fact that p(h1 ) and p(h2 ) are BN factorizations. By substituting (29), (31) for the numerators p(h1 ), p(h2 ) in (27), respectively, we obtain p(U ) =
A Structural Characterization of DAG-Isomorphic Dependency Models
205
Theorem 1. Equation p(U ) in (32) defines a BN factorization if and only if all the denominators in (32) can be absorbed. Proof. It is obvious that if p(U ) in (32) defines a BN factorization, all the denominators of (32) must have been absorbed. We now prove the “only if” part. Without loss of generality, assume the denominator p(h1 ∩ h2 ) in (32) is absorbed by some numerator p(W ), h1 ∩ h2 ⊆ W , in the factorization of p(h2 ), i.e., p(U ) =
We will show that equation (33) actually is a BN factorization in concordance with a CIL. The BN factorization in (28) has an associated DAG, denoted Dh1 . Let CIL1 represent the CIL for a topological ordering of the nodes in Dh1 , i.e., h1 . If we follow the topological ordering we just mentioned for Dh1 , i.e., we follow CIL1 to factorize p(U ), we will obtain: p(U ) = p1 ( | ) · . . . · p1 ( | ) · 1 p2 (W ) p2 () · . . . · p2 () · · 2 . p(h1 ∩ h2 ) 1 p () . . . · p2 ()
(34)
As indicated in Sect. 3, any possible absorption of the denominators yields a BN factorization which belongs to a equivalent class. Therefore, in (34), since the numerator p(W ) was reserved for absorbing p(h1 ∩ h2 ), the partial expression in (34), i.e., p2 (W ) p2 () · . . . · p2 () · 2 1 p () . . . · p2 () defines a BN factorization for p(h2 ), with an associated DAG Dh2 and a CIL, denoted CIL2 , corresponding to a topological ordering of the nodes in Dh2 , i.e., h2 . For any CI I(Xi , pa(Xi ), {X1 , . . . , Xi−1 } − pa(Xi )) ∈ CIL2 ,
(35)
where Xi ∈ W , coupling with the CI I(h1 − h2 , h1 ∩ h2 , h2 − h1 ), it yields by SG axioms [6,11] that, I(Xi , pa(Xi ), h1 ). (36) This means the original CI in (35), which produces p(Xi |pa(Xi )) as a factor in factorization of p(h2 ), can also produce the same factor in the factorization of p(U ). Thus, equation (34) turns out to be p(U ) = p1 ( | ) · . . . · p1 ( | ) · p2 (W ) 2 1 · · p ( | ) · . . . · p2 ( | ). p(h1 ∩ h2 ) 1
(37)
206
S. K. M. Wong et al. 2
) Lastly, we consider p(h11∩h2 ) · p (W in the above equation, which yields 1 p(W − h1 ∩ h2 |h1 ∩ h2 ), this is because I(W − h1 ∩ h2 , h1 ∩ h2 , h1 − h2 ). Now thinking of the global picture, we follow the CIs in CIL1 , then follow the CI, I(W − h1 ∩ h2 , h1 ∩ h2 , h1 − h2 ), finally, follow the CIs in CIL2 which can actually be extended to hold in p(U ) as explained in (35), (36), we thus obtain the following BN factorization of p(U ):
which is produced by the CIL CIL1 ∪ {I(W − h1 ∩ h2 , h1 ∩ h2 , h1 − h2 )} ∪ {I(Xi , pa(Xi ), h1 ) | Xi ∈ W, and Xi ∈ h2 }. Remark 2. Theorem 1 says that if we have a binary MN factorization (as shown in (27)) and each of its two numerators can be substituted by BN factorizations (as shown in (28), (30)) and the denominator of the MN factorization can be absorbed, then the MN factorization after all substitutions defines a BN factorization (as shown in (38)). It is worth mentioning that if there does not exist a BN factorization for p(h1 ) or p(h2 ), we can treat (28) or (30) as trivial BN factorization and the proof still holds. Corollary 1. The DAG associated with (38) is a perfect map of C = CIL1 ∪ {I(h1 − h2 , h1 ∩ h2 , h2 − h1 )} ∪ CIL2 . Proof. It has been shown in the proof of theorem 1, that the set C of CIs implies the BN factorization in (38). On the other hand, equation (38) implies C since each step in deriving equation (38) in the proof of theorem 1 is reversible. Recall in (28), (30), the BN factorizations p(h1 ), p(h2 ) are defined over variables in h1 , h2 , respectively. This can be relaxed by requiring two BN factorizations defined over subset of h1 , h2 , i.e.,
Theorem 2. Equation p(U ) in (46) defines a BN factorization if and only if all the denominators in (46) can be absorbed.
Corollary 2. The DAG associated with (46) is a perfect map of C = CIL1 ∪ {I(h1 − h2 , h1 ∩ h2 , h2 − h1 )} ∪ CIL2 , where CILi is a CIL for the DAG associated with p(hi ), i = 1, 2. Theorem 2 and corollary 2 can be similarly proved using the same technique that was employed in the proof of theorem 1. Theorem 2 and corollary 2 only deal with a binary MN factorization, they can actually be generalized to deal with an arbitrary MN factorization. Consider a MN factorization as follows: p(U ) =
where hi ⊆ hi , be BN factorizations, each of which defines a BN, with associated DAGs D1 , . . . , Dn , and CILs, CIL1 , . . . , CILn , respectively. Theorem 3. If and only if all denominators in (47) can be absorbed after substituting the above BN factorizations (equation (48)) for the numerators p(h1 ), . . . , p(hn ) in (47), then the resulting factorization of p(U ) after absorption, is a BN factorization, whose associated DAG is a perfect map of CIL1 ∪ . . . ∪ CILn ∪ CF (H). Remark 3. Theorem 3 says that if we have a MN factorization, and each of its numerators can be substituted by a BN factorizations (or trivial BN factorization) and furthermore, the denominators of the MN factorization can be absorbed after all substitutions, then this MN factorization after substitutions defines a BN factorization.
208
S. K. M. Wong et al.
Recall the definition of HCI, the feather of HCI is that it is composed of a tree of hierarchy of MNs, i.e., a tree hierarchy of MN factorizations on which structure the substitution can be easily done and theorem 3 can be recursively applied from the leaf node in the tree hierarchy of this HCI. This structure feather of HCI gives rise to the following theorem. Consider a HCI, namely, H = {H0 , . . . , Hn }, where H0 is the root of H and is defined over the variable set U = CT (H0 ). Since each Hi in H, i = 1, . . . , n, defines a MN factorization p(CT (Hi )), i.e., p(U ) =
p()·p()·...·p() p()·...·p()
p(CT (Hi )) =
p()·...·p() p()·...·p()
= p( | ) · p( | ) . . . · p( | ),
(49)
= p( | ) · . . . · p( | ).
(50)
By substitution equations in (50) according to the hierarchical structure of H, we obtain: p(U ) =
Theorem 4. Equation p(U ) in (51) defines a BN factorization if and only if all the denominators in (51) can be absorbed. The DAG associated with this BN factorization is a perfect map of H. Remark 4. The above theorem 4 characterizes a DAG-Isomorphic dependency model, namely, HCI satisfying theorem 4.
6
Conclusion
In this paper, by studying the factorization of a jpd, we have suggested a hierarchical characterization of those CIs that can be faithfully represented by a DAG. Our results indicate that an effective and efficient method for testing DAG-isomorphic dependency model can be developed based on theorem 4. We would like to emphasize here that redundant CIs should be removed before applying any testing methods. Otherwise, one may make erroneous conclusions. The method for removing redundant CIs is discussed in a separate paper [12].
Acknowledgement The authors are grateful to Milan Studeny for giving us example 3.
References 1. C. Berge. Graphs and Hypergraphs. North-Holland, Amsterdam, 1976. 196 2. E. Castillo, J. Manual Gutierrez, and A. Hadi. Expert Systems and Probabilistic Network Models. Springer, 1997. 195
A Structural Characterization of DAG-Isomorphic Dependency Models
209
3. D. M. Chickering. A transformational characterization of equivalent bayesian network structures. In Eleventh Conference on Uncertainty in Artificial Intelligence, pages 87–98. Morgan Kaufmann Publishers, 1995. 198 4. D Geiger. The non-axiomatizability of dependencies in directed acyclic graphs. Technical Report R-83, UCLA Cognitive Systems Laboratory, 1987. 196, 199 5. F. V. Jensen. Junction tree and decomposable hypergraphs. Technical report, JUDEX, Aalborg, Denmark, 1988. 196 6. J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers, San Francisco, California, 1988. 195, 196, 197, 198, 205 7. J. Pearl and A. Paz. Graphoids: Graph-based logic for reasoning about relevance relations. Technical Report R-53-L, University of California, 1985. 195, 198 8. G. Shafer. An axiomatic study of computation in hypertrees. School of Business Working Papers 232, University of Kansas, 1991. 196 9. T. Verma and J. Pearl. Equivalence and synthesis of causal models. In Sixth Conference on Uncertainty in Artificial Intelligence, pages 220–227. GE Corporate Research and Development, 1990. 198 10. T. Verma and J. Pearl. An algorithm for deciding if a set of observed independencies has a causal explanation. In Eighth Conference on Uncertainty in Artificial Intelligence, pages 323–330, 1992. 196, 200 11. S. K. M. Wong and C. J. Butz. Constructing the dependency structure of a multi-agent probabilistic network. IEEE Transactions on Knowledge and Data Engineering, 30(6):395–415, 2000. 197, 198, 205 12. S. K. M. Wong, Tao Lin, and Dan Wu. Construction of a non-redundant cover for conditional independencies. In The Fifteenth Canadian Conference on Artificial Intelligence, 2002. 208
Construction of a Non-redundant Cover for Conditional Independencies S. K. M. Wong, Tao Lin, and Dan Wu Department of Computer Science, University of Regina Regina, Saskatchewan, CANADA S4S 0A2 {wong,taolin,danwu}@cs.uregina.ca
Abstract. To design a probabilistic reasoning system, it might be necessary to construct the graphical structure of probabilistic network, from a given set of conditional independencies (CIs). It should be emphasized that certain redundant CIs must be removed before applying construction methods and algorithms. In this paper, firstly we discuss how to remove redundant CIs from a given set of CIs with the same context, which results in a reduced cover for the input CIs set, and then we suggest an algorithm to remove redundancy from arbitrary input CIs set. The resulting set of CIs after such a ‘clean’ procedure is unique. The conflicting CIs can be easily identified and removed, if necessary, so as to construct the desired graphical structure.
1
Introduction
Probabilistic networks [3,6,8] have become an established framework for probabilistic reasoning system with uncertain knowledge. A probabilistic network is specified by a graphical structure (a Bayesian directed acyclic graph (DAG) or an acyclic hypergraph) together with the corresponding set of conditional probability tables. One may directly construct the graphical structure of probabilistic network from the known causal relationship of the domain variables [2]. In many applications, however, it might be necessary to construct the Bayesian DAG or acyclic hypergraph from an input conditional independencies (CIs) set G in order to faithfully represent independence information implied by G. A number of methods and algorithms [5,9,11,12] were proposed for such construction. If the input CIs set are over the same context and satisfy conflict-free [5] condition, then an acyclic hypergraph can be constructed. Verma [9] also suggested an algorithm to construct a Bayesian DAG for a given set of CIs, if exists. It is also reported in [13] that Bayesian DAG can be characterized under certain conditions. In the construction methods or algorithms, however, it is often implicitly assumed that redundant CIs have already been eliminated from the input CIs set, otherwise some confusing contradiction would arise. Consider, for instance, a set G of CIs with the same context {ABCDE} as follows: G = {E ⇒⇒ B | ACD, EAB ⇒⇒ C | D}. R. Cohen and B. Spencer (Eds.): AI 2002, LNAI 2338, pp. 210–220, 2002. c Springer-Verlag Berlin Heidelberg 2002
Construction of a Non-redundant Cover for Conditional Independencies
211
It seems that an acyclic hypergraph can not be constructed because it is not conflict-free, while it is not true because G has an equivalent CIs set G satisfying the conflict-free conditions, namely: G = {E ⇒⇒ B | ACD, EA ⇒⇒ BC | D}. Therefore, it is crucial to remove redundant CIs first from the input CIs set before constructing the desired graphical structure. Given a set G of CIs, it normally has to go through its closure explicitly in order to remove redundant CIs, while the time complexity for computing the closure of G is exponential. In this paper, we propose a feasible method to remove redundant CIs from the input CIs set. Our method will put given CIs into a better situation for the construction of graphical structure. That is, if there does not exist a graphical structure to faithfully represent all input CIs, then we could conveniently decide which CIs have to be eliminated in order to construct a graphical structure to faithfully represent remaining CIs. There are two key steps in our procedure. Firstly, we remove redundant CIs with the same context, and then we apply the Semi-graphoid inference axioms to further ‘clean’ the remaining CIs. It should perhaps be emphasized that our method could be practical because it does not require to compute the closure for a given set of CIs with the same context. Another advantage of the proposed method is that in the resulting CIs set, we can identify and remove conflicting CIs, if necessary, so as to construct the desired graphical structure. The paper is organized as follows. The basic concepts are defined in Section 2. In Section 3, we introduce the notion of reduced CIs and describe our algorithm for removing redundant CIs with the same context. In Section 4, we show how the reduced CIs, can be further simplified by applying the contraction inference axiom. The conclusion is presented in Section 5.
2
Definitions
Let U be a finite set of variables {A1 , A2 , . . . , Am } and let the domain of each variable Ai be dom(Ai ). A probability network P over U is a probability distribution of the form P : dom(A1 ) × dom(A2 ) × . . . dom(An ) → [0, 1], where each dom(Ai ), i = 1, . . . , n, is a finite set. We call the union of all variables in U, denoted by U = A1 A2 . . . Am , the universal context of probability network P . An expression g is called an conditional independence statement (CI), denoted by: g : X ⇒⇒ Y | Z,
(1)
where Y =φ, Z =φ and X =φ are disjoint subsets of U . The union of X, Y and Z, XY Z namely, is called the context of g, over the finite set U. The lefthand side X in Expression (1), denoted by LHS(g), is called the key of CI g.
212
S. K. M. Wong et al.
For a set G of CIs, we denote all keys of G by LHS(G). The CI X ⇒⇒ Y | Z is said to hold for a probability network P if P (Y |XZ) = P (Y |X).
(2)
We also say a probability network P satisfies a set G of CIs if all CIs in G are hold in P . In the following discussion, we distinguish two specific cases: if the context of a CI X ⇒⇒ Y | Z is exactly the universal context U , thus, U = XY Z, then we say this CI is a full CI over the universal context U . Otherwise we say this CI is embedded over the universal context U . We also say an embedded CI X ⇒⇒ Y | Z over U is a full CI under its own context U = XY Z. That means, a given CI is embedded or full depending on the context interested. A set G of CIs logically implies the CI X ⇒⇒ Y | Z, written G |= X ⇒⇒ Y | Z, if X ⇒⇒ Y | Z is hold by every probability network that satisfies all CIs in G. That is, X ⇒⇒ Y | Z is logically implied by G if there is no counterexample probability network such that all the CIs in G are hold but X ⇒⇒ Y | Z is not. An inference axiom is a rule that states if the certain CIs is hold for a probability network P , then P must satisfy other CIs. A CI g is derivable from G if it can be yielded from G by applying certain inference axioms. Given a set G of CIs and a set of inference axioms, the closure of G, written G+ , is the smallest set containing any CI derivable from G. A set G of CIs is said to equivalent to G if any CI in one set is derivable from the other. We refer G as the cover of G. There exists a set of inference axioms called Semi-graphoid axioms for a set of CIs as follows: symmetry : X decomposition : X weak union : X contraction : X
⇒⇒ Y ⇒⇒ Y ⇒⇒ Y ⇒⇒ Y
| | | |
Z if and only if X ⇒⇒ Z | Y ; ZW then X ⇒⇒ Y | Z and X ⇒⇒ Y | W ; ZW then XW ⇒⇒ Y | Z; Z and XZ ⇒⇒ Y | W then X ⇒⇒ Y | ZW.
It is known that Semi-graphoid axioms are sound but not generally complete [14], thus, applying Semi-graphoid may not derive all CIs logically implied by a given set of CIs. However, it is important to employ them in the probability networks, because they are complete for a set of CIs that can be faithfully represented by a Bayesian Network or Markov Network. Furthermore, Semi-graphoid axioms imply a set of sound and complete axioms [10] for a set of CIs with the same context Ui as follows: If Y ⊆ X, then X ⇒⇒ Y | Ui − XY ; If X ⇒⇒ Y | Ui − XY and Y ⇒⇒ Z | Ui − XZ, then X ⇒⇒ Z − Y | Ui − X(Z − Y ). augmentation : If Z ⊆ W and X ⇒⇒ Y | Ui − XY, then W X ⇒⇒ ZY | Ui − W XY Z; union : If X ⇒⇒ Y | Ui − XY and X ⇒⇒ Z | Ui − XZ, then X ⇒⇒ Y Z | Ui − XY Z; decomposition : If X ⇒⇒ Y | Ui − XY and X ⇒⇒ Z | Ui − XZ, then X ⇒⇒ Y ∩ Z | Ui − X(Y ∩ Z), X ⇒⇒ Y − Z | Ui − X(Y − Z), X ⇒⇒ Z − Y | Ui − X(Z − Y ). reflexivity : transitivity :
Construction of a Non-redundant Cover for Conditional Independencies
213
In this paper, we only discuss the CIs set that are complete under Semi-graphoid axioms. A dependency basis [5] is used to summarize a set GUi of full CIs over the context Ui that all have the same key. Given a set GUi of CIs with the same context Ui , X ⊆ Ui , the dependency basis of X, written Dep(X), is defined as follows: Dep(X) = {W1 , W2 , . . . , Wm }, where X ⇒⇒ Wi | U −XWi is derivable from GUi , X∩Wi = φ for i = 1, 2, . . . , m, and {X, W1 , W2 , . . . , Wm } forms a partition of Ui . The usefulness of introducing Dep(X) lies in the fact that for any variable set Y , such that X ⇒⇒ Y | U −XY is derivable from GUi , is a union of some elements in Dep(X). On the other hand, for each nonempty proper subset W of Wi (1 ≤ i ≤ m), the CI X ⇒⇒ W | U − XW can not be derivable from G+ Ui .
3
Reduced CIs with the Same Context
In this section, we show that for any set GUi of full CIs with the same context Ui , there exists an unique set of reduced CIs, written G− Ui , such that it is a cover of GUi . An algorithm is suggested to yield the reduced cover of GUi without computing its closure exclusively. For any CI in this section, X ⇒⇒ W | Ui − XW namely, since context is known and fixed as Ui , we may omit the right part Ui − XW of “|” for simplicity. Given a set GUi of full CIs over the context Ui , G+ Ui is the closure of GUi which is derived by the complete axioms over the context Ui . A full CI X ⇒⇒ W in G+ Ui is said to be (1) trivial: if there is an X ⊂ X such that X ⇒⇒ W ∈ G+ Ui ; (2) left-reducible: if there is a W ⊂ W such that X ⇒⇒ W ∈ G+ Ui is non-trivial; (3) right-reducible: if there is a W ⊂ W such that X ⇒⇒ W ∈ G+ Ui is nontrivial. (4) transferable: if there is an X ⊂ X such that X ⇒⇒ (X − X )W ∈ G+ Ui . A full CI X ⇒⇒ W is said to be reduced if it is non-trivial, left-reduced, right-reduced and non-transferable. We refer X as r-key of this reduced CI. A set G− Ui of reduced CI of GUi is denoted by: + G− Ui = {X ⇒⇒ W | X ⇒⇒ W is a reduced CI in GUi }.
By definition, it is obvious that GUi implies G− Ui . Although it is not true that if g ∈ G− then g ∈ G , the following proposition shows that G− Ui Ui Ui implies all − − CIs in GUi . It follows that GUi is a cover of GUi . We refer to GUi as the reduced cover of GUi . Proposition 1 G− Ui |= GUi .
214
S. K. M. Wong et al.
Proof: See Appendix A. The following proposition [7] suggests an efficient method to identify all rkeys in GUi , including that of not in LHS(GUi ), based on split property. We say X split Y if and only if there are V1 , V2 ∈ DEP (X) such that V1 ∩ Y =φ, φ. Y is split by a set G of CIs if there is at least one X in the LHS(G) V2 ∩ Y = such that X splits Y . Proposition 2 Consider a set G of full CIs over the context Ui . If there is a r-key Z not in LHS(G), then there must be a r-key X in LHS(G) such that X ⊂ Z. Besides, for a reduced CI X ⇒⇒ V , there is Y ∈ LHS(G) which splits V , and Z = X(Y ∩ V ) . Proposition 2 paves a way to compute the reduced cover for a set GUi of full CIs over the context Ui . At first we take LHS(G) as the candidate set C of r-keys. If there is an element Y ∈ LHS(G) splits the dependent V of another element X in C, then X(Y ∩ V ) is added into the candidate set C. We may repeat this step until there is no more elements can be added. Since the member of LHS(G− ) is limited, this step will always come to stop. At that time, all r-keys should be inside C. What we need to do now is just to remove those elements from C which are not r-keys of GUi . Consider dependent V ∈ DEP (X), for any X in the candidate set C, the CI g : X ⇒⇒ V is already right-reduced. If there exists such X ⊂ X in C that has the same dependent V , by definition, g is left-reducible, then we may delete V from the DEP (X). After that, if DEP (X) is empty, then X must not be a r-key. Lemma 3 in Appendix B shows that X must not be a r-key if g is transferable. By checking all element in C through these two steps, we can obtain all r-keys of GUi . An algorithm to compute the reduced cover for a set full CIs over the context Ui is given as follows: Algorithm 1 To find the reduced cover for a set of full CIs. Input: A set GUi of full CIs over the context Ui . Output: G− Ui . Step 1: Let C = LHS(G); Step 2: For each element X of C, calculate DEP (X); Step 3: For each pair of elements X, Y of C, if there is V ∈ DEP (X) that is split by Y , then add Z = X(Y ∩ V ) in C; Otherwise, go to next step; Step 4: For each element X of C, W ∈ DEP (X), if there is X ⊂ X in C such that HW ∈ DEP (X ), where H ⊂ (X − X ), then X is removed from C; Step 5: For each element X of C, W ∈ DEP (X), if there is X ⊂ X in C such that W ∈ DEP (X ) then we delete W from DEP (X); If DEP(X) is empty, then X is removed from C; Step 6: Output C and their associated remaining dependents. The time complexity of Algorithm 1 is dominated by its Step 2 and Step 3. Let n be the number of all keys in GUi , m be the number of CIs in GUI , and u be
Construction of a Non-redundant Cover for Conditional Independencies
215
the number of variables in the universal context Ui . From [4], the time required for the computation of dependency basis of all variables in U is bounded by O(m · u · log u). The number of variables in C, c namely, is less than m · n · u. Therefore, the total time required for the computation of the dependency basis for each element in C is bounded by O(n · u2 · m2 · log u). It follows that the time required by Algorithm 1, in the worst case, is: O(n · u2 · m2 · log u). It is much more efficient than the computation of closure of GUi , which is exponential to the number of input CIs. Example 1 Consider a set GUi of full CIs over the context Ui = {ABCDEF } as follows: A ⇒⇒ E, B ⇒⇒ F, GUi = . EF ⇒⇒ C, EF ⇒⇒ D In the beginning, C = {A, B, EF }. We compute the dependency basis of them respectively, to yield DEP (A) = {E, BCDF }, DEP (B) = {F, ACDE}, DEP (EF ) = {C, D, AB}. For X = A, there is a CI B ⇒⇒ F ∈ G that splits BCDF ∈ DEP (A). Then Z = A(B ∩BCDF ) = AB is added into C. We also compute DEP (AB) = {C, D, E, F }. There exists another CI EF ⇒⇒ C ∈ G that splits BCDF ∈ DEP (A). Then Z = A(EF ∩ BCDF ) = AF is put into C. DEP (AF ) = {B, C, D, E}. For X = B, A ⇒⇒ E ∈ G splits ACDE ∈ DEP (B). Then Z = B(A ∩ ACDE) = AB. Since AB is already in C, it is not necessary to add again. There also exists another CI EF ⇒⇒ C ∈ G that splits ACDE ∈ DEP (B). Then Z = B(EF ∩ACDE) = BE is put into C. And DEP (BE) = {A, C, D, F }. After that, there are no more candidate can be put into C. Through the Step(4) and (5) in the algorithm, we get the minimal cover G∗ of G as follows:
G∗Ui
A ⇒⇒ E, B ⇒⇒ F, EF ⇒⇒ C, EF ⇒⇒ D, EF ⇒⇒ AB, . = AF ⇒⇒ B, AF ⇒⇒ C, AF ⇒⇒ D, BE ⇒⇒ A, BE ⇒⇒ C, BE ⇒⇒ D, AB ⇒⇒ C, AB ⇒⇒ D
For a set GUi of full CIs over the context Ui , if we would claim whether or not GUi can be faithfully represented by an acyclic hypergraph, by Algorithm 1, we may compute its reduced cover G− Ui directly from GUi . Its minimal cover [7] − can be yielded from GUi consequently, where the cover of GUi , denoted by G∗Ui , is said to be minimal if (1) each CI in G∗Ui is reduced CI; (2) no proper subset of G∗Ui is a cover of G∗Ui . Since GUi has a conflict-free cover if and only if its minimal cover G∗Ui is conflict-free, we may directly apply conflict-free condition to test G∗Ui and draw the same conclusion on GUi .
216
4
S. K. M. Wong et al.
Non-redundant Cover for Arbitrary Set of CIs
So far we only discuss removing redundant full CIs over the same context Ui . In most case of construction of graphical structure, we have to deal with arbitrary set G of CIs over different contexts Ui , 1 ≤ i ≤ k with respect to the universal context U such that: U = U1 ∪ U2 . . . ∪ Uk , 1 ≤ i ≤ k. In this section, we will discuss how to remove redundant CIs from arbitrary set of CIs by using of Semi-graphoid axioms. The resulting set is the unique nonredundant cover for the given set of CIs and can be used to characterize whether those CIs can be faithfully represented by a Bayesian DAG. Consider a set of CIs over the universal context U . These CIs can be partitioned according to their contexts respectively so as to make all CIs in each subset, GUi namely, to become full CIs over the context Ui . That is, G = GU1 ∪ GU2 . . . ∪ GUk
1 ≤ i ≤ k.
− − Proposition 3 Consider an arbitrary set of CIs. Let G− = G− U1 ∪GU2 . . .∪GUk . − G is a cover of G.
Proof: By induction. For any CI g over the context Ui , which is derived from GUi without applying contraction axiom. By Proposition 1, G− Ui |= g. Assume G− |= g , for any embedded CI g ∈ G+ over the context Uj ⊂ Ui . For a CI g ∈ G+ over the context Ui , if g is derived by applying contraction + − to g1 ∈ G+ Uk and g2 ∈ GUi , where Uk ⊂ Ui . By assumption, G |= g1 . If g2 is derived from G− Ui without applying contraction axiom, then we are done. Otherwise there must be another CIs g3 over the context Ui which implies g2 , by applying contraction axiom, and LHS(g2 ) ⊂ LHS(g3 ). If g3 is also derived by applying contraction axiom, following this procedure, at last we may find such a gj that is derived from GUi without applying contraction axiom, since LHS(gi−1 ) ⊂ LHS(gi ) and the number of attribute over the context Ui is limited. ✷ Although we can obtain reduced cover for each subset of CIs by Algorithm 1, a new CI can be derived by applying contraction axiom to two reduced CIs over the context Ui and Uj respectively. It is possible that this new CI will make some original CIs not reduced any more. For instance, consider two CIs as follows: {X ⇒⇒ Y | Z, XY ⇒⇒ Z | W }. Either one of them is reduced CIs over its context, {XY Z} and {XY ZW } respectively. After contraction axiom is applied, another new CI X ⇒⇒ Y W | Z is derived, which causes those two original CIs reducible. The similar situation could happen for applying projection axiom to some reduced CIs over different contexts. A reduced CI X ⇒⇒ Y | Z over the context {XY Z} could be derived from another reduced CI X ⇒⇒ Y | Z over the context {XY Z }, where Y Z ⊂ Y Z , through projection axiom, since X ⇒⇒ (Y ∩ Y Z) | (Z ∩ Y Z)) yields X ⇒⇒ Y |Z. Therefore, we have to check
Construction of a Non-redundant Cover for Conditional Independencies
217
such interaction among CIs over the different contexts by applying contraction and projection axioms. Consider an arbitrary set G of CIs over the universal context U . For a CI g : X ⇒⇒ Y | Z over the context Ui = {XY Z}, if there exists another nontrivial CI g : X ⇒⇒ Y W | ZV over the context Uj = {XY ZW V }, W =φ or V =φ, which is derivable from G by applying Semi-graphoid axioms, then we say g is the extension of g in G, and g is the projection of g in G under the context Ui . We say g is the maximal extension of g in G if there is no extension of g in G. The following algorithm is introduced to remove redundant CIs in G. This algorithm can be applied to ‘clean’ an arbitrary set G of CIs for testing whether or not G can be faithfully represented by a Bayesian DAG. Algorithm 2 To remove redundant CIs. Input: An arbitrary set G of CIs. Output: A set G of CIs without redundancy. Step 1: Partition CIs in G on account of their contexts respectively, that is, G = GU1 ∪ GU2 ∪ . . . , GUn ; Step 2: For each subset GUi , apply Algorithm 1 to obtain G− Ui , − − ∪ G ∪ . . . , G ; let G = G− U1 U2 Un Step 3: Replace every g in G by its maximal extension; Step 4: Go to step (1) until no more change can be made to G; Step 5: Remove g from G if (G − {g}) |= g; Step 6: Output G. Example 2 Let G be a set of CIs as follows: A ⇒⇒ B | C B ⇒⇒ D | CE, C ⇒⇒ E | BD G= . BC ⇒⇒ A | DEF, DE ⇒⇒ F | ABC
(3)
After we group each CIs according to their contexts respectively, it follows that: G = GABC ∪ GBCDE ∪ GABCDEF , where GABC = {A ⇒⇒ B | C}, GBCDE = B⇒⇒ D | CE, C ⇒⇒ E | BD , GABCDEF = DE ⇒⇒ F | ABC, BC ⇒⇒ A | DEF .
(4)
The CIs in each subset satisfy conflict-free condition, which can be faithfully represented by an individual acyclic hypergraph as shown in Figure 1(a). After we extend CIs in GABC and GBCDE , we obtain another set of CIs as follows: GABCDE = A ⇒⇒ BD | CE, B ⇒⇒ D | ACE, C ⇒⇒ E | ABD .
218
S. K. M. Wong et al.
A
A A B
C
B C
A B
C B
D E
B
C
D E
F
D
E
F
(a)
D
C E
(b)
Fig. 1. A collection of acyclic hypergraphs faithfully representing a set G of CIs. Part (a) is the original representation for G and Part (b) is result of applying Algorithm 2 to remove redundant CIs in G
Hence, GABC and GBCDE in G can be replaced equivalently to yield G = GABCDE ∪ GABCDEF , thus,
A ⇒⇒ BD | CE, B ⇒⇒ D | ACE, C ⇒⇒ E | ABD G = . BC ⇒⇒ A | DEF, DE ⇒⇒ F | ABC And then we check each CI in G one by one. It was found out that only one CIS in G, BC ⇒⇒ A | DEF namely, is redundant. After this CI is removed from G , we obtain the reduced cover G of G as follows:
A ⇒⇒ BD | CE, B ⇒⇒ D | ACE, C ⇒⇒ E | ABD G = . DE ⇒⇒ F | ABC G can be represented by a collection of acyclic hypergraphs shown in Figure 1(b). It is more complicated to remove redundant CIs from an arbitrary set of CIs than a set of full CIs with the same context, because of the interaction of CIs over the different contexts through contraction and projection axioms. In this section, we suggest an algorithm to remove redundant CIs from such a set G of CIs. It results in an unique cover which can be utilized to test whether or not G can be faithfully represented by a Bayesian DAG through certain condition. Furthermore, the conflicting CIs, thus, those CIs can not be represented simultaneously in the same Bayesian DAG, can be identified and removed in this cover,if necessary, for the construction of Bayesian DAG, as reported in [12,13].
5
Conclusion
In this paper, we emphasize the fact that one must first remove some redundant CIs before constructing the graphical structure of a probabilistic network from a
Construction of a Non-redundant Cover for Conditional Independencies
219
set of CIs. A method is suggested to remove such redundant CIs from a given set of CIs using Semi-graphoid axioms. The resulting set of CIs is an unique cover of input CIs. We also demonstrated that the proposed method for the removal of redundant CIs is practical and feasible, because it is not necessary to compute the closure for the input CIs set by using of Semi-graphoid axioms.
A
Proof of Proposition 1
There are two Lemmas should be proved before we prove Proposition 1. Lemma 1 Given a set G of CIs, Let I(W, X, U −XW ) be a non-trivial, left-and right-reduced full CI, then there is X ⊆ X such that X ⇒⇒ (X − X )W is reduced. Proof: Since X ⇒⇒ W is non-trivial, X ⇒⇒ (X − X )W is non-trivial as well. If X ⇒⇒ (X − X )W is left-reducible, then by definition there is an X ⊂ X such that X ⇒⇒ (X − X )W which implies X (X − X ) ⇒⇒ W . It is contradicts that X ⇒⇒ W is left-reduced. If X ⇒⇒ (X −X )W is right-reducible, by definition we have X ⇒⇒ X1 W1 and X ⇒⇒ X2 W2 , where X1 X2 = X − X and W1 W2 = W . It follows that X X1 ⇒⇒ W1 , X X2 ⇒⇒ W2 . Therefore, X ⇒⇒ W1 and X ⇒⇒ W2 . Since X ⇒⇒ W is reduced, either W1 or W2 must be W . Then X ⇒⇒ W is leftreducible because Xi =φ for i = 1, 2. This is a contradiction. ✷ Lemma 2 Given a set G of CIs. If Z ⇒⇒ W is non-trivial and right-reduced, then there is a reduced CI X ⇒⇒ HW in G− , such that X ⊆ Z, H ⊆ Z − X and X ⇒⇒ HW |= Z ⇒⇒ W . Proof: If Z ⇒⇒ W is left-reducible but right-reduced, then by definition there exists left-reduced and right-reduced Z ⇒⇒ W in G+ such that Z ⊂ Z and Z ⇒⇒ W |= Z ⇒⇒ W . If Z ⇒⇒ W is non-transferable, it is done. Otherwise, by Lemma 1 there is an X ⊂ Z such that X ⇒⇒ (Z −X )W is reduced, and X ⇒⇒ (Z −X )W |= Z ⇒⇒ W |= Z ⇒⇒ W . ✷ Proof: Given a non-trivial CI X ⇒⇒ W in G+ . If it is right-reduced, then by Lemma 2 there is X ⇒⇒ W in G− such that X ⇒⇒ W |= X ⇒⇒ W . If it is right-reducible, then by definition there is a X ⇒⇒ W1 , X ⇒⇒ W − W1 in G+ such that X ⇒⇒ W1 and X ⇒⇒ W − W1 imply X ⇒⇒ W , where W1 ⊂ W . If they are right-reduced, then it is done. Otherwise we may continue the same process until we find a right-reduced CI X ⇒⇒ Wi such that X ⇒⇒ Wi |= X ⇒⇒ Wi−1 or X ⇒⇒ Wi |= X ⇒⇒ Wi−2 − Wi−1 . ✷
220
B
S. K. M. Wong et al.
Transferable CI and R-Key
Lemma 3 Given a set G of CIs. For any W ∈ DEP (Z), if there is a reduced X ⇒⇒ HW , where X ⊂ Z and H ⊆ Z − X, then Z must not be a r-key. Proof: Assume not. Let Z − X = HH , thus, Z = XHH . It follows that X ⇒⇒ HW |= XH ⇒⇒ HW |= XH ⇒⇒ ZW . For other V ∈ DEP (Z), V =W , Z ⇒⇒ V |= ZW ⇒⇒ V . By transitivity, we get XH ⇒⇒ (V − ZW ), thus XH ⇒⇒ V . It means Z ⇒⇒ V is left-reducible. From XH ⇒⇒ HW , we get Z ⇒⇒ W is transferable, which is contradicted with Z being a r-key. ✷
References 1. Beeri, C., Fagin, R., Maier, D., and Yannakakis, M.: On the desirability of acyclic database schemes. In Journal of ACM pp. 479-513, vol. 30, No. 3, July 1983. 2. Heckerman, D., Geiger, D. and Chickering, D. M.: Learning Bayesian networks:the combination of knowledge and statistical data. In Machine Learning, 20:197-243, 1995. 210 3. Jensen, F. V.: An introduction to Bayesian networks. UCL Press, 1996. 210 4. Galil, Z.: An almost linear-time algorithm for computing a dependency basis in a relational database. In Journal of ACM, pp. 96-102, vol. 29, No. 1, 1982 215 5. Y. E. Lien: On the Equivalence Database Models. In Journal of ACM, pp. 333-362, vol. 29, No. 2, 1982 210, 213 6. R. E. Neapolitan: Probabilistic Reasoning in Expert Systems. John Wiley & Sons, Inc. 1989 210 7. Z. M. Ozsoyoglu and L. Y. Yuan: Reduced MVDs and minimal covers. ACM Transactions on Database Systems, Vol.12, No.3:377-394, September 1987 214, 215 8. J. Pearl: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers, San Francisco, 1988 210 9. T. S. Verma and Judea Pearl Equivalence and Synthesis of Causal Models. In Sixth Conference on Uncertainty in Artificial Intelligence, pp. 220-227. Morgan Kaufmann Publishers, 1990. 210 10. S. K. M. Wong, C. J. Butz, D. Wu: On the Implication Problem for Probabilistic Conditional Independency. In IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans Vol. 30, No. 6, 785-805, November 2000. 212 11. S. K. M. Wong, C. J. Butz, Y. Xiang: A methods for implementing a probabilistic model as a relational database. In Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 556-564. Morgan Kaufmann Publishers, 1995. 210 12. S. K. M. Wong, T. Lin and D. Wu: Construction of a Bayesian DAG from Conditional Independencies. In the Seventh International Symposium on Artificial Intelligence and Mathematics, 2002. 210, 218 13. S. K. M. Wong, D. Wu and T. Lin: A Structural Characterization of DAGIsomorphic Dependency Models. To appear in The Fifteenth Canadian Conference on Artificial Intelligence, 2002. 210, 218 14. S. K.M Wong, and Z. W. Wang: On axiomatization of probabilistic conditional independencies. In Tenth Conference on Uncertainty in Artificial Intelligence, pp. 591-597. Morgan Kaufmann Publishers, 1994. 212
Using Inter-agent Trust Relationships for Efficient Coalition Formation Silvia Breban and Julita Vassileva University of Saskatchewan, Computer Science Department Saskatoon, Saskatchewan S7N 5A9, Canada {svb308,jiv}@cs.usask.ca
Abstract. We address long-term coalitions that are formed of both customer and vendor agents after evaluating their trust relationships with other agents in the system. We present a coalition formation mechanism designed at agent level and analyzed at both system and agent levels. Evaluation has been conducted to compare agent strategies (individual vs. social) and to analyze the system behavior under different circumstances. Our results show that the coalition formation mechanism is beneficial for both the system (it reaches an equilibrium state) and for the agents (their gains increase exponentially in time).
1
Introduction
Coalition formation in multi-agent systems has been seen in the game theory [1] and distributed AI [2] as the mechanism of grouping agents that agree to cooperate to execute a task or achieve a goal. The goal can be common to all agents in the group in the case of group or social rationality or it can be specific to each agent in the case of individual rationality. Recent research brings the coalition formation process into the electronic marketplace as a mechanism of grouping customer agents with the intent of getting desired discounts from the vendor agents in large size transactions. In this context the definition of the term coalition means a group of self-interested agents (with no social or group rationality requirements) that are better off as parts of the group than by themselves. Coalition formation mechanisms have been proven to be beneficial for both customer and vendor agents by several studies [3, 4]. The already existing Internet communities like newsgroups, chat-rooms, and virtual cities constitute examples for the potential of creating large-scale economies among similar minded customers that can be explained by the high value (or usefulness) of networks that allow group formation. Such groups are known as Group-Forming Networks (GFN). In general, the value of a network is defined [5] as the sum of different access points (users) that can be connected for a transaction for any particular access point (user) when the need arises. There are three categories of values that networks can provide: a linear value, a square value, and an exponential value. The linear value is derived from the Sarnoff’s Law that specifies that the power of a broadcasting network is linearly increasing in proportion to the number of its users. It characterizes networks that provide services to individual users like TV programs or news sites. The square value is derived from R. Cohen and B. Spencer (Eds.): AI 2002, LNAI 2338, pp. 221-236, 2002. Springer-Verlag Berlin Heidelberg 2002
222
Silvia Breban and Julita Vassileva
the Metcalfe’s Law that states that the value of a peer-to-peer network equals the square of the number of its users. It is applicable to networks that facilitate transactions such as commercial sites or telephony systems and it has been used by economists as an explanation for the fast growth of the Internet. Reed [5] finds that the networks that allow group affiliation are even more powerful. According to the GFN Law that he promotes, the value of such networks grows exponentially with the number of users. As a consequence, networks that allow group formation among its components (users or agents) are expected to bring the highest benefit in the future. In general, coalitions present a loose organizational structure with an informal contract among agents. While in formal contracts there is an implicit formal trust in the structure and the regulations of the system that needs no explicit specification, in the context of informal contracts each agent in the group should be able to trust explicitly the other agents. No prior work addresses the issue of trust among members of the same coalition. Therefore, we propose a new approach to forming coalitions that have a long lifetime and are based on trust among agents. Our primarily goal is to provide a coalition formation mechanism that takes into consideration the long-term utilities of individual agents, their preferences, and the trust they have in their partners. Secondly, the proposed mechanism is designed to accommodate large numbers of agents (thousands and millions) due to minimized communication between agents and reduced complexity. Finally, for reasons of system stability and predictability our approach has two more objectives: to reduce the dynamics of the agents and to stabilize the number of coalitions in the system. The remainder of this paper is structured as follows: the next section describes the concept of long-term coalitions that are based on trust relationships between agents. It also presents the existing formal model that we use for representing trust at the agent level. Section 3 shows how this model is integrated in the coalition formation mechanism. Detailed description of the mechanism from an agent’s perspective is provided. Section 4 presents the evaluation we conducted at both system and agent levels. We draw conclusions and present future directions in the last section.
2
Long-Term Coalitions Formed on Trust Relationships
We address a multi-agent system composed of selfish agents that have fixed roles of either customer or vendor and trade books on the Internet. Agents can form coalitions with the intent of increasing their individual benefits. This improves their cooperation and coordination and, as a result, the efficiency of the trading system. We make use of several assumptions in our design: • • • •
Agents have individual rationality and no social or group requirements. They try at each moment to maximize their individual long-term utility function. Agents have a long lifetime of repeated interactions with other agents in the system and no interdiction to interact with agents outside of their coalition. An agent can join or leave a coalition at any moment. This is a direct consequence of the realistic assumption that an agent’s preferences can change over time. Agents may have different interests in books being traded (e.g. science fiction, romance, history). They may belong to different economic categories (e.g. a
Using Inter-agent Trust Relationships for Efficient Coalition Formation
• • • • • • • •
223
customer agent that can afford to buy only books between $20 and $60 and a vendor that would not sell for less than $100 belong to different categories). A coalition is automatically created when an agent wants to form a new coalition with another agent; it dissolves when it is composed of only two agents and one of them leaves the coalition. Coalitions are formed of both vendors and customers. Coalitions have a long lifetime once they are created. Agents in the same coalition agree with a specific discount for each transaction executed. Agents may have to pay a penalty (cost) for leaving a coalition. Agents prefer to be part of the coalition in which they expect to have most future transactions. Membership to coalitions is global knowledge. Coalitions are disjoint.
The main reason for forming coalitions is to bring agents with compatible preferences closer, by nurturing vendor-customer relationships. The concept is similar to the established practice in real-life markets like SafewayTM or SearsTM that give a minimal discount to members of their clubs. For customers, the member card – and belonging to the members’ club – represents an appreciation of the trustworthiness of the store as well as a preference for dealing with that particular merchant, based on positive previous trades and satisfaction with goods purchased, their quality and price. The vendor agrees to give a specific discount to its club members as a sign of appreciation of their long-term relationship. Such a policy is motivated by the fact that establishing a friendly and trustworthy relationship with clients promises vendors more transactions for the long run and retention of customers. It is known in the field of Economics as Customer Relationship Management (CRM) [6]. CRM is also called relationships marketing or customer management. It is concerned with the creation, development, and enhancement of individualized customer relationships with carefully targeted customers and customer groups that result in maximizing their total customer lifetime value. The large adoption of CRM in real-life market systems is being fuelled by a recognition that long-term relationships with customers lead to improved customer retention and profitability. We use it as the basis for motivating agents and their users to enter long-term coalitions. A vendor agent enters a coalition to increase its sales. It prefers to be part of the same coalition as customer agents with whom it has most transactions and it agrees with a certain discount for each transaction inside its coalition. A customer agent knows that being part of the same coalition as some vendor agent will bring it discounts from that vendor in future transactions. As a consequence it prefers to belong to the same coalition as vendor agents with whom it has had most transactions because this promises to bring it most discounts and increased profit. We impose to agents the restriction to be part of one coalition at a moment for reasons of decreased complexity of the mechanism. In the alternative case of allowing an agent to belong to more than one coalition, any customer agent would prefer to belong to all existing coalitions in the system. Thus, all customers would want to be part of each coalition. The existence of coalitions in such a case would become futile: the effect of creating a small, comfortable environment that brings compatible agents closer to each other to
224
Silvia Breban and Julita Vassileva
have more frequent interactions and, therefore, increased benefits for both customers and vendors would be lost. For vendor agents giving discounts to all customers would also be unacceptable. The CRM policy cannot function when a preferred group of customers is not distinguished from the others. From the point of view of an individual agent (either customer or vendor), we see the coalition formation mechanism as a decision problem: at each moment the agent faces a decision of whether to remain in the same status, form a new coalition, or leave the current coalition to join a different one. The decision should maximize the agent’s long-term utility function after taking into account important factors such as its long-term goals, incomplete knowledge about other agents, and global knowledge about the system. To help an agent make the right decision our approach uses the relationships established between the agent and other agents in the system after sharing common experiences. When faced with the decision problem of whether to join or leave a coalition, the agent has to evaluate all these relationships using some evaluation criterion. Based on this evaluation and on the natural assumption that an agent expects to have more profit from compatible agents with whom it has the best relationships, the agent prefers to be part of the same coalition as its partners with whom it has stronger relationships. In general, relationships between individuals can model different aspects of their interaction: the roles they play in the interaction, their goals, the importance that the interaction has for each of them, and the trust they have in one another [7]. In the absence of a formal contract between the agents, we find that the most appropriate aspect of a relationship is the trust that agents have in each other. In the context of formal contracts among agents in a group, there is an implicit trust in the structure and the regulations of the system that needs no explicit specification. In the context of informal contracts, each agent in the group should be able to explicitly trust the other agents. Trust has been thoroughly studied in e-commerce applications over the last years. Different definitions of trust can be found in [8, 9]. In our view the trust of an agent in another agent represents its belief that the other agent has similar preferences and this will lead to many successful transactions between them in the future. For instance, a customer that has had satisfactory transactions with a certain vendor trusts that vendor to promise beneficial transactions for the future. A vendor that is satisfied with the purchases of a customer also believes that it will have positive transactions with that customer in the future. To represent trust at the agent level we use the model proposed in [8] that assigns to each trust relationship of an agent a numerical value from a set of trust quantifications. A new experience between the truster and the trusted agents has a value from a predefined set of experience classes – i.e. it is evaluated to be either positive or negative with a particular strength. It leads to an update of the agents’ trust according to a transition function defined between different states of trust. We briefly present here the formal model described in [8]. Given a set of experience classes E and a set of trust quantifications T, a mapping for the transition from one trust value t to another trust value trust(e,t) can be defined as: trust : E × T ! T trust (e, t) = d * t + (1 – d) * e
Using Inter-agent Trust Relationships for Efficient Coalition Formation
225
We consider the case in which E = [-1, 1]. If an experience e is evaluated as a positive one it is assigned a positive value from E+; if e is a negative experience it takes a negative value from E-. We consider T = [-1, 1]. Parameter d ∈ [0.5, 1] is an inflation rate used to model the fact that older experiences become less important over time, while the most recent experience is the most relevant (since the agents’ preferences may change over time). In this trust function after each new experience e, the existing trust value t is multiplied by d and the impact of e is added, normalized so that the result fits in the desired interval T. Based on this representation and on the set of discrete time values when experiences take place Time = Ν+ (the set of natural numbers), a trust evolution function evol is inductively defined in [8]. This function is used by an agent when it has to update its trust in another agent at each step from the Time set: evol : E × Time ! T evol (e, 0) = 0 evol (e0 e1 … ei, i + 1) = trust (ei, evol ( e0 e1 … ei -1, i )) The definition of the trust evolution specifies that the initial trust for step 0 is set to a neutral value 0. At each step i+1 the trust is updated based on the previous trust (from step i) and the current experience ei according to the trust function defined above. We use this formal model of subjective trust to represent inter-agent relationships. The following section describes how we integrate it in the agent reasoning mechanism about coalition formation.
3
Coalition Formation Mechanism
We refer to a system of multiple personal assistant agents trading books in an open electronic market. Both the set of customer agents and the set of vendor agents may have size variations over time. We assume that the electronic market provides system matchmaker agents that are responsible for finding suitable partners to interact when a need arises. Agents can form coalitions to improve their individual benefits. The general scheme for coalition formation is shown in Fig. 1. It consists of two important steps. The first step is the interaction between a customer and a vendor. We use the term interaction to denote any attempt to execute a transaction between two agents. This step starts with a negotiation for a specific price, it continues with an evaluation of the interaction, and it ends with an update of the trust that each agent has in the other agent. The second step is named Coalition Reasoning Mechanism. It develops in each of the two agents’ reasoning mechanism after the interaction is finished. In this step an agent performs a re-evaluation of its status of belonging to the most profitable coalition or not. It also decides what action to take if any is needed. We present each component phase of this general scheme in more detail. As mentioned above, before a transaction between a customer and a vendor is executed, the two agents go through a bilateral negotiation phase to agree on a certain price. We use the negotiation mechanism developed by Mudgal [10]. It consists of an iterative process in which the two agents make offers and counteroffers based on the preferences of their users and on the reply of their opponent; it can end with an agreement in which case the interaction is successful or with a rejection from one of
226
Silvia Breban and Julita Vassileva
the agents in which case the interaction is unsuccessful. The users’ preferences play a crucial role in the result of the negotiation (either agreement or rejection). They take into consideration the minimum acceptable price for vendors and the maximum affordable price for customers, the subjective importance of money, the urgency of the current goal of selling or purchasing a certain product, the behavior towards risk, and the time constraints for executing the transaction. These preferences are relevant for establishing a possible compatibility or incompatibility between the agents. For instance, if the vendor has a minimum acceptable price p1 for a certain product and the customer has a maximum affordable price p2 for that product, and p1 > p2, the two agents will never reach an agreement when negotiating on that product. We consider such cases that lead to rejection on either side to be negative evidence for updating the trust that agents have in one another, since it reveals a certain level of incompatibility between their preferences in ranges of prices. INTERACTION rejection
NEGOTIATION
agreement APPLY DISCOUNT (if in the same coalition)
EVALUATE EXPERIENCE
E
-
UPDATE TRUST
E
EVALUATE EXPERIENCE
+
COALITION REASONING MECHANISM CLASSIFICATION OF TRUST RELATIONSHIPS DECISION MAKING
REMAIN IN THE SAME STATUS
FORM A COALITION
NEW
LEAVE COALITION AND JOIN ANOTHER
Fig. 1. General scheme for the coalition formation mechanism
When the negotiation terminates with an agreement, the agents agree on a price. If they belong to the same coalition, a certain discount is applied to this price as shown in Fig. 1. After the negotiation is finished, both the customer and the vendor have to evaluate the interaction. We consider a successful interaction as a positive experience because reaching an agreement between two agents is a direct consequence of similar interests in books and compatible preferences of the users. As mentioned above unsuccessful interactions reveal a possible incompatibility between the agents. We consider this as negative evidence for their belief that they will have successful transactions in the future. Positive experiences are assigned values from E+, while negative experiences are evaluated in the negative subset of experience classes E-. The last phase that takes place in the interaction step is the trust update. When the new experience is evaluated, the trust that the two agents have in each other is updated according to the trust evolution function defined in the previous section. Each agent stores in a vector the representation of its trust relationships with all agents in
Using Inter-agent Trust Relationships for Efficient Coalition Formation
227
the system with whom it has ever interacted. The agent’s relationships with the other agents are null according to the definition of the trust evolution function. A relationship is represented by the name of the agent to be trusted and a specific value from the set of trust quantifications T. The update of trust closes the interaction between the vendor and the customer. It also triggers the second step of the coalition formation, namely the Coalition Reasoning Mechanism (shown in Fig. 1). This consists of two parts: first the agent evaluates all its trust relationships with agents with whom it had previous interactions, and second, it makes a decision of whether it has sufficient trust to engage in an action of joining or leaving a coalition or it remains in the same status as before. The evaluation of trust relationships consists in classifying them and finding the best one. This can be done using different agent strategies: individually oriented and socially oriented. With the individually oriented strategy – that we denote with ind – an agent prefers to be in the same coalition with the agent in whom it has most trust. With the socially oriented strategy the current agent prefers the coalition that it trusts most. Trust in a specific coalition can be calculated as a function of the trust in individual agents from that coalition in which only agents that have a history of experiences with the current agent are significant. We consider the special case of agents out of coalitions as forming coalition 0. We present two possible functions to calculate the relationship established between an agent and a coalition. The first function calculates the relationship as the summative trust in all agents from that coalition. This leads to an agent strategy that we denote by soc1. The second function computes the trust of an agent in a coalition as the number of all agents in the coalition with whom the agent has positive relationships (i.e. the number of trustworthy agents). It leads to a different agent strategy that we name soc2. To decide what action is most profitable at each moment, an agent has to know the coalition it belongs to at the current moment, its trust relationships with other agents, and the coalitions in which these agents are. We consider as public global knowledge the coalition in which each agent in the system is. Our solution for the decision problem of an agent using the ind strategy can be described in pseudocode as a rulebased algorithm. The algorithm ensures that the current agent belongs to the same coalition as its most trusted partner. It finds first this agent. If they are in the same coalition the current agent does not change its status. If it is in a different coalition the agent leaves its coalition. In this latter case if the most trusted agent is in a coalition the current agent joins it, otherwise they form a new coalition. Find Ak - the most trusted agent by Ai if (Ai and Ak in the same coalition) Ai does not change its status elseif (Ai in a coalition) AND (Ak is not in Ai‘s coalition) then Ai leaves current coalition if (Ak ouf of coalitions) then Ai forms a new coalition with Ak elseif (Ak in a coalition) Ai joins Ak’s coalition The decision-making is the same for both socially oriented strategies. We present a solution for it in the form of a rule-based algorithm that can be used by an agent using
228
Silvia Breban and Julita Vassileva
the soc1 or soc2 strategies after it calculates its trust in coalitions. The agent finds the most trusted coalition first. If it belongs to this coalition no action is needed; if it is in a different coalition it leaves its current coalition. As a result the current agent is out of coalitions. It forms a new coalition with agents that it trusts most form out of coalitions or it joins the most trusted coalition if it exists. Find coalition k most trusted by Ai if (Ai in coalition k) AND (k different than coalition 0) Ai does nothing elseif (Ai in a coalition) AND (k different than coalition of Ai) Ai leaves current coalition if (k is coalition 0) Ai forms new coalition elseif (k is not coalition 0) Ai joins coalition k We analyze the three agent strategies presented in this section and compare their effects on the system and on the individual agents in the following section.
4
Evaluation
We have developed a simulation prototype of the proposed coalition formation mechanism in Java. We ran 54 sets of experiments with different configurations of parameters. Each set of experiments consisted of 100 trials over which the results were averaged. In our simulation all agents used the same coalition formation mechanism. Our goal was to evaluate the mechanism at system and agent levels. For the first part we investigated the number of coalitions in the system, the overall dynamics, and how these factors evolve in time. The evolution of the number of coalitions is relevant for reasons of predictability while the system dynamics (calculated as the sum of the number of coalitions visited by each agent) is important in establishing whether the system reaches an equilibrium state or not. For the agent evaluation we focused on the individual gains of the customer agents (calculated as the average of all benefits obtained from discounted transactions by each customer). The experiments were intended to compare the different agent strategies described in the previous section (ind, soc1, and soc2). Some variables involved in the design of the mechanism were set constant for all experiments: the inflation rate of trust (d = 0.5), the evaluation of positive experiences (0.2), the evaluation of negative experiences (-0.2), and the discount rate (5% of the value of the books being traded). The parameters under investigation are summarized in Table 1. Note that the number of vendor agents was varied to 100 and 1000 while the number of customer agents was varied to 100, 1000, and 10000. Instead of time, we use the number of interactions between agents as parameter accounting for the evolution of different factors over time the time periods in which a certain number of interactions happen can vary significantly.
Using Inter-agent Trust Relationships for Efficient Coalition Formation
229
Table 1. Simulation parameters and their values
Number of Number of number of agent strategy setup
We considered three different setups for the coalition formation mechanism: a simple one (denoted as simple), a setup that accounts for the increased likelihood that agents interact more often with members of their own coalition (denoted as prob), and a setup in which agents pay a cost for leaving a coalition (called costs). In the prob setup the implementation is designed to choose with a 0.6 probability a vendor from the same coalition as the customer agent interested in buying a book and with a 0.4 probability a vendor that is not in the same coalition as the customer. In the other setups the probability of choosing an agent from the same coalition is 0.5. The costs setup takes into account the realistic assumption that a customer has to spend time and effort to find better vendors when leaving a coalition, while a vendor that leaves a coalition might face a decrease in sales by loosing its former clients. For vendors the costs are seen as a threshold of trust: given that trust is represented on a scale from -1 to 1, a vendor leaves a coalition only if the trust in the new coalition it wants to join exceeds the trust in the current coalition by 0.2. A customer pays a constant penalty (10 price units) for leaving a coalition to cover expenses for searching for new coalitions. We plot first the results that show the evolution of the number of coalitions (see Fig. 2). On the X-axis of each graph the number of interactions is represented on a logarithmic scale from 1 to 1,000,000; the Y-axis represents the number of coalitions. Graphs a, c, and e (on the left) show configurations with 100 vendors for the simple, prob and costs setup. We used a scale from 1 to 100. Graphs b, d, and f (on the right) show configurations with 1000 vendors for the simple, prob and costs setup. We used a scale from 1 to 1000. The results in this figure show that as we increase the number of interactions among agents in the system, the number of coalitions first grows, it reaches a peak, and then it starts to decrease. In most cases when costs are involved the decrease in the number of coalitions is merely observable. The meaning of this behavior is that at the beginning agents form coalitions and after a while they start merging (faster when no costs are involved and slower when costs are involved). We focus on analyzing the peak values that reflect the formation of coalitions and the values reached after 1 million interactions that reflect the merging rate of the coalitions. The peak values are reached between 100 and 10000 interactions (faster for fewer vendors and for fewer customers). They range from 32 to 99 in configurations with 100 vendors and from 90 to 954 in configurations with 1000 vendors. Comparing the setups, we observe that the peak values are higher in the prob setup, medium in the costs setup, and lower in the simple setup (some exceptions apply). We notice that these values are similar for the three agent strategies. As a conclusion we note that the peak values are limited by the number of vendor agents, increase with the number of customer agents, and depend on the setup and, slightly, on the strategy.
230
Silvia Breban and Julita Vassileva
As we increase the number of interactions to 1 million, the number of coalitions drops to small values shown in Table 2. These values range from 1 to 99 in configurations with 100 vendors and from 1 to 909 in configurations with 1000 vendors. Table 2. The values of number of coalitions after 1 million interactions
We note an obvious difference in the evolutions of the number of coalitions from the costs setup and from the other two setups. While in the simple and the prob setups the number of coalitions drops in all configurations to small values before 1 million interactions, in the costs setup it presents a drop in only 2 out of 18 configurations (100 customers with 100 and 1000 vendors using soc1). In the other 16 out of 18 configurations from the costs setup the number of coalitions has a very slow drop until 1 million interactions. The fact that introducing costs for agents leaving a coalition leads to a pronounced decrease in the merging rate of coalitions can be explained by agents becoming less willing to leave a coalition after they join it. We expect them, however, to start merging after more interactions. The prob setup also delays the formation of coalitions and their merging speed, but to a much lower degree than the costs setup. Comparing the agent strategies we notice that for the simple and prob setups in all configurations with soc2 the number of coalitions drops faster to 1; with ind it reaches 1 in fewer cases; with soc1 it never reaches 1. In the costs setup the number of coalitions never drops to 1, but to small values. To conclude our analysis of the number of coalitions we note that it has a predictable and controllable evolution over time. The number of coalitions is limited by the number of vendor agents in the system; this led us to using different scales for configurations with 100 and 1000 vendors. Another interesting observation is that the curves for configurations with the same proportion between the number of vendors and the number of customers are the same (e.g. 100 customers with 100 vendors and 1000 customers with 1000 vendors as well as 100 customers with 1000 vendors and 1000 customers with 10000 vendors). In these cases the proportion between the existing number of coalitions and their upper limit is the same in all sample points. A useful consequence can be drawn from this: when dealing with large numbers of vendors and of customers we can easily simulate the experiments for smaller numbers with the same proportion between the two numbers and scale up the results. There are small differences among the three agent strategies. Significant is that with soc2 the number of coalitions drops faster to 1, with ind it reaches 1 in fewer cases, while with soc1 it never reaches 1, although the number of coalitions seems to stabilize at small values. When agents join the coalition with the highest number of trusted agents (soc2 strategy) and no costs are involved the number of coalitions tends to drop faster to 1.
Using Inter-agent Trust Relationships for Efficient Coalition Formation
(a)
(b)
(c)
(d)
(e)
(f)
231
Fig. 2. Number of coalitions for: a) 100 vendors simple; b) 1000 vendors simple; c) 100 vendors prob; d) 1000 vendors prob; e) 100 vendors costs; b) 1000 vendors costs
Our second evaluation factor is the system dynamics defined as the sum of coalitions visited by each agent. In Fig. 3 we display similar plots: a, c, and e (on the left) for configurations with 100 vendors using the simple, prob and costs setup; b, d, and f (on the right) for configurations with 1000 vendors using the simple, prob and costs setup. The system dynamics is represented in thousands on the Y-axis (from 0 to 200) as a function of the number of interactions shown on a logarithmic scale from 1 to 1,000,000. Generally, the dynamics is insignificant for small values of the number
232
Silvia Breban and Julita Vassileva
of interactions (from 1 to 1000). It shows a slow increase when the interactions grow to 10,000 and then to 100,000. When they grow further (to 1 million) the dynamics presents an exponential increase in several cases or stabilization in the other cases. The exponential increase is observable in configurations using the ind and the soc1 strategies with 1000 and 10000 customers in the simple and prob setups. In configurations using the soc2 strategy the dynamics stabilizes between 100,000 and 1 million interactions in the simple and the costs setups for all configurations. In the remaining cases the system dynamics has a linear increase. For easier comparison, we show in Table 3 the highest dynamics reached after 1 million interactions. They range from 0 to 183 (for 100 vendors) and from 2 to 172 (for1000 vendors). Table 3. The values of system dynamics after 1 million interactions
Comparing the values of the system dynamics for different setups we observe that they are higher in the prob setup, medium in the simple setup, and lower in the costs setup for all configurations. In the costs setup the dynamics is drastically reduced compared to the other two setups. The prob setup brings an increase in the dynamics in 15 out of 18 cases compared to the simple setup. Therefore, introducing costs stops agents from moving from one coalition to another, while considering higher probabilities for agents to interact within the same coalition increases their dynamics. Comparing the agent strategies, we observe that soc2 results in the lowest dynamics, followed by ind and, lastly, by soc1 (some exceptions apply). We also note that the system dynamics has similar values for configurations with 100 and 1000 vendors, but it has higher values when the number of customers is increased. This means that customer agents tend to move more from one coalition to another, since they account for the overall dynamics more than the vendor agents. To conclude the analysis of the system dynamics we observe that the increase in the system dynamics is related to the merge of coalitions: they both start between 1000 and 10000 interactions and last the same period of time. The evolutions of the number of coalitions and of the systems dynamics are similar for all cases. Small differences occur as a consequence of delays that appear when taking into account diverse realistic conditions, such as increased probabilities or costs. Overall, our system analysis shows that in all conditions the system is predictable (in number of existing coalitions) and that it reaches a stable state (in the overall dynamics) after a certain number of interactions that depends on the characteristics of the environment.
Using Inter-agent Trust Relationships for Efficient Coalition Formation
(a)
(b)
(c)
(d)
(e)
(f)
233
Fig. 3. System dynamics for: a) 100 vendors simple; b) 1000 vendors simple; c) 100 vendors prob; d) 1000 vendors prob; e) 100 vendors costs; f) 1000 vendors costs
The individual gain - evaluated for the agent level of our analysis – calculates the average of the sum of all discounts that each customer receives. We plot it in Fig. 4 as a function of the number of interactions shown on a logarithmic scale. It ranges from 0 to 15 thousands. The general trend of the curves is to grow very slowly from 1 to 10,000 interactions, after which they raise linearly until 100,000 interactions and exponentially until 1 million interactions. The configurations with 100 customers have higher values than those with 1000 or 10000 customers due to the fact that after a number of interactions the overall discounts are similar, but divided among less
234
Silvia Breban and Julita Vassileva
customer agents. We observe that the values of system dynamics are higher for the prob setup, medium for the simple setup, and lower for the costs setup.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 4. Individual gain for: a) 100 vendors simple; b) 1000 vendors simple; c) 100 vendors prob; d) 1000 vendors prob; e) 100 vendors costs; f) 1000 vendors costs
There is a small increase in the individual gain of customers in the prob setup compared to the simple setup. This is due to the higher probability of agents to interact with members of the same coalition that increases the customer’s chances to get discounts. In the costs setup the individual gain has reduced values compared to the simple setup. This can be explained by the fact that the costs that agents pay when leaving a coalition are subtracted from their total benefits.
Using Inter-agent Trust Relationships for Efficient Coalition Formation
235
Comparing the values of the individual gains for different agent strategies we note that they are higher for soc2, medium for ind, and lower for soc1 (some exceptions apply). An explanation for this is that when the agents use the soc2 strategy the number of coalitions drops faster to its lower limit, fact that increases a customer’s chances to interact with a vendor from the same coalition and get a desired discount. The results of the individual gain reveal once again that soc2 is the most beneficial strategy, while soc1 brings lower benefits for customers. We observe that the individual gain is directly related to the merge of coalitions. It is inversely proportional to the number of coalitions. In all strategies the gains become higher only after the coalitions start to merge (after 10,000 interactions). Three distinct behaviors are apparent: In all cases when the number of coalitions stabilizes to 1 the system dynamics stabilizes and the individual gain increases exponentially. This is observable in most cases that use the soc2 strategy and some with ind in the simple and the prob setups. It is a direct consequence of agents finding compatible partners faster and stabilizing in certain coalitions. In the remaining cases from the simple and the prob setups (all with soc1 and several with ind) the number of coalitions is still dropping (slowly) to its lower limit and the dynamics is increasing (linearly or exponentially), while the individual gain is increasing slower. This means that it is harder for the agents to find compatible partners and most profitable coalitions. We expect that in these cases the system will also reach equilibrium in the number of coalitions and in the overall dynamics shortly after 1 million interactions. When costs are involved (for all agent strategies) the number of coalitions drops very slowly from its peak values and the overall dynamics seems to stabilize after insignificant increases, but the individual gain is low. This can be explained by the fact that agents are more reluctant to leave their coalitions and join different ones. It drastically delays both the drop of the number of coalitions and the dynamics (that hardly increases). We expect that the number of coalitions will drop and the dynamics will increase and then stabilize, but much slower. Overall, our results show that the proposed coalition formation mechanism is beneficial for the customer agents and for the system. It ensures exponential benefits over time for the customers in all strategies. The mechanism leads to a predictable behavior of the system since the number of coalitions drops quickly to small values (limited by the number of agent categories) for all strategies. It also brings stability to the system since the overall dynamics reaches an equilibrated state for the soc2 strategy. Although the system dynamics increases exponentially for the other two strategies (i.e. ind and soc1), we expect it to stabilize after a larger number of interactions. The explanation of this behavior is that when most agents belong to the same coalition as the partners with whom they share similar interests and preferences, they stop moving from one coalition to another. This leads to stabilization in the number of coalitions and in the system dynamics as well as to exponential increase in the individual gains of customers. Soc2 is the best strategy for reasons of stability (least dynamics) and utility (best gain). Ind is better than the soc1 strategy. Costs for leaving a coalition reduce the dynamics and the gain drastically, but also delay the drop in the number of coalitions. Increasing the probability that agents have interactions inside their own coalitions lead to an increase in the dynamics as well as in the gain.
236
5
Silvia Breban and Julita Vassileva
Conclusion
In this paper, we proposed and evaluated a coalition formation mechanism that takes into consideration trust relationships between agents. We showed that this mechanism brings stability to the system (in the number of coalitions and in the overall dynamics) and provides the customer agents increased benefits over time. The mechanism uses reduced communication between the agents that makes it scalable for large numbers of agents and interactions. Future work includes investigation of the proposed coalition formation mechanism under more realistic circumstances such as considering setups with agents with different coalition strategies, allowing more types of goods to be traded in the system and giving the agents the liberty to belong to more coalitions at a time.
References 1.
O. Shehory, S. Kraus A Kernel-oriented model for coalition formation in general environments: Implementation and Results Proceedings of the Thirteenth National Conference on Artificial Intelligence, Portland, 1996, pp. 130-144. 2. O. Shehory, K. Sycara Multi-agent Coordination through Coalition Formation Proceedings of International Workshop on Agent Theories, Architectures, and Languages (ATAL97), Providence, RI, pp 135-146, 1997. 3. K. Lermann, O. Shehory Coalition Formation for Large Scale Electronic Markets Proceedings of the Fourth International Conference on Multiagent Systems ICMAS'2000, Boston, July 2000, pp. 216-222. 4. M Tsvetovat, K. Sycara Customer Coalitions in the Electronic Marketplace Proceedings of the Fourth International Conference on Autonomous Agents, Barcelona, 2000, 263-264. 5. D. P. Reed That Sneaky Exponential—Beyond Metcalfe's Law to the Power of Community Building Context magazine Spring 1999. 6. Customer Relationships Management, available on line at http://www.crm.com/ 7. J. Vassileva Goal-Based Autonomous Social Agents Supporting Adaptation and Teaching in a Distributed Environment Proceedings of ITS'98, San Antonio, Texas. LNCS No1452, Springer Verlag: Berlin pp.564-573, 1998. 8. C. Jonker, J. Treur Formal Analysis of Models for the Dynamics of Trust based on Experiences Autonomous Agents, Deception, Fraud and Trust in Agent Societies, Seattle 1999, pp. 81-94. 9. Ganzaroli, Y.Tan, W. Thoen The Social and Institutional Context of Trust in Electronic Commerce Autonomous Agents, Deception, Fraud and Trust in Agent Societies, Seattle 1999, pp. 65-76. 10. Mudgal, J. Vassileva Bilateral Negotiation with Incomplete and Uncertain Information: A Decision-Theoretic Approach Using a Model of the Opponent Proceedings of the 4th International Workshop on Cooperative Information Agents (CIA IV), Boston, July 2000, pp. 107-118.
Using Agent Replication to Enhance Reliability and Availability of Multi-agent Systems Alan Fedoruk and Ralph Deters Department of Computer Science, University of Saskatchewan Saskatoon, Canada {amf673,ralph}@cs.usask.ca http://bistrica.usask.ca/madmuc/
Abstract. Dependability is an important issue in the successful deployment of software artifacts, since users expect system services to be highly available and reliable. But unlike mainstream computer science, where dependability is a central design issue, the multi-agent research community is just beginning to realize its importance. One possible approach for increasing the reliability and availability of a multi-agent system is by replicating its agents. In this paper, the concept of transparent agent replication and its major challenges are investigated. In addition, an implementation of transparent agent replication for the FIPA-OS framework is presented and evaluated.
1
Introduction
A major reason for the significant gap between development of multi-agent systems (MAS) and deployment is the dependability problem. Much of the experimentation with MASs is done using a closed and reliable agent environment, which does not need to handle faults. When multi-agent systems are deployed in an open environment—where agents from various organizations interact in the same MAS and the systems become distributed over many hosts and communicate over public networks—more attention must be paid to fault-tolerance. In an open system, agents may be malicious or poorly designed, hosts may get overloaded or fail, network connections may be slow or fail all together. For a system to avoid failures it must be able to cope with these types of faults [10,13]. Incorporating redundant copies of system components, either through hardware or software, is widely used in engineering and in software systems to improve fault-tolerance. The idea is simple—if one component fails, there will be another ready to take over. Replicating agents within a MAS provides the benefit of fault tolerance, increasing reliability and availability, but it also raises problems with agent communication, read/write consistency, state synchronization, results synthesis, and increased work loads. The goal of this paper it to motivate the use of replication as a general technique to increase fault tolerance, and thus, improve availability and reliability of MASs. The remainder of this paper is structured as follows: Section 2 describes faults and failures as they relate to MASs and to agent replication, and related R. Cohen and B. Spencer (Eds.): AI 2002, LNAI 2338, pp. 237–251, 2002. c Springer-Verlag Berlin Heidelberg 2002
238
Alan Fedoruk and Ralph Deters
work is summarized; Section 3 presents agent replication, describing what it can be used for, the problems that it introduces and some methods for dealing with those problems; Section 4 describes an implementation of proxies using FIPAOS [2] and reports on one application of that implementation; Section 5 presents conclusions and identifies further work.
2
Multi-agent Systems, Faults and Failures
The four essential characteristics of a multi-agent system are: the agents are autonomous, there is no single point of control, the system, and the agents, interact with a dynamic environment, and the agents are social–agents interact with each other and may form relationships. Agents are distinguished from other software entities by these characteristics: they have some degree of autonomy [5], they have their own thread of control, and they are situated in some environment. A failure occurs when the system produces results that do not meet the specified requirements of the system [12]. A fault is defined to be a defect within an agent or framework component of a MAS which may lead to a failure. Faults can be grouped into five categories as shown in Table 1 [20]. When a fault occurs in an agent or a component of the MAS, interdependencies between agents may cause the fault to spread throughout the system in unpredictable ways. Several approaches to fault-tolerance in MASs are documented in the literature; they focus on different aspects of fault-tolerance but none explores the possibilities of transparent replication. Hanson and Kephart [11] present methods for combating maelstroms in MASs. Maelstroms are chain reactions of messages that continue cycling until the system fails. The proposed technique is specific to this type of fault. Klein [13] developed sentinels, agents that observe the running system and take action when faults are observed. Design goals of this system are to take the burden of fault-tolerance off the agent developer and have the agent infrastructure provide the fault-tolerance. Klein’s work focuses on observing agent behaviour, diagnosing the possible fault and taking appropriate
Table 1. Fault Types Fault Type Program bugs
Description Errors in programming that are not detected by system testing. Unforeseen states Omission errors in programming. The programming does not handle a particular state. System testing does not test for this state. Processor faults System crash or a shortage of system resources. Communication Slow downs, failures or other problems with the commufaults nication links. Emerging unwanted System behaviour which is not predicted; Emerging bebehaviour haviour may be beneficial or detrimental.
Using Agent Replication to Enhance Reliability and Availability
239
remedial action. Toyama and Hager [18] divide system robustness into two categories, ante-failure and post-failure. Ante-failure robustness is the traditional method—systems resist failure in the face of faults. Post-failure robustness deals with recovery after failure. Schneider [17] creates a method for improving faulttolerance with mobile agents using the TACOMA mobile agent framework. The work focuses on ensuring that replicated agents are not tainted by faulty processors or malicious entities as agents move from host to host. Kumar et al. [14] present a methodology for using persistent teams to provide redundancy for system critical broker agents within a MAS. The teams are similar to the replicate groups presented here, but replicate group members all perform the same task while team members do not. In this paper transparent replication will be the main focus. It builds on the team work approach presented in [14].
3 3.1
Agent Replication Definition of Agent Replication
Agent replication is the act of creating one or more duplicates of an agent in a multi-agent system. Each of these duplicates are capable of performing the same task as the original agent. The group of duplicate agents is referred to as a replicate group and the individual agents within the replicate group are referred to as replicates. Once a replicate group is created in a MAS, if that replicate group is visible to the rest of the MAS, there are several ways that agents can interact with it: – An agent can send requests to each replicate in turn until it receives an appropriate reply. – An agent can send requests to all replicates and select one, or synthesize the replies that it receives. – An agent can pick one of the replicates based on a particular criteria (speed, reliability, etc.) and interact only with that agent. All of the replication schemes presented here are variations on these ideas. Related work can be found in literature dealing with object group replication, [8], N-Version Voting [4], distributed systems and distributed databases [9,6,3]. Agent replication is an extension of object group replication, and many of the same challenges are faced when using agent replication. Distributed systems and databases provide some of the basic techniques for solving transaction, state and concurrency issues. Agent replication differs from object replication and from the replication used in distributed systems or databases by the nature of the agents; agents are autonomous and situated, while processes, objects and database entities are not. There are two basic types of agent replication: heterogeneous and homogeneous. In heterogeneous replication, replicates are created, which are functionally equivalent, but individual replicates may have been implemented separately. In homogeneous replication, replicates are exact copies of the original agent—the
240
Alan Fedoruk and Ralph Deters
replicates are not only functionally equivalent, but are copies of the same code. In addition to these two main groupings, agents can be either deterministic or non-deterministic, reactive or proactive. Replicates may be situated in different environments with differing processor speeds, loads and reliability, differing network connectivity, and differing resources. A system using homogeneous replication with deterministic agents will not see any benefit from agent replication when the agents encounter a program error or an unforeseen state; all of the agents in the replicate group will encounter the same fault. If the agents are non-deterministic, or the percepts received from their environments are different, then not all of the replicates will be in the same state, so they may not all encounter the program bug or unforeseen state fault, and in this case, replication will increase fault-tolerance. A system using either homogeneous or heterogeneous replication will see increased fault tolerance—to communications and processor faults—if members of the replicate group are running on more than one processor. If heterogeneous replication is used, fault-tolerance to bugs and unforeseen states will be increased. Two separately programmed agents may not have the same bugs, so a heterogeneous replicate group may be able to continue functioning even if some of the replicates encounter a bug and fail. This concept is an extension of work done on N-Version voting [4]. This property can be exploited when testing new versions of an agent. When a replicate group is created it consists of current version agents and new version agents. If the new agents encounter a bug and fail, the current version agents are still in place and the system will continue to operate. 3.2
Key Challenges Raised by Agent Replication
There are four key challenges raised by agent replication: replicated messages, read and write consistency, state synchronization and increased workload. The formation of a replicate group will always mean increased system overhead. Each replicate will use some system resources and extra communication is needed to manage state synchronization and communication issues. If an agent wishes to communicate with another agent that has been replicated three challenges arise. First, how will the agent know which member of the group to communicate with? Second, if the agent communicates with all members of the group, how will it process all of the replies? Third, if the replicate group needs to initiate communication, which one of the replicates will initiate it, and if all initiate it, how will the agent it communicates with deal with a multiple messages? All agents interact with their environment, [16] and having multiple agents means multiple interaction with the environment. That interaction can be viewed in terms of database read and writes. This allows us to work on the challenge of keeping the environment consistent while using work done with distributed, multi-user database management systems. The same challenges faced in distributed databases arise here: how will the database remain consistent and ensure that processes reading the data are receiving consistent data?
Using Agent Replication to Enhance Reliability and Availability
241
When a reactive agent receives a message, it will be processed in one of three ways: without any reference to the agent’s state; referencing the agent state; or referencing and altering the agent state. To avoid inconsistencies between replicates, it is necessary to coordinate them. This can be done by locking, using exclusive (X) or shared (S) locks, or by time-stamping [6]. Agent state and state synchronization is a central issue in agent replication. A group of replicates must have the ability to define, store, set and synchronize their states. What comprises an agent state is context dependent, but in general, the state is the knowledge, rules, goals, and code of the agent. For homogeneous agents, sharing state is relatively straight-forward; however, for heterogeneous agents a higher level definition of state must be created which can be interpreted by the different agents. To help manage states, the concept of a transaction, from database research, can be used. A transaction takes an agent from one consistent state to another. The type of transaction management, nested or flat, depends on the domain of the MAS and will not be discussed in this paper. See [15] for a discussion of nested transactions. 3.3
Transparent Replication and Replicate Group Proxies
Replication techniques that hide the fact, and the implementation of replication from the rest of the MAS will be called transparent replication. When transparent replication is used, other agents do not know that they are interacting with a replicate group. One method of implementing transparent replication is via replicate group proxies. The replicate group proxy acts as an interface between the replicates and the rest of the MAS. Proxies provide three important functions: they make a replicate group appear to be a single entity, they control communication between a replicate group and the MAS and they control execution and state management of a replicate group. Fig. 1 illustrates a heterogeneous replicate group (B Proxy, B0, B1, B2, B3, B4 ), the agents environment and another agent A interacting with the replicate group. A only sees B Proxy and not any of the replicates.
B0
A
B1 B Proxy
B2 B3
Environment Replicant Group
B4
Fig. 1. Replicate Group Structures and Proxies
242
Alan Fedoruk and Ralph Deters
The proxy handles all communication between replicates and other agents in the MAS and handles interaction between replicates and the agent environment. If all of the replicate agents are running simultaneously, the proxy would have to select which replicate to communicate with, and results synthesis and read and write consistency would still be a problem. To deal with these issues one of two management strategies is used: hot-standby or cold-standby. In hot-standby, the proxy selects one of the replicates as the active replicate. The other replicates are placed in a dormant mode. Periodically, the state of the active replicate is transferred to each of the dormant replicates. If the active replicate fails, the proxy will detect the failure and select a new replicate to be active. As the new active replicate has been getting state updates, it can start processing immediately, resuming where the previous agent left off. The proxy is handling communication for the replicate group, and any messages are now routed to the new active agent. In cold-standby, the proxy selects one of the replicates as the active replicate and the other replicates are again dormant. The difference is that the state of the active replicate is stored in the proxy, but not transferred to the dormant replicates. When the active replicate fails, the proxy will detect the failure, select a new replicate to be active, and transfer the current state to the new active replicate. Cold-standby will have slower switch-over times but will have less overhead while the system is running.
4 4.1
Implementing Agent Replication with Proxies A Replication Server
The purpose of the replication server implementation is to apply the transparent agent replication technique with proxies and obtain a measure of the effectiveness in improving the system failure rate, to gauge the added complexity and resource usage incurred by replication, and to provide an infrastructure for further experimentation. The FIPA-OS [2] agent toolkit was chosen as a platform for the implementation since it enjoys a large developer and user community, it implements the FIPA [1] standard, and it is open source Java code. The replication server implements transparent replication with the following features: – – – –
communication proxy; hot or cold standby replication; homogeneous and heterogeneous replication; state synchronization within a replicate group.
The replication server functions as follows: 1. A replicate group is created. 2. Agents are created within the replication server which created the replicate group, in another replication server or as a separate process. See Fig. 2
Using Agent Replication to Enhance Reliability and Availability
243
3. Agents register with a replicate group—agents are placed in the role of either dormant or active. 4. Periodically, the replication server checks that the active agent in each replication group is still reachable and if the active agent is deemed unreachable, a new active agent is chosen from the remaining replicates. 5. Periodically, the active agent will send its state to the replication server. The replication server will, in turn, store this state and distribute it to the other replicates in the group. 6. All messages going to and from a replicate group are funneled through the replicate group message proxy. The replication server, RepServer, is implemented as a standard FIPA-OS agent. A running MAS may have many RepServer agents. Each RepServer agent consists of one or more replicate groups, RepGroup, and provides replicate group management services for those replicate groups. Each RepGroup consists of a list of agents that make up the group, a message proxy agent, a reference to the currently active agent and a stack of past agent states. The list of agents consists of AgentID s, a flag to indicate whether or not the last contact with this agent was successful, and a time-stamp to indicate when this agent state was last updated. When a group is created, the message proxy registers with the platform Agent Management Service (AMS) and Directory Facilitator (DF) agents, and the members of the replicate group do not. When state information is sent to the replication server the state is placed on top of a stack within the appropriate RepGroup. Currently, the most recent state is always used for synchronizing replicates. However, having a stack of states allows other policies to be implemented—if the current state leads to a failure, it could be advantageous to start the new active agent with one of the previous states. The stack of states can be moved to persistent storage and be used for recovery if the RepServer fails.
RepServer1
RepServer0 RepGroupA
RepGroupB A
B
RepGroupC C
A0
B1
A1
A2
B2
B0
Fig. 2. Replication Architecture
244
Alan Fedoruk and Ralph Deters
A replication server accepts requests to perform the following functions: create a replicate group, register an agent with an existing replicate group, and create an agent. The replication server, for each replication group, will periodically ping the active agent and if it does not respond, a new active agent will be selected. Currently, the first agent that is found to be reachable is chosen as the next active agent. Other policies for choosing the next active agent can be implemented, such as, choosing the most up to date agent, or choosing the fastest agent. State synchronization is driven by the active agent. The agent decides when its state has changed, when it is consistent, and when it should be pushed to the replication server. The replication server then distributes the state to the other replicates in the group as appropriate. In the case of hot-standby, the replicates are kept as up to date as possible; in cold-standby, the replicates are only updated when they become active. Local Post Office FIPA-OS uses RMI for intra-platform message passing and CORBA to pass messages between platforms. A standard FIPA-OS agent has three components: a Task Manager (TM), a Conversation Manager (CM), and a Message Transport System (MTS). Normally messages are created in tasks under the control of the TM, passed to the CM, then passed to the MTS. The MTS handles the RMI processing to deliver the message to the MTS in the receiving agent. For agents that exist in the same process, the LocalPostOffice allows the agent to bypass the MTS and directly access the Conversation Manager of the receiving agent. Messages are created in tasks and passed to the CM. The CM determines if the receiving agent is reachable via the LocalPostOffice. If it is, the message is delivered directly to the CM in the receiving agent. Using the LocalPostOffice allows the agent to use local method calls rather than remote method calls, preserving bandwidth and processing resources, and lowering message passing overhead. See Fig. 3 and Fig. 4. Distributed Platforms in FIPA-OS FIPA-OS agent platforms can be implemented in one of three ways: as a single non-distributed platform with all platform components and agents on a single host; as a single distributed platform with platform components and agents on more than one host; and as multiple platforms, each either distributed or non-distributed, on one or more hosts. For the application described here the first two configurations were used. 4.2
Application
A simplified version of the I-HELP MAS [19] is used as a test bed for the replication server. I-HELP is a peer help system currently used at the University of Saskatchewan. I-HELP is susceptible to failures due to individual agents failing, particularly the matchmaker agent, and its brittleness requires full-time system administration [7]. I-HELP allows student users to find appropriate peer helpers. Each user of the system has a personal agent that maintains a small database
Using Agent Replication to Enhance Reliability and Availability
Agent1
245
Agent2
Task Mgr.
Task Mgr.
Conversation Mgr.
Via LocalPostOffice
MTS
Conversation Mgr.
MTS
Via RMI
Fig. 3. Local Post Office
of information about its user, such as topics the user is competent in, the identity of the user and whether or not the user is currently online and willing to help others. Each user communicates to his personal agent via a user interface agent. A matchmaker agent in the system maintains a list of all personal agents within the system. When a user initiates a request for help, the request is sent to the matchmaker agent which broadcasts it to all of the known agents, and then assembles the replies, which are returned to the requesting agent. See Fig. 6. This application is implemented in two versions. The first uses standard FIPA-OS agents with no replication. The second uses replicated agents for personal and matchmaker agents. As this is a closed and reliable environment, the replicated agents have been given a built-in fault rate to simulate an open environment. For the testing done on a single platform all platform services and agents are run on a Sun SunFire 3800 with 4 UltraSparcIII CPUs running at 750MHz and 8GB of RAM, running Solaris 2.8. When using a distributed platform, the platform services and the matchmaker agent are run on the SunFire 3800 and the other agents are run on three 800Mhz, 512MB PCs running Windows 2000. The evaluation consists of six tests: 1. The replicated version of the system ran with 1 matchmaker agent, 2 user interface agents and 8 personal agents. Each personal agent was replicated, using 10 replicates. The system was run with the fault rate of the personal agent varying from 5 failures per 100 messages to 90 failures per 100 messages. 2. The non-replicated version of the system ran with 1 matchmaker agent, 2 user interface agents and a variable number of personal agents, from 2 to 64. 3. The replicated version of the system ran with 1 matchmaker agent, 2 user interface agents and a variable number of personal agents, from 2 to 64. Each replicate group consists of three agents.
246
Alan Fedoruk and Ralph Deters
/** LocalPostOffice V1.1, November 20, 2001 */ package localpostoffice; // Import the fipa-os classes import fipaos.agent.*; import fipaos.mts.*; import fipaos.ont.fipa.*; import java.util.*; /** LocalPostOffice implements a local message delivery system. If an agent is running within the same process, this will deliver messages without going outside of the process. */ public class LocalPostOffice { /** List of agents in this process. */ private Hashtable _started_agents; /** Constructor. Set the _started_agents list. */ public LocalPostOffice (Hashtable started_agents) { _started_agents = started_agents; } /** deliver a message */ public void sendMessage( Message msg) throws Exception { ACL acl = msg.getACL(); if ( _started_agents.containsKey( acl.getReceiverAID().getName()) ) { ((FIPAOSAgent)_started_agents.get( acl.getReceiverAID().getName() )).getCM().receiveMessage(msg); } else { throw new Exception(); } } } // LocalPostOffice
Fig. 4. Local Post Office
Using Agent Replication to Enhance Reliability and Availability
PersAgent PersAgent
247
PersAgent
MatchMaker
PersAgent
PersAgent
PersAgent
UIAgent
UIAgent
Fig. 5. Simplified I-Help MAS
1. User enters help request, UI Agent passes request to PersAgent. 2. PersAgent forwards request to the MatchMaker.
3. MatchMaker forwards help request to all PersAgents. 4. MatchMaker Assembles replies.
5. MatchMake forwards reply to PersAgent, which forwards it to The UI Agent.
Fig. 6. Flow of Requests in I-HELP
4. The replicated version of the system ran with 1 matchmaker agent, 2 user interface agents and a variable number of personal agents, from 2 to 64. Each replicate group consists of three agents. In this case the local post office method of message passing for agents in the same process was used. 5. The replicated version of the system ran with 1 matchmaker agent, 2 user agents, 4 personal agents, with variable numbers of replicates in each replicate group, ranging from 3 to 15. 6. A distributed platform was created with personal agents on two Windows 2000 hosts, a personal agent and user interface agents on a third Windows host.
248
Alan Fedoruk and Ralph Deters
For these experiments, where it was used, the fault rate was set at 10 faults per 100 message arrivals. In all experiments, the measured variable was time required for the system to respond to a request. If no response is returned, a failure is assumed. This provides a measure of both fault tolerance and overhead. For each configuration, a number of requests are performed and the median response time was recorded. This removed any variation due to underlying system or network load. All tests were done using hot-standby replication. In these tests, each nonreplicated agent ran in its own process. To provide direct comparability, each replicated agent was run in a separate replication server in its own process. 4.3
Discussion
Experiment 1, uses the results from Test 1, and is used to illustrate the effectiveness of replication at varying fault rates. For lower values of the fault rate, (< 50%) using 10 replicates reduced the failure rate to near zero. Even with a higher failure rate of 50%, 10 replicates reduced the failure rate to 0.0009765. Assuming agent failures to be independent events the probability of a failure is equal to the product of the probabilities of failure of the individual agents. At very high failure rates (90%) several agents can fail before one gets any data loaded (the UIAgent registered) so measuring failure rate is difficult. However, with enough replicates, the failure rate will be reduced, and as was shown by Experiment 3, adding more replicates is not costly. Experiment 2 uses the results from Tests 2, 3 and 4, and is used to measure the overhead incurred by using replication. Test 2, see Fig. 7, isolates the performance effects of the underlying FIPA-OS platform. The results show a linear increase in the response time as more personal agents are added to the system. This result is expected as the number of messages passed increases directly with the number of personal agents in the system. Test 3, see Fig. 7, is used to determine the amount of overhead replication adds to a system. The results show an increase in the response time over the non-replicated case; however the response time still increases linearly with the increase in the number of personal agents in the system. This increase can be accounted for by the message proxy setup—the use of a message proxy doubles the number of messages that the agent platform must deliver—and by the extra processing required for state synchronization. Test 4 repeats Test 3, using the LocalPostOffice to improve message passing performance. The results, see Fig. 7, are similar to Test 3, but show improved performance—reduced response times. Experiment 3, uses the results from Test 5, see Fig. 8, and is used to illustrate the effect of increasing the replicate group size on system performance. The observed increase in response time, as the replicate group size increases, is very small. Experiment 4, uses the results of Test 6, and determines what, if any, impact, distribution has on performance. Using 4 and 8 personal agents, request response times were nearly identical to those measured in the non-distributed case. This shows that the load of the system can be spread around without negatively impacting performance.
Using Agent Replication to Enhance Reliability and Availability
249
50 45 Response Time (sec)
40 35 30 25 20 15 10
No Replication Replication Standard Comm Replication Local PO
5 0 10
20
30 40 No. of PersAgents (N)
50
60
Fig. 7. Response Time vs. Number of Personal Agents 8 Response Time
Response Time (sec)
7 6 5 4 3 2 1 0 0
5
10 15 20 No. of Replicants per Group (N)
25
30
Fig. 8. Response Time vs. Number of Replicates per Group, 4 Personal Agents
5
Conclusion and Future Work
This paper introduced the topic of agent replication and examined the issues associated with using agent replication in a multi-agent system. The main issues identified are agent communication, read/write consistency, state synchronization and system overhead. Transparent replication using proxies was introduced as one method to deal with the issues raised. The approach was shown to increase fault tolerance, hence reliability and availability of a MAS, with a reasonable increase in overhead; transparent replication is a promising technique. Transparent replication was tested in a closed environment, but the test application was constructed with built-in faults, to simulate an open environment.
250
Alan Fedoruk and Ralph Deters
In future, this research will focus on three areas. First, move the proxy into the agent and allow any of the agents to take on the role of the proxy. This would remove the single point of failure. In the event the agent acting as a proxy fails, agent teamwork and cooperation techniques for deciding which of the remaining agents would take over as a proxy would be used. Second, allow active replication. Currently the replication techniques described only allow one replicate to be active at any given time—except for special cases where multiple agents can run independently. Techniques for allowing more than one replicate to be active at once need to be developed. By using decentralized two-phase locking we hope to allow for multiple active replicates without risking state inconsistencies. Third, investigate ways to keep agents that will interact closer together. In the I-HELP application, keeping a user interface agent, the personal agent and a copy of the matchmaker on a local machine will improve the reliability, availability and performance of the system.
6
Software
Full source code for all applications discussed is available for download from http://www.cs.usask.ca/grads/amf673/RepServer.
References 1. FIPA foundation for intelligent physical agents. http://www.fipa.org/, 2001. 242 2. FIPA-OS. http://fipa-os.sourceforge.net/, 2001. 238, 242 3. G. R. Andrews. Multithreaded, Parallel, and Distributed Programming. AddisonWesley, 2000. 239 4. A. Avizienis. The n-version approach to fault-tolerant software. IEEE Transactions on Software Engineering, pages 1491–1501, Dec. 1985. 239, 240 5. K. S. Barber and C. E. Martin. Agent autonomy: Specification, measurement, and dynamic adjustment. In Proceedings of the Autonomy Control Software Workshop at Autonomous Agents 1999, pages 8–15, 1999. 238 6. T. Connolly and C. Begg. Database Systems: A Practical Approach to Design, Implementation, and Management, Second Edition. Addison-Wesley, 1999. 239, 241 7. R. Deters. Developing and deploying a multi-agent system. In Proceedings of the Fourth International Conference on Autonomous Agents, 2000. 244 ´ 8. P. Felber. A Service Approach to Object Groups in CORBA. PhD thesis, Ecole Polytechnique F´ed´erale de Lausanne, 1998. 239 9. R. Guerraoui and A. Schiper. Software-based replication for fault tolerancet. Computer, 30(4):68–74, Apr. 1997. 239 10. S. H¨ agg. A sentinel approach to fault handling in multi-agent systems. In Proceedings of the Second Australian Workshop on Distributed AI, in conjunction with the Fourth Pacific Rim International Conference on Artificial Intelligence (PRICAI’96), Cairns, Australia., August 1996. 237 11. J. E. Hanson and J. O. Kephart. Combatting maelstroms in networks of communicating agents. In Proceedings of the Sixteenth National Conference on Artificial Intelligence and Eleventh Conference on Innovative Applications of Artificial Intelligence. AAAI Press/MIT Press, 1999. 238
Using Agent Replication to Enhance Reliability and Availability
251
12. J. Musa, A. Iannino and K. Okumoto. Software Reliability, Measurement, Prediction, Application. McGraw Hill Book Company, 1987. 238 13. M. Klein and C. Dallarocas. Exception handling in agent systems. In O. Etzioni, J. P. M¨ uller, and J. M. Bradshaw, editors, Proceedings of the Third International Conference on Autonomous Agents (Agents’99), pages 62–68, Seattle, WA, USA, 1999. ACM Press. 237, 238 14. H. J. Levesque, S. Kumar, and P. R. Cohen. The adaptive agent architecture: Achieving fault-tolerance using persistent broker teams. In Proceedings, Fourth International Conference on Multi-Agent Systems, July 2000. 239 15. J. Moss, J. Eliot, and B. Eliot. Nested Transactions: An Approach to Reliable Distributed Computing. MIT Press, 1985. 241 16. S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice-Hall, 1999. 240 17. F. B. Schneider. Towards fault-tolerant and secure agentry. In Proceedings of 11th International Workshop of Distributed Algorithms, Sept. 1997. 239 18. K. Toyama and G. D. Hager. If at first you don’t suceed. In Proceedings of the Fourteenth National Conference on Artificial Intelligence, pages 3–9. AAAI Press/MIT Press, 1997. 239 19. J. Vassileva, G. McCalla, R. Deters, D. Zapata, C. Mudgal, and S. Grant. A multi-agent approach to the design of peer-help environments. In Proceedings of AIED’99, 1999. 244 20. D. N. Wagner. Liberal order for software agents? an economic analysis. Journal of Artificial Societies and Social Simulation vol. 3, no. 1, 3(1), 2000. 238
An Efficient Compositional Semantics for Natural-Language Database Queries with Arbitrarily-Nested Quantification and Negation Richard Frost and Pierre Boulos School of Computer Science, University of Windsor Ontario, Canada {richard,boulos}@uwindsor.ca
Abstract. A novel and efficient implementation of a compositional semantics for a small natural-language query processor has been developed. The approach is based on a set-theoretic version of Montague Semantics in which sets that are constructed as part of denotations of negative constructs are represented by enumerating the members of their complements with respect to the universe of discourse. The semantics accommodates arbitrarily-nested quantifiers and various forms of negation, nouns, transitive and intransitive verbs, conjunction, and disjunction. Queries containing the word “is” and passive verb constructs can be evaluated. However, the approach for these two constructs is ad hoc and inefficient. The approach has been implemented in a syntax-directed evaluator, constructed as an executable specification of an attribute grammar, with a user-independent speech-recognition front-end.
1
Introduction
There are many advantages to providing users with natural-language interfaces to data sources. In particular, when speech-recognition technology is used, the ability to phrase queries in some form of pseudo natural-language is almost a necessity as it is very difficult to “speak” a language such as SQL. Natural-language processors can be constructed in two ways: by translation to a formal language such as SQL, or by direct interpretation by an evaluator based on some form of compositional semantics. The latter approach has a number of advantages: 1) information concerning sub-phrases of the query, such as cost and size, can be presented to the user in an intelligible form before the query is processed, 2) for query debugging purposes, the user can ask for the value of sub-phrases to be presented before the whole query is evaluated, 3) and the sub-set of natural-language can be readily extended if the evaluator has a modular structure based on the compositional semantics. Richard Montague was one of the first to develop a compositional semantics for English. Montague’s approach is ideally suited as a basis for building naturallanguage query processors as it is highly orthogonal; the denotation of words and phrases apply in many syntactic contexts. Also there is a one-to-one correspondence between the rules defining syntactic constructs and the rules stating how R. Cohen and B. Spencer (Eds.): AI 2002, LNAI 2338, pp. 252–267, 2002. c Springer-Verlag Berlin Heidelberg 2002
An Efficient Compositional Semantics
253
the meaning of phrases are constructed from the meanings of their constituents. This correspondence can be readily implemented in a syntax-directed evaluator constructed as an executable specification of an attribute grammar as discussed in section 8. However, direct implementation of Montague semantics is impractical as discussed in section 2. A set-theoretic version of the first-order classical subset of Montague’s approach results in an efficient implementation as shown in section 4. However, in the set–theoretic semantics, a simplistic conversion of Montague’s treatment of negation does not work as discussed in section 5. One solution to the problem of negation requires all entities in the universe of discourse to be enumerated and is therefore intractable for finite universes and useless for infinite universes. As shown in section 6, this problem can be overcome by representing sets, that are denoted by constructs involving negation, by enumerating the members of their complements with respect to the universe of discourse, and redefining set operators to accommodate sets that are represented in this way. The proposed semantics is orthogonal and highly compositional and there is a one-to-one correspondence between the rules governing the syntactic structure of queries and sub-queries, and the rules determining how the meanings of queries and sub-queries are computed from the meanings of their components. The denotations of queries, and any syntactically well-formed component of a query, are entity sets, booleans, or functions in a function space constructed over entity sets and booleans. The semantics can be implemented efficiently in the sense that answers to queries containing arbitrarily-nested quantifiers and constructs involving negation can be computed without having to enumerate all entities in the universe of discourse. The semantics is also highly modular in that no intermediate logical form of a query or sub-query is computed. The denotation of each syntactically well-formed component of a query is computed independently of the denotations of other components. This is different to many other approaches to handling negation in which an intermediate logical form is created and then manipulated to either remove negation or minimize reference to the universe of discourse. A number of advantages derive from this modularity: 1) the query processor can be implemented as a syntax-directed evaluator thereby facilitating construction and modification, 2) information on, or answers to, sub-queries can be presented to users with the advantages discussed earlier, and 3) processing of the query can be easily partitioned and the evaluation of sub-queries can be distributed across a network of data sources. The new approach is being implemented in a syntax-directed evaluator, constructed as an executable specification of an attribute grammar, with a userindependent speech-recognition front-end. When complete, the system will correctly answer queries such as the following, without having to enumerate all entities in the universe of discourse. does every thing that orbits no moon orbit no planet no moon that orbits mars orbits phobos a non moon orbits sol
254
Richard Frost and Pierre Boulos
does something orbits no thing does sol orbit a non moon sol orbits no moon The approach can also handle queries involving the word “is” and passive verb constructs. For example: every thing that orbits no moon and is not a sun spins a moon that was discovered by hall does not orbit jupiter no moon is a planet or a person not every moon that was discovered by hall orbits a planet that is not orbited by phobos However, the mechanism for dealing with “is” and passive verbs is ad hoc, and in some cases highly inefficient. Current work is directed at developing a method to integrate an orthogonal denotation of the word “is” into an efficient extension of the new semantics. Other researchers have used Montague Semantics in natural-language processing. A comprehensive treatment of a wide range of negative constructs in natural language has been undertaken by Iwanska [6] who has developed a single polymorphic negation operator which forms part of a set of boolean algebras for thirteen syntactic categories. Iwanska’s work deals with a significantly harder problem than that discussed in this paper. Iwanska was trying to accommodate all forms of negation and automatically create a semantic representation of a sentence given as input in order to determine if subsequent sentences were entailed by it. The problem of efficiently evaluating the semantics of queries with respect to a single interpretation (i.e. data stored in relations) was not an immediate concern. In some ways, this paper can be regarded as describing an approach for the efficient implementation of some of the forms Iwanska’s operator. Approaches for accommodating negation in deductive databases have been developed [10], and a great deal of work has been done to deal with negation in logic programming [1,9]. The notion of representing sets by enumeration of the members of their complements, as an efficient mechanism for accommodating negation in a modular higher-order functional semantics is novel. The relationship of this work to the work of other researchers is discussed further in section 9.
2
Montague Semantics
In the early seventies, Richard Montague [8] developed an approach to the interpretation of natural language in which he claimed that languages such as English are formal languages whose syntax and semantics, at any point in time for an individual speaker, can be precisely defined. Each well-formed syntactic construct denotes a function in a function space constructed over a set of entities, the boolean values True and False, and a set of states each of which is a
An Efficient Compositional Semantics
255
pair consisting of a possible world and a point in time. Montague proposed that there is a one to one correspondence between the rules defining how composite syntactic constructs are built from simpler forms and the semantic rules defining how the denotations of such composite constructs are computed from the denotations of their constituents. He used a higher-order modal intensional logic as an intermediate form to facilitate explanation of his proposed semantics but made it clear that this logic was not an essential part of the theory. Ignoring intensional aspects which involve more complex apparatus, nouns such as “planet” and intransitive verbs such as “spins” denote predicates over the set of entities, i.e. characteristic functions of type entity → bool, (where x → y denotes the type of functions whose input is a value of type x and whose output is of type y). Quantifiers, such as “a”, and “every” denote higher-order functions of type (entity → bool) → (entity → bool) → bool. For example, the quantifier “every” denotes the function λpλq ∀x (p x) implies (q x). According to the rules proposed by Montague, the phrase “every planet spins” denotes the value of the expression λpλq ∀x (p x) implies (q x) planet spins => λq ∀x (planet x) implies (q x) spins => ∀x (planet x) implies (spins x) According to Montague, proper nouns do not denote entities directly. Rather, they denote functions defined in terms of entities. For example, the proper noun “mars” denotes the function λp p the entity mars. Constructs of the same syntactic category denote functions of the same semantic type. For example, the phrases “mars” and “every planet”, which are of the same syntactic category and may be exchanged in many contexts without affecting syntactic properties, both denote functions of type (entity → bool) → bool. Montague’s approach is highly orthogonal. In many cases, words that appear in differing syntactic contexts denote a single polymorphic function thereby avoiding the need to assign different meanings in these different contexts. For example, the word “and”, which can be used to conjoin nouns, verbs, term-phrases, etc., denotes the polymorphic function λgλfλx (g x) & (f x) which fits in all of these contexts. A comprehensive introduction to Montague’s theory is given in Dowty, Wall and Peters [2]. Owing to the fact that Montague’s approach is compositional and highly orthogonal, and there is a one to one correspondence between the rules defining syntactic constructs and the semantic rules stating how the meaning of phrases are constructed from the meanings of their constituents, the first-order, nonintensional, non-modal, part of it is well suited as a basis for the construction of simple natural-language front-ends to database systems. Unfortunately Montague Semantics cannot be used directly in a databasequery processor without incurring unacceptable inefficiencies. For example, a direct implementation of the denotation of the word “every” above, would require all entities in the database application to be examined in order to evaluate the denotation of the phrase “every planet spins”. This problem of inefficiency can be overcome if Montague’s approach is modified so that phrases denoting char-
256
Richard Frost and Pierre Boulos
acteristic functions of sets are implemented as denoting the sets themselves and all other denotations are modified accordingly. For example, the word “every” denotes the function λsλt s ⊆ t. Before we describe this approach in detail, we introduce a notation that is used throughout the remainder of the paper.
3
Notation
Rather than define the set–theoretic semantics in the notation of lambda calculus, we have chosen to use the notation of the higher-order functional programming language Miranda [12] which is closely related to the lambda calculus. This enables definitions of denotations to be readily checked for syntax and type errors, and as shown in sections 4 to 7, illustrates the ease with which the semantics can be implemented. It should be noted that the proposed semantics can be implemented in any programming language. However, it is particularly straightforward in languages that support the partial application of higher-order functions as discussed below. This use of a higher-order functional language to define denotations is consistent with the Scott-Strachey approach to the denotational semantics of programming languages [11]. The following explains the notation of Miranda: – f = e defines f to be a constant-valued function equal to the expression e. – f a1 ... an = e can be loosely read as defining f to be a function of n arguments whose value is the expression e. However, Miranda is a fully higher-order language – functions can be passed as parameters and returned as results. Every function of two or more arguments is actually a higher order function, and the correct reading of f a1 ... an = e is that it defines f to be a higher-order function, which when partially-applied to input i returns a function f’ a2 ... an = e’, where e’ is e with the substitution of i for a1. – The notation for function application is simply juxtaposition, as in f x rather than f (x). – Function application is left associative. For example, f x y is parsed as (f x) y, meaning that the result of applying f to x is a function which is then applied to y. Round brackets are used to override the left-associative order of function application. For example, the evaluation of f (x y) requires x to be applied to y, and then f to be applied to the result. – Round brackets with commas are used to create tuples, e.g. (x, y) is a binary tuple. Square brackets and commas are used to create lists, e.g. [x, y, z]. – t1 → t2 is the type of functions whose input is of type t1 and whose output is of type t2. In addition, when defining denotations in Miranda, we use d w to represent the denotation of the word w. In cases where a word w may have more than one denotation depending on the syntactic use, we extend d w to make it clear which
An Efficient Compositional Semantics
257
denotation applies. Also, we use ‘‘s’’ to represent the entity associated with the proper noun s. We use <<wp>> to represent the denotation of the word or phrase wp in non-code text, and e1 => e2 to indicate that e2 is an evaluated form of e1 that has been obtained by applying one or more of Miranda’s evaluation rules.
4
An Implementation of a Set-Theoretic Version of a Small Sub-set of Montague Semantics
As discussed above, according to Montague, nouns and intransitive verbs denote predicates over the set of entities. In the modified semantics, nouns and intransitive verbs denote sets of entities. These sets can be represented by lists in Miranda: d_moon d_planet d_spins
In the modified semantics, proper nouns (names) are implemented as functions which take a list as input and which return the boolean value True if the list contains the entity related to the proper noun, and False otherwise. For example, assuming that the function member has been defined appropriately: d_sol d_mars d_earth etc.
s = member s ‘‘sol’’ s = member s ‘‘mars’’ s = member s ‘‘earth’’
Accordingly, <<mars spins>> => True owing to the fact that application of d mars to d spins returns the value True because ‘‘mars’’ is a member of the list denoted by d spins. Quantifiers are implemented as higher–order functions which take a list as input and which return a function of type list → bool as output. For example, assuming that the functions subset and intersection have been defined appropriately: d_every s t d_a s t d_no s t
= subset s t = intersection s t~= [] = intersection s t = []
Accordingly, <<every moon spins>> = True owing to the fact that partial application of the higher-order function d every to d moon returns the function f such that f t = subset [‘‘deimos’’, ‘‘phobos’’] t. Application of f to d spins returns True because [‘‘deimos’’, ‘‘phobos’’] is a subset of [‘‘mars’’, ‘‘earth’’, ‘‘phobos’’, ‘‘deimos’’]
258
Richard Frost and Pierre Boulos
In the modified semantics, transitive verbs do not denote relations directly, though each is associated with one particular relation. A transitive verb is implemented as a function whose argument is a predicate on sets. When it is applied to a particular predicate, it returns a set of entities as result. An entity is in the result set if the predicate is true of the entity’s image under the associated relation. For example: d_orbits p
= [x | (x,image_x) <- collect orbit_rel; p image_x]
The definition of d orbits uses a Miranda programming construct called a list comprehension. List comprehensions give a concise syntax for a general class of iterations over lists. The syntax is adapted from an analogous notation used in set theory called a “set comprehension”. The general form of a list comprehension is: [ body | qualifiers ] where each qualifier is either a generator, of the form var ← exp, or else a filter, which is a boolean expression used to restrict the ranges of the variables introduced by the generators. When two or more qualifiers are present they are separated by semicolons. The following is a simple example of the use of a list comprehension in the definition of a function which takes a number and returns a list of all its factors. factors n = [ i | i <- [1 ..n]; n mod i = 0 ] The collect function used in the definition of the denotations of transitive verbs can be defined using recursion and two list comprehensions: collect [] = [] collect ((x,y):t) = (x, y:[e2 | (e1, e2) <- t; e1 = x]) : collect [(e1, e2) | (e1, e2) <- t; e1~= x] Application of collect to the relation orbit rel returns the following result, in which entities are paired with their image under the relation orbit rel: [(‘‘deimos’’,[‘‘mars’’]),(‘‘phobos’’,[‘‘mars’’]), (‘‘mars’’, [‘‘sol’’]), (‘‘earth’’, [‘‘sol’’])] In general, the images returned by collect could be lists of more than one element. Examples of the use of d orbits are given below.
5
A Problem with Negation
A problem with the modified semantics is that the denotation of the word “no” only works in some syntactic contexts, and fails in others, as illustrated below. Note that in these examples, round brackets have been introduced to ensure the correct binding of function application when the denotations of words are
An Efficient Compositional Semantics
259
used to compute the denotation of the whole phrase. We discuss later how the order of function application can be readily built into a syntax-directed evaluator constructed as an executable attribute grammar. In the following examples, we have also chosen to expand the denotations in an order which makes for easier reading. In all cases, the order respects the binding of function application but not necessarily the order in which Miranda’s “lazy” evaluator would compute the answer. The denotation of “no moon orbits sol” is given by the following. It is correct with respect to the denotations of “moon”, “orbits”, “sol”, etc. given earlier: => => => => =>
The denotation of “sol orbits no moon” is: => <<sol orbits no moon>> => <<sol>> (<> (<<no>> <<moon>>)) => d_sol ( d_orbits ( d_no d_moon)) => d_sol [x|(x,image_x) <- collect orbit_rel; (d_no d_moon) image_x] => d_sol [x|(x,image_x) <- collect orbit_rel; (intersection d_moon image_x) = []] => d_sol [x|(x,image_x) <- collect orbit_rel; (intersection [‘‘deimos’’, ‘‘phobos’’] image_x) = []] => d_sol [‘‘deimos’’, ‘‘phobos’’, ‘‘mars’’, ‘‘earth’’] => member [‘‘deimos’’,‘‘phobos’’,‘‘mars’’,‘‘earth’’] ‘‘sol’’ => False Which is not the expected answer. The reason for the failure is that when collect is applied to orbit rel it generates the following relation: [(‘‘deimos’’,[‘‘mars’’]), (‘‘phobos’’,[‘‘mars’’]), (‘‘mars’’, [‘‘sol’’]), (‘‘earth’’ ,[‘‘sol’’])] Owing to the fact that the images of ‘‘deimos’’, ‘‘phobos’’, ‘‘earth’’ and ‘‘mars’’ have empty intersections with list [‘‘deimos’’,‘‘phobos’’], the meaning of the sub-expression “orbits no moon” is computed to be:
260
Richard Frost and Pierre Boulos
[‘‘deimos’’,‘‘phobos’’,‘‘mars’’,‘‘earth’’]. This list does not include ‘‘sol’’, and consequently, the evaluation of <<sol orbits no moon>> returns the incorrect result False. The reason for the failure to return the expected answer is the fact that the relation orbit rel does not denote those entities that do not orbit anything. The collect function cannot identify and return those entities whose images are empty under the associated relation. A possible solution, is to modify collect so that it generates a pair for all entities in the universe of discourse. For example, if the universe of discourse contains the entities ‘‘phobos’’, ‘‘deimos’’, ‘‘mars’’, ‘‘earth’’, ‘‘sol’’, ‘‘hall’’, and no others, then collect applied to orbit rel as defined earlier would return the following relation: [(‘‘deimos’’,[‘‘mars’’]),(‘‘phobos’’,[‘‘mars’’]),(‘‘mars’’, [‘‘sol’’]),(‘‘earth’’,[‘‘sol’’]),(‘‘sol’’,[]),(‘‘hall’’,[])] This solves the problem with respect to the denotation of “no”, but reintroduces an inefficiency whenever a denotation of a transitive verb is used, irrespective of whether or not negation is involved. The worst-case complexity is O (estvs ) where es is the number of entities in the universe of discourse and tvs is the maximum depth of nested transitive verbs used in conjunction with the quantifier “no”. This inefficiency is not as bad as with a direct implementation of Montague semantics which results in exponential behavior with respect to the combined depth of nested quantifiers and transitive verbs, but it is still impractical for all but those applications with a relatively small number of entities.
6
A Solution – Represent “Negative” Sets by Enumerating the Elements of Their Complements
The proposed method for accommodating negation is based on the notion that a set can be represented in two ways: explicitly by enumerating all of its members, or implicitly by enumerating all of the members of its complement. In cases where a set is computed as the denotation of a phrase that involves a negation, it is often more efficient to represent it using its complement. To implement this approach, we need to introduce a new type set, which can be defined in Miranda as follows, where [string] is the type list of strings and string is a synonym for the type list of characters: set ::= SET [string] | COMP [string] The following are two examples of objects of type set. The first example represents the set whose members are ‘‘phobos’’ and ‘‘deimos’’. The second example denotes the set of all entities in the universe of discourse except ‘‘phobos’’ and ‘‘deimos’’, i.e. the set of “non moons”. SET [‘‘phobos’’,‘‘deimos’’] COMP[‘‘phobos’’, ‘‘deimos’’]
An Efficient Compositional Semantics
261
To determine the cardinality of a set we define the function cardinality in terms of the cardinality of the set of all entities in the universe of discourse, where # computes the length of a list, and all entities denotes the set of all entities in the universe of discourse. cardinality (SET s) = #s cardinality (COMP s) = #all_entities - (#s) Operators on sets are redefined as follows: c_member (SET s) e c_member (COMP s) e c_union c_union c_union c_union
SET (union s t) COMP (t -- s) COMP (s -- t) COMP (intersection s t)
(SET (COMP (SET (COMP
(SET (COMP (SET (COMP
t) t) t) t)
t) t) t) t) = = = =
= = = =
SET SET SET COMP
(intersection s t) (s -- t) (t -- s) (union s t)
subset s t (t -- s) = t subset (all_entities -- s) t subset t s
Where ++ is list addition, -- is list subtraction, and union is defined as follows: union as bs = as ++ (bs -- as). As shown later, evaluation of the denotation of the phrase “non moon that spins” would result in the following operation: c_intersection COMP [‘‘phobos’’, ‘‘deimos’’] SET [‘‘mars’’, ‘‘earth’’, ‘‘phobos’’, ‘‘deimos’’] => SET [‘‘mars’’, ‘‘earth’’] In only one case in the above definitions, i.e. the third line of the definition of sub-set, is it necessary to refer to the set of all entities. This is where we need to determine if a set represented by an enumeration of the members of its complement is a subset of a set that is represented by an explicit enumeration of it members. (this computation occurs in the evaluation of the denotation of phrases such as “every thing that orbits no moon spins”). Fortunately, this part of the definition can be replaced by the following which refers only to the cardinality of the set of all entities and not to the entities themselves: c_subset (COMP s) (SET t) = (#(union t s) = #all_entities)
262
Richard Frost and Pierre Boulos
Redefinition of nouns and quantifiers is straightforward: d_moon = SET [‘‘deimos’’, ‘‘phobos’’] d_planet = SET [‘‘earth’’,‘‘mars’’] d_spins = SET [‘‘earth’’,‘‘deimos’’,‘‘mars’’,‘‘phobos’’] d_thing = COMP [] d_sol s = c_member s ‘‘sol’’ d_mars s = c_member s ‘‘mars’’ d_every s t = c_subset s t d_a s t = cardinality (c_intersection s t) > 0 d_no s t = cardinality (c_intersection s t) = 0 The denotation of each transitive verb is redefined so that it begins by applying the predicate given as argument to the empty list, representing the empty image of all entities that do not appear on the left-hand side of the associated relation. If the predicate succeeds, then the result is returned in the form of a complement. If the predicate fails, the result returned is the same as that returned by the original definition of the denotation of the transitive verb. For example, the modified definition of the denotation of “orbits” is: d_orbits p
= COMP (firsts_of orbit_rel -- result), if p (SET []) = True = SET result, otherwise where result = [x | (x,image_x) <- collect orbit_rel; p image_x] firsts_of [] = [] firsts_of ((x,y):rest) = x : firsts_of [(a,b)|(a,b) <- rest; a~= x]
orbit_rel
= [(‘‘deimos’’,‘‘mars’’), (‘‘phobos’’,‘‘mars’’), (‘‘mars’’, ‘‘sol’’), (‘‘earth’’, ‘‘sol’’)] collect [] = [] collect ((x,y):t) = (x,SET (y:[e2 | (e1, e2) <- t; e1 = x])): collect [(e1, e2) | (e1, e2) <- t; e1~= x] In the revised semantics, the denotation of “orbits no moon” is: <> => <> (<<no>> <<moon>>) => d_orbits (d_no d_moon) => COMP (firsts_of orbit_rel -- result) where result = [x|(x,image_x) <- collect orbit_rel; (d_no d_moon image_x)] => COMP (firsts_of orbit_rel -- result) where result = [x|(x,image_x) <- collect orbit_rel;
An Efficient Compositional Semantics
263
(c_intersection [‘‘deimos’’,‘‘phobos’’] image_x) = []] => COMP ([‘‘deimos’’, ‘‘phobos’’, ‘‘mars’’, ‘‘earth’’] -[‘‘deimos’’, ‘‘phobos’’, ‘‘earth’’, ‘‘mars’’]) => COMP [] Meaning that everything in the universe of discourse “orbits no moon”. Evaluation of “sol orbits no moon” now returns the expected answer: <<sol>> <> => d_sol (COMP []) from above => member (COMP []) ‘‘sol’’ => True In order to simplify the coding of denotations of transitive verbs, the common parts of such definitions can abstracted into a higher-order function make denotation of tv defined as follows: make_denotation_of_tv r p = COMP (firsts_of r -- result), if p (SET []) = True = SET result, otherwise where result = [x | (x,image_x) <- collect r; p image_x] firsts_of [] = [] firsts_of ((x,y):rest) = x : firsts_of [(a,b)|(a,b) <- rest; a~=x] This function can now be used to define the denotations of various transitive verbs. For example: d_orbits = make_denotation_of_tv orbit_rel d_discovered = make_denotation_of_tv discover_rel discover_rel = [(‘‘hall’’,‘‘phobos’’),(‘‘hall’’,‘‘deimos’’)]
7
The Denotation of other Words and Constructs Commonly Used in Database Queries
The revised, more efficient, semantics results in some loss of polymorphism and thereby a reduced orthogonality. For example the word “and” now needs different denotations when used in different syntactic contexts such as “mars and venus”, and “planet and moon”. d_and_pn
s t = || d_and_niv s t = || d_and_tv s t = ||
g where g x = s x & t x when used to conjoin proper nouns c_intersection s t when used to conjoin nouns and verb phrases g where g x = c_union (s x) (t x) when used to conjoin transitive verbs
264
Richard Frost and Pierre Boulos
The use of multiple denotations, though not an attractive solution, can be readily implemented by assigning different syntactic categories to the different uses of the word “and”. The different meanings can then be assigned by the syntax-directed evaluator, as discussed later. The word “or” can be treated in a similar manner. We define the denotation of the word “that” when used in constructs such as “moon that orbits mars” to be equivalent to the denotation of the word “and” when used to conjoin common nouns verb phrases. d_that = d_and_niv The denotation of “non” is straightforward d_non (SET s) = COMP s d_non (COMP s) = SET s The word “not” has two denotations depending on the syntactic context d not tp is the denotation when “not” is used to qualify proper nouns, as in “not mars”, and d not niv is used when “not” is used to qualify intransitive verbs, as in “not spin”. d_not_tp tp = n_tp where n_tp s d_not_niv = d_non
=~(tp s)
The denotations of “who” and “which” are straightforward: d_which s t = c_intersect s t
d_who
= d_which
Example applications of these denotations: d_which (d_non d_moon) d_spins => SET [‘‘mars’’,‘‘earth’’] d_which (d_non d_moon) (d_not d_spins) => COMP [‘‘deimos’’,‘‘phobos’’,‘‘mars’’,‘‘earth’’] A simplistic approach for dealing with the denotation of passive verb constructs is to treat phrases of the form “is verb by” as atomic. That is, such phrases have a single non-divisible denotation which is obtained by inverting the associated relation. For example: d_is_orbited_by = make_denotation_of_tv (invert orbit_rel) This approach is not ideal as it does not accommodate phrases such as “was discovered and named by”. Developing an efficient denotation for the word “is” is problematic within the functional semantics that we have developed. This is a consequence of the fact that in a purely functional framework it is impossible to “look inside” a
An Efficient Compositional Semantics
265
function. It is therefore impossible to determine, for example, that “mars” is the entity associated with the definition of <<mars>> without actually applying this denotation. Therefore the value of the phrase “mars is mars” can only be determined by some application of the denotation of “mars”. One approach is to apply <<mars>> to all entities in the universe of discourse and to return a list containing the single entity “mars”. The resulting inefficient denotation for “is” can be defined as follows: d_is p = SET [e | e <- es; p (SET [e])] where (SET es) = all_entities The denotation of “mars is mars” is as follows: <<mars is mars>> => <> (<> <>) => <> (SET [e | e <- es; d_mars (SET [e])]) where (SET es) = all_entities => <> (SET [‘‘mars’’]) => c_member (SET [‘‘mars’’]) ‘‘mars’’ => True This denotation of “is” also works in phrases such as “mars is a planet” and “no moon is mars”. However, it is very inefficient even with applications involving a small number of entities, and is useless for applications with an infinite universe of discourse. A Miranda program has been constructed that includes all of the definitions given in this paper. The program can answer questions such as the following: every(thing$that(orbits(no moon)))(orbits(no planet)) no (moon $that (orbits mars)) (orbits phobos) a (non moon) (orbits sol) something (orbits (no thing)) sol (orbits (a (non moon))) sol (orbits (no moon)) not (every moon) (is_orbited_by phobos) every (thing $that (orbits (no moon)) $and2 (is (not1 (a person))) $and2 (is (not1 (a sun)))) spins a (moon $that (was_discovered_by hall)) (does (not2 (orbit earth)))
Embedding the Modified Semantics in a Syntax-Directed Evaluator with a Speech-Recognition Front-End
Owing to the one-to-one correspondence between the syntax and semantic rules, the proposed semantics can be readily implemented in a syntax-directed evaluator. This has been done and a query processor, which is constructed as an executable specification of an attribute grammar, has been added to a “Speechweb” of hyperlinked distributed language processors which are accessible through remote natural-language speech interfaces [3]. The approach to negation is currently being added to this system. A text-based interface to the system is also available and can be accessed at: www.cs.uwindsor.ca/users/r/richard/miranda/wage_demo.html
9
Relationship to other Work
Higher-order functional-programming has been used before in the construction on natural-language processors based on Montague Semantics e.g. Frost and Launchbury [5] illustrated the ease with which a set-theoretic version of a small first-order subset of Montague Semantics can be implemented in Miranda, and Lapalme and Lavier [7] have shown how a larger part of Montague Semantics can be implemented in a pure higher-order functional programming language. Iwanska [6] has developed a comprehensive semantics for a wide range of negative constructs in natural language, and has built a system alled UNO based on this semantics. UNO takes natural-language statements as input and can subsequently determine if they entail other natural-language statements. UNO is not a query processor and does not return answers to queries but rather converts natural-language statements to a logical form in which the denotations of phrases involving negation are determined through application of operators from a boolean algebra. Iwanska indicated that the denotations of certain constructs are set complements but did not develop a calculus for efficiently evaluating queries with respect to a single interpretation in the form of a relational database. Reiter [10] has developed a method for dealing with negation in queries expressed in a form of relational calculus, when evaluated with respect to a Hornclause knowledge base under the closed-world assumption. In Reiter’s approach, the closed-world evaluation of arbitrary queries is reduced tob the open-world evaluation of a set of atomic queries. However, this approach requires all entities in the universe of discourse to be enumerated when a negative atomic query is evaluated. Reiter was not concerned with the evaluation of natural-language queries but rather with the evaluation of relational queries with respect to deductive database consisting of data and rules expressed as Horn-clauses. Much work has been carried out on the concept of negation and of the concept of “negation as failure to prove” [1] in logic programming. The representation
An Efficient Compositional Semantics
267
of sets by use of their complements was first suggested by Naish [9] in the context of logic programming. However that work was more concerned with the management of negation within the context of the SLD resolution computation mechanism, and in particular delays the evaluation of “not” until it is ground.
Acknowledgements The authors would like to thank Mr. Stephen Karamatos and Dr. Walid Saba for their help with this work.
References 1. Clarke, K. L. (1978) Negation as Failure. In H. Gallaire, and J. Minker, (editors) Logic and Databases. New York: Plenum Press. 254, 266 2. Dowty, D. R., Wall, R. E. and Peters, S. (1981) Introduction to Montague Semantics. D. Reidel Publishing Company, Dordrecht, Boston, Lancaster, Tokyo. 255 3. Frost, R. A. and Chitte, S. (1999) A new approach for providing natural-language speech access to large knowledge bases. Proceedings of the Pacific Association of Computational Linguistics Conference PACLING ‘99, University of Waterloo, August 1999, 82–89. 266 4. Frost, R. A. (1995) W/AGE The Windsor Attribute Grammar Programming Environment, Schloss Dagstuhl International Workshop on Functional Programming in the Real World. 5. Frost, R. A. and Launchbury, E. J. (1989) Constructing natural language interpreters in a lazy functional language’. The Computer Journal – Special edition on Lazy Functional Programming, 32(2) 108 – 121. 266 6. Iwanska, L. (1992) A General Semantic Model of Negation in Natural Language: Representation and Inference. Doctoral Thesis, Computer Science, University of Illinois at Urbana-Champaign. 254, 266 7. Lapalme, G. and Lavier, F. (1990) Using a functional language for parsing and semantic processing. Publication 715a, Departement d’informatique et recherche operationelle, Universite de Montreal. 266 8. Montague, R. (1974) in Formal Philosophy: Selected Papers of Richard Montague, edited by R. H. Thomason. Yale University Press, New Haven CT. 254 9. Naish, L. (1986) Negation and Control in Prolog. Lecture Notes in Computer Science 238. Springer-Verlag. 254, 267 10. Reiter, R. (1978) Deductive question-answering in relational databases. In H. Gallaire, and J. Minker, (editors) Logic and Databases. New York: Plenum Press. 254, 266 11. Stoy, J. E. (1977) Denotational Semantics: The Scott-Strachey Approach to Programming language Theory. MIT Press, Cambridge (Mass.) 256 12. Turner, D. (1985) A lazy functional programming language with polymorphic types. Proc. IFIP Int. Conf. on Functional Programming Languages and Computer Architecture, Nancy, France. Springer Lecture Notes 201. 256
Text Summarization as Controlled Search Terry Copeck, Nathalie Japkowicz, and Stan Szpakowicz School of Information Technology & Engineering University of Ottawa, Ontario, Canada {terry,nat,szpak}@site.uottawa.ca
Abstract. We present a framework for text summarization based on the generate-and-test model. A large set of summaries is generated for all plausible values of six parameters that control a three-stage process that includes segmentation and keyphrase extraction, and a number of features that characterize the document. Quality is assessed by measuring the summaries against the abstract of the summarized document. The large number of summaries produced for our corpus dictates automated validation and fine-tuning of the summary generator. We use supervised machine learning to detect good and bad parameters. In particular, we identify parameters and ranges of their values within which the summary generator might be used with high reliability on documents for which no author's abstract exists.
1
Introduction
Text summarization consists of reducing a document to a smaller précis. Its goal is to include in that précis the most important facts in the document. Summarization can be performed by constructing a new document from elements not necessarily present in the original. Alternatively it can be done by extracting from a text those elements, usually sentences, best suited for inclusion. Our work takes the second approach. This paper describes a partially automated search engine for summary building. The engine has two components. A Summary Generator produces a summary of a text as directed by a variety of parameter settings. A Generation Controller evaluates summaries of documents in a training corpus in order to determine which parameter settings are best to use in generation. While summary generation is fully automated, the controller was applied in a semiautomatic way. We plan to automate the entire system in the near future. Summary generation in our system uses publicly-available text processing programs to perform each of its three main stages. More than one program is available for each stage. This alone produces 18 possible combinations of programs. Because many of the programs accept or require additional parameters themselves, evaluating the set of possible inputs manually is not feasible. Generation control is achieved through supervised machine learning that associates the summary generation parameters with a measure of the generated summary’s quality. Training data consist of a set of features characterizing the document together R. Cohen and B. Spencer (Eds.): AI 2002, LNAI 2338, pp. 268-280, 2002. Springer-Verlag Berlin Heidelberg 2002
Text Summarization as Controlled Search
269
with the parameter settings used to generate a summary and its rating. Features used to characterize a document are counts of its characters, words, sentences and paragraphs and other syntactic elements, such as connectives and proper nouns. The number of one, two, three and four or more occurrences of bigrams, content bigrams and content phrases are also counted. A list appears in Section 4.2. Text to be Summarized
Parameter Values
Segmenter
TextTiling
C99
Kea
Extractor
NPSeeker
Exact
Best Parameters Future Feedback Loop
Parameter Assessor Rating Summary Assessor
Stem
SUMMARY GENERATOR
GENERATION CONTROLLER
Summary + Parameter Values
Author Abstract, Features
Fig. 1. Overall System Architecture
Our summary is rated against the abstract, which we treat as the gold standard summary because a) the author of a document is well-suited to produce a summary of its contents; and b) for many documents the abstract is the only item approximating a summary which is available. There are few options.
2
Overall Architecture
The overall architecture of the system appears in Figure 1. The Summary Generator, detailed in Figure 2, takes the document to be summarized and one or more sets of parameter values and outputs one summary per set. These are fed into the Generation Controller along with the parameter values used to produce them and features of the text summarized. The Generation Controller has two subcomponents. The Summary Assessor rates each summary with respect to the document abstract. The Parameter Assessor takes all the sets of parameter values accumulated during summary generation together with the summary ratings and uses the ratings to determine the best and worst combinations of parameters.
270
Terry Copeck et al.
Text
Segmenter
Kea
TextTiling N TING
SEGME
Extractor
KEYPHRASE
Exact
MATCHING
C99
EXTRACTION
NPSeeker
Stem
Ranked Segments Summary
Fig. 2. Summary Generator Architecture
3
The Summary Generator
We generate summaries on the assumption that a summary composed of adjacent, thematically related sentences should convey more information than one whose sentences are chosen independently and likely bear no relation to their neighbors. Let’s first discuss generation methodology. A text to be summarized is broken down into a series of sequences of sentences or segments which talk about the same topic. Segments are then rated by identifying keyphrases in the document and ranking each on its overall number of matches for these phrases. The requested number of sentences is then selected from one or more of the top-ranked segments and presented to the user in document order. Lacking domain semantic knowledge or corpus statistics, document analysis based on segments and key phrases at least takes advantage of the statistically-motivated shallow semantics used in segmenters and keyphrase extractors. To see if our general assumption holds regardless of the algorithm applied at each of the three main stages in summary generation, we designed a system with alternative modules for each stage; and where possible used for these tasks programs which, because they are in the public domain or made freely available, are wellknown to the research community.
Text Summarization as Controlled Search
271
Three text segmenters were used: Segmenter from Columbia University (Kan et al. 1998), Hearst's (1997) TextTiling program, and the C99 program (Choi 2000). Three programs were used to pick out keywords and keyphrases from the body of a text: Waikato University's Kea (Witten et al. 1999), Extractor from the Canadian NRC's Institute for Information Technology (Turney 2000), and NPSeeker, a program developed at the University of Ottawa. Once a list of keyphrases has been assembled, the next step is to count matches. The obvious way is to look for exact ones, but there are other ways, and arguments for using them. Consider the word total. It is used in total up the bill, the total weight and compute a grand total. Each matches exactly but is a different part of speech. Exact matching can be less exact than wanted. Should totalling, totals and totalled, all verbs, be considered to match total? If so, it is a small step to the case where the key, rather than the match, is suffixed—the token is totalling rather than total. Counting hits based on agreement between the root of a keyword and the root of a match is called stem matching. Kea contains modules which perform both exact and stem matching. We repackaged them as standalone programs. Figure 2 shows graphically how eighteen different combinations of programs can be invoked to summarize a text. The particular segmenter, keyphrase extractor and type of matching to use are three of the parameters required to generate a summary. Three others must be specified: the number of sentences to include in the summary, the number of keyphrases to use in ranking sentences and segments, and the minimum number of matches to require in a segment. The Summary Generator constructs a summary by taking the most highly rated sentences from the segment with the greatest average number of keyphrases, so long as each sentence has at least the number of instances of keyphrases specified by hitcount. When this is no longer true, the generator moves to the next most highly rated segment, taking sentences until the required number has been accumulated.
4
The Generation Controller
4.1 Summary Assessment Component Summary assessment requires establishing how effectively a summary represents the original document. This task is generally considered hard (Goldstein et al. 1999). Assessment is highly subjective: different people are likely to rate a summary differently. Summary assessment is also goal-dependent: how will the summary be used? More domain knowledge can be presumed in a specialized audience than in the general public and an effective summary must reflect this. Finally, and perhaps most important when many summaries are produced automatically, assessment is very labor-intensive: the assessor needs to read the full document and all summaries. It is difficult to recruit volunteers for such pursuits and very costly to hire assessors. These considerations led us to focus on automating summary assessment. Although the procedure we settled on is fairly straightforward, the assumptions it is based on can be questioned. It is not clear whether the abstract can really serve as the gold standard for a summary. For example, we take abstracts in journal papers to be of high quality (Mittal et al. 1999). Yet in a corpus of academic papers, 24% of abstract keyphrases do not appear literally in the body (std. dev. 13.3). We are however unable to find an alternative source for a large number of summaries.
272
Terry Copeck et al.
The procedure is as follows: a list of keyphrases is constructed from each abstract, where a keyphrase is a sequence of tokens between stop words1. Such keyphrases are distinguished from those extracted during summarization by calling them key phrases in abstract or KPiAs. Summaries are rated on their number of KPiAs, with weight given both to the absolute number and to coverage of the total set of KPiAs. The rating procedure builds a set of KPiAs appearing in the summary, assigning a weight of 1.0 to each new KPiA added and 0.5 to each duplicating a KPiA already in the set. A summary’s score is thus: ∑ 1.0 * KPiAunique + 0.5 * KPiAduplicate
(1)
This value is normalized for summary length and document KPiA coverage by dividing it by score of a summary composed of an equal number of sentences with the highest KPiA counts in the document. The weighting of both KPiA count and coverage affects identification of the highest rated sentences in a text. In some circumstances a slightly higher overall score could be achieved by picking sentences with more unique and fewer absolute KPiA instances. Computing the highest possible score for a document running to hundreds of sentences is computationally quite expensive; it involves looking at all possible combinations of sentcount sentences. For normalization, simply totaling the counts for the sentences with the most KPiA instances to get a ‘nearly best’ score was deemed an adequate approximation. 4.2 Parameter Assessment Component The Parameter Assessor identifies the best and worst outcomes from the parameter sets used in the Summary Generator so generation can be tuned to give best results. The task can be construed as supervised machine learning. Input is a vector of values describing each summary: six parameters used to generate it, features characterizing the document summarized and the summary’s KPiA rating. The parameters are: • • • • • •
the segmenter (Segmenter, TextTiling, C99) the keyphraser (Extractor, NPSeeker, Kea) the match type (exact or stemmed match) key count, the number of keyphrases used in generating the summary sent count, the number of sentences appearing in the summary hit count, the minimum number of matches for a sentence to be in the summary
Also included are twenty features that describe the document being summarized (Table 1). These features are all counts: the number of characters, words, sentences and paragraphs in the document; and the number of other syntactic elements, such as KPiAs, connectives and proper nouns. We also count content phrases (substrings between stoplist entries), bigrams (word pairs) and content bigrams. We record how many of these latter three primitive lexical items occur once, twice, three times, or more than three times in the document.
1
A 980-word stop list was used, union of five publicly-available lists: Oracle 8 ConText, SMART, Hyperwave, and lists from the University of Kansas and Ohio State University.
Text Summarization as Controlled Search
273
Output is a set of rules identifying which summary characteristics, including parameter settings, yield good or bad ratings. These rules tell us which values to use and which to avoid in future work with the same type of documents. Table 1. Document features FEATURE Chars words sents paras kpiacnt conncnt pncnt contcnt cphr cphr2
5
DESCRIPTION # of characters in the document # of words in the document # of sentences in the document # of paragraphs in the document # of kpia instances # of (Marcu) connectives # of PNs, acronyms # of content phrases # of 1-instance content phrases # of 2-instance content phrases
# of 3-instance content phrases # of 4-or-more instance content phrases # of 1-instance content bigrams # of 2-instance content bigrams # of 3-instance content bigrams # of 4-or-more instance content bigrams # of 1-instance bigrams # of 2-instance bigrams # of 3-instance bigrams # of 4-or-more instance bigrams
Examples
Before discussing how document features and parameter settings are assessed, we will illustrate the operation of our system by summarizing a document provided as training data for the DUC competition. The training data included a summary written by an experienced editor to be taken as a measure of high quality, or “gold standard”, for each document. This reference summary filled the role of the abstract in our system: KPiAs extracted from it were used to rate our automatically-generated summaries. We present a reference summary and two summaries produced by our system, one rated good, the other bad. The keyphrases extracted from each will also be shown; KPiAs for the reference summary, and those extracted by the keyphraser used to generate the good or bad summary. To make the examples as informative as possible, we show generated summaries of five sentences, which approximates the length of the reference summary; based on 10 keywords, enough to show differences between each keyphraser’s output; and matched exactly, so each instance is clearly visible in the text. This fixes settings for three of the six parameters needed to produce a summary. The document summarized is a 1991 Wall Street Journal article on the appointment of Judge Clarence Thomas to the U.S. Supreme Court. The reference summary and the KPiAs extracted from it appear in Table 2 below. Examination quickly shows that KPiAs, phrases identified by separating text on stopwords and ignoring punctuation, describe the abstract in a comprehensive rather than a condensed way. The WSJ article contains 20; they are underlined in the text. Although this simple methodology produces a number of substrings which are not phrases and misses others which are, it does identify many legitimate phrases.
274
Terry Copeck et al. Table 2. The reference summary of WSJ911016-0124 and its KpiAs
Yesterday the Senate confirmed Judge Thomas by a narrow 52 to 48 vote. There is much speculation as to how the excruciating ordeal of the hearings might affect the decisions of the associate justice concerning such issues as abortion, sexual harassment, separation of powers, free press, and the rights of accused. A conservative legal commentator who earlier supported the nominee now says the Senate should have rejected Thomas since the accusations against him were not disproved and the Supreme Court must be above suspicion. Thomas himself told the Judiciary Committee that nothing could give him back his good name.
abortion sexual harassment separation Senate confirmed Judge Thomas accused A conservative legal Judiciary Committee excruciating ordeal earlier supported associate justice suspicion Thomas rejected Thomas Supreme Court speculation accusations decisions nominee Senate affect rights narrow issues vote
Table 3. A good summary: WSJ911016-0124.tee10-5-2.sum and its keyphrases 5 Yesterday evening, the Senate confirmed Judge Thomas by a narrow 52 to 48 vote -- one of the closest in the history of the court. 6 After a hard-fought victory, however, an unsettling question remains: How will the excruciating ordeal of the past two weeks influence Judge Thomas's legal views and decisions as an associate justice on the Supreme Court? 8 Even if he hadn't gone through the agony of the past two weeks, Judge Thomas would have been a crucial addition to the court, which is closely divided on such extraordinarily important issues as abortion, religious expression, and free speech. 21 Others are concerned that the Supreme Court as an institution will be diminished for having Judge Thomas on the bench, even though Prof. Hill's allegations of misconduct were not proven. 22 Conservative legal commentator Bruce Fein, who had been a Thomas supporter, says the Senate should have rejected the nominee because "the Supreme Court has got to be an institution that's above suspicion.
* Judge Thomas * court * justice * Supreme Court rights sexual harassment hearings * abortion Congress * nominee
Our program produced the good summary in Table 3 using the TextTiling segmenter and the Extractor keyphraser and requiring that sentences in the summary have at least two keyphrase matches. The presentation is highly annotated. Black boxed numbers indicate the sentence order in the original document, and show that sentences were taken from two segments: one encompassing sentences 5 & 6 ... 8 , and a second containing sentences 21 & 22 . This suggests that some degree of locality, one of our design objectives, has been achieved. Within each sentence KPiAs appear in italics and keyphrases found by Extractor in bold (words appearing in both a KPiA
Text Summarization as Controlled Search
275
and a keyphrase are bold italic). Keyphrase counts are used to pick sentences for the summary, while KPiA counts are used to rate this summary in terms of the reference summary. Since each sentence in the summary has at least two boldface phrases, the hit threshold requirement of two keyphrase matches per sentence has been properly applied (selection shifts to the next best segment when no other sentence in the current segment passes the threshold). Six of the ten keyphrases appear a total of 15 times in the summary; these six are starred in the list on the right. Underlined words appear in more than one keyphrase (eg court and Supreme Court) and are counted once for each occurrence. Table 4. A poor summary: WSJ911016-0124.cke10-5-2.sum and its keyphrases 45 "We have never had such sustained public scrutiny of such sensitive private matters" involving a Supreme Court nominee, observes Robert Katzmann, a political scientist at the Brookings Institution. 46 The cloud over Judge Thomas isn't likely to dissipate quickly, especially because the Senate was ultimately unable to determine whether he or his accuser was telling the truth. 47 As Judge Thomas helps decide controversial cases over the years, the specter of Anita Hill's lurid allegations will almost certainly be evident. 50 Here's a look at issues on which legal experts think Judge Thomas's thinking may be affected: SEXUAL HARASSMENT. 51 The bitter battle over Judge Thomas's treatment of Anita Hill has pushed a relatively obscure case called Franklin vs. Gwinnett County (Ga.) into the limelight because it involves sexual harassment.
* Thomas * Judge Thomas court Justice * Senate * bitter * Supreme Court * nominee experience Congress
Scanning the five sentences in the summary, we find a total of eleven—3, 3, 1, 1, 3—italicized substrings. These eleven are instances of KPiAs appearing in the reference summary written by the human editor. The generated summary is ranked using formula (1), and that ranking normalized by dividing it by the rank for the summary composed of the same number of sentences with highest counts of KPiAs in the document. This summary achieves 84% of the best rank possible. The poor summary appearing in Table 4 results when the C99 segmenter and the Kea keyphraser are used with the same two-hits-per-sentence threshold. Locality is again observed; the summary’s sentences may all come from one segment, spanning the seven-sentence subsequence 45 - 47 ... 50 & 51 . Once again six keyphrases are found in the summary, accounting for 13 instances. This time however only three KPiAs appear: Supreme Court, Senate, and issues. As a result this summary achieves only 24% of the rank of the document’s five highest-count sentences. Another factor which might contribute to lower ratings is the shorter length of the poor summary. It totals 125 words; the good summary’s 161 words are almost 30% longer2.
2
While length alone is not a good criterion on which to pick sentences for a summary, it is also undeniable that the more words there are in selected sentences, the more opportunity to match entries in the list of KPiAs.
276
6
Terry Copeck et al.
The Training Procedure
The C5.0 decision tree and rule classifier (Quinlan 2002) chosen for parameter assessment uses state-of-the-art techniques and produces rules which, though often numerous, are individually easy to interpret. C5.0 decision trees can be used to predict likely values when these are missing. This proved crucial for our entry in the 2001 Document Understanding Conference (DUC) sponsored by the U.S. National Institute of Standards and Technology (NIST), where documents must be summarized without abstracts and therefore without KPiAs. Our procedure for DUC was to build decision trees from the training corpus. We then used these trees to find the parameter settings that maximize the missing summary rating value for each document in the test corpus. C5.0 requires a discrete classification value, in this case the normalized KPiA rating. KPiA ratings for the training corpus ranging from 0.0 to 3.5 were therefore mapped to the five values [verybad, bad, medium, good, verygood]. We performed discretization with the k-means clustering technique (McQueen 1967). This technique discovers groupings in a set of data points, assigning the same discrete value to each element within a group and different values to different groups. Descriptions of documents in the test set were input to C5.0 with all permutations of parameter settings save the missing summary KPiA rating. The classifier then predicted the likely summary rating for each permutation of settings. We manually scanned its output to find which set of parameter values produces the best rating for each document, which thus differs for different documents. These optimal settings were submitted to the Summary Generator to produce the 308 single and 120 multiple document summaries sent to NIST. Multi-document summaries were generated by summarizing the single text produced by concatenating all documents on a topic. Table 5. Ratings achieved across the parameter setting space VERYBAD
BAD
MEDIUM
GOOD
VERYGOOD
Total
61,303
47,833
52,134
50,647
34,611
246,528
25%
19%
21%
21%
14%
100%
Ten-fold cross validation on the training data shows 16.3% classification error (std. dev. .01). The 5x5 confusion matrix further indicates 80% of misclassifications were to the adjacent value in the scale. This suggests the rules are reasonably trustworthy. The test run showed that they are productive, and that the ratings they seek to maximize are well-founded. 576 different parameter settings were applied to each of 428 single or multiple documents in the DUC test set generating 246,528 summaries. Table 5 shows the absolute and proportionate ratings of these summaries. It is immediately evident the chosen features characterize the document well enough, and parameterization of summarization is powerful enough, to produce output spanning the whole range of rating values and spanning it fairly evenly. Summaries are not all good or bad. On a scale of one to five, the average rating is 2.79—a bit worse than medium. However the scan of the output identified a verygood summary for 336 of the test documents and a good summary for the other 92. These optimal summaries, which were the ones submitted to DUC, have an average rating of 4.78. Machine learning thus improved summary quality by two ranks.
Text Summarization as Controlled Search
277
The large number of continuous3 attribute variables and the inherent difficulty of the problem led C5.0 to produce over 6300 rules relating document features and parameter settings to summary quality as measured by KPiA rating. Such a large number of rules also militated for the use of decision trees. Individual rules were not inspected manually in course of preparing our submission to DUC. It is however central to our work to learn which features best characterize a document and which parameter settings produce highly-rated summaries. Discussion of the most significant rules in the classification tree is therefore warranted. Rule 1: (48415/21735, lift 2.1) keyphrase = n -> class verybad [0.551] Rule 11: (1376/446, lift 2.6) sents <= 748 paras > 248 -> class verybad [0.676] Rule 13: (755/126, lift 3.2) pncnt > 178 cphr2 > 60 cbig4 <= 9 keycount <= 5 hitcount > 2 -> class verybad [0.832] Rule 19: (516/190, lift 2.5) sents > 53 cphr3 <= 1 abig4 > 4 sentcount <= 10 -> class verybad [0.631] Fig. 3. Negative rules discovered by C5.0
7
Identifying Good and Bad Attributes
Figures 3 and 4 respectively show the negative and positive rules in the decision tree that classify the greatest number of cases into the verybad and verygood ratings at the ends of the scale. These rules make overt the association between parameter settings and rating outcomes, a relationship used to identify the settings that give the best outcomes. Although C5.0 rules use the familiar antecedent → consequent format, their qualifiers require explanation. The slashed n/m pair in parentheses following each rule label indicates how many training examples the rule covered (n) and how many of these fall in the class predicted by the rule (m). Lift is the rule’s estimated accuracy4 3
4
C5.0 tends to produce many rules from continuous data because every distinct numeric value can figure in a rule. (n-m+1) / (n+2). The training set contained 144,806 cases.
278
Terry Copeck et al.
divided by the relative frequency of the predicted class. The square-bracketed value after the rule outcome of tells us the confidence with which this prediction is made. Negative rules that show attributes and values to avoid can be as important as positive ones showing attributes and values to prefer. Rule 1, covering 1/3 of all cases and predicting 1/6 of all outcomes5, is by far the most important rule of all and it is negative. Rule 1 says very simply that the NPSeeker keyphrase extractor does not produce good results6. Rule 4 is the most important positive rule. It suggests that very good summaries of six sentences or more (sentcount) result when the Kea keyphraser (keyphrase) extracts more than three keyphrases (sentcount) from short documents (sents less than 53 sentences). Rule 4: (9885/5433, lift 3.5) sents <= 53 keyphrase = k keycount > 3 sentcount > 5 -> class verygood [0.450] Rule 6: (6669/3452, lift 3.7) cphr <= 106 abig3 > 0 keyphrase = e -> class verygood [0.482] Rule 12: (1257/386, lift 5.3) sents <= 53 pncnt > 12 cphr <= 269 abig2 > 13 abig2 <= 72 keyphrase = e keycount > 5 sentcount > 10 hitcount <= 2 -> class verygood [0.693] Fig. 4. Positive rules discovered by C5.0
These two rules involve only a single document feature despite document features accounting for 20 of the 26 attributes in each summary description. That situation changes with higher numbered rules. Positive Rule 6 says that the Extractor keyphraser (keyphrase) gives good results when there are 3-instance bigrams (abig3) and few content phrases (cphr). Since a keyphraser is likely to select just such syntactic elements as bigrams, the two conditions make sense together. So does the next negative rule, Rule 11, that states that documents with fewer than 748 sentences but more than 248 paragraphs do not summarize well. Since such documents have quite short paragraphs (~ 3 sentences), this also makes sense. 5 6
48,415 / 144,806 is approximately 1/3; 21,735 / 144,806 is approximately 1/6. Later experiments showed that our current application of NPSeeker has an inherent tendency to lower ratings.
Text Summarization as Controlled Search
279
From this point on however it becomes difficult to draw conclusions from the attributes and values that figure in rules. This does not suggest the rules are illfounded. They may in fact efficiently partition the training data space, but they and the preponderance of the 6000 rules remaining do so in a more complicated way than people can easily follow. Two summarizer parameters do not figure in any important rule, negative or positive: segmenter (the segmentation technique used), and matchtype (match exactly or with stemmed keyphrases). If confirmed, this finding suggests two different conclusions, one less and one more likely: (a) good summaries might be produced without involving two of the three stages currently used in summary generation; (b) the choice among the available segmenters and matching methods does not affect the quality of summaries, so that the fastest segmenter and quicker matching can be used.
8
Conclusions and Future Work
The ‘average coverage’ measure used by NIST in DUC-2001 scored our summaries of single documents third best, while our summaries of multiple documents were ranked towards the end of twelve systems. The first result may be accounted for in part by the fact that our system adjusts its operation automatically to best suit the document submitted. Insofar as the various ‘black box’ segmenters, keyphrasers etc are better suited to one document than another and this information communicates through the machine learning loop, the TS program has the capacity to pick the ‘black box’ most suited to a document. There may be an inherent cause of our poorer results in summarizing multiple documents. The strategy of favoring locality in sentence selection presumes that the author of the source document has followed an orderly plan in presenting his message and attempts to take advantage of his organization. Multiple documents taken together cannot have a single overarching order. The research community recognizes this by having a separate category for multiple document summaries. A system preferring locality in sentence extraction almost guarantees that a short summary will be chosen from just one or two documents in the set. Pertinent information in other texts—and there almost always is some—will be ignored. In such a case picking the globally highest-ranked sentences might be a better practice, even though it would run a high risk of incoherence with almost every sentence coming from a different document. This paper presents a framework for text summarization based on the generateand-test model. We generate a large set of summaries by enumerating all plausible values of six parameters, and assess their quality by measuring them against the document abstract. Supervised machine learning identifies effective parameter values, which could be used with some confidence to summarize new documents without abstracts. We speculate that the summarizer's reliability would increase as the training corpus domain and that of new documents became more similar. At least four issues should be considered in the near future. •
Feedback loop: Manual operation meant in this experiment that parameters were assessed only once. Useful parameter values should be kept, new parameter values added, and the process repeated until no improvement results.
280
•
•
•
Terry Copeck et al.
Improve summary description and rating: different documents may require different methodologies—long versus short texts, or dense versus colloquial ones. More comprehensive description of the document should be included in the assessment process. Alternative measures of summary quality are also desirable. Full automation: At present parameter assessment is linked to summary generation manually. This makes runs time-consuming and impedes implementing the feedback loop mentioned earlier. An important goal is to automate the system fully so it can run on its own. Integrate other techniques/methodologies into the Summary Generator: Our current system only experimented with a restricted number of methodologies and techniques. Once the system has been fully automated nothing will prevent us from trying a variety of new approaches to text summarization that can then be assessed automatically until an optimal one has been devised.
Acknowledgements This work has been supported by the Natural Sciences and Engineering Research Council of Canada and by our University.
References 1.
Choi, F.: Advances in domain independent linear text segmentation. In Proceedings of ANLP/NAACL-00 (2000) 2. Hearst, M.: TexTiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics (1997) 23 (1) 33-64. 3. Goldstein, J., Kantrowitz, M., Mittal, V., Carbonell, J.: Summarizing Text Documents: Sentence Selection and Evaluation Metrics. In Proceedings of SIGIR-99 (1999) 121-128. 4. Kan, M.-Y., Klavans, J., McKeown, K.: Linear Segmentation and Segment Significance. In Proceedings of WVLC-6 (1998) 197-205. 5. Klavans, J., McKeown, K., Kan, M.-Y., Lee, S.: Resources for Evaluation of Summarization Techniques. In Proceedings of the 1st International Conference on Language Resources and Evaluation (1998). 6. MacQueen, J.: Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical statistics and probability (1967) (1) 281-297. 7. Mittal, V., Kantrowitz, M., Goldstein, J., Carbonell, J.: Selecting Text Spans for Document Summaries: Heuristics and Metrics. In Proceedings of AAAI-99 (1999) 467-473. 8. Quinlan, J.R.: C5.0: An Informal Tutorial. www.rulequest.com/see5-unix.html. Rulequest Research (2002). 9. Turney, P.: Learning algorithms for keyphrase extraction. Information Retrieval (2000) 2(4) 303-336. 10. Witten, I.H., Paynter, G., Frank, E., Gutwin, C., Nevill-Manning, C: KEA: Practical automatic keyphrase extraction. In Proceedings of DL-99 (1999) 254256.
QUANTUM: A Function-Based Question Answering System Luc Plamondon1 and Leila Kosseim2
Abstract. In this paper, we describe our Question Answering (QA) system called QUANTUM. The goal of QUANTUM is to find the answer to a natural language question in a large document collection. QUANTUM relies on computational linguistics as well as information retrieval techniques. The system analyzes questions using shallow parsing techniques and regular expressions, then selects the appropriate extraction function. This extraction function is then applied to one-paragraph-long passages retrieved by the Okapi information retrieval system. The extraction process involves the Alembic named entity tagger and the WordNet semantic network to identify and score candidate answers. We designed QUANTUM according to the TREC-X QA track requirements; therefore, we use the TREC-X data set and tools to evaluate the overall system and each of its components.
1
Introduction
We describe here our Question Answering (QA) system called QUANTUM, which stands for QUestion ANswering Technology of the University of Montreal. The goal of QUANTUM is to find a short answer to a natural language question in a large document collection. The current version of QUANTUM addresses short, syntactically well-formed questions that require factual answers. By factual answers, we mean that they should be found directly in the document collection or using lexical semantics, as opposed to answers that would require worldknowledge, deduction or combination of facts. QUANTUM’s answer to a question is a list of five ranked suggestions. Each suggestion is a 50-character snippet of a document in the collection, along with the source document number. The five suggestions are ranked from 1 to 5, the best suggestion being at rank 1 (see Fig. 1 for an example). QUANTUM also has the ability to detect that a question does not have an answer in the document collection. In that case, QUANTUM outputs an empty suggestion (with NIL as the document number) and ranks it
Work performed while at the Universit´e de Montr´eal.
R. Cohen and B. Spencer (Eds.): AI 2002, LNAI 2338, pp. 281–292, 2002. c Springer-Verlag Berlin Heidelberg 2002
282
Luc Plamondon and Leila Kosseim
according to its likelihood. Those features correspond to the TREC-X QA track requirements [1] where QUANTUM recently participated [2]. We shall introduce QUANTUM’s architecture and its function-based classification of questions. Then, we shall evaluate its overall performance as well as the performance of its components. The TREC-X metric used to measure performance is called Mean Reciprocal Rank (MRR). For each question, we compute a score that is the reciprocal of the rank at which the correct answer is found: the score is respectively 1, 1/2, 1/3, 1/4 or 1/5 if the correct answer is found in the suggestion at rank 1, 2, 3, 4 or 5. Of course, the score is 0 if the answer does not appear in any of the 5 suggestions. The average of this score over all questions gives the MRR.
are kept in Edinburgh Castle - together with jewel kept in the Tower of London as part of the British treasures in Britain’s crown jewels. He gave the K the crown jewel settings were kept during the war.
Fig. 1. Example of a question and its corresponding QUANTUM output. Each of the five suggestions includes a rank, the number of the document from which the answer was extracted and a 50-character snippet of the document containing a candidate for the answer. A NIL document number means that QUANTUM suggests the answer is not present in the document collection. Here, the correct answer is found at rank 2
2
Components of Questions and Answers
Before we describe QUANTUM, let us consider the question How many people die from snakebite poisoning in the US per year? (question # 302 of the TREC-9 QA track) and its answer. As shown in Fig. 2, the question is decomposed in three parts: a question word, a focus and a discriminant, and the answer has two parts: a candidate and a variant of the question discriminant. The focus is the word or noun phrase that influences our mechanisms for the extraction of candidate answers (whereas the discriminant, as we shall see in Sect. 3.3, influences only the scoring of candidate answers once they are extracted). The identification of the focus depends on the selected extraction mechanism; thus, we determine the focus with the syntactic patterns we use during question analysis. Intuitively, the focus is what the question is about, but we may
QUANTUM: A Function-Based Question Answering System
Q:
How many people die from snakebite poisoning in the U.S. per year?
question word
A:
283
focus
discriminant
About 10 people
die a year from snakebites in the United States.
candidate
variant of question discriminant
Fig. 2. Example of question and answer decomposition. The question is from TREC-9 (# 302) and the answer is from the TREC document collection (document LA0823900001)
not need to identify one in every question if the chosen mechanism for answer extraction does not require it. The discriminant is the remaining part of a question when we remove the question word and the focus. It contains the information needed to pick the right candidate amongst all. It is less strongly bound to the answer than the focus is: pieces of information that make up the question discriminant could be scattered over the entire paragraph in which the answer appears, or even over the entire document. In simple cases, the information is found as is; in other cases, it must be inferred from the context or from world-knowledge. We use the term candidate to refer to a word or a small group of words, from the document collection, that the system considers as a potential answer to the question. In the context of TREC-X, a candidate is seldom longer than a noun phrase or a prepositional phrase.
3
System Architecture
In order to find an answer to a question, QUANTUM performs 5 steps: question analysis, passage retrieval and tagging, candidate extraction and scoring, expansion to 50 characters and NIL insertion (for no-answer questions). Let us describe these steps in details. 3.1
Question Analysis
To analyze the question, we use a tokenizer, a part-of-speech tagger and a nounphrase chunker (NP-chunker). These general purpose tools were developed at the RALI laboratory for other purposes than question analysis. A set of about 40 hand-made analysis patterns based on words and on part-of-speech and nounphrase tags are applied to the question to select the most appropriate function for answer extraction. The function determines how the answer should be found in the documents; for example, a definition is not extracted through the same means as a measure or a time. Table 1 shows the 11 functions we have defined and implemented, along with TREC-X question examples (details on the answer
284
Luc Plamondon and Leila Kosseim
patterns shown in the table are given in Sect. 3.2 and 3.3). Each function triggers a search mechanism to identify candidates in a passage based on the passage’s syntactic structure or the semantic relations of its component noun phrases with the question focus. More formally, we have C = f (ρ, ϕ), where f is the extraction function, ρ is a passage, ϕ is the question focus and C is the list of candidates found in ρ. Each element of C is a tuple (ci , di , si ), where ci is the candidate, di is the number of the document containing ci , and si is the score assigned by the extraction function.
Table 1. Extraction functions, examples of TREC-X questions and samples of answer patterns. Hypernyms and hyponyms are obtained using WordNet, named entities are obtained using Alembic and NP tags are obtained using an NP-chunker. When we mention the focus in an answer pattern, we also imply other close variants or a larger NP headed by the focus Function definition(ρ, ϕ)
Example of question and sample of answer patterns Q: What is an atom? (ϕ = atom) A: , A: () A: is specialization(ρ, ϕ) Q: What metal has the highest melting point? (ϕ = metal ) A: cardinality(ρ, ϕ) Q: How many Great Lakes are there? (ϕ = Great Lakes) A: measure(ρ, ϕ) Q: How much fiber should you have per day? (ϕ = fiber ) A: A: of attribute(ρ, ϕ) Q: How far is it from Denver to Aspen? (ϕ = far ) A: Various patterns person(ρ) Q: Who was the first woman to fly across the Pacific Ocean? A: time(ρ) Q: When did Hawaii become a state? A: