ECAI 2008: Proceedings, 18th European Conference on Artificial Intelligence, July 21-25, 2008, Patras, Greece : Including Prestigious Applications of Intelligent ... in Artifical Intelligence and Applications)

ECAI 2008 Frontiers in Artificial Intelligence and Applications FAIA covers all aspects of theoretical and applied art...

Author: Malik Ghallab | Constantine D. Spyropoulos | Nikos Fakotakis | Nikos Avouris

8 downloads 782 Views 21MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

ECAI 2008

Frontiers in Artificial Intelligence and Applications FAIA covers all aspects of theoretical and applied artificial intelligence research in the form of monographs, doctoral dissertations, textbooks, handbooks and proceedings volumes. The FAIA series contains several sub-series, including “Information Modelling and Knowledge Bases” and “Knowledge-Based Intelligent Engineering Systems”. It also includes the biennial ECAI, the European Conference on Artificial Intelligence, proceedings volumes, and other ECCAI – the European Coordinating Committee on Artificial Intelligence – sponsored publications. An editorial panel of internationally well-known scholars is appointed to provide a high quality selection. Series Editors: J. Breuker, R. Dieng-Kuntz, N. Guarino, J.N. Kok, J. Liu, R. López de Mántaras, R. Mizoguchi, M. Musen, S.K. Pal and N. Zhong

Volume 178 Recently published in this series Vol. 177. Vol. 176. Vol. 175. Vol. 174. Vol. 173. Vol. 172. Vol. 171. Vol. 170. Vol. 169. Vol. 168. Vol. 167. Vol. 166. Vol. 165. Vol. 164. Vol. 163. Vol. 162. Vol. 161. Vol. 160. Vol. 159. Vol. 158. Vol. 157. Vol. 156. Vol. 155. Vol. 154. Vol. 153. Vol. 152. Vol. 151. Vol. 150. Vol. 149. Vol. 148. Vol. 147. Vol. 146. Vol. 145. Vol. 144.

C. Soares et al. (Eds.), Applications of Data Mining in E-Business and Finance P. Zaraté et al. (Eds.), Collaborative Decision Making: Perspectives and Challenges A. Briggle, K. Waelbers and P.A.E. Brey (Eds.), Current Issues in Computing and Philosophy S. Borgo and L. Lesmo (Eds.), Formal Ontologies Meet Industry A. Holst et al. (Eds.), Tenth Scandinavian Conference on Artificial Intelligence – SCAI 2008 Ph. Besnard et al. (Eds.), Computational Models of Argument – Proceedings of COMMA 2008 P. Wang et al. (Eds.), Artificial General Intelligence 2008 – Proceedings of the First AGI Conference J.D. Velásquez and V. Palade, Adaptive Web Sites – A Knowledge Extraction from Web Data Approach C. Branki et al. (Eds.), Techniques and Applications for Mobile Commerce – Proceedings of TAMoCo 2008 C. Riggelsen, Approximation Methods for Efficient Learning of Bayesian Networks P. Buitelaar and P. Cimiano (Eds.), Ontology Learning and Population: Bridging the Gap between Text and Knowledge H. Jaakkola, Y. Kiyoki and T. Tokuda (Eds.), Information Modelling and Knowledge Bases XIX A.R. Lodder and L. Mommers (Eds.), Legal Knowledge and Information Systems – JURIX 2007: The Twentieth Annual Conference J.C. Augusto and D. Shapiro (Eds.), Advances in Ambient Intelligence C. Angulo and L. Godo (Eds.), Artificial Intelligence Research and Development T. Hirashima et al. (Eds.), Supporting Learning Flow Through Integrative Technologies H. Fujita and D. Pisanelli (Eds.), New Trends in Software Methodologies, Tools and Techniques – Proceedings of the sixth SoMeT_07 I. Maglogiannis et al. (Eds.), Emerging Artificial Intelligence Applications in Computer Engineering – Real World AI Systems with Applications in eHealth, HCI, Information Retrieval and Pervasive Technologies E. Tyugu, Algorithms and Architectures of Artificial Intelligence R. Luckin et al. (Eds.), Artificial Intelligence in Education – Building Technology Rich Learning Contexts That Work B. Goertzel and P. Wang (Eds.), Advances in Artificial General Intelligence: Concepts, Architectures and Algorithms – Proceedings of the AGI Workshop 2006 R.M. Colomb, Ontology and the Semantic Web O. Vasilecas et al. (Eds.), Databases and Information Systems IV – Selected Papers from the Seventh International Baltic Conference DB&IS’2006 M. Duží et al. (Eds.), Information Modelling and Knowledge Bases XVIII Y. Vogiazou, Design for Emergence – Collaborative Social Play with Online and Location-Based Media T.M. van Engers (Ed.), Legal Knowledge and Information Systems – JURIX 2006: The Nineteenth Annual Conference R. Mizoguchi et al. (Eds.), Learning by Effective Utilization of Technologies: Facilitating Intercultural Understanding B. Bennett and C. Fellbaum (Eds.), Formal Ontology in Information Systems – Proceedings of the Fourth International Conference (FOIS 2006) X.F. Zha and R.J. Howlett (Eds.), Integrated Intelligent Systems for Engineering Design K. Kersting, An Inductive Logic Programming Approach to Statistical Relational Learning H. Fujita and M. Mejri (Eds.), New Trends in Software Methodologies, Tools and Techniques – Proceedings of the fifth SoMeT_06 M. Polit et al. (Eds.), Artificial Intelligence Research and Development A.J. Knobbe, Multi-Relational Data Mining P.E. Dunne and T.J.M. Bench-Capon (Eds.), Computational Models of Argument – Proceedings of COMMA 2006

ISSN 0922-6389

ECAI 2008 18th European Conference on Artificial Intelligence July 21–25, 2008, Patras, Greece Including

Prestigious Applications of Intelligent Systems (PAIS 2008)

Proceedings Edited by

Malik Ghallab INRIA, France

Constantine D. Spyropoulos NCSR Demokritos, Greece

Nikos Fakotakis University of Patras, Greece

and

Nikos Avouris University of Patras, Greece

Organized by the European Coordinating Committee for Artificial Intelligence (ECCAI) and the Hellenic Artificial Intelligence Society (EETN) Hosted by the University of Patras, Greece

Amsterdam • Berlin • Oxford • Tokyo • Washington, DC

© 2008 The authors and IOS Press. All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without prior written permission from the publisher. ISBN 978-1-58603-891-5 Library of Congress Control Number: 2008905319 Publisher IOS Press Nieuwe Hemweg 6B 1013 BG Amsterdam Netherlands fax: +31 20 687 0019 e-mail: [email protected]

Distributor in the UK and Ireland Gazelle Books Services Ltd. White Cross Mills Hightown Lancaster LA1 4XS United Kingdom fax: +44 1524 63232 e-mail: [email protected]

Distributor in the USA and Canada IOS Press, Inc. 4502 Rachael Manor Drive Fairfax, VA 22032 USA fax: +1 703 323 3668 e-mail: [email protected]

LEGAL NOTICE The publisher is not responsible for the use which might be made of the following information. PRINTED IN THE NETHERLANDS

v

ECCAI Member Societies ACIA (Spain) Catalan Association for Artificial Intelligence (Associació Catalana d’Intelligència Artificial) ADUIS (Ukrain) Association of Developers and Users of Intelligent Systems AEPIA (Spain) Spanish Association for Artificial Intelligence (Asociación Española para la Inteligencia Artificial) AFIA (France) French Association for Artificial Intelligence (Association Française pour l’Intelligence Artificielle) AIAI (Ireland) Artificial Intelligence Association of Ireland AIIA (Italy) Italian Association for Artificial Intelligence (Associazione Italiana per l’Intelligenza Artificiale) AISB (United Kingdom) Society for the Study of Artificial Intelligence and the Simulation of Behaviour APPIA (Portugal) Portuguese Association for Artificial Intelligence (Associação Portuguesa para a Inteligência Artificial) BAIA (Bulgaria) Bulgarian Artificial Intelligence Association BCS-SGAI (United Kingdom) British Computer Society Specialist Group on Artificial Intelligence BNVKI (Belgium/Netherlands) Belgian-Dutch Association for Artificial Intelligence (Belgisch-Nederlandse Vereniging voor Kunstmatige Intelligentie) CSKI (Czech Republic) Czech Society for Cybernetics and Informatics (Ceská spolecnost pro kybernetiku a informatiku) DAIS (Denmark) Danish Artificial Intelligence Society EETN (Greece) Hellenic Artificial Intelligence Society FAIS (Finland) Finnish Artificial Intelligence Society (Suomen Tekoälyseura ry) GI/KI (Germany) German Informatics Association (Gesellschaft für Informatik; Sektion KI e.V.) IAAI (Israel) Israeli Association for Artificial Intelligence LANO (Latvia) Latvian National Organisation of Automatics (Latvijas Automatikas Nacionala Organizacija) LIKS-AIS (Lithuania) Lithuanian Computer Society–Artificial Intelligence Section (Lietuvos Kompiuterininku Sajunga) NJSZT (Hungary) John von Neumann Society for Computing Sciences (Neumann János Számítógéptudományi Társaság) ÖGAI (Austria) Austrian Society for Artificial Intelligence (Österreichische Gesellschaft für Artificial Intelligence) RAAI (Russia) Russian Association for Artificial Intelligence SAIS (Sweden) Swedish Artificial Intelligence Society SGAICO (Switzerland) Swiss Group for Artificial Intelligence and Cognitive Science (Schweizer Informatiker Gesellschaft) SLAIS (Slovenia) Slovenian Artificial Intelligence Society (Slovensko drustvo za umetno inteligenco) SSKI SAV (Slovak Republic) Slovak Society for Cybernetics and Informatics at Slovak Academy of Sciences (Slovenská spolocnost pre kybernetiku a informatiku pri Slovenskej akadémii vied)

This page intentionally left blank

vii

ECAI 2008 Conference Chair Constantine D. Spyropoulos, Greece

Programme Committee Chair Malik Ghallab, France

Organizing Committee Chairs Nikos Fakotakis, Greece Nikos Avouris, Greece

Workshops Chairs Boi Faltings, Switzerland Ioannis Vlahavas, Greece

Demonstration Systems Chair Nikos Karacapilidis, Greece

Area Chairs Antoniou, Grigoris, Greece Benhamou, Frédéric, France Bessiere, Christian, France Console, Luca, Italy Cordier, Marie-Odile, France Dague, Philippe, France De Raedt, Luc, Belgium Flach, Peter, UK Geffner, Hector, Spain Horrocks, Ian, UK Ingrand, Felix, France Lakemeyer, Gerhard, Germany Lang, Jérôme, France Milano, Michela, Italy

Myllymaki, Petri, Finland Oliveira, Eugenio, Portugal Pazienza, Maria Teressa, Italy Saffiotti, Alessandro, Sweden Struss, Peter, Germany Thiébaux, Sylvie, Austria Torasso, Pietro, Italy Traverso, Paolo, Italy Trousse, Brigitte, France Uszkoreit, Hans, Germany Van Harmelen, Frank, The Netherlands Van Someren, Maarten, The Netherlands Verfaillie, Gérard, France

viii

PAIS 2008 Chairs Nick Jennings, United Kingdom Alex Rogers, United Kingdom

PAIS Programme Committee Stuart Aitken, UK Joachim Baumeister, Germany Jeremy Baxter, UK Riccardo Bellazzi, Italy Michael Berger, Germany Stefan Bussmann, Germany Andrew Byde, UK Monique Calisti, Switzerland Simon Case, UK Pádraig Cunningham, Ireland Ian Dickinson, UK Partha Dutta, UK

Floriana Esposito, Italy Robert Ghanea-Hercock, UK Josep Lluis Arcos, Spain Simon Maskell, UK David Nicholson, UK Michal Pechoucek, Czech Republic Nicola Policella, Germany Sarvapali Ramchurn, UK Oliviero Stock, Italy Jerome Thomas, France Simon Thompson, UK Franz Wotawa, Austria

ix

ECAI Programme Committee Agirre, Eneko, ES Ågotnes, Thomas, NO Ait-Mokhtar, Salah, FR Alechina, Natasha, UK Alonso, Carlos, ES Alonso, Eduardo, UK Amgoud, Leila, FR Ananiadou, Sophia, UK Antunes, Luis, PT Ardissono, Liliana, IT Areces, Carlos, FR Assayag, Gerard, FR Avesani, Paolo, IT Baldwin, Timothy, AU Baroglio, Cristina, IT Bartak, Roman, CZ Basili, Roberto, IT Battiti, Roberto, IT Beaufils, Bruno, FR Beck, Christopher, CA Beetz, Michael, DE Beldiceanu, Nicolas, FR Ben Naim, Jonathan, FR Bertoli, Piergiorgio, IT Besnard, Philippe, FR Biau, Gérard, FR Biswas, Gautam, US Blockeel, Hendrik, BE Boella, Guido, IT Boissier, Olivier, FR Bonet, Blai, VE Bonnefon, J.-F., FR Booth, Richard, TH Bordeaux, Lucas, UK Borrajo, Daniel, ES Bouchon-Meunier, B., FR Bouillon, Pierrette, CH Bouquet, Paolo, IT Bourreau, Eric, FR Bozzano, Marco, IT Brafman, Ronen, IL Brazdil, Pavel, PT Brown, Ken, IE Brugali, Davide, IT Buffet, Olivier, FR Buntine, Wray, AU Busquets, Didac, ES Cali, Andrea, UK

Camps, Valerie, FR Cancedda, Nicola, FR Cardoso, Amilcar, PT Carlsson, Mats, SE Carroll, John, US Ceberio, Martine, US Chades, Iadine, FR Charpillet, Francois, FR Chevaleyre, Yann, FR Cholvy, Laurence, FR Christie, Marc, FR Coelho, Helder, PT Coghill, George, UK Cohen, David, UK Collet, Jacques, FR Comet, Jean-Paul, FR Conitzer, Vincent, US Cornet, Ronald, NL Cortes, Juan, FR Cortés, Ulises, ES Coste-Manière, Eve, FR Coste-Marquis, Sylvie, FR Crowley, James, FR Cuenca Grau, Bernardo, UK Cussens, James, UK David, Bertrand, FR De Giacomo, Giuseppe, IT De Jong, Hidde, FR De Kleer, Johan, US De Ruyter, Boris, NL de Vries, Gerben Klaas Dirk, NL Dechter, Rina, US Delgrande, James, CA Demazeau, Yves, FR Devy, Michel, FR Dignum, Frank, NL Dignum, Virginia, NL Dimitrakakis, Christos, NL Dombre, Etienne, FR Domingue, John, UK Domshlak, Carmel, IL Dousson, Christophe, FR Dressler, Oskar, DE Duckett, Tom, UK Dutech, Alain, FR Edelkamp, Stefan, DE Eisele, Andreas, DE Eiter, Thomas, AT

El Fallah, S. Amal, FR Elkind, Edith, UK Endriss, Ulle, NL Erdem, Esra, TR Esteva, Marc, ES Euzenat, Jérôme, FR Eveillard, Damien, FR Ferber, Jacques, FR Faltings, Boi, CH Fargier, Hélène, FR Feelders, Ad, NL Fern, Alan, US Fernandez-Madrigal, J.-A, ES Ferrane, Isabelle, FR Ferré, Sébastien, FR Finzi, Alberto, IT Fischer, Klaus, DE Fisher, Michael, UK Forbus, Ken, US Fornara, Nicoletta, CH Fox, Maria, UK Frank, Eibe, NZ Frasconi, Paolo, IT Friedrich, Gerhard, AT Fuernkranz, Johannes, DE Gama, Joao, PT Gebhard, Patrick, DE Gent, Ian, UK Ghidini, Chiara, IT Giordana, Attilio, IT Giordano, Laura, IT Giovannucci, Andrea, ES Giunchiglia, Enrico, IT Gleizes, Marie-Pierre, FR Glimm, Birte, UK Godo, Lluis, ES Goethals, Bart, BE Gordillo, Jose-Luis, MX Governatori, Guido, AU Grastien, Alban, AU Gribonval, Rémi, FR Grobelnik, Marko, SI Gros, Patrick, FR Grosclaude, Irene, FR Grossi, Davide, LU Grunwald, Peter, NL Guéré, Emmanuel, FR Haarslev, Volker, CA

x

Haase, Peter, DE Habet, Djamal, FR Hajicova, Eva, CZ Hansen, Eric, US Harrenstein, Paul, DE Haslum, Patrik, AU Haton, Jean-Paul, FR Hayes, Pat, US Helmert, Malte, DE Hernandez, Daniel, DE Hernandez-Orallo, Jose, ES Hertzberg, Joachim, DE Herzig, Andreas, FR Hitzler, Pascal, DE Hofbaur, Michael, AT Hoffmann, Joerg, AT Hollink, Vera, NL Hoos, Holger, CA Hosobe, Hiroshi, JP Hu, Wei, CN Huang, Jinbo, AU Huang, Zhisheng, NL Huget, Marc-Philippe, FR Hunter, Aaron, CA Hunter, Anthony, UK Hustadt, Ullrich, UK Infantes, Guillaume, US Ironi, Liliana, IT Isaac, Antoine, NL Jaeger, Manfred, DK Jaffar, Joxan, SG Jannin, Pierre, FR Jonsson, Anders, ES Julio, Alferes Jose, PT Junker, Ulrich, FR Jéron, Thierry, FR Kayser, Daniel, FR Kalech, Meir, US Kalfoglou, Yannis, UK Kalyanpur, Aditya, US Kaplunova, Alissa, DE Karlsson, Lars, SE Kaski, Samuel, FI Kazakov, Yevgeny, UK Kern-Isberner, Gabriele, DE Kersting, Kristian, DE Klein, Michel, NL Koehn, Philipp, UK Koivisto, Mikko, FI Kok, Joost, NL

Konieczny, Sébastien, FR Koubarakis, Manolis, GR Krose, Ben, NL Krüger, Antonio, DE Kudenko, Daniel, UK Kuesters, Ralf, DE Lachiche, Nicolas, FR Lacroix, Simon, FR Lafortune, Stephane, US Lallouet, Arnaud, FR Lamperti, Gianfranco, IT Lanfranchi, Vitaveska, UK Larranaga, Pedro, ES Lavrac, Nada, Slovenia Lechevallier, Yves, FR Lecoutre, Christophe, FR Lembo, Domenico, IT Lesperance, Yves, CA Levene, Mark, UK Lima, Pedro, PT Liz, Sonenberg, AU Long, Derek, UK Longin, Dominique, FR Lorini, Emiliano, FR Lucas, Peter, NL Luis, Correia, PT Lukasiewicz, Thomas, UK Lutz, Carsten, DE López de Mántaras, R., ES Mackay, Wendy, FR Magro, Diego, IT Malerba, Donato, IT Manya, Felip, ES Marchand, Hervé, FR Marquis, Pierre , FR Martelli, Alberto, IT Massa, Paolo, IT Massimo, Zanzotto F., IT Maudet, Nicolas, FR McNeill, Fiona, UK Meisels, Amnon, IL Mendes, Rui, PT Mengin, Jerome, FR Meo, Rosa, IT Meseguer, Pedro, ES Meyer, Tommie, ZA Michel, Laurent, US Milicic, Maja, DE Mille, Alain, FR Mobasher, Bamshad, US

Moeller, Ralf, DE Monfroy, Eric, CL Mosterman, Pieter, US Motik, Boris, UK Mouaddib, Abdel-Illah, FR Muggleton, Stephen, UK Màrquez, Lluís, ES Napoli, Amedeo, FR Narasimhan, Sriram, US Nardi, Daniele, IT Nayak, Abhaya, AU Neumann, Guenter, DE Niemela, Ilkka, FI Nijholt, Anton, NL Nijssen, Siegfried, BE Nivre, Joakim, SE Noirhomme, Monique, BE Nunes, Luís, PT Nyberg, Mattias, SE O’Sullivan, Barry, IE Oddi, Angelo, IT Oepen, Stephan, NO Omicini, Andrea, IT Oriolo, Giuseppe, IT Ossowski, Sascha, ES Ozturk, Escoffier M., FR Pagnucco, Maurice, AU Palacios, Hector, ES Paliouras, Georgios, GR Pan, Jeff, UK Paolucci, Mario, IT Paquet, Thierry, FR Parsia, Bijan, UK Paternò, Fabio, IT Patino Vilchis, Jose Luis, FR Paula, Rocha Ana, PT Payne, Terry, UK Peek, Niels, NL Peischl, Bernhard, AT Pena, Jose, SE Pencolé, Yannick, FR Peppas, Pavlos, GR Perini, Anna, IT Perron, Laurent, FR Petrelli, Daniela, UK Pfahringer, Bernhard, NZ Pianesi, Fabio, IT Picardi, Claudia, IT Pirri, Fiora, IT Poesio, Massimo, IT

xi

Poibeau, Thierry, FR Portinale, Luigi, IT Pralet, Cédric, FR Price, Chris, UK Provan, Gregory, IE Pulido, Junquera B., ES Pulman, Stephen, UK Putnik, Goran, PT Pélachaud, Catherine, FR Quiniou, René, FR Quinou, Rene, FR Regin, Jean-Charles, FR Reis, Luis Paulo, PT Remondino, Marco, IT Renz, Jochen, AU Retore, Christian, FR Ricci, Francesco, IT Rintanen, Jussi, AU Robertson, Dave, UK Rochart, Guillaume, FR Roli, Andrea, IT Roos, Teemu, FI Rosati, Riccardo, IT Rosec, Olivier, FR Rossi, Francesca, IT Rousset, Marie-Christine, FR Rudova, Hana, CZ Ruml, Wheeler, US Sabbadin, Régis, FR Sabou, Marta, UK Sabouret, Nicolas, FR Sachenbacher, Martin, DE Salido, Miguel, ES Sanchez, Daniel, ES Sanner, Scott, AU Sattler, Uli, UK Saubion, Frederic, FR Sauro, Luigi, IT

Saïs, Lakhdar, FR Schaub, Torsten, DE Schiex, Thomas, FR Schlobach, Stefan, NL Schmid, Helmut, DE Schulte, Christian, SE Schulte, im Walde S., DE Schumann, Anika, AU Schwind, Camilla, FR Sellmann, Meinolf, US Semeraro, Giovanni, IT Serafini, Luciano, IT Serrurier, Mathieu, FR Shapiro, Steven, CA Shvaiko, Pavel, IT Sidobre, Daniel, FR Siegel, Anne, FR Simeon, Nicola, FR Simon, Laurent, FR Simonis, Helmut, IE Simov, Kiril, Bulgaria Smith, Barbara, UK Sprinkhuizen-Kuyper I., NL Stamou, Giorgos, GR Stede, Manfred, DE Stergiou, Kostas, GR Stuckenschmidt, Heiner, DE Stumme, Gerd, DE Stumptner, Markus, AU Stylianou, Yannis, GR Teichteil-Königsbuch, F., FR Ten Teije, Annette, NL Terenziani, Paolo, IT Terna, Pietro, IT Terrioux, Cyril, FR Tessaris, Sergio, IT Theseider Dupré, Daniele, IT Thielscher, Michael, DE

Thonnat, Monique, FR Torta, Gianluca, IT Trave-Massuyes, L., FR Trombettoni, Gilles, FR Truszczynski, Miroslaw, US Tsoukias, Alexis, FR Van Atteveldt, Wouter, NL Van Beek, Peter, CA Van Ditmarsch, Hans, NZ Van Hage, Willem, NL Van Hentenryck, Pascal, US Van Hoeve, Willem-Jan, US Van den Bosch, Antal, NL Van der Torre, Leon, LU Verhagen, Harko, SE Viappiani, Paolo, CA Vidal, Thierry, FR Vidal, Vincent, FR Vincent, Nicole, FR Volz, Raphael, DE Wallace, Mark, AU Wang, Kewen, AU Wang, Shenghui, NL Webb, Nick, US Weibelzahl, Stephan, IE Weydert, Emil, LU Widmer, Gerhard, AT Wilks, Yorick, UK Williams, Mary-Anne, AU Wilson, Nic, IE Wotawa, Franz, AT Wrobel, Stefan, DE Yangarber, Roman, FI Yap, Roland, SG Yokoo, Makoto, JP Yu, Huizhen, FI Zancanaro, Massimo, IT Zanella, Marina, IT

This page intentionally left blank

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved.

xiii

Preface Artificial Intelligence is a highly creative field. Numerous research areas in Computer Science that originated over the past fifty years within AI laboratories and were discussed in AI conferences are now completely independent and mature research domains whose young practitioners may not even be acquainted with the AI affiliation. It is fortunate to see that while disseminating and spreading out, the AI field per se remains very active. This is particularly the case in Europe. The ECAI series of conferences keeps growing. This 18th edition received more submissions than the previous ones. About 680 papers and posters were registered at ECAI 2008 conference system, out of which 518 papers and 43 posters were actually reviewed. The program committee decided to accept • •

121 full papers, an acceptance rate of 23%, and 97 posters.

Several submitted full papers have been accepted as posters. All posters, presented in these Proceedings as short papers, will have formal presentation slots in the technical sessions of the main program of the conference, as well as poster presentations within a specific session. The 561 reviewed submissions were originated from 51 different countries, out of which 35 countries are represented in the final program. The following table shows the number of submitted and accepted papers or posters per country, based on the contact author affiliation. Country Australia Austria Belgium Brazil Bulgaria Canada Chile China Cyprus Czech Republic Denmark Egypt Finland France Germany Greece Hungary

Sub. Acc. 26 12 12 6 4 3 13 1 1 1 13 6 1 6 3 1 1 6 1 1 1 1 4 3 116 42 49 20 34 14 1

Country India Iran Ireland Israel Italy Japan Korea Luxembourg Malaysia Malta Mexico Morocco Netherlands New Zealand Norway Pakistan Poland

Sub. Acc. 2 5 1 13 6 6 2 43 19 9 4 2 4 2 2 1 1 1 1 1 1 23 11 1 2 1 1 4

Country Sub. Acc. Portugal 17 6 Romania 4 1 Russia 4 Saudi Arabia 1 Singapore 1 Slovenia 4 3 South Africa 2 Spain 35 12 Sweden 9 5 Switzerland 2 Taiwan 2 1 Thailand 1 Tunisia 5 1 Turkey 3 1 United Kingdom 46 19 United States 15 6 Venezuela 1

The distribution of the 561 submitted and the 218 accepted paper or posters over reviewing areas (based on the first keyword chosen by the authors) is given below. With respect to previous ECAI conferences, one may notice a relative growth of the Machine Learning and Cognitive Modeling & Interaction areas. The rest of the distribution remains about stable, with marginal fluctuations given that areas are overlapping and their frontiers are not sharp.

xiv

ECAI 2008 Conference Areas KR&R Machine Learning Distributed & Multi-agents Systems Cognitive Modeling & Interaction Constraints and search Model-based Reasoning and Diagnosis NLP Planning and scheduling Perception, Sensing and Cognitive Robotics Uncertainty in AI

Papers Submitted 102 102 92 57 51 51 47 33 14 12 561

Papers Accepted 42 32 37 17 20 26 18 13 6 7 218

The Prestigious Applications of Intelligent Systems (PAIS), ECAI associated subconference, has also been very successful this year by the number and quality of submitted papers. Its program committee received 35 submissions in total and accepted 11 full papers, and 4 additional papers with short presentations. In conclusion, we are very happy to introduce you to the Proceedings of this 18th edition of ECAI, a conference that is growing and maintaining a high standard of quality. The success of this edition is due to the contribution and support of many colleagues. We would like to gratefully thank all those who helped organizing ECAI 2008 into a tremendous success. Area chairs, PAIS, workshop chairs and workshop organizers as well as the Systems Demonstration Chair were the key actors of this success. They managed timely and efficiently a heavy workload. Much thanks in particular to Felix Ingrand, who acted not only area chair but also as a program co-chair through the overall process. PC members provided high quality reviews and contributed to detailed discussions of several papers before reaching a decision. Finally, to all the persons involved in the local organization of the conference, many thanks for a tremendous amount of excellent work and much appreciated help. June 2008

Malik Ghallab Constantine Spyropoulos Nikos Fakotakis Nikos Avouris

xv

Contents ECCAI Member Societies

v

Conference Organization

vii

ECAI Programme Committee

ix

Preface Malik Ghallab, Constantine D. Spyropoulos, Nikos Fakotakis and Nikos Avouris

xiii

I. Invited Talks Semantic Activity Recognition Monique Thonnat

3

Bayesian Methods for Artificial Intelligence and Machine Learning Zoubin Ghahramani

8

The Impact of Constraint Programming Pascal Van Hentenryck

9

Web Science George Metakides

10

II. Papers 1. Knowledge Representation and Reasoning Advanced Preprocessing for Answer Set Solving Martin Gebser, Benjamin Kaufmann, André Neumann and Torsten Schaub

15

A Generic Framework for Comparing Semantic Similarities on a Subsumption Hierarchy Emmanuel Blanchard, Mounira Harzallah and Pascale Kuntz

20

Complexity of Subsumption in the EL Family of Description Logics: Acyclic and Cyclic TBoxes Christoph Haase and Carsten Lutz

25

Reasoning About Dynamic Depth Profiles Mikhail Soutchanski and Paulo Santos

30

Comparing Abductive Theories Katsumi Inoue and Chiaki Sakama

35

Privacy-Preserving Query Answering in Logic-Based Information Systems Bernardo Cuenca Grau and Ian Horrocks

40

Optimizing Causal Link Based Web Service Composition Freddy Lécué, Alexandre Delteil and Alain Léger

45

Extending the Knowledge Compilation Map: Closure Principles Hélène Fargier and Pierre Marquis

50

Semantic Modularity and Module Extraction in Description Logics Boris Konev, Carsten Lutz, Dirk Walther and Frank Wolter

55

New Results for Horn Cores and Envelopes of Horn Disjunctions Thomas Eiter and Kazuhisa Makino

60

xvi

Belief Revision with Reinforcement Learning for Interactive Object Recognition Thomas Leopold, Gabriele Kern-Isberner and Gabriele Peters

65

A Formal Approach for RDF/S Ontology Evolution George Konstantinidis, Giorgos Flouris, Grigoris Antoniou and Vassilis Christophides

70

Modular Equivalence in General Tomi Janhunen

75

Description Logic Rules Markus Krötzsch, Sebastian Rudolph and Pascal Hitzler

80

Conflicts Between Relevance-Sensitive and Iterated Belief Revision Pavlos Peppas, Anastasios Michael Fotinopoulos and Stella Seremetaki

85

Conservativity in Structured Ontologies Oliver Kutz and Till Mossakowski

89

Removed Sets Fusion: Performing off the Shelf Julien Hué, Eric Würbel and Odile Papini

94

A Coherent Well-Founded Model for Hybrid MKNF Knowledge Bases Matthias Knorr, José Júlio Alferes and Pascal Hitzler

99

2. Machine Learning Prototype-Based Domain Description Fabrizio Angiulli

107

Online Rule Learning via Weighted Model Counting Frédéric Koriche

112

Focused Ensemble Selection: A Diversity-Based Method for Greedy Ensemble Selection Ioannis Partalas, Grigorios Tsoumakas and Ioannis Vlahavas

117

MTForest: Ensemble Decision Trees Based on Multi-Task Learning Qing Wang, Liang Zhang, Mingmin Chi and Jiankui Guo

122

Many-Valued Concept Lattices for Conceptual Clustering and Information Retrieval Nizar Messai, Marie-Dominique Devignes, Amedeo Napoli and Malika Smail-Tabbone

127

Online Optimization for Variable Selection in Data Streams Christoforos Anagnostopoulos, Dimitris K. Tasoulis, David J. Hand and Niall M. Adams

132

Sub Node Extraction with Tree Based Wrappers Stefan Raeymaekers and Maurice Bruynooghe

137

Automatic Recurrent ANN Development for Signal Classification: Detection of Seizures in EEGs Daniel Rivero, Julian Dorado, Juan Rabuñal and Alejandro Pazos

142

A Method for Classifying Vertices of Labeled Graphs Applied to Knowledge Discovery from Molecules Frédéric Pennerath, Géraldine Polaillon and Amedeo Napoli

147

Nonnegative Decompositions with Resampling for Improving Gene Expression Data Biclustering Stability Liviu Badea and Doina Ţilivea

152

Exploiting Locality of Interactions Using a Policy-Gradient Approach in Multiagent Learning Francisco S. Melo

157

xvii

A Fast Method for Property Prediction in Graph-Structured Data from Positive and Unlabelled Examples Susanne Hoche, Peter Flach and David Hardcastle

162

VCD Bounds for Some GP Genotypes José Luis Montaña

167

Robust Division in Clustering of Streaming Time Series Pedro Pereira Rodrigues and João Gama

172

3. Model-Based Diagnosis and Reasoning Generating Diagnoses from Conflict Sets with Continuous Attributes Emmanuel Benazera and Louise Travé-Massuyés

179

A Compositional Mathematical Model of Machines Transporting Rigid Objects Peter Struss, Axel Kather, Dominik Schneider and Tobias Voigt

184

Model-Based Diagnosis of Discrete Event Systems with an Incomplete System Model Xiangfu Zhao and Dantong Ouyang

189

Chronicles for On-Line Diagnosis of Distributed Systems Xavier Le Guillou, Marie-Odile Cordier, Sophie Robin and Laurence Rozé

194

Test Generation for Model-Based Diagnosis Gregory Provan

199

Observation-Subsumption Checking in Similarity-Based Diagnosis of Discrete-Event Systems Gianfranco Lamperti and Marina Zanella

204

Local Consistency and Junction Tree for Diagnosis of Discrete-Event Systems Priscilla Kan John and Alban Grastien

209

Hierarchical Explanation of Inference in Bayesian Networks that Represent a Population of Independent Agents Peter Šutovský and Gregory F. Cooper

214

Coupling Continuous and Discrete Event System Techniques for Hybrid System Diagnosability Analysis Mehdi Bayoudh, Louise Travé-Massuyès and Xavier Olive

219

A Probabilistic Analysis of Diagnosability in Discrete Event Systems Farid Nouioua and Philippe Dague

224

Temporal Logic Patterns for Querying Qualitative Models of Genetic Regulatory Networks Pedro T. Monteiro, Delphine Ropers, Radu Mateescu, Ana T. Freitas and Hidde de Jong

229

Fighting Knowledge Acquisition Bottleneck with Argument Based Machine Learning Martin Možina, Matej Guid, Jana Krivec, Aleksander Sadikov and Ivan Bratko

234

4. Cognitive Modeling and Interaction Automatic Page Turning for Musicians via Real-Time Machine Listening Andreas Arzt, Gerhard Widmer and Simon Dixon

241

CDL: An Integrated Framework for Context Specification and Recognition

246

Fulvio Mastrogiovanni, Antonello Scalmato, Antonio Sgorbissa and Renato Zaccaria Web Page Prediction Based on Conditional Random Fields Yong Zhen Guo, Kotagiri Ramamohanarao and Laurence A.F. Park

251

xviii

A Formal Model of Emotions: Integrating Qualitative and Quantitative Aspects Bas R. Steunebrink, Mehdi Dastani and John-Jules Ch. Meyer

256

Modeling Collaborative Similarity with the Signed Resistance Distance Kernel Jérôme Kunegis, Stephan Schmidt, Şahin Albayrak, Christian Bauckhage and Martin Mehlitz

261

Modeling the Dynamics of Mood and Depression Fiemke Both, Mark Hoogendoorn, Michel Klein and Jan Treur

266

Groovy Neural Networks Axel Tidemann and Yiannis Demiris

271

An Efficient Student Model Based on Student Performance and Metadata Arndt Faulhaber and Erica Melis

276

5. Natural Language Processing Reducing Bias Effects in DOP Parameter Estimation Evita Linardaki

283

Multilingual Evidence Improves Clustering-Based Taxonomy Extraction Hans Hjelm and Paul Buitelaar

288

Unsupervised Grammar Induction Using a Parent Based Constituent Context Model Seyed Abolghasem Mirroshandel and Gholamreza Ghassem-Sani

293

Word Sense Induction Using Graphs of Collocations Ioannis P. Klapaftis and Suresh Manandhar

298

Learning Context-Free Grammars to Extract Relations from Text Georgios Petasis, Vangelis Karkaletsis, Georgios Paliouras and Constantine D. Spyropoulos

303

Talking Points in Metaphor: A Concise Usage-Based Representation for Figurative Processing Tony Veale and Yanfen Hao

308

Semantic Decomposition for Question Answering Sven Hartrumpf

313

Finding Key Bloggers, One Post at a Time Wouter Weerkamp, Krisztian Balog and Maarten de Rijke

318

Why Is This Wrong? – Diagnosing Erroneous Speech Recognizer Output with a Two Phase Parser Bernd Ludwig and Martin Hacker

323

Task Driven Coreference Resolution for Relation Extraction Feiyu Xu, Hans Uszkoreit and Hong Li

328

WWW Sits the SAT: Measuring Relational Similarity on the Web Danushka Bollegala, Yutaka Matsuo and Mitsuru Ishizuka

333

Improved Statistical Machine Translation Using Monolingual Paraphrases Preslav Nakov

338

Orthographic Similarity Search for Dictionary Lookup of Japanese Words Lars Yencken and Timothy Baldwin

343

6. Uncertainty and AI From Belief Change to Preference Change Jérôme Lang and Leendert van der Torre

351

xix

A General Model for Epistemic State Revision Using Plausibility Measures Jianbing Ma and Weiru Liu

356

Structure Learning of Markov Logic Networks Through Iterated Local Search Marenglen Biba, Stefano Ferilli and Floriana Esposito

361

Single-Peaked Consistency and Its Complexity Bruno Escoffier, Jérôme Lang and Meltem Öztürk

366

Belief Revision Through Forgetting Conditionals in Conditional Probabilistic Logic Programs Anbu Yue and Weiru Liu

371

Mastering the Processing of Preferences by Using Symbolic Priorities in Possibilistic Logic Souhila Kaci and Henri Prade

376

7. Distributed and Multi-Agents Systems Interaction-Oriented Agent Simulations: From Theory to Implementation Yoann Kubera, Philippe Mathieu and Sébastien Picault

383

Optimal Coalition Structure Generation in Partition Function Games Tomasz Michalak, Andrew Dowell, Peter McBurney and Michael Wooldridge

388

Coalition Structures in Weighted Voting Games Edith Elkind, Georgios Chalkiadakis and Nicholas R. Jennings

393

Agents Preferences in Decentralized Task Allocation Mark Hoogendoorn and Maria L. Gini

398

Game Theoretical Insights in Strategic Patrolling: Model and Algorithm in Normal-Form Nicola Gatti

403

Monitoring the Execution of a Multi-Agent Plan: Dealing with Partial Observability Roberto Micalizio and Pietro Torasso

408

A Hybrid Approach to Multi-Agent Decision-Making Paulo Trigo and Helder Coelho

413

Coalition Formation Strategies for Self-Interested Agents Thomas Génin and Samir Aknine

418

Of Mechanism Design and Multiagent Planning Roman van der Krogt, Mathijs de Weerdt and Yingqian Zhang

423

IAMwildCAT: The Winning Strategy for the TAC Market Design Competition Perukrishnen Vytelingum, Ioannis A. Vetsikas, Bing Shi and Nicholas R. Jennings

428

Multi-Agent Reinforcement Learning Algorithm with Variable Optimistic-Pessimistic Criterion Natalia Akchurina

433

As Safe as It Gets: Near-Optimal Learning in Multi-Stage Games with Imperfect Monitoring Danny Kuminov and Moshe Tennenholtz

438

A Heuristic Based Seller Agent for Simultaneous English Auctions Patricia Anthony and Edwin Law

443

A Truthful Two-Stage Mechanism for Eliciting Probabilistic Estimates with Unknown Costs Athanasios Papakonstantinou, Alex Rogers, Enrico H. Gerding and Nicholas R. Jennings

448

xx

Goal Generation and Adoption from Partially Trusted Beliefs Célia da Costa Pereira and Andrea G.B. Tettamanzi

453

Adaptive Play in Texas Hold’em Poker Raphaël Maîtrepierre, Jérémie Mary and Rémi Munos

458

Theoretical and Computational Properties of Preference-Based Argumentation Yannis Dimopoulos, Pavlos Moraitis and Leila Amgoud

463

Norm Defeasibility in an Institutional Normative Framework Henrique Lopes Cardoso and Eugénio Oliveira

468

8. Constraints and Search SLIDE: A Useful Special Case of the CARDPATH Constraint Christian Bessiere, Emmanuel Hebrard, Brahim Hnich, Zeynep Kiziltan and Toby Walsh

475

Frontier Search for Bicriterion Shortest Path Problems L. Mandow and J.L. Pérez de la Cruz

480

Heuristics for Dynamically Adapting Propagation Kostas Stergiou

485

Near Admissible Algorithms for Multiobjective Search Patrice Perny and Olivier Spanjaard

490

Compressing Pattern Databases with Learning Mehdi Samadi, Maryam Siabani, Ariel Felner and Robert Holte

495

A Decomposition Technique for Max-CSP Hachémi Bennaceur, Christophe Lecoutre and Olivier Roussel

500

Fast Set Bounds Propagation Using BDDs Graeme Gange, Vitaly Lagoon and Peter J. Stuckey

505

A New Approach for Solving Satisfiability Problems with Qualitative Preferences Emanuele Di Rosa, Enrico Giunchiglia and Marco Maratea

510

Combining Binary Constraint Networks in Qualitative Reasoning Jason Jingshi Li, Tomasz Kowalski, Jochen Renz and Sanjiang Li

515

Solving Necklace Constraint Problems Pierre Flener and Justin Pearson

520

Vivifying Propositional Clausal Formulae Cédric Piette, Youssef Hamadi and Lakhdar Saïs

525

Hybrid Tractable CSPs Which Generalize Tree Structure Martin C. Cooper, Peter G. Jeavons and András Z. Salamon

530

Justification-Based Non-Clausal Local Search for SAT Matti Järvisalo, Tommi Junttila and Ilkka Niemelä

535

Multi-Valued Pattern Databases Carlos Linares López

540

Using Abstraction in Two-Player Games Mehdi Samadi, Jonathan Schaeffer, Fatemeh Torabi Asr, Majid Samar and Zohreh Azimifar

545

xxi

9. Planning and Scheduling A Practical Temporal Constraint Management System for Real-Time Applications Luke Hunsberger

553

Towards Efficient Belief Update for Planning-Based Web Service Composition Jörg Hoffmann

558

Genetic Optimization of the Multi-Location Transshipment Problem with Limited Storage Capacity Nabil Belgasmi, Lamjed Ben Saïd and Khaled Ghédira

563

Regression for Classical and Nondeterministic Planning Jussi Rintanen

568

Combining Domain-Independent Planning and HTN Planning: The Duet Planner Alfonso Gerevini, Ugur Kuter, Dana Nau, Alessandro Saetti and Nathaniel Waisbrot

573

Learning in Planning with Temporally Extended Goals and Uncontrollable Events André A. Ciré and Adi Botea

578

A Simulation-Based Approach for Solving Generalized Semi-Markov Decision Processes Emmanuel Rachelson, Gauthier Quesnel, Frédérick Garcia and Patrick Fabiani

583

Heuristics for Planning with Action Costs Revisited Emil Keyder and Héctor Geffner

588

Diagnosis of Simple Temporal Networks Nico Roos and Cees Witteveen

593

10. Perception, Sensing and Cognitive Robotics An Attentive Machine Interface Using Geo-Contextual Awareness for Mobile Vision Tasks Katrin Amlacher and Lucas Paletta

601

Learning Functional Object-Categories from a Relational Spatio-Temporal Representation Muralikrishna Sridhar, Anthony G. Cohn and David C. Hogg

606

Sequential Spatial Reasoning in Images Based on Pre-Attention Mechanisms and Fuzzy Attribute Graphs Geoffroy Fouquier, Jamal Atif and Isabelle Bloch

611

Automatic Configuration of Multi-Robot Systems: Planning for Multiple Steps Robert Lundh, Lars Karlsson and Alessandro Saffiotti

616

Structure Segmentation and Recognition in Images Guided by Structural Constraint Propagation Olivier Nempont, Jamal Atif, Elsa Angelini and Isabelle Bloch

621

Theoretical Study of Ant-Based Algorithms for Multi-Agent Patrolling Arnaud Glad, Olivier Simonin, Olivier Buffet and François Charpillet

626

Incremental Component-Based Construction and Verification of a Robotic System Ananda Basu, Matthieu Gallien, Charles Lesire, Thanh-Hung Nguyen, Saddek Bensalem, Félix Ingrand and Joseph Sifakis

631

Salience-Driven Contextual Priming of Speech Recognition for Human-Robot Interaction Pierre Lison and Geert-Jan Kruijff

636

xxii

III. Prestigious Applications of Intelligent Systems (PAIS) A New CBR Approach to the Oil Spill Problem Juan Manuel Corchado, Aitor Mata, Juan Francisco De Paz and David Del Pozo

643

QuestSemantics – Intelligent Search and Retrieval of Business Knowledge Ian Blacoe, Ignazio Palmisano, Valentina Tamma and Luigi Iannone

648

Intelligent Adaptive Monitoring for Cardiac Surveillance Lucie Callens, Guy Carrault, Marie-Odile Cordier, Elisa Fromont, François Portet and René Quiniou

653

A Decision Support System for Breast Cancer Detection in Screening Programs Marina Velikova, Peter J.F. Lucas, Nivea Ferreira, Maurice Samulski and Nico Karssemeijer

658

The Design, Deployment and Evaluation of the AnimalWatch Intelligent Tutoring System Paul R. Cohen, Carole R. Beal and Niall M. Adams

663

AI on the Move: Exploiting AI Techniques for Context Inference on Mobile Devices Adolfo Bulfoni, Paolo Coppola, Vincenzo Della Mea, Luca Di Gaspero, Danny Mischis, Stefano Mizzaro, Ivan Scagnetto and Luca Vassena

668

Two Stage Knowledge Discovery for Spatio-Temporal Radio-Emission Data Matthias Haringer, Lothar Hotz and Vera Kamp

673

Using Natural Language Generation Technology to Improve Information Flows in Intensive Care Units James Hunter, Albert Gatt, François Portet, Ehud Reiter and Somayajulu Sripada

678

Application and Evaluation of a Medical Knowledge System in Sonography (SONOCONSULT) Frank Puppe, Martin Atzmueller, Georg Buscher, Matthias Huettig, Hardi Luehrs and Hans-Peter Buscher

683

Automating Accreditation of Medical Web Content Vangelis Karkaletsis, Pythagoras Karampiperis, Konstantinos Stamatakis, Martin Labský, Marek Růžička, Vojtěch Svátek, Enrique Amigó Cabrera, Matti Pöllä, Miquel Angel Mayer, Angela Leis and Dagmar Villarroel Gonzales

688

Pattern Classification Techniques for Early Lung Cancer Diagnosis Using an Electronic Nose Rossella Blatt, Andrea Bonarini, Elisa Calabró, Matteo Matteucci, Matteo Della Torre and Ugo Pastorino

693

A BDD Approach to the Feature Subscription Problem T. Hadzic, D. Lesaint, D. Mehta, B. O’Sullivan, L. Quesada and N. Wilson

698

Continuous Plan Management Support for Space Missions: The RAXEM Case Amedeo Cesta, Gabriella Cortellessa, Michel Denis, Alessandro Donati, Simone Fratini, Angelo Oddi, Nicola Policella, Erhard Rabenau and Jonathan Schulster

703

The i-Walker: An Intelligent Pedestrian Mobility Aid R. Annicchiarico, C. Barrué, T. Benedico, F. Campana, U. Cortés and A. Martínez-Velasco

708

Mixture of Gaussians Model for Robust Pedestrian Images Detection Dymitr Ruta

713

IV. Short Papers 1. Knowledge Representation and Reasoning Deriving Explanations from Causal Information Ph. Besnard, M.-O. Cordier and Y. Moinard

723

xxiii

A Hybrid Tableau Algorithm for ALCQ Jocelyne Faddoul, Nasim Farsinia, Volker Haarslev and Ralf Möller

725

Semantic Relatedness in Semantic Networks Laurent Mazuel and Nicolas Sabouret

727

HOOPO: A Hybrid Object-Oriented Integration of Production Rules and OWL Ontologies Georgios Meditskos and Nick Bassiliades

729

Rule-Based OWL Ontology Reasoning Using Dynamic ABOX Entailments Georgios Meditskos and Nick Bassiliades

731

Computability and Complexity Issues of Extended RDF Anastasia Analyti, Grigoris Antoniou, Carlos Viegas Damásio and Gerd Wagner

733

Automated Web Services Composition Using Extended Representation of Planning Domain Mohamad El Falou, Maroua Bouzid, Abdel-Illah Mouaddib and Thierry Vidal

735

Propositional Merging Operators Based on Set-Theoretic Closeness Patricia Everaere, Sébastien Konieczny and Pierre Marquis

737

Partial and Informative Common Subsumers in Description Logics Simona Colucci, Eugenio Di Sciascio, Francesco Maria Donini and Eufemia Tinelli

739

Prime Implicate-Based Belief Revision Operators Meghyn Bienvenu, Andreas Herzig and Guilin Qi

741

Approximate Structure Preserving Semantic Matching Fausto Giunchiglia, Mikalai Yatskevich, Fiona McNeill, Pavel Shvaiko, Juan Pane and Paolo Besana

743

Discovering Temporal Knowledge from a Crisscross of Timed Observations Nabil Benayadi and Marc Le Goc

745

Fred Meets Tweety Antonis Kakas, Loizos Michael and Rob Miller

747

Definability in Logic and Rough Set Theory Tuan-Fang Fan, Churn-Jung Liau and Duen-Ren Liu

749

WikiTaxonomy: A Large Scale Knowledge Resource Simone Paolo Ponzetto and Michael Strube

751

Computing ∈-Optimal Strategies in Bridge and Other Games of Sequential Outcome Pavel Cejnar

753

2. Machine Learning Classifier Combination Using a Class-Indifferent Method Yaxin Bi, Shenli Wu, Pang Xiong and Xuhui Shen

757

Reinforcement Learning with Classifier Selection for Focused Crawling Ioannis Partalas, Georgios Paliouras and Ioannis Vlahavas

759

Intuitive Action Set Formation in Learning Classifier Systems with Memory Registers L. Simões, M.C. Schut and E. Haasdijk

761

An Ensemble of Classifiers for Coping with Recurring Contexts in Data Streams Ioannis Katakis, Grigorios Tsoumakas and Ioannis Vlahavas

763

xxiv

Content-Based Social Network Analysis Paola Velardi, Roberto Navigli, Alessandro Cucchiarelli and Mirco Curzi

765

Efficient Data Clustering by Local Density Approximation Marc-Ismaël Akodjènou and Patrick Gallinari

767

Gas Turbine Fault Diagnosis Using Random Forests Manolis Maragoudakis, Euripides Loukis, Panayotis-Prodromos Pantelides

769

How Many Objects?: Determining the Number of Clusters with a Skewed Distribution Satoshi Oyama and Katsumi Tanaka

771

Active Concept Learning for Ontology Evolution Murat Şensoy and Pınar Yolum

773

Determining Automatically the Size of Learned Ontologies Elias Zavitsanos, Sergios Petridis, Georgios Paliouras and George A. Vouros

775

Dynamic Multi-Armed Bandit with Covariates Nicos G. Pavlidis, Dimitris K. Tasoulis, Niall M. Adams and David J. Hand

777

Reinforcement Learning with the Use of Costly Features Robby Goetschalckx, Scott Sanner and Kurt Driessens

779

Data-Driven Induction of Functional Programs Emanuel Kitzelmann

781

CTRNN Parameter Learning Using Differential Evolution Ivanoe De Falco, Antonio Della Cioppa, Francesco Donnarumma, Domenico Maisto, Roberto Prevete and Ernesto Tarantino

783

3. Model-Based Diagnosis and Reasoning Incremental Diagnosis of DES by Satisfiability Alban Grastien and Anbulagan

787

Characterizing and Checking Self-Healability Marie-Odile Cordier, Yannick Pencolé, Louise Travé-Massuyès and Thierry Vidal

789

Improving Robustness in Consistency-Based Diagnosis Using Possible Conflicts Belarmino Pulido, Anibal Bregon and Carlos Alonso-González

791

Dependable Monitoring of Discrete-Event Systems with Uncertain Temporal Observations Gianfranco Lamperti and Marina Zanella

793

Distributed Repair of Nondiagnosability Anika Schumann, Wolfgang Mayer and Markus Stumptner

795

From Constraint Representations of Sequential Code and Program Annotations to Their Use in Debugging Mihai Nica and Franz Wotawa

797

Compressing Binary Decision Diagrams Esben Rune Hansen, S. Srinivasa Rao and Peter Tiedemann

799

Dependent Failures in Consistency-Based Diagnosis Jörg Weber and Franz Wotawa

801

Cost-Sensitive Iterative Abductive Reasoning with Abstractions Gianluca Torta, Daniele Theseider Dupré and Luca Anselma

803

xxv

Computation of Minimal Sensor Sets for Conditional Testability Requirements Gianluca Torta and Pietro Torasso

805

Combining Abduction with Conflict-Based Diagnosis Ildikó Flesch and Peter J.F. Lucas

807

4. Cognitive Modeling and Interaction An Activity Recognition Model for Alzheimer’s Patients: Extension of the COACH Task Guidance System B. Bouchard, P. Roy, A. Bouzouane, S. Giroux and A. Mihailidis

811

Not So New: Overblown Claims for ‘New’ Approaches to Emotion Dylan Evans

813

Emergence of Rules in Cell Assemblies of fLIF Neurons Roman V. Belavkin and Christian R. Huyck

815

ERS: Evaluating Reputations of Scientific Journals Émilie Samuel and Colin de la Higuera

817

Personal Experience Acquisition Support from Blogs Using Event-Depicting Images Keita Sato, Yoko Nishihara and Wataru Sunayama

819

Object Configuration Reconstruction from Descriptions Using Relative and Intrinsic Reference Frames H. Joe Steinhauer

821

Probabilistic Reinforcement Rules for Item-Based Recommender Systems Sylvain Castagnos, Armelle Brun and Anne Boyer

823

An Efficient Behavior Classifier Based on Distributions of Relevant Events Jose Antonio Iglesias, Agapito Ledezma, Araceli Sanchis and Gal Kaminka

825

ContextAggregator: A Heuristic-Based Approach for Automated Feature Construction and Selection Robert Lokaiczyk and Manuel Goertz

827

A Pervasive Assistant for Nursing and Doctoral Staff Alexiei Dingli and Charlie Abela

829

5. Natural Language Processing Author Identification Using a Tensor Space Representation Spyridon Plakias and Efstathios Stamatatos

833

Categorizing Opinion in Discourse Nicholas Asher, Farah Benamara and Yvette Yannick Mathieu

835

A Dynamic Approach for Automatic Error Detection in Generation Grammars Tim vor der Brück and Holger Stenzhorn

837

Answering Definition Question: Ranking for Top-k Chao Shen, Xipeng Qiu, Xuanjing Huang and Lide Wu

839

Ontology-Driven Human Language Technology for Semantic-Based Business Intelligence Thierry Declerck, Hans-Ulrich Krieger, Horacio Saggion and Marcus Spies

841

Evaluation Evaluation David M.W. Powers

843

xxvi

6. Uncertainty and AI Using Decision Trees as the Answer Networks in Temporal Difference-Networks Laura-Andreea Antanas, Kurt Driessens, Jan Ramon and Tom Croonenborghs

847

An Efficient Deduction Mechanism for Expressive Comparative Preferences Languages Nic Wilson

849

An Analysis of Bayesian Network Model-Approximation Techniques Adamo Santana and Gregory Provan

851

7. Distributed and Multi-Agents Systems Verifying the Conformance of Agents with Multiparty Protocols Laura Giordano and Alberto Martelli

855

Simulated Annealing for Coalition Formation Helena Keinänen and Misa Keinänen

857

A Default Logic Based Framework for Argumentation Emanuel Santos and João Pavão Martins

859

An Empirical Investigation of the Adversarial Activity Model Inon Zuckerman, Sarit Kraus, Jeffrey S. Rosenschein

861

Addressing Temporal Aspects of Privacy-Related Norms Guillaume Piolle and Yves Demazeau

863

Evaluation of Global System State Thanks to Local Phenomenona Jean-Michel Contet, Franck Gechter, Pablo Gruer and Abder Koukam

865

Experience and Trust — A Systems-Theoretic Approach Norman Foo and Jochen Renz

867

Trust-Aided Acquisition of Unverifiable Information Eugen Staab, Volker Fusenig and Thomas Engel

869

BIDFLOW: A New Graph-Based Bidding Language for Combinatorial Auctions Madalina Croitoru, Cornelius Croitoru and Paul Lewis

871

Multi-Agent Reinforcement Learning for Intrusion Detection: A Case Study and Evaluation Arturo Servin and Daniel Kudenko

873

GR-MAS: Multi-Agent System for Geriatric Residences Javier Bajo, Juan M. Corchado and Sara Rodriguez

875

Agent-Based and Population-Based Simulation of Displacement of Crime (extended abstract) Tibor Bosse, Charlotte Gerritsen, Mark Hoogendoorn, S. Waqar Jaffry and Jan Treur

877

Organizing Coherent Coalitions Jan Broersen, Rosja Mastop, John-Jules Ch. Meyer and Paolo Turrini

879

A Probabilistic Trust Model for Semantic Peer-to-Peer Systems Gia-Hien Nguyen, Philippe Chatalic and Marie-Christine Rousset

881

Conditional Norms and Dyadic Obligations in Time Jan Broersen and Leendert van der Torre

883

Trust Aware Negotiation Dissolution Nicolás Hormazábal, Josep Lluis de la Rosa i Esteva and Silvana Aciar

885

xxvii

On the Role of Structured Information Exchange in Supervised Learning Ricardo M. Araujo and Luis C. Lamb

887

Magic Agents: Using Information Relevance to Control Autonomy B. van der Vecht, F. Dignum and J.-J.Ch. Meyer

889

Infection-Based Norm Emergence in Multi-Agent Complex Networks Norman Salazar, Juan A. Rodriguez-Aguilar and Josep Ll. Arcos

891

Opponent Modelling in Texas Hold’em Poker as the Key for Success Dinis Félix and Luís Paulo Reis

893

8. Constraints and Search LRTA* Works Much Better with Pessimistic Heuristics Aleksander Sadikov and Ivan Bratko

897

Thinking Too Much: Pathology in Pathfinding Mitja Luštrek and Vadim Bulitko

899

Dynamic Backtracking for Distributed Constraint Optimization Redouane Ezzahir, Christian Bessiere, Imade Benelallam, El Houssine Bouyakhf and Mustapha Belaissaoui

901

Integrating Abduction and Constraint Optimization in Constraint Handling Rules Marco Gavanelli, Marco Alberti and Evelina Lamma

903

Symbolic Classification of General Multi-Player Games Peter Kissmann and Stefan Edelkamp

905

Redundancy in CSPs Assef Chmeiss, Vincent Krawczyk and Lakhdar Sais

907

Reinforcement Learning and Reactive Search: An Adaptive MAX-SAT Solver Roberto Battiti and Paolo Campigotto

909

A MAX-SAT Algorithm Porfolio Paulo Matos, Jordi Planes, Florian Letombe, João Marques-Silva

911

On the Practical Significance of Hypertree vs. Tree Width Rina Dechter, Lars Otten and Radu Marinescu

913

9. Planning and Scheduling A New Approach to Planning in Networks Jussi Rintanen

917

Detection of Unsolvable Temporal Planning Problems Through the Use of Landmarks E. Marzal, L. Sebastia and E. Onaindia

919

A Planning Graph Heuristic for Forward-Chaining Adversarial Planning Pascal Bercher and Robert Mattmüller

921

10. Perception, Sensing and Cognitive Robotics Vector Valued Markov Decision Process for Robot Platooning Matthieu Boussard, Maroua Bouzid and Abdel-Illah Mouaddib

925

xxviii

Learning to Select Object Recognition Methods for Autonomous Mobile Robots Reinaldo A.C. Bianchi, Arnau Ramisa and Ramón López de Mántaras

927

Robust Reservation-Based Multi-Agent Routing Adriaan ter Mors, Xiaoyu Mao, Jonne Zutt, Cees Witteveen and Nico Roos

929

Automatic Animation Generation of a Teleoperated Robot Arm Khaled Belghith, Benjamin Auder, Froduald Kabanza, Philippe Bellefeuille and Leo Hartman

931

Planning, Executing, and Monitoring Communication in a Logic-Based Multi-Agent System Martin Magnusson, David Landén and Patrick Doherty

933

Author Index

935

I. Invited Talks

This page intentionally left blank

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-3

3

Semantic Activity Recognition Monique Thonnat 1 Abstract. Extracting automatically the semantics from visual data is a real challenge. We describe in this paper how recent work in cognitive vision leads to signiﬁcative results in activity recognition for visualsurveillance and video monitoring. In particular we present work performed in the domain of video understanding in our PULSAR team at INRIA in Sophia Antipolis. Our main objective is to analyse in real-time video streams captured by static video cameras and to recognize their semantic content. We present a cognitive vision approach mixing 4D computer vision techniques and activity recognition based on a priori knowledge. Applications in visualsurveillance and healthcare monitoring are shown. We conclude by current issues in cognitive vision for activity recognition.

with the unautorized person accessing together with an employee to a fordidden area. In the second case (shown in ﬁgure 2) without information on the location of the scene one can recognize a woman standing alone; a medical expert knowing the patient will interpret the same scene as an active elderly preparing a meal in her kitchen. In fact, the interpretation of a video sequence is not unique but it depends on the a priori knowledge of the observer and on his/her goal.

1 INTRODUCTION This paper is focused on activity recognition. Activity recognition is a hot topic in the academic ﬁeld not only due to scientiﬁc motivations but also due to strong demands coming from the industry and the society; in particular for videosurveillance and healthcare. In fact, there is an increasing need to automate the recognition of activities observed by visual sensors (usually CCD cameras, omni directional cameras, infrared cameras). More precisely we are interested in the real-time semantic interpretation of dynamic scenes observed by video cameras. We thus study spatio-temporal activities performed by mobile objects (e.g. human beings, animals or vehicles) interacting with the physical world. What does it mean to understand a video ? Is it just to perform statistics on the appearance of images and to recognize an image from a set of already seen images? If we really want to understand the activities performed by the physical objects 2D analysis is not sufﬁcient. We need to locate the physical objects in the 3D real world. The dynamics of the physical objects is a major cue for activity recognition. The computer vision community is very active in the domain of motion detection, mobile object tracking and more recently trajectory analysis. Very often these analyses are performed in the image plane and are thus dependant of the sensor parameters as its ﬁeld of view, position and orientation. However for reliable activity recognition the dynamics of the physical objects must be computed in the 4D space. Is there a unique objective interpretation of a dynamic scene? For instance the scenes shown in ﬁgures 1 and 2 can be interpreted more or less precisely in function of the a priori knowledge of the observer. In the ﬁrst case (shown in ﬁgure 1) without information on the location of the scene one can recognize an indoor scene where two men are walking together towards a door; a videosurveillance expert knowing the location (a bank agency), its spatial conﬁguration as well as security rules will interpret the same scene as a bank attack 1

INRIA, France, email: [email protected]

Figure 1. A scene with different valid interpretations: two people walking together towards a door or a bank attack with an access to a forbidden area by an unauthorized person and an employee.

Figure 2.

A scene with different valid interpretations: a person standing in a room or an active elderly preparing a meal in a kitchen.

2 4D APPROACH We present a cognitive vision approach mixing 4D computer vision techniques and activity recognition based on a priori knowledge. The major issue in semantic interpretation of dynamic scenes is the gap between the subjective interpretation of data and the objective measures provided by the sensors.

4

M. Thonnat / Semantic Activity Recognition

images from camera 1 images from camera 2

images from camera N

Figure 3.

mobile objects from camera 1

fused tracked mobile objects for the whole scene

motion detection

frame to frame tracking

motion detection

frame to frame tracking

...

...

motion detection

frame to frame tracking

tracked mobile objects from camera 1 cameras with overlapped FOV fusion

tracked mobile objects from camera N

long term group tracking long term crowd tracking physical objects

AND/OR tree−based scenario recognition automaton−based scenario recognition

alerts

temporal−constraints based scenario recognition Bayesian−network based scenario recognition

From sensor data to high level interpretation; global structure of an activity monitoring system built with VSIP[1].

Our approach to address this problem is to keep a clear boundary between the application dependent subjective interpretations and the objective analysis of the videos. We thus deﬁne a set of objective measures which can be extracted in real-time from the videos, we propose formal models to enable users to express their activities of interest and we build matching techniques to bridge the gap between the objective measures and the activity models. Figure 3 shows the global structure of a videosurveillance system built with this approach. First, a motion detection step followed by a frame to frame tracking is made for each video camera. Then the tracked mobile objects coming from different video cameras with overlapping ﬁelds of view are fused into a unique 4D representation for the whole scene. Depending on the chosen application, a combination of one or more of the available trackers (individuals, groups and crowd tracker) is used. Then scenario recognition is performed by a combination of one or more of the available recognition algorithms (automaton based, Bayesian-network based, AND/OR tree based and temporal constraints based). Finally the system generates the alerts corresponding to the predeﬁned recognized scenarios. For robust semantic interpretation of mobile object behaviour it is mandatory to rely on correct physical object type classiﬁcation. It can be based on simple 3D models like parallelepipeds [12] or complex 3D human body conﬁgurations with posture models as in [2]. Figure 4 shows examples of such postures.

Figure 4.

long term individual tracking

scenes or the walls and doors for indoor scenes) as well as the main static 3D objects (for instance the furniture in indoor scenes) and the 2D zones of interest. This geometry is deﬁned in terms of 3D position, shape and volume. • Semantic information: for each part of the map semantic information is added as its type (e.g. 3D object, 2D zone), its characterics (e.g. yellow, fragile) or its function (e.g. entrance zone, seat). We can see on ﬁgure 5 a 2D map of an indoor ﬂat and on ﬁgure 10 two partial views of the 3D map built for monitoring elderly at home. In this map in addition to the main structure of the rooms (walls, doors, etc.), the equipment and the furniture are deﬁned as well as the information related to the sensors.

Figure 5. Top view of the ﬂat

Different 3D models of human body postures

Figure 6.

3D map: the kitchen area and the top view of a ﬂat for monitoring elderly at home

3 3D MAP We use 3D maps as a means to model the a priori knowledge of the physical environment captured by the sensors. More precisely the 3D maps contain the a priori knowledge of the empty scenes:

4 ACTIVITY MODELLING

• Video Cameras: 3D position of the sensors, calibration matrix, ﬁelds of view,... • 3D Geometry: the geometry of the static structure of the empty scene (for instance the buildings and road structure for outdoor

In order to express the semantics of the activities a modelling effort is needed. The models correspond to the modeling of all the knowledge needed by the system to recognize video events occurring in the scene. To allow security operators to easily deﬁne and modify their models, the description of the knowledge is declarative and intuitive

5

M. Thonnat / Semantic Activity Recognition

(in natural terms). We propose a video event ontology to share common concepts in video understanding and to decrease the effort of knowledge modelling.

4.1 The Video Event Ontology The event ontology is a set of concepts for describing physical objects, events and relations between concepts: The physical objects are all the concepts to describe objects of the real world in the scene observed by the sensors. The attributes of a physical object are pertinent for the recognition. These attributes characterize the physical object. There are two types of physical objects: contextual objects (which are usually static and whenever in motion, its movement can be predicted using contextual information) and mobile objects (which can be perceived as moving in the scene and as initiating their motions, without the possibility to predict their movement). The events are all the concepts to describe mobile object evolutions and interactions in a scene. Different terms are used to describe these concepts and categorized into two categories: state (including primitive/composite state) and event (including primitive/composite event, single/multi-agent event). A primitive state is a spatio-temporal property valid at a given instant or stable on a time interval which is directly inferred from audiovisual attributes of physical objects computed by low level signal processing algorithms. A composite state is a combination of states. A primitive event is a change of states. A composite event is a combination of states and events. A single-agent event is an event involving a single mobile object. A multi-agent event is a composite event involving several (at least two) mobile objects with different motions. Currently this ontology contains 151 concepts used for different applications in video understanding. This ontology is implemented in Protege to be independant of a particular activity recognition formalism.

algorithm recognizes which events are occurring using the primitive video events. To recognize an event composed of sub-events, given the event model, the recognition algorithm selects a set of physical objects matching the remaining physical object variables of the event model. The algorithm then looks back in the past for any previously recognized state/event that matches the ﬁrst component of the event model. If these two recognized components verify the event model constraints (e.g. temporal constraints), the event is said to be recognized. In order to facilitate complex event recognition, after each event recognition, event templates are generated for all composite events, the last component of which corresponds to this recognized event. For more details see [9].

6 APPLICATIONS This approach has been applied to a large set of applications in visualsurveillance.

6.1 Visualsurveillance A typical example of complex activities in which we are interested is aircraft monitoring (see ﬁgure 7 in apron areas . In this example the duration of the servicing activities8 around the aircraft is about one hour and the activities involve interactions between several ground vehicles and human operators. The goal is to recognize these activities through formal activity models as shown in ﬁgure 9 and data captured by a network of video cameras (such as the ones shown in ﬁgure 7). For more details, refer to [3] and the related European project website http://www.avitrack.net/.

4.2 Activity Models A formalism for expressing an activity is directly based on the concepts of the video event ontology. A composite event model is composed of ﬁve parts: ”physical objects” involved in the event (e.g. person, equipment, zones of interest), ”components” corresponding to the sub-events composing the event, ”forbidden components” corresponding to the events which should not occur during the main event, ”constraints” are conditions between the physical objects and/or the components (including symbolic, logical, spatial and temporal constraints including Allen interval algebra operators, and ”alarms” describing the actions to be taken when the event is recognized. Primitive states, composite states and primitive events can be described using the same formalism. Please see [10] and [9] for more details of the formalism.

Figure 7.

a

b

c

d

Different views of an apron area captured by video cameras for aircraft monitoring

5 ACTIVITY RECOGNITION The algorithm proposed in [9] and in [10] enables to process efﬁciently (i.e. in realtime) a data ﬂow and to recognize pre-deﬁned activities. Alternative approaches based on probabilistic methods [6] or [7] can also be used. In the following we concentrate on the ﬁrst approach because it is directly based on the formalism and the ontology presented in the previous section. The video event recognition

6.2 Healtcare monitoring In this application the objective is to monitor elderly at home (see ﬁgure 10). In collaboration with gerontologists, we have modeled several primitive states, primitive events and composite events. First we

6

M. Thonnat / Semantic Activity Recognition

Figure 8. Activity recognition problem in airport: the main servicing operations around an aircraft (refuelling, baggage loading, power supply, etc...) and the location of the 8 video cameras (in blue)

are interesting in modelling events characteristic of critical situations such as falling down. Second, these events aim at detecting abnormal changes of behavior patterns such as depression. Given these objectives we have selected the activities that can be detected using video cameras [11]. We have modeled thirty four video events. In particular, we have deﬁned fourteen primitives states, four of them are related to the location of the person in the scene (e.g. inside kitchen, inside livingroom) and the ten remaining are related to the proposed 3D key human postures. We have deﬁned also four primitive events related to the combination of these primitive states: ”standing up” which represents a change state from sitting or slumping to standing, ”sitting down” which represents a change state from standing, or bending to sitting on a chair, ”sitting up” represents a change state from lying to sitting on the ﬂoor, and ”lying down” which represents a change state from standing or sitting on the ﬂoor to lying. We have deﬁned also six primitive events such as: stay in kitchen, stay in livingroom. These primitive states and events are used to deﬁne more composite events. For this study, we have modeled ten composite events. In this paper, we present just two of them: ”feeling faint” and ”falling down”. The model of the ”feeling faint” event is shown in ﬁgure 4. The ”feeling faint” model involves one physical object (one person), and it contains three 3D human posture components and constraints between these components. CompositeEvent (PersonFeelingFaint, PhysicalObjects( (p: Person) ) Components ( (pStand: PrimitiveState Standing(p)) (pBend: PrimitiveState Bending(p)) (pSit: PrimitiveState Sitting Outstretched Legs(p))) Constraints ((Sequence pStand; pBend; pSit) (pSit’s Duration >= 10)) Alarm( AText(”Person is Feeling Faint”) AType(”URGENT”)) ) ”Feeling faint” model.

Figure 9. Activity recognition problem in airport: example of an activity model enabling to describe an unloading operation with a high-level language

We have also modelled the ”falling down” event. There are different ways for describing a person falling down. Thus, we have modelled the event ”falling down” with three models: Falling down 1: A change state from standing, sitting on the ﬂoor (with ﬂexed or outstretched legs) and lying (with ﬂexed or outstretched legs). Falling down 2: A change state from standing, and lying (with ﬂexed or outstretched legs). Falling down 3: A change state from standing, bending and lying (with ﬂexed or outstretched legs). An example of the deﬁnition of the model ”falling down 1” is shown below.

Figure 10.

healthcare

CompositeEvent(PersonFallingDown1, PhysicalObjects( (p: Person) ) Components ( (pStand: PrimitiveState Standing(p)) (pSit: PrimitiveState Sitting Flexed Legs(p)) (pLay: PrimitiveState Lying Outstretched Legs(p))) Constraints ( (pSit before meet p lay) (pLay’s Duration >= 50)) Alarm (AText(”Person is Falling Down”) AType(”VERYURGENT”)) ) ”Falling down 1” model.

Figure 11 and ﬁgure 12 show respectively the camera view and the 3D visualization of the recognition of the ”feeling faint” event.

M. Thonnat / Semantic Activity Recognition

7

7 CONCLUSION

Figure 11.

Recognition of the ”feeling faint” event

We have shown a 4D semantic approach for activity recognition of dynamic scene. There are still a lot of open issues among which a full theory of visual data interpretation, reliable techniques for 4D analysis able to deal with changing observation conditions and scene content. From an activity recognition point of view the three main points are the development of shared operational ontologies, of formalisms for activity modelling with good properties such as scalability and learning techniques for model reﬁnement. In particular a large set of learning issues are rised by this 4D semantic approach for instance: learning contextual variations for physical object detection and image segmentation [5], learning the structure of the activity models [8] or learning the visual concept detectors [4].

REFERENCES

Figure 12.

3D visualization of the recognition of the ”feeling faint” event

Figure 13 and ﬁgure 14 show respectively the camera view and the 3D visualization of the recognition of the ”falling down” event.

Figure 13.

Figure 14.

Recognition of the ”falling down” event

3D visualization of the recognition of the ”falling down” event

[1] A. Avanzi, F. Bremond, C. Tornieri, and M. Thonnat, ‘Design and assessment of an intelligent activity monitoring platform’, EURASIP Journal on Applied Signal Processing, special issue in ”Advances in Intelligent Vision Systems: Methods and Applications”, 2005(14), 2359– 2374, (August 2005). [2] B. Boulay, F. Bremond, and M. Thonnat, ‘Applying 3d human model in a posture recognition system’, Pattern Recognition Letter, Special Issue on vision for Crime Detection and Prevention, 27(15), 1788–1796, (2006). [3] Florent Fusier, Valery Valentin, Franc¸ois Bremond, Monique Thonnat, Mark Bor g, David Thirde, and James Ferryman, ‘Video understanding for complex activity recognition’, Machine Vision and Applications Journal, 18, 167–188, (2007). [4] N. Maillot and M. Thonnat, ‘Ontology based complex object recognition’, Image and Vision Computing Journal, Special Issue on Cognitive Computer Vision, 26(1), 102–113, (2008). [5] V. Martin and M. Thonnat, ‘Learning contextual variations for video segmentation’, in The 6th International Conference on Vision Systems (ICVW08), Santorini, Greece, (2008). [6] G. Medioni, I. Cohen, F. Br´emond, S. Hongeng, and G. Nevatia, ‘Activity Analysis in Video’, Pattern Analysis and Machine Intelligence PAMI, 23(8), 873–889, (2001). [7] N. Moenne-Loccoz, F. Br´emond, and M. Thonnat, ‘Recurrent bayesian network for the recognation of human behaviors from video’, in Third International Conference On Computer Vision Systems (ICVS 2003), volume LNCS 2626, pp. 44–53, Graz, Austria, (2003). Springer. [8] A. Toshev, F. Br´emond, and M. Thonnat, ‘An a priori-based method for frequent composite event discovery in videos’, in Proceedings of 2006 IEEE International Conference on Computer Vision Systems, New York USA, (January 2006). [9] V-T. Vu, F. Br´emond, and M. Thonnat, ‘Automatic video interpretation: A novel algorithm for temporal scenario recognition’, in The Eighteenth International Joint Conference on Artiﬁcial Intelligence (IJCAI’03), Acapulco, Mexico, (2003). [10] V-T. Vu, F. Br´emond, and M. Thonnat, ‘Automatic video interpretation: A recognition algorithm for temporal scenarios based on pre-compiled scenario models’, in The 3rd International Conference on Vision System (ICVS’03), Graz, Austria, (2003). [11] N. Zouba, B. Boulay, F. Br´emond, and M. Thonnat, ‘Monitoring activities of daily living (adls) of elderly based on 3d key human postures’, in The 4th International Cognitive Vision Workshop (ICVW08), Santorini, Greece, (2008). [12] M. Z´uniga, F. Br´emond, and M. Thonnat, ‘Fast and reliable object classiﬁcation in video based on a 3d generic model’, in The 3rd International Conference on Visual Information Engineering (VIE2006), pp. 433–441, Bangalore, India, (September 26-28 2006).

8

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-8

Bayesian Methods for Artificial Intelligence and Machine Learning Zoubin Ghahramani Department of Engineering, University of Cambridge, UK Machine Learning Department, Carnegie Mellon University, USA http://learning.eng.cam.ac.uk/zoubin

Abstract. Bayesian methods provide a framework for representing and manipulating uncertainty, for learning from noisy data, and for making decisions that maximize expected utility----components which are important to both AI and Machine Learning. However, although Bayesian methods have become more popular in recent years, there remains a good degree of skepticism with respect to taking a fully Bayesian approach. This talk will introduce fundamental topics in Bayesian statistics as they apply to machine learning and AI, and address some misconceptions about Bayesian approaches. I will then discuss some current work on non-parametric Bayesian machine learning, particularly in the area of unsupervised learning.

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-9

9

The Impact of Constraint Programming Pascal Van Hentenryck Brown University

Abstract. Constraint programming is a success story for artificial intelligence. It quickly moved from research laboratories to industrial applications and is in daily use to solve complex optimization throughout the world. At the same time, constraint programming continued to evolve, addressing new needs and opportunities. This talk reviews some recent progress in constraint programming, including its hybridization with other optimization approaches, the quest for more autonomous search, and its applications in a variety of nontraditional areas.

10

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-10

Web Science George Metakides

Abstract not available at time of printing.

II. Papers

This page intentionally left blank

1. Knowledge Representation and Reasoning

This page intentionally left blank

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-15

15

Advanced Preprocessing for Answer Set Solving Martin Gebser and Benjamin Kaufmann and Andr´e Neumann and Torsten Schaub1 2 Abstract. We introduce the ﬁrst substantial approach to preprocessing in the context of answer set solving. The idea is to simplify a logic program while identifying equivalences among its relevant constituents. These equivalences are then used for building a compact representation of the program (in terms of Boolean constraints). We implemented our approach as well as a SAT-based technique to reduce Boolean constraints. This allows us to empirically analyze both preprocessing types and to demonstrate their computational impact.

1

INTRODUCTION

Answer Set Programming (ASP; [3]) has become an attractive paradigm for declarative problem solving. This is partly due to the availability of efﬁcient off-the-shelf ASP solvers [9, 19]. In fact, modern ASP solvers rely on Boolean constraint solving technology [1, 8, 7], leading to a similar performance as advanced SAT solvers [17]. On the other hand, the attractiveness of ASP stems from its rich modeling language, allowing for an easy and elaborationtolerant handling of knowledge-intensive applications. In practice, an input program is usually run through multiple preprocessing steps. At ﬁrst, a so-called grounder instantiates all variables, thus producing a ground logic program. Classical ASP solvers, such as smodels [19], more or less take the resulting program as is without doing further optimizations. In contrast, modern ASP solvers translate a ground program into a set of Boolean constraints (e.g., clauses) in order to exploit advanced SAT solving technology. Such translations necessitate the introduction of extra propositions (see below) in order to avoid an exponential blow-up. Also, this addition may result in exponentially smaller search spaces [16] and permits more succinct representations of loop constraints [14]. Nonetheless, the question arises in how far the introduced redundancy can be trimmed. While ASP solvers still lack full-ﬂedged preprocessing techniques, they already constitute an integral part of many SAT solvers [2, 20, 10]. There are two principal ways to address preprocessing in ASP solving: the external one, aiming at the reduction of a ground program, and the internal one, (recurrently) optimizing its inner representation. Within modern ASP solvers, the latter can be done by adapting corresponding techniques from SAT. Hence, we concentrate in the sequel on the former approach, being speciﬁc to ASP. Thereby, we build upon work on program transformations and equivalence [4, 5, 11]. To be precise, we develop preprocessing techniques for ground logic programs under answer set semantics. The idea is to transform a program into a simpler one, along with an assignment and a relation expressing equivalences among the assignable constituents of the program. These equivalences are subsequently exploited when transforming the resulting program into Boolean constraints, represented as clauses. We implemented both our external and a SATbased internal reduction strategy within the ASP solver clasp [7]. This makes clasp the ﬁrst ASP solver incorporating advanced pre-

processing techniques. Furthermore, our implementation allows us to empirically assess both the external and the internal approach to preprocessing, thus demonstrating their computational impact.

2

A (normal) logic program over an alphabet A is a ﬁnite multiset3 of rules of the form a ← b1 , . . . , bm , ∼cm+1 , . . . , ∼cn , where a, bi , cj ∈ A are atoms for 0 < i ≤ m, m < j ≤ n. A literal is an atom a or its (default) negation ∼a. Furthermore, let ∼A = {∼a | a ∈ A} and A = {a | a ∈ A}, where a is used for (classical) negation in propositional formulas. For a rule r, let head (r) = a be the head of r and the multiset body(r) = {b1 , . . . , bm , ∼cm+1 , . . . , ∼cn } be the body of r. Given a (multi)set B of literals, let B + = {a ∈ A | a ∈ B} and B − = {a ∈ A | ∼a ∈ B}. The set of atoms occurring in a logic program Π is denoted by atom(Π) and body(Π) = {body(r) | r ∈ Π}. Also, we deﬁne body(a) = {body(r) | r ∈ Π, head (r) = a}. Following [18], we characterize the answer sets of a logic program Π by the models of the completion [6] and loop formulas of Π. As mentioned above, in practice, this involves introducing extra propositions pB for bodies B. Given a program Π over A, its completion formula is then deﬁned as follows: ˘ `W ´ ¯ CF (Π, A) = a ↔ ∪ B∈body(a) pB | a ∈ A `V ´ ¯ ˘ V b ∧ c | B ∈ body(Π) . (1) pB ↔ b∈B + c∈B − A loop is a (nonempty) set of atoms that circularly depend upon each other in a program’s positive atom dependency graph [18]. The set of all loops of Π is denoted by loop(Π). If loop(Π) = ∅, then Π is said to be tight [12]. The loop formula of some L ∈ loop(Π) is `W ´ `W ´ LF (Π, L) = a∈L a → a∈L,B∈body(a),B + ∩L=∅ pB , and LF (Π) = {LF (Π, L) | L ∈ loop(Π)}. The bodies contributing to the consequent of a loop formula provide external support for the antecedent’s atoms. An atom is said to be unfounded if it belongs to the antecedent of a loop formula whose consequent is ⊥, expressing the absence of external support. We represent (classical) models by their set of entailed propositions, and let M(F ) stand for the set of all models of F . For some alphabet A, we deﬁne M(F )|A = {M ∩ A | M ∈ M(F )}. Then, a set X ⊆ A is an answer set of a logic program Π over A if X ∈ M(CF (Π, A) ∪ LF (Π))|A . We let AS (Π) denote the set of all answer sets of Π. Note that, whenever Π is tight, we have X ∈ AS (Π) iff X ∈ M(CF (Π, A))|A . Consider the following program Π over A = {a, . . . , f }: {a ←; b ← a, ∼c; c ← ∼b, ∼d; e ← ∼c; e ← f ; f ← a, e} . We get the following completion formula, CF (Π, A): {a ↔ p0 ; b ↔ p1 ; c ↔ p2 ; d ↔ ⊥; e ↔ p3 ∨ p4 ; f ↔ p5 } ¯∪ ˘ p0 ↔ ; p1 ↔ a ∧ c; p2 ↔ b ∧ d; p3 ↔ c; p4 ↔ f ; p5 ↔ a ∧ e . 3

1 2

Afﬁliated with SFU, Canada, and Grifﬁth University, Australia. Universit¨at Potsdam, August-Bebel-Str. 89, D-14482 Potsdam, Germany

BACKGROUND

The usage of multisets is motivated by the syntactic nature of our approach and the fact that grounders produce duplicates. For simplicity, we keep standard set notation for multiset operations.

16

M. Gebser et al. / Advanced Preprocessing for Answer Set Solving

CF (Π, A) has three models: {a, b, e, f, p0 , p1 , p3 , p4 , p5 }, {a, c, p0 , p2 }, and {a, c, e, f, p0 , p2 , p4 , p5 }. Furthermore, program Π has one loop, {e, f }, yielding LF (Π) = {e ∨ f → p3 }. This loop formula is falsiﬁed by {a, c, e, f, p0 , p2 , p4 , p5 }, thus {a, c, e, f } is no answer set of Π. The other two models of CF (Π, A) satisfy LF (Π) and correspond to the answer sets {a, b, e, f } and {a, c} of Π. Finally, a (partial) Boolean assignment A over A ∪ 2A∪∼A is a set of possibly negated elements of its domain. We deﬁne A = {a ∈ A | a ∈ A} ∪ {B ⊆ A ∪ ∼A | B ∈ A}. For instance, A = {a, d, {a, ∼c}} assigns true to a and false to d as well as body {a, ∼c}, and A = {d, {a, ∼c}} contains all false elements of A.

3

PREPROCESSING

Our initial goal is to turn a given program Π over an alphabet A into a simpliﬁed program Π , a partial assignment A, and an equivalence relation E on the atoms and bodies in Π . More formally, we transform a triple (Π, ∅, ∅) into (Π , A, E). Thereby, Π is obtained from Π by program transformations, mainly involving rule eliminations and body modiﬁcations. The semantics of the original program Π is captured by Π along with assignment A and E, where the latter is also exploited to generate a compact representation of Π in terms of Boolean constraints. Our transformation rules, shown in Table 1, are grouped into four building blocks: s = {(s0 ), . . . , (s15 )}, e = {(e16 ), . . . , (e27 )}, a = {(a28 ), . . . , (a35 )}, and u = {(u36 )}. (Note that many of them are subject to conditions, given in the rightmost column.) Roughly, the rules in s permit elementary simpliﬁcations, while e partitions atoms and bodies into equivalence classes. As a byproduct of this, all unclassiﬁed atoms are unfounded and set to false via (u36 ). Finally, the rules in a substitute the atoms in an equivalence class by a unique representative for that class. Note that s, e, a, and u are intended to be applied till saturation before proceeding to another block of transformations. In what follows, we gradually explain the different transformations and also provide examples. To begin with, rules (s0 ) to (s10 ) build upon well-known program transformations [4, 5, 11]. Let T →∗ T represent the computation of a ﬁxpoint T by repeated applications of → to T . Then, s →∗ amounts to computing the ﬁxpoint of Fitting’s operator [13]. s In addition, →∗ makes assignments to bodies and simpliﬁes the program at hand. Finally, rules (s11 ) to (s15 ) preserve the correspondence between the program Π and its associated assignment A. s For Π0 = {a ←; b ← a, ∼c; c ← ∼b, ∼d}, we get (Π0 , ∅, ∅) →∗ (Π1 , A1 , ∅), where Π1 = {b ← ∼c; c ← ∼b} and A1 = {a, d}. s In general, a ﬁxpoint of → has the following syntactic properties. s Proposition 1 Let (Π, ∅, ∅) →∗ (Π , A, ∅), for logic program Π over alphabet A. Then, we have: 1. body(r) = ∅, for all r ∈ Π ; 2. body(a) = ∅, for all a ∈ atom(Π ); 3. (atom(Π ) ∪ body(Π )) ∩ (A ∪ A) = ∅; 4. A ∩ A = ∅; 5. {B S ⊆ A ∪ ∼A | B ∈ A ∪ A} ⊆ A; 6. B∈A\A (B + ∪ B − ) ⊆ atom(Π ). W W Using BF (Y ) = {( b∈B + b ∨ c∈B − c) | B ∈ Y }, we can capture the relationship between the original program Π and the reduced program Π along with assignment A as follows. s Proposition 2 Let (Π, ∅, ∅) →∗ (Π , A, ∅), for logic program Π over alphabet A. Then, we have AS (Π) = M(CF (Π , A\A)∪LF (Π )∪(A∩A)∪BF (A\A))|A . Rules (e16 ) to (e27 ) comprise the heart of our approach and build an equivalence relation on atoms and bodies. We represent equivalence classes as triples, viz., E = [a, B, C], where a is an atom

representative for E, B is a body (externally) supporting E, and C contains all atoms and bodies belonging to E. We denote the components of E by aE = a, BE = B, and CE = C. Thereby, ∅ denotes a null value, where aE = ∅ means that CE ∩ A = ∅ and BE = ∅ expresses that E is not (externally) supported. For a set E of equivalence classes, deﬁne:4 S S s EC = EC = [a,B,C]∈E,B =∅ C [a,B,C]∈E C S s + = B . EB [a,B,C]∈E,B =∅ Some classes in E are deﬁned as dual to each other (and are ﬁnally represented by complementary propositional literals). In Table 1, the rules (e16 ) and (e17 ) each introduce a new equivalence class E along e and we assume both classes to be correlated via with its dual class E, e1 ; E2 , E e2 ; . . . ). Finally, we use E e to some unique name (e.g., E1 , E e = E. denote the dual class of E, and let E e Let us illustrate →∗ starting from (Π1 , A1 , ∅): e

→

E

Rule

(e16 ) b ← ∼c (e17 ) b ← ∼c (e18 ) b ← ∼c (e16 ) c ← ∼b (e20 ) (e17 ) c ← ∼b (e18 ) c ← ∼b

E1 E2 E3 E4 E5 E6 E7

= {E1 = E1 ∪ {E2 = {E1 = E3 ∪ {E3 e1 = {E = E5 ∪ {E4 e1 = {E E1 e2 E

e1 = [∅, ∅, ∅]} = [∅, {∼c}, {{∼c}}], E e2 = [∅, ∅, ∅] = [b, {∼c}, {b}], E } e1 , E e2 = [b, {∼c}, {b, {∼c}}], E } e3 = [∅, ∅, ∅]} = [∅, {∼b}, {{∼b}}], E e2 , E e3 = [∅, {∼b}, {{∼b}}], E1 , E } e4 = [∅, ∅, ∅] = [c, {∼b}, {c}], E } = [c, {∼b}, {c, {∼b}}], = [b, {∼c}, {b, {∼c}}], e3 = E e4 = [∅, ∅, ∅] =E }

e1 . We get two non-trivial, dual equivalence classes: E1 and E e1 is repreClass E1 is represented by b and supported by {∼c}; E sented by c and supported by {∼b}. Observe that (e16 ) and (e17 ) introduce equivalence classes and their duals, while (e18 ) and (e20 ) merge different classes. (For simplicity, trivial dual classes are kept.) e The overall proceeding of →∗ is support-driven, that is, rules are only taken into account if their positive body atoms have been classiﬁed. Moreover, each (vital) class [a, B, C] must be supported by some body B = ∅. To illustrate this, consider Π0 ∪ Π1 , where Π1 = {e ← ∼c; e ← f ; f ← e; g ← e, ∼f ; g ← h, ∼f ; h ← f, g} . s

We get (Π0 ∪ Π1 , ∅, ∅) →∗ (Π1 ∪ Π1 , A1 , ∅) and continue by ape plying →∗ to (Π1 ∪ Π1 , A1 , E7 ): e

→ (e17 ) (e16 ) (e21 ) (e17 ) (e16 ) (e21 ) (e19 )

E

Rule e ← ∼c f ←e f ←e e←f f ←e

E1 E2 E3 E4 E5 E6 E7

(e22 ) g ← e,∼f E7

= E7 ∪ = E1 ∪ = E7 ∪ = E3 ∪ = E4 ∪ = E3 ∪ = E7 ∪

{E1 {E2 {E1 {E3 {E4 {E3 {E1 e E 1

e = [∅, ∅, ∅] } = [e, {∼c}, {e}], E 1 e = [∅, ∅, ∅] } = [∅, {e}, {{e}}], E 2 e , E e = [e, {∼c}, {e, {e}}], E } 1 2 e = [f, {e}, {f }], E3 = [∅, ∅, ∅] } e = [∅, ∅, ∅]} = [∅, {f }, {{f }}], E 4 e , E e = [f, {e}, {f, {f }}], E } 3 4 = [e, {∼c}, {e, {e}, f, {f }}], e = E e = E e = [∅, ∅, ∅] =E } 2 3 4

We thus get (Π2 , A1 , E7 ), where Π2 = Π1 ∪ (Π1 \ {g ← e, ∼f }). Set E7 augments E7 with E1 , revealing that e and f can be treated as equals. Note that the supporting body {∼c} does not belong to CE1 , given that bodies {e} and {f } in CE1 are involved in loop {e, f }. Notably, the application of (e22 ) to g ← e,∼f allows us to stop without classifying g and h, which are unfounded relative to Π2 . However, by delaying the removal of g ← e,∼f , an equivalence relation E7 such that g and h belong to classes E satisfying BE = ∅ 4

The superscript s indicates supporting bodies B = ∅.

17

M. Gebser et al. / Advanced Preprocessing for Answer Set Solving

s

(s0 ) (s1 ) (s2 ) (s3 ) (s4 ) (s5 ) (s6 ) (s7 ) (s8 ) (s9 ) (s10 ) (s11 ) (s12 ) (s13 ) (s14 ) (s15 )

(Π ∪ {r, r}, A, E) (Π ∪ {a ← , , B}, A, E) (Π ∪ {a ← b, ∼b, B}, A, E) (Π ∪ {a ← a, B}, A, E) (Π ∪ {a ←}, A, E) (Π, A, E) (Π ∪ {a ← ∼a, B}, A, E) (Π ∪ {a ← B}, A ∪ {a}, E) (Π ∪ {a ← B}, A ∪ {B}, E) (Π ∪ {a ← , B}, A ∪ {}, E) (Π ∪ {a ← ∼, B}, A ∪ {}, E) (Π, A ∪ {{, } ∪ B}, E) (Π, A ∪ {{b, ∼b} ∪ B}, E) (Π, A ∪ {, {} ∪ B}, E) (Π, A ∪ {, {} ∪ B}, E) (Π, A ∪ {B}, E)

→ s → s → s → s → s → s → s → s → s → s → s → s → s → s → s →

(e16 ) (e17 ) (e18 ) (e19 ) (e20 ) (e21 ) (e22 ) (e23 ) (e24 ) (e25 ) (e26 )

(Π ∪ {a ← B}, A, E) → e (Π ∪ {a ← B}, A, E) → e (Π ∪ {a ← B}, A, E ∪ {E, [a, B, C]}) → e (Π ∪ {a ← B}, A, E ∪ {E, [a, B, C]}) → e e (Π, A, E ∪ {E, E, [a, B, C]}) → e e [a, B, C]}) (Π, A, E ∪ {E, E, → e e (Π ∪ {a ← B}, A, E ∪ {E, E}) → e (Π, A, E ∪ {[a, B, C]}) → e (Π, A, E ∪ {[a, B, C]}) → e (Π ∪ {a ← B}, A, E ∪ {[a, ∅, C]}) → e (Π ∪ {a ← B}, A, E ∪ {[a , ∅, C ]}) →

e

e

(e27 ) (Π ∪ {a ← B}, A, E ∪ {[∅, ∅, C]})

→

e (a28 ) (Π ∪ {a ← B}, A, E ∪ {E, E})

→

e (a29 ) (Π ∪ {a ← b, B}, A, E ∪ {E, E})

→

(a30 ) (a31 ) (a32 ) (a33 ) (a34 ) (a35 )

e (Π ∪ {a ← b, B}, A, E ∪ {E, E}) e (Π ∪ {a ← ∼c, B}, A, E ∪ {E, E}) e (Π ∪ {a ← ∼c, B}, A, E ∪ {E, E}) e (Π, A ∪ {B}, E ∪ {E, E}) (Π, A ∪ {{b} ∪ B}}, E ∪ {E}) (Π, A ∪ {{∼c} ∪ B}}, E ∪ {E})

a

a

a

→ a → a → a → a → a → u

(u36 ) (Π, A, E)

→

(Π ∪ {r}, A, E) (Π ∪ {a ← , B}, A, E) (Π, A, E) (Π, A, E) (Π, A ∪ {a}, E) (Π, A ∪ {a}, E) (Π, A ∪ {{∼a} ∪ B}, E) (Π, A ∪ {a}, E) (Π, A ∪ {B}, E) (Π ∪ {a ← B}, A ∪ {}, E) (Π, A ∪ {}, E) (Π, A ∪ {{} ∪ B}, E) (Π, A, E) (Π, A ∪ {, B}, E) (Π, A ∪ {}, E) (Π, A ∪ {a, B}, E)

` ´ a ∈ (B + ∪ B − ) \ (atom(Π) ∪ A ∪ A) ´ ` + s ,B ∈ e = [∅, ∅, ∅]}) (Π ∪ {a ← B}, A, E ∪ {E = [∅, B, {B}], E B ∪ Es ⊆ EC / EC ´ ` + Bs s ,a ∈ e = [∅, ∅, ∅]}) (Π ∪ {a ← B}, A, E ∪ {E = [a, B, {a}], E B ∪ EB ⊆ EC / EC ` ´ (Π ∪ {a ← B}, A, E ∪ {E = [a, B, C ∪ CE ]}) body(a) ⊆ CE , CE ∩ atom(Π) = ∅ ` ´ (Π ∪ {a ← B}, A, E ∪ {E = [aE , BE , CE ∪ C]}) body(a) ⊆ CE , CE ∩ atom(Π) = ∅ ` ´ e (Π, A, E ∪ {E = [a, B, C ∪ CE ], E}) B ∈ C, B + = ∅, B − ⊆ CEe , CE ∩ atom(Π) = ∅ ` ´ e (Π, A, E ∪ {E = [aE , BE , CE ∪ C], E}) B ∈ C, B + ⊆ CE , B − ⊆ CEe , CE ∩ atom(Π) = ∅ ` + ´ e (Π, A, E ∪ {E, E}) (B ∩ CE ) ∪ (B − ∩ CEe ) = ∅, (B + ∩ CEe ) ∪ (B − ∩ CE ) = ∅ ` ´ (Π, A, E ∪ {[a, ∅, C]}) B = ∅, B ∈ / body(Π) ´ ` s (Π, A, E ∪ {[a, ∅, C]}) B = ∅, B + ⊆ EC ´ ` + s s (Π ∪ {a ← B}, A, E ∪ {[a, B, C]}) B ∪ EB ⊆ EC ` s ⊆ Es , (Π ∪ {a ← B}, A, E ∪ {[a, B, C]}) {a, a } ⊆ C , a = a , B + ∪ EB C´ C = ({a, B} ∩ C ) ∪ (C \ (atom(Π) ∪ body(Π))) ´ ` s ⊆ Es (Π ∪ {a ← B}, A, E ∪ {[∅, B, C]}) B ∈ C, B + ∪ EB C ` e (Π, A, E ∪ {E, E}) a ∈ CE \ {aE }, {(a ← B ) ∈ Π S ∪ {a ← B} | a ∈ CE \ {aE },´ B + = ∅, a ∈ r∈Π∪{a←B} body(r)+ } = ∅ ` ← B ) ∈ Π | a ∈ C \ {a }, e (Π ∪ {a ← aE , B}, A, E ∪ {E, E}) b ∈ CE \ {aE }, {(aS E E ´ B + = ∅, a ∈ r∈Π∪{a←b,B} body(r)+ } = ∅ ` ´ e (Π ∪ {a ← ∼aEe , B}, A, E ∪ {E, E}) b ∈ CE \ {aE }, (b ← B ) ∈ Π, B + = ∅ ` ´ e (Π ∪ {a ← B}, A, E ∪ {E, E}) c ∈ CE , B + ∩ CEe = ∅ ` ´ e (Π ∪ {a ← ∼aE , B}, A, E ∪ {E, E}) c ∈ CE \ {aE }, B + ∩ CEe = ∅ ` + ´ e (Π, A, E ∪ {E, E}) (B ∩ CE ) ∪ (B − ∩ CEe ) = ∅, (B + ∩ CEe ) ∪ (B − ∩ CE ) = ∅ ` ´ (Π, A ∪ {{aE } ∪ B}}, E ∪ {E}) b ∈ CE \ {aE } ` ´ (Π, A ∪ {{∼aE } ∪ B}}, E ∪ {E}) c ∈ CE \ {aE } ` ´ s ∪ A) (Π, A ∪ {a}, E) a ∈ atom(Π) \ (EC

Transformation rules for preprocessing (where ∈ A ∪ A, ∼a = a, ∼a = a, and a = a).

Table 1.

could have been obtained as well. The latter again signals that g and h are unfounded, as in the case that they remain unclassiﬁed. The next results shed some light on the syntactic properties of the s e s e consecutive application of →∗ and →∗ , abbreviated by →∗ →∗ . s

e

Proposition 3 Let (Π, ∅, ∅) →∗ →∗ (Π , A, E), for logic program Π over alphabet A. Then, we have: 1. 2. 3. 4. 5.

` ´ a ∈ atom(Π) \ (A ∪ A), body(a) = ∅

s s EB ⊆ EC ⊆ atom(Π ) ∪ body(Π ); EC ∩ (A ∪ A) = ∅; CE ∩ CE = ∅, for all E, E ∈ E such that E = E ; (aE ← BE ) ∈ Π , for all E ∈ E such that aE = ∅, BE = ∅; s s body(r)+ ⊆ EC , for all r ∈ Π such that head (r) ∈ / EC .

We next show that our transformations preserve answer sets and that duality among equivalence classes carries forward to answer sets. s

e

Proposition 4 Let (Π, ∅, ∅) →∗ →∗ (Π , A, E), for logic program Π over alphabet A, and let X ∈ AS (Π). Then, we have: s 1. A ∩ A ⊆ X ⊆ (A ∩ A) ∪ EC ;

2. CE ∩A ⊆ X and CEe ∩X = ∅ or CEe ∩A ⊆ X and CE ∩X = ∅, e ⊆ E. for all {E, E} Equivalences and implicit or explicit unfoundedness of atoms (cf. E7 and E7 above) are exploited by the remaining transformations: (a28 ) to (a35 ) substitute equivalent atoms by the representative aE (or ∼aEe via rule (a30 )) for their class E, while (u36 ) assigns false to unfounded atoms. a u Although → and → leave program Π1 unchanged, they allow for further reducing Π2 in view of the obtained equivalence classes. We a u obtain (Π2 , A1 , E7 ) →∗ (Π3 , A1 , E7 ) →∗ (Π3 , A2 , E7 ), where Π3 = Π1 ∪ {e ← ∼c; e ← e; g ← h, ∼e; h ← e, g} and A2 = A1 ∪ {g, Sh} = {a, d, g, h}. Using E[X] = [a,B,C]∈E,C∩X =∅ (C ∩ A) for accumulating all atoms equivalent to members of X, we obtain the following result. s e a u Proposition 5 Let (Π, ∅, ∅) →∗ →∗ →∗ →∗ (Π , A, E), for logic program Π over alphabet A. Then, we have AS (Π) = {X ∪E[X]∪(A∩A) | X ∈ AS (Π )∩M(BF (A\A))} .

18

M. Gebser et al. / Advanced Preprocessing for Answer Set Solving

Finally, we consider the saturated result of preprocessing, where s e a u ∗ Π → (Π , A, E) stands for (Π, ∅, ∅) ( →∗ →∗ →∗ →∗ )∗ (Π , A, E). Let σ = {y1 /y1 , . . . , yn /yn } denote a substitution, and let Yσ be Y with every occurrence of yi replaced by yi for 1 ≤ i ≤ n. This allows us to formulate the following termination and conﬂuence result. Theorem 6 Let Π be a logic program over A. Then, we have: ∗ 1. Every derivation → from Π terminates with some (Π , A, E) such that no transformation rule in Table 1 is applicable to (Π , A, E); ∗ ∗ 2. If Π → (Π1 , A1 , E1 ) and Π → (Π2 , A2 , E2 ), then (A1 ∩ A) ∪ E[A1 ] = (A2 ∩A)∪E[A2 ], Π1 σ = Π2 , and (A1 \A)σ = A2 \A, where σ = {a/aE | E ∈ E2 , a ∈ CE ∩ A}; ∗ ∗ e1 } ⊆ E1 3. If Π → (Π1 , A1 , E1 ), Π → (Π2 , A2 , E2 ), and {E1 , E e such that BE1 = ∅, then {E2 , E2 } ⊆ E2 such that BE2 = ∅, CE1 σ = CE2 σ, and CEe1 σ = CEe2 σ, where σ = {a/aE | E ∈ E2 , a ∈ CE ∩ A}. ∗

Reconsidering Π0 ∪ Π1 , we get (Π0 ∪ Π1 ) → (Π1 , A2 , E ), where E contains two vital classes, viz., E = [b, {∼c}, {b, {∼c}, e = [c, {∼b}, {c, {∼b}}], while all other e, {e}, f, {f }}] and E classes E ∈ E are such that BE = ∅. This outcome is independent from the order in which transformations are applied. Also note that all six rules of Π1 are removed by preprocessing, thus transforming non-tight program Π0 ∪ Π1 into tight program Π1 . Notably, the result of our transformations goes beyond the wellfounded model [21] of a logic program. ∗ Proposition 7 Let Π → (Π , A, E), for logic program Π over A, and let I ⊆ A ∪ A be the well-founded model of Π. Then, we have I ∩A ⊆ (A∩A)∪E[A] and I ∩A ⊆ (A \ (A ∪ E[A ∪ atom(Π )])). Similar to the known algorithms for computing a program’s well∗ founded model, → can be computed in quadratic time. In fact, if no program rule is removed (via rules other than (a28 )) after the initial s s e a u application of →∗ , a linear pass of →∗ →∗ →∗ →∗ sufﬁces to coms∗ e∗ a∗ u∗ ∗ ∗ pute →, while iteration, viz., ( → → → → ) , is needed otherwise. We now take advantage of the result of our initial preprocessing phase, (Π , A, E), for obtaining a compact completion formula. To this end, we use E to induce a variable mapping ν : atom(Π ) ∪ {pB | B ∈ body(Π )} → V ∪ V, where V is an alphae ⊆ E such that BE = ∅, we bet of variable names. For each {E, E} e as follows: select a unique v ∈ V and map the elements of E and E 1. ν(y) = v iff y ∈ (CE ∩ atom(Π )) ∪ {pB | B ∈ CE ∩ body(Π )}; 2. ν(y) = v iff y ∈ (CEe ∩ atom(Π )) ∪ {pB | B ∈ CEe ∩ body(Π )}. Practically, ν amounts to an abstraction of the original program, as used for the internal representation within ASP solvers. We then use ν for inducing a substitution σν = {y/ν(y) | y ∈ atom(Π ) ∪ {pB | B ∈ body(Π )}}. For (Π1 , A2 , E ), we get mapping ν1 = {b → v; c → v; p{∼c} → v; p{∼b} → v}, using only one variable v. Having mapping ν induced by (Π , A, E), we express the completion and loop formulas of Π using the variables in V: ` VFν (Π , A, E) = LF (Π ) ∪ BF (A \ A) ∪ ´ CF (Π , atom(Π ) ∪ (A \ (A ∪ E[A ∪ atom(Π )]))) σν . Note that applying σν leaves the introduction of body propositions (cf. (1)) implicit. In our example, we get VFν1 (Π1 , A2 , E ) = CF (Π1 , {b, c, d, g, h})σν1 = {v ↔ v; v ↔ v; d ↔ ⊥; g ↔ ⊥; h ↔ ⊥} . Note that LF (Π1 ) is empty (since Π1 is tight), and so is BF (A2 \A). Clearly, CF (Π1 , {b, c, d, g, h})σν1 possesses the models ∅ and {v}. Such models are linked to the atoms in an original program Π by

appeal to EFν (E) = {a ↔ ν(aE ) | E ∈ E, BE = ∅, a ∈ CE ∩ A}; e.g., EFν1 (E ) = {b ↔ v; e ↔ v; f ↔ v; c ↔ v}. Formally, we have the following result. ∗ Theorem 8 Let Π → (Π , A, E), for logic program Π over A, and let ν be a variable mapping induced by (Π , A, E). Then, we have AS (Π) = M((A ∩ A) ∪ E[A] ∪ VFν (Π , A, E) ∪ EFν (E))|A . For instance, for (Π1 , A2 , E ), ν1 , and A = {a, . . . , h}, we obtain M({a} ∪ ∅ ∪ VFν1 (Π1 , A2 , E ) ∪ EFν1 (E ))|A = {{a, b, e, f }, {a, c}}, which are the two answer sets of Π0 ∪ Π1 . Finally, note that our implementation within clasp takes advantage of the preprocessing result only for the initial construction of a compact completion formula, while loop formulas are not computed a priori, but only if they are used for propagation or conﬂict analysis.

4

EXPERIMENTS

We conducted systematic experiments on the benchmark sets used in the categories SCore and SLparse of the ASP competition [15]. Our comparison considers the ASP solver clasp in four modes: (1) no elaborated preprocessing, only elementary simpliﬁcations as in (s0 ) to (s15 ); (2) external program reduction (as described in Section 3); (3) internal reduction, extending SatELite-like techniques [10];5 and (4) both types of preprocessing. Table 2 summarizes results in seconds (t), indicating the number of timeouts via a superscript. Each line averages over n runs on n/3 instances, each shufﬂed three times. Furthermore, |r|, |a|, and |b| give the average number of rules, atoms, and bodies, respectively, in the original programs of each class; |v| and |c| give the average number of variables and Boolean constraints in the internal representation. The number of variables |v| is the same for variant (1) and (3) as well as for (2) and (4), respectively, and thus not duplicated in Table 2. At the bottom of Table 2, all individual runs are summed up, not taking averages. Full details are provided at [7]. In total, we see that variant (4) performs best, even though SatELite-like techniques are currently not applied to so-called extended rules (allowed within SLparse instances, shown in the second part of Table 2), while we have generalized external program reduction to work on such rules too. Furthermore, SatELite-like techniques work best on tight examples, being released from unfounded set checking. (Note that 2/3 of the benchmark classes are tight.) Unlike this, the approach in Section 3 is advantageous on non-tight programs due to its support-driven strategy. Another factor is the size of s e a u input programs. While our external technique ( →∗ →∗ →∗ →∗ ) is implemented in a linear fashion, SatELite-like techniques involve subsumption tests yielding a quadratic worst case behavior. Regarding the number of variables, one has to compare |a|+|b| with |v|. In the worst case, both would be equal. However, we sometimes see significant reductions of more than one order of magnitude. Given that the elementary simpliﬁcations already cut down the number of variables, the speed-ups of version (2) over (1) are mainly due to the reduced completion formula (reﬂected by |c|). Also, the number |c| of constraints is often much smaller than the original number |r| of rules.

5

DISCUSSION

We provided the ﬁrst ASP-speciﬁc approach to preprocessing logic programs, aiming at reducing an input program as well as the number of variables in its internal representation. The latter goal is also pursued by smodels [19], where choices rely on atoms occurring negatively in bodies, and by cmodels [8], where heuristics are used to 5

Note that a straightforward application of SatELite-like techniques is insufﬁcient since it interferes with unfounded set detection.

19

M. Gebser et al. / Advanced Preprocessing for Answer Set Solving

Problem Name (n) 15-Puzzle (30) BlockedN-Queens (42) EqTest (15) Factoring (15) HamiltonianPath (42) RLP-150 (42) RLP-200 (42) RandomNonTight (42) SchurNumbers (15)

|r| 17203 308796 6901 6974 4228 728 1184 839 12014

|a| |b| 5161 13029 5503 155646 434 2996 4965 6782 1533 2542 151 715 201 1165 55 806 736 4391

|v| 3100 53716 1143 3637 1358 288 455 287 1005

clasp (1) clasp (2) |c| t |v| |c| t 24348 0.3 2930 23942 0.3 69281 18 285.8 50613 2988 16 254.5 12338 16.0 999 11514 14.4 13407 5.6 2244 9524 3.9 5533 0.1 748 2987 0.1 3002 0.3 286 2992 0.3 4850 0.9 453 4838 0.9 5286 32.3 283 5267 32.8 4862 2.3 829 3971 1.4

clasp (3) |c| t 13497 0.3 2720 18 265.1 9866 16.4 3791 1.8 2974 0.1 2994 0.3 4835 1.0 5286 31.3 2451 2.6

clasp (4) |c| t 13296 0.3 2720 18 265.7 9419 14.7 3765 1.9 1277 0.1 2986 0.3 4826 0.9 5252 33.4 1602 1.0

15-Puzzle (15) 38250 11385 37498 15694 116321 1 213.2 15298 115173 96.3 79624 104.1 79624 112.8 BlockedN-Queens (15) 5024 4699 2726 2472 331 17.1 894 331 9.1 331 9.5 331 13.5 BoundedSpanningTree (15) 206557 2359 203226 68524 201427 3.7 67796 198432 3.7 190486 16.5 190486 16.8 CarSequencing (15) 1582 2303 1263 1189 630 15 600.0 695 630 15 600.0 630 15 600.0 630 13 566.3 Factoring (12) 7685 5470 7472 4006 14803 8.6 2473 10525 4.1 4196 2.2 4170 2.1 HamiltonianCycle (15) 10502 7003 4955 3986 12236 0.3 1925 7916 0.2 4676 1.4 4641 1.3 HamiltonianPath (15) 4924 1623 2920 1514 6102 0.1 864 3387 0.1 3364 0.1 1560 0.1 Hashiwokakero (12) 738726 149926 717900 227596 2163406 3 125.2 217954 1912400 3 125.2 1915809 3 125.4 1912400 3 125.3 KnightsTour (15) 58062 10968 37996 14866 16518 0.5 11383 10559 0.5 5317 0.7 3402 0.7 RLP-150 (15) 735 151 721 290 3030 0.4 288 3019 0.3 3023 0.4 3014 0.3 RLP-200 (15) 793 199 781 326 3309 1.1 319 3269 1.0 3276 1.0 3244 1.0 RandomNonTight (15) 848 55 816 290 5380 9.0 287 5361 5.8 5380 9.0 5347 5.5 SchurNumbers (15) 85319 1713 43097 7570 11438 2 129.3 7307 11438 1 164.0 10705 1 129.0 10705 1 97.8 SearchTest-plain (15) 690808 4339 522045 34753 160494 3 122.9 31869 148922 2 114.1 114633 3 124.4 105102 1 81.5 SearchTest-verbose (15) 802803 4959 606804 40320 165791 12.3 36964 152633 13.8 97379 37.5 88708 34.9 SocialGolfer (15) 31506 11269 31108 12500 119754 3 120.6 11857 119754 3 121.3 108148 3 124.4 108148 3 124.2 SolitaireBackward (15) 20508 8381 9305 5473 39345 1.9 2545 18017 1.1 13980 1.7 11740 0.7 SolitaireBackward2 (15) 27435 4397 25517 8713 14323 4 260.4 8366 14323 6 312.8 10008 4 179.1 10009 3 177.7 SolitaireForward (15) 19606 8020 8858 5153 29835 3 120.3 3602 23819 3 120.3 18448 2 90.3 15253 3 120.2 Su-Doku (9) 1003593 17053 502502 173185 12772 7.1 165897 12772 7.9 12772 11.0 12772 11.3 TowersOfHanoi (15) 18340 7215 15028 7294 15903 24.1 5500 13527 24.4 8665 24.7 8664 16.0 TravelingSalesperson (15) 3825 3065 1588 1448 3588 0.4 583 2356 0.2 2356 0.3 2339 1.5 VerifyTest-variableSearchSpace (15) 12914 2296 9134 1061 4285 0.1 608 3088 0.1 1273 0.1 806 0.1 WeightBoundedDominatingSet (15) 3163 2879 798 1187 2048 6 245.9 264 910 4 165.1 453 3 128.2 453 2 105.4 WeightedLatinSquare (15) 997 770 446 405 222 0.0 146 222 0.0 222 0.0 222 0.0 WeightedSpanningTree (15) 112034 2185 108934 36998 81210 2.3 36294 78426 2.2 78052 4.5 78052 4.4 Total time/timeouts 44116.9/58 40774.2/53 38641.0/52 37139.0/47 variables/constraints 10954406/46339719 10172081/39117132 -/35997972 -/35438242 Table 2. Experiments with clasp (1.0.5) on a 2.2GHz PC under Linux; each run restricted to 600s time and 1GB RAM.

eliminate body variables. However, up to now clasp is the only ASP solver integrating advanced preprocessing techniques. Neither ASPspeciﬁc (external) nor SatELite-like (internal) preprocessing have yet been implemented elsewhere in the context of ASP. Our experiments show that investments in preprocessing are well spent. In fact, the best results are obtained when combining ASP-speciﬁc with SatELitelike preprocessing. Instead of integrating preprocessing into clasp, it could be performed by a dedicated front-end, beneﬁcial also to other solvers. The development of such a tool is left as a future issue.

REFERENCES [1] http://assat.cs.ust.hk. [2] F. Bacchus, ‘Enhancing Davis Putnam with extended binary clause reasoning’, in Proceedings AAAI’02, pp. 613–619. AAAI Press, (2002). [3] C. Baral, Knowledge Representation, Reasoning and Declarative Problem Solving. Cambridge University Press, (2003). [4] S. Brass and J. Dix, ‘Semantics of (disjunctive) logic programs based on partial evaluation’, Journal of Logic Programming, 40(1), 1–46, (1999). [5] S. Brass, J. Dix, B. Freitag, and U. Zukowski, ‘Transformation-based bottom-up computation of the well-founded model’, Theory and Practice of Logic Programming, 1(5), 497–538, (2001). [6] K. Clark, ‘Negation as failure’, in Logic and Data Bases, eds., H. Gallaire and J. Minker, pp. 293–322. Plenum Press, (1978). [7] http://www.cs.uni-potsdam.de/clasp. [8] http://www.cs.utexas.edu/users/tag/cmodels. [9] http://www.dlvsystem.com. [10] N. E´en and A. Biere, ‘Effective preprocessing in SAT through variable

[11] [12] [13] [14] [15]

[16] [17] [18] [19] [20] [21]

and clause elimination’, in Proceedings SAT’05, eds., F. Bacchus and T. Walsh, pp. 61–75. Springer, (2005). T. Eiter, M. Fink, H. Tompits, and S. Woltran, ‘Simplifying logic programs under uniform and strong equivalence’, in Proceedings LPNMR’04, eds., V. Lifschitz and I. Niemel¨a, pp. 87–99. Springer, (2004). F. Fages, ‘Consistency of Clark’s completion and the existence of stable models’, J. of Methods of Logic in Computer Science, 1, 51–60, (1994). M. Fitting, ‘Tableaux for logic programming’, Journal of Automated Reasoning, 13(2), 175–188, (1994). M. Gebser, B. Kaufmann, A. Neumann, and T. Schaub, ‘Conﬂict-driven answer set solving’, in Proceedings IJCAI’07, ed., M. Veloso, pp. 386– 392. AAAI Press/MIT Press, (2007). M. Gebser, L. Liu, G. Namasivayam, A. Neumann, T. Schaub, and M. Truszczy´nski, ‘The ﬁrst answer set programming system competition’, in Proceedings LPNMR’07, eds., C. Baral, G. Brewka, and J. Schlipf, pp. 3–17. Springer, (2007). M. Gebser and T. Schaub, ‘Tableau calculi for answer set programming’, in Proceedings ICLP’06, eds., S. Etalle and M. Truszczy´nski, pp. 11–25. Springer, (2006). C. Gomes, H. Kautz, A. Sabharwal, and B. Selman, ‘Satisﬁability solvers’, in Handbook of Knowledge Representation, eds., V. Lifschitz, F. van Hermelen, and B. Porter. Elsevier, (2008). F. Lin and Y. Zhao, ‘ASSAT: computing answer sets of a logic program by SAT solvers’, Artiﬁcial Intelligence, 157(1-2), 115–137, (2004). http://www.tcs.hut.fi/Software/smodels. S. Subbarayan and D. Pradhan, ‘NiVER: Non increasing variable elimination resolution for preprocessing SAT instances’, in Proceedings SAT’04, eds., H. Hoos and D. Mitchell, pp. 276–291. Springer, (2005). A. Van Gelder, K. Ross, and J. Schlipf, ‘The well-founded semantics for general logic programs’, Journal of the ACM, 38(3), 620–650, (1991).

20

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-20

A generic framework for comparing semantic similarities on a subsumption hierarchy Emmanuel Blanchard1 and Mounira Harzallah1 and Pascale Kuntz1 Abstract. Deﬁning a suitable semantic similarity between concept pairs of a subsumption hierarchy is becoming a generic problem for many applications in knowledge engineering exploiting ontologies. In this paper, we deﬁne a generic framework which can guide the proposition of new measures by making explicit the information on the ontology which has not been integrated into existing deﬁnitions yet. Moreover, this framework allows us to rewrite numerous measures, originally proposed in various contexts, which are in fact closely related to each other. From this observation, we show some metrical and ordinal properties. Experimental comparisons on WordNet and on collections of human judgments complete the theoretical results and conﬁrm the relevance of our propositions.

1

Introduction

Semantic similarity is a generic issue in a variety of applications in the areas of computational linguistics, artiﬁcial intelligence and biology, both in the academic community and the industry. Examples include word sense disambiguation [20], detection and correction of word spelling errors (malaproprisms) [4], image retrieval [23], information retrieval [13] and biological issues [25]. Similarities have been widely studied for set representations. The similarity σ(A, B) between two subsets of elements A and B is often deﬁned as a function of the elements common to A and B and as a function of the distinct ones. The Jaccard’s coefﬁcient [12] and the Dice’s coefﬁcient [7], which have originally been deﬁned for ecological studies, are probably the most commonly used similarities among a large family of coefﬁcients [11][24]. Their theoretical properties have been carefully studied [10][6]. Another important issue is the evaluation of semantic similarity in a network structure. With a long history in psychology [27][21], the problem of evaluating semantic similarity in a network structure has known a noticeable renewed interest linked to the development of the semantic web. In the 1970’s many studies on categorization were inﬂuenced by a theory which stated that, from an external point of view, the categories in a set of objects were organized in a taxonomy according to an abstraction process. It is a common principle of the current knowledge representation systems to describe proximity relationships between domain concepts by a hierarchy, or more generally by a graph, i.e. by the ontologies associated with the new languages of the semantic Web –in particular OWL [1]. The tree-based similarities deﬁned on a subsumption hierarchy contain two categories of similarities: those which, like the Wu and Palmer’s similarity [28], only depend on the hierarchical structure (e.g., path lengths between concept pairs), and those which, like the Lin’s similarity [14], additionally incorporate statistics on a corpus 1

University of Nantes, France, email: [email protected]

(e.g., concept occurrence frequencies). Some recent work has tried to extend the tree-based deﬁnitions to graphs by simultaneously taking into account different semantic relationships [15]. But, despite its pertinence, this attempt is faced with many open problems, and in practice the set-based and the tree-based similarities still remain the most widely used. Our main purpose here is to show that these measures, which have originally been proposed in various contexts, are closely related to each other. Most set-based similarities σ (A, B) can be re-written as functions f (|A| , |B| , |A ∩ B|) of the cardinalities of sets A and B and of their intersection set A ∩ B. In data analysis, a classiﬁcation attempt, not widely used in knowledge engineering, has permitted to gather numerous similarity deﬁnitions into two parametrized functions that we denote by fα and fβ [6]. In this paper, we extend the deﬁnitions of these functions to the tree-based similarities: we deﬁne two generic functions feα and feβ with the same schema as fα and fβ . Each function depends on a real parameter α or β, and on the “information content” ψ(ci ) = − log P (ci ) initially introduced by Resnik [19], where P (ci ) is the probability of encountering an instance of the concept ci . The operational computation of the theoretical probability P(ci ) may vary according to the available information (e.g., a corpus). We show that numerous published tree-based similarities are associated with a α or β value and an approximation of P. The interests of this work are threefold. First, some partial pairwise comparisons have already been presented in the literature, but our uniﬁed framework allows to precisely identify the theoretical differences and commonalities of a large set of measures. Second, an analysis of the combinatorics of the subsumption hierarchy has led us to deﬁne new approximations of the probability P which exploit information on the subsumption hierarchy which has not been integrated into existing measures yet. Third, we show that ordinal and metrical properties can be straightforwardly deduced from this uniﬁed framework. We complete this theoretical study by numerical experiments on WordNet samples (version 2.0) and on benchmarks on which human judgments have been collected.

2

A typology of set-based similarities

In this section, we denote by S a ﬁnite set of elements and A, B, C some subsets of S. We brieﬂy recall that a similarity σ on P(S) is a function σ : P(S) × P(S) → IR+ which satisﬁes two properties: symmetry (σ(A, B) = σ(B, A)) and maximality (σ(A, A) ≥ σ(B, C)). Most of the set-based similarities can be grouped into two parametrized families. The ﬁrst one σα has been proposed by Caillez and Kuntz [6]. It is deﬁned by a ratio between the cardinality of the intersection |A ∩ B|

21

E. Blanchard et al. / A Generic Framework for Comparing Semantic Similarities on a Subsumption Hierarchy

and the Cauchy’s mean [5] of the cardinalities of the respective sets |A| and |B|: σα (A, B) = fα (|A| , |B| , |A ∩ B|) =

|A∩B| μα (|A|,|B|)

(1)

”1 “ α α α where μα (|A| , |B|) = |A| +|B| for α ∈ IR. 2 Note that the case α = 1 concides with the classical arithmetic mean. The second family σβ has been studied by Gower and Legendre [10]: σβ (ci , cj ) = fβ (|A| , |B| , |A ∩ B|) =

β·|A∩B| |A|+|B|+(β−2)·|A∩B|

Table 1. Correspondence between different parameter values and well-known set-based similarities α Mean μα Similarity σα −∞ minimum Simpson β Similarity σβ −1 harmonic Kulczinsky 1/2 Sokal&Sneath 0 geometric Ochia¨ı 1 Jaccard 1 arithmetic Dice 2 Dice +∞ maximum Braun&Blanquet

It is easy to check that the values of the similarities σα and σβ are in the interval [0; 1].

A new formulation of tree-based similarities

In the following, we denote by C = {c1 , c2 , . . . , cn } a ﬁnite set of concepts. Formally, an ontology can be modeled by a directed graph where the nodes represent concepts and the arcs represent labeled relationships. Here, like often in the literature, we restrict ourselves to the subsumption relationship “is-a” on C × C. This relationship is common to every ontology, and different papers have conﬁrmed that it is the most structuring one (e.g., [18]). In this case, if we assume that each concept ci has no more than one parent (direct subsumer), the ontology can be modeled by a rooted tree T (C) where the root c0 is either an informative concept or a “dummy” concept just added for the connectivity. We denote by cij the most speciﬁc common subsumer of the concepts ci and cj in T (C). In this section, we adapt the deﬁnitions 1 and 2 above to deﬁne new tree-based similarity families using the information content notion [19]. We also propose different ways to compute the information content of a concept which aims at better exploiting the hierarchy. Moreover, we show how our framework support the rediscovering of existing tree-based similarities. Our proposition allows to better understand both the relationships between the set-based similarities and the tree-based similarities and between the tree-based similarities themselves.

3.1

σ eα (ci , cj ) = feα (ψ(ci ), ψ(cj ), ψ(cij )) =

ψ(cij )

μα (ψ(ci ),ψ(cj ))

(3)

where μα is the Cauchy’s mean and α ∈ IR, and σ eβ (ci , cj )

(2)

where β ∈ IR∗+ . Table 1 shows the correspondence for different values of α and β with well-known measures (see [24] for the original references of the deﬁnitions).

3

the information content ψ(cij ) = − log P(cij ) of their most speciﬁc common subsumer cij . Consequently, from the deﬁnitions 1 and 2, we deduce two new parametrized functions which deﬁne tree-based similarities:

= feβ (ψ(ci ), ψ(cj ), ψ(cij )) β·ψ(cij ) = ψ(ci )+ψ(cj )+(β−2)·ψ(c ij )

(4)

where β ∈ IR∗+ Let us remark that σ eα (ci , cj ) = σ eβ (ci , cj ) when α = 1 and β = 2. The parameter α allows to choose different deﬁnitions of the mean (e.g., arithmetic, geometric). Formulation 4 explicitely shows that the parameter β allows to weight the importance of the common information associated with the most speciﬁc common subsumer. The logarithm base has no inﬂuence over this similarity measure due to the use of a ratio.

3.2

Information content computation

Let us remark that in practice the instance set I is never completely described in extension. Consequently, the operational computation of the probability P (ci ) depends both on the information at our disposal and on the hypothesis carried through the construction of the ontology. We denote by Pb (ci ) the approximation of P (ci ) in practice. br proposed by Resnik is computed by the forThe approximation P br (ci ) = n(ci ) where n(ci ) is the number of occurrences of mula: P n(c0 ) ci plus the number of occurrences of the concepts which are subsumed by ci in T (C). This approximation considers the root as virbr (c0 ) = 1). tual (P The probability P(ci ) can be approximated without considering any additional information. We propose some approximations deduced from various hypothesis on the extension of the concepts. We distinguish three approaches associated with different hypothesis: • descending approach – Hypothesis 1: exponential decreasing of the instance number bd ) with concept depth in T (C) (P – Hypothesis 2: uniform distribution of the father’s instances on bs ) its sons (P • ascending approach – Hypothesis 3: exponential increasing of the instance number bh ) with concept height in T (C) (P – Hypothesis 4: uniform distribution of the root’s instances on bg ) leaves (P • combined approach bdh : aggregation of P bd and P bh – P bsg : aggregation of P bs and P bg – P

Two new generic functions

Like Lin in his seminal paper [14], let us suppose that a concept ci references a subset Ii of an instance set I. By analogy with the Shannon’s information theory, the information content of the concept ci is measured by ψ(ci ) = − log P(ci ) where P(ci ) ∈ [0, 1] is the probability for a generic instance of ci to belong to Ii . Similarly, the common information associated with a concept pair {ci , cj } is

3.2.1

d (Hypothesis 1) Approximation P

The probability for an instance to be associated with a concept ci decreases exponentially with the depth di of ci in T (C). Then, b b bd (ci ) = Pd (parent (ci )) = P(c0 ) P k k di

(5)

22

E. Blanchard et al. / A Generic Framework for Comparing Semantic Similarities on a Subsumption Hierarchy

where k is a ﬁxed integer and parent (ci ) is the parent (direct subsumer) of ci . Let us remark that when the logarithm base is set to k, the information content of a concept ci is equivalent to its depth plus the information content of the root: bd (ci ) = di + ψ(c0 ) ψd (ci ) = − logk P

3.2.2

(6)

s (Hypothesis 2) Approximation P

bs (parent(ci )) P |Children(parent(ci ))|

h (Hypothesis 3) Approximation P

bh (ci ) = P

(8)

In the particular case of a logarithm base equal to k, the information content of a concept ci is deﬁned by: bh (ci ) = h0 − hi + ψ(c0 ) ψh (ci ) = − logk P

3.2.4

(10)

where Leaves (ci ) corresponds to the leaf set subsumed by ci (when ci is a leaf, Leaves (ci ) = {ci }). bs case. Here, the information This case is dual to the previous P content (ψg ) deduced from this approximation corresponds to the generality degree in comparison with the leaves ; the height takes into account a part of the information exploited by this generality debh by considering the number of gree. This approximation reﬁnes P sons of the concept and its subsumed concepts.

3.2.5

sg and P dh Approximations P

We consider an alternative which simultaneously take into account the speciﬁcity and the generality degrees: bsg (ci ) = P

bg (ci ) bs (ci )+P P 2

(11)

bs and bsg is based on the arithmetic mean of P The deﬁnition of P b Pg . This choice is forced by the preservation of the recursivity: bsg (cx ). bsg (ci ) = P P P cx ∈Children(ci )

bd and P bh : A dual case is the aggregation of P bdh (ci ) = P

bh (ci ) bd (ci )+P P 2

lin(ci , cj ) =

2·ψr (cij ) ψr (ci )+ψr (cj )

(13)

Due to the Resnik’s approximation, the root concept is considered b 0 ) = 1). as virtual (P(c

3.3.2

Wu & Palmer’s similarity

wup(ci , cj ) =

3.3.3

(12)

2·ψd (cij ) ψd (ci )+ψd (cj )

(14)

Stojanovic’s similarity

bd allows to rewrite the Stojanovic’s similarity The approximation P [26] which is analogous to the Jaccard’s coefﬁcient: sto(ci , cj ) =

3.3.4

We consider a uniform distribution of the instances of the root concept on the leaf concepts: |Leaves(ci )| |Leaves(c0 )|

The Lin’s similarity [14] is analogous to the Dice’s coefﬁcient with the Resnik’s approximation:

(9)

g (Hypothesis 4) Approximation P

bg (ci ) = P(c b 0) · P

Lin’s similarity

The Wu & Palmer’s similarity [28] is analogous to the Dice’s coefﬁbd : cient with the approximation P

Each leaf has the same instance number and the probability of an instance to be associated with a concept ci increases exponentially with the height of ci . A leaf concept has a minimal probability which depends on the height of the hierarchy and on the instance number of the root. We can approximate P(ci ) by: b 0) P(c kh0 −hi

In this subsection, we show that the generic functions σ eα and σ eβ describe a set of semantic similarities (e.g., Lin, Wu & Palmer). We show that, in some cases, the approximations of P (ci ) coincide with known measures of the literature.

(7)

where Children (ci ) corresponds to the set of sons of ci . The information content (ψs ) deduced from this approximation corresponds to the speciﬁcity degree in comparison with the root ; the depth takes into account a part of the information exploited by bd by considering this speciﬁcity degree. This approximation reﬁnes P the number of sons of each subsumer.

3.2.3

Similarity deﬁnitions deduced from the approximations

3.3.1

We consider a uniform distribution of the instances of a father concept on its son concepts : bs (ci ) = P

3.3

ψd (cij ) ψd (ci )+ψd (cj )−ψd (cij )

(15)

Proportion of Shared Speciﬁcity

The Proportion of Shared Speciﬁcity (pss) proposed by Blanchard et bs approximaal. [2] coincides with the Dice’s coefﬁcient with the P tion: 2·ψs (cij ) pss(ci , cj ) = ψs (ci )+ψ (16) s (cj )

4

Metrical and ordinal properties

Most of the work on the mathematical properties of the similarities are focused on their metrical aspect [18]. They usually resort to preliminary transformations of the similarity into a dissimilarity of the form δ = M axσ − σ, where M axσ is the maximal value reached by σ, or δ = σ1 when M axσ is not ﬁnite, in order to check the triangular inequality δ (ci , cj ) ≤ δ (ci , ck ) + δ (ck , cj ). Here, M axσα = M axσβ = 1 and we can consider the transformations δα = 1 − σα and δβ = 1 − σβ . By studying the set-based similarities, Caillez et al. [6] and Gower et al. [10] have proved that the triangular inequality holds for α → +∞ and β ∈ [0, 1]. From a formal point of view, these questions are interesting; however, for practical applications in knowledge engineering, the developed approaches do not generally require this constraining property. When comparing results with different similarities, we can remark that specialists are more often concerned with the ordering associated with the obtained values than with the intrinsic values. Indeed, they order the concept pairs according to the proximities quantiﬁed by these measures.

E. Blanchard et al. / A Generic Framework for Comparing Semantic Similarities on a Subsumption Hierarchy

Proposition 1. The similarities of the family {e σβ }β∈IR∗ fol+

low the same ordering: for any ci , cj , ck , cl in C, σ eβ (ci , cj ) ≤ σ eβ (ck , cl ) ⇔ σ eβ (ci , cj ) ≤ σ eβ (ck , cl ) for any β and β ∈ IR∗+ . We show that σ eβ (ci , cj ) ≤ σ eβ (ck , cl ) ⇐⇒ σ e1 (ci , cj ) ≤σ e1 (ck , cl ) for any β ∈ IR∗+ . When ψ(ci )+ψ(cj )−2·ψ(cij ) = 0 then, σ e1 (ci , cj ) = σ eβ (ci , cj ) for any β > 0. Otherwise, it is easy to check that, for ψ(ci ) + ψ(cj ) − 2 · ψ(cij ) = 0, σ eβ (ci , cj ) =

β·e σ1 (ci ,cj ) 1+(β−1)·e σ1 (ci ,cj )

Consequently, σ e1 (ci , cj ) ≥ σ e1 (ck , cl ) σ eβ (ck , cl ).

⇐⇒

23

the discussion between experts concerning the ontological nature of WordNet. We have computed the information content for four different concept sets: the whole set of WordNet (146690 concepts) and three subsets of WordNet composed of the concept sets used respectively in the Miller & Charles [16], Rubenstein & Goodenough [22] and Finkelstein & Gabrilovich [9] benchmarks. We have compared the bd , P bg and P br . The correlations ρ (ψd , ψr ) and approximations P ρ (ψg , ψr ) are reported in the ﬁgure 1 (the rank correlations not reported here give similar results).

σ eβ (ci , cj ) ≥

Proposition 2. The similarities of the family {e σα }α∈IR do not follow the same ordering. Let us consider the following counter-example on a set C = {c1 , c2 , c3 , c4 }. We suppose that c1 is a subsumer of c2 , and that ψ(c1 ) = 1, ψ(c2 ) = 3, ψ(c3 ) = ψ(c4 ) = 2 and ψ(c34 ) = 2. In this 1 case, the Cauchy’s means are μα (ψ(c1 ), ψ(c2 )) = ((1 + 3α )/2) α and μα (ψ(c3 ), ψ(c4 )) = 2. Due to the convexity of the power function when α > 1, then μα (ψ(c1 ), ψ(c2 )) > μα (ψ(c3 ), ψ(c4 )) and consequently σ eα (c1 , c2 ) < σ eα (c3 , c4 ). When α < 1, the inequality is inverted. Proposition 3. The similarities of the family {e σα }α∈IR are decreasing functions of α. This is due to the fact that the α-means are increasing functions of α (e.g., [5]).

5

Experimental results

In this section, we present two complementary comparisons based on the subsumption hierarchy of WordNet 2.0 [8]. First, we compare the information content restricted to the structural information with the well-known Resnik’s information content which additionally requires a corpus. This allows us to quantify the information deduced from the corpus. Second, we use three well-known benchmarks (Rubenstein & Goodenough [22], Miller & Charles [16], Finkelstein et al. [9]) which gather human judgments on some concept pairs. This allowed us to evaluate the relevance of the different approximations.

5.1

Figure 1. Correlation of ψd and ψg information content with the one of Resnik ψr on WordNet concepts and four subsets

br which is a yardstick has been computed The approximation P with the British National Corpus with the Resnik counting method and a smoothing by 1 [17]. We can remark that each benchmark uses a sample of concepts which is not so representative of the whole set of concepts. Indeed, the corpus effect on the information content is more important on the whole set than on the three samples. From this point of view, the one of Finkelstein & Gabrilovich is the worse benchmark. Unsurprisingly, the information content based on the approximabd is the less correlated with P br . However, the positive corretion P lations show the relationship between the ascending and descending approximations: the depth tends to be conversely proportional to the height. The correlations between ψg and ψr show that the information quantity deduced from the corpus is restricted comparatively to the information deduced from the hierarchical structure. Nevertheless, these results depend on the corpus and the structure of WordNet. That’s why further work is required to generalize this conclusion to a large set of ontologies.

Comparison on WordNet

This subsection presents a comparison between the information content based on different approximations. We restrict ourselves to nouns and to the subsumption hierarchy (hyperonymy/hyponymy) of WordNet. This hierarchy which contains 146690 nodes constitutes the backbone of the noun subnetwork accounting for close to 80% of the links [3]. The computations have been performed with the Perl modules of Pedersen et al. [17] which allowed us to adapt treebased measures to the WordNet structure. Hence, although a synset could have more than one hyperonym, we have represented it as a tree model TW ordN et (C). We have also added some Perl modules to take into account all the new approximations presented in this paper. The main interest of TW ordN et (C) is to be large enough to allow computations of robust statistics and we do not enter here into

5.2

Comparisons with human judgments

As showed in section 3.1, two components are essential when comparing two concepts ci and cj : the shared information content (ψ ∩ (ci , cj ) = ψ(cij )) and the distinguishing information content (ψ (ci , cj ) = ψ(ci ) + ψ(cj ) − 2 · ψ(cij )). To measure the speciﬁc inﬂuence of these two components we have computed the correlation of each of them with the human judgment. The considered human judgment evaluations are taken from the Miller & Charles [16], Rubenstein & Goodenough [22], Finkelstein & Gabrilovich [9] experiments and the approximation of P is the Resnik’s approximation. The results (ﬁgure 2) closely depend on the test sets. The contribution of ψr is more important than the one of ψr∩ for the benchmarks of Miller & Charles and Rubenstein & Goodenough

24

E. Blanchard et al. / A Generic Framework for Comparing Semantic Similarities on a Subsumption Hierarchy

REFERENCES

Figure 2. Contribution of ψ ∩ and br to simulate human ψ with P judgment

Figure 3. Contribution of ψ ∩ and bg to simulate human ψ with P judgment

contrary to the Finkelstein & Gabrilovich benchmark. This tend to express the variability of human sensibility which can be due to the evaluation process of the three benchmarks. bg seems to Moreover the previous experiments have shown that P be the more efﬁcient (better correlated with human judgments) approximation comparing to the Resnik’s approximation which uses a corpus. Hence, we have computed the correlations of the two compobg (ﬁgure 3). The nents ψ ∩ and ψ with the human judgment with P results are very similar to those obtained with the Resnik’s approximation. This tend to suppose that the information deduced from the corpus contain as much information as noise.

6

Conclusion

The concept of similarity is fundamental in numerous ﬁelds (e.g., classiﬁcation, AI, psychology, ...). At the origin, the deﬁnitions are often built to fulﬁll precise objectives in speciﬁc domains. However, several measures (e.g., [12, 7]) have shown their relevance to very different applications. Nowdays similarities know a signiﬁcant renewed interest associated with the expansion of the ontologies in knowledge engineering. In this framework, the most often used measures to quantify proximities between concept pairs are tree-based similarities whose deﬁnitions may integrate or not additional information from a textual corpus. In practice, the choice of a similarity is a critical step since the results of the algorithms often closely depend on this choice. In this paper, we have built a new theoretical framework which allows to rewrite homogeneously numerous similarity functions used in knowledge engineering. We believe that such an approach, in the spirit of the pioneer work of Lin, is important for two major reasons. First, this rewriting highlights relationships both semantically and structurally between a large set of measures which have been originally deﬁned for very different purposes. And, it has allowed to deduce mathematical properties. Second, it can guide the proposition of new measures by making explicit the information on the ontology which has not been integrated into the deﬁnitions yet. In this way, we have here proposed new approximations which allow to better exploit the information associated with the hierarchical structure of the ontology. We have also restricted ourselves to similarities for subsumption hierarchies without multiple inheritance. We have started to extend our approach to subsumption hierarchy with multiple inheritance.

ACKNOWLEDGEMENTS We would like to thank the referees for their comments which helped improve this paper.

[1] S. Bechhofer, F. van Harmelen, J. Hendler, I. Horrocks, D. L. McGuinness, P. F. Patel-Schneider, and L. A. Stein. Owl web ontology language reference, 2004. http://www.w3.org/TR/owl-ref/. [2] E. Blanchard, P. Kuntz, M. Harzallah, and H. Briand, ‘A tree-based similarity for evaluating concept proximities in an ontology’, in Proc. 10th Conf. Int. Federation Classiﬁcation Soc., pp. 3–11. Springer, (2006). [3] A. Budanitsky, ‘Lexical semantic relatedness and its application in natural language processing’, Technical report, Univ. of Toronto, (1999). [4] A. Budanitsky and G. Hirst, ‘Evaluating wordnet-based measures of semantic distance’, Computational Linguistics, 32(1), 13–47, (2006). [5] P.S. Bullen, D. S. Mitrinovic, and P. M. Vasics, Means and their inequalities, Reidel, 1988. [6] F. Caillez and P. Kuntz, ‘A contribution to the study of the metric and euclidiean structures of dissimilarities’, Psychometrika, 61(2), 241– 253, (1996). [7] L. R. Dice, ‘Measures of the amount of ecologic association between species’, Ecology, 26(3), 297–302, (1945). [8] WordNet: An electronic lexical database, ed., C. Fellbaum, MIT Press, 1998. [9] L. Finkelstein, E. Gabrilovich, Y. Matias, G. Wolfman E. Rivlin, Z. Solan, and E. Ruppin, ‘Placing search in context: The concept revisited’, ACM Trans. Information Systems, 20(1), 116–131, (2002). [10] J.C. Gower and P. Legendre, ‘Metric and euclidean properties of dissimilarity coefﬁcients’, J. of Classiﬁcation, 3, 5–48, (1986). [11] Z. Hubalek, ‘Coefﬁcient of association and similarity based on binary (presence, absence) data: an evaluation’, Biological Reviews, 57(4), 669–689, (1982). [12] P. Jaccard, ‘Distribution de la ﬂore alpine dans le bassin des dranses et dans quelques r´egions voisines’, Bulletin de la Soci´et´e Vaudoise de Sciences Naturelles, (37), 241–272, (1901). (in french). [13] J. H. Lee, M. H. Kim, and Y. J. Lee, ‘Information retrieval based on conceptual distance in is-a hierarchies’, J. Documentation, 49(2), 188– 207, (1993). [14] D. Lin, ‘An information-theoretic deﬁnition of similarity’, in Proc. 15th Int. Conf. Machine Learning, pp. 296–304. Morgan Kaufmann, (1998). [15] A. G. Maguitman, F. Menczer, H. Roinestad, and A. Vespignani, ‘Algorithmic detection of semantic similarity’, in Proc. 14th Int. Conf. World Wide Web, pp. 107–116. ACM Press, (2005). [16] G.A. Miller and W.G. Charles, ‘Contextual correlates of semantic similarity’, Language and Cognitive Processes, 6(1), 1–28, (1991). [17] T. Pedersen, S. Patwardhan, and J. Michelizzi, ‘Wordnet similarity measuring the relatedness of concepts’, in Proc. 5th Ann. Meet. North American Chapter Assoc. Comp. Linguistics, pp. 38–41, (2004). [18] R. Rada, H. Mili, E. Bicknell, and M. Blettner, ‘Development and application of a metric on semantic nets’, IEEE Trans. Syst., Man, Cybern., 19(1), 17–30, (1989). [19] P. Resnik, Selection and Information : A Class based Approach to Lexical Relationships, Ph.D. dissertation, University of Pennsylvania, 1993. [20] P. Resnik, ‘Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language’, J. Artiﬁcial Intell. Research, 11, 95–130, (1999). [21] E. Rosch, ‘Cognitive representations of semantic categories’, Experimental Psychology: Human Perception and Performance, 1, 303–322, (1975). [22] H. Rubenstein and J.B. Goodenough, ‘Contextual correlates of synonymy’, Comm. ACM, 8(10), 627–633, (1965). [23] A.W. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain, ‘Content-based image retrieval at the end of the early years’, IEEE Trans. Pattern Anal. Machine Intell., 22(12), 1349–1380, (2000). [24] R. R. Sokal and P. H. Sneath, Principles of numerical taxonomy, W. H. Freeman, 1963. [25] O. Steichen, C. Daniel-Le Bozec, M. Thieu, E. Zapletal, and M.-C. Jaulent, ‘Computation of semantic similarity within an ontology of breast pathology to assist inter-observer consensus’, Computers in Biology and Medicine, 36(7-8), 768–788, (2006). [26] N. Stojanovic, A. Maedche, S. Staab, R. Studer, and Y. Sure, ‘Seal: a framework for developing semantic portals’, in Proc. Int. Conf. Knowledge Capture, pp. 155–162, (2001). [27] A. Tversky, ‘Features of similarity’, Psychological Review, 84(4), 327– 352, (1977). [28] Z. Wu and M. Palmer, ‘Verb semantics and lexical selection’, in Proc. 32nd Annual Meeting Assoc. Computational Linguistics, pp. 133–138, (1994).

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-25

25

Complexity of Subsumption in the EL Family of Description Logics: Acyclic and Cyclic TBoxes Christoph Haase1 and Carsten Lutz2 Abstract. We perform an exhaustive study of the complexity of subsumption in the EL family of lightweight description logics w.r.t. acyclic and cyclic TBoxes. It turns out that there are interesting members of this family for which subsumption w.r.t. cyclic TBoxes is tractable, whereas it is E XP T IME-complete w.r.t. general TBoxes. For other extensions that are intractable w.r.t. general TBoxes, we establish intractability already for acyclic and cyclic TBoxes.

1

MOTIVATION

Description logics (DLs) are a popular family of KR languages that can be used for the formulation of and reasoning about ontologies [5]. Traditionally, the DL research community has strived for identifying more and more expressive DLs for which reasoning is still decidable. In recent years, however, there have been two lines of development that have led to signiﬁcant popularity also of DLs with limited expressive power. First, a number of novel and useful lightweight DLs with tractable reasoning problems has been identiﬁed, see e.g. [3, 8]. And second, many large-scale ontologies that are formulated in such lightweight DLs have emerged from practical applications. Prominent examples include the Systematized Nomenclature of Medicine, Clinical Terms (SNOMED CT), which underlies the systematized medical terminology used in the health systems of the US, the UK, and other countries [19]; and the gene ontology (GO), which aims at consistent descriptions of gene products in different databases [20]. In this paper, we are concerned with the EL family of lightweight DLs, which consists of the basic DL EL and its extensions. Members of this family underly many large-scale ontologies including SNOMED CT and GO. The DL counterpart of an ontology is called a TBox, and the most important reasoning task in DLs is subsumption. In particular, computing subsumption allows to classify the concepts deﬁned in the TBox/ontology according to their generality [5]. In the DL literature, different kinds of TBoxes have been considered. In decreasing order of expressive power, the most common ones are general TBoxes, (potentially) cyclic TBoxes, and acyclic TBoxes. For the EL family, the complexity of subsumption w.r.t. general TBoxes has exhaustively been analyzed in [3] and its recent successor [4]. In all of the considered cases, subsumption is either tractable or E XP T IME-complete. However, the study of general TBoxes does not reﬂect common practice of ontology design, as most ontologies from practical applications correspond to cyclic or acyclic TBoxes. For example, SNOMED CT and GO both correspond to so-called acyclic TBoxes. Since cyclic and acyclic TBoxes are often preferable in terms of computational complexity [7, 14], the question arises 1 2

University of Oxford, UK, [email protected] TU Dresden, Germany, [email protected]

whether there are useful extensions of EL for which reasoning w.r.t. such TBoxes is computationally cheaper than reasoning w.r.t. general TBoxes. The goal of the current paper is to analyse the computational complexity of subsumption in the EL family of description logics w.r.t. acyclic TBoxes and cyclic TBoxes, with a special emphasis on the border of tractability. In our analysis, we omit extensions of EL for which tractability w.r.t. general TBoxes has already been established. Our results exhibit a more varied complexity landscape than in the case of general TBoxes: we identify cases in which reasoning is tractable, co-NP-complete, PS PACE-complete, and E XP T IMEcomplete. Notably, we identify two maximal extensions of EL for which subsumption w.r.t. cyclic TBoxes is tractable, whereas it is E XP T IME-complete w.r.t. general TBoxes. In particular, these extensions include primitive negation and at-least restrictions. They also include concrete domains, but fortunately do not require the strong convexity condition that was needed in the case of general TBoxes to guarantee tractability [3]. For other extensions of EL such as inverse roles and functional roles, we show intractability results already w.r.t. acyclic TBoxes. Compared to the case of general TBoxes, it is often necessary to develop new approaches to lower bound proofs. We also show that the union of the two identiﬁed tractable fragments is not tractable. Detailed proofs are provided in [10].

2

DESCRIPTION LOGICS

The two types of expressions in a DL are concepts and roles, which are built inductively starting from inﬁnite sets NC and NR of concept names and role names, and applying concept constructors and role constructors. The basic description logic EL provides the concept constructors top (), conjunction (C D) and existential restriction (∃r.C), and no role constructors. Here and in what follows, we denote the elements of NC with A and B, the elements of NR with r and s, and concepts with C and D. The semantics of concepts and roles is given in terms of an interpretation I = (ΔI , ·I ), with ΔI a non-empty set called the domain and ·I the interpretation function, which maps every A ∈ NC to a subset AI of ΔI and every role name r to binary relation rI of over ΔI . Extensions of EL are characterized by the additional concept and role constructors that they offer. Figure 1 lists all relevant constructors, concept constructors in the upper part and role constructors in the lower part. The left column gives the syntax, and the right column shows how to inductively extend interpretations to composite concepts and roles. In the presence of role constructors, composite roles can be used inside existential restrictions. In atleast restrictions (≥ n r) and atmost restrictions (≤ n r) , we use n to denote a nonnegative integer. The concrete domain constructor p(f1 , . . . , fk ) de-

C. Haase and C. Lutz / Complexity of Subsumption in the EL Family of Description Logics: Acyclic and Cyclic TBoxes

26

Syntax

Semantics

(C1)

LT (B) ⊆ LT (A)

ΔI

(C2)

For each ∃rB .B ∈ ET (B) there is ∃rA .A ∈ ET (A) such that rA ⊆ rB and (A , B ) ∈ S

(C3)

ConD (A) implies ConD (B)

I

I

¬C

Δ \C

C D

C I ∩ DI

C D

C I ∪ DI

(≤ n r)

{x | #{y | (x, y) ∈ rI } ≤ n}

(≥ n r)

{x | #{y | (x, y) ∈ rI } ≥ n}

∃r.C

{x | ∃y : (x, y) ∈ rI ∧ y ∈ C I }

∀r.C

{x | ∀y : (x, y) ∈ rI → y ∈ C I }

p(f1 , . . . , fk )

{x | ∃d1 , . . . , dk : f1I (x) = d1 ∧ . . . ∧ fkI (x) = dk ∧ (d1 , . . . , dk ) ∈ pD }

r∩s

rI ∩ sI

r∪s

rI ∪ sI

r− r

+

{(x, y) | (y, x) ∈ rI } S I i i>0 (r )

Figure 1. Syntax and semantics of concept and role constructors.

serves further explanation, to be given below. To denote extensions of EL, we use the symbol of the added constructors in superscript. For example, EL ,∪,− denotes the extension of EL with concept disjunction (C D), role disjunction (r ∪ s), and inverse roles (r− ). The concrete domain constructor permits reference to concrete data objects such as strings and integers. It provides the interface to a concrete domain D = (ΔD , ΦD ), which consists of a domain ΔD and a set of predicates ΦD [13]. Each p ∈ ΦD is associated with a ﬁxed arity n and a ﬁxed extension pD ⊆ Δn D . In the presence of a concrete domain D, we assume that there is an inﬁnite set NF of feature names disjoint from NR and NC . In Figure 1 and in general, f1 , . . . , fk are from NF and p ∈ ΦD . An interpretation I maps every f ∈ NF to a partial function f I from ΔI to ΔD . We use EL(D) to denote the extension of EL with the concrete domain D. In this paper, a TBox T is a ﬁnite set of concept deﬁnitions A ≡ C, where A ∈ NC and C is a concept. We require that the left-hand side of all concept deﬁnitions in a TBox are unique. A concept name A ∈ NC is deﬁned if it occurs on the left-hand side of a concept deﬁnition in T , and primitive otherwise. A TBox T is acyclic if there are no concept deﬁnitions A1 ≡ C1 , . . . , Ak ≡ Ck ∈ T such that Ai+1 occurs in Ci for 1 ≤ i ≤ k, where Ak+1 := A1 . An interpretation I is a model of T iff AI = C I for all A ≡ C ∈ T . The main reasoning task considered in this paper is subsumption. A concept C is subsumed by a concept D w.r.t. a TBox T , written T |= C D, if C I ⊆ DI for all models I of T . If T is empty or missing, we simply write C D. Sometimes, we also consider satisﬁability of concepts. A concept C is satisﬁable w.r.t. a TBox T if there is a model of T such that C I = ∅. For many extensions of EL, satisﬁability is trivial because there are no unsatisﬁable concepts.

3

TRACTABLE EXTENSIONS

We identify two extensions of EL for which subsumption w.r.t. TBoxes is tractable: EL∪,(¬) (D) and EL≥,∪ . This should be contrasted with the results in [3] which imply that subsumption w.r.t. general TBoxes is E XP T IME-complete in both extensions. In Section 4.1, we show that taking the union of the two extensions results in intractability already w.r.t. acyclic TBoxes.

Figure 2.

3.1

EL∪,(¬) (D): Conditions for adding (A, B) to S.

Role Disjunction, Primitive Negation, and Concrete Domains

We show that subsumption in EL∪,(¬) (D) w.r.t. (acyclic and cyclic) TBoxes is tractable. The superscript ·(¬) indicates primitive negation, i.e., negation can only be applied to concept names. The following is an example of an EL∪,(¬) (D)-TBox, where has age is a feature, and ≥13 and ≤19 are unary predicates of the concrete domain D: Parent

≡

Human ∃(has child ∪ has adopted).

Mother

≡

Parent Female ¬Male

Teenager

≡

Human ≥13 (has age) ≤19 (has age)

To guarantee tractability, we require the concrete domain D to satisfy a standard condition. Namely, we require D to be p-admissibile, i.e., satisﬁability of and implication between concrete domain expressions of the form p1 (v11 , . . . , vn1 1 ) ∧ · · · ∧ pm (v1m , . . . , vnmm ) are decidable in polynomial time, where the vji are variables that range over ΔD . In [3], it is shown that a much stronger condition is required to achieve tractability in EL(D) with general TBoxes. This condition is convexity, which requires that if a concrete domain atom p(v1 , . . . , vn ) implies a disjunction of such atoms, then it implies one of the disjuncts. For our result, there is no need to impose convexity. When deciding subsumption, we only consider concept names instead of composite concepts. This is sufﬁcient since T |= C D iff T |= A B, where T := T ∪ {A ≡ C, B ≡ D} and A and B do not occur in T . The subsumption algorithm requires the input TBox T to be in the following normal form. In each A ≡ C ∈ T , C is of the form

1≤i≤k

Li

1≤i≤

∃ri .Bi

1≤i≤m

pi (f1i , . . . , fni i )

where the Li are primitive literals, i.e., possibly negated primitive concept names; the ri are of the form r1 ∪ . . . ∪ rn ; and the Bi are deﬁned concept names. In the following, we refer to the set of literals occurring in C with LT (A), to the set of existential restrictions as ET (A), and deﬁne the following concrete domain expression, which for simplicity uses features as variables: ConD (A) := p1 (f11 , . . . , fn11 ) ∧ · · · ∧ pm (f1m , . . . , fnmm ). To ease notation, we confuse a role ri = r1 ∪ . . . ∪ rn with the set {r1 , . . . , rn }. It is easy to see how to adapt the algorithm given in [2] to convert an EL∪,(¬) (D)-TBox into normal form in quadratic time. During the normalization, we check for unsatisﬁable concepts. This is easy since a deﬁned concept name A with A ≡ C ∈ T is unsatisﬁable w.r.t. T iff one of the following three conditions holds: (i) there is a primitive concept P with {P, ¬P } ∈ LT (A); (ii) ConD (A) is unsatisﬁable; or (iii) there is an ∃r.B ∈ ET (A) with B unsatisﬁable. Suppose we want to decide whether A is subsumed by B w.r.t. a TBox T in normal form. If A is unsatisﬁable, the algorithm answers

C. Haase and C. Lutz / Complexity of Subsumption in the EL Family of Description Logics: Acyclic and Cyclic TBoxes

27

(C2) For each ∃rB .B ∈ ET (B) there is ∃rA .A ∈ ET (A) such that rA ⊆ rB and (A , B ) ∈ S

In the extension of EL with only at-least restrictions (≥ n r), subsumption w.r.t. general TBoxes is E XP T IME-complete [3]. As we will show in Section 4.3, EL extended with at-most restrictions (≤ n r) is intractable already w.r.t. acyclic TBoxes.

(C3) For each (≥ m r) ∈ NT (B), there is (≥ n r) ∈ NT (A) such that n ≥ m.

4

(C1) PT (B) ⊆ PT (A)

Figure 3.

EL≥,∪ : Conditions for adding (A, B) to S.

“yes”. Otherwise and if B is unsatisﬁable, it answers “no”. If A and B are both satisﬁable, it computes a binary relation S on the deﬁned concept names of T . The relation S is initialized with the identity relation and then completed by exhaustively adding pairs (A, B) for which the conditions in Figure 2 are satisﬁed. It is easily seen that the algorithm runs in time polynomial w.r.t. the size of the input TBox. Let S0 , . . . , Sn be the sequence of relations that it produces. To show soundness, it sufﬁces to prove that if (A, B) ∈ Si , i ≤ n, then T |= A B. This is straightforward by induction on i. To prove completeness, we have to exhibit a model I of T with AI \ B I = ∅. Such a model is constructed in a twostep process. First, we start with an instance of A, and then “apply” the concept deﬁnitions in the TBox as implications from left to right, constructing a potentially inﬁnite, tree-shaped interpretation. In the second step, we apply the concept deﬁnitions from right to left, ﬁlling up the interpretation of deﬁned concepts. Both steps involve some careful bookkeeping which ensures that the constructed instance of A is not an instance of B. Theorem 1 Subsumption in EL∪,(¬) (D) w.r.t. TBoxes is in PT IME. This result still holds if we additionally allow role conjunction (r ∩s) and require that composite roles are in disjunctive normal form (without DNF, subsumption becomes co-NP-hard).It is worth mentioning that, in the presence of general TBoxes, extending EL with each single one of (i) primitive negation, (ii) role disjunction, and (iii) any non-convex concrete domain results in E XP T IME-hardness [3]. Note that convexity of a concrete domain is a rather strong restriction, and it is pleasant that we do not need it to achieve tractability. We point out that it should be possible to enhance the expressive power of EL∪,(¬) (D) by enriching it with additional constructors of the DL EL++ [3]. Examples include nominals and transitive roles.

INTRACTABLE EXTENSIONS

We identify extensions of EL for which subsumption is intractable w.r.t. acyclic and cyclic TBoxes.

4.1

Primitive Negation and At-Least Restrictions

We show that taking the union of the DLs EL∪,(¬) (D) and EL≥,∪ from Sections 3.1 and 3.2 results in intractability. To this end, we consider EL≥,(¬) and show that subsumption w.r.t. the empty TBox is CO -NP-complete. It is easy to establish the lower bound also for EL≥ (D) as long as there are two concepts p(f1 , . . . , fn ) and p (f1 , . . . , fm ) that are mutually exclusive. This is the case for most practically useful concrete domains D. For the lower bound, we reduce 3-colorability of graphs to nonsubsumption. Given an undirected graph G = (V, E), reserve one concept name Pv for each node v ∈ V , and a single role name r. Then, G is 3-colorable iff CG (≥ 4 r), where „ « CG := ∃r. Pv ¬Pw v∈V

{v,w}∈E

I \ (≥ 4 r)I , then d has at most three rIntuitively, if d ∈ CG successors, each describing one of the three colors. The use of primitive negation in CG ensures that no two adjacent nodes have the same color. A matching upper bound can be derived from the CO -NP-upper bound for subsumption in ALUN , which has the concept constructors top, bottom (⊥), value restriction (∀r.C), conjunction, disjunction, primitive negation, number restrictions, and unqualiﬁed existential restriction [11]. Given two EL≥,(¬) -concepts C, D, we have C D iff ¬D ¬C. It remains to observe that bringing ¬C and ¬D into negation normal form yields two ALUN -concepts.

Theorem 3 Subsumption in EL≥,(¬) is CO -NP-complete.

4.2

Inverse Roles

where the Pi are primitive concept names, the ri are of the form r1 ∪ . . . ∪ rn , the Bi are deﬁned concept names, and the si are role names. We use PT (A) to refer to the set of primitive concept names occurring in C, ET (A) is as in the previous section, and NT (A) is the set of number restrictions in C. The conditions for adding a pair (A, B) to the relation S are given in Figure 3.

In [1], it is shown that subsumption w.r.t. the empty TBox is tractable in (an extension of) EL− . We prove that, w.r.t. acyclic TBoxes, subsumption in EL− is PS PACE-complete. Since the upper bound follows from PS PACE-completeness of subsumption in ALCI [5], we concentrate on the lower bound. We reduce validity of quantiﬁed Boolean formulas (QBFs). Let ϕ = Q1 v1 · · · Qk vk .ψ be a QBF, where Qi ∈ {∀, ∃} for 1 ≤ i ≤ k. W.l.o.g., we may assume that ψ = c1 ∧ · · · ∧ cn is in conjunctive normal form. We construct an acyclic TBox Tϕ and select two concept names L0 and E0 such that ϕ is valid iff Tϕ |= L0 E0 . Intuitively, a model of L0 and Tϕ is a binary tree of depth k that is used to evaluate ϕ. In the tree, a transition from a node at level i to its left successor corresponds to setting vi+1 to false, and a transition to the right successor corresponds to setting vi+1 to true. Thus, each node on level i corresponds to a truth assignment to the variables v1 , . . . , vi . In Tϕ , we use a single role name r and the following concept names:

Theorem 2 Subsumption in EL≥,∪ w.r.t. TBoxes is in PT IME.

• L0 , . . . , Lk represent the level of nodes in the tree model;

3.2

Role Disjunction and At-Least Restrictions

In EL≥,∪ , we allow role disjunction only in existential restrictions, but not in number restrictions. To show that subsumption w.r.t. TBoxes is tractable, we use a variation of the algorithm in the previous section. In the following, we only list the differences. A TBox is in normal form if, in each A ≡ C ∈ T , C is of the form

1≤i≤k

Pi

1≤i≤

∃ri .Bi

(≥ ni si )

1≤i≤m

C. Haase and C. Lutz / Complexity of Subsumption in the EL Family of Description Logics: Acyclic and Cyclic TBoxes

28

• Ci,j , 1 ≤ i ≤ n and 1 ≤ j ≤ k, represents truth of the clause ci on level j of the tree model; • E0 , . . . , Ek are used for evaluating ψ, and the index again refers to the level. For 1 ≤ i ≤ k, we use Pj to denote the conjunction of all concept names Ci,j , 1 ≤ i ≤ n, such that vj occurs positively in ci ; similarly, Nj denotes the conjunction of all concept names Ci,j , 1 ≤ i ≤ n, such that vj occurs negatively in ci . Now, the TBox Tϕ is as follows: L0 Lk−1 Ci,j Ek Ei Ei

≡ ··· ≡ ≡ ≡ ≡ ≡

∃r.(L1 P1 ) ∃r.(L1 N1 ) ∃r.(Lk Pk ) ∃r.(Lk Nk ) ∃r− .Ci,j−1 for 1 ≤ i ≤ n and 1 < j ≤ k C1,k · · · Cn,k ∃r.Ei+1 for 0 ≤ i < k where Qi+1 = ∃ ∃r.(Pi+1 Ei+1 ) ∃r.(Ni+1 Ei+1 ) for 0 ≤ i < k where Qi+1 = ∀

The deﬁnitions for L0 , . . . , Lk−1 build up the tree. The use of P1 and N1 in these deﬁnitions together with the deﬁnition of Ci,j sets the truth value of the clause ci according to a partial truth assignment of length j. Finally, the deﬁnitions of E0 , . . . , Ek evaluate ϕ according to its matrix formula ψ and quantiﬁer preﬁx. It can be checked that ϕ is valid iff Tϕ |= L0 E0 . Theorem 4 Subsumption in EL− w.r.t. acyclic TBoxes is PS PACEcomplete. We leave the case of cyclic TBoxes as an open problem. In this case, the lower bound from Theorem 4 is complemented only by the E XP T IME upper bound for subsumption in EL− w.r.t. general TBoxes from [3].

4.3

concept names. The TBox Tϕ is as follows: 8 j if pi ∈ { j1 , j2 , j3 } < ∃r0 .Ai+1 j j Ai ≡ ∃r1 .Ai+1 if ¬pi ∈ { j1 , j2 , j3 } : j j ∃r0 .Ai+1 ∃r1 .Ai+1 otherwise Ajn+1

≡

Aϕ

≡

Bi

≡

Let EL be EL extended with functional roles, i.e., there is a countably inﬁnite subset NF ⊆ NR such that all elements of NF are interpreted as partial functions. It is shown in [3] that subsumption in ELf w.r.t. general TBoxes is E XP T IME-complete. We show that it is co-NP-complete w.r.t. acyclic TBoxes and PS PACE-complete w.r.t. cyclic ones. We use ELF to denote the variation of ELf in which all role names are interpreted as partial functions. It has been observed in [3] that there is a close connection between ELF and FL0 , which provides the concept constructors conjunction and value restriction. It is easy to exploit this connection to transfer the known co-NP-hardness (PS PACE-hardness) from subsumption in FL0 w.r.t. acyclic (cyclic) TBoxes as proved in [16, 12] to ELF . We omit details for brevity. Since the described approach is not very illuminating regarding the source of intractability, however, we give a dedicated proof of coNP-hardness of subsumption in ELF w.r.t. acyclic TBoxes using a reduction from 3-SAT to non-subsumption. Let ϕ = c1 ∧ . . . ∧ ck be a 3-formula in the propositional variables p1 , . . . , pn and with cj = j1 ∨ j2 ∨ j3 for 1 ≤ j ≤ k. We construct a TBox Tϕ and select concept names Aϕ and B1 such that ϕ is satisﬁable iff Tϕ |= Aϕ B1 . In the reduction, we use two role names r0 and r1 to represent falsity and truth of variables. More precisely, a path rv1 · · · rvn with rvi ∈ {r0 , r1 } corresponds to the valuation pi → vi , 1 ≤ i ≤ n. Additionally, we use a number of auxiliary

1≤j≤k

Aj1

∃r0 .Bi+1 ∃r1 .Bi+1

Bn+1 ≡

If I is a model of Tϕ and d ∈ (Aj1 )I , 1 ≤ j ≤ k, then d is the root of a tree in I whose edges are labelled with r0 and r1 and whose paths are the valuations that make the clause cj false. Due to functionality of r0 and r1 , each d ∈ AIϕ is thus the root of a (single) tree whose paths are precisely the valuations that make any clause in ϕ false. Finally, d ∈ B1I means that d is the root of a full binary tree of depth n whose paths describe all valuations. It follows that ϕ is satisﬁable iff Tϕ |= Aϕ B1 . To prove matching upper bounds for ELf , we exploit the fact that, due to the FL0 -connection, subsumption in ELF is easily shown to be in CO -NP w.r.t. acyclic TBoxes and in PS PACE w.r.t. cyclic ones. We give an algorithm for subsumption in ELf that uses subsumption in ELF as a subprocedure. Like the algorithms in Section 3, it computes a binary relation S on the set of deﬁned concept names by repeatedly adding pairs (A, B) such that the input TBox entails A B. The algorithm works for both acyclic and cyclic TBoxes, giving us the desired upper bound in both cases. We assume the input TBox T to be in the same normal form as described in Section 3.2, but without concepts of the form (≥ n r). Let S be a binary relation on the deﬁned concept names in T . For every concept ∃r.A occurring in T with r ∈ / NF , introduce a fresh concept name Xr,A such that Xr,A = Xr ,A iff r = r , (A, A ) ∈ S, and (A , A) ∈ S. Now let the ELF -TBox TS be obtained from T by (i) replacing every concept ∃r.A where r ∈ / NF with Xr,A , and (ii) for each ∃r.A in T with r ∈ / NF , adding the concept deﬁnition

Functional Roles f

Xr.A ≡ Xr,B1 · · · Xr,Bn Zr,A where B1 , . . . , Bn are all concept names with (A, Bi ) ∈ S and (Bi , A) ∈ / S; and Zr,A is a fresh concept name. The algorithm starts with S as the identity relation and then exhaustively performs the following step: add (A, B) to S if TS |= A B. It returns “yes” if the input concepts form a pair in S, and “no” otherwise. Additionally, we can show that subsumption in ELf without TBoxes is in PT IME by a reduction to subsumption in EL. Theorem 5 Subsumption in ELf is in PT IME, CO -NP-complete w.r.t. acyclic TBoxes and PS PACE-complete w.r.t. cyclic TBoxes. It is not hard to see that the lower bounds carry over to EL≤ .

4.4

Booleans

We consider extensions of EL with Boolean constructors, starting with negation. Since EL¬ is a notational variant of ALC, we obtain the following from the results in [17, 18]. Theorem 6 Satisﬁability and subsumption in EL¬ is PS PACEcomplete without TBoxes and w.r.t. acyclic TBoxes, and E XP T IMEcomplete w.r.t. cyclic TBoxes. Now for disjunction. It has been shown in [6] that subsumption in EL is CO -NP-complete without TBoxes. In order to establish lower

C. Haase and C. Lutz / Complexity of Subsumption in the EL Family of Description Logics: Acyclic and Cyclic TBoxes

bounds for subsumption w.r.t. TBoxes, we reduce satisﬁability in EL¬ to non-subsumption in EL . An EL¬ -TBox T is in normal form if for each A ≡ C ∈ T , C is of the form , P , ¬B, ∃r.B, or B1 B2 with P primitive and B, B1 , B2 deﬁned. It is straightforward to show that any EL¬ -TBox T can be transformed into normal form in linear time such that all (non-)subsumptions are preserved. Thus, let T = {A1 ≡ C1 , . . . , An ≡ Cn } be an EL¬ -TBox in normal form. Since the proofs underlying Theorem 6 use only a single role name, we may assume w.l.o.g. that T contains only a single role name r. We convert T into an EL -TBox T by introducing fresh concept names A1 , . . . , An representing the negations of A1 , . . . , An and replacing every A ≡ ¬Aj ∈ T with A ≡ Aj and every Ai ≡ ∃r.Aj ∈ T with Ai ≡ ∃r.(Aj

(Ak Ak )).

. . ∃r.}(Aj Aj ) M ≡ 0≤i
if T is acyclic

M ≡ ∃r.M 1≤i≤n (Ai Ai )

if T is cyclic.

i times

In both cases, M is a fresh concept name. Then a deﬁned concept name A is satisﬁable w.r.t. T iff T |= A (Ai Ai ) M. 1≤i≤n

We obtain the following result. Theorem 7 Subsumption in EL is PS PACE-complete w.r.t. acyclic TBoxes and E XP T IME-complete w.r.t. cyclic TBoxes.

Transitive Closure

We consider EL+ , the extension of EL with transitive closure of roles. Using a result by Miklau and Suciu on query containment in a fragment of XPath [15], it is easy to show that subsumption in EL+ is co-NP-complete. By reusing the techniques from Miklau and Suciu’s lower bound proof, we can establish PS PACE-hardness (E XP T IME-hardness) of subsumption in EL+ w.r.t. acyclic (cyclic) TBoxes. More precisely, this is achieved by a reduction of satisﬁability in EL¬ to non-subsumption in EL+ , similar to the one described in Section 4.4. A corresponding E XP T IME upper bound for the case of cyclic TBoxes is obtained by a straightforward reduction to satisﬁability in propositional dynamic logic (PDL). For acyclic TBoxes, we obtain a PS PACE upper bound by a less straightforward reduction to subsumption in EL w.r.t. acyclic TBoxes, c.f. Theorem 7. Theorem 8 Subsumption in EL+ is CO -NP-complete, PS PACEcomplete w.r.t. acyclic TBoxes and E XP T IME-complete w.r.t. cyclic ones.

5

EL with

no TBox

acyclic

cyclic

¬C

PS PACE

PS PACE

E XP T IME

¬A

PT IME

PT IME

PT IME

C D

CO -NP

PS PACE

E XP T IME

functionality

PT IME

CO -NP

PS PACE

(≥ n r)

PT IME

PT IME

PT IME

p(f1 , . . . , fk )

PT IME

PT IME

PT IME

r∩s

PT IME

PT IME

PT IME

r∪s

PT IME

PT IME

PT IME

r−

PT IME

PS PACE

E XP T IME

r+

CO -NP

PS PACE

E XP T IME

1≤k≤n

The additional conjunct ensures that Ai and Ai cover the domain. To additionally ensure that they are disjoint, we add to T the concept deﬁnition

4.5

29

CONCLUSION

The complexity landscape for acyclic/cyclic TBoxes is much less uniform than for general TBoxes. For the case of general TBoxes, non-existence of a unique minimal model of a TBox (in the sense that it can be homomorphically embedded into any other model) was a sufﬁcient (but not necessary) condition for intractability. This is not the case here: in EL∪,(¬) (D) and EL≥,∪ , such models do not exist. It is also interesting to note that we did not ﬁnd a single case in which subsumption is tractable w.r.t. acyclic TBoxes, but intractable w.r.t. cyclic ones.

Figure 4. Complexity of subsumption in extensions of EL. Light gray cell background indicates membership in class, dark gray completeness for class.

REFERENCES [1] F. Baader, R. Molitor, and S. Tobies, ‘Tractable and decidable fragments of conceptual graphs’, in Proc. of ICCS-99, number 1640 in LNCS, pp. 480–493. Springer, (1999). [2] F. Baader, ‘Terminological cycles in a description logic with existential restrictions’, in Proc. of IJCAI-03, pp. 325–330. Morgan Kaufmann, (2003). [3] F. Baader, S. Brandt, and C. Lutz, ‘Pushing the EL envelope’, in Proc. of IJCAI-05, (2005). Morgan-Kaufmann Publishers. [4] F. Baader, S. Brandt, and C. Lutz, ‘Pushing the EL envelope further’, in Proc. of OWLED08DC, (2008). [5] F. Baader, D. Calvanese, D.L. McGuinness, D. Nardi, and P.F. PatelSchneider, eds., ‘The Description Logic Handbook: Theory, Implementation, and Applications’. Cambridge University Press, 2003. [6] S. Brandt, ‘Polynomial time reasoning in a description logic with existential restrictions, GCI axioms, and—what else?’, in Proc. of ECAI2004, pp. 298–302. IOS Press, (2004). [7] D. Calvanese, ‘Reasoning with inclusion axioms in description logics: Algorithms and complexity’, in Proc. of ECAI-96, pp. 303–307, (1996). [8] D. Calvanese, G. De Giacomo, D. Lembo, M. Lenzerini, and R. Rosati, ‘Data complexity of query answering in description logics’, in Proc. of KR-2006, pp. 260–270, (2006). [9] F.M. Donini, B. Hollunder, M. Lenzerini, A.M. Spaccamela, D. Nardi, and W. Nutt, ‘The complexity of existential quantiﬁcation in concept languages’, Artiﬁcial Intelligence, 53(2–3), 309–327, (1992). [10] C. Haase, ‘Complexity of Subsumption in Extensions of EL’. Master thesis, TU Dresden, Germany, 2007. [11] D. Nardi, F.M. Donini, M. Lenzerini and W. Nutt, ‘The complexity of concept languages’, Inf. and Comp. 134(1), 1–58, (1997). [12] Y. Kazakov and H. de Nivelle, ‘Subsumption of concepts in F L0 for (cyclic) terminologies with respect to descriptive semantics is PSpacecomplete’, in Proc. of DL2003, number 81 in CEUR-WS, (2003) [13] C. Lutz, ‘Description logics with concrete domains—a survey’, in Advances in Modal Logics Volume 4, pp. 265–296. King’s College Publications, (2003). [14] C. Lutz, ‘Complexity of terminological reasoning revisited’, in Proc. of LPAR-99, number 1705 in LNAI, pp. 181–200. Springer, (1999). [15] G. Miklau and D. Suciu, ‘Containment and equivalence for an xpath fragment’, in Proc. of PODS-02, pp. 65–76, (2002). ACM Press. [16] B. Nebel, ‘Terminological reasoning is inherently intractable’, Artiﬁcial Intelligence, 43, 235–249, (1990). [17] K. Schild, ‘Terminological cycles and propositional μ-calculus’, in Proc. of KR-94, pp. 509–520, Morgan Kaufmann, (1994). [18] M. Schmidt-Schauß and G. Smolka, ‘Attributive concept descriptions with complements’, Artiﬁcial Intelligence, 48(1), 1–26, (1991). [19] K.A. Spackman, ‘Managing clinical terminology hierarchies using algorithmic calculation of subsumption: Experience with SNOMED-RT’, Journal of the American Medical Informatics Association, (2000). [20] The Gene Ontology Consortium, ‘Gene Ontology: Tool for the uniﬁcation of biology’, Nature Genetics, 25, 25–29, (2000).

30

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-30

Reasoning about Dynamic Depth Proﬁles Mikhail Soutchanski1 and Paulo Santos2 Abstract. Reasoning about perception of depth and about spatial relations between moving physical objects is a challenging problem. We investigate the representation of depth and motion by means of depth proﬁles whereby each object in the world is represented as a single peak. We propose a logical theory, formulated in the situation calculus (SC), that is used for reasoning about object motion (including motion of the observer). The theory proposed here is comprehensive enough to accommodate reasoning about both sensor data and actions in the world. We show that reasoning about depth proﬁles is sound and complete with respect to actual motion in the world. This shows that in the conceptual neighbourhood diagram (CND) of all possible depth perceptions, the transitions between perceptions are logical consequences of the proposed theory of depth and motion.

1 Introduction The present paper describes a logical theory for representing knowledge about objects in space and their movements as noted from the viewpoint of a mobile robot. One of the main purposes of this theory is to equip the robot with the basic machinery for deriving and manipulating symbolic information about the motion of objects (and the robot itself). Methodologically speaking, the work described here belongs to the cognitive robotics area [5, 11]. The aim of cognitive robotics is to endow robots with high-level cognitive skills by importing techniques from the ﬁeld of knowledge representation, especially those that use formal logic as their theoretical foundation; [5] provides a detailed review of recent theoretical and practical results. The goal of this paper is to provide a formal account both for perception of motion and for actions by integrating a novel formalism about perception of depth [12] with a principled formalism for handling actions and change in a dynamic world, the situation calculus (SC) [11]. The central idea in the work reported in [12] is the construction of a logical representation for the data obtained from a mobile robot’s range ﬁnding sensor assuming a simpliﬁcation of the depth maps3 in the form of a horizontal slice across the centre of the visual ﬁeld. Each horizontal slice is represented as a two-dimensional proﬁle of a one-and-a-half planar view of a scene, which preserves the horizontal shape and depth of objects in the scene. This planar view constitutes the basic entity of the reasoning system called depth proﬁle calculus (DPC). Occlusion and parallax are two main perceptual events that this theory accounts for (inspired by [10] and other qualitative spatial reasoning approaches [1]). In the sequel, we talk about a range ﬁnding sensor assuming that it is either a stereo vision system, a laser range ﬁnder, or both. In the present paper we extend the language of DPC with the definitions provided by SC. This new theory also includes concepts of Euclidean geometry to describe visibility and motion of objects (including the observer). Thus, the formalism presented here facilitates not only sensor data assimilation but also reasoning about actions in the world. We expect that this research will result in an automated reasoning system capable of recognizing the plans of other vehicles based on knowledge about its own motion and data from a range ﬁnding sensor. The proposed framework can ﬁnd applications similar to those of cognitive vision systems that summarize conceptually a se-

quence of images of dynamic scenes in terms of motion verbs or as natural language sentences [4, 7, 8]. However, in contrast to systems reviewed in [8], that see a scene from a birds-eye view, we are interested in scenes from an ego-centric viewpoint that is more natural from a cognitive standpoint. This paper makes several important contributions. To the best of our knowledge, the proposed logical theory (called T DM) is the ﬁrst theory for reasoning about actions that goes well beyond simple examples discussed in [11] by developing an elaborated theory for reasoning about motion and perception of depth. Also, we show that in the conceptual neighbourhood diagram (CND) of all possible depth perceptions, the transitions between perceptions are logical consequences of the proposed theory of depth and motion. This theory can be used for solving the projection problem (if a logical property holds after doing a sequence of actions), for plan recognition [3], for reasoning about incomplete information and for solving other reasoning tasks that cannot be solved using CNDs alone.

2 Depth Proﬁles In this work, space is represented by depth proﬁles that are graphical descriptions of horizontal slices of depth maps as perceived from the observer’s viewpoint. Each depth proﬁle corresponds to one scene and is the result of one horizontal slice taken at mid-height of the scene objects. The robot has a limited ﬁeld of view deﬁned by the direction that it is facing, the maximum depth perceived and the angle of its camera aperture. The area determined by this angle is called the visibility cone. The depth and the edges of depth map regions intersected by horizontal slices deﬁne depth proﬁles (Figure 1, (b) and (d)). Peaks in these proﬁles are named depth peaks and are the primitive spatial entities that can be sensed by the robot. The edges of a peak in a proﬁle (represented in its horizontal axis, cf. Figure 1(b)) are related to the boundaries of images of visible objects in a horizontal slice, while the vertical dimensions of a peak (the axis depth) provide the depth of the perceived objects from the observer’s viewpoint. For instance, Figure 1(a) and (c) show an observer perceiving two objects (b1 and b2 ) from two distinct viewpoints, ν1 and ν2 , respectively. Figures 1(b) and (d) depict the depth proﬁles relative to ν1 and ν2 , where b1 and b2 are represented by the peaks p and q respectively. The depth L in Fig. 1 is the outermost boundary still visible to the observer. a) L

depth

b)

L q

p u2 b1

u1

u1 u 2 b2

ν1

e

m n

f

edges

ν1

c)

L

depth

d)

L p

u4 b1 ν2

u3

u4

b2

q

u3 ν2

i

j k

r

edges

1

Department of Computer Science, Ryerson University, Toronto, Canada, email: [email protected] 2 IAAA – Elec. Eng. Dep., Centro Universit´ ario da FEI, S˜ao Paolo, Brasil, email: [email protected] 3 A depth map indicates the range from observer to each point on an observed scene.

Figure 1. (a) and (c) are birds-eye views of two objects b1 , b2 from two distinct points νi ; (b) and (d) depth proﬁles respective to (a) and (c).

Besides depth and edges, depth proﬁles encode information about the apparent sizes of visual objects from the robot’s perspective. In

M. Soutchanski and P. Santos / Reasoning About Dynamic Depth Proﬁles

this work we use the width of peaks as a measure of the apparent sizes (e.g., in Figure 1(b) the size of the peak q is deﬁned by (f − e)). The observer-relative apparent displacement between pairs of objects can also be retrieved from depth proﬁles by measuring the distance between the nearest points of distinct depth peaks. In this work, angular distances in a depth proﬁle are measured from the leftmost boundary of the proﬁle to the rightmost point of a peak. Thus, for instance, in Figure 1(b) the observer-relative angular distance separating p and q is given by f − (f − e) − n = e − n. In order to maintain the correspondence between objects and peaks through time, three domain constraints are assumed: object persistence, smoothness of motion, and non interpenetration [6]. Object persistence stands for the assumption that the domain objects cannot disappear or appear instantaneously (without crossing the boundaries). Smoothness of motion is the assumption that objects in the world cannot instantly jump from one place to another. Finally, in an environment where non interpenetration holds, objects do not pass through each other. Depth peaks, however, can approach each other, recede from each other, coalesce, or split. Also, an individual peak can extend (when an object or the observer approach each other), shrink (when they move apart), appear (when an object moves closer to an observer through the boundary L), vanish (when an object moves away beyond the boundary L). These changes in peaks are relations in the depth proﬁle calculus (DPC) that represent transitions in the sensor data. These transitions are related to changes in the observed objects in the world and compose the conceptual neighbourhood diagram4 (CND) shown in Fig. 2 (see also [12]). A CND is a graph whereby in its vertices are relations on some speciﬁc objects, and in its edges, the transitions between these relations. Approaching

Receding Static

Hiding

Coalescing

Figure 2.

Splitting

Conceptual neighbourhood diagram

However, DPC is only capable of describing the observed changes in the world, thus overlooking reasoning about the actions that caused these transitions. The present paper deals with this issue by proposing a logical theory of depth and motion such that the transitions in the CND become logical consequences from the proposed theory. In order to reduce the inherent complexity of dealing with object’s shapes, and to facilitate the treatment of motion in the world, this work assumes that the environment is only populated by cylinders. In fact, approximating object’s shapes to cylinders is a traditional assumption in Computer Vision, recall David Marr’s cylindrical representation of the human body for instance. The assumption that each proﬁle contains exactly one peak per body is an initial approximation to the problem. Future research on noisy sensors will force us to relax this assumption, as (for instance) reﬂections of light on an observed object may cause it to be perceived as various distinct peaks.

3 Situation Calculus The situation calculus (SC) is a predicate logic language for axiomatising dynamic worlds. In recent years, it has been considerably extended beyond the original language to include stochastic actions, concurrency, continuous time, etc, but in all cases, its basic ingre4

The CND for single peak relations was omitted here for brevity.

31

dients consist of actions, situations and ﬂuents [11]. It can use also additional sorts for objects in the domain. Actions are ﬁrst-order terms consisting of an action function symbol and its arguments. In the approach to representing time in SC of [11], one of the arguments to such an action function symbol— typically, its last argument—is the time of the action’s occurrence. E.g., endM ove(R, loc(3, 8), loc(5, 10), 105.7) denotes the action of the robot ending its motion from location loc(3, 8) to location loc(5, 10) at time 105.7 (e.g., measured in seconds). There is a corresponding action startM ove(R, loc(3, 8), loc(5, 10), 44.9) that started the process of moving between these two locations at time 44.9. All actions are instantaneous (i.e, with zero duration). Durations of actions extended in time can be captured using processes as in [11]. In addition to physical actions, that change some properties of the world, there are also sensing actions that change what the robot knows about the world (only the robot can do a sensing action). E.g., the action sense(p, loc(xr , yr ), t) is a sensing action executed at time t from the location loc(xr , yr ) that gets from sensors a depth proﬁle p which can include one or several peaks. A situation is a ﬁrst-order term denoting a sequence of actions. Such sequences are represented using a binary function symbol do: do(α, s) denotes the sequence resulting from adding an action term α to the sequence s. The special constant S0 denotes the initial situation, namely the empty action sequence. Every situation refers uniquely to a sequence of actions. Relations or functions whose values vary from situation to situation are called ﬂuents, and are denoted by predicate or function symbols whose last argument is a situation term. To simplify the presentation we do not consider functional ﬂuents. In addition to ﬂuents representing properties of the world, one can reason in the SC not only about effects of physical actions on these properties, but also about sensing actions and their effects on knowledge using an epistemic ﬂuent K(s , s) [5, 11]. In this paper, we consider only a simple form of literal-based knowledge about ﬂuents that can be represented by the ﬂuents themselves (understood as subjectively perceived properties), as proposed in [13]. This approach (knowledge about literals only) is more practical and sufﬁcient for our purposes because we make a full observability assumption: a value of any ﬂuent at each moment of time in any situation is known. The correctness of this approach to reasoning about sensing actions is discussed in [9]. The SC includes the distinguished predicate P oss(a, s) to characterize actions a that are possible to execute in s; see other details and deﬁnitions in [11]. A basic action theory (BAT) D is a set of axioms for a domain theory written in the SC with the following ﬁve classes of axioms to model actions and their effects. Action precondition axioms Dap : There is one for each action term A(x), with syntactic form P oss(A( x), s) ≡ ΠA ( x, s). Here, x, s) is an uniform formula with free variables among x, s. ΠA ( These are the preconditions of action A: A is possible if and only x, s) is true. if the condition ΠA ( Successor state axioms (SSA) Dss : There is one for each relational x, a, s), ﬂuent F ( x, s), with syntactic form F ( x, do(a, s)) ≡ ΦF ( where ΦF ( x, a, s) is a uniform formula with free variables among a, s, x having the syntactic form a = PositiveAction∧γ + ( x, s)∨· · · x, s) ∨ · · · ) , ∨ F ( x, s) ∧ ¬(a = NegativeAction ∧γ − ( where PositiveAction is an action that has positive effect on the ﬂuent F , γ + ( x, s) is the formula expressing a context in which this positive effect can occur. Similarly, NegativeAction is an action that can x, s) holds in the make the ﬂuent F false if the uniform formula γ − ( situation s. These characterize the truth values of the ﬂuent F in the next situation do(a, s) in terms of the situation s, and they embody a

32

M. Soutchanski and P. Santos / Reasoning About Dynamic Depth Proﬁles

solution to the frame problem for deterministic actions [11]. Unique names axioms for actions Duna : These state that the actions of the domain are pairwise unequal. Initial database DS0 : This is a set of sentences whose only situation term is S0 ; it speciﬁes the initial problem state. Foundational axioms Σ for situations with time are given in [11]. These axioms use a new function symbol start(s), denoting the start time of situation s. However, below we use a predicate start(s, t) for similar purposes. We use predicates rather than functions to make connection with Prolog implementation more transparent.

4 Depth Proﬁles in the Situation Calculus

can be easily maintained for any s, following the movements of the observer or other bodies, using a variation of the rotational sweepline algorithm proposed for straight-line segments by D.T. Lee in 1978 (see [2] for details). For any snapshot of the world s, and a geometric conﬁguration of circles (representing 2D projections of cylinders), this is an efﬁcient algorithm that takes O(n · log n) time, where n is the number of circles. Each circle is reduced to a chord between tangent points computed by shooting a tangent ray to the circle from the viewpoint. The algorithm answers boolean queries whether a chord is visible (i.e., if at least one point on the chord is visible) or is completely occluded in a current situation. To take into account that the robot’s ﬁeld of view has an angular aperture of β ≤ 180◦ , this predicate takes an extra argument β. Also, visible takes an argument θ representing the direction that the robot is facing. In the sequel, we will not provide any axioms for this predicate, but rely on its intended interpretation provided by an external computational geometry module. Similarly, we do not provide any axioms for dense linear orders, but assume that time, coordinates and others vary over rationals or reals, with the standard interpretation. The ﬂuent f acing(θ, loc(xr , yr ), s) reads as “the robot located in loc(xr , yr ) is facing the direction that makes an angle θ with North in the situation s”; the ﬂuent location(b, l, s) means that b is located in the point l in s and the ﬂuent moving(b, l1 , l2 , s) represents the process of moving between two locations. The axioms for motion are simple, but they can be easily elaborated by taking into account equations of continuous motion. The predicate rotating(s) holds if the robot pans its range ﬁnding sensor in s. These ﬂuents are characterized by the following self-explanatory axioms.

This section describes the logical formalism to represent transitions between depth proﬁles within the situation calculus. The most important aspect of this formalism is expressed in its successor states axioms (SSA) whereby, as we shall see, the relations representing transitions in the perceptions of depth are combined with the actions that caused such transitions. Therefore, an agent using these axioms is capable of describing (by means of DPC relations) perceived changes in the world as well as reasoning about the effects of its own actions and those of other agents in the domain. Formally, we introduce a many-sorted ﬁrst-order language that uses quantiﬁers over depth proﬁles (p), time points (t), depth (u), size (z), angular distance (d) between peaks or between a peak and the left border, rotation angles (ω), directions (θ) that the robot is facing (we measure directions in terms of angular distance from North), physical bodies (b), coordinates (x and y) and viewpoints (v) which are locations loc(x, y). This theory also includes the term pk(b, u, z, d) (read as “peak of a body b located at depth u from the current view- f acing(θ2 , loc(xr , yr ), do(a, s)) ≡ (∃t, ω, θ1 ) (a = endP an(ω, t) ∧ f acing(θ1 , loc(xr , yr ), s) ∧ θ2 = θ1 + ω) ∨ point has size z and angular distance d from the left border”). f acing(θ2 , loc(xr , yr ), s) ∧ (¬∃t, ω)(a = endP an(ω, t)). Let the motion of the robot or any other body be described by the term startM ove(b, l1 , l2 , time) – b starts moving between lo- location(b, loc(x, y), do(a, s)) ≡ (∃t, x , y ) a = endM ove(b, loc(x , y ), loc(x, y), t) ∨ location(b, loc(x, y), s)∧ cations l1 and l2 at the moment time – and the term endM o(¬∃t, x2 , y2 )(a = endM ove(b, loc(x, y), loc(x2 , y2 ), t) ). ve(b, l1 , l2 , time) – b ends the process of moving between l1 and l2 . The robot R can pan its range ﬁnding sensor in any direction: this moving(b, l1 , l2 , do(a, s)) ≡ (∃t)a = startM ove(b, l1 , l2 , t) ∨ moving(b, l1 , l2 , s) ∧ (¬∃t)(a = endM ove(b, l1 , l2 , t)). is represented by the pair of actions startP an(ω, time) and endP an(ω, time), where the rotation angle ω is positive if it is clock- rotating(ω, do(a, s)) ≡ (∃t)startP an(ω, t) ∨ rotating(ω, s) ∧ (¬∃t)(a = endP an(ω, t)). wise and negative if it is counter-clockwise. The robot R located in In the sequel, we use a convenient abbreviation for the Euclidean loc(xr , yr ) can also get a proﬁle p from sensors by doing the acdistance between two points: tion sense(p, loc(xr , yr ), time). These actions are characterized by def euD(loc(x1 , y1 ), loc(x2 , y2 ), dist) the following precondition axioms Dap (where f ieldV iew(β) is the p = 2 2 dist = (x 1 − x2 ) + (y1 − y2 ) . visibility cone, f acing and location are ﬂuents introduced below). Three further predicates refer to peak attributes: the peak’s poss(startM ove(b, l1 , l2 , t), s) ≡ location(b, l1 , s)∧ depth, size, and the angular distance of a peak from the left ¬∃l, l moving(b, l, l , s) ∧ l1 = l2 ∧ start(s, t ) ∧ t ≥ t . poss(endM ove(b, l1 , l2 , t), s) ≡ moving(b, l1 , l2 , s)∧ border. The predicate depth(pk(b, u, z, d), u, loc(xr , yr ), s) start(s, t ) ∧ t ≥ t ∧ l1 = l2 . holds if the peak’s depth is u in the situation s when the viewposs(startP an(ω, t), s) ≡ ¬∃w rotating(w , s)∧start(s, t )∧t ≥ t . ing point is in loc(xr , yr ). Similarly, size(pk(b, u, z, d), z, poss(endP an(ω, t), s) ≡ rotating(ω, s) ∧ start(s, t ) ∧ t ≥ t . loc(xr , yr ), s) holds in s if the peak’s angular size is z. Finally, poss(sense(p, loc(xr , yr ), t2 ), s) ≡ dist(pk(b1 , u1 , z1 , d1 ), pk(b2 , u2 , z2 , d2 ), d, loc(xr , yr ), s) holds start(s, t1 ) ∧ t2 ≥ t1 ∧ location(R, loc(xr , yr ), s)∧ if the angular distance between two peaks is d in the situation (∃b, u, z, d, θ, β) pk(b, u, z, d) ∈ p ∧ u > 0 ∧ z > 0 ∧ d > 0∧ s. Below, only the SSA for depth is shown, size and dist are f acing(θ, loc(xr , yr ), s) ∧ f ieldV iew(β)∧ analogous. visible(loc(xr , yr ), b, β, θ, s)∧ The predicate depth(pk(b, u, z, d), u, loc(xr , yr ), do(a, s)) /* there are no invisible peaks in p */ holds after the execution of an action a at a situation s if and only (¬∃bI , uI , zI , dI ) (pk(bI , uI , zI , dI ) ∈ p∧ if a was a sensing action that picked out the peak of b with depth ¬visible(loc(xr , yr ), bI , β, θ, s) ), u or the robot R (or an object b) moved to a location such that the or in English, sensing a proﬁle p is a possible action, if p includes a Euclidean distance from the object to the observer (the depth of the peak (with positive attributes) from a visible object and has no peaks object b) becomes u in the resulting situation. This SSA is formally from objects that are currently not visible (given robot’s orientation expressed in the following formula, that also includes a frame axiom and aperture). The predicate visible(v, b, β, θ, s) means that a body stating that the value of the ﬂuent depth remains the same in the b is visible from the current viewpoint v if the ﬁeld of view is β absence of any action that explicitly changes its value. and the robot is facing a direction θ in the situation s. This predicate

M. Soutchanski and P. Santos / Reasoning About Dynamic Depth Proﬁles

depth(pk(b, u, z, d), u, loc(xr , yr ), do(a, s)) ≡ (∃t, p)a = sense(p, loc(xr , yr ), t) ∧ pk(b, u, z, d) ∈ p ∨ (∃t, x, y, x1 , y1 , r, e)(a = endM ove(R, loc(x1 , y1 ), loc(xr , yr ), t)∧ location(b, loc(x, y), s) ∧ location(R, loc(x1 , y1 ), s)∧ radius(b, r) ∧ euD(loc(x, y), loc(xr , yr ), e) ∧ (u = e − r)) ∨ (∃t, x1 , y1 , x2 , y2 , r, e)(a = endM ove(b, loc(x1 , y1 ), loc(x2 , y2 ), t)∧ location(R, loc(xr , yr ), s) ∧ location(b, loc(x1 , y1 ), s)∧ radius(b, r) ∧ euD(loc(xr , yr ), loc(x2 , y2 ), e) ∧ (u = e − r)) ∨ depth(pk(b, u, z, d), u, loc(xr , yr ), s)∧ location(R, loc(xr , yr ), s) ∧ (∃x, y).location(b, loc(x, y), s)∧ (¬∃t, l, p , u , z , d , x1 , y1 ) (a = endM ove(R, loc(xr , yr ), l, t) ∨ a = endM ove(b, loc(x, y), loc(x1 , y1 ), t) ∨ a = sense(p, loc(xr , yr ), t) ∧ pk(b, u , z , d ) ∈ p ∧ u = u ). In addition to the predicates on peak attributes we can deﬁne a set of relations representing transitions between attributes of single peaks. These transitions account for the perception of moving bodies and can be divided into two kinds: predicates referring to transitions in single peaks and transitions between pairs of peaks. Transitions on single peaks are: extending(pk(b, u, z, d),loc(xr , yr ), s), which states that a peak pk(b, u, z, d), representing an object b, is perceived from loc(xr , yr ) as extending (or expanding in size) in situation s; shrinking(pk(b, u, z, d), loc(xr , yr ), s), states that pk(b, u, z, d), representing a visible object b, is shrinking (contracting in size) in s; appearing(pk(b, u, z, d), loc(xr , yr ), s) means that pk(b, u, z, d), unseen in a previous situation, is perceived in a situation s; and, vanishing(pk(b, u, z, d), loc(xr , yr ), s) that represents the opposite of appearing. Finally, peak static represents that the peak attributes do not change in the resulting situation do(a, s) wrt s. For instance, SSA for extending (below) states that a peak is perceived as extending in a situation do(a, s) iff there was a sensing action that perceived that its angular size is greater in do(a, s) than in s, or the robot (or the object) moved to a position such that the computed angular size of the object in do(a, s) is greater than its size in situation s. In either case, the depth in both situations, depth u in do(a, s) and depth u in s, has to be smaller than an L (the furthermost point that can be noted by the robot sensors), representing in this case a threshold on depth that allow the distinction between extending and appearing. Thus, if the peak depth u in situation s was such that u ≥ L, i.e., the peak was too far, but the depth u < L in do(a, s), i.e., the peak is closer to the viewpoint in the resulting situation, then the peak is perceived as appearing, rather than extending (shrinking and vanishing are analogous). Examples of situations in which these ﬂuents hold are given in Figure 1: if the observer moves from viewpoint ν2 to ν1 (Figure 1(c) and (a)), the peak from b2 is perceived as extending (the peak q from b2 is greater in Figure 1(b) than in (d)). If the change is from ν1 to ν2 , instead, q would be shrinking, whereas if only one of the distances was smaller than L, then q would be appearing or vanishing, according to the differences noted in s and in do(a, s). For simplicity, we present a high-level description of the SSA only. extending(peak, viewpoint, do(a, s)) iff a is a sensing action which measured that the angular size of peak is currently larger than it was at s or a is an endM ove action terminating the process of robot’s motion resulting in the viewpoint such that a computed size of peak from the viewpoint is larger than it was at s or a is an endM ove action terminating the motion of an object to a new position such that from robot’s viewpoint a computed size of peak became larger than it was at s or extending(peak, viewpoint, s) and % frame axiom % a is none of those actions which have effect of decreasing the perceived angular size of peak

33

One of the predicates referring to the transition between pairs of peaks is approaching(pk(b1 , u1 , z1 , d1 ), pk(b2 , u2 , z2 , d2 ),loc(xr , yr ), s), which represents that peaks pk(b1 , u1 , z1 , d1 ) and pk(b2 , u2 , z2 , d2 ) (related, respectively, to objects b1 and b2 ) are approaching each other in situation s as perceived from the viewpoint loc(xr , yr ). (The following relations have analogous arguments to those of approaching, they were omitted here for brevity.) Similarly, receding, states that two peaks are receding from each other. The predicate coalescing, states that two peaks are coalescing. Analogously to coalescing, the relation hiding represents the case of a peak coalescing completely with another peak (corresponding to total occlusion of one body by another). The predicate splitting, states the case of one peak splitting into two distinct peaks; ﬁnally, two peak static, states that the two peaks are static. Axioms constraining the transitions between pairs of peaks are straightforward, but long and tedious (due to involved geometric calculations). Therefore, for simplicity, we discuss only a high-level description of the SSA for approaching (the axioms for receding, coalescing, shrinking and hiding are analogous). The axiom for approaching expresses that two depth peaks are approaching iff an apparent angle between them obtained by a sensing action is smaller at the situation do(a, s) than at s or, the observer (or an object) moved to a position such that a calculated apparent angle is smaller at do(a, s) than at s. In the latter case, the apparent angle between peaks from b1 , b2 is calculated by the predicate angle(loc(xb1 , yb1 ), loc(xb2 , yb2 ), loc(xν , yν ), rb1 , rb2 , γ) that has as arguments, respectively, the location of the centroids of objects b1 and b2 , the location of viewpoint ν, the radii of b1 and b2 and γ is an angle that we want to compute. The computations accomplished by angle include the straightforward solution (in time O(1)) of a system of equations (including quadratic equations for the circles representing the perimeter of the objects and linear equations for the tangent rays going from the viewpoint to the circles). Similarly to the threshold L used in the SSA for extending above, the SSA for approaching uses a pre-deﬁned (hardware dependent) threshold Δ (roughly, the number of pixels between peaks) that differentiates approaching (receding) from coalescing (splitting). Another threshold is used in an analogous way to differentiate coalescing from hiding. Figure 1 also exempliﬁes a case where approaching can be entailed. Consider for instance a robot going from viewpoint ν1 to ν2 , in this case, the angular distance (k − j) between peaks p and q in Fig. 1(d) is less than (e − n) in Fig. 1(b). Moving from viewpoint ν2 to ν1 would result in the entailment of receding. If it was the case that the apparent distance between the objects was less than Δ, coalescing or splitting could be entailed. approaching(peak1, peak2, viewpoint, do(a, s)) iff a is a sensing action that measured the angle between peak1 and peak2 and this angle is smaller than it was at s or a is an endM ove action terminating the process of robot’s motion resulting in the viewpoint such that a computed angle between peak1 and peak2 is currently smaller than it was at s or a is an endM ove action terminating the motion of an object to a new position such that from robot’s viewpoint a computed angle between peaks decreased in comparison to what it was at s or approaching(peak1, peak2, viewpoint, s) and % frame axiom% a is none of those actions which have an effect of increasing the perceived angle between peak1 and peak2.

We name Theory of Depth and Motion (T DM) a theory consisting of the precondition axioms Dap for actions introduced in this section, SSA Dss for all ﬂuents in this section, an initial theory DS0 (with at least two objects and the robot), together with Duna and Σ.

34

M. Soutchanski and P. Santos / Reasoning About Dynamic Depth Proﬁles

5 Perception and Motion in T DM The previous section introduced SSA for depth proﬁles constraining the ﬂuents on depth peaks to hold when either a particular transition in the attributes of a depth peak was sensed, or the robot (or an object) moved to a position such that a particular transition happens. It is easy to see that the axioms presented above deﬁne the conceptual neighbourhood diagram (CND) for depth proﬁles (Fig. 2). It is worth noting also that the vertices in the conceptual neighbourhood diagram (and the edges connecting them) in Figure 2 represent all the percepts that can be sensed given the depth proﬁle calculus in a domain where objects and the observer can move. Therefore, we can say that perception in T DM is sound and complete wrt motion, in the sense that the vertices and edges of the CND in Fig. 2 result from object’s motion (i.e. perception is sound) and that every motion in the world is accounted by a ﬂuent or by an edge between ﬂuents in this CND (i.e. it is complete). Our ﬁrst result is a schema applying to each ﬂuent in T DM that represents perception of relations between peaks. Theorem 1 (Perception is sound wrt motion). For any ﬂuent F in the CND the following holds: T DM |= a = sense(p, loc(xr , yr ), t ) ⊃ (¬F ( x, s)∧F ( x, do(a, s)) ⊃ (∃b, l1 , l2 , t)a = endM ove(b, l1 , l2 , t) ) T DM |= a = sense(p, loc(xr , yr ), t ) ⊃ (F ( x, s)∧¬F ( x, do(a, s)) ⊃ (∃b, l1 , l2 , t)a = endM ove(b, l1 , l2 , t) ). For any ﬂuents F and F in T DM if there is an edge between F and F in the CND then the following holds: T DM |= a = sense(p, loc(xr , yr ), t ) ⊃ ( F ( x, s) ∧ ¬F ( x, s) ∧ ¬F ( x, do(a, s))∧F ( x, do(a, s)) ⊃ (∃b, l1 , l2 , t) a = endM ove(b, l1 , l2 , t) ). Proof sketch: The proof of this theorem rephrases the explanation closure axiom that follows from the corresponding SSA (see [11] for details). For every vertex in the CND (i.e., for every perceptionrelated ﬂuent F of T DM), if the last action that the robot did is not a sense action, then the change in the value of this ﬂuent can happen only due to an action endM ove. In addition, we show that for every edge linking two distinct ﬂuents F and F of the CND in Fig. 2, the transition is due to a move action such that in the resulting situation, the ﬂuent F ceases to hold, but F becomes true. 2 The next theorem states that every motion in the domain is accounted by a vertex or by an edge of the CND in Fig. 2. We denote by Fi , Fj all perception-related ﬂuents (Fi and Fj can be different vertices or can be the same). Theorem 2 (Perception is complete wrt motion). For any moving action a in T DM there is a ﬂuent Fi or an edge between two ﬂuents Fi and Fj in the CND: T DM |= ˆW x, do(a, s)) ∨ ´˜ (∃b, i Fi ( W l1 ,`l2 , t)a = endM ove(b, l1 , l2 , t) ⊃ x, s)∧¬Fj ( x, s) ∧ ¬Fi ( x, do(a, s))∧Fj ( x, do(a, s)) i,j Fi ( Proof sketch: The proof follows from the geometric fact that the twelve numbered regions deﬁned by the bi-tangents between two objects (Figure 3) deﬁne all possible qualitatively distinct viewpoints to observe these objects. It is easy to see that for every motion of the observer within each region or across adjacent regions in Figure 3 there is an action A mentioned in the SSAs that corresponds to this motion. Therefore, it follows from SSAs that, either a vertex of the CND (a ﬂuent F ) describes the perception resulting from the motion, or there are two ﬂuents F and F such that F ceases to hold after doing a, but F becomes true. For instance, take a robot in Region 5 (Fig. 3) facing the two objects a and b, but moving backward from them. The SSAs would allow the conclusion that the peaks referring to a and b would be approaching and shrinking. On the other hand, a robot (still facing a and b) crossing from Region 5 to 6 would be able to en-

tail the transition from approaching to coalescing by using SSAs. 9

10 11 12

Figure 3.

a

8

1 2

3

b

7

4 5

6

Bi-tangents between two visible objects.

6 Discussion and conclusion We propose a logical theory built within the situation calculus for reasoning about depth perception and motion of a mobile robot amidst moving objects. The resulting formalism, called Theory of Depth and Motion (T DM), is a rich language that allows both sensor data assimilation and reasoning about motion in the world, where their effects are calculated with Euclidean geometry. We show that reasoning about perception of depth in T DM is sound and complete with respect to actual motion in the world. This result proves the conjecture made in [12] which hypothesises that the transitions in the conceptual neighbourhood diagrams of the depth proﬁle calculus are logical consequences of a theory about actions and change. Note that T DM relies on standard models of dense orders, computational geometry and other quantitative abstractions, but this pays off at the end: we can obtain logical consequences about purely qualitative phenomena (e.g., objects approaching each other) from T DM. This theory is an important contribution of our paper. Future research includes the implementation of the proposed formalism in a simulator of a dynamic trafﬁc scenario. We expect that the theory presented in this paper will allow the reasoning system to recognize and summarize (in simple sentences) plans of other vehicles based on knowledge about its own motion, and its perceptions. Acknowledgements: Thanks to Joshua Gross, Fr´edo Durand, Sherif Ghali for comments about computing visibility efﬁciently in dynamic 2D scenes. This research has been partially supported by the Canadian Natural Sciences and Engineering Research Council (NSERC) and FAPESP, S˜ao Paulo, Brazil.

REFERENCES [1] A. G. Cohn and J. Renz, ‘Qualitative spatial representation and reasoning’, in Handbook of Knowledge Representation, 551–596, (2008). [2] M. de Berg et al, Computational Geometry, Algorithms and Applications (Chapter 15), 2nd Edition, Springer, 2000. [3] A. Goultiaeva and Y. Lesp´erance, ‘Incremental plan recognition in an agent programming framework’, in Cognitive Robotics, Papers from the 2006 AAAI Workshop, pp. 83–90, Boston, MA, USA, (2006). [4] Gerd Herzog, VITRA: Connecting Vision and Natural Language Systems, http://www.dfki.de/vitra/, Saarbr¨ucken, Germany, 1986-1996. [5] H. Levesque and G. Lakemeyer, ‘Cognitive robotics’, in Handbook of Knowledge Representation, 869–886, Elsevier, (2008). [6] R. Mann, A. Jepson, and J. M. Siskind, ‘The computational perception of scene dynamics’, CVIU, 65(2), 113–128, (1997). [7] A. Miene, A. Lattner, U. Visser, and O. Herzog, ‘Dynamic-preserving qualitative motion description for intelligent vehicles’, in IEEE Intelligent Vehicles Symposium (IV-04), pp. 642–646, Parma, Italy, (2004). [8] Hans-Hellmut Nagel, ‘Steps toward a cognitive vision system’, AI Magazine, 25(2), 31–50, (2004). [9] R. P. A. Petrick, A Knowledge-level approach for effective acting, sensing, and planning, Ph.D. dissertation, University of Toronto, 2006. [10] D. Randell, M. Witkowski, and M. Shanahan, ‘From images to bodies: Modeling and exploiting spatial occlusion and motion parallax’, in Proc. of IJCAI, pp. 57–63, Seattle, U.S., (2001). [11] Raymond Reiter, Knowledge in Action. Logical Foundations for Specifying and Implementing Dynamical Systems, MIT, 2001. [12] Paulo Santos, ‘Reasoning about depth and motion from an observer’s viewpoint’, Spatial Cognition and Computation, 7(2), 133–178, (2007). [13] M Soutchanski, ‘A correspondence between two different solutions to the projection task with sensing’, in Proc. of the 5th Symposium on Logical Formalizations of Commonsense Reasoning, pp. 235–242, New York, USA, May 20-22, (2001).

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-35

35

Comparing Abductive Theories Katsumi Inoue 1 and Chiaki Sakama 2 Abstract. This paper introduces two methods for comparing explanation power of different abductive theories. One is comparing explainability for observations, and the other is comparing explanation contents for observations. Those two measures are represented by generality relations over abductive theories. The generality relations are naturally related to the notion of abductive equivalence introduced by Inoue and Sakama. We also analyze the computational complexity of these relations.

1

Introduction

Abduction has been used in many applications of AI including diagnosis, design, updates, and discovery. Abduction is incorporated in problem-solving and programming technologies as abductive logic programming [11]. In the process of building knowledge bases, we need to update an abductive theory in accordance with situation change and discovery of surprising facts. For example, to reﬁne an incomplete description, one may need to add more details to a part of the current theory. Such a reﬁnement is expected to ensure that the revised theory is more powerful in abductive reasoning than the previous one. Then, it is important to evaluate abductive theories by comparing abductive power of each theory in such processes. In predicate logic, comparison of information contents between theories is done by comparing their logical consequences. For example, given two ﬁrst-order theories T1 and T2 , T1 is considered more informative than T2 if T2 |= ψ implies T1 |= ψ for any formula ψ, i.e., T1 |= T2 . In this case, it is also said T1 is more general than T2 [13, 14]. On the other hand, T1 and T2 are equally informative if T1 |= T2 and T2 |= T1 , that is, if T1 and T2 are logically equivalent (T1 ≡ T2 ). Recently, Inoue and Sakama considered the generality conditions for answer set programming (ASP) [9] and for Reiter’s default logic [10]. These generality/equivalence relations compare monotonic/nonmonotonic theories in terms of deduction. The topic of our interest in this paper is how to compare abductive theories. That is, we seek conditions under which an abductive theory has more explanation power than another abductive theory. As far as the authors know, no answer to this question is given in the literature of abduction. To understand the problem, suppose that an abductive theory A1 is deﬁned to be stronger than another abductive theory A2 . This might imply that there is a formula which can be explained in the former but cannot be in the latter. Then, we would expect that A1 has more background knowledge than A2 or A1 has more hypotheses than A2 . However, the situation is not so simple because addition of background knowledge may violate the consistency of some combination of hypotheses. Hence, relationships between 1 2

National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan. email: [email protected] Wakayama University, Sakaedani, Wakayama 640-8510, Japan. email: [email protected]

amounts of background theories and hypotheses need to be analyzed in depth to compare abductive theories precisely. In this paper, we consider two logical frameworks for abduction, ﬁrst-order abduction and abductive logic programming (ALP). Then, we introduce two methods for comparing explanation power of different abductive theories, which were originally introduced by Inoue and Sakama [8] to identify equivalence of two abductive theories. The ﬁrst one is aimed at comparing explainability for observations in different theories, while the second one is aimed at comparing explanation contents for observations. Those two comparison measures are represented by generality relations over abductive theories. Moreover, the generality relations can naturally be related to the notion of abductive equivalence in [8]. Note that the proposed techniques for ﬁrst-order abduction can also be applied to comparing frameworks for explanatory induction in inductive logic programming. The rest of this paper is organized as follows. Section 2 introduces two generality relations for comparing abductive ﬁrst-order theories. Section 3 applies the similar techniques to ALP. Section 4 relates the abductive generality relations to abductive equivalence. Section 5 discusses the complexity issues. Section 6 gives concluding remarks.

2

Generality Relations in First-order Abduction

In this section, we consider abductive theories represented in ﬁrstorder logic, which have often been used in abduction in AI, e.g., [17]. In this setting, abductive theories are compared by two measures. Deﬁnition 1 Suppose that B and H are sets of ﬁrst-order formulas, where B represents background knowledge and H is a set of (candidate) hypotheses. We call a pair (B, H) a (ﬁrst-order) abductive theory. Given a formula O as an observation, a set E of formulas belonging to H 3 is an explanation of O in (B, H) if B ∪ E |= O and B ∪ E is consistent. We say that O is explainable in (B, H) if it has an explanation in (B, H).

2.1

Comparing Explainability

We ﬁrst consider a measure for comparing explainability between abductive theories. Deﬁnition 2 An abductive theory A1 = (B1 , H1 ) is more (or equally) explainable than an abductive theory A2 = (B2 , H2 ), written as A1 ≥ A2 , if every observation explainable in A2 is also explainable in A1 . 3

In this paper we do not specify how H is constructed. For example, when hypotheses contain variables, we could just assume that the set H is closed under instantiation. In another case, we could specify the language of H with a bias and then deﬁne that any formula which is constructed from H and satisﬁes the bias belongs to H. This latter treatment enables us to deal with comparing theories for inductive logic programming (ILP) [14] within the same logical framework as abduction. In any case, we simply denote as E ⊆ H when E is a set of formulas belonging to H.

36

K. Inoue and C. Sakama / Comparing Abductive Theories

Example 1 Consider three abductive theories A1 = (B1 , H1 ), A2 = (B2 , H2 ) and A3 = (B3 , H3 ), where B1

=

{ sprinkler was on ⊃ grass is wet },

H1

=

{ sprinkler was on, rained last night },

B2

=

B1 ∪ { rained last night ⊃ grass is wet },

H2

=

H1 ∪ { ¬(sprinkler was on ⊃ grass is wet ) },

B3

=

B2 ∪ { grass is wet ⊃ shoes are wet },

H3

=

H1 ∪ { ¬(sprinkler was on ⊃ shoes are wet ) }.

Then, A3 ≥ A2 ≥ A1 holds. In fact, every observation explainable in Ai is explainable in Ai+1 for i = 1, 2. Notice that A1 ≥ A2 also holds because rained last night can be explained by itself in both A1 and A2 . By contrast, shoes are wet is explainable in A3 , but is not in either A1 or A2 , i.e., A2 ≥ A3 . Note that each additional hypothesis in Hj \ H1 for j = 2, 3 has no effect in explaining any formula as it cannot be added to Bj without violating the consistency. We provide a necessary and sufﬁcient condition for the explainable generality relation. In the following, T h(Σ) denotes the set of logical consequences of a set Σ of ﬁrst-order formulas.

Proof: For any abductive theory (B, H), we can associate a prerequisite-free normal default theory Δ = (DH , B), where DH = | h ∈ H}. Then there is a 1-1 correspondence between the ex{ :h h tensions of Δ (in the sense of Reiter [18]) and Ext((B, H)) [17, Theorem 4.1]. By the semi-monotonicity of normal default theories [18, Theorem 3.2], H1 ⊇ H2 implies that, for any extension F of Δ2 = (DH2 , B), there is an extension E of Δ1 = (DH1 , B) such that F ⊆ E. By Theorem 2, the result holds. 2 For abductive theories A1 = (B1 , H) and A2 = (B2 , H) with the same hypotheses, B1 |= B2 implies neither A1 ≥ A2 nor A2 ≥ A1 . This explains the name of semi-monotonicity in Proposition 4. Example 2 Suppose the abductive theories A = (B, H) and A = (B , H) where B = {a ∧ b ⊃ p}, B = B ∪ {¬b}, and H = {a, b}. Then, A ≥ A because p has the explanation {a, b} in A but is not explainable in A . On the other hand, A ≥ A because ¬b has the explanation ∅ in A but is not explainable in A.

2.2

Comparing Explanations

We next provide a second measure for comparing abductive theories. This time we compare explanation contents.

Deﬁnition 3 An extension of an abductive theory A = (B, H) is T h(B ∪ S) where S is a maximal set of formulas belonging to H such that B∪S is consistent. The set of all extensions of A is denoted as Ext(A).

Deﬁnition 4 An abductive theory A1 = (B1 , H1 ) is more (or equally) explanatory than an abductive theory A2 = (B2 , H2 ), written as A1 A2 , if, for any observation O, every explanation of O in A2 is also an explanation of O in A1 .

Lemma 1 ([17]) Let O be a (possibly inﬁnite) set of formulas. There is an explanation that explains every formula in O in (B, H) iff there is an extension X of (B, H) such that O ⊆ X.

Example 3 For three abductive theories in Example 1, A3 A2 A1 holds. Although A1 ≥ A2 holds, we see that A1 A2 because {rained last night } is an explanation of grass is wet in A2 but is not in A1 .

Theorem 2 Let A1 = (B1 , H1 ) and A2 = (B2 , H2 ) be abductive theories. Then, A1 ≥ A2 holds iff for any extension X2 of A2 , there is an extension X1 of A1 such that X2 ⊆ X1 .

It is easy to see that the relation is stronger than the relation ≥, that is, A1 A2 implies A1 ≥ A2 . Now we show the necessary and sufﬁcient condition for explanatory generality.

Proof: (⇐) By Lemma 1, if an observation O is explainable in A2 , there is X2 ∈ Ext(A2 ) such that O ∈ X2 . For any such X2 , there is X1 ∈ Ext(A1 ) such that X2 ⊆ X1 . Then, O ∈ X1 and O is explainable in (B1 , H1 ) by Lemma 1. Hence, A1 ≥ A2 . (⇒) Assume that there is X2 ∈ Ext(A2 ) such that X2 ⊆ X1 for any X1 ∈ Ext(A1 ). Pick a formula ψ i for each X1 i ∈ Ext(A1 ) such that ψi ∈ (X2 \ X1 i ) (= ∅), and let O be the set of ψi ’s from every X1 i . Then, V O ⊆ X2 but O ⊆ X1 for any X1 ∈ Ext(A1 ). By Lemma 1, F ∈O F is explainable in A2 but is not explainable in 2 A1 . Hence, A1 ≥ A2 . There are several classes of abductive theories in which we can see explainable generality holds under some simple conditions. Proposition 3 (Assumption-freeness) Suppose two abductive theories (B1 , L) and (B2 , L), where L is the set of all literals in the underlying language. Then, (B1 , L) ≥ (B2 , L) iff B2 |= B1 . Proof: Any extension of an abductive theory (Bi , L) is logically equivalent to a (complete) model of Bi . By Theorem 2, (B1 , L) ≥ (B2 , L) iff, for any model M of B2 , there is a model N of B1 such that M ⊆ N . Because both M and N are complete, M ⊆ N implies M = N . Hence, any model of B2 is a model of B1 . 2 Proposition 4 (Semi-monotonicity) Suppose that (B, H1 ) and (B, H2 ) are two abductive theories with the same background knowledge. If H1 ⊇ H2 , then (B, H1 ) ≥ (B, H2 ).

Theorem 5 Let A1 = (B1 , H1 ) and A2 = (B2 , H2 ) be abductive theories. Then, A1 A2 holds iff B1 |= B2 and H1 ⊇ H2 hold, where Hi = { E ⊆ Hi | Bi ∪ E is consistent } for i = 1, 2. Proof: Note that any explanation E of an observation O in (Bi , Hi ) satisﬁes that (1) Bi ∪ E |= O and (2) E ∈ Hi . (⇐) Suppose A1 A2 . Then there exist a formula O and a set E of formulas such that B2 ∪ E |= O and E ∈ H2 while B1 ∪ E |= O or E ∈ H1 . If B1 ∪ E |= O holds, we have B1 |= E ⊃ O and B2 |= E ⊃ O, which implies B1 |= B2 . If E ∈ H1 holds, by E ∈ H2 we have H2 ⊆ H1 . Hence, the result holds. (⇒) Suppose A1 A2 . Then for any formula O and any set E of formulas, B2 ∪ E |= O and E ∈ H2 imply B1 ∪ E |= O and E ∈ H1 . By the fact that B2 ∪ E |= O implies B1 ∪ E |= O for any O, we have B2 ∪ E |= B1 ∪ E for any E ∈ H2 ∩ H1 . Then, B2 |= B1 holds when E = ∅. By the fact that E ∈ H2 implies 2 E ∈ H1 , we also have H2 ⊆ H1 . Hence, the result holds. Corollary 6 Let A1 = (B1 , H1 ) and A2 = (B2 , H2 ) be abductive theories. Then, A1 A2 holds iff B1 |= B2 and A1 ≥ A2 hold. Proof: The set Hi in Theorem 5 contains every subset E of Hi such that Bi ∪ E is consistent. Hi can be characterized by Ext(Ai ) as each consistent theory is a subset of some extension. Then, it can be proved that H1 ⊇ H2 iff for any X2 ∈ Ext(A2 ), there is X1 ∈ Ext(A1 ) such that X2 ⊆ X1 . Hence, the result follows from Theorem 2. 2 Corollary 7 If H1 ⊇ H2 , then (B, H1 ) (B, H2 ) holds.

K. Inoue and C. Sakama / Comparing Abductive Theories

3

Generality Relations in Abductive Logic Programming

In this section, we turn our attention to generality relations in abductive logic programming (ALP) [11]. The most signiﬁcant difference between abduction in ﬁrst-order logic and ALP is that ALP allows the nonmonotonic negation-as-failure operator not in a background program. When the background program P is nonmonotonic, the fact that P ∪E is consistent for some set E of hypotheses does not necessarily imply that P ∪ E is consistent for E ⊂ E. Hence comparing abductive power in ALP should be checked in a more naive manner upon each subset of hypotheses.

Deﬁnition 8 Let A1 = P1 , Γ1 and A2 = P2 , Γ2 be abductive programs, and G an observation. A1 is more (or equally) explainable than A2 , written as A1 ≥ A2 , if every observation explainable in A2 is also explainable in A1 . On the other hand, A1 is more (or equally) explanatory than A2 , written as A1 A2 , if, for any observation G, every explanation of G in A2 is also an explanation of G in A1 . Example 4 Let A1 = P1 , Γ and A2 = P2 , Γ be abductive programs, where P1 = { p ← a, a ← b }, P2 = { p ← a, p ← b }, and Γ = {a, b}. Then, A1 ≥ A2 and A2 ≥ A1 , while A1 A2 but A2 A1 . In fact, {b} is an explanation of a in A1 , but is not in A2 . The following results hold for two generality relations.

Deﬁnition 5 An abductive (logic) program is a pair P, Γ where • P is a (logic) program, which is a set of rules of the form: L1 ; · · · ; Lk ; not Lk+1 ; · · · ; not Ll ← Ll+1 , . . . , Lm , not Lm+1 , . . . , not Ln

(1)

where each Li is a literal (n ≥ m ≥ l ≥ k ≥ 0), and not represents negation as failure (NAF). The symbol ; represents disjunction. The left-hand side of the rule is the head, and the right-hand side is the body. A program containing variables is a shorthand of its ground instantiation. • Γ is a set of literals, called abducibles. Any instance of an abducible is also an abducible. Logic programs mentioned above belong to the class of general extended disjunctive programs (GEDPs) [6]. If any rule of the form (1) in a program P does not contain not in its head, i.e., k = l, P is called an extended disjunctive program (EDP) [4]. Moreover, if the head of any rule in an EDP P contains no disjunction, i.e., k = l ≤ 1, P is called an extended logic program (ELP). A semantics of a logic program is given by the answer set semantics [4, 6]. We denote the set of all ground literals in the language of a program as Lit. For a program P , the set of answer sets of P is denoted as AS(P ). When P is an EDP, AS(P ) is an antichain in 2Lit , that is, for any two answer sets S1 , S2 ∈ AS(P ), S1 ⊆ S2 implies S1 = S2 [4], but this is not the case for a GEDP. A semantics for ALP is given by extending answer sets of the background program with addition of abducibles. Such an extended answer set is called a belief set, which has also been called a generalized stable model [11]. Deﬁnition 6 Let A = P, Γ be an abductive program, and E ⊆ Γ. A belief set of A (with respect to E) is a consistent answer set of the logic program P ∪ E. The set of all belief sets of A is denoted as BS(A). A set S ∈ BS(A) is often denoted as SE when S is a belief set with respect to E. Deﬁnition 7 Let A = P, Γ be an abductive program, and G a conjunction of ground literals called an observation. We will often identify a conjunction G with the set of literals in G. A set E ⊆ Γ is an explanation of G in A if every ground literal in G is true in a belief set of A with respect to E.4 When G has an explanation in A, G is explainable in A. Note that restrictions in ALP can be removed so that not only literals but rules can be allowed as abducibles and that observations can contain NAF formulas as well as literals. As in the case of ﬁrst-order abduction, two generality relations are deﬁned for ALP as follows. 4

This deﬁnition provides credulous explanations. Alternatively, skeptical explanations are deﬁned as E ⊆ Γ such that G is true in every belief set of A with respect to E.

37

Theorem 8 Let A1 = P1 , Γ1 and A2 = P2 , Γ2 be abductive programs. Then, A1 ≥ A2 holds iff for any belief set S2 of A2 , there is a belief set S1 of A1 such that S2 ⊆ S1 . Proof: (⇐) If G is explainable in A2 , there is S2 ∈ BS(A2 ) such that G ⊆ S2 . For any such S2 , there is S1 ∈ BS(A1 ) such that S2 ⊆ S1 . Then, G ⊆ S1 and G is explainable in A1 . Hence, A1 ≥ A2 . (⇒) Assume that there is S2 ∈ BS(A2 ) such that S2 ⊆ S1 for any S1 ∈ BS(A1 ). For each S1 i ∈ BS(A1 ), pick a literal Li such that Li ∈ (S2 \ S1 i ) (= ∅), and let G be the set of Li ’s from every S1 i . Then, G ⊆ S2 but G ⊆ S1 for any S1 ∈ BS(A1 ). That is, G 2 is explainable in A2 but is not in A1 , i.e., A1 ≥ A2 . Theorem 9 Let A1 = P1 , Γ1 and A2 = P2 , Γ2 be abductive programs. Then, A1 A2 holds iff for any E ⊆ Γ2 and any SE ∈ BS(A2 ), there is TE ∈ BS(A1 ) such that E ⊆ Γ1 and SE ⊆ TE . Proof: (⇒) Suppose A1 A2 . Then, for any observation G and any E ⊆ Γ2 , the fact that G ⊆ SE for some SE ∈ BS(A2 ) implies that G ⊆ TE for some TE ∈ BS(A1 ). Thus, SE ⊆ TE . (⇐) Suppose SE ∈ BS(A2 ) for any E ⊆ Γ2 implies the existence of TE ∈ BS(A1 ) with E ⊆ Γ1 such that SE ⊆ TE . Then, for any observation G, G ⊆ SE implies G ⊆ TE . That is, if G has an 2 explanation E in A2 , G has the same explanation E in A1 . Theorem 8 and Theorem 9 might look similar, but the condition of the latter is ﬁner-grained than that of the former. In fact, as in the case of ﬁrst-order abduction, A1 A2 implies A1 ≥ A2 .

4

Connection to Abductive Equivalence

In this section, we consider the relationship between the generality relations in abduction proposed in this paper and the equivalence relations in abduction proposed in the literature. Inoue and Sakama [8] study different types of equivalence relations in abduction: explainable/explanatory equivalence of abductive theories under both ﬁrst-order abduction and ALP. Pearce et al. [16] characterize a part of these problems in the context of equilibrium logic. In the following, an abductive framework A means either a ﬁrst-order abductive theory A = (B, H) or an abductive logic program A = P, Γ . Deﬁnition 9 ([8]) Let A1 and A2 be abductive frameworks. 1. A1 and A2 are explainably equivalent if, for any observation O,5 O is explainable in A1 iff O is explainable in A2 . 2. A1 and A2 are explanatorily equivalent if, for any observation O, E is an explanation of O in A1 iff E is an explanation of O in A2 . 5

This deﬁnition of explainable equivalence for ALP is not exactly the same as that in [8, Deﬁnition 4.3]. In [8] an observation is a single ground literal, while we allow a conjunction of ground literals as an observation.

38

K. Inoue and C. Sakama / Comparing Abductive Theories

Explainable equivalence requires that two abductive frameworks have the same explainability for any observation. Explainable equivalence may reﬂect a situation that two programs have different knowledge to derive the same goals. On the other hand, explanatory equivalence assures that two abductive frameworks have the same explanation contents for any observation. Explanatory equivalence is stronger than explainable equivalence: if two abductive frameworks are explanatorily equivalent then they are explainably equivalent. By Deﬁnitions 2, 4, 8, and 9, it is obvious that all generality relations deﬁned in this paper are “anti-symmetric”6 in the sense that two abductive frameworks are explainably/explanatorily equivalent iff one is both more (or equally) and less (or equally) explainable/explanatory than another at the same time.

there is T ∈ max(AS(P1 ∪ E)) such that T ⊆ T . By A2 A1 , there is S ∈ AS(P2 ∪ E) such that T ⊆ S , and then there is S ∈ max(AS(P2 ∪ E)) such that S ⊆ S . Then S ⊆ S holds and both belong to max(AS(P2 ∪ E)), which imply S = T = S , and thus S ∈ max(AS(P1 ∪ E)). Hence, (1) if E ⊆ Γ2 and P2 ∪ E is consistent then E ⊆ Γ1 and P1 ∪ E is consistent, and (2) max(AS(P2 ∪ E)) ⊆ max(AS(P1 ∪ E)) for any E ⊆ Γ2 . Similarly, (3) if E ⊆ Γ1 and P1 ∪ E is consistent then E ⊆ Γ2 and P2 ∪ E is consitent, and (4) max(AS(P1 ∪ E)) ⊆ max(AS(P2 ∪ E)) for any E ⊆ Γ1 . By (1) and (3), C1 = C2 holds. By (2) and (4), max(AS(P1 ∪ E)) = max(AS(P2 ∪ E)) holds for any E ⊆ Γ1 and for any E ⊆ Γ2 . Hence, the result follows. (⇐) can be proved in a similar way. 2

Proposition 10 Let A1 and A2 be abductive frameworks.

Two logic programs P1 and P2 are strongly equivalent with respect to a rule set R if AS(P1 ∪ R) = AS(P2 ∪ R) for any logic program R ⊆ R [7]. This equivalence notion is a restricted version of strong equivalence [12], and is called relative strong equivalence [7].7 The next result was originally shown in [8]8 and then was discussed in [16] for EDPs. Now it can be simply proved by the antichain property of AS(P ) for any EDP P .

1. A1 and A2 are explainably equivalent iff A1 ≥ A2 and A2 ≥ A1 . 2. A1 and A2 are explanatorily equivalent iff A1 A2 and A2 A1 . With this correspondence and results in previous sections, we can derive either new characterizations of abductive equivalence or new (and simple) proofs of previously presented results. For ﬁrst-order abduction, the following results can be veriﬁed with new proofs. Proposition 11 Two ﬁrst-order abductive theories A1 and A2 are explainably equivalent iff Ext(A1 ) = Ext(A2 ) holds. Proposition 12 For ﬁrst-order abductive theories A1 = (B1 , H1 ) and A2 = (B2 , H2 ), the following four statements are equivalent. 1. 2. 3. 4.

A1 A1 B1 B1 Hi

and A2 are explanatorily equivalent. and A2 are explainably equivalent and B1 ≡ B2 . ≡ B2 and H1 = H2 . ≡ B2 and H1 = H2 , where = { h ∈ Hi | Bi ∪ {h} is consistent } for i = 1, 2.

For ALP, the next results can be newly obtained. In the following, for any set X, let max(X) = { x ∈ X | ¬∃y ∈ X. x ⊂ y }. Theorem 13 Let A1 = P1 , Γ1 and A2 = P2 , Γ2 be abductive programs. Then, A1 and A2 are explainably equivalent iff max(BS(A1 )) = max(BS(A2 )). Proof: (⇒) By Theorem 8, A1 ≥ A2 implies that, for any S2 ∈ max(BS(A2 )) there exists S1 ∈ BS(A1 ) such that S2 ⊆ S1 , and then there exists S1 ∈ max(BS(A1)) such that S1 ⊆ S1 . By A2 ≥ A1 , there exists S2 ∈ BS(A2 ) such that S1 ⊆ S2 , and then there exists S2 ∈ max(BS(A2 )) such that S2 ⊆ S2 . Then S2 ⊆ S2 holds, but because both belong to max(BS(A2 )), S2 = S2 holds. Hence, S2 (= S1 ) also belongs to max(BS(A1 )), and thus the result holds. (⇐) can be proved by tracing the above proof backward. 2 Theorem 14 Let A1 = P1 , Γ1 and A2 = P2 , Γ2 be abductive programs. A1 and A2 are explanatorily equivalent iff C1 = C2 holds and max(AS(P1 ∪ E)) = max(AS(P2 ∪ E)) for any E ∈ Ci , where Ci = { E ⊆ Γi | Pi ∪ E is consistent } for i = 1, 2. Proof: (⇒) Suppose that A1 and A2 are explanatorily equivalent. By Theorem 9, A1 A2 implies that, for any E ⊆ Γ2 and any SE ∈ BS(A2 ), there is TE ∈ BS(A1 ) such that E ⊆ Γ1 and SE ⊆ TE . Then, for any E ⊆ Γ2 and any S ∈ max(AS(P2 ∪ E)), E ⊆ Γ1 and there is T ∈ AS(P1 ∪ E) such that S ⊆ T , and then 6

The relations ≥ and are also preorders, i.e., reﬂexive and transitive, for both ﬁrst-order abduction and ALP.

Corollary 15 Let A1 = P1 , Γ and A2 = P2 , Γ be abductive programs with the same hypotheses such that both P1 and P2 are EDPs. Also, let Pi = Pi ∪{ ← L, ¬L | L ∈ Lit} for i = 1, 2. Then, A1 and A2 are explanatorily equivalent iff P1 and P2 are strongly equivalent with respect to Γ.

5

Complexity Results

We show that the computational complexity of deciding generality between abductive theories becomes more complex in general than that of abductive equivalence presented in [8]. Theorem 16 Let A1 and A2 be two propositional abductive theories. Deciding if A1 ≥ A2 is ΠP 3 -complete. Proof: Let A1 = (B1 , H1 ) and A2 = (B2 , H2 ). We here identify Ext(Ai ) with the extensions of the prerequisite-free normal default theory (DHi , Bi ) for i = 1, 2 as in the proof of Proposition 4. For any subset S ⊆ H2 , checking if E = T h(B2 ∪ S) is an extension of A2 is coNP-complete [19]. If E ∈ Ext(A2 ) then deciding if there does not exist F ∈ Ext(AV 1 ) such that E ⊆ F can be determined V by checking if the formula B2 ∧ S belongs to some extension P of A1 , which is Σ2 -complete [5]. Thus, we can choose S ⊆ H2 in nondeterministic polynomial time with a ΣP 2 -oracle to decide if A1 ≥ A2 holds. Hence, the original problem is the complement of P this, and belongs to ΠP 3 . We omit the proof of Π3 -hardness because of the space limitation. 2 Theorem 17 Let A1 and A2 be two propositional abductive theories. Deciding if A1 A2 is ΠP 3 -complete. Proof: Follows from Corollary 6 and Theorem 16. 7

2

This deﬁnition is due to [7], and is slightly different from the notion of relativized equivalence in [20, 16]. In [20], P1 and P2 are deﬁned as strongly equivalent relative to a literal set U iff AS(P1 ∪ R) = AS(P2 ∪ R) for any set R of rules that are constructed using literals in U . 8 The condition of EDPs was missing in [8, Theorem 4.4]. In fact, only Theorem 14 holds for GEDPs. Moreover, to characterize inconsistent programs in ALP, an EDP having the answer set Lit should be translated to an EDP without an answer set in Corollary 15.

K. Inoue and C. Sakama / Comparing Abductive Theories

Theorem 18 Let A1 = P1 , Γ1 and A2 = P2 , Γ2 be abductive programs. Deciding if A1 ≥ A2 is (i) ΠP 2 -complete when P1 and P2 are ELPs, and is (ii) ΠP -complete when P1 and P2 are GEDPs. 3 Proof: A computation problem in GEDPs reduces in polynomial time to the corresponding problem in EDPs [6], so we here consider the cases that each Pi is either an ELP or an EDP. (Membership) For any guess S ⊆ Lit, deciding if S ∈ BS(A2 ) is NP-complete for an ELP P2 (resp. ΣP 2 -complere for an EDP P2 ) [2]. For such an S, deciding if there does not exist T ∈ BS(A1 ) such that S ⊆ T can be determined by credulous reasoning that contains S, which is NP-complete for an ELP P1 (resp. ΣP 2 -complere for an EDP P1 ) [2]. Hence, by Theorem 8, A1 ≥ A2 can be nondeterministically solvable with two calls to an NP-oracle (resp. a ΣP 2 -oracle). P (resp. Π ). Therefore, the complement is in ΠP 2 3 (Hardness) We prove for Wnthe ELP case. Let Φ = ∀X∃Y.φ be a closed QBF, where φ = j=1 Cj is a DNF formula, that is, Cj is a conjunction of literals. Let A1 = P1 , Γ1 and A2 = P2 , Γ2 be abductive programs such that P1 = {g ← Cj | 1 ≤ j ≤ n}, Γ1 = X ∪ ¬X ∪ Y ∪ ¬Y , P2 = {g ← }, and Γ2 = X ∪ ¬X, where ¬X = {¬x | x ∈ X} and ¬Y = {¬y | y ∈ Y }. Note that both P1 and P2 are ELPs. We prove that: A1 ≥ A2 ⇔ Φ is valid. (⇒) Suppose A1 ≥ A2 . By Theorem 8, for any S ∈ BS(A2 ), there is T ∈ BS(A1 ) such that S ⊆ T . In particular, for any IX ⊆ X, there is a belief set S ∈ BS(A2 ) with respect to IX ∪¬(X \IX ), and hence IX ∪¬(X\IX ) ⊆ T for some T ∈ BS(A1 ). Since g ∈ S, g must be in T too. Then, some Cj (1 ≤ j ≤ n) must be true under IX ∪ ¬(X \ IX ) and IY ∪ ¬(Y \ IY ) for some IY ⊆ Y . Hence, φ is true under such an interpretation. Since IX was arbitrary, Φ is valid. (⇐) Suppose Φ is valid. Then for any IX ⊆ X, φ is true under IX ∪ ¬(X \ IX ) and IY ∪ ¬(Y \ IY ) for some IY ⊆ Y . Then some Cj is true under this interpretation, and hence g holds. It is easy to see for any S ∈ BS(A2 ) that there is T ∈ BS(A1 ) such that S ⊆ T . By Theorem 8, A1 ≥ A2 holds. For the EDP case, we can apply a transformation of a QBF ∀X∃Y ∀Z.φ into a disjunctive program, which is analogous to the one presented in [1, Theorem 3.1] and [2, Lemma 2]. 2 Theorem 19 Let A1 = P1 , Γ1 and A2 = P2 , Γ2 be abductive programs. Deciding if A1 A2 is (i) ΠP 2 -complete when P1 and P2 are ELPs, and is (ii) ΠP 3 -complete when P1 and P2 are GEDPs. Proof: Like Theorem 18, we can assume that each Pi is either an ELP or an EDP. For any guess S ⊆ Lit, deciding if SE ∈ BS(A2 ) for some E ⊆ Γ2 is NP-complete for an ELP P2 (resp. ΣP 2 -complere for an EDP P2 ) [2]. For any such E, deciding if AS(P1 ∪ E) = ∅ is NP-complete for an ELP P2 (resp. ΣP 2 -complere for an EDP P2 ) [1]. For SE , deciding if there does not exist T ∈ AS(P1 ∪ E) such that SE ⊆ T can be determined by credulous reasoning that contains SE , which is NP-complete for an ELP P1 (resp. ΣP 2 complere for an EDP P1 ) [2]. Hence, by Theorem 9, A1 A2 can be nondeterministically solvable with three calls to an NP-oracle (resp. P P a ΣP 2 -oracle). Therefore, the complement is in Π2 (resp. Π3 ). The hardness can be shown in the same way as in Theorem 18. 2

6

Discussion

The relation ≥ introduced in this paper can be represented by generality relations deﬁned by Inoue and Sakama [9, 10]. We brieﬂy sketch the relationships here. For ﬁrst-order abductive theories A1 = (B1 , H1 ) and A2 = (B2 , H2 ), by identifying Ext(Ai) with the extensions of the prerequisite-free normal default theory (DHi , Bi ) for

39

i = 1, 2, we can prove that A1 ≥ A2 iff A1 |=dt A2 , where |=dt is a Hoare order deﬁned on the class of default theories [10]. On the other hand, for abductive logic programs A1 = P1 , Γ1 and A2 = P2 , Γ2 , let Pi (i = 1, 2) be the GEDP deﬁned by Pi = Pi ∪ { l; not l ← | l ∈ Γi }. Then, BS(Ai ) = AS(Pi ) holds [6]. With this result, we can see that A1 ≥ A2 iff P1 |=lp P2 , where |=lp is a Hoare order deﬁned on the class of GEDPs (originally deﬁned on the class of EDPs in [9]). Besides work on generality relations in ASP [9], a general correspondence framework has been proposed in [3, 15] to compare logic programs. This framework is deﬁned to compare equivalence and inclusion between the semantics of logic programs instead of generality, but the notions of projection and contexts are also introduced to enable a variety of equivalence comparison. Incorporating these notions into our generality framework is a topic of future work.

REFERENCES [1] T. Eiter and G. Gottlob. On the computational cost of disjunctive logic programs: propositional case. Annals of Mathematics and Artiﬁcial Intelligence, 15:289–323, 1995. [2] T. Eiter, G. Gottlob and N. Leone. Abduction from logic programs: semantics and complexity. Theoretical Computer Science, 189:129– 177, 1997. [3] T. Eiter, H. Tompits and S. Woltran. On solution correspondences in answer-set programming. In: Proc. IJCAI-05, pp. 97–102, 2005. [4] M. Gelfond and V. Lifschitz. Classical negation in logic programs and disjunctive databases. New Generation Computing, 9:365–385, 1991. [5] G. Gottlob. Complexity results for nonmonotonic logics. J. Logic and Computation, 2:397–425, 1992. [6] K. Inoue and C. Sakama. Negation as failure in the head. J. Logic Programming 35, pp. 39–78, 1998. [7] K. Inoue and C. Sakama. Equivalence of logic programs under updates. In: Proc. 9th European Conference on Logics in Artiﬁcial Intelligence, LNAI 3229, pp. 174–186, Springer, 2004. [8] K. Inoue and C. Sakama. Equivalence in abductive logic. In: Proc. IJCAI-05, 2005, pp. 472–477. [9] K. Inoue and C. Sakama. Generality relations in answer set programming. In: Proc. 22nd International Conference on Logic Programming, LNCS 4079, pp. 211–225, Springer, 2006. [10] K. Inoue and C. Sakama. Generality and equivalence relations in default logic. In: Proc. 22nd Conference on Artiﬁcial Intelligence (AAAI07), pp. 434–439, 2007. [11] A. Kakas, R. Kowalski and F. Toni. The role of abduction in logic programming. In: D. Gabbay, C. Hogger and J. Robinson, editors, Handbook of Logic in Artiﬁcial Intelligence and Logic Programming, Vol. 5, pp. 235–324, Oxford University Press, 1998. [12] V. Lifschitz, D. Pearce and A. Valverde. Strongly equivalent logic programs. ACM Transactions on Computational Logic, 2:526–541, 2001. [13] T. Niblett. A study of generalization in logic programs. In: Proc. 3rd European Working Sessions on Learning, pp. 131–138, Pitman, 1988. [14] S.-H. Nienhuys-Cheng and R. De Wolf. Foundations of Inductive Logic Programming. LNAI 1228, Springer, 1997. [15] J. Oetsch, H. Tompits and S. Woltran. Facts do not cease to exist because they are ignored: relativised uniform equivalence with answer-set projection. In: Proc. 22nd Conference on Artiﬁcial Intelligence (AAAI07), pp. 458–464, 2007. [16] D. Pearce, H. Tompits and S. Woltran. Relativised equivalence in equilibrium logic and its applications to prediction and explanation: preliminary report. In: Proc. LPNMR’07 Workshop on Correspondence and Equivalence for Nonmonotonic Theories, pp. 37–48, 2007. [17] D. Poole. A logical framework for default reasoning. Artiﬁcial Intelligence, 36:27–47, 1988. [18] R. Reiter. A logic for default Reasoning. Artiﬁcial Intelligence, 13:81– 132, 1980. [19] R. Rosati. Model checking for nonmonotonic logics: algorithm and complexity. In: Proc. IJCAI-99, pp. 76–81, 1999. [20] S. Woltran. Characterizations for relativized notions of equivalence in answer set programming. In: Proc. 9th European Conference on Logics in Artiﬁcial Intelligence, LNAI 3229, pages 161–173, Springer, 2004.

40

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-40

Privacy-Preserving Query Answering in Logic-based Information Systems Bernardo Cuenca Grau and Ian Horrocks 1 Abstract. We study privacy guarantees for the owner of an information system who wants to share some of the information in the system with clients while keeping some other information secret. The privacy guarantees ensure that publishing the new information will not compromise the secret one. We present a framework for describing privacy guarantees that generalises existing probabilistic frameworks in relational databases. We also formulate different ﬂavors of privacy-preserving query answering as novel, purely logic-based reasoning problems and establish general connections between these reasoning problems and the probabilistic privacy guarantees.

1

Motivation

Privacy protection is an important issue in modern information systems. The digitalization of data on the Web has dramatically increased the risks of private information being either accidentally or maliciously disclosed. These risks have been witnessed by numerous cases of personal data theft from systems that were believed to be secure. The design of information systems that provide provable privacy guarantees is, however, still an open problem—in fact, the notion of privacy is itself still open to many interpretations [2]. This paper addresses the problem of privacy-preserving query answering. In this setting it is assumed that the information itself is kept secret, but that the owner of the information wants to allow some query access to it while at the same time preventing private information from being revealed. For example, a hospital may want to allow researchers studying prescribing practices to query the patients’ records database for information about medicines dispensed in the hospital, but they want to ensure that no information is revealed about the medical conditions of individual patients. To make this more precise, the hospital wants to check whether answering speciﬁed legal queries could augment knowledge (from whatever source) that an attacker may have about the answer to a query for patient names and their medical conditions (the so-called sensitive query). Taking into account that an attacker may have previous knowledge about the system is of crucial importance, as such knowledge may connect the answers to legal and sensitive queries, and lead to the (partial) revelation of the latter. For example, allowing a query for drugs and the dates on which they were prescribed may seem harmless, but if the attacker knows the dates on which patients have been in hospital and drugs that are used to treat AIDS, then he may deduce that there must be an AIDS patient amongst the group known to be in hospital on a date when AIDS drugs were dispensed. This problem has been recently investigated in the context of relational databases (DBs) [9, 10, 6]. In these privacy frameworks, the knowledge and/or beliefs about the system of a potential attacker are 1

Oxford University Computing Laboratory, UK

modeled as a probability distribution over possible states of the information system. Privacy checking then amounts to verifying whether publishing new information, such as the answer to a legal query, could change the probability (from an attacker’s perspective) of any particular answer to the sensitive query. In the ﬁrst part of this paper, we extend the probabilistic notions of privacy explored in the DB literature to cover a very general class of logic-based languages which includes, for example, ontology languages [12]. Furthermore, since these notions are too strict in practice, we propose ways to weaken them. In the second part, we formulate privacy-preserving query answering in terms of novel, purely logic-based reasoning problems. We show that our logic-based notions have natural probabilistic counterparts. Finally, we argue that these reasoning problems are related to existing ones; to illustrate this fact, we point out a connection with the notion of a conservative extension, an important concept in modular ontology design [8, 7]. Given the generality of our notion of an information system, we do not make claims concerning computational properties. Our results, however, provide an excellent formal base for studying such properties for particular languages.

2

Logic-based Information Systems

We adopt a general framework for describing logic-based information systems that captures any language whose formal semantics is based on First Order (FO) models; the framework is open toward different mechanisms for selecting admissible models and thus comprises a wide range of languages. We distinguish between intensional knowledge (background knowledge about the application domain) and extensional knowledge (data involving speciﬁc objects of the domain). This allows us to make the usual distinction in KR between schema knowledge and data. The framework here has been adapted from existing general frameworks in the literature [5, 1]. An Information System Formalism (ISF) is a tuple F = (Σ, LS , LD , Sem) where Σ is a countably inﬁnite FO-signature, LS , LD are FO-languages over Σ, called the schema and dataset language respectively, and Sem is a speciﬁcation of the semantics (of which more below). A schema S (respectively a dataset D ) is a set of LS -sentences (respectively a set of LD -sentences) over Σ. For example, in relational DBs, Σ is a set of relations and constants; LD only allows for ground atomic formulas, and LS is the language of FO Predicate Logic with equality. Datasets and schemas are called relational instances and relational schemas respectively. In the case of description logic (DL) ontologies, Σ contains unary relations, binary relations and constants; LS is a DL, such as SH I Q [12], and LD again only allows for ground atomic formulas over the predicates in Σ; Datasets are called ABoxes and schemas TBoxes.

41

B. Cuenca Grau and I. Horrocks / Privacy-Preserving Query Answering in Logic-Based Information Systems

The semantics is given by a pair Sem = (δ, ◦); δ is a function that assigns to each FO-interpretation I over Σ and each possible set S of LS -sentences (respectively LD -sentences D ) a truth value δ(I , S ) ∈ {true, false} (respectively δ(I , D ) ∈ {true, false}); ◦ is a binary operation on sets of interpretations, such that for each pair of sets M1 , M2 , ◦ returns a set of interpretations M3 = M1 ◦ M2 . An information system (IS) F is a pair ℑ = (S , D ), with S an LS -schema, and D an LD -dataset. The set of models of ℑ is Mod(ℑ) = Mod(S ) ◦ Mod(D ), with Mod(S ) = {I | δ(I , S ) = true} / and Mod(D ) = {I | δ(I , D ) = true}. ℑ is satisﬁable if Mod(ℑ) = 0. For example, in both ontologies and relational DBs, schemas are interpreted in the usual way in FOL: δ(I , S ) = true iff I |=FOL S . In SH I Q ontologies, datasets are also interpreted in the usual way: δ(I , D ) = true iff I |=FOL D , and ◦ is the intersection between the schema and the dataset models. In relational DBs, however, the data usually has a single model—that is, δ(I , D ) = true iff I = ID , where ID is the minimal Herbrand model of D ; The operation ◦ is also deﬁned differently: I1 ◦ I2 ∈ Mod(ℑ) iff I2 = ID and ID |=FOL S . We are also very permissive w.r.t. query languages. A query language for F is an FO-language LQ over Σ. A boolean query Q is an LQ -sentence. The semantics is given by a function δLQ that assigns to each interpretation I and boolean query Q a truth value δLQ (I , Q) ∈ {true, false}. A system ℑ entails Q, written ℑ |=F Q if, for each I ∈ Mod(ℑ), δLQ (I , Q) = true. A general query Q is a LQ -formula, where x is the vector of free variables in Q. Let σ[x/o] be a function that, when applied to a general query Q, yields a new boolean query σ[x/o] (Q) by replacing in Q the variables in x by the constants in o. The answer set for Q in ℑ is the following set of tuples of constants: ans(Q, ℑ) = {o | ℑ |=F σ[x/o] (Q)}. An example of a query language could be the language of conjunctive queries in both DBs and ontologies. Given a query language LQ , a view over ℑ is a pair V = (V, v), with V —the deﬁnition of the view— an LQ -query, and v—the extension of the view— a ﬁnite set of tuples of constants, such that v = ans(V, ℑ). Condition [S ↑] [S ∗ ] [V] [Q = q]

Set Syst([S ↑]) = {ℑ = (S , D ) | ℑ ∈ IS and S ⊆ S } Syst([S ∗ ]) = {ℑ = (S , D ) | ℑ ∈ IS} Syst([V]) = {ℑ ∈ IS | each V ∈ V is a view over ℑ} Syst([Q = q]) = {ℑ ∈ IS | ans(Q, ℑ) = q} Table 1. Conditions on Information Systems

Given F = (Σ, LS , LD , Sem), we denote by IS, D the set of all satisﬁable systems and datasets respectively in F , and by Tup the set of all tuples of constants over Σ. We also consider systems in IS that satisfy certain conditions; the conditions we consider are given in Table 1. Given a schema S , the ﬁrst and second rows in the table represent respectively the set of ISs whose schemas extend S and are equal to S ; given a set of views V, the third row represents the set of ISs over which every V ∈ V is a view; ﬁnally, given a query Q and an answer set q, the last row represents the ISs for which q is the answer to Q. We denote with [C1 , . . . ,Cn ] the conjunction of conditions [C1 ], . . . , [Cn ], and with Syst([C1 , . . . ,Cn ]) the subsets of IS that satisfy all of [C1 ], . . . , [Cn ].

3

The Privacy Problems

Given F = (Σ, LS , LD , Sem) and a query language LQ , our goal is to study privacy guarantees for Bob —the owner of a system ℑ = (S , D ) in IS— against the actions of Alice— a potential attacker. Existing privacy frameworks for DBs[9, 10, 6] assume that the actual data D is kept hidden. The data to be protected is deﬁned by

a query Q, called the sensitive query, whose deﬁnition is known by Alice. As an external user, Alice can only access the system through a query interface which allows her to ask certain “legal” queries; these legal queries, together with their answers, are represented as a set V of views over ℑ. Bob wants to extend the set of legal queries, i.e., to publish new views. The problem of interest is the following: The publishing problem: Given ℑ = (S , D ), an initial set of views V and a ﬁnal set of views W over ℑ with V ⊆ W, verify that no additional information about the answers to Q is disclosed.2 R(x,y) (dis1,drug1) (dis2,drug1) (dis3, drug2) (dis4, drug2)

S(z,y) (pat1,drug1) (pat2, drug1) (pat3, drug2) (pat4, drug2)

T(z,w,x) (pat1,male,dis1) (pat2,male,dis2) (pat3, f em, dis3) (pat4, male, dis4)

F(z,t) (pat1, (pat2, (pat3, (pat4,

f lo1) f lo2) f lo3) f lo2)

Table 2. Example Hidden Dataset

Example 1 The IS of a hospital, modeled in FO-logic, contains data about the following predicates: R(x, y), which relates diseases to drugs, S(z, y), which relates patients to their prescribed drugs, T(z, w, x), which relates patients, their gender, and their diagnosed disease, and F(z,t) which speciﬁes the ﬂoor of the hospital where each patient is located. Their extension in the hidden dataset D is given in Table 2. The schema S is public and contains FO-sentences such as ∀x, y : [R(x, y) ⇒ Disease(x)∧Drug(y)], which ensures that R only relates diseases to drugs, and sentences like ∀x : [Disease(x) ⇒ ¬Drug(x))], which ensures disjointness between drugs, diseases, patients, genders and ﬂoors. S also models other common-sense knowledge, e.g. that the gender of a patient is unique. Bob does not want to reveal any information about which patients suffer from dis1, i.e., the answer to the query Q(z) = ∃w : [T(z, w, dis1)] should be secret; however, Bob also wants to publish views V1 = (V1 , v1 ), and V2 = (V2 , v2 ) with V1 (x, y) ← F(z,t) and V2 (z, w) ← ∃x : [T(z, w, x)], and where v1 , v2 are their respective extensions w.r.t. D . Publishing these views could lead to a privacy breach w.r.t. Q. For example, if S contains a sentence α stating that all the patients in f lo1 suffer from dis1 then, by publishing V1 , Alice could deduce that pat1 suffers from dis1 and thus belongs to the answer to Q1 , which clearly causes a privacy breach. Even if the identity of patients suffering from dis1 is not revealed, the views could still provide useful information to Alice. Suppose that S contains β stating that dis1 is a kind of disease that only affects men; then by publishing V2 Alice could infer that pat3, a woman, cannot be in the answer to Q1 , which would permit Alice to discard possible answers. Such privacy breaches are datasetdependent: if all patients in D were male and none of them is on the ﬁrst ﬂoor, then publishing V1 and V2 would be harmless. 3 Existing DB frameworks assume that the schema is static and fully known by Alice, which are not always reasonable assumptions. For inferential systems like ontologies [12], where the schema participates in query answering by allowing the deduction of new data, Bob may prefer to hide a part of the schema. In fact, some widely used ontologies, such as SNOMED-CT—a component of the Care Record Service in the British Health System—are not fully available. Furthermore, the schema may undergo continuous modiﬁcations; indeed many ontologies are updated on a daily basis. To overcome these limitations, we propose to formalise and study the following problems: The generalised publishing problem: New views or schema axioms are published, but the IS ℑ = (S , D ) remains static. Given an initial public schema S1 and a ﬁnal public schema S2 with S1 ⊆ S2 ⊆ S , 2

/ Note that this generalises the “standard” case where V = 0.

42

B. Cuenca Grau and I. Horrocks / Privacy-Preserving Query Answering in Logic-Based Information Systems

initial views V and ﬁnal views W with V ⊆ W, Bob wants to verify that no additional information about the answers to Q is disclosed. The system evolution problem: The IS ℑ = (S , D ) evolves to ℑ = (S , D ). Bob wants to ensure that, if it was possible to safely publish certain information before the change, then the same information can be safely published after the change. DB frameworks are probabilistic and apply to the publishing problem [10, 6, 11]. In the next section, we generalise them. Our presentation differs from [10, 6, 11] in two aspects: we consider arbitrary ISFs instead of relational DBs; and we consider the generalised publishing problem: instead of assuming that the schema is ﬁxed and known, we allow for partially secret schemas. We show that known results for DBs can be naturally lifted to our more general setting.

4

Probabilistic Frameworks

The framework by Miklau & Suciu [10] is based on Shannon’s information-theoretic notion of perfect secrecy. As mentioned before, we present the framework in a more general form. Alice’s (additional) knowledge about the IS being attacked is given as a distribution P : IS → [0, 1] over all possible ISs. Given P, the probability that an IS satisﬁes a condition [C] in Table 1 is as follows: P([C]) = ∑ℑ∈Syst([C]) P(ℑ). Given [C1 ], [C2 ], P([C1 ] | [C2 ]) represents the probability, according to Alice’s knowledge, that an IS satisﬁes [C1 ] given that it satisﬁes [C2 ]; this can be computed using the Bayes P([C1 ,C2 ]) formula: P([C1 ] | [C2 ]) = P([C 2 ]) Let ℑ = (S , D ) be the system to be protected. Alice initially knows part of the schema S1 ⊆ S and views V over ℑ. After publication, she observes the new schema S2 with S1 ⊆ S2 and views W = V ∪ U; she is also aware that the real schema S extends both S1 and S2 . The apriori and a-posteriori probabilities, according to Alice’s knowledge, that q is the answer to Q are respectively given as follows:3 P([Q = q] | [S1 ↑, V])

(a-priori)

(1)

P([Q = q] | [S2 ↑, W])

(a-posteriori)

(2)

The privacy condition under consideration is called perfect privacy: intuitively, Alice should not learn anything about the possible outcomes of Q, whatever her additional knowledge or beliefs (i.e., for any P). Note that the condition is trivially satisﬁed if S1 and V already reveal the answer to Q, i.e., if each ℑ ∈ Syst([S1 ↑, V]) yields the same outcome to Q; in this case we say that Q is trivial. Example 2 Suppose that in Example 1, the schema S with β ∈ S is known, and V2 —the relation between patients and their genders— is published. Suppose that Alice has only vague knowledge about the IS and considers all datasets consistent with S equally likely. Consider an answer set q containing pat3. Before publishing the view, the probability (1) is non-zero for q, whereas, after publishing V2 , (2) is zero. Intuitively, Alice’s knowledge about Q has increased. 3 Deﬁnition 1 (Perfect Privacy). Perfect privacy holds if, for each P : IS → [0, 1] and q ∈ Tup with (1) well-deﬁned, (2) equals (1).

In Example 1, Alice may believe that the answer to Q is q1 = {pat1} with P(q1 ) = 2/3, q2 = {pat1, pat2} with P(q2 ) = 1/6 and q3 = {pat1, pat3} with P(q3 ) = 1/6. Note the difference with [10], where Alice had prior knowledge about the possible ISs themselves. The distribution P induces possible compatible distributions P : IS → [0, 1] over ISs as follows: P is compatible with P, written P ∈ Comp(P) if, for each q, the sum of the probabilities of the ISs for which ans(Q, ℑ) = q is precisely P(q) (i.e., ∑{ℑ∈Syst([Q=q])} P (ℑ) = P(q)). Alice’s a-priori and a-posteriori knowledge is given respectively by (1) and (2) over P , and the privacy condition is the following: Deﬁnition 2 (Safety). Safety holds if, for each P : Tup → [0, 1], P ∈ Comp(P), and q ∈ Tup with (1) well-deﬁned, (2) equals (1). Triviality of Perfect Privacy and Safety: In the relational DB literature, it has been observed that, on the one hand, safety and perfect privacy are closely related [6] and that, on the other hand, they are too strict in practice: revealing any new information, even if apparently irrelevant to Q, causes perfect privacy and safety not to hold— intuitively, this is because the attacker’s beliefs can establish a (possibly spurious) connection between any revealed information and the answer to the sensitive query. We show that these results can be naturally lifted to the generalised publishing problem for arbitrary ISFs as follows: Theorem 1 For given ℑ, Q, S1 , S2 , and V, W: (i) Safety ⇔ Perfect Privacy, and (ii) Perfect Privacy ⇔ Syst([S1 ↑, V]) ⊆ Syst([S2 ↑, W]). Relaxing Perfect Privacy and Safety: A number of recent papers have tried to weaken these notions. Miklau and Suciu [10] proposed to place constraints on P and consider only product distributions; this amounts to assuming that the tuples in the DB are independent. This assumption, however, is not reasonable if the schema is nontrivial: schema constraints can impose arbitrary correlations between tuples. Other proposals, e.g. [3], involve making (1) only approximately equal to (2). In this paper, we propose two novel notions— quasi-safety and quasi-privacy— that signiﬁcantly relax Deﬁnitions 1 and 2 respectively; we show later on that both notions are equivalent and have a nice logical counterpart in terms of purely logic-based reasoning problems. Consider the notion of safety. Given P : Tup → [0, 1], Deﬁnition 2 requires (1) and (2) to coincide for all its compatible distributions. Deﬁnition 2 can be relaxed by requiring, for each P, only the existence of a compatible distribution P for which (1) and (2) coincide. Moreover, such distribution must be “reasonable” given the public information S1 , V—that is, if P assigns non-zero probability to q1 , then P cannot assign zero probability to all ISs that satisfy [S1 , V] and yield q1 . Formally, we say that P ∈ Comp(P) is ad/ there is an IS missible for S1 , V if, for each q such that P(q) = 0, / ℑ ∈ Syst([S1 , V, Q = q]) such that P (ℑ) = 0. Deﬁnition 3 (Quasi-Safety). Quasi-safety holds if, for each P : Tup → [0, 1] there is an admissible P ∈ Comp(P) s.t., for each q ∈ Tup, for which (1) is well-deﬁned, (2) equals (1).

The framework by Deutsch and Papakonstantinou [6, 11] models Alice’s knowledge or beliefs as a distribution P : Tup → [0, 1] over the possible outcomes of the sensitive query. Here, we present the framework in a more general form.

That is, whatever Alice’s knowledge or beliefs about the answers to Q, there is always a compatible opinion about the hidden IS that is “reasonable” given the public information and that would not cause her to revise her beliefs after the new information is published. A similar principle can be used for weakening perfect privacy:

These probabilities are well-deﬁned if P([S1 ↑, V]) and P([S2 ↑, W]) are nonzero; that is, if there is a IS with non-zero probability that is compatible with the available information.

Deﬁnition 4 (Quasi-Privacy). Quasi-privacy holds if, for each P : IS → [0, 1], there is a P : IS → [0, 1] s.t., for each q ∈ Tup for which (1) is well-deﬁned over P, (2) over P equals (1) over P.

3

B. Cuenca Grau and I. Horrocks / Privacy-Preserving Query Answering in Logic-Based Information Systems

That is, whatever Alice’s initial beliefs about the hidden IS, she can always revise them such that her opinion about the answers to Q does not change when the new information is published.

5

A Logic-based Framework

In this section, we formalise privacy from a purely logic-based perspective as a guarantee that the published information will not “change the meaning” of the sensitive query. We propose a collection of privacy conditions that model this notion of meaning change, and consider both the publishing and the evolution problems.

5.1

The Generalised Publishing Problem

The most basic information about Q is obviously its answer. The most dangerous privacy breach occurs when publishing new information reveals part of such answer. In Example 1, before publishing any views, Alice cannot deduce the name of any patient suffering from dis1; after publication of V1 , Alice learns that pat1 does have dis1 and therefore belongs to the answer of Q. We will then say that the set of certain answers to Q has changed. Furthermore, as seen in Example 1, a privacy breach could also occur if Alice can discard possible answers and therefore formulate a “better guess”, even if part of the actual answer has not been disclosed. Initially, all possible sets of patients (e.g. q3 = {pat2, pat3}) are possible. Upon publication of V2 , all answers including pat3 (e.g. q3 = {pat2, pat3}) become impossible. We will then say that the set of possible outcomes of Q has changed. Possible outcomes and certain answers: Given Q and a condition [C] (see Table 1), the possible outcomes of Q given [C] are as follows: out([C]) = {q ∈ Tup | ∃ℑ ∈ Syst([Q = q,C])}

(3)

The set of certain answers of Q given [C] is deﬁned as the common subset of all the possible outcomes: cert([C]) = out([C]). As argued before, a privacy condition should at least guarantee that the set of certain answers given the initial schema and views stays the same after publishing the new information:4 cert([S1 ↑, V]) = cert([S2 ↑, W])

(4)

A stronger privacy condition can be obtained if we require the set of possible outcomes not to change as follows: out([S1 ↑, V]) = out([S2 ↑, W])

(5)

It is ultimately up to the data owner to decide which condition is most appropriate for his application needs. Monotonicity for answer sets: Sometimes in this section we will focus only on ISFs and query languages that have a monotonic behavior with respect to answer sets—that is, if new schema axioms and/or views are published, the set of possible answers to a query Q can only decrease. In the limit, if the whole system is published, then only one answer remains possible, namely the “real” answer for Q against the IS . This property can be formalized as follows:

S1 ⊆ S2 and V ⊆ W ⇒ out([S2∗ , W]) ⊆ out([S1∗ , V])

(6)

Many languages currently used in practice, such as relational DBs and DL ontologies satisfy this property. Checking Condition (5) in ISFs that satisfy Property (6) just requires to consider the initial and ﬁnal schemas, instead of all their super-sets. 4

It can be easily seen that Condition (5) implies 4

43

Proposition 1 If F satisﬁes Property (6), then Condition (5) holds iff out([S1∗ , V]) ⊆ out([S2∗ , W]), In what follows, if a result depends on Property (6), it will be explicitly stated; otherwise, we assume general ISFs and queries. Bridges between probability and logic: At this stage, we can establish a ﬁrst general bridge between our logic-based conditions and the probabilistic ones. In particular, it turns out that Condition (5) is equivalent to both quasi-privacy and quasi-safety: Theorem 2 Quasi-safety ⇔ Quasi-privacy ⇔ Condition (5). Note that Theorem 2, on the one hand, implies that quasi-safety and quasi-privacy are indeed equivalent notions; on the other hand, it provides a natural logical interpretation to our probabilistic weakening of safety and perfect privacy. Breaches in logic privacy: Condition (5) may still lead to potential security breaches if new schema axioms are published, as shown by the following example: Example 3 Suppose LS is FO predicate logic, LD only allows for ground atomic formulas, and LQ is the language of conjunctive queries. Let A, B be unary predicates and R a binary predicate; consider a Σ with two constants: a, b. The sensitive query is A(x). Suppose that Bob publishes V1 with deﬁnition B(x) and extension {a, b}. Initially, S1 = 0/ and hence all outcomes Tup = {{}, {a}, {b}, {a, b}} are possible. Suppose that Bob publishes S2 = {∀x : [A(x) ↔ ∃y : [R(x, y) ∧ B(y)]]}. Upon publication of S2 , no possible outcome is ruled out, but S2 has introduced a correlation between V1 and Q. These correlations could potentially lead to a security breach. 3 Indeed, even if Alice cannot discard any possible outcome of Q, Bob may want to prevent the new information from establishing potentially dangerous correlations; to this end, we introduce a stronger notion of logic-based privacy. Strengthening logic privacy: We propose an additional condition in case new schema axioms are published. Our condition is only deﬁned for ISs satisfying Property (6) and it ensures that for each possible dataset D , Alice obtains the same answer for Q independently of whether she considers the initial schema S1 or the ﬁnal one S2 . That is, for each ℑ = (S2 , D ) ∈ Syst([S2∗ , W]), the following should hold: ans(Q, ℑ) = ans(Q, ℑ )

(7)

where ℑ = (S1 , D ). If we enforce this condition in the example above, we would have that publishing S2 yields a privacy breach. Indeed, consider D = {R(a, b), B(a), B(b)}; we have ans(Q, S1 = {}) = {}, whereas ans(Q, S2 ) = {a}. These intuitions motivate the following notion of privacy for ISFs satisfying Property (6): Deﬁnition 5 (Strong Logic-based Privacy). Given Q, S1 , S2 , V, W, strong logic-based privacy holds if Conditions (5) and (7) hold. The above establishes a middle ground between too strict privacy notions (Deﬁnitions 1, 2) and rather permissive ones (Deﬁnitions 3, 4). Deﬁnition 5 implies that a privacy breach may only occur if the new information correlates the public one to the answers of Q; that is, publishing information that is completely unrelated to Q will not break privacy. Note, however, that if S1 = S2 , then Deﬁnition 5 reduces to Condition (5) since Condition (7) trivially holds. A connection with conservative extensions: Deﬁnition 5 is close to conservative extensions, a well-established notion in mathematical logic, and an important concept in ontology design and reuse [8, 4, 7].

44

B. Cuenca Grau and I. Horrocks / Privacy-Preserving Query Answering in Logic-Based Information Systems

Conservative extensions have been recently proposed as the basic notion for deﬁning modules in ontologies—independent parts of a given theory— and safe reﬁnements—extensions of a theory that do not affect certain aspects of the meaning of the original theory. In the context of privacy-preserving query answering, the notion of a query conservative extension [7] for monotonic ISFs is of special relevance: Deﬁnition 6 (Query Conservative Extension). 5 Given S1 ⊆ S2 , sets Q, D of queries and datasets respectively, S2 is a query conservative extension of S1 w.r.t. Q, D if, for each Q ∈ Q and D ∈ D, we have that ans(Q, ℑ = (S2 , D )) = ans(Q, ℑ = (S1 , D )). In order to establish a connection between Deﬁnitions 5 and 6, let us introduce the following notation. Given [C], we denote the set of datasets that an IS that satisﬁes [C] can have as follows: Data([C]) = {D ∈ D | ∃ℑ ∈ Syst([C]), ℑ has dataset D }. If D = Data([S2∗ , W]), then Deﬁnition 6 corresponds precisely to Condition (7). If V = W, and D = Data([S1∗ , V]), then Deﬁnition 6 is a sufﬁcient condition for strong logic-based privacy.

5.2

The System Evolution Problem

Suppose that the privacy of ℑ = (S , D ) w.r.t. a query Q and a set V of published views has been tested and the system evolves to ℑ = (S , D ). We want to ensure that ℑ behaves in the same way as ℑ w.r.t. the secrecy of Q given V. Such notion of robustness under changes can be characterized as follows. Let ℑ = (S , D ), ℑ = (S , D ) be ISs, and let Q be a sensitive query. Consider a notion of security characterized by a predicate Privacy(ℑ, Q, V), e.g. (strong) logic-based privacy, which is evaluated to true if, given the IS ℑ = (S , D ), with S being public, Q is secure for the publication of V. Deﬁnition 7 (Secure Evolution). The evolution of ℑ = (S , D ) to ℑ = (S , D ) is secure w.r.t. Q and V if Privacy(ℑ, Q, V) implies Privacy(ℑ , Q, V ) with V being the views over ℑ with the same view deﬁnitions as V. We distinguish two situations: (i) the data changes during the evolution of the system, but the schema remains constant, and (ii) the data remains constant, but the schema changes. Varying the data: We ﬁrst formulate the notion of data independence, which ensures robust evolution w.r.t. changes in the data. Deﬁnition 8 (Data Independence). A notion of privacy is dataindependent w.r.t. S , Q and V if, for each ℑ, ℑ ∈ Syst([S ∗ ]) the evolution of ℑ to ℑ is secure w.r.t. Q, V. It is not hard to see that, given any non-trivial Q and any S , Perfect privacy and safety are data-independent w.r.t. S , Q. In contrast, the notion of privacy derived from Condition 5 is not data-independent for all S . Consider Example 1 and suppose that the schema S contains the sentence β and that the dataset D only contains male patients. In this case, Condition (5) holds since no possible outcome of Q can be ruled out when publishing V2 ; however, if D evolves to D containing a female patient, then the condition is violated. As a consequence, strong logic-based privacy is not data-indepedent and, given Theorem 2, nor are quasi-privacy and quasi-safety. Data independence for any schema is, indeed, a strict requirement. For ISFs satisfying Property (6), certain schemas and certain views, it is possible to obtain data-independence results: 5

In [7], D and Q are the sets of all datasets and all queries respectively over a given signature.

Proposition 2 Let S be a query conservative extension of S = {} w.r.t. Q = {Q} and D = D; let V, V be s.t. out([V]) = out([V ]). Then (strong) logic-based privacy is data-independent w.r.t. S , Q. Proposition 2 guarantees that data independence is obtained for schemas and views that are uncorrelated with the sensitive query. Varying the schema: we now assume that the data remains constant and the schema changes. Suppose that, in Example 1, the initial schema S does not contain β; let S = S ∪ {β} and let the dataset D contain a female patient. Publishing the names and gender of the patients (view V2 ) does not cause a privacy breach since S does not introduce any correlation between diseases and the gender of patients; however, when ℑ = (S , D ) evolves to ℑ = (S , D ) then such correlation does exist and the publication of V2 is no longer safe. Note that, given Q, D , we have that S is not a query conservative extension of S . This observation suggests the following sufﬁcient condition for secure evolution of ISFs satisfying Property (6): Proposition 3 Let S is a query conservative extension of S w.r.t Q = {Q} and D = Data([S ∗ ]); let out([S∗ , V]) = out([S∗ , V ]). Then, the evolution of ℑ = (S , D ) to ℑ = (S , D ) is secure w.r.t. Q, V for both privacy as in Condition (5) and strong logic-based privacy. Propositions 2 and 3 establish a bridge between the notions of conservative extension and secure evolution and show that the former can be used to provide sufﬁcient conditions for the latter.

6

Conclusion

In this paper, we have generalised existing results for privacy in databases, and proposed novel privacy conditions. We have proposed a novel logic-based approach and established bridges with existing information-theoretic approaches. Our results provide a deeper fundamental understanding of privacy-preserving query answering and can be used as a starting point for studying the decidability and complexity of the different privacy guarantees for particular languages.

REFERENCES [1] F. Baader, C. Lutz, H. Sturm, and F. Wolter, ‘Fusions of Description Logics and Abstract Description Systems’, JAIR, 16, 1–58, (2002). [2] E. Bertino, S. Jajodia, and P. Samarati, ‘Database security: Research and practice’, Inf. Syst., 20(7), 537–556, (1995). [3] A. Blum, C. Dwork, F. McSherry, and K. Nissim, ‘Practical privacy: the sulq framework’, in PODS, pp. 128–138. ACM, (2005). [4] B. Cuenca Grau, I. Horrocks, Y. Kazakov, and U. Sattler, ‘A logical framework for modularity of ontologies’, in IJCAI-07, pp. 298–304. AAAI, (2007). [5] G. De Giacomo E. Franconi I. Horrocks A. Kaplunova D. Lembo M. Lenzerini C. Lutz D. Martinenghi R. Moeller R. Rosati S. Tessaris A.Y. Turhan D. Calvanese, B. Cuenca Grau. Common framework for representing ontologies. TONES Project Deliverable, 2007. [6] A. Deutsch and Y. Papakonstantinou, ‘Privacy in database publishing’, in ICDT-2005, volume 3363 of LNCS, pp. 230–245. Springer, (2005). [7] R. Kontchakov, F. Wolter, and M. Zakharyaschev, ‘Modularity in dl lite’, in DL-2007. [8] C. Lutz, D. Walther, and F. Wolter, ‘Conservative extensions in expressive description logics’, in IJCAI-07, pp. 453–459. AAAI, (2007). [9] A. Machanavajjhala and J. Gehrke, ‘On the efﬁciency of checking perfect privacy’, in PODS-2006, pp. 163–172. ACM, (2006). [10] G. Miklau and D. Suciu, ‘A formal analysis of information disclosure in data exchange’, J. Comput. Syst. Sci., 73(3), 507–534, (2007). [11] A. Nash and A. Deutsch, ‘Privacy in GLAV information integration’, in ICDT, pp. 89–103, (2007). [12] P.F. Patel-Schneider, P. Hayes, and I. Horrocks. Web ontology language OWL Abstract Syntax and Semantics. W3C Recommendation, 2004. [13] L. Sweeney, ‘K-anoniminity: a model for protecting privacy’, Int. J. on Uncertainty, Fuzziness and Knowledge-based Systems., 10(5), (2002).

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-45

45

Optimizing Causal Link Based Web Service Composition Freddy L´ecu´e1,2 and Alexandre Delteil2 and Alain L´eger2 Abstract. Automation of Web service composition is one of the most interesting challenges facing the Semantic Web today. Since Web services have been enhanced with formal semantic descriptions, it becomes conceivable to exploit causal links i.e., semantic matching between their functional parameters (i.e., outputs and inputs). The semantic quality of causal links involved in a composition can be then used as a innovative and distinguishing criterion to estimate its overall semantic quality. Therefore non functional criteria such as quality of service (QoS) are no longer considered as the only criteria to rank compositions satisfying the same goal. In this paper we focus on semantic quality of causal link based semantic Web service composition. First of all, we present a general and extensible model to evaluate quality of both elementary and composition of causal links. From this, we introduce a global causal link selection based approach to retrieve the optimal composition. This problem is formulated as an optimization problem which is solved using efﬁcient integer linear programming methods. The preliminary evaluation results showed that our global selection based approach is not only more suitable than the local approach but also outperforms the naive approach.

1

Introduction

The semantic web [6] is considered to be the future of the current web. Web services in the semantic web are enhanced using rich description languages such as the Web Ontology Language (OWL) [19]. Formally the latter semantic descriptions are expressed by means of Description Logics concepts [4] in ontologies. An ontology is deﬁned as a formal conceptualization of a domain we require to describe the semantics of services e.g., their functional input, output parameters. Intelligent software agents can, then, use these descriptions to reason about web services and automate their use to accomplish intelligent tasks e.g., selection, discovery, composition. In this work we focus on web service composition and more speciﬁcally on its functional level (aka causal link composition). Starting from an initial set of web services, such a level of composition aims at selecting and inter-connecting web services by means of their (semantic) causal links according to a goal to achieve. The functional criterion of causal link, ﬁrst introduced in [14], is deﬁned as a semantic connection between an output of a service and an input parameter of another service. Since the quality of the latter links are valued by a semantic matching between their parameters, causal link compositions could be estimated and ranked as well. From their estimation results, some compositions can be considered as unsuitable in case of under speciﬁed causal links. Indeed a composite service that does not provide acceptable quality of causal links might be as useless as a service not providing the desired functionality. Unlike most of approaches [5, 22, 23] which focus on the quality of composition by means of non functional parameters i.e., quality of 1 2

Ecole de Mines de Saint-Etienne, France, email: [email protected] Orange Labs, France, email: {ﬁrstname.lastname}@orange-ftgroup.com

service (QoS), the quality of causal links can be considered as a distinguishing functional criterion for semantic web service compositions. Here we address the problem of optimization in service composition with respect to this functional criterion. Retrieving such a composition is deﬁned as the global selection of causal links maximizing the quality of the composition, taking into account preferences and constraints deﬁned by the end-user. To this end, an objective function maximizing the overall quality subject to causal links constraints is introduced. This leads to an NP-hard optimization problem [8] which is solved using integer linear programming methods. The remainder of this paper is organised as follows. In the next section we brieﬂy review i) causal links, ii) a distinguishing criterion i.e., their robustness and iii) the causal link composition model. Section 3 deﬁnes the causal link quality criteria we require during the global selection phase. Section 4 formulates the problem of global causal link selection and describes an integer linear programming method to efﬁciently solve it. Section 5 presents its computational complexity and some experimentations. Section 6 brieﬂy comments on related work. Finally section 7 draws some conclusions and talk about possible future directions.

2

Background

First of all, we present causal links. Then we remind the deﬁnition of their robustness, and ﬁnally describe causal link composition.

2.1

Web Service Composition & its Causal Links

In the semantic web, parameters (i.e., input and output) of services referred to concepts in a common ontology3 or Terminology T , where the OWL-S proﬁle [1] or SA-WSDL [18] can be used to describe them (through semantic annotations). At functional level web service composition consists in retrieving some semantic links between output parameters Out si ∈ T of services si and input parameters In sj ∈ T of other services sj . Such a link i.e., causal link [14] cli,j (Figure 1) between two functional parameters of si and sj is formalized as si , SimT (Out si , In sj ), sj . Thereby si and sj are partially linked according to a matching function SimT . This function expresses which matching type is employed to chain services. The range of SimT is reduced to the four well known matching type introduced by [16] and the extra type Intersection [15]: • Exact If the output parameter Out si of si and the input parameter In sj of sj are equivalent; formally, T |= Out si ≡ In sj . • PlugIn If Out si is sub-concept of In sj ; formally, T |= Out si In sj . • Subsume If Out si is super-concept of In sj ; formally, T |= In sj Out si . • Intersection If the intersection of Out si and In sj is satisﬁable; formally, T |= Out si In sj ⊥. 3

Distributed ontologies are not considered here but are largely independent of the problem addressed in this work.

46

F. Lécué et al. / Optimizing Causal Link Based Web Service Composition

• Disjoint Otherwise Out si and In sj are incompatible i.e., T |= Out si In sj ⊥. Out0 si

In0 si Ink si

si

Service

Inn si

Figure 1.

Service

In0 sj

In sj

Out si Outn si

Causal Link cl

2.2

Causal Link cli,j (SimT (Out si , In sj ))

sj

Service

Out sj

Inn sj Input Parameter

Output Parameter

Illustration of a Semantic Causal Link cli,j .

Robust Causal Link

The latter matching function SimT enables, at design time, ﬁnding some levels of semantic compatibilities (i.e., Exact, PlugIn, Subsume, Intersection) and incompatibilities (i.e., Disjoint) among independently deﬁned web service descriptions. However, as emphasized by [13], the matching types Intersection and Subsume need some reﬁnements to be fully efﬁcient for causal links composition. Example 1. (Causal Link & Subsume Matching Type) Suppose s1 and s2 be two services such that the output parameter NetworkConnection of s1 is (causal) linked to the in1 put parameter SlowNetworkConnection of s2 (cl1,2 in Figure 3). This causal link is valued by a Subsume matching type since N etworkConnection SlowN etworkConnection (Figure 2). It is obvious that such a causal link should not be directly applied in a service composition since the NetworkConnection is not speciﬁc enough to be used by the input SlowNetworkConnection. Indeed the output parameter NetworkConnection requires some Extra Descriptions to ensure a composition of s1 and s2 .

Example 2. (Robustness, Extra & Common Description) Suppose the causal link presented in Example 1. Such a link is not robust enough (Deﬁnition 1) to be applied in a composition. The description missing in NetworkConnection to be used by the input parameter SlowNetworkConnection is deﬁned by the Extra Description SlowN etworkConnection\N etworkConnection i.e., ∀netSpeed.Adsl1M . However the Common Description is not empty since this is deﬁned by SlowN etworkConnection N etworkConnection i.e., ∀netP ro.P rovider. Robust causal links can be obtained by retrieving Extra Description that changes an Intersection in a PlugIn matching type, and a Subsume by an Exact matching type.

2.3

Causal Link Composition Model

In this work, the process model of web service composition and its causal links is speciﬁed by a statechart [10]. Its states refer to services whereas its transitions are labelled with causal links. In addition some basic composition constructs such as sequence, conditional branching (i.e., OR-Branching), structured loops, concurrent threads (i.e., AND-Branching), and inter-thread synchronization can be found. To simplify the presentation, we assume that all considered statecharts are acyclic and consists of only sequences, OR-Branching and AND-Branching. In case of cycle, a technique for unfolding statechart into its acyclic form needs to be applied beforehand. Details about this unfolding process are omitted for space reasons. Example 3. (Process Model of a Causal Link Composition) Suppose si,3≤i≤8 be six services extending Example 1 in a more complex composition. The process model of this composite service is illustrated in Figure 3. The composition consists in an OR-Branching and AND-Branching wherein nine causal links are involved.

A causal link valued by the Intersection matching type requires a comparable reﬁnement. From this, [13] deﬁned a robust causal link. N etworkConnection ≡ ∀netP ro.P rovider ∀netSpeed.Speed SlowN etworkConnection ≡ N etworkConnection ∀netSpeed.Adsl1M

T2 1 cl1,2

T1

s1

Network Connection

Adsl1M ≡ Speed ∀ mBytes.1M

1 cl1,4

Figure 2. Sample of an ALE domain ontology T . Causal Link cl

Deﬁnition 1. (Robust Causal link) A causal link si , SimT (Out si , In sj ), sj is robust iff the matching type between Out si and In sj is either Exact or PlugIn. Property 1. (Robust Web Service Composition) A composition is robust iff all its causal links are robust. A possible way to replace a link si , SimT (Out si , In sj ), sj valued by Intersection or Subsume in its robust form consists in computing the information contained in the input In sj and not in the output Out si . To do this, the difference or subtraction operation [7] for comparing ALE DL descriptions is adapted in [13]. Even if [20] previously presented an approach to capture the real semantic difference, the [7]’s difference is preferred since its result is unique. From this, in case a causal link si , SimT (Out si , In sj ), sj is neither valued by a Disjoint matchmaking nor robust, Out si and In sj are compared to obtain two kinds of information, a) the Extra Description In sj \Out si that refers to the information required but not provided by Out si to semantically link it with the input In sj of sj , and b) the Common Description Out si In sj that refers to the information required by In sj and effectively provided by Out si .

Figure 3.

s

Slow 2 Network Connection

T3 1 cl2,3

T6

s3 1 cl3,5

s4 Input Parameter

T5

s5

OR-Branching

T4

1 cl5,6

1 cl4,5

1 cl5,7

s6 AND Branching

T7

s7

1 cl6,8 T8

s8

1 cl7,8

Output Parameter T: Task s: Service

Illustration of an (Executable) Causal Link Composition.

The example 3 illustrates an executable composition wherein tasks Ti have been concretized by one of their candidate services e.g., here si . Indeed some services with common functionality, preconditions and effects although different input and output parameters are given and can be used to perform a target task in the composition. In this way we address the issue of composing a large and changing collection of semantic web services. In our approach the choice of services is done at composition time, only based on their causal links with A other services. Thus each abstract causal link cli,j between two tasks Ti , Tj of an abstract composition needs to be concretized. Ideally, a k,1≤k≤n relevant link is selected among its n candidate causal links cli,j between two of their services to obtain an executable composition. Example 4. (Tasks, Candidate Services & Causal Links) Let s2 be a candidate service for T2 with NetworkConnection 2 as input parameter. The causal link cl1,2 between s1 and s2 is then 1 2 more robust than cl1,2 . Indeed cl1,2 is valued by an Exact matching 1 type whereas cl1,2 is valued by a Subsume matching type.

F. Lécué et al. / Optimizing Causal Link Based Web Service Composition

3

Causal Link Quality Model

As previously presented, several candidate services are grouped together in every task of an abstract composition. A way to differenti1 2 ate their causal links (e.g., cl1,2 and cl1,2 in example 4) consists in considering their different functional quality criteria. To this end, we adopt a causal link quality model, effective to any causal link. In this section, we ﬁrst present the quality criteria used for elementary causal links, before turning our attention to composite causal links. For each criterion, we provide a deﬁnition and indicates rules to compute its value for a given causal link.

3.1

Quality Criteria for Elementary Causal Links

We consider three generic quality criteria for elementary causal links cli,j deﬁned by si , SimT (Out si , In sj ), sj : its i) Robustness, ii) Common Description rate, and iii) Matching Quality. • Robustness. The Robustness qr of a causal link cli,j is deﬁned by 1 in case the link cli,j is robust (see Deﬁnition 1), and 0 otherwise. • Common Description rate. This rate4 qcd ∈ (0, 1] is deﬁned by: qcd (cli,j ) =

|Out si In sj | |In sj \Out si | + |Out si In sj |

(1)

This criterion estimates the rate of descriptions which is well speciﬁed for upgrading a non robust causal link into its robust form. In (1), Out si In sj is supposed to be satisﬁable since only relevant links between two services are considered in our model. • Matching Quality. The Matching Quality qm of a link cli,j is a value in (0, 1] deﬁned by SimT (Out si , In sj ) i.e., either 1 (Exact), 34 (PlugIn), 12 (Subsume) and 14 (Intersection). The Disjoint match type is not considered since Out si In sj is satisﬁable. In case we consider Out si In sj to be not satisﬁable, it is straightforward to extend and adapt our quality model by computing contraction [9] between Out si and In sj . Given the above quality criteria, the quality vector of a causal link cli,j is deﬁned as follows: ` ´ q(cli,j ) = qr (cli,j ), qcd (cli,j ), qm (cli,j ) (2) In case of services si and sj related by more than one causal link, the value of each criterion is retrieved by computing their average.

3.2

Quality Criteria for Causal Link Composition

The above quality criteria are also applied to evaluate the quality of any causal link composition c. To this end, Table 1 provides aggregation functions for such an evaluation. A brief explanation of each criterion’s aggregation function follows (here cl stands for cli,j ): • Robustness. On the one hand the robustness Qr of both a sequential and an AND-Branching composition c is deﬁned as the average of its causal link cl’s robustness qr (cl). On the other hand the robustness of an OR-Branching causal link composition is a sum of qr (cl) weighted by pr i.e., the probability that causal link cl be chosen at run time. • Common Description rate. This Description rate Qcd of c is deﬁned as its robustness, by simply changing qr (cl) by qcd (cl). • Matching Quality. The matching quality Qm of a sequential and AND-Branching causal link composition c is deﬁned as a product of qm (cl). The matching quality of an OR-Branching causal link composition c is deﬁned as Qr (c), by changing qr (cl) by qm (cl). 4

|.| refers to the size of ALE concept descriptions ([12] p.17) i.e., ||, |⊥|, . |A|, |¬A| and |∃r| is 1; |C D| = |C| + |D|; |∀r.C| and |∃r.C| is 1 + |C|. For instance |Adsl1M | is 3 in Figure 2.

47

Using the above aggregation functions, the quality vector of an executable causal link composition is deﬁned by (3). For each criterion l ∈ {r, cd, m} the higher the value Ql for c the higher its lth quality. (3) Q(c) = (Qr (c), Qcd (c), Qm (c)) Even if criteria qr , qm used to value a single causal link are correlated, their aggregated values of compositions Qr , Qm for Sequential, AND-Branching are independent since they are computed from different functions i.e., linear for Qr , not for Qm . Thus a composition c with a high robustness may have either a high or low overall matching quality. We have the same conclusion on the other criteria. Composition Quality Criterion Construct Robustness Qr Com. Desc. rate Qcd Match. Qual. Qm Q Sequential/ 1 P 1 P q (cl) |cl| cl qcd (cl) cl qm (cl) AND- Branching |cl| cl r P P P OR-Branching cl qr (cl).pcl cl qcd (cl).pcl cl qm (cl).pcl Table 1. Quality Aggregation Rules for Causal Link Composition.

4

Global Causal Link Selection

In the following we study the optimal composition5 as the selection of causal links that optimize the overall quality of the composition. On the one hand the selection can be locally optimized at each abA stract causal link cli,j of the composition, but two main issues arise. k First, the local selection of a candidate link cli,j enforces a speciﬁc service for both tasks Ti and Tj . Thus, these constraints can no longer A ensure to select neither the best links for its closest abstract links clα,i A and clj,β nor the optimal composition (e.g., the best local selection in A 1 cl1,2 i.e., cl1,2 does not lead to the optimal composition in Figure 4). Secondly, quality constraints may be not satisﬁed, leading to a suboptimal composition e.g., a constraint with a robustness more than 70% cannot be enforced. On the other hand, the naive global approach considers an exhaustive search of the optimal composition among all A the executable compositions. Let |cli,j | be the number abstract links in an composition and n be the number of candidate services by task, A the total number of executable causal link compositions is n2.|cli,j | , making this approach impractical for large scale composition. Here, we address these issues by presenting an integer linear programming (IP) [21] based global causal link selection, which i) further constrains causal links, and ii) meets a given objective.

4.1

IP Based Global Selection & Objective Function

There are 3 inputs in an IP problem: an objective function, a set of integer decision variables (restricted to value 0 or 1), and a set of constraints (equalities or inequalities), where both the objective function and the constraints must be linear. IP attempts to maximize or minimize the value of the objective function by adjusting the values of the variables while enforcing the constraints. The problem of retrieving an optimal executable composition is mapped into an IP problem. Here we suggest to formalize its objective function. To this end, the robustness, common description rate and matching values of the p potential executable compositions i.e., Qλ,1≤λ≤p l,l∈{r,cd,m} have been ﬁrst determined by means of aggregation functions in Table 1. Then, the latter quality values Qλr , Qλcd , Qλm has been scaled according to (4). ( min Qλ l −Ql ∼λ if Qmax − Qmin = 0 l l −Qmin Ql = Qmax (4) l ∈ {r, cd, m} l l max 1 if Ql − Qmin =0 l In (4), Qmax is the maximal value of the lth quality criteria whereas l min Ql is the minimal value of the lth quality criteria. This scaling phase complexity is linear in the number of abstract links in the composition. Finally, the objective function (5) of the IP problem follows. 5

The relation and combination with quality of services is not addressed here.

48

F. Lécué et al. / Optimizing Causal Link Based Web Service Composition

X

max

1≤λ≤p

!

“∼λ ” Ql × ωl

(5)

T1

Candidates

l∈{r,cd,m}

where Pωl ∈ [0, 1] is the weight assigned to the l quality criterion and l∈{r,cd,m} ωl = 1. In this way preferences on quality of the desired executable compositions can be done by simply adjusting ωl e.g., the Common Description rate could be weighted higher.

4.2

2 q(cl1,2 ) = (0, 35 , 12 )

Causal Link cl

Allocation Constraint. Only one candidate link should be selected A for each abstract link cli,j between tasks Ti and Tj . This constraint k,1≤k≤n in (6). is formalized by exploiting the integer variables yi,j n X k A yi,j = 1, ∀cli,j (6) k=1

Example 5. (Allocation Constraint) Suppose the sequential composition of tasks T1 , T2 , T3 in Figure 4. Two candidate causal links can be applied between tasks T1 and T2 1 2 i.e., cl1,2 , cl1,2 . Since only one candidate between two tasks will be 1 2 1 2 A selected, we have y1,2 + y1,2 = 1. We have y2,3 + y2,3 = 1 for cl2,3 . Incompatibility Constraint. Since the selection of a candidate k A for cli,j enforces a speciﬁc service for both tasks Ti causal link cli,j (e.g., si ) and Tj (e.g., sj ), the number of candidate links concretizA A ing its closest abstract links clα,i and clj,β is highly reduced. Indeed A A the candidate links for clj,β (clα,i ) have to use only input (output) parameters of sj (si ). Thus, a constraint (7) for each pair of incomk l patible candidate links (cli,j , clj,β ) is required in our IP problem. k l A A yi,j + yj,β ≤ 1, ∀cli,j ∀clj,β

(7)

Example 6. (Incompatibility Constraint) Suppose the composition in Figure 4. According to (7), the incompat1 2 2 1 ibility constraints are i) y1,2 + y2,3 ≤ 1, ii) y1,2 + y2,3 ≤ 1. Indeed 1 2 2 1 (cl1,2 , cl2,3 ), (cl1,2 , cl2,3 ) are pairs of incompatible candidate links since task T2 cannot be performed by two distinct services sa and sb . Besides (6), (7), IP constraints on the quality criteria of the whole abstract composition are required. Here, we focus on the sequential, AND-Branching compositions, but a similar formalization for ORBranching compositions and a fortiori their combinations is required. k Robustness Constraint. Let ri,j be a function of (i, j, k) representk ing the robustness quality of a causal link cli,j . Constraint (8) is required to capture the robustness quality of a causal link composition. n 1 XX k k Qr = A ri,j .yi,j (8) |cli,j | A k=1 cli,j

An additional constraint (9) can be used to constrain the robustness quality of the executable composition to not be lower than L. n 1 XX k k ri,j .yi,j ≥ L, L ∈ [0, 1] (9) A |cli,j | A k=1 cli,j

Common Description Rate Constraint. Let cdki,j be a function of k (i, j, k) representing the Common Description rate of a link cli,j . Its k constraint is deﬁned as (8), (9) by replacing Qr by Qcd , ri,j by cdki,j .

A Candidates cl2,3

sb

T3

Candidates

1 q(cl2,3 ) = (0, 15 , 14 )

sa

Input Parameter

Figure 4.

Integer Variables & Constraints of IP Problem

k,1≤k≤n A For every candidate link cli,j of an abstract link cli,j , we ink clude an integer variable yi,j in the IP problem indicating the seleck k tion or exclusion of link cli,j . By convention yi,j is 1 if the kth cank A didate link cli,j is selected to concretize cli,j between tasks Ti and Tj , 0 otherwise. The selected links will form an optimal executable composition satisfying (5) and meeting the following constraints:

T2

Candidates

1 q(cl1,2 ) = (1, 1, 1)

s1

th

A Candidates cl1,2

sα

2 q(cl2,3 ) = (1, 1, 1)

Output Parameter T: Task s: Service

Tasks, Candidate Services & Causal Links.

Matching Quality Constraint. Among the criteria used to select causal links, the Matching quality is associated with a nonlinear aggregation function (see Table 1). A transformation in a linear function is then required to capture it in the IP problem. Assume mki,j be a function of (i, j, k) representing the Matching quality of causal link k cli,j . The overall Matching quality of the executable composition is: n ” Y “Y k Qm = (mki,j )yi,j (10) clA i,j

k=1

The Matching quality constraints can be linearised by applying the logarithm function ln. Equation (10) then becomes: ! n X X k k ln(Qm ) = ln(mi,j ).yi,j (11) clA i,j

Pn

k=1

k = 1 and = 1 or 0 for each causal link cli,j . since ln(Qm ) is formalized to capture the Matching quality in our work. Changing a nonlinear constraint in its linear form requires also to linearise the objective function. Thus, (12) is replaced by (13) in (4). k k=1 yi,j

k yi,j

Qλm − Qmin m Qmax − Qmin m m

(12)

ln(Qλm ) − ln(Qmin m ) (13) min ln(Qmax m ) − ln(Qm )

Local Constraint. The IP problem can also include local selection and encompass local constraints. Such constraints can then predicate on properties of a single link and can be formally included in the A model. In case a target causal link cli,j requires its local robustness to be higher than a given value v, this constraint is deﬁned by (14). n X k k ri,j .yi,j > v, v ∈ [0, 1] (14) k=1

Local constraints are enforced during the causal links selection. Those which violate the local constraints are ﬁltered from the list of candidate links, reducing the number of variables of the model. The proposed method for translating the problem of selecting an optimal execution composition into an IP problem is generic and, although it has been illustrated with criteria introduced in Section 3, other semantic criteria to value causal links can be accommodated.

5

Computational Complexity & Experimentation

The optimization problem formulated in section 4 , which is equivalent to an IP problem, is NP-hard [17]. In case the number of abstract and candidate causal links is expected to be very high, ﬁnding the exact optimal solution to such a problem takes exponential run-time complexity in the worst case, so no practical. However our approach scales well by running a heuristic based IP solver wherein hundreds of abstract and candidate causal links are involved. This is a suitable upper bound for practicable industrial applications. We conducted experiments on an Intel(R) Core(TM)2 CPU, 1.86GHz with 512 RAM. Compositions with up to 500 abstract causal links and 100 candidates for each abstract link have been considered. In our experiments we assumed that robustness, common

F. Lécué et al. / Optimizing Causal Link Based Web Service Composition

Computation Cost (ms)

description rate and matching quality of each causal link have been inferred in a pre-processing step of semantic reasoning. From these, the IP model formulation is computed, and the optimization problem is solved by running CPLEX, a state of the art integer linear programming solver based on the branch and cut technique 6 [21]. The experimentation (Figure 5) aimed at comparing the global selection based approach by IP with the local optimization and naive global selection (i.e., exhaustive search). We measured the computation cost (in ms) of selecting causal links to create an optimal executable composition under the three different selection approaches. 10000 Global Selection Using Exhaustive Search Global Selection Using IP Local Optimization Based-Selection

8000 6000 4000 2000 0 0

100 200 300 400 Number of Abstract Causal Links in Composition

500

Figure 5. Number of Abstract Causal Links vs. Computation Cost for Optimal Executable Composition. (100 candidates for each causal links).

The computation cost of global selection by exhaustive search is very high even in very small scale in aspect of the number of abstract causal links and their candidates. Although the computation cost of global selection by IP is higher than that of local optimization, it is still acceptable. Finding the optimal solution to the optimization problem takes 10 seconds for a composition of 450 abstract causal links with 100 candidate links (i.e., 10 candidate services by task). In case of higher number of links, the problem can be, for instance, divided in several global selection problems. Alternatively, suboptimal solutions satisfying revisited quality thresholds can be sufﬁcient.

6

Related Work

Despite considerable work in the area of service composition, few efforts have speciﬁcally addressed optimization in ’causal link’-based service composition. Even if [13] introduce validity and robustness in causal link composition, no quality model is explicitly supported. In addition, the most valid and robust compositions are only addressed in their future work. In contrast, we present a model with various types of quality criteria used for optimizing the composition. Unlike our work that considers quality of causal links, [23, 2] focused on QoS-aware service composition. To this end, they suggest a QoS-driven approach to select candidate services valued by non functional criteria such as price, execution time, and reliability. In the same way as our approach, they consider their problem as an optimization problem. Towards this issue different strategies as optimization techniques can be adopted, e.g., Integer Programming [23], Genetic Algorithms (GAs) [8], or Constraint Programming [11]. As discussed in [8], GAs better handle non-linearity of aggregation functions, and better scale up when the number of candidate services for each abstract service is high. In IP based approaches all quality criteria are used for specifying both constraints and objective function. In contrast to our problem the incompatibility constraints are not required since they assume independence between the services of any task. The global selection problem is also modelled as a knapsack problem [22], wherein [3] performed dynamic programming to solve the problem. Unfortunately all the previous QoS-aware service composition approaches consider only causal links valued by an Exact match. The causal link quality is then disregarded by these approach. 6

LINDO API version 5.0, Lindo Systems Inc. http://www.lindo.com/

7

49

Conclusion and Future Work

In this work we study causal links based semantic web service composition. Our approach has been directed to meet the main challenge facing this problem i.e., how effectively retrieve optimal compositions of causal links. To this end we have ﬁrst presented a general and extensible model to evaluate quality of both elementary and composition of causal links. Since the global causal link selection is formalized as an optimization problem, IP techniques are used to compute optimal executable composition of services. Our global selection based approach is not only more suitable than the local approach but also outperforms the naive approach. Moreover the experimental results show an acceptable computation cost of the IP-based global selection for a high number of abstract and candidates causal links. Since several executable compositions maximizing the overall quality of causal links may be retrieved, the main direction for future work is to consider optimality for quality of service (driven by empirical analysis of compositions usage) to further optimize them.

REFERENCES [1] Anupriya Ankolenkar, Massimo Paolucci, Naveen Srinivasan, and Katia Sycara, ‘The owl-s coalition, owl-s 1.1’, Technical report, (2004). [2] Danilo Ardagna and Barbara Pernici, ‘Adaptive service composition in ﬂexible processes’, IEEE Trans. Software Eng., 33(6), 369–384, (2007). [3] Ismailcem Budak Arpinar, Ruoyan Zhang, Boanerges Aleman-Meza, and Angela Maduko, ‘Ontology-driven web services composition platform’, Inf. Syst. E-Business Management, 3(2), 175–199, (2005). [4] F. Baader and W. Nutt, in The Description Logic Handbook: Theory, Implementation, and Applications, (2003). [5] Rainer Berbner, Michael Spahn, Nicolas Repp, Oliver Heckmann, and Ralf Steinmetz, ‘Heuristics for qos-aware web service composition’, in ICWS, pp. 72–82, (2006). [6] Tim Berners-Lee, James Hendler, and Ora Lassila, ‘The semantic web’, Scientiﬁc American, 284(5), 34–43, (2001). [7] S. Brandt, R. Kusters, and A. Turhan, ‘Approximation and difference in description logics’, in KR, pp. 203–214, (2002). [8] Gerardo Canfora, Massimiliano Di Penta, Raffaele Esposito, and Maria Luisa Villani, ‘An approach for qos-aware service composition based on genetic algorithms’, in GECCO, pp. 1069–1075, (2005). [9] Simona Colucci, Tommaso Di Noia, Eugenio Di Sciascio, Francesco M. Donini, and Marina Mongiello, ‘Concept abduction and contraction in description logics’, in DL, (2003). [10] David Harel and Amnon Naamad, ‘The statemate semantics of statecharts’, ACM Trans. Softw. Eng. Methodol., 5(4), 293–333, (1996). [11] Ahlem Ben Hassine, Shigeo Matsubara, and Toru Ishida, ‘A constraintbased approach to horizontal web service composition’, in ISWC, pp. 130–143, (2006). [12] Ralf K¨usters, Non-Standard Inferences in Description Logics, volume 2100 of Lecture Notes in Computer Science, Springer, 2001. [13] Freddy L´ecu´e and Alexandre Delteil, ‘Making the difference in semantic web service composition.’, in AAAI, pp. 1383–1388, (2007). [14] Freddy L´ecu´e and Alain L´eger, ‘A formal model for semantic web service composition’, in ISWC, pp. 385–398, (2006). [15] L. Li and I. Horrocks, ‘A software framework for matchmaking based on semantic web technology’, in WWW, pp. 331–339, (2003). [16] M. Paolucci, T. Kawamura, T.R. Payne, and K. Sycara, ‘Semantic matching of web services capabilities’, in ISWC, pp. 333–347, (2002). [17] Christos H. Papadimitriou, ‘On the complexity of integer programming’, J. ACM, 28(4), 765–768, (1981). [18] K. Sivashanmugam, K. Verma, A. Sheth, and J. Miller, ‘Adding semantics to web services standards’, in ICWS, pp. 395–401, (2003). [19] Michael K. Smith, Chris Welty, and Deborah L. McGuinness, ‘Owl web ontology language guide’, W3c recommendation, W3C, (2004). [20] Gunnar Teege, ‘Making the difference: A subtraction operation for description logics’, in KR, pp. 540–550, (1994). [21] L. Wolsey, Integer Programming, John Wiley and Sons, 1998. [22] Tao Yu, ‘Service selection algorithms for composing complex services with multiple qos constraints’, in ICSOC, pp. 130–143, (2005). [23] Liangzhao Zeng, Boualem Benatallah, Marlon Dumas, Jayant Kalagnanam, and Quan Z. Sheng, ‘Quality driven web services composition’, in WWW, pp. 411–421, (2003).

50

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-50

Extending the Knowledge Compilation Map: Closure Principles H´el`ene Fargier1 and Pierre Marquis2 Abstract. We extend the knowledge compilation map introduced by Darwiche and Marquis with new propositional fragments obtained by applying closure principles to several fragments studied so far. We investigate two closure principles: disjunction and implicit forgetting (i.e., existential quantiﬁcation). Each introduced fragment is evaluated w.r.t. several criteria, including the complexity of basic queries and transformations, and its spatial efﬁciency is also analyzed.

1

INTRODUCTION

This paper is concerned with knowledge compilation (KC). The key idea underlying KC is to pre-process parts of the available data (i.e., turning them into a compiled form) for improving the efﬁciency of some computational tasks (see among others [2, 1, 10, 4]). A research line in KC [7, 3] addresses the following important issue: How to choose a target language for knowledge compilation? In [3], the authors argue that the choice of a target language for a compilation purpose in the propositional case must be based both on the set of queries and transformations which can be achieved in polynomial time when the data to be exploited are represented in the language, as well as the spatial efﬁciency of the language (i.e., its ability to represent data using little space). Thus, the KC map reported in [3] is an evaluation of dozen of signiﬁcant propositional languages (called propositional fragments) w.r.t. several dimensions: the spatial efﬁciency (i.e., succinctness) of the fragment and the class of queries and transformations it supports in polynomial time. The basic queries considered in [3] include tests for consistency, validity, implicates (clausal entailment), implicants, equivalence, sentential entailment, counting and enumerating theory models (CO, VA, CE, EQ, SE, IM, CT, ME). The basic transformations are conditioning (CD), (possibly bounded) closures under the connectives ∧, ∨, and ¬ ( ∧ C, ∧BC, ∨C, ∨BC, ¬C) and (possibly bounded) forgetting which can be viewed as a closure operation under existential quantiﬁcation (FO, SFO). The KC map reported in [3] has already been extended to new propositional languages, queries and transformations in [12, 5, 11]. In this paper, we extend the KC map with new propositional fragments obtained by applying closure principles to several fragments studied so far. Intuitively, a closure principle is a way to deﬁne a new propositional fragment from a previous one. In this paper, we investigate in detail two disjunctive closure principles, disjunction (∨) 1 2

IRIT-CNRS, Universit´e Paul Sabatier, France, email: [email protected] Universit´e Lille-Nord de France, Artois, CRIL UMR CNRS 8188, France, email: [email protected]

and implicit forgetting (∃), and their combinations. Roughly speaking, the disjunction principle when applied to a fragment C leads to a fragment C[∨] which allows disjunctions of formulas from C, while implicit forgetting applied to a fragment C leads to a fragment C[∃] which allows existentially quantiﬁed formulas from C. Obviously enough, whatever C, C[∨] satisﬁes polytime closure under ∨ (∨C) and C[∃] satisﬁes polytime forgetting (FO). Applying any/both of those two principles may lead to new fragments, which can prove strictly more succinct than the underlying fragment C; interestingly, this gain in efﬁciency does not lead to a complexity shift w.r.t. the main queries and transformations; indeed, among other things, our results show that whenever C satisﬁes CO (resp. CD), then C[∨] and C[∃] satisfy CO (resp. CD). The remainder of this paper is organized as follows. In Section 2, we deﬁne the language of quantiﬁed propositional DAGs. In Section 3, we extend the usual notions of queries, transformations and succinctness to this language. In Section 4, we introduce the general principle of closure by a connective or a quantiﬁcation before focusing on the disjunctive closures of the fragments considered in [3] and studying their attractivity for KC, thus extending the KC map. In Section 5, we discuss the results. Finally, Section 6 concludes the paper.

2

A GLIMPSE AT QUANTIFIED PDAGS

All the propositional fragments we consider in this paper are subsets of the following language of quantiﬁed propositional DAGs QPDAG: Deﬁnition 1 (quantiﬁed PDAGs) Let P S be a denumerable set of propositional variables (also called atoms). • QPDAG is the set of all ﬁnite, single-rooted DAGs α (called formulas) where each leaf node is labeled by a literal over P S or one of the two Boolean constants or ⊥, and each internal node is labeled by ∧ or ∨ and has arbitrarily many children or is labeled by ¬, ∃x or ∀x (where x ∈ P S) and has just one child. • Qp PDAG is the subset of all proper formulas of QPDAG, where a formula α is proper iff for every literal l = x or l = ¬x labelling a leaf of α, at most one path from the root of α to this leaf contains quantiﬁcations of the form ∃x or ∀x, and if such a path exists, it is the unique path from the root of α to the leaf. Restricting the language QPDAG to proper formulas α ensures that every occurrence of a variable x corresponding to a literal at a leaf of α depends on at most one quantiﬁcation on x, and is either free or bound. As a consequence (among others), conditioning a proper formula can be achieved as usual (without requiring any duplication of nodes).

H. Fargier and P. Marquis / Extending the Knowledge Compilation Map: Closure Principles

PDAG [12] is the subset of Qp PDAG obtained by removing the possibility to have internal nodes labeled by ∃ or ∀; PDAG-NNF [3] (resp. ∃PDAG-NNF, resp. ∀PDAG-NNF) is the subset of Qp PDAG obtained by removing the possibility to have internal nodes labeled by ¬, ∃ or ∀ (resp. ¬, ∀, resp. ¬, ∃). Distinguished formulas from QPDAG are the literals over P S; if V is any subset of P S, LV denotes the set of all literals built over V , i.e., {x, ¬x | x ∈ V }. If a literal l of LP S is an atom x from P S, it is said to be a positive literal; otherwise it has the form ¬x with x ∈ P S and it is said to be a negative literal. If l is a literal built up from the atom x, we have var(l) = x. A clause (resp. a term) is a (ﬁnite) disjunction (resp. conjunction) of literals or the constant ⊥ (resp. ). The size |α| of any QPDAG formula α is the number of nodes plus the number of arcs in α. The set V ar(α) of free variables of a Qp PDAG formula α is deﬁned in the standard way. Let I be an interpretation over P S (i.e., a total function from P S to BOOL = {0, 1}). The semantics of a QPDAG formula α in I is the truth value from BOOL deﬁned inductively in the standard way; the notions of model, logical consequence (|=) and logical equivalence (≡) are also as usual. Finally, if α ∈ QPDAG and X = {x1 , . . . , xn } ⊆ P S, then ∃X.α (resp. ∀X.α) is a short for ∃x1 .(∃x2 .(...∃xn .α)...) (resp. ∀x1 .(∀x2 .(...∀xn .α)...)) (this notation is well-founded since whatever the chosen ordering on X, the resulting formulas are logically equivalent).

3

QUERIES, TRANSFORMATIONS, AND SUCCINCTNESS

The following queries CO, VA, CE, EQ, SE, IM, CT, ME for PDAG-NNF formulas have been considered in [3]; their importance is discussed in depth in [3], so we refrain from recalling it here; we extend them to Qp PDAG formulas and add to them the MC query (model checking), which is trivial for PDAG formulas (every formula from PDAG satisﬁes MC), but not for Qp PDAG formulas. Deﬁnition 2 (queries)

Let C denote any subset of Qp PDAG.

• C satisﬁes CO (resp. VA) iff there exists a polytime algorithm that maps every formula α from C to 1 if α is consistent (resp. valid), and to 0 otherwise. • C satisﬁes MC iff there exists a polytime algorithm that maps every formula α from C and every interpretation I over V ar(α) to 1 if I is a model of α, and to 0 otherwise. • C satisﬁes CE iff there exists a polytime algorithm that maps every formula α from C and every clause γ to 1 if α |= γ holds, and to 0 otherwise. • C satisﬁes EQ (resp. SE) iff there exists a polytime algorithm that maps every pair of formulas α, β from C to 1 if α ≡ β (resp. α |= β) holds, and to 0 otherwise. • C satisﬁes IM iff there exists a polytime algorithm that maps every formula α from C and every term γ to 1 if γ |= α holds, and to 0 otherwise. • C satisﬁes CT iff there exists a polytime algorithm that maps every formula α from C to a nonnegative integer that represents the number of models of α over V ar(α) (in binary notation). • C satisﬁes ME iff there exists a polynomial p(., .) and an algorithm that outputs all models of an arbitrary formula α from C in time p(n, m), where n is the size of α and m is the number of its models (over V ar(α)). The following transformations for PDAG-NNF formulas have been considered in [3]; again, we extend them to Qp PDAG formulas:

Deﬁnition 3 (transformations) Qp PDAG.

51

Let C denote any subset of

• C satisﬁes CD iff there exists a polytime algorithm that maps every formula α from C and every consistent term γ to a formula from C that is logically equivalent to the conditioning α | γ of α on γ, i.e., the formula obtained by replacing each free occurrence of variable x of α by (resp. ⊥) if x (resp. ¬x) is a positive (resp. negative) literal of γ. • C satisﬁes FO iff there exists a polytime algorithm that maps every formula α from C and every subset X of variables from PS to a formula from C equivalent to ∃X.α. If the property holds for each singleton X, we say that C satisﬁes SFO. • C satisﬁes ∧C (resp. ∨C) iff there exists a polytime algorithm that maps every ﬁnite set of formulas α1 , . . . , αn from C to a formula of C that is logically equivalent to α1 ∧ . . . ∧ αn (resp. α1 ∨ . . . ∨ αn ). • C satisﬁes ∧BC (resp. ∨BC) iff there exists a polytime algorithm that maps every pair of formulas α and β from C to a formula of C that is logically equivalent to α ∧ β (resp. α ∨ β). • C satisﬁes ¬C iff there exists a polytime algorithm that maps every formula α from C to a formula of C logically equivalent to ¬α. Finally, the following notion of succinctness (modeled as a preorder over propositional fragments) has been considered in [3]; we also extend it to QPDAG formulas: Deﬁnition 4 (succinctness) Let C1 and C2 be two subsets of QPDAG. C1 is at least as succinct as C2 , denoted C1 ≤s C2 , iff there exists a polynomial p such that for every formula α ∈ C2 , there exists an equivalent formula β ∈ C1 where |β| ≤ p(|α|). ∼s is the symmetric part of ≤s deﬁned by C1 ∼s C2 iff C1 ≤s C2 and C2 ≤s C1 . <s is the asymmetric part of ≤s deﬁned by C1 <s C2 iff C1 ≤s C2 and C2 ≤s C1 .

4 4.1

EXTENDING THE KC MAP BY DISJUNCTIVE CLOSURES Closure Principles

Intuitively, a closure principle is a way to deﬁne a new propositional fragment starting from a previous one, through the application of “operators” (i.e., connectives or quantiﬁcations):3 Deﬁnition 5 (closures) Let C be a subset of QPDAG and be any ﬁnite subset of {∨, ∧, ¬, ∃, ∀}. C[] is the subset of QPDAG inductively deﬁned as follows:4 • if α ∈ C, then α ∈ C[], • if δ ∈ ∩ {∨, ∧}, and αi ∈ C[] with i ∈ 1 . . . n and n > 0, then δ(α1 , . . . , αn ) ∈ C[], • if ¬ ∈ and α ∈ C[], then ¬α ∈ C[], • if δ ∈ ∩ {∀, ∃}, α ∈ C[], and x ∈ P S then δx.α ∈ C[]. Observe that if C ⊆ Qp PDAG then C[] ⊆ Qp PDAG: closure does not question properness. We also have the following easy proposition, which makes precise the interplay between elements of in the general case: 3

Other closure principles could have been deﬁned in a similar way, would the underlying propositional language contain other connectives. 4 In order to alleviate the notations, when = {δ , . . . , δ }, we shall write n 1 C[δ1 , . . . , δn ] instead of C[{δ1 , . . . , δn }].

52

H. Fargier and P. Marquis / Extending the Knowledge Compilation Map: Closure Principles

Proposition 1 For every subset C of QPDAG and every ﬁnite subsets 1 , 2 of {∨, ∧, ¬, ∃, ∀}, we have: • • • •

C[∅] = C. If 1 ⊆ 2 then C[1 ] ⊆ C[2 ]. (C[1 ])[2 ] ⊆ C[1 ∪ 2 ]. If 1 ⊆ 2 or 2 ⊆ 1 then (C[1 ])[2 ] = C[1 ∪ 2 ].

Before focusing on some speciﬁc “operators”, we add to succinctness the following notions of polynomial translation and polynomial equivalence, which prove helpful in the following evaluations: Deﬁnition 6 (polynomial translation) Let C1 and C2 be two subsets of QPDAG. C1 is said to be polynomially translatable into C2 , noted C1 ≥P C2 , iff there exists a polytime algorithm f such that for every α ∈ C1 , we have f (α) ∈ C2 and f (α) ≡ α. Like ≥s , ≥P is a preorder (i.e., a reﬂexive and transitive relation) over the power set of QPDAG. It reﬁnes the spatial efﬁciency preorder ≥s over QPDAG in the sense that for any two subsets C1 and C2 of QPDAG, if C1 ≥P C2 , then C1 ≥s C2 (but the converse does not hold in general). Thus, if C1 is polynomially translatable into C2 , we have that C2 is at least as succinct as C1 . Furthermore, whenever C1 is polynomially translatable into C2 , every query which is supported in polynomial time in C2 also is supported in polynomial time in C1 ; and conversely, every query which is not supported in polynomial time in C1 unless the polynomial hierarchy collapses cannot be supported in polynomial time in C2 , unless the polynomial hierarchy collapses. The corresponding indifference relation ∼P given by C1 ∼P C2 iff C1 ≥P C2 and C2 ≥P C1 , is an equivalence relation; when C1 ∼P C2 , C1 and C2 are said to be polynomially equivalent. Obviously enough, polynomially equivalent fragments are equally efﬁcient (and succinct) and possess the same set of tractable queries and transformations. Before presenting some useful polynomial equivalences, we ﬁrst need to introduce the notion of stability under uniform renaming. It characterizes the subsets C of Qp PDAG for which, intuitively, the choice of variables names does not really matter; technically it allows to rename (bound) variables in a formula α of C without leaving the fragment. Deﬁnition 7 (stability under uniform renaming) Let C be any subset of Qp PDAG. C is stable under uniform renaming iff for every α ∈ C, there exists arbitrarily many distinct bijections r from V ar(α) to subsets V of fresh variables from P S (i.e., not occurring in α) such that the formula r(α) obtained by replacing in α (in a uniform way) every free occurrence of x ∈ V ar(α) by r(x) belongs to C as well. We are now ready to present more speciﬁc results: Proposition 2 Let C be any subset of Qp PDAG, s.t. C is stable under uniform renaming. We have: • (C[∃])[∨] ∼P (C[∨])[∃] ∼P C[∨, ∃]. • (C[∀])[∧] ∼P (C[∧])[∀] ∼P C[∧, ∀]. It is important to note that such polynomial equivalences, showing in some sense that the “sequential” closure of a propositional fragment stable under uniform renaming by a set of “operators” among {∨, ∃} (resp. among {∧, ∀}) is equivalent to its “parallel” closure, cannot be systematically guaranteed for any choices of fragments

and “operators”. For instance, if C is the set LP S ∪ {, ⊥}, then (C[∨])[∧] is the set of all CNF formulas, (C[∧])[∨] is the set of all DNF formulas, and C[∨, ∧] is the set of all PDAG-NNF formulas. From the succinctness results reported in [3], it is easy to conclude that those three fragments are not pairwise polynomially equivalent. Similarly, if C is the set of all clauses over P S, then (C[∧])[∃] and C[∧, ∃] are polynomially equivalent to CNF[∃], but (C[∃])[∧] is polynomially equivalent to CNF, which is not polynomially equivalent to CNF[∃] (this follows from the forthcoming Proposition 8).

4.2

Disjunctive Closures

In the rest of this paper, we will focus on the two disjunctive closure principles [∨] (closure by disjunction), [∃] (closure by forgetting), and their combinations. At the start, this choice was motivated by the fact that any closure C[∃] obviously satisﬁes forgetting, which is an important transformation for a number of applications, including planning, diagnosis, reasoning about action and change, reasoning under inconsistency (see e.g. [2, 8, 9] for details), while any closure C[∨] clearly preserves the crucial query CO and transformation CD. Our purpose is now to locate on the KC map all languages obtained by applying the disjunctive closure principles to the eight languages PDAG-NNF, DNNF, CNF, OBDD< DNF, PI, IP, MODS considered (among others) in [3]; all those languages are subsets of PDAG: • PDAG-NNF is the subset of PDAG consisting of negation normal form formulas. • DNNF is the subset of PDAG-NNF consisting of decomposable negation normal form formulas. • CNF is the subset of PDAG-NNF consisting of conjunctive normal form formulas. • OBDD< is the subset of DNNF consisting of ordered binary decision diagrams. < is a strict and complete ordering over P S and we assume the ordered set (P S, <) of order type η (the order type of the set of rational numbers with its familiar ordering).5 • DNF is the subset of DNNF consisting of disjunctive normal form formulas. • PI is the subset of CNF consisting of all prime implicates (or Blake) formulas. • IP is the subset of DNF consisting of all prime implicants formulas. • MODS is the subset of DNF consisting of disjunctions α of canonical terms over V ar(α).6 For space reasons, we cannot provide formal deﬁnitions of those languages here (they can be found e.g. in [3, 12]). It is easy to prove that the eight languages PDAG-NNF, DNNF, CNF, OBDD< DNF, PI, IP, MODS are stable under uniform renaming. Hence, thanks to Propositions 1 and 2, it is enough to consider the three fragments C[∃], C[∨], and C[∨, ∃] for C being any on the eight above languages. Applying the three disjunctive closure principles [∨], [∃], and [∨, ∃] to the eight languages leads to consider twenty-four fragments. The following result shows that many fragments do not need to be considered separately, because they are polynomially equivalent. 5

6

This technical, yet harmless, condition ensures that OBDD< is stable under uniform renaming, which cannot be guaranteed in the general case for this fragment (due to the constraint of compatibility with < imposed to every variable path from the root of any OBDD< formula to any of its leaves). If α is a MODS formula and x ∈ V ar(α) then every term of α contains a literal l s.t. var(l) = x.

53

H. Fargier and P. Marquis / Extending the Knowledge Compilation Map: Closure Principles

Proposition 3 • CNF[∃] ∼P CNF[∨, ∃] ∼P PDAG-NNF[∃] ∼P PDAG-NNF[∨, ∃]. • PDAG-NNF ∼P PDAG-NNF[∨]. • DNNF ∼P DNNF[∨] ∼P DNNF[∃] ∼P DNNF[∨, ∃]. • OBDD< [∃] ∼P OBDD< [∨, ∃]. • PI[∨] ∼P PI[∨, ∃]. • PI ∼P PI[∃]. • DNF ∼P DNF[∨] ∼P DNF[∃] ∼P DNF[∨, ∃] ∼P IP[∨] ∼P IP[∃] ∼P IP[∨, ∃] ∼P MODS[∨] ∼P MODS[∨, ∃]. • MODS ∼P MODS[∃]. In the light of Proposition 3, it is thus enough to consider the ﬁve remaining languages, only, i.e., CNF[∃], CNF[∨], OBDD< [∃], OBDD< [∨], and PI[∨]; “remaining” means here not identiﬁed as polynomially equivalent to one of the languages already located within the KC map in [3].

4.3

Queries and Transformations

Let us present ﬁrst the general results we obtained about tractable queries and transformations:

CNF[∃] CNF[∨] OBDD< [∃] OBDD< [∨] PI[∨]

• If C satisﬁes CO (resp. CD) then C[∨], C[∃] and C[∨, ∃] satisfy CO (resp. CD). • If C satisﬁes CO and CD then C satisﬁes CE and ME. • If C satisﬁes CO and CD then C, C[∨], C[∃] and C[∨, ∃] satisfy MC. • C[∨] and C[∨, ∃] satisfy ∨C (hence ∨BC) and C[∃] and C[∨, ∃] satisfy FO (hence SFO). • If C satisﬁes ∧C (resp. ∧BC, ∨C, ∨BC) and is stable under uniform renaming, then C[∃] satisﬁes ∧C (resp. ∧BC, ∨C, ∨BC). We have also derived some more speciﬁc results, about the ﬁve remaining languages:

CNF[∃] CNF[∨] OBDD< [∃] OBDD< [∨] PI[∨]

√ √

VA ◦ ◦ ◦ ◦ ◦

CE ◦ ◦ √ √ √

IM ◦ ◦ ◦ ◦ ◦

EQ ◦ ◦ ◦ ◦ ◦

SE ◦ ◦ ◦ ◦ ◦

CT ◦ ◦ ◦ ◦ ◦

ME ◦ ◦ √ √ √

MC ◦ √ √ √ √

Table 1. Subsets √ of the ∃PDAG-NNF language and their corresponding polytime queries. means “satisﬁes” and ◦ means “does not satisfy unless P = NP.”

This proposition shows in particular that OBDD< [∃], OBDD< [∨], PI[∨] satisfy the same tractable queries (among those considered here); such queries include all the tractable queries offered by CNF[∨]; CNF[∃] offers no tractable query. As to transformations, we have obtained the following results: Proposition 6 The results in Table 2 hold.

SFO √ √ √ √ √

• √ ? √

∧C √ ? ◦ ◦ ◦

∧BC √ √ √ √ ?

∨C √ √ √ √ √

∨BC √ √ √ √ √

¬C ◦∗ ? ◦ ◦ ◦

This proposition shows in particular that OBDD< [∃] satisﬁes at least all the transformations offered by OBDD< [∨] and PI[∨]; CNF[∨] does not satisfy FO; CNF[∃] satisﬁes all transformations but ¬C. Propositions 5 and 6 also show that preservation results by disjunctive closures (as the ones reported in Proposition 4 and related to CO, CD, ∧C, ∧BC, ∨C, ∨BC) do not hold for VA, IM, EQ, SE, CT, or ¬C: moving from a fragment C to one of its disjunctive closures C[∨], C[∃], or C[∨, ∃] may easily lead to give up VA, IM, EQ, SE, CT and ¬C (just take C = OBDD< ).

Succinctness

For space reasons, we split our succinctness results into two propositions (and two tables). In the ﬁrst table, we compare w.r.t. spatial efﬁciency ≤s the ﬁve remaining fragments we have considered. Proposition 7 The results in Table 3 hold.

CNF[∃] CNF[∨] OBDD< [∃] OBDD< [∨] PI[∨]

Table 3.

Proposition 5 The results in Table 1 hold.

CO ◦ ◦ √

FO √

Table 2. Subsets of the√ ∃PDAG-NNF language and their corresponding polytime transformations. means “satisﬁes,” • means “does not satisfy,” ◦ means “does not satisfy unless P=NP.” and ◦∗ means “does not satisfy unless the polynomial hierarchy collapses.”

4.4 Proposition 4 Let C be any subset of Qp PDAG.

CD √ √ √ √ √

CNF[∃] ∼s ≤s ≤∗s ≤∗s ≤s

CNF[∨] ≤s ∼s ≤∗s ≤∗s ≤∗s

OBDD< [∃] ≤s ≤s ∼s ? ≤s

OBDD< [∨] ≤s ≤s ≤s ∼s ≤s

PI[∨] ≤s ≤s ? ? ∼s

Succinctness of target compilation languages. ∗ means that the result holds unless the polynomial hierarchy collapses.

In the second table, we compare w.r.t. ≤s the ﬁve remaining fragments with the eight fragments PDAG-NNF, DNNF, CNF, OBDD< DNF, PI, IP, MODS: Proposition 8 The results in Table 4 hold.

PDAG-NNF CNF DNNF DNF OBDD< PI IP MODS

Table 4.

CNF[∃] ?, ≥s ≤s , ≥s ≤∗s , ≥s ≤s , ≥s ≤s , ≥s ≤s , ≥s ≤s , ≥s ≤s , ≥s

CNF[∨] ≤s , ≥s ≤s , ≥s ≤∗s , ≥s ≤s , ≥s ≤s , ≥s ≤s , ≥s ≤s , ≥s ≤s , ≥s

OBDD< [∃] ≤s , ≥∗s ≤s , ≥∗s ≤s , ? ≤s , ≥s ≤s , ≥s ≤s , ? ≤s , ≥s ≤s , ≥s

OBDD< [∨] ≤s , ≥∗s ≤s , ≥∗s ≤s , ? ≤s , ≥s ≤s , ≥s ≤s , ? ≤s , ≥s ≤s , ≥s

PI[∨] ≤s , ≥s ≤s , ≥∗s ?, ≥∗s ≤s , ≥s ≤s , ≥s ≤s , ≥s ≤s , ≥s ≤s , ≥s

Succinctness of target compilation languages. ∗ means that the result holds unless the polynomial hierarchy collapses.

54

5

H. Fargier and P. Marquis / Extending the Knowledge Compilation Map: Closure Principles

DISCUSSION

In the previous sections, we have listed a number of general results (linking propositional fragments to their disjunctive closures) and speciﬁc results (i.e., pertaining to some speciﬁc fragments). As to general results, Proposition 4 shows in particular that, under the stability under uniform renaming requirement (which is not very demanding), whenever a Qp PDAG fragment C satisﬁes CO and CD, the associated fragment C[∨, ∃] satisﬁes CO, CD, CE, ME, MC, ∨C and FO. Thus C[∨, ∃] gives ∨C and FO “for free”; indeed, from the obvious inclusion C ⊆ C[∨, ∃], it turns out that C[∨, ∃] is at least as succinct as C. Furthermore, Proposition 4 shows that, under the same stability condition, closure by forgetting preserves CO, CD, ∧C, ∧BC, ∨C, and ∨BC; thus moving from C to its closure by forgetting C[∃] may lead to improve the spatial efﬁciency (and, for sure, not to decrease it!), without a complexity shift w.r.t. any of these queries and transformations; thus, for instance, OBDD< [∃] (resp. CNF[∃]) is stricly more succinct than OBDD< (resp. CNF). As to speciﬁc results, our results show PI[∨] as a fragment challenging PI, when VA, IM, EQ, SE are not required by the application under consideration. Indeed, like PI, its closure PI[∨] satisﬁes CO, CE, and ME; besides, PI[∨] offers more tractable transformations than PI and is strictly more succinct than it. Our results also show OBDD< [∃] (which is polynomially equivalent to OBDD< [∨, ∃]) as an interesting alternative to DNNF for applications where ∧BC is required (DNNF does not satisfy ∧BC). While we ignore whether DNNF is strictly more succinct than OBDD< [∃] or not, we know that OBDD< [∃] is strictly more succinct than OBDD< and DNF. Furthermore, we know that OBDD< [∃] satisﬁes the same polytime queries as DNNF or DNF, and the same polytime transformations as DNF, and strictly more than DNNF. Thus, OBDD< [∃] can prove useful for applications where CO, CE, ME, MC, CD, FO, ∧BC, ∨C are enough. As a matter of example, consider a preference-based search problem (e.g. the conﬁguration of a “simple product” like a travel) where the input data is given by some hard constraints (the feasible travels), plus some soft constraints encoding the current choices of the user. Notice that since several variables can be involved in the soft constraints, complex choices can be represented (for instance α = (loc1 ⇔ (acc1 ∨ acc2 ∨ acc3 )) ∧ (¬loc2 ∨ acc4 ) ∧ (¬acc3 ∨ ¬acc4 ) expressing the user’s choices as to the possible locations and the types of accommodations); this is a great advantage over current systems which restrict the representation of user’s choices to literals. If the conjunction of the constraints is inconsistent, the soft constraints have to be relaxed. A way to perform this relaxation is to weaken the soft constraints by forgetting some variables in them (see [9]). Thus, ∃acc3 .α ≡ ((¬loc1 ∨ acc1 ∨ acc2 ∨ acc4 ) ∧ (¬acc1 ∨ loc1 ) ∧ (¬acc2 ∨ loc1 ) ∧ (¬loc2 ∨ acc4 )) is the relaxation of α obtained by removing whatever was imposed on the accomodation acc3 . Such a relaxation can prove sufﬁcient to lead to a solution. In the light of our results, each step of such an interactive process (which consists of consistency checks, followed by relaxation steps until consistency is reached, and ﬁnally the generation of some solutions) can be achieved in polynomial time, provided that the hard constraints and the soft constraints have been ﬁrst compiled into OBDD< formulas; the approach is as follows: (1) conjoin the hard and the soft constraints, which can be done in polynomial time since OBDD< [∃] satisﬁes ∧BC, (2) determine in polynomial time whether the result is consistent or not (OBDD< [∃] satisﬁes CO), (3) if this is the case generate in polynomial time one or several solutions (OBDD< [∃] satisﬁes ME) else forget in polynomial time some variables in the soft

constraints (OBDD< [∃] satisﬁes FO) and resume from (1).

6

CONCLUSION

In this paper, we have extended the KC map with new propositional fragments obtained by applying disjunctive closure principles to several fragments studied so far. We have investigated two closure principles, disjunction and implicit forgetting (i.e., existential quantiﬁcation), and their combinations. This paper calls for a number of perspectives. One of them consists in removing the question marks which remain in the previous tables. Another issue for further research concerns the principle of closure by disjunction: notwithstanding the fact that it “commutes” with forgetting, it has its own interest since it allows to render complete every incomplete propositional fragments containing the set of all terms over P S.7 Accordingly, another perspective consists in extending further the KC map by applying the disjunctive closures at work here to other propositional languages (e.g. the set of all Horn CNF formulas), and evaluating the resulting fragments. [6] is a ﬁrst step in this direction.

ACKNOWLEDGEMENTS The authors would like to thank the anonymous reviewers for their helpful comments. They have been partly supported by the ANR project PHAC (ANR-05-BLAN-0384). Pierre Marquis has also been partly supported by the IUT de Lens and the R´egion Nord/Pas-deCalais.

REFERENCES [1] M. Cadoli and F.M. Donini, ‘A survey on knowledge compilation’, AI Communications, 10(3–4), 137–150, (1998). [2] A. Darwiche, ‘Decomposable negation normal form’, Journal of the ACM, 48(4), 608–647, (2001). [3] A. Darwiche and P. Marquis, ‘A knowledge compilation map’, Journal of Artiﬁcial Intelligence Research, 17, 229–264, (2002). [4] R. Dechter and I. Rish, ‘Directional resolution: the Davis-Putnam procedure, revisited’, in Proceedings of KR’94, pp. 134–145, (1994). [5] H. Fargier and P. Marquis, ‘On the use of partially ordered decision graphs in knowledge compilation and quantiﬁed Boolean formulae’, in Proceedings of AAAI’06, pp. 42–47, (2006). [6] H. Fargier and P. Marquis, ‘Extending the knowledge compilation map: Krom, Horn, afﬁne and beyond’, in Proceedings of AAAI’08, (2008). To appear. [7] G. Gogic, H.A. Kautz, Ch.H. Papadimitriou, and B. Selman, ‘The comparative linguistics of knowledge representation’, in Proceedings of IJCAI’95, pp. 862–869, (1995). [8] J. Lang, P. Liberatore, and P. Marquis, ‘Propositional independence formula-variable independence and forgetting’, Journal of Artiﬁcial Intelligence Research, 18, 391–443, (2003). [9] J. Lang and P. Marquis, ‘Resolving inconsistencies by variable forgetting’, in Proceedings of KR’02, pp. 239–250, (2002). [10] B. Selman and H.A. Kautz, ‘Knowledge compilation and theory approximation’, Journal of the ACM, 43, 193–224, (1996). [11] S. Subbarayan, L. Bordeaux, and Y. Hamadi, ‘Knowledge compilation properties of tree-of-BDDs’, in Proceedings of AAAI’07, pp. 502–507, (2007). [12] M. Wachter and R. Haenni, ‘Propositional DAGs: A new graph-based language for representing Boolean functions’, in Proceedings of KR’06, pp. 277–285, (2006).

7

Indeed, if C consists of all terms over P S – obviously, an incomplete fragment in the sense that some propositional formulas like a ∨ b are not equivalent to any term – then C[∨] is equal to the complete fragment DNF.

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-55

55

Semantic Modularity and Module Extraction in Description Logics Boris Konev1 and Carsten Lutz2 and Dirk Walther1 and Frank Wolter1 Abstract. The aim of this paper is to study semantic notions of modularity in description logic (DL) terminologies and reasoning problems that are relevant for modularity. We deﬁne two notions of a module whose independence is formalised in a model-theoretic way. Focusing mainly on the DLs EL and ALC, we then develop algorithms for module extraction, for checking whether a part of a terminology is a module, and for a number of related problems. We also analyse the complexity of these problems, which ranges from tractable to undecidable. Finally, we provide an experimental evaluation of our module extraction algorithms based on the large-scale terminology Snomed ct.

1

Introduction

The main use of ontologies in computer science is to formalise the vocabulary of an application domain, i.e., to ﬁx the vocabulary as a logical signature and to provide a logical theory that deﬁnes the meaning of terms built from the vocabulary and their relationships. To emphasise this usage, we speak of terminologies rather than of ontologies. Current applications lead to the development of large and comprehensive terminologies, as witnessed, e.g., by the Systematized Nomenclature of Medicine, Clinical Terms (Snomed ct), which comprises ∼0.4 million terms and underlies the systematised medical vocabulary used in the health systems of the US, the UK, and other countries [13]. When working with terminologies of this size and complexity, often only a fragment of the deﬁned vocabulary is of interest. For example, a terminology designer may want to reuse a part of a large terminology inside his own terminology without being forced to adopt it completely. If the terminology is deployed in an application, it is often also unwieldy to work with the whole terminology compared to working only with the part that is relevant for the application. These observations illustrate the importance of the module extraction problem for terminologies, as studied, e.g., in [6, 2, 12, 4, 3]: given a relevant signature Σ and a terminology T that deﬁnes, among others, the terms from Σ, extract a minimal subset (module) T0 from T such that T0 can serve as a substitute for T w.r.t. Σ. What it means that T0 can serve as a substitute for T depends on the application at hand. In this paper, we aim at minimal modules T0 that induce the same dependencies between terms in Σ as the original terminology T . We understand such dependencies in a model1 2

University of Liverpool, Liverpool, UK TU Dresden, Dresden, Germany

theoretic way, identifying the dependencies between Σ-terms with the class of all Σ-reducts of models satisfying the terminology. Thus, two terminologies induce the same dependencies between terms in Σ if the classes of Σ-reducts of their models coincide. Applications for which the resulting type of module is appropriate include (a) importing, instead of the whole terminology, the module into another terminology; see also [6], (b) computing the classiﬁcation of only the terms in the signature Σ, and (c) querying a database using the module instead of the whole terminology. The main advantage of our modeltheoretic approach compared to entailment-based notions of dependencies [9, 11] is its robustness under changes to the language in which terminologies and queries are formulated. The contribution of this paper is as follows. We introduce a model-theoretic notion of dependencies and explore the complexity of basic reasoning problems such as checking whether two terminologies induce the same dependencies and whether a terminology induces any dependencies at all. Considering terminologies formulated in the standard description logic (DL) ALC, the lightweight DL EL, and their extensions with inverse roles, we ﬁnd that the complexity ranges from tractable via Πp2 -complete and coNexpNP -complete to undecidable. Based on these notions of dependency, we introduce two notions of a module and develop algorithms for module extraction and checking whether a subset of a terminology is a module. The algorithms work on acyclic terminologies formulated in ALCI and ELI. The module extraction algorithm for ELI has been implemented in a system called MEX, and we present experimental results comparing modules extracted from Snomed ct by MEX with modules extracted using the ⊥-local modules approach of [6]. Detailed proofs are provided in the technical report [8].

2

Preliminaries

In this paper, we consider the description logics EL, ELI, ALC and ALCI. Let NC and NR be countably inﬁnite and disjoint sets of concept names and role names, respectively. In ALC, composite concepts are built up starting from the concept names in NC , and applying the concept constructors shown in the upper four rows of Figure 1. In the ﬁgure and in general, C and D denote concepts, and r denotes a role name. As usual, we use ⊥ to abbreviate ¬, , →, and ↔ for the usual Boolean connectives deﬁned in terms of ¬ and , and ∀r.C for ¬∃r.¬C. EL is the fragment obtained from ALC by dropping negation. We obtain ALCI from ALC and ELI from EL by additionally allowing inverse roles inside existen-

56

B. Konev et al. / Semantic Modularity and Module Extraction in Description Logics

Name

Syntax

top-concept negation conjunction existential restriction

¬C C D ∃r.C r−

inverse role Figure 1.

Semantics I

I

=Δ ΔI \ C I C I ∩ DI {d ∈ ΔI | ∃d (d, d ) ∈ rI ∧ d ∈ C I } {(e, d) | (d, e) ∈ rI }

Syntax and semantics.

tial restrictions, as shown in the bottom-most line of Figure 1. The semantics of concepts is deﬁned by means of interpretations I = (ΔI , ·I ), where the interpretation domain ΔI is a non-empty set, and ·I is a function mapping each concept name A to a subset AI of ΔI and each role name rI to a binary relation rI ⊆ ΔI × ΔI . The function ·I is inductively expanded to composite concepts as shown in Figure 1. A general TBox is a ﬁnite set of axioms, where an axiom can be either a concept inclusion (CI) C D or a concept equality (CE) C ≡ D, with C and D concepts. If all concepts used in T belong to a description logic L, then T is also called a general L-TBox. An interpretation I satisﬁes a CI C D (written I |= C D) if C I ⊆ DI ; it satisﬁes a CE C ≡ D (written I |= C ≡ D) if C I = DI . I is a model of a general TBox T if it satisﬁes all axioms in T . We write T |= C D (T |= C ≡ D) if every model of T satisﬁes C D (C ≡ D). A general TBox T is called a terminology if it satisﬁes the following conditions: • all CEs are of the form A ≡ C (concept deﬁnition) and all CIs are of the form A C (primitive concept deﬁnition), where A is a concept name; • no concept name occurs more than once on the left hand side of an axiom. Deﬁne the relation ≺T ⊆ NC × (NC ∪ NR ) by setting A ≺T X iﬀ there exists an axiom of the form A C or A ≡ C in T such that X occurs in C. Denote by ≺∗T the transitive closure of ≺T and set dependT (A) = {X | A ≺∗T X}. Intuitively, dependT (A) consists of all symbols X which are used in the deﬁnition of A in T . A terminology T is called acyclic if A ∈ dependT (A) for any A ∈ NC . In an acyclic terminology T , the set Pr(T ) of primitive symbols in T consists of all role names and concept names that do not occur on the left hand side of an axiom of T . The set PPr(T ) of pseudo-primitive symbols in T consists of all symbols primitive in T and all A such that A C ∈ T for some C. A signature Σ is a ﬁnite subset of NC ∪ NR . The signature sig(C) (sig(α), sig(T )) of a concept C (axiom α, TBox T ) is the set of concept and role names which occur in C (α, T , respectively). If sig(C) ⊆ Σ, we call C a Σ-concept and similarly for axioms and TBoxes.

Deﬁnition 1. Let T0 and T1 be general TBoxes and Σ a signature. • T1 is a semantic Σ-consequence of T0 , written T0 |=Σ T1 , if for every model I0 of T0 , there exists a model I1 of T1 with I0 |Σ = I1 |Σ ; • T0 and T1 are semantically Σ-inseparable, written T0 ≡Σ T1 , if T0 |=Σ T1 and T1 |=Σ T0 ; • T1 is a model-conservative extension of T0 w.r.t. Σ if T1 ⊇ T0 and T0 ≡Σ T1 ; • T0 is a semantic Σ-tautology if T0 ≡Σ ∅. Intuitively, two general TBoxes are semantically Σinseparable if they induce the same dependencies between Σconcepts in a very strong sense: it can be shown that T0 ≡Σ T1 iﬀ for every sentence ϕ of second-order logic which uses no symbols from sig(T0 ∪ T1 ) \ Σ, we have T0 |= ϕ iﬀ T1 |= ϕ. We give examples of typical applications of the notions introduced above. Example 2. Semantic Σ-inseparability of TBoxes T0 and T1 implies that (∗) T0 ∪ T |= C D iﬀ T1 ∪ T |= C D, for all TBoxes T and CIs C D with T , C, D formulated in any standard description logic and not using symbols from sig(T0 ∪T1 )\Σ. Assume, for example, that T is a terminology describing terms related to hospital administration which uses a set Σ of medical terms from T1 = Snomed ct. If the designer of T knows that T1 and another medical terminology T0 are semantically Σ-inseparable, then it does not make any diﬀerence whether he imports T1 or T0 into T . If T0 is much smaller than T1 , the latter might be preferable. Observe that the quantiﬁcation over T in (∗) ensures that this property does not break when T evolves. Example 3. It follows from (∗) that for any semantic Σtautology T0 , the following holds: for all TBoxes T and CIs C D such that T , C, and D use no symbols from sig(T0 ∪ T1 ) \ Σ, T |= C D iﬀ T0 ∪ T |= C D. Thus, one can import into T0 any such T without changing the dependencies that T induces between terms in Σ. If T0 is a terminology for hospital administration and a semantic Σ-tautology for a set Σ of medical terms deﬁned in Snomed ct, then one can import Snomed ct into T0 without corrupting the meaning of medical terms deﬁned in Snomed ct. This application is discussed in detail in [6] under the name of safety for a signature Σ. To illustrate the diﬀerence between entailment-based notions of inseparability as in [9, 11] and semantic Σ-inseparability consider the following example. Let Σ = {A, B}

3

Semantic modularity

We introduce the fundamental notions underlying semantic dependencies and modules and give two deﬁnitions of a module. In the following, we say that two interpretations I and J coincide on a signature Σ, written I|Σ = J |Σ , iﬀ ΔI = ΔJ and X I = X J for all X ∈ Σ.

and

T1 = {A ∃r.B}.

Observe that, in models I of T1 , AI = ∅ implies B I = ∅. Thus, T1 is not a semantic Σ-tautology. However, this dependency between A and B cannot be expressed in terms of a CI, and T1 entails the same Σ-CIs as the empty TBox in all of the DLs introduced in Section 2. Slightly more complex examples show that even property (∗) above is not equivalent to semantic Σ-inseparability. The exact relation between semantic

B. Konev et al. / Semantic Modularity and Module Extraction in Description Logics

Σ-inseparability and entailment-based notions of conservative extensions has been investigated in detail in [5, 6, 9, 11]. Throughout this paper, we consider two kinds of modules. Deﬁnition 4. Let T0 ⊆ T1 be general TBoxes and Σ ⊇ sig(T0 ). Then T0 is a • weak semantic Σ-module of T1 iﬀ T1 is semantically Σinseparable from T0 ; • strong semantic Σ-module of T1 iﬀ T1 \ T0 is a semantic Σ-tautology. The requirement that T0 only contains Σ-symbols reﬂects the idea that modules should be self-contained: if an ontology T induces a dependency between symbols occurring in T0 , then this dependency is induced by T0 already. Notions of a module in which this is not the case are of interest as well and are considered, e.g., in [2]. Lemma 5. Every strong semantic module is a weak semantic module. The converse fails for acyclic EL-terminologies. Proof. The ﬁrst part is obvious. For the second part, let T0 = {A ≡ }, T1 = T0 ∪ {B A}, and Σ = {A, B}. Then T0 is a weak semantic module of T1 , but not a strong semantic module. Intuitively, the diﬀerence between weak and strong modules is that strong modules additionally require the ontology without the module to not induce any dependencies between symbols in Σ.

4

Deciding semantic Σ-consequence

It has been observed in [6, 11, 9] that semantic notions of entailment and inseparability as given in Deﬁnition 1 tend to be computationally diﬃcult. Indeed, we can prove a strong undecidability result for deciding semantic Σ-tautologies using a reduction of the validity of a bimodal formula on a frame. Theorem 6. Given an acyclic ALC-terminology T , it is undecidable whether T is a semantic Σ-tautology. This even holds for acyclic ALC-terminologies of the form {A C} and for Σ = {A, r1 , r2 }. By deﬁnition of modules, Theorem 6 implies that it is not possible to decide, given acyclic ALC-terminologies T1 and T0 ⊆ T1 and a signature Σ, whether T0 is a weak/strong semantic Σ-module in T1 . Thus, Theorem 6 and related results explain why the notions of modularity from Deﬁnition 4 have not yet found practical applications. Instead, applications use notions of a module based on locality [6] or deductive versions of inseparability [2], or notions of a module that are not logic-based [12, 4, 3]. One aim of this paper is to challenge this approach by identifying relevant cases in which reasoning about modules as deﬁned in Section 3 is decidable, and sometimes even tractable. A ﬁrst observation is that avoiding roles in Σ improves the situation. Theorem 7. Let L be ALC or ALCI. Given general LTBoxes T1 and T0 and a signature Σ with sig(Ti ) ∩ Σ ⊆ NC for i = 0, 1, it is (1) coNExpNP -complete to decide whether T0 ≡Σ T1 ; if Σ is ﬁxed, then this problem is coNPNExp -complete; (2) Πp2 -complete to decide whether T0 is a semantic Σtautology. The lower bound applies already to acyclic TBoxes.

57

Observe that deciding semantic Σ-tautologies under the restrictions given in Theorem 7 is easier than deciding satisﬁability and subsumption in ALC w.r.t. acyclic TBoxes, which is PSpace-complete [10]. We remark that Theorem 7 is also of interest when analysing merged TBoxes, as it implies decidability of the following problem: given general ALC-TBoxes T0 and T1 such that the set of symbols Σ shared by T0 and T1 contains only concept names, decide whether the union T0 ∪ T1 is semantically Σ-inseparable from T0 and T1 .

5

Deciding semantic modules

Theorem 7 suggests that controlling the role names in Σ can help to overcome undecidability of semantic modules. We identify a syntactic restriction that is inspired by this observation and recovers decidability of semantic modules for acyclic terminologies formulated in ALC and ALCI. It also provides the basis for showing that, in EL and ELI, we can decide semantic modules for acyclic terminologies without any further restrictions. Deﬁnition 8. Let T be an acyclic terminology and Σ, Σ1 , Σ2 signatures. T contains a syntactic (Σ1 , Σ2 )-dependency if there exists a concept name A ∈ Σ1 such that dependT (A) ∩ Σ2 = ∅. A syntactic (Σ, Σ)-dependency is called a syntactic Σ-dependency. The notion of a syntactic (Σ1 , Σ2 )-dependency generalises the notion of acyclicity (A ∈ dependT (A)) to pairs of sets of symbols. Syntactic Σ-dependencies give rise to a natural case in which semantic modules in ALCI are decidable. Theorem 9. Let L be ALC or ALCI. For acyclic Lterminologies T1 ⊇ T0 and signature Σ ⊇ sig(T0 ) such that T1 \ T0 contains no syntactic (Σ, Σ ∩ NR )-dependencies, it is (1) decidable in coNExpNP whether T0 is a weak semantic Σ-module of T1 ; this problem is coNExpTime-hard; (2) Πp2 -complete to decide whether T0 is a strong semantic Σ-module of T1 . We conjecture that the problem in Point (1) is actually coNExpNP -complete. It is natural to consider also a stronger syntactic condition, namely to disallow any Σ-dependency instead of only (Σ, Σ ∩ NR )-dependencies. In this case, the notions of strong and weak semantic modules coincide and deciding modules is only Πp2 -complete for acyclic ALC- and ALCI-terminologies. Theorem 10. Let L be ALC or ALCI. For acyclic L-terminologies T1 ⊇ T0 and a signature Σ ⊇ sig(T0 ) such that T1 \ T0 contains no syntactic Σ-dependencies, the following are equivalent: • T0 is a strong semantic Σ-module of T1 ; • T0 is a weak semantic Σ-module of T1 ; • for all P ⊆ Σ ∩ (Pr(T0 ) \ Pr(T1 )), the following concept is satisﬁable in a model of T1 \ T0 of cardinality 1: CP = A ¬A. A∈P

Πp2 -complete

A∈Σ∩(Pr(T0 )\(Pr(T1 )∪P ))

to decide whether T0 is a weak/strong seIt is mantic module of T1 .

58

B. Konev et al. / Semantic Modularity and Module Extraction in Description Logics

Output “not module” if any of the two conditions applies, and “module” otherwise: 1. there exists A ∈ Σ ∩ (Pr(T0 ) \ Pr(T1 )) such that dependT1 \T0 (A) ∩ Σ = ∅; 2. there exists A ∈ Σ ∩ (Pr(T0 ) \ Pr(T1 )) such that A ≡ C ∈ T1 for some C and [ dependT1 \T0 (B) ⊇ depend≡ T1 \T0 (A)∩PPr(T1 \T0 ) B ∈ Σ ∩ (Pr(T0 ) \(Pr(T1 ) ∪ {A}))

Figure 2.

Module checking in ELI

We now consider EL and ELI. Theorems 6 and 9 show that, in the case of acyclic ALCI-TBoxes, (Σ, Σ∩NR )-dependencies are the culprit for undecidability of semantic Σ-tautologies. In EL and ELI, the situation is rather diﬀerent. Here, dealing with Σ-dependencies (and thus also (Σ, Σ∩NR )-dependencies) is trivial. Lemma 11. Let L be EL or ELI. If T is an acyclic Lterminology that contains a syntactic Σ-dependency, then T is not a semantic Σ-tautology. Proof. In any model I of an acyclic ELI-terminology T , from X ∈ dependT (A) and X I = ∅ it follows that AI = ∅. The lemma follows immediately. Based on Lemma 11, we show that in EL and ELI, modules can be decided and extracted in polytime. In what follows, we work with acyclic EL-terminologies T that contain no trivial axioms, i.e., no axiom in T is of the form A ≡ (nor A ≡ , etc.). In acyclic EL-terminologies, any such A can be eliminated by replacing it with . Thus, it is harmless to disregard trivial axioms. Theorem 12. Let L be EL or ELI. For acyclic Lterminologies T1 ⊇ T0 containing no trivial axioms and signature Σ ⊇ sig(T0 ), the following are equivalent: • T0 is a strong semantic Σ-module of T1 ; • T0 is a weak semantic Σ-module of T1 . It is decidable in polytime whether T0 is a weak/strong semantic module of T1 . The polytime bound of Theorem 12 is established by the algorithm in Figure 2, which takes as input acyclic ELIterminologies T1 ⊇ T0 and a signature Σ ⊇ sig(T0 ). In the formulation of the algorithm, we use the following notation. A concept name A ∈ NC directly ≡-depends on X ∈ NC ∪ NR , in symbols A ≺≡ T X, iﬀ there exists A ≡ C ∈ T such that X occurs in C. Then, depend≡ T (A) denotes the set of all X such that (A, X) is in the transitive closure of ≺≡ T . The algorithm takes as input acyclic ELI-terminologies T1 ⊇ T0 and a signature Σ ⊇ sig(T0 ). Theorem 12 yields the interesting result that, for acyclic ELI-terminologies, checking semantic modules is computationally simpler than subsumption, which is PSpacecomplete [7].

6

Module extraction

Based on the results given in the previous section, we devise algorithms for extracting modules from acyclic ALCI-

Initialise: T0 = ∅. Apply Rules 1 and 2 exhaustively, preferring Rule 1 Output T0 . 1. if A ∈ Σ ∪ sig(T0 ), α ∈ T1 \ T0 has A on the left hand side, and dependT1 \T0 (A)∩(Σ∪sig(T0 )) = ∅, set T0 := T0 ∪{α}, 2. if α ∈ T1 \ T0 with A on the left-hand side and there is a minimal subset Q ⊆ (Σ ∪ sig(T0 )) ∩ (Pr(T0 ) \ Pr(T1 )) such that A ∈ Q and for some P ⊆ Q, the concept A ¬A CP,Q = A∈P

A∈Q\P

is not satisﬁable in a one-point model of T1 \ T0 , then set T0 := T0 ∪ {α}. Figure 3.

Module extraction in ALCI

Initialise: T0 = ∅. Apply Rules 1 and 2 exhaustively, preferring Rule 1 Output T0 . 1. if A ∈ Σ ∪ sig(T0 ), α ∈ T1 \ T0 has A on the left hand side, and dependT1 \T0 (A)∩(Σ∪sig(T0 )) = ∅, set T0 := T0 ∪{α}. 2. if A ∈ Σ ∪ sig(T0 ), A ≡ C ∈ T1 \ T0 , and [ dependT1 \T0 (B) ⊇ depend≡ T1 \T0 (A)∩PPr(T1 \T0 ), B∈(Σ∪ sig(T0 )) ∩ (Pr(T0 ) \(Pr(T1 ) ∪ {A}))

set T0 := T0 ∪ {A ≡ C}. Figure 4.

Module extraction in ELI

and ELI-terminologies. We start with ALCI, for which an extraction algorithm is given in Figure 3. It takes as input an acyclic ALCI-terminology T1 and a signature Σ, and it outputs a module T0 as described by the following theorem. Theorem 13. Let T1 be an acyclic ALCI-terminology and Σ a signature. The output T0 of the algorithm in Figure 3 is the unique smallest strong (equivalently, weak) semantic Σ∪sig(T0 )-module of T1 such that T1 \T0 contains no syntactic Σ-dependencies. The condition that T1 \ T0 contains no syntactic Σdependencies is essential. Without it, we would have smaller modules, but cannot extract them automatically. The latter follows from Theorem 6 and the fact that, without the mentioned condition, the smallest semantic Σ-module of a terminology T1 is empty iﬀ T1 is a semantic Σ-tautology. Also observe that the module T0 extracted by the algorithm is not necessarily formulated in Σ, but may contain additional symbols. This is clearly unavoidable even in simple cases, e.g. when extracting a semantic {A, B}-module from the terminology {A B B }. If implemented carefully, the check whether Rule 2 is applicable to a given axiom α ∈ T1 \ T0 can be done in Σp2 (and is also hard for Σp2 ). Apart from this, the algorithm runs in polynomial time. The algorithm for module extraction in ALCI ﬁrst checks for syntactic Σ-dependencies and then applies (a variation of) module checking. When extracting modules from ELI-

B. Konev et al. / Semantic Modularity and Module Extraction in Description Logics

59

Additionally to generating small modules, MEX is rather efﬁcient regarding runtime and memory consumption. We have TM R Core carried out the experiments on a PC with Intel$ 2 CPU at 2.13 GHz and with 3 GB of RAM. For all signature sizes in Figure 5, the average time of module extraction was 4.1 seconds and at most 124.7 MB of memory were consumed. This performance does not signiﬁcantly decrease with large signature sizes: the average time and space consumed by MEX when extracting a module for 10 000 symbols in 5 seconds and 121.7 MByte. For 100 000 symbols, it is merely 9.6 seconds and 134.6 MByte.

8 Figure 5.

Sizes of ⊥-local modules and semantic modules

terminologies, we apply the same strategy. In contrast to ALCI, we know from Lemma 11 that if T0 ⊆ T1 is such that T1 \T0 contains a Σ-dependency, then T0 is not a weak (equivalently, strong) Σ-module of T1 . It follows that we do not need the additional condition on modules from Theorem 13, i.e., that T1 \ T0 contains no Σ-dependency. The extraction algorithm for ELI is given in Figure 4. It takes as input an acyclic ELI-terminology T1 and a signature Σ, and it outputs a semantic module T0 as described by the following theorem. Theorem 14. Let T1 be an acyclic ELI-terminology containing no trivial axioms and Σ a signature. The output T0 of the algorithm in Figure 4 is the unique smallest strong (equivalently, weak) semantic Σ ∪ sig(T )-module of T1 . Example 15. Consider again the scenario described in Example 2, but now suppose that T0 is the output of the algorithm of Figure 4 applied to Snomed ct and Σ. Then it does not make any diﬀerence whether the user imports T0 or Snomed ct into T . Experimental results in the next section show that T0 is often much smaller than Snomed ct.

7

Experiments with MEX

To evaluate our approach to module extraction, we have carried out a number of experiments on the medical terminology Snomed ct, an acyclic EL-terminology that additionally comprises role inclusion statements of the form r s (role hierarchies) and r ◦ s r (right identities). A variation of the algorithm in Figure 4 that addresses this case is presented in the technical report accompanying this paper. It was implemented in the system MEX, which is written in OCaml. The main aim of our experiments is to compare the size of modules extracted by MEX with the size of minimal ⊥local modules as introduced in [6]. For EL, ⊥-local modules coincide with the modules extracted by the extraction feature of the CEL reasoner [14], which is used in the experiments below. We have used the Snomed ct version of February 2005, which comprises 379 691 axioms. Experiments are based on randomly selected signatures of size between 100 and 1 000 symbols and were carried out for 1 000 diﬀerent signatures of each size. Figure 5 shows the maximal, minimal, and average module sizes depending on the size of the signature. Note that, in every case, the largest semantic module is smaller than the smallest ⊥-local module.

Discussion

We have presented semantic notions of a module in a DL terminology and algorithms for checking and extracting such modules. Our experiments show that, at least in lightweight DLs of the EL family, highly eﬃcient practical implementations of our algorithms are possible. We are optimistic that also the extraction algorithm for ALCI can be implemented in a reasonably eﬃcient way.

9

Acknowledgements

The authors were supported by EPSRC grant EP/E065279/1.

REFERENCES [1] F. Baader and C. Lutz and B. Suntisrivaraporn, ‘CEL—A Polynomial-time Reasoner for Life Science Ontologies’, in Proc. of IJCAR’06, pp. 287–291, (2006). [2] A. Borgida, ‘On importing knowdledge from DL ontologies: some intuitions and problems’, in Proc. of DL-07, (2007). [3] P. Doran, V. Tamma, and L.Iannone, ‘Ontology module extraction for ontology reuse:an ontology engineering perspective’, in Proc. of CIKM-07, (2007). [4] J. Gennari et al., ‘The evolution of prot´eg´ e: an environment for knowledge-based systems development’, Int. J. Hum.Comput. Stud., 58(1), 89–123, (2003). [5] B. Cuenca Grau, I. Horrocks, Y. Kazakov, and U. Sattler., ‘A logical framework for modularity of ontologies’, in Proc of IJCAI’07. AAAI Press, (2007). [6] B. Cuenca Grau, I. Horrocks, Y. Kazakov, and U. Sattler, ‘Just the right amount: extracting modules from ontologies’, in Proc. of WWW-07, pp. 717–726, (2007). [7] C. Haase and C. Lutz, ‘Complexity of subsumption in the EL family of description logics: Acyclic and cyclic TBoxes’, in Proc. of ECAI-08. (2008). [8] B. Konev, C. Lutz, D. Walther, and F. Wolter, ‘Semantic modularity and module extraction in description logics’, Technical Report, (2007). [9] C. Lutz, D. Walther, and F. Wolter, ‘Conservative extensions in expressive description logics’, in Proc. of IJCAI-07. AAAI Press, (2007). [10] C. Lutz, ‘Complexity of terminological reasoning revisited’, in Proc. of LPAR’99, number 1705 in LNAI, pp. 181–200. Springer, (1999). [11] C. Lutz and F. Wolter, ‘Conservative extensions in the lightweight description logic EL’, in Proc. of CADE-2007. Springer, (2007). [12] J. Seidenberg and A.L. Rector, ‘Web ontology segmentation: analysis, classiﬁcation and use’, in Proc. of WWW-06, pp. 13–22, (2006). [13] K.A. Spackman, ‘Managing clinical terminology hierarchies using algorithmic calculation of subsumption: Experience with SNOMED-RT’, JAMIA, (2000). [14] B. Suntisrivaraporn, ‘Module Extraction and Incremental Classiﬁcation: A Pragmatic Approach for EL+ Ontologies’, in Proc. of ESWC-2008. Springer, (2008).

60

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-60

New Results for Horn Cores and Envelopes of Horn Disjunctions ∗

Thomas Eiter∗ and Kazuhisa Makino†1 Institut für Informationssysteme, Technische Universität Wien Favoritenstraße 9-11, A-1040 Vienna, Austria [email protected] † Graduate School of Information Science and Technology, University of Tokyo, Tokyo, 113-8656, Japan [email protected]

Abstract. We provide a characterization of Horn cores for formulas in conjunctive normal form (CNF) and, based on it, a novel algorithm for computing Horn cores of disjunctions of Horn CNFs that has appealing properties (e.g., it is polynomial for a bounded disjunction). Furthermore, we show that recognizing the Horn envelope of a disjunction of two Horn CNFs is intractable, and that computing a compact Horn CNF for it (that is irredundant and prime) is not feasible in polynomial total time unless P=NP; this answers an open problem.

1

Introduction

Motivated by the fact that deductive inference from propositional knowledge bases, even in clausal form, is a well-known co-NPcomplete problem, Horn cores and Horn envelopes were introduced as semantic approximations of such knowledge bases that have benign computational properties, as inference of a clause from a Horn conjunctive normal form (CNF) is decidable in polynomial time. A Horn core [10] (or greatest Horn lower bound [11]) of a formula ϕ is a Horn CNF ψ such that ψ logically implies ϕ (in symbols, ψ ≤ ϕ), and no Horn CNF ψ exists such that ψ strictly logically implies ψ (ψ < ψ ) and ψ logically implies ϕ. [10]; dually, a Horn CNF ψ is a Horn envelope [10] (or least Horn upper bound [11]) of ϕ, if ϕ ≤ ψ and there is no Horn CNF ψ such that ϕ ≤ ψ < ψ. Semantically, the set of models of a Horn core resp. Horn envelope is as close as possible to the set of models of ϕ, measured by set inclusion, such that completeness resp. soundness of inference from ϕ is guaranteed. In general, non-equivalent Horn cores may exist, while the Horn envelope is semantically unique (cf., e.g., [11, 10]). Computing Horn envelopes and Horn cores has been investigated e.g. in [11, 10, 2, 3, 6, 1]. The problems are intractable in general since they subsume the classical SAT problem. However, it has been shown that a Horn core of given CNF ϕ is computable in polynomial time with an NP oracle [2]. For Horn envelopes, this is not possible as the smallest one may be exponentially large. Moreover, computing a compact (irredundant and prime) Horn envelope is not feasible in polynomial total-time, i.e., in time polynomial in the size of the input and the output, unless P = NP. The intractability of computing Horn cores and the Horn envelope of a given formula ϕ raised the issue of tractable cases, i.e., when the problem is solvable in polynomial time, respectively in polynomial total-time. To this end, Horn disjunctions were considered in [5], i.e., 1

Supported by the Wolfgang Pauli Institute and the Scientiﬁc Grant in Aid of the Ministry of Education, Science, Sports, Culture & Technology of Japan.

formulas ϕ = ϕ1 ∨ · · · ∨ ϕl , where l ≥ 2 and each ϕi is a Horn CNF. As argued there, such formulas emerge e.g. in integration of and reasoning from distributed knowledge bases, and Horn cores and envelopes can be fruitfully applied. Since disjunctive normal form (DNF) is the special case where each ϕi is a conjunction of literals, as in the case of a CNF recognizing and constructing a Horn core resp. the Horn envelope of such ϕ is intractable. However, [5] provided a (quite involved) algorithm for constructing a Horn core of ϕ that is polynomial for l bounded by a constant; on the other hand, it was posed as an open problem whether the Horn envelope of ϕ is constructible in polynomial total-time. In this paper, we reconsider Horn disjunctions, and make the following main contributions: • We provide a novel characterization of the Horn cores of a CNF. More speciﬁcally, given Horn CNFs ψ and ϕ, we give a condition in terms of a logical equivalence ψ ≡ ϕ∗ψ where ϕ∗ψ is a CNF constructed from ϕ and ψ (Theorem 1). • Based on this novel characterization, we present a new algorithm N EWCORE for computing a Horn core ψ of a disjunction ϕ = ϕ1 ∨ · · · ∨ ϕl of Horn CNFs that includes a given Horn CNF ψ, i.e., ψ ≤ ψ ≤ ϕ, in polynomial time for l bounded by a constant. In a previous algorithm [5], ψ was limited to ϕi ’s and only a particular (predetermined) Horn core was computed. The new algorithm has no such limitation and is nondeterministically complete, i.e., each Horn core results by proper choices in the execution. • We show that recognizing a Horn envelope of a Horn disjunction is co-NP-complete. More precisely, we show that given Horn CNFs ψ, ϕ1 , and ϕ2 , deciding whether ψ is a Horn envelope of ϕ1 ∨ ϕ2 is co-NP-complete. Based on this result, we show that computing the Horn envelope of ϕ1 ∨ ϕ2 , given Horn ϕ1 and ϕ2 , in terms of some irredundant prime Horn CNF is not possible in polynomial total-time unless P = NP; this resolves the open problem in [5]. Our results, which improve and complement previous results, advance the knowledge about Horn cores and envelopes; this is important for their application in knowledge representation and reasoning.

2

Preliminaries

Let x1 , x2 , . . . , xn be propositional variables (atoms), where each xi evaluates to either 1 (true) or 0 (false), from which Boolean formulas ϕ are built using connectives ∧, ∨, and ¬ as usual. A literal is either an atom xi or a negated atom ¬xi , and a clause is a disjunction c = 1 ∨ · · · ∨ k of distinct literals. We denote by P (c)

T. Eiter and K. Makino / New Results for Horn Cores and Envelopes of Horn Disjunctions

61

Algorithm C ORE models of ϕ

V 1 Vm2 Input: Horn CNFs ϕ1 = m i=1 c1,i and ϕ2 = j=1 c2,j . Output: A Horn core ψ of ϕ = ϕ1 ∨ ϕ2 .

models of the Horn envelope ...

models of different Horn cores

Step 1. S := {c∗i,j = c1,i ∨ c2,j | c∗i,j ≡ , 1 ≤ i ≤ m1 , 1 ≤i ≤ m2 }; S2 := {c ∈ S | |P (c)| = 2}. For each c∗i,j ∈ S, let c1i,j , c2i,j be the Horn clauses s.t. N (c1i,j ) = N (c2i,j ) = N (c∗i,j ), P (c1i,j ) = P (c1,i ) and P (c2i,j ) = P (c2,j ). // HC(ϕ1 , ϕ2 ) is the set of all clauses c1i,j and c2i,j . Step 2. Sa := S \ S2 ; Sb := S2 ; for each c∗i,j ∈ S2 do if ϕ1 ≤ c1i,j and ϕ2 ≤ c1i,j then // (ϕ ≤ c11,j ) begin Sa := Sa ∪ {c1i,j }; Sb := Sb \ {c∗i,j } end elseif ϕ1 ≤ c2i,j and ϕ2 ≤ c2i,j then // (ϕ ≤ c21,2 ) begin Sa := Sa ∪ {c2i,j }; Sb := Sb \ {c∗i,j } end; V V Step 3. Output ψ := c∈S c ∧ c∗ ∈S c1i,j .

n

{0,1}

Figure 1. Horn cores and Horn envelope of a formula ϕ

and N (c) the sets of variables occurring unnegated and negated in c, respectively. A conjunctive normal form (CNF) is a conjunction of V distinct clauses ϕ = i ci . We view clauses and CNFs also as sets of literals and sets of clauses, respectively. The empty clause (representing falsity) is denoted by ⊥, and the empty CNF (representing truth) by . A clause c is Horn if |P (c)| ≤ 1, and deﬁnite if |P (c)| = 1. A CNF ϕ is Horn, if it contains only Horn clauses. A model is a vector v ∈ {0, 1}n representing a truth assignment to x1 , . . . , xn , whose i-th component, denoted vi , is the value of xi . For a formula ϕ, the vector v is a model of ϕ, if ϕ(v) = 1, i.e., ϕ evaluates to true using the standard Boolean truth tables. For formulas ϕ1 , ϕ2 , we denote by ϕ1 ≤ ϕ2 logical implication, i.e., ϕ1 (v)=1 implies ϕ2 (v)= 1 for each model v, and by ϕ1 < ϕ2 strict implication, i.e., ϕ1 ≤ ϕ2 ∧ ϕ2 ≤ ϕ1 . Logical equivalence, ϕ1 ≤ ϕ2 ∧ ϕ2 ≤ ϕ1 , is denoted by ϕ1 ≡ ϕ2 . A clause c is an implicate of a formula ϕ, if ϕ ≤ c; it is prime, if no proper subclause c ⊂ c is an implicate of ϕ. A CNF ϕ is irredundant, if ϕ ≡ ϕ for any proper subformula ϕ ⊂ ϕ, and prime, if each c ∈ ϕ is prime. It is well-known that for a Horn CNF ϕ, all prime implicates are Horn clauses and that multiple irredundant prime CNFs equivalent to ϕ may exist (which are all Horn) [8]. Furthermore, testing ϕ ≤ c for a clause c is possible in linear time (cf. [4]), and a prime irredundant CNF equivalent to ϕ is computable in quadratic time [8]. On the other hand, computing special such CNFs, like with smallest number of clauses or literals, is NP-hard (eg. [9]). The notions of Horn core and of Horn envelope were introduced as semantic approximations of a general formula (see Figure 1). Deﬁnition 1 A Horn CNF ψ is a Horn core of a formula ϕ, if ψ ≤ ϕ, and there is no Horn CNF ψ such that ψ < ψ ≤ ϕ. Wm Clearly, ϕ ≡ i=1 ψi where ψ1 , . . . , ψm are all the nonequivalent Horn cores of ϕ, and ϕ has a Horn CNF iff m = 1; in general, m is exponential in the number of variables and the size of a CNF for ϕ. Example 1 Consider ϕ = (x1 ∨ xn+1 ) ∧ · · · ∧ (xn ∨ x2n ). This formula has 2n non-equivalent Horn cores ψ1 , . . . , ψ2 of the form xi1 ∧ xi2 ∧ · · · ∧ xi , ij ∈ {i, n+i}, 1 ≤ j ≤ n. On the other hand, a Horn core ψi of an arbitrary CNF ϕ is obtainable by removing clauses and literals from ϕ, and thus ψ does semantically not subsume ϕ; nonetheless, computing some Horn core that is irredundant and prime is NP-hard. Deﬁnition 2 A Horn CNF ψ is a Horn envelope of a formula ϕ, if ϕ ≤ ψ, and there is no Horn CNF ψ such that ϕ ≤ ψ < ψ. As easily seen, the Horn envelope of any formula ϕ is semantically unique, i.e., ψ ≡ ψ for each Horn envelopes ψ, ψ of ϕ (cf. [11, 10]). In particular, the following holds. Let P C(ϕ) denote the set of all prime implicates of ϕ. Then, Proposition 1 For any formula ϕ, ψ = {c ∈ P C(ϕ) | |P (c)| ≤ 1} is a Horn envelope of ϕ. In contrast to Horn cores, the Horn envelope can be necessarily large.

Algorithm C ORE∗ Input: Horn CNFs ϕ0 , ϕ1 , . . . , ϕl . Output: A Horn core ψ ∗ of ϕ = ϕ0 ∨ ϕ1 ∨ · · · ∨ ϕl . Step 1. ψ := ϕ0 ; i := 0; changes := 0; Step 2. while changes < l do begin if i < l then i := i + 1 else i := 1; ψnew := C ORE(ψ, ϕi ); if ψ < ψnewVthen begin ψ := {c ∈ HC(ϕ0 , ϕ1 , . . . , ϕl ) | ψnew ≤ c}; changes := 0; end else changes := changes + 1; end{while}; Step 3. Output the Horn CNF ψ ∗ = ψ. Figure 2. Previous algorithms for Horn core computation

Example 2 Consider the CNF W Vn ϕ = (x2n+1 ∨ n i=1 ¬xi ) ∧ j=1 (x2n+1 ∨ xj ∨ ¬xn+j ), over x1 , . . . , x2n+1 . The unique smallest Horn envelope for ϕ is V W V V ψ = i1 ∈{1,n+1} i2 ∈{2,n+2} · · · i ∈{n,2n} (x2n+1 ∨ n j=1 ¬xi ). As mentioned above, computing a Horn envelope of a given CNF ϕ that is prime and irredundant is not feasible in polynomial total-time unless P equals NP.

3

Horn Disjunctions

Prior to presenting our results, we revisit the Horn core construction algorithm in [5]. It relies on two methods: an algorithm C ORE to compute a Horn core of a disjunction of two Horn CNFs, and a general algorithm C ORE∗ that uses C ORE as a subroutine. Algorithm C ORE, which is shown in Figure 2, exploits the following structural property of Horn cores established in [5]: (UC) For a Horn disjunction ϕ = ϕ1 ∨ ϕ2 , each ϕi , i ∈ {1, 2}, is contained in a unique Horn core ψi of ϕ; in other words, ϕi is extendible (by allowing more models) to a single Horn core ψi . The basic idea of C ORE is then to increase ϕ1 to the Horn core ψ1 . Note that C ORE is fully deterministic, and runs in polynomial time. As property UC does not generalize to arbitrary disjunctions ϕ = ϕ1 ∨ · · · ∨ ϕl , l ≥ 2, Algorithm C ORE∗ uses an iterative approach: • compute a Horn core ψ2 of ϕ1 ∨ ϕ2 , using C ORE; then, • compute a Horn core ψ3 of ψ2 ∨ ϕ3 , using C ORE; etc.

62

T. Eiter and K. Makino / New Results for Horn Cores and Envelopes of Horn Disjunctions

Eventually, this sequence converges to a Horn core of ϕ. As shown in [5], the number of iterations is bounded by l·|HC(ϕ0 , ϕ1 , . . . , ϕl )| and C ORE∗ runs in polynomial time for l bounded by a constant. Here HC(ϕ0 , ϕ1 , . . . , ϕl ) contains W informally the strengthenings of non-tautologic clauses c in ϕ = li=0 ϕi , rewritten to a CNF, to clauses in some ϕi . More formally, Deﬁnition 3 For CNFs αi = ci,1 ∧ · · · ∧ ci,m , 1 ≤ i ≤ l, let S(α1 , . . . , αl ) = {c1 ∪ · · · ∪ cl ≡ | ci ∈ αi , 1 ≤ i ≤ l}. Then HC(α1 , . . . , αl ) is the set of all Horn clauses c ∈ S(α1 , . . . , αl ) such that, for some ji ∈S{1, . . . , mi }, i ∈ {1, . . . , l}, and some k ∈ {1, . . . , l}, (i) N (c) = li=1 N (ci,j ) and (ii) P (c) = P (ck,j ). That is, for c ∈ HC(α1 , . . . , αl ) pick one clause ci,j from each αi such their disjunction is not a tautology; let the negative part of c be the disjunction of all the negative parts (i), and let the positive part of c be the positive part of some chosen ck,j (ii). Example 3 Consider ϕ1 = (¬x1 ∨ ¬x3 ∨ x4 ) ∧ (¬x2 ∨ ¬x5 ∨ x6 ), and ϕ2 = (¬x1 ∨ ¬x2 ∨ x3 ) ∧ (¬x1 ∨ ¬x3 ∨ x6 ). Then S(ϕ1 , ϕ2 ) = { ¬x1 ∨ ¬x3 ∨ x4 ∨ x6 , ¬x1 ∨ ¬x2 ∨ x3 ∨ ¬x5 ∨ x6 , ¬x1 ∨ ¬x2 ∨ ¬x3 ∨ ¬x5 ∨ x6 }. Thus ϕ1 ∨ ϕ2 = (¬x1 ∨ ¬x3 ∨ x4 ∨ x6 ) ∧ (¬x1 ∨ ¬x2 ∨ x3 ∨ ¬x5 ∨ x6 ) ∧ (¬x1 ∨ ¬x2 ∨ ¬x3 ∨ ¬x5 ∨ x6 ) and HC(ϕ1 , ϕ2 ) = { ¬x1 ∨ ¬x3 ∨ x4 , ¬x1 ∨ ¬x3 ∨ x6 , ¬x1 ∨ ¬x2 ∨¬x5 ∨ x6 , ¬x1 ∨ ¬x2 ∨ x3 ∨¬x5 , ¬x1 ∨ ¬x2 ∨¬x3 ∨ ¬x5 ∨ x6 }. The following necessary condition for a Horn core was given in [5]: W Lemma 1 Let ϕ1 , . . . , ϕl be Horn CNFs, and let ϕ = li=1 ϕi . Then ψ is a Horn core of ϕ only V if there exists some M ⊆ HC(ϕ1 , ϕ2 , . . . .ϕl ) such that ψ ≡ c∈M c. In Example 3, one Horn core of ϕ1 ∨ ϕ1 is ψ = (¬x1 ∨ ¬x3 ∨ x4 ) ∧ (¬x1 ∨ ¬x2 ∨ ¬x5 ∨ x6 )∧ (¬x1 ∨ ¬x2 ∨ ¬x3 ∨ ¬x5 ∨ x6 ) which indeed satisﬁes this condition. It can be turned into a characterization by requiring that ψ is as weak as possible, i.e., that there is V no M ⊆ HC(ϕ1 , ϕ2 , . . . .ϕl ) such that ψ < c∈M c. This is exploited by C ORE∗ and C ORE. In particular, C ORE exploits that one of the subclauses c11,2 , c21,2 of c∗1,2 in S(ϕ1 , ϕ2 ) must be an implicate of ψ, and thus can be added to ψ. If both are implicates, preference for addition is given to c11,2 . For details, see [5].

4

Horn Core Characterization

In this section, we present our characterization of Horn cores. We ﬁrst start with some preliminary concepts and deﬁnitions. Deﬁnition 4 (DIψ (c)) For any clause c and formula ψ, deﬁne DIψ (c) = {c ⊆ c | N (c )=N (c), |P (c ) ∩ P (c)|=1, ψ ≤ c }. That is, DIψ (c) contains all deﬁnite subclauses c of c implied by ψ. 1 Deﬁnition 5 (ϕH , ϕ1N H (ψ), ϕ= N H (ψ)) Given a CNF ϕ and formula ψ, we group ϕ into 1 ϕ = ϕH ∧ ϕ1N H (ψ) ∧ ϕ= N H (ψ), where ϕH = {c ∈ ϕ | |P (c)| ≤ 1},

ϕ1N H (ψ)

=

{c ∈ ϕ | |P (c)| > 1, |DIψ (c)| = 1},

1 ϕ= N H (ψ)

=

{c ∈ ϕ | |P (c)| > 1, |DIψ (c)| = 1}.

The following example illustrates the deﬁnitions. Example 4 Suppose that ϕ = c1 ∧ c2 ∧ c3 , where c1

=

(¬x1 ∨ ¬x3 ∨ x4 ∨ x6 )

c2

=

(¬x1 ∨ ¬x2 ∨ x3 ∨ ¬x5 ∨ x6 )

c3

=

(¬x1 ∨ ¬x2 ∨ ¬x3 ∨ ¬x5 ∨ x6 )

and ψ = (¬x1 ∨ ¬x3 ∨ x4 )(¬x1 ∨ ¬x2 ∨ ¬x5 ∨ x6 ). Then DIψ (c1 )

=

{ ¬x1 ∨ ¬x3 ∨ x4 }

DIψ (c2 )

=

{ ¬x1 ∨ ¬x2 ∨ ¬x5 ∨ x6 },

DIψ (c3 )

=

{c3 },

1 and ϕH = c3 , ϕ1N H (ψ) = { c1 , c2 }, ϕ= N H (ψ) = ∅. As predicted, 1 ϕ = ϕH ∧ ϕ1N H (ψ) ∧ ϕ= N H (ψ) = c3 ∧ c1 ∧ c2 .

Using the above concepts, we now characterize Horn cores. Deﬁnition 6 (ϕ∗ψ ) Given a CNF ϕ and formula ψ, deﬁne the CNF 1 ϕ∗ψ = ϕH ∧ μϕ (ψ) ∧ ϕ= N H (ψ),

where μϕ (ψ) = {c ∈ DIψ (c) | c ∈ ϕ1N H (ψ)}. That is, replace non-Horn clauses c in the CNF for ϕ by strengthenings to deﬁnite clauses c , if c is the only such clause implied by ψ. We have now the following result. Theorem 1 (Horn Core Characterization) A given Horn CNF ψ is a Horn core of a CNF ϕ if and only if ψ ≡ ϕ∗ψ . The formal proof is omitted here. Intuitively, by construction of ϕ∗ψ any Horn core ψ of ϕ must fulﬁll ψ ≤ ϕ∗ψ . On the other hand, ϕ∗ψ ≤ ϕ; thus if ϕ∗ψ is equivalent to ψ, it must be a Horn core. Example 5 (cont’d) In Example 4, we had ϕH = {c3 }, ϕ1N H (ψ) = 1 { c1 , c2 }, and ϕ= N H (ψ) = ∅. We thus obtain μϕ (ψ) = DIψ (c1 ) ∪ DIψ (c2 ) = { ¬x1 ∨ ¬x3 ∨ x4 , ¬x1 ∨ ¬x2 ∨ ¬x5 ∨ x6 }, and thus ϕ∗ψ = ϕH ∧ μϕ (ψ) = (¬x1 ∨ ¬x2 ∨ ¬x3 ∨ ¬x5 ∨ x6 ) ∧ (¬x1 ∨¬x3 ∨ x4 )(¬x1 ∨¬x2 ∨¬x5 ∨ x6 ). Now ψ = (¬x1 ∨ ¬x3 ∨ x4 ) ∧ (¬x1 ∨ ¬x2 ∨ ¬x5 ∨ x6 ) ≡ ϕ∗ψ ; hence, ψ is a Horn core of ϕ.

5

Computation

We now turn to computing a Horn core of a Horn disjunction, for which we exploit the characterization in the previous section. Our strategy is to increase an initial Horn CNF repeatedly, until we arrive at a Horn CNF that satisﬁes the condition in Theorem 1. To this end, we ﬁrst consider recognizing a Horn core, and show that the problem is polynomial if a CNF for ϕ is constructible in polynomial time.

5.1 Recognizing Horn Cores We observe the following facts. Lemma 2 Let ϕ1 ∨ · · · ∨ ϕl be a disjunction of l ≥ 2 Horn CNFs ϕi , and let ϕ be a CNF for it. Let ψ be a Horn CNF. Then, 1. ϕ∗ψ is constructible from ϕ and ψ in polynomial time; 2. checking whether ψ ≤ ϕ∗ψ is feasible in polynomial time; 3. checking whether ϕ∗ψ ≤ ψ is feasible in polynomial time.

T. Eiter and K. Makino / New Results for Horn Cores and Envelopes of Horn Disjunctions

Proof Items 1 and 2 are clearly feasible in polynomial time (note that ϕ∗ψ is a CNF). For item 3, we rewrite ϕ∗ψ as a Horn disjunction: ϕ∗ψ

=

1 ϕH ∧ μϕ (ψ) ∧ ϕ= N H (ψ)

≡

1 1 ϕH ∧ μϕ (ψ) ∧ ϕ= N H (ψ) ∧ ϕN H (ψ)

≡

ϕ ∧ μϕ (ψ) ≡ (ϕ1 ∧ μϕ (ψ)) ∨ · · · ∨ (ϕl ∧ μϕ (ψ)).

As α ∨ β ≤ γ iff α ≤ γ and β ≤ γ, we can check for i = 1, . . . , l 2 that ϕi ∧ μϕ (ψ) ≤ ψ; this is feasible in polynomial time. In particular, if l is bounded by a constant, a CNF ϕ for ϕ1 ∨ · · · ∨ ϕl is computable in polynomial time by simple means (e.g., ϕ := S(ϕ1 , . . . , ϕl )). We thus obtain the following result. Theorem 2 Deciding whether a given Horn CNF ψ is a Horn core of a given Horn disjunction ϕ = ϕ1 ∨ · · · ∨ ϕl , l ≥ 2, is feasible in polynomial time, if a CNF for ϕ is computable in polynomial time. In particular, if l is bounded by a constant, this is decidable in time O(max{n, l}n|ψ|Πli=1 |ϕi |) (here |γ| is the number of clauses in γ). Here and later, we assume in the time analysis that clauses c are represented by bitmaps (of size n) for P (c) and N (c).

5.2

Algorithm N EWCORE Input: Horn CNFs ψ, ϕ1 , . . . ϕl , l ≥ 2. Output: A Horn core ψ of ϕ = ϕ1 ∨ · · · ∨ ϕl such that ψ ≤ ψ ≤ ϕ, or “no” if none exists. Step 1. convert ϕ to a CNF α (e.g., α := S(ϕ1 , . . . , ϕl )); if ψ ≤ α then return “no’; S =1 := {c ∈ α | |P (c)| > 1, |DIψ (c)| = 1}; β := {N (c) ∪ {xj } | c ∈ S =1 , xj ∈ P (c)}; μ := {c ∈ DIψ (c) | c ∈ α, |P (c)| > 1, |DIψ (c)| = 1}; ψ := ψ; Step 2. while ϕi ∧ μ ≤ ψ for some i ∈ {1, . . . , l} do // (ϕψ ≤ ψ ) begin select v ∈ {0, 1}n witnessing ϕi ∧ μ ≤ ψ β := β − { c ∈ β | c(v) = 0 }; for each c ∈ S =1 do if a single clause c ∈ β fulﬁlls N (c ) = N (c), P (c ) ⊆ P (c) then begin S =1 := S =1 − { c }; μ := μ ∪ { c }; end ψ := {c ∈ α | |P (c)| ≤ 1} ∪ μ ∪ β; end{while}; Step 3. Output ψ . Figure 3. New algorithm for Horn core computation

Constructing a Horn Core

We now present our algorithm to construct a Horn core of a Horn disjunction ϕ = ϕ1 ∨ · · · ∨ ϕl that contains a given Horn CNF ψ. If ψ ≤ ϕ, then obviously there is no Horn core ψ of ϕ such that ψ ≤ ψ ≤ ϕ. Otherwise, we can construct some such ψ by iteratively increasing ψ, exploiting the characterization in Theorem 1. The following lemma is crucial. Lemma 3 Suppose ψ ≤ ϕ and ψ ≡ ϕ∗ψ . Then, there exists some v ∈ {0, 1}n such that (i) ψ(v) = 0 and ϕ∗ψ (v) = 1 (i.e., ϕ∗ψ ≤ ψ), and (ii) for every such v and Horn CNF ψ = ϕH ∧ μϕ (ψ) ∧ β, 1 where β contains for each clause c ∈ ϕ= N H (ψ) at least one clause c ∈ DIψ (c) such that c (v) = 1, it holds that ψ < ψ ≤ ϕ. The algorithm N EWCORE, shown in Figure 3, proceeds as follows. After converting ϕ to a CNF α and testing ϕ ≤ α, it initializes auxiliary variables and a candidate Horn core ψ . In Step 2, ψ is tested using Lemma 2; if not a Horn core yet, ψ is repeatedly updated according to Lemma 3. Example 6 (cont’d) Reconsider ϕ = ϕ1 ∨ ϕ2 in Example 3, where ϕ1 = (¬x1 ∨¬x3 ∨x4 )∧(¬x2 ∨¬x5 ∨x6 ), and ϕ2 = (¬x1 ∨¬x2 ∨ x3 )∧(¬x1 ∨¬x3 ∨x6 ), and let ψ = (¬x1 ∨¬x3 ∨x4 )∧(¬x1 ∨¬x2 ). (As seen from Example 4, ψ ≤ ϕ but ψ is not a Horn core of ϕ.) In Step 1 of N EWCORE, α = c1 ∧ c2 ∧ c3 and DIψ (c1 ) = { ¬x1 ∨ ¬x3 ∨ x4 } DIψ (c2 ) = { ¬x1 ∨ ¬x2 ∨ ¬x5 ∨ x6 , ¬x1 ∨ ¬x2 ∨ ¬x5 ∨ x6 }, DIψ (c3 ) = { ¬x1 ∨ ¬x2 ∨ ¬x3 ∨ ¬x5 ∨ x6 }. Thus, S =1 β μ ψ

= {c2 }, = {¬x1 ∨ ¬x2 ∨ ¬x5 ∨ x3 , ¬x1 ∨ ¬x2 ∨ ¬x5 ∨ x6 }, = {¬x1 ∨ ¬x3 ∨ x4 }, and = (¬x1 ∨ ¬x3 ∨ x4 ) ∧ (¬x1 ∨ ¬x2 ).

In Step 2, the test of the while loop succeeds as ϕ1 ∧ μ ≡ ϕ1 ≤ ψ holds; e.g., for v = (110011), we have ϕ1 (v) = 1 and ψ (v) = 0. The set β is then updated to β = {¬x1 ∨ ¬x2 ∨ ¬x5 ∨ x6 }, and for c2 the updates S =1 := ∅ and μ = {¬x1 ∨ ¬x3 ∨ x4 , ¬x1 ∨ ¬x2 ∨ ¬x5 ∨ x6 } are performed; ﬁnally, ψ is updated to

63

ψ

=

(¬x1 ∨ ¬x2 ∨ ¬x3 ∨ ¬x5 ∨ x6 )∧ (¬x1 ∨ ¬x3 ∨ x4 ) ∧ (¬x1 ∨ ¬x2 ∨ ¬x5 ∨ x6 ).

The test for the next while-iteration fails, since ψ ≡ μ; hence, ψ is output. Note that ψ is indeed a Horn core of ϕ such that ψ ≤ ψ . The following result states that the new algorithm is correct (the formal proof is omitted here). Theorem 3 N EWCORE correctly computes a Horn core ψ of ϕ = ϕ1 ∨· · ·∨ϕl such that ψ ≤ ψ ≤ ϕ. Moreover, it can be implemented to run in time O(nlm(l ˆ m ˆ + |ψ|)), where m ˆ = Πli=1 |ϕi |. In particular, if l is bounded by a constant, N EWCORE runs in polynomial time: In Step 1, building α = S(ϕ1 , . . . , ϕl ) is feasible in time O(nlm), ˆ and the test ψ ≤ α in time O(n|ψ|m). ˆ Each DIψ (c) is computable in time O(nl|ψ|), and thus the initial S =1 , β, and μ in time O(mnl|ψ|). ˆ In total, Step 1 is feasible in time O(nlm|ψ|). ˆ In Step 2, the while loop is executed at most (l−1)|S =1 |+1 ≤ (l−1)m+1 ˆ often. Using appropriate data structures, the loop body is executable in time O(lnm) ˆ and the loop tests need throughout the computation in total time O(nlm(l ˆ m+|ψ|)): ˆ since μ only increases, all tests ϕ for each potential clause c in ψ i ∧ μ ≤ c are feasible in P ˆ + |ϕi |)) = O(lnm). ˆ There are at most lm ˆ total time O( li=1 n(m such clauses c from the initial β and at most |ψ| many from ψ \ β. In total, Step 2 is feasible in time O(nlm(l ˆ m ˆ + |ψ|)). In summary, we obtain a bound of O(nlm(l ˆ m ˆ + |ψ|)). However, in practice better behavior is plausible as |α|, |S =1 | etc. are likely to be smaller than m ˆ and far less than (l−1)m+1 ˆ loop executions are expected; furthermore, simple optimizations can be incorporated. Important features of algorithm N EWCORE are, different from algorithms C ORE and C ORE∗ , that it can compute targeted Horn cores with ψ and that it is nondeterministically complete; upon proper choices of v, each Horn core ψ such that ψ ≤ ψ ≤ ϕ is obtainable. Example 7 (cont’d) In Example 6, in the ﬁrst while-iteration for e.g. v = (110100), which also witnesses ϕ1 ∧ μ ≤ ψ , β is updated differently (it stays unchanged). In the next iteration, β is necessarily updated to β = {¬x1 ∨ ¬x2 ∨ ¬x5 ∨ x6 }, and the same Horn core ψ as in Example 6 is re-obtained. In fact, all successful choices lead to this ψ . Hence, it is the unique Horn core such that ψ ≤ ψ ≤ ϕ.

64

6

T. Eiter and K. Makino / New Results for Horn Cores and Envelopes of Horn Disjunctions

Horn Envelope of a Horn Disjunction

We now turn to the question of whether a Horn envelope of a Horn disjunction can be computed efﬁciently. As we show, it has a negative answer, which is a consequence of the intractability of recognizing the Horn envelope. More precisely, the following holds. Theorem 4 Given Horn CNFs ψ, ϕ1 , and ϕ2 , deciding whether ψ is a Horn envelope of ϕ = ϕ1 ∨ ϕ2 is co-NP-complete. Proof (Sketch) As for the membership in co-NP, ψ is not a Horn envelope of ϕ if and only if either (a) ϕ ≤ ψ, or (b) there exists a Horn clause c such that (b.1) ϕ ≤ ψ ∧ c and (b.2) ψ ≤ c (thus ϕ ≤ (ψ ∧ c) < ψ). Such a clause c can be guessed, and the tests (a), (b.1), and (b.2) are feasible in polynomial time. The co-NP-hardness is shown by a reduction from the complement of SAT. Let α = c1 ∧ · · · ∧ cm be a CNF of nonempty clauses ci on variables x1 . . . , xn . Let y, z, x1 , . . . , xn be fresh variables. Deﬁne β1

=

β2

=

c∗1 ∧ · · · ∧ c∗m ,

V (¬y ∨ ¬x1 ∨ · · · ∨ ¬xn ) ∧ n i=1 (¬y ∨ xi ∨ ¬xi ), W W where c∗i = y ∨ x ∈ N (c ) ¬xj ∨ x ∈ P (c ) ¬xj , and let ϕ 1 = z ∧ β1 ∧ β 2 ,

ϕ2 = y ∧ β2 , and

ϕ = ϕ 1 ∨ ϕ2 .

Note that ϕ1 and ϕ2 are Horn CNFs and that ψ ∗ ≤ β1 ∧ β2 must hold, where ψ ∗ is any Horn envelope of ϕ, as ϕ ≡ (y ∨ z) ∧ β1 ∧ β2 . Intuitively, ϕ2 generates the models v for the variables . , xn , which are x1 , . . W W encoded by prime implicates of β2 of form pv = v = 0 ¬xi ∨ v = 1 ¬xi , while ϕ1 generates, via interaction (resolution) of clauses in β1 and β2 , all models v resp. clauses pv such that α(v) = 0. Now if α is unsatisﬁable, then each pv is an implicate of both ϕ1 and ϕ2 , thus of ψ ∗ , and ψ ∗ ≡ β1 ∧ β2 follows. Otherwise, some pv is not a joint implicate, and ψ ∗ < β1 ∧ β2 holds. Thus ψ = β1 ∧ β2 is a Horn envelope of ϕ iff α is unsatisﬁable. 2 Armed with this result, we now derive that most likely we cannot efﬁciently construct a compact Horn envelope of a Horn disjunction. Theorem 5 There is no algorithm that constructs, given Horn CNFs ϕ1 and ϕ2 , a prime irredundant Horn envelope ψ for ϕ1 ∨ ϕ2 in time polynomial in the size of ψ, ϕ1 , and ϕ2 , unless P = NP. Proof We show that if such an algorithm would exist, then the co-NP-complete problem of recognizing the Horn envelope in Theorem 4 could be solved in polynomial time. The proof makes use of the following lemma, which states an important property of Horn CNFs. Denote by α the representation size of a CNF α Lemma 4 All prime irredundant (Horn) CNFs for a Horn CNF ϕ differ at most polynomially in size, i.e., there exists a polynomial p(·) such that for every two irredundant prime CNFs ϕ1 and ϕ2 equivalent to ϕ, ϕ1 ≤ p(ϕ2 ) and ϕ2 ≤ p(ϕ1 ). This lemma follows, e.g., by combining results in [8] on prime irredundant Horn CNFs (especially, the number of clauses c with P (c) = ∅ in them) and in [7] on the FD-covers, which correspond to sets of deﬁnite clauses. Now suppose that an algorithm A would exist that computes a prime irredundant Horn envelope for ϕ = ϕ1 ∨ ϕ2 in polynomial total-time, i.e., in time bounded by a polynomial q(os, is) in the output size os = A(ϕ1 , ϕ2 ), and the input size is = ϕ1 + ϕ2 . We use then A to decide, given Horn CNFs ψ, ϕ1 and ϕ2 , whether ψ is a Horn envelope of ϕ1 ∨ ϕ2 in polynomial time (which implies P = NP) as follows. We run A for at most q(os∗ , is) steps,

where os∗ = p(ψ) is a (polynomial) upper bound on the size of A(ϕ1 , ϕ2 ) from Lemma 4 (note that ψ need not be prime). If A halts, then we check whether the output of A is equivalent to ψ; this is feasible in polynomial time. Otherwise, A will compute a Horn CNF ψ ∗ such that ψ ∗ ≡ ψ, and hence ψ is not the Horn envelope of ϕ. This algorithm works in polynomial time in the size of ψ, ϕ1 , and ϕ2 . 2 We remark that in the hardness proof of Theorem 4 neither ϕ1 nor ϕ2 may be replaced by a small Horn CNF. In fact, we can show that the problem is tractable S if in some ϕi the number of variables that occur positively, i.e., | {P (c) | c ∈ ϕi }|, is bounded by a constant. Moreover, if both ϕi have this property, a Horn envelope of ϕ1 ∨ ϕ2 is computable in input-polynomial time. This holds since a CNF ϕ∗i with all prime implicates of ϕi is computable in polynomial time (in the size of ϕi ) and S(ϕ∗1 , ϕ∗2 ) contains all prime implicates of ϕ1 ∨ ϕ2 . More generally, we have the following result. S Proposition 2 Given arbitrary CNFs ϕi such that | {P (c) | c ∈ ϕi }| ≤ k for a constant k, where 1 ≤ i ≤ l and l is bounded by a constant, a prime irredundant Horn envelope for ϕ1 ∨ · · · ∨ ϕl is computable in time polynomial in the size of ϕ1 ∨ · · · ∨ ϕl .

7

Conclusion

Horn cores and Horn envelopes are important concepts for propositional formulas that have appealing properties. We have obtained both positive results, like a novel characterization of Horn cores for CNFs and a new algorithm to compute Horn cores for a Horn disjunction, and a negative result in terms of the intractability of computing the Horn envelope of a Horn disjunction wrt. polynomial total-time. These results provide a computational basis for crafting implementations in the context of knowledge bases. Several issues remain for future work. One is to explore consequences and applicability of the present results to other combinations of Horn theories than disjunctions. Another is to further delineate the (in)tractability frontier for Horn envelopes that was brieﬂy discussed here. Finally, efﬁcient enumeration of multiple or all Horn cores would be interesting (a suitable variant of algorithm N EWCORE is non-obvious).

REFERENCES [1] Y. Boufkhad, ‘Algorithms for Propositional KB Approximation’, in Proc. National Conference on AI (AAAI ’98), pp. 280-285. AAAI Press. [2] M. Cadoli and F. Scarcello, ‘Semantical and Computational Aspects of Horn Approximations’, Artiﬁcial Intelligence, 119(1-2), 1–17, (2000). [3] A. del Val, ‘An Analysis of Approximate Knowledge Compilation’, in Proc. IJCAI ’95, pp. 830–836, (1995). [4] W. Dowling and J. H. Gallier, ‘Linear-time Algorithms for Testing the Satisﬁability of Propositional Horn Theories’, Journal of Logic Programming, 3, 267–284, (1984). [5] T. Eiter, K. Makino, and T. Ibaraki, ‘Disjunctions of Horn Theories and their Cores’, SIAM Journal on Computing, 31(1), 269–288, (2001). [6] G. Gogic, Ch. Papadimitriou, and M. Sideri, ‘Incremental Recompilation of Knowledge’, J. Artif. Intell. Res., 8, 23–37, (1998). [7] G. Gottlob, ‘On the size of nonredundant FD-covers’, Information Processing Letters, 24(6), 355–360, (1987). [8] P. Hammer and A. Kogan, ‘Horn functions and their DNFs’, Information Processing Letters, 44, 23–29, (1992). [9] P. Hammer and A. Kogan, ‘Optimal Compression of Propositional Horn Knowledge Bases: Complexity and Approximation’, Artiﬁcial Intelligence, 64, 131–145, (1993). [10] D. Kavvadias, Ch. Papadimitriou, and M. Sideri, ‘On Horn Envelopes and Hypergraph Transversals’, in Proc. 4th Int’l Symp. Algorithms and Computation (ISAAC-93), LNCS 762, pp. 399–405. Springer. [11] B. Selman and H. Kautz, ‘Knowledge Compilation and Theory Approximation’, Journal of the ACM, 43(2), 193–224, (1996).

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-65

65

Belief revision with reinforcement learning for interactive object recognition Thomas Leopold1 and Gabriele Kern-Isberner2 and Gabriele Peters3 Abstract. From a conceptual point of view, belief revision and learning are quite similar. Both methods change the belief state of an intelligent agent by processing incoming information. However, for learning, the focus in on the exploitation of data to extract and assimilate useful knowledge, whereas belief revision is more concerned with the adaption of prior beliefs to new information for the purpose of reasoning. In this paper, we propose a hybrid learning method called S PHINX that combines low-level, non-cognitive reinforcement learning with high-level epistemic belief revision, similar to human learning. The former represents knowledge in a sub-symbolic, numerical way, while the latter is based on symbolic, non-monotonic logics and allows reasoning. Beyond the theoretical appeal of linking methods of very different disciplines of artiﬁcial intelligence, we will illustrate the usefulness of our approach by employing S PHINX in the area of computer vision for object recognition tasks. The S PHINX agent interacts with its environment by rotating objects depending on past experiences and newly acquired generic knowledge to choose those views which are most advantageous for recognition.

1

INTRODUCTION

One of the most challenging tasks of computer vision systems is the recognition of known and unknown objects. An elegant way to achieve this is to show the system some samples of each object class and thereby train the system, so that it can recognize objects that it has not seen before, but which look similar to some objects of the training phase (due to some deﬁned features). Several methods to do so have been successfully used and anaylized. One of them is to set up a rule-based system and have it reason, another one is to use numerical learning methods such as reinforcement learning. Both of them have advantages, but also disadvantages. Reinforcement learning yields good results in different kinds of environments, but its training is time consuming, since it is a trial-anderror method and the agent has to learn from scratch. The possibilities to introduce background knowledge (e. g., by the choice of the initial values of the QTable) are more limited as for example with knowledge representation techniques. Another disadvantage consists in a limited possibility to generalize experiences and so to be able to act appropriately in unfamiliar situations. Though some generalization can be obtained by the application of function approximization techniques, the possibilities to generalize from learned rules to unfa1 2 3

University of Technology Dortmund, Germany, email: [email protected] University of Technology Dortmund, Germany, email: [email protected] University of Applied Sciences and Arts Dortmund, Germany, email: [email protected]

miliar situations are more diverse again with for example knowledge representation techniques. Knowledge representation and belief revision techniques have the advantage that the belief of the agent is represented quite clearly and allows reasoning about actions. The belief can be extended by new information, but needs to be revised when the new information contradicts the current belief. One drawback is that it is difﬁcult to decide which parts of the belief should be given up, so that the new belief state is consistent, i.e., without inherent contradictions. In this paper, we present our hybrid learning system S PHINX, named after the Egyptian statue of a hybrid between a human and a lion. It combines the advantages of both Q-Learning and belief revision and diminishes the disadvantages, thus synergy effects can emerge. S PHINX agents, on the one hand, are intelligent agents equipped with epistemic belief states which allows them to build a model of the world and to apply reasoning techniques to focus on most plausible actions. On the other hand, they use QTables to determine which action should be carried out next, and are able to process reward signals from the environment. Moreover, S PHINX agents can learn situational as well as generic knowledge which is incorporated into their epistemic states via belief revision. In this way, they are able to adjust faster and more thoroughly to the environment, and to improve their learning capabilites considerably. This will be illustrated in detail by experiments in the ﬁeld of computer vision. This paper is organized as follows: Chapter 2 summarizes related work. In chapter 3 we recall basic facts on Q-Learning, ordinal conditional functions and revision. Chapter 4 contains the main contribution of this paper, the presentation of the S PHINX system. Chapter 5 summarizes results from experiments in computer vision carried out in different environments. Finally, we conclude in chapter 6.

2

RELATED WORK

Psychological ﬁndings propose a two-level learning model for human learning [1], [6], [3], [10]. On the so called bottom level, humans learn implicitly and acquire procedural knowledge. They are not aware of the relations they have learned and can hardly put it into words. On the other level, the top level, humans learn explicitly and acquire declarative knowledge. They are aware of the relations they have learned and can express it, e. g., in form of if-then rules. A special form of declarative knowledge is episodic knowledge. This kind of knowledge is not of general nature, but refers to speciﬁc events, situations or objects. Episodic knowledge makes it possible to remember speciﬁc situations where general rules do not hold. These two levels do not work separately. Depending on what is learned, humans learn top-down or bottom-up [11]. It has been found [8] that in completely unfamiliar situations mainly implicit learning

66

T. Leopold et al. / Belief Revision with Reinforcement Learning for Interactive Object Recognition

takes place and procedural knowledge is acquired. The declarative knowledge is formed afterwards. This indicates that the bottom-up direction plays an important role. It is also advantageous to continually verbalize to a certain extent what one has just learned and so speed up the acquisition of declarative knowledge and thereby the whole learning process. Sun, Merrill and Peterson developed the learning model CLARION [9]. It is a two-level, bottom-up learning model which uses QLearning for the bottom level and a set of rules for the top level. The rules have the form ’Premise ⇒ Action’, where the premise can be met by the current state signal of the environment. For the maintainance of the set of rules (i. e., adding, changing and deleting rules) the authors have conceived a certain technique. They have proven their model, which works similar to human learning, to be successful in a mine ﬁeld navigation task and similar to human learning. Cang Ye, N. H. C. Yung and Danwei Wang propose a neural fuzzy system [2]. Like CLARION, this is a two-level learning model, combining reinforcement learning and fuzzy logic. The system has successfully been applied to a mobile robot navigation task.

3

BASICS AND BACKGROUND

In this section, we will recall basic facts on the two methodologies that are used and combined in this paper. First, we brieﬂy describe Q-Learning, a popular approach used for solving Markov Decision Processes (MDPs) (see e.g. [12]). The scenario is the usual one for agents, where one or more agents interact with an environment. Normally, the environment starts in a state and ends, when one terminal state is reached. This timespan is called an episode. For each action, the agent is rewarded. The more reward it collects during an episode, the better. Episodes consist of steps in which the agent ﬁrst perceives the current state s of the environment via a (numerical) state signal, e. g., an ID. It looks up in its memory, called QTable, which action a seems to be the best in this situation and performs it. The environment reacts on this action by changing its state to s . After this change, the agent gets a reward r for its choice and updates its QTable. Q(λ)-learning is an enhanced Q-Learning method that not only takes the expected rewards into account but also considers the stateaction-pairs that have led to a state s. Let Q(s, a) represent the sum of rewards the agent expects to receive until the end of the episode, if it performs action a in situation s, and let A(s) be the set of actions the agent can perform in state s. The update formula for a state-action-pair (˜ s, a ˜) for Q(λ)-learning is Q(˜ s, a ˜) := Q(˜ s, a ˜) + α · e(˜ s, a ˜) · δ, where e(˜ s, a ˜) is an eligibility factor, expressing how much inﬂuence on (s, a) is conceded to (˜ s, a ˜) (the longer ago, the smaller the value), and δ := r+ max Q(s , a )−Q(s, a). a ∈A(s )

Before updating the (˜ s, a ˜)-values, the eligibility factor of the current state-action-pair (s, a) is increased by 1. After the update, the parameter λ is used to decrease the e(˜ s, a ˜)-values to e(˜ s, a ˜) := λ · e(˜ s, a ˜). For λ = 0, we get the basic Q-Learning approach. The decision which action to take in a situation s is usually done by choosing the one with the greatest Q(s, a)-value. To make the discovery of new solutions possible, the agent chooses a random action with a small probability . Now, the concept of ordinal conditional functions (OCFs) and appropriate revision techniques will be explained. OCFs will serve as representations of epistemic states of agents in this paper. Ordinal conditional functions [7] are also called ranking functions, as they assign a degree of plausibility in the form of a degree of disbelief, or surprise, respectively, to each possible world. We will

work within a propositional framework, making use of multi-valued propositional variables di with domains {vi,1 , . . . , vi,mi }. Possible worlds are simply interpretations here, assigning exactly one value to each di , and thus correspond to complete elementary conjunctions of multivalued literals (di = vi,j ), mentioning each di . Let Ω be the set of all possible worlds. Formally, an ordinal conditional function (OCF) is a mapping κ : Ω → N ∪ {∞} with κ−1 (0) = ∅. The lower κ(ω), the more plausible is ω, hence the most plausible worlds have κ-value 0. A degree of plausibility can be assigned to formulas A by setting κ(A) := min{κ(ω) | ω |= A}, so that κ(A ∨ B) = min{κ(A), κ(B)}. This means that a formula is considered as plausible as its most plausible models. Therefore, due to κ−1 (0) = ∅, at least one of κ(A), κ(A) must be 0. A proposition A is believed if κ(A) > 0 (which implies particularly κ(A) = 0). Moreover, degrees of plausibility can also be assigned to conditionals by setting κ(B|A) = κ(AB) − κ(A). A conditional (B|A) is accepted in the epistemic state represented by κ, or κ satisﬁes (B|A), written as κ |= (B|A), iff κ(AB) < κ(AB), i.e. iff AB is more plausible than AB. OCFs represent the epistemic attitudes of agents in quite a comprehensible way and offer simple arithmetics to propagate information. Therefore, they can be revised by new information in a straightforward manner, making use of the idea of so-called c-revisions [4] that are capable of revising ranking functions even by sets of new conditional beliefs. Here, we will only consider revisions by one conditional belief, so we will present the technique for this particular case. Given a prior epistemic state in the form of an OCF κ and a new conditional belief (B|A), the revision κ∗ = κ ∗ (B|A) is deﬁned by j κ0 + κ(ω) + λ, if ω |= AB, ∗ κ (ω) = (1) κ0 + κ(ω) , otherwise, where κ0 is a normalizing additive constant and λ is the least natural number to ensure that κ∗ (AB) < κ∗ (AB). Although c-revisions are deﬁned in [4] for logical languages deﬁned from binary atoms, the approach can be easily generalized to considering multi-valued propositional variables. Note that also c-revision by facts is covered, as facts are identiﬁed with degenerate conditionals with tautological premises, i.e. A ≡ (A|). OCFs and c-revisions provide a framework to carry out high quality belief revision meeting all standards which are known to date, even going beyond that [4].

4

THE SPHINX LEARNING METHOD

Similar to the cognitive model, our learning method consists of two levels. For the bottom level we use Q(λ)-Learning, and for the top level, ordinal conditional functions (OCFs) are employed to represent the epistemic state of an agent and perform belief revision. This brings together two powerful methodologies from rather opposite ends of the scale of cognitive complexity, meeting the challenge of combining learning and belief revision in a particularly extreme case. To combine belief revision and reinforcement learning, each (subsymbolic) state s is described by a logical formula from a language deﬁned over propositional variables di with domains {vi,1 , . . . , vi,mi }. The symbolic representation of a speciﬁc state is a conjunction of literals mentioning all di and reﬂects the logical perception of s by the agent. Furthermore, we deﬁne a variable action having as domain the set Actions of possible actions. Hence, the possible worlds on which ranking functions are deﬁned here correspond to elementary conjunctions of the form (d1 = v1,k1 ) ∧ . . . ∧ (dn = vn,kn ) ∧ (action = a).

T. Leopold et al. / Belief Revision with Reinforcement Learning for Interactive Object Recognition

Figure 1.

The S PHINX system

The S PHINX system interlinks Q-learning, the epistemic state and belief revision in two ways: First, it uses current beliefs to restrict the search space of actions for Q-Learning. Second, direct feedback to an action in the form of a reward is processed to acquire speciﬁc or generic symbolic knowledge from the most recent experience by which the current epistemic state is revised. It is displayed in ﬁgure 1 and works as follows: Algorithm ’Sphinx-Learning’: While the current state s is not a terminal state 1. The Sphinx agent perceives the signal of the state s coming from the environment and its logical description d(s). 2. The agent queries its current epistemic state κ which actions Aκ (s) = {a1 , . . . , ak } are most plausible in s. 3. The agent looks up the Q-values of these actions and determines the set Abest (s) ⊆ Aκ (s) of those actions in Aκ (s) that have the greatest Q-value. 4. The agent chooses a random action a ∈ Abest (s) and performs it. 5. The environment changes to the successor state. 6. The agent receives the reward r from the environment. 7. The agent updates the QTable as described in section 3. 8. The new Q-values for actions in s are being read and the new best actions for s are determined. 9. The agent tries to ﬁnd new rules that relate d(s) to best actions (according to the updated QTable) and revises κ with this information in form of conditionals. End While We will now explain the algorithm step by step. When a state s is perceived (step 1), then κ is browsed for the most plausible worlds satisfying d(s). Aκ (s) (step 2) is the set of actions occurring in the most plausible d(s)-worlds: Aκ (s) = {a ∈ Actions | κ(d(s) ∧ action = a) = κ(d(s))} Then, the actions in Aκ (s) are ﬁltered according to their Q-values (step 3), and one of these actions is carried out (step 4). It is particularly in these two steps that the enhancement of reinforcement learning with epistemic background pays out, since an ordinary QAgent determines the set of best actions from the set of all possible actions. Steps 5 to 7 are pure Q-Learning. In step 8, the best actions for s due to the new Q-values are determined. This is done to exploit the experience by the received reward for future situations and make it usable on the epistemic level in step 9. The operations performed in step 9 are quite complex and described in the following. The aim of the mentioned revision of κ is to make those actions most plausible in d(s) that have the greatest Q-value in s. As inputs for this revision, the agent tries to ﬁnd patterns in the state descriptions for which certain actions are generally

67

better than others. This is done by a frequency based heuristics. For each pattern (i.e., a conjunction of literals of some of the variables) p and each action a, the agent remembers how often a was a best resp. a poor action by using counters. If the agent ﬁnds in step 8, that an action a is a best action in s and has not been among the best actions before, then the counters for a of all patterns covered by d(s) are increased by 1. If a was a best action in s before but is no longer, the counters are decreased by 1. Negative experiences where a was a poor action are handled in an analogous manner. With these counters, probabilities can be calculated, expressing, if a is usually a best resp. a poor action, when a situation s for which d(s) satisﬁes p is perceived. If such a relation between a pattern and a set of actions is found, a revision of κ with a conditional encoding such newly acquired strategic knowledge is performed; basically, the following four different types of revision occur: 1. Revision with information about a poor action in a speciﬁc state (episodic knowledge). 2. Revision with information about a poor action in several, similar states (generalization). 3. Revision with information about best actions in a speciﬁc state (episodic knowledge). 4. Revision with information about best actions in several, similar states (generalization). A ’poor’ action in a speciﬁc state resp. in several, similar states was deﬁned as an action that yields a reward less than -1. The conditionals used to revise κ have the following forms: 1. (action = a|d(s)), where d(s) is the symbolic representation of a certain state s in which a is poor. 2. (action = a|p), where p is a pattern satisﬁed by d(s), representing a set of states, which are similar because they share a common pattern. W 3. ( action = ai |d(s)), where all ai are best actions (due to their i

Q-values) in s. W 4. ( action = ai |p), where each ai is a best action in at least one i

of the states covered by the pattern p. ai needs not to be a best action in all states covered by p. The last form of revision should exclude not best actions from being plausible when p is perceived, so the agent has to ﬁnd the best action for a speciﬁc state covered by p only among the actions ai . Since revisions and especially revisions with generalized rules have a strong inﬂuence on the choice of actions, they have to be handled carefully, i. e., the agent should be quite sure about the correctness of a rule before adding it to its belief. Therefore, the agent uses several counters counting, how often an action has been poor, not poor, a best or not a best one under certain circumstances. With these counters some probabilities can be calculated which can be used to evaluate the certainty about the correctness of a speciﬁc rule. However, since all rules are merely plausible but not correct in a logical sense, further revisions may alleviate or even cancel the effects of erroneously acquired rules. Our learning model also supports background knowledge. If the user knows some rules that might be helpful for the agent and its task, he can formulate them as conditionals and let the agent revise κ with them before starting to learn.

5

INTERACTIVE OBJECT RECOGNITION

We tested our learning method in a navigation environment and in two different simulations of object recognition environments. In this

68

T. Leopold et al. / Belief Revision with Reinforcement Learning for Interactive Object Recognition

paper, we present the results of the latter in two different scenarios.

5.1

Recognition of Geometric Objects

In this test environment, the agent has to learn to recognize the following objects: sphere, ellipsoid, cylinder, cone, tetrahedron, pyramid, prism, cube, cuboid. By interacting with the environment the agent can look at the object from the front, from the side or from the top or it can choose to try to name the current object. The possible front, side, and top views are represented by ﬁve elementary shapes, namely: circle, ellipse, triangle, square, and rectangle. For example, the cone has the front view ’triangle’, the side view ’triangle’, and the top view ’circle’. The prism is given by the front view ’triangle’, the side view ’rectangle’, and the top view ’rectangle’. This leads to the following domains for this environment: • FrontView = {Unknown, Circle, Ellipse, Triangle, Square, Rectangle} • SideView = {Unknown, Circle, Ellipse, Triangle, Square, Rectangle} • TopView = {Unknown, Circle, Ellipse, Triangle, Square, Rectangle} • Action = {LookAtFront, LookAtSide, LookAtTop, RecognizeUnknown, RecognizeSphere, RecognizeEllipsoid, RecognizeCylinder, RecognizeCone, RecognizeTetrahedron, RecognizePyramid, RecognizePrism, RecognizeCube, RecognizeCuboid} At the beginning of each episode, the environment chooses one of the nine geometric objects and generates the state signal ’FrontView = Unknown ∧ SideView = Unknown ∧ TopView = Unknown’. If the agent’s action is LookAtFront, LookAtSide, resp. LookAtTop, the FrontView, SideView, resp. TopView is revealed in the new state signal following the agent’s action. If the agent’s action is an action of type ’Recognize’ action, the episode ends. The reward function returns -1, if one of the ’Look’ actions has been performed. Otherwise, the agent is rewarded with 10, if it has recognized the objects correctly, and with -10, if not. After ten steps the running episode is forced to end. Figure 2 shows the recognition rates after each training phase. In each training phase, each object is shown ten times to the current agent. The values result from 1000 independend agents. If the agents are provided with the background knowledge If no view has been perceived yet, then look at the front, the side, or the top of the object via the conditional (action = LookAtFront ∨ action = LookAtSide ∨ action = LookAtTop|FrontView = Unknown ∧ SideView = Unknown ∧ TopView = Unknown), the recognition rates improve, as can also be seen from ﬁgure 2. In the following, we list some of the rules that the agents learned by exploring the effects of updating the QTables on the cognitive (i.e. logical) level: • If FrontView = Circle, then action = RecognizeSphere • If FrontView = Unknown ∧ SideView = Triangle, then action = LookAtFront • If FrontView = Triangle ∧ SideView = Unknown, then action = RecognizePrism

5.2

Recognition of Simulated Real Objects

To analyse Sphinx under more realistic conditions, we set up another environment. We deﬁned shape attributes that are suitable for representing objects within a simple object recognition task and then

Figure 2.

Recognition Rates for Geometric Objects

chose random objects and describe them with these previously deﬁned attributes. These attributes are the input to Sphinx. Again, there are three possible perspectives: the front view, the side view, and a view from a position between these two views. The decision for these persepectives, especially for the intermediate view, was made based on the results found by [5] who revealed that the intermediate view plays a special role in human object recognition. The front and the side view are described by three attributes each: approximate (idealized) shape, size (i.e. proportion) of the shape, and deviance from the idealized shape. The approximate shape can take on the values unknown, circle, square, triangle up, and triangle down. The size can be unknown, ﬂat, regular, or tall. The deviance can be little, medium, or big. Besides these attributes the object is described by the complexity of its texture. This attribute can take on the values simple, medium, and complex. We set the attributes for each object manually. In a real application they can be determined easily by a simple image processing module which merely has to quantize the shape and texture of an object. If the agent looks at the object from the front or the side, it perceives the matching idealized shape, its size, its deviance, and the complexity of the texture. From the intermediate view the agent can only perceive the idealized shapes of the front and the side view and the complexity of the texture, but not the size and deviances. Formally the domains are: • FrontViewShape = {Unknown, Circle, Square, TriangleUp, TriangleDown} • FrontViewSize = {Unknown, Flat, Regular, Tall} • FrontViewDeviation = {Unknown, Little, Medium, Much} • SideViewShape = {Unknown, Circle, Square, TriangleUp, TriangleDown} • SideViewSize = {Unknown, Flat, Regular, Tall} • SideViewDeviation = {Unknown, Little, Medium, Much} • Texture = {Simple, Medium, Complex} • Action = {RotateLeft, RotateRight, RecognizeUnkown} ∪ R where R is the set of ’Recognize’ actions. At the beginning of each episode, the agent looks at the current object from a random perspective and the variables are set according to this perspective. Now, the agent can rotate the object clockwise or counter-clockwise or name it. If the agent’s action is a ’Recognize’ action, the episode ends. After ten steps the running episode is forced to end. The reward function is the same as in the previous test environment. We have chosen

T. Leopold et al. / Belief Revision with Reinforcement Learning for Interactive Object Recognition

15 different objects from nine different object classes such as bottle, tree, and house for which we provide the three attributes mentioned (shape, size, and deviation) (see ﬁgure 3).

69

• If FrontViewShape = Circle ∧ SideViewShape = Unknown ∧ Texture = Simple, then action = RotateLeft • If Texture = Complex, then action = RecognizeBottle What remains to be done at this point to apply our system to real images of objects, is the extraction of shape attributes from the images. This can be done by existing segmentation methods.

6 Figure 3.

Approximated geometrical forms of objects

Similar to the previous scenario, the experimental results obtained by testing 100 independend agents are depicted in Figure 4. Again, it can be seen clearly that S PHINX-Learning does better than Q(λ)learning with respect to the speed of learning.

CONCLUSION

Both low-level, non-cognitive learning and high-level learning with using epistemic background and acquiring generic knowledge are present in human learning processes. In this paper, we presented the hybrid S PHINX approach that enables intelligent agents to adjust to its environment in a similar way by combining epistemicbased belief revision with experience-based reinforcement learning. We linked both methodologies for two purposes: First, the current epistemic state allows the agent to focus on most plausible actions that are evaluated by QTables to ﬁnd the most promising actions in some current state. Second, the direct feedback by the environment is used not only to update QTables, but also to generate speciﬁc or generic knowledge with which the epistemic state is revised. In order to illustrate the usefulness of our approach, we described application scenarios from computer vision and performed experiments in which S PHINX agents are employed for object recognition tasks. The evaluation of these experiments shows clearly that the proposed interplay of belief revision and reinforcement learning beneﬁts from the advantages of both methodologies. Therefore, the S PHINX approach allows complex yet ﬂexible interactions between learning and reasoning that help agents perform considerably better.

REFERENCES

Figure 4.

Recognition Rates for Simulated Real Objects

In a second step we added background knowledge that enabled the agent to recognize all objects correctly, if it has perceived all of the three views. Furthermore, we added rules to the background knowledge that told the agent to look at the object from all perspectives ﬁrst. With these rules the agent has a complete, but not optimal, solution for the task. We wanted to ﬁnd out how fast the agent learns that it does not need all views to classify the current object. To protect the background knowledge from being overwritten by the agent’s own rules too early, some parameters were changed, so that the agent had to be more sure about the correctness of a rule before adding it to its belief. This setup resulted in a constantly high recognition rate of over 99 %. The number of perceived views decreased over time from 3.28 to 1.99. The value of 3.28 perceived view vs. 3 possible views results from the fact, that the intermediate view has to be perceived twice if the environment starts in this view. Then, the agent perceives this view at the beginning, then rotates the object to the front and then back to the intermediate view so it can rotate the object to the side view in the next step (or vice versa). Here are some of the rules the agent learned and assimilated during its training: • If FrontViewShape = TriangleUp ∧ FrontViewSize = Tall, then action = RecognizeBottle

[1] Anderson, J. R., The architecture of cognition, Hardvard University Press, Cambridge, MA, 1983. [2] C. Ye and Yung, N. H. C. and D. Wang, ‘A fuzzy controller with supervised learning assisted reinforcement learning algorithm for obstacle avoidance’, IEEE Transactions on Systems, Man, and Cybernetics, Part B, 33(1), 17–27, (2003). [3] Gombert, J.-E., ‘Implicit and explicit learning to read: Implication as for subtypes of dyslexia’, Current Psychology Letters, 10(1), (2003). [4] G. Kern-Isberner, Conditionals in nonmonotonic reasoning and belief revision, Springer, LNAI 2087, 2001. [5] Pereira, A. and James, K. H. and Jones, S. S., and Smith, L. B. Preferred views in children’s active exploration of objects, 2006. [6] Reber, A. S., ‘Implicit learning and tacit knowledge’, Journal of Experimental Psychology: General, 118(3), 219–235, (1989). [7] W. Spohn, ‘Ordinal conditional functions: a dynamic theory of epistemic states’, in Causation in Decision, Belief Change, and Statistics, II, eds., W.L. Harper and B. Skyrms, 105–134, Kluwer Academic Publishers, (1988). [8] Stanley, W. B. and Mathews, R. C. and Buss, R. R. and Kotler-Cope, S., ‘Insight without awareness: On the interaction of verbalization, instruction and practice in a simulated process control task’, The Quarterly J. of Exp. Psychology Section A, 41(3), 553–577, (1989). [9] Sun, R. and Merrill, E. and Peterson, T., ‘From implicit skills to explicit knowledge: a bottom-up model of skill learning’, Cognitive Science, 25(2), 203–244, (2001). [10] Sun, R. and Slusarz, P. and Terry, C., ‘The interaction of the explicit and the implicit in skill learning: A dual-process approach’, Psychological Review, 112(1), 159–192, (2005). [11] Sun, R. and Zhang, X. and Slusarz, P. and Mathews, R., ‘The interaction of implicit learning, explicit hypothesis testing learning and implicit-toexplicit knowledge extraction’, Neural Networks, 20(1), 34–47, (2007). [12] Sutton, R. S. and Barto, A. G., Reinforcement Learning: An Introduction, Bradford Book, The MIT Press, 1998.

70

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-70

A Formal Approach for RDF/S Ontology Evolution George Konstantinidis and Giorgos Flouris and Grigoris Antoniou and Vassilis Christophides1 Abstract. In this paper, we consider the problem of ontology evolution in the face of a change operation. We devise a general-purpose algorithm for determining the effects and side-effects of a requested elementary or complex change operation. Our work is inspired by belief revision principles (i.e., validity, success and minimal change) and allows us to handle any change operation in a provably rational and consistent manner. To the best of our knowledge, this is the ﬁrst approach overcoming the limitations of existing solutions, which deal with each change operation on a per-case basis. Additionally, we rely on our general change handling algorithm to implement specialized versions of it, one per desired change operation, in order to compute the equivalent set of effects and side-effects.2

1

INTRODUCTION

Stored knowledge, in any knowledge-based application, may need to change due to various reasons, including changes in the modeled world, new information on the domain, newly-gained access to information previously unknown, and other eventualities [11]. Here, we consider the case of ontologies expressed in RDF/S (as most of the Semantic Web Schemas (85,45%) are expressed in RDF/S [14]) and introduce a formal framework to handle the evolution of an ontology given a change operation. We pay particular attention to the semantics of change operations which can, in principle, be either elementary (involving a change in a single ontology construct) or composite (involving changes in multiple constructs) [5]. Even though RDF/S does not support negation, the problem is far from trivial as inconsistencies may rise due to the validity rules associated with RDF/S ontologies. In fact, naive settheoretical addition or removal of ontological constructs (i.e., direct application of a change) has been acknowledged as insufﬁcient for ontology evolution [4, 6, 12]. Most of the implemented ontology management systems (e.g., [1, 2, 8, 13]), are designed using an ad-hoc approach, that solves the problems related to each change operation on a per-case basis. More speciﬁcally, they explicitly deﬁne a ﬁnite, and thus incomplete, set of change operations that they support, and have determined, a priori, the semantics of each such operation. Hence, given the lack of a formal methodology, the designers of these systems have to determine, in advance, all the possible invalidities that could occur in reaction to a change, the various alternatives for handling any such possible invalidity, and to pre-select the preferable solutions for implementation per case [6]; this selection is hard-coded into the systems’ implementations. This approach requires a highly tedious, case-based reasoning which is error-prone and gives no formal guarantee that

the cases and options considered are exhaustive. To overcome these limitations, we propose an ontology evolution framework and elaborate on its formal foundations. Our methodology is driven by ideas and principles of the belief revision literature [3]. In particular, we adopt the Principle of Success (every change operation is actually implemented) and the Principle of Validity (resulting ontology is valid, i.e., it satisﬁes all the validity constraints of the underlying language). Satisfying both these requirements is not trivial, as the straightforward application of a change operation upon an ontology can often lead to invalidity, in which case certain additional actions (side-effects) should be executed to restore validity. Sometimes, there may be more than one ways to do so, in which case a selection mechanism should be in place to determine the “best” option. In this paper, we employ a technique inspired by the Principle of Minimal Change [3] (stating that the appropriate result of changing an ontology should be as “close” as possible to the original). The general idea of our approach is to ﬁrst determine all the invalidities that any given change (elementary or composite) could cause upon the updated ontology, using a formal, well-speciﬁed validity model, and then to determine the best way to overcome potential invalidity problems in an automatic way, by exploring the various alternatives and comparing them using a selection mechanism based on an ordering relation on potential side-effects. In particular, our formal approach is parameterizable to this relation, thus providing a customizable way to guarantee the determination of the “best” result. Although our framework is general, in this paper we focus on a fragment of the RDF/S model which exhibits interesting properties for deciding query containment and minimization [10]. To the best of our knowledge, our implementation is the ﬁrst one that allows processing any type of change operation, and in a fully automatic way.

2

PROBLEM FORMULATION

2.1

In order to abstract from the syntactic peculiarities of the underlying language and develop a uniform framework, we will map RDF to First-Order Logic (FOL). Table 1 (restricted for presentation purposes) shows the FOL representation of certain RDF statements. The language’s semantics is not carried over during the mapping, so we need to combine the FOL representation with a set of validity rules that capture such semantics. For technical reasons, we assume that all constraints can be encoded in the form of (or can be broken down into a conjunction of) DEDs (disjunctive embedded dependencies), which have the following general form: ∀ uP ( u) → ∨i=1,...,n ∃ vi Qi ( u, vi )

1

Institute of Computer Science, FO.R.T.H., Heraklion, Greece, email:gconstan,fgeo,antoniou,[email protected] 2 This work was partially supported by the EU projects CASPAR (FP6-2005IST-033572) and KP-LAB (FP6-2004-IST-27490).

Modeling RDF/S, ontologies and updates

where: • u, vi are tuples of variables

(DED)

71

G. Konstantinidis et al. / A Formal Approach for RDF/S Ontology Evolution

Table 1. Representation of RDF facts using FOL predicates

Table 2. Validity Rules

RDF triple

Intuitive meaning

Predicate

Rule ID/Name

Integrity Constraint

Intuitive Meaning

C rdf:type rdfs:Class P rdf:type rdf:Property x rdf:type rdfs:Resource P rdfs:domain C P rdfs:range C C1 rdfs:subClassOf C2 P1 rdfs:subPropertyOf P2 x rdf:type C xPy

C is a class P is a property x is a class instance domain of property range of property IsA between classes IsA between properties class instantiation property instantiation

CS(C) P S(P ) CI(x) Domain(P, C) Range(P, C) C IsA(C1 , C2 ) P IsA(P1 , P2 ) C Inst(x, C) P I(x, y, P )

R2 Domain Applicability

∀x, y ∈ Σ: Domain(x, y) → P S(x) ∧ CS(y) ∀x, y ∈ Σ: C IsA(x, y) → CS(x) ∧ CS(y) ∀x, y ∈ Σ: C Inst(x, y) → CI(x) ∧ CS(y) ∀x, y, z ∈ Σ: Domain(x, y) → ¬Domain(x, z) ∨ (y = z) ∀x ∈ Σ, ∃y, z ∈ Σ: P S(x) → Domain(x, z) ∧ Range(x, y) ∀x, y, z ∈ Σ: C IsA(x, y) ∧ C IsA(y, z) → C IsA(x, z) ∀x, y, ∈ Σ: C IsA(x, y) → ¬C IsA(y, x) ∀x, y, z ∈ Σ : C Inst(x, y) ∧ C IsA(y, z) → C Inst(x, z) ∀x, y, z, w ∈ Σ : P I(x, y, z) ∧ Domain(z, w) → C Inst(x, w) ∀x, y ∈ Σ: P IsA(x, y) → ¬P IsA(y, x)

Domain applies to properties; the domain of a property is a class Class IsA applies between classes

R4 C IsA Applicability R6 C Inst Applicability R8 Domain is unique

• P , Qi are conjunctions of relational atoms of the form R(w1 , ..., wn ) and equality atoms of the form (w = w ), where w1 , ..., wn , w, w are variables or constants • P may be the empty conjunction We employ DEDs, as they are expressive enough for capturing the semantics of different RDF fragments and other simple data models which are appropriate for our purposes in this paper. Moreover, DEDs will prove suitable for constructing a convenient mechanism for detecting and repairing invalidities. Table 2 shows some rules that are used to capture the semantics of the various RDF constructs (e.g., R11 captures IsA transitivity), as well as the restrictions imposed by our RDF model (e.g., R8 captures that the domain of a property should be unique). It should be stressed that the semantics of the language captured by tables 1 and 2 essentially corresponds to a fragment of the standard RDF/S data model3 in which there is a clear role distinction between ontology primitives, no cycles in the subsumption relationships, while property subsumption respects corresponding domain/range subsumption relationships. Such a fragment has been ﬁrst studied in [10] in an effort to provide a group of sound and complete algorithms for query containment and minimization while it is compatible with W3C guidelines4 for devising restricted fragments of the RDF/S data model. Similarly, the general-purpose change handling algorithm presented in this paper can be also applied to other fragments of RDF/S (see also [7, 9]) or the standard RDF/S semantics. In Table 2, Σ denotes the set of constants in our language. We equip our FOL with closed semantics, i.e., CWA (closed world assumption). This means that, for two formulas p, q, if p q then p % ¬q. Abusing notation, for two sets of ground facts U , V , we will say that U implies V (U % V ) to denote that U % p for all p ∈ V . Any expression of the form P (x1 , ..., xk ) is called a positive ground fact where P is a predicate of arity k and x1 , ..., xk are constant symbols. Any expression of the form ¬P (x1 , ..., xk ) is called a negative ground fact iff P (x1 , ..., xk ) is a positive ground fact. L denotes the set of all well-formed formulae that can be formed in our FOL. We denote by L+ the set of positive ground facts, L− the set of negative ground facts and set L0 = L+ ∪ L− , called the set of ground facts of the language. We deﬁne: • An ontology is a set K ⊆ L+ • An update is a set U ⊆ L0 In simple words, an ontology is any set of positive ground facts whereas an update is any set of positive or negative ground facts. Applying an update to an ontology should result in the incorporation of the update in the ontology. By deﬁnition, ontologies have two properties: (a) they are always consistent (in the purely logical sense) and (b) they imply only the 3 4

http://www.w3.org/TR/rdf-concepts/ http://www.w3.org/TR/2004/REC-rdf-mt-20040210/#technote

R10 Domain and Range exists R11 C IsA Transitivity R12 C IsA Irreﬂexivity R15 Determining C Inst R17 Property Instance of and Domain R23 P IsA Irreﬂexivity

Class Instanceof applies between a class instance and a class The domain of a property is unique Each property has a domain and a range Class IsA is Transitive

Class IsA is Irreﬂexive Class instance propagation Instanceof between properties reﬂect in their sources/domains Property IsA is Irreﬂexive

positive ground facts that are already in the ontology. The above two properties together with the CWA semantics, imply that: • P (x) ∈ K ⇔ K % P (x) ⇔ K ¬P (x) • P (x) ∈ / K ⇔ K % ¬P (x) ⇔ K P (x) An application of these properties is that updating K with ¬P (x) corresponds to contracting P (x) from K, because “incorporating” ¬P (x) in an ontology could be achieved only by removing P (x) from K. Therefore, updating an ontology with negative ground facts corresponds to contraction/erasure in the standard terminology, whereas updating an ontology with positive ground facts corresponds to revision/update in the standard terminology.

2.2

Updating under constraints

We say that an ontology K satisﬁes a validity rule c, iff K % c. Obviously for a set C of validity rules, K satisﬁes C (K % C) iff K % c for all c ∈ C. It is easy to see that for a simple constraint of the form c = ∀uP (u) → Q(u), where P, Q are simple positive predicates and u is a variable, it holds that: K % c iff for all constants x : K % {¬P (x)} or K % {Q(x)}. This can be easily extended to the general case. Suppose that c = ∀ uP ( u) → ∨i=1,...,n ∃ vi Qi ( u, vi ), where P ( u) = P1 ( u) ∧ . . . ∧ Pk ( u) for some k 0 and Qi ( u, vi ) = Qi1 ( u, vi ) ∧ . . . ∧ Qim ( u, vi ) for some m > 0 depending on i. Then K % c iff for all tuples of constants x at least one of the following is true (note that in case of obvious reference to tuples of constants or variables we will be omitting the symbol): • There is some j : 0 < j k such that K % {¬Pj (x)}.

72

G. Konstantinidis et al. / A Formal Approach for RDF/S Ontology Evolution

• There is some i : 1 i n and some tuple of constants z such that for all j = 1, 2, ..., m K % {Qij (x, z)}. We can conclude that K % c iff for all tuples of constants x at least one of the following sets is implied by K: • {¬Pj (x)}, 0 < j k • {Qi1 (x, z) ∧ Qi2 (x, z) ∧ ... ∧ Qim (x, z)}, 1 i n, z:constant

negative ground facts, so they are updates in our terminology. This is a very useful remark, as we will subsequently take advantage of the elements of Comp(c, x), applying them as updates. In our example, the validity of rule R2.2, for x = P, y = D can be restored iff either {¬Domain(P, D)} or {CS(D)} are added as additional updates (side-effects) to the ontology. Note that sideeffects could trigger side-effects of their own if violating any rules.

Based on the above observation, we deﬁne the component set of c with respect to some tuple of constants x as follows:

Rule ID/Name

Components of the rule

Comp(c, x) = {{¬Pj (x)}|0 < j k} ∪ {{Qi1 (x, z) ∧ Qi2 (x, z)

R2 Domain Applicability

R2.1 : ∀x, y ∈ Σ : Comp(R2.1, (x, y))= {{¬Domain(x, y)}, {P S(x)}} R2.2 : ∀x, y ∈ Σ : Comp(R2.2, (x, y))= {{¬Domain(x, y)}, {CS(y))}} ∀x, y, z ∈ Σ : Comp(R8, (x, y, z))= {{¬Domain(x, y)}, {¬Domain(x, z)}, {(y = z)}} R10.1 : ∀x ∈ Σ, ∃z ∈ Σ : Comp(R10.1, (x, z))= {{¬P S(x)}, {Domain(x, z)}} R10.2 : ∀x ∈ Σ, ∃y ∈ Σ : Comp(R10.1, (x, y))= {{¬P S(x)}, {, Range(x, y)}} ∀x, y, z, w ∈ Σ : Comp(R17, (x, y, z, w)) = {{¬P I(x, y, z)} ,{¬Domain(z, w)}, {C Inst(x, w)}}

Table 3. Some validity rules in component set form

∧... ∧ Qim (x, z)} |1 i n, z : constant} Prop. 1 will subsequently help us deﬁne a valid ontology.

R8 Domain is unique

Prop. 1 K % c iff for all constants x there is some V ∈ Comp(c, x) such that K % V .

R10 Domain and Range exists

Def. 1 Consider a FOL language L and a set of validity rules C. An ontology K will be called valid with respect to L and C iff K is consistent and it satisﬁes the validity rules C.

R17 Property Instance of and Domain

Note that a valid ontology, by our rules of Table 2, contains all its implicit knowledge as well (i.e., it is closed with respect to inference). Due to the special characteristics of our framework (e.g., CWA, the form of rules, etc), one does not need to employ full FOL reasoning to determine whether an ontology K is valid (i.e., using Def. 1 and Prop. 1); instead, we can use the specialized procedure described below (Prop. 2).

A

P

Make D domain of P

a

P

(a)

Prop. 2 A ground fact P (x), added to an ontology K, would violate rule c, iff there is some set V and tuple of constants u for which ¬P (x) ∈ V and V ∈ Comp(c, u) and for all V ∈ Comp(c, u), V = V it holds that K V . As an example, consider the ontology of Fig. 1(a). The original ontology in our case, per Table 1, is: K = { CS(A), CS(B), CI(a), CI(b), P S(P ), Domain(P, A), Range(P, B), P I(a, b, P ), C Inst(a, A), C Inst(b, B)} and the update is: U = {Domain(P, D)}. To detect rule violations in an automated way, according to Prop. 2, we must ﬁnd all the rules that contain ¬Domain(x, y), set x = P , y = D, and determine whether some other component for the particular instantiation is implied by the ontology. If the answer is no, then the addition of Domain(P, D) would violate the particular instantiation of this rule. In our case, this is true for rule R2.2 (domain applicability), for x = P, y = D and rule R8 (unique domain) for x = P, y = D, z = A as well as for x = P, y = A, z = D (see also Table 3 for some rules in their component set format). Moreover, it violates rule R17 for x = a, y = b, z = P, w = D. One nice property of our detection mechanism is that it provides an immediate way to restore invalidities as well, i.e., generate potential side-effects that would restore the violation. In particular, the violation that Prop. 2 detects can be restored by making any of the elements of Comp(c, u) true in the ontology. At this point note that when a Qij (x, z) in some set V ∈ Comp(c, x) is an equality of the form w = w , then the truth value of this equality is revealed as soon as we instantiate this rule’s variables to constants. Therefore, by evaluating an equality as a tautology () or contradiction (⊥) and replace it accordingly in the rule’s instances, we are able to eliminate all the equality atoms from the components sets. Without equalities, the elements of Comp(c, x) contain only positive and

D P

B

b

A

B

a

P

b

(b)

Figure 1. Adding a new domain to a property.

2.3

Selection of side-effects: ordering

If there were no validity rules or we were not interested in the result being a valid ontology the most rational way to perform an update would be to simply apply the changes in U upon K. Def. 2 The raw application of an update U to an ontology K is denoted by K + U and is the following ontology: K + U = {P (x) ∈ L+ |P (x) ∈ K ∪ U and¬P (x) ∈ / U} When a set of changes (i.e., an update U ) is raw applied to a valid ontology K, some of the changes that appear in U may be void, i.e., they don’t need to be performed because they are already implemented (implied) by the original ontology. We deﬁne an operator which, given a resulting ontology K that an update would produce on a valid ontology K, calculates the actual effects of the update: Def. 3 For K a valid ontology and K an ontology: Delta(K, K ) = {P (x) ∈ L0 |K % {P (x)} and K {P (x)}} Delta function is some kind of “edit distance”5 between K and K ; if K = K + U , then Delta represents the actual changes that U enforces upon K. Thus, K + U = K + Delta(K, K + U ) = K , so Delta(K, K + U ) produces the same result as U when applied upon an ontology; however they may be different as U could contain void changes. 5

Note that the term “edit-distance” is usually used for sequences and not sets (i.e., edit scripts)

G. Konstantinidis et al. / A Formal Approach for RDF/S Ontology Evolution

As already mentioned, the raw application of an update would not work for our case, because it may not respect the validity constraints of the language. Thus, applying an update involves the application of some side-effects. In some cases, it may not be possible to ﬁnd adequate side-effects for the update at hand; such updates are called infeasible and cannot be executed. For example, any inconsistent update (such as, U = {CS(A), ¬CS(A)}) is infeasible. In most cases though, an update has several possible alternative sets of side-effects, which implies that a selection should be made. Consider an update U with alternative side-effects U1 and U2 . Then, the set of changes that should be raw applied on the initial ontology, in order to reach a valid result, is either U ∪ U1 or U ∪ U2 . According to the Principle of Minimal Change we should choose the one which causes the “mildest” effects upon the ontology; to determine the “relative mildness” (or “relative cost”) of such effects, we deﬁne an ordering between updates. Note that this ordering should depend on K itself: for example, the “cost” of removing an IsA relation between A and B should depend on the importance of the concepts A, B in the RDF graph itself. The following conditions have proven necessary for an ordering to produce “rational” results. Def. 4 An generating

ordering iff the

Delta Antisymmetry: Transitivity: Totality: Conﬂict Sensitivity: Monotonicity:

K is following U :

called conditions U

U

updatehold:

For any U , U K and K U implies Delta(K, K + U )=Delta(K, K + U ). For any U , U , U : U K U and U K U implies U K U . For any U , U : U K U or U K U . For any U , U : U K U iff Delta(K, K + U ) K Delta(K, K + U ). For any U , U : U ⊆ U implies U K U .

Similarly, an ordering scheme {K |K : a valid ontology} is called update-generating iff K is update-generating for all valid ontologies K. For our RDF case we introduced a particular update-generating ordering, which is based on the ordering shown in Table 4 (among the positive and negative predicates presented in Table 1 for simplicity). The details of the expansion of this ordering to refer to ground facts and sets of ground facts (i.e., updates) is omitted due to space limitations. In short, the general idea is that an update U1 is “cheaper” (or preferable) than U2 (denoted by U1 K U2 ) iff the “most expensive” predicate used in update U1 , is “cheaper” than the “most expensive” predicate used in update U2 where the predicates’ relative preference is determined by the order shown in Table 4. Ties are resolved using cardinality considerations and/or the relative importance of the predicate’s arguments in the original ontology (details omitted). Our ordering was based on the results of experimentation on various alternative orderings and results to an efﬁcient and intuitive implementation. Nonetheless, our algorithm works with any update-generating ordering; each different ordering would model and impose a different global evolution policy on our algorithm. Fig. 1(b) depicts the outcome of the requested update with respect to our ordering. Table 4.

Ordering of predicates

P I

Def. 5 Consider a FOL language L, a set of integrity constraints C, some update-generating ordering scheme and a change operator • : L+ × L0 → L+ . A change operation K • U will be called rational with respect to iff it satisﬁes the following properties for all ontologies K and updates U :

73

• Limit Cases: If K is not a valid ontology or U is an infeasible update, then: K • U = K. • General Case: If K is a valid ontology and U is a feasible update, then: – Principle of Success: K • U U – Principle of Validity: K • U is a valid ontology – Principle of Minimal Change: For all valid ontologies K such that K U , it holds that Delta(K, K • U ) K Delta(K, K )

Def. 5 dictates that applying a rational change operator between an update and an ontology should result (in the general case) in a valid ontology (Principle of Validity), which implies the update (Principle of Success). Moreover, for any other valid ontology K that could be an alternative result, the set of (non-void) side-effects leading to K (captured by the Delta function) is more “expensive” than the set of (non-void) side-effects leading to the result of the rational change operation. In effect, the rational change operator applies the minimum, with respect to K , set of effects and side-effects.

3

ALGORITHMS

3.1

General-purpose algorithm

We now present our general-purpose algorithm shown in Table 5. For a given language L (predicates and rules) and update-generating ordering , the function takes as inputs the ontology K, an update U , and the set ESE (initially empty) of already considered effects and side-effects. The algorithm identiﬁes all invalidities caused by the predicates of the update, appends the update with each possible alternative side-effect separately and calls itself. Upon returning, it compares the different alternatives with the “min” function (which implements ). Upon termination it returns the effects (U ) and the minimal (per K ) set of side-effects of U upon K. The output of the algorithm, if not infeasible, is ready to be raw applied to K, as stated in Theorem 1. Table 5. Function Update: (U, K, ESE) → Delta(K, K • U ) ST EP 1: If U ∪ ESE is inconsistent, then return INFEASIBLE. ST EP 2: If (K ∪ ESE) U , then return ∅ ST EP 3: Take an arbitrary ground fact P (x) ∈ U \ ESE such that K {P (x)} ST EP 4: Find a rule r , such that there is some set V and tuple of constants y for which ¬P (x) ∈ V and V ∈ Comp(r, y ) and for all V ∈ Comp(r, y), V = V it holds that V U ∪ ESE . ST EP 5: If there is no such rule, then return {P (x)} ∪ Update(U, K, ESE ∪ {P (x)}). ST EP 6: Otherwise, select (arbitrarily) one such rule, say r , and return min{Update(U ∪ V , K, ESE)}, where the min is taken over all V ∈ Comp(r, y ), V = V .

Theorem 1 The raw application of the output of U pdate(K, U, ∅) to K implements a rational change operation. The complexity of our algorithm depends on the parameters (language, rules, ordering). When considering an inﬁnite number of constants in our language, we could have an inﬁnite number of rule instances or a rule instance with an inﬁnite size (when ∃ in a DED stands before a free variable). However, for our parameters, we can limit Σ to the ﬁnite set of all the names appearing in K or U , plus an extra auxiliary constant. The intuition behind this choice is that our ordering guarantees that no solution involving more than one auxiliary constant names (i.e., names not in K or U ) could ever be selected for implementation (per ). This fact and Theorem 2 below guarantee termination of the algorithm:

74

G. Konstantinidis et al. / A Formal Approach for RDF/S Ontology Evolution

Theorem 2 Consider the language and the ordering deﬁned in section 2, an ontology K and an update U . Then, if we restrict the set of constants of our language to a ﬁnite set, before calling U pdate(K, U, ∅), the latter terminates.

3.2

Special-purpose algorithms

In practice, our general-purpose algorithm will be applied for a particular language and ordering. For such a case, it makes sense to trade generality for computational efﬁciency and develop special-purpose algorithms that would produce the same output as the generalpurpose one for the particular set of parameters. For this purpose, we singled out a number of useful change operations and developed a special-purpose algorithm for each one (for the particular language and ordering at hand). Since there is an inﬁnite number of possible updates, this effort is inherently incomplete, and we will necessarily have to resort to the general-purpose algorithm for unconsidered operations. This approach may seem to reintroduce the drawbacks of ad-hoc approaches mentioned earlier, but this is not the case: the specialpurpose algorithms have been formally proven to be equivalent to the general algorithm (i.e., they are rational change operators) for the speciﬁc operation they tackle. This can be proved by running the general-purpose algorithm for all relevant states of an ontology and verifying its output against the output of the special-purpose algorithms. Thus, special-purpose algorithms are more efﬁcient than the general-purpose algorithm, but use the same general approach as a formal foundation; moreover, any unforseen operation can anyway be dealt with by the general algorithm. As an example, Table 6 shows such an algorithm for adding a domain to a property, and Prop. 3 shows the relevant result. Prop. 3 Consider the language and the ordering deﬁned in section 2, an ontology K and the update U = {Domain(P, D)} (for any P, D ∈ Σ). Then, the output of the algorithm in Table 5 with the above inputs is identical to the output of the algorithm in Table 6 with the same inputs.

Table 6.

Add Domain Algorithm

If PS(P) doesn’t exist does not already exist in K: Add to output {P S(P ), Range(P, rdf s : Resource)} If CS(D) doesn’t exist does not already exist in K: Add to output {CS(D)} Remove the old Domain, e.g., add to output {¬Domain(P, A)} Add the new Domain, i.e., add to output {Domain(P, D)} If P is instantiated by a property instance, say P I(a, b, P ) Verify that (the new domain) D has as its instance the resource a If not, add to output an instance relationship from a to D .

4

CONCLUSIONS

In this paper, we studied the problem of updating an ontology in the face of new information. We criticized the currently used paradigm of selecting a number of operations to support and determining the proper effects of each operation on a per-case basis, and proposed a formal framework to describe updates and their effects, as well as a general-purpose algorithm to perform those updates. Our methodology is inspired by the general belief revision principles of Validity, Success and Minimal Change [3]. The end result is an algorithm that is highly parameterizable, both in terms of the language used and in terms of the implementation of the Principle of Minimal Change.

Our methodology exhibits a “faithful” behavior with respect to the various choices involved, regardless of the particular ontology or update at hand. It lies on a formal foundation, issuing a solid and customizable method to handle any type of change on any ontology, including operations not considered at design time. In addition, it avoids resorting to the error-prone per-case reasoning of other systems, as all the alternatives regarding an update’s side-effects can be derived from the language’s rules themselves, in an exhaustive and provably correct manner. Although our general algorithm can be applied to a variety of languages, in this paper we elaborated on a speciﬁc fragment of RDF. This restriction allowed the development of special-purpose algorithms which provably exhibit behavior identical to the general-purpose one (so they are rational change operators), but also enjoy much better computational properties. Our approach was recently implemented in a large scale real-time system, as part of the ICS-FORTH Semantic Web Knowledge Middleware (SWKM), which includes a number of web services for managing RDF/S knowledge bases6 . Future work includes experimental evaluation of the algorithms’ performance and the incorporation of techniques and heuristics that would further speed up our algorithms. We also plan to evaluate the feasibility of applying the same methodology in richer languages, such as OWL Lite (notice that for highly complex languages, the development of a complete set of validity rules may be difﬁcult).

REFERENCES [1] S. Bechhofer, I. Horrocks, C. Goble, and R. Stevens, ‘OilEd: A Reasonable Ontology Editor for the Semantic Web’, Ki 2001: Advances in AI: Joint German/Austrian Conference on AI, Vienna, Austria, September 19-21, 2001: Proceedings, (2001). [2] T. Gabel, Y. Sure, and J. Voelker, ‘KAON–ontology management infrastructure’, SEKT informal deliverable, 3(1). [3] P. G¨ardenfors, ‘Belief Revision: An Introduction’, Belief Revision, 29, 1–28, (1992). [4] P. Haase and Y. Sure, ‘D3.1.1.b state of the art on ontology evolution’, Technical report, (2004). [5] M. Klein and N.F. Noy, ‘A component-based framework for ontology evolution’, Workshop on Ontologies and Distributed Systems at IJCAI, (2003). [6] G. Konstantinidis, G. Flouris, G. Antoniou, and V. Christophides, ‘Ontology evolution: A framework and its application to rdf’, in Proceedings of the Joint ODBIS & SWDB Workshop on Semantic Web, Ontologies, Databases (SWDB-ODBIS-07), (2007). [7] S. Munoz, J. Perez, and C. Gutierrez, ‘Minimal deductive systems for rdf’, in Proceedings of the 4th European Semantic Web Conference, (2007). [8] N. Noy, R. Fergerson, and M. Musen, ‘The knowledge model of Prot´eg´e-2000: Combining interoperability and ﬂexibility’, Lecture Notes in Artiﬁcial Intelligence (LNAI), 1937, 17–32. [9] J.Z. Pan and I. Horrocks, ‘Metamodeling Architecture of Web Ontology Languages’, Proceedings of the Semantic Web Working Symposium, 149, (2001). [10] G. Serﬁotis, I. Kofﬁna, V. Christophides, and V. Tannen, ‘Containment and minimization of rdf/s query patterns’, in Proceedings of the 4th International Semantic Web Conference (ISWC-05), (2005). [11] L. Stojanovic, A. Maedche, B. Motik, and N. Stojanovic, ‘User-driven ontology evolution management’, Proceedings of the 13th European Conference on Knowledge Engineering and Knowledge Management EKAW, (2002). [12] L. Stojanovic and B. Motik, ‘Ontology Evolution within Ontology Editors’, Proceedings of OntoWeb-SIG3 Workshop, 53–62, (2002). [13] Y. Sure, J. Angele, and S. Staab, ‘OntoEdit: Multifaceted Inferencing for Ontology Engineering’, Journal on Data Semantics, 1(1), 128–152. [14] Y. Theoharis, Y. Tzitzikas, D. Kotzinos, and V. Christophides, ‘On graph features of semantic web schemas’, IEEE Transactions on Knowledge and Data Engineering, 20(5), (2008). 6

http://athena.ics.forth.gr:9090/SWKM

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-75

75

Modular Equivalence in General Tomi Janhunen1 Abstract. The notion of modular equivalence was recently introduced in the context of a module architecture proposed for logic programs under answer set semantics [12, 6, 13]. In this paper, the module architecture is abstracted for arbitrary knowledge bases, KBfunctions for short, giving rise to a universal notion of modular equivalence. A further objective of this paper is to study modular equivalence in the contexts of SAT-functions, i.e., propositional theories with a module interface, and their logic programming counterpart, known as LP-functions [6]. As regards SAT-functions, we establish the full compositionality of classical semantics. This proves modular equivalence a proper congruence relation for SAT-functions. Moreover, we address the interoperability of SAT-functions and LPfunctions in terms of strongly faithful transformations in both directions. These considerations justify the proposed design of KBfunctions in general and pave the way for hybrid KB-functions.

module theorems in the context of LP-functions [12, 6] and they play an essential role in view of the congruence property of modular equivalence. Moreover, our goal is to study the interrelationships of SAT-functions and LP-functions in terms of translations. To this end, translations proposed in the literature [11, 10, 5] provide a good starting point but extra care is required as regards their modularity. The technical objectives discussed above reﬂect our interest in bringing principles of good software engineering practise such as program encapsulation, modularization, and so forth, to the realm of ASP. In this paper, we propose similar ideas for satisﬁability checking and thus foster the cross-fertilization of SAT and ASP technologies. To this end, our long-term goals comprise hybrid formalisms for knowledge representation that combine best features of ASP, SAT checking, and related approaches. In addition to background theory, we intend to develop solvers and/or compilers to deal with hybrid representations. An example of such a representation is given below.

1 INTRODUCTION

Example 1 Consider the problem of determining Hamiltonian cycles (HCs) for a directed graph G = V, E with E ⊆ V × V given as input. To represent the HC problem, we propose the follown ing combination of a SAT-function Πn cyc and an LP-function Πrch :

The study of equivalence relations has become a vivid line of research in answer-set programming (ASP). These relations are tightly connected to the optimization of answer set programs as they formalize which (versions of) programs are viewed equivalent. A number of proposals such as strong/weak equivalence [9], uniform equivalence [2], and the relativized versions of strong and uniform equivalence [16] have been considered in the literature. Amongst these, strong equivalence appears to be too restrictive to cover practical needs such as hiding auxiliary atoms whereas others fail to be congruence relations with respect to program union ∪, i.e., Π1 ≡ Π2 does not imply Π1 ∪ Π ≡ Π2 ∪ Π in general, which pre-empts prospects for modular veriﬁcation. To the contrary, the notion of modular equivalence [12, 6] overcomes these deﬁciencies. Firstly, it is a proper congruence relation for a program composition operator —enabling substitutions of equivalent modules as well as modular veriﬁcation. Secondly, it provides logic programs with an adequate input/output interface using which interaction of program modules can be deﬁned. One of the main objectives of this paper is to generalize the notion of modular equivalence for arbitrary knowledge bases in the propositional case. Conceptually, this implies the introduction of KB-functions—in analogy to (D)LP-functions [3, 6]—to provide the basic mechanism for the encapsulation of theories of interest. The general form of modular equivalence is then obtained by viewing KB-functions as modules. In addition to LP-functions, we consider SAT-functions based on sets of propositional clauses as instances of KB-functions. Although this implies a simpler syntax and semantics compared to LP-functions, the customization of modular equivalence for SAT-functions involves a number of concerns such as the compositionality of classical semantics. Such results are known as 1

Helsinki University of Technology TKK, P.O. Box 5400, FI-02015 TKK, Finland. Email: [email protected]

W W Πn cyc : { 1≤j≤n cij , 1≤j≤n cji | 1 ≤ i ≤ n} ∪ {¬cij ∨ ¬cik , ¬cji ∨ ¬cki | 1 ≤ i ≤ n, 1 ≤ j < k ≤ n} ∪ {¬cij ∨ eij | 1 ≤ i, j ≤ n}. Πn rch :

{r1 ←} ∪ {rj ← ri , cij | 1 ≤ i, j ≤ n} ∪ {f ← ¬ri , ¬f | 1 0, is to create a cycle on G whereas Πn rch checks for the reachability of vertices. An input th vertex to atom eij of Πn cyc denotes an edge in E leading from the i th n the j . The clauses of Πcyc pick for every vertex 1 ≤ i ≤ n, exactly one outgoing edge i, j and incoming edge j, i, denoted by cij and cji , respectively. The rules of Πn rch formalize the reachability of vertices from the ﬁrst and, if this is not the case, disqualify the cycle. We proceed as follows. The general module architecture based on KB-functions is introduced in Section 2. A particular instance, i.e., SAT-functions, is then ﬁxed in Section 3. The notion of modular equivalence is formulated for KB-functions to deﬁne which of them are equivalent and which not. Section 4 is devoted for modularity properties: The full compositionality of classical semantics is established and formalized as a module theorem which implies the desired congruence property of modular equivalence. A brief comparison with LP-functions follows in Section 5. The interconnections of SAT-functions and LP-functions plus their combinations are then worked out in Section 6. The prospects for verifying modular equivalence are addressed in Section 7. Finally, the paper is concluded with a brief discussion of the results and related work in Section 8.

76

T. Janhunen / Modular Equivalence in General

2 GENERAL MODULE ARCHITECTURE For a start, we introduce a general module architecture for logical theories to be used for knowledge representation in the propositional case. They are formalized as sets of expressions built of propositional atoms (or just atoms for short). We term such sets of expressions, which typically formalize some kind of knowledge, as knowledge bases (KBs). Since we are interested in the construction of KBs in a modular fashion, we distinguish an input/output interface for each theory. To this end, we generalize notions introduced in the context of logic programming, see e.g. [3, 6]. Given a KB, i.e., a set E of expressions pertaining to some syntax, we write At(E) for the signature of E, i.e., the set of propositional atoms having occurrences in the expressions of E. We adopt the notion of a module based on E from [6] where the case of disjunctive logic programs is of interest. Deﬁnition 1 A KB-function Π is a quadruple E, I, O, H where I, O, and H are mutually disjoint sets of input atoms, output atoms, and hidden atoms, respectively, and E is a set of expressions such that At(E) ⊆ I ∪ O ∪ H. Given a KB-function Π = E, I, O, H, the (overall) signature At(Π) of Π is deﬁned as I ∪ O ∪ H. Atoms in I ∪ O are visible and accessible to other KB-functions attached with Π; either to produce input for Π or to exploit the output of Π. We are prepared to allow arbitrary input/output relationships between modules as long as the semantics of modules supports that. The hidden atoms in H formalize some auxiliary concepts of Π which may not make sense in other contexts but may enable a more succinct representation as follows. Example 2 Consider a KB-function Πn = En , In , On , Hn where n ≥ 0 and En = {¬b0 } ∪ E1 ∪ . . . ∪ En where each set Ei consists of four clauses bi ∨ ai ∨ ¬bi−1 , bi ∨ ¬ai ∨ bi−1 , ¬bi ∨ ai ∨ bi−1 , and ¬bi ∨ ¬ai ∨ ¬bi−1 . The signature of Πn comprises of In = {a1 , . . . , an }, On = {bn }, and Hn = {b0 , . . . , bn−1 }. The purpose of this KB-function is to check whether the number of true atoms among a1 , . . . , an is odd. Intuitively, the meaning of bi is the odd parity property for a1 , . . . , ai and it is captured recursively by bi ↔ (¬ai ∧ bi−1 ) ∨ (ai ∧ ¬bi−1 ) for which Ei is a clausal representation for each 0 < i ≤ n. The only output atom bn denotes the overall result for a1 , . . . , an . Any attempts to remove hidden atoms b0 , . . . , bn−1 from the deﬁnition of Πn cause a substantial increase in the length of En : There are 2n−1 truth assignments to a1 , . . . , an for which bn is supposed to be true and false, respectively, and there is no simple logical pattern to make a distinction between the two cases using clauses. In contrast, the length of Πn is linear in n which nicely demonstrates the favorable effects of auxiliary atoms. For notational compatibility with [6], we denote the visible and hidden parts of At(Π) by Atv (Π) = I ∪ O and Ath (Π) = H = At(Π) \ Atv (Π), respectively. Additionally, Ati (Π) and Ato (Π) denote the sets I and O of input, and respectively, output atoms of Π. To access the set of expressions in Π, we deﬁne E = Expr(Π). For any set S ⊆ At(Π), the projections of S on Ati (Π), Ato (Π), Atv (Π), and Ath (Π), are denoted by Si , So , Sv , and Sh , respectively. Having ﬁxed the syntax of KB-functions, few words on their semantics follow. From the mathematical point of view, a KB-function Π = E, I, O, H provides a mapping from truth assignments (represented as subsets of I) to sets of truth assignments (represented as subsets of O ∪ H). However, the exact mapping depends on the semantics assigned to the set of expressions E. The case of SATfunctions, which involves a set of clauses C and classical truth assignments, is deferred until Section 3 whereas a brief account of

LP-functions under stable model semantics [4] follows in Section 6. Meanwhile we formalize conditions for building more complex KBfunctions out of simpler ones. We say that KB-functions Π1 and Π2 respect the I/O interfaces of each other iff At(Π1 ) ∩ Ath (Π2 ) = ∅, At(Π2 ) ∩ Ath (Π1 ) = ∅, and Ato (Π1 ) ∩ Ato (Π2 ) = ∅. Deﬁnition 2 (Composition) The composition Π1 ⊕ Π2 of two KBfunctions Π1 = E1 , I1 , O1 , H1 and Π2 = E2 , I2 , O2 , H2 which respect the I/O interface of each other is a KB-function E1 ∪ E2 , (I1 \ O2 ) ∪ (I2 \ O1 ), O1 ∪ O2 , H1 ∪ H2 . Although Π1 and Π2 must not share hidden atoms nor output atoms, they may share input atoms, i.e., I1 ∩ I2 = ∅ is allowed. An input atom of Π1 becomes an output atom in Π1 ⊕ Π2 if it appears as an output atom in Π2 , i.e., Π2 provides the input for Π1 in this setting. The input atoms of Π2 are treated in a symmetric fashion. The hidden atoms of Π1 and Π2 retain their statuses in Π1 ⊕ Π2 .

3 SAT-FUNCTIONS In this section, we deﬁne the class of SAT-functions and their semantics. A (propositional) clause is an expression of the form a1 ∨ · · · ∨ an ∨ ¬b1 ∨ · · · ∨ ¬bm ,

(1)

where n, m ≥ 0, and a1 , . . . , an and b1 , . . . , bm are propositional atoms. The logical intuition behind (1) is that at least one of a1 , . . . , an is true or at least one of b1 , . . . , bm is false. Since the order of atoms does not matter, we write A ∨ ¬B as a shorthand for (1) using the idea that A = {a1 , . . . , an } and ¬B = {¬b1 , . . . , ¬bm }. When both A and ¬B are empty, we have an empty disjunction, written ⊥. Any propositional theory can be transformed into clausal form in linear time using Tseitin’s technique with auxiliary atoms. A SAT-function Π = C, I, O, H is a KB-function where C is a ﬁnite set of clauses. For simplicity, the class of SAT-functions SF spans over a ﬁxed (at most denumerable) signature At(SF) so that At(Π) ⊆ At(SF) holds for each SAT-function Π ∈ SF. As regards semantics, an interpretation for a SAT-function Π is deﬁned as an arbitrary subset M of At(Π). Then, an atom a ∈ At(Π) is true in M , denoted M |= a, iff a ∈ M . For a negative literal ¬a, we have M |= ¬a iff M |= a. A set of literals L is satisﬁed by M , denoted M |= L, iff M |= l for every W l ∈ L. To make a disjunctive interpretation, we deﬁne M |= L, providing M |= l for some l ∈ L. The classical semantics of SAT-functions is deﬁned next. Deﬁnition 3 An interpretation M ⊆ At(Π) is a (classical) model of a SAT-function Π = C, I, O, H, denoted W M |= Π, iff W M |= C, i.e., for every clause A ∨ ¬B ∈ C, M |= A or M |= ¬B. The set of all classical models of Π is denoted by CM(Π). Let us now clarify the role of input atoms in SAT-functions. Given a SATfunction Π and any interpretation M ⊆ At(Π), the projection Mi can be viewed as the actual input for Π. We use partial evaluation to pre-interpret input atoms appearing in Π with respect to Mi . Deﬁnition 4 For a SAT-function Π = C, I, O, H and an actual input Mi ⊆ I for Π, the instantiation of Π with respect to Mi , denoted by Π/Mi , is the SAT-function C , ∅, O, H where C contains a reduced clause (A \ I) ∨ ¬(B \ I) for each clause A ∨ ¬B ∈ Π such that Mi |= Ai ∨ ¬Bi , i.e., Mi ∩ Ai = ∅ and Bi ⊆ Mi . There are no (occurrences of) input atoms in the reduced SATfunction Π/Mi and the visibility of atoms is not affected by instantiation. Partial evaluation is fully compatible with classical semantics.

T. Janhunen / Modular Equivalence in General

Proposition 1 Let Π be a SAT-function and M ⊆ At(Π) an interpretation that deﬁnes an actual input Mi ⊆ Ati (Π) for Π. For all N ⊆ At(Π) such that Ni = Mi , N |= Π ⇐⇒ No ∪ Nh |= Π/Mi . Therefore, we may characterize the overall semantics of Π by CM(Π) = {M ⊆ At(Π) | Mo ∪ Mh |= Π/Mi } which accentuates the dependence of CM(Π) on all possible input interpretations. Example 3 Recall the SAT-function Πn from Example 2 in the case n = 3. For an input interpretation Mi = {a2 }, we obtain Π3 /Mi = {¬b0 , b1 ∨ ¬b0 , ¬b1 ∨ b0 , b2 ∨ b1 , ¬b2 ∨ ¬b1 , b3 ∨ ¬b2 , ¬b3 ∨ b2 }. There is a unique classical model N = {a2 , b2 , b3 } of Π3 /Mi satisfying Ni = Mi . It captures the correct solution to the parity problem formalized by Π3 : the output atom b3 is true indicating that the number of true atoms among a1 , . . . , a3 is odd. Thus Π maps the input interpretation Mi = {a2 } to the set of interpretations {N ⊆ At(Π) | N |= Π3 /Mi and Ni = Mi } = {{a2 , b2 , b3 }}. In general, the set CM(Π/Mi ) need not be a singleton and the number of output interpretations may vary. Imagine, for instance, a SAT-function Πn gc that captures the famous n-coloring problem of graphs. Then there are input interpretations Mi corresponding to graphs without n-colorings, i.e., CM(Πn gc /Mi ) = ∅, or with several n-colorings for which |CM(Πn gc /Mi )| > 1 holds accordingly. Hidden atoms play no special role in Deﬁnition 3 which determines the semantics of a SAT-function Π. To the contrary, they become highly important when SAT-functions are compared with each other. To this end, we adopt the notion of modular equivalence originally proposed in the context of ASP [12, 6]. Roughly speaking, the idea is to neglect hidden atoms when comparing interpretations assigned to KB-functions by projecting them with respect to visible atoms. Additionally, a strict correspondence of models is required. Deﬁnition 5 Two KB-functions Π1 and Π2 are modularly equivalent, denoted Π1 ≡m Π2 , iff Ati (Π1 ) = Ati (Π2 ) and Ato (Π1 ) = Ato (Π2 ), and there is a bijection f : SEM1 (Π1 ) → SEM2 (Π2 ) so that for all M ∈ SEM1 (Π1 ), M ∩ Atv (Π1 ) = f (M ) ∩ Atv (Π2 ). The deﬁnition of ≡m is applicable to all KB-functions given the appropriate semantic operators SEM1 and SEM2 which map Π1 and Π2 to subsets of 2At(Π1 ) and 2At(Π2 ) , respectively. Within SF, one would of course set SEM1 (Π1 ) = CM(Π1 ) and SEM2 (Π2 ) = CM(Π2 ) in Deﬁnition 5. The design criteria of ≡m go back to ASP where a bijective relationship between answer sets and the solutions of a problem is highly desirable. In this way, the numbers of “solutions” associated with modularly equivalent programs are the same. This is analogous to parsimonious reductions employed in complexity theory and, in particular, in the context of counting problems [15].

4 MODULARITY OF SAT-FUNCTIONS Our next objective is to address the modularity properties of classical semantics as set out by Deﬁnition 4 and Proposition 1. The ﬁrst result, i.e., module theorem to be established below, links classical models assigned to a composition Π1 ⊕ Π2 with classical models assigned to the component SAT-functions Π1 and Π2 separately. Given two KB-functions Π1 and Π2 in general, we say that respective interpretations M1 ⊆ At(Π1 ) and M2 ⊆ At(Π2 ) are mutually compatible, or just compatible for short, iff M1 ∩ Atv (Π2 ) = M2 ∩ Atv (Π1 ), i.e., M1 and M2 agree about the truth values of their joint visible atoms. If the join Π1 ⊕Π2 is deﬁned, there may be shared input atoms in Ati (Π1 ) ∩ Ati (Π2 ), or atoms in Ato (Π1 ) ∩ Ati (Π2 )

77

and Ati (Π1 ) ∩ Ato (Π2 ) that are output atoms in one SAT-function and input atoms in the other. According to Deﬁnition 2, latter atoms end up in Ato (Π1 ⊕ Π2 ). To enable a concise formulation, we adopt a kind of a natural join (in database terminology) for arbitrary sets of interpretations S1 ⊆ 2At(Π1 ) and S2 ⊆ 2At(Π2 ) . The set S1 1 S2 contains I1 ∪ I2 for each compatible pair I1 ∈ S1 and I2 ∈ S2 . Theorem 1 (Module Theorem) Let Π1 and Π2 be SAT-functions such that their composition Π1 ⊕ Π2 is deﬁned. Then, the set of classical models CM(Π1 ⊕ Π2 ) = CM(Π1 ) 1 CM(Π2 ). Theorem 1 indicates that classical semantics fully supports modularization. The result is an analog of [6, Proposition 3] but formulated here for sets of clauses rather than disjunctive rules. As regards a generalization for several SAT-functions, a sequence M1 , . . . , Mn of interpretations for Π1 , . . . , Πn , respectively, is considered to be compatible, iff for all 1 ≤ i, j ≤ n, the interpretations Mi and Mj are pairwise compatible. Likewise, a ﬁnite composition Π1 ⊕. . .⊕Πn is deﬁned, iffL the respective compositions Πi ⊕Πj are deﬁned for i = j. n Thus CM( n i=1 Πi ) =1i=1 CM(Πi ) holds if Π1 ⊕. . .⊕Πn is deﬁned. Generally speaking, Theorem 1 and its generalization are not very restrictive when splitting a set of clauses C into modules. Two aspects, however, require attention. First, each hidden atom must remain local to some module and hence it cannot occur in any other modules. Second, each clause should be local to a particular module. Deﬁnition 6 The join, Π1 Π2 , of two SAT-functions Π1 and Π2 is Π1 ⊕ Π2 , iff Π1 ⊕ Π2 is deﬁned and Expr(Π1 ) ∩ Expr(Π2 ) = ∅. In view of algebraic properties, we have Π1 Π2 ∈ SF (closure), Π1 ∅ = ∅ Π1 = Π1 for the empty module ∅ = ∅, ∅, ∅, ∅ (identity), Π1 Π2 = Π2 Π1 (commutativity), and Π1 (Π2 Π3 ) = (Π1 Π2 ) Π3 (associativity) given SAT-functions Π1 , Π2 , and Π3 that pairwise respect the I/O interface of each other and share no clauses. In addition, we can establish that ≡m is a congruence relation for which enables ≡m -preserving substitutions under . Theorem 2 Let Π1 , Π2 , and Π be SAT-functions. If Π1 ≡m Π2 and both Π1 Π and Π2 Π are deﬁned, then Π1 Π ≡m Π2 Π.

5 BRIEF ACCOUNT OF LP-FUNCTIONS Next we highlight the main differences and similarities with respect to LP-functions corresponding to the case of normal programs presented in [12]. The general module system formalized as KBfunctions is almost applicable as such. As regards LP-functions, expressions are rules of the form h ← a1 , . . . , an , ¬b1 , . . . , ¬bm , abbreviated as h ← A, ¬B using sets of atoms as in Section 3. Intuitively, one derives the head h using h ← A, ¬B if one can derive all atoms in A but none in B. Given a set of rules R in an LP-function Π = R, I, O, H, the set Hd(R) = {h | h ← A, ¬B ∈ R} of head atoms must be contained in O ∪ H. This ensures that an LPfunction cannot redeﬁne its input atoms. The composition operator ⊕ from Deﬁnition 2 is directly applicable to LP-functions but the deﬁnition of the join operator becomes much more involved. In [12, 6], the strongly connected components (SCCs) of the positive dependency graph DG+ (Π) = Ato (Π) ∪ Ath (Π), ≤1 are exploited. Here a ≤1 h holds iff Π has a rule h ← A, ¬B with a ∈ Ao ∪ Ah . The positive dependency relation ≤ is (≤1 )∗ . The join Π1 Π2 of two LP-functions Π1 and Π2 is deﬁned iff Π1 ⊕ Π2 is deﬁned and, for each SCC S of DG+ (Π1 ⊕ Π2 ), S ∩ Ato (Π1 ) = ∅ or S ∩ Ato (Π2 ) = ∅ [12], i.e., Π1 and Π2 are ≤-independent.

78

T. Janhunen / Modular Equivalence in General

Given an LP-function Π = R, I, O, H and an interpretation M ⊆ At(Π) which includes an input interpretation Mi = M ∩ I, the reduct RM,I generalizes that from [4] as it contains a reduced rule h ← (A \ I) whenever there is a rule h ← A, ¬B such that M |= Ai ∪ ¬B.2 An interpretation M ⊆ At(Π) is a stable model of Π iff M \ I is the least (classical) model of RM,I . The operator SM(Π) = {M ⊆ At(Π) | Mo ∪ Mh = LM(RM,Ati (Π) )} covers the class N of LP-functions based on normal programs. For N , modular equivalence ≡m is obtained from Deﬁnition 5 by substituting SM(·) for SEM1 (·) and SEM2 (·). An analog of Theorem 1 can also be established for LP-functions under but not under ⊕ [12]. This is crucial for ≡m being a congruence relation for .

6 INTERCONNECTIONS The goal of this section is to address the relationship of SATfunctions with LP-functions in terms of translations. Due to known complexity results, we shall restrict ourselves to the case of normal (non-disjunctive) programs—to enable translations feasible in polynomial time. In addition to translations, we consider direct combinations of SAT-functions and LP-functions in Section 6.1. Any set C of propositional clauses can be translated into a normal logic program using, e.g., the translation provided by Niemel¨a [11]. Extra care, however, is required to ensure the modularity of the transformation when embedding SAT-functions in LP-functions. To this end, we have to pay special attention to input atoms and localize the choice of truth values. Thus we introduce a clause-speciﬁc new atom f(A,B) for each clause A∨¬B and similarly a(A,B) and b(A,B) for every non-input atom a ∈ A and b ∈ B, respectively. An entire clause A ∨ ¬B ∈ C is translated into a set of rules TrLP (A, B) = {f(A,B) ← B, ¬A, ¬f(A,B) } ∪ {a ← ¬a(A,B) ; a(A,B) ← ¬a | a ∈ Ao ∪ Ah } ∪ {b ← ¬b(A,B) ; b(A,B) ← ¬b | b ∈ Bo ∪ Bh }.

(2)

The choice of a truth value for a non-input atom a ∈ At(C) may concern several pairs of rules a ← ¬a(A,B) and a(A,B) ← ¬a in translations TrLP (A, B) for different clauses A∨¬B. These choices are synchronized by a which will therefore have a unique truth value. Deﬁnition 7 Given a SAT-function Π = C, I, O, H, the translation TrLP (Π) of Π into an LP-function is R, I, O, H where R = S {TrLP (A, B) | A ∨ ¬B ∈ C} ∪ {a ← ¬a; a ← ¬a | a ∈ (O ∪ H) \ At(C)} and H consists of H, {a | a ∈ (O ∪ H) \ At(C)}, and for each A ∨ ¬B ∈ C, Ao(A,B) ∪ Ah(A,B) ∪ Bo(A,B) ∪ Bh(A,B) ∪ {f(A,B) }. An arbitrary truth value is chosen for any atom in (O ∪ H) \ At(C) having no occurrences in C. Otherwise, such atoms would become false by default. Generally speaking, a translation function Tr(·) is (i) strongly faithful iff rev(Π) ≡m Tr(rev(Π)) where rev(Π) = E, I, O ∪ H, ∅ is Π = E, I, O, H with H revealed, (ii) modular iff Tr(Π1 ) Tr(Π2 ) = Tr(Π1 Π2 ), and (iii) -preserving iff Tr(Π1 ) Tr(Π2 ) is deﬁned whenever Π1 Π2 is [13]. Also, note that strong faithfulness implies faithfulness, i.e., Π ≡m Tr(Π). The semantics of SAT-functions is accurately conveyed by TrLP . Theorem 3 The translation function TrLP from SAT-functions to LP-functions is strongly faithful, modular, and -preserving. 2

Thus all input literals and negative literals get evaluated from rule bodies.

Obtaining a similar translation in the other direction, i.e., from LPfunctions to SAT-functions, is much more elaborate. Formal counterexamples (cf. [11]) pre-empt a fully modular translation that could be applied rule by rule in analogy to TrLP deﬁned above. In spite of this, non-modular alternatives have been proposed [10, 5]. For instance, Janhunen [5] removes positive body conditions from rules using a systematic translation TrAT which is not reviewed herein due to space limitations. When combined with Clark’s completion procedure TrCC , a strongly faithful translation from LP-functions to SATfunctions is obtained. Albeit non-modularity in general, both translations behave modularly given an appropriate level of granularity. To make TrAT modular, it must be applied within each LP-function Π = R, I, O, H to sets of rules RS = {h ← A, ¬B ∈ R | h ∈ S} associated with the SCCs S of DG+ (Π). For TrCC , it is sufﬁcient to distinguish Ra = {h ← A, ¬B | h = a} for each a ∈ Ato (Π) ∪ Ath (Π). The time complexity of TrAT is of the order of R × log2 (|O ∪ H|) [5] for Π = R, I, O, H whereas the procedure TrCC can be kept linear using new (hidden) atoms. Theorem 4 The translation function TrCC ◦TrAT from LP-functions to SAT-functions is strongly faithful, modular, and -preserving.

6.1 Composing hybrid functions In light of translations TrLP and TrCC ◦ TrAT , SAT-functions and LP-functions appear to be equally expressive although LP-functions tend to be more concise (cf. Examples 1 and 4). Furthermore, modular equivalence ≡m can be used for intra-class as well as inter-class comparisons of SAT/LP-functions and ≡m partitions the respective classes into equivalence classes that comfortably respect joins of SAT/LP-functions, i.e., ≡m is a congruence for . Given such a harmonic view over SF and N , we will now consider the possibility of building hybrids of SAT-functions and LP-functions. A cross-section of the theory presented in Sections 2–5 follows. The composition Π1 ⊕ Π2 of a SAT-function Π1 = C, I1 , O1 , H1 with an LP-function Π2 = R, I2 , O2 , H2 can be formed by Definition 2. As regards the join Π1 Π2 , no special restrictions arise from Deﬁnition 6: clauses and rules have different syntax and remain separate in C ∪ R. This view is supported if we translate Π1 into TrLP (Π1 ) and check the implications of the conditions on which TrLP (Π1 )Π2 is deﬁned. Whenever Π1 Π2 is deﬁned, it is natural to set SEM(Π1 Π2 ) = CM(Π1 ) 1 SM(Π2 ) in order to combine classical and stable models associated with the respective functions. This paves the way for lifting the module theorem to the level of hybrid SAT/LP-functions and showing ≡m a congruence for . Example 4 Recall the Hamiltonian cycle problem from Example 1. Let Cn be the set of clauses that selects cycles for the graph G given as input and Rn the set of rules that captures the universal reachability of vertex 1. The respective SAT- and LP-functions are n Πn cyc = Cn , In , On , ∅ and Πrch = Rn , On , ∅, Hn ∪ {f } where n Ati (Πcyc ) = In = {eij | 1 ≤ i, j ≤ n} is used to describe G. The n signature Ato (Πn cyc ) = On = {cij | 1 ≤ i, j ≤ n} = Ati (Πrch ) expresses which edges are on a cycle. Auxiliary atoms of Hn = {ri | 1 ≤ i ≤ n} which are hidden in Πn rch denote reachable vertices. n n The semantics of the hybrid function Πn cyc Πrch is CM(Πcyc ) 1 n SM(Πrch ) which determines all Hamiltonian cycles for all directed graphs based on V = {1, . . . , n}. Theorems 3 and 4 and the congruence properties of ≡m imply that modularly equivalent formaln izations are obtained as partial translations TrLP (Πn cyc ) Πrch and n n Πcyc TrCC (TrAT (Πrch )) and the respective stable/classical semann tics for the joins. But the hybrid Πn cyc Πrch is the most concise.

T. Janhunen / Modular Equivalence in General

7 VERIFYING MODULAR EQUIVALENCE In what follows, we analyze the task of verifying the modular equivalence of two SAT-functions, say Π1 and Π2 , given for inspection. Recalling Deﬁnition 5, it is straightforward to check that the input and output signatures of Π1 and Π2 coincide but the bijective relationship of models renders the question difﬁcult. A brute-force approach would check all input interpretations Mi ⊆ I = Ati (Π1 ) = Ati (Π2 ) to ensure that CM(Π1 /Mi ) and CM(Π2 /Mi ) have equally many models which coincide up to Atv (Π1 ) = Atv (Π2 ). Thus, the veriﬁcation of ≡m involves a counting problem which can be highly complex in general, i.e., hard for the complexity class #P. The computational cost of verifying ≡m is reduced if the use of hidden atoms is forbidden. Then ≡m reduces to weak equivalence ≡ for SAT-functions deﬁned as follows: Π1 ≡ Π2 iff Ati (Π1 ) = Ati (Π2 ), Ato (Π1 ) = Ato (Π2 ), and CM(Π1 ) = CM(Π2 ). This notion is very close to classical equivalence but the compatibility of the input/output interfaces is additionally required. Deciding whether Π1 ≡ Π2 holds is coNP-complete for SAT-functions by the respective complexity results for LP-functions [7] and the polynomial time reductions involved in Theorems 3 and 4. 3 This ﬁts perfectly with the idea of testing Π1 ≡ Π2 using a SAT solver: SAT-functions Π1 and Π2 are translated into a set of clauses EQT(Π1 , Π2 ) that is unsatisﬁable iff CM(Π1 ) ⊆ CM(Π2 ). A preliminary implementation, namely SATEQ4 , produces such a translation but does not yet support input atoms due to lack of symbols in the DIMACS format, i.e., the de facto standard for representing propositional formulas in CNF. As regards benchmarking, we use two orthogonal formulations n of the classical n-queens problem [7]: Πn qx and Πqy correspond to column-wise, and respectively row-wise, placement of n queens on an n × n chess board. It is rather easy to ﬁnd models for these instances: MINISAT version 1.14 solves an instance with n = 128 queens in roughly 5 seconds on a 2GHz AMD Athlon 64/3200+ CPU n with 2GB memory. In contrast, verifying Πn qx ≡ Πqy , i.e., showing n n EQT(Πqx , Πqy ) unsatisﬁable, gets increasingly difﬁcult as n grows. For n = 5 . . . 13, the running times of MINISAT for the respective problem instances are 0.010, 0.014, 0.025, 0.070, 0.403, 2.26, 16.9, 173, and 2430 seconds—all averaged over 10 runs. The numbers of atoms and clauses in these instances vary in the respective ranges 304 . . . 5828 (≈ 2.5 × n3 ) and 844 . . . 17112 (≈ 7.5 × n3 ). For comparison, average running times become 9.0, 7.7, 8.0, 6.5, 4.3, and 2.0 times higher when LPEQ [7] and SMODELS are used to solve the respective problem instances with 8 . . . 13 queens. As shown in [7], the problem of verifying modular equivalence Π1 ≡m Π2 is in coNP for normal programs having enough visible atoms, i.e, the EVA property. For an LP-function Π = R, I, O, H, the point is that the hidden part Rh /Mv = {h ← Ah , ¬Bh | h ← A, ¬B ∈ R and Mv |= Av ∪ ¬Bv } must have a unique stable model given any interpretation Mv for the visible atoms in I ∪ O. In the presence of the EVA property, there is no need to count models when verifying ≡m . This suggests a future extension to the translation EQT(·, ·) described above, but it is not yet clear how to generalize the EVA property for SAT-functions like Πn in Example 2.

8 CONCLUSION AND DISCUSSION In this paper, we present a general module architecture, involving the notion of a KB-function, for knowledge bases. In case of 3 4

Also, recall that checking the validity of φ ↔ ψ for two propositional formulas φ and ψ is a coNP-complete decision problem. See http://www.tcs.hut.fi/Software/sateq/ for details.

79

propositional theories, represented as sets of clauses, the architecture based on SAT-functions is fully compatible with classical semantics as made precise by Theorem 1. Furthermore, the compositionality of classical semantics guarantees that ≡m is a proper congruence for SAT-functions. The tight interconnections with LP-functions addressed in Section 6 justify the presented architecture and pave the way for hybrid functions as illustrated in Examples 1 and 4. Few words on related work follow. In contrast with ≡m , the notions of equivalence formulated in [8] are entailment-based, although the idea of hiding auxiliary atoms is present in the sense of forgetting. Moreover, forgetting auxiliary atoms from logic programs does not preserve (stable) models in general which, in turn, is central in our approach. The ideas underlying satisﬁability modulo theories [14] suggest yet another mode of reasoning in which one is interested in the answer to the satisﬁability question but not in values of individual variables. To support this kind of reasoning, we would have to revise the way in which models are combined by the operator 1 under joins. In the future, we intend to study automated decomposition of propositional theories using ≡m (cf. [1]). To fully exploit our implementation of EQT(·, ·) in the modular veriﬁcation of ≡m , we have to somehow represent module interfaces in the DIMACS format.

9 ACKNOWLEDGEMENTS This research has been supported by the Academy of Finland (grant #122399). The author thanks reviewers for their helpful comments.

REFERENCES [1] E. Amir and S. McIlraith, ‘Improving the efﬁciency of reasoning through structure-based reformulation’, in Abstraction, Reformulation, and Approximation, pp. 247–259. Springer, (2000). LNCS 1864. [2] T. Eiter and M. Fink, ‘Uniform equivalence of logic programs under the stable model semantics’, in Proceedings of ICLP’03, pp. 224–238. Springer, (2003). LNCS 2916. [3] M. Gelfond, ‘Representing knowledge in a-prolog’, in Logic Programming and Beyond, Essays in Honour of Robert A. Kowalski, Part II, LNCS 2408, pp. 413–451. Springer, (2002). [4] M. Gelfond and V. Lifschitz, ‘The stable model semantics for logic programming’, in Proceedings of ICLP’88, pp. 1070–1080, (1988). [5] T. Janhunen, ‘Representing normal programs with clauses’, in Proceedings of ECAI’04, pp. 358–362, Valencia, Spain, (2004). IOS Press. [6] T. Janhunen, E. Oikarinen, H. Tompits, and S. Woltran, ‘Modularity aspects of disjunctive stable models’, in Proceedings of LPNMR’07, pp. 697–744, Tempe, AZ, U.S.A., (2007). Springer. LNAI 4483. [7] T. Janhunen and T. Oikarinen, ‘Automated Veriﬁcation of Weak Equivalence within the Smodels System’, TPLP, 7(6), 697–744, (2007). [8] J. Lang, P. Liberatore, and P. Marquis, ‘Propositional independence: Formula-variable independence and forgetting’, Journal of Artiﬁcial Intelligence Research, 18, 391–443, (2003). [9] V. Lifschitz, D. Pearce, and A. Valverde, ‘Strongly equivalent logic programs’, ACM TOCL, 2(4), 526–541, (2001). [10] F. Lin and J. Zhao, ‘On tight logic programs and yet another translation from normal logic programs to propositional logic’, in Proceedings of IJCAI’03, pp. 853–858, Acapulco, Mexico, (2003). [11] I. Niemel¨a, ‘Logic programs with stable model semantics as a constraint programming paradigm’, AMAI, 25(3–4), 241–273, (1999). [12] E. Oikarinen and T. Janhunen, ‘Modular equivalence for normal logic programs’, in Proceedings of ECAI’06, pp. 412–416. IOS Press, (2006). [13] E. Oikarinen and T. Janhunen. Achieving compositionality of the stable model semantics for SMODELS programs, February 2008. Submitted. [14] S. Ranise and C. Tinelli. SMT-LIB – The satisﬁability modulo theories library. http://goedel.cs.uiowa.edu/smtlib/, 2008. [15] L. Valiant, ‘The complexity of computing the permanent’, Theoretical Computer Science, 8, 189–201, (1979). [16] S. Woltran, ‘Characterizations for relativized notions of equivalence in answer set programming’, in Proceedings of JELIA’04, pp. 161–173, Lisbon, Portugal, (2004). Springer. LNAI 3229.

80

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-80

Description Logic Rules Markus Krötzsch and Sebastian Rudolph and Pascal Hitzler1 Abstract. We introduce description logic (DL) rules as a new rulebased formalism for knowledge representation in DLs. As a fragment of the Semantic Web Rule Language SWRL, DL rules allow for a tight integration with DL knowledge bases. In contrast to SWRL, however, the combination of DL rules with expressive description logics remains decidable, and we show that the DL SROIQ – the basis for the ongoing standardisation of OWL 2 – can completely internalise DL rules. On the other hand, DL rules capture many expressive features of SROIQ that are not available in simpler DLs yet. While reasoning in SROIQ is highly intractable, it turns out that DL rules can be introduced to various lightweight DLs without increasing their worst-case complexity. In particular, DL rules enable us to signiﬁcantly extend the tractable DLs EL++ and DLP.

1 INTRODUCTION The development of description logics (DLs) has been driven by the desire to push the expressivity bounds of these knowledge representation formalisms while still maintaining decidability and implementability. This has lead to very expressive DLs such as SHOIN, the logic underlying the Web Ontology Language OWL DL, SHOIQ, and more recently SROIQ [6] which is the basis for the ongoing standardisation of OWL 22 as the next version of the Web Ontology Language. On the other hand, more lightweight DLs for which most common reasoning problems can be implemented in (sub)polynomial time have also been sought, leading, e.g., to the tractable DL EL++ [1]. Another popular paradigm of knowledge representation are rulebased formalisms – ranging from logic programming to deductive databases. Similar to DLs, the expressivity and complexity of rule languages has been studied extensively [3], and many decidable and tractable formalisms are known. Yet, reconciling DLs and rule languages is not easy, and many works have investigated this problem. In this paper, we introduce DL rules as an expressive new rule language for combining DLs with ﬁrst-order rules in a rather natural way that admits tight integration with existing DL systems. Since DLs can be considered as fragments of function-free ﬁrst-order logic with equality, an obvious approach is to combine them with ﬁrstorder Horn-logic rules. This is the basis of the Semantic Web Rule Language SWRL [7], proposed as a rule extension to OWL. However, reasoning becomes undecidable for the combination of OWL and SWRL, and thus more restricted rule languages have been investigated. A prominent example are DL-safe rules [12], which restrict the applicability of rules to a ﬁnite set of named individuals to retain decidability. Similar safety conditions have already been proposed for CARIN [11] in the context of the DL ALCNR, where also 1 2

Universität Karlsruhe (TH), Germany, [mak|sru|phi]@aifb.uni-karlsruhe.de OWL 2 is the forthcoming W3C recommendation updating OWL, based on the OWL 1.1 member submission, cf. http://www.w3.org/2007/OWL.

acyclicity of rules and Tboxes was studied as an alternative for retaining decidability. Another basic approach is to identify the Horn-logic rules directly expressible in OWL DL (i.e. SHOIN), and this fragment has been called Description Logic Programs DLP [5]. DL rules in turn can be characterised as a decidable fragment of SWRL, which corresponds to a large class of SWRL rules indirectly expressible in SROIQ. They are based on the observation that DLs can express only tree-like interdependencies of variables. For example, the concept expression ∃ worksAt.University ∃ supervises .PhDStudent that describes all people working at a university and supervising some PhD student corresponds to the following ﬁrst-order formula: ∃y.∃z.worksAt(x, y)∧ University(y)∧ supervises(x, z)∧ PhDStudent(z) Here variables form the nodes of a tree with root x, where edges are given by binary predicates. Intuitively, DL rules are exactly those SWRL rules, where premises (rule bodies) consist of one or more of such tree-shaped structures. One could, for example, formulate the following rule: worksAt(x, y) ∧ University(y) ∧ supervises(x, z) ∧ PhDStudent(z) → profOf(x, z)

Since SWRL allows the use of DL concept expressions in rules, we obtain SROIQ rules, EL++ rules, or DLP rules as extensions of the respective DLs. For the case of SROIQ, DL rules have independently been proposed in [4], where a tool for editing such rules was presented. As shown below, DL rules are indeed “syntactic sugar” in this case, even though rule-based presentations are often signiﬁcantly simpler due to the fact that many rules require the introduction of auxiliary vocabulary for being encoded in SROIQ. On the other hand, we also consider the light-weight DLs EL++ and DLP for which DL rules truly extend expressivity, and we show that the polynomial complexity of these DLs is preserved by this extension. After giving some notation in Section 2, we introduce DL rules in Section 3. Section 4 shows how DL rules can be internalised in SROIQ, while Section 5 employs a novel reasoning algorithm to process EL++ rules. Section 6 introduces DLP 2 and shows the tractability of this DL-based rule language. Most proofs are omitted and can be found in [9].

2 PRELIMINARIES In this section, we brieﬂy introduce our notation based on the DL SROIQ [6]. More detailed deﬁnitions and introductory remarks can be found in [9]. As usual, the DLs considered in this paper are based on three disjoint sets of individual names NI , concept names NC , and role names NR containing the universal role U ∈ NR . Deﬁnition 1 A SROIQ Rbox for NR is based on a set R of roles deﬁned as R NR ∪ {R− | R ∈ NR }, where we set Inv(R) R−

M. Krötzsch et al. / Description Logic Rules

and Inv(R− ) R to simplify notation. In the sequel, we will use the symbols R, S , possibly with subscripts, to denote roles. A generalised role inclusion axiom (RIA) is a statement of the form S 1 ◦ . . . ◦ S n R, and a set of such RIAs is a SROIQ Rbox. An Rbox is regular if there is a strict partial order ≺ on R such that • S ≺ R iﬀ Inv(S ) ≺ R, and • every RIA is of one of the forms: R ◦ R R, R− R, S 1 ◦ . . . ◦ S n R, R ◦ S 1 ◦ . . . ◦ S n R, or S 1 ◦ . . . ◦ S n ◦ R R such that R ∈ NR is a (non-inverse) role name, and S i ≺ R for i = 1, . . . , n. The set of simple roles for some Rbox is deﬁned inductively: • If a role R occurs only on the right-hand-side of RIAs of the form S R such that S is simple, then R is also simple. • The inverse of a simple role is simple. Deﬁnition 2 Given a SROIQ Rbox R, the set of concept expressions C is deﬁned as follows: • NC ⊆ C, ∈ C, ⊥ ∈ C, • if C, D ∈ C, R ∈ R, S ∈ R a simple role, a ∈ NI , and n a nonnegative integer, then ¬C, C D, C D, {a}, ∀R.C, ∃R.C, ∃S .Self, ≤n S .C, and ≥n S .C are also concept expressions. Throughout this paper, the symbols C, D will be used to denote concept expressions. A SROIQ Tbox is a set of general concept inclusion axioms (GCIs) of the form C D. An individual assertion can have any of the following forms: C(a), R(a, b), ¬R(a, b), a b, with a, b ∈ NI individual names, C ∈ C a concept expression, and R, S ∈ R roles with S simple. A SROIQ Abox is a set of individual assertions. A SROIQ knowledge base KB is the union of a regular Rbox R, and an Abox A and Tbox T for R. The standard semantics of the above constructs is recalled in [9].

3 DESCRIPTION LOGIC RULES We introduce DL rules as a syntactic fragment of ﬁrst-order logic. Deﬁnition 3 Consider some description logic L with concept expressions C, individual names NI , roles R (possibly including inverse roles), and let V be a countable set of ﬁrst-order variables. Given terms t, u ∈ NI ∪ V, a concept atom (role atom) is a formula of the form C(t) (R(t, u)) with C ∈ C (R ∈ R). To simplify notation, we often use ﬁnite sets S of (role and concept) atoms for representing the conjunction S . Given such a set S of atoms and terms t, u ∈ NI ∪ V, a path from t to u in S is a non-empty sequence R1 (x1 , x2 ), . . . , Rn (xn , xn+1 ) ∈ S where x1 = t, xi ∈ V for 2 ≤ i ≤ n, xn+1 = u, and xi xi+1 for 1 ≤ i ≤ n. A term t in S is initial (resp. ﬁnal) if there is no path to t (resp. no path starting at t). Given sets B and H of atoms, and a set x ⊆ V of all variables in B∪H, a description logic rule (DL rule) is a formula ∀x. B → H such that (R1) for any u ∈ NI ∪ V that is not initial in B, there is a path from exactly one initial t ∈ NI ∪ V to u in B, (R2) for any t, u ∈ NI ∪ V, there is at most one path in B from t to u, (R3) if H contains an atom C(t) or R(t, u), then t is initial in B. Here ∀x for x = {x1 , . . . , xn } abbreviates an arbitrary sequence ∀x1 . . . . ∀xn . Since we consider only conjunctions with all variables quantiﬁed, we will often simply write B → H instead of ∀x. B → H. A rule base RB for some DL L is a set of DL rules for L.

81

The semantics of DL rules in the context of a description logic knowledge base is given by interpreting both the rules and knowledge base as ﬁrst-order theories in the usual way, and applying the standard semantics of predicate logic. This has been discussed in the context of SWRL in [7], and we will not repeat the details here. Deﬁnition 3 ensures that role atoms in rule bodies essentially form a (set of) directed trees, starting at initial elements. Using the wellknown equivalence of formulae {p → q1 ∧ q2 } and {p → q1 , p → q2 }, one can transform any rule into an equivalent set of rules without conjunctions in rule heads. This can be done in linear time, so we assume without loss of generality that all DL rules are of this form. Since all DLs considered herein support nominals, we can also assume that all terms in rules are variables. Indeed, any atom C(a) with a ∈ NI can be replaced by C(x)∧{a}(x) with x ∈ V a new variable. Using inverse roles, role atoms with individual names can be replaced by concept atoms as follows: R(x, a) becomes ∃R.{a}(x), R(a, y) becomes ∃ Inv(R).{a}(y), and R(a, b) becomes ∃R.{b}(x)∧{a}(x). A similar transformation is possible for rule heads, where generated concept atoms {a}(x) are again added to the rule body. Before considering the treatment of DL rules in concrete DLs, we highlight some relevant special applications of DL rules. Concept products Rules of the form C(x) ∧ D(y) → R(x, y) can encode concept products (sometimes written C × D R) asserting that all elements of two classes must be related [13]. Examples include statements such as Elephant(x) ∧ Mouse(y) → biggerThan (x, y) or Alkaline(x) ∧ Acid(y) → neutralises (x, y). Local reﬂexivity, universal role Rules of the forms C(x) → R(x, x) and R(x, x) → C(x) can replace the SROIQ Tbox expression C ∃R.Self and ∃R.Self C. The universal role U of SROIQ can be deﬁned as (x) ∧ (y) → U(x, y). Hence, a DL that permits such rules does not need to explicitly introduce those constructs. Qualiﬁed RIAs DL rules of course can express arbitrary role inclusion axioms, but they also can state that a RIA applies only to instances of certain classes. Examples include Woman(x) ∧ hasChild(x, y) → motherOf(x, y) and trusts(x, y) ∧ Doctor(y) ∧ recommends(y, z) ∧ Medicine(z) → buys(x, z).

4 DL RULES IN SROIQ We now show how knowledge bases of such rules can be internalised into the DL SROIQ. Since SROIQ supports inverse roles, it turns out that one can relax condition (R1) of DL rules as follows: (R1’) for any u ∈ NI ∪ V that is not initial in B, there is a path from one or more initial elements t ∈ NI ∪ V to u in B. On the other hand, we need to adopt the notions of regularity and simplicity to DL rule bases in SROIQ, which again restricts the permissible rule bases: Deﬁnition 4 Consider a rule base RB and a knowledge base KB for SROIQ. The set of simple roles of KB ∪ RB is the smallest set of roles containing every role R for which the following hold: • If R or Inv(R) occur on the right-hand-side of some RIA of KB, then this RIA is of the form S R or S Inv(R), and S is simple. • If R or Inv(R) occur in some rule head of the form R(x, y) or Inv(R)(x, y) in RB, then the according rule body is of the form S (x, y) with S simple, or of the form C(x) where x = y. Note that this is indeed a proper inductive deﬁnition, where roles that do not occur on the right of either RIAs or rules form the base case.

82

M. Krötzsch et al. / Description Logic Rules

The extended knowledge base KB ∪ RB is admissible for SROIQ if all roles S (i) occurring in concept (sub)expressions of the form ≤n S .C, ≥n S .C, ∃S .Self, and Dis(S 1 , S 2 ), and in role atoms of the form S (x, x) (x ∈ V) are simple. An extended knowledge base KB ∪ RB is regular if there is a strict partial order ≺ on R such that • S ≺ R iﬀ Inv(S ) ≺ R, • the role box of KB is regular w.r.t. ≺, and • for any rule B → R(x, y), each S (z, v) ∈ B satisﬁes one of the following: – S ≺ R, or – there is no path from v to y, or – S = R, there is no other R(z , v ) ∈ B with a path from v to y, and we ﬁnd that: either x = z and there is no C(x) ∈ B, or y = v and there is no C(y) ∈ B. Note that RIAs in regular SROIQ knowledge bases are allowed to have two special forms for transitivity and symmetry, which we do omit for the deﬁnition of regularity in DL rules to simplify notation. Since S in S (x, x) is simple, we can replace such role atoms by concept atoms C(x) where C is a new concept name for which a new axiom C ≡ ∃S .Self is added. We will thus assume that no role atoms of this form occur in admissible knowledge bases. One can now show that checking the satisﬁability of extended SROIQ knowledge bases that are admissible and regular is decidable, and has the same worst-case complexity as reasoning in SROIQ. This is achieved by a polynomial transformation of rule bases into SROIQ axioms. The ﬁrst step of doing this is to replace “dead branches” of the tree-shaped query body by DL concepts. The proof is a variation of the “rolling-up” technique used for conjunctive query answering [2]. Lemma 5 Any DL rule B → H for SROIQ can be transformed into a semantically equivalent rule B → H such that all paths in B are contained in a single maximal path. If H = R(x, y), then y is the ﬁnal element of that maximal path, and if H = C(x) then there are no paths in B. A rule with these properties is called linearised. As an example, the DL rule that was given in the introduction can be simpliﬁed to yield (using “ , ” instead of “∧” for brevity): ∃ worksAt.University(x), supervises(x,z), PhDStudent(z)→profOf(x,z) The above transformation allows us to reduce tree-shaped rules to rules of only linear structure that are much more similar to RIAs in SROIQ. But while all role atoms now belong to a single maximal path, rules might still contain disconnected concept atoms. The rule R(x, y) ∧ S (u, v) ∧ C(z) → T (x, v), e.g., is rewritten to ∃R. (x) ∧ S (u, v) ∧ C(z) → T (x, v). Now it can be shown that DL rules in SROIQ can be internalised. Theorem 6 Consider a rule base RB and a knowledge base KB for SROIQ, such that RB∪KB is admissible. There is a SROIQ knowledge base KBRB that can be computed in time polynomial in the size of RB, such that KB ∪ RB and KB ∪ KBRB are equisatisﬁable. Moreover, if KB ∪ RB is regular, then KB ∪ KBRB is also regular. Proof. We can assume rules in RB to have the form as in Lemma 5, since the transformation given in [9] preserves regularity and simplicity in KB ∪ RB. We assume w.l.o.g. that no rule in RB has the universal role U in its head: such rules would be tautological. We

further assume that all variables occurring in rule heads also occur in their body – atoms of the form (x) can safely be added to that end. For any rule B → R(x, y), Lemma 5 asserts that B contains at most one maximal path with ﬁnal element y, and all role atoms of B (if any) are part of that path. Let z be the initial element of this path if it exists, and let z be y otherwise. If x z, then x occurs in B only in concept atoms C(x), and we can add a role atom U(x, z) to B without violating (R1)–(R3). This change preserves the semantics of the rule since U(x, z) is true for any variable assignment (mapping free variables to domain elements of I; sometimes also called variable binding [7]) in any interpretation. Regularity of the role base is preserved since we can assume w.l.o.g. U to be the least element of ≺ (exploiting that U does not occur in rule heads). Simplicity is no concern as R by assumption is not a simple role in KB∪RB. In summary, we may transform the body of any rule with head R(x, y) to contain exactly one maximal path, leading from x to y. We describe the step-wise computation of KBRB . Initially, we set KBRB ∅, and deﬁne the set of remaining rules as RB RB. The reduction proceeds iteratively until RB is empty. In every step, we select some rule B → H ∈ RB. As discussed above, there is only a single maximal path of roles in B, all role atoms in B are part of that path, and all but adjacent variables in the path are distinct (no cycles). We distinguish ﬁve cases: (1) If B contains atoms D(z) and D (z) for some variable z, then these atoms are replaced in B by a new atom (D D )(z). (2) Otherwise, if H = C(x) and B = D(x), then B → H is removed from RB , and a Tbox axiom D C is inserted into KBRB . (3) Otherwise, if H = R(x, y) and B is of the form {R1 (x, x2 ), . . . , Rn (xn , y)}, then B → H is removed from RB , and an Rbox axiom R1 ◦ . . . ◦ Rn R is inserted into KBRB . (4) Otherwise, if H = R(x, y), and there is D(z) ∈ B such that z occurs in a role atom of B or H (in ﬁrst or second argument position), then the following is done. First, a new role name S is introduced, and the Tbox axiom D ≡ ∃S .Self is added to KBRB . Second, a new variable z ∈ V is introduced, the role atom S (z, z ) is added to B, every role atom T (x , z) ∈ B is replaced by T (x , z ), and every role atom T (z, y ) ∈ B is replaced by T (z , y ). Finally, D(z) is removed from B, and if z = y then the rule head is replaced by R(x, z ). (5) Otherwise, if H = C(x) or H = R(x, y), and there is some D(z) ∈ B such that z occurs neither in H nor in any role atom of B, then the following is done. If B contains some atom of the form R(x, t) so there is no atom D (x) ∈ B, then deﬁne u y; otherwise deﬁne u x. Now D(z) in B is replaced by the atom ∃U.D(u). The correctness of this procedure can be established by verifying the following claims: 1. 2. 3. 4. 5.

The cases distinguished by the algorithm are exhaustive. The algorithm terminates after a polynomial number of steps. After termination, KB ∪ KBRB is a SROIQ knowledge base. After termination, KB ∪ RB and KB ∪ KBRB are equisatisﬁable. If KB ∪ RB is regular, then so is KB ∪ KBRB .

Details on the proofs of those claims can be found in [9].

Considering again our introductory example, we arrive at the following SROIQ axioms (where S 1 , S 2 are new auxiliary roles): S 1 ◦ supervises ◦ S 2 profOf ∃ worksAt.University ≡ ∃S 1 .Self PhDStudent ≡ ∃S 2 .Self Based on Theorem 6, we conclude that the problem of checking the

M. Krötzsch et al. / Description Logic Rules

satisﬁability of SROIQ knowledge bases extended with DL rules is decidable, as long as the extended knowledge base is admissible and regular. Since the internalisation is possible in polynomial time, the worst-case complexity for this problem is the same as for checking satisﬁability of SROIQ knowledge bases.

5 DL RULES IN EL++ In this section, we investigate DL rules for the DL EL++ [1], for which many typical inference problems can be solved in polynomial time. As EL++ cannot internalise DL rules, they constitute a true extension of expressivity. We therefore take a diﬀerent approach than in SROIQ: instead of considering rule bases as an auxiliary set of axioms that is successively reduced and internalised, we introduce DL rules as core expressive mechanism to which all other EL++ axioms can be reduced. While EL++ rule bases oﬀer many expressive features formerly unavailable in EL++ , we show that the complexity of core inference problems remains tractable. We simplify our presentation by omitting concrete domains from EL++ – they are not aﬀected by our extension and can be treated as shown in [1]. Deﬁnition 7 A role of EL++ is a (non-inverse) role name. An EL++ Rbox is a set of generalised role inclusion axioms, and an EL++ Tbox (Abox) is a SROIQ Tbox (Abox) that contains only the following concept constructors: , ∃, , ⊥, as well as nominal classes {a}. An EL++ knowledge base is the union of an EL++ Rbox, Tbox and Abox. An EL++ rule base is a set of DL rules for EL++ that do not contain atoms of the form R(x, x) in the body. Note that we do not have any requirement for regularity or simplicity of roles in the context of EL++ . It turns out that neither is relevant for obtaining decidability or tractability. The case of R(x, x) in bodies is not addressed by the below algorithm – [10] signiﬁcantly extends the below approach to cover this and other features. Since it is obvious that both concept and role inclusion axioms can directly be expressed by DL rules, we will consider only EL++ rule bases without any additional EL++ knowledge base axioms. We can restrict our attention to EL++ rules in a certain normal form: Deﬁnition 8 An EL++ rule base RB is in normal form if all concept atoms in rule bodies are either concept names or nominals, all variables in a rule’s head also occur in its body, and all rule heads are of one of the following forms: A(x) ∃R.A(x) R(x, y) where A ∈ NC ∪ {{a} | a ∈ NI } ∪ { , ⊥} and R ∈ NR . A set B of basic concept expressions for RB is deﬁned as B {C | C ∈ NC , C occurs in RB} ∪ {{a} | a ∈ NI , a occurs in RB} ∪ { , ⊥}. Proposition 9 Any EL++ rule base can be transformed into an equisatisﬁable EL++ rule base in normal form. The transformation can be done in polynomial time. When checking satisﬁability of EL++ rule bases, we can thus restrict to rule bases in the above normal form. A polynomial algorithm for checking class subsumptions in EL++ knowledge bases has been given in [1], and it was shown that other standard inference problems can easily be reduced to that problem. We now present a new algorithm for checking satisﬁability of EL++ rule bases, and show its correctness and tractability. Clearly, subsumption checking can be reduced to this problem: given a new individual a ∈ NI , the rule base RB ∪ {C(a), {a}(x) D(x) → ⊥(x)} is unsatisﬁable iﬀ RB entails C D. Instance checking in turn is directly reducible to subsumption checking in the presence of nominals.

83

Algorithm 10 The algorithm proceeds by computing two sets: a set E of inferred “domain elements”, and a set S of relevant subclass inclusion axioms that are entailed by RB. The elements of E are represented by basic concept expressions of RB, i.e. E ⊆ B, and the inclusion axioms in S are of the form C D or C ∃R.D, where C, D ∈ E. Thus E and S are polynomially bounded by the size of RB. Initially, we set E {{a} | {a} ∈ B} ∪ { } and S ∅. Now a DL rule is applied whenever we ﬁnd that there is a match with the rule body. Given a rule B → H, a match θ is a mapping from all variables in B to elements of E, such that the following hold: • for every C(y) ∈ B, θ(y) C ∈ S, and • for every R(y, z) ∈ B, θ(y) ∃R.θ(z) ∈ S. The algorithm now proceeds by applying the following rules until no possible rule application further modiﬁes the set E or S: (EL1) If C ∈ E, then S S ∪ {C C, C }. (EL2) If there is a rule B → E(x) ∈ RB, and if there is a match θ for B with θ(x) = θx , then S S ∪ {θx E}. In this case, if E = C or E = ∃R.C, then E E ∪ {C}. (EL3) If there is a rule B → R(x, y) ∈ RB, and if there is a match θ for B with θ(x) = θx and θ(y) = θy , then S S∪{θx ∃R.θy }. (EL4) If {C {a}, D {a}, D E} ⊆ S then S S ∪ {C E}. Here we assume that C, D, D ∈ B, E ∈ B ∪ {∃R.C | C ∈ B}, and R ∈ NR . After termination, the algorithm returns “unsatisﬁable” if ⊥ ∈ E, and “satisﬁable” otherwise. The correctness of the above algorithm is shown by using the sets E and S for constructing a model, which is indeed possible whenever ⊥ E (see [9] for details). EL++ has a small model property (a consequence of the proofs for [1]) that allows us to consider at most one individual for representing the members of each class. The set E thus is used to record classes which must have some element, and matches θ use these class names to represent (arbitrary) individuals to which some DL rule might be applied. Assuming that all steps of Algorithm 10 are computable in polynomial time, it is easy to see that the algorithm also terminates in polynomial time, since there are only polynomially many possible elements for E and S, and each case adds new elements to either set. However, it also has to be veriﬁed that individual steps can be computed eﬃciently, and this is not obvious for the match-checks in (EL2) and (EL3). Indeed, ﬁnding matches in query graphs is known to be NP-complete in general, and the tree-like structure of queries is crucial to retain tractability. Moreover, even tree-like rule bodies admit exponentially many matches. But note that Algorithm 10 does not consider all matches but only the (polynomially many) possible values of θx (and θy ). It turns out that there is indeed a algorithm that checks in polynomial time whether a match θ as in (EL2) and (EL3) exists, but without explicitly considering all possible matches. This task is closely related to the problem of testing the existence of homomorphisms between trees and graphs. Proposition 11 Consider a rule of the form B → C(x) (B → R(x, y)), sets E and S as in Algorithm 10, and an element θx ∈ E (elements θx , θy ∈ E). There is an algorithm that decides whether there is a match θ such that θ(x) = θx (θ(x) = θx and θ(y) = θy ), running in polynomial time w.r.t. the size of the inputs. Theorem 12 Algorithm 10 is a sound and complete procedure for checking satisﬁability of EL++ rule bases. Satisﬁability checking, instance retrieval, and computing class subsumptions for EL++ rule bases is possible in polynomial time in the size of the rule base.

84

M. Krötzsch et al. / Description Logic Rules

6 DLP 2 Description Logic Programs (DLP) have been proposed as a tractable formalism for bridging the gap between DL and (Horn) logic programming [5]. DLP can naturally be extended with DL rules and various other features of SROIQ, and the resulting tractable rule language might be dubbed DLP 2 in analogy to the ongoing standardisation of OWL 2. A detailed syntactic characterisation for DLP was given in [14], and a yet more general formulation can be obtained from [8]. Here, we adopt a much simpler deﬁnition that focusses on the essential expressive features only: Deﬁnition 13 Roles of DLP are deﬁned as in SROIQ, including inverse roles. A DLP body concept is any SROIQ concept expression that includes only concept names, nominals, , ∃, , and ⊥. A DLP head concept is any SROIQ concept expression that includes only concept names, nominals, , ∀, , ⊥, and expressions of the form ≤1.C where C is a DLP body concept. A DLP knowledge base is a set of Rbox axioms of the form R S and R ◦ R R, Tbox axioms of the form C D, and Abox axioms of the form D(a) and R(a, b), where C ∈ C is a body concept, D ∈ C is a head concept, and a, b ∈ NI are individual names. A DLP rule base is a set of DL rules such that all concepts in rule bodies are body concepts, and all concepts in rule heads are head concepts. A DLP 2 knowledge base consists of a DLP knowledge base that additionally might contain Rbox axioms of the form Dis(R, S ) and Asy(R), together with some DLP rule base. Dis and Asy assert role disjointness and asymmetry as explained in [9]. Note that neither regularity nor simplicity restrictions apply in DLP. Combining “rolling-up” as in Lemma 5 with a decomposition of the remaining single paths to multiple rules, any DLP 2 knowledge base can be transformed into an equisatisﬁable set of function-free ﬁrst-order Horn rules with at most ﬁve variables per formula, and this transformation is possible in polynomial time. Since this fragment of Horn-logic is tractable, we can conclude the following:

Theorem 14 Satisﬁability checking, instance retrieval, and computing class subsumptions for DLP 2 knowledge bases is possible in polynomial time in the size of the knowledge base.

7 CONCLUSION We have introduced DL rules as a rule-based formalism for augmenting description logic knowledge bases. For all DLs considered in this paper – SROIQ, EL++ , and DLP – the extension with DL rules does not increase the worst-case complexity. In particular, EL++ rules and the extended DLP 2 allow for polynomial time reasoning for common inference tasks, even though DL rules do indeed provide added expressive features in those cases. The main contributions of this paper therefore are twofold. Firstly, we have extended the expressivity of two tractable DLs while preserving their favourable computational properties. The resulting formalisms of EL++ rules and DLP 2 are arguably close to being maximal tractable fragments of SROIQ. In particular, note that the union of EL++ and DLP is no longer tractable, even when disallowing number restrictions and inverse roles: this follows from the fact that this DL contains the DL Horn-FLE which was shown to be ExpTimecomplete in [8]. Secondly, while DL rules do not truly add expressive power to SROIQ, our characterisation and reduction methods for DL rules

provides a basis for developing ontology modelling tools. Indeed, even without any further extension, the upcoming OWL 2 standard would support all DL rules. Hence OWL-conformant tools can choose to provide rule-based user interfaces (as done for Protégé in [4]), and rule-based tools may oﬀer some amount of OWL support. We remark that in the case of DLP and EL++ , the conditions imposed on DL rules can be checked individually for each rule without considering the knowledge base as a whole. Moreover, in order to simplify rule editing, the general syntax of DL rules can be further restricted without sacriﬁcing expressivity, e.g. by considering only chains rather than trees for rule bodies. We thus argue that DL rules can be a useful interface paradigm for many application ﬁelds. Our treatment of rules in EL++ and DLP 2 – used only for establishing complexity bounds in this paper – can be the basis for novel rule-based reasoning algorithms for those DLs, and we leave it for future research to explore this approach.

ACKNOWLEDGEMENTS Research reported herein was supported by the EU in the IST projects ACTIVE (IST-2007-215040) and NeOn (IST-2006-027595), and by the German Research Foundation under the ReaSem project.

REFERENCES [1] Franz Baader, Sebastian Brandt, and Carsten Lutz, ‘Pushing the EL envelope’, in Proc. 19th Int. Joint Conf. on Artiﬁcial Intelligence (IJCAI 2005), Edinburgh, UK, (2005). Morgan-Kaufmann Publishers. [2] Diego Calvanese, Giuseppe De Giacomo, and Maurizio Lenzerini, ‘On the decidability of query containment under constraints’, in Proc. 17th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS’98), pp. 149–158. ACM Press, (1998). [3] Evgeny Dantsin, Thomas Eiter, Georg Gottlob, and Andrei Voronkov, ‘Complexity and expressive power of logic programming’, ACM Computing Surveys, 33, 374–425, (2001). [4] Francis Gasse, Ulrike Sattler, and Volker Haarslev. Rewriting rules into SROIQ axioms. Poster at 21st Int. Workshop on DLs (DL-08), 2008. [5] Benjamin N. Grosof, Ian Horrocks, Raphael Volz, and Stefan Decker, ‘Description logic programs: combining logic programs with description logic’, in Proc. 12th Int. Conf. on World Wide Web (WWW 2003), pp. 48–57. ACM, (2003). [6] Ian Horrocks, Oliver Kutz, and Ulrike Sattler, ‘The even more irresistible SROIQ’, in Proc. 10th Int. Conf. on Principles of Knowledge Representation and Reasoning (KR2006), pp. 57–67. AAAI Press, (2006). [7] Ian Horrocks and Peter F. Patel-Schneider, ‘A proposal for an OWL rules language’, in Proc. 13th Int. Conf. on World Wide Web (WWW 2004), eds., Stuart I. Feldman, Mike Uretsky, Marc Najork, and Craig E. Wills, pp. 723–731. ACM, (2004). [8] Markus Krötzsch, Sebastian Rudolph, and Pascal Hitzler, ‘Complexity boundaries for Horn description logics’, in Proc. 22nd AAAI Conf. (AAAI’07), (2007). [9] Markus Krötzsch, Sebastian Rudolph, and Pascal Hitzler, ‘Description logic rules (extended technical report)’, Technical report, Universität Karlsruhe, Germany, (FEB 2008). Available at http://korrekt. org/page/SROIQ_Rules. [10] Markus Krötzsch, Sebastian Rudolph, and Pascal Hitzler, ‘ELP: Tractable rules for OWL 2’, Technical report, Universität Karlsruhe, Germany, (May 2008). http://korrekt.org/page/ELP. [11] Alon Y. Levy and Marie-Christine Rousset, ‘Combining Horn rules and description logics in CARIN’, Artiﬁcial Intelligence, 104, 165–209, (1998). [12] Boris Motik, Ulrike Sattler, and Rudi Studer, ‘Query answering for OWL-DL with rules’, J. Web Sem., 3(1), 41–60, (2005). [13] Sebastian Rudolph, Markus Krötzsch, and Pascal Hitzler, ‘All elephants are bigger than all mice’, in Proc. 21st Int. Workshop on Description Logics (DL-08), (2008). [14] Raphael Volz, Web Ontology Reasoning with Logic Databases, Ph.D. dissertation, Universität Karlsruhe (TH), Germany, 2004.

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-85

85

Conﬂicts between Relevance-Sensitive and Iterated Belief Revision Pavlos Peppas and Anastasios Michael Fotinopoulos and Stella Seremetaki1 Abstract. The original AGM paradigm focuses only on one-step belief revision and leaves open the problem of revising a belief state with whole sequences of evidence. Darwiche and Pearl later addressed this problem by introducing extra (intuitive) postulates as a supplement to the AGM ones. A second shortcoming of the AGM paradigm, seemingly unrelated to iterated revision, is that it is too liberal in its treatment of the notion of relevance. Once again this problem was addressed with the introduction of an extra (also very intuitive) postulate by Parikh. The main result of this paper is that Parikh postulate for relevance-sensitive belief revision is inconsistent with each of the Darwiche and Pearl postulates for iterated belief revision.

1

INTRODUCTION

The original AGM paradigm for belief revision, [1, 3, 11], focuses only on one-step transitions leaving open the problem of how to revise a belief state with a whole sequence of evidence. This problem was later addressed by Darwiche and Pearl who formulated four intuitive new postulates (known as the DP postulates) to regulate iterated revisions. Possible world semantics were introduced to characterize the new postulates, and with some adjustments (see section 4) the DP postulates were shown to be compatible with the original AGM ones.2 Although Darwiche and Pearl’s work has received some criticism, it remains very inﬂuential in the literature of iterated belief revision and has served as a basis for further developments in the area [7, 5]. A shortcoming of a different nature of the original AGM paradigm is that it neglects the important role of relevance in belief revision. As noted by Parikh, [8], when a belief state ψ is revised by new information μ, only the part of ψ that is related to μ should be effected; the rest of ψ should remain the same. Parikh proceeded to formulate a postulate, called (P), that captures this intuition (albeit in limited cases). Postulate (P) was later shown to be consistent with the AGM postulates and possible-world semantics were introduced to characterize it, [10]. The main contribution of this paper is to show that, in the presence of the AGM postulates, Parikh postulate for relevance-sensitive belief revision is inconsistent with each of the (seemingly unrelated) Darwiche and Pearl postulates for iterated belief revision. This of course is quite disturbing. Both the concept of relevance and the process of iteration are key notions in Belief Revision and 1

2

Dept of Business Administration, University of Patras, Patras 26500, Greece, emails: [email protected], [email protected], [email protected] To be precise, the DP postulates were shown to be compatible with the reformalization of the AGM postulates introduced by Katsuno and Mendelzon in [6].

we can do away with neither. Moreover, the encoding of these notions proposed by Parikh, Darwiche, and Pearl appears quite natural and it is not obvious how one should massage the postulates in order to reconcile them. Further to this point, subsequent postulates introduced to remedy problems with the (DP) ones, [7, 5], are also shown to be incompatible with postulate (P) (see section 6). On the positive side, these incompatibility results reveal a hitherto unknown connection between relevance and iteration that deepens our understanding of the belief revision process. The paper is structured as follows. The next section introduces some notation and terminology. Following that we brieﬂy review (Katsuno and Mendelzon’s re-formalization of) the AGM postulates, Darwiche and Pearl’s approach for iterated revisions, and Parikh’s proposal for relevance-sensitive belief revision (sections 3, 4, and 5). Section 6 contains our main incompatibility results. Finally in section 7 we make some concluding remarks.

2

PRELIMINARIES

Throughout this paper we shall be working with a ﬁnitary propositional language L. We shall denote the (ﬁnite) set of all propositional variables of L by A. For a set of sentences Γ of L, we denote by Cn(Γ) the set of all logical consequences of Γ, i.e. Cn(Γ) = {ϕ ∈ L: Γ % ϕ}. A theory K of L is any set of sentences of L closed under %, i.e. K = Cn(K). We shall denote the set of all theories of L by TL . A theory K of L is complete iff for all sentences ϕ ∈ L, ϕ ∈ K or ¬ϕ ∈ K. As it is customary in Belief Revision, herein we shall identify consistent complete theories with possible worlds. We shall denote the set of all consistent complete theories of L by ML . If for a sentence ϕ, Cn(ϕ) is complete, we shall also call ϕ complete. For a set of sentences Γ of L, [Γ] denotes the set of all consistent complete theories of L that contain Γ. Often we shall use the notation [ϕ] for a sentence ϕ ∈ L, as an abbreviation of [{ϕ}]. When two sentences ϕ and χ are logically equivalent we shall often write ϕ ≡ χ as an abbreviation of % ϕ ↔ χ. Finally, the symbols and ⊥ will be used to denote an arbitrary (but ﬁxed) tautology and contradiction of L respectively.

3

THE KM POSTULATES

In the AGM paradigm belief sets are represented as logical theories, new evidence as sentences of L, and the process of belief revision is modeled as a function mapping a theory K and a sentence μ to a new theory K ∗ μ. Moreover, eight postulates for ∗ are proposed, known as the AGM postulates, that aim to capture the notion of rationality in belief revision. Katsuno and Mendelzon in [6] slightly reshaped the AGM constituents to make the formalization more amendable to implementa-

86

P. Peppas et al. / Conﬂicts Between Relevance-Sensitive and Iterated Belief Revision

tion. In particular, the object language L is set to be a ﬁnitary propositional one,3 and belief sets are deﬁned as ﬁnite sets of sentences of L. Since a belief set contains only ﬁnitely many elements, one can in fact represent it as a single sentence ψ; namely the conjunction of all its elements. This is the representation eventually adopted in [6] and the one we shall use herein. To emphasize the ﬁniteness of the new representation we shall call ψ a belief base and reserve the term belief set for the closure of ψ, i.e. the theory Cn(ψ). With the above reformulation, a revision function becomes a function ∗ mapping a sentence ψ and a sentence μ to a new sentence ψ ∗ μ; i.e. ∗ : L × L → L. Moreover in the new formalization the AGM postulates are equivalent to the following six, known as the KM postulates: (KM1) (KM2) (KM3) (KM4) (KM5) (KM6)

4

ψ ∗ μ % μ. If ψ ∧ μ is satisﬁable then ψ ∗ μ ≡ ψ ∧ μ. If μ is satisﬁable then ψ ∗ μ is also satisﬁable. If ψ1 ≡ ψ2 and μ1 ≡ μ2 then ψ1 ∗ μ1 ≡ ψ2 ∗ μ2 . (ψ ∗ μ) ∧ ϕ % ψ ∗ (μ ∧ ϕ). If (ψ ∗μ)∧ϕ is satisﬁable then ψ ∗(μ∧ϕ) % (ψ ∗μ)∧ϕ.

ITERATED BELIEF REVISION

One thing to notice about the KM postulates is that they all refer to single-step revisions; no constraints are placed on how the revision policy at the initial belief base ψ may relate to the revision policies at it descendants (i.e. at the belief bases resulting from ψ via a sequence of revisions). Darwiche and Pearl’s solution to this problem came in the form of four additional postulates, known as the DP postulates, listed below, [2]: (DP1) (DP2) (DP3) (DP4)

If ϕ % μ then (ψ ∗ μ) ∗ ϕ = ψ ∗ ϕ. If ϕ % ¬μ then (ψ ∗ μ) ∗ ϕ = ψ ∗ ϕ. If ψ ∗ ϕ % μ then (ψ ∗ μ) ∗ ϕ % μ. If ψ ∗ ϕ % ¬μ then (ψ ∗ μ) ∗ ϕ % ¬μ.

The DP postulates are very intuitive and their intended interpretation, is loosely speaking as follows (see [2, 7] for details). Postulate (DP1) says that if the subsequent evidence ϕ is logically stronger than the initial evidence μ then ϕ overrides whatever changes μ may have made. (DP2) says that if two contradictory pieces of evidence arrive sequentially one after the other, it is the later that will prevail. (DP3) says that if revising ψ by ϕ causes μ to be accepted in the new belief base, then revising ﬁrst by μ and then by ϕ can not possibly block the acceptance of μ. Finally, (DP4) captures the intuition that “no evidence can contribute to its own demise” [2]; if the revision of ψ by ϕ does not cause the acceptance of ¬μ, then surely this should still be the case if ψ is ﬁrst revised by μ before revised by ψ. An initial problem with the DP postulates (more precisely with (DP2)) was that they were inconsistent with the KM postulates. However this inconsistency was not deeply rooted and was subsequently resolved. There are in fact (at least) two ways of removing it. The ﬁrst, proposed by Darwiche and Pearl themselves, [2], involves substituting belief bases with belief states, and adjusting the KM postulates accordingly. The second, proposed by Nayak, Pagnucco, and Peppas, [7], keeps belief bases as the primary objects of change, but modiﬁes the underlying assumptions about the nature of ∗. 3

In the original AGM paradigm, the object language L is not necessarily ﬁnitary nor propositional. The details of L are left open and only a small number of structural constraints are assumed of L and its associated entailment relation (see [3, 11]).

In particular, notice that, with the exception of (KM4), the KM postulates apply only to a single initial belief base ψ; no reference to other belief bases are made. Even postulate (KM4) can be weakened to comply with this policy: (KM4)’

If μ1 ≡ μ2 then ψ ∗ μ1 ≡ ψ ∗ μ2 .

With the new version of (KM4), the KM postulates allow us to deﬁne a revision function as a unary function ∗ : L → L, mapping the new evidence μ to a new belief base ∗ψ (μ), given the initial belief base ψ as background. This is the ﬁrst modiﬁcation made by Nayak et al. The second is to make revision functions dynamic. That is, revision functions may change as new evidence arrives. With this relaxation it is possible for example to have one revision function ∗ψ associated initially with ψ, and a totally different one after the revision of ψ by a sequence of evidence that have made the full circle and have converted ψ back to itself.4 Notice that the weakening of (KM4) to (KM4)’ is consistent with such dynamic behavior. As shown by Nayak et al., these two modiﬁcations sufﬁce to reconcile the DP postulates with the KM ones, and it is these modiﬁcations we shall adopt for the rest of the paper.5 Hence for the rest of the paper, unless speciﬁcally mentioned otherwise, we shall use the term “KM postulates” to refer to (KM1)-(KM6) with (KM4) replaced by (KM4)’, and we shall assume that revision function are unary and dynamic. We close this section with a remark on notation. Although we assume that revision functions are unary (relative to some background belief base ψ), for ease of presentation we shall keep the original notation and denote the revision of ψ by μ as ψ ∗ μ rather than ∗ψ (μ).

5

RELEVANCE-SENSITIVE BELIEF REVISION

Leaving temporarily aside the issue of iterated belief revision, we shall now turn back to one-step revisions to review the role of relevance in this process. As already mentioned in the introduction, Parikh in [8] pointed out that the AGM/KM postulates fail to capture the intuition that during the revision of a belief base ψ by μ, only the part of ψ that is related to μ should be effected, while everything else should stay the same. Of course determining the part of ψ that is relevant to some new evidence μ is not a simple matter. There is however at least one special case where the role of relevance can be adequately formalized; namely, when it is possible to break down ψ into two (syntactically) independent parts such that only the ﬁst of the two parts is syntactically related to the new evidence μ. More precisely, for a sentence ϕ of L, we shall denote by Aϕ the smallest set of propositional variables, through which a sentence that is logically equivalent to ϕ can be expressed. For example, if ϕ is the sentence (p ∨ q ∨ z) ∧ (p ∨ q ∨ ¬z), then Aϕ = {p, q}, since ϕ is logically equivalent to p ∨ q, and no sentence with fewer propositional variables is logically equivalent to ϕ. We shall denote by Lϕ the propositional sublanguage built from Aϕ via the usual boolean connectives. By Lϕ we shall denote the sublanguage built from the complement of Aϕ , i.e. from A − Aϕ . Parikh proposed the following postulate to capture the role of relevance in belief revision (at least for the special case mentioned above):6 4 5 6

In such cases, although the sequence of evidence does not effect the beliefs of the agent, it does however change the way the agent reacts to new input. It should be noted though that our results still hold even if Darwiche and Pearl’s proposal of switching to belief states was adopted. The formulation of (P) in [8] is slightly different from the one presented below since Parikh was working with theories rather than belief bases. The two version are of course equivalent.

P. Peppas et al. / Conﬂicts Between Relevance-Sensitive and Iterated Belief Revision

(P)

If ψ ≡ χ ∧ ϕ where χ, ϕ are sentences of disjoint sublanguages Lχ , Lϕ respectively, and Lμ ⊆ Lχ , then ψ ∗ μ ≡ (χ ◦ μ) ∧ ϕ, where ◦ is a revision function of the sublanguage Lχ .

According to postulate (P), whenever it is possible to break down the initial belief base ψ into two independent parts χ and ϕ, and moreover it so happens that the new evidence μ can be expressed entirely in the language of the ﬁrst part, then during the revision of ψ by μ, it is only the ﬁrst part that is effected; the unrelated part ϕ crosses over to the new belief base verbatim. Notice that of the nature of the “local” revision operator ◦ and its relationship with the “global” revision operator ∗ is not clearly stated in (P). Peppas et al., [10], therefore proposed a re-formulation of axiom (P) in terms of two new conditions (R1) and (R2) that do not refer to a “local” revision operator ◦. Only the ﬁrst of these two conditions will be needed herein: (R1)

If ψ ≡ χ∧ϕ, Lχ ∩Lϕ = ∅, and Lμ ⊆ Lχ , then Cn(ψ)∩Lχ = Cn(ψ ∗ μ) ∩ Lχ .

At ﬁrst (R1) looks almost identical to (P) but it is in fact strictly weaker than it (see [10] for details). It is essentially condition (R1) that we will be using to derive our incompatibility results.

6

INCOMPATIBILITY RESULTS

As already announced in the introduction, we shall now prove that in the presence of the KM postulates, (R1) – and therefore (P) – is inconsistent with each of the postulates (DP1)-(DP4). The proof relies on the semantics characterization of these postulates so we shall brieﬂy review it before presenting our results. We start with Grove’s seminal representation result [4] and its subsequent reformulation by Katsuno and Mendelzon [6]. Let ψ be a belief base and ≤ψ a total preorder in ML . We denote the strict part of ≤ψ by <ψ . We shall say that ≤ψ is faithful iff the minimal elements of ≤ψ are all the ψ-worlds:7 (SM1) (SM2)

If r ∈ [ψ] then r ≤ψ r for all r ∈ ML . If r ∈ [ψ] and r ∈ [ψ] then r <ψ r .

Given a belief base ψ and a faithful preorder ≤ψ associated with it, one can deﬁne a revision function ∗ : L → L as follows: (S*)

ψ ∗ μ = γ(min([μ], ≤ψ ).

In the above deﬁnition min([μ], ≤ψ ) is the set of minimal μworld with respect to ≤ψ , while γ is a function that maps a set of possible worlds S to a sentence γ(S) such that [γ(S)] = S. The preorder ≤ψ essentially represents comparative plausibility: the closer a world is to the initial worlds [ψ], the more plausible it is. Then according to (S*), the revision of ψ by μ is deﬁned as the belief base corresponding to the most plausible μ-worlds. In [4, 6] it was shown that the function induced from (S*) satisﬁes the KM postulates and conversely, every revision function ∗ that satisﬁes the KM postulates can be constructed from a faithful preorder by means of (S*).8 7

8

In [6] a third constraint was required for faithfulness, namely that logically equivalent sentences are assigned the same preorders. This is no longer necessary given the new version of (KM4). We note that the weakening of (KM4) to (KM4)’ does not effect these results since it is accommodated by a corresponding weakening of the notion of faithfulness.

87

This correspondence between revision functions and faithful preorders can be preserved even if extra postulates for belief revision are introduced, as long as appropriate constraints are also imposed on the preorders. In particular, Darwiche and Pearl proved that the following four constraints (SI1)-(SI4) on faithful preorders correspond respectively to the four postulates (DP1)-(DP4). (SI1) (SI2) (SI3) (SI4)

If r, r ∈ [μ] then r ≤ψ r iff r ≤ψ∗μ r . If r, r ∈ [¬μ] then r ≤ψ r iff r ≤ψ∗μ r . If r ∈ [μ] and r ∈ [¬μ] then r <ψ r entails r <ψ∗μ r . If r ∈ [μ] and r ∈ [¬μ] then r ≤ψ r entails r ≤ψ∗μ r .

Notice that all of the above constraints make associations between the preorder ≤ψ related to the initial belief base ψ and the preorder ≤ψ∗μ related to the belief base that results from the revision of ψ by μ. The semantic constraint(s) corresponding to postulate (P) have also been fully investigated in [10].9 Herein however we shall focus only on condition (R1); in fact we shall be even more restrictive and consider only the semantic counterpart of (R1) in the special case of consistent and complete belief bases: (PS)

If Diff(ψ, r) ⊂ Diff(ψ, r ) then r <ψ r .

In the above condition, for any two worlds w, w , Diff(w, w ) represents the set of propositional variables that have different truth values in the two worlds; in symbols, Diff(w, w ) = {p ∈ A : w % p and w % p} ∪ {p ∈ A : w % p and w % p}. Whenever a sentence ψ is consistent and complete, we use Diff(ψ, w ) as an abbreviation of Diff(Cn(ψ), w ). Intuitively, (PS) says that the plausibility of a world r depends on the propositional variables in which it differs from the initial (complete) belief base ψ: the more the propositional variables in Diff(ψ, r) the less plausible r is. In [10] it was shown that, for the special case of consistent and complete belief bases, (PS) is the semantic counterpart of (R1); i.e. given a consistent and complete belief base ψ and a faithful preorder ≤ψ , the revision function ∗ produced from ≤ψ via (S*) satisﬁes (R1) iff ≤ψ satisﬁes (PS). Although it is possible to obtain a fully-ﬂedged semantic characterization of postulate (P) by generalizing (PS) accordingly (see [10]) the above restricted version sufﬁces to establish the promised results: Theorem 1 In the presence of the KM postulates, postulate (P) is inconsistent with each of the postulates (DP1)-(DP4). Proof. Since (P) entails (R1) it sufﬁces to show that (R1) is inconsistent with each of (DP1)-(DP4). Assume that the object language L is built from the propositional variable p, q, and z. Moreover let ψ be the complete sentence p∧q∧z and let ≤ψ be the following preorder in ML : pqz <ψ pqz <ψ pqz <ψ pqz <ψ pqz <ψ pqz <ψ pqz <ψ pqz In the above deﬁnition of ≤ψ , and in order to increase readability, we have used sequences of literals to represent possible worlds (namely the literals satisﬁed by a world), and we have represented the negation of a propositional variable v by v. Notice that ≤ψ satisﬁes (PS). In what follows we shall construct sentences μ1 , μ2 , μ3 , and μ4 , such that no preorder satisfying (PS) and related to ψ ∗ μ1 (respectively to ψ ∗ μ2 , ψ ∗ μ3 , ψ ∗ μ4 ) can also 9

In the same paper axiom (P) was shown to be consistent with all the AGM/KM postulates.

88

P. Peppas et al. / Conﬂicts Between Relevance-Sensitive and Iterated Belief Revision

satisfy (SI1) (respectively (SI2), (SI3), (SI4)). Given the correspondence between (R1) and (PS) on one hand, and the correspondence between (DP1)-(DP4) and (SI1)-(SI4) on the other, this will sufﬁce to prove the theorem. Inconsistency of (PS) and (SI1): Let μ1 be the sentence q ∨ z. According to the deﬁnition of ≤ψ , there is only one minimal μ1 -world, namely pqz, and therefore by (S*), ψ ∗μ1 ≡ p∧ q ∧z. Consider now the possible worlds w = pqz and w = pqz. Clearly, Diff(ψ ∗ μ1 , w ) = {q} ⊂ {q, z} = Diff(ψ ∗ μ1 , w). Consequently, no matter what the new preorder ≤ψ∗μ1 is, as long as it satisﬁes (PS) it holds that w <ψ∗μ1 w. On the other hand, since w, w ∈ [μ1 ] and w ≤ψ w , (SI1) entails that w ≤ψ∗μ1 w . Contradiction. Inconsistency of (PS) and (SI2): Let μ2 be the sentence p∧q∧z. Once again there is only one minimal μ2 -world, namely pqz, and therefore ψ ∗ μ2 ≡ p ∧ q ∧ z. Let w and w be the possible worlds pqz and pqz respectively. It is not hard to verify that Diff(ψ ∗ μ2 , w ) ⊂ Diff(ψ ∗ μ2 , w) and therefore (PS) entails w <ψ∗μ1 w. On the other hand, given that w, w ∈ [¬μ2 ] and w ≤ψ w , (SI2) entails that w ≤ψ∗μ1 w , leading us to a contradiction. Inconsistency of (PS) and (SI3): Let μ3 be the sentence (p ∧ q ∧ z) ∨ (p ∧ q ∧ z). Given the above deﬁnition of ≤ψ it is not hard to verify that ψ ∗ μ3 ≡ p ∧ q ∧ z. Once again, deﬁne w and w to be the possible worlds pqz and pqz respectively. Clearly Diff(ψ∗μ3 , w ) ⊂ Diff(ψ∗μ3 , w) and therefore (PS) entails w <ψ∗μ3 w. On the other hand notice that w ∈ [μ3 ], w ∈ [¬μ3 ] and w <ψ w . Hence (SI3) entails that w <ψ∗μ3 w . Contradiction. Inconsistency of (PS) and (SI4): Let μ4 = μ3 = (p ∧ q ∧ z) ∨ (p ∧ q ∧ z) and assume that w, w are as previously deﬁned. Then clearly, like before, (PS) entails w <ψ∗μ4 w. On the other hand, since w ∈ [μ4 ], w ∈ [¬μ4 ] and w ≤ψ w , (SI4) entails that w ≤ψ∗μ4 w . A contradiction once again. 2 The above inconsistencies extend to subsequent developments of the Darwiche and Pearl approach as well. Herein we consider two such extensions in the form of two new postulates introduced to rectify anomalies with the original DP approach. The ﬁrst postulate, called the Conjunction Postulate, was introduced by Nayak, Pagnucco, and Peppas in [7]: (CNJ)

If μ ∧ ϕ % ⊥, then ψ ∗ μ ∗ ϕ ≡ ψ ∗ (μ ∧ ϕ).

As shown in [7], in the presence of the KM postulates, (CNJ) entails (DP1), (DP3), and (DP4). Hence the following result is a direct consequence of Theorem 1: Corollary 1 In the presence of the KM postulates, postulate (P) is inconsistent with postulate (CNJ). The last postulate we shall consider herein is the Independence Postulate proposed by Jin and Thielscher in [5]: (Ind)

If ¬μ ∈ ψ ∗ ϕ then μ ∈ ψ ∗ μ ∗ ϕ.

Jin and Thielscher proved that, although weaker than (CNJ), (Ind) still entails (DP3) and (DP4). Hence, from Theorem 1, it follows: Corollary 2 In the presence of the KM postulates, postulate (P) is inconsistent with postulate (Ind).

7

CONCLUSION

In this paper we have proved the inconsistency of Parikh’s postulate (P) for relevance-sensitive belief revision, with each of the four postulates (DP1)-(DP4) proposed by Darwiche and Pearl for iterated belief revision. This result suggests that a major refurbishment may be due in our formal models for belief revision. Both relevance and iteration are central to the process of belief revision and neither of them can be sacriﬁced. Moreover, the formalizations of these notions by postulates (P) and (DP1)-(DP2) respectively seem quite intuitive and it is not clear what amendments should be made to reconcile them. On a more positive note, the inconsistencies proved herein reveal a hitherto unknown connection between relevance and iteration, which will eventually lead to a deeper understanding of the intricacies of belief revision.

REFERENCES [1] C. Alchourron and P. Gardenfors and D. Makinson, “On the Logic of Theory Change: Partial Meet Functions for Contraction and Revision”, Journal of Symbolic Logic, vol 50, pp 510-530, 1985. [2] A. Darwiche and J. Pearl, “On the Logic of Iterated Belief Revision”, Artiﬁcial Intelligence, vol. 89, pp 1-29, 1997. [3] P. Gardenfors, Knowledge in Flux. MIT Press, 1988. [4] A. Grove, “Two modellings for theory change”, Journal of Philosophical Logic vol. 17, pp. 157-170, 1988. [5] Y. Jin and M. Thielscher, “Iterated Revision, Revised”, Proceedings of the 19th International Joint Conference in Artiﬁcial Intelligence, pp 478-483, Edinburgh, 2005. [6] H. Katsuno and A. Mendelzon, “Propositional Knowledge Base Revision and Minimal Change”, in Artiﬁcial Intelligence, vol. 52, pp 263-294, 1991. [7] A. Nayak, M. Pagnucco, and P. Peppas, “Dynamic Belief Revision Operators”, Artiﬁcial Intelligence, vol. 146, pp 193-228, 2003. [8] R. Parikh, “Beliefs, Belief Revision, and Splitting Languages”, in J. Lawrence Moss, M. de Rijke, (eds)., Logic, Language, and Computation, vol 2, pp 266–268, CSLI Lecture Notes No. 96,. CSLI Publications, 1999. [9] P. Peppas, N. Foo, and A. Nayak, “Measuring Similarity in Belief Revision”, Journal of Logic and Computation, vol. 10(4), 2000. [10] P. Peppas, S. Chopra, and N. Foo, “Distance Semantics for RelevanceSensitive Belief Revision”, Proceedings of the 9th International Conference on the Principles of Knowledge Representation and Reasoning (KR2004), Whistler, Canada, June 2004. [11] P. Peppas, “Belief Revision”, in F. van Harmelen, V. Lifschitz, and B. Porter (eds), Handbook in Knowledge Representation, Elsevier Publications, 2007. [12] M. Winslett, “Reasoning about Action using a Possible Models Approach”, in Proceedings of the 7th National (USA) Conference on Artiﬁcial Intelligence (AAAI’88), pp. 89-93, 1988.

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-89

89

Conservativity in Structured Ontologies1 Oliver Kutz2 and Till Mossakowski3 Abstract. Using category theoretic notions, in particular diagrams and their colimits, we provide a common semantic backbone for various notions of modularity in structured ontologies, and outline a general approach for representing (heterogeneous) combinations of ontologies through interfaces of various kinds, based on the theory of institutions. This covers theory interpretations, (deﬁnitional) language extensions, symbol identiﬁcations, and conservative extensions. In particular, we study the problem of inheriting conservativity between sub-theories in a diagram to its colimit ontology, and apply this to the problem of localisation of reasoning in ‘modular ontology languages’ such as DDLs or E-connections.

1

Introduction

In this paper, we propose to use the category theoretic notions of diagram and colimit in order to provide a common semantic backbone for various notions of modularity in ontologies.4 At least three commonly used notions of ‘module’ in ontologies can be distinguished, depending on the kind of relationship between the ‘module’ and its supertheory (or superontology): (1) a module can be considered a ‘logically independent’ part within its superontology—this leads to the deﬁnition of module as a part of a larger ontology which is a conservative extensions of it; (2) a module can be a part of a larger ‘integrated ontology’, where the kind of integration determines the relation between the modules—this is the approach followed by modular ontology languages (e.g. DDLs, E-connections etc.); (3) a ‘part’ of a larger theory can be considered a module for reasons of elegance, re-use, tradition, etc.—in this case, the relation between a module and its supertheory might be a language extension, theory extension/interpretation, etc. The main contributions of the present paper are the following: (i) building on the theory of institutions, diagrams, and colimits, we show how these different notions of module can be considered simultaneously using the notion of a module diagram; (ii) we show how conservativity properties can be traced and inherited to the colimit of a diagram; (iii) we show how this applies to the composition problem in modular ontology languages such as DDLs and E-connections. Section 2 introduces institutions, Section 3 the diagrammatic view of modules, and Section 4 studies the problem of conservativity in diagrams. Finally, we sketch heterogeneous diagrams and apply this to modular ontology languages in Section 5.5 1

Work on this paper has been supported by the Vigoni program of the DAAD and by the DFG-funded collaborative research centre ‘Spatial Cognition’. 2 SFB/TR 8 Spatial Cognition, University of Bremen, Germany. [email protected] 3 DFKI GmbH, Bremen, Germany, [email protected] 4 This paper extends the results of [18]. 5 Proofs for the results of this paper can be found in [19].

2

Institutions

The study of modularity principles can be carried out to a quite large extent independently of the details of the underlying logical system that is used. The notion of institutions was introduced by Goguen and Burstall in the late 1970s exactly for this purpose (see [14]). They capture in a very abstract and ﬂexible way the notion of a logical system by describing how, in any logical system, signatures, models, sentences (axioms) and satisfaction (of sentences in models) are related. The importance of the notion of institutions lies in the fact that a surprisingly large body of logical notions and results can be developed in a way that is completely independent of the speciﬁc nature of the underlying institution.6 An institution I = (Sign, Sen, Mod, |=) consists of (i) a category Sign of signatures; (ii) a functor Sen : Sign −→ Set giving, for each signature Σ, the set of sentences Sen(Σ), and for each signature morphism σ : Σ −→ Σ , the sentence translation map Sen(σ) : Sen(Σ) −→ Sen(Σ ), where Sen(σ)(ϕ) is abbreviated σ(ϕ); (iii) a functor Mod : Signop −→ CAT giving, for each signature Σ, the category of models Mod(Σ), and for each signature morphism σ : Σ −→ Σ , the reduct functor Mod(σ) : Mod(Σ ) −→ Mod(Σ), where Mod(σ)(M ) is abbreviated M |σ ; (iv) a satisfaction relation |=Σ ⊆ |Mod(Σ)| × Sen(Σ) for each Σ ∈ |Sign|, such that for each σ : Σ −→ Σ in Sign the following satisfaction condition holds: ()

M |=Σ σ(ϕ) iff M |σ |=Σ ϕ

for each M ∈ |Mod(Σ )| and ϕ ∈ Sen(Σ), expressing that truth is invariant under change of notation and enlargement of context. The only condition governing the behaviour of institutions is thus the satisfaction condition ().7 A theory in an institution is a pair T = (Σ, Γ) consisting of a signature Sig(T ) = Σ and a set of Σ-sentences Ax(T ) = Γ, the axioms of the theory. The models of a theory T are those Sig(T )-models that satisfy all axioms in Ax(T ). Logical consequence is deﬁned as usual: T |= ϕ if all T -models satisfy ϕ. Theory morphisms, also called interpretations of theories, are signature morphisms that map axioms to logical consequences. Examples of institutions include ﬁrst- and higher-order classical logic, description logics, and various (quantiﬁed) modal logics [19].

3

Modules as Diagrams

Several approaches to modularity in ontologies have been discussed in recent years, including the introduction of various so-called ‘modular ontology languages’. The module system of the Web Ontology 6 7

For an extensive treatment of the model theory in this setting, see [10]. Note, however, that non-monotonic formalisms can only indirectly be covered this way, but compare, e.g., [16].

90

O. Kutz and T. Mossakowski / Conservativity in Structured Ontologies

Language OWL itself is as simple as inadequate [9]: it allows for importing other ontologies, including cyclic imports. The language C ASL, originally designed as a ﬁrst-order algebraic speciﬁcation language, is used for ontologies in [21]. Beyond imports, it allows for renaming, hiding and parameterisation. Other languages envisaging more involved integration and modularisation mechanisms than plain imports include DDLs [6], E-connections [17], and P-DLs [4]. We will use the formalism of colimits of diagrams as a common semantic backbone for these languages.8 The intuition behind colimits is explained as follows:

structure are just tree-shaped diagrams, while both shared parts and cyclic imports lead to arbitrary graph-shaped diagrams. The translation of C ASL (without hiding) to so-called development graphs detailed in [7] naturally leads to diagrams as well. Finally, the diagrams corresponding to modular languages like DDLs and E-connections will be studied in Sect. 5. Thus, diagrams can be seen as a uniform mathematical formalism where properties of all of these module concepts can be studied. An important such property is conservativity.

“Given a species of structure, say widgets, then the result of interconnecting a system of widgets to form a super-widget corresponds to taking the colimit of the diagram of widgets in which the morphisms show how they are interconnected.” [13]

Conservative diagrams are important because they imply that the combined ontology does not add new facts to the individual ontologies. Indeed, the notion of an ontology module of an ontology T has been deﬁned as any “subontology T such that T is a conservative extension of T ” [12]—this perfectly matches our notion of conservative diagram below.

The notion of diagram is formalised in category theory. Diagrams map an index category (via a functor) to a given category of interest. They can be thought of as graphs in the category. A cocone over a diagram is a kind of “tent”: it consists of a tip, together with a morphism from each object involved in the diagram into the tip, such that the triangles arising from the morphisms in the diagram commute. A colimit is a universal, or minimal cocone. For details, see [1]. In the sequel, we will assume that the signature category has all ﬁnite colimits, which is a rather mild assumption; in particular, it is true for all the examples of institutions mentioned above. Moreover, we will rely on the fact that colimits of theories exist in this case as well; the colimit theory is deﬁned as the union of all component theories in the diagram, translated along the signature morphisms of the colimiting cocone. Deﬁnition 1 A module diagram of ontologies is a diagram of theories such that the nodes are subdivided into ontology nodes and interface nodes.

4

Conservative Diagrams and Composition

Deﬁnition 2 A theory morphism σ : T1 −→ T2 is prooftheoretically conservative, if T2 does not entail anything new w.r.t. T1 , formally, T2 |= σ(ϕ) implies T1 |= ϕ. Moreover, σ : T1 −→ T2 is model-theoretically conservative, if any T1 -model M1 has a σ-expansion to T2 , i.e. a T2 -model M2 with M2 |σ = M1 . It is easy to show that conservative theory morphisms compose. Moreover, model-theoretic implies proof-theoretic conservativity. However, the converse is not true in general, compare [22] for an example. Deﬁnition 3 A (proof-theoretic, model-theoretic) conservative module diagram of ontologies is a diagram of theories such that the theory morphism of any ontology node into the colimit of the diagram is (proof-theoretically resp. model-theoretically) conservative. By conservativity, the deﬁnition immediately yields:

Composition of module diagrams is simply their union. Proposition 1 The colimit ontology of a proof-theoretic (modeltheoretic) conservative module diagram is consistent (satisﬁable)9 if any of the component ontologies is.

Example 1 Consider the union of the diagrams T1 Σ1

-

T2

T2 Σ2

-

T3

where the Σi are interfaces and the Ti are ontologies. Think of e.g. T12 as being an ontology that imports T1 and T2 , where Σ1 contains all the symbols shared between T1 and T2 . Then T12 (and T23 ) can be obtained as pushouts, and so can the overall union T123 (which typically is then further extended with new concepts etc.). A “c” means “conservative”; this will be explained in Sect. 4. c c

-

T12

T1

c

c c

Σ1

T123

-

T23

T2

c

-

T3

Σ2

Notice that Example 1 is closely related to the composition (or combination) of ontology alignments, as introduced in [26], and further studied in [20]. In general, it is clear that theories with an import 8

However, note that hiding is not covered by this approach.

Thus, in particular, in a conservative module diagram, an ontology node Oi can only be consistent (satisﬁable) if all other ontology nodes Oj , j = i, are consistent (satisﬁable) as well. The main question is how to ensure these conservativity properties in the united diagram. To study this, we introduce some notions from model theory, namely various notions of interpolation (for proof-theoretic conservativity) and amalgamation (for modeltheoretic conservativity). Craig interpolation plays a crucial role in connection with proof systems in structured theories. The most common formulation, i.e. Craig (or Arrow) interpolation, however, relies on a connective → being present in the institution. A slightly more general formulation, often called turnstile interpolation is as follows: if ϕ |= ψ, then there exists some χ that only uses symbols occurring in both ϕ and ψ, with ϕ |= χ and χ |= ψ. This, of course, follows from Craig interpolation in the presence of a deduction theorem. For the general study of module systems, we need to generalise such deﬁnitions in at least two important ways. The ﬁrst concerns the 9

Contrary to the terminology used in DL, we distinguish here proof-theoretic (syntactic) consistency of a theory T (which means T |= ϕ for some sentence ϕ) from model-theoretic (semantic) satisﬁability (which means M |= T for some model M ).

91

O. Kutz and T. Mossakowski / Conservativity in Structured Ontologies

rather implicit use of signatures in the standard deﬁnitions. Making signatures explicit means to assume that ϕ lives in a signature Σ1 , ψ lives in a signature Σ2 , the entailment ϕ |= ψ lives in Σ1 ∪ Σ2 , and the interpolant in Σ1 ∩ Σ2 . Since we do not want to go into the technicalities for equipping an institution with unions and intersections (see [11] for details), we replace Σ1 ∩ Σ2 with a signature Σ, and Σ1 ∪ Σ2 with Σ such that Σ is obtained as a pushout from the other signatures via suitable signature morphisms (cf. the diagram below). Secondly, we move from single sentences to sets of sentences. This is useful since we want to support DLs and TBox reasoning, and DLs like (sub-Boolean) EL do not allow to rewrite ‘conjunctions of subsumptions’, i.e., we cannot collapse a TBox into a single sentence. (In case of compact logics, the use of sets is equivalent to the use of ﬁnite sets.) This leads to the following deﬁnition. In the sequel, ﬁx an arbitrary institution I = (Sign, Sen, Mod, |=): Deﬁnition 4 The institution I has the Craig-Robinson interpolation property (CRI for short), if for any pushout θ1 -

Σ1 σ1

Σ

θ2 σ2 -

Σ2

Σ any set Γ1 of Σ1 -sentences and any sets Γ2 , Δ2 of Σ2 -sentences with θ1 (Γ1 ) ∪ θ2 (Δ2 ) |= θ2 (Γ2 ), there exists a set of Σ-sentences Γ (called the interpolant) such that

Institution EL ALC ms ALC ALCO ALCQO SHOIN FOLms QS5 Figure 1.

Proposition 2 A compact institution with implication has CRI iff it has Craig interpolation. Here, an institution I has implication if for any two Σ-sentences ϕ, ψ, there exists a Σ-sentence χ such that, for any Σ-model M ,

Deﬁnition 5 An institution I is (weakly) exact if, for any diagram of signatures, any compatible family of models (i.e. compatible with the reducts induced by the involved signature morphisms) can can be amalgamated to a unique (or weakly amalgamated to a not necessarily unique) model of the colimit. For pushouts, this amounts to the following (we use notation as in Def. 4): any pair (M1 , M2 ) ∈ Mod(Σ1 )×Mod(Σ2 ) that is compatible (in the sense that M1 and M2 reduce to the same Σ-model) can be amalgamated to a (unique) Σ -model M (i.e., there exists a (unique) M ∈ Mod(Σ ) that reduces to M1 and M2 , respectively).

CRI + + + + -

Weak exactness for these institutions follows with standard methods, see [10]. The same holds for exactness for the many-sorted variants. Exactness, however, obviously fails for the single-sorted logics as well as for QS5, because in these logics, the implicit universe resp. the implicit set of worlds leads to the phenomenon that the empty signature has many different models. The following propositions are folklore in institutional model theory, see [10]. Theorem 1 1. In an institution with CRI proof-theoretic conservativity is preserved along pushouts. 2. In an institution that is weakly exact, model-theoretic conservativity is preserved along pushouts. We now give necessary conditions for the preservation of conservativity when taking the colimit of the union of conservative diagrams. Firstly, a diagram is thin, or a preorder, if its index category is thin (i.e., there is at most one arrow between two given objects). Consider the following non-thin union diagram (assuming that the two arrows in the union are inherited from two different ontologies), where {P } ⊆ T1 and {C1 ≡ ¬C2 } ⊆ T2 : T1

P → C1

- T3 ⊇ C ≡ ¬C - T2 ...........

P → C2

Although the individual ontologies are conservative, the union is not because in the colimit C1 and C2 are identiﬁed. Next, a preorder is ﬁnitely bounded inf-complete if any two elements with a common lower bound have an inﬁmum. Consider the following, not ﬁnitely bounded inf-complete union diagram (assume that it is obtained as the union of its upper and its lower half): P Q...... .......... .......... .......... c .... c Q P ≡Q .. c .......... c .. .. .. ...... ................ QP

-

M |= χ iff (M |= ϕ implies M |= ψ) Moreover, I is compact if T |= ϕ implies T |= ϕ for a ﬁnite subtheory T of T . Since for modal logics, the deduction theorem (for the global consequence relation |=) generally fails, these logics do not have implication in the above sense, and we cannot apply Prop. 2. However, various more specialised criteria can be given, see [19]. Some results are summarised in Fig. 1. The amalgamation property (called ‘exactness’ in [11]) is a major technical assumption in the study of speciﬁcation semantics, see [23].

exact + + -

(Weak) exactness and Craig-Robinson interpolation

Γ1 |= σ1 (Γ) and Δ2 ∪ σ2 (Γ) |= Γ2 . CRI, in general, is strictly stronger than Craig interpolation. However, for almost all logics typically used in knowledge representation, they are indeed equivalent. We give a criterion that applies to institutions generally, taken from [10]:

weakly exact + + + + + + + +

P

Again, the individual ontologies are conservative, but the colimit of the union is not. Hence, call a diagram tame if it does not show these sources of inconsistency/non-conservativity, i.e. if it is thin and ﬁnitely bounded inf-complete. Theorem 2 1. Assume institution I has an initial signature10 and has CRI (is weakly exact). If the involved ontologies are consistent (satisﬁable), then composition of module diagrams via union preserves proof-theoretic (model-theoretic) conservativity if the diagram resulting from the union of the individual diagrams and their colimits is tame. 2. If the union is a disjoint union, the tameness assumption can be dropped. 10

Usually, the empty signature is initial.

92

O. Kutz and T. Mossakowski / Conservativity in Structured Ontologies

Note that consistency of the involved ontologies can be replaced with connectedness of the united diagram. The above examples and Example 2 below show that the conditions from the theorem are essentially optimal. See Example 1 for a conservative union of conservative diagrams.

5

ms C E (T1ms , T 2 )

M |=JΦ(Σ) αΣ (ϕ) ⇔ βΣ (M ) |=IΣ ϕ. Here, Φ(Σ) is the translation of signature Σ from institution I to institution J, αΣ (ϕ) is the translation of the Σ-sentence ϕ to a Φ(Σ)sentence, and βΣ (M ) is the translation (or perhaps: reduction) of the Φ(Σ)-model M to a Σ-model. The deﬁnitions and results of the previous sections also apply to the heterogeneous case. However, special care is needed in obtaining CRI or (weak) exactness [10]. Heterogeneous knowledge representation was also a major motivation for the deﬁnition of modular languages, E-connections in particular [17]. We here show how the integration of ontologies via ‘modular languages’ can be re-formulated in module diagrams. In the following, we will assume basic acquaintance with the syntax and semantics of both, DDLs and E-connections, which we reformulate as many-sorted theories. Details have to remain sketchy for lack of space. It should be clear that DDLs or E-connections can essentially be considered as many-sorted heterogeneous theories: component ontologies can be formulated in different logics, but have to be built from many-sorted vocabulary, and link relations are interpreted as relations connecting the sorts of the component logics (compare [3] who note that this is an instance of a more general co-comma construction). To be more precise, assume a DDL D = (S1 , S2 ) is given. Knowledge bases for D can contain bridge rules of the form:

Ci −→ Cj

(into rule)

Ci −→ Cj

(onto rule)

where Ci and Cj are concepts from Si and Sj (i = j), respectively (we consider here only DDL in its most basic form without individual correspondences etc.). An interpretation I for a DDL knowledge base is a pair ({Ii }i≤n , R), where each Ii is a model for the corresponding Si , and R is a function associating with every pair (i, j), i = j, a binary relation rij ⊆ Wi × Wj between the domains Wi and Wj of Ii and Ij , respectively. In the many-sorted re-formulation of DDLs, the relation rij is now interpreted as a relation between the -sort of S1 and the -sort of S2 . Bridge rules are expressed as existential restrictions of the form () ∃rij .Ci Cj

and

ms T1ms ( T2

T1

T1ms

c

c

c

-

T2ms

∅

Figure 2.

c

T2

E-connections and DDLs many-sorted

a discussion of related issues). In fact, the main difference between DDLs and various E-connections now lies in the expressivity of this ‘link language’ L connecting the different sorts of the ontologies. In basic DDL as deﬁned above, the only expressions allowed are those given in (), so the link language of basic DDL is a certain, very weak sub-Boolean fragment of many sorted ALC, namely the one given through (). In E-connections, expressions of the form ∃rij .Ci are again concepts of Sj , to which Booleans (or other operators) of Sj as well as restrictions using relations rji can be applied. Thus, the basic link language of E-connections is sorted ALCI ms (relative to the now richer languages of Si ).11 Such many-sorted theories can easily be represented in a diagram as shown in Figure 2. Here, we ﬁrst (conservatively) obtain a disjoint union T1ms ( T2ms as a pushout, where the component ontologies have been turned into sorted variants (using an institution comorphism from the single-sorted to the many-sorted logic), and the empty interface guarantees that no symbols are shared at this point. An E-connection KB in language C E (T1ms , T2ms ) or a DDL KB in language DDL(T1ms , T2ms ) is then obtained as a (typically not conservative) theory extension. When connecting ontologies via bridges, or interfaces, this typically is not conservative everywhere, but only for some of the involved ontologies. We give a criterion for a single ontology to be conservative in the combination. While the theorem can be applied to arbitrary interface nodes, when applied to E-connections or DDLs, we assume that bridge nodes contain DDL bridge rules or E-connection assertions. Theorem 3 Assume that we work in an institution that has CRI (is weakly exact). Let ontologies T1 , . . . , Tn be connected via bridges Bij , i < j. If Ti is proof-theoretically (model-theoretically) conservative in Bij for j > i, then T1 is proof-theoretically (modeltheoretically) conservative in the resulting colimit ontology T . The diagram in Fig. 3 illustrates Theorem 3 for the case n = 3. As concerns the applicability of the theorem, we have given an overview of logics being (weakly) exact or having CRI in Fig. 1. Of course, the conservativity assumptions have to be shown additionally. We next give an example of the failure of the claim of the theorem in case we work in a logic that lacks Craig-Robinson interpolation. Example 2 The presence of nominals in description or modal logics generally destroys (standardly formulated) Craig interpolation [2]. Here is a counterexample for the logic ALCO. Let

∃rij .Ci ' Cj

The fact that bridge rules are atomic statements in a DDL knowledge base now translates to a restriction on the grammar governing the usage of the link relation rij in the multi-sorted formalism (see [5] for

-

-

Heterogeneity and Modular Languages

As [24] argue convincingly, relating or integrating ontologies may happen across different institutions as ontologies are written in many different formalisms, like relation schemata, description logics, ﬁrstorder logic, and modal logics. Heterogeneous speciﬁcation is based on some graph of logics and logic translations, formalised as institutions and so-called institution comorphisms, see [15]. The latter are again governed by the satisfaction condition, this time expressing that truth is invariant also under change of notation across different logical formalisms:

DDL(T1ms , T2ms )

11

Γ

:=

{ ∃S.C ∃S.¬C} and

Δ

:=

{∀S.(D i) ∃S.D}

But can be weakened to ALC ms or the link language of DDLs, or strengthened to more expressive many-sorted DLs such as ALCQI ms .

O. Kutz and T. Mossakowski / Conservativity in Structured Ontologies

c

T1 c

-

- B13

T3

c B12

- ? - T 6

-

B23

c T2

Figure 3. Colimit integration along bridges for n = 3

where i is a nominal. Clearly, Γ |= Δ, for in every model M |= Γ, every point has at least two S-successors. But i can only be true in at most one of those successors, which entails M |= Δ. Now, (using bisimulations) it can be shown that in ALCO there is no Δ built from shared concept names alone (there are none) such that Γ |= Δ and Δ |= Δ. Assume now ontologies T1 , T2 , T3 are formulated in the DL ALCO with signatures Sig(T1 ) ⊆ {S, B, D, i}, Sig(T2 ) ⊆ {C1 , C2 }, and Sig(T3 ) ⊆ {B1 , B2 }. Also, assume {∃S.D} ⊆ T1 . Consider now the situation depicted in Fig. 3 with B12

⊇

{ ∃S.∃R1 .C1 , ∃S.∃R1 .¬C2 },

B13

⊇

{B1 ≡ ∃R3−1 .B, B2 ≡ ∃R3−1 .B},

B23

⊇

{C1 ≡ ∃R2 .B1 , C2 ≡ ∃R2 .B2 }.

Here, the roles R1 , R2 , R3 can be seen as link relations, and since we apply existential restrictions ∃S to ∃R2 .C1 etc., the example can be understood as a composition of (binary) E-connections. The reader can check that Ti is conservative in Bij for j > i. However, in the colimit (union) of this diagram, ∀S.D i ∃S.D follows, while this does not follow in T1 , and thus T1 is not conservative in the colimit ontology. Thus, if the assumptions of the theorem are satisﬁed, reasoning over the signature of T1 can be performed within T1 , i.e. without considering the overall integration T . This, however, can not be guaranteed for logics lacking CRI. In the light of this example, it should now come as no surprise that attempts to localise reasoning in DDLs in a peer-to-peer like fashion whilst remaining sound and complete have been restricted to logics lacking nominals [25].

6

Discussion and Outlook

Diagrams and their colimits offer the right level of abstraction to study conservativity issues in different languages for modular ontologies. We have singled out conditions that allow for lifting conservativity properties from individual diagrams to their combinations. An interesting point is the question whether proof-theoretic or model-theoretic conservativity should be used. The model-theoretic notion ensures ‘modularity’ in more logics than the proof-theoretic one since the lifting theorem for the former only depends on mild amalgamation properties. By contrast, for the latter one needs CraigRobinson interpolation which fails, e.g., for some description logics with nominals, and also for QS5—but these logics are used in practice for ontology design. Moreover, when relating ontologies across different institutions, the model-theoretic notion is more feasible. Finally, it has the ad-

93

vantage of being independent of the particular language, which implies avoidance of examples like the one presented in [22], where a given ontology extension is proof-theoretically conservative in EL but not in ALC. Of course, model-theoretic conservativity generally is harder to decide, but it can be ensured by syntactic criteria, and the work related to this is promising [8].

REFERENCES [1] J. Ad´amek, H. Herrlich, and G. Strecker, Abstract and Concrete Categories, Wiley, New York, 1990. [2] C. Areces and M. Marx, ‘Failure of interpolation in combined modal logics’, Notre Dame Journal of Formal Logic, 39(2), 253–273, (1998). [3] F. Baader and S. Ghilardi, ‘Connecting Many-Sorted Theories’, The Journal of Symbolic Logic, 72(2), 535–583, (2007). [4] J. Bao, D. Caragea, and V. Honavar, ‘On the Semantics of Linking and Importing in Modular Ontologies’, in Proc. of ISWC. Springer, (2006). [5] A. Borgida, ‘On Importing Knowledge from DL Ontologies: some Intuitions and Problems’, in Proc. of DL, (2007). [6] A. Borgida and L. Seraﬁni, ‘Distributed Description Logics: Assimilating Information from Peer Sources’, Journal of Data Semantics, 1, 153–184, (2003). [7] CoFI (The Common Framework Initiative), C ASL Reference Manual, LNCS Vol. 2960 (IFIP Series), Springer, 2004. Freely available at http://www.cofi.info. [8] B. Cuenca Grau, I. Horrocks, Y. Kazakov, and U. Sattler, ‘Modular Reuse of Ontologies: Theory and Practice’, Journal of Artiﬁcial Intelligence Research (JAIR), 31, (2008). to appear. [9] B. Cuenca Grau, B. Parsia, and E. Sirin, ‘Combining OWL Ontologies Using E-Connections’, Journal of Web Semantics, 4(1), 40–59, (2006). [10] R. Diaconescu, Institution-independent Model Theory, Studies in Universal Logic, Birkh¨auser, 2008. [11] R. Diaconescu, J. Goguen, and P. Stefaneas, ‘Logical Support for Modularisation’, in 2nd Workshop on Logical Environments, 83–130, CUP, New York, (1993). [12] S. Ghilardi, C. Lutz, and F. Wolter, ‘Did I Damage My Ontology? A Case for Conservative Extensions in Description Logics’, in Proc. of KR-06, pp. 187–197, (2006). [13] J. A. Goguen, ‘A categorical manifesto’, Mathematical Structures in Computer Science, 1, 49–67, (1991). [14] J. A. Goguen and R. M. Burstall, ‘Institutions: Abstract Model Theory for Speciﬁcation and Programming’, J. of the ACM, 39, 95–146, (1992). [15] J. A. Goguen and G. Ros¸u, ‘Institution morphisms’, Formal aspects of computing, 13, 274–307, (2002). [16] S. Guerra, ‘Composition of Default Speciﬁcations’, Journal of Logic and Computation, 11(4), 559–578, (2001). [17] O. Kutz, C. Lutz, F. Wolter, and M. Zakharyaschev, ‘E-Connections of Abstract Description Systems’, Artiﬁcial Intelligence, 156(1), 1–73, (2004). [18] O. Kutz and T. Mossakowski, ‘Modules in Transition: Conservativity, Composition, and Colimits’, in 2nd Int. Workshop on Modular Ontologies (WoMO-07), (2007). [19] O. Kutz and T. Mossakowski, ‘Conservativity in Structured Ontologies’, Technical report, University of Bremen, www.informatik. uni-bremen.de/˜okutz/OntoStruc-TR.pdf, (2008). [20] O. Kutz, T. Mossakowski, and M. Codescu, ‘Shapes of Alignments: Construction, Combination, and Computation’, in Int. Workshop on Ontologies: Reasoning and Modularity (WORM-08), (2008). [21] K. L¨uttich, C. Masolo, and S. Borgo, ‘Development of Modular Ontologies in CASL’, in 1st Workshop on Modular Ontologies 2006, eds., P. Haase, V. Honavar, O. Kutz, Y. Sure, and A. Tamilin, volume 232 of CEUR Workshop Proceedings. CEUR-WS.org, (2006). [22] C. Lutz and F. Wolter, ‘Conservative Extensions in the Lightweight Description Logic EL’, in Proc. of CADE-07, pp. 84–99. Springer, (2007). [23] D. Sannella and A. Tarlecki, ‘Speciﬁcations in an arbitrary institution’, Information and Computation, 76, 165–210, (1988). [24] M. Schorlemmer and Y. Kalfoglou, ‘Institutionalising Ontology-Based Semantic Integration’, Journal of Applied Ontology., (2008). To appear. [25] A. Tamilin, Distributed Ontological Reasoning: Theory, Algorithms, and Applications, Ph.D. dissertation, University of Trento, 2007. [26] A. Zimmermann, M. Kr¨otzsch, J. Euzenat, and P. Hitzler, ‘Formalizing Ontology Alignment and its Operations with Category Theory’, in Proc. of FOIS, pp. 277–288, (2006).

94

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-94

1

1

2

Card Σ M ax GM ax

•

•

•

Σ

1

L

P 2

→ ↔ ϕ

M ax

Cn

¬ ∧ ∨ L

95

J. Hué et al. / Removed Sets Fusion: Performing off the Shelf

Ψ = {ϕ1 , . . . , ϕn } Ψ ϕ1 , . . . , ϕn

n

n

• V

W

ϕ

Ψ = {ϕ}

Δ

•

Δ(Ψ) Δ(Ψ) = W ϕ ∈Ψ

V ϕ ∈Ψ

•

ϕi

1

n

short(X) ← size(Y ), X < Y X Y X
•

Δ(Ψ) Δ(Ψ)

Δ(Ψ) =

#domain possible(X) possible(1..n) X

h ← k {a1 , . . . , an } l l

k

{a1 , . . . , an }

minimize{.} maximize{.} minimize{a1 , · · · , an }

ϕi

{a1 , · · · , an }

Σ Ψ

Card

Σ

M ax

GM ax

M ax Ψ

P c ← c, ai (1 ≤ i ≤ n), bj (1 ≤ a1 , . . . , an , not b1 , . . . , not bm not j ≤ m) r head(r) = body(r) = {a1 , · · · , an , b1 , · · · , bm } c body + (r) = {a1 , · · · , an } body − (r) = {b1 , · · · , bm } body(r) = body + (r) ∪ body − (r) + r head(r) ← body + (r) r r X Π r ∈ Π head(r) ∈ X body(r) ⊆ X Π ΠX

CN(Π) Π X = {r + | r ∈ Π and body − (r) ∩ X = ∅} X Π

ϕn

Ψ = {ϕ1 , . . . , ϕn }

ϕ1 . . . ϕ1 . . .

ϕn ϕ1 . . . ϕn

P

ΠX

CN(ΠX ) = X Ψ

ϕ1 . . .ϕn Ψ

P

Ψ = {ϕ1 , . . . , ϕn } X ⊆ ϕ1 . . . ϕn Ψ (ϕ1 . . . ϕn )\X

X ≤P Y P

≤P

P X

Y

Ψ = {ϕ1 , . . . , ϕn } X ⊆ ϕ1 . . .ϕn P X Y ⊆ ϕ1 . . . ϕn Y

Ψ = {ϕ1 , . . . , ϕn } ΔP (Ψ) W ΔP (Ψ) = X∈F R(Ψ) {Cn((ϕ1 . . . ϕn )\X)}

Ψ

96

J. Hué et al. / Removed Sets Fusion: Performing off the Shelf

Card, Σ, M ax, GM ax GM ax

ϕi

LΨ X

X ≤lex

pi (X) = |X ∩ ϕi | (pi (X))1≤i≤n

•

f ≡ f1 ∨ . . . ∨ fj ρf 1 , . . . , ρ f

•

f ≡ f 1 ∧. . .∧f j ..., rfi ← ρf

a ← not b ← not b 1 ra∨b ← not a, not b ρ1 ← not a ρ2 ← b 3 r¬a∧¬b ←b

X ≤P Y |X| ≤ |Y | Σ1≤i≤n | X ∩ ϕi |≤ Σ1≤i≤n |Y ∩ ϕi | max1≤i≤n | X ∩ ϕi |≤ max1≤i≤n | Y ∩ ϕi | Ψ LΨ X ≤lex LY

Ψ = {ϕ1 , ϕ2 , ϕ3 } ϕ1 = ϕ3 = {¬a ∧ ¬b, ¬a ∨ ¬b} {a ∨ b, b} ϕ2 = {a ↔ b, ¬b} FΣ R(Ψ) = {{a∨b, b}} ΔΣ (Ψ) = Cn(a ↔ b, ¬b, ¬a∧¬b, ¬a∨ ¬b) FM ax R(Ψ) = {{b, a ↔ b, ¬a ∧ ¬b}} ΔM ax (Ψ) = Cn(a ∨ b, ¬b, ¬a ∨ ¬b)

ΠΨ ΠΨ

P

Ψ

ΠΨ

ρ2 = ρ¬a∧¬b

a

← not a c ← not c 1 rb ← not b ρ1 ← not b 2 ← b r¬b 3 r¬a∨¬b ← a, b

ΠΨ ΠΨ S P

∀S ⊆ ΠΨ S

S S

S

IS

s(Ψ)

ϕi

ΠΨ R

+

=

{rfi |f

f ∈ ϕi

a←

Ψ V+

∈ ϕi } F0 (rfi )

rfi a

ρf

ΠΨ

ΠΨ

f≡a

• 3

f ≡ ¬f

a→b a

Ψ

X ⊆ (ϕ1 . . . ϕn ) X P P F0 (S ∩ R+ ) = X Card Σ M ax

rfi minimize f

ΠΨ

← not a rfi ← not ρf 1

¬a∨b

M ax

S

V+

rfi 1

ΠΨ

ΠΨ 1 2 2 3 {a , b , ρ1 , ra∨b , rb1 } {a , b, ρ1 , ρ2 , r¬b , ra↔b , r¬a∧¬b } 1 2 3 2 3 3 {a, b , ρ1 , ρ2 , rb , ra↔b , r¬a∧¬b} {a, b, ρ2 , r¬b , r¬a∧¬b, r¬a∨¬b } M ax 2 3 {a, b , ρ1 , ρ2 , rb1 , ra↔b , r¬a∧¬b } 1 Σ {a , b , ρ1 , ra∨b , rb1 } Ψ M ax Σ

fj

f ∀f ∈ ϕi •

IS = {a | a ∈ S} ∪

Ψ = {ϕ1 , . . . , ϕn } S IS V+ (ϕ1 . . . ϕn )\F0 (S ∩ R+ )

(ϕ1 . . . ϕn ) ΠΨ

ΠΨ

S ΠΨ

ΠΨ

V−

ϕi ∀rfi ∈ R+ , F0 (rfi ) = f

a∈V+ a ← a

f, f 1 , . . . , f n

P

S

ΠΨ Card Σ Ψ = {ϕ1 , . . . , ϕn }

b ← not b c ← not c 2 ra↔b ← ρ1 , ρ2 ρ2 ← a 3 r¬a∧¬b ←a

P

S ∩ V+ {¬a | a ∈ S}

P

P

rfi ← ρf 1

ρ1 = ρa∧b a

P Card Σ M ax GM ax

rfi ←

a ↔ b (a∧b)∨(¬a∧¬b) f j ≡ a ρf

i i + ΠΣ Ψ = minimize{rf | rf ∈ R } ∪ ΠΨ

Σ Σ

ΠΨ

ΠΣ Ψ Σ

97

J. Hué et al. / Removed Sets Fusion: Performing off the Shelf

Card

+

ΠCard = minimize{rfi Ψ

R } ∪ Ps(Ψ)

Card

Σ

ΠΨ

Ψ

Card

: : : : :

S ( =

β1

δ4 β1 β2

: : :

S

Σ

#domain possible(U ). #domain base(V ). possible(1..m). base(1..n). size(U ) ← U {rfV |F0 (f ) ∈ ϕV }U.

9 > > > = > > > ;

δi

α rfV

U

Πbound Ψ

S

M ax

ϕV

U size(U )

)

#domain possible(W ). negmax(W ) ← size(U ), U > W. max(U ) ← size(U ), not negmax(U ).

U >W max(U ) U size(U ) ax M ax ΠM = Πsize ∪ Πbound ∪ ΠΨ ∪ Ψ Ψ Ψ minimize[max(1) = 1, . . . , max(m) = m] M ax

ΠΨ

M ax

= { γ1 : size(V, U ) ← U {rem(V, 1), . . . , rem(V, m)} U. }

m γ1

bound

ΠΨ

Σ

ax ΠM Ψ M ax

O(m × n) GM ax |X ∩ ϕi | size

GM ax

O(mn × nn )

W

size(U )

ΠΨ

ΠΨ

minimize{}

S

8 δ0 > > > < δ1 δ2 = > > > δ3 : α

m

Πbound Ψ

ax ΠGM Ψ GM ax

Card

S max1≤i≤n (|{rfi |rfi ∈ S}|) minimize{} M ax

size(U )

GM ax

rfi ∈

ΠCard Ψ

M ax

Πsize Ψ

|

8 α0 > > > > > > > > < α1 = > > > > αi > > > > : αn

ϕV ϕV : : : :

U

max(X1 , . . . , Xn ) ← size(Y1 , X1 ), . . . , size(Yn , Xn ), X1 >= X2 , . . . , Xn−1 >= Xn , neq(Y1 , . . . , Yn ). max1 (X1 ) ← max(X1 , . . . , Xn ) , X1 >= X2 , . . . , Xn−1 >= Xn . ... max3 (X3 ) ← max(X1 , . . . , Xn ), X1 >= X2 , . . . , Xn−1 >= Xn .

9 > > > > > > > > =

M ax

> > > > > > > > ;

X1

size() Xn minimize[] Xn +Xn−1 ×m+. . .+X1 ×mn−1 size bound ax GM ax ΠGM = ΠΨ ∪ ΠΨ ∪ Ψ ΠΨ ∪ minimize[maxn (1) = 1, maxn (2) = 2, . . . , maxi (1) = mn−i , maxi (2) = 2 × mn−i , . . . , max1 (n) = n × mn−1 ].

(nf )

(na)

98

J. Hué et al. / Removed Sets Fusion: Performing off the Shelf

Σ

M ax

(nf )/(na)

Card Σ M ax GM ax

minimize

12th

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-99

99

A Coherent Well-founded Model for Hybrid MKNF Knowledge Bases ´ Alferes1 and Pascal Hitzler 2 Matthias Knorr1 and Jos´e Julio Abstract. With the advent of the Semantic Web, the question becomes important how to best combine open-world based ontology languages, like OWL, with closed-world rules paradigms. One of the most mature proposals for this combination is known as Hybrid MKNF knowledge bases [11], which is based on an adaptation of the stable model semantics to knowledge bases consisting of ontology axioms and rules. In this paper, we propose a well-founded semantics for such knowledge bases which promises to provide better efﬁciency of reasoning, which is compatible both with the OWL-based semantics and the traditional well-founded semantics for logic programs, and which surpasses previous proposals for such a well-founded semantics by avoiding some issues related to inconsistency handling.

1

Introduction

The Web Ontology Language OWL3 is a recommended standard by the W3C for modeling Semantic Web knowledge bases. It is essentially based on Description Logics (DLs) [1], and thus adheres to the open-world assumption. It is apparent, however, and frequently being voiced by application developers, that it would be favorable to have closed-world modeling as an additional feature for ontology-based systems. This need has led to several investigations into combinations of closed-world rules paradigms with DLs, which can still be considered to be in their early stages, and the proposed solutions differ substantially. We base our work on the claim that the integration should be as tight as possible, in the sense that conclusions from the rules affect the conclusions from the ontology and vice-versa. Among such proposals are several whose semantics is based on stable model semantics (SMS) [5] (e.g. [2, 3, 6, 11, 13]), and only few which are based on the well-founded semantics (WFS) [14], like [4, 8]. Though these WFS-based approaches are in general weaker in their derivable consequences, their faster computation (data complexity P vs. NP) should be more suitable for the intended application area, the WWW. One of the currently most mature proposals for a tight integration is known as Hybrid MKNF knowledge bases [11], which draws on the logic of Minimal Knowledge and Negation as Failure (MKNF) [10]. [11]’s proposal evaluates knowledge bases under a stable model semantics, resulting in unfavorable computational complexities. In this paper, we therefore deﬁne a new semantics, restricted to non-disjunctive rules, which soundly approximates the semantics of [11] and is in a strictly lower complexity class. The semantics furthermore yields the original DL-semantics when no rules are present, and the original well-founded semantics if the DL-component is empty. 1 2 3

CENTRIA, Universidade Nova de Lisboa, Portugal AIFB, Universit¨at Karlsruhe, Germany http://www.w3.org/2004/OWL/

The semantics is furthermore coherent in the sense of [12], i.e. whenever any formula is ﬁrst-order false then it is also non-monotonically false. It also allows for detecting inconsistencies between interacting ontologies and rules, and in fact does this without any substantial additional computational effort. Due to this inconsistency handling, our proposal is superior to that of [8], which also attempted to deﬁne a WF semantics, but resulted in some unintuitive behavior in the presence of inconsistencies. The paper is structured as follows. We ﬁrst recall preliminaries on Hybrid MKNF knowledge bases in Section 2. We then introduce a running modeling example in Section 3 before introducing our wellfounded semantics in Section 4. Section 5 is devoted to some basic properties, especially regarding consistency. In Section 6 we brieﬂy compare with most similar approaches, and conclude. More details, including proofs, can be found in [9].

2

Preliminaries

At ﬁrst we present the syntax of MKNF formulas taken from [11]. A ﬁrst-order atom P (t1 , . . . , tn ) is an MKNF formula where P is a predicate and the ti are function-free ﬁrst-order terms. If ϕ is an MKNF formula then ¬ϕ, ∃x : ϕ, K ϕ and not ϕ are MKNF formulas and likewise ϕ1 ∧ ϕ2 and ϕ1 ← ϕ2 for MKNF formulas ϕ1 , ϕ2 . The symbols ∨, ≡, and ∀ represent the usual boolean combinations of the previously introduced constructors. Substituting the free variables xi in ϕ by terms ti is denoted ϕ[t1 /x1 , . . . , tn /xn ]. Then, given a (ﬁrst-order) formula ϕ, K ϕ is called a modal K-atom and not ϕ a modal not-atom. The signature Σ contains, apart from the constants occurring in the formulas, a countably inﬁnite supply of constants not occurring in the formulas and the Herbrand universe of such a signature is denoted by . Moreover, the equality predicate ≈ in Σ is interpreted as an equivalence relation on . As in [11], hybrid MKNF knowledge bases can contain any ﬁrstorder fragment DL satisfying the following conditions: (i) each knowledge base O ∈ DL can be translated into an equivalent formula π(O) of function-free ﬁrst-order logic with equality, (ii) it supports A-Boxes-assertions of the form P (a1 , . . . , an ) for P a predicate and ai constants of DL and (iii) satisﬁability checking and instance checking (i.e. entailment of the form O |= P (a1 , . . . , an )) are decidable4 . We now recall hybrid MKNF knowledge bases of [11]. Deﬁnition 1 Let O be a DL knowledge base. A ﬁrst-order functionfree atom P (t1 , . . . , tn ) over Σ such that P is ≈ or it occurs in O is called a DL-atom; all other atoms are called non-DL-atoms. A (nondisjunctive) MKNF rule r has the following form where H, Ai , 4

For more details on DL notation we refer to [1].

100

M. Knorr et al. / A Coherent Well-Founded Model for Hybrid MKNF Knowledge Bases

and Bi are ﬁrst-order function free atoms: K H ← K A1 , . . . , K An , not B1 , . . . , not Bm

(1)

The sets {K H}, {K Ai }, and {not Bi } are called the rule head, the positive body, and the negative body, respectively. A rule is positive if m = 0; r is a fact if n = m = 0. A program P is a ﬁnite set of MKNF rules. A hybrid MKNF knowledge base K is a pair (O, P). The semantics of such a knowledge base K is obtained by translating it into the MKNF formula π(K) = K π(O) ∧ π(P) where P is transformed by universally quantifying all the variables in each rule. An MKNF rule r is DL-safe if every variable in r occurs in at least one non-DL-atom K B occurring in the body of r. A hybrid MKNF knowledge base K is DL-safe if all its rules are DL-safe. Given a hybrid MKNF knowledge base K = (O, P), the ground instantiation of K is the KB KG = (O, PG ) where PG is obtained by replacing in each rule of P all variables with constants from K in all possible ways.

3

Example Scenario

Consider an online store selling, among other things, CDs. Due to the fact that many newly published CDs are simply compilations of already existing music, the owners decide to offer their customers a special service: whenever somebody likes the compilation of a certain artist he can search speciﬁcally for more music of that artist published on albums. The service shall however deny offering other compilations or products which are too similar to the already owned CD. Similarity can be deﬁned in various ways but we assume for simplicity that this is handled internally, e.g. by counting the number of identical tracks, and encoded by predicate Dif(x, y). The internal database is organized as a hybrid MKNF knowledge base including an ontology containing all available discs, their tracks and so on and whether they are albums or compilations. The following shall provide the considered service5 : Comp

¬Offer

K Offer(x)

←

not owns(x), K owns(y), K Dif(x, y), K artist(x, z), K artist(y, z).

(2) (3)

Given the input of CDs the customer owns, rule (3) offers an album x in case the customer does not own it, which is sufﬁciently different to a CD y he owns, where the artist z of x is the same as the artist of y. Additionally, (2) is a DL statement (translatable into ∀x : Comp(x) → ¬Offer(x)) enforcing that any CD which is a compilation shall never be offered.

4

Three-valued Semantics

We start by deﬁning three-valued structures which serve as a means for evaluating hybrid MKNF knowledge bases. Deﬁnition 2 A three-valued (partial) MKNF structure (I, M, N ) consists of a Herbrand ﬁrst-order interpretation I and two pairs M = M, M1 and N = N, N1 of sets of Herbrand ﬁrst-order interpretations where any ﬁrst-order atom which occurs in all elements in M (resp. N ) also occurs in all elements of M1 (resp. N1 ). It is called total if M = M, M and N = N, N . 5

Capital letters represent DL-atoms and objects/individuals while the other represent non-DL atoms and variables. Note that rule (3) is in fact DL-safe.

Set I is intended to interpret the ﬁrst-order formulas while the pairs M and N evaluate the modal operators K and not . MKNF formulas are thus interpreted with respect to the set {t, u, f } of truth values with f < u < t where the operator max (resp. min) chooses the greatest (resp. least) elementjwith respect to this ordering: t iff p(t1 , . . . , tn ) ∈ I (I, M, N )(p(t1 , . . . , tn )) = f iff p(t1 , . . . , tn ) ∈ I 8 < t u (I, M, N )(¬ϕ) = : f

iff (I, M, N )(ϕ) = f iff (I, M, N )(ϕ) = u iff (I, M, N )(ϕ) = t

(I, M, N )(ϕ1 ∧ ϕ2 ) = min{(I, M, N )(ϕ1 ), (I, M, N )(ϕ2 )} j (I, M, N )(ϕ1 ← ϕ2 ) =

t iff (I, M, N )(ϕ1 ) ≥ (I, M, N )(ϕ2 ) f otherwise

(I, M, N )(∃x : ϕ) = max{(I, M, N )(ϕ[α/x]) | α ∈ } 8 t iff (J, M, M1 , N )(ϕ) = t > > > > for all J ∈ M < f iff (J, M, M1 , N )(ϕ) = f (I, M, N )(K ϕ) = > > for some J ∈ M1 > > : u otherwise 8 t iff (J, M, N, N1 )(ϕ) = f > > > > for some J ∈ N1 < f iff (J, M, N, N1 )(ϕ) = t (I, M, N )(not ϕ) = > > for all J ∈ N > > : u otherwise Note that ﬁrst-order atoms, and also ﬁrst-order (non-modal) formulas, are evaluated with respect to one ﬁrst-order interpretation, and are, therefore, entirely two-valued. This is intended since this way a rule free hybrid knowledge base shall be interpreted just as any DL base. Moreover, implications are not interpreted in a classical sense: u ← u is true while its classical boolean correspondence, u ∨ ¬u, is undeﬁned. This is needed for the very same reason it is in case of the well-founded semantics of LP: rules propagating undeﬁnedness are true. Without this, it could never be the case that a DL-free hybrid knowledge would coincide with the well-founded semantics of LPs. So, only modal atoms (and thus rules) make use of the third truth value u and we are going to explain the details of this part of the evaluation scheme. Each modal operator is evaluated with respect to a pair of sets of interpretations. The idea is that K ϕ is true if ϕ is true in all elements in M ; otherwise it is either false or undeﬁned depending on M1 . If ϕ is true in all elements in M1 then K ϕ is undeﬁned; otherwise false. The case of not ϕ is handled symmetrically with respect to N, N1 , only now the condition for true modal Katoms yields false modal not-atoms. The restrictions on three-valued MKNF structures guarantee that no modal formula can be true and false at the same time. We now deﬁne interpretation pairs which form the basis for a model notion. Deﬁnition 3 An interpretation pair (M, N ) consists of two sets of Herbrand interpretations M , N with N ⊆ M , and models a closed MKNF formula ϕ if and only if (I, M, N , M, N )(ϕ) = t for each I ∈ M . If there exists an interpretation pair modeling ϕ then ϕ is consistent. M contains all interpretations which model only truth while N models everything which is true or undeﬁned. Note that the corre-

M. Knorr et al. / A Coherent Well-Founded Model for Hybrid MKNF Knowledge Bases

spondence ¬K being equivalent to not is supported by using the interpretation pair (M, N ) to evaluate both, K and not , simultaneously. The subset relation between M and N does not only guarantee allowed MKNF structures but also that any formula which is true (resp. false), in all elements of M is also true (resp. false) in all elements of N . This will ensure consistency since it prevents modal atoms which are true and false at the same time and, as we will soon see, modal atoms which are undeﬁned though being ﬁrst-order false. Now we deﬁne MKNF models based on a preference relation over interpretation pairs which model the considered formula. Deﬁnition 4 Any interpretation pair (M, N ) is a partial (or threevalued) MKNF model for a given closed MKNF formula ϕ if (1) (I, M, N , M, N )(ϕ) = t for all I ∈ M and (2) (I , M , N , M, N )(ϕ) = t for some I ∈ M and each interpretation pair (M , N ) with M ⊆ M and N ⊆ N where at least one of the inclusions is proper. If there is a partial MKNF model of a given closed MKNF formula ϕ then ϕ is called MKNF-consistent, otherwise MKNF-inconsistent. With a ﬁxed evaluation of the modal not-atoms we maximize the sets which evaluate modal K-atoms, checking whether this still yields a true evaluation. By maximizing these sets we naturally obtain less formulas which are true in all elements of these sets and thus less modal K-atoms which are true or undeﬁned. In this sense we deal with a logics of minimal knowledge. Once more, by N ⊆ M , we guarantee that only reasonable augmentations are considered. Example 1 Consider the knowledge base from the running example, with the obvious abbreviations, together with the users input owns(C1), and two albums A1, A2, in the database from the same artist, where A1 is sufﬁciently different from the compilation while A2 is not. Then, restricted to the domain of interest, an interpretation pair (M, N ) modeling the KB and containing owns(C1), Of(A1), and Of(A2) is not an MKNF model since any (M , N ) such that Of(A2) is not in all elements of M still models the KB. In fact, the only MKNF model restricted to these three modal atoms would be (M, N ) with M = N = {{owns(C1), Of(A1), Of(A2)}, {owns(C1), Of(A1)}}. One could ask now, what is the point of having u available in this example? The answer is that this simply depends on the intention and design of the reasoning capability: the idea could be only to recommend one disk. For that, we could add not Of(x1), x = x1 to the rule (3) (ensuring additionally DL-safety). Supposing that both, A1 and A2 are sufﬁciently different from C1 we would obtain two MKNF models of the KB, one with K Of(A1) and one with K Of(A2), and additionally one model which simply does not choose between the two but leaves them both undeﬁned. The advantage of this comes into play when deﬁning a way of calculating a model which incorporates all the minimally necessary true information: it is simpler to compute one slightly less expressive model than to keep track of various of them. Since MKNF models are in general inﬁnite, as in [11] the proper idea for algorithmization is to represent them via a 1st-order formula whose model corresponds to the MKNF model. For that, a partition (T, F ) of true and false modal atoms is provided which allows to determine this ﬁrst-order formula. Deﬁnition 5 Let K = (O, P) be a hybrid MKNF knowledge base. The set of K-atoms of K, written KA(K), is the smallest set that

101

contains (i) all K-atoms occurring in PG , and (ii) a modal atom K ξ for each modal atom not ξ occurring in PG . For a subset S of KA(K), the objective knowledge of S is the forS mula obK,S = O ∪ K ξ∈S ξ, and SbDL = {ξ | K ξ ∈ SDL } where SDL is the subset of DL-atoms of S. A (partial) partition (T, F ) of KA(K) is consistent if obK,T |= ξ for each K ξ ∈ F . Before we continue deﬁning operators which will derive conclusions from knowledge bases, we have to modify MKNF knowledge bases such that we can address the coherence problem: a ﬁrst-order false formula ϕ (as a consequence of the DL part) has to be connected to not ϕ which cannot be done straightforwardly since not cannot occur in the DL part. Thus, instead of representing the connection directly, we introduce new positive DL atoms which represent the falsity of an already existing DL atom, and a further program transformation which makes these new modal atoms available for reasoning in the respective rules. Deﬁnition 6 Let K be a DL-safe hybrid MKNF knowledge base. We obtain K∗ from K by adding an axiom ¬H N H for every DL atom H(t1 , . . . , tn ) which occurs as head in at least one rule in K where N H is a new predicate not allready occurring in K. Moreover, b from K∗ by adding not N H(t1 , . . . , tn ) to the body of we obtain K each rule with a DL atom H(t1 , . . . , tn ) in the head. The idea is to have N H(t1 , . . . , tn ) available as a predicate representing that ¬H(t1 , . . . , tn ) holds: K∗ makes this connection exb introduces a restriction on each rule with a DL atom in plicit and K the head saying intuitively that the rule can only be used to conclude something if the negation of its head does not hold already. Note that b are still hybrid MKNF knowledge bases, so we only refer K∗ and K ∗ b explicitly when it is necessary. to K and K We now deﬁne the monotonic operator TK which allows to draw conclusions from positive hybrid MKNF knowledge bases. Deﬁnition 7 For K a positive nondisjunctive DL-safe hybrid MKNF knowledge base, RK , DK , and TK are deﬁned on the subsets of b as follows: KA(K) RK (S)

=

DK (S)

=

TK (S)

=

S ∪ {K H | K contains a rule of the form (1) such that K Ai ∈ S for each 1 ≤ i ≤ n} b and O ∪ SbDL |= ξ}∪ {K ξ | K ξ ∈ KA(K) {K Q(b1 , . . . , bn ) | K Q(a1 , . . . , an ) ∈ S \ SDL , b and K Q(b1 , . . . , bn ) ∈ KA(K), O ∪ SbDL |= ai ≈ bi for 1 ≤ i ≤ n} RK (S) ∪ DK (S)

RK derives immediate consequences from the rules while DK obtains consequences using modal atoms and the statements contained in the DL part. Since TK is monotonic, it has a unique least ﬁxpoint which we denote TK ↑ ω and is obtained in the usual way. Note that TK ↑ ω is in fact order-continuous due to the absence of function symbols in the language and the restriction to known individuals. A transformation for nondisjunctive hybrid MKNF knowledge bases is deﬁned turning them into positive ones, thus allowing the application of the operator TK . Deﬁnition 8 Let KG = (O, PG ) be a ground nondisjunctive DLsafe hybrid MKNF knowledge base and S ⊆ KA(KG ). The MKNF transform KG /S = (O, PG /S) is obtained by PG /S containing all rules K H ← K A1 , . . . , K An for which there exists a rule K H ← K A1 , . . . , K An , not B1 , . . . , not Bm in PG with K Bj ∈ S for all 1 ≤ j ≤ m.

102

M. Knorr et al. / A Coherent Well-Founded Model for Hybrid MKNF Knowledge Bases

This resembles the transformation known from stable models of logic programs and the following operator using a ﬁxpoint of TK is thus straightforward to deﬁne. Deﬁnition 9 Let K = (O, P) be a nondisjunctive DL-safe hybrid b We deﬁne: MKNF knowledge base and S ⊆ KA(K). ΓK (S) = TK∗G /S ↑ ω b the apEven though we consider all modal atoms from KA(K), ∗ plied knowledge base KG does not enforce not H(t1 , . . . , tn ) if ¬H(t1 , . . . , tn ) holds. Thus ΓK alone is not sufﬁcient to obtain the intended model and we deﬁne an operator similar in appearance but bG. referring to K

5

Properties

In the same manner as done in [7] for the alternating ﬁxpoint of normal logic programs, we restate the iteration for obtaining ΦK and ΨK as: P0 Pn+1 Pω

= = =

∅ Γ SK (Nn ) Pn

N0 Nn+1 Nω

= = =

b KA(K) Γ (P n) K T Nn

Deﬁnition 10 Let K = (O, P) be a nondisjunctive DL-safe hybrid b We deﬁne: MKNF knowledge base and S ⊆ KA(K). ΓK (S) = TKb G /S ↑ ω

It is easy to see that ΦK ↑ 1 = P2 , ΦK ↑ 2 = P4 , i.e. ΦK ↑ i = P2i , and likewise ΨK ↓ i = P2i . In particular, it can be shown that the sequence of Pi (respectively Ni ) is increasing, (respectively decreasing) and without surprise its limits concur with the least ﬁxpoint of ΦK , respectively the greatest ﬁxpoint of ΨK , i.e. Pω = lfp(ΦK ) and Nω = gfp(ΨK )6 . As an overall beneﬁt, we can compute the least ﬁxpoint of ΦK directly from the greatest one of ΨK and vice versa.

Both, Γ and Γ , are shown to be antitonic and we join them as follows to two monotonic operators.

Proposition 1 Let K be a nondisjunctive DL-safe hybrid MKNF knowledge base. Then Pω = Γ(Nω ) and Nω = Γ (Pω ).

Deﬁnition 11 Let K = (O, P) be a nondisjunctive DL-safe hybrid b We deﬁne: MKNF knowledge base and S ⊆ KA(K). ΦK (S) = ΓK (ΓK (S)) and ΨK (S) = ΓK (ΓK (S))

It furthermore can be shown that we can even use this computation to check consistency of the KB.

Since both are monotonic we obtain a least and a greatest ﬁxpoint in both cases and the least ﬁxpoint of ΦK and the greatest one of ΨK then deﬁne the well-founded partition. Deﬁnition 12 Let K = (O, P) be a nondisjunctive DL-safe hybrid MKNF knowledge base and let PK , NK ⊆ KA(K) with PK being the least ﬁxpoint of ΦK and NK the greatest ﬁxpoint of ΨK , both restricted to the modal atoms only occurring in KA(K). Then (PW , NW ) = (PK ∪ {K π(O)}, KA(K) \ NK ) is the well-founded partition of K. Both, PK and NK , are restricted to the modal atoms occurring b are not in K only. Thus, the auxiliary modal atoms introduced via K present in the well-founded partition. But they are not necessary there anyway since their only objective is preventing inconsistencies, i.e. deriving ¬ϕ and not ϕ being undeﬁned. Example 2 Consider the KB K from the example scenario and add the owners compilation C2, i.e. owns(C2) and album A3 and compilation C3 which are both sufﬁciently different from C2. At ﬁrst we add a DL statement ¬Of NOf to the knowledge base to obtain b with head Of(x) receives a K∗ . Additionally, each ground rule in K modal atom not NOf(x) in its body where x here just functions as a placeholder for A3 and C2. When we compute Γ of ∅ then the result contains K NOf(C3), K Of(C3) and of course K Of(A3) and so does Γ applied to this result providing already the least ﬁxpoint of b then only ΦK (due to our simpliﬁcations). If we compute Γ of KA(K) K NOf(C3) occurs due to DKb because all rules with modal notatoms are removed from the transform. However, computing Γ of this result derives only additionally K Of(A3) due to not NOf(C3) occurring in the body of the rule with head K Of(C3), and we also obtained the greatest ﬁxpoint of ΨK . So the well-founded partition contains K Of(A3) in PW , i.e. offers A3, but also K Of(C3) in PW and in NW , i.e. C3 is offered and rejected at the same time. This is a clear indication that the knowledge base is inconsistent, as we shall see in the following section, and the (2) and (3) alone are not suitable to provide the intended service.

Proposition 2 Let K be a nondisjunctive DL-safe hybrid MKNF knowledge base and Pω the least ﬁxpoint of ΦK . If Γ (Pω ) ⊂ Γ(Pω ) then K is MKNF-inconsistent. Intuitively, what this statement says is that any inconsistency between rules and the ontology can be discovered by computing the set of non-false modal atoms and then checking whether all false modal atoms are not just enforced to be false by the ﬁrst-order knowledge although the (unchanged) rules support (at least one of) these false modal atoms. For an inconsistency check this is of course not sufﬁcient, since an inconsistent DL base O is not detected by this method. In fact, in case we want to check for consistency of K we have both to check consistency of O alone, and apply the proposition above. Theorem 1 Let K = (O, P) be a nondisjunctive DL-safe hybrid MKNF knowledge base and Pω the least ﬁxpoint of ΦK . Γ (Pω ) ⊂ Γ(Pω ) or O is inconsistent iff K is MKNF-inconsistent. Normal rules alone cannot be inconsistent, unless we allow integrity constraints as rules whose head is K f (cf. [11]). But then inconsistencies are easily detected since PW or NW contain K f . Example 3 Reconsider the result from Example 2. If we compute Γ of the least ﬁxpoint of ΦK , i.e. Pω then the result now contains K Of(C3) while this is not contained in Γ of Pω . Our assumption that the KB is inconsistent is thus veriﬁed. However, in case of a consistent knowledge base, the well-founded partition always yields a three-valued model. Theorem 2 Let K be a consistent nondisjunctive DL-safe hybrid MKNF KB and (PK ∪{K π(O)}, KA(K)\NK ) be the well-founded partition of K. Then (IP , IN ) where IP = {I | I |= obK,PK } and IN = {I | I |= obNK } is an MKNF model – the well-founded MKNF model. In fact, the result is not any three-valued MKNF model but the least one with respect to the following order. 6

ΦK and ΨK are order-continuous for the same reasons as TK .

M. Knorr et al. / A Coherent Well-Founded Model for Hybrid MKNF Knowledge Bases

Deﬁnition 13 Let ϕ be a closed MKNF formula and (M1 , N1 ) and (M2 , N2 ) be partial MKNF models of ϕ. Then (M1 , N1 ) ,k (M2 , N2 ) iff M1 ⊆ M2 and N1 ⊇ N2 . This order intuitively resembles the knowledge order where the least element contains the smallest amount of derivable knowledge, i.e. the one which leaves as much as possible undeﬁned. Theorem 3 Let K be a consistent nondisjunctive DL-safe hybrid MKNF KB and (M, N ) be the well-founded MKNF model. Then, for any three-valued MKNF model (M1 , N1 ) of K we have (M1 , N1 ) ,k (M, N ). Moreover, for an empty DL base the well-founded partition corresponds to the well-founded model for (normal) logic programs. Corollary 1 Let K be a nondisjunctive program of MKNF rules, Π a normal logic program obtained from P by transforming each MKNF rule K H ← K A1 , . . . , K An , not B1 , . . . , not Bm into a clause H ← A1 , . . . , An , not B1 , . . . , not Bm of Π, WK = (P, N ) be the well-founded MKNF model, and WΠ be the well-founded model of Π. Then K H ∈ P if and only if H ∈ WΠ and K H ∈ N if and only if not H ∈ WΠ . Finally the data complexity result is obtained basically from the result of TK for positive nondisjunctive MKNF knowledge bases in [11] where data complexity is measured in terms of A-Box assertions and rule facts. Theorem 4 Let K be a nondisjunctive DL-safe hybrid MKNF KB. Assuming that entailment of ground DL-atoms in DL is decidable with data complexity C the data complexity of computing the wellfounded partition is in PC . This means that if the description logic fragment is tractable,7 we end up with a formalism whose model is computed with a data complexity of P.

6

Comparisons and Conclusions

As already said, [11] is the stable model oriented origin of our work. The data complexity for reasoning with (2-valued) MKNF models C in nondisjunctive programs is shown to be E P where E = NP if C ⊆ NP, and E = C otherwise. Thus, computing the well-founded partition generally ends up in a strictly smaller complexity class than deriving one of maybe various MKNF models. However, M is a (two-valued) MKNF model of K iff (M, M ) is a three-valued MKNF model of K and, furthermore, if (M, M ) is the well-founded MKNF model of K, M is the only MKNF model of K. Furthermore, the well-founded partition can also be used in the algorithms presented in [11] for computing a subset of that knowledge which holds in all partitions corresponding to a two-valued MKNF model. The approach presented in [8], though conceptually similar to ours, is based on a different semantics which evaluates K and ¬not (and ¬K and not ) simultaneously, thereby differing from the ideas of [11]. This in particular does not allow to minimize unnecessary undeﬁnedness. Furthermore, in contrast with our approach, [8] does not allow for any form of detection of inconsistencies resulting from the interaction of the DL part and the rules. Instead, it provides a 7

See e.g. the W3C member submission on tractable fragments of OWL 1.1 (now called OWL 2) at http://www.w3.org/Submission/ owl11-tractable/.

103

strange kind of model in these cases which contains undeﬁned modal atoms which are actually ﬁrst-order false. Therefore, our proposal is more robust than the one in [8] and in fact more closely related to the two-valued one. In summary, here we deﬁne a WFS of (tightly integrated) hybrid KBs that is sound wrt. the semantics deﬁned in [11] for MKNF KBs, that has strictly lower complexity, coinciding with it in case there are no rules, and that coincides with the WFS of normal programs [14] in case the DL-part is empty. We also obtain tractable fragments whenever the underlying DL is tractable. Moreover, we deﬁne a construction for computing the WF-model that is also capable of detecting inconsistencies. It is worth noting that when inconsistencies come from the combination of the rules with the DL-part (i.e. for inconsistent KBs with a consistent DL-part), the construction still yields some results (e.g. in example 2). This suggests that the method could be further exploited in the direction of deﬁning a paraconsistent semantics for hybrid KBs. This, together with a study of tractable fragments, generalization to disjunctive rules and implementations, are subjects for future work.

REFERENCES [1] The Description Logic Handbook: Theory, Implementation, and Applications, eds., F. Baader, D. Calvanese, D. L. McGuinness, D. Nardi, and P. F. Patel-Schneider, Cambridge University Press, 2 edn., 2007. [2] J. de Bruijn, T. Eiter, A. Polleres, and H. Tompits, ‘Embedding nonground logic programs into autoepistemic logic for knowledge-base combination’, in IJCAI-07. AAAI Press, (2007). [3] T. Eiter, T. Lukasiewicz, R. Schindlauer, and H. Tompits, ‘Combining answer set programming with description logics for the semantic web’, in KR’04, eds., D. Dubois, C. Welty, and M-A. Williams, pp. 141–151. AAAI Press, (2004). [4] T. Eiter, T. Lukasiewicz, R. Schindlauer, and H. Tompits, ‘Wellfounded semantics for description logic programs in the semantic web’, in RuleML’04, eds., G. Antoniou and H. Boley, pp. 81–97. Springer, LNCS, (2004). [5] M. Gelfond and V. Lifschitz, ‘The stable model semantics for logic programming’, in ICLP, eds., R. A. Kowalski and K. A. Bowen. MIT Press, (1988). [6] S. Heymans, D. Van Nieuwenborgh, and D. Vermeir, ‘Guarded open answer set programming’, in LPNMR’05, eds., C. Baral, G. Greco, N. Leone, and G. Terracina, pp. 92–104. Springer, LNAI, (2005). [7] P. Hitzler and M. Wendt, ‘A uniform approach to logic programming semantics’, Theory and Practice of Logic Programming, 5(1–2), 123– 159, (2005). [8] M. Knorr, J. J. Alferes, and P. Hitzler, ‘Towards tractable local closed world reasoning for the semantic web’, in Progress in Artiﬁcial Intelligence, eds., J. Neves, M. F. Santos, and J. Machado, pp. 3–14. Springer, (2007). [9] M. Knorr, J. J. Alferes, and P. Hitzler, ‘A coherent well-founded model for hybrid MKNF knowledge bases (extended version)’, Technical report, (2008). Available from the authors. [10] V. Lifschitz, ‘Nonmonotonic databases and epistemic queries’, in IJCAI’91, pp. 381–386, (1991). [11] B. Motik and R. Rosati, ‘A faithful integration of description logics with logic programming’, in Proceedings of the Twentieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-07), pp. 477–482. AAAI Press, (2007). [12] L. M. Pereira and J. J. Alferes, ‘Well founded semantics for logic programs with explicit negation’, in ECAI, pp. 102–106, (1992). [13] R. Rosati, ‘Dl+Log: A tight integration of description logics and disjunctive datalog’, in KR’06, eds., P. Doherty, J. Mylopoulos, and C. Welty. AAAI Press, (2006). [14] Allen van Gelder, Kenneth A. Ross, and John S. Schlipf, ‘The wellfounded semantics for general logic programs’, Journal of the ACM, 38(3), 620–650, (1991).

This page intentionally left blank

2. Machine Learning

This page intentionally left blank

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-107

107

Prototype-based Domain Description Fabrizio Angiulli1 Abstract. In this work a novel one-class classiﬁer, namely the Prototype-based Domain Description rule (PDD), is presented. The PDD classiﬁer is equivalent to the NNDD rule under the inﬁnity Minkowski metric for a suitable choice of the prototype set. The concept of PDD consistent subset is introduced and it is shown that computing a minimum size PDD consistent subset is in general not approximable within any constant factor. A logarithmic approximation factor algorithm, called the CPDD algorithm, for computing a minimum size PDD consistent subset is then introduced. The CPDD algorithm has some parameters which allow to tune the trade off between accuracy and size of the model. Experimental results show that the CPDD rule sensibly improves over the CNNDD classiﬁer in terms of size of the subset, while guaranteeing a comparable classiﬁcation quality.

1

INTRODUCTION

Domain description, or one-class classiﬁcation, is a classiﬁcation technique whose goal is to distinguish between objects belonging to a certain class and all the other objects of the space [11]. The Nearest Neighbor Domain Description rule (NNDD) [1] is a one-class classiﬁer accepting test objects whose nearest neighbors distances in a reference data set, assumed to model normal behavior, lie within a certain threshold. In particular, given a data set of objects, also called reference set, and two parameters k and θ, the NNDD associates a feature vector δ(x) ∈ Rk with each object x composed of the distances from x to its ﬁrst k nearest neighbors in the reference set. The classiﬁer accepts x if and only if δ(x) belongs to the hyper-sphere (according to one of the Lr Minkowski metric, r ∈ {1, 2, . . . , ∞}) centered in the origin of Rk and having radius θ, i.e. if and only if -δ(x)-r ≤ θ. The CNNDD rule is a variant of the NNDD rule using a selected subset of the data set as the reference set [1]. In this work a novel nearest neighbor based one-class classiﬁer, called the Prototype-based Nearest Neighbor classiﬁer (PDD), is introduced. A prototype set is a set of objects xi , also called prototypes, each of which is associated with a radius R(xi ). Given parameter θ, an object y is accepted if it lies within distance θ − R(xi ) from some prototype xi . It is shown that the PDD classiﬁer is equivalent to the NNDD rule under the inﬁnity Minkowski metric (that is for r = ∞) for a suitable choice of the prototype set. Then the concept of PDD consistent subset is introduced, that is a subset of the original prototype set, which, loosely speaking, accepts all the discarded prototypes. It is shown that computing a minimum size PDD consistent subset is in general not approximable within any constant factor. A logarithmic approximation factor algorithm, called the CPDD algorithm, for computing a minimum size PDD consistent subset is then introduced. The CPDD algorithm has some parameters which allow 1

DEIS - University of Calabria, Italy, email: [email protected]

to tune the trade off between accuracy and size of the model. Experimental results show that the CPDD rule sensibly improves over the CNNDD classiﬁer in terms of size of the subset, while guaranteeing a comparable classiﬁcation quality. Moreover, comparison with the one-class SVM classiﬁer points out that both the compression ratio and the accuracy of the CPDD are comparable to that of the one-class SVM classiﬁer, but with some advantages for the CPDD rule. The rest of the work is organized as follows. Section 2 deﬁnes the Prototype-based Domain Description rule (PDD) and the concept of PDD consistent subset. Section 3 investigates the computational complexity of the problem of computing a minimum size PDD consistent subset. Section 4 describes the CPDD rule. Section 5 presents experimental results. Finally, Section 6 presents conclusions and future work.

2

THE PROTOTYPE-BASED DOMAIN DESCRIPTION RULE

In the following U denotes a set of objects, d a distance metric on U , D a set of objects from U , k a positive integer number, θ a positive real number, and r ∈ {1, 2, . . . , ∞} a Minkowski metric Lr . A prototype set P is a set of pairs P = {x1 , r1 , . . . , xn , rn }, where each xi (1 ≤ i ≤ n) is an object of U , also called prototype, and each ri is a real number, also called prototype radius. Given a prototype xi , the prototype radius ri associated with xi is also denoted by R(xi ). Next the Prototype-based Domain Description one-class classiﬁer is deﬁned. Deﬁnition 2.1 Given a prototype set P , the Prototype-based Domain Description rule (PDD) according to P , d, and θ, is the function PDDP,d,θ from U to {−1, +1} such that +1, if ∃x ∈ P such that d(x, y) + R(x) ≤ θ PDDP,d,θ (y) = −1, otherwise The PDD rule accepts an input object y (that is returns the value +1) if y lies within distance R(xi ) from some prototype xi . The PDD rule is a nearest neighbor based one-class classiﬁer. Next the deﬁnition of another nearest neighbor based one-class classiﬁer, namely the NNDD rule, is recalled, and then the relationship between there two rules is pointed out. Given an object x of U , the k-th nearest neighbor nnD,d,k (x) of x in D according to d is the object y of D such that there exist exactly k − 1 objects z of D with d(x, z) ≤ d(x, y). If x ∈ D, then nnD,d,1 (x) = x. The k nearest neighbors distances vector δD,d,k (p) of p in D is δD,d,k (p) = (d(p, nnD,d,1 (p)), . . . , d(p, nnD,d,k (p))). Deﬁnition 2.2 ([1]) The Nearest Neighbor Domain Description rule (NNDD) according to D, d, k, θ, r, is the function NNDDD,d,k,θ,r

108

F. Angiulli / Prototype-Based Domain Description

from U to {−1, +1} such that

Algorithm CPDD 1. for each object xi in D, determine the distance ri between xi and its k-th nearest neighbor in D 2. for each object xi such that ri ≤ θ, determine the set Ni composed of the objects y of D such that d(xi , y) + ri ≤ θ 3. set P to {xi ∈ D | ri ≤ θ}, and set S and C to the emptyset 4. while |C| ≤ η|P | do

NNDDD,d,k,θ,r (p) = sign(θ − -δD,d,k (p)-r ), where sign(x) = −1 if x < 0, and sign(x) = 1 otherwise. The following deﬁnition relates the PDD rule and the NNDD rule. Given a set of objects D, the prototype set P (D, d, k, θ) associated with D w.r.t. d, k, and θ is

(a) determine the object xj of P such that (break ties in favor of the object such that the value rj is minimum)

{x, d(x, nnD,d,k (x)) | x ∈ D ∧ d(x, nnD,d,k (x)) ≤ θ}.

|Nj − C| = max{|Ni − C| : xi ∈ P }

Relationship between the two rules is clariﬁed by the theorem below.

(b) set S to S ∪ {xj , rj }, and C to C ∪ Nj

Theorem 1 Given a set of objects D, and parameters k and θ, it holds that

5. return the set S

(∀x ∈ D)(NNDDD,d,k,θ,+∞ (x) = PDDP (D,d,k,θ),d,θ (x)). Proof. Let x be a generic object of D. If d(x, nnD,d,k (x)) ≤ θ, then NNDDD,d,k,θ,+∞ (x) = sign(θ − -δD,d,k (p)-+∞ ) = sign(θ − d(x, nnD,d,k (x))) = 1. Furthermore, the pair x, d(x, nnD,d,k (x)) belongs to P (D, d, k, θ) and, hence, d(x, x) + R(x) = 0 + d(x, nnD,d,k (x)) ≤ θ and PDDP (D,d,k,θ),d,θ (x) = 1. If d(x, nnD,d,k (x)) > θ, then NNDDD,d,k,θ,+∞ (x) = −1. By contradiction, assume that there exists a pair y, R(y)) in P (D, d, k, θ) such that d(x, y) + R(y) ≤ θ. Since within distance ry = d(x, y) + d(y, nnD,d,k (y)) from x there are at least k + 1 objects of D, it holds that d(x, nnD,d,k (x)) ≤ ry ≤ θ, which contradicts the hypothesis. 2 Thus, from the point of view of the objects belonging to the data set D, the prototype set P (D, d, θ, k) is the analogue for the PDD rule of the data set D for the NNDD rule. When the reference set D is large, space requirements to store D and time requirements to ﬁnd the nearest neighbors of an object in D increase. In the spirit of the reference set thinning problem for the k-NN-rule [9, 2], the concept of NNDD reference consistent subset was deﬁned in [1]. In the same spirit, next it is provided the deﬁnition of PDD consistent subset. Let P be a prototype set and let S be a subset of P . The set S is said to be a PDD consistent subset of P with respect to d and θ, if the following relationship hold (∀x, r ∈ P )(PDDP,d,θ (x) = PDDS,d,θ (x)). Importantly, it also holds that a PDD consistent subset S of the set P (D, d, θ, k) is the analogue for the PDD rule of the data set D for the NNDD rule. It can be ﬁnally concluded from the concept of sample compression scheme [7] and from the discussion above that replacing the prototype set P with a consistent subset S of P improves both response time and generalization.

3

COMPLEXITY ANALYSIS

In this section the computational complexity of the problem of computing a minimum size PDD consistent subset is investigated. The reader is referred to [8, 3] for basics on complexity theory, NP optimization problems, and approximation algorithms. Next it is shown that, in the general case, the problem of computing a minimum size PDD consistent subset is not in the APX complexity class, which is, loosely speaking, the class of the NP optimization problems whose optimal solution can be approximated in polynomial time within a ﬁxed factor.

Figure 1. The CPDD algorithm.

Given a prototype set P , distance metric d, and a positive real number θ, the PDD Consistent Subset Problem P, d, θ is deﬁned as follows: compute a PDD consistent subset S ∗ of P with respect to d and θ, also said a minimum size PDD consistent subset, such that, for each PDD consistent subset S of P with respect to d and θ, |S ∗ | ≤ |S|. Given a positive integer m, the decision version P, d, θ, mD of the problem P, d, θ is deﬁned as follows: reply “yes” if there exists a PDD consistent subset S of P with respect to d and θ such that |S| ≤ m, and reply “no” otherwise. Theorem 2 The P, d, θ problem (1) is NP-hard, and (2) is not in APX. Proof sketch. (Point 1) Membership is immediate. As for the hardness the proof is by reduction of the Dominating Set Problem [8]. Let G = (V, E) be an undirected graph, and let m ≤ |V | be a positive integer. The Dominating Set Problem is: is there a subset U ⊆ V , called dominating set of G, with |U | ≤ m, such that for all v ∈ (V − U ) there exists u ∈ U with {u, v} ∈ E ? Deﬁne the metric dV on the set V of nodes of G as follows: dV (u, v) = θ, if {u, v} ∈ E, and dV (u, v) = 2θ, otherwise. Let PV be the set {v, 0 | v ∈ V }. It can be proved that G has a dominating set of size m if and only if PV , dV , θ, mD is a “yes” instance. The NP-hardness of the P, d, θ problem follows immediately from the NP-completeness of its decision version. (Point 2) It is known that the Minimum Dominating Set Problem, that is the problem of determining the size of the smallest dominating set of a graph, is not in APX [4]. We note that Point 1 of this theorem deﬁnes an AP-reduction from the Minimum Dominating Set Problem to the Minimum PDD Consistent Subset Problem (the reader is referred to [3] for the deﬁnition of AP-reduction). As an immediate consequence of this reduction, the latter problem does not belong to APX. 2

4

THE CPDD ALGORITHM

Figure 1 shows the algorithm CPDD. Given a data set D, the CPDD algorithm computes a PDD consistent subset of the prototype set P (D, d, k, θ) associated with D. The algorithm receives in input a data set D, parameters k and θ, and the additional parameters , η ∈ (0, 1], whose use will be

109

F. Angiulli / Prototype-Based Domain Description

(a) θ = 0.2, = 1.00, η = 1.00

(b) θ = 0.2, = 0.75, η = 1.00

(c) θ = 0.2, = 0.75, η = 0.99

(d) θ = 0.1, = 1.00, η = 1.00

(e) θ = 0.1, = 0.75, η = 1.00

(f) θ = 0.1, = 0.75, η = 0.99

Figure 2. Examples of PDD consistent subsets computed by the CPDD algorithm.

discussed in the following (if not otherwise speciﬁed, it is assumed that and η are both set to one). Initially, for each object xi of D, the algorithm determines the distance ri to its k-th nearest neighbor (step 1) and also the set Ni of the objects of D lying within distance θ − ri from it (step 2). The set P built in step 3 is composed of the objects occurring in the prototype set P (D, d, k, θ). Then the algorithm computes the consistent subset S following a greedy strategy (step 4). The set C consists of the objects of P which are correctly classiﬁed by the current subset S. At each step, the object xj maximizing the number of objects in Nj −C is selected and inserted in S, until C contains at least the fraction η of the objects in P (until C covers P , if η = 1). Next theorem shows that the the size of the solution returned by the algorithm has an approximation factor. Theorem 3 The CPDD algorithm provides a solution having a 1 + ln(n) approximation factor. Proof. Assume that the parameter is set to one. We note that the set Ni consists of precisely all the prototypes of P (D, d, θ, k) which are correctly recognized through the PDD rule if xi is included in the PDD consistent subset S. Given a ﬁnite set S and a collection C of subsets of S, a set cover for S is a subset C of C such that every element in S belongs to at least one member of C . It is clear that the PDD consistent subsets of P are in one-to-one correspondence with the set covers of {Ni | xi ∈ P }. The result hence follows by noting that step 4 of the algorithm CPDD is analogous to the greedy algorithm for the Minimum Set Cover Problem [6], the problem of computing a set cover of minimum size, which achieves an approximation factor of 1 + ln(n), where n is the size of the set to be covered. 2

Note that steps 3-5 compute a PDD consistent subset of any arbitrary prototype set. Figure 2 reports some examples of PDD consistent subsets computed by the CPDD algorithm. The data set (blue points) is composed of ten thousands points in the plane. The parameter k was set to 5, while two distinct values for the parameters θ, and η were considered, namely 0.1 and 0.2 for θ, 0.75 and 1.0 for , and 0.99 and 1.0 for η. The Euclidean distance was employed as distance function d. Stars (in red color) denote the prototypes belonging to the PDD consistent subset S, while the (black) curve denotes the decision boundary of the classiﬁer PDDS,d,θ . The relative size of the PDD consistent subsets reported in Figure 2 is summarized in the following table. θ = 0.2 θ = 0.1

= 1.00 η = 1.00 70 (0.7%) 227 (2.3%)

= 0.75 η = 1.00 128 (1.3%) 439 (4.4%)

= 0.75 η = 0.99 62 (0.6%) 337 (3.4%)

From the ﬁgure and the table above it is clear that the smaller the value of the parameter θ, the closer the class boundary to the data set shape, the greater the number of data set objects rejected by the PDD rule, and the greater the number of prototypes belonging to the consistent subset. Moreover, the smaller the value of the parameter , the greater the number of prototypes belonging to the consistent subset, the more accurate the form of the decision boundary, and the smaller the probability of rejecting objects belonging to the class represented by the data set. For example, in Figure 2(a) ( = 1) there is a “hole”, approximately centered in (−0.78, −0.78), in the lower tail of the data set (but also other smaller “holes” exist along the data set shape), while the same region is covered by the prototypes in Figure 2(b) ( = 0.75). Finally, the smaller the value of the parameter η, the smaller the

110

CPDD TNR

0.8 CNNDD TNR

0.4

CNNDD size

0.2 CPDD size

0 0

0.05

0.1 0.15 False Negative Rate

0.2

CNNDD TNR

0.8

0.6 CPDD TNR CNNDD size

0.4

0.2 CPDD size

0 0

0.05

(a) CPDD AUC=0.997, CNNDD AUC=0.996 (b) CPDD AUC=0.967

η=0.95

DR CNNDD

0.8 ρ=0.9

0.6

FP

DR ρ=0.9

0.4

η=0.95

CPDD

0.2 FP

0 −3 10

−2

0.2

Satellite image red soil, k=4 1 CNNDD TNR

0.8 CPDD TNR

0.6

0.4 CNNDD size

0.2 CPDD size

0 0

0.05

−1

0

10 10 Relative subset size, |S|/|D|

10

η=0.95

TNR CNNDD FNR ρ=0.9

0.6 ρ=0.9

0.4

0.2

η=0.95

TNR CPDD FNR

0 −2 10

−1

10 Relative subset size, |S|/|D|

(d)

0.2

Satellite image red soil, k=4

1

0.8

0.1 0.15 False Negative Rate

CNNDD (c) CPDD AUC=0.986, CNNDD AUC=0.992

AUC=0.970, Ionosphere good, k=4

1

False Negative Rate, True Negative Rate

False Negative Rate, True Negative Rate

Image segmentation path, k=4

0.1 0.15 False Negative rate

0

10

False Negative Rate, True Negative Rate

0.6

Ionosphere good, k=4 1

True Negative Rate, Relative subset size |S|/|D|

Image segmentation path, k=4 1

True Negative Rate, Relative subset size |S|/|D|

True Negative Rate, Relative subset size |S|/|D|

F. Angiulli / Prototype-Based Domain Description

1

0.8

η=0.95

TNR CNNDD FNR

0.6 ρ=0.9

0.4

0.2 ρ=0.9 TNR 0 −3 10

(e)

η=0.95

CPDD

FNR

−2

−1

10 10 Relative subset size, |S|/|D|

0

10

(f)

Figure 3. Comparison between the CPDD and the CNNDD rule.

number of prototypes belonging to the consistent subset, but the greater the probability of rejecting objects belonging to the class represented by the data set, since the most sparse regions of the feature space belonging to the class are left uncovered.

5

EXPERIMENTAL RESULTS

In this section, experiments involving the CPDD rule on three data sets from the UCI Machine Learning Repository, namely Image segmentation, Ionosphere, and Satellite image, are described.2 In particular, for the Image segmentation data set (19 attributes) the path class (330 objects) was considered the normal one, while the remaining 1,980 objects were considered anomalies, for the Ionosphere data set (34 attributes) the good class (225 objects) was considered the normal one, while the objects of the bad class were considered anomalies, and for the Satellite image (36 attributes) the red soil class (1,533 objects) was considered the normal one, while the remaining 3,902 objects were considered anomalies. Figure 3 reports comparison of the CPDD and CNNDD (for r = +∞) rules on the three considered data sets. The parameter k was set to 4 in all the experiments, while the parameter θ was varied from zero to a suitable large value, and, then, the size of the subset computed, the false negative rate, and the true negative rate, were measured. If not otherwise speciﬁed, the parameters and η are set to 1. The Euclidean distance was employed as distance function d. The True Positive Rate (TPR, for short) is the fraction of normal 2

Also other data set were considered: the behavior of the method on these other data sets was analogous to what here described.

objects accepted by the classiﬁer, while the False Positive Rate (FPR, for short) is the fraction of abnormal objects accepted by the classiﬁer. Dually, the False Negative Rate (FNR, for short) is the fraction of normal objects rejected by the classiﬁer, while the True Negative Rate (TNR, for short) is the fraction of abnormal objects rejected by the classiﬁer. It holds that FNR=1-TPR and FPR=1-TNR. Figures 3(a)-(c) compare the ROC curves of the CPDD (solid lines) and CNNDD (dash-dotted lines) methods and also the relative size |S|/|D| of the corresponding consistent subsets S achieving the same value of FNR. The ROC curve is the plot of the FNR versus the TNR (or, correspondingly, TPR versus TNR), and the area under the ROC curve (AUC, for short) provides a summary to compare two classiﬁers. From these curves it is clear that the the CPDD consistent subset (dashed lines) is much smaller than the CNNDD subset (dotted lines) guaranteeing the same FNR. Moreover, the AUCs of the two methods are very similar. Figures 3(d)-(f) report the TNR (solid lines) ad FNR (dashed lines) of the CPDD method, and the TNR (dash-dotted lines) and FNR (dotted lines) of the CNNDD method as a function of the relative subset size |S|/|D|. For the CPDD method the pair of parameters = 1, η = 0.95 (upper curve), = 1, η = 1 (middle curve), and = 0.9, η = 1 (lower curve), were considered, in order to study sensitivity to these parameters. As far as the middle curves of the CPDD ( = 1, η = 1) and the curves of the CNNDD is concerned, it can be noted that for the same value of FNR or TNR the subset of the CPDD is sensibly smaller than that of the CNNDD. As notable examples (highlighted by means of big points on the curves) compare (1)

111

0.8

SVM TNR CPDD TNR

0.6

SVM size, γ=1.0

0.4

η=1.00

0.2

η=0.95 CPDD size, k=4

0 0

0.2

0.4 0.6 False Negative Rate

0.8

1

(a) CPDD AUC=0.997, SVM AUC=0.982

Ionosphere good 1

CPDD TNR

SVM TNR

0.8 SVM size, γ=0.1

0.6 η=1.00

0.4

η=0.95

0.2

CPDD size, k=4

0 0

0.2

0.4 0.6 False Negative Rate

0.8

1

True Negative Rate, Relative subset size |S|/|D|

Image segmentation path

1

True Negative Rate, Relative subset size |S|/|D|

True Negative Rate, Relative subset size |S|/|D|

F. Angiulli / Prototype-Based Domain Description

(b) CPDD AUC=0.970, SVM AUC=0.918

Satellite image red soil 1 SVM TNR

SVM size, γ=0.0001

0.8 CPDD TNR

0.6

0.4

0.2

η=1.00

0 0

0.2

η=0.95

CPDD size, k=4

0.4 0.6 False Negative Rate

0.8

1

(c) CPDD AUC=0.986, SVM AUC=0.974

Figure 4. Comparison between the CPDD rule and the one-class SVM.

the CPDD subset of relative size 0.054, achieving FNR=0.024 and TNR=0.997, with the CNNDD subset of relative size 0.158, achieving FNR=0.027 and TNR=0.983, for the Image segmentation data set, (2) the CPDD subset of relative size 0.064, achieving FNR=0.031 and TNR=0.869, with the CNNDD subset of relative size 0.277, achieving FNR=0.045 and TNR=0.862, for the Ionosphere data set, and (3) the CPDD subset of relative size 0.042, achieving FNR=0.041 and TNR=0.952, with the CNNDD subset of relative size 0.106, achieving FNR=0.027 and TNR=0.943, for the Satellite image data set. As far as the upper curves of the CPDD is concerned ( = 1, η = 0.95), it can be noted that by decreasing the value of the parameter η, very high values of TNR are obtained in correspondence of very small subsets, but the associated FNR worsens with respect to the case η = 1. This can be explained since the smaller the parameter η, the greater the portion of the accepting region of the PDD rule which is left uncovered by the CPDD consistent subset. As far as the lower curves of the CPDD is concerned ( = 0.9, η = 1), on the contrary, it can be noted that by decreasing the value of the parameter , the FNR improves while the TNR gets worse. This can be explained since the smaller the parameter , the greater, and also the closer to each other, the number of prototypes composing the CPDD consistent subset. Hence, by properly setting the parameters and η the user can tune the trade off between FNR and TNR and, simultaneously, between subset size and accuracy. The following table summarizes the AUCs of the CPDD for various combinations of the parameters and η. η Image segmentation Ionosphere Satellite image

Data set

1.00 1.00 0.997 0.970 0.986

1.00 0.95 0.989 0.956 0.981

0.90 1.00 0.996 0.972 0.989

0.90 0.95 0.990 0.967 0.987

Figure 4 compares the ROC curves of the CPDD method with that of the one-class SVM classiﬁer [10, 5]. As for the one-class SVM, the radial basis function kernel was used, varying parameter γ between 10−4 and 102 , and then the curve associated with the best AUC has been selected. As for the CPDD rule, the parameter used were k = 4, = 1, and η ∈ {0.95, 1.00}. Interestingly, the CPDD performed better than the one-class SVM both in terms of accuracy (AUCs of the two methods are reported in ﬁgure) and in terms of size of the model. Indeed, as for the size of the model is concerned, for small FNRs, the size of the CPDD subset is practically identical to the number of support vectors, while, for greater values of FNRs, the former number is much smaller than the

latter one. This can be explained by noticing that the CPDD subset does not contain the reference set outliers, which form the great majority of the reference set for large FNRs. Moreover, by setting the parameter η to 0.95, the size of the CPDD subset is further decreased, while the accuracy of the CPDD classiﬁer remained good, as witnessed by the table above reported.

6

CONCLUSIONS AND FUTURE WORK

In this work the CPDD one-class classiﬁcation algorithm has been presented and compared with the CNNDD and the one-class SVM classiﬁers, pointing out some advantages of the novel approach. A lot of additional questions are worth of being considered. Among them, studying the sensitivity of the method to the parameter k, comparison with the CNNDD rule under different metrics, comparison with other one-class classiﬁcation methods, and using kernel functions to possibly improve size of the model and/or accuracy.

REFERENCES [1] F. Angiulli, ‘Condensed nearest neighbor data domain description’, IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(10), 1746–1758, (2007). [2] F. Angiulli, ‘Fast nearest neighbor condensation for large data sets classiﬁcation’, IEEE Transactions on Knowledge and Data Engineering, 19(11), 1450–1464, (2007). [3] G. Ausiello, P. Crescenzi, G. Gambosi, V. Kann, A. MarchettiSpaccamela, and M. Protasi, Complexity and Approximation, SpringerVerlag, Berlin, 1999. [4] M. Bellare, S. Goldwasser, C. Lund, and A. Russeli, ‘Efﬁcient probabilistically checkable proofs and applications to approximations’, in STOC, pp. 294–304, (1993). [5] C.-C. Chang and C.-J. Lin, LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/∼cjlin/libsvm. [6] V. Chv´ atal, ‘A greedy heuristic for the set-covering problem’, Mathematics of Operations Research, 4(3), 233–235, (1979). [7] S. Floyd and M. Warmuth, ‘Sample compression, learnability, and the vapnik-chervonenkis dimension’, Machine Learning, 21(3), 269–304, (1995). [8] M.R. Garey and D.S. Johnson, Computer and Intractability, W. H. Freeman and Company, New York, 1979. [9] P.E. Hart, ‘The condensed nearest neighbor rule’, IEEE Trans. on Information Theory, 14, 515–516, (1968). [10] B. Schlkopf, J.C. Platt, J. Shawe-Taylor, A.J. Smola, and R.C. Williamson, ‘Estimating the support of a high-dimensional distribution’, Neural Computation, 13(7), 1443–1471, (2001). [11] D.M.J. Tax, One-class classiﬁcation, Ph.D. dissertation, Delft University of Technology, June 2001.

112

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-112

Online Rule Learning via Weighted Model Counting Fr´ed´eric Koriche1 Abstract. Online multiplicative weight-update learning algorithms, such as Winnow, have proven to behave remarkably for learning simple disjunctions with few relevant attributes. The aim of this paper is to extend the Winnow algorithm to more expressive concepts characterized by DNF formulas with few relevant rules. For such problems, the convergence of Winnow is still fast, since the number of mistakes increases only linearly with the number of attributes. Yet, the learner is confronted with an important computational barrier: during any prediction, it must evaluate the weighted sum of an exponential number of rules. To circumvent this issue, we convert the prediction problem into a Weighted Model Counting problem. The resulting algorithm, SharpNow, is an exact simulation of Winnow equipped with backtracking, caching, and decomposition techniques. Experiments on static and drifting problems demonstrate the performance of the algorithm in terms of accuracy and speed.

1

INTRODUCTION

A recurrent theme in Machine Learning is the development of online mistake-driven learning algorithms [4]. Such algorithms are “anytime learners” that can be interrupted at each instant to provide a prediction whose correctness is related to the number of mistakes that have been made so far. Basically, the underlying model takes place in a sequence of trials. At any time step, the learner is presented an observation and it is asked to predict its associated class. If the prediction is incorrect, we charge it one mistake. In a landmark paper, Littlestone [13] introduced the Winnow algorithm, which has rapidly become the blueprint of many efﬁcient online learners. Winnow resembles the Perceptron algorithm in its simplicity, but uses multiplicative, rather than additive, weight updates on input features. Consequently, when the target concept is a k out of n variable disjunction, the number of mistakes grows as k log n instead of kn. The fact that the dependence on n is reduced to logarithmic, rather than linear, makes this algorithm applicable even if the number of features is enormous. This remarkable property opens the door to learning problems characterized by high-dimensional feature spaces. One of them concerns the well-known problem of rule learning which consists in identifying, from a collection of examples, a small set of rules that explains all the positive examples and none of the negative ones [10]. In the paradigm of concept learning, any rule theory can be viewed as a DNF formula, that is, a disjunction of conjunctive features. Based on this notion, Winnow can naturally be extended to rule theories by projecting the data into an higher-dimensional feature space in which any conjunctive feature is viewed as a basic attribute. The enhanced algorithm inherits of an increased expressiveness while preserving a strong learning power. Indeed, if the observed examples are vectors of n attributes taking values over a domain of size d then, providing 1

LIRMM, Universit´e Monptellier 2, France, [email protected]

that the number of all conjunctive features is bounded by (d + 1)n , the performance of Winnow degrades only linearly with the input dimension. In fact, for any target concept, the number of mistakes grows essentially as kn log(d + 1), where k is the minimum number of rules needed to represent the concept into a rule theory. In its primal form, Winnow maintains a vector that assigns a weight to each distinct feature. For rule learning, such an implementation is computationally prohibitive, since the number of possible rules grows exponentially with the input dimension. Speciﬁcally, during any prediction, the learner is confronted with the problem of computing the weighted sum of an exponential number of features. Kernel methods have emerged as a standard approach for solving counting problems that arise from high-dimensional feature spaces. The underlying idea is to start from the dual form of the learning algorithm and to use a kernel function that simulates the target feature space while working with original input data. Speciﬁcally, in the setting of Boolean DNF formulas, efﬁcient kernel functions have been obtained for the Perceptron algorithm and its maximum margin variants [15]. Unfortunately, it seems impossible to ﬁnd an analogous result for the Winnow algorithm. Indeed, as observed by Khardon et al. [11], the Kernel Winnow Prediction problem is #P-hard, even for the restricted class of monotone DNF formulas. Such a computational barrier does not imply that, in practice, the sole option for the learner is a brute force enumeration of its feature space. To this very point, in the AI literature, a great deal of attention has been devoted to a related problem, referred to as Weighted Model Counting [5, 9, 17]. The problem is to evaluate the sum of weights of all the assignments satisfying a CNF formula. The basic building block for most model counting algorithms is the Davis-Putnam (DP) procedure that performs a backtracking search in the space of candidate models [3]. Based on this procedure, recent programs such as Cachet [16] and SharpSat [19] can handle large instances by combining formula caching and decomposition into connected components. The power of these techniques raises the natural question of whether the Kernel Winnow Prediction problem can be solved in practice by translation to Weighted Model Counting. This paper provides initial evidence that the answer is afﬁrmative: such a translation can be indeed effective for learning, with fast convergence and speed, “sparse” target concepts involving a small number of relevant rules. The resulting algorithm, called SharpNow, is an exact simulation of kernel Winnow, equipped with backtracking, caching, and decomposition techniques. For sparse target concepts, it can efﬁciently handle large spaces of conjunctive features with an accuracy superior to that of kernel Perceptron-like algorithms. The background about online rule learning can be found in section 2. The translation method and the SharpNow algorithm are presented in section 3. Experiments on both static and drifting problems are reported in section 4. Finally, section 5 concludes the paper with a discussion about related work and perspectives of further research.

113

F. Koriche / Online Rule Learning via Weighted Model Counting

2

ONLINE RULE LEARNING

In the online learning model, the algorithm is presented with a series of examples {xt , yt } labeled by a target concept. At each time step t, the algorithm ﬁrst outputs a class prediction yˆt for the observation xt , and then updates its hypothesis based on the true class yt . We adopt the convention that yt ∈ {−1, +1}. The learners of particular interest in this study are the Perceptron [14] and Winnow [13] algorithms applied to rule theories.

2.1

Rule Theories

Let {x1 , · · · , xn } be a set of attributes taking values in a discrete domain of size d. For convenience, we assume a standard way of naming values, as a list of natural numbers. A concept is a function f from dn to {−1, +1}, and an example is a vector x in dn . We say that a series of examples {xt , yt } is labeled by a target concept f if f (xt ) = yt for each index t. An atom is an expression xi = j, also denoted xji , where xi is an attribute and j a value. A conjunctive feature, or rule, is a conjunction of atoms, and a rule theory is a set of rules. A rule r covers an example x if for all atoms xji in r, the value of xi in x is j. A rule theory R covers an example x if at least one rule in R covers x. For instance, the following expression is a theory with three rules ⎧ ⎨ sky = sunny ∧ humidity = normal sky = cloudy ∧ temp = mild ⎩ sky = rain ∧ wind = weak The example (sunny, normal , mild , weak ) is classiﬁed as positive by this theory, because it is covered by the ﬁrst rule. It is well-known that any concept can be represented by an equivalent rule theory. The rule size of a concept f , denoted |f |, is the minimal number of rules needed to represent f as a rule theory. The feature space Rn,d of all rules generated from n attributes taking values over a domain of size d is represented by an indexed set {r1 , · · · , rN }, where N = (d + 1)n . Given an example x, the feature expansion of x onto Rn,d is a vector φ(x) in {0, 1}N where φi (x) = 1 if and only if ri covers x.

2.2

In particular, the kernel obtained by Sadohara [15] can be derived from d = 2. Based on the so-called kernel trick, the prediction rule 1 can be replaced with yˆt = sign

yˆt = sign(wt · φ(xt ))

(1)

No change is made to wt if the prediction is correct. In case of mistake, the algorithm uses the additive rule wt+1 = wt + yt φ(xt )

(2)

For rule theories, implementing the Perceptron in its the primal form is computationally infeasible since we would need to maintain a weight vector of (d+1)n size. Yet, it is well-known that the dual form of the algorithm is a linear combination of inner products formed by the current observation xt and the previous examples {(xs , ys )} on which mistakes where made. In the setting of Rn,d , each inner product φ(xs ) · φ(xt ) can be simulated by the kernel function K(xs , xt ) = 2|{i:xs,i =xt,i }|

ys K(xs , xt )

(3)

s=1

Each trial of the kernel Perceptron algorithm can thus be executed in polynomial time. Unfortunately, the algorithm can provably require many updates even for very simple rule theories [11]. Theorem 1. There exists a target concept of polynomial rule size and a sequence of examples labeled by it which causes the kernel Perceptron algorithm to make 2Ω(n) mistakes. Importantly, this result still holds for most Perceptron-like algorithms, including the version parameterized with a learning rate and a nonzero threshold, and the recent maximum margin variants [12].

2.3

Winnow

The algorithm has a very similar structure. It takes as input two parameters: a learning rate η and a threshold θ, and maintains a vector wt which is initialized to w0 = 1. Upon receiving an example xt , the algorithm predicts according to the rule yˆt = sign(wt · φ(xt ) − θ)

(4)

Again no change is made to wt if the prediction is correct. In case of mistake, the hypothesis is updated with a multiplicative rule wt+1 = wt exp(ηyt φ(xt ))

(5)

According to these speciﬁcations, the following result can be deduced by a simple adaptation of Winnow’s amortized analysis [1]. η Theorem 2. Let θ = 2 sinh (d + 1)n . Then, for any target concept η f , the number m of mistakes made by the Winnow algorithm over any sequence of examples labeled by f satisﬁes

Perceptron

For sake of clarity, we examine the zero-threshold version of the Perceptron algorithm. Throughout its execution, it maintains a vector wt in RN which is initialized to w0 = 0. Upon receiving an example xt , the algorithm predicts using the rule

m

m≤

eη + 1 [1 + |f |n log(d + 1)] η

Thus, the Winnow algorithm has a polynomial mistake bound for learning polynomial-size rule theories. However, the key difﬁculty is to provide a computationally efﬁcient simulation of the algorithm. Speciﬁcally, the Kernel Winnow Prediction problem is to infer the sign of wt · φ(xt ) − θ for the last example of a given sequence {(xt , yt )} of examples labeled by a target polynomial-size rule theory, after applying the prediction rule 4 and the update rule 5 on the weight vector wt for the previous examples. As shown in [11] this problem is #P-hard, which implies that, unless #P = P, there is no general construction that will run Winnow using kernel functions. In a nutshell, the most important message to be gleaned from online rule learning is that both additive and multiplicative update algorithms are, in theory, limited by either computational efﬁciency or convergence reasons. On the one hand, even if kernel Perceptron-like algorithms may be executed efﬁciently, they can provably require an exponential number of mistakes and, on the other hand, even if the kernel Winnow algorithm has a polynomial mistake bound, it seems impossible to simulate its execution in polynomial time.

114

3

F. Koriche / Online Rule Learning via Weighted Model Counting

WEIGHTED MODEL COUNTING

Algorithm 1: W MC(KB , A)

Despite the undoubtable importance of the aforementioned results, it remains to be seen whether, in practice, the execution of kernel Winnow can be simulated using efﬁcient techniques that have been developed for solving real-world #P problems. The key motivation of this paper is to convert the Kernel Winnow Prediction problem into a Weighted Model Counting problem, for which general and efﬁcient model counting techniques can be applied.

3.1

The Translation Method

Informally, any instance of the Weighted Model Counting problem consists in a set of weighted clauses; the task is to evaluate the sum of weights of assignments satisfying these clauses [17]. The intuitive idea behind the translation method is just to ascribe a weighted clause to each labeled example that has led to a mistake. To this end, we need to introduce additional deﬁnitions. Consider again a set {x1 , · · · , xn } of attributes taking values over a discrete domain of size d. In the following, any rule deﬁned over this vocabulary is viewed as an assignment, that is a set of atoms or, equivalently, a map that assigns to each atom a value in {0, 1}. A literal is an atom xji or its negation ¬xji and a clause is a disjunction of literals. Given a rule r and a literal xji (resp. ¬xji ), we say that r satisﬁes xji (resp. ¬xji ), if xji ∈ r (resp. xji ∈ r). Given a rule r and a clause c, we say that r satisﬁes c if r satisﬁes at least one literal occurring in c. Based on these notions, the feature expansion of a clause c under Rn,d is a map φ(c) in [0, 1]N where φi (c) = 1 if and only if ri satisﬁes c. A weighted clause is an expression of the form (c, w) where c is a clause and w is a value in R. Intuitively, w reﬂects how strong a constraint it is: the higher the weight, the greater the difference in likelihood between a rule that satisﬁes the constraint and one that does not. In this setting, any “unweighted” clause c is treated as an abbreviation of (c, 0): it denotes a hard constraint that restricts the space of possible rules. A weighted knowledge base KB is a ﬁnite set of weighted clauses. The weight of KB , denoted KB, is KB =

N

w1−φi (c)

(6)

i=1 (c,w)∈KB

As usual, we take the convention that 00 = 1 and 0z = 0 for any real number z > 0. Thus, the weight of the knowledge base KB is just the sum of weights of the assignments that are models of KB . We are now in position to present the translation method. The learning algorithm starts with the knowledge base KB 0 = ∅. On seeing an example xt , the algorithm predicts according to

(7) yˆt = sign KB t ∪ {¬xji : xt,i = j} − θ Recall that in the above expression, each unary clause ¬xji is treated as an abbreviation of (¬xji , 0). In case of mistake, KB t is simply expanded with a weighted clause that conveys, in a concise form, the information gathered from xt and yt . In formal terms

{xji : xt,i = j}, eηyt (8) KB t+1 = KB t ∪ For instance, suppose that our learner makes a mistake on the positive example (sunny, normal , mild , weak ). Then its knowledge base is expanded with the weighted clause (c, eη ), where c is the disjunction of all atoms sky = cloudy, sky = rain, . . . that are false in the example. In the next prediction, the weight of any rule that violates c will be multiplied by eη .

Input: a weighted knowledge base KB and a set of atoms A Output: the sum of weights of all subsets of A according to KB if I N C ACHE(KB ) then return G ET F ROM C ACHE(KB ) if I S L EAF(KB ) then return 2|A| ∗ (c,w)∈KB w weight ← 1 for each connected component KB i in KB do let Ai be the set of atoms occurring in KB i choose an atom a in Ai KB i,1 ← {(c − {a}, w) : (c, w) ∈ KB i , a ∈ c} weight 1 ← W MC(KB i,1 , Ai − {a}) KB i,2 ← {(c − {¬a}, w) : (c, w) ∈ KB i , ¬a ∈ c} weight 2 ← W MC(KB i,2 , Ai − {a}) S ET T O C ACHE(KB i , weight 1 + weight 2 ) weight ← weight ∗ (weight 1 + weight 2 ) return weight

3.2

The SharpNow Algorithm

In the spirit of model counting algorithms such as Cachet [16] and SharpSat [19], we begin to develop a procedure for evaluating the weight of knowledge base that combines backtracking search with formula caching and decomposition into connected components. The procedure, called W MC, takes as input a weighted knowledge base KB and a set of atoms A. Basically, the procedure performs a depth-ﬁrst search in the tree of partial assignments generated from A. Notably, a leaf of the tree is reached whenever every clause occurring in KB is empty. In this case, the resulting weight can be evaluated by simply taking the product of weights of these clauses. Following [2], the depth-ﬁrst search procedure is enhanced by using decomposition into connected components. Namely, by identifying in linear time the connected components in the constraint graph of KB , the resulting weight can be determined by multiplying together the weight of each connected component. Finally, the W MC procedure is equipped with a caching technique that prevents it from recomputing the same component. Because the length of weighted clauses is in O(nd), we employ the hybrid coding scheme suggested in [19] that concisely encodes a set of clauses KB as a vector of indices. Notice that the technique of component caching is particularly relevant in the setting of online learning when the algorithm is susceptible to recompute many identical subtrees from one prediction to the next. Proposition 3. Let KB be a weighted knowledge base, xt an example, and At the set of all atoms that are true in xt . Let A be the set of atoms in At that occur in KB , and A = At − A. Then KB ∪ {¬xji : xt,i = j} = 2|A| ∗ W MC(KB , A) Proof. Based on the completeness of the DP backtracking search procedure for model counting [3], we know that W MC(KB , A) = {w : (c, w) ∈ KB , r ∩ c = ∅} r⊆A

Let KB = KB ∪ {¬xji : xt,i = j}. From deﬁnition 6, we can infer KB = {w : (c, w) ∈ KB , r ∩ c = ∅} r⊆At

= 2|A| W MC(KB , A)

115

F. Koriche / Online Rule Learning via Weighted Model Counting

With these notions in hand, we can now present the algorithm SharpNow. As speciﬁed by the translation method, the algorithm starts with an empty knowledge base. During any trial, SharpNow predicts according to rule 7 and updates its knowledge base in light of rule 8. The prediction rule is implemented using the W MC procedure as speciﬁed in Proposition 3. The following result claims that SharpNow and Winnow make exactly the same predictions on the same series of labeled examples. As an immediate corollary, the mistake bound of SharpNow is the same as the one derived for the original algorithm. This implies, among others, that the size of the knowledge base maintained by the learner is polynomial in the input dimension. Theorem 4. SharpNow is an exact simulation of kernel Winnow Proof. We consider that Winnow and SharpNow are run with the same parameters η and θ, and the same series of labeled examples {(xt , yt )}. A sufﬁcient condition for establishing the result is to prove that the following equation holds: wt · φ(xt ) = KB t ∪ {¬xji : xt,i = j}

(9)

First of all, consider an assignment r which is not an element of Rn,d . Then, for any possible example xt , r violates at least one clause in the set {¬xji : xt,i = j}. It follows that r is not a model of KB t ∪ {¬xji : xt,i = j}. Thus, to prove 9, we only need to consider assignments that are elements of Rn,d . So, let ri be a rule in Rn,d , KB wt,i the weight of ri maintained by wt , and wt,i the weight of ri KB according to KB t , i.e. wt,i = {w : (c, w) ∈ KB t , ri ∩ c = ∅}. KB We shall prove by induction on the number of trials that wt,i = wt,i . Consider the ﬁrst trial. We have w0 = 1 and since KB 0 = ∅, we KB have w0,i = 1. Now consider an arbitrary trial and assume by inducKB tion hypothesis that wt−1,i = wt−1,i at the beginning of the trial. If KB no mistake has occurred during the trial, then wt,i = wt,i trivially holds. Thus, suppose that a mistake has occurred. If φ i (xt ) = 1, then we know that ri violates the clause {xji : xt,i = j}. So KB yt η KB yt η wt,i = e wt−1,i = e wt−1,i = wt,i . As a similar strategy apKB plies when φi (xt ) = 0. By applying the fact that wt,i = wt,i to each rule in Rn,d , we therefore obtain the desired result.

4

EXPERIMENTS

To provide empirical support for SharpNow, we evaluated it on several learning problems where the target concept is characterized by a small set of rules. A comparison with other standard online mistakedriven algorithms relies on their ability to achieve fast convergence to the optimal hyperplane in the conjunctive feature space Rn,d . The experiments were conducted on a 3.00 GHz Intel Xeon 5160 with 4 GB RAM running Windows XP. All algorithms were written in C++. Notably, the SharpNow algorithm was run using a cache of η size 1 GB, and a learning rate η = 1.278 for which the term 2 sinh η in Theorem 2 is minimized.

4.1

Static Problems

We begin with experiments on several UCI datasets, with no (known) concept drift, aiming at evaluating the performance of SharpNow relative to the kernel Perceptron and kernel Passive-Aggressive (PA) algorithms. Basically, the PA algorithm [7] is a maximum margin variant of Perceptron that forces the learner to achieve a unit margin on the most recent example while remaining as close as possible to the previous hypothesis. Both algorithms were implemented using the kernel prediction rule 3, and the PA algorithm was run with the update rule (PA-I) and a slack variable C ﬁxed to 100.

data set tic-tac-toe kr-vs-kp nothing one pair two pairs three of a kind straight ﬂush full house four of a kind straight ﬂush royal ﬂush Table 1.

Perceptron error ms 0.058 0.02 0.042 0.09 0.032 0.89 0.116 2.01 0.026 0.78 0.004 0.12 0.167 0.87 0.012 0.06 0.070 0.14 0.019 0.03 0.010 0.03 0.003 0.01

PA error 0.073 0.073 0.020 0.113 0.042 0.007 0.158 0.011 0.069 0.029 0.009 0.003

ms 0.02 0.16 0.70 2.09 0.53 0.22 0.70 0.03 0.13 0.03 0.01 0.01

SharpNow error ms 0.009 0.21 0.028 713 0.021 16.5 0 2.44 0 5.17 0 4.41 0.107 119 0.010 0.03 0 5.84 0 2.34 0.007 0.01 0.003 0.01

Results for the tic-tac-toe, kr-vs-kp, and poker hand datasets

Experiments were conducted with the “tic-tac-toe” dataset (958 instances, 9 attributes, 27 atoms), the “kr-vs-kp” dataset (3196 instances, 36 attributes, 73 atoms), and the “poker-hand” dataset (1,025,010 instances, 10 attributes, 85 atoms). The last dataset was divided into 10 binary problems. Each problem consists in ﬁnding a particular class of poker hand, where all examples of this class are considered as positive, and all other examples are negative. Each card is described using two attributes (suit and rank); to compare pairs of cards, we introduced four additional binary attributes, same-suit, same-rank, before-rank and next-rank. The total number of atoms is thus 165. For all experiments, the accuracy results have been obtained using only one epoch of the training set. For the tic-tac-toe and kr-vs-kp datasets, we employed a standard 10-fold cross-validation. For the poker-hand dataset, we used the training set of 25,010 instances and a subset of 5,000 test instances in the pool of 1, 000, 000 test instances. The test set was ﬁlled with up to 2,500 positives and the rest as negatives, all examples been chosen at random in the pool. Results are reported in Table 1. In term of accuracy, the performance of SharpNow is remarkable. Notably, for many problems in the poker hand dataset, we observed that the algorithm converges using less than 5,000 trials while kernel Perceptron and the kernel Passive-Aggressive show some difﬁculty in approaching the target concept. The running time is measured in milliseconds per trial. As expected, SharpNow is the slowest because it must solve a Weighted Model Counting problem during each prediction. Yet, with a speed of several milliseconds per trial, SharpNow can be used in real-time for medium-size datasets involving a sparse target function.

4.2

Drifting Problems

A natural application for online learning algorithms is to track concepts that are allowed to change over time. In this setting, we analyze the performance of SharpNow relative to the kernel Perceptron and kernel Forgetron algorithms. The Forgetron [8] is a shifting variant of Perceptron that gradually forgets the oldest supports in the set of examples on which mistakes were made; to this end, it uses a decaying rule that diminishes the contribution of old supports and a ﬁxed memory budget B that removes the oldest support. We conducted experiments with a variant of the Stagger Concepts, a standard benchmark for drifting problems. Each example is a scene involving objects o1 , · · · , ok with attributes color (oi ) ∈ {green, blue, red }, shape(oi ) ∈ {triangle, circle, square}, and size(oi ) ∈ {small , medium, large}. In the original problem [18], each scene involves only one object. To analyze the performance of

116

F. Koriche / Online Rule Learning via Weighted Model Counting

100

100 SharpNow Forgetron (B = 25) Perceptron

80

80

70

70

60

60

50

50

40

40

30

30

20

20

10

10

0

0 0

500

Figure 1.

Time Step (t)

1000

1500

Stagger Concepts with 5 objets

algorithms with more complex scenes including multiple objects, we made 20 series of experiments ranging from k = 1 to k = 20. Each experiment lasts for 1500 trials, and consists of three rule theories that lasts for 500 time step each: (1) {color (ok ) = red ∧ size(ok ) = small }, (2) {color (ok ) = green, shape(ok ) = circle}, and (3) {size(ok ) = medium, size(ok ) = large}. At each trial, the learner is trained on one example, and tested on 250 positive and 250 negative examples generated randomly according to the current concept. The accuracy results obtained with k = 5 and k = 20 are reported in Figures 1 and 2. We observe that SharpNow needs some time to readjust its weights but always converges toward the target concepts; by contrast the kernel Perceptron and kernel Forgetron algorithms show some difﬁculty in approaching the ﬁrst and the last concepts. This phenomenon increases with k, revealing that SharpNow is quite robust to irrelevant features. The running time of SharpNow ranges 1 from less than 1000 seconds per trial with k ≤ 5 to 14 seconds per trial for k = 20 (180 atoms and 1028 conjunctive features).

5

SharpNow Forgetron (B = 100) Perceptron

90

Test Error %

Test Error %

90

CONCLUSIONS

We presented SharpNow, an online rule learning algorithm that aims at combining a multiplicative weight update strategy and a weighted model counting method. As an exact simulation of kernel Winnow, the mistake bound of SharpNow is linear in the input dimension. Preliminary experiments on static and drifting problems tend to conﬁrm that SharpNow is particularly efﬁcient for learning small rule theories in presence of many irrelevant conjunctive features. The problem of extending multiplicative weight-update algorithms to expressive concept classes has been a subject of ongoing research in Machine Learning. Yet, to the best of our knowledge, very few investigations have attempted to handle rule theories. A notable exception is the work by Chawla et. al [6] who suggested to explore Markov Chain Monte Carlo (MCMC) methods for approximating the Kernel Winnow Prediction problem. The main difference with our study is that MCMC methods cannot guarantee an exact simulation of kernel Winnow. Moreover, the running time of their resulting algorithm remains quite slow, taking days to learn a DNF formula over 20 variables. By contrast, the weighted model counting technique takes several milliseconds per trial for similar problems. Several avenues of research naturally emerge from this study. One of them concerns the analysis of SharpNow under noisy environments using, for example, a discount factor suggested in kernel functions. An orthogonal direction is to extend our method to multi-class environments by simulating multiplicative voting algorithms [4].

0

500

Figure 2.

Time Step (t)

1000

1500

Stagger Concepts with 20 objets

REFERENCES [1] P. Auer and M. K. Warmuth, ‘Tracking the best disjunction’, Machine Learning, 32(2), 127–150, (1998). [2] R. J. Bayardo and J. D. Pehoushek, ‘Counting models using connected components’, in 17th National Conference on Artiﬁcial Intelligence, pp. 157–162, Austin, TX, (2000). [3] E. Birnbaum and E. L. Lozinskii, ‘The good old Davis-Putnam procedure helps counting models’, Journal of Artiﬁcial Intelligence Research, 10, 457–477, (1999). [4] N. Cesa-Bianchi and G. Lugosi, Prediction, Learning, And Games, Cambridge University Press, Cambridge, UK, 2006. [5] M. Chavira and A. Darwiche, ‘On probabilistic inference by weighted model counting’, Artiﬁcial Intelligence, (2008). To appear. [6] D. Chawla, L. Li, and S. Scott, ‘On approximating weighted sums with exponentially many terms’, Journal of Computer and System Sciences, 69(2), 196–234, (2004). [7] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer, ‘Online passive-aggressive algorithms’, Journal of Machine Learning Research, 7, 551–585, (2006). [8] O. Dekel, S. Shalev-Shwartz, and Y. Singer, ‘The Forgetron: A kernelbased Perceptron on a ﬁxed budget’, in Advances in Neural Information Processing Systems 18, Vancouver, Canada, (2005). [9] C. Domshlak and J. Hoffmann, ‘Probabilistic planning via heuristic forward search and weighted model counting’, Journal of Artiﬁcial Intelligence Research, 30, 565–620, (2007). [10] J. F¨urnkranz, ‘Separate-and-conquer rule learning’, Artiﬁcal Intelligence Review, 13(1), 3–54, (1999). [11] R. Khardon, D. Roth, and R. A. Servedio, ‘Efﬁciency versus convergence of boolean kernels for on-line learning algorithms’, Journal of Artiﬁcial Intelligence Research, 24, 341–356, (2005). [12] R. Khardon and R. A. Servedio, ‘Maximum margin algorithms with boolean kernels’, J. of Machine Learning Res., 6, 1405–1429, (2005). [13] N. Littlestone, ‘Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm’, Machine Learning, 2(4), 285–318, (1988). [14] F. Rosenblatt, ‘The Perceptron: a probabilistic model for information storage and organization in the brain’, Psych. Rev., 65, 386–408, (1958). [15] K. Sadohara, ‘Learning of boolean functions using support vector machines.’, in 12th Int. Conference on Algorithmic Learning Theory, pp. 106–118, Washington, DC, (2001). [16] T. Sang, F. Bacchus, P. Beame, H. A. Kautz, and T. Pitassi, ‘Combining component caching and clause learning for effective model counting’, in 7th Int. Conference on Theory and Applications of Satisﬁability Testing, Vancouver, BC, Canada, (2004). [17] T. Sang, P. Beame, and H. A. Kautz, ‘Performing Bayesian inference by weighted model counting’, in 20th National Conference on Artiﬁcial Intelligence, pp. 475–482, Pittsburgh, PA, (2005). [18] J. C. Schlimmer and R. H. Granger, ‘Beyond incremental processing: Tracking concept drift’, in 5th National Conference on Artiﬁcial Intelligence, pp. 502–507, Philadelphia, PA, (1986). [19] M. Thurley, ‘sharpSat - counting models with advanced component caching and implicit BCP’, in 9th Int. Conference on Theory and Applications of Satisﬁability Testing, pp. 424–429, Seattle, WA, (2006).

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-117

117

Focused Ensemble Selection: A Diversity-Based Method for Greedy Ensemble Selection Ioannis Partalas, Grigorios Tsoumakas, Ioannis Vlahavas1 Abstract. Ensemble selection deals with the reduction of an ensemble of predictive models in order to improve its efﬁciency and predictive performance. A number of ensemble selection methods that are based on greedy search of the space of all possible ensemble subsets have recently been proposed. This paper contributes a novel method, based on a new diversity measure that takes into account the strength of the decision of the current ensemble. Experimental comparison of the proposed method, dubbed Focused Ensemble Selection (FES), against state-of-the-art greedy ensemble selection methods shows that it leads to small ensembles with high predictive performance.

1 Introduction Ensemble methods [6] has been a very popular research topic during the last decade. Their success arises from the fact that they offer an appealing solution to several interesting learning problems of the past and the present, such as: improving predictive performance over a single model, scaling inductive algorithms to large databases, learning from multiple physically distributed data sets and learning from concept-drifting data streams. Typically, ensemble methods comprise two phases: the production of multiple predictive models and their combination. Recent work [9, 8, 7, 15, 4, 10, 11, 2], has considered an additional intermediate phase that deals with the reduction of the ensemble size prior to combination. This phase is commonly named ensemble pruning, selective ensemble, ensemble thinning and ensemble selection, the last one of which is used within this paper. Ensemble selection is important for two reasons: efﬁciency and predictive performance. Having a very large number of models in an ensemble adds a lot of computational overhead. For example, decision tree models may have large memory requirements [9] and lazy learning methods have a considerable computational cost during execution. The minimization of run-time overhead is crucial in certain applications, such as stream mining. Equally important is the second reason, predictive performance. An ensemble may consist not only of high performance models, but also of models with lower predictive performance. Intuitively, combining good and bad models together will not have the expected result. Pruning the low-performing models while maintaining a good diversity of the ensemble is typically considered as a proper recipe for a successful ensemble. The problem of pruning an ensemble of classiﬁers has been proved to be NP-complete [14]. Exhaustive search for the best subset of classiﬁers isn’t tractable for ensembles that contain a large number of models. Greedy approaches, such as [2, 4, 9, 10, 11], are fast, as they 1

Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece, email: {partalas,greg,vlahavas}@csd.auth.gr

consider a very small part of the space of all combinations. These methods, start with an initial ensemble (empty or full) and search in the space of the different ensembles, by iteratively expanding or contracting the initial ensemble by a single model. The search is guided by either the predictive performance or the diversity of the alternative ensembles. This paper contributes a novel method for greedy ensemble selection, based on a new diversity measure that takes into account the strength of the decision of the current ensemble. Experimental comparison of the proposed method, dubbed Focused Ensemble Selection (FES), against state-of-the-art greedy ensemble selection methods shows that it leads to small ensembles with high predictive performance. The remainder of this paper is structured as follows: Section 2 presents background information on ensemble methods and Section 3 reviews previous work on ensemble selection. Section 4 introduces the proposed method. Section 5 presents the setup of the experimental study and Section 6 discusses the results. Finally, Section 7 concludes this work.

2 Ensemble Methods 2.1 Producing the Models An ensemble can be composed of either homogeneous or heterogeneous models. Homogeneous models derive from different executions of the same learning algorithm by using different values for the parameters of the learning algorithm, injecting randomness into the learning algorithm or through the manipulation of the training instances, the input attributes and the model outputs [6]. Two popular methods for producing homogeneous models are bagging [3] and boosting [13]. Heterogeneous models derive from running different learning algorithms on the same dataset. Such models have different views about the data, as they make different assumptions about them. For example, a neural network is robust to noise in contrast to a k-nearest neighbor classiﬁer.

2.2 Combining the Models A lot of different ideas and methods have been proposed in the past for the combination of classiﬁcation models. The main motivation underlying this research is the observation that there is no single classiﬁer that performs signiﬁcantly better in every classiﬁcation problem [18]. The necessity for high classiﬁcation performance in some critical domains (e.g. medical, ﬁnancial, intrusion detection) have

118

I. Partalas et al. / Focused Ensemble Selection: A Diversity-Based Method for Greedy Ensemble Selection

urged researchers to explore methods that combine different classiﬁcation algorithms in order to overcome the limitations of individual learning paradigms. Unweighted and Weighted Voting are two of the simplest methods for combining not only Heterogeneous but also Homogeneous models. In Voting, each model outputs a class value (or ranking, or probability distribution) and the class with the most votes (or the highest average ranking, or average probability) is the one proposed by the ensemble. In Weighted Voting, the classiﬁcation models are not treated equally. Each model is associated with a coefﬁcient (weight), usually proportional to its classiﬁcation accuracy. Let x be an instance and mi , i = 1..k a set of models that output a probability distribution mi (x, cj ) for each class cj , j = 1..n. The output of the (weighted) voting method y(x) for instance x is given by the following mathematical expression: y(x) = arg max c

k X

wi mi (x, cj ),

i=1

where wi is the weight of model i. In the simple case of voting (unweighted), the weights are all equal to one, that is, wi = 1, i = 1..k.

3 Ensemble Selection 3.1 Greedy Approaches Margineantu and Dietterich [9] introduce heuristics to calculate the beneﬁt of adding a classiﬁer to an ensemble, using forward selection in a number of them. These heuristics are based on the diversity and the performance of the classiﬁers. The authors experiment with boosting ensembles and conclude that pruning can help an ensemble to increase its predictive performance. Fan et al. [7] prune an ensemble of classiﬁers using forward selection of the classiﬁcation models, like in [9]. As a heuristic, they use the beneﬁt that is obtained by evaluating the combination of the selected classiﬁers with the method of voting. Their results show that pruning increases the predictive performance and speeds up the run time of an ensemble of C4.5 decision trees trained on disjoint parts of a large data set. Caruana et al. [4] produce an ensemble of 1000 classiﬁers using different algorithms and sets of parameters for these algorithms. They subsequently prune the ensemble following an approach that is similar to [9]. This way they manage to achieve very good predictive performance compared to state-of-the-art ensemble methods. Banﬁeld et al. [2], propose a method that selects a subensemble in a backward manner. The authors reward each classiﬁer according to its decision with regard to the ensemble decision. The method removes the classiﬁer with the lowest accumulated reward. Martinez-Munoz et al. [11, 10] present two algorithms for pruning an ensemble of classiﬁers. In [11] the authors deﬁne for each classiﬁer a vector with dimensionality equal to the size of the training set, where each element i corresponds to the decision of the classiﬁer for the instance i. The classiﬁer is added to the ensemble according to its impact in the difference between the vector of the ensemble (average of individual vectors) with a predeﬁned reference vector. This reference vector indicates the desired direction towards which the vector of the ensemble must align. In [10], the authors produce an initial ensemble of bagging models. Then using a forward selection procedure, they add to the ensemble the classiﬁer that disagrees the most with the current ensemble. The process ends when a predeﬁned size for the ﬁnal pruned ensemble is reached.

3.2 Other Approaches Giacinto and Roli [8] employ Hierarchical Agglomerative Clustering (HAC) for ensemble selection. This way they implicitly used the complete link method for inter-cluster distance computation. Pruning is accomplished by selecting a single representative classiﬁer from each cluster. The representative classiﬁer is the one exhibiting the maximum average distance from all other clusters. Zhou and Tang [20] perform stochastic search in the space of model subsets using a standard genetic algorithm. Standard genetic operations such as mutations and crossovers are used and default values are used for the parameters of the genetic algorithm. The voted performance of the ensemble is used as a function for evaluating the ﬁtness of individuals in the population. Tsoumakas et al. [15] prune an ensemble of heterogeneous classiﬁers using statistical procedures that determine whether the differences in predictive performance among the classiﬁers of the ensemble are signiﬁcant. Only the classiﬁers with signiﬁcantly better performance than the rest are retained and subsequently combined with the methods of (weighted) voting. Zhang et al. [19], formulate the ensemble pruning problem as a mathematical problem and apply semi-deﬁnite programming (SDP) techniques. Their algorithm requires the number of classiﬁers to retain as a parameter and runs in polynomial time. Partalas et al. [12], present an ensemble selection method under the framework of Reinforcement Learning, where the learning module ﬁnds an optimal policy for including or excluding a classiﬁer from the ensemble.

4 Focused Ensemble Selection Let H = {ht , t = 1, 2, . . . , T } be the set of classiﬁers (or hypotheses) of an ensemble, where each classiﬁer ht maps an instance x to a class label y, ht (x) = y. Greedy ensemble selection approaches start either with an empty set of classiﬁers (S = ∅) or the complete ensemble (S = H). For simplicity of presentation we focus on the former initial conditions only, yet our argumentation holds for both. At each step the current subset S is expanded by a model ht ∈ H \ S, based on either the predictive performance [9, S 7, 4] or the diversity [9, 11, 10, 2] of the expanded ensemble S {ht }. Methods that are based on diversity have been shown to be more effective than those that are based on accuracy. The S methods in [10, 2] measure the diversity of candidate ensembles S {ht } by comparing the decision of the current ensemble S with the decision of candidate classiﬁers ht ∈ H \ S on a set of evaluation examples (xi , yi ), i = 1, 2, . . . , N . Each example consists of a feature vector xi and a class label yi . We can distinguish 4 events concerning both of these decisions: etf

:

y = ht (xi ) ∧ y = S(xi )

ef t

:

y = ht (xi ) ∧ y = S(xi )

ett

:

y = ht (xi ) ∧ y = S(xi )

ef f

:

y = ht (xi ) ∧ y = S(xi )

where S(xi ) is the classiﬁcation of instance xi by ensemble S. This classiﬁcation is derived from the application of an ensemble combination method on S, which usually is voting. The diversity measure in [10] is based on etf only, while the one in [2] neglects ef t . We argue that all events should contribute to the calculation of an appropriate diversity measure. Event ef t for example, corresponds to the case where the candidate classiﬁer errs, while

119

I. Partalas et al. / Focused Ensemble Selection: A Diversity-Based Method for Greedy Ensemble Selection

the ensemble is correct. Although, the ensemble is correct, we do not know how many votes lead to its correct decision. If the difference in votes between the correct and wrong decision is marginal, then this candidate classiﬁerSmight lead to a misclassiﬁcation of example xi by the ensemble S {ht }. The above example concerning event ef t , brings up another disadvantage of the methods in [10, 2]. The decisions of individual models within the current ensemble are not separately considered, as the current ensemble is treated as a whole. We hypothesize that better results can be obtained from a measure that takes into account the strength of the current ensemble’s decision. We argue that an example that is incorrectly (correctly) classiﬁed by most of the members of the current ensemble, should not affect strongly the ensemble selection method, as this is probably a very hard (easy) example. On the other hand, examples that are misclassiﬁed by about half of the ensemble’s members, are near to change status (correct/incorrect classiﬁcation) and should strongly affect the method. In order to deal with the above issues, we propose a diversity measure that considers all events and takes into account the strength of the current ensemble’s decision. We deﬁne the following quantities: N Ti , which denotes the proportion of models in the current ensemble S that classify example (xi , yi ) correctly, and N Fi = 1 − N Ti , which denotes the number of models in S that classify it incorrectly. The proposed method, dubbed Focused Ensemble Selection (FES), starts with the full ensemble (S = H) and iteratively removes the classiﬁer ht ∈ S that minimizes the following quantity: f es(ht ) =

N “ X i=1

N Ti ∗ I(etf ) − N Fi ∗ I(ef t) +

” +N Fi ∗ I(ett ) − N Ti ∗ I(ef f ) ,

where I(true) = 1 and I(f alse) = 0. Note that events etf and ett increase the metric, because the candidate classiﬁer is correct, while events ef t and ef f decrease it, as the candidate classiﬁer is incorrect. The strength of increase/decrease depends on the strength of the ensemble’s decision. If the current ensemble S is incorrect, then the reward/penalty is multiplied by the proportion of correct models in S. On the other hand, if S is correct, then the reward/penalty is multiplied by the proportion of incorrect models in S. This weighting scheme focuses the attention of the algorithm to examples that are near to change status, while it overlooks examples whose correct classiﬁcation is either very easy or very hard. In event etf for example, the addition of a correct classiﬁer when the ensemble is wrong contributes a gain of 1 multiplied by the proportion of classiﬁers in that ensemble that are correct. The rationale is that if the number of classiﬁers is small, then correct classiﬁcation of this example is hard to achieve and thus the contribution is penalized, while if the number of classiﬁers is large, then the correct classiﬁcation of this example is easier to achieve and thus the contribution is rewarded. An issue that is worth mentioning here concerns the dataset used for calculating the diversity (or predictive performance) measures in greedy ensemble selection methods. One approach is to use the training set for evaluation, as in [11]. This offers the beneﬁt that plenty of data will be available for evaluation and training, but is susceptible to overﬁtting. Another approach is to withhold a part of the training set for evaluation, as in [4, 2] and the REPwB method in [9]. This is less prone to overﬁtting, but reduces the amount of data that are available for training and evaluation compared to the previous approach. FES supports both of these approaches.

Another important issue that concerns ensemble selection methods, is when to stop adding classiﬁers in the ensemble, or, in other words, how many models should the ﬁnal ensemble include. One solution is to perform the search until all models have been added into (removed from) the ensemble and select the ensemble with the highest accuracy on the evaluation set. This approach has been used in [4]. Others prefer to select a predeﬁned number of models, expressed as a percentage of the original ensemble [9, 7, 11, 2]. FES supports both of these approaches, but follows the former by default, because it is more ﬂexible and automated, since it doesn’t require the speciﬁcation of a percentage. Algorithm 1 presents the proposed method in pseudocode. Its time complexity is O(T 2 |S|N ), which can be optimized to O(T 2 N ) if the predictions of the current ensemble are updated incrementally each time a classiﬁer is removed from it. Algorithm 1 The proposed method in pseudocode Require: Ensemble of classiﬁers H 1: S = H 2: B = ∅ 3: acc = 0 4: while S = ∅ do 5: h = arg minf es(ht ) h ∈S

6: S = S \ {h} 7: acctemp = Accuracy(S) 8: if acctemp > acc then 9: acc = acctemp 10: B=S 11: end if 12: end while 13: return B

5 Experimental Setup 5.1 Datasets We experimented on 12 data sets from the UCI Machine Learning repository [1]. Table 1 presents the details of these data sets (Folder in UCI server, number of instances, classes, continuous and discrete attributes, percentage of missing values). We avoided using datasets with less than 650 examples, so that an adequate amount of data is available for training, evaluation and testing. Table 1. Details of data sets: Folder in UCI server, number of instances, classes, continuous and discrete attributes, percentage of missing values id d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12

UCI Folder car cmc credit-g kr-vs-kp hypothyroid segment sick soybean tic-tac-toe vehicle vowel waveform-5000

Inst

Cls

Cnt

Dsc

MV(%)

1728 1473 1000 3196 3772 2310 3772 683 958 946 990 5000

4 3 2 2 4 7 2 19 2 4 11 3

0 2 7 0 7 19 7 0 0 18 3 21

6 7 13 36 23 0 23 35 9 0 10 0

0.00 0.00 0.00 0.00 5.40 0.00 5.40 0.00 0.00 0.00 0.00 0.00

120

I. Partalas et al. / Focused Ensemble Selection: A Diversity-Based Method for Greedy Ensemble Selection

5.2 Methodology The methodology of the experiments proceeds as follows: Initially, the whole dataset is split into three disjunctive parts, a training set, an evaluation set and a test set with 40%, 40% and 20% of the initial dataset respectively. In this paper, we focus on ensembles of heterogeneous models. We therefore run different learning algorithms with different parameters on the training set, in order to produce 200 models that constitute the initial ensemble. The WEKA machine learning library [17] was used as the source of learning algorithms. We trained 24 multilayer perceptrons (MLPs), 60 kNNs, 110 support vector machines (SVMs), 2 naive Bayes classiﬁers and 4 decision trees. The different parameters used to train the algorithms were the following (the rest of the parameters were left unchanged in their default values): • MLPs: we used 6 values for the nodes in the hidden layer {1, 2, 4, 8, 32, 128} and 4 values for the momentum term {0.0, 0.2, 0.5, 0.8}. • kNNs: we used 20 values for k distributed evenly between 1 and the plurality of the training instances. We also used 3 weighting methods: no-weighting, inverse-weighting and similarityweighting. • SVMs: we used 11 values for the complexity parameter {10−7 , 10−6 , 10−5 , 10−4 , 10−3 , 10−2 , 0.1, 1, 10, 100, 1000}, and 10 different kernels. We used 2 polynomial kernels (of degree 2 and 3) and 8 radial kernels (gamma ∈ {0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 2}). • Naive Bayes: we built one model with default parameters and one with kernel estimation. • Decision trees: we used 2 values for the conﬁdence factor ({0.25, 0.5}), and 2 values for Laplace smoothing ({true, false}). We compare the performance of our approach, Focused Ensemble Selection (FES), against the following greedy ensemble selection methods: Forward Selection (FS) [4], Complementariness (COM) [10], Margin Distance Minimization (MDM) [11] and Concurrency Thining (CT) [2]. The evaluation set is used for the calculation of diversity and performance measure for all competing algorithms, because preliminary experiments have shown that it leads to signiﬁcantly better results than using the training set in ensembles of heterogeneous models. Voting was used for model combination in FES, FS, COM and CT. Similarly to FES, all rival algorithms follow the approach of [4], which selects the ensemble with the highest accuracy on the evaluation set, instead of using an arbitrary percentage of selection. In addition, the following section discuses comparative results with alternative versions of the algorithms that select a ﬁxed percentage (20%) of models. The resulting ensemble is evaluated on the test set, using voting for model combination. We also calculate the performance of the best single model (BSM) in the ensemble, and the performance of the complete ensemble of 200 models (ALL), using voting for model combination, based on the performance of the models on the evaluation dataset. The whole experiment is performed 10 times for each dataset and the results are averaged.

6 Results and Discussion Table 2 presents the classiﬁcation accuracy of each algorithm on each dataset. The accuracy of the winning algorithm at each dataset is

highlighted with bold typeface. A ﬁrst observation is that the proposed approach achieves the best performance in most of the datasets (6), followed by BSM (3), CT (2), MDM and FS (1) and ﬁnally COM and ALL (0). Table 2. Classiﬁcation accuracy for each algorithm on each dataset.

id d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12

FES 98.3 52.7 74.4 99.0 99.3 96.9 98.1 91.5 98.7 81.1 90.3 86.0

FS 98.1 52.4 74.2 99.1 99.2 96.9 98.0 91.1 98.5 80.1 90.5 85.7

COM 98.2 51.2 73.2 99.0 99.2 96.9 98.0 91.0 98.6 80.8 89.8 85.7

CT 98.4 52.3 74.0 99.0 99.3 96.8 98.2 91.6 98.6 80.9 90.3 85.9

MDM 97.4 51.5 73.4 97.9 97.8 96.5 97.4 91.7 98.4 79.1 87.8 84.4

BSM 99.4 42.8 69.5 95.4 90.7 98.5 95.2 90.1 95.8 64.4 98.9 72.7

ALL 82.7 47.6 70.8 95.6 91.9 97.8 95.4 89.8 63.9 75.3 90.7 80.7

According to [5], the appropriate way to compare two or more algorithms on multiple datasets is based on their average rank across all datasets. On each dataset, the algorithm with the highest accuracy gets rank 1.0, the one with the second highest accuracy gets rank 2.0 and so on. In case two or more algorithms tie, they all receive the average of the ranks that correspond to them. Table 3 presents the rank of each algorithm on each dataset, along with the average ranks. The proposed approach has the best average rank (2.17), followed by CT (2.71), FS (3.29), COMP (4.0), MDM (4.92), BSM (5.33) and ALL (5.58). Although the difference of the average ranks between the 2nd best algorithm (CT) and FES is small, CT achieves the highest accuracy (and rank) in only two datasets. We therefore argue that FES should be preferred over CT and the rest of its rivals for ensemble selection. Table 3. Corresponding rank for each algorithm on each dataset.

id d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12

FES 3.0 1.0 1.0 3.0 1.5 4.0 2.0 3.0 1.0 1.0 4.5 1.0

FS 5.0 2.0 2.0 1.0 3.5 4.0 3.5 4.0 4.0 4.0 3.0 3.5

COM 4.0 5.0 5.0 3.0 3.5 4.0 3.5 5.0 2.5 3.0 6.0 3.5

CT 2.0 3.0 3.0 3.0 1.5 6.0 1.0 2.0 2.5 2.0 4.5 2.0

MDM 6.0 4.0 4.0 5.0 5.0 7.0 5.0 1.0 5.0 5.0 7.0 5.0

BSM 1.0 7.0 7.0 7.0 7.0 1.0 7.0 6.0 6.0 7.0 1.0 7.0

ALL 7.0 6.0 6.0 6.0 6.0 2.0 6.0 7.0 7.0 6.0 2.0 6.0

Av. Rank

2.17

3.29

4.0

2.71

4.92

5.33

5.58

We next turn to statistical procedures, in order to investigate whether the performance differences between FES and the rest of the algorithms are signiﬁcant. According to [5], the appropriate statistical test for the comparison of two algorithms on multiple datasets is the Wilcoxon signed rank test [16]. Note that the majority of past approaches have used the paired t-test, which is inappropriate for this task. We performed 6 tests, one for each paired comparison of FES with each of the other algorithms, at a conﬁdence level of 95%. The test found that FES is signiﬁcantly better than all other algorithms, apart from CT.

I. Partalas et al. / Focused Ensemble Selection: A Diversity-Based Method for Greedy Ensemble Selection

Table 4 shows the average size of the ﬁnal ensembles that are selected by the algorithms on each dataset. A general remark is that the number of selected models is small compared to the size of the original ensemble. Only 5.05% to 14.95% of the 200 classiﬁers are ﬁnally selected by the algorithms. Furthermore, the number of models selected based on the maximum accuracy in the evaluation set, is smaller than using a ﬁxed size, such as 20% [10, 11] or 10% [2] of the models, leading to further reduction of the computational cost of the ﬁnal ensemble. Table 4. Average size of selected ensembles for each algorithm.

id d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12

FES 11.6 18.3 15.7 13.4 11.7 20.5 7.2 23.3 27.9 8.6 8.5 20.8

FS 6.1 15.9 14.4 11.1 5.8 17.2 3.9 13.7 9.4 13.3 8.3 40.6

COMP 5.1 13.4 20.9 11.8 5.0 17.8 3.5 11.0 11.1 10.3 6.8 15.8

CT 6.5 16.7 12.7 11.2 7.2 15.3 3.7 10.9 11.2 10.6 4.4 15.1

MDM 20.9 26.1 33.7 27.9 6.7 29.6 9.3 40.0 31.9 21.9 31.8 79.1

Av. Size

15.6

13.3

11.0

10.5

29.9

In order to investigate whether the performance of greedy ensemble selection algorithms is signiﬁcantly better when the size of the ﬁnal ensemble is selected dynamically, rather than using a predeﬁned percentage of models (20%), we performed Wilcoxon tests on the predictive performance of the two alternative versions of each algorithm on all datasets. With 95% conﬁdence the test showed no statistical differences, but the results were in favor of the dynamic approach. Figure 1 presents the mean number of each type of models that are selected by FSD across all datasets for the type of models that are selected. FSD selects on average 7.5 SVMs, 5.2 MLPs, 1.3 kNNs, 1.2 DTs and 0.4 NB models. This shows that SVMs and MLPs, which are traditionally highly accurate classiﬁers, dominate the ﬁnal ensembles. On the other hand we notice that the ﬁnal ensembles include on average 30% of the trained DTs, 22% of the trained MLPs and NB models, 7% of SVMs and 2% of kNNs. This shows that our production procedure led to quite diverse DTs, MLPs and NB models, while on the other hand most of the produced SVMs and kNNs were probably very similar. These results can be taken into account, in order to produce a more diverse initial ensemble. 12

Mean Number of Models

10

8

MLP

Figure 1.

k−NN

SVM

NB

DT

Aggregates for FSD concerning the type of models that are selected.

121

7 Conclusions This paper contributed a new method for greedy ensemble selection, named Focused Ensemble Selection (FES). The main idea of the method is to overlook examples that are either very easy or very hard, and focus on those that are near to change status (correct/incorrect classiﬁcation). We performed experiments comparing FES with state-of-the-art methods from the related bibliography. Although FES was not found signiﬁcantly better than all competitors that were considered in this paper, it still was found consistently better based on both the average rank and the number of datasets, where it achieved the highest accuracy. We consider that the main novel idea of this paper (taking into consideration the strength of the ensemble’s classiﬁcation) is a positive contribution that could be valuable to other researchers working in ensemble selection and ensemble methods in general.

REFERENCES [1] A. Asuncion and D.J. Newman. UCI machine learning repository, 2007. http://www.ics.uci.edu/∼mlearn/MLRepository.html. [2] Robert E. Banﬁeld, Lawrence O. Hall, Kevin W. Bowyer, and W. Philip Kegelmeyer, ‘Ensemble diversity measures and their application to thinning.’, Information Fusion, 6(1), 49–62, (2005). [3] L. Breiman, ‘Bagging Predictors’, Machine Learning, 24(2), 123–40, (1996). [4] R. Caruana, A. Niculescu-Mizil, G. Crew, and A. Ksikes, ‘Ensemble selection from libraries of models’, in Proceedings of the 21st International Conference on Machine Learning, (2004). [5] Janez Demsar, ‘Statistical comparisons of classiﬁers over multiple data sets’, Journal of Machine Learning Research, 7, 1–30, (2006). [6] T. G. Dietterich, ‘Ensemble Methods in Machine Learning’, in Proceedings of the 1st International Workshop in Multiple Classiﬁer Systems, pp. 1–15, (2000). [7] Wei Fan, Fang Chu, Haixun Wang, and Philip S. Yu, ‘Pruning and dynamic scheduling of cost-sensitive ensembles’, in Eighteenth national conference on Artiﬁcial intelligence, pp. 146–151. AAAI, (2002). [8] Giorgio Giacinto and Fabio Roli, ‘An approach to the automatic design of multiple classiﬁer systems’, Pattern Recognition Letters, 22(1), 25– 33, (2001). [9] D. Margineantu and T. Dietterich, ‘Pruning adaptive boosting’, in Proceedings of the 14th International Conference on Machine Learning, pp. 211–218, (1997). [10] G. Martinez-Munoz and A. Suarez, ‘Aggregation ordering in bagging’, in International Conference on Artiﬁcial Intelligence and Applications (IASTED), pp. 258–263. Acta Press, (2004). [11] G. Martinez-Munoz and A. Suarez, ‘Pruning in ordered bagging ensembles’, in 23rd International Conference in Machine Learning (ICML2006), pp. 609–616. ACM Press, (2006). [12] I. Partalas, G. Tsoumakas, I. Katakis, and I. Vlahavas, ‘Ensemble pruning using reinforcement learning’, in 4th Hellenic Conference on Artiﬁcial Intelligence (SETN 2006), pp. 301–310, (May 18–20 2006). [13] Robert E. Schapire, ‘The strength of weak learnability’, Machine Learning, 5, 197–227, (1990). [14] Christino Tamon and Jie Xiang, ‘On the boosting pruning problem’, in 11th European Conference on Machine Learning (ECML 2000), pp. 404–412. Springer-Verlag, (2000). [15] G. Tsoumakas, L. Angelis, and I. Vlahavas, ‘Selective fusion of heterogeneous classiﬁers’, Intelligent Data Analysis, 9(6), 511–525, (2005). [16] F. Wilcoxon, ‘Individual comparisons by ranking methods’, Biometrics, 1, 80–83, (1945). [17] I.H. Witten and E. Frank, Data Mining: Practical machine learning tools and techniques, Morgan Kaufmann, 2005. [18] D. Wolpert, The mathematics of generalization, Addison-Wesley, 1995. [19] Yi Zhang, Samuel Burer, and W. Nick Street, ‘Ensemble pruning via semi-deﬁnite programming’, Journal of Machine Learning Research, 7, 1315–1338, (2006). [20] Zhi-Hua Zhou and Wei Tang, ‘Selective ensemble of decision trees’, in 9th International Conference on Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing, RSFDGrC 2003, pp. 476–483, Chongqing, China, (May 2003).

122

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-122

MTForest: Ensemble Decision Trees based on Multi-Task Learning Qing Wang and Liang Zhang and Mingmin Chi and Jiankui Guo 1 Abstract. Many ensemble methods, such as Bagging, Boosting, Random Forest, etc, have been proposed and widely used in real world applications. Some of them are better than others on noisefree data while some of them are better than others on noisy data. But in reality, ensemble methods that can consistently gain good performance in situations with or without noise are more desirable. In this paper, we propose a new method namely MTForest, to ensemble decision tree learning algorihms by enumerating each input attribute as extra task to introduce different additional inductive bias to generate diverse yet accurate component decision tree learning algorithms in the ensemble. The experimental results show that in situations without classiﬁcation noise, MTForest is comparable to Boosting and Random Forest and signiﬁcantly better than Bagging, while in situations with classiﬁcation noise, MTForest is signiﬁcantly better than Boosting and Random Forest and is slightly better than Bagging. So MTForest is a good choice for ensemble decision tree learning algorithms in situations with or without noise. We conduct the experiments on the basis of 36 widely used UCI data sets that cover a wide range of domains and data characteristics and run all the algorithms within the Weka platform.

1

Introduction

Decision-tree is one of the most successful and widely used learning algorithms, due to its various attractive features: simplicity, comprehensibility, no parameters, and being able to handle mixed-type data. The most widely used decision tree learning algorithm is C4.5 [1] which recently had been ranked 1st in the ”top10 algorithms in data mining” [16]. Ensemble methods train a collection of learners and then combine their predictions to make ﬁnal decision. Since the generalization ability of an ensemble could be signiﬁcantly better than that of a single learner, so studying the methods for constructing good ensembles has become one of the most active research areas in supervised learning [8]. And a lot of ensemble methods to improve the generalization ability of decision tree learning algorithms have been proposed and widely used in real word applications. Typically, an ensemble is built in two steps, that is, generating multiple component learners and then combining their predictions. According to the styles of training the component learners, current ensemble learning algorithms can be roughly categorized into two classes, that is, algorithms where component learners must be trained sequentially and algorithms where component learners could be trained in parallel [9]. The representative of the ﬁrst category is Boosting [4], which sequentially generates a series of component 1

Department of Computer and Information Technology, Fudan University, Shanghai, China. Email:{wangqing,lzhang,mmchi,gjk}@fudan.edu.cn

learners and iteratively increases the weights on the instances most recently be misclassiﬁed by the former component learner. The representative of the second category is Bagging [2] which independently generates many samples from the original training set via bootstrap sampling and then trains the component learners from each of these samples. Other representatives of this category include Random Forest [3], Randomized C4.5 [5], Random Subspace [6], etc. Many ensemble methods for decision trees have been proposed and widely used in real world applications. Some of them are better than others on noise-free data while some of them are better than others on noisy data, such as Boosting and Random Forest are better than Bagging in situations without noise, while Bagging is more robust to noise and is better than Boosting and Random Forest in situations with noise [7]. But in reality, due to time and cost reason, ensemble methods that can consistently gain good performance in situations with or without noise is more desirable. In this paper, we propose a new way to ensemble decision tree based on multi-task learning which generates diverse but accurate component decision tree learners in the ensemble through using different input attribute as extra task to introduce different inductive bias to the decision tree learning process. The resulting forest can achieve better performance on both noise-free and noisy data and have the following desirable characteristics: 1. Its accuracy is as good as Random Forest and Boosting and can achieve signiﬁcantly improvement over Bagging in situations without noise. 2. Its accuracy is slightly better than Bagging and signiﬁcantly better than Random Forest and Boosting in situations with an amount of noise. 3. It is simple and easy to parallelize. The rest of this paper is organized as follows. In Section 2, we introduce the related works for ensemble decision tree learning algorithms. In Section 3, we introduce our ensemble method for decision tree learning. In Section 4, we describe the experimental setup and results in detail. Finally, we make a conclusion and outline our main directions for further research.

2 Related works Bagging [2] is one of the older, simpler, and better known methods for creating an ensemble of classiﬁers which independently generates many samples from the original training set via bootstrap sampling and then trains a component learner from each of these samples. The Bagging algorithm has achieved great success in building ensembles of decision trees, neural networks and other unstable learning algorithms. Boosting [4] sequentially generate component classiﬁers by

Q. Wang et al. / MTForest: Ensemble Decision Trees Based on Multi-Task Learning

iteratively increases the weights on the instances most recently be misclassiﬁed and have gain great success on both stable and unstable learning algorithms. Both Bagging and Boosting are methods that generate a diverse ensemble of classiﬁers by manipulating the training data. Ho’s random subspace technique [6] selects random subsets of the available features to be used in training the individual decision trees in an ensemble. Ho’s approach randomly selects one half of the available features for each decision tree and creates ensembles of size 100. Ho summarized the results as follows: The subspace method is better in some cases, about the same or worse in other cases when compared to the other two forest building techniques Bagging and Boosting [6]. One other conclusion was that the subspace method is best when the data set has a large number of features and samples, and that it is not good when the data set has very few features coupled with a very small number of samples or a large number of classes [6]. Dietterich introduced an approach termed randomized C4.5 [5] to ensemble C4.5 learning algorithm. In this approach, at each node in the decision tree, the 20 best splits are determined and one of them is randomly selected for use at that node other than select the best split. For continuous attributes, it is possible that multiple tests from the same attribute will be in the top 20. Through experiments with 33 data sets from the UCI repository, it was found that randomized C4.5 can gain substantial improvement over bagging but is not comparable to Boosting on noise-free data, while randomized C4.5 is more robust than Boosting on noisy data. Breiman’s Random Forest [3] technique incorporates elements of random subspaces and bagging and is speciﬁc to using decision trees as the base classiﬁer. At each node in the tree, a subset of the available features is randomly selected and the best split available within this subset is selected for split. Also, bagging is used to create the training set of data items for each individual tree. The number of features randomly chosen (from n total) at each node is a parameter of this approach. Through experiments with 16 data sets from the UCI repository and 4 synthetic data sets, it was found that Random Forest is comparable to Boosting and sometimes better on noise-free data and is more robust than Boosting on noisy data. Empirical study [5, 7] on these ensemble methods for decision tree learning have shown that Boosting and Random Forest are the best ensemble methods for decision tree in situation without noise, while Bagging is best ensemble methods in situations with noise. So in this paper, we use Bagging, Boosting and Random Forest as benchmark ensemble methods to compare with our method.

3

Ensemble decision trees based on multi-task learning

Multi-Task Learning (MTL) [11] trains multiple tasks simultaneously while using a shared representation and has been the focus of much interest in the machine learning community over the last decade. It has been empirically [11, 15] as well as theoretically [13, 15] shown to often signiﬁcantly improve performance relative to learning each task independently. When the training signals are for tasks other than the main task, from the point of view of the main task, the other tasks are serving as a bias [11]. This multi-task bias causes the learner to prefer hypotheses that explain more than one task, i.e. it must be biased to prefer hypotheses that have utility across multiple tasks. Because in multi-task learning extra task is serve as additional inductive bias, we can use different extra task to bias each component learner in the ensemble to generate different component learn-

123

ers [11, 14]. The multi-task learning theory guarantees that the component learner will be with high accuracy if the extra task is related to the main task and the component learner will be with high diversity if the each extra task represents different bias. But in most learning environments, we are only given the training data which is composed of a vector of input attributes {A1 , A2 , ..., An } and the class variable C and we do not have any other extra related tasks information. In [12], it has shown that some of the attributes that attribute selection process discards can beneﬁcially be used as extra outputs for inductive bias transfer. So in our method, we treat each input attribute as extra task to bias each component decision tree in the ensemble. It obvious that we can generate a good ensemble in which component learners could be with high accuracy as well as high diversity given that each attribute (task) highly correlated with the class attribute and not highly correlated with each other. So it does better if we using a feature selection step to choose these attributes subset as extra tasks, but to make the algorithm simple and easy to implement, in this paper, we simply use each attribute in the input as an extra task to bias each component decision tree learning algorithm in the ensemble.

3.1 The MTForest algorithm Our ensemble method is described in Algorithm 1. In MTForest, we generate different component decision trees by use different input attribute as extra task together with the main classiﬁcation task. We call this two-task decision tree algorithm below. The two-task decision tree learning process is similar to standard C4.5 decision tree learning algorithm except that the Information Gain and Gain Ratio criteria of each split Si is calculated by combine the main classiﬁcation task and the extra task, showing below. MTIG(Si ) = MainTaskIG(Si ) + weight ∗ ExtraTaskIG(Si ) (1) MTGR(Si ) = MainTaskGR(Si )+weight∗ExtraTaskGR(Si ) (2) The parameter weight is served as an trade-off parameter between the classiﬁcation accuracy and the diversity of each component two-task decision trees. In our experiments, we set the value of weight as 2. To further enhance the diversity among the component decision trees in the ensemble, we grow each two-task decision tree to maximum size and do not prune and incorporate our algorithm with Bagging, i.e. constructing each two-task decision tree on a new training set using bootstrap sampling from original training set. Our ensemble method enumerates each attribute in the input as an extra task to bias each component decision tree learning algorithm in the ensemble, so its ensemble size is equal to the number of attribute in the input which is different from most of other ensemble methods such as Bagging, Boosting, Random Forest that need to specify the ensemble size. Also the building process of each two-task decision tree learning algorithm do not depend on each other, so MTForest can easily be parallelized. For numeric attributes, in our implementation, we process in the following way. When a numeric attribute be choose as extra task, we ﬁrst discretize this attribute by k-bin discretization where k =10, then in selecting the splitting attribute, this numeric attributes (task) are treated the same as non-numeric class attributes.

4 Experimental methodology and results In this section, we describe the experimental methodology, the data sets, and the obtained results.

124

Q. Wang et al. / MTForest: Ensemble Decision Trees Based on Multi-Task Learning

Algorithm 1 The MTForest Algorithm Input: Training instances set D, k, weight Output: A collection of two-task decision tree classiﬁers S S={}; for each attribute Ai in the input If Ai is numeric discretize attribute Ai by k-bin discretization; Di = bootstrap sampling from D with the same size; Using Ai as extra task together with the main classiﬁcation task C to create an unpruned two-task decision tree Ti using Eq.1 and Eq.2 on the training instances set Di ; S=S Ti ; return S

4.1

Methodology

We conduct experiments under the framework of Weka[18]. For the purpose of our study, we use the 36 well-recognized data sets from the UCI repositories [17] which represent a wide range of domains and data characteristics. There is a brief description of these data sets in Table 1. We adopted the following three steps to preprocess data sets. 1. First, missing values in each data set are ﬁlled in using the unsupervised ﬁlter ReplaceMissingValues in Weka; 2. Second, numeric attributes are discretized using the unsupervised ﬁlter Discretize in Weka; 3. It is well known that, if the number of values of an attribute is almost equal to the number of instances in a data set, this attribute does not provide any information to the class. So,we use the unsupervised ﬁlter Remove in Weka to delete attribute does not provide any information to the class. Two occurred within the 36 data sets, namely Hospital Number in data set colic.ORIG and Animal in data set zoo. In our experiments, we compare MTForest to Bagging C4.5, Boosting C4.5 and Random Forest in terms of classiﬁcation accuracy in both noise-free and noisy situations. We use the implementation of C4.5 (weka.classiﬁers.trees.J48), Random Forest (weka.classiﬁers.trees.RandomForest), Bagging (weka.classiﬁers. meta.Bagging) and Boosting (weka.classiﬁers.meta.AdaBoostM1) in Weka, and implement our algorithms under the framework of Weka. For Bagging and Boosting, we set the ensemble size as 50; while for Random Forest we set the ensemble size as 100. We done this for two reasons, ﬁrst because it is large enough to ensure convergence of the ensemble effect with most of our data sets, second it is the same ensemble sizes used in [3]. To Random Forest, an important parameter is the number of features randomly selected at each node; in our experiments we use the default value because it can achieve best results in most case [7]. For our methods, the ensemble size is the number of input attributes which is often far smaller than 50 except on two data set(audiology, sonar) and we set the value of parameter weight as 2. In all experiments, the classiﬁcation accuracy of each algorithm on a data set was obtained via 10 runs of ten-fold cross validation. Runs with the various algorithms were carried out on the same training sets and evaluated on the same test sets. To compare two ensemble algorithms across all domains, we employ the statistics used in [5], namely the win/draw/loss record. The win/draw/loss record presents three values, the number of data sets for which algorithm A obtained better, equal, or worse performance than algorithm B with respect to classiﬁcation accuracy. We report the statistically signiﬁ-

Table 1.

Description of the data sets used in the experiments.

Datasets anneal anneal.ORIG audiology autos balance-scale breast-cancer breast-w car colic colic.ORIG credit-a credit-g diabetes glass heart-c heart-h heart-statlog hepatitis hypothyroid ionosphere iris kr-vs-kp labor letter lymph mushroom primary-tumor segment sick sonar soybean tic-tac-toe vehicle vote yeast zoo

Size 898 898 226 205 625 286 699 1728 368 368 690 1000 768 214 303 294 270 155 3772 351 150 3196 57 20000 148 8124 339 2310 3772 208 683 958 846 435 1484 101

Attribute 39 39 70 26 5 10 10 7 23 28 16 21 9 10 14 14 14 20 30 35 5 37 17 17 19 23 18 20 30 61 36 10 19 17 10 18

Classes 6 6 24 7 3 2 2 4 2 2 2 2 2 7 5 5 2 2 4 2 3 2 2 26 4 2 21 7 2 2 19 2 4 2 10 7

Missing Y Y Y Y N Y Y N Y Y Y N N N Y Y N Y Y N N N Y N N Y Y N Y N Y N N Y N N

Numeric Y Y N Y Y N N N Y Y Y Y Y Y Y Y Y Y Y Y Y N Y Y Y N N Y Y Y N N Y N Y Y

cant win/draw/loss record; where a win or loss is only counted if the difference in values is determined to be signiﬁcant at the 95% level by a paired t-test.

4.2 Results Table 2 shows the comparison results of two-tailed t-test with a 95% conﬁdence level between each pair of algorithms, in which each entry w/t/l means that the algorithm at the corresponding row wins in w data sets, ties in t data sets, and loses in l data sets, compared to the algorithm at the corresponding column. Table 4 shows the detailed experimental results of the mean classiﬁcation accuracy and standard deviation of each algorithm on each data set, and the average values are summarized at the bottom of the table. From Table 2 and 4 we can see that MTForest can achieve substantial improvement over C4.5 on most data set (13 wins and 2 losses) which suggest that MTForest is potentially a good ensemble technique for decision tree. MTForest can also gain signiﬁcantly improvement over Bagging (7 wins and 2 losses) and is comparable to two state-of-the-art ensemble technique for decision trees, Boosting (8 wins and 8 losses) and RandomForest (3 wins and 4 losses). An interesting phenomenon on iris data set is that, MTForest is the only ensemble method which can gain improvement over the C4.5 while other three ensemble methods used to compare all decrease the accuracy over C4.5. An important issue of an ensemble method is the question of how well it performs in situations when there is a large amount of classiﬁcation noise, i.e., training examples with incorrect class labels. Since

Q. Wang et al. / MTForest: Ensemble Decision Trees Based on Multi-Task Learning

Table 2. Summary of experimental results on noise-free data with two-tailed t-test with 95% conﬁdence level. Each cell contains the number of wins, ties and losses between the algorithm in that row and the algorithm in that column. w/t/l Bagging Boosting Random Forest MTForest

C4.5 11/24/1 16/17/3 15/19/2 13/21/2

Bagging

Boosting

Random Forest

12/17/7 9/22/5 7/27/2

6/25/5 8/20/8

3/29/4

Table 3. Summary of experimental results on noisy data with two-tailed t-test with 95% conﬁdence level. Each cell contains the number of wins, ties and losses between the algorithm in that row and the algorithm in that column. w/t/l Bagging Boosting Random Forest MTForest

C4.5 15/21/0 10/14/12 13/17/6 18/14/4

Bagging

Boosting

Random Forest

3/13/20 6/19/11 5/27/4

15/21/0 20/15/1

9/23/4

some noise in the outputs is often present, robustness with respect to noise is a desirable property. Following Breiman [3], the following experiment was done which changed about one in twenty class labels (i.e., injecting 5% noise). For each data set in the experiment, we randomly split off 10% of the data set as a test set, and runs are made on the remaining training set. The noisy version of the training set is gotten by changing, at random, 5% of the class labels into an alternate class label chosen uniformly from the other labels. We repeat this process 100 times to compute the classiﬁcation accuracy of each algorithm in this noisy situation. Table 3 shows the comparison results of two-tailed t-test with a 95% conﬁdence level between each pair of algorithms in this noisy situations, in which each entry w/t/l has the same meanings as in Table 2. Table 5 shows the detailed experimental results of the mean classiﬁcation accuracy and standard deviation of each algorithm on each data set in this noisy situation, and the average values are summarized at the bottom of the table. From Table 3 and 5 we can see that MTForest can achieve substantial improvement over C4.5 on most data set (18 wins and 4 losses) which suggest that MTForest is potentially a good ensemble technique for decision tree in this noisy situations. And MTForest can signiﬁcantly outperform Boosting (20 wins and 1 losses) and Random Forest (9 wins and 1 losses) and is slightly better than Bagging in this noisy situation (5 wins and 4 losses). From Table 5, we can also see that MTForest has the best average value (83.30) over all the data sets used.

5

Conclusion and Future Work

We address the problem of ensemble decision trees learning algorithms that can consistently gain good performance in situations with or without noise. Previous study has shown that Bagging can always improve the classiﬁcation performance of decision tree learning algorithms on both noise-free and noisy data, but its performance on noise-free data is not comparable to Boosting and Random Forest. In this paper, we propose a new ensemble method for decision tree learning algorithms by enumerating each input attribute as extra task together with the main classiﬁcation task to generate different component decision trees in the ensemble. The experimental results show that the performance of our algorithm can comparable to Boosting and Random Forest on noise-free data and as good as Bagging on noisy data. Dietterich [8] indicated that roughly there are four ensemble

125

schemes, that is, perturbing the training set, perturbing the input attributes, perturbing the output representation, and injecting randomness to the learning algorithm. The success of MTForest suggests that we can also inject different additional inductive bias to the learning algorithm to create an ensemble of classiﬁers. It will be interesting to explore whether or not we can using selective ensemble technique [10] to select a subset of the two-task decision trees created to improve the performance; exploiting task relatedness to assign different weight to each component decision tree classiﬁers in the ensemble; extending multi-task ensemble technique to ensemble stable classiﬁer (such as Naive Bayes, KNN) where bagging can not work well. These have been left to be investigated in the future.

ACKNOWLEDGEMENTS This research is partially supported by the National Key Basic Research Program (973) of China under grant No.2005CB321905. We thank the anonymous reviewers for their great helpful comments.

REFERENCES [1] J.Quinlan, C4.5:Programs for Machine Learning, Morgan Kaufmann, (1993). [2] L.Breiman, Bagging Predictors, Machine Learning, 24, pp.123-140, (1996). [3] L. Breiman, Random Forests, Machine Learning, 45, pp.5-32, (2001). [4] R.E. Schapire, A Brief Introduction to Boosting, In Proc.16th International Joint Conference on Artiﬁcial Intelligence, pp.1401-06, (1996). [5] T.G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Machine Learning, 40, 139-157, (2000). [6] T.Ho, The Random Subspace Method for Constructing Decision Forests, IEEE Trans. Pattern Analysis and Machine Intelligence, 20, 832-844, (1998). [7] R.E. Banﬁeld, L.O. Hall, K. W. Bowyer, and W.P. Kegelmeyer, A Comparison of Decision Tree Ensemble Creation Techniques. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29, 173-180, (2007). [8] T.G. Dietterich. Ensemble learning. In: The Handbook of Brain Theory and Neural Net-works, 2nd edition, M.A. Arbib, Ed. Cambridge, MA: MIT Press, (2002). [9] Z.H. Zhou and Y. Yu, Adapt Bagging to Nearest Neighbor Classiﬁers, Journal of Computer Science and Technology, 20, 48-54, (2005). [10] Z.H. Zhou, J.X. Wu, W. Tang. Ensembling neural networks: Many could be better than all. Artiﬁcial Intelligence, 137, pp.239-263 (2002). [11] R.Caruana,Multi-Task Learning, Machine Learning, 28, pp.41-75, (1997). [12] R.Caruana, Virginia R. Beneﬁtting from the Variables that Variable Selection Discards. Journal of Machine Learning Research, 3, pp.124564, (2003). [13] J. Baxter, A model for inductive bias learning, Journal of Artiﬁcial Intelligence Research, 12, pp.149-198, (2000). [14] Qiang Ye and P.W. Munro. Improving a Neural Network Classiﬁer Ensemble with Multi-task Learning, In Proc International Joint Conference on Neural Networks, (2006). [15] R.K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal Machine Learning Research, 6, pp.1817-1853, (2005) [16] Xindong Wu, Vipin Kumar, Ross, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey Mclachlan, Angus Ng, Bing Liu, Philip Yu, Zhi-Hua Zhou, Michael Steinbach, David Hand, Dan Steinberg, Top 10 algorithms in data mining, Knowledge and Information Systems, 14, pp. 1-37, (2008). [17] Blake. C., Merz. C. J. UCI repository of machine learning databases. In Department of ICS, University of California, Irvine.http://www.ics.uci.edu/ mlearn/MLRepository.html. [18] Witten, I. H., Frank, Data Mining:Practical Machine Learning Tools and Technology with Java Implementation, Morgan Kaufmann, (2000).

126

Q. Wang et al. / MTForest: Ensemble Decision Trees Based on Multi-Task Learning

Table 4. The detailed experimental results of the classiﬁcation accuracy and standard deviation on data without additionally introduced noise.

Table 5. The detailed experimental results of the classiﬁcation accuracy and standard deviation on data with 5% randomly introduced noise.

Data Set C4.5 Bagging Boosting Random Forest MTForest Datasets C4.5 Bagging Boosting Random Forest MTForest anneal 98.61±0.98 98.70±0.92 95.71±2.25 98.01±1.32 98.51±1.18 anneal 98.65±0.97 98.68±0.92 99.55±0.68 99.35±0.79 99.23±0.80 anneal.ORIG 90.36±2.51 91.86±2.48 92.32±2.16 91.67±2.37 92.65±2.41 anneal.ORIG 90.28±2.74 91.48±2.69 90.27±2.73 90.06±2.52 91.54±2.58 audiology 76.61±8.13 80.37±7.39 80.63±8.41 78.03±7.66 81.21±7.73 audiology 77.22±7.69 80.97±7.50 84.82±7.13 79.97±6.85 82.13±6.98 autos 76.01±10.12 83.42±8.32 81.36±8.55 82.30±8.57 80.35±8.54 autos 81.54±8.32 85.47±6.81 87.69±7.55 84.97±7.68 83.11±8.03 balance-scale 64.14±4.16 75.09±4.94 71.89±4.32 78.47±3.85 79.85±3.92 balance-scale 65.95±4.85 74.35±5.50 68.83±4.87 77.16±4.31 78.92±4.16 breast-cancer 75.26±5.04 73.76±5.85 66.04±8.21 70.07±7.36 69.49±6.96 breast-cancer 74.08±5.64 73.23±6.42 65.52±8.05 68.13±7.64 68.44±7.48 breast-w 92.86±3.49 94.79±2.94 93.81±3.09 95.72±2.53 94.99±2.68 breast-w 94.01±3.28 95.44±2.71 96.70±2.08 96.34±2.44 95.67±2.49 car 91.51±1.95 92.83±1.86 93.69±1.82 94.03±1.62 92.85±2.00 car 92.22±2.01 93.59±1.80 96.72±1.50 94.70±1.66 92.37±1.95 colic 84.15±5.89 84.62±6.00 79.18±7.23 83.58±5.98 84.65±5.65 colic 84.31±6.02 84.83±5.81 82.56±5.85 84.37±5.47 85.65±5.36 colic.ORIG 66.23±1.40 66.31±1.23 66.28±1.23 72.12±4.72 67.90±3.31 colic.ORIG 71.76±5.63 68.08±3.53 71.67±5.45 72.58±5.00 68.09±3.43 credit-a 84.85±4.14 85.77±4.14 79.97±4.18 84.00±4.52 84.51±4.14 credit-a 85.06±4.12 85.71±3.91 82.99±4.15 85.01±3.81 85.83±4.00 credit-g 72.16±3.41 73.79±3.96 72.18±3.46 74.81±3.49 73.64±2.97 credit-g 72.61±3.49 74.04±4.03 73.64±3.15 75.53±3.21 74.03±2.86 diabetes 74.01±4.87 74.37±4.66 70.62±5.13 72.64±4.77 71.74±4.31 diabetes 73.89±4.70 73.92±4.53 72.07±4.67 73.15±3.96 72.57±4.69 glass 55.60±8.89 59.29±8.81 56.50±8.64 59.63±8.61 60.08±8.84 glass 58.14±8.48 59.90±9.33 56.51±9.50 60.55±8.94 61.56±8.98 heart-c 78.18±6.65 80.07±6.28 77.20±6.97 78.78±7.06 78.10±6.65 heart-c 79.14±6.44 79.98±6.66 78.15±7.29 78.78±7.12 79.48±6.94 heart-h 80.23±7.69 80.53±7.19 78.01±6.90 79.87±6.20 78.46±7.03 heart-h 80.10±7.11 80.97±6.92 79.34±7.14 80.10±6.03 79.24±6.90 heart-statlog 77.59±7.13 79.22±7.00 76.67±7.23 78.37±7.00 77.70±7.50 heart-statlog 79.78±7.71 79.74±6.89 78.26±7.34 79.25±6.45 79.59±7.01 hepatitis 81.24±9.50 81.77±8.04 82.02±8.57 82.00±7.34 82.60±8.08 hepatitis 81.12±8.42 81.83±7.64 83.53±8.77 82.14±6.51 82.39±8.34 hypothyroid 93.23±0.44 93.25±0.44 91.95±0.85 91.69±0.90 92.91±0.67 hypothyroid 93.24±0.44 93.26±0.44 92.26±0.94 92.58±0.78 93.14±0.65 ionosphere 87.61±4.92 89.52±4.51 90.71±5.04 90.89±4.51 91.88±4.20 ionosphere 87.47±5.17 89.40±4.69 92.22±4.53 90.86±4.69 91.45±4.40 iris 95.27±4.77 95.33±5.23 91.93±6.31 93.40±6.80 94.87±5.09 iris 95.99±4.64 95.67±5.05 94.53±6.24 95.27±5.04 96.27±4.28 kr-vs-kp 99.25±0.46 99.30±0.40 94.78±1.32 98.28±0.69 99.21±0.45 kr-vs-kp 99.44±0.37 99.46±0.37 99.60±0.31 99.27±0.44 99.46±0.36 labor 83.80±14.6285.60±13.2689.27±12.88 88.89±12.77 89.30±12.37 labor 84.97±14.2484.99±14.0686.30±14.78 89.76±12.12 89.47±12.53 letter 80.56±0.83 83.74±0.87 85.77±0.89 87.49±0.76 89.81±0.62 letter 81.31±0.78 84.24±0.83 89.89±0.78 89.78±0.68 90.81±0.61 lymph 77.84±9.62 78.79±8.82 81.76±9.55 82.82±9.26 79.72±10.26 lymph 78.21±9.74 79.70±9.61 83.26±8.85 83.04±9.49 80.31±9.60 mushroom 99.99±0.01 99.99±0.01 96.31±0.66 99.95±0.07 99.87±0.12 mushroom 100±0.00 100±0.00 100±0.00 100±0.00 100±0.00 primary-tumor 41.01±6.59 45.19±6.16 42.63±6.61 41.29±6.05 43.66±6.77 primary-tumor 41.89±6.88 44.14±5.81 41.39±6.81 40.97±6.33 42.71±6.13 segment 92.92±1.71 93.90±1.56 92.65±1.80 94.36±1.65 94.83±1.46 segment 93.42±1.67 94.03±1.47 95.24±1.37 96.07±1.23 95.32±1.27 sick 98.04±0.78 98.11±0.77 96.75±0.98 97.22±0.77 97.99±0.77 sick 98.16±0.68 98.25±0.68 98.15±0.73 98.22±0.65 98.28±0.74 sonar 71.32±9.79 74.09±9.30 75.16±9.05 77.31±8.75 77.09±9.01 sonar 71.09±8.40 74.08±8.95 79.21±9.02 78.58±9.06 78.59±9.14 soybean 92.19±2.97 94.04±2.76 89.88±3.26 92.24±2.67 93.45±3.05 soybean 92.63±2.72 94.05±2.61 94.04±2.61 93.80±2.72 93.86±2.82 tic-tac-toe 83.81±3.55 92.79±2.47 95.53±2.10 94.80±2.23 93.20±2.57 tic-tac-toe 85.57±3.21 94.73±2.04 98.92±1.03 97.05±1.76 95.88±1.89 vehicle 68.56±4.41 71.59±3.88 70.54±3.81 71.49±3.76 72.32±3.32 vehicle 70.74±3.62 72.10±3.82 72.55±4.02 72.63±3.56 73.05±3.76 vote 95.60±3.10 96.11±2.81 93.18±3.80 95.15±3.18 96.15±2.86 vote 96.27±2.79 96.27±2.67 94.80±3.05 96.18±2.85 96.25±2.58 yeast 51.72±3.45 53.39±3.85 51.87±3.78 51.18±3.42 52.44±3.60 yeast 52.56±3.44 54.35±3.89 52.93±3.74 52.43±3.53 53.25±3.52 zoo 92.39±6.71 93.30±6.97 81.85±16.47 93.16±6.69 95.17±6.03 zoo 92.61±7.33 93.20±7.37 97.34±5.75 94.65±6.03 94.48±6.64 mean 81.28±4.90 83.11±4.67 81.10±5.24 83.07±4.75 83.30±4.65 mean 82.06±4.77 83.52±4.64 83.84±4.76 84.12±4.45 84.07±4.54

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-127

127

Many-Valued Concept Lattices for Conceptual Clustering and Information Retrieval Nizar Messai and Marie-Dominique Devignes and Amedeo Napoli and Malika Smail-Tabbone1

1

LORIA - INRIA Nancy Grand Est, 615, rue du Jardin Botanique 54600 Villers-l`es-Nancy, FRANCE, email: {messai, devignes, napoli, malika}@loria.fr

2

Background and related work

2.1

Formal Concept Analysis

In this section we recall basic deﬁnitions related to FCA. More detailed deﬁnitions and results can be found in [6]. FCA starts from a given input data represented as a formal context to provide the set of all formal concepts which form a concept lattice. A formal context is denoted by K = (G, M, I) where G is a set of objects, M is a set of attributes, and I is a binary relation between G and M (I ⊆ G × M ). (g, m) ∈ I denotes the fact that object g ∈ G is in relation through I with attribute m ∈ M (also read as g has m). Table 1 shows an example of formal context. The objects are planets of the Solar System Table 1. An example of formal context

×

×

× ×

100
×

DS<100

×

× ×

Satellite

×

× ×

DS>200

Mercury Venus Earth Mars

Distance to Sun (106 km)

M>5

PP Attr. P Obj. PP P

Mass (1021 t) 4<M<5

Diameter (103 km)

M<1

Formal concept analysis (FCA) [6] is a data analysis method allowing to derive implicit relationships from a set of objects described by their attributes. In the basic setting, data to be analyzed must be represented as a formal context which has the form of a binary table with rows corresponding to objects and columns corresponding to attributes. A table entry does or does not contain a cross depending on whether an object does or does not have an attribute. Based on attribute sharing between objects, the data are structured into units called formal concepts. These concepts are partially ordered and form a special hierarchy of concepts called a concept lattice. A concept lattice is an equivalent representation of data in a formal context which emphasizes the relationships between objects, attributes, and formal concepts, and provides a suitable support for navigation into the data set it represents. This particular kind of conceptual clustering was the main motivation behind the successful use of FCA in a wide range of application ﬁelds [5]. Whereas the basic FCA setting requires a particular representation of data (i.e. a formal context), real-world data sets are often complex and heterogeneous. Their representation in terms of a binary table does not produce a formal context. It rather produces a many-valued (MV) context where table entries (i) are empty when there is no relation between the corresponding objects and attributes, (ii) contain arbitrary values taken by the attributes for the objects, or (iii) truth degrees depending on the link between an object and an attribute. For using FCA, any operation on data with such representation must be preceded by a transformation of MV contexts into binary contexts using an appropriate conceptual scaling [6]. In this paper we present an approach for analyzing data sets represented as MV contexts. First, we deﬁne an MV Galois connection based on similarity between attribute values to compute MV concepts and MV concept lattices. Then, we investigate the usability of

D>20

Introduction

10
1

the approach for various data analysis tasks such as classiﬁcation and information retrieval. The paper is organized as follows. Section 2 recalls basic FCA related notions, introduces MV contexts, and gives a survey of existing approaches dealing with MV contexts. Sections 3 and 4 propose a detailed formalization of an MV Galois connection and its derived structures (MV concepts and MV concept lattices). Section 5 details the application of MV concept lattices to information retrieval. Finally, section 6 concludes the present work and discusses research perspectives. For completing the content of this paper, detailed proofs of the propositions and other examples may be found in a detailed annex at http://www.loria.fr/∼messai/ﬁles/papers/Messai ECAI08 Annex.pdf.

D<10

Abstract. In this paper we present an extension of the Galois connection to deal with many-valued formal contexts. We deﬁne a manyvalued Galois connection with respect to similarity between attribute values in a many-valued context. Then, we deﬁne many-valued formal concepts and many-valued concept lattices. Depending on a similarity threshold, many-valued concept lattices may have different levels of precision. This feature makes them very useful for multilevel conceptual clustering. Many-valued concept lattices are also used in a new lattice-based information retrieval approach for efﬁciently answering complex queries.

×

× ×

and the attributes are characteristics of these planets. The formal concepts are computed based on the relation I as maximal sets of objects having in common maximal sets of attributes. Formally, a concept is represented by a pair (A, B) such that A ⊆ G,

128

N. Messai et al. / Many-Valued Concept Lattices for Conceptual Clustering and Information Retrieval

B ⊆ M , A = B, and B = A where A = {m ∈ M | (g, m) ∈ I ∀g ∈ A} (the set of attributes common to all objects in A) and B = {g ∈ G | (g, m) ∈ I ∀m ∈ B} (the set of objects which have all attributes in B). A and B are respectively called the extent and the intent of the concept (A, B). The derivation operators : P(G) → P(M ) and : P(M ) → P(G) form a Galois connection between the powerset lattices P(G) and P(M ) [6]. The set of all concepts in a formal context is denoted by B(G, M, I). A concept (A1, B1) is a sub-concept of (A2, B2) when A1 ⊆ A2 (or equivalently B2 ⊆ B1). In this case, (A2, B2) is a super-concept of (A1, B1) and we write (A1, B1) ≤ (A2, B2). The set of concepts in B(G, M, I) ordered using the partial order “≤” forms the concept lattice of the context (G, M, I) denoted by B(G, M, I). The concept lattice of the context given in Table 1 is represented by the so-called line diagram (or Hasse diagram) shown in Figure 1.

Figure 1. B(G, M, I): the concept lattice corresponding to the context (G, M, I) given in Table 1.

2.2

Many-valued contexts

The basic FCA setting provides appropriate tools for processing data sets easily representable as formal contexts. However, in real-world data sets, objects are described by attributes which can take on several values. In this case, the attributes are said to be many-valued, in contrast to one-valued attributes considered in the previous section. Examples of MV attributes are color, weight, height, etc. The tabular representation of such data sets results in a table where rows are the objects, columns are the attributes, and table entries are attribute values. This representation is called a many-valued context [6]. Formally, an MV context is denoted by (G, M, W, I) where G is a set of objects, M is a set of attributes, W is a set of attribute values, and I is a ternary relation between G, M and W (i.e., I ⊆ G × M × W ). (g, m, w) ∈ I denotes the fact that “the attribute m takes the value w for the object g”. This fact is also denoted by m(g) = w (this notation will be preferred in the present paper). Table 2 shows an example of MV context. It represents the original data set represented as a formal context in Table 1. The objects are planets and the attributes are physical real-valued measures characterizing the planets. In the rest of the paper we will use the initials of the attribute names to designate the attributes e.g. Distance to Sun will be written DS. The basic FCA setting detailed so far cannot be immediately applied to MV contexts. A survey of approaches taking into account MV contexts is given in the following section.

2.3

Processing many-valued contexts

In the literature, there are mainly two families of approaches dealing with MV contexts. The ﬁrst one consists in transforming an MV

Table 2. An example of MV context.

Diameter

PP Attr. P (103 km) Obj. PP P Mercury Venus Earth Mars

48.8 12.1 12.7 6.7

Mass

Distance to

(1021 t)

Sun ( 106 km)

0.33 4.87 5.97 0.64

57.9 108 150 228

Satellite

1 2

context into an ordinary formal context [6]. The transformation process is called conceptual scaling and the obtained formal context is called a scale. Conceptual scaling consists in replacing each MV attribute by a set of one-valued attributes. The formal context given in Table 1 is a possible scale of the MV context given in Table 2. The MV attribute “Diameter” is replaced by the attributes “D < 10”, “10 < D < 20”, and “D > 20”. It is possible to deﬁne different scales for the same MV context. The choice of a scale depends on attribute interpretations and makes conceptual scaling a user dependant task which can hardly be automatized in the case of large data sets. The second family of approaches deals with a particular form of MV contexts, namely fuzzy contexts i.e. MV contexts where the attribute values are truth values of the statement “the object g has the attribute m”. The proposed approaches are usually called “fuzzy formal concept analysis (FFCA)” [1, 2, 11, 7, 12] although they show different ways of dealing with truth values in fuzzy contexts. In [7], the so-called “variable threshold concept lattice” is obtained after transforming an MV context into a one-valued context as follows. First, a threshold δ is chosen. Then, only the values higher than δ are replaced by a cross. In [12], two thresholds are chosen to form the so-called “window” and only the values in this window are replaced by a cross. These two approaches can be seen as a particular kind of conceptual scaling as they result in an ordinary formal context which is then processed using FCA. In [11], the authors use fuzzy logic to deal with fuzzy contexts. In particular, intersection and union operators on fuzzy sets are used to compute fuzzy intents of fuzzy formal concepts. In [1], an approach based on residuated lattices is deﬁned. In this approach, both attributes and objects are seen as fuzzy sets contrary to the previous approach where only attribute sets are considered as fuzzy sets. Other approaches similar to -and covered by- the last one are discussed with details in [2]. In the following, we propose an original approach to process MV contexts. This approach is based on a new Galois connection that takes into account similarities between attribute values in MV contexts. The approach deals with MV contexts where attribute values are totally ordered. This form includes numerical contexts (i.e. the attribute values are numbers) and arbitrary formal contexts where attribute values can be totally ordered such as tiny, small, medium, big, and huge for an attribute font size in an MV context dealing with textual documents. In this way, the present work has connection with Symbolic Data Analysis (SDA) where symbolic description of complex symbolic objects are built according to a Galois Connection [3]. Rephrased in FCA terms, some features of SDA could be reused in the present work for dealing with MV contexts containing categorical ordered attribute values.

N. Messai et al. / Many-Valued Concept Lattices for Conceptual Clustering and Information Retrieval

3

Many-valued Galois connection

3.2

To make the formalization easier we will use an equivalent representation of the MV context. This representation is also an MV context in which the attribute values are shifted to the interval [0,1]. The equivalent representation of the MV context given in Table 2 is given in Table 3. The attribute values in the equivalent representation are obtained by dividing the values in the same column by the maximum value in this column. For example the values of the attribute “Diameter” are divided by 48.8. As the formalization will be done on the Table 3. The equivalent representation of the MV context given in Table 2.

Obj.\Attr. Mercury Venus Earth Mars

D (*48.8) 1 0.3 0.3 0.1

M (*5.97) 0.1 0.8 1 0.1

DS (*228) 0.3 0.5 0.7 1

S (*2)

0.5 1

equivalent representation, we will use (G, M, W, I) to denote such representation instead of the original MV context.

3.1

Attribute sharing between objects

In the basic FCA setting, an attribute m is shared by a set A of objects if and only if each object g in A has the attribute m. This condition is strong and cannot be straightforwardly adapted to MV attributes since the same MV attribute can take on different values for different objects. This condition is relaxed in the case of MV contexts and an attribute is said to be shared by two objects (or more) whenever it has similar values for both objects. Deﬁnition 1 Given an MV context (G, M, W, I) and a threshold θ ∈ [0, 1]. 1. Two attribute values wi and wj of an attribute m are similar if and only if |wi −wj | ≤ θ. 2. Two objects gi and gj in G share an attribute m in M if and only if m(gi ) = wi and m(gj ) = wj are similar i.e. |wi − wj | ≤ θ. More precisely, gi and gj share m[wi ,wj ] (assuming that wi ≤ wj ). The interval [wi , wj ] is called similarity interval of m for gi and gj . 3. More generally, a set A ⊆ G of objects shares an attribute m whenever any two objects in A share m. The similarity interval of m for A is [ming∈A (m(g)), maxg∈A (m(g))] and the attribute m shared by objects in A is noted by m[ming∈A (m(g)),maxg∈A (m(g))] . For illustration, consider the MV context given in Table 3 and a threshold θ = 0.2. The objects Mercury and Venus share DS[0.3,0.5] . The previous statement is interpreted as follows. The planets Mercury and Venus have similar distances to the sun, and more precisely, their distances from the sun vary between two values which are 0.3*228 106 km and 0.5*228 106 km. The threshold θ deﬁnes the maximal difference allowed between two attribute values. The choice of θ depends on the processed data sets and on the results to be extracted from these data sets. Such choice is similar to the choice of frequency threshold for itemsets in data mining. In the following, Deﬁnition 1 will be used to deﬁne two derivation operators which form a Galois connection between object sets and attribute sets in an MV context.

129

Partial orders, derivation operators, and many-valued Galois connection

Intuitively, the derivation operators to be deﬁned associate to a set of objects the set of their common attributes with the appropriate similarity intervals and, dually, associate to a set of attributes with appropriate similarity intervals the set of all objects having these attributes. In the following, Iθ denotes all possible intervals [α, β] such that β − α ≤ θ and [α, β] ⊆ [0, 1]. Deﬁnition 2 Given an MV context (G, M, W, I) and a threshold θ ∈ [0, 1]: 1. For a set A ⊆ G of objects: Aθ = {m[α,β] ∈M ×Iθ such that m(g)=Ø and ∀gi , gj ∈A, |m(gi ) − m(gj )| ≤ θ, α = ming∈A (m(g)), and β = maxg∈A (m(g))} (the set of attributes common to the objects in A). 2. Dually, for a set B ⊆ M ×Iθ of attributes with similarity intervals: Bθ = {g ∈ G such that ∀m[α,β] ∈ B, m(g) ∈ [α, β]} (the set of objects sharing attributes in B). Consider the MV context given in Table 3 and a threshold θ = 0.2, {Mercury,Venus}θ = {DS[0.3,0.5] } and {DS[0.3,0.5] }θ = {Mercury,Venus}. The deﬁnition of a Galois connection between the object sets in G and the attribute sets in M × Iθ requires the deﬁnition of appropriate partial order relations on both sets. Formally, all the possible sets of objects are represented by the powerset P(G) and all the possible sets of attributes with similarity intervals given a threshold θ are represented by the powerset P(M ×Iθ ). Considering the inclusion relation “⊆” on P(G), (P(G),⊆) is a partially ordered set. The partial order relation to be deﬁned on P(M ×Iθ ) must combine the inclusion relation “⊆” between subsets of M with the inclusion between similarity intervals of attribute values and the threshold θ. Deﬁnition 3 Given two sets B1 and B2 in P(M ×Iθ ). B1 ⊆θ B2 if and only if ∀ m[α1 ,β1 ] ∈ B1 , ∃ m[α2 ,β2 ] ∈ B2 such that [α2 , β2 ] ⊆ [α1 , β1 ]. In the MV context given in Table 3, examples of elements in P(M ×Iθ ) ordered with respect to “⊆θ ” (θ = 0.2) are {D[0.3,0.3] , M[0.8,1] , DS[0.5,0.7] } ⊆θ {D[0.3,0.3] , M[0.8,0.8] , DS[0.5,0.5] } and {D[0.1,0.3] } ⊆θ {D[0.3,0.3] , M[0.8,1] , DS[0.5,0.7] }. Proposition 1 (P(M ×Iθ ), ⊆θ ) is a partially ordered set (i.e. ⊆θ is reﬂexive, antisymmetric and transitive). Theorem 1 (Galois connection) The derivation operators introduced in deﬁnition 2 form a Galois connection (called many-valued Galois connection) between (P(G), ⊆) and (P(M ×Iθ ), ⊆θ ). It follows from Theorem 1 that the derivation operators deﬁne two closure systems [6] on G and M ×Iθ . The intersection between two subsets, B1 and B2 , of M ×Iθ is deﬁned as the set of m[α,β] such that {m[α,β] } ⊆θ B1 and {m[α,β] } ⊆θ B2 . The MV Galois connection deﬁned above is used in the following to deﬁne and compute MV concepts and MV concept lattice given an MV context.

4

Many-valued concepts and many-valued concept lattices

Deﬁnition 4 A many-valued concept is a pair (A, B) where A ⊆ G and B ⊆ M ×Iθ such that Aθ = B and Bθ = A. A and B are

130

N. Messai et al. / Many-Valued Concept Lattices for Conceptual Clustering and Information Retrieval

respectively the extent and the intent of (A, B). Consider again the MV context in Table 3 and a threshold θ = 0.2. Then, ({Mars,Venus,Earth},{D[0.1,0.3] }) and ({Venus,Earth},{D[0.3,0.3] ,M[0.8,1] ,DS[0.5,0.7] }) are MV concepts. Deﬁnition 5 Consider an MV context (G, M, W, I) and a threshold θ. 1. If (A1 , B1 ) and (A2 , B2 ) are MV concepts, (A1 , B1 ) is a subconcept of (A2 , B2 ) when A1 ⊆ A2 (which is equivalent to B2 ⊆θ B1 ). In this case (A2 , B2 ) is a superconcept of (A1 , B1 ) and we write (A1 , B1 ) ≤θ (A2 , B2 ). The relation ≤θ is the hierarchical order of MV concepts. 2. The set of all MV concepts of (G, M, W, I) ordered in this way is denoted by Bθ (G, M, W, I) and called the many-valued concept lattice of (G, M, W, I).

scaling called plain scaling [6]. In this case, an attribute m is shared by two objects g1 , g2 only when m(g1 ) = m(g2 ). In the case of a threshold θ = 1, an attribute m is shared by two objects g1 and g2 whenever m(g1 )=Ø and m(g2 )=Ø. This case considers only the presence or absence of a relation between an object and an attribute when computing concepts which corresponds to the basic FCA setting. The MV concept lattice obtained for θ = 1 is the same as the ordinary concept lattice of the formal context obtained by replacing all attribute values by crosses (×). The only difference is that intervals of all possible values of attributes are given in the intents of MV concepts. Figure 3 shows the MV concept lattices of the MV context given in Table 3 for θ = 0 (left) and for θ = 1 (right).

The MV concept lattice of the context represented in Table 3 for a threshold θ = 0.2 is shown in Figure 2. This representation is slightly different from the reduced labelling [6]. The concept corresponding to any node of the diagram is obtained as follows. Similarly to the reduced labelling, the extent is obtained by considering the objects located below the considered node and those that can be reached by descending line paths from this node. However, the intent is given all at once above the considered node. This MV concept lattice is

Figure 3. B0 (G, M, W, I) (left) and B1 (G, M, W, I) (right): the MV concept lattices of (G, M, W, I) given in Table 3 for θ = 0 and θ = 1 respectively.

Figure 2. B0.2 (G, M, W, I): the MV concept lattice of (G, M, W, I) given in Table 3 for θ = 0.2.

Varying the threshold θ results in restricting or relaxing the conditions for attribute sharing between objects. Increasing θ corresponds to enlarging similarity intervals of attribute values which results in forming MV concepts whose intents are coarser partitions of attributes. Conversely, decreasing θ corresponds to reducing similarity intervals of attribute values which results in forming MV concepts whose intents are ﬁner partitions of attributes. Consequently, the possibility of varying the threshold θ makes MV concept lattices a good candidate for multi-level clustering of data in MV contexts. This feature is particularly useful for large contexts. Indeed, one general lattice can be built ﬁrst (by choosing higher θ) to provide a rough view on the whole context. Then, according to the precision need, more speciﬁc lattices can be built (by choosing smaller θ).

computed adapting the classical approach. An MV concept lattice provides an exhaustive and precise representation of data in an MV context. Indeed, attribute values and similarity intervals in the intents of MV concepts represent all possible combinations of MV attributes shared by objects given a similarity threshold θ. For illustration consider the attribute DS transformed into “DS < 100”, “100 < DS < 200”, and “DS > 200” in the scale given in Table 1. None of these attributes is shared by Mercury and Venus. However, the difference between real-world values taken on by DS for both planets is less than the difference allowed between planets having DS = 110 and DS = 190 (which share attribute “100 < DS < 200” and hence the attribute DS). This limit is overcome in the case of MV concept lattice by automatic computation of similarity intervals. Considering the example of DS, Mercury and Venus share DS[0.3,0.5] and B0.2 (G, M, W, I) contains the MV concept ({Mercury,Venus},{DS[0.3,0.5] }). The previous remark concerns also attribute M for objects Venus and Earth, and attribute D for objects Venus, Mars, and Earth. The MV concept lattice Bθ (G, M, W, I) depends on the choice of the threshold θ. A threshold θ = 0 is equivalent to the conceptual

5

Using many-valued concept lattices for IR

Mainly motivated by navigation capabilities in conceptual hierarchies such as concept lattices, various research works have focused on the use of FCA for IR tasks. Many FCA-based IR systems have been proposed for Web document retrieval [4, 10] and for domainspeciﬁc retrieval [8] as well. Although these approaches have proved good performances, they are still limited by the problem of representing complex object/attribute relationships (especially in domainspeciﬁc retrieval) into one-valued contexts which always results in a loss of information. To overcome this problem, some approaches try to improve IR by using domain ontologies and thesauri [10, 4, 8, 9]. Another way of applying FCA in presence of complex data without transformation into a one-valued context consists in using an MV Galois connection. Data are represented as an MV context and then as an MV concept lattice which is used as retrieval support. Deﬁnition 6 (1) A query is a set Y of weighted attributes where weights can be single values or intervals. In an MV concept lattice, a query is represented by a query concept (Yθ , Y ) where Yθ is the set of objects that share all attributes in Y .

N. Messai et al. / Many-Valued Concept Lattices for Conceptual Clustering and Information Retrieval

(2) An object g is relevant for a query Y whenever gθ ∩ Y = Ø (i.e. g shares at least one of the attributes in Y ). Let us consider the example of MV context (G, M, W, I) given in Table 3 and its corresponding MV concept lattice B0.2 (G, M, W, I) shown in Figure 2. It is possible to deal with queries such as “which are the planets having a diameter of about 12 thousand km and a distance to sun of about 150 million km”. The formal representation of this query is {D0.3 , DS0.7 }. The values 0.3 and 0.7 weighting the query attributes are obtained in the same way as the values in Table 3. It is also possible to deal with queries such as “which are the planets having a diameter between 6 thousand km and 13 thousand km and a distance to sun between 100 million km and 150 million km”. The formal representation of this query is {D[0.1,0.3] , DS[0.5,0.7] }. It is also possible to to deal with queries such as “which planets have a satellite”. The formal representation of this query is {S}. A retrieval session starts by inserting a query into the MV concept lattice. According to the deﬁnition of objects relevance, the relevant objects are to be found in the extents of the query concept and its super-concepts. Objects in the extent of the query concept, Yθ , are the most relevant since they have all the query attributes in Y . Let us consider (A, B) a query super-concept i.e. (Yθ , Y ) ≤θ (A, B). The objects in A, are less relevant than those in Yθ because they have only a part of the query attributes (B ⊆θ Y ). Accordingly, the query answer is incrementally computed while navigating in the MV concept lattice starting from the query concept and following the upward links to consider the query super-concepts. For illustration consider the query “which are the planets having a diameter of about 12 thousand km and a mass of about 5 1021 t and a distance to sun of about 150 million km” which is formally represented by {D0.3 , M0.8 , DS0.7 }. The steps of the retrieval session for this query considering B0.2 (G, M, W, I) are shown in Figure 4. The obtained answer for this query is the following.

Figure 4. Retrieval session for the query {D0.3 , M0.8 , DS0.7 }. The dotted ellipses show the concepts whose extents contain the relevant objects. The numbers near the ellipses represent the steps of the retrieval.

1 – Venus: (D0.3 , M0.8 , DS0.7 ). 2 – Earth: (D0.3 , M1 , DS0.5 ). 3 – Mars: (D0.1 ). It can be noticed in this answer that the object Mercury does not appear although it is in the extent of a query super-concept, namely ({Venus,Mercury}, {DS[0.3,0.5] }). This absence is justiﬁed by the fact that Mercury does not fulﬁll the relevance criterion (cf. Deﬁnition 6-(2)). As it is shown here, the IR process beneﬁts from the extension of FCA capabilities with MV Galois connection and MV concept lattices. In this way, it is possible to deal with real-world applications

131

involving search on complex data, e.g. document search on the Web, biological data or Web site search. One advantage of the formalism introduced here is ﬁrstly its simplicity and secondly the fact that the standard algorithms for designing concept lattices can be reused straightforwardly, with a minimal adaptation. These are guarantees that the present MV FCA extension is generic, powerful, and can be used to improve and extend existing applications by allowing to handle complex data.

6

Conclusion and future work

In this paper, we introduced an extension of FCA to deal with complex data represented as MV contexts. We deﬁned an MV Galois connection based on similarity between attribute values. The basic idea is that two objects share an attribute whenever the values taken by this attribute for these objects are similar (i.e. their difference is less than a threshold). This Galois connection is the basis of the computation of MV concepts and MV concept lattices. Depending on the similarity threshold, MV concept lattices can have different levels of precision. This makes them a good tool for multi-level clustering which is particularly useful for large contexts. Many-valued concept lattices are also suitable for lattice-based IR. We deﬁned a retrieval method based on navigation into MV concept lattices to efﬁciently deal with complex queries. In the future, the MV Galois connection will be extended to deal with more general MV contexts, i.e. symbolic MV attributes. Similarity between attribute values can be taken from domain ontologies and thesaurus. Resulting lattices can be used in different application domains including Clustering, Information Retrieval, and Complex Data Mining. At the moment, an evaluation study is carried out on real-world biological data.

REFERENCES [1] R. Belohlavek, ‘Lattices generated by binary fuzzy relations’, Tatra Mountains Mathematical Publications, 16, 11–19, (1999). [2] R. Belohlavek and V. Vychodil, ‘What is a fuzzy concept lattice?’, in CLA 2005, Proccedings. Olomouc, Czech Republic, pp. 34–45. [3] H.H. Bock and E. Diday, eds. Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data, volume 15 of Studies in Classiﬁcation, Data Analysis, and Knowledge Organization. Springer, February 2000. [4] C. Carpineto and G. Romano, Concept Data Analysis: Theory and Applications, John Wiley & Sons, 2004. [5] B. Ganter, G. Stumme, and R. Wille, eds. Formal Concept Analysis, Foundations and Applications, volume 3626 of LNCS. Springer, 2005. [6] B. Ganter and R. Wille, Formal Concept Analysis, Springer, mathematical foundations edn., 1999. [7] J. Ma, W.-X. Zhang, and S. Cai, ‘Variable threshold concept lattice and dependence space’, in FSKD 2006, Proccedings, volume 4223 of LNCS, pp. 109–118. Springer. [8] N. Messai, M.-D. Devignes, A. Napoli, and M. Smail-Tabbone, ‘Querying a bioinformatic data sources registry with concept lattices’, in ICCS 2005, Proceedings, volume 3596 of LNAI, pp. 323–336, Kassel, Germany, (July 2005). Springer. [9] N. Messai, M.-D. Devignes, A. Napoli, and M. Smail-Tabbone, ‘Extending attribute dependencies for lattice-based querying and navigation’, in ICCS 2008, Proceedings, volume 5113 of LNAI, pp. 189–202, Toulouse, France, (July 2008). Springer. [10] U. Priss, ‘Lattice-based Information Retrieval’, Knowledge Organization, 27(3), 132–142, (2000). [11] S. Ben Yahia and A. Jaoua, ‘Discovering knowledge from fuzzy concept lattice’, Data mining and computational intelligence, 167–190, (2001). [12] W. Zhou, Z. Liu, Y. Zhao, and Z. Xie, ‘Clustering-based reduction algorithm on the structure of fuzzy concept lattices’, in ICFCA 2007, Supplementary Volume, pp. 131–145.

132

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-132

Online optimization for variable selection in data streams Christoforos Anagnostopoulos1,2 and Dimitris Tasoulis2 and David J. Hand2,3 and Niall M. Adams3 Abstract. Variable selection for regression is a classical statistical problem, motivated by concerns that too many covariates invite overﬁtting. Existing approaches notably include a class of convex optimisation techniques, such as the Lasso algorithm. Such techniques are invariably reliant on assumptions that are unrealistic in streaming contexts, namely that the data is available off-line and the correlation structure is static. In this paper, we relax both these constraints, proposing for the ﬁrst time an online implementation of the Lasso algorithm with exponential forgetting. We also optimise the model dimension and the speed of forgetting in an online manner, resulting in a fully automatic scheme. In simulations our scheme improves on recursive least squares in dynamic environments, while also featuring model discovery and changepoint detection capabilities.

(RLS-AF) also implies a certain degree of tracking capability in the general time-varying case [5]. In this paper, we use RLS-AF to maintain an up-to-date estimate of the correlation structure, which we then use to incrementally update our estimate of the Lasso solution and tune the penalty weight, using AIC. We lay out the details of this algorithm in Section 2. The extensive simulation study that we describe in Section 3 establishes that the resulting scheme scales very well with the dimension of the stream, is fully automatic and reacts interpretably to changes in the stream dynamics, maintaining excellent predictive performance.

1

Assume a static linear regression model with normal errors, where yi is a univariate response and Xi a 1 × p vector of regressors:

Introduction

Regression models with many covariates can potentially capture more information about the response, but are also prone to overﬁt. Variable selection techniques have been used as a systematic way of negotiating this trade-off [11, 14] in the static, ofﬂine case. The need for such methods to carry over in the streaming data context has been recognised in the literature [6] but not yet systematically addressed. Variable selection can lead to improvements in both parsimony and predictive performance [11]. Early methods would proceed in a greedy manner, iteratively adding maximally predictive covariates into the model, but suffered from local optima and lack of robustness[4]. More recent developments, such as the Lasso [15], have largely overcome these shortcomings via the use of convex, sparsity-inducing penalties. The weight of the penalty controls the model dimension and its optimal value can be estimated using general purpose model scoring techniques, such as the Akaike Information Criterion (AIC) [1], or more complex, tailored methods [17]. Several efﬁcient algorithms have been proposed for the optimisation of Lasso-type problems (e.g., [15, 10, 8]). However, an online implementation had been lacking, so that only greedy heuristics have so far been employed in the streaming data case (e.g., [16]). Another primary concern in the streaming context is adaptivity to drifting stream dynamics. An effective way to enforce adaptivity is to make past data decreasingly relevant to the estimation procedure via the use of forgetting factors [13]. In the case of linear regression, these forgetting procedures may be employed online and complemented by a gradient descent procedure that converges to the optimal rate of forgetting [13] for ﬁxed drift parameters. This convergence property of recursive least squares with adaptive forgetting 1 2 3

Author for correspondence: [email protected] The Institute for Mathematical Sciences, Imperial College London, SW7 2PG, London Department of Mathematics, Imperial College London, South Kensington Campus, London SW7 2AZ, UK

2

Description of the Algorithm

2.1

Recursive Least Squares

yi = Xi β T + ,

∼ N (0, σ 2 )

Given a dataset (yi , Xi )n i=1 of observations from this model, denoted by (y, X) in vector form, the ordinary least squares (OLS) estimate of β is that which minimises the residual sum of squares (RSS): β OLS = argminβ

n X (yi − Xi β T )2 = P Q−1

(1)

i=1

where P = Xy and Q = X T X are sample covariance estimates. The proof of this result is standard (e.g., [7]) and rests on the assumption that X T X is non-singular, which in practice holds for n >> p. Assume now that the data arrive in a time-ordered fashion, so that at time t, we have an estimate βtOLS = Pt Q−1 t . At time t + 1, we can linearly update the sample covariances, e.g., Qt+1 = Qt + XtT Xt , but updating βt+1 via (1) requires us to invert Qt+1 which is impractical (O(p3 )) in online contexts for large p. We hence employ the −1 Sherman-Morrison formula [12] to express Q−1 t+1 in terms of Qt , −1 T under the assumption that Xt+1 Qt Xt+1 = −1: T −1 = Q−1 − Q−1 t t+1 = (Qt + Xt Xt )

−1 T Q−1 t Xt+1 Xt+1 Qt −1 T 1 + Xt+1 Qt Xt+1

(2)

This recursive characterisation involves only ‘affordable’ matrix multiplications, i.e., computable in O(p2 ), rather than O(p3 ).

2.2

Adaptive Forgetting

In case the true regression coefﬁcients are time-varying, we can track them by iteratively scaling down the importance of past datapoints by a factor of λ ≤ 1, where λ = 1 yields the static, OLS solution: Qt+1 =

t+1 X i=1

T λt+1−i XiT Xi = λQt + Xt+1 Xt+1

(3)

C. Anagnostopoulos et al. / Online Optimization for Variable Selection in Data Streams

Since the update of the sample covariance is still of the same form, βtOLS = Pt Q−1 may still be updated recursively via (2). t Setting the value of λ is analogous to choosing an optimal window size and can be done using standard techniques, such as crossvalidation. However, the optimal value of λ may often be timevarying: for instance, if the drift features abrupt changes followed by intervals of no change, the value of λ should adapt accordingly, taking small values only at times of abrupt change. In [13], this need is addressed via an adaptive scheme that at each timepoint moves λ in the direction that minimises RSS the most at the current timepoint: λt+1 = λt + cλ sign(∂RSS/∂λt )

(4)

for some small constant cλ . The formula for the gradient is derived in [13, p.734] and is also O(p2 ) so that, overall, RLS-AF is fully online.

2.3

The Lasso Estimator

The Lasso algorithm for variable selection [15] penalises the RSS by the sum of the absolute values of the learnt regression coefﬁcients, i.e., the L1-norm of the regression vector: ) ( n ! p X X L1 T 2 (yi − Xi β ) |βj | (5) β ← argmin +γ β

i=1

j=1

It is a well-known fact that L1-norm penalties naturally favor sparse solutions [10], while retaining the convexity of the objective function. As a result, β L1 can be computed by a line search algorithm: for each j = 1, . . . , p, we minimise the penalized RSS with respect to βjOLS and iterate until convergence. The minimisation step is closed form: ( ˛ ˛ ˛ ˛ sign(βjL1 − Sj )(˛βjL1 − Sj ˛ − γ), if ˛βjL1 − Sj ˛ ≥ γ L1 βj ← 0, otherwise (6) where Sj is the gradient of the penalised RSS with respect to βjL1 , holding all other coefﬁcients constant, and is given by: Sj = ∂RSS/∂βjL1 = −0.5

n X

Xij (yi − Xi (β L1 )T )

(7)

i=1

This algorithm was ﬁrst proposed as the shooting algorithm in [10]. Notably, the RHS of (7) expands to a formula that involves only the sample covariances, P and Q. We are hence free to employ instead the forgetful versions, Pt , Qt obtained from RLS-AF. This allows us to hybridise the two algorithms, so that at each tick we perform one iteration of the shooting algorithm, initialised at the previous tick’s estimate of β L1 , retaining an O(p2 ) complexity per tick. Note also that this algorithm assumes that the sample covariances have been standardised, so that rather than Pt and Qt , we have to provide it with their standardised versions, taking care to properly reweight β L1 . This has no effect on computational complexity.

2.4

Learning the Dimensionality Online

There remains the issue of setting γ, the shrinkage parameter featured in (5), which effectively controls the dimension of the learnt model. To do so we employ a standard model scoring criterion, the Akaike Information Criterion [1]. This penalises the RSS of a regression estimate βˆ by the number q of its non-zero coefﬁcients: ˆ = n log RSS + 2q − n AIC(β) “ ” ˆ βˆT + 2q − n = n log R − 2P βˆT + βQ

133

where we have rewritten the RSS in terms of the sample covariances P and Q, as well as the sample variance R = y T y. Normally, n is equal to the sample size, but since we will be employing our forgetful versions Pt , Qt , Rt the effective sample size at time t depends on previous forgetting factors: n(t) = λt + λt λt−1 + · · · =

t Y t X

λm = λt (1 + n(t − 1)) (8)

i=1 m=i

In our case, the candidate regression estimates which we wish to compare are Lasso solutions, β L1 (γ), for various values of γ. We hence wish to determine the value of γ that produces the Lasso solution with minimum AIC. Ideally, we should perform an exhaustive search over γ, but at present we have no way of optimising this efﬁciently enough for an online scheme. Moreover, the relationship between γ and q is not closed form [8], so that gradient descent methods do not readily apply. Instead, as a ﬁrst approximation, we perform a simple numerical minimisation at each tick, computing AIC(β L1 (γ)) for γ equal to each of γt − cγ , γt , γt + cγ , where cγ is a small constant and γt our current estimate. We then use the value with the smallest AIC as our setting of γt+1 . Hybridising RLS-AF with the shooting algorithm and the numerical minimisation of AIC yields our algorithm, RLASSO-AF, for online variable selection. Its complexity is dominated by RLS-AF and it is fully automatic, except for its dependence on initialised values and gradient descent steps (cλ and cγ ). ´˘ ` algorithm ONLINE LASSO-AF y, X, cγ , cλ , OLS L1 initialise β0 = 0, β0 = 0, γ0 = 0, λ0 = 1, P0 = 1, Q0 = I at time t do # compute RLS estimates (note that we denote Q−1 by Gt ) t yˆt+1 ← Xt+1 (βtL1 )T # current prediction n ← λt (n + 1) # effective sample size G Xt+1 # auxillary step for notational ease kt+1 = λ +X t G XT t

t+1

t+1

t

Gt+1 ← λ1t Gt − λ1t kt+1 Xt+1 Gt # this is Q−1 t+1 T Pt+1 ← λt Pt + Xt+1 yt+1 2 Rt+1 ← λt Rt + yt+1 # sample variance of y, required in AIC t+1 ← yt+1 − Xt+1 (βtOLS )T # RLS residual OLS βt+1 ← Pt+1 Q−1 t+1 # RLS regression ˘ coefﬁcients estimate for γ in {γt − cγ , γt , γt + cγ } do # compute candidate˘Lasso estimate βˆ for j = 1, . . . , p do S ˛= −0.5(P˛t+1 (j) − Qt+1 (:, j)(βtL1 )T ) ˛ ˛ˆ − S ˛ ≥ γt then if ˛β(j) ˛ “˛ ” ˆ ˆ − S) ˛˛β(j) ˆ − S ˛˛ − γ β(j) ← sign(β(j) else

¯ ˆ β(j) ←0 # compute AIC of candidate Lasso estimate q ←number of“non-zero coefﬁcients in βˆ ” ¯ ˆ t+1 βˆT + 2q − n AIC(γ) = n log Rt+1 − 2Pt+1 βˆT + βQ γt+1 ← the value in {γt − cγ , γt , γt + cγ } with smallest AIC L1 ← the respective candidate estimate βt+1 # compute ∂RSS/∂λ = ∂Tt+1 t+1 /∂λ in three recursive steps T T St+1 ← λ1t (I − kt+1 Xt+1 )St (I − Xt+1 kt+1 )+ 1 T 1 + λt kt+1 kt+1 − λt Gt+1 + T t+1 ψt+1 ← (I − kt+1 Xt+1 )ψt + St+1 Xt+1 T T ∂RSS ← ψ X t+1 t+1 t+1 ∂λ ` ∂RSS ´ ¯ λt+1 ← λt + cλ sign ∂λ

134

3 3.1

C. Anagnostopoulos et al. / Online Optimization for Variable Selection in Data Streams

Experimental Results Description of the Simulation Engine

We test our method against simulated datasets (yt , Xt )n t=1 , letting (yt , Xt ) ∼ N (0, Σt ) where Σ1 , . . . , Σt , . . . is a sequence of dynamically changing (p + 1) × (p + 1) covariance matrices. The joint Gaussianity assumption is convenient in that it guarantees the required linear relationship between the response and the covariates, while also producing realistic variability in the covariates. We assume zero mean vectors for simplicity. To generate the dynamic sequence of Σt s, we employed the simple and easy to control method described in [2]. We divide our timescale into regular intervals by introducing ‘changepoints’, t1 , . . . , ti , . . . . The distance between changepoints determines the ‘speed of the drift’. We then randomly generate symmetric, positive deﬁnite matrices Q1 , Q2 , . . . to lie at each changepoint, setting Σti = Qi . In between, for t ∈ (ti , ti+1 ), we set: ti+1 − t Σt = μQi + (1 − μ)Qi+1 , with μ = ti+1 − ti

(9)

This turns the difﬁcult problem of generating smoothly changing covariance matrices to the easy problem of generating arbitrary covariance matrices. Moreover, it allows us to control the true sparsity of the regression relationship: for each changepoint, we randomly generate the p × p covariance of the covariates and, separately, a p × 1 vector of regression coefﬁcients of a speciﬁed sparsity. We can then use standard formulas to obtain a joint covariance Qi . We also considered abrupt change, setting μ = 1 in (9). Note that all the underlying covariance matrices were chosen to be standardised, so as to retain our sense of scale despite the changes in the correlation structure. We have repeated our experiments with non-standardised covariances and have found that our qualitative ﬁndings persist, although lack of standardisation introduces appreciably more volatility in the error sequence across all settings.

3.2

Figure 1. Plot of time and sampled averaged log(RSS) for RLS-AF, RLASSO-AF learning γ online and RLASSO-AF learning γ optimally via an ofﬂine search. The underlying streams have length 1000 and true sparsity of 0.5. In the case of online RLASSO-AF, the plot is shrunk for visibility.

shortcomings can be expected to dominate when there is little or no covariance drift, where least squares estimation enjoys ever increasing sample sizes and eventually ceases to overﬁt, making its performance hard to beat. In contrast, in the dynamic case where the least squares ﬁt is imperfect, γ need not be as ﬁne-tuned for our scheme to reap the beneﬁts of variable selection. In Figure 2, we demonstrate via a surface plot how the relative improvement in performance of RLASSO-AF over RLS-AF depends on the dimension and true sparsity ratio of the underlying stream. Results are averaged over several streams of length 1000, featuring changepoints every 100 points, for each of a range of settings for the stream dimension and sparsity. The plot suggests that RLASSO-AF outperforms RLS-AF more and more as p grows, when the covariance structure is changing, regardless of whether the true model is sparse or not. In fact, relative performance does not seem to be affected strongly by the sparsity of the true regression model: our adaptive learning of γ seems not to be sophisticated enough to sufﬁciently exploit sparsity in the true model, suggesting room for improvement.

Results

We start by discussing performance in the static case to get some ﬁrst insights into the problem. We measure performance by the mean squared residual error, (yt − Xt β)2 , averaged over time as well as over several streams of varying dimensionality. The results are reported in Figure 1, where we compare RLS-AF with RLASSO-AF, as well as its ofﬂine version where γ is optimally set via exhaustive search. We immediately observe that online RLASSO-AF is outperformed by RLS-AF for all values of p. However, this effect can be exclusively attributed to the difﬁculties of learning the shrinkage parameter online: when γ is optimally selected, our estimator outperforms RLS-AF by a margin that grows wider with p. These results are consistent with theory. In the static case, our scheme is guaranteed for any ﬁxed setting of γ to converge to the respective Lasso estimator, which is known to be superior to the OLS estimate when γ is suitably set, in particular when the true model is sparse and p large [15]: we can outperform RLS-AF if we optimally select γ. However, making this selection online is not expected to be easy. First, closed-form model scoring criteria, such as the AIC, do not always perform well in practice as they rest on several unrealistic assumptions [4], so that their suitability cannot be guaranteed. Second, our adaptive scheme only approximately minimises AIC. These

Figure 2. Surface plot of relative increase in performance of RLASSO-AF versus RLS-AF, averaged over several streams of dimension ranging from 5 to 100 covariates and true sparsity ranging from 0% to 90%.

C. Anagnostopoulos et al. / Online Optimization for Variable Selection in Data Streams

We now move to describe the adaptive characteristics of our algorithm. This behavior is best illustrated against a single stream, but persists across a variety of settings, as explained later. Our chosen testbed is a stream of length n = 2000, featuring an abrupt change every 200 timepoints. The true regression relationships are 50% sparse: half the covariates are randomly selected at each changepoint to be set to exact zeros. The total number of covariates is 50. Our results against this testbed are summarised in Figure 3. First, we conﬁrm the ﬁndings of [13] in immediately noticing that the forgetting factor adapts brilliantly in most cases (Figure 3, Plot III): shortly after a changepoint occurs, λ begins to drop fast, allowing the algorithm to quickly forget data prior to the changepoint and reestimate the sample covariances. An easy computation (that of equation (8)) shows that, when λ is at its lowest, the effective sample size drops to a mere 10 datapoints, an impressively short memory given that we are estimating a 50 × 50 covariance matrix. After a short interval, λ increases again making available a longer memory, in recognition of the fact that any irrelevant data have by now been forgotten and the process is once again constant. Second, we conﬁrm that the Lasso predictions are more precise and less volatile than the RLS-AF predictions at and around the changepoints (shown as smaller peaks in Figure 3, IV), yielding an overall 155% boost in performance. Since this stream features abrupt changes, the covariance is static in between changepoints so that RLS-AF catches up with RLASSO-AF. Indeed, switching to smooth changes removes these static periods yielding further advantage to RLASSO-AF and raising the relative performance to 161% (plot not included for lack of space). Beyond predictive performance, we also gain appreciably in model discovery capabilities: in Figure 4 we can clearly observe that whenever a covariate suddenly becomes inactive, our algorithm will prune it fairly consistently shortly afterwards.

Figure 3. A plot of the learnt shrinkage parameter (Plot I), the learnt model dimension (Figure 3, Plot II), the learnt forgetting factor (Figure 3, Plot III) and the residual errors of both RLS-AF and RLASSO-AF against time. The underlying simulation utilises a stream featuring abrupt changes every 200 timepoints and 50% sparse true regression models.

135

Figure 4. Plot of ﬁve randomly selected true regression coefﬁcients stacked against their respective learnt Lasso estimates, against time. Black indicates an exact 0, white indicating values away from 0. The underlying simulation utilises a stream featuring abrupt changes every 200 timepoints and 50% sparse true regression models.

Finally, we turn our attention to the adaptive behavior of the shrinkage parameter. As we indicated earlier, our adaptive routine, despite its shortcomings, manages to learn γ well enough for fastpaced environments, where there is enough to gain by variable selection so that ﬁne-tuning is not necessary. This is already in some sense important, as it makes our algorithm practically applicable. However, the adaptive behavior is very interesting in its own right, as it reveals a complicated dependency of AIC to the shrinkage parameter, the residual error and the forgetting factor. To fully address this interplay lies beyond the scope of the current work and is our main research aim for the future. We take, however, the opportunity to introduce some key issues. Perhaps surprisingly, our adaptive algorithm does not spot a single reasonable value for γ. Instead, as depicted in Plot I of Figure 3, it seems to react to change in a consistent, repeatable manner, peaking shortly after changepoints. This can be attributed to the strong dependence of AIC to the residual error, as is suggested by plotting the cross-correlation between γ and the residual error (Figure 5), averaged over several samples from the same stream characteristics. This is evidence that, at times of crisis, our algorithm tends to prune more violently, in an attempt to contain the error. Still, as shown in Figure 3, Plot II, the actual number of variables pruned is on average higher during periods of constancy than during periods of change. A closer look at Figure 3, Plots II and III, reveals that this effect is mediated by the reaction of the forgetting factor: the number of variables pruned follows λ much more closely than it does the residual error. Indeed, a cross-correlation plot averaged over several streams conﬁrms this (Figure 6). This strong correlation can be explained by the dependence of AIC on both the effective sample size (a function of λ) and, of course, the quality of the Lasso estimates, which also increases with λ. Overall, a complicated picture is painted involving lagged feedback mechanisms. At times of stability, AIC tends to favor sparser models, since the increasing quality of the Lasso estimates offers ‘more for less’, in particular since the true, underlying regression model is sparse. At times of change, however, the reaction of the algorithm is more complex. The sudden increase in residual error causes γ to temporarily shoot up, shrinking and pruning away coefﬁcients so as to contain the error. Shortly afterwards the forgetting factor drops in response and γ follows. This dependence of γ on λ is

136

C. Anagnostopoulos et al. / Online Optimization for Variable Selection in Data Streams

in fact the key relationship of interest and deserves further study. In particular, it could perhaps be exploited to improve our algorithm by way of a joint optimisation step for γ and λ.

Figure 5. Plot of the sample cross-correlation between the number of variables pruned at each timepoint and the value of the forgetting factor.

ing line-search gradient descent. Recent work on pathwise coordinate optimisation [9] has established that several alternative selection schemes are amenable to such methods, which we intend to try out in future versions of our algorithm. Finally, our adaptive algorithm reveals interesting insights into the interdependence between shrinkage, pruning, forgetting and model scoring. This suggests room for improvement by way of a joint optimisation step for these parameters that exploits their interplay. We also wish to apply these insights to a yet harder variant of our problem, whereby only selected covariates are ever actually observed, introduced in [3]. Our algorithm is a ﬁrst step in establishing a framework for online variable selection, so that we may investigate in a principled manner the interplay between forgetting, choice of focus and change. Our results open up an exciting research direction that raises novel questions about the nature of learning in dynamic environments.

ACKNOWLEDGEMENTS This work was undertaken as part of the ALADDIN (Autonomous Learning Agents for Decentralised Data and Information Systems) project and is jointly funded by a BAE Systems and the EPSRC (Engineering and Physical Research Council) strategic partnership, under EPSRC grant EP/C548 051/1. The work of David Hand was partially supported by a Royal Society Wolfson Research Merit Award.

REFERENCES

Figure 6. Plot of the sample cross-correlation between the shrinkage parameter and the value of the forgetting factor. The spread of the correlation is mostly due to the effect of averaging over several streams.

To conclude, we remark that our algorithm is somewhat sensitive to the size of the gradient steps, cλ and cγ , although to a much smaller extent than to the underlying quantities optimised, λ and γ.

4

Conclusion

Variable selection can lead to improvements in predictive performance in the static case, but has so far not been sufﬁciently exploited in the streaming context, for lack of online implementations of standard selection algorithms. We propose an online optimisation scheme that implements the Lasso algorithm, converging to the optimal solution in the static case. We hybridise it with recursive least squares adaptive forgetting to allow it to track changing covariance structures. We also estimate the optimal dimensionality via an algorithmic step that at each timepoint takes a step in the direction of minimum AIC, to produce a fully automatic O(p2 ) algorithm for online variable selection in streaming data. We test our algorithm against simulated data across a variety of settings. We observe that our adaptive choice of model dimension suffers in environments of slow change, where the least squares solution tends to be very accurate, but can offer great improvements in performance and promising model discovery capabilities in fast-paced environments. As future work we intend to test our algorithm against real-world datasets, also making use of alternative model scoring criteria, such as the BIC, Cp or corrected AIC. We also remark that the same algorithmic framework may be employed to construct online implementations of any variable selection scheme that may be solved us-

[1] H. Akaike, ‘Information theory and an extension of the maximum likelihood principle’, Intern. Symposium on Information Theory, 2 nd, Tsahkadsor, Armenian SSR, 267–281, (1973). [2] C. Anagnostopoulos and N.M. Adams, ‘Simulating dynamic covariance structures for testing the adaptive behaviour of variable selection algorithms’, Proc. of the 10th Intern. Conf. on Computer Modelling and Simulation, UKSIM/EUROSIM 2008, (2008). [3] C. Anagnostopoulos, N.M. Adams, and D.J. Hand, ‘Deciding what to observe next: adaptive variable selection for regression in multivariate data streams’, Proc. of ACM Symposium on Applied Computing, 2, (2008). [4] L. Breiman, ‘Heuristics of instability and stabilization in model selection.’, The Annals of Statistics, 24(6), 2350–2383, (1996). [5] M. Campi, ‘Estimation and control’, Journal of Mathematical Systems, 4, 13–25, (1994). [6] G. Dong, J. Han, L.V.S. Lakshmanan, J. Pei, H. Wang, and P.S. Yu, ‘Online mining of changes from data streams: Research problems and preliminary results’, Proc. of the 2003 ACM SIGMOD Workshop on Management and Processing of Data Streams, (2003). [7] N.R. Draper and H. Smith, Applied Linear Regression, Wiley, New York, 1966. [8] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, ‘Least angle regression’, Annals of Statistics, 32(2), 407–499, (2004). [9] J. Friedman, T. Hastie, H. Hoﬂing, and R. Tibshirani, ‘Pathwise coordinate optimization’, The Annals of Applied Statistics, 1(2), 302–333, (2007). [10] W.J. Fu, ‘Penalized regressions: the Bridge versus the Lasso’, Journal of Computational and Graphical Statistics, 7(3), 397–416, (1998). [11] E.I. George, ‘The variable selection problem’, Journal of the American Statistical Association, 95(452), 1304–1308, (2000). [12] G.H. Golub and C.F. Van Loan, Matrix Computations, Johns Hopkins University Press, 1996. [13] S. Haykin, Adaptive Filter Theory, Prentice-Hall, Inc. Upper Saddle River, NJ, USA, 1996. [14] A.J. Miller, Subset Selection in Regression, CRC Press, 2002. [15] R. Tibshirani, ‘Regression shrinkage and selection via the Lasso’, Journal of the Royal Statistical Society, Series B, 58(1), 267–288, (1996). [16] B. Yi, N. Sidiropoulos, T. Johnson, H.V. Jagadish, C. Faloutsos, and A. Biliris, ‘Online data mining for co-evolving time sequences’, Proc. of the 16th Intern. Conf. on Data Engineering, 13–22, (2000). [17] H. Zou, T. Hastie, and R. Tibshirani, ‘On the “Degrees of Freedom” of the Lasso’, Ann. Statist, 35(5), 2173–2192, (2007).

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-137

137

Sub Node Extraction with Tree Based Wrappers Stefan Raeymaekers and Maurice Bruynooghe1 Abstract. String based as well as tree based methods have been used to learn wrappers for extraction from semi-structured documents (e.g., HTML documents). Previous work has shown that tree based approaches perform better while needing less examples than string based approaches. A disadvantage is that they can only extract complete text nodes, whereas string based approaches can extract within text nodes. This paper proposes a hybrid approach that combines the advantages of both systems and compares it experimentally with a string based approach on some sub node extraction tasks.

1 Introduction Some information is only available in HTML documents. Wrappers for extracting such information are very useful and learning them from examples is an active research ﬁeld as it is tedious to build them manually and they require maintenance each time the templates generating the webpages are updated. In string based approaches [1, 3, 5, 6, 4, 7, 10, 11, 15], the document is viewed as a sequence of tokens and markup tags from which a subsequence gets extracted. This is done by learning to recognize the start and end boundaries of the target substring. These boundaries are typically between two tokens.2 The markup tags deﬁne an implicit tree structure on the document. In the string representation, the relations in this tree are hidden, as in the ﬂattened tree, parent or sibling nodes of a given node are separated by the tokens and tags that make up the subtrees under its (preceding) siblings. This renders the induction task more difﬁcult. Tree based approaches [9, 12, 2] view the document as a tree, preserving the tree relations; [13] experimentally compares string and tree based approaches and concludes that the latter perform better: they need less examples and the induction time is often orders of magnitudes lower. However, they can only extract complete nodes while boundaries can be inside nodes, i.e., sub node extraction can be required. This paper investigates the viability of two approaches for sub node extraction, starting from a state of the art tree based system [12]. In one approach, trees are extended, i.e., the text nodes become internal nodes, having as children the tokens of the original text value. The other is a two phase approach:a hybrid system uses the tree based system [12] for node extraction, while in a second step the extracted nodes are further processed with STALKER, a state of the art string based system [11], to perform the sub node extraction. These approaches are experimentally compared with standalone STALKER. Whereas STALKER uses a hierarchical approach that splits the learning task in simpler tasks that are learned separately, the version used for the sub node extraction ommits the task 1

2

K.U.Leuven, Dept. of Computer Science, Celestijnenlaan 200A, 3001 Leuven, Belgium, email : {stefan.raeymaekers,maurice.bruynooghe}@cs.kuleuven.be Using tokens instead of characters as basic units speeds up the learning.

splitting. Indeed, the sub node extraction task is typically much simpler that the initial extraction task. The rest of the paper 3 . is organized as follows. In Section 2, we give an overview of the STALKER system and in Section 3 of the used tree based system. In Section 4 we discuss the main contribution of this paper, the extensions to enable sub node extraction. Section 5 describes the experimental setup and discusses the results. We conclude in Section 6. Below we introduce an example for later use. Example 1 A web site running a restaurant guide, allows visitors to search for restaurants based on parts of their name. It returns a list of restaurants that is constructed from a ﬁxed template. In Figure 1, a possible outcome is shown for a search on ’china’. From this web page we can extract the following ﬁelds: the name of the restaurant(N), its type(T), the city(C) where it is located, and a phone number(P). For each restaurant we could also extract the url(L) from the link (leading to more detailed address information). And from the top sentence, the search term(S) that generated the page can be found. Note that the occurrence of the search term in the name is rendered in italic, while the land code of the phone number is in bold. a)

Restaurant Guide: search results for china

NewChinaTown (chinese) BrusselsTel: +32(0)2 345 67 89

RoyalChina(chinese) LeuvenTel: +32(0)16 61 61 61

ChinaGarden (chinese) AmsterdamTel: +31(0)20-4321234

b)

Figure 1. Restaurant Guide (Example 1): a) HTML code; b) screen shot.

2 A String Based Approach: STALKER Hierarchical Extraction In contrast with other string based methods, STALKER is based on hierarchical extraction. An Embedded Catalog (EC) describes the structure of the data. This is a tree structure where the leaves are ﬁelds, and the internal nodes either tuples or lists. Figure 2 shows the EC for Example 1. To extract a speciﬁc ﬁeld, ﬁrst the parent has to be extracted, and the extraction rules are 3

A very preliminary version appeared as [14].

138

S. Raeymaekers and M. Bruynooghe / Sub Node Extraction with Tree Based Wrappers

then applied on the subsequence extracted for the parent. To extract the values of the ’City’ ﬁeld of Example 1, ﬁrst the subsequence containing the search term and the list of restaurants is extracted. Then the complete list of restaurants and then the individual restaurants are extracted. And ﬁnally, from the subsequences for each restaurant, the ’City’ ﬁeld is extracted. The advantage of this approach is that complex extraction tasks are split into easier problems. Disadvantages are that more examples are needed to learn rules for every level of the hierarchy4 , and that errors in the different levels will accumulate. AnyToken non-Html

Document SearchTerm

AlphaNumeric

LIST(Restaurant)

Name Type City Phone

Alphabetic Capitalized

Figure 2. Embedded Catalog for the restaurant guide example (Example 1).

Html

Punctuation

Number

AllCaps

Figure 3. Wildcard hierarchy. A token that matches a wildcard of a given type, will also match the wildcards of the ancestors of that type.

Rules To extract a subsequence from a sequence of tokens, the STALKER system uses a start and an end rule, to ﬁnd the boundaries of that subsequence. The start rules are executed in forward direction from the beginning of the sequence, the end rules are executed in backward direction. A STALKER rule is either a simple rule or a disjunction of simple rules. In the latter case the boundary is given by the ﬁrst simple rule that does not fail. A simple rule is a list of so-called landmarks. A landmark is a sequence pattern consisting of tokens and/or wildcards. On execution, the rule searches for a part of the sequence that matches the ﬁrst landmark. From the end of this part the search for the second landmark is started, and so on. The boundary that is ﬁnally returned is either the end or the beginning of the part that matched the last landmark. Which one is indicated by a modiﬁer; SkipTo or SkipUntil for respectively the end or the beginning (or BackTo and BackUntil for rules in the other direction). When the search for a landmark reaches the end/beginning of the sequence, the rule is said to fail. STALKER uses multiple types of wildcards that form a type hierarchy (see Figure 3). Example 2 Consider the following ’Restaurant’ subsequence: New i China /i Town ( chinese ) /a b Tel : + 32 /b ( 0 ) 2 345 67 89

Brussels

The rule SkipUntil(AnyToken) succesfully matches the ﬁrst token of the restaurant sequence with the wildcard ’AnyToken’. As the modiﬁer is ’until’, the beginning of that token is returned. This is the beginning of ’New’ for the above sequence. The rule BackTo( /a ) BackTo(’(’) goes backward to the position before the ﬁrst matching ’ /a ’ token, and then continues going backward from there on until the ﬁrst ’(’ encountered. The position between ’Town’ and ’(’ will be returned. Both rules together extract the ’Name’ ﬁeld. To extract the ’Restaurant’ sub-sequences from the list of restaurants (or the sequence extracted for that list), start- and end-rules are repeatedly used. The startrule SkipTo( p a ) returns the position at the end of the ﬁrst occurrence of these two consecutive tags. The endrule for this extraction task is: BackTo( /p ). Induction Algorithm The STALKER induction algorithm starts from a set of positive examples where boundaries are marked. Simple rules are learned until the set is empty. Examples covered by a rule are removed from the example set. To learn a rule, one seed example 4

(the shortest example) is used to guide the induction, the other examples are used to test the quality of candidate rules. The algorithm does not search the entire rule space for the best rule. In each loop it takes two rules from the current set of rules, one is the best solution, the other is the best reﬁner. Some heuristic rules are designed to deﬁne a ranking (best solution and best reﬁner) over a set of rules. This ranking is based on properties of the rules, and on the number and quality of the extractions of each rule on the other examples. The reﬁnements of the best reﬁner, together with the best solution gives the new rule set for the next iteration. This loop continues until a perfect solution is found (one that either extracts correctly from an example or fails on that example) or until all reﬁnements fail. The initial set of candidate rules are single landmark rules, with each landmark a single token or wildcard (occurring in the seed). The reﬁnement step will either extend one of the landmarks of a rule with an extra token or wildcard (the extended landmark has to match within the seed), or add a new single token/wildcard landmark somewhere in the rule (the token or wildcard has to occur in the seed).

3 A Tree Based Approach: In this section we introduce the notion of ( )-Contextual Tree Languages, as a subclass of the regular tree languages. In contrast with the whole class of regular languages, this subclass can be learned from positive examples only. The intuition behind ( )-contextual tree languages is fairly straightforward. At the base is a parameterized deconstruction of a tree into its building blocks called ( )forks. These are subparts of the tree with maximally consecutive children and a maximal depth of (see Example 3). A tree belongs to a given language iff its ( )-forks all belong to the representative set of building blocks for that language. To learn a ( )-contextual tree language from examples, the ( )-forks of these examples are collected into a representative set for the learned language. For formal deﬁnitions of ( )-forks, we refer to [12]. The (k,l)-Contextual Tree Languages In what follows, we use (k,l) ( ) and (k,l) ( ) to represent respectively the forks of a tree and those of a set of trees and (Σ) to represent all trees build from an alphabet Σ. Deﬁnition 1 The (k,l)-contextual tree language based on the set of trees is deﬁned as (k,l) ( ) = { ∈ (Σ) | (k,l) ( ) ⊆ }. As shown in [12], the language (k,l) ( (k,l) ( )) is the most appropriate ( )-contextual tree language that can be learned from a set of positive examples , as it is the most speciﬁc ( )-contextual language that accepts all the examples. Generalization is controlled by the choice of the parameters; they determine the minimal granularity of the building blocks (the forks from the examples) that can be used in deﬁning the language. Negative examples can be used to adjust the parameter values [12]. Example 3 Below we show graphically the (3 3)-forks of a tree . Two trees from the language (3,3) ( (3,3) ({ })) are shown on the right. t

a bcbcb

a

a

bcb cbc bcb a

a bcbc

For list extraction, each example requires two consecutive elements.

a

a

a a

bcb cbc

a b

b

b

a

a

bcb cbc c

b a

a

bcbcbcbcb

cbc a cbc

139

S. Raeymaekers and M. Bruynooghe / Sub Node Extraction with Tree Based Wrappers

Wrapper Induction Trees are build from some alphabet Σ. The marked alphabet ΣX is deﬁned as ΣX = Σ ∪ { X | ∈ Σ}. A marking of a tree over an alphabet Σ is a function that maps on , a tree over the marked alphabet ΣX by replacing some symbols by their marked variants X . A correctly marked tree (with regard to the extraction task) is deﬁned as the single marked version of a tree in which all target nodes, and no others, are marked, while a partially correct marked tree requires that the marked nodes are only a subset of the target nodes. The wrapper is represented as a language that accepts all partially correct marked trees and no other ones. During extraction, a node is extracted if, after marking that single node, the resulting tree is accepted by the wrapper language. The wrapper is learned from examples that consist of the document tree with exactly one of the target nodes marked. In [13], a ( )-contextual tree language is learned, accepting only partially correct marked trees. The generalization power of the learning algorithm used there is improved in two ways. Firstly, the text nodes (except elements of the distinguishing context5 ) are replaced by a wildcard: ’@’. Second, only the forks containing a marker (marked forks) are used; they provide the local context needed to decide whether a node should be extracted or not, while the other forks describe the general structure of the document. The latter is not needed as we assume that all documents for a given task are generated from the same template. Example 4 In Figure 4.a we show the tree (only a subtree due to space restrictions) of the document from Figure 1, with the target ﬁelds indicated beneath. Only the ’City’ ﬁeld can be extracted with the regular tree based approach, as it is the only one that occupies a single text node. Given the value ’Brussels’ as example element of that ﬁeld, we will mark the node containing ’Brussels’ with a marker ’ ’, and replace text elements by the wildcard ’@’. The result (for the subtree of Figure 4.a) is shown in Figure 4.b. The learning algorithm collects now the ( )-forks containing the marker. For = 2 p )

(

and = 1, the resulting set is (

is

p

p

a @C

= 2 and = 2, it

@C

)

. The ﬁrst wrapper will extract every HTML

,

@C ,

, for

@C ,

@C

b

element under a ’ ’-tag: ’Brussels’ and ’(0)2 345 67 89’ from the subtree in Figure 4.a, hence it is too general. The second wrapper extracts only ’Brussels’ and is therefore a correct marking acceptor. Another ’single node’ ﬁeld is the ’(S)earch term’. A correct marking acceptor is 9 obtained with the parameters = 1 and = 3: 8 > > > < > > > : 5

b

i @S ,

,

i

@S @S

> > > = > > > ;

The use of distinguishing contexts is optional, and the set of contexts is learned from the given examples. b)

p

The term sub node ﬁeld means that the text value from the ﬁeld does not necessarily begin or end at a node boundary. The value can be a substring of a single text node, or could start and end in different nodes (the boundary nodes) in which case the value is the concatenation of the strings in the text nodes between the boundary nodes. We deﬁne a spanning node for a given occurrence of a ﬁeld as the ﬁrst common ancestor of the two boundary nodes. Different cases (ad) are schematically represented in Figure 5. The ’Name’ and ’Type’ ﬁelds of Figure 4 are examples of respectively case a and b. In the latter case, the spanning node coincides with the boundary nodes. Note that the boundary nodes are not necessarily at the same depth in the tree, as illustrated by the ’Phone’ ﬁeld of Figure 4 where the ’p’ node is the spanning node. In Example 1, all occurrences of a same ﬁeld have different spanning nodes. It is possible though that different occurrences share the same spanning node (Figure 5.c). On top of that, the boundary and spanning nodes can coincide (Figure 5.d). We have extended the system and the GUI of [12] to handle sub node extraction. The changes for the user are that instead of initially clicking on a single text node, he can select a subsequence as the initial positive example. The system still relearns the wrapper given the accumulated date after each interaction and the extraction results for the current document are visualized. Next to providing a new negative example (false positive) by indicating that a result was falsely extracted or providing a new positive example (false negative), the user can in the extended system provide a correction, by giving the correct value, when an extraction overlaps partly with a target value.

a @C

Town (chinese)

b @

becomes

[text] Town ( chinese )

. In these extended trees we in-

dicate the start and end boundaries of a sub node ﬁeld with respeca)

p

a

html

b)

html

c)

d)

html

S

html

S

b

i New China Town (chinese) (N)ame

4 Sub Node Extraction

Extended Tree Approach In this approach, every text node in a tree is replaced by a special node ’[text]’, with as children the tokens of the original text in the node. In our example we see that

.

a)

The induction algorithm is able to learn from positive examples only. But suitable values for the parameters , , and a boolean to turn the distinguishing contexts on or off have to be speciﬁed. Smaller values lead to more general acceptors, while larger values result in more speciﬁc ones. To obtain correct marking acceptors, while avoiding overﬁtting, one can search for the most general wrapper that rejects a set of given negative examples. In [12] an efﬁcient algorithm is given to search through the parameter space. All the above is integrated in an interactive system that starts induction from a single positive example. During the interaction, a user can apply the current wrapper on a document and can provide a new example by selecting either a false positive or a false negative. The wrapper is updated after each new example.

(T)ype

Brussels (C)ity

Tel: +32

(0)2 345 67 89

(P)hone number

@i@

@

@

Figure 4. a) A subtree of the document from Figure 1, containing the ﬁ rst restaurant. The different ﬁ elds are indicated below the text leaves. b) The same subtree preprocessed, with the target node of the (C)ity ﬁ eld marked.

. . . start . . . end . . .

S

ﬁeld X

X

. . . start1 . . . end1 . . . start2 . . . end2 . . . X

X

S X

X

Figure 5. Schematic representation of the different possible conﬁ gurations in which an occurrence of a ﬁ eld can be found in a tree. The broken lines indicate an ancestor relation of one or more levels deep.

140

S. Raeymaekers and M. Bruynooghe / Sub Node Extraction with Tree Based Wrappers

tively the token that follows and precedes that boundary. For sub node extraction we therefore have to learn two extraction tasks: the start and the end tokens. Using the extended trees these can be done with a tree based approach6 . When start and end tokens do not coincide, we not only have positive examples but also negative ones. Indeed, the start token and the tokens in between provide negative examples for the end token and similar for the start token. To extract the ﬁnal values, each start boundary is matched with an end boundary. For an incorrect wrapper it is possible that some boundaries are missing (false negatives) or some extra incorrect boundaries are extracted (false positives). We therefore sort the extracted boundaries according to their position and match each start boundary with the ﬁrst encountered end boundary unless another start boundary is encountered ﬁrst, in which case the ﬁrst start boundary remains unmatched. Hence an unmatched boundary is not always erroneous, it is also possible that the matching boundary is missing or an erroneous boundary prevents the matching. In the interactive system each of the unmatched boundaries is visualized as a single token extraction, but in a different color such that they can be recognized by the user. When such a single extracted token (with one missing boundary) happens to be correct, the user can correct by providing the missing boundary as a new example. A new positive example and a correction are processed in the same way: they provide a positive example for both the start and end extraction, and negative examples for the tokens in between (if any). Not every interaction will provide new examples for both extraction tasks. It is possible that only one task needs to be relearned.

induction are performed on smaller sequences, leading to smaller and more correct rules, and faster induction. Using the bounding nodes approach, the sequences started from are smaller than in the ﬁrst approach for the Name and Phone ﬁelds. The advantage of the bounding nodes approach over the spanning node one is that the input sequence is often smaller. However, a third step matching start and end boundaries is required. Given that almost all extraction tasks we looked at are either of case b or do not require sub node extraction at all, we have so far only implemented the spanning node approach. Our implementation uses ( )-contextual tree languages for the ﬁrst step and STALKER for the second step. When multiple extractions have the same spanning node, STALKER’s (hierarchical) list extraction feature has to be used. With both approaches this is needed for case d; with the spanning node approach also for case c. In the interactive system, positive and negative examples of spanning nodes are used as described in [13] to learn the extraction of spanning nodes. For every positive example of a spanning node, the sequence under the spanning tree together with the selected subsequence provides an example for the STALKER induction algorithm. A new negative example requires only to relearn the extraction of the spanning node as the set of (positive) examples used by STALKER is preserved. A correction, on the other hand, will often not affect the position of the spanning node, in which case only the sub node extraction has to be relearned.

5 Experiments Hybrid Approach We discern two ways to combine tree based node extraction with a subsequent string based extraction to learn and perform sub node extraction. A ﬁrst approach (spanning node) uses a tree based method for extracting the spanning node and a string based method for extracting the correct subsequence from the sequence obtained by ﬂattening the subtree rooted at the spanning node. A second approach (bounding nodes) uses a tree based method twice, once to extract nodes containing a start token and once to extract nodes containing an end token. Then a string based method is used to extract start/end tokens from the sequences rooted at the nodes extracted in the ﬁrst step. Finally, start and end tokens need to be matched. Although two different sequences are used in the second step, these sequences are smaller than in the second step of the ﬁrst approach (unless in cases b and d where start, end, and spanning nodes coincide).

This sequence is smaller than in the hierarchical STALKER approach. The end rule BackTo(’(’) sufﬁces, as opposed to the BackTo( /a ) BackTo(’(’) rule given in Example 2. For the Type ﬁeld, the sequence is even smaller. The spanning node is a text node (case b). The City ﬁeld simpliﬁes to node extraction, no sequence extraction is needed. Only the Phone ﬁeld, with spanning node ’p’ will need sequence extraction from the same sequence as in the hierarchical STALKER approach. For the other ﬁelds, extraction and rule

In our experimental setup we compare the number of user interactions needed to learn a correct wrapper. For hierarchical STALKER, correct rules have to be learned at every level of the EC. Each level requires two initial examples, followed by corrections as needed to obtain a correct wrapper. For the two other approaches, a single initial example is required, followed by false positives, false negatives, and corrections as needed to obtain a correct wrapper. To simulate the user, we use the annotated training set to ﬁnd all mistakes, and pass a randomly selected one to the learner. We use the WIEN data sets7 for our comparison. We only used the sets that have a set of annotations included in the repository, and we left out some that were hard to represent in the STALKER embedded catalog formalism. Every data set has multiple ﬁelds. As we compare a single ﬁeld extraction task, we split the tuple extraction task for every data set into several single ﬁeld extraction tasks. Each task is referred to with the name of the original data set combined with the index of the ﬁeld in the tuple. Some ﬁelds are contained in the ’href’ attribute of an ’a’ tag, or the ’src’ attribute of an ’img’ tag. They are skewed in favour of the tree based approach because the HTMLparser associates these attributes to the corresponding node, hence we left them out. In Table 1 we show the averaged results of 30 runs on each data set. We give the induction time for each approach in column ms. For the hybrid approach we give the ﬁnal and values (which were identical on each run), and the number of Positive examples, Negative examples, and Corrections. For the hierarchical STALKER approach, we show the number of Positive examples, split over the different levels (the top level left). For the extended tree approach 1, 1, 2, and 2 show the ﬁnal parameters for the extraction of respectively the start boundary and end boundary; we also show the number of Positive (including the corrections) and Negative examples. For this

6

7

Example 5 In Example 2, STALKER extracts the Name, Type, City, and Phone ﬁelds from the Restaurant subsequence (the previous level in the hierarchy of Figure 2). In the ﬁrst approach the spanning node ’a’ is extracted, and sequence based extraction is then performed on the sequence deﬁned by this spanning node: New

i

China

/i

Town ( chinese )

With the preprocessing (wildcards, . . . ) as described in Section 3.

At the RISE repository: http://www.isi.edu/info-agents/RISE/index.html.

S. Raeymaekers and M. Bruynooghe / Sub Node Extraction with Tree Based Wrappers

approach, the ﬁnal and parameters are not always the same; in such cases we show the most frequent ones. Furthermore, for single node extraction tasks, start and end nodes are identical so we have only a single set of parameters. Table 1. Comparison of the interactions needed to learn a perfect wrapper with differen approaches. Data set s1-1 s1-3 s3-2 s3-3 s3-4 s3-5 s3-6 s4-1 s4-2 s4-3 s4-4 s5-2 s8-2 s8-3 s12-2 s14-1 s14-3 s15-2 s19-2 s19-4 s20-3 s20-4 s20-5 s20-6 s23-1 s23-3 s25-2 s27-1 s27-2 s27-3 s27-4 s27-5 s27-6 s30-2 s30-3 s30-4 s30-5 s30-6 s30-7 s30-8

P/N 1/2.9 4/1.8 1/1 1/0 1/5.9 1.4/8.3 1.8/3.7 1/0 1/1 1/3.8 2/5.4 1/1 1/1 1/1.2 2/1.4 2/1.9 1/0 1/0 1/2.6 1/1 1/0 1/1.4 1/1.8 1/1.4 1/1 1/1 1/1 4.2/11 5.2/5.1 5.7/15 6/6 1/3.2 1/3.1 3.6/5.7 1/2.7 5.4/4.4 2.7/2 3.6/6.6 2.5/5.2 2.9/5.3

Extended k1 l1 k2 l2 1 4 1 3 3 3 1 3 1 2 18 3 16 3 13 3 13 3 14 3 1 2 1 2 2 3 3 4 3 4 4 2 2 3 1 4 1 3 2 3 1 4 4 3 2 2 1 2 1 2 2 2 2 3 1 3 1 2 2 3 3 3 2 3 2 3 1 3 1 3 3 7 3 7 9 7 2 6 3 7 2 6 11 7 2 6 6 3 2 5 8 5 2 5 1 3 3 3 3 3 3 3 2 2 5 2 3 2 2 2 5 2 4 2 2 2 2 2 3 2 3 2

Hybrid STALKER ms P/N/C k l ms P ms 560 1/1/0 1 3 18 3/72.1/2.9 4442 568 4/1.8/0 3 3 612 3/60.6/6.9 3651 13 1/1/0 1 3 10 2.9/2.1/2.2 460 3 1/0/0 1 2 3 2.5/2.1/2.3 316 8929 1/0/1.2 1 2 6 2.8/2.1/3 394 17261 1/0/4.9 1 2 48 2.8/2.1/5.4 554 6230 1/0/3.4 1 2 7 2.9/2/6.1 27520 2 1/0/0 1 2 2 4.5/2.2/2.3 1E6 37 1/1/0 2 3 33 4.7/3/2.1 1E6 1351 1/1/1 2 3 23 4.7/2.2/2 1E6 22938 1/1/1 2 3 22 4.8/2.7/2 1E6 26 1/1/0 1 4 18 2.8/3.7/2.9 1136 18 1/1/0 1 3 12 2.4/2/2.6 785 46 1/1.2/0 2 3 39 2.3/2.1/2.9 675 58 2/1.4/0 1 4 36 2.7/80.6/2.3 1394 666 1.1/1/0.9 2 2 21 2/2.3/2.3 21 5 1/0/0 1 2 3 2/2.2/2.4 21 4 1/0/0 1 2 2 2.9/2.1/2.1 5 148 1/1/0 2 2 13 2/2.1/2.1 85 9 1/1/0 1 3 7 2/2/2 85 6 1/0/0 1 2 3 2/2/2.1 155 203 1/1.4/0 2 3 192 2/2/3 165 1145 1/1.8/0 2 3 1067 2/2/2 158 45 1/1.4/0 2 3 41 2/2/3.1 159 39 1/1/0 2 3 35 2.6/3.1/4.1 606 14 1/1/0 1 3 11 2.6/2.6/3.4 602 8 1/1/0 1 3 5 2.6/5.1/3.0 82 3E5 1/1.6/1 2 6 195 2.7/2.4 44 5E5 1/1.3/1 2 6 190 2.7/2.8 46 1E6 1/1.4/1 2 6 235 2.7/2.8 52 1E6 1/1.2/1 2 6 348 2.6/6.7 3430 1141 1/1/1 2 5 125 2.7/2.5 33 7402 1/1/0 2 5 101 2.9/2.3 44 49983 2/1/0.7 1 3 14 2/2.8/2.6 8 350 2/1/0 1 3 14 2/3/2 8 93583 2/1/0 2 2 50 2/3.3/2.5 12 14481 2/1/0.2 2 2 26 2/2.7/2.1 7 3E5 2/1/0 2 2 49 2/3.1/2.8 9 99119 2/1/0 2 2 28 2/2.5/2.1 6 2/1/0 2 2 25 2/2.6/2.6 6

Clearly, the hybrid approach requires the least number of user interactions. Note that the number of interactions in the sequence extraction step of the hybrid approach (P+C) is typically smaller than the number of interactions in the ﬁnal level of the STALKER approach (last number in P). This conﬁrms that the input sequence is smaller (cnfr. Example 5). The extended tree approach needs consistently more interactions than the hybrid approach, and sometimes more than the hierarchical STALKER approach.

6 Conclusion Another approach combining sequence based and tree based methods is [8] . A (set covering) meta learning algorithm runs the learning algorithms of different wrapper modules, evaluates their results and chooses the best resulting rules to add to the ﬁnal solution. Some of these modules are deﬁned to combine other modules to allow conjunctions or a multi level approach like ours. In contrast to our approach, the algorithm requires completely annotated documents (or at least a completely annotated part of the document).

141

Tree based methods have been shown to be superior over string based ones on complete node extraction tasks; however they cannot extract values that have boundaries within text nodes [13]. A ﬁrst approach, extending the trees with an extra level holding the individual tokens and applying a tree based method to extract begin and end tokens does not yield convincing results. However, a hybrid method that uses a tree based method to select the spanning node of the target and a string based method for selecting the target in the subsequence deﬁned by this spanning node turns out to be superior to a purely string based method. This hybrid approach is able to keep the good results of the tree based induction, while overcoming the tree based problem with sub node extraction. Our hybrid approach extracts only a single ﬁeld, instead of ntuples. In contrast, STALKER can extract n-tuples as long as they can be represented in the embedded catalog formalism. [13] proposes to add a tuple aggregation procedure on top of a single ﬁeld extraction method. However, the authors have not yet built an implementation to show the feasibility.

REFERENCES [1] Mary Elaine Califf and Raymond J. Mooney, ‘Relational learning of pattern-match rules for information extraction’, in AAAI/IAAI ’99, pp. 328–334. American Association for Artiﬁ cial Intelligence, (1999). [2] Julien Carme, R´emi Gilleron, Aur´elien Lemay, and Joachim Niehren, ‘Interactive learning of node selecting tree transducers’, Machine Learning, 66(1), 33–67, (2007). [3] Boris Chidlovskii, Jon Ragetli, and Maarten de Rijke, ‘Wrapper generation via grammar induction’, in Proc. 11th European Conference on Machine Learning (ECML), volume 1810, pp. 96–108. Springer, Berlin, (2000). [4] D. Freitag and A. McCallum, ‘Information extraction with HMMs and shrinkage’, in AAAI-99 Workshop on Machine Learning for Information Extraction, (1999). [5] Dayne Freitag, ‘Information extraction from HTML: Application of a general machine learning approach’, in AAAI/IAAI, pp. 517–523, (1998). [6] Dayne Freitag and Nicholas Kushmerick, ‘Boosted wrapper induction’, in Proceedings of the Seventeenth National Conference on Artiﬁ cial Intelligence and Twelfth Innovative Applications of AI Conference, pp. 577–583. AAAI Press, (2000). [7] Chun-Nan Hsu and Ming-Tzung Dung, ‘Generating ﬁ nite-state transducers for semi-structured data extraction from the web’, Information Systems, 23(8), 521–538, (1998). [8] Lee S. Jensen and William W. Cohen, ‘A structured wrapper induction system for extracting information from semi-structured documents’, in Proc. of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining, (2001). [9] Raymondus Kosala, Maurice Bruynooghe, Hendrik Blockeel, and Jan Van den Bussche, ‘Information extraction from web documents based on local unranked tree automaton inference’, in Intl. Joint Conference on Artiﬁ cial Intelligence (IJCAI), pp. 403–408, (2003). [10] Nickolas Kushmerick, Daniel S. Weld, and Robert B. Doorenbos, ‘Wrapper induction for information extraction’, in Intl. Joint Conference on Artiﬁ cial Intelligence (IJCAI), pp. 729–737, (1997). [11] Ion Muslea, Steve Minton, and Craig Knoblock, ‘Hierarchical wrapper induction for semistructured information sources’, Journal of Autonomous Agents and Multi-Agent Systems, 4, 93–114, (2001). [12] Stefan Raeymaekers, Maurice Bruynooghe, and Jan Van den Bussche, ‘Learning (k, l)-contextual tree languages for information extraction.’, in ECML, pp. 305–316, (2005). [13] Stefan Raeymaekers, Maurice Bruynooghe, and Jan Van den Bussche, ‘Learning (k,l)-contextual tree languages for information extraction’, Machine Learning, 71(2-3), 155 – 183, (2008). [14] Stefan Raeymaekers, and Maurice Bruynooghe, ‘A hybrid approach towards wrapper induction’, in Proceedings of the ECML Workshops Prior Conceptual Knowledge in Machine Learning and Data Mining, and Web Mining 2.0, pp. 161-172, (2007). [15] Stephen Soderland, ‘Learning information extraction rules for semistructured and free text’, Machine Learning, 34(1-3), 233–272, (1999).

142

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-142

Automatic Recurrent ANN development for signal classiﬁcation: detection of seizures in EEGs ˜ Daniel Rivero and Julian Dorado and Juan Rabunal and Alejandro Pazos1 Abstract. Biomedical signal processing is one of the research ﬁelds that has received more research in the recent years or decades. Inside it, signal classiﬁcation has shown to be one of the most important aspects. One of the most used tools for doing this analysis are Artiﬁcial Neural Networks (ANNs), which have proven their utility in modeling almost any input/output system. However, their application is not easy, because it involves some design and training stages in which the expert has to do much effort to develop a good network, which is even harder when working with time series, in which recurrent networks are needed. This paper describes a new technique for automatically developing Recurrent ANNs (RANNs) for signal processing, in which the expert does not have to take part on their development. These networks are obtained by means of Evolutionary Computacion (EC) tools, and are applied to the classiﬁcation of electroencephalogram (EEGs) signals in epileptic patients. The objective is to discriminate those EEG signals in which an epileptic patient is having a seizure.

1

INTRODUCTION

ANNs are learning systems that have solved a large amount of complex problems related to different disciplines (classiﬁcation, clustering, regression, etc.) [1] . The interesting characteristics of this powerful technique have induced its use by researchers in different environments [2] . One of these environments is biomedical signal processing, with applications such as signal classiﬁcation or modeling. Nevertheless, the use of ANNs has some problems mainly related to their development process. This process can be divided into two parts: architecture development and training. As the network architecture is problem-dependant, the design process of this architecture used to be manually performed, meaning that the expert had to test different architectures and train them until ﬁnding the one that achieves best results after the training process. The manual nature of the described process determines its slow performance although the recent use of ANNs creation techniques have contributed to achieve a more automatic procedure. The technique described in this paper allows the automatically obtaining of ANNs with no need of human participation. Recurrent ANNs (RANNs) have emerged as a special type of ANNs in which the output of a neuron can be the input of any other neuron of the network (it can be even the same neuron). For this reason, the evaluation of these networks is harder, because a memory must be kept on each neuron in order to remember the output given at the last time step. This last time step corresponds to the evaluation of the last sample of the input signal. These memories kept on each 1

University of A Coru˜na, Spain, email: {drivero, julian, juanra, apazos}@udc.es

neuron and evaluated for each signal sample make the network more appropriate for signal processing. RANNs have been widely used for signal processing with very good results. This paper uses a technique based on Genetic Programming (GP) [3] to automatically develop RANNs. The objective is, by one side, to obtain a system that develops RANNs without the expert participation. On the other side, this system will be used to solve a problem in the ﬁeld of signal processing. This problem is to classify seizures in EEG signals taken from epileptic patients. Epilepsy is one of the most common neurological disorders, and is characterized by the occurrence of recurrent seizures in the EEG signal [17]. This signal measures the brains electrical activity and its analysis is one of the most important tools for the diagnosis of neurological disorders. The recording of EEG signals by means of the use of recording systems generate large amounts of EEG data, and the complete visual analysis of this data by an expert is not routinely possible. This is the reason why efforts have long been made in order to develop tools that automatically process EEG signals.

2 2.1

STATE OF THE ART Genetic Programming

Genetic Programming (GP) [3] is based on the evolution of a given population. In this population, every individual represents a solution for a problem that is intended to be solved. The evolution is achieved by means of selection of the best individuals although the worst ones have also a little chance of being selected and their mutual combination for creating new solutions. This process is developed using selection, crossover and mutation operators. After several generations, it is expected that the population might contain some good solutions for the problem. The GP encoding for the solutions is tree-shaped, so the user must specify which are the terminals (leaves of the tree) and the functions (nodes capable of having descendants) for being used by the evolutionary algorithm in order to build complex expressions. The wide application of GP to various environments and its success are due to its capability for being adapted to numerous different problems. Although one of the most used applications of GP is the generation of mathematical expressions [4], it has been also used in others ﬁelds such as rule generation [5], ﬁlter design [6], etc.

2.2

ANN development with EC tools

The development of ANNs is a subject that has been extensively dealt with very diverse techniques. The world of evolutionary algorithms is no exception, and proof of that is the great amount of works that have been published about the different techniques in this area, even

D. Rivero et al. / Automatic Recurrent ANN Development for Signal Classiﬁcation: Detection of Seizures in EEGs

with GAs or GP [3] [7]. As a general rule, the ﬁeld of ANNs generation using evolutionary algorithms is divided into three main ﬁelds: evolution of weights, architectures and learning rules. First, the weight evolution starts from an ANN with an already determined topology. In this case, the problem to be solved is the training of the connection weights, attempting to minimize the network failure. With the use of an evolutionary algorithm, the weights can be represented either as the concatenation of binary [9] or real values [10]. This can overcome the problems of the backpropagation (BP) algorithm [8]. Second, the evolution of architectures includes the generation of the topological structure. In order to use evolutionary algorithms to develop ANN architectures, it is needed to choose how to encode the genotype of a given network for it used by the genetic operators. At the ﬁrst option, direct encoding, there is a one-to-one correspondence between every one of the genes and their subsequent phenotypes. The most typical encoding method consists of a matrix that represents an architecture where every element reveals the presence or absence of connection between two nodes [11]. These types of encoding are generally quite simple and easy to implement. However, they also have a large amount of inconveniences as scalability [12], the incapability of encoding repeated structures or permutation [13]. Apart from direct encoding, there are some indirect encoding methods. In these methods, only some characteristics of the architecture are encoded in the chromosome. These methods have several types of representation. Firstly, the parametric representations represent the network as a group of parameters such as number of hidden layers, number of nodes for each layer, number of connections between two layers, etc [14]. Another non direct representation type is based on grammatical rules [12]. In this system, the network is represented by a group of rules, shaped as production rules that make a matrix that represents the network. The growing methods represent another type of encoding. In this case, the genotype does not encode a network directly, but it contains a group of instructions for building up the phenotype. The genotype decoding will consist on the execution of those instructions [15]. With regard to the evolution of the learning rule, there are several approaches [16], although most of them are only based on how learning can modify or guide the evolution and also on the relationship among the architecture and the connection weights.

2.3

EEG signal processing in epileptic patients

The processing of EEG signals is a topic that has had a great impact from its beginning. Epileptic EEGs have been analyzed with many different tools and therefore their processing is very well documented [17]. The most common approach for EEG signal classiﬁcation consists on 2 stages (in general for any pattern classiﬁcation system): feature extraction using any signal processing tool and classiﬁcation with any machine learning tool, such as ANNs or Support-Vector Machines (SVMs). One of the most powerful tools for signal and image processing appeared in the late 1980s and is called the wavelet transform [18]. In a recent work by Subasi [19], EEG signals were analyzed by means of the discrete wavelet transform. The signals were ﬁrst decomposed in 5 levels using a DB4 ﬁlter and classiﬁed using an ANN. Other different analysis of the signal is done by the extraction of entropy-based features. Different entropy estimators have been applied to quantify the complexity of a signal. They were used as inputs of a classiﬁer system [20].

143

Another approach refers to the study of EEG signals as chaotic systems. From this point of view, Lyapunov exponents are extracted from time series using Jacobi matrices. From these exponents, different features can be extracted, which are used for the classiﬁcation [21]. Time-frequency features are also powerful tools for this task. An epileptic signal has components in both time and frequency, but conventional time and frequency representations present only one aspect. By means of computation of a time-frequency distribution, a signal is localized in both time and frequency domains. This technique was used in some works, using the pseudo Wigner-Ville and the smoothed-pseudo Wigner-Ville distribution [22]. Features were extracted from this distribution from epileptic EEG signals, and they were used as inputs to a feed-forward back propagation network. The classiﬁcation of the signals using these features can be done by different classiﬁcation algorithms. SVMs and Linear Discriminant Analysis are two of the most used methods for this task. However, ANNs is the most used tool to perform this classiﬁcation. This paper used recurrent ANNs, which present an additional advantage: for the classiﬁcation of a single window of the signal, the previous outputs of the neurons are taken into account. This means that knowledge from the previous samples of the signal is used for the classiﬁcation.

3

MODEL

The GP-development of ANNs is performed by means of the GP typing property [23]. This property provides the ability of developing structures that follow a speciﬁc grammar. In this case, the nodes to be used, as described in previous works [24], are the following: • ANN. Node that deﬁnes the network. It appears only at the root of the tree. It has the same number of children as the network expected outputs. Each of its children will be a neuron either a hidden or an input one. • n-Neuron. Node that identiﬁes a neuron with n inputs. This node will have 2*n children. The ﬁrst n children will be other neurons, either input, hidden or output ones. These n neurons will be inputs to this neuron being evaluated. The second n children will be arithmetical sub-trees. These sub-trees represent real values. These values correspond to values of the respective connection weights of the input neurons - the ﬁrst children - of this neuron. • n-Input neuron. Nodes that deﬁne an input neuron which receives its activation value from the input variable n. These nodes will not have any children. • Finally, the arithmetic operators set +,-,*,%, where % designs the operation of protected division (returns 1 as a result if the divisor is 0). They will generate the values of connection weights (sub-trees of the n-Neuron nodes). These nodes perform operations among constants in order to obtain new values. As real values are also needed for such operations, they have to be introduced by means of the addition of random constants to the terminal set in the interval [-4, 4]. This interval was chosen because it was recommended in other works [27]. The evaluation of a single tree implies the building of an ANN. This evaluation of the tree begins ﬁrst with the creation of the input neurons and, after that, the evaluation of the ‘ANN’ node in the root of the tree. This evaluation performs the evaluation of each of its children, each of them is a neuron that will be established as output neuron. The evaluation of each input neuron node (nodes named as ‘n-Input’) only returns that input neuron already created. The evaluation of a ‘n-Neuron’ node implies the creation of the neuron (hidden

144

D. Rivero et al. / Automatic Recurrent ANN Development for Signal Classiﬁcation: Detection of Seizures in EEGs

or output) and the evaluation of the ﬁrst n children to make the preceding neurons (whether hidden or input) and the second n children to establish the weights of the connections of those neurons to this one. The evaluation of the whole tree is done recursively, evaluating each ‘n-Neuron’ node until terminal ‘n-Input’ nodes are found. ANNs can be generated with these operator sets. However, these networks would not allow, for a given neuron, the existence of output connections to more than one different neuron, thus not allowing the creation of recurrent connections. For such reason, the system has been endowed with a list where neurons are being added as the evaluation of the tree progresses, and an index that points at one speciﬁc element of the list. In order to extract neurons from the list, and therefore to operate with it, the operator sets is added with the following operators:

ANN 2-Neuron 2-Neuron Forward 2-Neuron

-2.34

Pop

2.8

-2

x1

%

-1

3-Neuron 1-Input

-

2-Input

2

3.2 -

Forward

2-Input Pop 3-Input

3.2 -2

x3

0.4 0.67

x4

1.3

2.8 -1

1.1

-2.34

2.6

0.67 4-Input

1.8

1.3

x2

1 2.1

Figure 1. GP tree and its corresponding network

• ‘Forward’. This node advances the index list one unit. This node has one descendant. • ‘Pop’. This node extracts from the list the neuron at the position pointed by the index. This node substitutes the evaluation of a neuron, as it gives back an already existing one, so it has no descendants.

networks would be preferred as the penalization value that is added is proportional to the number of neurons at the ANN. The calculus of the ﬁnal ﬁtness will be as follows:

These two nodes allow the output of a neuron be the input of more than one different neuron. Two different options can be made at this point:

Where MSE is the mean square error of the ANN within the group of training patterns, N is the number of neurons of the network and P is the penalization value for such number of neurons.

• When evaluating a n-Neuron node, this node is evaluated and, after it, it is added to the list. This case does not allow recurrent connections because none of its input neurons can reference it by a ‘Pop’ operator because it has not been inserted in the list yet. This is the case used in previous works [24]. • When evaluating a n-Neuron node, it is ﬁrst added to the list, and after it, it is evaluated. This is the case that allows recurrent connections, and used in this paper.

4

Note that, during the neuron creation, a given neuron - either an input one or a referenced one - can be repeated several times as predecessor. In such case, there is no new input connection from that processing element, but the weight of the already existing connection will be added with the value of the new connection. Fig. 1 shows an example of a GP that, using these nodes, represent an ANN. In this ﬁgure, the ‘Pop’ node labeled as ‘1’ refers to the ﬁrst neuron created (which corresponds to the ﬁrst output), since no ‘Forward’ nodes have been evaluated. Therefore, the ﬁrst hidden neuron (the one that references to that ‘Pop’ node), will have as input the ﬁrst output neuron. The ‘Pop’ node labeled as ‘2’ refers to the third neuron created (which corresponds to the second hidden neuron), because two ‘Forward’ nodes have been evaluated. Once the tree has been evaluated, the genotype turns into phenotype. In other words, it is converted into an ANN with its weights already set (thus it does not need to be trained) and therefore can be evaluated. The evolutionary process demands the assignation of a ﬁtness value to every genotype. Such value is the result after the evaluation of the network with the pattern set representing the problem. This result is the mean square error (MSE) of this evaluation. Nevertheless, this error value considered as ﬁtness value has been modiﬁed in order to induce the system to generate simple networks. The modiﬁcation has been made by adding a penalization value multiplied by the number of neurons of the network. In such way, and given that the evolutionary system has been designed in order to minimize an error value, when adding a ﬁtness value, a larger network would have a worse ﬁtness value. Therefore, the existence of simple

f itness = M SE + N ∗ P

(1)

DATA DESCRIPTION

The data used in this paper is well described by Andrzejak et al. [25], and is publicly available. The complete dataset consists of ﬁve sets (denoted A-D), each containing 100 single-channel EEG signals of 23.6 s., with a sampling rate of 173.61 Hz. These segments were selected and cut out from continuous multichannel EEG recordings after visual inspection for artifacts, e.g., due to muscle activity or eye movements. In addition, the segments had to fulﬁll a stationarity criterion. Sets A and B consisted of segments taken from surface EEG recordings that were carried out on ﬁve healthy volunteers using a standardized electrode placement scheme. Volunteers were relaxed in an awake state with eyes open (set A) and eyes closed (set B), respectively. Sets C, D, and E originated from an EEG archive of presurgical diagnosis. Segments in set D were recorded from within the epileptogenic zone, and those in set C from the hippocampal formation of the opposite hemisphere of the brain. While sets C and D contained only activity measured during seizure free intervals, set E only contained seizure activity. Here segments were selected from all recording sites exhibiting ictal activity. The present study aims to do the classiﬁcation between segments with seizures and seizure-free intervals, both from epileptic patients. For this reason, only segments C, D and E were processed in this work, since A and B correspond to healthy volunteers. More speciﬁcally, the segments chosen for the classiﬁcation are segments D (segments from the epileptogenic zone, seizure free) and E (seizure activity).

5

RESULTS

The system described here has been applied to solve this particular problem. For each of the classes (D and E), 100 different segments were available, each of them with 4097 samples (23.6 seconds with a sampling rate of 173.61). Each segment was windowed with a window size of 1 sec. and an overlapping between consequent windows of 0.25 sec. Therefore,

145

D. Rivero et al. / Automatic Recurrent ANN Development for Signal Classiﬁcation: Detection of Seizures in EEGs

each segment was divided into 31 windows. 6 simple time-domain and frequency-domain features were extracted from each window: • • • • • •

Table 2. Summary of the networks found that solved this problem.

X0 : Mean of the values. X1 : Standard deviation of the values. X2 : Mean of the absolute values of the FFT. X3 : Standard deviation of the absolute values of the FFT. X4 : Mean of the power spectral densities of the values. X5 : Standard deviation of the power spectral densities.

Although these features may seem to be very basic for signal classiﬁcation, they will show that can offer a high accuracy when discriminating between these two different classes of signals. However, recent works point that better results can be obtained with other different time-domain and frequency-domain features, such as dominant frequency, average power in the main energy zone, etc [26]. Other features to be extracted could be based on entropy of each segment [20], lyapunov exponents [21], wavelets [19], etc. The 200 different signals (100 for each class) were divided into three different groups: training (60%), validation (20%) and test (20%), having each group the same number of signals from each class. The validation set is used to control the training of the networks and gives an estimation of the generalization of the networks being trained. Thus, the network that will be returned by the evolutionary system will be the one with best validation results (which will be expected to be the one that has a best generalization). The test will be applied to the resulting networks. Therefore these test results are not used on the training and represent the real measure of the generalization level and learning of the network. All of the segments were normalized between -1 and 1. Therefore, the neurons of the ANNs were forced to give an ouput in that range, with a hyperbolic tangent transfer function. However, a threshold was applied to the output of the ANN so a positive output is considered to be an E window (seizure) and a negative a D segment (seizure free, epileptogenic zone). Some preliminary experiments were performed in order to ﬁt the parameters of the system. According to those results, the system was run with the parameters shown on Table 1, which have shown to give good results. Table 1. Parameter Pop. Size Crossover rate Creation alg. Selection alg.

Parameter values.

Value 1000 95% Half&Half. 2-ind. tourn.

Parameter Mutation prob. Penalization Max. neuron inputs Maximum height

Value 4% 10−5 6 5

With this conﬁguration set, different networks were obtained that solved this problem. A small comparison of the networks found can be seen on Table 2. As can be seen on this table, the accuracies returned by the networks in the tests are very high. As was already explained, the system returns as output the network that gave the best results in the validation set. This network was the one used in the test. Table 2 shows a set of networks returned by the system in 12 independent runs. For each network, the table shows the accuracy in validation and test, the number of neurons and connections, and the features used by each network. The networks shown in this table are ordered by the validation accuracy. This table shows networks that have returned a very high accuracy, in validation and test. Moreover, these networks have been optimized

1 2 3 4 5 6 7 8 9 10 11 12

Accuracy validation 97.09% 95.48% 95.48% 95.32% 95.32% 95.16% 95.16% 95.16% 94.83% 94.67% 94.51% 94.51%

Accuracy test 93.06% 90.64% 92.09% 93.38% 91.45% 97.25% 95.32% 98.38% 90.32% 88.54% 98.38% 96.77%

Num. neurons 12 7 16 5 11 5 10 14 11 6 7 9

Num. connections 47 26 55 16 37 21 37 54 43 23 19 29

Features not used X1 X3 X0 X0 X0 X0 , X2 , X3 X2 X0 X0 X0 X3

and some of them have a very low number of neurons and connections. For instance, network number 6 has returned a good result in validation and one of the best results in test (97.25 %). This network has a very low number of neurons (5) and connections (21), being one of the simplest networks among all of them. The two simplest networks are number 6 and 4, which show that this problem can be solved with a low number of neurons. The last column of this table shows the features that were used by these networks to obtain these results. These features were labeled as X0 -X5 , as described before. Feature X0 (mean) is not present in many of the best networks (3, 4, 5, 6, 9, 10, 11). Therefore it can be deduced that it is the least important in order to give good results. Other features such as X2 (mean of the absolute values of the FFT) or X3 (standard deviation of the absolute values of the FFT) were not used by some of the networks found here. In particular, X2 was not used by the networks 6 and 7, which have a very high accuracy, and X3 was not used by the networks 2, 6, 12 and 13. Therefore, it can be concluded that, in order to achieve that success rate, those features are not necessary, although they are used by other networks that achieve the same (or less) accuracy value. The comparison of the results with other approaches is not an easy task. As already described, the whole database is divided into 5 sets, and this work focuses on classiﬁcation between sets D and E. No other works focus only on these sets. However, other works in the literature study the classiﬁcation between sets A, D and E. This classiﬁcation task seems to be easier since set A seems to be easier to discriminate. Also, those works use expert-made networks. These approaches return an accuracy of 96.79% [21] and 85.9% [28], which are worse results than some of the obtained in this work like 97.25% or 98.38%. Also, the networks obtained here have a much lower number of neurons than those.

6

CONCLUSIONS

As can be seen on the results obtained, and shown on Table 1, the system described here can successfully perform the classiﬁcation between these two classes of signals. Moreover, this system is more general. It develops recurrent networks that represent an input/output relation between time series. This was applied here to signal classiﬁcation, but could be applied for other tasks such as signal modeling, ﬁltering, etc. The problem presented here belongs to the ﬁeld of biomedical signal processing. This time, the classiﬁcation of EEG signals of epileptic patients. The results obtained show a very high accuracy on the test. Moreover, the networks developed by this system have been optimized so

146

D. Rivero et al. / Automatic Recurrent ANN Development for Signal Classiﬁcation: Detection of Seizures in EEGs

they have a small number of networks and connections. Another important feature is that this system can discriminate the inputs needed to obtain the resulting accuracy. As was explained, features such as mean have shown to be useless in many of the networks. Other features were not used by some networks and therefore it can be concluded that, although they can contribute to giving a high accuracy, their contribution is not essential for this task. The problem presented here, classiﬁcation of EEG signals in epileptic patients, has already been studied with other tools. This paper shows that evolutionary techniques can also be applied in order to develop recurrent ANNs to solve this problem.

7

FUTURE WORKS

This work opens new research lines in this topic. First, more experiments must be done in order to ﬁt the parameters of the evolutionary system, although the ones used here have returned good results. Another research line is related to the features taken from the EEG segments. As shown on recent works, other features could improve the performance and the results obtained by the system. These features can be based on time-frequency analysis too [26], but also in entropies [20], lyapunov exponents [21], wavelets [19], etc. Finally, more work can be done with the databases described. This work only focuses on epileptic patients, doing classiﬁcation between seizure and seizure-free EEG signals. More work could be done also with healthy patients, trying to discriminate between healthy and epileptic patients EEGs (whether epileptic patients EEGs are showing seizures or not). Also, as the results presented here are very satisfactory, this technique can also be applied to BCI approaches, and in general to biomedical signal processing tasks.

ACKNOWLEDGEMENTS This work was partially supported by the Spanish Ministry of Education and Culture (Ref TIN2006-13274) and the European Regional Development Funds (ERDF), grant (Ref. PIO52048 and RD07/0067/0005) funded by the Carlos III Health Institute, grants (Ref. PGIDIT 05 SIN 10501PR) and (Ref. PGDIT 07TMT011CT) from the General Directorate of Research of the Xunta de Galicia and grant (File 2006/60, 2007/127 and 2007/144) from the General Directorate of Scientiﬁc and Technologic Promotion of the Galician University System of the Xunta de Galicia. The development of the experiments described in this work, has been performed with equipments belonging to the Super Computation Center of Galicia (CESGA).

REFERENCES [1] S. Haykin, Neural Networks (2nd ed.), Prentice Hall, Englewood Cliffs, NJ, 1999. [2] J.R. Rabu˜nal and J. Dorado (eds.), Artiﬁcial Neural Networks in RealLife Applications, Idea Group Inc., 2005. [3] J.R. Koza, Genetic Programming: On the Programming of Computers by Means of Natural Selection, MIT Press, Cambridge, MA, 1992. [4] D. Rivero, J.R. Rabu˜nal, J. Dorado and A. Pazos, ‘Time Series Forecast with Anticipation using Genetic Programming’, Computational Intelligence and Bioinspired Systems, IWANN 2005, Springer, 968-975, (2005). [5] M. Bot, Application of Genetic Programming to Induction of Linear Classiﬁcation Trees, Final Term Project Report, Vrije Universiteit, Amsterdam, 1999.

[6] J.R. Rabu˜nal, J. Dorado, J. Puertas, A. Pazos, A. Santos and D. Rivero, ‘Prediction and Modelling of the Rainfall-Runoff Transformation of a Typical Urban Basin using ANN and GP’, Applied Artiﬁcial Intelligence, 17(4), 329-343, (2003). [7] E. Cant-Paz and C. Kamath, ‘An Empirical Comparison of Combinations of Evolutionary Algorithms and Neural Networks for Classiﬁcation Problems’, IEEE Transactions on systems, Man and Cybernetics Part B: Cybernetics, 915-927, (2005). [8] R.S. Sutton, Two problems with backpropagation and other steepestdescent learning procedure for networks’, Proc. 8th Annual Conf. Cognitive Science Society, Erlbaum, Hillsdale, NJ, 915-927, (1986). [9] D.J. Janson and J.F. Frenzel, ‘Training product unit neural networks with genetic algorithms’, IEEE Expert, 8, 26-33, (1903). [10] G.W. Greenwood, ‘Training partially recurrent neural networks using evolutionary strategies’, IEEE Trans. Speech Audio Processing, 5, 192194, (1997). [11] E. Alba, J.F. Aldana and J.M. Troya, ‘Fully automatic ANN design: A genetic approach’, Proc. Int. Workshop Artiﬁcial Neural Networks (IWANN93), Lecture Notes in Computer Science, Springer, 686, 399404, (1993). [12] H. Kitano, ‘Designing neural networks using genetic algorithms with graph generation system’, Complex Systems, 4, 461-476, (1990). [13] X. Yao and Y. Liu, ‘Towards designing artiﬁcial neural networks by evolution’, Appl. Math. Computation, 91(1), 83-90, (1998). [14] S.A. Harp, T. Samad and A. Guha, ‘Toward the genetic synthesis of neural networks’, Proc. 3rd Int. Conf. Genetic Algorithms and Their Applications, Morgan Kaufmann, San Mateo, CA, 360-369, (1989). [15] S. Nolﬁ and D. Parisi, ‘Toward Evolution of Artiﬁcial Neural Networks’, Handbook of brain theory and neural networks, Second Edition, MIT Press, Cambridge, MA, 418-421, (2001). [16] P. Turney, D. Whitley and R. Anderson, ‘Special issue on the baldwinian effect’, Evolutionary Computation, 4(3), 213-329, (1996). [17] H.R. Mohseni, A. Maghsoudi and B. Shamsollahi, ‘Seizure Detection in EEG signals: A Comparison of Different Approaches’, Engineering in Medicine and Biology Society, 2006, EMBS06, 28th Annual Internacional Conference of the IEEE, 6724-6727, (2006). [18] P.S. Addison, The Illustrated Wavelet Transform Handbook, Institute of Physics, 2002. [19] A. Subasi, ‘Epileptic seizure detection using dynamic wavelet network’, Expert Systems with Applications, 29, 343-355, (2005). [20] N. Kannathala, M.L. Choob, U.R. Acharyab and P.K. Sadasivana, ‘Entropies for the detection of epilepsy in EEG’, Computer Methods and Programs in Biomedicine, (2005). [21] N.F. Guler, I. Ubeylib, E.D. Guler and I. Guler, ‘Recurrent neural networks employing Lyapunov exponents for EEG signals classiﬁcation’, Expert Systems with Applications, 29, 506-514, (2005). [22] H.R. Mohseni, A. Maghsoudi and N. Sadati, ‘Automatic detection of epileptic seizure using Time-Frequency Distributions’, Advances in Medical, Signal and Information Processing (MEDSIP), (2005). [23] D.J. Montana, ‘Strongly typed genetic programming’, Evolutionary Computation, 3(2), 199-200, (1995). [24] D. Rivero, J. Dorado, J. Rabu˜nal and A. Pazos, ‘Using Genetic Programming for Artiﬁcial Neural Network Development and Simpliﬁcation’, Proceedings of the 5th WSEAS International Conference on Computational Intelligence, Man-Machine Systems and Cybernetics (CIMMACS’06), WSEAS Press, 65-71, (2006). [25] R.G. Andrzejak, K. Lehnertz, C. Rieke, F. Mormann, P. David and C.E. Elger, ‘Indications of nonlinear deterministic and ﬁnite dimensional structures in time series of brain electrical activity: Dependence on recording region and brain state’, Physical Review E, 64, 061907, (2001). [26] V. Srinivasan, C. Eswaran, N. Sriraam, ‘Artiﬁcial Neural Network Based Epileptic Detection Using Time-Domain and Frequency-Domain Features’, Journal of Medical Systems, 29(6), 647-660, (2005). [27] S. Fahlman, ‘Faster-learning variantions of back-propagation: An empirical study’, Proceedings of the 1988 Connectionist Models Summer School, Morgan Kaufmann, San Mateo, 38-51, (1988). [28] N. Sadati, H.R. Mohseni and A. Magshoudi ‘Epileptic Seizure Detection Using Fuzzy Neural Networks’, Proc. of the IEEE Intern. Conf. on Fuzzy Syst., Canada, 596-600, (2006).

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-147

147

A Method for Classifying Vertices of Labeled Graphs Applied to Knowledge Discovery from Molecules Fr´ ed´ eric Pennerath1,3 , G´ eraldine Polaillon2 and Amedeo Napoli3 Abstract. The article proposes a generic method to classify vertices or edges of a labeled graph. More precisely the method computes a conﬁdence index for each vertex v or edge e to be a member of a target class by mining the topological environments of v or e. The method contributes to knowledge discovery since it exhibits for each edge or vertex an informative environnement that explains the found conﬁdence. When applied to the problem of discovering strategic bonds in molecules, the method correctly classiﬁes most of the bonds while providing relevant explanations to chemists. The developed algorithm GemsBond outperforms both speed and scalability of the learning method that has previously been applied to the same application while giving similar results.

1

Introduction

Labeled graphs constitute one of the most widely used models to represent symbolic data, thanks to their simplicity and generality. If a vertex (or an edge) is obviously characterized by the label it carries, the most interesting information about a vertex generally comes from its relations with its topological environment. The general question raised by the present article is about this topological information: what can be learnt about a vertex knowing its environment in a graph ? In particular can a vertex be classiﬁed into a target class by comparing its environments with those of classiﬁed examples ? To solve such problems, most approaches of relational learning face the combinatorics explosion of possible graph patterns by projecting graphs into simpler representation models (cf Sect. 6). The problem gets tractable to the detriment of accuracy as model reduction induces inevitably some loss in available topological information. By contrast, the proposed method called GemsBond directly works on graph patterns included in data, addressing more speciﬁcally vertex classiﬁcation problems that resist to topological reduction. This is particularly true in organic chemistry where changing the chemical element of a single atom a may radically change its inﬂuence over atoms at three or even more bonds (i.e atom connections) away from a. Indeed the context that has originally motivated the design of GemsBond is chemical synthesis: experts of this ﬁeld build 1 2 3

up synthesis plans of target molecules thanks to an analytical method called retrosynthesis [3]. Every step of this recursive method consists in inferring from the molecular graph of the current target molecule M a chemical reaction that builds M from simpler molecules called precursors. The decomposition into subproblems is iterated, precursors serving as new targets, until subsequent precursors are readily available molecules. The expert starts each step by identifying the strategic bonds in the target molecular graph [3]. Strategic bonds are the best or the easiest candidate bonds of a molecule M to be created by chemical reactions that synthesize M . Figure 1 illustrates a retrosynthesis step where the breaking of a strategic bond produces two precursors fragments. Because

Supelec, France, email: [email protected] Supelec, France, email: [email protected] Loria, France, Nancy, email: [email protected]

Figure 1.

Step of a retrosynthesis

chemical reactions follow common patterns, reactions produce speciﬁc topological environments around created bonds. As a consequence, the strategic character of a bond can often be inferred from its topological environment. However the discovery of strategic bonds requires knowledge of thousands of reaction patterns whose conditions of applicability are not clearly known. Discovering automatically strategic bonds by mining existing reaction databases will thus help experts in improving the quality of their strategic bond analysis. The article has a twofold contribution: it presents a generic method that classiﬁes vertices or edges of a graph by mining topological environments occurring frequently in a set of example graphs. Then it presents a successful application of this method to the problem of discovering strategic bonds in molecular graphs. To this end, section 2 introduces the problem of vertex classiﬁcation based on vertex environment in a formal application-independent framework. Section 3 presents our method GemsBond and gives some details about its imple-

148

F. Pennerath et al. / A Method for Classifying Vertices of Labeled Graphs Applied to Knowledge Discovery from Molecules

mentation. Section 4 explains how the previous method has addressed the problem of discovering strategic bonds. Section 5 describes results obtained by GemsBond in predicting strategic bonds while section 6 compares the proposed method to other related works.

2

Problem statement

A L-vertex-labeled graph g = (V, E, L, λg ) is deﬁned by a set of vertices V (g) = V , a set of pairs of said adjacent vertices E(g) = E called edges, a set of vertex types L and a labeling function λg : V → L that labels every vertex v by a type λg (v). A graph g1 is a subgraph of g2 (i.e g1 ⊆ g2 ) if V (g1 ) and E(g1 ) are resp. subsets of V (g2 ) and E(g2 ) and vertices of V (g1 ) are identically labeled in g1 and g2 . A graph is connected if every pair of vertices can be linked by a sequence of adjacent vertices. Two graphs are isomorphic if it is possible to rename vertices of one graph so that it gets equal to the second. For sake of conciseness the problem statement only considers vertex-labeled graphs even if the problem can be generalized to labeled graphs where edges carry types. Therefore the term graph refers hereafter to a L-vertex-labeled graph. The considered problem of vertex classiﬁcation based on vertex environment consists in predicting whether a given input vertex v of a input graph g is member of a target class C by comparing the environments of v in g to environments of already classiﬁed vertices. An environment E of v in g is formally deﬁned as any connected subgraph of g containing v. This supervised classiﬁcation problem assumes the existence of a set E of example graphs where members of C are known vertices. Figure 2 provides two example graphs, an input graph g and an environment E of an input vertex v of g all referred in subsequent examples. In order to be meaning-

Figure 2. Two example graphs (a), an input graph and input vertex (b) and an environment (c) of the input vertex.

ful, the problem assumes that for the considered application, the hypothesis v ∈ C statically depends on the environment of v in g and that the dependency is the same whether graph g is an example or an input graph.

3

The GemsBond algorithm

The principles of GemsBond rely on a conﬁdence index c(E) for the hypothesis v ∈ C to be true knowing only one particular environment E of v. In turn, the deﬁnition of c(E) relies on the number of occurrences of E in a set E of example graphs. An occurrence of a graph g in a graph g is deﬁned by an injective application (or morphism) μ : V (g) → V (g ) that preserves both vertex adjacency and vertex labeling: ∀{v1 ; v2 } ∈ E(g), {μ(v1 ); μ(v2 )} ∈ E(g )

(1)

∀v ∈ V (g), λg (μ(v)) = λg (v)

(2)

Given an environment E of v in a graph g, an occurrence of E in E speciﬁed by a morphism μ is positive (resp. negative) if the image μ(v) of v is (resp. is not) in the target class C. The number occ+ (E) (resp. occ− (E)) of positive (resp. negative) occurrences of environment E is deﬁned as the total number of positive (resp. negative) occurrences of E in all graphs of E. Figure 3 shows the three positive and two negative occurrences of the environment of Fig. 2(c) in example graphs of Fig. 2(a). Contrary to frequency of itemsets,

Figure 3.

Positive and negative occurrences

the number of occurrences is not monotonic wrt subgraph inclusion: given two environments E1 and E2 , both properties E1 ⊂ E2 and occ(E1 ) < occ(E2 ) can hold simultaneously. For instance single vertex of type c has one less (i.e two) positive occurrences in examples of Fig. 2(a) than the larger environment of Fig. 2(c). Whereas absolute values of occ+ (E) and occ− (E) can unpredictably ﬂuctuate when environment E of v grows, the fraction c(E) of positive occurrences of E, called conﬁdence, can approach probability for hypothesis v ∈ C to be true given E: c(E) =

occ+ (E) occ+ (E) + occ− (E)

(3)

In the special case both occ+ (E) and occ− (E) are null, the method adopts a conservative position assuming conﬁdence c(E) is zero. Contrary to numbers of occurrences, this ratio is bounded between 0 and 1 and is consistent with a probability: a value of 0 (resp. 1) states every occurrence of E in the example set is negative (resp. positive). The conﬁdence c(v) in hypothesis v ∈ C is presumably independent of any particular environment so that the whole set E(v) of environments of v should contribute to the value of c(v). However mining every environment of v would require an unacceptable amount of processing time. A compromise solution consists in considering only the few maximal environments of v in g occurring at least in nmin examples of E. However the external parameter nmin cannot be tuned easily: if the value of nmin is too low, maximal environments get large and require long processing time whereas a too large value of nmin stops the environment growth too early and produces non discriminative environments of average conﬁdence. Instead GemsBond empirically deﬁnes the conﬁdence c(v) as equal to the highest conﬁdence reached by any environment E of v: c(v) = c(Emax ) with Emax = argmax(c(E))

(4)

E∈E(v)

This choice is legitimate for said asymmetric problems when the fact v ∈ C is triggered by the presence around v of any environment from a restricted set of (unknown) speciﬁc environments. As negation of an existential disjunction is not another existential disjunction, symmetry with dual problem

F. Pennerath et al. / A Method for Classifying Vertices of Labeled Graphs Applied to Knowledge Discovery from Molecules

is broken. Consequently, if an input vertex has two disjoint environments Eh and El resp. with a high and a low conﬁdence, Eh is preponderant over El so that conﬁdence c(v) remains high (but lower than c(Eh )). This property explains the choice in formula 4 of converging in priority towards a single environment of maximal conﬁdence. Finally if the conﬁdence of v is greater than a minimum decision threshold cmin , the vertex v is classiﬁed positively: c(v) ≥ cmin ⇔ v is believed to be in C

further developed by recursive calls (cf line 4). Figure 4 illustrates the greedy search of Emax relatively to the input graph and input vertex of Fig. 2(b) and the two example graphs of Fig. 2(a). Bold edges represent edge extensions.

(5)

The optimal value of cmin can be learnt from E. The method provides an easy-to-use analysis tool for the expert as the value of c(v) is justiﬁed by a single environment Emax . This environment hereafter called explanation has shown to carry relevant chemical information to explain strategic bonds. In this sense the method does not only serve classiﬁcation problems but knowledge extraction as well. Given an input vertex v of an input graph g, the mining procedure of GemsBond consists in searching the environment Emax that maximises the conﬁdence c(E) relatively to a set of examples E. The depth-ﬁrst search of Emax is implemented by a recursive procedure that develops the current environment E set initially to the vertex-graph of v. Again, depending on the desired tradeoﬀ between result accuracy and processing speed, multiple search strategies are possible. For the speciﬁc problem of strategic bond discovery, the locally greedy search (cf algorithm 1) has shown to be approximately as accurate as other more exhaustive searches while being signiﬁcatively faster. At each step of the recursive search, every extension Algorithm 1: The greedy procedure ﬁndEMax() Data: input graph g, example set E Input: current env. Ecurrent and its conf. ccurrent Result: explanation Emax and conﬁdence cmax are global variables Set C ← ∅ ; clocal max ← 0 ; 1 forall extension e of Ecurrent in g do 2 c ← conf(e(Ecurrent ), E) ; 3 if c ≥ clocal max and c > ccurrent then if c > clocal max then clocal max ← c ; C ← ∅ C ← C ∪ {e} if C = ∅ then if ccurrent > cmax then cmax ← ccurrent ; Emax ← Ecurrent

4

149

else forall e ∈ C do ﬁndEMax(e(Ecurrent ), clocal max )

Figure 4.

Computing the conﬁdence of Ecurrent (cf line 2) requires to count all positive and negative occurrences of Ecurrent in E. Graph mining algorithms using a depth-ﬁrst search can eﬃciently compute the number of occurrences of the current graph pattern by using a fast and compact data structure called embedding list [9]. This structure has been upgraded to count simultaneously positive and negative occurrences in one single pass over the examples so that computing the conﬁdence does not require more time than computing a number of occurrences. A caching mechanism has also been added to algorithm 1 to remember the conﬁdence of already mined environment graphs so that the conﬁdence of every mined environment is computed only once. This cache is made of a trie that maps encodings of graphs to conﬁdences in a way encodings are invariant to vertex index permutations. In addition extensions that produce a null conﬁdence are black-listed so that they are not applied later during the greedy search (cf Fig. 4). Finally in order to improve quality of results, greedy selection (cf line 3 of algorithm 1) has been disabled while the size of Ecurrent is below some threshold smin . This condition protects GemsBond from an early convergence toward a local maximum of conﬁdence that is not globally optimal. When tested on chemical data, most suboptimal maxima appear for small environments of two or three bonds so that a good value smin is 3 labeled edges for that particular application.

4

of the current environment Ecurrent compatible with the input graph g is enumerated (cf line 1) before the conﬁdence of the extended environment is evaluated (cf line 2). Extending Ecurrent simply consists in adding to Ecurrent one of the edges in E(g)\E(Ecurrent ) incident to a vertex of V (Ecurrent ). Only the environments that have a maximal conﬁdence (locally) are

Example of a greedy search

Application to strategic bond discovery

The GemsBond algorithm has been applied to the asymmetric problem of discovering strategic bonds as introduced in section 1, where example and input graphs are molecular graphs. In a molecular graph as illustrated on Fig. 1, vertices represent atoms labeled by their chemical elements (C for carbon, ...) and edges represent covalent bonds labeled by their type (single, double, triple or aromatic). The set E contains examples of molecule synthesis speciﬁed by molecular graphs where bonds

150

F. Pennerath et al. / A Method for Classifying Vertices of Labeled Graphs Applied to Knowledge Discovery from Molecules

created or modiﬁed by the underlying synthesis are specially annotated. Molecular graphs are directly imported from reaction databases without any additional annotation but bond aromaticity. Output conﬁdences produced by GemsBond are eﬃciently conveyed to experts by modulating the thickness of every bond with its conﬁdence: the more strategic a bond is, the thicker the bond is drawn. Figure 5 represents an example of output as displayed to the expert. Bonds created (resp. modiﬁed) by the considered synthesis are crossed twice (resp. once) whereas strategic bonds appear thicker, to various extent, than other bonds. The output illustrates the four classes of bonds for cmin set to 0.7: for instance the created bond a of conﬁdence 0.92 is a true positive. The respective

5

Evaluation

Classiﬁcation tests have been carried on 6600 examples4 of molecule synthesis. In order for the experts to better focus their analysis, only strategic character of single bonds has been computed. A cross validation test has partitioned the ﬁrst 6000 examples in subsets of 100 elements. In each subset, the conﬁdence of single bonds to be strategic has been evaluated from the 5900 remaining examples. Figure 7(a) shows the stacked histograms of both created and not created bonds depending on value of conﬁdence. Because of the unbalanced numbers of created/not created bonds, each distribution has been normalized to a total sum of 1. Most bonds with a conﬁdence higher than 0.4 appear to be created. Peaks are caused by recurrent explanations. Distributions of Fig. 7(a)

(a) Figure 5.

(b)

The four classes of bonds and conﬁdences

explanations of the four bonds are given on Fig. 6. As expected bonds of high conﬁdence have more sophisticated environments than bonds of low conﬁdence. These explanations are minimal as the greedy algorithm only extends the environment if it makes the conﬁdence strictly grow. As a consequence all atoms and bonds of an explanation play some role in the found conﬁdence. The uncreated bond b symmet-

Figure 6.

Explanations

ric to created bond a is necessarily strategic. This suggests a problem already observed in [11]: whereas a created bond is strategic, a not created bond for a given synthesis may be created by another unconsidered synthesis and be actually strategic. This in turn induces noise in data and a persistent classiﬁcation error. Another diﬃculty is that created bonds are 9 times less frequent than not created bonds so that classiﬁers are pushed to make positive predictions only for the most obviously strategic bonds.

(c)

Figure 7.

Distribution of created/not created bonds (a), ROC curve (b) and error rates (c)

determine prediction error as a function of threshold cmin on Fig. 7(c) and ROC curve on Fig. 7(b) with an AUC of 0.92. The minimal error of 6 % reached for cmin = 0.7 is biased by the over-representation of not created bonds and is to be compared with the 10 % error rate obtained when systematically rejecting the hypothesis. Corrected error on Fig. 7(c) is the prediction error if both classes are assumed to be equally represented. Optimal value for cmin is then 0.3 for an unbiased error of 16 %. These thresholds have been validated on the 600 remaining examples and no gap has been observed with both predicted error rates. The learning method CNN [11] already applied to the same problem is not available so that no comparison test between CNN and GemsBond could be performed. Regin reports in [11] a slightly better error rate of 4 % on their 4

r Dataset cannot be distributed but retrieved from the Symyx r MDL ChemInform and Reﬂib databases by selecting only mono-product and mono-step reactions with at least one created C − C bond, with a yield of 90 % at least and with atoms only of type H, B, C, N , O, F , Si, P , S, Cl, Br or I. Only the ﬁrst 6600 products are considered in the resulting dataset of 6743 reactions.

F. Pennerath et al. / A Method for Classifying Vertices of Labeled Graphs Applied to Knowledge Discovery from Molecules

own tests. However the 2 % of diﬀerence must be mitigated as CNN was tested on data cautiously selected by hand and whose atoms and bonds were annotated with additional relevant chemical information. Considering algorithm eﬃciency, Regin reports a jack-knife test of CNN over 694 single bonds from 75 molecular graphs required 72 hours on a SPARC 2. In comparison, GemsBond processes the cross validation test of about 190000 single bonds from 6000 molecules in about 50 minutes on an Opteron 250. Gap of performance is so large (about 20000 faster while mining 80 times more data) that it cannot be explained by hardware or implementation issues only. Since complexities of CNN and GemsBond with the size of E are respectively quadratic and linear, the performance gap should even increase for larger sets E.

6

Related work

Some computer-assisted synthesis systems like [6, 13] already search for strategic bonds in molecular graphs. However these methods rely on either hard-coded heuristics or deductive rule systems. To the best of our knowledge, the only attempt to learn strategic bonds from examples is described in [12, 11]. This complex inductive graph learning method iteratively generalizes graph patterns from examples. The method is robust against noise in data but requires large amount of processing time to compute maximal common subgraphs of graph patterns. Moreover the method does not scale properly with data since its complexity is a quadratic function of the number of examples. The method GemsBond aims at solving the same bond classiﬁcation problem but ﬁnds inspiration in pattern searching, as a subﬁeld of artiﬁcial intelligence, and more speciﬁcally in graph-based data-mining whose general principle is the extraction of subgraphs occurring frequently in a set of labeled graphs. The ﬁrst algorithm of this type is Subdue [2] that uses a beam search strategy to extract from a set of graphs, subgraphs maximizing a scoring function. More recently algorithms have eﬃciently extracted subgraphs frequent in a graph dataset [7, 8, 15, 9]. These methods have found applications in chemistry to predict biological activity of molecules based on frequency of molecular substructures [4, 14]. Our method diﬀers from these approaches as the classiﬁcation is local to a vertex or edge and as mined patterns are subgraphs of one particular graph so that the search space is much smaller than the whole graph order. The problem addressed by GemsBond is more related to statistical or logical relational learning [10] and graph labeling [1]. However these methods reduce the topological information using either apriori Bayesian or logical relational models [10], information diﬀusion models along edges [16] and more generally injection into high-dimensional Euclidian spaces using vertex kernels [5]. In comparison our method directly works in the ordered space of graph patterns.

7

Conclusion

This article has described an original graph-mining method to classify vertices or edges based on their environment. The GemsBond algorithm has proved to be a fast, scalable and accurate solution to the strategic bond discovery problem. In future, study of variations on the search algorithm should relax the assumption on problem asymmetry so that GemsBond

151

gets applicable to a wider spectrum of applications and benchmarks.

ACKNOWLEDGEMENTS Authors wish to thank chemists C. Lauren¸co and G. Niel from ENSC, Montpellier, France for their support and feedback.

REFERENCES [1] Graph Labelling Workshop of ECML/PKDD 2007, the 11th European Conference on Principles and Practice of Knowledge Discovery in Databases, Warsaw, Poland, 2007. [2] D. J. Cook and L. B. Holder, ‘Substructure discovery using minimum description length and background knowledge’, Journal of Artiﬁcial Intelligence Research, 1, 231–255, (1994). [3] E.J. Corey and X.M. Cheng, The Logic of Chemical Synthesis, John Wiley & Sons, New York, 1989. [4] M. Deshpande, M. Kuramochi, and G. Karypis, ‘Frequent sub-structure-based approaches for classifying chemical compounds’, icdm, 00, 35, (2003). [5] T. G¨ artner, T. Horvarth, Q. V. Le, A. J. Smola, and S. Wrobel, Mining Graph Data, chapter 11, Wiley-Interscience, 2006. [6] J. Gasteiger, M. Pf¨ ortner, M. Sitzmann, R. H¨ ollering, O. Sacher, T. Kostka, and N. Karg, ‘Computer-assisted synthesis and reaction planning in combinatorial chemistry’, Perspectives in Drug Discovery and Design, 20, 245–264, (2000). [7] A. Inokuchi, T. Washio, and H. Motoda, ‘An apriori-based algorithm for mining frequent substructures from graph data’, in PKDD ’00: Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery, pp. 13–23, London, UK, (2000). Springer-Verlag. [8] M. Kuramochi and G. Karypis, ‘Frequent subgraph discovery’, in ICDM ’01: Proceedings of the 2001 IEEE International Conference on Data Mining, pp. 313–320, (2001). [9] S. Nijssen and J. N. Kok, ‘A quickstart in frequent structure mining can make a diﬀerence’, in KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 647–652, New York, NY, USA, (2004). ACM Press. [10] Luc De Raedt, Thomas G. Dietterich, Lise Getoor, Kristian Kersting, and Stephen Muggleton, eds. Probabilistic, Logical and Relational Learning - A Further Synthesis, 15.04. 20.04.2007, volume 07161 of Dagstuhl Seminar Proceedings. Internationales Begegnungs- und Forschungszentrum fuer Informatik (IBFI), Schloss Dagstuhl, Germany, 2008. [11] J.-C. R´egin. D´eveloppement d’outils algorithmiques pour l’intelligence artiﬁcielle. Application ` a la chimie organique. Th`ese de l’Universit´e des Sciences et Techniques du Languedoc, Montpellier, 1995. [12] J.C. Regin, O. Gascuel, and C. Laurenco, ‘Machine learning of strategic knowledge in organic synthesis from reaction databases’, in Proceedings of E.C.C.C-1, Computational Chemistry, eds., F. Bernardi and J.L. Rivail, pp. 618–623, Woodbury, NY, (1995). AIP Press. [13] H. Satoh and T. Nakata, ‘Knowledge discovery on chemical reactivity from experimental reaction information’, in Discovery Science, eds., Gunter Grieser, Yuzuru Tanaka, and Akihiro Yamamoto, volume 2843 of Lecture Notes in Computer Science, pp. 470–477. Springer, (2003). [14] R. M. H. Ting and J. Bailey, ‘Mining minimal contrast subgraph patterns’, in SDM, eds., J. Ghosh, D. Lambert, D. B. Skillicorn, and J. Srivastava. SIAM, (2006). [15] X. Yan and J. Han, ‘gspan: Graph-based substructure pattern mining’, in ICDM ’02: Proceedings of the 2002 IEEE International Conference on Data Mining, p. 721, Washington, DC, USA, (2002). IEEE Computer Society. [16] Dengyong Zhou, Jiayuan Huang, and Bernhard Sch¨ olkopf, ‘Learning from labeled and unlabeled data on a directed graph’, in ICML, eds., Luc De Raedt and Stefan Wrobel, volume 119 of ACM International Conference Proceeding Series, pp. 1036–1043. ACM, (2005).

152

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-152

Nonnegative Decompositions with Resampling for Improving Gene Expression Data Biclustering Stability Liviu Badea 1 Abstract. The small sample sizes and high dimensionality of gene expression datasets pose significant problems for unsupervised subgroup discovery. While the stability of unidimensional clustering algorithms has been previously addressed, generalizing existing approaches to biclustering has proved extremely difficult. Despite these difficulties, developing a stable biclustering algorithm is essential for analyzing gene expression data, where genes tend to be co-expressed only for subsets of samples, in certain specific biological contexts, so that both gene and sample dimensions have to be taken into account simultaneously. In this paper, we describe an elegant approach for ensuring bicluster stability that combines three ideas. A slight modification of nonnegative matrix factorization that allows intercepts for genes has proved to be superior to other biclustering methods and is used for base-level clustering. A continuous-weight resampling method for samples is employed to generate slight perturbations of the dataset without sacrificing data and a positive tensor factorization is used to extract the biclusters that are common to the various runs. Finally, we present an application to a large colon cancer dataset for which we find 5 stable subclasses.

1

INTRODUCTION

Many real-life application domains, such as bioinformatics, text mining and image processing involve data with very high dimensionality. For example, gene expression datasets contain measurements of the expression levels for virtually all genes of a given organism (tens of thousands in eukaryotes), while the number of samples is still limited to at most a few hundreds. Clustering is one of the most frequently used unsupervised data analysis methods in the field of gene expression data analysis. However, clustering such high-dimension small-sample data is meaningful only if a certain stability of the resulting clusters can be achieved. Unfortunately however, virtually all clustering methods that are currently used in this field tend to produce highly unstable clusters, especially when clustering genes. (The instability manifests itself either w.r.t. the initialization of the algorithm, as in the case of k-means, or w.r.t. small perturbations of the dataset, in the case of deterministic algorithms, such as hierarchical clustering.) The stability of clustering has been addressed in previous work mainly for unidimensional clustering (dealing with either genes 2 or samples) [e.g.12]. The main idea of these approaches is to construct a consensus among a number of different clusterings obtained 1

National Institute for Research in Informatics, email: [email protected] In the following, we will refer to the items to be clustered as ‘genes’ and occasionally use other domain-specific terminology. However, the approach can easily be applied to other domains.

2

and

Doina ğilivea1 either by slight perturbations of the input dataset or due to different initializations in the case of nondeterministic algorithms. To construct the consensus, one needs the correspondence between the clusters of different clusterings. Most of the above mentioned approaches avoid determining the cluster correspondence by working with so-called connectivity matrices. Such a connectivity matrix Tg1g2 has non-zero entries for the items g1, g2 that belong to a common cluster. The consensus matrix Mg1g2 is then the average of the connectivity matrices for the different clusterings obtained in different runs. Unidimensional clustering is not fully satisfactory for gene expression data analysis, where genes tend to be co-expressed only for certain subsets of samples, corresponding to specific biological contexts. Therefore, both the gene and the sample dimension have to be taken into account simultaneously. Unfortunately, the above-mentioned approach based on consensus matrices cannot be applied to bidimensional clustering. This is due to the fact that in the case of biclustering one cannot simply deal with separate gene and sample connectivity matrices. To appreciate this in more detail we need a few notations. Let Xsg represent the gene expression matrix value for gene g in sample s, S cg( i ) the membership degree of gene g in cluster c of clustering i and A sc( i ) the mean expression level of cluster (biological process) c in sample s. Then, the connectivity matrices for genes and samples are CS=S(i)TS(i) and respectively CA=A(i)A(i)T. g1 g 2

g 1 g2 cluster c’’ cluster c’

s1 s2

S1 S2

Figure 1. Consider the two situations presented in Figure 1. In both situations genes (g1, g2) belong to the same gene cluster, so CSg1g2 is non-zero. Similarly, samples (s1,s2) belong to the same sample cluster, so CAs1s2 is non-zero in both situations as well. However, since (g1,s1) and (g2,s2) belong to the same bicluster only in the first situation (Figure 1, left), dealing with separate gene and sample connectivity matrices (CS and CA) would miss this essential distinction. The correct generalization of connectivity matrices to bidimensional clustering is what we call a connectivity 4-tensor: (1) C( s g )( s g ) ( As c Scg ) ( As c Scg ) 1 1

2 2

¦

c

1

1

2

2

which is in general not reducible to a (tensor) product of connectivity matrices: C( s g )( s g ) z ( ¦ As c ' Sc ' g )( ¦ As c ' ' Sc ' ' g ) . 1 1 2 2 1 1 2 2 c' c''

153

L. Badea and D. Tilivea ¸ / Nonnegative Decompositions with Resampling for Improving Gene Expression Data Biclustering Stability

The consensus 4-tensor associated to different biclustering runs i would then be the average of the associated connectivity tensors: (2) M ( s g )( s g ) C((si )g )( s g ) r

colon adenocarcinoma dataset [10,11] for which we discover 5 stable clusters, one of these containing normal colon samples.

Unfortunately, explicitly computing and storing these connectivity and consensus tensors is practically infeasible for large gene expression datasets. Unlike the unidimensional case where the connectivity and consensus matrices are of sizes quadratic in the numbers of items to be clustered (e.g. genes), the 4-tensors above are of sizes (nsng)2 , where ns, ng are the numbers of samples and genes respectively. In the colon cancer dataset analyzed below ns|200 and ng|3000, so we would have to deal with tensors of size 3.61011. Fortunately, there is a better way of constructing stable biclusters. Let us note that connectivity tensors are highly redundant (i.e. are of lower rank), the only reason for constructing them being due to the difficulty of determining the correspondence between similar biclusters in different clustering runs, especially when dealing with soft clustering algorithms. To deal with this problem, we use the meta-clustering approach from [8,14], which is based on a positive tensor factorization (PTF) of the biclusters obtained in clustering runs i. This meta-clustering approach based on PTF solves in an elegant manner the cluster correspondence problem and tends to produce stable biclusters, but is still sub-optimal in certain respects. First, it uses nonnegative matrix factorization (NMF) [1,2] as base-level clustering algorithm. NMF performs very well for biclustering gene expression data, even for data with many irrelevant genes 1 , but it tends to reconstruct the average expression levels of such irrelevant genes as superpositions of induced clusters. While this reduces the reconstruction error, it also produces artificial cluster membership coefficients for such irrelevant genes. Here, we solve this problem by slightly generalizing NMF to allow for “gene intercepts”. Secondly, PTF simultaneously determines the bicluster correspondence and constructs a consensus of the biclusters obtained in several runs of NMF starting with different initializations. Here, we consider an additional type of perturbation to the data based on resampling to ensure an increased stability of the resulting clusters. Various methods based on resampling have been applied in the context of unidimensional clustering (e.g. [12, etc]). Unfortunately, virtually all proposed approaches have significant drawbacks. For example, in bootstrapping, approximately one third of the original samples are discarded, potentially affecting the final results, especially in the small-sample case. The same holds for other subsampling approaches. On the other hand, methods based on resampling with replacement may be affected by spurious clusters constructed from sample replicates. Recently, Dresen et al [13] introduced a resampling method based on so-called continuous weights that avoids these problems by a simulated resampling, in which the (integer) numbers of resamplings of each sample are replaced with continuous weights. The difficult part consists in adapting the specific clustering algorithm 2 to work with such weighted samples instead of the resampled ones. In this paper, we show how NMF can be generalized to deal with continous-weight resampling. We apply our approach to a large

2

1 1

1

2

2

¦

i

1 1

2

2

i.e. genes that show little co-variation with other genes. [13] shows how to deal with correlation-based hierarchical clustering in this context.

BICLUSTERING USING NONNEGATIVE MATRIX FACTORIZATIONS WITH INTERCEPT

An elegant method of biclustering consists in factorizing the gene expression matrix X as a product of an nsunc (samples u clusters) matrix A and an ncung (clusters u genes) matrix S 3 (3) X sg | Asc S cg So g

¦

c

subject to additional nonnegativity constraints:

Asc t 0, S cg t 0, So g t 0

(4)

which express the obvious fact that expression levels and cluster membership degrees cannot be negative. Factorization (3) differs from the standard NMF factorization [1,2] by the additional “gene intercept” So, whose main role consists in absorbing the constant expression levels of genes, thereby making the cluster samples Scg “cleaner”. The factorization (3-4) can be regarded more formally as a constrained optimization problem: 1 1 || X A S e So ||2F 2 2

min f ( A, S, So)

¦X A S e So

2 sg

(5)

s, g

subject to the nonnegativity constraints (4). This problem can be solved using an iterative algorithm with the following multiplicative update rules (which can be easily derived using the method of Lee and Seung [2]): X S T sc Asc m Asc A S e So S T sc H

S cg m S cg

A

T

X

cg

A A S e So H e X m So e A S e So H T

(6)

cg

T

So g

g

g

T

g

where e is a column vector of 1 of size equal to the number of samples and H is a regularization parameter (a very small positive number). The algorithm initializes A, S and So with random entries, so that (slightly) different solutions may be obtained in different runs. (This is due to the non-convex nature of the optimization problem (5), which in general has many different local minima.) We can view the different solutions obtained by the generalized NMFi algorithm as overfitted solutions, whose consensus we’ll need to construct. For combatting overfitting, we consider additional perturbations using continuous weight resampling as explained below. We have observed experimentally that adding intercepts to standard NMF leads to significant improvements in the quality of the recovered clusters. More precisely, the genes with little variation are reconstructed by the standard NMF algorithm from combinations of clusters, while NMFi uses the additional degrees of freedom So to produce null cluster membership degrees Scg for these genes. Moreover, NMFi recovers with much more accuracy than standard NMF the original sample clusters, the standard NMF

2

3 Recall from the introduction that Xsg represents the gene expression level of gene g in sample s, Scg the membership degree of gene g in cluster c and Asc the mean expression level of cluster (biological process) c in sample s.

154

L. Badea and D. Tilivea ¸ / Nonnegative Decompositions with Resampling for Improving Gene Expression Data Biclustering Stability

algorithm being confused by the cluster overlaps. (See the Figure in Supplementary material at www.ai.ici.ro/ecai08/). This improvement in recovery of the original clusters is very important in our application, where we aim at a correct sub-classification of samples.

It is interesting to note that w-factorizations can be reduced to standard NMF factorizations, but only in the absence of intercepts. More precisely, we have the following result. Proposition. In the case of no intercepts, (A,S) is a w-factorization of X if and only if (VA,S) is a standard factorization of VX, where V diag ws .

3

NMF WITH CONTINUOUS WEIGHT RESAMPLING

The fact that intercepts interact with resampling weights shows that the generalization is non-trivial.

A frequently used method to obtain more stable clusters consists in building a consensus of several individual clusterings constructed from perturbations of the original dataset. As already mentioned in the Introduction, various types of perturbations based on resampling have been applied in the context of one-way clustering (e.g. [12]). However, all of these have drawbacks related either to loss of precious original data (a problem which is exacerbated in the case of small sample sizes), or to potential spurious clusters built from replicates of samples resampled several times. Recently, Dresen et al. [13] have addressed this problem by generalizing the (integer) numbers of resamplings of each sample to continuous weights. This retains the full dimensionality of the original data and has proved superior to bootstrapping especially for small numbers of samples. However, the approach requires modifying the original clustering algorithm to simulate working with “continuous numbers of samples”. While [13] show how this can be done with correlation-based hierarchical clustering (by modifying Pearson correlation to take into account weighted samples), generalizing this approach to NMF factorization is non-trivial. In the following we show how NMFi can be adapted to deal with continuous weight resampling. The distribution of a drawing with replacement is the binomial distribution, which is approximated by the Poisson distribution for large numbers of observations. Since in a bootstrap sample the expected value and the variance is 1, [13] used a continuous approximation of the Poisson distribution, namely a log-normal distribution with mean and variance 1. In the following, we assume that the continuous sample weights ws are drawn from a log-normal distribution with equal mean and variance M. (The results improve as M is increased.) Generalizing NMFi to deal with continuous weight resampling amounts to replacing the optimization problem (5) by the following: 1 2 (7) ws X A S e So sg min f ( A, S , So) 2 s, g

¦

The associated multiplicative update rules can be easily shown to take the following form: X S T sc (8.1) Asc m Asc A S e So S T sc H

S cg m S cg

A

A

So g m So g

T

e

T

(8.2)

W X g

(8.3)

W A S e So cg H

e

T

W X cg

T

W A S e So g H

where W = diag(ws) is the diagonal matrix with ws on the diagonal. We will call the factorization obtained by solving the optimization problem (7) a w-factorization and the corresponding algorithm NMFir.

4

CONSENSUS CLUSTERING WITH PTF

Starting with a number of NMFir runs

X | A ( i ) S ( i ) So ( i )

i

1,..., r

(9)

we construct a consensus biclustering using a Positive Tensor Factorization (PTF) [3] of the biclusters 4 , which simultaneously determines the bicluster correpondence D and the consensus biclustering (E,J) [8,14]: n

As ( ic ) S ( ic ) g | ¦ k c 1D ( ic ) k E sk J kg

(10)

where s are samples, g, genes, c clusters and k metaclusters (or “consensus clusters”). 5 E and J represent the consensus of A(i) and S(i) respectively. More precisely, the columns Ek of E and the corresponding rows Jk of J make up a base set of bicluster prototypes EkJk out of which all biclusters of all individual runs can be recomposed, while D encodes the (bi)cluster-metacluster correspondence. The factorization (10) can be computed using the following multiplicative update rules [8,14]: ( AT E ) ( S J T ) D mD D [( E T E ) (J J T )] E mE

J mJ

A [D ( S J T )] E [(D T D ) (J J T )]

(11)

[D ( AT E )]T S [(D T D ) ( E T E )]T J

where ‘ ’ and ‘’ represent element-wise multiplication and division of matrices, while ‘’ is ordinary matrix multiplication. After convergence of the PTF update rules, the rows of J are normalized to unit norm to make the gene clusters directly comparable to each other, whereas the columns of D are normalized such that ¦ D ( ic ) k r (r is the number of runs). i ,c

Then, NMFir initialized with ( E , J , J 0 ) is run 6 to produce the final factorization X | AS + eSo. The nonnegativity constraints of PTF meta-clustering are essential both for allowing the interpretation of EkJk. as consensus biclusters, as well as for obtaining sparse factorizations. In practice, the rows of the correspondence matrix D tend to contain typically one or only very few significant entries. Therefore, D can be used to assess the stability of the individual clusterings (A(i),S(i),So(i)). To do this, we diagonalize all D(i) by row permutations

D (cic ') k

¦P c

(i )

c 'c

D ( ic ) k

(12)

4

A tensor factorization is needed instead of a matrix factorization since biclusters are matrices. 5 To simplify the notation, the indices i and c were merged into a single index (ic). 6 J0 is obtained from the 1-dimensional NMF decomposition (i ) Sog( i ) D 00 J 0 g with the normalization D 00( i ) r .

¦

i

155

L. Badea and D. Tilivea ¸ / Nonnegative Decompositions with Resampling for Improving Gene Expression Data Biclustering Stability

such that the largest elements 7 of D(i) end up on the diagonal D(ik)k. We then apply these row permutations to the gene cluster matrices S(i) which are thereby synchronized with the consensus matrix J

¦P

S cc('ki )

c

(i )

c 'c

S ck( i ) .

(13)

At this point, we can estimate the stabilities of individual entries of the gene cluster matrices using the following instability measure:

¦ (S c

c( ) ) instab( S kg

(i ) kg

i

J kg ) 2 r J kg ,

(14)

which gauges the deviation of the individual runs from the consensus. It can be easily shown that (14) is equivalent to

c() ) 2 mean( S kg c() ) J kg J kg std ( S kg 2

c( ) ) instab( S kg

(14’)

A similar measure can be defined for the sample matrices A(i). While using D we can discard entire unstable clusters, our instability measure may be used to gauge our confidence in the individual gene or sample cluster values obtained.

5

EXPERIMENTAL EVALUATION

We first evaluated our approach on simulated data generated according to the following hidden-variable graphical model A1

}

Am

Xj } Xk

}

A2

X1 } Xi

Xng

in which the hidden variables correspond to potentially overlapping biclusters: X = AS +H. The test contains 50 samples, 100 genes and the structure is random with 10 samples and 20 genes per bicluster. The logarithms of the hidden variables A were normally distributed with Psignal ranging between 4 and 8, Vsignal=1 in the clusters and Pbkg =3, Vbkgl=1 outside. Bicluster cluster mean match of various meta-clustering algorithms(mean over 10 runs) 0.8 Best k-means PTF PTF offset PTF offset resampling

Bicluster cluster mean match

0.7

definition of the match between two sets of possibly overlapping clusters), Figure 2 shows the variation of the match with the signal to background ratio. PTF with intercept and PTF with intercept and resampling behave very similarly, but outperform simple PTF [14] as well as k-means. Although we could not show that resampling is essentially better than PTF with intercepts in simulated data, we believe that it is useful for estimating cluster confidence factors in our real-life application. Colon cancer dataset. The most frequent colon cancer type, sporadic colon adenocarcinoma, is very heterogeneous and its best current classification based on the presence or absence of microsatellite instabilities (MSI-L, MSI-H and MSS) [9] is far from ideal from the point of view of gene expression. To obtain a more accurate subclassification based on gene expression profiles, we have applied our approach to a large colon cancer dataset (204 samples) containing 182 colon adenocarcinoma samples from the expO database [10] and 22 control (“normal”) samples from [11]. (All of these had been measured on Affymetrix U133 Plus 2.0 chips.) The combined raw scanning data was preprocessed with the RMA normalization and summarization algorithm. (The logarithmic form of the gene expression matrix was subsequently used, since gene expression values are approximately log-normally distributed.) After eliminating the probe-sets (genes) with relatively low expression as well as those with a nearly constant expression value 8 , we were left with 3708 probe-sets. Finally, the Euclidean norms of the expression levels for the individual genes were normalized to 1 to disallow genes with higher absolute expression values to overshadow the other genes in the factorization. An important parameter of the factorization is its internal dimensionality (the number of clusters nc). To avoid overfitting, we estimated the number of clusters nc as the largest number of dimensions around which the change in relative error dH of the dnc

factorization of the real data is still significantly larger than the change in relative error obtained for a randomized dataset 9 (similar to [5]) – see also Figure 3 below. Using this analysis we estimated the internal dimensionality of the dataset to be around 5.

0.6

randomized data 0.5 0.4 0.3 0.2

real data 0.1

3

4

5

6 mu signal

7

8

9

Figure 2. Variation of bicluster match with signal/noise ratio Figure 3. Determining the internal dimensionality of the dataset

Although all algorithms produce quite low relative errors H rel || X A S e So|| / || X || (k-means – slightly higher ones), they behave differently when it comes to recovering the original clusters. Since the match of the recovered clusters with the original ones is more important than the relative error (see [8] for our

We then ran PTF with 50 NMFir iterations and nc=5. Figure 4 depicts the sample cluster matrix A. Note that cluster 5 corresponds

7

9

Since entire rows are permuted in the process of bringing the largest values of D on the diagonal, the largest value on a given column may not end up on the diagonal if it occurs on a row that had been permuted previously.

8

Only genes with an average expression value over 100 and with a standard deviation above 150 were retained. The randomized dataset was obtained by randomly permuting for each gene its expression levels in the various samples. The original distribution of the gene expression levels is thereby preserved.

156

L. Badea and D. Tilivea ¸ / Nonnegative Decompositions with Resampling for Improving Gene Expression Data Biclustering Stability

to the normal control samples from [11]. To make sure that this “normal cluster” is not a “batch effect” (due to the fact that we have combined two different datasets), we first looked at the expression of known housekeeping genes across the two datasets – overall, these turned out to have no particular dataset bias. Furthermore, the dataset from [11] contains besides normal samples from healthy individuals, also “normal” samples from individuals afflicted by early-onset colon cancer. We interpret the fact that a few of these cancer susceptibility samples but none of the samples from healthy individuals cluster in the colon cancer classes 1-4 as evidence against a systematic batch bias.

1

0.8

0.6

0.4

downregulation is known to be an early event in colorectal tumorigenesis [PMID:9135022]. (More details on the biclusters and the associated genes can be found in the supplementary material at www.ai.ici.ro/ecai08/.)

6

CONCLUSIONS

Soft biclustering is particularly difficult in the case of overlapping clusters, which are ubiquitous for gene expression data. Nonnegative factorizations like NMF are good for this purpose, but we show that they can be improved by adding intercepts. On the other hand, NMF factorizations depend on their initialization. Instead of regarding this as a drawback, we used PTF to construct a consensus factorization that hopefully reduces overfitting. Generating perturbations of the data by simulated resampling allow estimations of bicluster stability, which is especially important when looking at gene expression biclusters that typically contain hundreds of genes. Finally, we have applied the approach to a large colon cancer dataset, for which our approach finds 5 stable biclusters (one of which contains the genes active in the normal samples and down-regulated in colon cancer). Among the genes with the most significant coefficients, we find many with a known involvement in colon cancer. Our subclassification could thus be used to systematize the roles of these genes in the various subtypes.

7

REFERENCES

0.2

1. 2. 1

2

3

4

5

0

3.

Figure 4. The sample cluster matrix A The gene clusters contain genes with a well known involvement in colon cancer. For example, cluster 2 contains the regenerating islet-derived family member 4 REG4, which is known to be involved in inflammatory and metaplastic responses of the gastrointestinal epithelium 10 [PMID:12819006], its overexpression being an early event in colorectal carcinogenesis [PMID: 14550954]. Cluster 2 contains three additional genes from the same family, with documented oncogenic properties: REG1B, REG1A, REG3A. Cluster 3 contains several genes involved in the TGF-beta pathway: osteopontin (SPP1), activin A (INHBA), thrombospondin 1 (THSB1), the plasminogen activator inhibitor type 1 (SERPINE1), etc. Cluster 4 contains (with a high membership coefficient) the teratocarcinoma-derived growth factor 1 TDGF1, which has been proposed as a biomarker for colon and breast carcinoma [PMID:16951234]. TDGF1 expression has been recently shown to be controlled by the canonical Wnt/beta-catenin/TCF signaling pathway (the “classical” textbook pathway in colon cancer) [PMID:17291450], as well as by TGF-beta-like pathways [PMID: 17941089]. The cluster 1 gene MYH11 has been very recently linked to microsatellite-stable HNPCC and sporadic colon cancer [PMID:17950328], while a polymorphism in the chemokine ligand 12 CXCL12 has been found in colon cancer patients [PMID: 17143542]. Finally, the “normal” class 5 is characterized by genes down-regulated in colon cancer, such as the carcinoembryonic antigen-related cell adhesion molecule 7 CEACAM7, whose 10

Due to lack of space, we refer to medical publications by their PubmedID.

4. 5.

6.

7. 8. 9.

10. 11.

12.

13. 14.

Lee D.D., H.S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, vol. 401, no. 6755, pp. 788-791, 1999. Lee D.D., H.S. Seung. Algorithms for non-negative matrix factorization. Proc. NIPS 2000, MIT Press, 2001. Welling M., Weber M. Positive tensor factorization. Pattern Recognition Letters 22(12): 1255-1261 (2001). Cheng Y. Church G. Biclustering of expression data. Proc. ISMB-2000, 93-103. Kim P.M., Tidor B. Subsystem identification through dimensionality reduction of large-scale gene expression data. Genome Res. 2003 Jul;13(7):1706-18. Brunet J.P., Tamayo P., Golub T.R., Mesirov J.P. Metagenes and molecular pattern discovery using matrix factorization. PNAS 101(12):4164-9, 2004, Mar 23. Cheng Y, Church GM. Biclustering of expression data. Proc. ISMB 2000; 8:93-103. Badea L. Clustering and Metaclustering with Nonnegative Matrix Decompositions. Proc. ECML-05, Vol. 3720, pp. 10-20. Jass JR, et al. Characterisation of a subtype of colorectal cancer combining features of the suppressor and mild mutator pathways. J.Clin.Pathol. 52: 455-460, 1999. expO. Expression Project for Oncology http://expo.intgen.org/expo/geo/goHome.do Hong Y, Ho KS, Eu KW, Cheah PY. A susceptibility gene set for early onset colorectal cancer that integrates diverse signaling pathways: implication for tumorigenesis. Clin. Cancer Res. 2007 Feb 15;13(4):1107-14. Monti et al. Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data. Machine Learning, Vol. 52, No. 1-2. (2003), pp. 91-118. Dresen Gana IM, et al. New resampling method for evaluating stability of clusters. BMC Bioinformatics. 2008, Jan. 24;9(1):42. Badea L, Tilivea D. Stable Biclustering of Gene Expression Data with Nonnegative Matrix Factorizations. Proc. IJCAI-07, pp. 2651-2656.

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-157

157

Exploiting locality of interactions using a policy-gradient approach in multiagent learning Francisco S. Melo1 Abstract. In this paper, we propose a policy gradient reinforcement learning algorithm to address transition-independent Dec-POMDPs. This approach aims at implicitly exploiting the locality of interaction observed in many practical problems. Our algorithms can be described by an actor-critic architecture: the actor component combines natural gradient updates with a varying learning rate; the critic uses only local information to maintain a belief over the joint state-space, and evaluates the current policy as a function of this belief using compatible function approximation. In order to speed the convergence of the algorithm, we use an optimistic initialization of the policy that relies on a fully observable, single agent model of the problem. We illustrate our approach in some simple application problems.

1

INTRODUCTION

One of the main topics of research in artiﬁcial intelligence is the development of autonomous intelligent agents. The ability of an agent to fulﬁll a certain task in a given environment greatly depends on that agent’s ability to perceive its environment and interact with it. As new and demanding applications appear, there is a natural interest in developing more complex intelligent agents, able to interact not only with the environment but with other agents existing in the same environment. In such multiagent applications, it is desirable that a each agent be able to adapt and coordinate with the other agents. Reinforcement learning (RL) provides an appealing approach to address such adaptability issues. The “traditional” RL approach to multiagent systems makes use of game theoretic models such as Markov games [8]. This approach has fostered many interesting algorithms in which independent decision-makers are able to successfully adapt their policy to that of other agents and coordinate towards a common goal (e.g., [4, 17]). However, the Markov game approach is generally unsuited to address problems in which the agents have sensorial limitations, since they rely on several joint-observability assumptions inherent to these models that seldom hold in practice. In fact, many problems found in practice require models that can accomodate for some form of partial observability, such as partially observable stochastic games [7] or decentralized-MDPs/POMDPs [2]. Unfortunately, such models are inherently too complex to be solved exactly [2]. It is in face of this inherent complexity of multiagent problems that policy gradient methods may prove of use [4, 14]. In this work, we address multiagent decision problems using a policy-gradient approach. In particular, we adopt an actor-critic architecture similar to that considered in several single-agent works 1

School of Computer Science, Carnegie Mellon University, USA. E-mail: [email protected]

[6, 12] and extend this to multi-agent problems with partial observability. We consider cooperative multiagent tasks in which each agent has only local imperfect perception of its state and cannot observe the actions of other agents. We use a decentralized POMDP (DecPOMDP) to model the group of agents, but consider several assumptions that aim at exploring the locality of interaction present in many problems in practice. Namely, we admit the state transitions of each agent to depend only on its own actions, and the observations of each agent to depend only on the joint state of the agents. We apply our actor-critic algorithm to this transition-independent Dec-POMDP. Our critic-component uses TD-learning with function approximation to evaluate the policy currently implemented by the actor; the actor, in turn, uses this evaluation to update the policy using an estimate of the natural gradient [5] and a win-or-learn-fast (WoLF) update schedule [4]. The setting considered in this work is distinct from other approaches in the literature in that we assume no joint-state or joint-action observability. Our method is, to the extent of our knowledge, the ﬁrst learning algorithm for Dec-POMDPs. The paper is organized as follows. In Section 2 we review the basic models used in the paper, such as MDPs, POMDPs and DecPOMDPs. We proceed in Section 3 by introducing our actor-critic algorithm for transition independent Dec-POMDPs. In Section 4 we illustrate the application of our method in simple multi-robot navigation problems and conclude in Section 5 by discussing how this approach can be extended to more complex problems.

2

MARKOV MODELS

In this section we review the basic models used throughout the paper. We start by reviewing Markov decision processes (MDPs) and their partially observable counterparts (POMDPs). We then move to multiagent models such as Markov games and their cooperative, partially observable counterparts, Dec-POMDPs.

2.1

Markov decision processes

A Markov decision process (MDP) is a tuple M = (X , A, P, r, γ) where X is a ﬁnite set of possible states and A is a ﬁnite set of possible actions. Pa (x, y) represents the probability of moving from state x ∈ X to state y ∈ X by choosing a particular action a ∈ A. The function r : X × A → R is a bounded reward function, assigning the agent a numerical reward r(x, a) for choosing action a in state x. The purpose of the agent is to maximize the expected total sum of discounted rewards, where 0 ≤ γ < 1 is a discount-factor assigning greater importance to rewards coming earlier in the future and Xt and At denote the state and action at time t. The optimal value function V ∗ is deﬁned for each state x ∈ X as ∞ t ∗ V (x) = max E γ r(Xt , At ) | X0 = x (1) {At }

t=0

158

F.S. Melo / Exploiting Locality of Interactions Using a Policy-Gradient Approach in Multiagent Learning

and veriﬁes the well-known Bellman optimality equation. The optimal Q-values Q∗ (x, a) are deﬁned for each pair (x, a) as Pa (x, y)V ∗ (y) (2) Q∗ (x, a) = r(x, a) + γ y∈X

The optimal decision rule can be obtained from Q∗ as π ∗ (x) = arg max Q∗ (x, a) and the map π ∗ is the optimal policy for the MDP. a∈A

More generally, a policy is any mapping πt deﬁned over X ×A that generates a control process {At } verifying P [At = a | Xt = x] = πt (x, a). We write V πt (x) instead of V ({At } , x) if the control process {At } is generated by a policy πt . A stationary policy is a policy π that does not depend on t. A deterministic policy is a policy assigning probability 1 to a single action in each state.

2.2

Partially observable MDPs

We refer to a partially observable Markov decision process, or POMDP, as a tuple M = (X , A, Z, P, O, r, γ), where X , A, P, r and γ are as deﬁned before. The fundamental difference between POMDPs and MDPs is that, in the former, the agent is no longer able to decide based on the state Xt at time instant t, since this state not observable. Instead, the agent has access to an observation Zt that depends on Xt according to the observation probabilities2 O(x, z) = P [Zt = z | Xt = x] . A common approach to address POMDPs is to maintain a belief on the current state of the process. The belief at time t, which we denote by bt , is a vector representing the probability of being in each state x ∈ X given the history at time t, i.e., bt (x) = P [Xt = x | Ht ]. This belief-vector summarizes all information so far and, for the purpose of decision-making, is a sufﬁcient statistic for the history of the process. In fact, as in MDPs, it is possible to deﬁne for each policy π a value function ∞ t π V (b) = Eπ γ rˆ(Bt , At ) | X0 ∼ b , t=0

where X0 ∼ b indicates that X0 is distributed according to the belief vector b and Bt denotes the (random) belief vector at time t. The function rˆ is deﬁned from r simply as: rˆ(b, a) = b(x)r(x, a). x∈X

Repeating the development in the previous subsection, we obtain similar deﬁnitions for Qπ and also for V ∗ and Q∗ , now in terms of beliefs. In fact, a POMDP can be redeﬁned in terms of beliefs as an MDP with continuous state-space (the belief-space). We conclude by remarking that, due to the computational complexity of exact POMDP methods [11], most methods used in practice to solve POMDPs rely on some form of approximation.

2.3

Markov games

A Markov game is a tuple M = n, X , (Ak ), P, (rk ), γ , where n k is the number of agents, X is the state-space, A = ×n k=1 A is the set of joint actions, P represents the controlled transition probabilities and rk is the reward function for agent k, k = 1, . . . , n. As in (PO)MDPs, in a Markov game each agent tries to maximize its individual expected total discounted reward. 2

In this paper we ignore the dependence of the observations on the actions.

In this paper we focus only on cooperative settings. In cooperative settings, where all agents share the same reward (i.e., r1 = . . . = rn ), there are deterministic joint policies that maximize the total expected reward for all agents. We henceforth refer to such (joint) policies as the optimal policies. Also, for this class of games, the deﬁnitions of value-function, Q-function and policy carry without great modiﬁcation from those of MDPs, bearing in mind that, in Markov games, the action-choices depend on n independent agents.

2.4

Dec-POMDPs

We a Dec-POMDP as being described by a tuple M = consider n, X , (Ak ), (Z k ), P, (Ok ), r, γ , where n, X , A, P, r and γ are deﬁned as in a Markov game,3 Z k is the set of possible observations for agent k and Ok describes the observation probabilities for agent k. At each time t, each agent k, k = 1, . . . , n, takes an action ak ∈ Ak and receives an observation Z k according to the probabilities Ok (x, z k ) = P Ztk = z k | Xt = x . As in all previous frameworks, the purpose of each agent is to maximize the expected total discounted reward. In this paper, we consider transition independent Dec-POMDPs. This means that the state-space X can be partitioned (factored) k into individual state-spaces X k verifying X = ×n k=1 X . At each time step t, the state of the Dec-POMDP is thus a tuple Xt = (Xt1 , . . . , Xtn ). Each Xtk describes the state of agent k at time t. Furthermore, this state depends only on the actions of agent k, i.e., k = y k | Xt = x, At = a = P Xt+1 k = y k | Xtk = xk , Akt = ak .4 = P Xt+1 This allows each agent k, k = 1, . . . , n, to maintain at each time step t an individual belief bk regarding its individual state, updated as k k k k k k k k k xk bt (x )Pak (x , y )Oak (y , z ) bt+1 (y ) = , (3) k k k k k k k k xk ,wk bt (x )Pak (x , w )Oak (w , z ) where ak is the individual action taken at time t and z k was the individual observation received at time t + 1.

3

POLICY GRADIENT APPROACH TO MULTIAGENT LEARNING

We now describe our learning algorithm for Dec-POMDPs. This algorithm can be seen as an extension of the algorithm in [3] to multiagent settings, using a WoLF policy update schedule and optimistic initialization.

3.1

The actor-critic architecture

Before going into the detailed description of our algorithm, we review some important concepts regarding policy-gradient/actor-critic algorithms. Further details can be found in [16, 6]. Let M = (X , A, P, r, γ) be a MDP with a compact state-space X ⊂ Rp . Let πθ be a stationary policy parameterized by some 3 4

A Dec-POMDP describes, by deﬁnition, a cooperative group of agents. Therefore, all agents share the same reward r. This is often the case, for example, in multi-robot navigation tasks, where the moving actions of one robot do not affect the position of the other robots.

F.S. Melo / Exploiting Locality of Interactions Using a Policy-Gradient Approach in Multiagent Learning

ENVIRONMENT

A1t

order to efﬁciently tackle the prohibitive complexity of general decentralized decision-making problems. In our algorithm, local interactions are exploited at three different levels, to know:

Ant

State: Xt = (x1t , . . . , xnt ) Zt1

Optimistic initialization Optimistic initialization consists in “ignoring” partial observability to initialize the parameterized policy. In particular, we compute the optimal Q-function for the fully observable Markov game initialize the policy as a soft-max version of Q-MDP. Q∗ (x, a) is a |X|×|A| matrix, that we consider as the adjustable parameters for the policy. Therefore, the general form of the policies considered is x b(x)θ(x,a) k e πθk (b, ak ) = aa b(x)θ(x,u) , (6) x u∈A e

Ztn

AGENT 1

AGENT n

Critic

Critic

Policy evaluation: ˆθ (b, a1 ) = φθ (b, a1 )w A

Policy evaluation: ˆθ (b, an ) = φθ (b, an )w A

Actor

Actor

Policy update: θt+1 = θt + αt w

Policy update: θt+1 = θt + αt w

Rt

Figure 1. The actor-critic architecture. Each agent maintains at each time-step a belief bt on the joint state of the process. The critic component estimates the Q-function associated with each local belief and each action. The actor component uses this evaluation to perform a policy update in the direction of the natural gradient, using a WoLF policy update schedule.

ﬁnite-dimensional vector θ ∈ RM . We assume that π is continuously differentiable with respect to θ and henceforth write V θ instead of V πθ to denote the corresponding value function. Deﬁne ρ(θ) = X V θ (x)p0 (x)dx, where p0 is the distribution for the initial state. Notice that we abusively write ρ(θ) instead of ρ(πθ ) to simplify the notation. The value ρ(θ) denotes the total expected discounted reward associated with policy πθ given the initial state distribution p0 . As shown in [16, 6], ∂πθ ∇ρ(θ) = p(x) (4) (x, a)Qθ (x, a)dx, ∂θ X a where ∇ denotes the gradient with respect to (w.r.t.) θ and p(x) =

∞ X t=0

γ t P [Xt = x | X0 = y, πθ ] p0 (y)dy.

(5)

We can now introduce the overall architecture of our actor-critic algorithm, as depicted in Figure 1. Each agent k follows an individual parameterized policy πθk . At each time step t, agent k receives a local observation Ztk , used to update the belief bt on the joint state of the process. The critic component of the architecture uses this belief and the history of individual actions and collected rewards to evaluate πθk , by computing the associated Q-function (or, equivalently, the associated advantage function). This Q-function is then used in the actor-component to estimate the gradient ∇ρ(θ) and update πθk using the gradient direction. In the following subsections we describe in greater detail the different components of the algorithm.

3.2

159

Exploiting local interaction

In many problems found in practice, the interaction/coordination between the several agents occurs only in very particular situations (e.g., when sharing a resource or in avoiding undesirable states). Several recent works have proposed new models [1, 15] and methods [13, 10] that seek to take advantage of this locality of interaction in

where θ is a |X| × |A| real parameter matrix that is initialized to the values of Q∗ . Notice that, in problems where the agents need only to interact in few, very particular situations, the policy described above can be implemented (most of the time) independently of the other agents’ states and policies. Independent belief tracking In independent belief tracking, each agent maintains an individual belief estimate for every other agent. Therefore, each agent k maintains at each time t a vector of beliefs bt = (b1t , . . . , bn t ), estimating the individual state of each agent. Since the agent has no knowledge of the actions of the other agents, the estimates regarding their individual state will often be very innacurate. However, in most situations, the action choice of agent k can be carried out independently of the other agents and, therefore, the innacuracy in the belief estimates bj with j = k does not affect the action-choice for agent k. On the other hand, in those situation where interaction must occur, the innacuracy in other agents’ belief estimates may have a negative impact on the performance of agent k (and, thus, of the group). Notice, however, that the observation model as described in Section 2 depends on the joint state of the process. It is possible to “minimize” the innacuracy in the belief estimates of other agents if, at those situations where interaction occurs, the observations for each agent k provide it information on the state of other agents that leads to a more accurate belief estimate.5 Optimistic policy estimation Finally, in conducting independent belief tracking, each agent k can estimate the policy followed by other agents, thus trying to include more information that may possibly yield more accurate belief estimates. As in the optimistic policy initialization, each agent k will estimate agent j’s policy, j = k, as π k (b, ak ) = arg max b(x)Q∗ (x, a), ak ∈Ak

aak x∈X

where b is the joint belief estimate.

3.3

The actor: Combining natural gradients and WoLF updates

From (4) it is evident that, in order to compute the gradient ∇ρ(θ), the function Qθ needs to be computed. However, since our policy is deﬁned in terms of beliefs (which are continuous quantities), so is Qθ , and some form of function approximation is needed. 5

Consider, for example, multirobot navigation tasks, where each robot can move, most of the time, relying only on its own position estimates. Only when two or more robots are close must they coordinate to avoid collisions. However, in these situations, sensorial information allows accurate estimates of the position of the other robots.

160

F.S. Melo / Exploiting Locality of Interactions Using a Policy-Gradient Approach in Multiagent Learning

Tiger? Treasure?

Let {φi , i = 1, . . . , M } be a set of M linearly independent funcˆ θ be the best approximation tions and L (φ) its linear span. Let Q θ 6 ˆ θ can be written as of Q in L (φ). As any function in L (φ), Q θ ˆ Q (x, a) = φ (x, a)w. The following result can be found in [16].

Tiger? Treasure?

Theorem 1 Given an MDP M = (X , A, P, r, γ) and a set of basis functions {φi , i = 1, . . . , M } as deﬁned above, if φ(x, a) = then ∇ρ(θ) =

x,a

∂ log(πθ ) (x, a) ∂θ

d(x)

(7)

∂πθ ˆ θ (x, a). (x, a)Q ∂θ

Now, as observed in [5], the parameterized policy space can be seen as a manifold that can be endowed with an adequate Riemannian metric. From this metric, a natural gradient is deﬁned, expressed in terms of the Fisher information matrix. Peters et al. [12] showed the ˜ natural gradient of ρ(θ) w.r.t. θ is, simply, ∇ρ(θ) = w, where w is θ ˆ the parameter vector corresponding to Q . Bearing all this in mind, the update rule for our algorithm becomes: θt+1 = θt + αt w. The step-size αt is chosen according to the WoLF (win-or-learn fast) schedule: when “winning”, a smaller learning rate is used; when “loosing”, a larger learning rate is used. In other words, if the performance of the current policy is better than that of an “average” policy, a smaller learning rate is used (indicating that the policy may be close to a locally optimal policy). If, on the other hand, the performance of the current policy is worse than that of an average policy, the learning rate is set to a higher value, leading to a faster learning. In practical terms, we use a step-size sequence similar to that in [4]: ˆ θ (bt , a) > π ˆθ αw if a πθ (bt , a)Q a ¯ (bt , a)Q (bt , a) αt = αl otherwise where αl > αw are, respectively, the loosing and winning learning rates and π ¯ is the “average” policy, obtained by using in (6) the average parameter vector over time.

3.4

The critic: Advantage estimation in belief space

In the gradient expressions (4) and in Theorem 1, one can add an ˆ θ . Such function is known as arbitrary function F (x) to Qθ and Q a baseline function and, as shown in [3], if F is to be chosen so ˆ θ and Qθ , the as to minimize the mean-squared error between Q optimal choice of baseline function is F (x) = V θ (x). Recalling that the advantage function associated with a policy π is deﬁned as Aπ (x, a) = Qπ (x, a) − V π (x, a), the performance of the overall algorithm can be improved by estimating the advantage function instead of the Q-function [3], . As seen in the previous subsection, the actor component will update the parameter along the direction of the parameter vector w corresponding to the orthogonal projection of Qθ (or, equivalently, Aθ ) on the linear space spanned by the compatible basis functions, 6

ˆ θ as the orthogonal projection of Qθ on L (φ) with respect to We take Q the inner product f, g = f (b, a) · g(b, a)π θ (b, a)p(b)db, B

a) Grid world

a

where B denotes the belief-space and p is the distribution introduced in (5), with the beliefs b playing the role of x.

Figure 2.

b) Dec-Tiger problem

Two simple problems used to illustrate the application of our algorithm.

deﬁned in (7). However, unlike Qθ or V θ , the advantage function does not verify a Bellman-like recursion and, therefore, it is necessary to independently estimate the value function V θ . for which we also consider a linear approximation. In particular, we admit that θ Aθ (b, a) ≈ φ θ (b, a)w and V (b) ≈ ξ (b)v, where φθ are the compatible basis functions deﬁned according to (7) and each component ξi belongs to a second set of linearly independent basis functions that we use to approximate the value function. Since we are considering multiagent problems, where multiple independent decision makers interact in a common environment, it is best that each agent k computes this estimate online, since the transition data sampled from the process reﬂects (although implicitly) the eventual learning process taking place in the other agents. Therefore, our critic uses a TD-based update to estimate both the value function V θ and the advantage function Aθ by means of the following recursion (similar in spirit to that in [3])7 vt+1 = vt + βt ξt rt + γξt+1 vt − ξt vt ; wt+1 = (I − βt φ t φt )wt + βt φt rt + γξt+1 vt − ξt vt , where I is the identity matrix, ξt is the row-vector ξ (bt ), ξt+1 = ξ (bt+1 ) and φt = φ θ (bt , at ).

4

EXPERIMENTAL RESULTS

To illustrate the working of our algorithm, we tested it in several very simple Dec-POMDPs scenarios. The ﬁrst set of results was obtained in a small grid-world problem, as represented in Figure 2.a. In this problem, each of two robots must reach the opposite corner in a 3×3 maze. When both agents reach the corresponding corners, they receive a common reward of 20. If they “collide” in some state, they receive a reward of −10. Otherwise, they receive a reward of −1. The robots can move in one of four directions, N , S, E and W . The transitions in each direction have some uncertainty associated: with probability 0.8 the movements succeeds and, with probability 0.2 it fails. The robots can observe “Null”, indicating that nothing is detected; “Goal” indicating that the robot has reached its individual target position; and “Crash”, indicating that both robots are in the same position. After successfully reaching the goal, the position of the robots is reset. We ran the algorithm for 104 learning steps and then tested the learnt policy on the environment for 50 time-steps. In Figure 3.a, we present the total discounted reward obtained during a sample run. Notice that the robots are able to quickly reach the goal, which clearly indicates that they were able to learn the desired task. Notice also that the robots are able to avoid collisions, which indicates that they 7

We remark, however, that we are using a discounted framework, unlike the average per-step reward framework featured in [3].

F.S. Melo / Exploiting Locality of Interactions Using a Policy-Gradient Approach in Multiagent Learning

(Sampled) discounted performance 0

70

−10

60

−20 −30

50

Total disc. reward

Total disc. reward

(Sampled) discounted performance 80

40 30 20

−40 −50 −60 −70

10

−80

0 −10

−90

5

10

15

20

25 30 Time steps

35

40

45

a) Grid world

50

−100

5

10

15

20

25 30 Time steps

35

40

45

50

b) Dec-Tiger

Figure 3. Sample runs with the learnt policies in the two test problems.

were able to coordinate without communicating and using only local information during learning. The second problem is the well-known Dec-Tiger problem [9]. In this problem, two agents must choose between two doors, behind one of which is hidden a tiger. The other door hides a treasure. The purpose of the two agents is to ﬁgure out behind which door the treasure is hidden, by listening the noises behind the doors. They must act in a coordinated fashion at all times, since their performance greatly depends on this ability to coordinate. We remark that this problem, unlike the grid-world problem, is not particularly suited to be addressed by our algorithm. In fact, the Dec-Tiger problem is not transition independent: the state-space cannot be factored and the actions of each agents have a large inﬂuence on both states, observations and rewards received by the other agent. Nevertheless, we applied our algorithm to this problem, to better understand the general applicability of the method. Once again, we ran the algorithm for 104 learning steps and then tested the learnt policy on the environment for 50 time-steps. In Figure 3.b, we present the total discounted reward obtained during a sample run. Notice that, although some miscoordinations sometimes occur (which are impossible to overcome since each agent only has available local information), the agents are, nevertheless, able to attain many coordinated action choices. And, the remarkable thing is that, once again, this was achieved without communication and using only local information during learning (and execution). Finally, to conclude this section, we summarize in Table 1 the average total discounted reward obtained during a 50-step run. The results presented correspond to the average over 2, 000 independent Monte-Carlo trials. Environment Grid world Dec-Tiger

Total disc. reward 34.001 11.049

Table 1. Total discounted reward obtained in the two problems. The results correspond to the average over 2, 000 independent Monte Carlo runs.

5

CONCLUSIONS

We conclude the paper with several important remarks. First of all, the algorithm introduced here is closely related to the Gra-WoLF algorithm in [4]. The main differences lie on our usage of natural gradients and on our ability to address problems with partial state observability and no joint-action observability. Partial observability is addressed by considering the problem to be described by a transition independent Dec-POMDP. We take advantage of this fact by proposing several strategies that allow the agents to maintain independent beliefs that can be used for decision-making. Another important observation is that the optimistic initialization considered will naturally bias the initial policy of the agents towards

161

the goal. This bias may potentially lead to more frequent initial visits to the rewarding states and thus allowing the learning process to converge more rapidly. Finally, it is important to remark that the results presented herein allow for little comprehension of the actual potential of the algorithm. We are currently testing this algorithm in much larger problems, which will allow us to infer how well our algorithm can cope with the high dimensionality arising from the consideration of large problems. We remark, however, that the fact that our algorithm does not take into account any global information, it is reasonable to expect that its complexity to grow linearly with the number of agents (instead of the exponential growth in fully coupled approaches). It is also important to somehow compare the performance of our algorithm with that of the several planning methods in the literature, in the particular class of problems that can adequately be addressed by our algorithm. We remark, however, that these algorithms compute the policy off-line which difﬁcults direct comparison.

ACKNOWLEDGEMENTS This research was partially sponsored by the Portuguese Fundação para a Ciência e a Tecnologia under the Carnegie Mellon-Portugal Program and the Information and Communications Technologies Institute (ICTI), www.icti.cmu.edu. The views and conclusions contained in this document are those of the author only.

References [1] R. Becker, S. Zilberstein, V. Lesser, and C. Goldman, ‘Transitionindependent decentralized Markov decision processes’, in Proc. AAMAS, pp. 41–48, (2003). [2] D. Bernstein, S. Zilberstein, and N. Immerman, ‘The complexity of decentralized control of Markov decision processes’, Mathematics of Operations Research, 27(4), 819–840, (2002). [3] S. Bhatnagar, R. Sutton, M. Ghavamzadeh, and M. Lee, ‘Incremental natural actor-critic algorithms’, in Proc. NIPS 20, pp. 105–112, (2007). [4] M. Bowling and M. Veloso, ‘Scalable learning in stochastic games’, in Workshop on Game & Decision Theor. Agents, pp. 11–18, (2000). [5] S. Kakade, ‘A natural policy gradient’, in Proc. NIPS 14, pp. 1531– 1538, (2001). [6] V. Konda and J. Tsitsiklis, ‘On actor-critic algorithms’, SICON, 42(4), 1143–1166, (2003). [7] H. Kuhn, ‘Extensive games and the problem of information’, Annals of Mathematics Studies, 28, 193–216, (1953). [8] M. Littman, ‘Value-function reinforcement learning in Markov games’, J. Cognitive Systems Research, 2(1), 55–66, (2001). [9] R. Nair, D. Pynadath, M. Yokoo, M. Tambe, and S. Marsella, ‘Taming decentralized POMDPs: Towards efﬁcient policy computation for multiagent settings’, in Proc. IJCAI, pp. 705–711, (2003). [10] F. Oliehoek, M. Spaan, S. Whiteson, and N. Vlassis, ‘Exploiting locality of interaction in factored Dec-POMDPs’, in Proc. AAMAS, pp. 517–524, (2008). [11] C. Papadimitriou and J. Tsitsiklis, ‘The complexity of Markov chain decision processes’, Mathematics of Operations Research, 12(3), 441– 450, (1987). [12] J. Peters, S. Vijayakumar, and S. Schaal, ‘Natural Actor-Critic’, in Proc. ECML, pp. 280–291, (2005). [13] M. Roth, R. Simmons, and M. Veloso, ‘Exploiting factored representations for decentralized execution in multi-agent teams’, in Proc. AAMAS, pp. 469–475, (2007). [14] S. Singh, M. Kearns, and Y. Mansour, ‘Nash convergence of gradient dynamics in general-sum games’, in Proc. UAI, pp. 541–548, (2000). [15] M. Spaan and F. Melo, ‘Interaction-driven Markov games for decentralized multiagent planning under uncertainty’, in Proc. AAMAS, pp. 525–532, (2008). [16] R. Sutton, D. McAllester, S. Singh, and Y. Mansour, ‘Policy gradient methods for reinforcement learning with function approximation’, in Proc. NIPS 13, pp. 1057–1063, (2000). [17] X. Wang and T. Sandholm, ‘Reinforcement learning to play an optimal Nash equilibrium in team Markov games’, in Proc. NIPS 15, pp. 1571– 1578, (2002).

162

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-162

A Fast Method for Property Prediction in Graph-Structured Data from Positive and Unlabelled Examples Susanne Hoche1 and Peter Flach2 and David Hardcastle3 Abstract. The analysis of large and complex networks, or graphs, is becoming increasingly important in many scientiﬁc areas including machine learning, social network analysis and bioinformatics. One natural type of question that can be asked in network analysis is “Given two sets R and T of individuals in a graph with complete and missing knowledge, respectively, about a property of interest, which individuals in T are closest to R with respect to this property?”. To answer this question, we can rank the individuals in T such that the individuals ranked highest are most likely to exhibit the property of interest. Several methods based on weighted paths in the graph and Markov chain models have been proposed to solve this task. In this paper, we show that we can improve previously published approaches by rephrasing this problem as the task of property prediction in graph-structured data from positive examples, the individuals in R, and unlabelled data, the individuals in T , and applying an inexpensive iterative neighbourhood’s majority vote based prediction algorithm (“iNMV”) to this task. We evaluate our iNMV prediction algorithm and two previously proposed methods using Markov chains on three real world graphs in terms of ROC AUC statistic. iNMV obtains rankings that are either signiﬁcantly better or not signiﬁcantly worse than the rankings obtained from the more complex Markov chain based algorithms, while achieving a reduction in run time of one order of magnitude on large graphs.

1

Introduction

The analysis of large and complex networks or graphs is becoming increasingly important in a variety of scientiﬁc disciplines. Graphs allow us to model various tasks for graph-structured data which consist of individuals that are connected to each other in terms of, e.g., a shared interest or common function. In a graph G = (V, E), the individuals are modelled as nodes v ∈ V , and the connection between the individuals as links e ∈ E ⊆ V ×V between the nodes. One prominent task in the analysis of graph-structured data is to rank one fraction T ⊂ V of target nodes in a graph relative to another fraction R ⊂ V of root nodes exhibiting a certain property of interest φ, in order to answer the question how close or similar they are to the ones in R with respect to φ. Here, we focus on co-authorship graphs where the nodes are papers which are linked to each other by 1 2 3

University of Bristol, Department of Computer Science, UK, email: [email protected] University of Bristol, Department of Computer Science, UK, email: [email protected] University of Bristol, Department of Computer Science, UK, email: [email protected]

an undirected weighted edge iff the papers have one or more author in common; R ⊂ V is a set of papers having scientiﬁc topic φ, and T ⊂ V is a set of papers with unknown topics for which we want to know how similar they are to the papers in R with observed topic φ. To answer such a question, we can attempt to rank the nodes in T such that the nodes ranked highest are most likely to exhibit φ and can thus be assumed to be closest to R with respect to φ. A number of approaches have been proposed in different scientiﬁc areas to determine a node’s importance in a graph, such as, e.g., numerous node centrality measures in social network analysis [19], and ranking algorithms motivated by the necessity to sort Web pages in a speciﬁc Web search task (e.g., HITS [11] and PageRank [3]). However, while these algorithms operate on a global level, the task we are interested in is to rank nodes on a local level, i.e., with respect to a given set R of nodes exhibiting property φ which can be interpreted as existing background knowledge, or ranking bias. Several such local ranking methods which answer the question of relative importance for graph structured data have been proposed in [20]. These methods are based on weighted paths and Markov chain models and thus computationally expensive which makes their application for large graphs inefﬁcient. We can improve these approaches by rephrasing the ranking problem as the task of property prediction in graph-structured data from positive examples, the nodes in R, and unlabelled data, the nodes in T , and applying an inexpensive iterative neighbourhood’s majority vote based prediction algorithm (“iNMV”) that allows an effective and efﬁcient ranking of the nodes in T with respect to the nodes in R. Given a set R ⊂ V of papers in a co-authorship graph G with an observed topic φ ∈ Φ, one can predict – on the basis of the known topics and the graph’s link structure – the probability that for a given set T of papers with unknown topics, t ∈ T has topic φ, and rank the nodes in T according to this predicted probability, i.e., according to their similarity to R with respect to φ. The remainder of the paper is organised as follows. We discuss two Markov chain based methods proposed in [20] for ranking individuals in graphs in Section 2. In Section 3, we present our iNMV prediction algorithm and detail how we obtain a ranking of T . In Section 4, we show that on three real world graphs the iNMV prediction algorithm achieves rankings that are either signiﬁcantly better or not signiﬁcantly worse than the rankings obtained from the two methods described in Section 2, and at the same time reduces the run time on large graphs by one order of magnitude. We review related work in Section 5 and conclude in Section 6.

S. Hoche et al. / A Fast Method for Property Prediction in Graph-Structured Data from Positive and Unlabelled Examples

2

Local Ranking Methods based on Markov Chains

3

White and Smyth propose in [20] several local ranking methods – based on weighted paths and Markov chain models – which answer the question of the relative importance of a set T of nodes in a graph G with respect to another set R in G. Here, we discuss two of their proposed methods that are based on Markov chains. In a Markov chain based approach G is viewed as representing a ﬁrstorder Markov chain. The idea is to traverse the graph in a Markov random walk, i.e., to start at some node and then randomly follow an outgoing edge to the next node from where the process then repeats itself. The ﬁrst-order Markov chain, or the transitions between the nodes, is characterized by a transition probability matrix P. The descriptions in the next two sections are based on [20].

2.1

Inverse Average Mean First Passage Time

The mean ﬁrst passage time mrt from a node r to a node t in a ﬁrstorder Markov chain is deﬁned as the expected number of steps in an inﬁnite-length Markov random walk starting at r until the ﬁrst arrival at t, i.e., as ∞

mrt =

∑ n frt

(n)

,

(1)

n=1 (n) frt

denotes the probability that the random walk starting at r where reaches t after exactly n steps. [20] deﬁnes the importance I1 (t|R) of a node t with respect to a set R in terms of the inverse average mean ﬁrst passage time, i.e., as I1 (t|R) =

1 1 |R|

∑r∈R mrt

(2)

That is, important nodes are relatively close to all the nodes in R. A so-called mean ﬁrst passage time matrix M with entries mi j for all pairs of nodes (vi , v j ) in the graph can be obtained as follows. The fundamental matrix is deﬁned as Z = (I − P − eπT )−1 , where P is the Markov transition probability matrix, e a column vector containing all ones, and π a column vector of the stationary distribution for the Markov chain. The mean ﬁrst passage time matrix is then obtained as M = (I − Z + EZdg )D,

(3)

where I is the identity matrix, E a matrix containing all ones, Zdg the matrix that agrees with Z on the diagonal but is 0 elsewhere, and D 1 the diagonal matrix with elements dii = π(i) for node i’s stationary distribution π(i) for the Markov chain.

2.2

K-Step Markov Approach

An alternative approach investigated in [20] deﬁnes the importance I2 (t|R) of a node t with respect to a set R on the basis of a Markov random walk of ﬁxed length K, i.e., as the probability that the Markov random walk starting at r and ending after exactly K steps reaches t. The value K determines the bias towards the set R: the smaller K the larger is R’s inﬂuence, the larger K the more we approach the Markov chain’s stationary distribution. I2 (t|R) can be computed as I2 (t|R) = [PpR + P2 pR + · · · + PK pR ]t ,

(4)

where P is the Markov transition probability matrix, pR is a column vector containing the initial probabilities for the set R, and [X]t denotes the t-th entry of the column vector X.

163

Rephrasing the Task of Local Ranking in Terms of Property Prediction

Our main contribution in this paper is to show that we can solve the local ranking problem more efﬁciently by rephrasing it as the task of property prediction from positive and unlabelled examples. Specifically, let G = (V, E) be a given co-authorship graph with a set of nodes (papers) V and a set E ⊆ V ×V of undirected (co-authorship) edges (vi , v j ) with weight wi j , and let Φ be a set of topics that each paper can have (we assume that a paper can have several topics). Fur/ where R is a set of root nodes, thermore, let V = R ∪ T, R ∩ T = 0, or positive examples, for which we have observed the topics, and T is a set of target nodes, or unlabelled examples, for which we do not know the topics. The task is to rank the nodes in T for each φk ∈ Φ separately on the basis of the set R of root nodes and the graph’s link structure given by E according to their probability of exhibiting topic φk .

3.1

Iterative Neighbourhood’s Majority Vote based Property Prediction

To this end, we apply our iterative neighbourhood’s majority vote prediction algorithm iNMV which is based on a simple majority vote of directly linked nodes, or neighbours, and which consists of an initialisation step and an update step which can be applied iteratively. In the initialisation step, we assign for each target node an initial estimate to its topic probability on the basis of the topics observed for the root set R. In an update step, a node’s existing estimate is modiﬁed based on the neighbouring nodes’ current estimates. This way, entities are classiﬁed in dependence of each other, and mutual inﬂuence of the predictions is accounted for. The more often the update step is iterated, the more the predictions are propagated through the graph. Since papers can have multiple topics, we consider for each topic φk ∈ Φ a binary learning problem where nodes having topic φk constitute the positive examples. For each topic φk ∈ Φ separately, iNMV derives for each target node vi ∈ T , an estimate of the probability of observing φk for vi . We denote the set of topics of paper vi as its topic set yi ⊆ Φ. Our approach assumes that nodes in the same neighbourhood of the graph tend to have similar properties, and that the predicted topic for one node in the graph depends on the topic of the nodes directly linked to it. Therefore, we assume that the probability of observing topic φk for node vi ∈ T given G is equal to the probability of observing φk for vi given vi ’s neighbourhood Ni := {v j ∈ V |(vi , v j ) ∈ E} consisting of those nodes in V that are directly linked to vi . We base the prediction of an unlabelled node’s topic probability both on labelled and unlabelled neighbours in the graph, and thus derive a topic probability estimate from the known topics and topic probability estimates of directly linked root and target nodes, respectively. To predict the probability of observing φk for a node vi ∈ T with (1) unknown topic set yi , we assign to vi an initial estimate pik := P(φk ∈ yi |R), where P(φk ∈ yi |R) denotes the probability that paper vi has topic φk , conditioned on the topics observed in R. This estimate is based on the number nk of times that φk is observed in R using the maximum likelihood based m-estimate where the observations are augmented by m additional samples which are assumed to be distributed according to p: (1)

pik := P(yi = φk |R) =

nk + p · m , |R| + m

(5)

164

S. Hoche et al. / A Fast Method for Property Prediction in Graph-Structured Data from Positive and Unlabelled Examples

where |R| denotes the cardinality of set R. We choose m = 1 and p = 0.5 (each topic is equally likely to be present or absent). (1) For a node vi ∈ R with observed topic, let pik := 1 for every topic φk that is observed for vi . (1) For each topic φk , we update the initial probability estimates pik for each node vi ∈ T based on its neighbourhood’s estimates: the (t+1) := P(t+1) (yi = φk |Ni ) is derived on the basis modiﬁed estimate pik (t)

of the estimates p jk := P(t) (y j = φk |N j ) for observing φk for vi ’s neighbours v j ∈ Ni in the t-th update step: (t+1)

pik

:= P(t+1) (yi = φk |Ni ) =

1 ∑n j ∈Ni wi j

∑

n j ∈Ni

(t)

wi j p jk ,

(6)

where wi j is the weight of the edge between the nodes vi and v j . As we are dealing with an undirected graph, equation (6) is recursive. To account for the mutual inﬂuence between linked nodes, the estimates can be propagated through the graph by iterating equation (6) several times. With more iterations, predictions are propagated further through the graph.

3.2

Ranking the Target Set using ROC Analysis

iNMV obtains for every topic φk ∈ Φ and every node vi ∈ T an estimate pik of the probability of observing φk for vi . We interpret pik as a score which we use to order the target nodes T . iNMV learns from positive and unlabelled examples, i.e., from root and target nodes. However, for each topic φk ∈ Φ we have originally positive and negative examples, i.e., those examples which exhibit φk and those which do not. To generate unlabelled examples, we delete for each topic and each target node the label indicating to which topic the paper belongs, but use it, after we have obtained the ranking of the nodes, to compute the ranking’s AUC. The area under the ROC Curve statistic, or AUC, is a measure based on the pairwise comparisons between the results of a binary prediction problem, and is often used to evaluate the performance of a prediction or ranking algorithm. It can be interpreted as the probability that for a pair (+, −) of a positive and a negative example that are both drawn uniformly at random, a higher score will be assigned to the positive example than to the negative (which means that these two examples are ranked correctly relative to each other). An algorithm’s AUC is the fraction of (+, −)-pairs that it correctly ranks relative to each other, and is deﬁned as AUC =

n ∑m i=1 ∑ j=1 1(+i >− j )

m·n

,

(7)

where +1 , · · · , +m are the scores assigned to the m positive examples, −1 , · · · , −n are the scores assigned to the n negative examples, and 1(+i >− j ) is the indicator function which is equal to 1 if +i > − j , and 0 otherwise. An algorithm’s AUC is maximal, i.e., equal to 1, iff it ranks all positive examples higher than the negative examples. Any misranked (+, −)-tuple decreases the AUC.

4

Empirical Evaluation

We evaluate the three methods described in Sections 2 and 3 on co-authorship graphs induced from the bibliographic data sets “IPLNet2” [1] and “Cora” [14]. The weighted links between the nodes are modelled in terms of an adjacency matrix A which holds for each pair (vi , v j ) of connected nodes vi , v j ∈ V a non-zero entry

wi j according to the overlap of the papers’ author lists. We obtain the Markov transition probability matrix P from A by normalising the rows in A.

4.1

Data and Experimental Setup

The ILPNet2 bibliographic database contains hand-selected ILPrelated references from 1970 onwards. Our co-authorship graph consists of the largest connected component of 406 nodes with known topics and 6354 links (on average ≈ 15 links per node). We restrict our evaluation to the 10 topics that include at least 20 papers each. For each topic φ, we generate in 10 trials 4 distinct root and target set partitions. In each partition, the root set consists of 75% of the positive examples, i.e., the papers which have topic φ. The target set contains the remaining 25% of the positive examples and all negative examples, i.e., the papers which do not have topic φ. The target nodes are distinct in each of the 4 root and target set partitions, and their union results in the complete set of nodes. Thus, each node serves for each topic and trial exactly once as an unlabelled example, or target node. For each topic, we apply the three methods to the 40 distinct data partitions. From this we yield for each topic φ and each node v ∈ T an estimated degree to which v belongs to φ. We interpreted these values as scores and use them to rank the nodes as detailed in Section 3.2, where a higher score indicates a higher probability of exhibiting φ. Cora is a collection of ≈ 34, 000 computer science research papers that have been automatically collected from the web [14]. Our co-authorship graph consists of the largest connected component of 10,513 nodes with known topics and 87,438 links (on average ≈ 8 links per node). The topics establish a hierarchy with general computer science topics at the top level which branch out into several sub-levels. We restrict our evaluation to the 6 top-level topics with the highest number of positive examples (“6 Top”), and to the 7 Machine Learning sub-topics on the lowest hierarchy level (“7 ML”). For each topic φ, we generate in 5 trials 2 distinct root and target set partitions, where a root set consists of 50% of the positive examples, and a target set of the remaining 50% of the positive examples and all negative examples. For each topic, we apply the three methods to 10 “6 Top” and “7 ML” root and target set partitions, respectively, and use the resulting scores to generate rankings of the target nodes which we evaluate in terms of ROC AUC statistic.

4.2

Results

In Figure 1, we show for the three methods described in Sections 2 and 3 and the three domains described in Section 4.1 boxplots of the AUCs for all topics averaged over all partitions and trials. We show for the ILPNet2 data from left to right boxplots for the AUCs obtained from the inverse average mean ﬁrst passage time (iaMFPT) method, iNMV with 1, 5, and 10 iterations, respectively, and the KStep Markov method for K = 1, 2, 5, 10, 25. Each boxplot shows the median, lower and upper quartile, and the lower and upper limit of the AUCs for the single topics, for one method. Since the iaMFPT method has been found numerically too complex for the large Cora graph, results for this method are only shown for the small ILPNet2 graph. We think that this is justiﬁed since the ranking of this method is signiﬁcantly worse than the rankings of all other methods (see below). We have also performed experiments for the K-Step Markov method for K > 25 but found that the AUCs are further decreasing and signiﬁcantly lower than those for iNMV with 1, 5 or 20 iterations, and thus omit these results.

165

S. Hoche et al. / A Fast Method for Property Prediction in Graph-Structured Data from Positive and Unlabelled Examples

4.3

Discussion

For iNMV, we obtain with 5 iterations on all three domains rankings with the highest AUCs. Equally, the K-Step Markov method yields for small K (2 or 5) the best AUCs. This indicates that on the domains we are investigating, the rankings beneﬁt from a mixture of local patterns from small neighbourhoods in the graph rather than from a global method that considers information from large areas of the graph (as, e.g., the K-step Markov with larger K, or iaMFPT). The K-Step Markov method considers for a target node t ∈ T all nodes r ∈ R that are K hops in G away from t. In contrast, iNMV with K iterations of the update step considers for the estimate of t’s topic probability all nodes r ∈ R that are K hops in G away from t, and additionally all nodes t ∈ T that are K hops in G away from t, where the topic probability estimate of t itself is modiﬁed in each

iteration of the update step on the basis of its direct neighbourhood. This way, mutual inﬂuence of the unlabelled nodes is also taken into account which seems to be advantageous for the ranking of T with respect to R and φ. ILPnet2, Cora 6 Top and 7 ML: K -Step Markov vs. iNMV - boxplots for all topics averaged over all trials and partitions

1 ILPNet2

Cora 6 Top

Cora 7ML

0.9

AUC averaged over all nCV runs

For the two Cora domains, we show in Figure 1 from left to right boxplots for the AUCs obtained from iNMV with 1, 5, and 10 iterations, respectively, and the K-Step Markov method for K = 1, 2, 5, 10, 25. For the two Cora domains and all methods, the single topics’ AUCs are in close range to each other. In contrast, the AUCs of the ILPNet2 topics exhibit large differences for all methods. In all the domains, nodes belonging to some topics form heterogeneous clusters in the graph, while nodes belonging to others topics are spread more widely over the graph. This seems to be more problematic when only a small number of positive examples exists. We perform a signiﬁcance test to answer the question whether the results are signiﬁcantly different. When comparing more than two classiﬁers, the non-parametric Friedman test [9] is widely recommended [6]. The Friedman test compares k algorithms over N data sets by ranking each algorithm on each data set separately, with the best result receiving rank 1, etc., and assigning average ranks in case of ties. The test then compares the average ranks of all algorithms on all data sets. If the null-hypothesis – that all algorithms are performing equivalently – is rejected under the Friedman test statistic, post-hoc tests such as the Nemenyi test [15] can be used to determine which algorithms perform statistically different. Note that for each topic φ, distinct root and target set partitions are generated, and that the Friedman test can thus be applied to these N = |φ| mutually independent data sets. According to the Friedman test, the AUCs averaged over all trials and partitions for the ILPNet2 data set obtained from the iaMFPT method are signiﬁcantly worse than the rankings obtained from any other method. The AUC of the ranking obtained from the iaMFPT is most likely so much smaller because a target node t’s importance I1 (t|R) is equally inﬂuenced by all root nodes in R. By contrast, a target node’s ranking obtained from iNMV or the K-Step Markov method for small K depends on a much smaller neighbourhood. This seems to indicate that the set of root nodes has to be rather coherent in order for the iaMFPT to produce a good ranking as, e.g., in the data sets evaluated in [20] (e.g., a set of collaborating authors, or interacting terrorists, where |R| = 2). In the ILPNet2 data, where the root set consists of a set of papers which have the topic of interest but which most likely belong to different “co-authorship cliques”, this assumption does not seem to hold, but rather the neighbourhood assumption that directly linked papers tend to be on the same topic. For the Cora “6 Top” data, the Friedman test reports for the AUCs averaged over all trials and partitions that both iNMV with 5 and 20 iterations are signiﬁcantly better than the K-Step Markov method for both K = 1 and K = 25. No signiﬁcant differences have been found for the rankings on the Cora “7 ML” data.

0.8

0.7

0.6

0.5

0.4

iaMFPT

iNMV

K -Step -Step Markov

iNMV

K -Step Markov

iNMV

K -Step Markov

0.3

Figure 1. Boxplots for the AUCs of the rankings resulting from the methods described in Sections 2 and 3 on the ILPNet2, Cora “6 Top” and “7 ML” data sets for all topics averaged over all partitions and trials. For each domain, we show – from left to right – a boxplot for iNMV with 1, 5, and 20 Iterations, and for the K-Step Markov method for K = 1, 2, 5, 10, 25, respectively. For the ILPNet2 data, the leftmost boxplot is for the iaMFPT method. Each boxplot shows the median, lower and upper quartile, and the lower and upper limit of the data points (not considered to be outliers), i.e., the AUCs for the single topics, for one method. An outlier is depicted as “+”.

For the domains investigated in this paper, the obtained AUCs do not seem to depend on the percentage of positive examples for a topic. Rather, the main factors seem to be the number of intra- and inter-topic neighbours, respectively, that a node is linked to, and the way that the nodes with the same topic are positioned in the graph G. The more the nodes in G establish areas homogeneous with respect to their topics the more successful can a method be that assumes similar nodes in the neighbourhood of each other and thus bases its prediction for a node v on a small region around v in the graph.

ILPNet2 Cora6Top Cora7ML

iNMV 1It 2.3±0.06 216±12 218±6

iNMV 5 Its 13.4±0.7 252±15 266±7

iNMV 20 Its 34±1.6 414±29 465±16

1-Step Markov 7.5±0.6 1477±2 1508±27

2-Step Markov 7.5±0.6 1479±2 1555±33

5-Step Markov 7.6±0.6 1638±27 1649±29

10-Step Markov 7.9±0.7 2309±23 2312±21

25-Step Markov 8.6±0.6 4446±6 4460±19

inv. avg MFPT 17.5±1.6 n/a n/a

Figure 2. Run time complexity and standard deviations of the compared methods in seconds on a Intel(R) Xeon(TM) MP CPU 3.16GHz processor.

In Figure 2, we report the run time complexity for the iNMV and K-Step Markov methods and all domains, and that of the iaMFPT method for ILPNet2. On the small ILPNet2 co-authorship graph, iNMV is with 5 and 20 iterations 2 to 5 times slower than the K-Step Markov method. However, all methods’ run time lies in the range of a a few seconds only. For the large graphs, the K-Step Markov method’s run time is 6 to 10 times larger than that of iNMV, i.e., in the range of hours rather than minutes.

166

5

S. Hoche et al. / A Fast Method for Property Prediction in Graph-Structured Data from Positive and Unlabelled Examples

Related Work

Closely related to our work with respect to prediction methods in graph-structured data are the publications in the ﬁelds of link-based object classiﬁcation, collective inference, and iterative classiﬁcation. [4] and [17] were among the ﬁrst to study the effects of using related objects’ attributes to enhance classiﬁcation in graph-structured domains. [4] proposes a relaxation-labelling based method for topic prediction in hyperlinked domains. [17] incrementally classiﬁes a collection of encyclopedia articles and take into account the classes of unlabelled documents only after they have been classiﬁed on the basis of neighbouring documents. [2] introduces conditional random ﬁelds for link-based object classiﬁcation, e.g. for part-of-speech tagging, while [18] extends this approach to a setting of arbitrary graphs instead of chains. [16] proposes the use of relational dependency networks and Gibbs sampling to collectively infer labels for linked instances. [12] proposes an iterative link-based object classiﬁcation method based on modelling link distributions which describe the neighbourhood of directed links around an object. [13] investigates the effectiveness of relaxation labelling based methods for classiﬁcation of graph-structured data similar to the one proposed in [4]. However, none of these works consider the task of ranking a set of target nodes with respect to a set of root nodes exhibiting a speciﬁc property. Although we have for all domains that we investigate in this paper both positive and negative labelled examples, we only consider the positive examples as labelled. We argue that it is realistic to assume a paper that is not labelled as belonging to a speciﬁc topic to be unlabelled rather than to be a negative example. In the areas of social network analysis and Web mining, several approaches have been proposed to determine a node’s importance in a graph. Freeman developed several measures of node centrality which express how important a node is in a graph [7, 8]. A comprehensive overview about centrality measures in graphs is given in [19]. Several algorithms have been proposed to rank the nodes in a graph of Web pages. Well known examples are HITS [11] and PageRank [3] – which operate on a global level – and personalised variants thereof, e.g., a topic-sensitive PageRank [10] where the ranking of Web pages is biased towards a set of speciﬁc topics, and a personalised version of HITS [5] which adjusts the measure of an authoritative source on the basis of incorporating user feedback. These personalised variants bias the standard ranking towards a set of apriori deﬁned root nodes. However, they have been designed speciﬁcally for the context of Web queries.

6

Conclusion

We presented an effective and efﬁcient algorithm to solve the task of ranking a set of target nodes in a graph with respect to a pre-deﬁned set of root nodes which exhibit a speciﬁc property of interest. To this end, we rephrased the ranking problem as the task of property prediction in graph-structured data from positive and unlabelled examples, and proposed an inexpensive iterative neighbourhood’s majority vote based prediction algorithm, iNMV. On three real-world co-authorship networks, iNMV obtains rankings that are either signiﬁcantly better or not signiﬁcantly worse with respect to AUC than the rankings obtained from two previously published Markov chain based algorithms, and at the same time achieves a reduction in run time of one order of magnitude on large graphs. For a local ranking method, it seems to be advantageous to not only account for the root nodes’ inﬂuence on the prediction for a target node but to also consider, as iNMV with several iterations of the update step does, the

mutual inﬂuence of linked target nodes. In future work we plan to investigate whether there are beneﬁts in learning a joint model for two or more topics. Topics are likely to be correlated (overlapping or disjoint), and we may be able to take advantage of that. We are furthermore investigating the time dependency of co-authorship networks and paper topics.

Acknowledgements The authors would like to acknowledge funding and support for this work from GCHQ in Cheltenham in the UK. and would like to thank J¨org Kaduk for numerous interesting discussions.

REFERENCES [1] ILPnet2 on-line library. http://www.cs.bris.ac.uk/ ∼ILPnet2/Tools/Reports. [2] J. Lafferty, A. McCallum, and F. Pereira, ‘Conditional random ﬁelds: Probabilistic models for segmenting and labeling sequence data’, in Proceedings of the 18th International Conference on Machine Learning, pp. 282–289 (2001). [3] S. Brin and L. Page, ‘The anatomy of a large-scale hypertextual web search engine’, in Proceedings of the 7th International World Wide Web Conference, pp. 107–117 (1998). [4] S. Chakrabarti, B.E. Dom, and P. Indyk, ‘Enhanced hypertext categorization using hyperlinks’, in Proceedings of the SIGMOD-98 ACM International Conference on Management of Data, pp. 307–318 (1998). [5] H. Chang, D. Cohn, and A. McCallum, ‘Creating customized authority lists’, in Proceedings of the 17th International Conference on Machine Learning, pp. 167–174 (2000). [6] J. Demˇsar, ‘Statistical comparisons of classiﬁers over multiple data sets’, Journal of Machine Learning Research, 7, 1–30 (2006). [7] L. C. Freeman, ‘A set of measures of centrality based on betweenness’, Sociometry, 40, 35–41 (1977). [8] L. C. Freeman, ‘Centrality in social networks: I. conceptual clariﬁcation’, Social Networks, 1(3), 215–239 (1979). [9] M. Friedman, ‘The use of ranks to avoid the assumption of normality implicit in the analysis of variance’, Journal of American Statistical Association, 32, 675–701 (1937). [10] T. Haveliwala, ‘Topic-sensitive PageRank’, in Proceedings of the 11th International World Wide Web Conference, pp. 517–526 (2002). [11] J. Kleinberg, ‘Authoritative sources in a hyperlinked environment’, in Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithms (1998). [12] Q. Lu and L. Getoor, ‘Link based classiﬁcation’, in Proceedings of the 20th International Conference on Machine Learning, pp. 496–503 (2003). [13] S.A. Macskassy and F. Provost, ‘Classiﬁcation in networked data: A toolkit and a univariate case study’, Journal of Machine Learning, 8, 935–983 (2007). [14] A. McCallum, K. Nigam, J. Rennie, and K. Seymore, ‘Automating the construction of internet portals with machine learning’, Information Retrieval, 3(2), 127–163 (2000). [15] P. B. Nemenyi, Distribution-free multiple comparisons, Ph.D. dissertation, Princeton University, 1963. [16] J. Neville and D. Jensen, ‘Iterative classiﬁcation in relational data’, in Proceedings of the AAAI-2000 Workshop on Learning Statistical Models from Relational Data, pp. 13–20 (2000). [17] H.-J. Oh, S. H. Myaeng, and M.-H. Lee, ‘A practical hypertext categorization method using links and incrementally available class information’, in Proceedings of the 23rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 264–271 (2000). [18] B. Taskar, P. Abbeel, and D. Koller, ‘Discriminative probabilistic models for relational data’, in Proceedings of the 18th International Conference on Uncertainty in Artiﬁcial Intelligence, pp. 485 – 492 (2002). [19] S. Wasserman and K. Faust, Social Network Analysis: Methods and Applications, Cambridge University Press, 1994. [20] S. White and P. Smyth, ‘Algorithms for estimating relative importance in networks’, in Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 266–275 (2003).

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-167

167

VCD Bounds for some GP Genotypes ˜ 1 Jos´e Luis Montana Abstract. We provide upper bounds for the Vapnik-Chervonenkis dimension (VCD) of classes of subsets of Rn that can be recognized by computer programs represented by expression trees built from arithmetic operations ({+, −, ∗, /, }), inﬁnitely differentiable algebraic operations (like l-root extraction), conditional instructions and sign tests. Our VCD bounds for this genotype are expressed as a polynomial function in the height of the expression trees used to represent the programs. This implies, in particular, that a GP learning machine dealing with a search space containing sequential exponential time computer programs of polynomial parallel complexity needs only a polynomial amount of training examples.

1

Introduction

In the last years GP has been applied to a range of complex learning problems including that of classiﬁcation and symbolic regression in a variety of ﬁelds like quantum computing, electronic design, sorting, searching, game playing, etc. A common feature to both tasks is to evolve a population composed by GP expressions built from a set of functionals F = {f1 , . . . , fk } and a set of terminals T = {x1 , . . . , c1 , . . .} (including the variables and the constants). Once we have chosen the functionals and the terminals, the classiﬁcation (respectively regression) task can be thought as a supervised learning problem where the hypothesis class C is the tree structured search space described from the set of leaves T and the set of nodes F . Analogously, the GP algorithm evolving computer programs P represented by the concepts of class C can be regarded as a learning algorithm. In the seventies the work by Vapnik and Chervonenkis ([9], [7], [8]) provided a remarkable family of bounds relating the performance of a learning machine (see [5] for a modern presentation of the theory). The Vapnik- Chervonenkis dimension (VCD) is a measure of the capacity of a family of functions (or learning machines) {f (x, α)}α as classiﬁers. Here α denotes the set of parameters of the learning machine. In general, the error, ε(α), of a learning machine with parameters α is written as ε(α) = Q(x, α; y)dμ, where Q measures some notion of loss between f (x, α) and the target concept y, and μ is the distribution from which examples (x, y) are drawn to the learner. For example, for classiﬁcation problems, the error of misclassiﬁcation is given taking Q(x, α; y) = |y − f (x, α)|. Similarly, for regression tasks one takes Q(x, α; y) = (y − f (x, α))2 . Many of the classic applications of learning machines can be explained inside this formalism. The starting point of Statistical Learning Theory is that we might not know μ. At this point one nreplace theoretical er1 ror ε(α) by empirical error εm (α) = m Q(xi , yi , α). Now, i=1 the results by Vapnik state that the error ε(α) of learning machine 1

Department of Mathematics, Statistics and Computer Sciences. University of Cantabria, Spain, email: [email protected]. This work was partially supported by Spanish grant TIN2007-67466-C02-02.

with parameters α can be estimated independent of the distribution of μ(x, y) due to the following formula.

ε(α) ≤ εm (α) +

h(log(2m/h) + 1) − log(η/4) , m

(1)

where η es the probability that bound is violated and h is the VCD of the family of classiﬁers f (x, α). While the existence of the bounds in Equation 1 is impressive, very often these bounds remain meaningless. The VC dimension h depends on the class of classiﬁers, equivalently on a fully speciﬁed learning machine. Hence, it does not make sense to calculate VCD for GP in general, however it makes sense if we choose a particular class of computer programs as classiﬁers (i.e. a particular genotype). For the simpliﬁed genotype that only uses the binary standard arithmetic operators, some chosen computer program structure and a bound on the size of the program, VC dimension remains polynomial in the size of the program and in the number of parameters of the learning machine. This last statement is an easy consequence of [3] (see Theorem 6 below) (this bound also applies to the Decision Tree Model). Hence, GP approach with arithmetic functionals and ”short” programs (of size polynomial in the dimension of the space of events) has small VC dimension. Inspired by the above considerations our aim is to go deep into the study of formal properties of GP algorithms focusing the analysis of the classiﬁcation complexity (VC dimension) of GP-trees as starting point. This point of view is not knew: a statistical learning approach to GP is proposed in [2]. We mention that as main difference with previous related work ([3]) –where polynomial bounds in the size of the computer programs are given for VC dimension– our bounds show that the capacity of classiﬁcation of GP-trees depends essentially on parallel complexity more than on sequential time complexity. Moreover, if the GP-tree internal nodes consist of inﬁnitely differentiable algebraic functionals, sign tests and conditional statements, then VC dimension depends polynomially on the height of the tree. This is quite strong since the known polynomial dependence on the size is improved –in the well-parallelizable case– by a logarithmic factor.

1.1

Main results

Following the approach in [3] we deal with general concept classes whose concepts and instances are represented by tuples of real numbers. For such a concept class C, let Ck,n be C restricted to concepts represented by k real values and instances represented by n real values. The membership test of a concept class C over domain X takes as input a concept C ∈ C and an instance x ∈ X, and returns the boolean value ”x ∈ C”. Throughout this paper, the membership test for a concept class Ck,n is assumed to be expressed as a GP-tree Tk,n

168

J.L. Montaña / VCD Bounds for Some GP Genotypes

taking k + n real inputs, representing a concept C ∈ Rk and an instance x ∈ X = Rn . The tree Tk,n uses exact real arithmetic, analytic algebraic operators as primitives (this includes the usual arithmetic operators and other more sophisticated operators like series having fractional exponents), conditional statements, and when evaluated at input (x, y) returns the truth value ”x belongs to the concept represented by y”. For classes deﬁned by GP-trees as described above we announce the following results. • For a hierarchy of concept classes Ck,n , deﬁned by GP-trees Tk,n using analytic algebraic functionals and height bounded by h = h(k, n) the VC dimension of Ck,n is polynomial in h, k, n and in the number of analytic algebraic operators that the programs contains. • For a hierarchy of concept classes Ck,n , deﬁned by GP-trees Tk,n using analytic algebraic functionals and height bounded by a polynomial in k and n the VC dimension of Ck,n is also polynomial in k and n and in the number of analytic algebraic operators that the program contains. The precise statement of our main result is given in Section 5.2, Theorem 17.

2

Tree Structured Search Spaces

Historically the ﬁrst GP search space was a subset of the LISP language. Today, GP has extended to deal with any tree structured search space. This space is usually describe from a set of leaves or terminals T = {x1 , x2 , ...} including constants, variables and auxiliary variables and a set of internal nodes or functionals representing the operators with a given arity N = {fk1 , fk2 , ...}. The search space includes all well-formed expressions, recursively deﬁned as being either a terminal or the application of a k-ary operator fk to a list of k well formed expressions. Example 1 Rational functions. A simple example of tree structured search space is that of rational functions of any degree of variables x1 , ..., xn . The set of terminals includes all variables xi and a particular R terminal standing for any real valued constant. The set of nodes includes the binary operations +, −, ∗, /. Example 2 Straight Line Programs. Another tree-structured search space is that of computer programs without go to instructions. These programs are usually known as straight line programs. The main restriction is that only functions returning a value can be represented. As in the general tree case a program or a function is recursively deﬁned as a terminal, or as the result of a k-ary operator applied to k-functions. The terminal set (leaves) includes the input variables of the program (real variables) and the constants in R. The set of functionals (internal nodes) includes the following nodes: • Computation nodes which are the binary nodes +, −, ∗, / and a ﬁnite set of nodes labeled with elements {f1 , . . . , fq } being inﬁnitely differentiable algebraic operators of arities ki for every i, 1 ≤ i ≤ q. • Sign nodes where ∈ {<, =, >} is a sign condition. These nodes have a single son which must be either a variable, or a computation node or a branching node. Associated to each sign node there is a function sign(f, ) that outputs true if condition f 0 is satisﬁed and f alse otherwise.

• Branching nodes if (-) then {−} else{−} which are 3-ary operators, having as sons: a node with boolean output representing the condition B; and two sons f and g with numerical output representing the conditional statements. Associated to a branching node there is a function branch(B, f, g) that outputs f if condition B evaluates to true and outputs g otherwise. Remark 3 Examples of inﬁnitely differentiable algebraic functions are the set of polynomials, rational maps and also functions including k-root extraction. Other more sophisticated examples are Puiseux series, i.e. series having fractional exponents like ∞ i a x q with k ∈ Z , q ∈ N+ and ai ∈ R. See [1] for a i=k i deﬁnition and properties of Puiseux series. Remark 4 The sequential running time of a straight line program represented by a GP-tree T is given by the size of the tree T , s(T ), while the parallel running corresponds to the height of the tree T and will be denoted by h(T ).

3

VC Dimension of Formulas

The following deﬁnition of VC dimension is standard. See for instance [7]. Deﬁnition 5 Let C be a class of subsets of a set X. We say that C shatters a set A ⊂ X if for every subset E ⊂ A there exists S ∈ C such that E = S ∩ A. The VC dimension of C is the cardinality of the largest set that is shattered by C. Along this section we deal with concept classes Ck,n such that concepts are represented by k real numbers, w = (w1 , . . . , wk ), instances are represented by n real numbers, x = (x1 , . . . , xn ), and the membership test to the family Fk,n is expressed by a formula Φk,n (w, x) taking as inputs the pair concept/instance (w, x) and returning the value 1 if ”x belongs to the concept represented by w” and 0 otherwise. We can think of Φk,n or as a function from Rk+n to {0, 1}. So for each concept w, deﬁne: Cw := {x ∈ Rn : Φk,n (w, x) = 1},

(2)

The objective is to obtain an upper bound on the VC dimension of the collection of sets Ck,n = {Cw : w ∈ Rk }.

(3)

For boolean combinations of polynomial equalities and inequalities the following seminal result by Golberg and Jerrum is known. Theorem 6 ([3], Theorem 2.2) Suppose Ck,n is a class of concepts whose membership test can be expressed by a boolean formula Φk,n involving a total of s polynomial equalities and inequalities, where each polynomial has degree no larger than d. Then the VC dimension V of Ck,n satisﬁes V ≤ 2k log2 (4eds) (4) Now assume that formula Φk,n is a boolean combination of s atomic formulas, each of them being of one of the following forms: τi (w, x) > 0

(5)

τi (w, x) = 0

(6)

or

J.L. Montaña / VCD Bounds for Some GP Genotypes

where {τi (w, x)}1≤i≤s are inﬁnitely differentiable functions from Rk+n to R. Next, make the following assumptions about the functions τi . Let α1 , ..., αv ∈ Rn . Form the s.v functions τi (w, αj ) from Rk to R. Choose Θ1 , ..., Θr among these, and let Θ : Rk → Rr

(7)

Θ(w) := (Θ1 (w), ..., Θr (w))

(8)

be deﬁned by Assume there is a bound B independent of the αi , r and 1 , ..., r such that if Θ−1 (1 , ..., r ) is an (k − r)-dimensional C ∞ - submanifold of Rk then Θ−1 (1 , ..., r ) has at most B connected components. With the above set-up, the following result is proved in [4]. Theorem 7 The VC dimension V of a family of concepts Ck,n whose membership test can be expressed by a formula Φk,n satisfying the above conditions satisﬁes: V ≤ 2log2 B + 2klog2 (2es)

4

(9)

VC Dimension of Formulas with Inﬁnitely Differentiable Algebraic Operators

We study the VC dimension of formulas involving analityc algebraic functions. Such functions are called Nash functions in the mathematical literature (see [1]). A Nash function f : Rn → R is an analytic function satisfying a nontrivial polynomial equation P (x, f (x)) = 0.2 . The degree of a Nash function is the minimal degree of non trivial polynomials vanishing on its graph. A sign assignment to a Nash function f is one of the (in)equalities: f > 0 orf = 0 orf < 0. A sign assignment to a set of s of Nash functions is consistent if all s (in)equalities can be satisﬁed simultaneously by some assignment of real numbers to the variables. The following Lemma is an easy consequence of B´ezout Theorem for Nash functions which is proved in [6]. Lemma 8 Let f1 , . . . , fs be n-variate Nash functions each fi of degree bounded by d. Then, the subset of Rn deﬁned by the equations: f1 = 0, . . . , fs = 0

(10)

has at most (2d)(s+1)(2n−1) connected components. We state for Nash functions a statement that bounds the number of consistent sign assignments of a ﬁnite family of such functions. The technical details of the proof are omitted and are based on [10]. Lemma 9 Let F be a ﬁnite family of s n-variate Nash functions with degree bounded by d ≥ 1. If s ≥ (n + 1)(2n − 1) the number of consistent sign assignments to functions of the family F is at most ( 2

8eds )(n+1)(2n−1) . (n + 1)(2n − 1)

(11)

Polynomial and regular rational functions are Nash functions; the function √ 1 + x2 is Nash on R; many activations functions used in neuronal networks are Nash, the function which associates to a real symmetric matrix its i-th eigenvalue (in increasing order) is Nash on the open subset of symmetric matrices with no multiple eigenvalue. Actually, Nash functions are those functions needed in order to have an implicit function theorem in real algebraic geometry.

169

Next we give a result concerning VC dimension of families of concepts deﬁned by Nash functions. The proof is a technical consequence of Theorem 7 and Lemma 8. Proposition 10 Let x = (x1 , ..., xn ) and y = (y1 , ..., yk ) denote vectors of real variables. Suppose Ck,n is a class of concepts whose membership test can be expressed by a boolean formula Φk,n involving a total of s (in)equalities of polynomials belonging to the polynomial ring R[x, y, f1 (x, y), ..., fq (x, y)], where each polynomial has degree no larger than d, and each function fi is Nash of degree bounded by d . Then the VC dimension of Ck,l is bounded above by 2(1+log2 max{d, d })(k(q +1)+1)(2k(q +1)−1)+2klog2 (2es) (12)

5

VC Dimension Bounds for GP-trees

There is an alternative deﬁnition of GP-trees representing straight line code to that given in Section 2, by allowing sign gates to output a value in {0, 1} with the obvious meaning. Next we provide a precise deﬁnition of this alternative model that is more accurate for combinatorics. Deﬁnition 11 A Nash (q, β)-GP tree T of degree D over R is a GP-tree whose leafs are labeled with inputs or with elements of R. The internal nodes having out degree 2 are labeled with a binary arithmetic operation of R, that is one operation in {+, −, ∗, /}; the nodes with outdegree 1 which are sign gates are labeled by a sign condition. Finally there are q nodes labeled by a Nash operator of degree bounded by D with outdegree at most β. The following statement, whose proof is straightforward from the deﬁnitions, states the relation between GP-trees with branching nodes and boolean sign gates and the alternative deﬁnition given above. Proposition 12 Nash GP-trees with Nash operations and sign nodes as described in Deﬁnition 11 are able to simulate Nash GP-trees with boolean sign nodes and selection nodes, deﬁning equivalent models of computation and complexity. The output function of a GP-tree as in Deﬁnition 11 can be deﬁned as follows. To each node v we inductively associate a function. • If v is an input or constant node then fv is the label of v. • If v has outdegree 2 and v1 and v2 are the sons of v then fv = fv1 opv fv2 where opv ∈ {+, −, ∗, /} is the label of v. • If v is labeled by a Nash operator f and v1 , . . . , vk are the sons of v then fv = f (fv1 , . . . , fvk ) with k ≤ β. There are at most q nodes of this form. • If v is a sign node then fv = sign(fv ) where v is the son of v in the tree. Remark 13 Observe that the combination of computation nodes with sign nodes (equivalently, the presence of branching nodes and boolean sign nodes)) may increase the number of terms involved in the description as formula of a GP-tree (the size of the formula) up to a number which is doubly exponential in the height of the tree. This implies that the best we can expect from Theorem 6 is an O(k2 (2h +1)2 ) upper bound for the VC dimension of concept classes Ck,n whose membership test is represented by a GP-tree Tn,k having only arithmetic nodes and height h = h(n, k). A formal explanation of this situation is given in the following proposition.

170

J.L. Montaña / VCD Bounds for Some GP Genotypes

Proposition 14 For every l there is a GP-tree T (l) having height O(l) expressing the membership to a concept class C(l) and involvl ing 22 L-terms in its description as formula in the ﬁrst order language L with symbols +, −, ∗, /, 0, 1 and < for the order. We explicitly construct the GP-tree T (l) as follows. • The input nodes of T (l) are the variables x and y. The dimension of the space of variables x and y is not meaningful in this example. • Consider any set of 3.2l polynomials Qi (x, y) that can be computed in constant height. • In constant height and size O(2l ), build 2l nodes vi0 , 1 ≤ i ≤ 2l , as follows: the output fv0 is the polynomial Q3i−2 , when Q3i = i 0, or Q3i−1 , when Q3i = 0. • Within height l + 1 and size 2 v1i , ..., v2i l−i+1 where

l+1

fvi = fvi−1

2.k−1

k

− 1, add product nodes

∗ fvi−1 . 2.k

(13)

In this latter deﬁnition, the superscript index i indicates the height level and ranges in 1...l + 1, and the subscript index k indicates the node number at level i; moreover k ranges in 1...2l−i+1 . • Finally, add a root node v whose output is given by fv = sign(fvl+1 ). 1

the set of new variables zv . We introduce at most qβ new variables. Let v(i, 1),. . . ,v(i, li ) be the collection of sign nodes of the GP-tree Tk,n whose height is i ≤ h = h(k, n). Now, for each pair (i, j), 1 ≤ j ≤ li , let fi,j be the function that the sign node v(i, j) receives as input. Since the outdegree of the arithmetic nodes is bounded by 2 , it easily follows by induction that fi,j is a piecewise rational function of (x, y, z, (fl (x, y, z))1≤l≤q of formal degree bounded by 2i (the variables z can be eliminated by substitution to get fi,j as function of the input variables x, y). Note that at height i the number of non spurious li is bounded above by max{β, 2}h−i . Now, for each sign assignment = (i,j ) ∈ {>, =, <} let Φ be the formula: Φ =

(fi,j i,j 0),

1≤i≤h

li

(16)

1≤i≤h,1≤j≤li

Using ﬁnite induction on the number of conjunctions in Equation 16 one arrives to the following. li

Claim A For every ∈ {>, =, <} 1≤i≤h there are rational functions ri,j of (x, y, z, f1 , . . . , fq ) of formal degree bounded by 2i such that formula Φ is equivalent to the formula

(ri,j i,j 0)

(17)

1≤i≤h,1≤j≤li

(14)

Now, note that the membership test to the class C(l) can be expressed by a formula Φ(l)) = Φ(l)(x, y) deﬁned by Φ(l) = (Q3 = 0, . . . , Q3.2l = 0, Q1 . · · · .Q3.2l −2 ≥ 0) ∨ . . . ... ∨ (Q3 = 0, ..., Q3.2l = 0, Q2 . · · · .Q3.2l −1 ≥ 0)

In what follows formula in Equation 17 will be also denoted by Φ . Notice that the set of inputs accepted by the GP-tree Tk,n can be described by a disjunction of some of the formulas Φ . Hence, the proof of Lemma 15 becomes an inmediate consequence of Lemma 16 below. Lemma 16 The number of tuples such that formula Φ represents a consistent sign assignment is bounded by 2 2 (D max{2, β})O((k+n+βq) .h ) .

Indeed, we are involving all possible products 2 l

Qk(i) ,

(15)

i=1

where k(i) is either 3.i − 1 or 3.i − 2. This means that formula Φ(l) l has size O(22 )

5.1

Short Bounds for the Formula size of GP-trees

Along this subsection we analyze the formula size of GP-trees. Lemma 15 Let Ck,n be a family of concept classes whose membership test can be expressed by a family of Nash GP-trees Tk,n having k + n real variables representing the concept and the instance and height h = h(k, n). Assume that the GP-tree Tk,n has at most q Nash nodes of degree bounded by D ≥ 2 and outdegree bounded by β. Then, the membership test to Ck,n can expressed by a family of formulas Φk,n in k + n free variables having size at most 2 (max{β, 2}D)O(((n+k+βq)d) ) . Sketch of the proof: We transform Tk,n into a formula Φk,n having the required formula size. Let h = h(k, n) be the height of Tk,n . For each node v being an input of a Nash node let us introduce a variable zv that contains the function value computed at this node. Call z

The proof of Lemma 16 is a technical consequence of Lemma 9. Sketch of the proof: To show this assertion we proceed by ﬁnite induction on the number of conjunctions in Equation 16. Since l1 ≤ max{β, 2}h−1 and the degree of the functions r1,j (as Nash functions) is bounded by D, we conclude from Claimm A and 2 Lemma 9 that there are at most (max{β, 2}D)O((n+k+βq) .h) values 1 ∈ {>, =, <}l1 such that formula Φ1 represents a consistent sign assignment. From Claim A, each consistent sign assignment 1 for the l1 rational functions r1,j determines speciﬁc rational functions {r2,j }1≤j≤l2 in the variables (x, y, z) as inputs for the sign gates at height level 2. Since the number of these nodes l2 ≤ max{β, 2}h−2 and the degree (as Nash functions) at this height is bounded by D2 , we conclude, applying Lemma 9, that for each l1 -tuple 1 ∈ {>, =, <}l1 such that Φ1 represents a consistent assign2 ment, there are at most (max{β, 2}D)O(n+k+βq) .h tuples 2 ∈ l2 {>, =, <} such that Φ2 represents a consistent sign assignment. Hence we see that the number of pairs (1 , 2 ) such that Φ1 Φ2 represents a consistent sign assignment is bounded by 2 2 (max{β, 2}D)O((n+k+βq) .h) .(max{β, 2}D)O(n+k+βq) .h . Iterating h times this argument, one gets that the number of tuples

(1 , ..., h ) ∈ {>, =, <}

1≤i≤h

lh

171

J.L. Montaña / VCD Bounds for Some GP Genotypes

such that Φ1 bounded by

...

Φh represents a consistent sign assignment is 2

(DO((n+k+β)

.h)

= (max{β, 2}D)O((n+k+βq)

2

.h2 )

,

1≤i≤h

as wanted.

5.2

VC Dimension of GP-trees

Theorem 17 Let Ck,n be a family of concept classes whose membership test can be expressed by a family GP-trees Tk,n having k + n real variables representing the concept and the instance and height h = h(k, n). Assume that the GP-tree Tk,n has at most q Nash nodes of degree bounded by D ≥ 2 and indegree bounded by β. Then, the VC dimension of Ck,n is in the class O((log2 D + log2 max{β, 2}) k(n + k + βq)2 h2 )

1 {i ∈ {1..m} : yi = f (xi )} (21) m Structural Risk minimization can be stated as follows (see [2]). εz (f ) =

Theorem 18 Let {Fk }k be a family of hypothesis spaces with ﬁnite V C dimensions {Vk }k . Let F = ∪k∈N Fk . Assume that for all probability distributions ρ the error of the Bayes classiﬁer ε(tρ ) is L∗ = inff ∈F ε(f ). Then, given a set of classiﬁed examples z = (xi , yi )1≤i≤n , consider a function f ∈ F minimizing

εz (f ) +

32 Vk (log e + log s), s

(22)

where k = min{l : f ∈ Fl }

(23)

Then (1) The generalization error ε(f ) with probability 1 converges to L∗ when s goes to inﬁnity.

. Sketch of the proof: In order to prove Theorem 17 we appeal to Proposition 10 in a high dimensional space. From Claim A and Lemma 15 we conclude that Ck,n is a class of concepts whose membership test can be expressed by a boolean formula 2 2 Φk,n involving a total of s = (max{β, 2}D)O((k+n+βq) .h ) (in)equalities of polynomials belonging to the polynomial ring R[x, y, z, f1 (x, y, z), ..., fq (x, y, z)] (where variables z can be eliminated by substitution). Each polynomial has degree no larger than 2h , and each function fi is Nash of degree bounded by D. Then, from Proposition 10 we conclude that the VC dimension V of Ck,n satisﬁes V = O((log2 D + log2 max{β, 2})k(n + k + βq)2 h2 )

(18)

as wanted.

6

to ρ. Here Z m denotes the m − f old Cartesian product of Z. The empirical error of f (w.r.t. z) is

Some remarks about the Selection of the Fitness function of a GP Learning Machine

We brieﬂy recall how classiﬁcation problems are studied inside the framework of Statistical Learning Theory. We consider an input space X = Rn and an output space Y = {−1, 1}. We observe a sequence of n i.i.d. pairs (Xi , Yi ) sampled according to an unknown probability measure ρ on the product space X × Y . The goal is to construct a function f : X −→ Y which predicts the value y ∈ Y from a given x ∈ X. The criterium to choose function f is a low probability of error ρ{x ∈ X : f (x) = y}. The error of a function f is deﬁned as ε(f ) = ρ{(x, y) ∈ X × Y : f (x) = y}

(19)

Measure ρ can be decomposed as the product given by the marginal distribution ρX and the conditional distribution ρ(Y |X = x). According to well known results from Statistical Learning Theory (c. f. [5] ), the Bayes classiﬁer tρ (x) =sgn fρ (x) deﬁned by the sign of the regression function fρ (x) = ydρ(Y |X = x), achieves the minimum error over all possible measurable functions, that is: ε(tρ ) = inff ε(f )

(20)

We now consider the sampling. Assume Z = X × Y . Let z = (xi , yi )1≤i≤m ∈ Z m i.e. m samples independently drawn according

(2) If additionally one optimal classiﬁer belongs to Fk then for any s and such that Vk (log e + log s) ≤ s 2 /512, the error ε(f ) < with probability at most

Here Δ =

∞ k=1

Δe−s −Vk

e

2

/128

+ 8sVfP e−s

2

/512

(24)

is assumed ﬁnite.

Interpretation Suppose we have to design a GP algorithm to discover a computer program that explains s classiﬁed examples z = (xi , yi ). There are two problems involved: (1) what search space should be used, and (2) how to deﬁne the ﬁtness function. A naive approach yields to conjecture a class F of computer programs after some previous experimentation has being performed and then use as ﬁtness of a program P ∈ F the empirical risk εz (P). A second approach is the following. Consider as ﬁtness of program P a compromise between empirical accuracy and VC dimension as suggested by Equation 22. According to the results shown in this paper a compromise between empirical error and height of the GP trees should perform better than empirical accuracy.

REFERENCES [1] J. Bochnak, M. Coste, and M.-F. Roy. G´eom´etrie alg´ebrique r´eelle. (French) [Real algebraic geometry], volume 12. Berlin, 1987. [2] S. Gelly, O. Teytaud, N. Bredeche, and M. Schoenauer. A statistical learning theory approach of bloat. In GECCO, pages 1783–1784, 2005. [3] P. Goldberg and M. Jerrum. Bounding the vapnik-chervonenkis dimension of concept classes parametrizes by real numbers. Machine Learning, 18:131–148, 1995. [4] M. Karpinski and A. Macintyre. Polynomial bounds for VC dimension of sigmoidal and general pffaﬁan neual networks. J. Comp. Sys. Sci., 54:169–176, 1997. [5] G. Lugosi. A Pattern classiﬁcation and learning theory, pages 5–62. Springer, 2002. [6] R. Ramanakoraisina. Bezout theorem for nash functions. Comm. Algebra, 17(6):1395–1406, 1989. [7] V. Vapnik. Statistical learning theory. Communications of the ACM, 1998. [8] V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its applications, 16:264–280, 1971. [9] V. Vapnik and A. Chervonenkis. Ordered risk minimization. Automation and Remote Control, 34, pages 1226–1235, 1974. [10] H. E. Warren. Lower bounds for approximation by non linear manifolds. Trans. A.M.S., 133:167–178, 1968.

172

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-172

Robust Division in Clustering of Streaming Time Series Pedro Pereira Rodrigues1 and Jo˜ao Gama2 Abstract. Online learning algorithms which address fast data streams should process examples at the rate they arrive, using a single scan of data and ﬁxed memory, maintaining a decision model at any time and being able to adapt the model to the most recent data. These features yield the necessity of using approximate models. One problem that usually arises with approximate models is the deﬁnition of a minimum number of observations necessary to assure convergence, which implies a high risk since the system may have to decide based only on a small subset of the entire data. One approach is to apply techniques based on the Hoeffding bound to enforce decisions with a conﬁdence level. In divisive clustering of time series, the goal is to ﬁnd clusters of similar time series over time. In online approaches there are two decisions to make: when to split and how to assign variables to new clusters. We can deﬁne a conﬁdence level to both the decision of splitting and the assignment of data variables to new clusters. Previous works have already addressed conﬁdent decisions on the moment of split. Our proposal is to include a conﬁdence level to the assignment process. When a split point is reported, creating two new clusters, we can directly assign points which are conﬁdently closer to one cluster than the other, having a different strategy for those variables which do not satisfy the conﬁdence level. In this paper we propose to assign the unsure variables to a third cluster. Experimental evaluation is presented in the context of a recently proposed hierarchical algorithm, assessing the advantages of the proposal, revealing also advantages on memory usage reduction and processing speed. Although this proposal is evaluated under the scope of an existent method, it can be generalized to any divisive procedure. Keywords: robust clustering, time series, data streams.

1 INTRODUCTION Data streams usually consist of variables producing examples continuously over time. Clustering streaming time series is an emergent ﬁeld in data mining and knowledge discovery from data streams. The basic idea behind this task is to ﬁnd groups of variables (the time series) that behave similarly through time, which is usually measured in terms of time series similarities. Several real-world clustering problems exist that address data coming from a stream at high rate: in electrical supply systems, clustering demand proﬁles (ex: industrial or urban) decreases the computational cost of predicting each individual subnetwork load [5]; in medical systems, clustering medical sensor data (such as ECG, EEG, etc.) is useful to determine correlation between signals [14]; in ﬁnancial markets, clustering stock prices evolution helps preventing bankruptcy [10]. Hence, 1 2

LIAAD - INESC Porto L.A. & Faculty of Sciences, University of Porto, Portugal, email: [email protected] LIAAD - INESC Porto L.A. & Faculty of Economics, University of Porto, Portugal, email: [email protected]

data stream approaches should be considered to solve them. In this work we address the problem of clustering streaming series assuming data is gathered by a centralized process while it is becoming available for online analysis. Recent research as addressed this issue from different perspectives. Beringer and H¨ullermeier proposed an online version of k-means for clustering parallel data streams, using a Discrete Fourier Transform approximation of the original data [1]. The basic idea is that the cluster centers computed at a given time are the initial cluster centers for the next iteration of k-means, applying a procedure to dynamically update the optimal number of clusters at each iteration. Clustering On Demand (COD) is another framework for clustering streaming series which performs one data scan for online statistics collection and has compact multi-resolution approximations, designed to address the time and the space constraints in a data stream environment [2]. It is divided in two phases: a ﬁrst online maintenance phase providing an efﬁcient algorithm to maintain summary hierarchies of the data streams and retrieve approximations of the sub-streams; and an ofﬂine clustering phase to deﬁne clustering structures of multiple streams with adaptive window sizes. Rodrigues et al. [13], proposed the Online Divisive-Agglomerative Clustering (ODAC) system, a hierarchical procedure which dynamically expands and contracts clusters based on their diameters. It constructs a tree-like hierarchy of clusters of streams, using a top-down strategy based on the correlation between streams. In the next section we present the main contribution of this paper, a robust criterion for the assignment of time series to new clusters. Since we decided to evaluate the criterion in the context of the ODAC system, an overview on this recent hierarchical algorithm is included in section 3, exposing its main characteristics. Section 4 enunciates the validity indices used in Section 5 to validate and support the advantages of our proposal, while Section 6 presents some concluding remarks.

2 A ROBUST ASSIGNMENT CRITERION In divisive clustering algorithms, when a split point is reported, the different systems take different actions. For example, ODAC [13] determines two variables as pivots and assigns each of the remaining variables to the cluster which has the closest pivot. This is usually a good heuristic, as it often ﬁnds an optimal border hyperplane. It is a lot faster than the heuristic performed by DIANA [9], since it does not need to compute the average distances to decide which leaf will receive each variable. However, this may lead to erroneous situations if the moving variable is equally distant from the two pivots, as there is no way of determining to which cluster it should be assigned. This symmetric assignment is the key object of our proposal in this work. When considering the expansion of the structure, the symmetric splitting of variables appears as a possible drawback, in the

P.P. Rodrigues and J. Gama / Robust Division in Clustering of Streaming Time Series

173

• a fuzzy approach, where the unsure variables are assigned to both clusters, with a membership weight; • a robust approach, where the unsure variables are assigned to a third cluster.

Figure 1. Example of a dissimilarity structure between ten variables, produced by three clusters using strict (left) and conﬁdent (right) assignment. Variables x1 and x2 are the chosen pivots for splitting at ﬁrst level this set of variables. Dot-dashed lines represent the ﬁrst-level splitting while continuous lines represent a second-level splitting.

sense that a previous decision of moving a variable to a leaf, when there was no statistical conﬁdence on the decision of assignment, may separate variables that should be together. The main issue here is the possible splitting of compact clusters due to their equidistant position to two variables external to that cluster. Left plot of ﬁgure 1 presents an example of a possible conﬁguration where the problem could rise. If we could assign a conﬁdence level to the assignment, only variables which were conﬁdently closer to one pivot than the other would follow the corresponding pivot.

The problem of assigning evenly dissimilar variables has two possible approaches. On one side, we can simply disregard the problem and assign them to the closest pivot. On the other side, we can include a conﬁdence to the decision of assignment, directly assigning variables which are conﬁdently closer to one pivot than the other, and having a different strategy for those variables which do not verify the conﬁdence level, further called unsure variables. The Hoeffding bound [8] can be used to control this decision, having the advantage of being independent of the probability distribution generating the observations [3], and stating that after independent observations of a real-valued random variable with range , and with conﬁdence 1 − , the true mean of is at least − , where is the observed mean of the samples and =

2

(1 ) 2

(1)

As each leaf is fed with a different number of examples, each cluster k will possess a different value for , designated k . ( Let 1 = ( 1 ) and 2 = 2 ) be the distance measure between the unsure time series and each of the chosen pivots 1 and 2 . After seeing samples at the leaf, let Δ = 1 − 2 , a new random variable consisting on the difference between the observed values through time. Applying the Hoeffding bound to Δ , if Δ k , one can conﬁdently say that, with probability 1− , the difference between 1 and 2 is larger than zero, and assign to the cluster represented by 1 (the equivalent for 2 holds). For the remaining, several approaches exist: • an average approach3 , where each unsure variable is assigned to the cluster which is closer on average; 3

This is the assignment criterion of DIANA.

2.2 Conﬁdent assignment strategy Let 1 and 2 be the pivots of the clusters 1 and 2 , respectively, and be the moving variable. The expansion is decided as follows:

2.1 Assignment alternatives

In a cluster with variables, speed is the main motivation of a symmetric assignment, being ( ). The ﬁrst approach, although it may prevent the symmetry issue, has the problem of reducing the ability to cope with high-speed data streams, as each assignment test has to compute the average dissimilarity of the variable to each cluster ( ( 2 )). The second approach presents several advantages on the quality of the resulting clusters, as the decision of assignment is delayed to lower levels of the hierarchy (also ( )). This was recently proposed in [12], revealing both its beneﬁts and its drawbacks. As reported, the downside of this is that, since one stream could appear in several clusters, the memory and processing advantages would loose its relevance. The third approach combines the advantage of conﬁdent assignment with fast processing ( ( )), while keeping the memory requirements. Moreover, since some of the clusters are split in three, the reduction in memory size and speed-up in processing time with every split has even more impact on the global performance of the system.

• if ( 2 ) − ( 1 ) k move variable to cluster 1 ; • if ( 1 ) − ( 2 ) k move variable to cluster 2 ; • else move into a third cluster 3 , which will keep the variables that do not fall into any of the two groups. The main advantage of this procedure is to prevent splitting compact clusters due to their symmetrical position between two variables external to that cluster. An example of application of this conﬁdent assignment is explained in the right plot of ﬁgure 1. In the proposed approach, the expansion still keeps the characteristic of speeding up the process with structure growth. Moreover, we should stress that the third sibling will only be created if at least one variable cannot be conﬁdently assigned. If the data can be easily symmetrically bisected, only the two usual clusters are created. This approach is a simple variation of the ODAC algorithm that might be useful in applications where the data reveals a multidiversity characteristic which may not be easily structured in a binary hierarchy. Moreover, it is our belief that this approach would reach the real ﬁnal clusters faster than the original procedure. To assess the beneﬁts of this new strategy, we evaluate the new criterion in the context of the ODAC clustering system. This way, next sections include a brief but necessary overview of this algorithm. We should stress that, although motivated by an existent method, this conﬁdent assignment criterion can easily be applied to any divisive procedure other than ODAC.

3 ODAC OVERVIEW The Online Divisive-Agglomerative Clustering (ODAC) is an incremental approach for clustering streaming time series using a hierarchical procedure [13]. It constructs a tree-like binary hierarchy of

174

P.P. Rodrigues and J. Gama / Robust Division in Clustering of Streaming Time Series

clusters of streams, using a top-down strategy based on the correlation between streams. The system also possesses an agglomerative phase to enhance a dynamic behavior capable of structural change detection. The splitting and agglomerative operators are based on the diameters of existing clusters and supported by a signiﬁcance level given by the Hoeffding bound [8]. Inspecting the algorithm, we can observe that: • the update time and memory consumption does not depend on the number of examples, as it gathers sufﬁcient statistics to compute the correlations within each cluster; moreover, anytime a split is reported, the system becomes faster as less correlations must be computed; • the system possesses an anytime compact representation, since a binary hierarchy of clusters is available at each time stamp, and does not need to store anything more than the sufﬁcient statistics; • an agglomerative phase is included to react to structural changes; these changes are detected by monitoring the diameters of existing clusters; • given its hierarchical core, the system possesses a inherently adaptable conﬁguration of clusters; This is one of the systems clearly proposed to address clustering of multiple streams. It copes with high-speed production of examples and reduced memory requirements, with constant update time. It also presents adaptability to new data, detecting and reacting to structural drift.

3.1 Dissimilarity measure The system must analyze distances between incomplete vectors, possibly without having any of the previous values available. Thus, these distances must be incrementally computed. The system uses Pearson’s correlation coefﬁcient [11] between time series as similarity measure. This way, the sufﬁcient statistics needed to compute the correlation are easily updated at each time step.

3.2 Splitting criterion One problem that usually arises with approximate models is the definition of a minimum number of observations necessary to assure convergence. One approach is to apply techniques based on the Hoeffding bound [8] to solve this problem. Let ( ) be the distance measure between pairs of time series, and k = {( i j ) | i j ∈ k } be the set of pairs of variables included in a speciﬁc leaf k . After seeing samples at the leaf, let (

1)

1

= argmax ( (x,y)∈Dk

=

k \{( 1

3.3 Original assignment criterion When a split point is reported, the pivots are variables 1 and 1 where 1 = ( 1 1 ), which are separated into each of the newly created clusters. The system then assigns each of the remaining variables of the old cluster to the cluster which has the closest pivot. The previous section introduced our proposal to solve the uncertainty of this assignment when dealing with streaming time series.

3.4 Aggregation criterion The main setting of the system is the monitoring of existing clusters’ diameters. On stationary data streams, the diameter of a cluster decreases every time a split occurs. However, usual real-world problems deal with non-stationary data streams, where time series that were correlated in the past are no longer correlated to each other, in the current time period, and might be approaching time series of other clusters. The strategy that is adopted in ODAC to detect changes in the structure is based only on the analysis of the diameters. In fact, the diameter of each two new clusters should be less or equal than their parent’s diameter. In this way, no computation is needed between the variables of the two siblings.

4 CLUSTER VALIDITY To validate our proposal, we will measure quality in a two-fold fashion: the quality of the hierarchy with respect to the data, and the correspondence between the ﬁnal clustering structure and the real data. The Cophenetic Correlation Coefﬁcient [7] (CPCC) measures quality in hierarchical structures, being deﬁned as N

N−1

ij

= !

P 2

2 ) = argmax ( (x,y)∈D

)

k

be the second top-most dissimilar pair of variables. Consider 1 = ( 1 1 ) and 2 = ( 2 2 ) in Δ = 1 − 2 , a new random variable consisting on the difference between the observed values

P

C

N

(2)

2 ij

−

2 P

2 ij

−

2 C

i=1 j=1+1

where is the Cophenetic Proximity Matrix with each ij being the proximity level at which the two objects and appeared together in the same cluster for the ﬁrst time, and

let (

−

N−1

i=1 j=1+1

1 )},

ij

i=1 j=1+1

"N−1 N " #

)

be the pair of variables with maximum dissimilarity within the cluster k , and in the same way considering k

through time. Applying the Hoeffding bound to Δ , if Δ k, one can conﬁdently say that, with probability 1 − , the difference between 1 and 2 is larger than zero, and select ( 1 1 ) as the pair of variables representing the diameter of the cluster. With this rule, the ODAC system will only split the cluster when the true diameter of the cluster is known with statistical conﬁdence given by the Hoeffding bound. However, to prevent the hierarchy from growing unnecessarily, another criterion is deﬁned in ODAC which has to be fulﬁlled in order to perform the splitting, which falls out of the scope of this work.

=

N N−1 1 i=1 j=1+1

ij

and

C

=

N N−1 1

ij

i=1 j=1+1

The closer the value of this index is to 1, the better the match and the better the hierarchy ﬁts the data. Regarding the correspondence between the clustering structure and the real data, two criteria give different insights on the result.

P.P. Rodrigues and J. Gama / Robust Division in Clustering of Streaming Time Series

175

The Modiﬁed Hubert’s Γ Statistic index [6] is given by Γ=

N−1 N 1

ij

(3)

ij

i=1 j=i+1

, is the proximity matrix and is a × mawhere = n(n−1) 2 trix where each ij is the distance between the representative points (centroids, medoids, etc.) of the clusters to which and belong. High values of this index represent compact and well-separated clusters. The Dunn’s Validity Index [4] is given by

= min i,j

( max{ k

i

j)

( k )}

Figure 2.

ODAC structure comparison: strict (left) vs conﬁdent (right) assignment (3 clusters data set).

Figure 3.

ODAC structure comparison: strict (left) vs conﬁdent (right) assignment (4 clusters data set).

$ (4)

where ( i j ) is the single linkage dissimilarity function between two clusters, and ( k ) is the diameter of cluster k . High values of this index also represent compact and well-separated clusters. The MHΓ measure increases with the number of clusters, but the second index is independent of the number of clusters. This way, the highest value of DVI is considered the best one.

5 EXPERIMENTAL EVALUATION The main argued advantage of this technique is to prevent splitting compact clusters due only to the fact that they might be equidistant to the two chosen pivots. Moreover, it is also our belief that this approach would reach the real ﬁnal clusters much faster than the original procedure, while keeping (and even improving) the memory and speed qualities of the original procedure. Based on previous work’s results, we have decided to ﬁx the Hoeffding bound conﬁdence level parameter to = 0 05. We will consider a different conﬁdence level for the assignment criterion, with parameter a .

5.1 Artiﬁcial data We designed a three clustered data set, where all the clusters are equidistant, which was possible by creating three streams completely uncorrelated and adding noise to two copies per original variable, therefore becoming highly correlated with the original. Figure 2 shows the resulting structures, using strict (left) and conﬁdent (right) assignment. As feared, the original procedure made a ﬁrst-level splitting where one of the clusters was erroneously split, resulting in a ﬁnal structure with four clusters. Our procedure found the three clusters fast and accurately, supporting our arguments. But what if the clustering structure is more complex? We designed a different data set, with four clusters again uncorrelated. Figure 3 shows the resulting structures, using strict (left) and conﬁdent (right) assignment. The original technique revealed some instability due to erroneous splits (which forced later aggregations) before ending with the correct result. The conﬁdent assignment produced a correct answer quicker, with no aggregations. Moreover, the outcome hierarchy is less deep and much more readable, as we can quickly acknowledge which clusters forced each split. Also, in this data set we can observe that the system only made a threefold split in the ﬁrst level, generating three siblings, while in the second level the split was binary as the bounds were fully observed for all the remaining variables. From this, we can stress that our approach will decide at each level if the data is symmetrically separable or not.

5.2 Real data In order to assess the improvement given by this new assignment strategy, we have applied the algorithm to real world data. Time series of electrical data are one of the most widely studied sets of data. From the raw data received at each sub-station, taking into account only the current intensity sensors, we aggregated observations on a hourly basis over more than two and a half years. This data set represents 2700 sensors along 22364 observations. The proposed method relies on a conﬁdence test based on the Hoeffding bounds. This way, sensitivity analysis must be performed to assess the level of dependence of the method to this parameter. Figure 4 presents the analysis on a small sample of the current intensity data set, with 51 variables observed along two years. We can observe that, for chosen values of a , the system reveals low sensitivity, being the best results observed for low a parameter values, which means

Figure 4.

ODAC quality sensitivity to the δa parameter, for a small sample (51 variables) of the current intensity data.

176

P.P. Rodrigues and J. Gama / Robust Division in Clustering of Streaming Time Series

However, this approach can easily be applied to any divisive procedure other than ODAC. Experimental results show that this new assignment criterion will ﬁnd better hierarchies quicker, with the outcome tree being also much more readable. Our approach will decide at each splitting level if the data is symmetrically separable or not, creating two or three new clusters accordingly. The main advantages of this criterion include a better decision on the assignment of unsure variables, while keeping memory usage and processing time requirements. Current and future work is concentrated on extending experiments to different datasets and systems, and the clariﬁcation of the impact of this new setup on structural drift detection feature. Figure 5. Evolution of performance measures, in terms of number of clusters, number of stored values (memory) and examples per second (processing and update), averaged over a week window. Continuous lines represent the performance for the new conﬁdent assignment, while dashed lines represent the performance for the original symmetric assignment.

that only variables which are closer to one of the pivots with high conﬁdence should be assigned. The experiment with a = 1 applies the original symmetric assignment criterion. Although highest values of DVI were reported for a values of 0 02 and 0 03, the best hierarchy (given by the CPCC value) occurred for a = 0 01. Nevertheless, we believe that considering low levels of a will always be preferred to the symmetric assignment criterion, as this resulted on poor validity indices. From this point, we chose to ﬁx a = 0 01. For the complete current intensity data set we ran two experiments, using the original symmetric assignment criterion and the new conﬁdent assignment criterion. We monitored both the update speed and memory usage along time, as these are two of the main positive arguments of the proposed strategy. The objective of this experiment was to assess the increase in performance of the entire system. In Figure 5 we present tendency graphs based on weekly averages of the performance measures: stored values, for memory usage monitoring; and examples per second, for speed monitoring. We can observe how the overall processing speed and memory usage improve with time, especially for the proposed strategy. It becomes clear the advantages of threefold assignment in the speed-up achieved by splitting. Moreover, even with the increasing number of clusters to test, the total processing speed increases with splits, reducing the total amount of memory needed to operate.

6 CONCLUDING REMARKS In this paper we have presented a conﬁdent assignment criterion for divisive clustering of streaming time series. In this task series, the goal is to ﬁnd clusters of similar time series over time. Previous works have already addressed the problem of assigning conﬁdence to the decision of the moment of split. The split decision used in some divisive systems focus on the symmetrical boundary between two splinter pivot variables, generating uncertainty in the assignment since it has to decide based on only a small subset of the entire data. Our proposal is to include a conﬁdence level to the consequent assignment process. When a split point is reported, creating two new clusters, we can directly assign points which are conﬁdently closer to one cluster than the other, having a different strategy for those variables which do not satisfy the conﬁdence level. In this work we have proposed a robust approach to the assignment of variables to newly created clusters, enabling the creation of a third sibling with all variables for which conﬁdence bounds are not veriﬁed. Experimental evaluation is presented in the context of the application of this criterion in ODAC, a recently proposed hierarchical algorithm.

ACKNOWLEDGEMENTS The work of Pedro P. Rodrigues is supported by the Portuguese Foundation for Science and Technology (FCT) under the PhD Grant SFRH/BD/29219/2006. The authors thank the Plurianual ﬁnancial support attributed to LIAAD and acknowledge the participation of FCT Project ALES II under Contract POSC/EIA/55340/2004. We would also like to thank the referees for their comments which helped improve this paper.

References [1] J¨urgen Beringer and Eyke H¨ullermeier, ‘Online clustering of parallel data streams’, Data and Knowledge Engineering, 58(2), 180–204, (August 2006). [2] Bi-Ru Dai, Jen-Wei Huang, Mi-Yen Yeh, and Ming-Syan Chen, ‘Adaptive clustering for multiple evolving streams’, IEEE Transactions on Knowledge and Data Engineering, 18(9), 1166–1180, (September 2006). [3] Pedro Domingos and Geoff Hulten, ‘Mining high-speed data streams’, in Proceedings of the Sixth ACM-SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 71–80, Boston, MA, (2000). ACM Press. [4] J.C. Dunn, ‘Well separated clusters and optimal fuzzy partitions’, Journal of Cybernetics, 4(1), 95–104, (1974). [5] Jo˜ao Gama and Pedro Pereira Rodrigues, ‘Stream-based electricity load forecast’, in Proceedings of the 11th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2007), volume 4702 of Lecture Notes in Artiﬁcial Intelligence, pp. 446–453, Warsaw, Poland, (September 2007). Springer Verlag. [6] Maria Halkidi, Yannis Batistakis, and Michalis Varzirgiannis, ‘On clustering validation techniques’, Journal of Intelligent Information Systems, 17(2-3), 107–145, (2001). [7] Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001. [8] W. Hoeffding, ‘Probability inequalities for sums of bounded random variables’, Journal of the American Statistical Association, 58(301), 13–30, (1963). [9] L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley and Sons, Inc., New York, 1990. [10] R.N. Mantegna, ‘Hierarchical structure in ﬁnancial markets’, The European Physical Journal B, 11(1), 193–197, (1999). [11] Karl Pearson, ‘Regression, heredity and panmixia’, Philosophical Transactions of the Royal Society, 187, 253–318, (1896). [12] Pedro Pereira Rodrigues and Jo˜ao Gama, ‘Semi-fuzzy splitting in online divisive-agglomerative clustering’, in Proceedings of 13th Portuguese Conference on Artiﬁcial Intelligence (EPIA’07), volume 4874 of Lecture Notes in Artiﬁcial Intelligence, pp. 133–144, Guimar˜aes, Portugal, (December 2007). Springer Verlag. [13] Pedro Pereira Rodrigues, Jo˜ao Gama, and Jo˜ao Pedro Pedroso, ‘Hierarchical clustering of time-series data streams’, IEEE Transactions on Knowledge and Data Engineering, 20(5), 615–627, (May 2008). [14] Delsey M Sherrill, Marilyn L Moy, John J Reilly, and Paolo Bonato, ‘Using hierarchical clustering methods to classify motor activities of copd patients from wearable sensor data’, Journal of Neuroengineering and Rehabilitation, 2(16), (2005).

3. Model-Based Diagnosis and Reasoning

This page intentionally left blank

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-179

179

Generating Diagnoses from Conﬂict Sets with Continuous Attributes Emmanuel Benazera and Louise Trav´e-Massuy´es1 Abstract. Many techniques in model-based diagnosis and other research ﬁelds ﬁnd the hitting sets of a group of sets. Existing techniques apply to sets of ﬁnite elements only. This paper addresses the computation of the hitting sets of a group of sets whose elements are convex or non-convex, bounded or unbounded continuous regions. We assume the conﬂict sets are known and we present a novel procedure, the Continuous Hitting Set algorithm (CHS) for transforming conﬂict sets of continuous elements into minimal hitting sets.

1 Introduction Many theoretical and practical problems can partly reduce to an instance of the minimal hitting set problem, or its close variant the minimum set covering problem. One widely recognized application is in the ﬁeld of model-based diagnosis [5]. In this approach, a system is a tuple (COM P S, SD, OBS); COM P S is a ﬁnite set of system components; SD is the system description; OBS is the observation. A diagnosis is a minimal set D ⊆ COM P S such that under the assumption that all other components are behaving correctly, D explains the observation given SD. In the diagnosis community, D is said to be consistent with SD and OBS. This approach to diagnosis has two steps: (i) a collection of all minimal conﬂict sets is computed; (ii) the conﬂict sets are transformed into diagnoses. A conﬂict set s ⊆ COM P S is such that the assumption that all components in s are behaving correctly is not consistent with SD and OBS. A minimal conﬂict set is such that it does not contain any other conﬂict set. Reiter showed that the minimal diagnoses are the minimal hitting sets of the collection of minimal conﬂicts. Since the beginning of model-based diagnosis, several algorithms for computing the hitting sets have been introduced. Most rely on the building of a so-called HS-DAG [2] or HS-tree [7] but other representations exist [3, 4]. All these techniques transform conﬂict sets of discrete elements into diagnoses. But in many applications of modelbased diagnosis, the conﬂicts contain more information. This information includes but is not limited to intervals of possible failure time in systems with functional delays, or continuous parameter ranges found in fault models. For example, in systems with delays, several conﬂicts may involve the same components with different estimates of the symptom occurence dates [6]. In this paper, we address step (ii) and assume all minimal conﬂicts are given. Each conﬂict element is a component with bounded or unbounded intervals over a continuous line. We assume there is a single continuous attribute per component, but this assumption has no incidence on the generality of the presented method. The problem with conﬂict sets of continuous attributes is that minimal diagnoses are 1

LAAS-CNRS, Universit´e de Toulouse, 7 av. du Colonel Roche, 31077 Toulouse Cedex 4, France. email: ebenazer,[email protected]

conditioned upon the component continuous values. This is because a minimal diagnosis corresponds to a minimal continuous region. A diagnosis in this context is a set of k components along with a set of bounded or unbounded regions of k . Existing algorithms are not designed to ﬁnd and construct these regions. A naive strategy would be to apply these algorithms to a collection of conﬂicts with selected elements of the continuous lines. However, since the hitting set problem has a worst case performance that is exponential the size of the collection of conﬂicts, this would hardly prove an efﬁcient approach. Moreover, many points that belong to the same minimal diagnoses would be computed independently. This paper presents a general computational method for ﬁnding the hitting sets of a collection of conﬂicts with continuous attributes. The algorithm is named CHS for Continuous Hitting Sets. Starting from the classical approach, the proposed solution searches the hitting sets in an aggregate space of diagnoses. Similarly to the classical methods the CHS has both an expansion and a pruning phases. It is shown how the pruning phase dominates the computational effort. Simulation experiments on hundreds of randomly generated conﬂicts assess the main properties and the scalability of the CHS.

2 Problem Deﬁnition and Solution Approach 2.1 Problem Formulation We consider a tuple (COM P S, SD, OBS). A component Ci of COM P S operates over a continuous line xi , where i is the component index, and that represents its failure time, or any other parameter or variable value (bounded or unbounded) domain. A conﬂict set, or conﬂict for short, is a set of components which cannot behave normally altogether according to the observations. We assume component Ci in a conﬂict s has a known unidimensional failure interval Iis . It is noted CiI . To simplify notations Cij where j is an integer denotes Ci[j,+∞] . A conﬂict of cardinality k is noted Nk s s = {C1I1 , · · · , CkI }. It deﬁnes a continuous region i=1 Ii k that is a hypercube of . A minimal conﬂict is a conﬂict that does not strictly include any other conﬂict. Reiter proved that minimal diagnoses can be computed from minimal conﬂicts. Each component in a potential diagnosis belongs to one or more conﬂicts. We say a component explains, or equivalently covers these conﬂicts. Given a collection S of conﬂicts a diagnosis of cardinality k is a tuple (D, X) where D and X are the discrete and continuous diagnosis respectively. They T that D ∈ COM P S with D = {C1, · · · , Ck} Nare such and X ⊆ ki=1 [ s∈S ∗ Iis ], where Si∗ ⊆ S is the subset of conﬂicts that are explained by Ci. The inclusion is a consequence of having conﬂicts with overlapping continuous regions. This conditions diagnoses upon regions that are smaller than the failure interval in each conﬂict.

180

E. Benazera and L. Travé-Massuyés / Generating Diagnoses from Conﬂict Sets with Continuous Attributes

Example 1. Consider two conﬂicts s1 = {C11 , C2τ2 +1 , C31 }, s2 = {C1τ1 +1 , C21 , C41 }, τ1 ≥ 0, τ2 ≥ 0. The six diagnoses are (D1 , X1 ) = (C1, x1 ∈ [τ1 + 1, +∞[) (D2 , X2 ) = (C2, x2 ∈ [τ2 + 1, +∞[) (D3 , X3 ) = ({C1, C2}, x1 ∈ [1, τ1 + 1[∪x2 ∈ [1, τ2 + 1[) (D4 , X4 ) = ({C1, C4}, x1 ∈ [1, τ1 + 1[∪x4 ∈ [1, +∞[) (D5 , X5 ) = ({C3, C2}, x2 ∈ [1, τ2 + 1[∪x3 ∈ [1, +∞[) (D6 , X6 ) = ({C3, C4}, x3 ∈ [1, +∞[∪x4 ∈ [1, +∞[). with x1 ∈ [1, τ1 + 1[⊂ [1, +∞[, x2 ∈ [1, τ2 + 1[⊂ [1, +∞[.

2.2 Solution Approach 2.2.1 Hitting set algorithm A hitting set of a collection of sets is a set intersecting every set of this collection. Minimal hitting sets of a collection of minimal conﬂicts yield the minimal diagnoses. An incremental algorithm to generate all the minimal hitting sets based on a set of conﬂicts was originally proposed by [5], then corrected by [2]. This algorithm gives a means to compute diagnoses incrementally, under the permanent fault assumption. The diagnosis algorithm builds a Hitting-Set tree (HS-tree) in which all the nodes but leaves are labelled by a conﬂict set. For each element C in the conﬂict label of node n, an edge labelled C joins n to a successor node. H(n) is deﬁned as the set of edge labels on the path from n to the root node. The HS-tree is built by considering every conﬂict in arbitrary order. Every new conﬂict is compared to every leaf of the HS-tree, and some new leaves are built if necessary. The resulting HS-tree is pruned for redundant or subsumed leaves before the next conﬂict is considered. Pruned leaves are said to be closed. At the end of the diagnosis procedure, the minimal hitting sets, and hence the minimal diagnoses that explain the system’s misbehaviors, are given by the set of edge labels H(l) associated to the open leaves l of the HS-tree.

2.2.2 Continuous hitting sets The original hitting set algorithm considers conﬂict sets with discrete elements only. It looks for singletons in each conﬂict set. The algorithm cannot condition the diagnoses upon the different continuous failure points of a component. Doing this signiﬁcantly enlarges the number of diagnoses. It follows that the difﬁculty we address in this paper is the potentially huge size of the space of diagnoses over continuous regions. The reason for this size is the existence of continuous variables. The hitting set algorithm is exponential in the number of conﬂict elements so the number of potential diagnoses is staggering. Underlying the diagnoses are the conﬂicts, each being explained by the failure of a component in certain regions of its continuous line. It follows that the dimension of the continuous space is the total number of different components in the set of conﬂicts. In general we assume the continuous space dimension to be equal to the number of components in the system. The challenge is thus to apply the hitting set algorithm to this continuous state-space. Our solution to address this issue is to search for minimal diagnoses in an aggregate space of diagnoses that is represented by a directed acyclic graph (DAG) in which there is a node for each potential diagnosis. In other words, each node of our DAG represents a continuous region in which the discrete diagnosis element is the same.

3 Generating Diagnoses from Conﬂicts with Continuous Elements A simple way of understanding the Continuous Hitting Set (CHS) algorithm is as a variant of the HS algorithm where candidate diagnoses with identical discrete elements are expanded in unison. The main difference with the HS is twofold: • The CHS produces a DAG instead of a tree. • Nodes are often simultaneously a leaf and a node in the interior of the DAG. This happens when a part of the aggregated diagnoses do explain all conﬂicts, while another part does not. In the standard HS-tree, a single diagnosis is associated with each node. In the CHS, multiple diagnoses are associated with a single node.

3.1 Data Structures The main data structure represents a node n. Given a set of conﬂicts S, it contains: • A diagnosis H(n) that is a set of kn edge labels, i.e. components. • A region Xn of continuous diagnosis elements. It represents the continuous lines of the components in H(n), Xn ⊆ k . • Openn (.) → {0, 1}: the Open function. For each x ∈ Xn , Openn (x) indicates whether (n, x) explains all conﬂicts in S. The open region of n is noted Ωn = {x ∈ Xn |Openn (x) = 1}. A diagnosis is either opened or closed. Note that we don’t refer to open or closed nodes; instead we refer to diagnoses associated with nodes as being open or closed. • δn (., .) the explanation function. For x ∈ Xn and s ∈ S, δn (s, x) indicates whether s is explained by (n, x). Formally, ( 1 if ∃CiI st. Ci ∈ s ∩ H(n) with x ∈ Iis , δn (s, x) = −1 otherwise. (1)

3.2 The CHS algorithm Algorithm 1 presents the main procedure.

3.2.1 Expansion (lines 9 to 12): For a node n and a conﬂict s: ( {C ∈ s|C ∈ H(n)} An (s, x) = ∅

if δn (s, x) = −1 otherwise

(2)

is the set of discrete conﬂict elements that can expand (n, x). At each iteration, CHS expands a diagnosis (n, x) if it doesn’t explain the conﬂict s. An important distinction between HS and CHS is that in the latter, nodes are often partially expanded. This means not all conﬂicts are explained by some diagnoses (n, x) of node n. The catch is that only those (n, x) that do not explain all conﬂicts are expanded, and closed after expansion.

3.2.2 Computing the explanation functions (lines 10 & 4): Each newly expanded (n, x) must be updated. This consists in recomputing its explanation function (Eqn (1)).

181

E. Benazera and L. Travé-Massuyés / Generating Diagnoses from Conﬂict Sets with Continuous Attributes

1: Root node such that H(Root) = {} and ΩRoot = M , for M 2: 3: 4: 5: 6: 7: 8: 9:

10: 11: 12: 13: 14: 15: 16:

components in SD. for all conﬂict sets s ∈ S do for all (n, x) such that Openn (x) = 1 do if δn (s, x) = 1 then For all C ∈ H(n) ∩ s, add the pair (n, x) to oldleaves[C]. else for all CI ∈ s do if An (s, x) contains C then Expand (n, x) by adding an edge` labelled with C´ and successor aggregated nodes n , x = (x, y) with y ∈ I s . Compute δn (s, x ), open / close (n , x ) accordingly. Add the pair (n , x ) to newleaves[C]. Close the expanded (n, x). for all C in s do for all leaf (n, x) of newleaves[C] do if H(n) contains H(n ) and Ωn contains Ωn for some (n , x ) in oldleaves[C] then Close (n, x).

Root C1

C2

1

C3

2

Ω1 = { 1 ≥ 1}

3

Ω2 = { 2 ≥

2 + 1}

Ω3 = { 3 ≥ 1}

(a) Expansion of s1 .

Root C1

C2

1 Ω1 = { 1 ≥

C3

2 1 + 1}

C2

3

Ω2 = { 2 ≥

2 + 1}

C4

C1

4 Ω4 = {1 ≤ 1 1 + 1} ∪{ 2 ≥ 1}

C2

5

6

7

Ω5 = {1 ≤ 1 1 + 1} ∪{ 4 ≥ 1}

Ω6 = { 1 ≥ 1 + 1} ∪{ 3 ≥ 1}

Ω7 = { 2 ≥ 1} ∪{ 3 ≥ 1}

C4 8 Ω8 = { 4 ≥ 1} ∪{ 3 ≥ 1}

(b) Expansion of s2 .

Algorithm 1: CHS algorithm. Root

3.2.3 Opening & closing of continuous regions (line 10): The algorithm proceeds by leaving open the regions of the continuous space that explain all conﬂicts, and by closing the others. Similarly to the original HS, the CHS has an expansion phase and a pruning phase. In the expansion phase, (n, x) is closed if it has been expanded, or if ∃s st. δn (s, x) = −1 and An (s, x) = {}. In the pruning phase, (n, x) is closed if it is subsumed by some other node (n , x ) such that H(n ) ⊆ H(n) and Ωn ⊆ Ωn . For every new conﬂict s and every element C of the conﬂict, the algorithm builds two lists, newleaves[C] and oldleaves[C], which are then compared. Closed regions of a given node cannot be reopened. This is easily seen since closed regions contain points that do not explain all conﬂicts. Therefore these regions are expanded into new nodes. The (n, x) that remain opened after all conﬂicts in S have been processed are the minimal diagnoses.

C1

C2

1 Ω1 = { 1 ≥

C3

2 Ω2 = { 2 ≥

1 + 1}

C2

3 2 + 1}

C4

C1

4

5

6

Ω4 = Ω5 = {1 ≤ 1 1 + 1} {1 ≤ 1 1 + 1} ∪{1 ≤ 2 ∪{ 4 ≥ 1} 2 + 1}

C2

C4

7

8

Ω7 = {1 ≤ 2 2 + 1} ∪{ 3 ≥ 1}

×

Ω8 = { 4 ≥ 1} ∪{ 3 ≥ 1}

(c) Pruning after expansion of s2 .

Figure 1.

Expansion and pruning.

3.2.4 Example: Consider the two conﬂicts of example 1. Figure 1(a) pictures the CHS structure after the expansion of s1 . Expansion of s2 leads to the closing of the subregion 1 ≤ x1 < τ1 + 1 of node 1, and closes node 3, see Figure 1(b). Node 2 is unchanged since after step 4, δ2 (s, x2 ) = 1 for all open x2 ≥ τ2 + 1, leaving A2 (.) empty. The pruning phase closes nodes and regions. A node n is closed whenever for all x, Ωn (x) = ∅ for all x ∈ Xn . Node 6 is closed as it is subsumed by node 1. Similarly, node 2 subsumes some continuous regions of nodes 4 and 7, that are thus closed. Node inclusions are represented with dashed edges on Figure 1(c).

3.2.5 DAG: The CHS produces a DAG. There exist multiple paths from the Root node to some other nodes. Note that the DAG structure allows disjoint diagnosis regions to be aggregated in the same node (see Figure 2).

Root C1

Root

C2

C1

1

2

Ω1 = [0 1]

Ω2 = [0 1]

1

C2 2

C2

C1 3

Ω3 = {{ 1 ∈ [0 1]} ∪ { 2 ∈ [ 2 + 2 +∞[}} ∪{ 1 ∈ [ 1 + 2 +∞[} ∪ { 2 ∈ [0 1]∪}}

Figure 2. CHS produces a DAG. Left: expansion of s1 = {C1[0,1] , C2[0,1] }. Right: expansion of s2 = {C1τ1 +2 , C2τ2 +2 }, τ1 ≥ 0, τ2 ≥ 0, & pruning.

3.3 Handling Continuous Variables Computationally, one challenging aspect of the CHS is the handling of continuous variables. As previously mentioned, for n, and H(n)

182

E. Benazera and L. Travé-Massuyés / Generating Diagnoses from Conﬂict Sets with Continuous Attributes

of cardinality kn , Xn ⊆ k . In algorithm 1, the expansion phase replicates the continuous state-space of a father node n into a child node n , such that Xn ⊂ Xn ⊆ k +1 . In practice it is possible to maintain a single multidimensional space in M where M is the total number of components in SD. In this space, each conﬂict is a hypercube of dimension ≤ M . Step 2 of the CHS can be implemented as an intersection of all conﬂict hypercubes. This results into a partitioned hypercube of dimension M . Remaining operations translate into a labelling/unlabelling of the cube regions with the diagnoses of open nodes. In implementation we use bsp-trees and the intersection operator in [1].

3.4 Properties Theorem 1 (Soundness of CHS). Let S be a set of conﬂict sets, and T a CHS-DAG obtained by using the CHS with node closing and pruning. For any open node n of T , (H(n), Ωn ) is a minimal hitting set for S. Proof. Steps 4 and 7 ensure that any open (n, x) is a hitting set. If (n, x) is not minimal, then it exists an open node (n , x ) that is such that H(n ) ⊆ H(n) and Ωn ⊆ Ωn and that is not in T . The CHS builds nodes from sets to supersets. Therefore n must have been generated before n. Moreover if (n , x ) is closed, it is either: i/expanded, and thus it exists n such that h(n ) = h(n), with n father of n so that n = n and n is minimal, which contradicts the hypothesis that n is not minimal; ii/subsumed by some node n , and thus n is also subsumed by n , and thus a node that subsumes n has been generated, which contradicts the hypothesis that this node had not been generated. Thus (n , x ) is minimal. Theorem 2 (Completeness of CHS). Let S be a set of conﬂict sets, and T a CHS-DAG obtained by using the CHS with node closing and pruning. For any minimal hitting set (H ∗ , X ∗ ), there exists an open node n of T such that (H(n), Ωn ) = (H ∗ , X ∗ ). Proof. Assume (H ∗ , X ∗ ) minimal hitting set of size k. Then there must be k components N over K conﬂicts such that for i = 1, · · · , k, Ci ∩ H ∗ = ∅, and ki=1 x∗i ⊆ X ∗ . By construction of the DAG, for each conﬂict s CHS updates open nodes whose intersection with s is not empty, and expands all other open nodes (n, x). So there exists a path from the Root node to (H ∗ , X ∗ ). This path goes through successively opened nodes. These nodes are closed either: i/after being expanded into other opened nodes; ii/if subsumed by some other nodes, which is impossible if (H ∗ , X ∗ ) is minimal. In case k = 0, the Root node is the returned solution. Alike the HS, CHS is incremental and takes conﬂicts in any order. Searching for all hitting sets of a given set is NP-complete, and the worst case performance of the standard HS is in the order of 2M . In fact, the observed performances are usually well under this theoretical bound but more realistic bounds of the HS performances are difﬁcult to obtain. For the CHS, three cases can be distinguished, where in each conﬂict: i) each component has a single failure point; ii) each component has a single failure interval; iii) each component has disjoint failure intervals. The complexity of mixtures of these situations lies within the theoretical bounds for i), ii) and iii). Assume M components over a set of K conﬂicts. The number of occurence of component m over all conﬂicts is noted 0 ≤ fm ≤ K. In case ii), for m, the maximum number of intervals over all conﬂicts is 2fm − 1. This corresponds to the case where all intervals

for m in conﬂicts do intersect with each others. Each intersected region thus explains a different subset of conﬂicts, and corresponds to different nodes of the CHS-DAG. Consequently PM an upper bound to the worst case performances is given by [2f − 1] + `M ´Q QMm=1 m 2 m=0 [2fm − 1]. m ,m [2fm − 1][2fm − 1] + · · · + With it, bounds on cases i) and iii) can be easily expressed by considering unbounded intervals, and a ﬁxed number of intervals per component, respectively.

3.5 Generating Relevant Diagnoses In Artiﬁcial Intelligence it is important to study the generation of approximated results. Here the idea is developed that some nodes of the CHS DAG are more important than others. Let AB be the abnormal predicate such that AB(C) is true whenever C fails. ` Suppose´ that each component C has a probability distribution p AB(C)|x of failing over x. The probability of a node (n, x) is: Y ` ´ pn (x) = p AB(Ci)|x (3) Ci∈H(n)

and the probability of n is: pn =

Z Ω

pn (x)dx.

(4)

The CHS algorithm can be easily adapted to the computation of the most relevant minimal diagnoses. Given a number between 0 and 1, the nodes (n, Ωn ) with pn < are closed.

4 Experimental Evaluation The CHS was implemented and tested extensively through simulation experiments. Overall, it yields fast results for spaces under 10 dimensions, but doesn’t scale favorably well beyond. The main results are drawn from a set of 500 runs of the CHS for M = K = 6 components and conﬂicts. The simulation settings were designed to validate the theoretical properties of the CHS with no special assumption made on the domain (or SD). As such, they allow a fair examination. The settings were as follows. Each conﬂict has a random size. Each component in a conﬂict comes with a random interval that is generated by picking up two integers between 0 and 100. This section reports on the reactions of the CHS. The continuous diagnoses are the open continuous leaves of the CHS-DAG and their number is the total number of minimal diagnoses. The discrete diagnoses are the open nodes of the CHS-DAG. Both numbers theoretically grow exponentially with the total number of conﬂict elements. This is visible on Figure 3 despite the fact that our experiments were limited to small numbers of components. In consequence the CHS has expanded many of the discrete diagnoses after just a few conﬂicts (Figure 4). The DAG structure in the aggregate space of diagnoses allows the minimal continuous diagnoses to continue to grow after all discrete diagnoses have been expanded (Figure 3). The complexity analysis has shown how the number of occurence of a component in conﬂicts plays a crucial role. This is clearly conﬁrmed on Figure 5. The explosion of the number of minimal continuous diagnoses is a direct consequence of the NP-complete nature of the problem. Figure 6 shows the minimal discrete diagnoses are distributed differently. This is due to the DAG structure: given a mean integer f of mean occurences P ` ´over M components, this number is always smaller than fi=1 M i . That is, the number of minimal discrete diagnoses is at most all combinations of f and fewer components.

183

E. Benazera and L. Travé-Massuyés / Generating Diagnoses from Conﬂict Sets with Continuous Attributes

25000

45000

45 # minimal continuous sets # expanded continuous sets # pruned continuous sets

# minimal continuous sets

# minimal discrete sets # discrete nodes

20000

40

40000

35

35000

30

30000

25

25000

20

20000

15

15000

10

10000

5

5000

15000

10000

5000

0

0

0 1

2

3

4

5

6

1

2

3

Number of conflicts

Figure 3.

4

5

1

6

Mean minimal continuous diagnoses wrt. the number of conﬂicts.

Figure 4.

Mean minimal discrete diagnoses wrt. the number of conﬂicts.

45000

3

4

5

6

Figure 5. Minimal continuous diagnoses (500 runs). 140

# minimal leaves # processing time (s) 50

2

Mean frequency of component occurence in conflicts

Number of conflicts

# discrete diags # discrete nodes

40000

# minimal discrete sets

120

45

35000 100

40 30000 35

80

25000 30 20000

60

25 15000

20

40 10000

15

20 5000

10 5

0

0 4

5

1

2

3

4

5

6

7

8

9

10

11

Number of continuous dimensions

0 6

4

5

6

7

8

9

10

11

Number of continuous dimensions

Mean frequency of component occurence in conflicts

Figure 6.

Minimal discrete diagnoses (500 runs).

Figure 7.

Minimal continuous diagnoses.

Based on a second set of experiments we aimed to elucidate the scaling properties of the approach wrt. the continuous dimensions. These experiments are runs with M ranging from 1 to 11, K = 4, and conﬂict random intervals in [0, 10]. The results are graphically depicted on ﬁgures 7 and 8. The exponential response of the number of minimal continuous diagnoses appears clearly. 5e+08

300 mean number of pruning checks mean expansion time mean pruning time 250

4e+08

150 2e+08

Running time(s)

Pruning checks

200 3e+08

100

1e+08 50

0

0 1

2

Figure 9.

3 4 Number of conflicts

5

6

Expansion vs. pruning.

In practice it was not possible to run the CHS in reasonable time on problems with more than 10 or so continuous dimensions. The limitation stems mainly from the pruning phase that dominates the computational effort (Figure 9). An addition to the pruning loop allows the inclusion checks of discrete diagnoses to be improved. Space limitation precludes its description here. In worst cases however, the pruning loop continues to require up to several billions inclusion checks

Figure 8.

Minimal discrete diagnoses.

of continuous sets.

5 Conclusion We have presented the CHS algorithm, a solution to ﬁnding the minimal hitting sets of a collection of sets with continuous attributes. The algorithm uses an DAG representation in an aggregate space of diagnoses. CHS is based on the same dual mechanism as the classical hitting set algorithms: it has an expansion phase and a pruning phase. To our knowledge CHS is the ﬁrst computational method to produce minimal diagnoses with continuous attributes. In practice however, CHS exhibits an unfair behavior: it expands high numbers of potential diagnoses in little time and spends most of its time pruning out a large fraction of them. It is an open problem how to better tackle this computational cost.

REFERENCES [1] J.H. Friedman, J.L. Bentley, and R.A. Finkel, ‘An algorithm for ﬁnding best matches in logarithmic expected time’, ACM Trans. Mathematical Software, 3(3), 209–226, (1977). [2] R. Greiner, B.A. Smith, and W. Wilkerson, ‘A correction to the algorithm in reiter’s theory of diagnosis’, Artiﬁcial Intelligence, 41(1), 79– 88, (1989). [3] R. Haenni, ‘Generating diagnoses from conﬂict sets’, in Proceedings of the Fifteenth National Conference on Artiﬁcial Intelligence, (1997). [4] L. Lin and Y. Jian, ‘The computation of hitting sets: review and new algorithms’, Information Processing Letters, 186, 177–184, (2003). [5] R. Reiter, ‘A theory of diagnosis from ﬁrst principles’, Artiﬁcial Intelligence, 32, 57–95, (1987). [6] L. Trav´e-Massuy`es and G. Calderon-Espinoza, ‘Timed fault diagnosis’, in European Control Conference (ECC-07), (2007). [7] F. Wotawa, ‘A variant of reiter’s hitting-set algorithm’, Information Processing Letters, 79, 45–51, (2001).

184

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-184

A Compositional Mathematical Model of Machines Transporting Rigid Objects Peter Struss, Axel Kather, Dominik Schneider, Tobias Voigt1 Abstract. We present models of various elements of a plant that involves the transportation of lumped material. An application context is provided by a project on diagnosing disturbances in food packaging plants and, more specifically, bottling plants. While there exist models of flow of homogeneous matters, such as liquid material in a hydraulic system, based on simultaneous equations of Kirchhoff/Ohm type, in our project we need to cope with nonnegligible transportation time of objects and capture phenomena like the tailback of units (if transportation is blocked) or the propagation of gaps in the flow of units. Because the application context requires compositionality of the model, i.e. local, contextfree models of the individual transportation elements, we are also facing the problem that whether or not a single element produces an output flow (or accepts an input flow) cannot be determined solely by the model of this element, but only through modeling the interaction with the subsequent element, which may block the output (or the previous one not providing the input). This issue is addressed by modeling the potential of an existing flow distinctly from the actual occurrence of a flow, an idea which also can enhance models of continuous flow.

1

On the one hand, modeling the transportation of individual objects is prohibitive or useless. On other hand, the abovementioned flow models of a homogeneous matter fail to capture essential features, such as gaps in the flow or the creation of a tailback by some blockage and its propagation through the plant in finite time. Furthermore, an inflow and outflow of a single transportation element of a line cannot definitely be predicted by a local model of this element, because they depend also on the supply of the previous element and the intake capacity of the following one, respectively. As a consequence, we had to develop a model that • includes transportation times, • covers interrupted flows, • handles the exchange of flows between neighboring elements appropriately. The paper focuses exclusively on presenting a base model addressing the requirements and does not discuss the impact on the diagnosis model and engine in detail. After presenting the application context in section 2, we specify the goals of the modeling work, present our solution (section 4), and report results from the successful validation of the model.

INTRODUCTION

Modeling the flow of some matter in a system is quite widespread in model-based systems, e.g. in model-based diagnosis of hydraulic or pneumatic systems. At least under certain simplifying assumptions, mathematical first principles models exist, and it appears to be straightforward to abstract them into adequate input to a model-based problem solver. Typically, such models assume that the flowing matter is continuous and homogeneous and does not have to be modeled as an object or in its detailed structure. And they usually incorporate the analogies to Kirchhoff’s and Ohm’s Laws, which leads to simultaneous equations that imply instantaneous propagation of pressure and disregard time needed by the matter to be transported through the system. There are classes of application domains that involve a flow of objects through a plant and, hence, suggest the use of some flow model, but require dropping some of the simplifying assumptions mentioned. One instance of this class is given by food packaging plants, which are subject to a diagnosis project we are carrying out, and, more specifically, by bottling plants, which we will use as an example in this paper. Such plants involve streams of objects of different types, bottles, crates, and pallets being the most prominent ones. 1

Technische Universität München, Germany, email: [email protected]

2

AN APPLICATION DOMAIN: BOTTLING PLANTS

Food packaging at industrial scale is carried out in high output packaging lines consisting of specific machines and conveyors. There are different machines for specific packaging tasks, such as primary packaging of food or beverages (e. g. with foil packs, pouches, or containers), secondary packaging (boxes, multipacks, crates, etc.), and tertiary packaging (e. g. pallets or displays). Additionally, machines for de-palletizing and unpacking of returnable bottles, cleaning, inspection and sorting out improper objects may be involved. Plant constellations are configured using one machine of a specific type or several ones in parallel. Machines of different types are connected by conveyors. Because of the high speeds and output rates (up to 100.000 packages per hour), machines and conveyors are failure-sensitive with an availability degree of 92-98 percent. As a specific example for packaging plants, our project considers bottling plants for beverages (e.g. the one shown in Fig 1). In order to fill beverages into returnable bottles, the material flows of pallets, crates, and bottles (plus labels, glue, etc.) need to be coordinated. This leads to complex line configurations comprised of machines that remove crates from pallets and bottles from crates, process, inspect, or sort objects, and package different types of objects (Fig. 2 shows an abstract, but typical example).

185

P. Struss et al. / A Compositional Mathematical Model of Machines Transporting Rigid Objects

M3

BB3 F3

M4

BB2 F2

M2

BB4 F4

M5

BB5 F5

BC2

M1 M2 M3 M4 M5 M6 M7

M6

BP0 BP1 BC1

M1

BC6

BP1

M7

BP7 BC1 BC2 BC6

Figure 1.

Pallet feeding Pallet transport incl. magazin Pallet releasing crate conveyor empty crate conveyor incl. cleaning and magazin crate conveour

Conveyors of a bottling plant for returnables BP0

To prevent oxygen intake or microbiological contaminations of the beverage, the filling process should not be interrupted. Therefore transportation by consecutive machines needs to be decoupled. Otherwise, each individual failure would inevitably cause downtime of the entire plant: In particular, this would stop the filling process and decrease the efficiency of the entire production. To prevent this, conveyors of bottling plants are designed as buffers like the abstract bottle conveyor shown in Fig. 3. Transporting buffers perform two tasks. One is to carry the objects from one machine to the next one. The other is to store objects in order to be able to compensate for a downtime of the upstream machine and to prevent the immediate propagation of a tailback in case of a downtime of the downstream machine. In addition, the machines located upstream and downstream w.r.t. the filling machine work with higher output rates than the filler. This enables full upstream buffers and receptive downstream buffers to compensate for short downtimes of single machines. These design principles help achieving a continuous operation of the filling machine. However, in practice, they cannot guarantee avoidance of unwanted idle time of the filler, and (unplanned) downtime of the plant can lie in the range of 10-30 percent. Machine failures of significant duration, gaps caused by a large number of objects being sorted out, stoppages caused by toppled or jammed objects, or just mistakes of the operators result in downtime of the filling machine and decrease the availability of the entire plant. Because of the interlaced flows of the various object types, time offsets, and the large scale of the plants, the reasons for such plant downtimes can be difficult to identify while they occur, but also in retrospect based on recorded data. In consequence, bottle filling and packaging industries is highly interested in an automated diagnosis tool for their plants. There are a number of requirements and challenges to automated diagnosis raised by this application task. A fundamental economical condition is the fact that many of the potential end users, e.g. breweries, are small or medium enterprises, which could not afford spending many resources on the establishment or adaptation of a tailored diagnosis system for their plant. Another practical requirement is to cheaply accommodate frequent changes in the structure of the line, due to rearrangement or addition of machines.

Figure 3.

Depalletizing Unpacking Bottle Cleaning Bottle Inspection, Filling and Capping Labeling Packing Palletizing

A three step transporting buffer for bottles

Figure 2.

BP7

BF2-5 bottle conveyours

Generic structure of a bottling plant for returnable bottles

Both issues suggest a model-based solution to diagnosis, which allows performing adaptation by simply (re-)specifying the plant structure. Additional arguments for such a solution stem from the facts that usually a plant is a combination of machines from various manufacturers with different instrumentation and available data and that there may be temporarily missing data due to technical problems. This requires a flexible solution that derives the best diagnosis from whatever data is available (in contrast, for instance, to decision trees based on a fixed set of observables). Heterogeneity and changes of the set of machines also establishes a requirement on the model: firstly, it has to be machinecentered and compositional; secondly, it has to be stated at a level of abstraction that covers types of machines, independently of specificities and the manufacturer. The available data is inherently incomplete and imprecise. Even balance equations do not necessarily hold, because bottles may have been removed by an operator (for inspection or because they blocked the flow) or simply have fallen off the belt. For such an application and task, the model cannot describe the motion of individual objects, but it also must not ignore that rigid objects are transported (as opposed to a continuous medium). It has to be capable of properly predicting the propagation of gaps in the stream of objects (causing a lack in supply to subsequent machines) and tailbacks caused by blockages, as well the propagation of special features and deficiencies of the transported objects, which may be caused by improper performance of one machine (e.g. improper cleaning) and may affect the (mis-)behavior of another element downstream (e.g. an inspection machine).

3 3.1

MODELS OF TRANSPORTATION ELEMENTS Previous Work

The only similar work we are aware of is in the domain of transport of paper in a copier. [Gupta-Struss 95] presents a process-oriented model, and [Fromherz et al. 03] develop a component-oriented model for control generation. Both models are compositional, but focus on the motion of individual sheets, rather than the more abstract perspective of flow of objects. The same holds for discreteevent-simulation models that are used to validate the control of the machines.

186

P. Struss et al. / A Compositional Mathematical Model of Machines Transporting Rigid Objects

Transportation Element with Buffer State variables B(t) # objects in buffer Bout(t) # objects buffered for immediate output velocity of input transportation means vin(t) velocity of output transportation means vout(t) minimal transportation time td(t) Parameters diameter of object (in transportation plain) d0 C Capacity [number of objects] Interface variables potential inflow [objects/s] in.qpot(t) potential outflow [objects/s] out.qpot(t) actual inflow [objects/s] in.qact(t) actual outflow [objects/s] out.qact(t) Equations if B(t)
Connector between Transportation Elements Interface variables TEn+1.in.qpot(t) potential inflow of upstream element TEn+1 TEn.out.qpot(t) potential outflow of downstream element TEn TEn+1.in.qact(t) actual inflow of upstream element TEn+1 TEn.out.qact(t) actual outflow of downstream element TEn Equations (5) TEn .out.qact(t)= min (TEn+1.in.qpot(t) , TEn .out.qpot(t)) TEn .out.qact(t) = TEn+1.in.qact(t) Figure 4.

Equations of buffer and connector

3.2 Modeling Assumptions We first list the most important assumptions underlying the transportation models presented here, which are fulfilled in our project domain (under normal conditions), but should also apply to a much broader class of problems. • The transported objects are rigid bodies with fixed spatial extensions and are not significantly deformed through transportation. • They are transported with a fixed orientation (like crates), or the orientation does not affect transportation times significantly (e.g. due to a symmetric cross-section, as for bottles). • There is no interaction among the objects or between objects and the components that has a significant impact on the transportation process (such as bouncing). • Objects can move only in the direction of the motion of the transportation means (or not at all), although not necessarily with the same speed.

3.3

A Model of a Transportation Element with Buffer

In order to present the essentials of the modeling approach, we consider some sort of archetype of model, which can be specialized

or extended to accommodate other kinds of machines. This is a machine that • has one input and one output with vin, vout being the respective speeds of the means for transportation (e.g. belts), • possibly transforms or modifies one kind of object (as, for instance, cleaning of bottles), but does not amalgamate several objects to form a new one, • has a buffer with a (constant) capacity C. The process of buffering the objects can be fairly random, as illustrated by the bottle conveyor in Figure 3, where bottles may gather in bulks. However, it is assumed, that (under normal behavior) no object is prevented from approaching the output unless it is blocked by other objects ahead, waiting for output. For instance, within the bottle conveyor, its shape and several parallel belts with different speeds ensure that bottles are not left in some corner, but pushed towards the “ideal” fastest belt, if there is space. The intuition behind the model can be best described in terms of three fundamental concepts and five “behavior rules”, each of which is first introduced informally and then turned into equations. One of the problems to be solved stems from the fact that a local machine model in isolation cannot determine whether an actual flow occurs at its input and output. But it can and has to express the limits on the machine’s potential to take in or output objects: Concept 1 The potential input and output flow, in.qpot and out.qpot, represent the maximal flow the machine can accept or generate, dependent on its internal state. The actual flows are represented by two different variables, in.qact and out.qact. The first restriction is determined by Rule 1 The potential input flow is given by the input speed of the transportation element, unless the buffer is full. In this case, it cannot be higher than the actual output flow. In the mathematical model (see Fig. 4), this rule is formalized by equation 1, where d denotes the diameter of the object cross-section and B is the filling degree of the buffer (in terms of number of objects). It involves the assumption that an actual outflow generates the potential for intake instantaneously, which is not true in practice and, hence, another reason for expressing tolerance intervals with values and time. Note that we take all speeds and flows as positive, as their sign is determined by their association with the intrinsic direction of the transportation element. Computing B is straightforward: Rule 2 The change in the total number of buffered objects is determined by the actual input and output flows. The respective equation 2 indicates that B will be computed by integrating the difference of the actual flows. Setting up the model fragments for the potential output flow is based on the second key idea: Concept 2 Bout denotes the number of buffered output objects at time t, i.e. the number of objects that can possibly be subject to output at this time. Before we clarify this crucial concept, we use its intuitive understanding and the third concept for formulating the rule for the potential output flow. Concept 3 The minimal transportation time, td, is the time an object needs to get directly from the input to the output, i.e. it is not delayed by other objects that are piling up.

P. Struss et al. / A Compositional Mathematical Model of Machines Transporting Rigid Objects

In case of the bottle conveyor, this means that the bottle stays on the fastest (innermost) belt. Rule 3 The potential output flow is determined solely by the output speed, if there is more than one buffered output object. Otherwise, it cannot be higher than the actual input flow at the time reduced by the minimal transportation time. One should be aware that in the second case, each single object may (potentially) leave the output with the speed vout. However, if the input flow at the time when it entered was lower, there will be a gap occurring after the output of the object, which makes the (average) flow lower than vout. As a special case, the potential output flow becomes zero, if the actual input flow was zero at the respective time. Again, the respective equation 3 in Figure 4 formalizes this. Computing Bout also involves the minimal transportation time t d. If an object entered the transportation element later than time t - td, it cannot possibly reach the output at time t and, hence, cannot become part of the buffered output objects. If it entered earlier, it may or may not have already left the output before t, depended on how the actual output flow reduced Bout. This consideration is captured by Rule 4 The change in the number of buffered output objects at time t is determined by the actual input flow at time t - td diminished by the actual outflow at time t. Hence, also Bout is obtained by integration according to equation 4, which completes the model of the transportation element with buffer. Note, that Bout is not necessarily the number of objects that form a contiguous pile in front of the output. It could be less, because the last objects that joined the pile entered later than t - t d.

3.4

Interaction of Transportation Elements

What remains to be done is determining the actual flows from the potential flows of connected machines. This interaction is captured by a model of a generic connector used for connecting all types of transportation elements. The respective rule and equation 5 (Fig. 4) are straightforward: Rule 5 The actual output flow of a machine is limited by both its own potential output flow and the potential input flow of the following machine (and equal to the actual input flow of this machine).

3.5

Other Features and Transportation Elements

The buffer model leaves options for different use and specialization. Due to lack of space, we can only sketch some important cases, many of which are fairly straightforward. For instance, vin and vout could be different as for the entire bottle conveyor shown in Figure 3. In this case, the minimal transportation time td needs to be calculated or estimated based on varying speeds along the “ideal path”. Alternatively, the same conveyor can be considered as an aggregation of several buffers in series each with one unique speed on its fastest belt, which eases the computation of td . Note that the speeds are subject to control and may vary dynamically. Therefore, in case of a unique speed, td is determined by the equation l=

t t −t d

v(τ )dτ

,

where l is the length of the “ideal path” and v(t) its time-varying speed.

187

Gates may sit at the input or output of transportation elements and are controlled in a binary manner in order to block the flow entirely if necessary. This is captured by multiplying the respective speed with a factor of (1 – stategate ), if stategate is 1 for a closed gate and 0 otherwise. While the bottle conveyor has no fixed relation between the speed of the belts and the motion of the bottles, which may slide, other machines, such as the filler, transport objects by locking them to certain sockets. This is obtained as a specialization of the buffer model with a unique speed and the capacity given by the number of sockets that can be occupied by objects while processing them. Some elements, such as the bottle cleaning unit, may have n inputs of the same type of objects). To accommodate this feature in the model, we simply have to replace the actual input flow by the sum of several individual input flows. Elements having several outputs (for objects of the same type) usually require some modeling of the mechanism that distributes the objects among the various outputs, e.g. evenly (if possible) or according to some criteria. An example for the latter case is given by inspection machines ejecting objects that fail to pass some test. Another class of machines produces an output by combining objects of different kinds, as for instance the packaging of 20 bottles in a crate. The ratio of the number of different objects participating in this combination is usually not arbitrary, but exactly specified. This ratio links the various potential and actual inflows and the outflow, which is then limited by the “slowest” input flow (relative to the ratio of the respective object type). The counterpart to this very generic combination element is the separation element, with unpackers being a subclass, in which the slowest actual outflow of a decomposition result limits the potential inflow of the composite object. This set of fairly generic model types turns out to cover the variety of machines in a bottling plant and, more generally, also in the food packaging plants that we encountered.

4

VALIDATION OF THE MODEL

In order to validate the component models described above we implemented them as numerical simulation models in MATLAB/SIMULINK® [MathWorks 08] and compared the simulated behavior (using the solver \ode4" (Runge-Kutta) with a fixed-step size of one second) with the one of real plants. Because the latter varies within fairly large ranges due the nature of the involved processes, no exact matching with a unique measured run of the plant is to be expected. What is more important (and sufficient to support diagnosis) is that the model reproduces essential qualitative features of the plant behavior within relatively generous time intervals. Every component was modeled using the equations introduced above and tested in isolation to check whether it was adequate of and stated in a context-independent manner, which is a prerequisite for compositionality. In a second step, a model of a complete plant was configured using the validated components. In testing the individual components, values of single parameters and variables were varied, and the response of the simulated behavior was monitored. For example, the predicted changes in the buffered material B of a component for different values of the input speed vin and the output speed vout are shown in Figure 5. It depicts that the buffer fills as long as the input speed is higher than the output speed (assuming a sufficient supply), whereas with the input

188

P. Struss et al. / A Compositional Mathematical Model of Machines Transporting Rigid Objects

Figure 5.

Plots showing the changes of the buffer (lower) in response to

speed reduced to its minimum 0.1 and the output speed being still high, the amount of buffered objects decreases. Because of the minimal transportation time, td , of the component, the buffer is not completely emptied, as long as there is input available. Furthermore, only the objects represented by the variable Bout determine the existence of an output flow. Another real characteristic behavior can be reproduced when increasing the input speed while maintaining the output speed constant. Although vin is still higher than vout, the buffer filling degree remains constant after a certain time, because it is limited by the maximum capacity of the component. Similar results were achieved by testing the other component type models, providing evidence that the models capture the features relevant to the diagnostic task and do not violate contextindependence. The second challenge was validation by comparing the simulated behavior of a plant model with the behavior of a real plant. Several test cases were constructed, based on real-world downtimes scenarios of the bottling plant whose topology is shown in Fig. 6. The simulated plant consists of a primary flow of bottles and a secondary object flow of crates. In one test case, the downtime propagation of a failure of the crate washer was simulated and analyzed. This failure interrupts both object flows. After some delay, missing input occurs at the crate packer. Also the unpacker stops at some point, due to its output being blocked. The details of the propagation of failure depend on the capacities and filling degrees of the various buffers connecting the machines. For instance, if the crate magazine is empty and all other buffers are filled with a sufficient degree, the lack of crates will rapidly reach the crate packer. This causes a blockage of the labeling machine and the bottle filler (because the packer is not able to process the bottles) before the lack of bottles in the primary flow (caused by the inoperable unpacker) reaches the filling machine. In contrast, if the crate magazine is completely full, the crate packer keeps working for some time, and the filling machine will be stopped due to a lack of bottles. Even for this complex scenario, the simulation model reproduces the behavior of the real world plant. Similarly, the characteristics of fault propagation occurring in real plants were predicted for other relevant scenarios.

5

SUMMARY AND OUTLOOK

The developed model is a truly compositional one, capturing the behavior of each machine in a component model in a local manner

Figure 6.

The structure of one of the test plants

and the interaction of consecutive machines in the connector model. Validation has provided evidence that the model really captures the essential features of plant behavior we are interested in from a diagnostic perspective. For the use in a consistency-based diagnosis engine (see [Struss 08]), it was abstracted in order to account for the imprecision of the possible predictions, regarding both variables and time, which is described in [Struss et al. 08]. The level of model abstraction depends on the intended goal of the diagnosis: we first focused on “hard” failures (stop of the filling machine, that is) caused by hard faults (blockage of another machine), which requires only distinguishing zero from non-zero flow. Based on this model, a first prototype of the diagnostic system correctly performs fault localization, retrieving relevant data from a data base of recorded machine signals, which provides an even more convincing validation of the model. The project also aims at a contribution to improving the general conditions for diagnosis through standardization of the data acquisition. Partners of the consortium are the originators of an existing standard that has now been widely accepted for bottling plants. This has now been extended on the one hand regarding data relevant to diagnosis and on the other hand generalizing it for food packaging plants. This will significantly improve the conditions for effective and easily adaptable diagnostic solutions.

REFERENCES [Fromherz et al. 03] Fromherz, M., Bobrow, D., and de Kleer, J.: Modelbased Computing for Design and Control of Reconfigurable Systems. In: Bredeweg, B. and Struss P. (eds), Qualitative Reasoning. Special issue of AI Magazine, Winter 2003, 24(4), AAAI Press, Menlo Park, USA [Gupta-Struss 95] Gupta, V., Struss, P.: Modeling a Copier Paper Path: A Case Study in Modeling Transportation Processes. In: QR-95, Working Papers of the Ninth International Workshop on Qualitative Reasoning, Amsterdam, 1995. [MathWorks 08] http://www.mathworks.com/products [Struss 08] Struss, P.: Model-based Problem Solving. In: van Harmelen, F., Lifschitz, V., and Porter, B. (eds.). Handbook of Knowledge Representation, Elsevier, 2008 [Struss et al. 08] Struss, P., Kather, A., Schneider, D., and Voig, T.: Qualitative Modeling for Diagnosis of Machines Transporting Rigid Objects. In: 22nd International Workshop on Qualitative Reasoning, Denver, Co, 2008

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-189

189

Model-Based Diagnosis of Discrete Event Systems with an Incomplete System Model Xiangfu Zhao and Dantong Ouyang 1 Abstract. Model-based diagnosis of discrete event systems (DESs) is more and more active in artiﬁcial intelligence. However, there has been always a very restrictive assumption in the previous works that the model of a given DES is complete, including all nominal behaviors and all possible failure behaviors of the system. In order to relax this so restrictive assumption, in this paper, model-based diagnosis of a DES with an incomplete system model is investigated. A new concept of “P-synchronization product” of ﬁnite state automata is proposed, by which the P-diagnosis of the DES with an incomplete system model is easily put forward. It is also shown that the traditional synchronization product of ﬁnite state automata can be seen as a special situation of P-synchronization product. In addition, an ideal heuristic way from theoretical view to improve the P-synchronization product is discussed as well.

1

Introduction

Model-based diagnosis is one of the active branches of artiﬁcial intelligence. Since a formalization of model-based diagnosis with ﬁrstorder logic given by R. Reiter [26], it has been widely studied. Earlier, static systems were studied by researchers (e.g. [13, 12, 33, 25], etc.), and then researches on dynamic systems have begun since the last decade (see [6, 16, 22, 31], etc.). Especially, model-based diagnosis of DESs has arisen increasing interests, as DESs cover continuous-variable systems which, after quantization, are represented as discrete systems [21] for the purpose of diagnosis at a higher level of abstraction, as well as “discrete by nature” systems. This domain is more and more active since the seminal work of [29, 30], which has been the basis not only for subsequent contributions in the control engineering ﬁeld [14], but also for further research in artiﬁcial intelligence [27]. A number of model-based approaches for diagnosing DESs have been proposed in both ﬁelds literature. And they have been widely applied, particularly in large scale telecommunication networks in [9, 24, 28] and power transmission networks in [1, 4, 17, 18, 19, 20]. Model-based diagnosis of DESs consists in ﬁnding what happened to the system from existing observations. A diagnosis is deﬁned as the set of trajectories consistent with the observations. There have been different terminologies used as histories [1], narratives [2], consistent paths [5], trajectories [10] or scenarios [11]. In this paper, we mainly concern the diagnosis of DESs [3] where the system behavior is modeled by automata. Then a usual formal way of representing the diagnosis problem is to express it as the synchronized product of the system model automaton and an observation automaton. 1

School of Computer Science and Technology, Jilin University, Changchun 130012 China, email: [email protected], [email protected]

However, there has been always an assumption in the previous works that the model of the given DES is complete, including all nominal behaviors and all possible failure behaviors. Generally speaking, the assumption is very restrictive, as it is difﬁcult to be assured that the model is complete practically. Also inspired from the paper [7], which copes with an incomplete system model for static diagnosis, in this paper, we mainly concern model-based diagnosis of a DES with an incomplete system model similarly. A novel concept of “P-synchronization product” of ﬁnite state automata is proposed, by which the diagnosis of the DES with an incomplete system model is easily presented. It is also shown that the traditional synchronization product can be seen as a special situation of P-synchronization product. This paper is organized as follows: Some preliminary knowledge about model-based diagnosis of DESs is introduced in the second section. The concepts of P-synchronization product and P-diagnosis are proposed in section three. Section four presents a heuristic way to reﬁne P-diagnosis. Related works are compared in section ﬁve. And in the last section, we give a conclusion.

2 Preliminaries 2.1 Automata and synchronization An automaton is represented as a tuple (Q, E, T , I, F ) where Q is the set of states, E the set of events, T the set of transitions (q, l, q ) with l ⊆ E, I the set of initial states, and F the set of ﬁnal states. For each state q ∈ Q, generally we suppose (q, φ, q) ∈ T . A trajectory denotes a path in the automaton joining an initial state to a ﬁnal state. And we use T raj(A) to denote the set of trajectories of an automaton A correspondingly. Moreover, in the following, we consider trim automata only, where the trim operation transforms an automaton by removing the states that do not belong to any trajectory. The synchronization operation on any two automata A1 and A2 can build the trim automaton, where all the trajectories of both automata which cannot be synchronized according to the synchronization events (i.e. E1 ∩ E2 ) will be removed. Formally, suppose given two automata A1 = (Q1 , E1 , T1 , I1 , F1 ) and A2 = (Q2 , E2 , T2 , I2 , F2 ), the synchronization of A1 and A2 , denoted by A1 ⊗ A2 , is the trim automaton A = T rim(A ) with A = (Q1 × Q2 , E1 ∪ E2 , T , I1 × I2 , F1 × F2 ) such that: T = {((q1 , q2 ), l, (q1 , q2 )) | ∃ l1 , l2 : (q1 , l1 , q1 ) ∈ T1 ∧ (q2 , l2 , q2 ) ∈ T2 ∧ (l1 ∩ (E1 ∩ E2 ) = l2 ∩ (E1 ∩ E2 ))∧ l = l1 ∪ l2 }.

2.2 Diagnosis Thanks to the deﬁnition of the synchronization operation, the deﬁnitions used in the domain of DES diagnosis where the model of the

190

X. Zhao and D. Ouyang / Model-Based Diagnosis of Discrete Event Systems

system is represented by an automaton can be given directly. Let t0 be the starting time and tn be the ending time of diagnosis in the following. And more details about the synchronization of automata can be found for instance in [9, 24]. Deﬁnition 1 (Model). The model of the system, denoted by M od, is an automaton, in which the behaviors of the system are described and the trajectories of M od represent the evolutions of the system. The set of initial states I M od is the set of possible states of the system at t0 . As all the states of the system may be ﬁnal, we suppose as usual that F M od = QM od . And the set of observable events is denoted by EoM od , a subset of E M od , and the other part of E M od is M od the set of all the unobservable events Euo . Observations can be uncertain [17] and can be represented by an automaton, where the transition labels are observable events of EoM od in the complete system model M od. Deﬁnition 2 (Observation automaton). The observation automaton, denoted by Obs, is an automaton describing all possible observation sequences emitted by the system during the period [t0 , tn ]. The diagnosis of a DES therefore can be represented as the set of all the trajectories of the model that are consistent with the observation sequences practically emitted by the system during the period [t0 , tn ]. The automaton obtained by the synchronization of the model and the observations denotes all these trajectories. Deﬁnition 3 (Diagnosis). The diagnosis of a DES model M od and the obtained observations Obs, denoted by Δ, is a trim automaton such that: Δ = M od ⊗ Obs. For a simple example, consider the system model M od1 and the observation automaton Obs1 shown in Figure 1.(a) and Figure 1.(b) respectively. Then we can get the diagnostic results in Figure 1.(c) by the synchronization of M od1 and Obs1 . In the previous related works, generally speaking, the system model M od is assumed to be complete, including all the nominal behaviors and all possible faulty behaviors. However, the assumption is rather restrictive, and that is only an ideal situation. It is difﬁcult to be assured that the system model is complete practically. Therefore, some diagnostic solutions would be missed by the synchronization operation ⊗ when the system model is incomplete, even any diagnostic result can not be found. For instance, given the incomplete system model M od1 and the observation automaton Obs1 shown in f1

o1

o2

f2

o3

o5

f3

o6

o7

o4

(a) The complete system model M od1 .

(b) The real observation Obs1 . 1,1'

f1

2,1'

o1 o5

3,2'

8,6'

o2 f3

4,3'

9,6'

f2 o6

5,3'

10,7'

o3 o7

6,4'

o4

7,5'

11,8'

(c) The synchronization of M od1 and Obs1 . Figure 1.

The Model, observation, and their synchronization.

f

o

f

f

Figure 2.

o

o

o

o

The incomplete system model M od1 .

Figure 2 and Figure 1.(b) respectively. If we still use the synchronization operation ⊗ to obtain the diagnoses, clearly, the result will be the null automaton. As a result, even some of the approximate diagnostic solutions to explain the observations would have been missed.

3 P-synchronization Product and Diagnosis 3.1 Related concepts In order to obtain the approximate diagnostic results under an incomplete model of the system and the real complete emitted observation sequences, a new concept of “P-synchronization product” of ﬁnite state automata is proposed in the following. Deﬁnition 4 (P-synchronization product). Given two ﬁnite state automata A1 and A2 , where A1 = (Q1 , E1 , T1 , I1 , F1 ) and A2 = (Q2 , E2 , T2 , I2 , F2 ), such that: E1 = Σo1 ∪ Σuo1 , Σo1 ∩ Σuo1 = φ, Q1 = F1 , and E2 = Σo2 . The P-synchronization product of A1 and A2 is the automaton A1 ⊗p A2 = T rimp(A , p), in which “ ⊗p ” denotes the P-synchronization operation. And A = (Q1 × Q2 , E1 ∪ E2 , T , I1 × I2 , F1 × F2 ) such that: T = {((q1 , q2 ), l, (q1 , q2 ))| ∃ l1 , l2 : (q1 , l1 , q1 ) ∈ T1 ∧ (q2 , l2 , q2 )∈ T2 ∧ Constraint(l, l1 , l2 )}. Moreover, Constraint(l, l1 , l2 ) is deﬁned as follows:

l=

l1 l2

: :

l1obs = l2 l1 = φ

Where l1obs denotes the set: { e | e ∈ l1 ∧ e ∈ Σo1 }, i.e. the set of all the observable events in l1 when the system model is considered to be as A1 in the following. And l1obs can be φ when l1 ⊆ Σuo1 . In addition, the trimp is a more restrictive trim operation, and it is based on the original trim operation and a new concept of the degree of the synchronization, which will be described in the following: Deﬁnition 5 (Synchronization degree). The synchronization degree of a trajectory traj in T rim(A ) is denoted by Syn degree(traj), and deﬁned as: Syn degree (traj) = | Esyn |/| Eobs |, where Esyn = {l | lobs = φ ∧ l1obs = l2 ∧ lobs ⊆ Σo1 ∧ l ∈ traj }, Eobs = {l | lobs = φ ∧ lobs ⊆ Σo2 ∧ l ∈ traj }. Note: here l ∈ traj means that l is one of the transition labels on traj; l1obs and l2 are the corresponding ones in Deﬁnition 4. With the deﬁnition of synchronization degree above, T rimp(A , p) can be described as follows: ∀ trajectory traj ∈ T rimp(A , p), such that: traj ∈ T rim(A ) and Syn degree(traj) ≥ p. In a word, the T rimp(A , p) operator deletes all the trajectories from T rim(A ) whose synchronization degree is less than p. With the P-synchronization product, given the synchronization degree p, we can present a deﬁnition of diagnosis of a DES, when the model of the DES is incomplete. Deﬁnition 6 (P-diagnosis). Let M od be an automaton, which represents the incomplete system model, Obs be another automaton, which denotes all the real observation sequences. Obs is supposed to be complete. Then the P-diagnosis of the DES, denoted by Δp , a trim automaton, can be deﬁned as follows: Δp = M od ⊗p Obs.

191

X. Zhao and D. Ouyang / Model-Based Diagnosis of Discrete Event Systems

Note: in M od, the corresponding Σo1 is the set of all the observable transition events EoM od , and Σuo1 is the set of all the rest unM od observable transition events Euo , in which failure events are included. Whereas in Obs, the corresponding Σo2 is the set of all the transition events, for each transition event is observable in Obs. In addition, as to the synchronization operation “⊗” and the Psynchronization operation “⊗p ”, we have the following proposition: Proposition 1. Let A1 , A2 be any two ﬁnite state automata, then we have A1 ⊗A2 = A1 ⊗p A2 with the synchronization degree p = 1. The proposition can be simply explained as follows: When p = 1, i.e. it is required in the P-synchronization product that each observation event in the automaton Obs must be synchronized with the set of all the observable events of the automaton M od, thus the Psynchronization product is the same as the synchronization product. From proposition 1 and the deﬁnitions above, it is clear that the ⊗p operator, used for the P-synchronization product, can be seen as a generalization of ⊗ operator, used for the synchronization product (i.e. the P-synchronization product when p = 1). Moreover, when p = 1, the Constraint(l, l1 , l2 ) will become the following constraints: l = l1 : l1obs = l2 . Obviously, the value of l is more restrictive than before.

o4

o1 1,1'

1,2'

f1 o5

o1

2,1'

o2

2,2'

f1

1,6'

o4 1,4'

f1 2,3'

o1 f1

1,1'

f1

1,2'

2,1'

o5

o1

o2

2,2'

1,3'

o1 f1

1,1'

2,1'

1,2'

o1

o5 1,6'

f1 o2

2,2'

f3 f1 o6 1,7'

o5

9,1'

2,6'

f1 o7

o2

f2

4,3'

f3 o1

f2

4,4'

o3

o5

f3

6,4'

9,2'

9,6'

o6 f1

5,4'

f3 o7

o6 o6

(a) p = 0.

o4 o4 o2

9,1'

o5

f1

2,6'

10,7'

9,7'

2,8'

f3

1,1'

o1 f1

f1

1,2'

2,1'

o5

o1 f1

1,6'

o7

o3

5,3'

o5 f3

6,4'

o6

9,6'

o4 o4

10,7'

o7 o7

6,5'

7,5'

11,8'

10,8'

2,2'

o2

f3 o5

9,1'

2,6'

4,3'

f2

o5 f3

5,3'

o3

9,6'

o6

6,4'

10,7'

o4 o4 o7

6,5'

7,5'

11,8'

(d) p = 1/2.

1,1'

o1 f1

f1

1,2'

o5

9,5'

1,1'

9,4'

2,1'

o1 f1

o1

2,2'

f3 o5

f1

o2

f2

4,3'

o5 f3

9,1'

2,6'

o3

f1

1,2'

2,1'

5,3'

9,6'

o3 o6

6,4'

10,7'

o4 o7

7,5'

11,8'

o1

2,2'

o2

4,3'

f2

5,3'

o3

6,4'

o4

7,5'

(f) p = 3/4.

9,3'

o7 o7

11,8'

(c) p = 1/3.

o4

7,5'

o7 o7

7,5'

f3

5,5'

6,5'

10,7'

6,5'

(e) p = 2/3.

f3

f3

f2 o4

o3 5,3' o3

2,7' 1,8'

4,5'

o4

6,4'

o6

9,6'

f2

4,3'

f3

2,5'

o3 f1 o2

o5

o4 o4

(b) p = 1/4.

1,6'

o4

5,4'

5,5'

10,8'

2,4'

o3

o4

f2

o3 5,3' o3

f3

f2

4,5'

2,6'

Generally, P-diagnosis is affected by the value of synchronization degree p (0≤ p ≤ 1). On the one hand, if p is bigger, some diagnostic results would be missed, while the obtained results might be more reduced. On the other hand, if p is set to be smaller, more diagnostic results may be produced, while more of them would be spurious. Let us give an example to show the impact on diagnostic results when the synchronization degree p is different in the following. Example 1: Given an automaton of a DES model and an automaton of the real observation sequences shown in Figure 2 and Figure 1.(b), respectively, then we can obtain the P-diagnosis of the system shown in Figure 3 according to different values of p, where all the trajectories, each of whose synchronization degree is less than p, have been cut off, and only the events on the solid line transitions in any trajectory are the synchronized events by the observation and the incomplete system model (i.e. l1obs = l2 ). From example 1, by comparing each of the sub-ﬁgures in Figure 3 with Figure 1.(c) (the synchronization product when the model is complete), we can see clearly that when p = 2/3, i.e. Figure 3.(e) is the best synchronization in Figure 3. And when p < 2/3 in Figure 3(a), 1,5'

f2

4,3'

9,1'

o5

The impact of the synchronization degree p

f1

o3

f3

1,6'

3.2

f1

4,4'

11,8'

Figure 3. 10,8'

9,8'

The P-synchronization of incomplete system model M od1 and the real observation Obs1 .

192

X. Zhao and D. Ouyang / Model-Based Diagnosis of Discrete Event Systems

Figure 3.(b), Figure 3.(c) and Figure 3.(d), respectively, though all the possible trajectories are included in each of them, many spurious results are produced, too. While in Figure 3.(f), a practical possible trajectory is missed as a result of the high value of p.

3.3

An ideal heuristic way to reﬁne P-diagnosis

It is clearly seen that it is a difﬁcult problem how to set the value of the synchronization degree p for better diagnosis. In fact, how to set a better value of the synchronization degree p still depends on how complete the system model is. Therefore, an ideal heuristic way from theoretical view to reﬁne P-diagnosis is proposed as follows. o1

f1

o5

o2

f2

f3

o2

o3 f4

10

o4 11

o3

12

o4

13

(a) A complete system model. f1

f2

o2 f3

o2

f4

o3

o4

(b) An incomplete system model.

(c) An observation sequence.

1,1'

o1 f1

f1

1,2'

2,1'

o1

2,2'

f3

9,2'

o2

10,3'

f4

11,3'

o3

12,4'

o4

13,5'

(d) p = 3/4 (only the synchronization degree considered).

1,1'

o1 f1

f1

1,2'

2,1'

o1

2,2'

f3 o2

9,2'

4,3'

o2

o3 f4 10,3'

f2

5,3'

o3

4,4'

o3 f2 o4

o4 f4 10,4' 11,3'

5,4'

o3 o3 o4 f2

10,5'

f4

11,5'

o4 12,5'

11,4'

o4 o4 12,4'

13,5'

5,5'

4,5'

(e) p = 1/4 (only the synchronization degree considered).

1,1'

o1 f1

f1

1,2'

2,1'

o1

2,2'

f3 o2

9,2'

4,3'

o2 f2 o3

10,3'

5,3' 4,4'

f4 o3 f2 o4

11,3'

5,4'

o3 o4 f2

12,4'

traj be any trajectory in M , and trajset be the set of corresponding complete trajectories in M . The completeness degree of the trajectory traj in M is denoted by compl degree(traj ), and de ﬁned as: compl degree(traj ) = min({|trajobs |/|trajobs |}), where M trajobs is the set: {l | lobs = φ ∧ lobs ⊆ Eo ∧ l ∈ traj }, lobs is the set: {e | e ∈ l ∧ e ∈ EoM }, and for any traj ∈ trajset, trajobs is the set: {l | lobs = φ ∧ lobs ⊆ EoM ∧ l ∈ traj}. Note: like before, l ∈ traj means that l is one of the transition labels of traj. Here we use the minimum value of all the ratios between the incomplete trajectory and all the corresponding complete trajectories, to represent the completeness degree to further reﬁne diagnosis, with keeping as many approximate trajectories as possible. Suppose that we can give the completeness degree of each trajectory (such as approximately by experience, etc.) in the given incomplete system model. If the synchronization degree of a trajectory is not less than the completeness degree of the corresponding trajectory in the system model projected by the synchronized trajectory, then the synchronized trajectory can be still kept as a candidate diagnostic result. Or else, it will be removed. Example 2: Given the complete system model shown in Figure 4.(a), and an incomplete system model and the real observation sequences are shown in Figure 4.(b) and Figure 4.(c), respectively. If we only use the synchronization degree to obtain the approximate diagnostic results, from Figure 4.(d), we can see that some of the real possible synchronization trajectories have been missed when p = 3/4. While if p = 1/4, all the possible synchronization trajectories have been kept in Figure 4.(e). However, there are more spurious trajectories in Figure 4.(e). In order to reduce the spurious trajectories in Figure 4.(e) and at the same time keep all the real possible trajectories, the completeness degree of the trajectory is introduced, and the diagnostic results will be shown in Figure 4.(f), which is obtained as follows: We suppose here the completeness degree of the trajectory <1, {f1 }, 2, {o2 }, 4, {f2 }, 5> is 1/4, and the completeness degree of the trajectory <1, {f1 }, 2, {f3 }, 9, {o2 }, 10, {f4 }, 11, {o3 }, 12, {o4 }, 13> is 3/4. Then we add the constraint of the completeness degree to Figure 4.(e), and the more precise synchronization results are obtained shown in Figure 4.(f). In a word, we can use the completeness degree of a trajectory as a heuristic way to reﬁne the P-diagnosis results.

o4

13,5'

5,5'

4,5'

(f) p = 1/4 (with the complete degree further considered). Figure 4. The P-synchronization of incomplete system model and the real observation, with the synchronization degree and the complete degree considered.

Deﬁnition 7 (Completeness degree). Let the automaton M be an ideal complete system model, M be an incomplete system model,

4 Related Works and Comparisons One of the classical approaches in monitoring dynamic systems is knowledge-based techniques that directly associate a diagnosis to a set of symptoms, such as expert systems [23], or chronicle recognition systems [8, 15]. However, the main weakness of the approach is the lack of generality: once the system changes (new components, new connections, new technologies, etc.), a new expertise has to be acquired. Instead, model-based techniques used in this paper rely on a behavioral model of the system, which are known to be better suited to diagnosing DESs than expertise-based approaches. In [32], stochastic automaton is used to represent the model of a DES, where probabilistic information is added into each transition. The main purpose of [32] is to extend the logic ﬁnite-state machines to stochastic automata to represent uncertainty of transitions. Whereas we mainly concern the incompleteness of a DES model in this paper, as usual we suppose the transitions are certain in the system model. In addition, the computation is more complex by the introduction of probability information in [32]. As to diagnosis of static system earlier (e.g. [26]), usually each diagnostic result is represented by a set of failure components. Whereas

X. Zhao and D. Ouyang / Model-Based Diagnosis of Discrete Event Systems

we use all the trajectories consistent with observation sequences to explain the evolutions of DESs. As a result of the incompleteness of the DES model, maybe there would be not much more failure information in the produced trajectories. However, suppose when all the failure behaviors are complete in the incomplete DES model, the failure information provided by the trajectories will be more precise.

5 Conclusion and Future Work An attempt is made on diagnosis of a DES with an incomplete system model in this paper. A new concept of P-synchronization product of ﬁnite state automata is ﬁrstly proposed, which can be seen as a generalization of the traditional synchronization product, and by which the P-diagnosis of the DES is put forward as well. It is also shown that the P-diagnosis of a DES can approximately represent the diagnostic results based on the given synchronization degree p. In addition, the completeness degree of a trajectory used to improve the P-diagnosis as an ideal heuristic way is discussed as well. In this paper, the given DES model is supposed to be global, how to process the P-diagnosis of the DES, considering the decentralized model of the system is worth doing some research in future, and the main problem may be the design of the synchronization degree of each subsystem and the corresponding sub-observations. Once we can obtain more and more observations, the incomplete model may be complemented to be more complete by learning to discover the missed system transitions and the corresponding states. This direction can be seen as another interesting future work.

ACKNOWLEDGEMENTS We would like to thank the referees for their comments which helped improve this paper. This work was supported by NSFC Grant No. 607730097; NSFC Major Research Program under Grant No. 60496321, Basic Theory and Core Techniques of Non Canonical Knowledge; Program for New Century Excellent Talents in University; Jilin Province Science and Technology Development Plan under Grant No. 20060532; and European Commission under Grant No. TH/Asia Link/010 (111084).

REFERENCES [1] P. Baroni, G. Lamperti, P. Poglianob and M. Zanella, Diagnosis of large active systems, Artiﬁcial Intelligence, 110(1) (1999), 135-183. [2] C. Barral, S. McIlraith, and T. Son, Formulating diagnostic problem solving using an action language with narratives and sensing, in: 7th International Conference on Knowledge Representation and Reasoning KR’00, Breckenridge, CO, 2000, pp. 311-322. [3] C. Cassandras and S. Lafortune, Introduction to Discrete Event Systems, vol. 11 of The Kluwer International Series in Discrete Event Dynamic Systems, Kluwer Academic Publisher, Boston, MA, 1999. [4] S. Cerutti, G. Lamperti, M. Scaroni, M. Zanella and D. Zanni, A diagnostic environment for automaton networks, Software - Practice & Experience, 37(4) (2007), 365-415. [5] L. Console, C. Picardi and M. Ribaudo, Diagnosis and diagnosability analysis using PEPA, in: 14th European Conference on Artiﬁcial Intelligence - ECAI’00, Berlin, Germany, 2000, pp. 131-135. [6] L. Console, C. Picardi, and D. Theseider Dupr`e, Temporal decision trees: model-based diagnosis of dynamic systems on-board, Journal of Artiﬁcial Intelligence Research, 19 (2003), 469-512. [7] L. Console, D. Theseider Dupr`e, abd P. Torasso, A theory of diagnosis for incomplete causal models, in: 11th International Joint Conference on Artiﬁcial Intelligence - IJCAI’89, Detroit, USA, 1989, pp. 1311-1317. [8] M.-O. Cordier and C. Dousson, Alarm driven monitoring based on chronicles, in: 4th IFAC Symposium on Fault Detection, Supervision and Safety for Technical Processes - SAFEPROCESS’2000, Budapest, Hungary, 2000, pp. 286-291.

193

[9] M.-O. Cordier and A. Grastien, Exploiting independence in a decentralised and incremental approach of diagnosis, in: 20th International Joint Conference on Artiﬁcial Intelligence - IJCAI’07, Hyderabad, India, 2007, pp. 292-297. [10] M.-O. Cordier and C. Largou¨et, Using model-checking techniques for diagnosing discrete-event systems, in: 12th International Workshop on Principles of Diagnosis - DX’01, San Sicario, Italy, 2001, pp. 39-46. [11] M.-O. Cordier and S. Thi´ebaux, Event-based diagnosis for evolutive systems, in: 5th International Workshop on Principles of Diagnosis - DX’94, New Paltz, USA, 1994, pp. 64-69. [12] P. Dague, Model-Based Diagnosis of Analog Electronic Circuits, Annals of Mathematics and Artiﬁcial Intelligence, 11(1-4) (1994), 439-492. [13] J. de Kleer and B. C. Williams, Diagnosing multiple faults, Artiﬁcial Intelligence, 32(1) (1987), 97-130. [14] R. Debouk, S. Lafortune and D. Teneketzis, Coordinated decentralized protocols for failure diagnosis of discrete-event systems, Discrete Event Dynamical Systems, 10(1-2) (2000), 33-86. [15] C. Dousson and P. L. Maigat, Chronicle recognition improvement using temporal focusing and hierarchization, in: 20th International Joint Conference on Artiﬁcial Intelligence - IJCAI’07, Hyderabad, India, 2007, pp. 324-329. [16] J. Gamper and W. Nejdl, Abstract temporal diagnosis in medical domains, Artiﬁcial Intelligence in Medicine, 10(3) (1997), 209-234. [17] G. Lamperti and M. Zanella, Diagnosis of discrete-event systems from uncertain temporal observations, Artiﬁcial Intelligence, 137(1-2) (2002), 91-163. [18] G. Lamperti and M. Zanella, Diagnosis of Active Systems: Principles and Techniques, vol. 741 of The Kluwer International Series in Engineering and Computer Science, Kluwer Academic Publishers, Dordrecht, NL, 2003. [19] G. Lamperti and M. Zanella, Flexible diagnosis of discrete-event systems by similarity-based reasoning techniques, Artiﬁcial Intelligence, 170(3) 2006, 232-297. [20] G. Lamperti, M. Zanella, and D. Zanni, Incremental processing of temporal observations in model-based reasoning, AI Communications, 20(1) (2007), 27-37. [21] J. Lunze, Discrete-event modelling and diagnosis of quantized dynamical systems, in: 10th International Workshop on Principles of Diagnosis - DX’99, Loch Awe, United Kingdom, 1999, pp. 147-154. [22] P. J. Mosterman, Hybrid Dynamic Systems: a Hybrid Bond Graph Modeling Paradigm and Its Application in Diagnosis, Ph.D. dissertation, Vanderbilt University, 1997. [23] D. Niebur, Expert Systems for Power System Control in Western Europe, in: 5th IEEE Symposium on Intelligent Control, Philadelphia, USA, 1990, pp. 1112-1119. [24] Y. Pencol´e and M.-O. Cordier, A formal framework for the decentralised diagnosis of large scale discrete event systems and its application to telecommunication networks, Artiﬁcial Intelligence, 164(1-2) (2005), 171-208. [25] L. Portinale, D. Magro, and P. Torasso, Multimodal diagnosis combining case-based and model-based reasoning: a formal and experimental analysis, Artiﬁcial Intelligence, 158(2) (2004), 109-153. [26] R. Reiter, A theory of diagnosis from ﬁrst principles, Artiﬁcial Intelligence, 32(1) (1987), 57-96. [27] L. Roz´e, Supervision of telecommunication network: a diagnoser approach, in: 8th International Workshop on Principles of Diagnosis DX’97, Mont-Saint-Michel, France, 1997, pp. 103-111. [28] L. Roz´e and M.-O. Cordier, Diagnosing discrete-event systems: extending the “diagnoser approach” to deal with telecommunication networks, Discrete Event Dynamic Systems, 12(1) (2002), 43-81. [29] M. Sampath, R. Sengupta, S. Lafortune, K. Sinnamohideen, and D. Teneketzis, Diagnosability of discrete event systems, IEEE Transactions on Automatic Control, 40(9) (1995), 1555-1575. [30] M. Sampath, R. Sengupta, S. Lafortune, K. Sinnamohideen, and D. Teneketzis, Failure diagnosis using discrete-event models, IEEE Transactions on Control Systems Technology, 4(2) (1996), 105-124. [31] P. Struss, Fundamentals of model-based diagnosis of dynamic systems, in: 15th International Joint Conference on Artiﬁcial Intelligence - IJCAI’97, Nagoya, Japan, 1997, pp. 480-485. [32] D. Thorsley and D. Teneketzis, Failure diagnosis of stochastic automata, in: 14th International Workshop on Principles of Diagnosis - DX’03, Washington, DC, USA, 2003. [33] F. Wotawa, On the relationship between model-based debugging and program slicing, Artiﬁcial Intelligence, 135(1-2) (2002), 125-143.

194

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-194

Chronicles for On-line Diagnosis of Distributed Systems Xavier Le Guillou and Marie-Odile Cordier and Sophie Robin and Laurence Roz´e 1 Abstract. The formalism of chronicles has been proposed to monitor and diagnose dynamic physical systems. Even if efﬁcient chronicle recognition algorithms exist, it is now well-known that distributed approaches are better suited to monitor real systems. In this article, we adapt the chronicle-based approach to a distributed context and illustrate this work on the monitoring of software components.

1 Introduction Monitoring and diagnosing dynamic systems have become very active topics in research and development for a few years. Besides continuous models based on differential equations, essentially used in control theory and discrete event systems based on ﬁnite state machines (automata, Petri nets, . . . ), a formalism commonly used for on-line monitoring, in particular by people from the artiﬁcial intelligence community, is the one of chronicles. This formalism, proposed in [10], has been widely used and extended [7, 9, 6]. A chronicle describes a situation that is worth identifying within the diagnosis context. It is made up of a set of events and temporal constraints between those events. As a consequence, this formalism ﬁts particularly well problems that consider a temporal dimension. The set of interesting chronicles constitutes the base of chronicles. Then, monitoring the system consists in analyzing ﬂows of events, and recognizing on ﬂy patterns described by the base of chronicles. Efﬁcient algorithms exist and this approach has been used for industrial applications as well as medical ones [7, 14, 2]. One of the key issues of model-based approaches for on-line monitoring is the size of the model which is generally too large when dealing with real applications. Distributed or decentralized approaches have been proposed to cope with this problem, like [5, 8, 1, 13]. The idea is to consider the system as a set of interacting components instead of a single entity. The behavior of the system is thus described by a set of local component models and by the synchronization constraints between the component models. Considering chronicle-based approaches, to our knowledge, no distributed approaches exist and the contribution of this paper consists in adapting the chronicle-based approach to distributed systems. This work has been motivated by an application that aims at monitoring the behavior of software components, and more precisely of web services within the context of the WS-D IAMOND (Web Service DIAgnosability, MONitoring and Diagnosis) European project. In this context, a request is sent to a web service which collaborates with other services to provide the adequate reply. Faults may propagate from one service to another and diagnosing them is a crucial issue, in order to react properly. We use a simpliﬁed example of an e-foodshop to illustrate our proposal. We ﬁrst recall the principles of the chronicle recognition approach and give basic deﬁnitions in section 2. We introduce in section 3 the 1

Irisa – Universit´e de Rennes 1, France, email: [email protected]

simpliﬁed example that will be used all along this paper. In section 4, we show how to extend the chronicle-based approach to distributed systems. We ﬁrst describe the architecture of a chronicle-based distributed system (4.1). Then we extend the chronicle formalism to deal with synchronization constraints (4.2). We describe in 4.3 a pushpull algorithm able to compute a global diagnosis from the local diagnoses, computed by locally distributed chronicle recognition systems, checking the synchronization constraints. After an illustrative example in 4.4, we compare our proposal to related work in section 5 and conclude in section 6.

2 Chronicle recognition approach The chronicle recognition approach (ﬁrst introduced in [10]) relies on a set of patterns, named chronicles, which constitutes the chronicle base. Let us recall the formalism and the chronicle recognition algorithm.

2.1 Formalism of chronicles A chronicle is a set of observable events which are time-constrained and is characteristic of a situation. An event type deﬁnes what is observed within the system, for instance the name of an activity act, the name augmented with the fact that the activity is starting (namely act− ) or ending (namely act+ ), the name enriched with observable parameters act(?var1 , . . . , ?varn ) or a combination of those possibilities. E denotes the set of possible event types. An event is a pair (e, ?t) where e ∈ E is an event type and ?t the occurrence date of the event. A chronicle (model) C is a pair (S, T ) where S is a set of events and T a set of constraints between their occurrence dates. When its variables and its occurrence dates are instantiated, a chronicle is called a chronicle instance.

2.2 Chronicle recognition A chronicle recognition tool, called CRS (Chronicle Recognition System), has been developed by C. Dousson2 . It is in charge of analyzing the input stream of events and of identifying, on the ﬂy, any pattern matching a situation described by a chronicle. Chronicles are compiled into temporal constraint networks which are processed by efﬁcient graph algorithms. CRS is based on a complete forecast of the possible dates for each event that has not occurred yet. This set (called temporal window) is reduced by propagation of the dates of observed events through the temporal constraint network. When a new event arrives in the input stream, new instances of chronicles are generated in the set of hypotheses, which is managed as a tree. 2

http://crs.elibel.tm.fr/

195

X. Le Guillou et al. / Chronicles for On-Line Diagnosis of Distributed Systems

Instances are discarded as soon as possible, when constraints are violated or when temporal windows become empty.

?SUPPlistIn ?WHitemIn [next]

[1,3]

a

b

timeout Available?

?SUPPitemOut

[end]

Chronicle model

[yes]

hardErr

[no]

?SUPPitemIn

(a,1)

(a,3)

(b,5)

stockErr

time

Available? [no]

(a,1)

b,[2,4]

(a,1)

b,[3,4]

(a,3)

b,[4,6]

[yes]

?WHitemOut

- Discarded -

I1 (a,3)

b,[5,6]

(a,3)

(b,5)

?SUPPlistOut

I2 I3

Figure 1.

Principle of chronicle recognition

Figure 1 shows the principle of the recognition algorithm on a very simple example: a single chronicle model is deﬁned, containing only two events: (a, ?ta ) and (b, ?tb ), with ?ta + 1 ≤?tb ≤?ta + 3. When event (a, 1) is received, instance I1 is created, which updates the temporal window of the related node b. When a new event (a, 3) occurs, a new instance I2 is created and the forthcoming temporal window of I1 is updated. When event (b, 5) is received, instance I3 is created (from I2) and I1 is destroyed as no more event (b, ?tb ) could match the temporal constraints from now on. Instance I2 is still waiting for another potential event (b, ?tb ) before ?tb > 6. As all the events of I3 are instantiated, this instance is recognized.

3 Motivating example To illustrate the ideas developed in this paper, we consider an orchestration of three web services, a shop, a supplier and a warehouse, that provide e-shopping capabilities to users. This application keeps the essential properties of the applications we aim to monitor. In particular, we consider closed environments, where a workﬂow-like description of each web service (Figures 2 and 3) involved in the processing of the request is supposed to be available. A customer wants to place an order and selects items on the shop. This list of items is transferred to a supplier which sends a reservation request to a warehouse, for each item of the list. The warehouse returns an acknowledgement to the supplier for each item request and, at the end of the item list, the supplier sends a list of the available items to the shop which forwards it to the customer. The customer agreement terminates the process. dataErr

Simpliﬁed workﬂows of the SUPP and the WH

placing his order, the customer may make a data acquisition error, which may result in unexpected items on his reservation list. Then, a timeout may occur when calling the supplier. We consider that only timeouts may occur on the supplier (Figure 3), when calling the warehouse. On the warehouse, things are more complicated. First, an item may be out of stock, resulting in an incomplete reservation list. Then, an internal error may happen, resulting in a denial of service. Figure 4 presents two processes that may result in the same observation on the shop, i.e. a cancellation of the order due to an incorrect reservation list: (a) a data acquisition error, ordering “eggs and teak” instead of “eggs and tea”, for instance, and (b) a stock error happening on the warehouse. Here, we notice that two distinct errors that happen on two distinct services can result in the same local problem, hence the necessity of diagnosing the system globally in order to repair in an adequate way. SHOP

SUPP

WH

{eggs,teak}

SHOP

SUPP

WH

{eggs,tea} {eggs}

{eggs}

avail

avail

{teak}

{tea}

avail {eggs,teak}

notAvail {eggs}

(a)

(b)

Figure 4. Two scenarii that may result in a cancelled order

4 Extension to distributed environments

ReceiveOrder ?SHOPlistOut

timeout

Figure 3.

ChkNReserve ?SHOPlistIn SendBill

Diagnosing distributed systems thanks to chronicles requires to deﬁne a modular diagnosis architecture capable of merging diagnoses provided by local chronicle-based diagnosers and to enrich the chronicle formalism with synchronization constraints.

ReceiveConfirm [ok]

ForwardOrder

Figure 2.

[cancel]

CancelOrder

Workﬂow of the SHOP service

Faults may happen during this process. Figure 2 shows two of them (represented by pentagons), related with the shop. First, when

4.1 Architecture Figure 5 summarizes our chronicle-based approach architecture. This decentralized system is composed of a global diagnoser (or broker) in charge of merging the local diagnoses sent by each service and sending global diagnoses to a repair module. Services are composed of the web service itself, logs generated in real time by the web service, a base of chronicles generated off-line, a local diagnoser that uses the logs to instantiate chronicles from the base.

196

X. Le Guillou et al. / Chronicles for On-Line Diagnosis of Distributed Systems

Broker (global diagnoser)

Local diagnoser 1

logs 1

Local diagnoser 2

base of chronicles 1

Web service 1

base of chronicles 2

Web service 2

... Figure 5.

logs 2

...

General architecture of a distributed system

4.2 Extension of the formalism of chronicles As a fault occurring on a service often propagates to other services, we base our approach on the merging of local diagnoses. As a consequence, we enrich the initial formalism of chronicles with synchronization constraints that allow the broker to spot homologous chronicles and merge them.

Let us consider the chronicle describing the external error case. We have the distributed chronicle model CD = (S, T , O, I, K): S = { (ReceiveOrder− (), ?t1 ), (ChkN Reserve− (?SHOP listOut), ?t2 ), (ChkN Reserve+ (?SHOP listIn), ?t3 ), (SendBill+ (), ?t4 ), (ReceiveConf irm+ (), ?t5 ), (F orwardOrder+ (), ?t6 ) } T = {?t1
4.3 Algorithms 4.2.1 Synchronization point Before deﬁning a distributed chronicle, let us ﬁrstly deﬁne what is a synchronization point. The status of a variable is a Boolean that denotes if the value of a chronicle variable is normal (¬err) or abnormal (err) in a given execution case. A synchronization variable is a pair (?var, status) where ?var is a (non temporal) chronicle variable and status the status of this variable inside a given chronicle model. A synchronization point is a tuple (e, {vars}, servtype ) where e is an event type, {vars} a set of synchronization variables linked with this event type and servtype a type of remote service the local service communicates with. An instance of a synchronization point is a synchronization point in which variables are instantiated, and servtype is instantiated as the effective address of the remote service. A synchronization point is incoming if it corresponds to a servremote → servlocal communication, outgoing for the contrary (see example chronicle in section 4.2.2). Referring to Figure 2 and section 3, here is one of the two synchronization points on the SHOP, which is instantiated as follows, in the execution case of an external error (see Figure 7): (ChkN Reserve+ , {(?SHOP listIn, err)}, SU P P ). It expresses the fact that the error is coming from the supplier, through the ?SHOPlistIn variable, which is received by the SHOP at the end of the ChkNReserve activity.

Our approach consists in merging local chronicles in order to compute a set of candidate global diagnoses. This set of diagnoses is represented by a diagnosis tree as explained in section 4.3.2. There are two steps in the global diagnosis process (Figure 6). In a ﬁrst step, at “push” time, local diagnosers send recognized chronicles to the broker, which triggers the global diagnosis process. In a second step, at “pull” time, i.e. when the global diagnoser needs information, it queries local diagnosers about their chronicles recognized previously or in future. This push-pull mechanism is implemented through a ﬁlter as explained below.

push

pull

Chronicle filter

Diagnosis tree

integration Instances of applicant chronicles (CRS)

grafting Global diagnoser push

logs

(a)

Figure 6.

pull

base of chronicles (b)

Operation of the (a) local and (b) global diagnosers

4.2.2 Distributed chronicle

4.3.1 Local diagnosis and ﬁltering

A distributed chronicle is a classical chronicle enriched with a “color” and a “synchronization” part, so that we can merge it with chronicles from adjacent services. The color of a chronicle K represents the degree of importance of a chronicle and its capacity to trigger a global diagnosis process. Two colors are used: red for faults that may trigger the broker and green for normal behaviors and non critical faults. Distributed chronicle: a distributed chronicle is a tuple CD = (S, T , O, I, K) where S is a set of events, T a graph of constraints between their occurrence dates, O and I are respectively two sets of outgoing and incoming synchronization points, and K is the color of the chronicle.

The computation of the local diagnosis relies on a CRS module fed by the logs of the web service and sending its recognized chronicles to the global diagnoser (Figure 6.(a)). In order to avoid sending useless chronicles, a ﬁlter M is set for each running process. In f ilter mode, only red recognized chronicles are sent to the global diagnoser. Green chronicles are stored in a chronicle buffer Cbuf . Nevertheless, at “pull” time, the global diagnoser can change M from f ilter to open, which ﬂushes Cbuf in order to provide the global diagnoser with all the available information. In open mode, both red or green newly recognized chronicles will be directly sent to the global diagnoser. Algorithm 1 illustrates this operation.

197

X. Le Guillou et al. / Chronicles for On-Line Diagnosis of Distributed Systems

init: mode M := f ilter, chronicle set Cbuf := ∅; on event chronicle c recognized do if (M = f ilter ∧ c.color = red) ∨ M = open then Broker.push(c); else Cbuf := Cbuf ∪ {c}; end end on event LocalDiagnoser.pull() do foreach c ∈ Cbuf do Broker.push(c); Cbuf := ∅, M := open; end Algorithm 1: Local diagnoser management

4.3.2 Global diagnoser algorithm The global diagnoser algorithm relies on a diagnosis tree Dt in charge of treasuring all the candidate diagnoses under the shape of partially recognized global chronicles (Figure 6.(b)). Each candidate diagnosis is represented by a path leading to a constraintless node in Dt . The global diagnoser algorithm (Algorithm 2) manages this tree and queries local diagnosers in order to make it grow and complete the pending paths. The initial diagnosis tree only contains the emptynode which, being constraintless, is compatible with any recognized chronicle. When a recognized chronicle c is sent by a service s to the global diagnoser, two operations are performed. First, Dt is traversed, trying to combine each node n with c thanks to the status of corresponding variables. In case of a compatibility between n and c, a child node containing c and the synchronization constraints that remain to check is grafted under n in Dt . Then, the global diagnoser changes to open the mode of all the services mentioned in c in order to collect all the information needed for a global diagnosis (Algorithm 2). init: diagnosis tree Dt := emptynode; on event Broker.push(chronicle c) do foreach node n of Dt do if c compatible with n then n.addChild(c); end end foreach service s mentioned in c do s.LocalDiagnoser.pull(); end end Algorithm 2: Global diagnoser management When a candidate diagnosis (i.e. a constraintless node) is computed in Dt , the broker forwards it to an external repair module and proceeds with the exhibition of other candidate diagnoses.

4.4 Illustration on the example We consider a customer placing an order on the SHOP, order which is forwarded to the SUPPlier. For each product of the item list, the SUPP calls the WareHouse so as to book the corresponding product. Unfortunately, a product is missing which provokes the recognition of the WH:stockErr chronicle, the color of which is green, because the WH doesn’t consider being out of stock as an error. The broker is not triggered and the execution goes on. But when the SUPP receives the negative answer of the WH, the red chronicle SUPP:extErr is

recognized and the SUPP “pushes” this chronicle towards the broker, triggering a global diagnosis process while the service execution goes on. SHOP normal dataErr extErr timeout SUPP normal fwdErr extErr timeout

?listOut ¬err err ¬err ¬err

?listIn ¬err err err undef

?listIn ¬err err ¬err ¬err

?itemOut ¬err err ¬err ¬err

?itemIn ¬err err err undef

WH normal fwdErr stockErr hardErr

?itemIn ¬err err ¬err ¬err

?itemOut ¬err err err undef

?listOut ¬err err err undef

Figure 7. Chronicles of the three web services

Dt only contains the empty root node, at this point. This node is compatible with the constraints of SUPP:extErr, listed in Figure7, and a new node containing SUPP:extErr and its constraints is grafted under the root node. After this, the broker changes to open the mode of WH, “pulling” the previously recognized WH:stockErr chronicle towards it. Dt now contains two nodes. WH:stockErr is compatible with the empty root node, which results in the grafting of a child node under the root, containing WH:stockErr and its constraints. WH:stockErr is also compatible with SUPP:extErr, as the homologous variables have the same status: ?SU P P itemOut and ?W HitemIn are ¬err, ?SU P P itemIn and ?W HitemOut are err. This way, a child node is grafted under SUPP:extErr, containing WH:stockErr and the remaining unchecked constraints (Figure 8). The “pulling” process goes on, interrogating the SHOP and waiting for its recognized chronicles. At the end of the orchestration execution, Dt exhibits a single constraintless node, which is then the unique candidate diagnosis: SHOP:extErr, SUPP:extErr, WH:stockErr. []

SUPP:extErr

WH:stockErr

?SUPP:listIn(notErr) ?SUPP:itemOut(notErr) ?SUPP:itemIn(err) ?SUPP:listOut(err)

?WH:itemIn(notErr) ?WH:itemOut(err)

WH:stockErr+SUPP:extErr ?SUPP:listIn(notErr) ?SUPP:listOut(err)

Figure 8.

Intermediate diagnosis tree

4.5 A word about complexity Let us consider the complexity of such an approach. On the local side, the complexity only depends on CRS, which has already been successfully used in large scale systems. Some basic rules about

198

X. Le Guillou et al. / Chronicles for On-Line Diagnosis of Distributed Systems

chronicle writing allow to optimize the use of CRS: PID ﬁltering avoids the recognition of useless cross-process chronicles, delays in chronicle models ﬂush chronicle instances automatically, etc. On the broker side, the size of the tree only depends on the number of chronicles recognized on each service, hence a need for discriminating and exclusive chronicles. In the worst case, considering all the chronicles are compatible, we demonstrate that the maximum number of nodes in Dt is Y (|Cs | + 1) nmax = s∈S

with S the set of implied services and Cs the set of chronicles recognized on s.

5 Related work and discussion Within the context of the supervision of dynamic systems, many approaches use the formalism of chronicles [2, 7, 9, 6, 15] but few deal with using chronicles in a distributed context. The way we approach the problem of monitoring dynamic systems from a distributed chronicle-based modeling of the system may be compared with distributed approaches of monitoring discrete-event systems, such as [1, 5, 16, 13]. In each of those works, local diagnoses computed by the different components of the system are synchronized in order to compute a diagnosis taking into account the constraints between components. For instance, the approach of [1] is not so far away from ours, as it ﬁts parts together to build the system diagnosis, like in a puzzle. Those parts, called tiles, are labelled by alarms and represent pieces of trajectories. The main difference between the two approaches, apart from the Petri-net-based formalism they use, is that the architecture they adopted is fully distributed, without supervisor. In our decentralized case, a supervisor is in charge of ﬁtting local chronicles together after having synchronized them, which results in the computation of a global chronicle, aiming at taking a repair decision. Concerning the monitoring of software components and more precisely web services, we can cite among others [11, 3, 4, 12]. The authors of [4] are interested in checking on line the consistency between what a web service should do, called a contract, and its effective execution. Contracts are expressed in a constraint-oriented language, and integrated into the web services ﬁles under the shape of annotations. Then, monitors, implemented as web services, observe the behavior of the web services and are capable of detecting timeout problems or functional errors. In [3], the decentralized architecture is close to ours. Each web service is equipped with a local diagnoser generating hypotheses that are consistent with the local model and the observations. A supervisor merges local diagnoses to compute a global one, by propagating hypotheses from a local diagnoser to its neighbors. The main difference is that they rely on a static diagnosis approach: using dependencies between state variables, their approach consists in explaining the alarms that have arisen at a given time. In our case, we monitor the behavior of the components as it evolves.

6 Conclusion Our contribution in this paper is to propose a distributed chroniclebased monitoring and diagnosis approach. Even if it is now recognized that distributed approaches are the only realistic way to monitor large-scale systems, no work exists, to our knowledge, as far as chronicle-based approaches are concerned. We propose a distributed architecture in which a broker service is in charge of synchronizing the local diagnoses computed from chronicles at the component

level. We extend the formalism of chronicles and introduce synchronization points that express the synchronization constraints which are checked by the broker according to a push-pull mechanism. We describe the main algorithms and illustrate them on a simpliﬁed eshopping example. A platform has been developed and allows us to make experiments in the framework of the WS-D IAMOND European project, dedicated to the monitoring of software components. The main perspectives are twofold. The ﬁrst one is to couple the diagnosis service with a repair service, the goal being to ensure a good QoS, even in case of fault occurrences. The second one is to build acquisition tools to help building the set of local chronicles, starting from workﬂow descriptions. A ﬁrst step in this direction can be found in [17].

REFERENCES [1] A. Aghasaryan, E. Fabre, A. Benveniste, R. Boubour, and C. Jard, ‘Fault detection and diagnosis in distributed systems : an approach by partially stochastic petri nets’, Discrete Event Dynamic Systems, 8(2), 203–231, (1998). [2] J. Aguilar, K. Bousson, C. Dousson, M. Ghallab, A. Guasch, R. Milne, C. Nicol, J. Quevedo, and L. Trav´e-Massuy`es, ‘Tiger: real-time situation assessment of dynamic systems’, Technical report, (1994). [3] L. Ardissono, L. Console, A. Goy, G. Petrone, C. Picardi, M. Segnan, and D. Theseider Dupr´e, ‘Cooperative model-based diagnosis of web services’, in Proceedings of DX05, International Workshop on the Principles of Diagnosis, Paciﬁc Grove, California, (2005). [4] L. Baresi, C. Ghezzi, and S. Guinea, ‘Smart monitors for composed services’, in Proc. of the 2nd Int. Conf. on Service-Oriented Computing (ICSOC’04), pp. 193–202, (2004). [5] P. Baroni, G. Lamperti, P. Pogliano, and M. Zanella, ‘Diagnosis of a class of distributed discrete-event systems’, IEEE Transactions on systems, man, and cybernetics, 30(6), 731–752, (2000). [6] M.-O. Cordier and C. Dousson, ‘Alarm driven monitoring based on chronicles’, in Proc. of Safeprocess’2000, pp. 286–291, (2000). [7] M.-O. Cordier, J.-P. Krivine, P. Laborie, and S. Thi´ebaux, ‘Alarm processing and reconﬁguration in power distribution systems’, in Proc. of IEA-AIE’98, pp. 230–240, (1998). [8] R. Debouk, S. Lafortune, and D. Teneketzis, ‘Coordinated decentralized protocols for failure diagnosis of discrete event systems’, Discrete Event Dynamic Systems, 10(1-2), 33–86, (2000). [9] M. Dojat, N. Ramaux, and D. Fontaine, ‘Scenario recognition for temporal reasoning in medical domains.’, Artiﬁcial Intelligence in Medicine, 14(1-2), 139–155, (1998). [10] C. Dousson, P. Gaborit, and M. Ghallab, ‘Situation recognition: representation and algorithms’, in Proc. of the Int. Joint Conf. on Artiﬁcial Intelligence (IJCAI’93), pp. 166–172, (1993). [11] I. Grosclaude, ‘Model-based monitoring of component-based software systems’, in Proc. of the 15th Int. Workshop on Principles of Diagnosis (DX’04), pp. 51–56, (2004). [12] A. Lazovik, M. Aiello, and M. Papazoglou, ‘Planning and monitoring the execution of web service requests’, in Proc. of the 1st Int. Conf. on Service-Oriented Computing (ICSOC’03), volume 2910 of Lecture Notes in Computer Science, pp. 335–350, (2003). [13] Y. Pencol´e and M.-O. Cordier, ‘A formal framework for the decentralised diagnosis of large scale discrete event systems and its application to telecommunication networks’, Artiﬁcial Intelligence Journal, 164(1-2), 121–170, (2005). [14] Y. Pencol´e, M.-O. Cordier, and L. Roz´e, ‘Incremental decentralized diagnosis approach for the supervision of a telecommunication network.’, in IEEE Conf. on Decision and Control (CDC’02), (2002). [15] R. Quiniou, M.-O. Cordier, G. Carrault, and F. Wang, ‘Application of ilp to cardiac arrhythmia characterization for chronicle recognition’, in ILP’2001, volume 2157 of LNAI, pp. 220–227, (2001). [16] N. Roos, A. Teije, A. Bos, and C. Witteveen, ‘An analysis of multiagent diagnosis’, in Proc. of the 1st Int. Joint Conf. on Autonomous Agents and MultiAgent Systems (AAMAS’02), (2002). [17] Y. Yan, Y. Pencol´e, M.-O. Cordier, and A. Grastien, ‘Monitoring web service networks in a model-based approach’, in 3rd European Conf. on Web Services (ECOWS), (2005).

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-199

199

Test Generation for Model-Based Diagnosis Gregory Provan1 Department of Computer Science, University College Cork, Cork, Ireland [email protected] ABSTRACT This article formalises the dual problem to model-based diagnosis (MBD), i.e., generating tests to isolate multiple simultaneous faults. Using a standard propositional MBD framework, we ﬁrst deﬁne a test of minimal size that can isolate multiple simultaneous faults of an arbitrary nature. Second, we prove complexity results for multiplefault tests of minimal size in propositional system models, showing such problems have complexity similar to those of MBD problems, i.e., complexity at the second level of the polynomial hierarchy.

1

Introduction

Many real-world applications have focused on identifying minimal sensor observations, or tests, that will accurately isolate faults. For example, in the aerospace industry, it is common practice to precompute minimal test sets for systems, and also to build special equipment, called Built-in Test Equipment (or BITE), to conduct such fault-isolation tests (called Built-In Tests) [11]. The BITE will then conduct pre-deﬁned sensor analysis upon system start-up or malfunction. This area of diagnosis, called Automated Test Pattern Generation (ATPG) [3], has signiﬁcant practical importance. ATPG, in more precise terms, concerns pre-computing a set of tests that can validate circuits or diagnose hardware in embedded applications [3]. In contrast, model-based diagnosis (MBD) [14] addresses the task of isolating multiple simultaneous faults, given an arbitrary observation α. Although the objectives of efﬁcient fault isolation are similar, the approaches, and the entailed algorithms, are quite different. This article formalises ATPG as a dual problem to MBD. Using a standard propositional MBD framework, we ﬁrst deﬁne a test that can isolate multiple simultaneous faults, of an arbitrary nature. This formalisation extends standard notions of ATPG, which are restricted to generating tests that can isolate single, stuck-at faults, namely, faults in discrete circuits that occur when wires or gates are stuck at either 0 or 1 [3, 15]. In this article we address models framed as Boolean functions, since all existing ATPG is based on Boolean functions. However, our framework applies to more general models; for example, we can generalise our results to multi-valued propositional functions in a straightforward fashion, thus allowing our results to be applicable to qualitative MBD/ATPG models with ﬁnite, discretevalued variables. Our contributions are as follows. First, we deﬁne a general ATPG model, cast within a satisﬁability (SAT) framework, which adopts the multiple-fault deﬁnitions of MBD [14]. This general ATPG problem addresses multiple simultaneous faults of an arbitrary nature. Second, we prove complexity results for these more general ATPG mod1

Supported by SFI grant 04/IN3/I524.

els, demonstrating that these problems lie at the second level of the polynomial hierarchy. Given the intractability of MBD inference on real-world circuits [13], it is likely that multiple-fault ATPG will also be intractable, since both the MBD and multiple-fault ATPG are at the second level of the polynomial hierarchy.

2

Related Work

This article applies a standard MBD framework [14] to test generation. To our knowledge, no article has formalised a similar approach for multiple-fault ATPG using the MBD consistency-based framework. Further, the ATPG literature lacks any methodology that directly generates tests to isolate faults of arbitrary size, often leading to inefﬁcient and inaccurate fault isolation; ATPG focuses instead on iterated single-fault approaches for this task, e.g., [18, 19]. ATPG also differs from sequential diagnosis [17], in that it computes all tests a priori; in contrast, test sequencing computes the next best test dynamically. The complexity of test generation has been carefully studied, and we review the complexity of ATPG and MBD. ATPG Complexity: The standard single-fault ATPG model has been proven to be NP-complete [8], as it is an instance of the wellknown SAT and CIRCUIT-SAT problems. Although this worst-case result indicates intractability, in practice test-generation is tractable because of the structural properties of circuits: the complexity is exponential in the undirected circuit cut-width [12], and the averagecase complexity is poly-time in circuit size for circuits with bounded cut-width. However, in the real world systems, multiple faults can occur simultaneously, necessitating multiple-fault ATPG methods. MBD Complexity: Computing a diagnosis which is minimal with respect to subset-inclusion or fault cardinality is NP-complete for propositional Horn models [1, 6]. The complexity for arbitrary propositional models, for several minimality conditions, has been shown to be at the second level of the polynomial hierarchy [4]. The complexity becomes less tractable if we instead consider the problem of computing the set of all minimal diagnoses. This problem is at least as difﬁcult as counting the number of diagnoses, which has been shown to be #co-NP-Complete [7]. These results indicate that the diagnosis problems in which we are interested, i.e., computing the set of minimal-cardinality diagnoses over propositional models, are intractable.

3

Deﬁnitions and Notation

We now introduce notation for Boolean functions, MBD and ATPG.

200

3.1

G. Provan / Test Generation for Model-Based Diagnosis

Boolean Functions

Let f : {0, 1} → {0, 1} be a Boolean function over a set X = {x1 , x2 , ..., xn } of n variables. A conjunctive normal form (CNF) Boolean formula f on Boolean variables X is a conjunction of m clauses {C1 , C2 , ...Cm }. Each clause Ci is a disjunction of ki literals l1 , ...lki . A literal is an instance of a variable or its complement. We call λf an assignment for f if we set each of the variables in f to either 0 or 1; such an assignment may be represented by an n-bit vector in {0, 1}n in the natural way. A satisfying assignment λ, or model, for f is one for which f (λ) = 1. The set of satisfying assignments for f is denoted by Λf . A partial assignment is obtained when only a subset of variables in X is assigned values. A partial assignment may be represented by a vector of length n, each of whose elements is either 0, 1, or ∗. The generic satisﬁability problem can be deﬁned as follows. n

Deﬁnition 1 (SAT) Given a CNF formula, f (x1 , x2 , ..., xn ), the Boolean satisﬁability problem SAT (f ) has an answer YES iff there exists an assignment λ of Boolean values to the variables x1 , x2 , ..., xn , i.e., ∃λ such that f evaluates to 1.

3.2

Model-Based Diagnosis Problem

This section introduces the notion of system model and diagnosis that we use to generalise ATPG to multiple-fault scenarios. MBD models are applicable to arbitrary systems, but in the following we assume that we are deﬁning the MBD problem for circuits, to ensure consistency between the MBD and ATPG formalisms. In future work we will deﬁne ATPG for arbitrary systems. Central to MBD, a model of an artifact is represented as a Boolean propositional function f over X. Distinguishing two subsets of these variables as assumable and observable2 variables gives us a diagnostic system. We use a standard speciﬁcation of MBD [14], except that we use the notion of health assignment (deﬁned below) rather than sets of assumables to denote system health status. Deﬁnition 2 (Diagnostic System) A diagnostic system Ψ is deﬁned as the triple Ψ = f, H, O, where f is a propositional theory over a set of variables X, H ⊆ X is the set of assumables, and O ⊆ X is the set of observables. Throughout this paper we will assume that O ∩ H = ∅ and f |=⊥. The traditional query in MBD computes terms of assumable variables, which are explanations for the system description and an observation. By convention, if the set of health variables H = {h1 , ..., hm }, then hi = 0 denotes normal functionality for component i, and hi = 1 denotes a fault. Deﬁnition 3 (Health Assignment) Given a diagnostic system Ψ = f, H, O, an assignment λH to all variables in H is deﬁned as a health assignment. The MBD notion of diagnosis covers multiple simultaneous faults, as denoted below. Deﬁnition 4 (Diagnosis) Given a diagnostic system Ψ = f, H, O, and an observation α over some variables in O, a health assignment ω is a diagnosis iff f ∧ α ∧ ω |=⊥. 2

In the MBD literature the assumable variables are also referred to as “component”, “failure-mode”, or “health” variables. Observable variables are also called “measurable” or “control” variables.

In the MBD literature, a range of types of “preferred” diagnosis has been proposed. This turns the MBD problem into an optimization problem. In the following deﬁnition we consider the common subsetand cardinality-ordering. Deﬁnition 5 (Minimal Diagnosis) A diagnosis ω is deﬁned as minimal, if no diagnosis ω exists such that the set of negative literals in ω form a proper subset of the set of negative literals in ω. Deﬁnition 6 (Diagnosis Cardinality) The cardinality of a diagnosis, denoted as |ω|, is deﬁned as the number of negative literals in ω. A diagnosis is deﬁned as cardinality-minimal, denoted MinCard (ω), if it minimizes the number of negative literals.

3.3

ATPG Deﬁnitions

We now deﬁne the tasks performed by ATPG in terms of computing tests (satisfying assignments) that can isolate faults in a system speciﬁed in terms of a Boolean function f . ATPG traditionally uses a particular type of Boolean function, a Boolean circuit. Deﬁnition 7 (Boolean circuit) A Boolean circuit C is a directed acyclic graph (DAG) with distinguished observables IN , OUT, such that IN∪OUT= OC and IN∩ OUT= ∅, where the vertices are labeled as follows: • The input vertices, IN, labeled with a variable xi or a constant (0 or 1), have fan-in 0. • The output vertices, OUT, labeled “output”, have fan-out 0. • The gate vertices, H = {h1 , ..., hm }, with fan-in k > 0, are labeled with a Boolean function hi ∈ H on k inputs (∨, ∧, ¬), where the ¬ gate has fan-in of 1. We now deﬁne notions of faults on Boolean circuits, using notation introduced in [12]. The standard ATPG notion of a fault and faulted-circuit is restricted to a single faulty gate with a stuck-at fault. Deﬁnition 8 (Single stuck-at-fault) Given a Boolean circuit C, a single stuck-at fault φ(hi , ν) causes a component hi ∈ H to be permanently stuck at logic value ν ∈ {0, 1}. Deﬁnition 9 (Faulted circuit) Given a circuit C and a single stuckat fault φ(hi , ν), a faulted circuit, denoted by Cφ , is the circuit C with the output for hi set to value ν. We can translate a circuit into a propositional formula which corresponds to a diagnostic system, ΨC = f, H, OC , in which we restrict observables to OC (which covers only the inputs and outputs of the system) and translate each gate into an equivalent logical formula. Based on this translation, the notion of faulted circuit corresponds to a single-fault health assignment ω i in which only one health variable is set to 1, i.e., for one i ∈ {1, ..., m}, hi = 1 and all other hj:j =i = 0. We denote a “nominal” health assignment, where all hi = 0, using ω ∅ . Deﬁnition 10 (Test) Given a circuit diagnostic system ΨC = f, H, OC , and a health assignment ω i , a test is an instantiation α over some variables in O such that f ∧ α ∧ ω i |=⊥. Just as in MBD, we are interested in computing tests with respect to some completeness and minimality criteria. A test set is complete if it detects all single stuck-at faults; a complete test set of minimal size is a minimal test set. We now deﬁne the (single-fault) ATPG problem using consistency-based terminology from MBD:

G. Provan / Test Generation for Model-Based Diagnosis

Deﬁnition 11 (ATPG testability problem) Given a Boolean circuit diagnostic system, ΨC , a single stuck-at fault φ(hi , ν), corresponding to health assignment ω i , is testable if and only if there exists a test α such that f ∧ α ∧ ω i |=⊥ and f ∧ α ∧ ω ∅ |=⊥. Otherwise the fault is said to be untestable. The ATPG problem is to compute a minimal set of tests such that every single stuck-at fault is testable.

3.4

Review of Complexity Classes

In this article we will be deﬁning the complexity of several problems, and we introduce the polynomial hierarchy as the standard means for classifying different complexity classes. The best-known complexity classes are P and N P : P is the set of languages possessing algorithms that run in time that is a polynomial in the length of the input; N P is the set of languages possessing algorithms that run in nondeterministic polynomial time. Since we will be considering problems that are harder than those in P and N P , we deﬁne the polynomial hierarchy in terms of languages as follows. A language L is in the class ΣP i iff there is another language L in the class P and an integer k for which L = {x : (∃y1 )(∀y2 )(∃y3 ) · · · (Qyi ), |yi | = |x|k for all i, [(x, y1 , y2 , · · · , yi ) ∈ L ]}, where the sequence of quantiﬁers alternates, ending with Q = ∃ if i is odd or Q = ∀ if i is P even. According to this deﬁnition, P = ΣP 0 and P = Σ1 . We can also deﬁne the complementary hierarchy ΠP of problems, which dei ¯ ∈ L}, or in other note the problems deﬁned by coL = {L : L P words, coΣP i = Πi , for i = 0, ..., ∞. In a manner analogous to N P -hard problems being computationally more difﬁcult than P hard problems, ΣP 2 -hard problems are computationally more difﬁcult than N P -hard (or ΣP 1 -hard) problems.

4

Multiple-Fault ATPG

We now deﬁne more general notions of ATPG problems, using the multiple-fault speciﬁcations deﬁned in the previous section. Deﬁnition 12 (Multiple-fault Health Assignment) Given a diagnostic system Ψ = f, H, O, a multiple-fault health assignment ω I is an instantiation of H where at least 2 elements of ω I are set to 1. We can now deﬁne a multiple-fault ATPG test: Deﬁnition 13 (Multiple-fault Test) Given a circuit diagnostic system ΨC = f, H, OC , and a multiple-fault health assignment ω I , a multiple-fault ATPG test is an instantiation α over some variables in O such that f ∧ α ∧ ω I |=⊥ and f ∧ α ∧ ω ∅ |=⊥. The multiple-fault ATPG problem is hence deﬁned as the problem of computing a test-set that can isolate every multiple fault combination in 2H . We now show the complexity of a decision version of Deﬁnition 13, MF - TEST. Problem 1 (MF - TEST ) Given a circuit diagnostic system ΨC = f, H, OC , and a multiple-fault health assignment ω I , does there exist a multiple-fault ATPG test? Theorem 1 The Multiple-fault ATPG problem MF - TEST is ΣP 2 complete.

201

Proof: We prove this result using a reduction from a propositional abduction problem (PAP) P [16], for which the problem of solution existence (as deﬁned below) has been shown to be ΣP 2 -complete [4]. A propositional abduction problem (PAP) can be deﬁned using a tuple V, Ξ, μ, T , where V is a set of variables of which Ξ and μ are disjoint subsets, while T is a (consistent) propositional formula. Ξ is typically referred to as the hypotheses, and μ as the manifestations. A solution, δ(T , μ), given manifestations μ, exists if Ξ ∪ T is consistent and Ξ ∪ T |= μ. We can reduce a PAP into a generalised ATPG problem using the following procedure: • for each hypothesis Ξi ∈ Ξ create an observable variable Oi ∈ O; • for each manifestation μi ∈ μ create a component variable h ∈ H; • for each variable vi ∈ V \ (Ξ ∪ μ) create a variable xi ; • for the propositional formula T create a Boolean function f , in which we have the variable correspondence as deﬁned. We assume that in both PAP and ATPG, a variable in (Ξ, μ), and its corresponding pair (O, h), has value 0 denoting “normal” and 1 denoting “abnormal” (for Ξ, O) or “faulty” (μ, h). Clearly this reduction can be performed in polynomial time. We now show that the PAP has a solution iff the multiple-fault ATPG problem is testable. ⇒: Since T is a (consistent) propositional formula, then it must be the case that Ξ ∪ T |= μ when μ = {0, ..., 0}, together with Ξ = {0, ..., 0}, denote a “normal” state. Assume a solution exists in the PAP with some μ that is abnormal; by our mapping, there exists a solution in MF - TEST such that some faulty h corresponding to μ has a consistent assignment. Consequently, we must have a testable multiple-fault setting in MF - TEST, since there exists an assignment of Boolean values to the primary inputs and output of C (and also Cφ ) such that the output from components hi ∈ H have complementary logic values in C and Cφ . ⇐: Assume that a testable multiple-fault setting exists in our ATPG problem. By our reduction, this means that there must be a solution to our PAP with abnormal setting for μ. 2

5

Minimal Multiple-Fault ATPG

When we generate tests for multiple faults, there are two types of minimality that must be considered: (1) minimality of the multiplefault; and (2) minimality of the size of the test set. Multiple-fault minimality has largely been ignored within ATPG, since ATPG computes multiple simultaneous faults by isolating single faults in a sequential manner; however, this issue has received considerable attention in MBD, and special-purpose algorithms have been deﬁned for minimal multiple-fault isolation, e.g., [14, 2]. Typically, given an anomalous observation in MBD, one wants to isolate the fault-set of minimal cardinality, since (1) it is most likely that fewer components have failed, and (2) this leads to replacing as few components as possible. Computing a test set of minimal size is a key underlying ATPG task, since running the fewest tests leads to the most efﬁcient (cheapest) fault isolation procedure. Test-set minimality has been largely ignored in MBD, which assumes arbitrary, rather than optimised, observations will be input to the inference engine. It has been shown that computing the minimum number T ∗ (C) of tests to identify all single stuck-at faults in a circuit C is NP-hard [10]. Further, approximating this minimum test-set size is NP-hard as well [10], i.e., it is NP-hard to deﬁne some β > 1 and number of tests t such that

202

G. Provan / Test Generation for Model-Based Diagnosis

T ∗ (C) ≤ t ≤ T ∗ (C) · β. As a consequence, we anticipate that the minimal test-set problem for multiple faults will be intractable, given the intractability of the simpler, single-fault problem. In the following, we address the issue of generating a test-set for minimal faults which are not restricted to the single-fault case.

5.1

Hence our space of possible health values is given by H = {h1,1 , h1,2 , ..., h1,J , h2,1 , ..., h2,J , ..., hI,J }. Let f = C1 ∧ C2 ∧ · · · ∧ Cm be a CNF formula with m clauses. We construct a CNF formula f from G as follows. Variables The variables in f are as follows: • For each vertex vi,j , introduce a variable hvi,j (the semantics is that hvi,j is true if vi,j is in the clique, and the only literals occurring in a clique χ are health literals). • For each i = 1, 2, ..., I, j = 1, 2, ..., J and k = 0, 1, 2, ..., κ, introduce a variable Yijk .

Tests for Minimal Multiple-Fault Diagnoses

Almost all real-world systems suffer from being under-sensed, i.e., having too few sensors to uniquely isolate every multiple-fault scenario, and from the model being over-constrained [15]. A consequence of the diagnostic system being over-constrained and undersensed is that, for different α, there may be minimal-cardinality diagnoses of differing cardinalities. Multiple-fault ATPG is thus faced with the task of identifying the α leading to the max-cardinality diagnosis amongst the different α. This fault isolation problem has been deﬁned as a Max/Fault Min/Cardinality (MFMC) problem [5]. Here, we show the complexity of generating tests for MFMC diagnoses. The problem of computing a test for the Max-Fault MinCardinality (MFMC) problem [5] can be deﬁned as follows.

The second set of variables ensures that Yijk is true if there are at least k true variables in the set {hv1,1 , ..., hv1,J , hv2,1 , ..., hvi,J }. Clauses The clauses in f are as follows: • For each pair of vertices u, v ∈ G having no edge (u, v), add a clause ¬hv ∨ ¬hu to f .3 • For each i = 1, 2, ..., I, j = 1, 2, ..., J and k = 1, 2, ..., κ, add clause Yi−1,j,k ∨ (Yi−1,j,k−1 ∧ hvi,j ). • For each i = 1, 2, ..., I, j = 1, 2, ..., J add clauses enforcing Yi,j,0 = true • For each k = 1, ..., κ, add clauses enforcing Y0,0,k = false. • Finally, add a clause containing just the variable YI,J,κ .

Deﬁnition 14 (MFMC-Test) Given a diagnostic system Ψ, a MaxFault Min-Cardinality (MFMC) test (observation) is an instantiation α over the variables in O such that MinCard (f ∧ α) is maximized.

5.1.1

The above reduction clearly can be computed in polynomial time. Next we verify that the reduction is correct. ⇒: Suppose G has a clique χt of size κ. If this is true, then in f we must assign true to the variables hv such that v ∈ χt , and false to the other variables. We also assign true to those Yijk ’s such that there are at least k true variables in {hv1 , hv2 , ..., hvi }, and we assign false to the other Yijk ’s. This gives a satisfying assignment for f . Further, under the partition function t : I → J, χt (G) = mint maxχ {|χ| : χ ⊆ V . This corresponds in our satisfying assignment in f to there existing minω|=f ∧α |ω| = κ, since every node in χt (G) consists of a health literal from a different partition, i.e., it consists of a minimal diagnosis ω such that |ω| = κ. ⇐: Suppose f has a satisfying assignment. Let χ = {v : hv is assigned “true” }. Then χ is a clique because for any pair of vertices u, v ∈ χ, there cannot be a clause ¬hv ∨ ¬hu in f (because such a clause would not be satisﬁed), so there must be an edge (u, v) ∈ G. χ has size κ or more because YI,J,κ must be true. Further, a satisfying assignment in f must correspond to there existing χt (G) = mint maxχ {|χ| : χ ⊆ V , since under the satisfying assignment we have minω|=f ∧α |ω| = κ, and every node in χt (G) consists of a health literal from a different partition, i.e., it consists of a minimal diagnosis ω such that |ω| = κ. 2

Worst-Case Complexity

We will show that a restricted, decision version of the MFMC-Test problem, M F M CD , is ΠP 2 -complete, by showing a reduction from M INMAX -C LIQUE to M F M CD . M F M CD frames the problem of ﬁnding a test α that isolates a cardinality-minimal diagnosis ω of cardinality at least κ. Problem 2 (M F M CD ) : Given a CNF boolean function f over a set X of variables, for which the variable set is partitioned into disjoint subsets denoting observables O, health variables H and unobservable variables U , an integer κ, does there exist a test α and a health assignment ω such that minω|=f ∧α |ω| ≥ κ? Theorem 2 M F M CD is ΠP 2 -complete. Proof: We need to show two things: (1) that M F M CD is in ΠP 2 ; and (2) a reduction from a ΠP 2 -complete problem. 1. It is easy to see that M F M CD is in ΠP 2 , since it is solvable by a polynomial-time nondeterministic machine with the use of an NP-complete set of an oracle. 2. We now show a reduction from the ΠP 2 -complete problem M INMAX -C LIQUE [9]. M INMAX -C LIQUE is denoted by a graph G = (V, E), a partition (Vi,j )i∈I,j∈J of V , an integer κ, a function t : I → J, and χt is the size of the largest clique in G restricted to i∈I (Vi,t(i) ). In other words, we can denote χt (G) = mint maxχ {|χ| : χ ⊆ V is a clique in Gt }. Gt denotes the induced subgraph of G on the vertex set Vi = Ii=1 Vi,t(i) . Given the partition, |V | = I · J = n; hence V = {v1,1 , v1,2 , ..., v1,J , v2,1 , ..., v2,J , ..., vI,J }. Partition i is given by {vi,1 , vi,2 , ..., vi,J }, for i = 1, ..., I. We assume that in M F M CD , there are I health variables, each of which has J possible (abnormal) health values, i.e., health variable k has values given by {hk,1 , hk,2 , ..., hk,J }, k = 1, ..., I.

5.1.2

Approximation Complexity

We can also prove the complexity of approximating this problem. As based on the approach presented in [9], we will use the method of proving the ΠP 2 -completeness results for c-approximation problems MINMAX-A in terms of G-reductions. Ko and Lin [9] describe a G-reduction from MINMAX-SAT (ΓSAT ) to MINMAX-A, where MINMAX-A is an arbitrary optimisation problem. In particular, Ko and Lin constructed G-reductions from ΓSAT : |f |; (1 − )|f | to a pair ΓM IN M AX−A : (1 − 2 )size(x); (1 − 1 )size(x), where 1 > 2 ≥ 0. Based on this approach, they proved the following: 3

At this point, the satisfying assignments to f correspond to cliques in G.

G. Provan / Test Generation for Model-Based Diagnosis

Theorem 3 [9] There exists a constant c > 1 such that the capproximation problem of MINMAX-CLIQUE is ΠP 2 -complete. We can use this result to prove the following: Theorem 4 (M F M CD -approximation) There exists a constant c > 1 such that the c-approximation problem of M F M CD is ΠP 2 complete. Proof: Let μ be the reduction of theorem 2. In the above proof, we showed that for any graph G, MINMAX-CLIQUE = M F M CD (μ(G)) For any graph G whose vertex set V is partitioned into Vi,j , 1 ≤ i ≤ I, 1 ≤ j ≤ J, we deﬁne size(G) = I. Similarly, for any CNF formula f whose variables are partitioned into (O, H, U ) such that the partition for H follows Hi,j , 1 ≤ i ≤ I, 1 ≤ j ≤ J, we deﬁne size(f ) = |ω|. Then the above observation implies that μ is a G-reduction from MINMAX-CLIQUE : size(G); (1 − )size(G) to M F M CD : size(f ); (1 − )size(f ). 2

5.2

Test-Set Minimality

We now address the problem of computing a minimal test to identify a multiple fault diagnosis in a circuit. Given a test set T , we deﬁne the size of T as |T |. A given test α ∈ T may identify a set of diagnoses as being consistent with α. If test α ∈ T is consistent with fault ω, then we assign t = 1; we assign t = 0 otherwise. A test set T unambiguously isolates ω if, when applied to ω ∧ f , ω is the only fault consistent with T . A test set T is of minimal size for isolating a fault ω if T unambiguously isolates ω and there is no other test set T such that |T | < |T | and T unambiguously isolates ω. We deﬁne the problem of computing a minimal test-set for a multiple-fault, MIN - MF - TEST, as follows: Problem 3 (MIN - MF - TEST ) : Given a CNF boolean function f over a set X of variables, for which the variable set is partitioned into disjoint subsets denoting observables O, health variables H and unobservable variables U , a test set T , and an integer κ, does there exist a test α and a health assignment ω such that maxα|=f ∧ω |α| ≤ κ? Theorem 5 MIN - MF - TEST is ΠP 2 -complete. Proof Sketch:4 We prove this theorem analogously to the proof of Theorem 2, except that we reduce Maxmin-Vertex Cover (MMVC), which is ΠP 2 -complete [9], to MIN - MF - TEST . We deﬁne MMVC as: Problem 4 (Maxmin-Vertex Cover (MMVC)) : Given a graph G = (V, E), a partition (Vi,j )i∈I,j∈J of V , an integer κ, a function t : I →J, and χt is the size of the smallest vertex cover in G restricted to i∈I (Vi,t(i) ),is maxt∈J I χt (G) ≤ κ? We assume that in MMVC, there are I test variables ti ∈ T , each of which has J possible (abnormal) values, i.e., test variable k has values given by {tk,1 , tk,2 , ..., tk,J }, k = 1, ..., I. Hence our space of possible test values is given by T = {t1,1 , t1,2 , ..., t1,J , t2,1 , ..., t2,J , ..., tI,J }. Let f = C1 ∧ C2 ∧ · · · ∧ Cm be a CNF formula with m clauses. We construct a CNF formula f from G similar to the construction of the proof of Theorem 2, except that we introduce test variables for health variables. Given the reduction from MMVC to MIN - MF TEST , the theorem follows. 2 4

We sketch this proof due to space limitations.

6

203

Summary and Discussion

We have described a generalised framework for ATPG that allows this important technique to be applied to a wider range of tasks than the current single-fault tasks; such new tasks include multiple-fault inference for a range of embedded and other applications. In addition, we have described the complexity of these generalised ATPG problems, showing that these problems lie on the second level of the polynomial hierarchy. This multiple-fault ATPG framework offers signiﬁcant advances to test generation, both in the ability to isolate multiple faults, and in the ability to address systems more complex than digital circuits. The drawback is the intractability of these generalised ATPG problems. Since it is likely that inference will be intractable for real-world artifacts (as is true for MBD [13]), further work needs to be done to develop efﬁcient, approximation algorithms. One promising MBD approach, that of stochastic approximation algorithms for multiplefault test generation, has been explored in [5].

REFERENCES [1] T. Bylander, D. Allemang, M.C. Tanner, and J. Josephson, ‘The computational complexity of abduction’, Artiﬁcial Intelligence, 49, 25–60, (1991). [2] J. de Kleer, ‘An Assumption-based TMS’, AI Journal, 28, 127–162, (1986). [3] R. Drechsler and G. Fey, ‘Automatic Test Pattern Generation’, Formal Methods for Hardware Veriﬁcation, LNCS, 30–55, (2006). [4] T. Eiter and G. Gottlob, ‘The Complexity of Logic-Based Abduction’, J. ACM, 42(1), 3–42, (1995). [5] A. Feldman, G. Provan, and A. van Gemund, ‘Computing observation vectors for max-fault min-cardinality diagnoses’, in Proc. AAAI’08, (July 2008). [6] G. Friedrich, G. Gottlob, and W. Nejdl, ‘Physical impossibility instead of fault models’, in Proc. AAAI, (1990). [7] M. Hermann and R. Pichler, ‘Counting complexity of propositional abduction’, in IJCAI, pp. 417–422, (2007). [8] O.H. Ibarra and S.K. Sahni, ‘Polynomially Complete Fault Detection Problems’, IEEE Transactions on Computers, C-24(3), 242–249, (1975). [9] K-I. Ko and C.L. Lin, ‘On the Complexity of MIN-MAX Optimization Problems and Their Approximation’, in Minimax and Applications, eds., Panos M. Pardalos and Dingzhu Du, Kluwer Academic Publishers, (1995). [10] B. Krishnamurthy and B. Akers, ‘Complexity of estimating the size of a test set’, IEEE Trans. Comp., 33(8), 750–752, (1984). [11] E J McCluskey, ‘Built-in self-test techniques’, IEEE Design Test Comp., 2(2), 21–28, (1985). [12] M.R. Prasad, P. Chong, and K. Keutzer, ‘Why is Combinational ATPG Efﬁciently Solvable for Practical VLSI Circuits?’, Journal of Electronic Testing, 17(6), 509–527, (2001). [13] G.M. Provan, ‘An Empirical Analysis of the Complexity of ModelBased Diagnosis’, in ECAI, pp. 783–784, (2006). [14] R. Reiter, ‘A Theory of Diagnosis from First Principles’, Artiﬁcial Intelligence, 32, 57–96, (1987). [15] W. R. Simpson and J. W. Sheppard, System Test and Diagnosis, Kluwer Academic Publishers, 1994. [16] P. Torasso, L. Console, L. Portinale, and D. Theseider Dupre, ‘On the role of abduction’, ACM Comput. Surv., 27(3), 353–355, (1995). ¨ uyurt, ‘Sequential Testing of Complex Systems: a Review’, Dis[17] T. Unl¨ crete Applied Mathematics, 142(1-3), 189–205, (2004). [18] Z. Wang, M. Marek-Sadowska, K.H. Tsai, and J. Rajski, ‘Multiple fault diagnosis using n-detection tests’, Computer Design, 2003. Proceedings. 21st International Conference on, 198–201, (2003). [19] Z. Wang, M. Marek-Sadowska, K.H. Tsai, and J. Rajski, ‘Analysis and Methodology for Multiple-Fault Diagnosis’, Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 25(3), 558– 575, (2006).

204

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-204

Observation-Subsumption Checking in Similarity-Based Diagnosis of Discrete-Event Systems Gianfranco Lamperti and Marina Zanella1 Abstract. In similarity-based diagnosis of discrete-event systems the knowledge generated for solving a previous diagnostic problem can be reused to solve a new one, provided the two problems are similar. Problem-similarity requires that the temporal observation relevant to the new problem be subsumed by the temporal observation relevant to the old one. A temporal observation encompasses the (uncertain) events observed over a time interval and their (uncertain) reciprocal temporal order. Such an observation has been produced by one out of several distinct certain sequences of observable events, with each of such sequences being a sentence of the regular language of the observation. An observation subsumes another if its regular language contains the regular language of the other. However, checking observation-subsumption by following its formal deﬁnition is time consuming. In order to speed up the process, an alternative technique is proposed, which is based on the notion of coverage and exploits a number of necessary conditions, as well as a sufﬁcient condition, for subsumption to hold. Such conditions can be directly checked on the properties of the given observations, without any need to appeal to the language theory. Experimental evidence conﬁrms the efﬁciency of subsumption-checking via coverage.

1

INTRODUCTION

Discrete-event systems (DESs) [3] are dynamic systems, typically modeled as networks of components. Each component is a (ﬁnite) communicating automaton [2] that reacts to (a ﬁnite number of discrete) input events by state-transitions which possibly generate new events towards other components. Diagnosis of DESs is a challenging task that has been tackled since a decade via different approaches, either based on artiﬁcial intelligence [14, 6, 13] or automatic control techniques [5, 15, 18, 12, 7, 16]. Within the domain of a class of asynchronous DESs [1, 10], called active systems, a diagnosis approach has been proposed that is based on similarity techniques [11, 4] with the aim of pursuing reuse of knowledge when solving a diagnostic problem. The data structures generated for solving each diagnostic problem are stored into a knowledge-base. When a new problem is to be faced, instead of solving it from scratch, the knowledge-base is ﬁrst browsed in order to ﬁnd a previously-solved diagnostic problem that is ‘compatible’ with the new one. If so, the knowledge relevant to the old problem can be exploited to solve the new problem, thereby speeding up the diagnostic process. Among other constraints, this compatibility requires that the observation relevant to the problem in the knowledge-base subsume the observation relevant to the new problem. Such an observation is temporal in nature, and is represented by a DAG. The deﬁnition of subsumption between two observations is based on a containment relationship between the regular languages of their index spaces The index space is a deterministic acyclic automaton whose generation (and comparison) may

require considerable computational resources. This paper proposes an approach to observation-subsumption checking that, for the sake of efﬁciency, avoids any index-space manipulation, by reasoning on the speciﬁc properties of the observations.

2

When an active system evolves over time, it generates a sequence of observable labels, called a signature. However, what is actually perceived by the external observer is a relaxation of the signature S , called a temporal observation. Formally, let L be the ﬁnite domain of all the observable labels the active system can generate, possibly including the null label . A temporal observation is a (not necessarily connected) DAG O D .N ; L; A/ (1) where N is the set of nodes, with each N 2 N being marked with a non-empty subset of L, and A W N 7! 2N is the set of arcs. A ‘’ temporal precedence relationship among nodes of the graph is deﬁned as follows: If N 7! N 0 2 A then N N 0 ; If N N 0 and N 0 N 00 then N N 00 ; If N ! 7 N 0 2 A then ÀN 00 2 N .N N 00 N 0 /. The set of labels marking a node N is the extension of N , written kN k. Thus, the relaxation of the signature S into O involves two kinds of uncertainty: Logical uncertainty: each single observable label in the signature S is instead perceived as a set of candidate labels, possibly including the null label . All labels in kN k but one are spurious, with just one being the actual label.2 Temporal uncertainty: the absolute temporal ordering of the signature S is relaxed to partial temporal ordering. If N N 0 in O, where ` and `0 are the actual labels in N and N 0 , respectively, then ` precedes `0 in S . However, not all precedence relationships between nodes in N are known. As such, O implicitly incorporates several candidate signatures, where each candidate is determined by selecting one label from each node in N without violating the temporal constraints imposed by the precedence relationships. The set of all the candidate signatures of O is called the extension of O, written kOk. Among such candidates is (although unknown) the actual signature S . Like for nodes, all candidate signatures but one are spurious. As explained in [9], such a degradation may be caused by the multiplicity of the communication channels that convey observable labels from the system to the observer (temporal uncertainty), and by noise (logical uncertainty). 2

1

Universit`a di Brescia, Italy, email: flamperti, [email protected]

BACKGROUND

If the actual label is , it means that no label was generated by the system. Note how the extension of a node in N cannot be the singleton fg.

G. Lamperti and M. Zanella / Observation-Subsumption Checking in Similarity-Based Diagnosis of Discrete-Event Systems

205

Figure 1. Observations O1 (left) and O2 (right).

Example 1 Shown in Fig. 1 are the graphs of two (both logically and temporally uncertain) observations, namely, from left to right, O1 D .N1 ; L1 ; A1 / and O2 D .N2 ; L2 ; A2 /, where N1 D fN1 ; : : : ; N5g, N2 = fN10 ; : : : ; N40 g, L1 D fa; b; c; d; f; g, and L2 D fa; b, c; d; g. In O2 , N10 incorporates the ﬁrst observable label, namely a. Then, either N20 or N30 follows, each of which involves two candidate labels. The last generated node is N40 , with a and being the ﬁnal candidate labels. The extension of the observation, namely kO2 k, includes the candidate signatures ac, ad , abc, abd , aca, ada, acb, adb, abca, abda, acba, adba, each of which is obtained by selecting one label for each node without violating the temporal constraints, and then removing the null label . The index space of an observation O, namely Isp.O/, is a deterministic automaton whose regular language is the extension of O, Lang.Isp.O// D kOk:

(2)

In other words, the set of strings generated by each path in Isp.O/, from the initial state to a ﬁnal state, equals the set of candidate signatures relevant to O. As detailed in [4], the generation of the index space of O requires two steps, namely: Yielding the nondeterministic automaton, called the preﬁx space of O, where each node identiﬁes the set of consumed nodes in N up to now; Generating the deterministic automaton equivalent to the preﬁx space, in fact the index space.3 The role of the index space comes into view for checking observation-subsumption. Example 2 Shown in Fig. 2 are the index spaces of observations O1 (left) and O2 (right) displayed in Fig. 1. It is easy to check that the regular language of each index space equals the extension of the relevant observation (the set of candidate signatures), where each string of the language corresponds to a path in the index space, from the initial state to one of the ﬁnal states (with the latter being double circled in the ﬁgure). In particular, Example 1 offers evidence that Lang.Isp.O2 // D kO2k. In similarity-based diagnosis of DESs [11], it is essential to ﬁnd out whether the solution of the diagnostic problem } 0 at hand can be supported by the knowledge yielded for solving a previous (different) diagnostic problem } . Among other constraints, reuse of } can be exploited only if the observations O 0 and O relevant to } 0 and }, respectively, are linked by a subsumption relationship, O c O0

(3)

namely, only if O subsumes O 0 . The subsumption relationship is deﬁned in terms of regular-language containment, relevant to the corresponding index spaces, precisely: Lang.Isp.O// Lang.Isp.O 0 //: 3

(4)

Being equivalent, both the preﬁx space and the index space share the same regular language [8].

Figure 2. Index spaces Isp.O1/ (left) and Isp.O2 / (right).

This means that O subsumes O 0 iff the set of candidate signatures of O includes all the candidate signatures of O 0 . Example 3 With reference to observations O1 and O2 outlined in Fig. 1, and the relevant index spaces in Fig. 2, it is easy to check that O1 c O2 , that is, kO1k kO2k. In other words, each string in Lang.Isp.O2 // is also a string in Lang.Isp.O1 //. The reason why observation subsumption supports reuse can be roughly explained as follows. The solution of } yields an automaton , a sort of diagnoser, where each state is marked by a set of diagnoses and each transition is marked by a label in L. The language of is the subset of the signatures relevant to O that comply with the model of the system, namely, Lang./ kOk. The same applies to a new problem } 0 relevant to O 0 . However, if O c O 0 , that is, kOk kO 0 k, then Lang./ Lang.0 /. In other words, contains all the signatures of 0 . This allows the diagnostic engine to reuse in order to generate 0 based on O 0 . Such an operation is far more efﬁcient than generating 0 from scratch, which would require heavy model-based reasoning. However, the computational difﬁculty now lies on observation-subsumption checking: we need ﬁrst generate Isp.O 0 / and, subsequently, compare Lang.Isp.O 0 // with each Lang.Isp.O// in the knowledge-base, in the hope of ﬁnding a subsuming observation O. Such an approach, based on the generation of the index space and on regular-language containment-checking, may be prohibitive in real applications.

3

CHECKING SUBSUMPTION

Assume the problem of testing O c O 0 , namely the checking problem faced by the systematic technique illustrated in the previous section. In order to cope with the complexity of such a technique, the idea is to ﬁnd out some conditions inherent to the considered observations that either imply or are implied by the subsumption relationship. If these conditions can be checked using a reasonable amount of computational resources, then chances are that we can give an answer to the checking problem efﬁciently. Speciﬁcally, if a necessary condition Nc is violated, then the answer to the checking problem will be no. Dually, if a sufﬁcient condition Sc holds, then the answer will be yes. However, if either Nc holds or Sc is violated, then the

206

G. Lamperti and M. Zanella / Observation-Subsumption Checking in Similarity-Based Diagnosis of Discrete-Event Systems

checking problem remains unanswered. Necessary conditions and sufﬁcient conditions relevant to the checking problem are given in Theorem 1 and Theorem 2, respectively, and eventually incorporated within Algorithm 1 (see below). Proofs of such theorems are omitted. Theorem 1 Let O D .N ; L; A/ and O 0 D .N 0 ; L0 ; A0 / be two temporal observations. Let n and n0 be the number of nodes in N and N 0 , respectively. Let n and n0 be the number of nodes that include the null label in N and N 0 , respectively. Let M and M 0 be the multisets of observable labels occurring in O and O 0 , respectively. Then, O subsumes O 0 only if the following conditions hold: n n0

(5)

n n0 n n0

(6)

M M0 :

(7)

Corollary 1.1 O subsumes O 0 only if L L0 :

(8)

Example 4 In Example 3 we have shown that O1 subsumes O2 , where such observations are displayed in Fig. 1. Hence, the conditions relevant to Theorem 1 are expected to hold for O1 and O2 . We have n1 D 5, n2 D 4, n1 D 3, n2 D 2. As a matter of fact, both conditions (5) and (6) hold. Moreover, since M1 D Œa; a; a; b; b; b; b; c; d; f; ; ; and M2 D Œa; a; b; c; d; ; , condition (7) holds too. The conditions necessary for subsumption stated in Theorem 1 can be easily checked. They correspond to the ﬁrst actions of the checking algorithm. If one of them is violated, the check terminates immediately with a negative answer. Otherwise, the check continues by testing a sufﬁcient condition for subsumption based on the notion of coverage given in Deﬁnition 1 below. Roughly, O covers O 0 when O is a relaxation of O 0 , inasmuch as an observation is a relaxation of a system signature. O0

.N 0 ; L0 ; A0 /

Deﬁnition 1 (Coverage) Let O D .N ; L; A/, D be two temporal observations, where N D fN1 ; : : : ; Nn g and N 0 D fN10 ; : : : ; Nn0 0 g. We say that O covers O 0 , written O D O 0 , iff there exists a subset NN of N , with NN D fNN 1 ; : : : ; NN n0 g having the same cardinality as N 0 (NN is isomorphic to N 0 ), such that, de noting N D N NN , we have: (1) (-coverage): 8N 2 N . 2 kN k/; (2) (logical coverage): 8i 2 Œ1 :: n0 .kNN i k kNi0k/; (3) (temporal coverage): For each path NN i Ý NNj in O such that both NN i and NNj are in NN , and all (if any) intermediate nodes of the path are in N , we have Ni0 Nj0 in O 0 . Example 5 With reference to the observations displayed in Fig. 1, we can show that O1 D O2 . Assume the subset of N1 being NN 1 D fN2 ; N1; N4 ; N5 g. Hence, N1 D fN3 g. Clearly, -coverage holds, as 2 kN3k. Logical coverage holds too, as kN2k kN10 k, kN1 k kN20 k, kN4 k kN30 k, and kN5k kN40 k. It is easy to check that temporal coverage occurs. For instance, for hN1 ; N3 ; N5 i, where N3 2 N1 , we have N20 N40 in O2 . Theorem 2 and Note 1 below state that coverage is only sufﬁcient for subsumption, not necessary. As coverage entails subsumption, the conditions in Theorem 1 are necessary for coverage too. Theorem 2 Coverage entails subsumption: O D O 0 H) O c O 0 :

Figure 3. Observations O (left) and O 0 (right).

Note 1 Coverage is stronger than subsumption, namely: O c O 0 6) O D O 0 :

To be convinced, it sufﬁces to show an example in which subsumption holds whilst coverage does not. Consider two observations, O D .N ; L; A/ and O 0 D .N 0 ; L0 ; A0 /, where N D fN1 ; N2 g, N 0 D fN10 ; N20 g, L D L0 D fag, A D fN1 7! N2 g, A0 D ;, and kN1 k D kN2 k D kN10 k D kN20 k D fag, as displayed in Fig. 3. Clearly, NN D N and N D ;. Note how, unlike O, since A0 D ;, O 0 does not force any temporal constraint between N1 and N2 . Incidentally, both observations involve just one candidate signature, namely S D ha; ai. Thus, since kIsp.O/k D kIsp.O 0 /k D fha; aig, both observations subsume each other, in particular O c O 0 . However, it is easy to realize that O does not cover O 0 , namely O 6D O 0 . In fact, due to the symmetry of O 0 , we can choose any of the two possible associations between nodes in O and nodes in O 0 , for instance, NN D fN1 ; N2 g. Based on Deﬁnition 1, on the one hand, both coverage and logical coverage hold. On the other, temporal coverage is missing, as for N1 7! N2 in O, we have N10 6 N20 . The same negative result occurs for NN D fN2 ; N1 g. In other terms, O 6D O 0 .

4

TESTING COVERAGE

Subsumption-checking via coverage is coded in Algorithm 1, which tests both the necessary conditions of Theorem 1 and the coverage relationship. A tracing of the algorithm is provided in Example 6. Algorithm 1 (C OVERS ) The Covers function (lines 1–41) takes as input two observations, O and O 0 , and outputs a Boolean value indicating whether or not O covers O 0 . The body of Covers is outlined in lines 30–41. In lines 31–32, the parameters for O and O 0 are set. At line 33, conditions (5) and (6) of Theorem 1, along with condition (8) of Corollary 1.1, are checked. In lines 36–38, the multisets M and M 0 of instances of labels are created, with the former decremented by d D .n n0 / instances of label , which is the cardinality of .N N 0 /. This allows the algorithm to retain a sufﬁcient number of spare nodes in N that contain , namely N in Deﬁnition 1. At line 39, condition (7) of Theorem 1 is checked. The algorithm yields NN , the subset of N that is associated with N 0 in Deﬁnition 1, by building the set R of associations through the call to the auxiliary function CovStep at line 40. The speciﬁcation of CovStep is given in lines 3–29. Besides O, O 0 , M, and M 0 , it takes as input C and C 0 , the set of nodes already considered in O and O 0 , respectively, along with d , the number of nodes in N not yet considered, and R, the set of associations made up so far. The body of CovStep starts at line 10, where the cardinality of R is tested: if R contains n0 pairs, it means that all nodes in N 0 have been considered and NN is completed, thereby, coverage holds. Otherwise, a new node N 0 in O 0 is picked at line 11, such that all its parent nodes have been considered already. At line 12, the set F of nodes in O is created, which includes the unconsidered nodes of O with all parents already in C . A loop for each node N in F is iterated in lines 13–27. First, logical coverage and containment relationship of labels4 are tested (line 14). Then, the set 4

(9)

(10)

Intuitively, testing the containment between M and M 0 once decremented by labels in N and N 0 , respectively, amounts to continuously checking condition (7).

G. Lamperti and M. Zanella / Observation-Subsumption Checking in Similarity-Based Diagnosis of Discrete-Event Systems

Na of the nearest ancestors5 of N that have already been involved in the associations of R is instantiated (line 15). If temporal-coverage checking (line 16) succeeds, CovStep is recursively called at line 17, with new actual parameters: the sets C and C 0 are extended with nodes N and N 0 , respectively, the multisets M and M 0 are decremented by the labels in N and N 0 , respectively, while R is extended with the new pair .N; N 0 /. If such a call succeeds, the current activation of CovStep succeeds too (line 18). If not, or either logical or temporal coverage fails, a chance still remains by assuming N 2 N : this is viable only on condition that N include , there exists at least one spare node in N (d > 0), and the multiset M contains M 0 once decremented by the labels of N , aside (line 22)6 . If so, a different recursive call to CovStep is performed at line 23, with the changed parameters being the (extended) set C of consumed nodes in O, the (decremented) multiset M, and the decremented value of d . If such a call succeeds, the current activation of CovStep succeeds too. If not, the loop is iterated and a new node in F is tried. If the computation exits the loop in a natural way, it means that no node can be associated with N 0 within this computational context, thereby causing the current activation of CovStep to fail (line 28). 1. function Covers.O;O 0 /: Bool 2. O D .N ; L; A/, O 0 D .N 0 ; L0 ; A0 /: observations; 3. function CovStep.O; O 0 ; C ; C 0 ; M; M 0 ; d; R/: Bool 4. O D .N ; L; A/, O 0 D .N 0 ; L0 ; A0 /: observations, 5. C ; C 0 : the set of consumed nodes for O and O 0 , 6. M; M 0 : the multisets of labels in O and O 0 , 7. d : the number of nodes in N that can still be in N , 8. R N N 0 : a relation on N and N 0 ; 9. begin fCovStepg 10. if jRj D n0 then return true end-if; 11. Pick up a node N 0 2 .N 0 C 0 / whose parents are in C 0 ; 12. F WD fN j N 2 .N C /; all parents of N are in C g; 13. for each N 2 F do 14. if kN k kN 0k and .M kN k/ .M 0 kN 0k/ then 15. Na WD the set of nearest ancestors of N in R.N /; 16. if 8Na 2 Na ; .Na ; Na0 / 2 R .Na0 N 0 / then 17. if CovStep.O;O 0 ; C [ fN g; C 0 [ fN 0 g; M kN k, M 0 kN 0k; d; R [ f.N; N 0 /g/ then 18. return true 19. end-if 20. end-if 21. end-if; 22. if 2 kN k and d > 0 and .M .kN k fg// M 0 then 23. if CovStep.O; O 0 ; C [ fN g; C 0 , M .kN k fg/; M0 ; d 1; R/ then 24. return true 25. end-if 26. end-if 27. end-for; 28. return false 29. end fCovStepg; 30. beginfCoversg 31. n WD jN j; n WD fN j N 2 N ; 2 kN kg; 32. n0 WD jN 0 j; n0 WD fN 0 j N 0 2 N 0 ; 2 kN 0kg; 33. if n < n0 or n n0 < n n0 or L 6 L0 then 34. return false 35. end-if: 36. Create the multisets M and M 0 of labels in O, O 0 ; 37. d WD n n0 ; 38. Remove d instances of label from M; 39. if M 6 M0 then return false end-if; 40. return CovStep.O; O 0 ; ;; ;; M; M 0 ; d; ;/ 41. end fCoversg. 5

The nearest ancestors of a node are not necessarily its parents, since a parent node may not belong to R.N /, as it is included in N . 6 When a spare node is consumed, is retained in M because. at line 38, all instances of relevant to spare nodes were removed from M already.

207

Figure 4. Activation tree for CovStep in Example 6.

Example 6 With reference to the observations in Fig. 1, consider the run of Covers.O1 ; O2 /. Since, according to Example 4, all the necessary conditions of Theorem 1 hold, we focus our attention on the ﬁrst call to CovStep at line 40. Depicted in Fig. 4 is the tree of the recursive activations to CovStep, where each node i corresponds to the i -th call (dashed nodes correspond to calls at line 23, with the others corresponding to line 17). Details are given in Table 1, with Id being the identiﬁer of the call, and the other columns indicating the actual parameters of the call (observation nodes are identiﬁed by the corresponding subscripts). The computation is described by the following steps, where item numbers stand for Id. 1. N 0 D 10 , F D f1; 2g. Within the loop (line 13), choosing N D 1 makes the multiset containment false (line 14). However, since condition at line 22 holds for N , a recursive call to CovStep is performed at line 23 (see Id D 2 in Table 1). 2. N 0 D 10 , F D f2; 3g. With N D 2, a recursive call is performed at line 17 (Id D 3 in Table 1). 3. N 0 D 20 , F D f3; 4g. With N D 3, logical coverage fails, as kN k 6 kN 0k. Besides, although 2 kN k, condition at line 22 is false because d D 0 (no further spare nodes to assume in N ). Thus, a new iteration of loop at line 13 is performed with N D 4: logical coverage fails, while condition at line 22 is false (since d D 0 and … kN k). This causes the control to return to the second call, where condition at line 22 is false. Therefore, a new iteration of loop at line 13 is performed, now with N D 3. Since both checks at lines 14 and 22 fail, the control returns to the ﬁrst call, where N D 2 is chosen: this allows the fourth recursive call at line 17 (Id D 4). 4. N 0 D 20 , F D f1; 4g. With N D 1, a recursive call is performed at line 17 (Id D 5). 5. N 0 D 30 , F D f3; 4g. With N D 3, logical coverage fails. However, since condition at line 22 holds, a recursive call is performed at line 23 (Id D 6). 6. N 0 D 30 , F D f4g. With N D 4, a recursive call is performed at line 17 (Id D 7). 7. N 0 D 40 , F D f5g. With N D 5, a recursive call is performed at line 17 (Id D 8). 8. At line 10, since jRj D 4, CovStep succeeds. Proposition 1 Algorithm 1 is a sound and complete implementation of coverage: (11) Covers.O; O 0 / ” O D O 0 :

5

EXPERIMENTAL RESULTS

In order to assess the coverage approach to subsumption checking, two different algorithms prototyped in Haskell functional language [17], namely Subsumes and Covers, were run. The former is strictly based on the deﬁnition of subsumption and requires testing indexspace (automaton) containment. In order to stress the computation, only observations that subsume each other were chosen as input to the two algorithms .7 The experimental results refer to observations 7

In fact, when a necessary condition in Theorem 1 is violated, Covers is increasingly more efﬁcient than Subsumes.

208

G. Lamperti and M. Zanella / Observation-Subsumption Checking in Similarity-Based Diagnosis of Discrete-Event Systems

Table 1. Tracing of Covers.O1 ; O2 / in Example 6.

Id 1 2 3 4 5 6 7 8

C ; f1g f1; 2g f2g f1; 2g f1; 2; 3g f1; 2; 3; 4g f1; 2; 3; 4; 5g

C0 ; ; f10 g f10 g f10 ; 20 g f10 ; 20 g f10 ; 20 ; 30 g f10 ; 20 ; 30 ; 40 g

M fa; a; a; b; b; b; b; c; d; f; ; g fa; a; b; b; b; c; d; f; ; g fa; b; b; c; d; f; ; g fa; a; b; b; b; c; d; f; ; g fa; b; b; c; d; f; g fa; b; b; c; d; g fa; b; g ;

4000

4

3000

3 Time[s]

Time[s]

each of which is represented by a connected non-linear DAG. Shown in Fig. 5 is the response time of the two algorithms, with the x-axis marked by the number of nodes in the involved observations. The y-axis indicates the time for Subsumes (dashed line, on the left) and Covers (plain line, on the right) to emit the relevant verdict. Considering the different scale of the y-axis, the comparison is striking in favor of Covers. No considerable difference exists between the algorithms as far as space allocation is concerned.

2000 1000 0

2 1

0

10 Observation nodes

20

0

0

10 Observation nodes

20

Figure 5. Checking subsumption: response time.

6

CONCLUSION

A technique for checking temporal-observation subsumption has been proposed. This check is required to pursue similarity-based diagnosis of DESs, where the solution of a diagnostic problem is possibly supported by the solution of a previously-solved problem stored in a knowledge-base. A check strictly based on the deﬁnition of observation-subsumption requires the generation of a nondeterministic automaton, its subsequent transformation into a deterministic one (the index space of the observation), and a regular-language containment-checking. Since index-space generation and processing are computationally complex, an alternative technique has been envisaged and formally deﬁned in this paper, which is based on the notion of coverage and allows the direct comparison of the two observations without any index-space generation or manipulation. The new approach has been tested and compared with the previous (systematic) approach illustrated in the literature. Experimental results indicate that the technique is considerably worthwhile as to time complexity. Future research will try and envisage computationallycheap stronger necessary conditions for observation subsumption so as to further reduce the check response-time, which depends on several observation parameters besides the number of nodes, such as the number of involved labels and the number of nodes that include the null label. However, if the coverage check fails, nothing can be entailed about subsumption. Roughly, an observation may be subsumed by another that does not cover it if it includes a set of temporally unrelated nodes all having the same logical content. Removing

M0 fa; a; b; c; d; ; g fa; a; b; c; d; ; g fa; b; c; d; ; g fa; b; c; d; ; g fa; c; d; g fa; c; d; g fa; g ;

d 1 0 0 1 1 0 0 0

R ; ; f.2; 10 /g f.2; 10 /g f.2; 10 /; .1; 20 /g f.2; 10 /; .1; 20 /g f.2; 10 /; .1; 20 /; .4; 30 /g f.2; 10 /; .1; 20 /; .4; 30 /; .5; 40 /g

such temporal uncertainty before performing coverage checking is promising to enhance the effectiveness of the approach.

REFERENCES [1] P. Baroni, G. Lamperti, P. Pogliano, and M. Zanella, ‘Diagnosis of large active systems’, Artiﬁcial Intelligence, 110(1), 135–183, (1999). [2] D. Brand and P. Zaﬁropulo, ‘On communicating ﬁnite-state machines’, Journal of ACM, 30(2), 323–342, (1983). [3] C.G. Cassandras and S. Lafortune, Introduction to Discrete Event Systems, volume 11 of The Kluwer International Series in Discrete Event Dynamic Systems, Kluwer Academic Publisher, Boston, MA, 1999. [4] S. Cerutti, G. Lamperti, M. Scaroni, M. Zanella, and D. Zanni, ‘A diagnostic environment for automaton networks’, Software – Practice and Experience, 37(4), 365–415, (2007). DOI: 10.1002/spe.773. [5] Y.L. Chen and G. Provan, ‘Modeling and diagnosis of timed discrete event systems - a factory automation example’, in American Control Conference, pp. 31–36, Albuquerque, NM, (1997). [6] L. Console, C. Picardi, and M. Ribaudo, ‘Process algebras for systems diagnosis’, Artiﬁcial Intelligence, 142(1), 19–51, (2002). [7] R. Debouk, S. Lafortune, and D. Teneketzis, ‘Coordinated decentralized protocols for failure diagnosis of discrete-event systems’, Journal of Discrete Event Dynamic Systems: Theory and Application, 10, 33– 86, (2000). [8] J.E. Hopcroft, R. Motwani, and J.D. Ullman, Introduction to Automata Theory, Languages, and Computation, Addison-Wesley, Reading, MA, third edn., 2006. [9] G. Lamperti and M. Zanella, ‘Diagnosis of discrete-event systems from uncertain temporal observations’, Artiﬁcial Intelligence, 137(1–2), 91– 163, (2002). [10] G. Lamperti and M. Zanella, Diagnosis of Active Systems – Principles and Techniques, volume 741 of The Kluwer International Series in Engineering and Computer Science, Kluwer Academic Publisher, Dordrecht, NL, 2003. [11] G. Lamperti and M. Zanella, ‘Flexible diagnosis of discrete-event systems by similarity-based reasoning techniques’, Artiﬁcial Intelligence, 170(3), 232–297, (2006). [12] J. Lunze, ‘Diagnosis of quantized systems based on a timed discreteevent model’, IEEE Transactions on Systems, Man, and Cybernetics – Part A: Systems and Humans, 30(3), 322–335, (2000). [13] Y. Pencol´e and M.O. Cordier, ‘A formal framework for the decentralized diagnosis of large scale discrete event systems and its application to telecommunication networks’, Artiﬁcial Intelligence, 164, 121–170, (2005). [14] L. Roz´e and M.O. Cordier, ‘Diagnosing discrete-event systems: extending the ‘diagnoser approach’ to deal with telecommunication networks’, Journal of Discrete Event Dynamic Systems: Theory and Application, 12, 43–81, (2002). [15] M. Sampath, S. Lafortune, and D.C. Teneketzis, ‘Active diagnosis of discrete-event systems’, IEEE Transactions on Automatic Control, 43(7), 908–929, (1998). [16] G. Schullerus and V. Krebs, ‘Diagnosis of a class of discrete-event systems based on parameter estimation of a modular algebraic model’, in Twelfth International Workshop on Principles of Diagnosis – DX’01, pp. 189–196, San Sicario, I, (2001). [17] S. Thompson, Haskell – The Craft of Functional Programming, Addison-Wesley, Harlow, UK, 1999. [18] S.H. Zad, R.H. Kwong, and W.M. Wonham, ‘Fault diagnosis in timed discrete-event systems’, in 38th IEEE Conference on Decision and Control – CDC’99, pp. 1756–1761, Pheonix, AZ, (1999). IEEE, Piscataway, NJ.

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-209

209

Local Consistency and Junction Tree for Diagnosis of Discrete-Event Systems Priscilla Kan John Alban Grastien [email protected] [email protected] NICTA1 and Australian National University Abstract. We extend the decentralised/distributed approach of diagnosis of discrete-event systems modeled using automata. The goal is to avoid computing a global diagnosis, which is expensive, and to perform local diagnoses instead. To still ensure global consistency, we transform the topology of the system into a junction tree where each vertex represents a subsystem. Local consistency between the diagnoses of these subsystems ensures global consistency due to the tree structure. This technique will work best for systems whose natural structure is close to a tree structure, as the generated automata will be of reasonable size.

1

Introduction

Nowadays, many technical systems are highly automated, if not completely controlled by computers. As such systems increase in complexity, their supervision becomes more and more challenging such that there is a strong need to automate the task. New methods are required to meet this objective. We are here concerned with the model-based diagnosis of systems modeled as discrete-event systems (DES, [1]). It is well-known that the diagnosis of discrete-event systems [8] can be seen as the computation of all the trajectories on the model consistent with the observations. This can be done by unfolding the model according to the observations. The main challenge is then to cope with the complexity of the task as the representation of these trajectories is usually exponential in the number of components in the system [10]. To deal with systems of increasing size, several approaches have been investigated. A ﬁrst approach trades time for space: the model of the system is compiled into a structure called the Sampath diagnoser [11] to enable eﬃcient on-line computation. However, this structure is double exponential in the number of components and cannot be built in most cases [10]. Use of symbolic tools has also been proposed, giving interesting results [12, 13, 4]. Another option is to consider local computations. Rather than computing the trajectories on the whole system, the trajectories are computed locally. The problem is then to make 1

This research was supported by NICTA in the framework of the SuperCom project. NICTA is funded by the Australian Government as represented by the Department of Broadband, Communications and the Digital Economy and the Australian Research Council through the ICT Centre of Excellence program. The authors also want to thank Carole Aujames for her work on the implementation.

sure that the local sets of trajectories are consistent with each other. Unfortunately, local (pairwise) consistency does not ensure global consistency; worst, an algorithm that reﬁnes the local diagnoses pairwisely may not terminate. Methods were proposed to avoid global computation [9, 2, 14, 3], but these methods do not scale up nicely. The complexity of numerous algorithms in diﬀerent domains drops when applied to trees. This is typically the case here, since the local consistency ensures the global consistency when the connections between components form a tree. A popular solution to convert a graph into a tree is to make it into a junction tree. The vertices are basically gathered in clusters. We thus transform the topological graph of the system into a junction tree where each cluster corresponds to a subsystem. The diagnosis is performed locally on each cluster, and local consistency is applied until a ﬁxpoint is reached. The paper is divided as follows: we ﬁrst present basic notations on languages and diagnosis. In Section 3, we discuss the issues of distributed diagnosis, and the central notion of consistency. Our approach based on junction trees and local consistency is presented Section 4.

2

Preliminaries

In this section, we present basic notations on language and how it applies to the diagnosis of discrete-event systems.

2.1

Language Formalism

Let Σ be any set. We denote Σ the set of all ﬁnite sequences on Σ; an element σ = e1 . · · · .en ∈ Σ is called a word over Σ; the empty word is denoted ε. A language L over Σ is a subset of Σ . The projection on Σ of a word σ over Σ ⊇ Σ denoted PΣ→Σ (σ) keeps all the elements of σ in Σ . Formally,

PΣ→Σ (σ) =

ε PΣ→Σ (σ ) e.PΣ→Σ (σ )

if σ = ε if σ = e.σ and e ∈ Σ \ Σ if σ = e.σ and e ∈ Σ

The projection on Σ of a language L over Σ is denoted PΣ→Σ (L) and deﬁned by {PΣ→Σ (σ) | σ ∈ L}. The inverse −1 operation PΣ→Σ of the projection from Σ to Σ generates all the ﬁnite words on Σ whose projection on Σ is the parameter: −1 PΣ→Σ (L) = {σ ∈ Σ | PΣ→Σ (σ) ∈ L}. The synchronous product ⊗ between two languages L1 over Σ1 and L2 over Σ2 computes all the words over Σ1 ∪Σ2 whose

210

P. Kan John and A. Grastien / Local Consistency and Junction Tree for Diagnosis of Discrete-Event Systems

projection on Σi is Li : L1 ⊗ L2 = {σ ∈ (Σ1 ∪ Σ2 ) | ∀i ∈ {1, 2}, PΣ1 ∪Σ2 →Σi (σ) ∈ Li }. The local consistency operation of language L1 over Σ1 on L2 over Σ2 denoted consΣ1 ,Σ2 (L1 , L2 ) returns the minimum sublanguage of L2 such that the synchronous product with L1 is not modiﬁed: consΣ1 ,Σ2 (L1 , L2 ) = {σ ∈ L2 | PΣ2 →Σ1 ∩Σ2 (σ) ∈ PΣ1 →Σ1 ∩Σ2 (L1 )} or equivalently (PΣ1 →Σ1 ∩Σ2 (L1 )). consΣ1 ,Σ2 (L1 , L2 ) = L2 ∩ PΣ−1 2 →Σ1 ∩Σ2

2.2

Diagnosis of Discrete-Event Systems

We consider a system whose state can be described as the assignment of state variables over a discrete domain. We consider the evolution of the state variables to also be discrete. The set of all – including unexpected – possible behaviours of this system is a language denoted Mod over the set of events Σ that can possibly occur on the system. The set of events is partitioned into observable Σo and unobservable Σu events. The occurrence of an observable event generates an observation. While the system is running, it generates a ﬂow of observations. The occurrence of observable events can thus be partially determined as a language over Σo denoted Obs. The diagnosis of the system is the problem of determining what possibly happened on the system given the observations on its behaviour. This can be simply computed by Δ = Mod ⊗ Obs.

(1)

Languages can be represented by several tools. Regular languages are often represented by automata or Petri nets. The problem with these tools is that of the state explosion. The size of these structures is exponential in the number of state variables, which makes it impossible to use directly.

3

Consistency in a Distributed Model

Real-world systems are often distributed by nature, i.e. a set of interconnected components. The global behaviour of the system is complex, while each component has a simple behaviour. Recent approaches take advantage of this distributed nature to avoid the computational blow up.

3.1

Distributed Modeling

Modern technical systems are usually formed by combining simple components with simple behaviours leading to a device that exhibits complex behaviours. Rather than modeling the whole system, it is often preferable to model each component separately for many good reasons: fewer chances to make mistakes or forget behaviours, reusability, compactness. The system being a set of components, each component γi is modeled separately: Modi deﬁned on alphabet Σi . Some formalisms consider that components share variables. Here, without loss of generality, we consider that components share events such that an event shared by several components must occur on each component at the same time. Other events may occur in a completely concurrent manner. The system Γ = {γ1 , . . . , γn } composed of components γ1 , . . . , γn is modeled as a set of languages dMod = {Mod1 , . . . , Modn } over the alphabets Σ1 , . . . , Σn . The global model of the system is implicitly deﬁned by Mod = Mod1 ⊗ . . . ⊗ Modn but never explicitly computed.

3.2

Distributed Diagnosis and Global Consistency

The alphabet Σi that represents the events of each component γi is partitioned into observable events Σio and unobservable events Σiu . Moreover, we consider that the global observations Obs on the system can be distributed into Obsi deﬁned on Σio such that Obs = Obs1 ⊗ . . . ⊗ Obsn . Γ A distribution S = {S1 , . . . , Sm } ∈ 22 is a set of subsets of Γ such that S covers Γ: S1 ∪ . . . ∪ Sm = Γ. A distributed diagnosis is a mapping that associates with each subset Si a diagnosis dΔ(Si ) such that dΔ(S1 ) ⊗ . . . ⊗ dΔ(Sm ) = Δ. The literature usually considers that S is a partition of Γ [9]. The local diagnoses can be simply computed by: dΔ(Si ) =

%

(Modk ⊗ Obsk ).

(2)

γk ∈Si

This returns a distributed diagnosis that can be easily computed as long as any Si contains a small number of elements. However, the local diagnoses can be inconsistent with each other. Basically, some words of dΔ(Si ) should be removed because they disappear when Si is synchronised with other Sj elements. Thus, we are interested by the globally consistent distributed diagnosis: A distributed diagnosis dΔ is globally consistent if ∀i ∈ {1, . . . , m}, dΔ(Si ) = PΣ→ΣSi (Δ) where ΣSi = γ ∈S Σk . i k The globally consistent distributed diagnosis is such that no word of any dΔ(Si ) can be removed. We want to compute this reﬁned distributed diagnosis but the goal is to avoid the computation of Δ.

3.3

Local Consistency

The local consistency property requires that any pair of local diagnoses are consistent. Formally, a distributed diagnosis dΔ is locally consistent if ∀{S1 , S2 } ⊆ S, PΣS1 →ΣS1 ∩ΣS2 (dΔ(S1 )) = PΣS2 →ΣS1 ∩ΣS2 (dΔ(S2 )). It is possible to reﬁne a distributed diagnosis using local consistency as presented in Algorithm 1. After the distribution is performed, and a local diagnosis is computed for each subsystem, the algorithm takes pairs of subsystems and performs a local consistency on these diagnoses. Basically, the idea is to remove the word of dΔ(S1 ) that cannot be synchronised with any word of dΔ(S2 ), and vice versa. The local consistencies can actually be performed in any order. Algorithm 1 Distributed diagnosis algorithm based on local consistency 1: input Γ, {Mod1 , . . . , Modn }, {Obs1 , . . . , Obsn } 2: S = {S1 , . . . , Sm } := distribution(Γ) 3: for all i ∈ {1, &. . . , m} do 4: dΔ(Si ) = (Modk ⊗ Obsk ) γk ∈Si 5: repeat 6: for all {S1 , S2 } ⊆ S do 7: dΔ(S2 ) := consΣS1 ,ΣS2 (dΔ(S1 ), dΔ(S2 )) 8: dΔ(S1 ) := consΣS2 ,ΣS1 (dΔ(S2 ), dΔ(S1 )) 9: until dΔ is stable However, as shown in [14], local consistency does not ensure global consistency. Moreover, because the languages may be

P. Kan John and A. Grastien / Local Consistency and Junction Tree for Diagnosis of Discrete-Event Systems

inﬁnite, no ﬁx-point is reached in the worst case; the algorithm does not terminate. As noticed by Su and Wonham, both problems disappear when the topology of the system forms a tree. A topology of a distributed representation S of the system is a graph G = V, E where V = S is the set of vertices and E ⊆ V × V is a symmetric and anti-reﬂexive set of edges such that ∀{S, S } ⊆ V, ∀e ∈ ΣS ∩ ΣS , ∃S1 , . . . , Sk such that: • ∀i ∈ {1, . . . , k}, e ∈ ΣSi , and • ∀i ∈ {0, . . . , k}, Si , Si+1 ∈ E (S0 = S and Sk+1 = S ). Two subsystems that share an event are connected through an edge, or through a chain of edges where intermediate subsystems also share this event. The graph G is a tree if for any pair Si and Sj , there is exactly one path on the graph that contains no loop and leads from Si to Sj . Provided that the distribution of the system can be represented by a tree, the algorithm presented above terminates and is sound. Because of space requirement, we only give a simpliﬁed proof of this last result. Similar proofs can be found in [14] with slightly diﬀerent deﬁnition of the topology. Consider that the distribution generates a tree G = V, E. Consider that the local diagnosis dΔ(Si ) is computed for each subsystem Si ∈ V and that the local consistency procedure is applied until stability is reached. Choose randomly some subsystem Si ∈ V. We want to determine whether PΣ→Σi (Δ) = dΔ(Si ) which states that the diagnosis is globally consistent. To do so, we set Si as the root of the treeG. Let X ⊆ V be a subset of subsystems, we & dΔ(S). We build denote ΣX = S∈X ΣS and LX = S∈X X incrementally from X = {Si } by adding Sk ∈ / X such that Sj ∈ X and Sk , Sj ∈ E; note that because of the deﬁnition of G and since G is a tree, ΣX ∩ Σk = Σj ∩ Σk . We note X = X ∪ {Sk }. We prove by induction that for any Sp ∈ X, PΣX →Σp (LX ) = dΔ(Sp ). • This is clearly the case for X = {Si }. • PΣX →ΣX ∩Σk (LX ) = PΣX →Σj ∩Σk (LX )

(because ΣX ∩ Σk = Σj ∩ Σk )

= PΣj →Σj ∩Σk (PΣX →Σj (LX )) (since Σj ∩ Σk ⊆ Σj ⊆ ΣX ) = PΣj →Σj ∩Σk (dΔ(Sj ))

(by induction)

= PΣk →Σj ∩Σk (dΔ(Sk ))

(by local consistency)

= PΣk →ΣX ∩Σk (dΔ(Sk ))

(because ΣX ∩ Σk = Σj ∩ Σk )

Thus, LX and dΔ(Sk ) are locally consistent. Thus, for any Sp ∈ X, PΣX →Σp (LX ⊗ dΔ(Sk )) = PΣX →Σp (LX ) = dΔ(Sp ), and PΣX →Σk (LX ⊗ dΔ(Sk )) = dΔ(Sk ). Thus, for X = S, we have the following result: ∀Si ∈ S, PΣ→Σi (Δ) = dΔ(Si ). The distributed diagnosis is then globally consistent. 2 We propose to build such a distribution of the system, using the junction tree theory.

4 4.1

Diagnosis by Junction Tree Junction Tree

The concept of the junction tree is borrowed from the ﬁeld of probabilistic inference where its structure is useful for working

211

in complex domains [5]. Deﬁnition 1 (Junction Tree) Let G = V, E be a graph. A junction tree for G is a pair (T , C), where T is a tree and C is a function which maps each node i in tree T into a label Ci called a cluster. The junction tree must satisfy the following properties: 1. Ci ⊆ V, i.e. each cluster is a set of vertices from G. 2. If two vertices are connected in G, they will appear together in some cluster Ci . 3. If a vertex appears in two clusters Ci and Cj , it must also appear in every cluster Ch on the path connecting vertices i and j in the junction tree. This is known as the running intersection property. The separator of edge i-j in a junction tree is deﬁned as Ci ∩ Cj . The width of a junction tree is the size of its largest cluster minus one. One of the steps in obtaining a junction tree from a graph is to triangulate the graph, i.e., add extra links such that every cycle of length greater than three has a chord. There are diﬀerent ways to triangulate a graph, yielding diﬀerent sets of clusters. Each triangulated graph may have several diﬀerent junction trees. It is therefore desirable to have optimal triangulations and optimal junction trees with respect to complexity. However, the optimality problem for triangulation is NP-complete. Given a triangulated graph, we can obtain an optimal junction tree using an algorithm from [6] which is quadratic in the number of cliques. The reasoning behind the use of junction trees in diagnosis is that it could help avoid the need to compute a global diagnosis. Using a junction tree representation of a system has 2 main advantages [14]: 1. A tree representation of a system implies that local consistency is equivalent to global consistency. 2. Non-termination issues with local consistency algorithms can be resolved.

4.2

Distribution Algorithm

Algorithm 2 Distribution using Junction Tree Algorithm 1: input Γ, {Mod1 , . . . , Modn } 2: V := Γ 3: E := {Vi , Vj ∈ V 2 | i = j & Σi ∩ Σj = ∅} 4: S := {} 5: while V = ∅ do 6: pick a vertex V ∈ V 7: C := {V } ∪ {V | V, V ∈ E} 8: E := E ∪ {V1 , V2 | V1 ∈ C, V2 ∈ C} 9: V := V − {V } 10: E := E − {V1 , V2 ∈ E | V1 = V ∨ V2 = V } 11: if not (∃C ∈ S | C ⊆ C ) then 12: S := S ∪ {C} 13: return S The junction tree algorithm returns a topology as deﬁned previously, provided it is followed by computation of the edges of the tree itself. Indeed, let e ∈ ΣS1 ∩ ΣS2 be an event that is

212

P. Kan John and A. Grastien / Local Consistency and Junction Tree for Diagnosis of Discrete-Event Systems

H

I

GH

GI

G CDG B

C

D

A

F

BC

E

Graph 1

DF AC

DE

Junction Tree for Graph 1

H

I

GHI

G CDG B

C

D

A

F E

Graph 2

ABC

DEF

Junction Tree for Graphs 1, 2

H

I

DEGHI

DEFI

G CDGEH B

C

A Graph 3

Figure 1.

D

F E

ABCE

BCEH

Junction Tree for Graphs 1, 2, 3

Three graphs and corresponding junction trees

shared by subsystems Si and Sj . We prove that any vertex S in the path between Si and Sj contains this event (e ∈ ΣS ). There are two (possibly identical) components γ1 and γ2 such that ∀i ∈ {1, 2}, e ∈ Σi and γi ∈ Si . Since component γ1 and γ2 share an event, they are connected in the original topology and because of the second property of junction trees, there is a cluster S in the junction tree that contains both components ({γ1 , γ2 } ⊆ S). By the third property of the junction tree, all clusters between Si and S contain component γi and thus event e. S can be between Si and Sj or outside, but in both cases there is a path of clusters between S1 and S2 that share event e. Thus, the junction tree algorithm returns a tree-shaped distribution. 2 We perform distribution by rearranging the topology of the system into a junction tree, as described in Algorithm 2. We ﬁrst obtain a graph of the original system, G = V, E. Each component γ in the system is a vertex V on the graph. The edges, E , on the graph represent connected components. We use the junction tree algorithm [5] to obtain the clusters that make up S. We pick a vertex V ∈ V. A cluster C is obtained by taking the set formed by V and its neighbours, i.e. the

vertices on the graph that are connected to V by an edge. We add edges so that all the vertices that make up a cluster are connected. C is added to S if it is not a subset of an element of S. We update the original graph by removing V and its associated edges from it. This procedure is repeated until no more vertices are left on the original graph. It is then trivial to calculate the separators that link the clusters into a junction tree. As mentioned, building an optimal junction tree is N Pcomplete. However, we can use heuristics in the vertex selection phase of the algorithm (line 6) that would achieve polynomial-time while still producing a high quality tree [5]. One heuristic is to minimise the number of edges added to the graph [7] (line 8 of the algorithm), which then achieves a low-polynomial complexity. We mentioned in section 3.3 that local consistencies can be performed in any order. However, we can use a strategy, global propagation [5], that would only require two ordered series of local consistency computations on the junction tree to achieve global consistency. We consider a message pass from a cluster CX to its neighbour CY to be an operation that makes the components of CX locally consistent with those of CY . By performing these message passes in an ordered manner, we ensure that the consistency introduced by previous message passes is preserved. We arbitrarily pick a cluster Sr ∈ S to be the root of the junction tree. We start from each leaf node and perform local consistency with the neighbour until the root is reached (the gather phase). We then perform local consistency in the other direction, from the root back to the leaf (the distribute phase). All the clusters are now locally, and consequently globally, consistent with one another.

4.3

Discussion

Using a junction tree is very interesting as the resulting subsystems tends to be of small size. However, this does not necessarily imply that the local diagnoses will actually be small. Consider for instance a tree with n nodes N1 to Nn where the automaton of each node Ni is represented in Figure 2 (the node Ni shares the event ei with Ni+1 ). For each i, the number of occurrences of event ei is twice that of event ei−1 . Thus, the event ei globally occurred k × 2i times where k is a natural number. Therefore, the (globally consistent) automaton representing the behaviour of component Ci contains 3 × 2i−1 states. In this example, the number of states after local consistency is exponential in the number of nodes. The result basically comes from the fact that the events ei and ej are not concurrent events but they occur in sequence. We expect that most systems actually exhibit concurrent behaviours. The natural topology of the system has an important impact on the quality of the produced junction tree, and hence the size of the subsystems. If we start oﬀ with a near tree-like structure, the resulting junction tree will produce smaller size clusters, and hence smaller automata to work with, reducing complexity. E.g. in Figure 1, system 1 produces the best junction tree with smallest clusters. With System 3, because of the larger size clusters, the local diagnoses will actually be quite big.

P. Kan John and A. Grastien / Local Consistency and Junction Tree for Diagnosis of Discrete-Event Systems

ei−1 ei

Figure 2.

ei

Automaton that models the language of node Ni

In [14], the authors proposed a similar algorithm for the distributed diagnosis of discrete-event systems. A local diagnosis is computed for each component. Then, a given diagnosis is incrementally synchronised with the other diagnoses, which ensures global consistency. After each synchronisation, the events that appear only in components that has already been synchronised can be safely abstracted: the current diagnosis is projected on the relevant events, which reduces the complexity. This algorithm can be seen as a special case of our approach with three main diﬀerences. First, it implicitly builds a junction line, since the diagnoses are synchronised in sequence. This restriction potentially increases the width of the junction tree, with a negative impact on the global eﬃciency. Second, this algorithm builds a junction tree/line on the graph of events rather than the graph of components. This can also be done in our approach, though not presented because of space requirements. Considering the graph of events leads to clusters with less, or in the worst case as many, events than in the approach presented in this paper, thus reducing complexity. Finally, the authors of [14] propose a dynamic strategy to choose the order of the synchronisation. Future works include such a dynamic construction of the junction tree.

5

Conclusion and Future Works

In this article, we identiﬁed the importance of a distribution of the system into (possibly overlapping) subsystems for the diagnosis of discrete-event systems. If the distribution generates a tree-shaped topology, an algorithm based on local consistency can ensure global consistency of the diagnosis. We used the graph theory of junction trees to obtain good distributions. The complexity of the diagnosis is then often bounded by the tree width of the system topology, though counter-examples exist. We think there is still room for improvement. First, we proposed a static construction of the junction tree based only on the topology of the system. We want to investigate a more ﬂexible technique where the junction tree is built after diagnoses and simple pruning operations are performed locally on components. The idea is that some connections in the system topology can be removed when no communication happened through these connections, leading to a graph with a smaller tree width. Moreover, we could then assign weight on each vertex of the graph. These technique should then improve the eﬃciency of diagnosis. More generally, we want to investigate more dynamic computations of junction trees: experiments have shown that the connections can often be removed after the distributed diagnosis is computed, during the local consistency algorithm. For this reason, we want to start the

213

diagnosis algorithm while the junction tree is being computed so as to dynamically change the construction of the junction tree. This is not trivial as the construction of the junction tree must satisfy some properties. Regarding system design, an interesting exploration would be to interact with the system designer to propose alternative topology structures in the system in order to ensure a reasonable tree width of the system. Finally, we considered that the observations emitted by different components were completely independent. However, it is often the case that a (partial) order exists between the observations. E.g. the alarm emitted by component 1 was surely emitted before the alarm from component 2. This generates a connection between the two components and potentially interconnects all the components. We want to investigate this issue and determine when these connections can be removed, possibly with an approach based on time slicing [2].

REFERENCES [1] C. Cassandras and S. Lafortune, Introduction to Discrete Event Systems, Kluwer Academic Publishers, 1999. [2] M.-O. Cordier and A. Grastien, ‘Exploiting independence in a decentralised and incremental approach of diagnosis’, in Twentieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-07), ed., M. Veloso, pp. 292–297. AAAI press, (2007). [3] E. Fabre, A. Benveniste, S. Haar, and Cl. Jard, ‘Distributed monitoring of concurrent and asynchronous systems’, Journal of Discrete Event Systems, 33–84, (2005). special issue. [4] A. Grastien, Anbulagan, J. Rintanen, and E. Kelareva, ‘Diagnosis of discrete-event systems using satisﬁability algorithms’, in Nineteenth National Conference on Artiﬁcial Intelligence (AAAI-07), ed., R. Holte. AAAI Press, (2007). [5] C. Huang and A. Darwiche, ‘Inference in belief networks: A procedural guide’, International Journal of Approximate Reasoning, 15(3), 225–263, (1996). [6] F.V. Jensen and F. Jensen, ‘Optimal junction trees’, in Proceedings of the Tenth Conference on Uncertainty in Artiﬁcial Intelligence, Seattle, Washington, (1994). [7] Uﬀe Kjrulﬀ. Triangulation of graphs - algorithms giving small total state space, 1990. [8] G. Lamperti and M. Zanella, Diagnosis of Active Systems, Kluwer Academic Publishers, 2003. [9] Y. Pencol´ e and M.-O. Cordier, ‘A formal framework for the decentralised diagnosis of large scale discrete event systems and its application to telecommunication networks’, Artiﬁcial Intelligence (AIJ), 164, 121–170, (2005). [10] J. Rintanen, ‘Diagnosers and diagnosability of succint transition systems’, in Proceedings of the 20th Joint Conference on Artiﬁcial Intelligence (AAAI-07), ed., M. Veloso. AAAI Press, (2007). [11] M. Sampath, R. Sengupta, S. Lafortune, K. Sinnamohideen, and D. Teneketzis, ‘Diagnosability of discrete-event systems’, IEEE Transactions on Automatic Control, 40(9), 1555–1575, (1995). [12] A. Schumann, Y. Pencol´e, and S. Thi´ebaux, ‘Symbolic models for diagnosing discrete-event systems’, in Sixteenth European Conference on Artiﬁcial Intelligence (ECAI’04), (2004). [13] A. Schumann, Y. Pencol´ e, and S. Thi´ebaux, ‘A spectrum of symbolic on-line diagnosis approaches’, in Nineteenth National Conference on Artiﬁcial Intelligence (AAAI-07), ed., R. Holte. AAAI Press, (2007). [14] R. Su and W. M. Wonham, ‘Global and local consistencies in distributed fault diagnosis for discrete-event systems’, Transactions on Automatic Control, 50(12), 1923–1935, (2005).

214

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-214

Hierarchical explanation of inference in Bayesian networks that represent a population of independent agents Peter Šutovský and Gregory F. Cooper1 Abstract. This paper describes a novel method for explaining Bayesian network (BN) inference when the network is modeling a population of conditionally independent agents, each of which is modeled as a subnetwork. For example, consider disease-outbreak detection, in which the agents are patients who are modeled as independent, conditioned on the factors that cause disease spread. Given evidence about these patients, such as their symptoms, suppose that the BN system infers that a respiratory anthrax outbreak is highly likely. A public-health ofﬁcial who received such a report would generally want to know why anthrax is being given a high posterior probability. This paper describes the design of a system that explains such inferences. The explanation approach is applicable in general to inference in BNs that model conditionally independent agents; it complements previous approaches for explaining inference on BNs that model a single agent (e.g., explaining the diagnostic inference for a single patient using a BN that models just that patient).

1

Introduction

The importance of an explanation facility in intelligent systems was recognized early on. There are several studies that experimentally conﬁrmed the positive impact of explanation on learning [2, 22, 11], on user perception of the system [33], and on the accuracy of decision making [2, 31]. Only one experimental study has evaluated the impact of explanation of inference in Bayesian networks on decision making [31]. It showed that appropriate explanation can improve decision making. Developments in Bayesian network (BN) research in the 1980s and 1990s [5, 18] made BNs one of the most powerful tools for modeling of uncertainty in AI. Today, there are applications of BNs in various domains: tutoring systems [27], user interfaces [17], information retrieval [12], locomotive diagnostics [26], ﬁnancial operational risk assessment [23], ecology [34], genetics [3], biosurveillance [6] and medical diagnosis [20]. There are methods for explanation of inference that were designed for general BNs [7, 31, 15, 32] or for special BNs [29, 28, 21, 4]. Some explanation methods have been successfully applied to BNs for real world applications [19, 31]. However, current explanation methods may not be feasible for very large networks. BNs can become huge in size when they model a large population of agents by representing each agent with an individual subnetwork. We refer to these networks as Bayesian Networks with a Population of Agents (BNPA). These networks are useful in situations in which 1

University of Pittsburgh , Pittsburgh, email: {pesst17, gfc}@pitt.edu

we want to learn something about a population based on information about the agents in the population. In disease-outbreak detection (aka biosurveillance), for example, the agents are often people who are reporting their symptoms when admitted to the hospital. Another type of agent could be a sensor that periodically measures and reports information about air quality at a given location in a city. An example of an agent-based BN in the military domain is the collection of intelligence from soldiers engaged in combat about the size and nature of an enemy force in order to derive an estimate of the enemy’s military capabilities. Agents in a population will be independent of one another, if we condition on all those factors that make them dependent, which are shown as interface nodes (I) in Figure 1. We refer to such networks as Bayesian Networks with a Population of Independent Agents (BNPIA). In detecting non-contagious infectious diseases, for example, we can model and condition on all the signiﬁcant factors that cause a person to acquire the disease, such as the amount, location, and timing of the source of the disease. This paper describes a new approach, the hierarchical explanation method (HEM), for explaining inference in a BNPIA. Unlike previous explanation methods, the HEM exploits the modular character of a BNPIA.

2

Background

A BN is a framework for efﬁcient representation of joint probability distribution over a set of random variables. A Bayesian network has two main components: a graph and local probability distributions [25]. The graph is the qualitative component of a BN, locally representing the relationships among domain variables. The graph of a BN is a directed acyclic graph (DAG), which means that it cannot contain cycles, that is, closed loops of directed links. It consists of nodes that represent random variables and directed arcs connecting nodes. We use the term node and variable interchangeably in the text. The arcs are directed from parents to children. An arc expresses the dependency of a child node on a parent node. Every node is associated with a conditional probability distribution (CPD). In general, a BN network can contain continuous variables, discrete variables or both. An outcome of inference in a BN is a posterior probability distribution of variables as an effect of observed evidence. Hence the explanation of inference in Bayesian network is focused on the posterior distribution of a node of interest. Methods that explain BN inference try to clarify why and how a certain posterior probability was obtained given observed evidence. The posterior probability that we want to explain results from a combination of several factors: the evidence, the BN structure (vari-

215

P. Šutovský and G.F. Cooper / Hierarchical Explanation of Inference in Bayesian Networks

Common subnetwork (G) Interface nodes (I) 1442443

Agents’ subnetworks 6447448 6 4 474 48 (A)

...

Agent 1 subnetwork

...

Agent N subnetwork

Figure 1. General representation of a Bayesian Networks with a Population of Independent Agents. The closed regions represent Bayesian subnetworks. The circles on the edge of the networks denote nodes that are connected by arcs that bridge subnetworks. Only two such “I/O” nodes are shown per subnetwork, but in general there could be any number. The arrows between subnetworks show the direction in which the Bayesian-network arcs are oriented between the subnetworks. The braces show which nodes can possibly be connected by arcs. Interface nodes I are the “I/O” nodes in the common subnetwork that connect common subnetworks with the agent subnetworks.

ables and arcs), the BN parameters (local conditional probabilities), and inference. Complete information about these factors could be included in the explanation. The calculation of posterior probability, however, involves many calculations with many numbers. Moreover, the evidence can consist of many ﬁndings. An explanation that simply lists all this information is unlikely to be very useful. Gregor and Benbasat [13] concluded that explanations which require less cognitive effort to access and understand will be used more often and will have larger positive effect on the performance, learning, or user perception. Therefore, many explanation methods start construction of explanation with the selection of a subset of the most important ﬁndings from the set of all ﬁndings [31, 14, 4]. These methods use some quality of the explanation measure to evaluate the set of selected ﬁndings, which is usually based on a distance measure that measures the distance between the desired posterior probability obtained using complete evidence and the posterior probability obtained using the selected subset of ﬁndings. The smaller the distance, the better the selected subset represents the complete evidence. In the next step, explanation methods select paths between the selected subset of ﬁndings and the node of interest (NOI). Further simpliﬁcation of explanation is achieved by examining which components of network structure (nodes and arcs) are important for propagation of evidence from the selected evidence nodes to the node of interest. Basically, if the removal of a node or arc does not appreciably change the posterior probability distribution of some node of interest, the node or arc is removed for the purpose of providing an explanation. INSITE, the explanation method proposed by Suermondt [30, 31], was the ﬁrst comprehensive work based on this approach. There are several variations of explanation methods. Madigan et al. [21] proposed an explanation method that provided explanation of inference in graphical form. Henrion and Druzdzel [16, 10] explained inference in Bayesian networks by means of qualitative explanations. Sember, Zukerman, and Wiegerink [28, 32] proposed methods that explain inference in BNs in terms of messages deﬁned by

Pearl’s [24] belief propagation algorithm. Scenario-based explanation [11] provides explanation in the form of a list containing the most probable scenarios that are consistent with hypothesis and evidence. Some of these methods can be combined with the two-step method sketched in the previous paragraph. The proposed HEM introduced in this paper is closely related to INSITE [31] and to an explanation method based on analysis of evidence in naive Bayes networks, where weight of evidence is used to determine the inﬂuence of each piece of evidence (a ﬁnding) on the selected diagnosis [9].

3

Agent-based Bayesian networks

An agent-based BN (Figure 1) consists of several parts: a subnetwork that represents the whole population (the common part) and subnetworks that represent agents in the population individually (the agent part). The common part of the BN is connected with the agent parts of the BNPIA via the common part nodes that we call interface nodes (I). The size of the population can be very large. For example, a Bayesian model for biosurveillance called PANDA-CDCA [8] was tested using a population of 423,000 agents. PANDA-CDCA (Figure 2) is a BNPIA for diagnosing outbreaks of CDC (Center for Disease Control and Prevention) Category A diseases [1], which include anthrax, tularemia, plague, and several other serious infectious diseases. In this paper we use PANDA-CDCA as an example of an agent-based BN with independent agents. Agents in PANDA-CDCA are modeled as not interacting directly with each other in contracting disease, and hence, there are no directed arcs connecting variables in different agent subnetworks. Therefore, the agents are conditionally independent if we condition on the nodes Outbreak Disease in Population and Fraction of Population with Outbreak Disease. This independence assumption seems reasonable for non-contagious disease outbreaks. In case of non-contagious diseases due to bioterrorism, for example, the main increase in infected individuals in the population is due to a release of some biological agent for which we can assume non-transmission of the disease among individuals (agents) in the population. Common part (G) Node of interest (T) Outbreak

Interface nodes (I) Fraction of Population with Outbreak Disease

Outbreak Disease in Population

Agent Group 1

Agent Group K A7

A1 Outbreak Disease State of Person

Person 1 Chief Complaint Finding

Figure 2.

Outbreak Disease State of Person

...

Person 7 Chief Complaint Finding

AN

Agents’ subnetworks (A)

...

Outbreak Disease State of Person

...

Person N Chief Complaint Finding

Example of Bayesian Networks with a Population of Independent Agents: PANDA-CDCA [8].

216

P. Šutovský and G.F. Cooper / Hierarchical Explanation of Inference in Bayesian Networks

PANDA-CDCA takes as input chief complaints observed at an emergency department (ED) during the previous 24-hour period, and it outputs the posterior probability of CDC Category A diseases (plus a few other diseases, including inﬂuenza). The common part (G) for the PANDA-CDCA BN model is made up of the nodes that represent features common to the whole population, and an agent part (A) that consists of subnetworks A = {A1 , . . . , An } of all n individuals in the population, where Ai is subnetwork of ith agent. Nodes in the common subnetwork are Outbreak, Outbreak Disease in Population, and Fraction of Population with Outbreak Disease. The Outbreak node represents the presence or absence of an outbreak. The node Outbreak Disease represents 12 explicitly modeled diseases. Fraction of Population represents the population that has an outbreak disease and has come to the ED within the previous 24 hours. The Fraction of Population and Outbreak Disease create an interface between the other global node (Outbreak node) and the agent subnetworks. In PANDA-CDCA all individuals (agents) are represented by identical subnetworks, although in general they could differ. An agent’s subnetwork consists of the nodes Outbreak Disease State of Patient and Chief Complaint Finding. The node Outbreak Disease State of Patient represents diseases that each person can have according to the model. The CDC category A diseases and inﬂuenza, cryptosporidiosis, hepatitis A and asthma are modeled explicitly; any other disease which the patient may have is represented by the state “other”, meaning some other disease. PANDA-CDCA was tested using semisynthetic and real data with encouraging results [8].

4

Hierarchical explanation method

efﬁcient. The middle level consists of interface nodes; conditioning on them renders the agents independent of each other. In PANDACDCA, there are two interface nodes: Outbreak Disease in Population and Fraction of Population with Outbreak Disease. In order to keep the example simple, we assume in Figure 3 that there is only one interface node, namely, Outbreak Disease in Population. Having introduced the three levels, we now summarize how explanation is performed using them. At the ﬁrst level, all possible states t1 , t2 , . . . , tj , . . . , tNT of the NOI, T , are ranked using the posterior probability P (tj |e), where e is the evidence for all individuals in the population. Since there is no higher level, this posterior probability is conditioned only on evidence observed for the individuals. In this way we select the most probable instantiation of the NOI for the subsequent analysis. Alternatively, the user can select the state of NOI that he is interested in having explained. Suppose the selected instantiation of the NOI is t3 . At the middle level of the explanation tree in Figure 3, the instantiations of the states i1 , i2 , . . . , ij , . . . , iNI of nodes I are ranked using the posterior probability P (ij | t3 , e). Suppose that i7 is the instantiation with the highest posterior probability. At the lowest level of the explanation tree, we identify those individuals that most contributed to a high posterior probability of i7 , when compared to other possible instantiations of I: ¬i7 = i1 ∨ . . . ∨ i6 ∨ i8 ∨ iNI . Recall that individuals with identical evidence are considered as one evidence equivalence group. Let ej be the total evidence of the j th group. The relative support of i7 by the j th evidence equivalence group is measured using the conditional likelihood ratio (LR) given by Equation 1.

Unlike previously existing explanation methods for Bayesian netL (i7 : ej |t3 , e1 , . . . , ej−1 ) = works, the hierarchical explanation method takes advantage of the p (ej | i7 , t3 , e1 , . . . , ej−1 ) modularity in BNPIAs. The structure of a BNPIA allows us to iden= , (1) p (ej | ¬i7 , t3 , e1 , . . . , ej−1 ) tify those agents that are most important for obtaining inference results (Figure 2). When HEM selects evidence for explanation it selects all evidence of the agent and, hence, the agent’s subnetwork. The denominator in Equation 1 is derived as follows: Figure 3 shows how HEM is applied using PANDA-CDCA. Since in PANDA-CDCA all agents have the same subnetwork, it does not NI represent the most general example of BNPIA. However it provides k=7 p (e1 , . . . , ej | ik ) p (ik | t3 ) . p (ej | ¬i7 , t3 , e1 , . . . , ej−1 ) = N I an apropriately complex model for demonstrating HEM. HEM builds k=7 p (e1 , . . . , ej−1 | ik ) p (ik | t3 ) up explanation hierarchically using three levels. The information collected is represented by the schema in Figure 3. The tree structure in We have to use a conditional LR since in general the ﬁgure represents variables (ellipses) at each level of the explanation and the instantiation of these variables (rectangles) are ranked and sorted from most important to least important for use in explanap (e1 | ¬i7 , t3 , ) p (e2 | ¬i7 , t3 ) = p (e1 , e2 | ¬i7 , t3 ) . tion. The instantiations of the nodes are ranked with respect to instantiations selected on the previous level and the evidence. HEM starts explanation with the node of interest (NOI), whose probability we The likelihood ratio in Equation 1 allows us to decompose the poswant to explain. NOI constitutes the top explanation level in Figure terior odds of i7 into contribution of each evidence equivalence group 3. HEM assumes that the NOI is in the common part of BNPIA. For ej (Equation 2). PANDA-CDCA it is the Outbreak (O) node. The lowest level represents patient models (subnetworks), each with its respective set of evp(i7 |e,t3 ) = idence. People modeled in PANDA-CDCA correspond to the agents p(¬i7 |e,t3 ) p(i7 |t3 ) NG of a BNPIA. Explanation can be simpliﬁed by creating groups of (2) j=1 L (i7 : ej | t3 , e1 , . . . , ej−1 ) , p(¬i7 |t3 ) patients with the same model and evidence (model-evidence equivalence groups). As PANDA-CDCA uses the same subnetwork for all agents, we can simplify explanation by grouping agents with the where Ng is number of evidence groups. Let Bj−1 represent backsame evidence. Therefore the explanation in PANDA-CDCA is conground information consisting of T = t3 and j − 1 already sestructed based on groups of agents with identical evidence rather than lected equivalence evidence groups {e1 , . . . , ej−1 }. The LR allows on individual agents. We will refer to groups of agents as evidence us to determine which evidence equivalence group is supporting and equivalence groups. Although, the HEM does not require agents to which evidence equivalence group is contradicting the instantiation have identical subnetworks, it makes the grouping of the agents more i7 , shown in Equation 3.

217

P. Šutovský and G.F. Cooper / Hierarchical Explanation of Inference in Bayesian Networks

5

Score = P(O = true | E = e) = 0.9999 0.9999

This section provides an example of applying the HEM methodology to produce an explanation of inference for PANDA-CDCA. In particular, Figure 3 shows the scheme of HEM as applied to PANDACDCA. The observed evidence for PANDA-CDCA, e, consists of chief-complaint ﬁndings extracted from chief-complaint strings recorded for each patient who comes to the emergency department. Chief-complaint ﬁndings included in the model are fever, cough, and headache, for example. Since Outbreak is the NOI, the posterior probability of each possible instantiation of that node is derived given the observed evidence on all individuals in the population. The posterior probability of Outbreak = true is 0.9999 and posterior probability of Outbreak = f alse is 0.0001. Assume we would like to know why the posterior of Outbreak = true is so high. HEM next identiﬁes the instantiations of those intermediate nodes that most support Outbreak = true. In this section, we will refer to the node Outbreak Disease in Population simply as to Outbreak Disease. For simplicity of exposition, we will not include the node Fraction of population with outbreak disease in the explanation described here. All possible instantiations of the variable Outbreak Disease are scored using the posterior of Outbreak Disease given evidence e and Oubreak = true. Suppose that the top scoring instantiation of Outbreak Disease is Outbreak Disease = botulism (score = 0.998) and the second most highly scored instantiation is Outbreak Disease = plague (0.001). Explanation focuses on the most important (highest scoring) instantiation, that is, in oubreak disease = botulism (score = 0.998). As a ﬁnal stage,

Score = P(O = false | E = e) = 0.0001 .0001

True

False

Outbreak disease in population (OD)

Outbreak disease in population (OD)

Score = P(OD = botulism | E = e, O = true) = = 0.998

0.0 01

botulism

Score = L ( OD=botulism : ccf | O=true ) 22

Example of hierarchical explanation

Outbreak (O)

...

Moreover, the LR in Equation 1 allows us say that instantiation i7 is LR(i7 : ej |Bj−1 ) times more (or alternatively, less) supported by ej than ¬i7 given the background information Bj−1 if LR(i7 : ej |Bj−1 ) > 0 (or alternatively, LR(i7 : ej |Bj−1 ) < 0). The likelihood ratio given by Equation 1 depends on the order of selected group evidence and only in the case of a binary I can the LR be replaced by unconditional LRs, L (i7 : ej ), in Equation 2. A simple heuristic for ordering evidence in the case of an I with more than two states is to initially select the evidence equivalence group with the highest likelihood ratio. In this case, we select ﬁrst the highranked evidence equivalence group. The remaining evidence equivalence groups are then sorted and applied similarly. This approach selects evidence that is most supportive of the instantiation of i7 itself, regardless of the interaction of the selected evidence with the rest of the observed evidence. Once instantiations and evidence equivalence groups are selected, HEM will select information for explanation using the scores calculated for each instantiation and evidence equivalence group. Explanation presented to the user includes the score for the selected information.

98 0.9

(3)

HEM searches for groups of individuals that provide the highest evidential support for instantiation Outbreak Disease = botulism. Equation 1 is used to quantify such support. The highest support for Outbreak Disease = botulism given Outbreak = true is provided by a group of 36 patients with the chief complaint of difﬁculty swallowing. The second highest support is provided by a group of patients with the chief complaint of slurred speech. Using the information derived above, a simple verbal explanation can be constructed, such as: “PANDA-CDCA detected an outbreak (Outbreak=true) with probability 0.9999. The most probable outbreak disease is botulism with probability 0.998. Evidence that supports botulism as the outbreak disease is a group of 36 patients with a chief complaint of difﬁculty of swallowing. When 36 such patients come to the emergency department, the probability of botulism increases with respect to alternative outbreak diseases by a factor of 22”.

Score = P(OD = plague | E = e, O = true) = = 0.001

...

⎧ ⎪ > 1 evidence ej ⎪ ⎪ ⎪ ⎪ ⎪ supports instantiantion i7 ⎪ ⎪ ⎪ ⎪ ⎪ over instantiation ¬i7 ⎪ ⎪ ⎪ ⎪ ⎪ ⎨= 1 evidence ej L(i7 : ej | Bj−1 ) = supports instantiantion i7 ⎪ ⎪ ⎪ equally to instantiation ¬i7 ⎪ ⎪ ⎪ ⎪ ⎪ < 1 evidence ej ⎪ ⎪ ⎪ ⎪ ⎪ supports instantiantion ¬i7 ⎪ ⎪ ⎪ ⎩ over instantiation i7 .

plague

Agent Group

Agent Group

18

15

10

Agent Group 4 (36)

Agent Group 3 (29)

...

Agent Group 5 (14)

Agent Group 7 (2)

...

Chief complaint finding (CCF)

Chief complaint finding (CCF)

...

Chief complaint Chief complaint finding (CCF) finding (CCF)

...

difficulty swallowing

slurred speech

...

hemoptysis

slurred speech

...

Figure 3. Schema of hierarchical explanation for PANDA-CDCA. The ellipses represent variables and the rectangles represent values. The numbers in parentheses represent the number of patients in the corresponding evidence group and the numbers on the edges represent the scores.

6

Summary and future research

This paper describes a novel method called HEM for explanation of inference in BNs that model populations of conditionally independent agents. HEM complements previous explanation methods that focus on explaining inference for BN models of single individuals. HEM exploits the modularity of BNPIA models to structure its explanations. We currently are completing the implementation of the HEM explanation system. In the near future we plan to evaluate how effectively the system provides human users with explanations of inference.

ACKNOWLEDGEMENTS This research was supported by the U.S. National Science Foundation grant IIS-0325581.

218

P. Šutovský and G.F. Cooper / Hierarchical Explanation of Inference in Bayesian Networks

References [1] http://www.bt.cdc.gov/agent/agentlist-category.asp. [2] D. C. Berry and D. E. Broadbent. Explanation and verbalization in a computer-assisted search task. The Quarterly Journal of Experimental Psychology, 39a(4):585 – 609, November 1987. [3] S. Bulashevska, O. Szakacs, B. Brors, R. Eils, and G. Kovacs. Pathways of urothelial cancer progression suggested by bayesian network analysis of allelotyping data. International Journal of Cancer, 110(6):850–856, 2004. [4] U. Chajewska and D. L. Draper. Explaining predictions in Bayesian networks and inﬂuence diagrams. In Interactive and Mixed-Initiative Decision-Theoretic Systems, pages 23–31. AAAI Spring Symposium, 1998. [5] E. Charniak. The Bayesian basis of common sense medical diagnosis. In AAAI, pages 70–73, 1983. [6] G. F. Cooper, D. H. Dash, J. D. Levander, W. K. Wong, W. R. Hogan, and M. M. Wagner. Bayesian biosurveillance of disease outbreaks. Proceedings of the 20th conference on Uncertainty in artiﬁcial intelligence, pages 94–103, 2004. [7] Gregory F. Cooper. NESTOR: A Computer-based Medical Diagnostic Aid that Integrates Causal and Probabilistic Knowledge. PhD thesis, Stanford University, 1984. [8] Gregory F. Cooper, John N. Dowling, John D. Levander, and Peter Sutovsky. A bayesian algorithm for detecting cdc category a outbreak diseases from emergency department chief complaints. In Advances in Disease Surveillance. International Society for Disease Surveillance, 2006. [9] Robert G. Cowell, Steffen L. Lauritzen, A. Philip David, David J. Spiegelhalter, and David J. Spiegelhater. Probabilistic networks and expert systems. 1999. [10] Marek J. Druzdzel. Qualitative verbal explanations in Bayesian belief networks. Artiﬁcial Intelligence and Simulation of Behaviour Quarterly, 94:43–54, 1996. [11] Marek J. Druzdzel and Max Henrion. Using scenarios to explain probabilistic inference. In Working notes of the AAAI–90 Workshop on Explanation, pages 133–141, Boston, MA, 1990. AAAI. [12] Robert Fung and Brendan del Favero. Applying Bayesian networks to information retrieval. In Communications of the ACM, volume 38, pages 42–48, 1995. [13] Shirley Gregor and Izak Benbasat. Explanations from intelligent systems: Theoretical foundations and implications for practice. MIS Quarterly, 23(4):497–530, 1999. [14] P. Haddawy, J. Jacobson, and C. E. Kahn. Generating explanations and tutorial problems from Bayesian networks. JAMIA, pages 770–774, 1994. Suppl. S. [15] Peter Haddawy, Joel Jacobson, and Charles E. Kahn. Banter: a bayesian network tutoring shell. Artiﬁcial Intelligence in Medicine, 10(2):177–200, June 1997. [16] Max Henrion and Marek J. Druzdzel. Qualitative and linguistic explanation of probabilistic reasoning in belief networks. In Proceedings of the Third International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU), pages 225–227, Paris, France, July 1990. [17] Eric Horvitz, John Breese, David Heckerman, David Hovel, and Koos Rommelse. The Lumiere project: Bayesian user modeling for inferring the goals and needs of software users. In Proceedings of the Fourteenth Annual Conference on UAI (UAI-98), pages 256–265, San Francisco, CA, 1998. Morgan

Kaufmann Publishers. [18] Jin H. Kim and Judea Pearl. A computational model for causal and diagnostic reasoning in inference systems. In IJCAI, pages 190–193, 1983. [19] C. Lacave, A. Onisko, and F. J. Díez. Debugging medical Bayesian networks with Elvira’s explanation capability. In Workshop onBayesian Models in Medicine, Eighth European Conference on Artiﬁcial Intelligence in Medicine (AIME2001), Cascais, Portugal, 2001. [20] Carmen Lacave and Francisco Javier Díez. Knowledge acquisition in prostanet - a bayesian network for diagnosing prostate cancer. In Vasile Palade, Robert J. Howlett, and Lakhmi C. Jain, editors, Knowledge-Based Intelligent Information and Engineering Systems, volume 2774 of Lecture Notes in Computer Science, pages 1345–1350. Springer, 2003. [21] David Madigan, Krzysztof Mosurski, and G. Almond, Russell. Graphical explanation in belief networks. Journal of Computational and Graphical Statistics, 6(2):160–181, June 1997. [22] Kathleen Ellen Mofﬁtt. An empirical test of expert system explanation facility effects on incidental learning and decisionmaking. PhD thesis, ARIZONA STATE UNIVERSITY, 1989. Chairperson-James Hershauer. [23] M. Neil, N. E. Fenton, and M. Tailor. Using bayesian networks to model expected and unexpected operational losses. Risk Analysis: An International Journal, 25(4):963–972, 2005. [24] Judea Pearl. Reverend Bayes on inference engines: A distributed hierarchical approach. In AAAI, pages 133–136, 1982. [25] Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. 1988. [26] K. Wojtek Przytula and Don Thompson. Construction of Bayesian networks for diagnostics. In Proceedings of 2000 IEEE Aerospace Conference, 2000. [27] K. G. Schulze, R. N. Shelby, D. J. Treacy, and M. C. Wintersgill. Andes: A coached learning environment for classical Newtonian physics. In Proceedings of the 11th International Conference on College Teaching and Learning, Jacksonville, FL, April 2000. [28] Peter Sember and Ingrid Zukerman. Strategies for generating micro explanations for bayesian belief networks. In Proceedings of the 5th Workshop on Uncertainty in Artiﬁcial Intelligence, pages 295–302, Windsor, Ontario, 1989. [29] D.J. Spiegelhalter and R.P. Knill-Jones. Statistical and Knowledge-Based Approaches to Clinical Decision-Support Systems, with an Application in Gastroenterology. Journal of the Royal Statistical Society. Series A (General), 147(1):35–77, 1984. [30] Henri J. Suermondt. Explanation in Bayesian belief networks. PhD thesis, Stanford, CA, USA, 1992. [31] Henri J. Suermondt and Gregory F. Cooper. An evaluation of explanations of probabilistic inference. Comput. Biomed. Res., 26(3):242–254, 1993. [32] Wim Wiegerinck. Approximate explanation of reasoning in Bayesian networks. In Workshop Probabilistic Graphical Models (PGM ’04), volume 2004, pages 209–216, 2004. [33] L. Richard Ye and Paul E. Johnson. The impact of explanation facilities on user acceptance of expert systems advice. MIS Quarterly, 19(2):157–172, 1995. [34] J. Y. Zhu and A. Deshmukh. Application of bayesian decision networks to life cycle engineering in green design and manufacturing. Engineering Applications of AI, 16(2):91–103, March 2003.

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-219

219

Coupling Continuous and Discrete Event System Techniques for Hybrid System Diagnosability Analysis Mehdi Bayoudh1 and Louise Trav´e-Massuy`es1 and Xavier Olive2 Abstract. In this paper we propose a hybrid system modeling framework aimed at analyzing diagnosability. In this framework, the hybrid system is seen as the composition of an underlying discrete event and an underlying continuous systems. Diagnosability of these two underlying systems are fully analyzed and new results are provided for the underlying continuous system (called the multimode system). Based on these results, a hybrid language that contains ’natural’ discrete events and discrete events capturing the continuous dynamics, is deﬁned. On the basis of this language the diagnosability deﬁnition of hybrid systems is provided. With respect to this deﬁnition, we prove that the diagnosability of the underlying continuous or the discrete event system is only a sufﬁcient condition. Diagnosability of hybrid systems must be decided by coupling both discrete event and continuous informations. Finally, the necessary and sufﬁcient condition of hybrid diagnosability is given.

1

INTRODUCTION

Diagnosability is the property that guarantees that the system state can be precisely diagnosed after the occurrence of a fault. In a autonomy context, particularly, for autonomous satellites, diagnosability property is required and allows one to perform reconﬁguration actions. Diagnosability deﬁnition depends mainly on the system modeling, the diagnosis approach and the observation system. Diagnosability was properly deﬁned for Discrete Event Systems (DES) [8] and for Continuous Systems (CS) [9]. But there are few equivalent results for hybrid systems. In [2] diagnosability is studied for Real Time Hybrid Systems (RTHS), the classical DES necessary and sufﬁcient condition of diagnosability from [8] is lightly modiﬁed and expressed in terms of reachability. In [4] hybrid diagnosability is studied based on the Hybrid Input/Ouput Automata (HIOA) formalism, the DES necessary and sufﬁcient condition of diagnosability is generalized but requires more restrictive hypotheses. In this paper, we propose an approach to analyze the diagnosability of hybrid systems based on a hybrid model. The behavior of the hybrid system is seen as the composition of an underlying DES and an underlying CS. The new concept of mode signature is proposed to characterize the diagnosability of the underling CS called the multimode system. This concept is used to deﬁne a language for the hybrid system. Then, DES diagnosability analysis is extended to hybrid systems based on this language. The diagnosability of hybrid systems is then deﬁned. Based on this deﬁnition we show that the diagnosability of the underlying DES or CS is only a sufﬁcient condition. Diagnosability of hybrid systems must be decided by coupling both discrete 1 2

LAAS-CNRS, Universit´e de Toulouse, France, email: bayoudh, [email protected]. Thales Alenia Space, France, email: [email protected].

event and continuous informations. Finally, the necessary and sufﬁcient condition of hybrid diagnosability is given.

2

Hybrid System Modeling

As mentioned in [5], a hybrid system may be described by a hybrid automaton deﬁned as a tuple S = (ζ, Q, Σ, T, C, (q0 , ζ0 )), where: • ζ is the set of continuous variables, which includes observable and non observable variables. The set of observable variables is denoted by ζOBS 3 . • Q is the set of discrete system states. Each state qi ∈ Q represents a functional mode of the system. It includes nominal and anticipated fault modes. • Σ is the set of events. Events correspond to discrete control inputs, spontaneous mode changes and fault events. Σo ⊆ Σ is the set of observable events. Without loss of generality, we assume that fault events are unobservable (otherwise, these faults are obviously diagnosable ). • T is the transition function, T : Q × Σ → Q. • C is the set of system constraints linking continuous variables. It represents the set of differential and algebraic equations modeling the continuous behavior of the system. • (ζ0 , q0 ) ∈ ζ × Q, is the initial condition of the hybrid system. The discrete part of the hybrid automaton, given by M = (Q, Σ, T, q0 ), is a discrete automaton that describes the discrete dynamics of the system, i.e. the possible evolutions between functional modes of Q. Modes include nominal and fault modes. An unknown mode can be added to model all non anticipated faulty situations. The continuous behavior of the hybrid system is modeled by an underlying continuous system Ξ = (ζ, Q, C, ζ0 ) that describes the whole continuous behavior of the system. Notice that transitions between modes are implicit and consequently not constrained in any way. We hence call this system a multimode system. The underlying continuous behavior in each mode qi is modeled by a set of constraints Ci . A set of constraints linking only observable continuous variables is computed from Ci . This set is denoted Cobsi . Each constraint of Cobsi can be evaluated from observable variables. It must be satisﬁed when the system evolves in mode qi . The hybrid behavior is the result of the contribution of the underlying CS and DES. The diagnosability of the hybrid system is analyzed by considering diagnosability properties of its two underlying systems. 3

We assume that the set of system observable variables is the same in all system modes. This assumption is generally veriﬁed when the set of system’s sensors is permanent, and do not depend on the system mode.

220

3

M. Bayoudh et al. / Coupling Continuous and Discrete Event System Techniques for Hybrid System Diagnosability Analysis

Diagnosability of the underlying CS

Diagnosing the multimode system consists in determining the current mode of the system. The diagnosability property of the multimode system guarantees that the mode of the system can be determined without ambiguity using continuous observations. In this section, we present the theory to analyze the diagnosability of the underlying continuous system. We introduce the new concepts of mirror and mode signatures. This study is used later to establish the deﬁnition and criteria for the whole hybrid system diagnosability. To check the consistency of the system model with respect to observations, a set of consistency indicators is linked with every opk erating mode of the system. To each constraint Cobs from Cobsi a i consistency indicator called residual is associated and denoted rik . The residual is a Boolean indicator. It is zero when the constraint k Cobs is satisﬁed, otherwise it is equal to 1. i

3.1

Mirror and Reﬂexive signatures

The qk -mirror signature of mode qj is the vector of residuals of mode qk evaluated when the system is in mode qj . We use the term mirror because it represents the signature of qj seen in mode qk . Deﬁnition 1 Mirror Signature q Given the tuple Sr k = [rk1 , rk2 , ..., rkNr(qk ) ] of system residuals in mode qk , the qk -mirror signature of mode qj is given by the vector q Sj/k = [s1j/k , ..., sNr(qk ) ]T = [Sr k (ζOBSqj )]T , where ζOBSqj j/k denotes the value of observable variables in mode qj . The reﬂexive signature is a particular case of the mirror signature Sj/k , with j = k. Deﬁnition 2 Reﬂexive Signature q The reﬂexive signature of mode qj , Sj/j = [Sr j (ζOBSqj )]T = [0, 0, ..., 0]T , is the vector of residuals of mode qj , computed with observations when the system is in mode qj .

3.2

The Mode signature

The new concept of mode signature that characterizes a mode is now introduced. Deﬁnition 3 Mode Signature The signature of a mode qi is the vector obtained by the concatenation of all the mirror signatures of qi , Sig(qi ) = T T T T [Si/1 , Si/2 , ..., Si/i , ..., Si/m ]T , where m is the number of system 4 modes .

3.3

From mode signatures to multimode system diagnosability characterization

The concept of mode signature leads us to the characterization of multimode systems diagnosability. Let us notice that in our approach, the faulty behaviors (but the unknown mode) are modeled by fault modes. A given fault corresponds to a set of fault modes in which this fault is present. In this paper, we analyze diagnosability at the level of fault modes, which is somehow more precise than at fault level. Indeed, whereas the signature of a mode is reduced to one single tuple, the signature of a fault is in general a set of tuples. 4

In our approach, nominal and fault modes have the same status and the signature of a given mode anticipates how it should be seen in terms of the indicator tuples of the different modes of the system (including itself).

By analogy with fault diagnosability of continuous systems, concepts of mode and fault diagnosability of multimode systems are deﬁned as follows: Deﬁnition 4 Two modes qi and qj (i = j) are diagnosable if Sig(qi ) = Sig(qj ). The multimode system Ξ is diagnosable if and only if all pairs of modes qi and qj , i = j, are diagnosable. Deﬁnition 5 The signature of the fault fi is deﬁned as the set of the signatures of all possible destination modes after the occurrence of the fault event fi . Sig(fi ) = {Sig(T (qk , fi )), 1 ≤ k ≤ m} Deﬁnition 6 Two faults fi and fj , i = j are diagnosable if Sig(fi ) ∩ Sig(fj ) = ∅. In our theory, diagnosability of two modes qi and qj is interpreted along two complementary ways through introduced deﬁnitions of mutual and 3-rd diagnosability: • Deﬁnition 7 Mutual Diagnosability Two modes qi and qj , i = j, are not mutually diagnosable if: Si/i = Si/j = [0, 0, ..., 0]TN r(qi ) and Sj/j = Sj/i = [0, 0, ..., 0]TN r(qj ) . The mutual diagnosability is equivalent to Mode Discernability deﬁned in [3]. • Deﬁnition 8 3rd-Diagnosability Two modes qi and qj are qk -3rddiagnosable if they have different signatures with respect to the qk mode , i.e. they have different qk -mirror signatures, k = i, j. Formally, qi and qj , i = j, are qk -3rd-mirror diagnosable if and only if Si/k = Sj/k . Two modes qi and qj , i = j, are 3rd-diagnosable if and only if ∃k = i, j such as Si/k = Sj/k . The multimode system is 3rd-diagnosable if and only if for all pairs of modes qi and qj , i = j, there exist ki,j = i, j such as Si/ki,j = Sj/ki,j . Then, we have the following result: Theorem 1 Two modes qi and qj , i = j are diagnosable if and only if they are mutually diagnosable or 3rd-diagnosable. Proof 1 Consider two modes qi and qj , i = j. T T T T Let Sig(qi ) = [Si/1 , Si/2 , ..., Si/i , ..., Si/m ]T T T T T and Sig(qj ) = [Sj/1 , Sj/2 , ..., Sj/j , ..., Sj/m ]T qi and qj are diagnosable if and only if Sig(qi ) = Sig(qj ) T T ⇔ ∃k ∈ [1, m] such as Si/k = Sj/k ⇔ qi and qj are 3rd (if k = i, j) or mutually (if k = i or k = j) diagnosable. 2 Consequently, the multimode system is diagnosable if and only if for every pair of modes (qi , qj ), i = j mutual or/and 3rd-diagnosability holds.

4 4.1

Diagnosability of the underlying DES DES Diagnosability Reminder

A DES is modeled by a ﬁnite state machine M = (Q, Σ, T, q0 ), where Q is the set of discrete states, Σ is the set of events, T : Q × Σ → Q the transition function and q0 the initial state, as already deﬁned in section 2. The event set Σ is partitioned as Σ = Σuo ∪ Σo ,

M. Bayoudh et al. / Coupling Continuous and Discrete Event System Techniques for Hybrid System Diagnosability Analysis

where Σuo (Σo ) is the unobservable (observable) event set. We consider ΣF ⊆ Σuo as the set of fault events to be diagnosed. In [8] the diagnosis of the DES consists in the deduction of unobservable fault events from the observable traces generated by the system. The event-based point of view introduces temporal aspects in the observations and the diagnosability deﬁnition takes the following form: Deﬁnition 9 A fault f is diagnosable if its occurrence is always followed by a ﬁnite observable sequence of events that allows us to diagnose f with certainty [6]. The system is said to be diagnosable if and only if all the anticipated faults are diagnosable.

5.1

221

Abstraction of the continuous dynamics in terms of discrete events

We assume that the dynamics of the discrete control inputs are slower than the dynamics of residual generators (mode signatures establish between two consecutive discrete events). We deﬁne a function fCS DES , that for each mode transition of the underlying DES, associates an event issued from the continuous domain, which represents the change of the mode signature. This function aims to deﬁne ΣSig , as the set of discrete events issued from the abstraction of continuous dynamics of the multimode system.

We then have the following result from [8]: Proposition 1 The DES is diagnosable if and only if ∀f ∈ ΣF , ∃ n ∈ N such as: ∀ sequence of events (or trajectory) sF t, such that sF ends with the occurrence of f , and t it is a continuation of sF , ||t|| ≥ n ⇒ (∀ trajectory s: PΣo (s) = PΣo (sF t) ⇒ f occurs in s), where PΣo is the projection operator on the set of observable events.

4.2

The diagnoser approach

We assume that M has no unobservable cycles (i.e cycles containing unobservable events only). The set of fault events ΣF is partitioned into disjoint sets corresponding to different fault types Fi , ΣF = ΣF1 ∪ ΣF2 ∪ ... ∪ ΣFn and ΣFi ∩ ΣFj = ∅, f or i = j. The aim of the diagnosis is to make inferences about past occurrences of fault types on the basis of the observed events. In order to solve this problem the system model is converted into a diagnoser. The diagnoser Diag(M ) = (QDiag , ΣDiag , TDiag , q0 Diag ) is a deterministic ﬁnite state machine built from the system model M (For more details see [8]). It can be used for on-line diagnosis and/or diagnosability analysis. Here, the diagnoser is used to perform the diagnosability analysis. Deﬁnition 10 Given a diagnoser state qDiag ∈ QDiag , this state is Fi -uncertain if Fi does not belong to all the labels of the state whereas Fi belongs to at least one label of the state. Theorem 2 The system M is not diagnosable 5 [8] if and only if the associated diagnoser Diag(M ) : • contains an uncertain cycle, i.e. a cycle in which there is at least one Fi -uncertain diagnoser state for some Fi . • the states of the original system involved in the different diagnoser cycling states also deﬁne a cycle in the original system M .

fCS DES : Q × T (Q, Σ) −→ ΣSig ( if Sig(qi ) = Sig(qj ) Roij ∈ ΣSig o (qi , qj ) −→ Sig Ruoij ∈ Σuo if Sig(qi ) = Sig(qj ) is a set of observable events, generated when the mode sig• ΣSig o nature of the source mode is different from the mode signature of the destination mode. • ΣSig uo is a set of unobservable events generated when the mode signature of the source mode is equal to the mode signature of the destination mode. • ΣSig is deﬁned as ΣSig ∪ ΣSig o uo .

5.2

The abstraction of the continuous dynamics changes in terms of discrete events allows us to deﬁne the language of the hybrid system, which describes the evolution of the system behavior. We denote by Σhybrid = Σ ∪ ΣSig the alphabet that contains ”natural” discrete events and events modeling mode switches. We model the behavior of the hybrid system as a preﬁx closed language L(S) ⊂ Σ∗hybrid over the event alphabet Σhybrid , where Σ∗hybrid denotes the set of all ﬁnite strings of elements of the set Σhybrid including the empty string (Σ∗hybrid is called the Kleene Closure of Σhybrid [7]). A trajectory of the hybrid system is represented by a string of events of the hybrid alphabet Σhybrid .

5.3

Diagnosability of Hybrid Systems

Diagnosing a hybrid system consists on tracking the system mode by using both continuous and discrete observable behaviors. The hybrid system is diagnosable if and only if the occurrence of any unobservable fault event is detected with a ﬁnite number of discrete event and continuous observations. The behavior of the hybrid system is the result of continuous and discrete behaviors. Hence, the hybrid diagnosability analysis must call upon both discrete event and continuous informations. A common framework is required in order to combine these informations. In the next subsection we aim at combining continuous and discrete knowledge in a uniﬁed model by abstracting the change of continuous dynamics in terms of discrete events. 5

Under the liveness hypothesis of the discrete automaton.

Behavior Automaton

The hybrid language L(S) can be generated by its ﬁnite state generator representation [7]. In this paper, this automaton is called the behavior automaton and mixes both ”natural” discrete events and signature switches.

5.3.1

5

Hybrid Language and Hybrid Trajectories

Properties of the hybrid language

The hybrid language L(S) ⊂ Σ∗hybrid mixes ”natural” discrete events from Σ and events issued from the abstraction of the continuous dynamics ΣSig . Hence, some speciﬁc properties can be stated (see Figure 1). Property 1 ∀w ∈ L(S), w = e .R .w , where e ∈ Σ, R ∈ ΣSig , w ∈ L(S).

Figure 1.

Property of the hybrid language

222

5.3.2

M. Bayoudh et al. / Coupling Continuous and Discrete Event System Techniques for Hybrid System Diagnosability Analysis

Hybrid diagnosability

The diagnosability of a hybrid system is deﬁned as follows: Deﬁnition 11 A fault event f is diagnosable if its occurrence can always be detected after a ﬁnite set of continuous and discrete observations i.e. after a ﬁnite sequence of observable events and a ﬁnite set of continuous variable observations. The system is said to be diagnosable if and only if all the anticipated faults are diagnosable. This deﬁnition provides the following result in the hybrid language framework : Proposition 2 The hybrid system is diagnosable if ∀fi , ∃ni ∈ N such as: ∀sFi t ∈ L(S), such that sFi ends with the occurrence of fi , and t ∈ L(S) is a continuation of sFi , ||t|| ≥ ni ⇒ (∀w ∈ L(S) : PΣhybrido (w) = PΣhybrido (sFi t) ⇒ fi ∈ w), where PΣhybrido is the projection operator on the set of observable events of Σhybrid i.e. Σhybrido = Σo ∪ ΣSig o .

5.4

5.5

CS Sufﬁcient criterion

Theorem 4 The hybrid system S = (ζ, Q, Σ, T, C, (ζ0 , q0 )) is diagnosable if the underlying continuous system Ξ = (ζ, Q, C, ζ0 ) is diagnosable. Proof 3 Consider a Hybrid System S = (ζ, Q, Σ, T, C, (ζ0 , q0 )), such that the underlying multimode system Ξ = (ζ, Q, C, ζ0 ) is diagnosable and a fault f ∈ ΣF , given sF t ∈ L(S) such that sF ∈ L(S) ends with the occurrence of f . Let qc (qf ) be the mode of the system before (after) the occurrence of the fault event f (see Figure 3). Since the underlying multimode system is diagnosable then ∀qi = qj , Sig(qi ) = Sig(qj ), therefore ΣSig uo = ∅ and in addition, all the observable events Roij are different. Let t ∈ Σ∗hybrid be a continuation of sF such that ||t|| ≥ 1. ∀w ∈ L(S) such that PΣhybrido (w) = PΣhybrido (sF t) = PΣhybrido (sF )Rocf w (where w ∈ Σ∗hybrido ) (this is guaranteed by the property 1). The observation of the event Rocf means

DES Sufﬁcient Criterion

Theorem 3 The hybrid system S = (ζ, Q, Σ, T, C, (ζ0 , q0 )) is diagnosable if its underlying discrete event system M = (Q, Σ, T, q0 ) is diagnosable. Proof 2 Given a Hybrid System S = (ζ, Q, Σ, T, C, (ζ0 , q0 )), such that the underlying discrete event system M = (Q, Σ, T, q0 ) is diagnosable. Given a fault f ∈ ΣF , given sF t ∈ L(S) such that sF ∈ L(S) ends with the occurrence of f , and t ∈ Σ∗hybrid is a continuation of sF (see Figure 2). We denote sF = PΣ (sF ) and t = PΣ (t) , where PΣ is the projection on the set of discrete events Σ. We have sF ∈ L(M ) ends with f ∈ Σuo ⊂ Σ, and t ∈ Σ∗ is a continuation of sF . Since, M = (Q, Σ, T, q0 ) is diagnosable then there exists an integer n such that: ||t || ≥ n ⇒ ∀w ∈ L(M ), (PΣo (w ) = PΣo (sF t ) ⇒ f ∈ w ) (proposition 1). We consider the integer n = 2n + 1, then from property 1 we have

Figure 3.

Composition of a hybrid fault trajectory

that the system has transited from the current mode qc to the fault mode qf , thus f ∈ w. Hence, the hybrid system S is diagnosable. 2 Corollary 1 Two modes qi and qj , i = j of the hybrid system S are diagnosable if Sig(qi ) = Sig(qj ). If all pairs of modes (qi , qj ), i = j of the hybrid system are diagnosable then the hybrid system is diagnosable. This is again only a sufﬁcient condition in terms of the underlying multimode system. As a matter of fact, the next section shows that continuous and discrete information are required to achieve a necessary and sufﬁcient condition.

5.6

Necessary and sufﬁcient condition

We build the hybrid system diagnoser, by considering the behavior automaton deﬁned in 5.3. The diagnosability property of the hybrid system is analyzed on this diagnoser by extending the DES diagnosability theorem [8] (theorem 2) to hybrid systems. Figure 2. Composition of a hybrid fault trajectory and its projection into the discrete event set Σ

||t|| ≥ n ⇒ ||t || ≥ n ∀w ∈ L(S) such that PΣhybrido (w) = PΣhybrido (sF t), we consider w = PΣ (w) PΣhybrido (w) = PΣhybrido (sF t) ⇒ PΣo (w ) = PΣo (sF t ) ⇒ f ∈ w thus f ∈ w and consequently the hybrid system S is diagnosable w.r.t. proposition 2. 2 The above result provides a sufﬁcient condition for hybrid diagnosability that is solely based on the underlying DES. In practice, the underlying DES is rarely diagnosable because it does not include explicit information about the events that occur after the occurrence of a fault. Diagnosability can only be decided on the basis of the observation of discrete control inputs and discrete sensor outputs.

Proposition 3 The hybrid system S = (ζ, Q, Σ, T, C, (ζ0 , q0 )) is not diagnosable if and only if: • the associated diagnoser computed from the corresponding behavior automaton contains an uncertain cycle, i.e. a cycle in which there is at least one Fi -uncertain diagnoser state for some Fi . • the states of the behavior automaton involved in the different diagnoser cycling states also deﬁne a cycle in the behavior automaton.

6

Illustrative example

Consider the circuit modeled by a hybrid automaton represented in Figure 4. The nominal modes are q and q that represent the conﬁgurations sw = on and sw = of f respectively. For sake of simplicity, only single faults are modeled. Events f1 and f2 model the occurrence of faults: ”R1 broken” (R1 opened circuit) and ”R2 broken” (R2 opened circuit) respectively. The fault events f1 (f2 ) can occur

223

M. Bayoudh et al. / Coupling Continuous and Discrete Event System Techniques for Hybrid System Diagnosability Analysis

in conﬁguration sw = on or sw = of f and lead to fault modes qf1 (qf2 ) or q f1 (q f2 ) respectively. The control events on and of f and the observation events o1 and o2 (the lamp lights/doesn’t light) are observable. Fault events f1 and f2 are not observable. Voltages V and E and the current I are the continuous observable variables. The consistency indicators are derived

o2

o1

Ro3

q f2 off

q3

q9

o1

Ro5

q6

q5

q f2

R1

I

R2

o2

Figure 5. on

on

q’ f 1

q’

q’f 2

f1

f2 o2

o2

o2

Figure 4. The Hybrid System

from the underlying continuous behavior in every mode. A consisk tency indicator rk is associated with a constraint Cobs . 8 1 > : E − V − R2 I = 0 (r1 ) {q, qf1 }: Cobs > > > < 2 3 : V − R1 I = 0 (r3 ) {q }: Cobs : E − V − R2 I = 0 (r2 ), Cobs 4 5 > f } : Cobs : I = 0 (r4 ), Cobs : E − V = 0 (r5 ) {q > 1 > > : 6 7 : V = 0 (r6 ), Cobs : I = 0 (r7 ) {qf2 , q f2 }: Cobs The underlying DES automaton is shown in Figure 4. Diagnosability Sig(q) = Sig(qf1 ) = [0, 0, 1, 1, 1, 1, 1]T

Table 1.

Sig(q ) = [0, 0, 0, 1, 1, 1, 1]T Sig(q f1 ) = [1, 1, 1, 0, 0, 1, 0]T

Mode Signatures of the underlying continuous system Ξ6

analysis is performed below and summarized in table 2. Consider the multimode system and the table of mode signatures 6 given in table 1. Constraints in modes q and qf1 (qf2 and q f2 ) are the same. Hence, mode signatures of q and qf1 (qf2 and q f2 ) are identical and consequently the two modes q and qf1 (qf2 and q f2 ) are not diagnosable. The other pairs of modes are diagnosable. Hence, the underlying multimode system is not diagnosable. Consider the underlying DES. Notice that when the observable event o1 occurs inﬁnitely, the occurrence of the fault event f1 cannot be decided (the system may be in mode q or in mode qf1 ). The same happens for q , q f1 and q f2 with respect to o2 . Then the underlying DES is not diagnosable (this can be shown by building the diagnoser of the underlying DES). CS view

DES view

Hybrid System view

{q, qf1 } {qf2 , q f2 }

{q, qf1 } {q , q f1 , q f2 }

{q, qf1 }

Non diagnosable mode sets in the CS, DES and Hybrid views

As a result, the diagnosability of the hybrid system cannot be decided using the CS (DES) sufﬁcient conditions for diagnosability. The necessary and sufﬁcient criterion of hybrid diagnosability is required. For this, the diagnoser of the hybrid system is built from the behavior automaton (but not provided due to the space limitation). It 6

o2

q1

Ro2 on

q’f 1 Ro8 o2

The behavior automaton

on

E

Sig(qf2 ) = Sig(q f2 ) = [1, 1, 1, 0, 1, 0, 0]T

q2

q8 f1

f2

Ro1

off

off

V

Table 2.

q7 Ro7

q f1 off

q f1 off

on

q’

q’f 2

f1

q

sw

Ruo4

Ro6

Ruo9

o1

f2

q4

q

on o2

o1

f1

off

Ruo10

q10

f2

For sake of concision, identical residuals are represented by one single entry in the mode signatures.

shows that the only modes that are non diagnosable are q and qf1 , manifested by an uncertain cycle (o1 ) containing the uncertain diagnoser state {(q, {}), (qf1 , {f1 , Ruo4 }}.

7

Conclusion

In this paper a theoretical framework is proposed to analyze the diagnosability of multimode and hybrid systems. It leads to the introduction of the new concepts of mirror, reﬂexive and mode signatures. Based on these concepts, a characterization of diagnosability for multimode systems is achieved. Then, hybrid diagnosability is deﬁned and associated conditions are provided. The difference between diagnosability of multimode systems and hybrid systems is clariﬁed. By abstracting the continuous dynamics in terms of discrete events, a general framework for analyzing hybrid systems diagnosability is proposed, that builds upon existing work on DES and CS diagnosability. The system being decomposed into CS and DES underlying systems, we offer the possibility to use DES and CS techniques for hybrid diagnosability analysis and for on-line hybrid state tracking [1]. Future works will be based on these results and consider active diagnosis and reconﬁguration [10] guided by diagnosability properties of the hybrid system.

REFERENCES [1] M. Bayoudh, L. Trav´e-Massuy`es, and Xavier Olive, ‘Hybrid systems diagnosis by coupling continuous and discrete event techniques’, accepted for presentation at the 17th IFAC World Congress, (2008). [2] S. Biswas, D. Sarkar, S. Mukhopadhyay, and A. Patra, ‘Diagnosability analysis of real time hybrid systems’, Industrial Technology, 2006. ICIT 2006. IEEE International Conference on, 104–109, (2006). [3] V. Cocquempot, T. El Mezyani, and M. Staroswiecki, ‘Fault detection and isolation for hybrid systems using structured parity residuals’, IEEE/IFAC-ASCC: Asian Control Conference, (2004). [4] G.K. Fourlas, K.J Kyriakopoulos, and N.J. Krikelis, ‘Diagnosability of hybrid systems’, in Proceedings of the 10th Mediterranean Conference on Control and Automation-MED2002, Lisbon, Portugal, (2002). [5] T. Henzinger, ‘The theory of hybrid automata’, in Proceedings of the 11th Annual IEEE Symposium on Logic in Computer Science (LICS’96), pp. 278–292, New Brunswick, New Jersey, (1996). [6] Y. Pencol´e, ‘Diagnosability analysis of distributed discrete event systems’, in Proceedings of the 16th Eureopean Conference on Artiﬁcial Intelligence, ECAI’2004, pp. 43–47, (2004). [7] P. J. Ramadge and W. M. Wonham, ‘The control of discrete-event systems’, Proc. IEEE, 77(1), 81–98, (1989). [8] M. Sampath, R. Sengputa, S. Lafortune, K. Sinnamohideen, and D. Teneketsis, ‘Diagnosability of discrete-event systems’, IEEE Transactions on Automatic Control, 40, 1555–1575, (1995). [9] L. Trav´e-Massuy`es, T. Escobet, S. Spanache, and X. Olive, ‘Diagnosability analysis based on component supported analytical redundancy relations’, IEEE Transactions on Systems, Man and Cybernetics, Part A, (2004). [10] K. Tsuda, D. Mignone, G. Ferrari-Trecate, and M. Morari, ‘Reconﬁguration strategies for hybrid systems’, Proceedings of the American Control Conference, 2, 868–873, (2001).

224

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-224

A probabilistic analysis of diagnosability in discrete event systems Farid Nouioua and Philippe Dague1 Abstract. This paper shows that we can take advantage of information about the probabilities of the occurrences of events, when this information is available, to reﬁne the classical results of diagnosability: instead of giving a binary answer, the approach we propose allows one to quantify, in particular, the degree of non-diagnosability in case of negative answer. The dynamics of the system is modelled by a reducible Markov chain. A state of this chain contains information about whether it is faulty (resp. ambiguous) or not. The useful reﬁnements of the decision about diagnosability are then obtained from the asymptotic analysis of this Markov chain. This analysis may be very useful in practice since it may lead to take the decision of tolerating some non-diagnosable systems, if their non-diagnosability is not critical, and thus allows one saving the cost of additional sensors necessary to make these systems diagnosable 2 .

1 MOTIVATION One major requirement in designing today’s real-life complex systems, is to ensure for them a high level of autonomy. Studying the diagnosability of a system is one of the key issues in this context. The problem of diagnosability drew the attention of many researchers from both the discrete-event and the control communities. A formal deﬁnition of diagnosability has been introduced ﬁrst in [7]. This work provides also an algorithm to verify diagnosability in discrete event systems (DES) represented by ﬁnite automata. The proposed method is based on the so-called diagnoser which is an automaton with only observable events and which allows one to estimate the state of the system after the observation of a sequence of events. Other veriﬁcation algorithms with polynomial complexity (the previous one is exponential in the states number) have been then proposed and are based on the twin plant approach which uses a synchronized product of two automata [3][9]. Adaptations of these algorithms have been also proposed to deal with the distributed case [5][8]. Another approach to solve the diagnosability problem in DESs is based on model-checking [1] where the veriﬁcation of diagnosability is reduced to the veriﬁcation of a formula, often expressed in temporal logic, by using efﬁcient tools developed in the model-checking community. We ﬁnd also methods that use algebraic language [2] and more recently SAT formalism to express and solve diagnosability [6]. However, all these approaches give a binary answer to the question of whether the system is diagnosable or not. But, one can easily 1

2

LRI, Universit´e Paris-Sud and CNRS, and INRIA Saclay-Ile-de-France, Parc Club Orsay Universit´e, 4 rue J. Monod, 91893 Orsay Cedex, FRANCE, email: {Farid.Nouioua, Philippe.Dague}@lri.fr This work is part of DIAFORE project supported by ANR under grant ANR-05-PDIT-016-05.

remark that there is no single level in non-diagnosability of a system, i.e., in case of a non-diagnosable system we would like to quantify the degree of this non-diagnosability. The practical importance of such a reﬁnement is that it could allow the tolerance of a system with, say, a low level or controllable form of non-diagnosability, if the cost necessary to making it diagnosable by adding observable events and thus sensors is very high. This paper shows that this objective can be achieved by enhancing the transitions of a DES with their probabilities. The method consists in extracting a Markov chain that explains the dynamics of the system and provides probabilities of the different observable traces leading it to faulty/normal and/or ambiguous/unambiguous states. Some probability measures can then be derived from the asymptotic analysis of this Markov chain and provides useful conclusions that go further than simply deciding if the system is or is not diagnosable. In section 2, we deﬁne a probabilistic discrete event model, we show its relationship to a classical DES and we recall some basic deﬁnitions useful to deﬁne diagnosability. In section Section 3, we start by recalling the algorithm used in [7] to verify diagnosability in case of possible multiple faults, we show after that how to use the diagnoser in the case of probabilistic transitions to construct a so-called estimator (which can be thought of as a probabilistic diagnoser) and how to extract from it a Markov chain and we explain how we can use the results of the asymptotic behavior of the Markov chain to draw the wished conclusions. Section 4 is devoted to some examples that illustrate different cases one can encounter. Finally we conclude and give some perspectives of future work in section 5.

2 PROBABILISTIC DISCRETE EVENT MODEL This section introduces the notion of a probabilistic discrete event model which simply corresponds to a classical DES enriched by information about probabilities of the transitions between its states. Deﬁnition 1 We model a probabilistic discrete event system (P DES) by the structure Γ = (X, E, θ, x0 ) where X = {x0 , ..., xn−1 } is a ﬁnite set of states (|X| = n), E = {e0 , ..., em−1 } is a ﬁnite set of events occurring on Γ (|E| = m), x0 is the initial state and θ : X × E × X −→ [0..1] is a probabilistic transition function deﬁned such that θ(x, e, x ) = α (0 ≤ α ≤ 1) is the probability that the event e occurs in x and causes the transition of the system from state x to state x . Deﬁnition 2 To a PDES Γ = (X, E, θ, x0 ) we associate a discrete event system (DES) G = (X, E, δ, x0 ) having the same states set X, events set E and initial state x0 . The transitions of G are deﬁned by δ ⊆ X × E × X where (x, e, x ) ∈ δ ⇔ θ(x, e, x ) > 0.

225

F. Nouioua and P. Dague / A Probabilistic Analysis of Diagnosability in Discrete Event Systems

The DES G is simply the automaton obtained by removing from Γ the information about the probabilities of its transitions. The probabilistic transition function can be generalized to a word s of E ∗ (the Kleene closure of E). θ(x1 , s, x2 ) is the probability that the system transits from x1 to x2 following the word s. Since a transition between two states following a word can in general be performed by following different paths, this probability is the sum of probabilities of all paths leading from x1 to x2 . Formally, let us consider that s = e1 , . . . ep and let Cxs1 ,x2 be the set of state sequences between x1 and x2 by which the system may transit to generate s: Cxs1 ,x2 = {(y1 , . . . yp+1 )|x1 = y1 , x2 = yp+1 , (yi , ei , yi+1 ) ∈ δ} We have: θ(x1 , s, x2 ) =

s (y1 ,...,yp+1 )∈Cx 1 ,x2

p

i=1

θ(yi , ei , yi+1 ).

θ(x1 , s, x2 ) can also be deﬁned recursively by: n−1 θ(x1 , s, x2 ) = j=0 θ(x1 , e1 , xj ).θ(xj , e2 . . . ep , x2 ). Let us now recall some basic deﬁnitions and notations relative to G and useful in the study of diagnosability. We denote by L the language generated by G. L is a subset of E ∗ and is preﬁx closed. E = Eo ∪ Euo where Eo (resp. Euo ) contains the observable (resp. unobservable) events. Ef ⊆ Euo is a subset of unobservable events containing the faults. Moreover, faults are partitioned into disjoint sets corresponding to the different fault types: Ef = Ef1 ∪ . . . ∪ Efp . In what follows, we will focus on one fault type as in [9][5][8]. This is justiﬁed as the system is diagnosable if and only if it is diagnosable for each fault type. Thus, to check the diagnosability of a system with several faults, one must check in turn its diagnosability w.r.t each fault type by considering all the other faults as non observable. For the sake of simplicity, we will denote by f each occurrence of the fault type for which we want to verify the diagnosability. We suppose also that L is live, that there is no cycle in G with only unobservable events and that we represent in the model all the possible transitions system in each state. Thus, we have for each x in m−1 of the n−1 X: i=0 θ(x, ei , xj ) = 1. j=0 Example 1 Figure 1 shows an example of a PDES Γ = (X, E, θ, x0 ) and its corresponding DES G = (X, E, δ, x0 ) where: X = {x0 , x1 }, E = Eo ∪ Euo with Eo = {a, b, c} and Euo = {f, uo}, the set of fault events is Ef = {f }, the initial state for the two systems is x0 and the transition functions are deﬁned by: • θ(x0 , uo, x1 ) = 1/6, θ(x0 , f, x1 ) = 1/2, θ(x0 , c, x1 ) = 1/3, θ(x1 , a, x1 ) = 1/3 and θ(x1 , b, x0 ) = 2/3. For the other possible combinations of the source state x, target state y and the event e, θ(x, e, y) = 0. • δ(x0 , uo) = x1 , δ(x0 , f ) = x1 , δ(x0 , c) = x1 , δ(x1 , a) = x1 and δ(x1 , b) = x0 .

L/s = {t ∈ E ∗ |st ∈ L}. P : E ∗ −→ E ∗ is a projection function that erases from any trace its unobservable events : P (σ) = if σ = or σ ∈ Euo , P (σ) = σ if σ ∈ Eo and P (sσ) = P (s)P (σ) for s ∈ E ∗ and σ ∈ E. PL−1 is the inverse projection: for any w ∈ Eo∗ , PL−1 (w) = {s ∈ L|P (s) = w}. It provides for an observable trace w, all traces of L whose projection is w. We denote by sf the ﬁnal event of a trace s and by Ψ(f ) all traces ending in the fault event f : Ψ(f ) = {s ∈ L|sf = f } and we deﬁne: Xo = {x0 } ∪ {x ∈ X |∃y ∈ X, ∃e ∈ Eo , (y, e, x) ∈ δ}. Let L(G, x) denote the set of traces originating from x, Lo (G, x) denotes the set of traces originating from x and ending at the ﬁrst observable event and Lσ (G, x) the subset of Lo (G, x) containing traces that end at the observable ∗ ,σ ∈ event σ: Lo (G, x) = {s ∈ L(G, x) | s = uσ, u ∈ Euo Eo }, Lσ (G, x) = {s ∈ Lo (G, x) | sf = σ}.

3 DIAGNOSABILITY Intuitively, a system is said to be diagnosable if we can deduce without confusion after a ﬁnite delay of observations whether a fault occurred or not in the system. Let us recall here the formal definition given in [7] adapted to our assumption that only one fault type is considered. The system is diagnosable if the following holds: (∃n ∈ N )[∀s ∈ Ψ(f )](∀t ∈ L/s)[ t ≥ n ⇒ D]), where the diagnosability condition D is: w ∈ PL−1 [P (st)] ⇒ f ∈ w.

3.1 Checking the “binary” diagnosability We start by checking the ”binary” diagnosability of the system. We use for that the algorithm of Sampath and al. [7] whose diagnoser is well adapted to be used for a probabilistic analysis (see below). We consider the case of possible multiple faults. In what follows we recall brieﬂy the construction of the generator G and the diagnoser Gd before giving (without technical details) the necessary and sufﬁcient condition on G and Gd for the binary diagnosability of L. The generator G is deﬁned by G = (Xo , Eo , δG , x0 ) where Xo , Eo and x0 have already been deﬁned. δG is such that: (x, σ, x ) ∈ δG if (x, s, x ) ∈ δ f or some s ∈ Lσ (G, x). The corresponding probabilistic generator is deﬁned by Γ = (Xo , Eo , θΓ , x0 ) where Xo , Eo and x0 are the same as in G and the probabilistic transition function θΓ : Xo ×Eo ×Xo −→ [0, 1] is deﬁned by: θΓ (x, σ, x ) = θ(x, s, x ). s∈Lσ (G,x) Proposition 1 The sum of the probabilities of all transitions issued from each state of Γ equals 1. Formally:

θΓ (x, σ, x ) = 1 f or each x in X0

σ∈Eo x ∈X0

. The diagnoser is a deterministic automaton which is deﬁned by Gd = (Qd , Eo , δd , q0 ) where:

Figure 1.

A PDES and the corresponding simple DES

• Qd ⊆ 2Xo ×{N,F } . A state qd of Qd is of the form: qd = {(x1 , l1 ), . . . , (xk , lk )} where xi ∈ Xo and li ∈ {N, F }. • Eo is the set of the observable events. • δd : Qd × Eo −→ Qd is the transition function of the diagnoser deﬁned by: q2 = δd (q1 , σ) ⇔ q2 = R(q1 , σ) with: – σ ∈ ed (q1 ) where ed (q1 ) =

A word of the language L is also called trace. The empty trace is denoted by . The postlanguage of L after s is denoted by L/s:

(x,l)∈q1

{P (s)|s ∈ Lo (G, x)}

– R : Qd × Eo −→ Qd is a range function deﬁned by: R(q, σ) = (x,l)∈q s∈Lσ (G,x) (x,s,x )∈δ {(x , LP (x, l, s))}

226

F. Nouioua and P. Dague / A Probabilistic Analysis of Diagnosability in Discrete Event Systems

– LP : Xo × {N, F } × E ∗ −→ {N, F } is a label propagation function deﬁned by:

LP (x, l, s) =

N if l = N and f ∈ s F else

• q0 = {(x0 , N )} is the initial state of the diagnoser Gd .

Figure 2.

The probabilistic generator Γ , the simple generator G and the diagnoser Gd

• Let q0 , . . . , qk be the states of the diagnoser Gd such that q0 = {(x0 , N )}. The set of the states of Δ is Z ⊆ X × {N, F } × {N A, A} × {0, . . . , k} where N A (resp. A) is a new label standing for non ambiguous (resp. ambiguous). The initial state of Δ is z0 = (x0 , N, N A, 0) and each sub-state (x, l) of a state qi of Gd correspondsto a state z = (x, l, Att, i) of Δ where: A if qi is an f − uncertain state Att = NA else • Eo is the set of observable events. • ϕ : Z × Eo × Z −→ [0..1] is the probabilistic transition function of Δ. Let z = (x, l, Att, i) and z = (x , l , Att , i ) be two states of Z and σ be an observable event. The transition probability ϕ(z, σ, z ) can be different from 0 only if there is a possible transition from z to z . From the construction of the diagnoser Gd , this corresponds to the case where there is at least some trace s ∈ Lσ (G, x) such that l = LP (x, l, s) and (x, s, x ) ∈ δ. Let S be the set of all such traces: S = {s ∈ Lσ (G, x)|l = LP (x, l, s) and (x, s, x ) ∈ δ}. The transition probability ϕ(z, σ, z ) is then the sum of the probabilities of transitions from x to x by the different traces of S: ϕ(z, σ, z ) = s∈S θ(x, s, x ).

Figure 2 represents the probabilistic generator, the simple generator and the diagnoser of the system described in example 1. A state q of Gd is said to be f-uncertain if ∃(x, l), (x , l ) ∈ q such that l = l , i.e., l = N and l = F or vice versa. Informally, a set of f-uncertain states q1 , . . . qn is said to form an f-indeterminate cycle if q1 , . . . qn form a cycle in Gd to which correspond in G a cycle involving only states with label F and a cycle involving only states with label N . Finally, L is diagnosable if and only if its diagnoser Gd contains no f-indeterminate cycle. Figure 3. The estimator

3.2 Constructing the estimator The diagnoser gives us a general information about the state of the system after the observation of a sequence of observable events. For example, let us suppose that we have, in the diagnoser, a path from the initial state q0 = {(x0 , N )} to some f-uncertain state q and that this path is labeled by the observable trace w. Let us suppose for instance that q corresponds to two different states x1 and x2 with no fault in the ﬁrst state and with a fault in the second one, i.e., q = {(x1 , N ), (x2 , F )}. If we observe the trace w, we can deduce that the system is either in state x1 and no fault occurred or in state x2 with the occurrence of a fault. In a probabilistic framework, the probabilities to be in state x1 or in state x2 are not necessarily the same. However, the probability to observe w, independently from the target state represents the probability to be in a f-uncertain state. The estimator is a PDES which makes explicit this piece of information: a state of the estimator is composed of a state name from the original system, a fault label (N or F) and a new attribute which indicates if we can decide or not that a fault occurred when the system arrives to this state, i.e., this attribute indicates simply if the state belongs to some f-uncertain state or not. The transitions of the estimators correspond to a reﬁnement of those of the diagnoser. Indeed, if we have a transition in the diagnoser from a state q1 to a state q2 by means of an observable σ then for each sub-state in q2 , we have at least one sub-state in q1 which transits to it by σ, of course, with some probability. The estimator makes explicit these internal transitions and their corresponding probabilities. Formally, the estimator is deﬁned by Δ = (Z, Eo , ϕ, z0 ) where:

Figure 3 shows the estimator of the system presented in example 1. Remark 1 It can be easily shown that the maximum number of states in the estimator is exponential in the number of states in the system and since we consider one fault at a time this number is linear in the number of fault types. Proposition 2 The sum of the probabilities of all transitions issued from each state of Δ equals 1:

σ∈Eo

ϕ(z, σ, z ) = 1 f or each z in Z

z ∈Z

3.3 Probabilistic analysis In this section, we show how to extract from the estimator an homogeneous and discrete Markov chain and then to exploit the well known results about the asymptotic behaviors of such chains (for more details about that, see for example [4]) to provide some reﬁnement to the classical binary diagnosability. We think that such a reﬁnement can be, in practice, very useful in taking decisions about what adaptations have to be done on a non-diagnosable system.

3.3.1 The Markov chain associated with the estimator To an estimator Δ = (Z, Eo , ϕ, z0 ), we associate the homogeneous and discrete time Markov chain {Mi , i = 0, 1...} where Mi is a

227

F. Nouioua and P. Dague / A Probabilistic Analysis of Diagnosability in Discrete Event Systems

random variable whose value is the state of the system after the observation of the ith event occurring in the system. Z is the state space of our Markov chain. Its transition matrix tr is deﬁned by: ∀(z1 , z2 ) ∈ Z 2 , trz1 ,z2 =

ϕ(z1 , σ, z2 ).

σ∈Eo

Figure 4 illustrates graphically the Markov chain of the system presented in example 1. The corresponding transition matrix is: z0 z0 0 z1 ⎜ 2/3 ⎜ z2 ⎜ 0 tr = ⎜ z3 ⎜ 0 z4 ⎝ 0 0 z5

⎛

z1 1/3 1/3 0 0 0 0

z2 1/18 0 1/3 7/18 0 0

z3 1/9 0 2/3 1/9 0 0

z4 1/6 0 0 1/6 1/3 5/9

z5 ⎞ 1/3 0 ⎟ ⎟ 0 ⎟ ⎟ 1/3 ⎟ 2/3 ⎠ 4/9

fault occurred and the probability pG F /Nd that a fault occurred known that the observed trace is non-diagnosable. To perform our analysis, we apply the following procedure: 1. Classify the states of the chain {Mi }. We recall that: a class is simply a strongly connected component in the representative graph of {Mi }; a persistent class is a class whose states have no successor outside it; if a persistent class contains only one state, then it is said to be absorbent and a class which is not persistent is said to be transitory. Let ζ = {C1 , . . . , Ch } be the set of the persistent classes of {Mi } and μ = {μ1 , . . . , μr } be the set of transitory states, i.e., which do not belong to persistent classes. 2. Put the transition matrix in the canonical form in which: persistent classes are put in the ﬁrst and the states of each persistent class are put together. We obtain the transition matrix:

⎛

T r1 ⎜ .. ⎜ .

A Markov chain is said to be irreducible if its representative graph

tr = ⎝ 0 R1

Figure 4.

The representative graph of the Markov chain associated with the estimator

represents one strongly connected component. In the general case a Markov chain is reducible and its representative graph contains more than one strongly connected component. This is the case for Markov chains associated with the estimators described in this work. Proposition 3 Under the assumption that there is at least one occurrence of a fault in the system, the Markov chain {Mi } associated with the estimator Δ is reducible3 .

3.3.2 The asymptotic behavior In addition to the information about the binary diagnosability property of a system, the transition probabilities provide further useful information especially when the system is not diagnosable. By studying the asymptotical behavior of the Markov chain associated with the estimator, we can compute relevant probability measures concerning the possible inﬁnite observable traces of the system. The fact that we consider inﬁnite traces is not disadvantageous in practice because this study allows also one to estimate the average number of steps after which the system converges to a stage where it is or not diagnosable. After an arbitrary inﬁnite execution of G, we focus on three probability measures that we think important in this context: the probability pG Nd that the observed inﬁnite trace (projection of the execution onto the set of observable events) is non-diagnosable, i.e., we cannot decide if the fault occurred or not; the probability pG F that a 3

Supposing that there is at least one fault occurrence in the system, the estimator must contain at least one state whose fault label is F and it is easy to prove that from such a state we can never come back to the initial state z0 whose fault label is N .

··· .. . ··· ···

0 .. . T rh Rh

⎞

0 .. ⎟ . ⎟

0⎠ Q

T ri is the stochastic matrix containing the transition probabilities inside the persistent class Ci . The matrix R = [R1 , . . . , Rh ] (resp. the matrix Q) contains the transition probabilities from transitory states to persistent states (resp. to transitory states). 3. Compute the fundamental matrix of the Markov chain given by: N = (I − Q)−1 (I is the unit matrix of size r) and the absorption matrix given by B = N.R. We have the following results: the probability to be in a transitory state after an inﬁnite number of steps is 0; the average number of steps (observed events) before absorption (reaching a persistent class) starting from a transitory state i is given by the sum of the terms of the ith row of the fundamental matrix N and the probability of absorption in the persistent state j when we start from state i is given by the term bij of the matrix B. The absorption probability of a persistent class is then the sum of the absorption probabilities of its states. The starting point for us is always the initial state z0 = (x0 , N, N A, 0) that we suppose without loss of generality be the ﬁrst transitory state4 which corresponds to the ﬁrst row of N and B. Let ζNd (resp. ζF ) be the subset of persistent classes containing only ambiguous states (resp. states with the fault label), i.e. states z = (x, l, Att, i) where Att = A (resp. l = F ) and let ζNd∧F be the subset of persistent classes containing only ambiguous states with the fault label: ζNd∧F = ζNd ∩ ζF . Then we can deﬁne the G G probabilities pG Nd , pF and pF /Nd as follows: • pG one of the Nd is the probability to be absorbed in classes of ζNd b . starting from z0 . It is given by: pG Nd = c∈ζ z∈c 1z Nd

• pG F is the probability to be absorbed in one ofthe classes of ζF starting from z0 . It is given by: pG b . F = c∈ζF z∈c 1z is the probability to be absorbed in one of the classes of • pG Nd∧F from z . Using the Bayes formulae, we obtain: ζNd∧F starting 0 c∈ζN d∧F pG F /Nd = c∈ζN d

z∈c z∈c

b1z

b1z

.

• In addition to these probability measures, we can obtain the average number of steps before absorption starting from state z0 by r G the relation: N bAbs = j=1 (N )1j . 4

z0 is always transitory. See the explanation given in the previous footnote.

228

F. Nouioua and P. Dague / A Probabilistic Analysis of Diagnosability in Discrete Event Systems

Let us now come back to our example (see the transition matrix in the previous section and the representative graph in ﬁgure 4). We have one persistent class C = {z4 , z5 } and two transitory classes: the ﬁrst one contains the states z0 and z1 and the other one contains the states z2 and z3 . After putting the transition matrix in the canonical form we compute the matrices N and B. The ﬁrst rows (corresponding to z0 ) of these matrices are: z z1 z2 z3 z z5 0 4 N1 = 3/2 3/4 5/12 1/2 and B1 = 1/3 2/3 G

From N1 we obtain that: N bAbs = 3/2 + 3/4 + 5/12 + 1/2 = 3.16 steps. Moreover, we have: ζNd = ζF = ζNd∧F = C. Thus, G G we obtain : pG Nd = pF = pF /Nd = 1. This means that, in average, after the observation of 3 to 4 events, we are almost sure that the trace observed corresponds to a non-diagnosable trace theoretically, but we are also almost sure that a fault has occurred.

G1 and G2 are examples where the probability of non diagnosability tends to a strictly positive value when the length of the observed trace tends to ∞. But, even in this case, the knowledge about the fault probability in an arbitrary inﬁnite trace can be signiﬁcant for taking decisions: in G1 even if we are sure that all inﬁnite traces are ”theoretically” not diagnosable, we know that the probability that a fault occurs tends to 1. The situation is completely different in G2 in which the non diagnosability of the system is more ”effective”. G3 is an example of a system which can be kept unchanged even if not diagnosable unless the average time before absorption is judged very long, because the probability to stay in a non-diagnosable trace tends to 0 when the length of the observed trace tends to ∞. Table 1. Results for the PDESs G1 to G4 .

G1 G2 G3 G4

4 EXAMPLES The ﬁgures 5 shows four examples of PDES. The lack of space prevents us to examine in detail, for these examples, the whole of the analysis procedure described in this paper. So, we only comment brieﬂy the ﬁnal results that are summed up in table 1.

G

# pers. classes

pNd

pF

pF /Nd

i N bAbs

1 4 2 2

1 6/13 0 0

1 6/13 1/2 8/15

1 1/2 no no

150.5 2.38 3 3.2

5 CONCLUSION We have shown that using probabilistic information about the transitions of a DES, when available, can provide useful reﬁnement of the binary decision about the diagnosability of the system. Especially, this reﬁnement can lead in practice to tolerate non-diagnosability in cases where it is not persistent, i.e., in cases where it sufﬁces to let the system run for enough long time to be almost sure that the observed trace will allow one to decide if a fault occurred or not. Different perspectives are open from this ﬁrst investigation. We want to generalize this work to other DES formalisms like Petri nets and symbolic transition systems, to the distributed case, and to the case where reparability actions are also available, in order to study global self-healability in a probabilistic framework.

Figure 5.

Examples

G1 has exactly the same structure that the system discussed in this paper but with different probability transitions: we put a very small probability in the transition containing the fault. We obtain the same probabilities: pNd = pF = pF /Nd = 1 but the average number of steps before being absorbed passes from 3.16 to 150.5. In G2 , we have the probability of 6/13 that an arbitrary inﬁnite trace observed in the system be non-diagnosable and the same probability that it contains a fault (the probability to be in a diagnosable trace (resp. in a trace without fault) is then 1 − 6/13 = 7/13). We have the probability of 1/2 that a non-diagnosable trace contains a fault. Even if G3 is theoretically non-diagnosable, the probability to observe a nondiagnosable trace tends to 0 when the length of this trace tends to ∞. In addition, in average, we must not wait for a long time before obG3 taining a diagnosable trace (N bAbs = 3) in which the probabilities of having or not a fault are here equal. Finally G4 is diagnosable, it is then obvious to obtain that pNd = 0 as in G3 . However, the difference between the two cases is that the estimator of G3 contains at least one ambiguous state (with Att = A) but that is transitory, whereas all states in G4 are unambiguous (with Att = N A).

REFERENCES [1] A. Cimatti, C. Pecheur, and R. Cavada, ‘Formal veriﬁcation of diagnosability via symbolic model checking’, in 18th International Joint Conference on Artiﬁcial Intelligence (IJCAI’2003), pp. 363–369. [2] L. Console, C Picardi, and M. Ribaudo, ‘Diagnosis and diagnosability analysis using PEPA’, in 14th European Conference on Artiﬁcial Intelligence (ECAI’2000), pp. 131–136. [3] S. Jiang, Z. Huang, V. Chandra, and R. Kumar, ‘A polynomial algorithm for testing diagnosability of discrete event systems’, IEEE Transactions On Automatic Control, 46(8), 1318–1321, (2001). [4] J.G. Kemeny and J.L. Snell, Finite Markov Chains, Springer-Verlag, 1983. [5] Y. Pencole, ‘Diagnosability analysis of distributed discrete event systems’, in 16th European Conference on Artiﬁcial Intelligence (ECAI’2004), pp. 173–178. [6] J. Rintanen and A. Grastien, ‘Diagnosability testing with satisﬁability algorithms’, in 20th International Joint Conference on Artiﬁcial Intelligence (IJCAI’2007), pp. 532–537. [7] M. Sampath, R. Sengupta, S. Lafortune, K. Sinnamohideen, and D. Teneketzis, ‘Diagnosability of discrete-event systems’, IEEE Transactions On Automatic Control, 40(9), 1555–1575, (1995). [8] A. Schumann and Y. Pencole, ‘Scalable diagnosability checking of event-driven systems’, in 20th International Joint Conference on Artiﬁcial Intelligence (IJCAI’2007), pp. 575–580. [9] T. Yoo and S. Lafortune, ‘Polynomial-time veriﬁcation of diagnosability of partially-observed discrete-event systems’, IEEE Transactions On Automatic Control, 47(9), 1491–1495, (2002).

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-229

229

Temporal Logic Patterns for Querying Qualitative Models of Genetic Regulatory Networks Pedro T. Monteiro1,2 , Delphine Ropers1 , Radu Mateescu1 , Ana T. Freitas2 , and Hidde de Jong1 1 INRIA Grenoble - Rhˆ one-Alpes, 655 Av. de l’Europe, Montbonnot, 38334 St. Ismier Cedex, France 2 IST/INESC-ID, 9 Rua Alves Redol, 1000-029 Lisboa, Portugal Abstract. Formal veriﬁcation based on model checking provides a powerful technology to query qualitative models of dynamical systems. The application of model-checking approaches is hampered, however, by the difﬁculty for non-expert users to formulate appropriate questions in temporal logic. In order to deal with this problem, we propose the use of patterns, that is, high-level query templates capturing recurring questions which can be automatically translated to temporal logic. We develop a set of patterns for the analysis of qualitative models of genetic regulatory networks, which are sufﬁciently generic though to be useful in other application domains. The applicability of the patterns has been investigated by the analysis of a model of the network of global regulators controlling the carbon starvation response in Escherichia coli.

1

Introduction

Qualitative simulation provides predictions of the possible qualitative behavior of a dynamical system [14]. It is an attractive approach when little or no quantitative information on parameter values is available, or when one is interested in the range of possible qualitative behaviors compatible with the structure of the system. These conditions are often met in the analysis of biological systems, which explains the popularity of qualitative approaches in mathematical and theoretical biology (e.g., [1, 3, 12, 19]). An example is the method for the qualitative simulation of genetic regulatory networks described in [1]. This approach is based on a class of piecewise-linear (PL) differential equation models to describe regulatory interactions between genes, and has been implemented in the computer tool Genetic Network Analyzer (GNA). A problem with the use of qualitative simulation is the potential explosion of the number of qualitative behaviors when dealing with large and complex systems whose dynamics cannot be sufﬁciently constrained. In order to deal with this problem, the use of model-checking techniques has been proposed [17]. This approach was successfully explored for the validation of qualitative models of genetic regulatory networks, by coupling GNA to state-of-the-art model checkers [2]. It allows model predictions to be veriﬁed by experimental observations expressed as statements in temporal logic. Formal veriﬁcation based on model checking provides a powerful technology to query qualitative models, but it raises new issues, notably that of formulating good questions when analyzing a large model. Posing relevant and interesting questions is critical in modeling in general, but even more so in the context of applying formal veriﬁcation techniques, due to the fact that it is not easy for nonexperts to formulate queries in temporal logic. The response to this

problem proposed by the formal veriﬁcation community is the use of patterns, that is, high-level query templates that capture recurring questions in a speciﬁc application domain and that can be automatically translated to temporal logic [7]. This approach does not seem to have received any attention in qualitative reasoning thus far. The aim of this paper is to develop a set of patterns for the analysis of models of genetic regulatory networks. Its main contributions are twofold. First, we develop a set of generic query templates, based on a review of frequently-asked questions by modelers, and translate these templates to temporal logic formulas (Sec. 2). Although the patterns have been formulated for the analysis of genetic regulatory networks, they are sufﬁciently generic to carry over to other application domains. Second, we show the interest of the patterns in a casestudy, concerned with the analysis of a large and complex model of the E. coli carbon starvation response (Sec. 3). This model extends a previous model [16] by taking into account additional regulators of bacterial stress responses.

2 2.1

Patterns for querying qualitative models Description of network dynamics

As a basic hypothesis, we assume that the dynamics of genetic regulatory networks can be modeled by means of ﬁnite state transition systems (FSTSs) [6]. The latter formalism provides a general description of a dynamical system that explicitly underlies GNA [1], but the predictions of other qualitative simulators can also be mapped to FTSTs. The generality of the FSTS formalism is important for assuring the wide applicability of the patterns developed in this section. Moreover, statements in temporal logic are usually interpreted on FSTSs, so that the latter naturally connect qualitative models to model-checking tools. A ﬁnite state transition system is formally deﬁned as a tuple Σ = S, AP, L, T, S0 , where S is a set of states, AP is a set of atomic propositions, L : S → 2AP is a labeling function that associates to a state s ∈ S the set of atomic propositions satisﬁed by s, T ⊆ S × S is a relation deﬁning transitions between states, and S0 ⊆ S is a set of initial states. For our purpose, S describes the possible states of the genetic regulatory network, each of which is characterized by a set of atomic propositions, such as that the concentration of protein P is above a threshold and increasing.

2.2

Identiﬁcation of patterns

The notion of patterns was introduced in the domain of software engineering as a means to capture expert solutions to recurring prob-

230

P.T. Monteiro et al. / Temporal Logic Patterns for Querying Qualitative Models of Genetic Regulatory Networks

lems in program design. In the formal veriﬁcation domain they have been introduced in an inﬂuential paper [7], to help non-expert users formulate their temporal-logic queries. In the latter context, patterns are high-level descriptions of frequently asked questions in an application domain that are formulated in structured natural language rather than temporal logic. The aim of the patterns is not to cover all possible questions an expert can think of, but rather to simplify the formulation of those that are primary. The difﬁculty of proposing patterns is to come up with a limited number of query schemas that are sufﬁciently generic to be applicable in a variety of situations, and at the same time sufﬁciently concrete to be comprehensible for the non-expert user. Moreover, the overlap between the patterns should be minimal. We analyzed a large number of modeling studies in systems biology (starting from the references in [18]), as well as lists of temporal logic queries (e.g., [4]). This bibliographic research allowed us to identify an open-ended list of questions on the dynamics of genetic, metabolic, and signal transduction networks. For instance, “Is the basal glycerol production level combined with rapid closure of Fps1 sufﬁcient to explain an initial glycerol accumulation after osmotic shock?” [13]. The identiﬁed questions were grouped into four categories, depending on whether they concerned the occurrence/exclusion, consequence, sequence, and invariance of cellular events. For each of these, we developed an appropriate pattern, capturing the essence of the question and the most relevant variants.

2.3

Description of patterns

The patterns consist of structured natural language phrases, represented in schematic form, with placeholders for so-called state descriptors. A state descriptor is a statement expressing a state property, and takes the form of (a Boolean combination of) atomic propositions. Let φ, ψ be state descriptors, then φ, ψ ::= p1 ∈ AP | p2 ∈ AP | . . .

Deﬁnition 2 (Consequence pattern)

then it is

Deﬁnition 1 (Occurrence/exclusion pattern)

It

is possible is not possible

for a state

φ to occur

This pattern represents the concepts of occurrence and its negation, exclusion (to capture safety properties). It will often be used during the development of a model to check for the presence or absence of some property that was experimentally observed. For instance, “It is possible for a state with a high concentration of protein P1 to occur”. Using this pattern, we can also check for mutual exclusion, by using the pattern negative form in combination with a conjunctive state descriptor. For instance, “It is not possible for a state to occur in which genes g1 and g2 are highly expressed”.

occurs,

possibly necessarily

followed by a state

ψ

The consequence pattern relates two events separated in time. More precisely, it expresses that if the ﬁrst state occurs, then it is possibly or necessarily followed by the second state. If the latter state necessarily follows, then the consequence pattern expresses a form of causal relation. An instance of this pattern is, for example, “If a state occurs in which the concentration of protein P is below 5 μM, then it is necessarily followed by a state in which the expression of gene g is at its basal level”. Deﬁnition 3 (Sequence pattern)

A state

ψ

is reachable and

is possibly preceded at some time necessarily all the time

by a state φ

The sequence pattern represents an ordering relation between two events. It ought not to be confused with the consequence pattern, since the conditional occurrence of the second state which characterizes the latter is absent in the sequence pattern. It must be possible to observe both the ﬁrst and the second state, in that order, for an instance of the sequence pattern to be true. Four variants of the pattern are distinguished, depending on whether the second state follows possibly or necessarily after the ﬁrst state, and whether the system is in the ﬁrst state all the time or only at some time before the occurrence of the second state. An instance of this pattern is “A steady state is reachable and is necessarily preceded all the time by a state in which nutrient N is absent”. Deﬁnition 4 (Invariance pattern)

::= ¬φ | φ ∧ ψ | φ ⇒ ψ | . . . The state descriptors are interpreted on the FSTS, in the sense that their meaning is formally deﬁned as the set of states S1 ⊆ S satisfying the state descriptor. In addition to (Boolean combinations of) atomic propositions, the state descriptors may be temporal-logic formulas deﬁned on the atomic propositions AP . However, the precise deﬁnition of the state descriptors depends on the particular type of FSTS that is used, as the latter determines AP .

φ

If a state

A state

φ

can persist indefinitely must

The invariance pattern is used to check if the system can or must remain indeﬁnitely in a state. In contrast with the occurrence/exclusion pattern, the question is not whether a particular state can be reached, but rather whether a particular state is invariable. An instance of this pattern is “A state with a basal expression of gene g must persist indeﬁnitely”.

2.4

Translation to temporal logic

By deﬁning a temporal-logic translation of the patterns, the user queries can be automatically cast in a form that allows the veriﬁcation of the speciﬁed property by means of model-checking tools. The patterns deﬁned above are independent of a particular temporal logic, which allows the same high-level speciﬁcation of a user query to be veriﬁed by means of different approaches and tools. It is worth noticing though that some of the patterns we propose have a branchingtime nature (e.g., the consequence and the sequence patterns), and therefore these are not translatable into a linear-time formalism, such as LTL [6]. Two examples of translations of the patterns in Sec. 2.3 are shown in tabular form: the Computational Tree Logic (CTL) translation and

P.T. Monteiro et al. / Temporal Logic Patterns for Querying Qualitative Models of Genetic Regulatory Networks

Table 1.

Rules for the translation of the patterns into CTL and μ-calculus. For each of the four patterns, the translation of all variants is shown. We use the version of μ-calculus presented in [15], which is interpreted on classical Kripke structures. The symbol T stands for True. Occurrence/Exclusion pattern It is possible for a state φ to occur It is not possible for a state φ to occur Consequence pattern If a state φ occurs, then it is possibly followed by a state ψ If a state φ occurs, then it is necessarily followed by a state ψ Sequence pattern A state ψ is reachable and is possibly preceded at some time by a state φ A state ψ is reachable and is possibly preceded all the time by a state φ A state ψ is reachable and is necessarily preceded at some time by a state φ A state ψ is reachable and is necessarily preceded all the time by a state φ Invariance pattern A state φ can persist indeﬁnitely A state φ must persist indeﬁnitely

the μ-calculus translation (Table 1). In both CTL and μ-calculus, formulas are built upon atomic propositions. Also, the usual connectors of propositional logic, such as negation (¬), logical or (∨), logical and (∧) and implication (⇒), can be used in both logics. In addition, CTL provides two types of operators: path quantiﬁers, E and A, and temporal operators, such as F and G. Path quantiﬁers are used to specify that a property p is satisﬁed by some (E p) or every (A p) path starting from a given state. Temporal operators are used to specify that, given a state and a path starting from that state, a property p holds for some (F p) or for every (G p) state of the path. Each path quantiﬁer must be paired with a temporal operator. In the case of μ-calculus, two types of operators are provided: the least (μ) and greatest (ν) ﬁxed points, and the modal operators possibility (♦) and necessity (). Least and greatest ﬁxed points specify ﬁnite and inﬁnite recursive applications of a formula, respectively. For instance, given a state and a path starting from that state, the fact that a property p holds for some state or for all states of the path is expressed using a least (μ) or a greatest (ν) ﬁxed point, respectively. Modal operators are used to specify that, given a state, p possibly (♦ p) or necessarily ( p) holds on some or all of its outgoing states.

3 3.1

231

Carbon starvation response in E. coli Model of carbon starvation response

To test the applicability of the temporal logic patterns, we have used our approach for the analysis of a model of the carbon starvation response in the bacterium E. coli. In the absence of essential carbon sources in its growth environment, an E. coli population abandons exponential growth and enters a non-growth state called stationary phase. This growth-phase transition is accompanied by numerous physiological changes in the bacteria, and controlled on the molecular level by a complex genetic regulatory network. The molecular basis of the adaptation of the growth of E. coli to the nutritional conditions has been the focus of extensive studies for decades [10, 11]. However, notwithstanding the enormous amount of information accumulated on the genes, proteins, and other molecules, kinetic parameters and the molecular concentrations are absent, with some exceptions, which makes it difﬁcult to apply traditional methods for the dynamical modeling of genetic regulatory networks. These circumstances have motivated the development of a qualitative model of the carbon starvation response network using a class of piecewise-linear (PL) differential equations. The PL models, orig-

CTL EF (φ) ¬EF (φ)

μ-calculus μX.(φ ∨ ♦X) ¬μX.(φ ∨ ♦X)

AG (φ ⇒ EF (ψ)) AG (φ ⇒ AF (ψ))

νX.((φ ⇒ μY.(ψ ∨ ♦Y )) ∧ X) νX.((φ ⇒ μY.(ψ ∨ Y )) ∧ X)

EF (φ ∧ EF (ψ)) E (φ U ψ) EF (ψ) ∧ ¬E (¬φ U ψ) EF (ψ) ∧ AG (¬φ ⇒ AG (¬ψ))

μX.((φ ∧ μY.(ψ ∨ ♦Y )) ∨ ♦X) μX.(ψ ∨ (φ ∧ ♦X)) μX.(ψ ∨ ♦X) ∧ ¬μY.(ψ ∨ (¬φ ∧ ♦Y )) μX.(ψ ∨ ♦X) ∧ νY.((φ ∨ νZ.(¬ψ ∧ Z)) ∧ Y )

EG (φ) AG (φ)

νX.(φ ∧ ♦X) νX.(φ ∧ X)

inally introduced on [9], provide a coarse-grained picture of the dynamics of genetic regulatory networks. They associate a protein concentration variable to each of the genes in the network, and capture the switch-like character of gene regulation by means of step functions that change their value at a threshold concentration of the proteins. The advantage of using PL models is that the qualitative dynamics of the high-dimensional systems are relatively simple to analyze, using inequality constraints on the parameters rather than exact numerical values [1, 2]. This makes the PL models a valuable tool for the analysis of the carbon starvation network. In previous work we developed a PL model that we extend here by the general stress response factor RpoS and related regulators ([16], Ropers et al., in preparation). The dynamics of this system are described by nine coupled PL differential equations, and ﬁfty inequality constraints on the parameter values.

3.2

Qualitative simulation of starvation response

The mathematical properties of the class of PL models used for modeling the stress response network have been well-studied [9]. It was previously shown how discrete abstractions can be used to convert the continuous dynamics of the PL system into a FSTS [1]. The states S of the FSTS correspond to hyperrectangular regions in the concentration space, while the transitions T arise from trajectories entering one region from another. The atomic propositions AP describe, among other things, the concentration bounds of the regions and the trend of the variables inside a region (increasing, decreasing, or steady). The generation of the FSTS from the PL model has been implemented in the computer tool GNA [2]. GNA is able to export the FSTS to standard model checkers like NuSMV [5] and CADP [8], supporting the use of CTL and μ-calculus, respectively. The application of this approach to the model of the E. coli carbon starvation network generates a huge FSTS. The entire state set consists of approximately O(1010 ) states, while the subset of states that is most relevant for our purpose, i.e. the states that are reachable from an initial state corresponding to a particular growth state of the bacteria, still consists of O(103 ) states. It is clear that FSTSs of this size cannot be analyzed by visual inspection, and that formal veriﬁcation techniques are needed. In the next section we show how the patterns deﬁned in Sec. 2.3 can speed up the querying of these FSTSs, by simplifying the formulation of relevant properties to be tested.

232

P.T. Monteiro et al. / Temporal Logic Patterns for Querying Qualitative Models of Genetic Regulatory Networks

Cya

GyrI P

gyrI

P1/P’1 P2

cya Export/

Cya∗

cAMP

ATP

GyrAB gyrAB

P

Fis

Relaxed DNA

CRP

ﬁs

P

P1 P2

Supercoiled DNA

crp

Carbon starvation signal (Input)

Stable RNAs (Output)

TopA topA

P1

P1 Px1

rrn

P2

σS nlpDP1/P2 rpoSP1

RssB∗

RssB rssB

rpoS

P

Legend P

Fis Synthesis of protein Fis from promoter P of gene ﬁs

ﬁs

RssB*

Active form of protein RssB +

2 (xgyrAB , θgyrAB )

−

1 (xgyrI , θgyrI )

x˙ gyrAB = κgyrAB (1 − s s 1 2 0 < θgyrAB < θgyrAB < κgyrAB /γgyrAB < max gyrAB

Activation

Degraded protein Conversion −

s

1 (xtopA , θtopA ))

Inhibition

s

−

(xf is , θf4 is )

(a)

− γgyrAB xgyrAB (b)

Figure 1. (a) Network of key genes, proteins and regulatory interactions involved in the carbon starvation response network in E. coli. (b) PL differential equation and parameter inequality constraints for the gyrase GyrAB. The variable xgyrAB denotes the concentration of GyrAB. The protein is produced at a 2 rate κgyrAB if the DNA supercoiling level is not high, that is, if the concentration of GyrAB itself is below the threshold θgyrAB , and the concentrations of 1 1 the topoisomerase TopA and the gyrase inhibitor GyrI are above the thresholds θtopA and θgyrI , respectively. The regulatory logic of gyrAB expression is 2 2 ) evaluates to 1, if xgyrAB > θgyrAB (and to 0 otherwise). The protein is degraded modeled by means of step functions. For instance, s+ (xgyrAB , θgyrAB 2 at a rate proportional to its own concentration, γgyrAB xgyrAB . The constraint θgyrAB < κgyrAB /γgyrAB < max gyrAB express that the derepression 2 . Instead of numerical values, the qualitative of the gyrAB promoter allows the concentration of GyrAB to reach a high level, above the threshold θgyrAB simulator uses such inequality constraints to infer behavior predictions [1, 2].

3.3

Analysis of carbon starvation response model using query patterns

Four relevant properties were studied to analyze the E.coli carbon starvation response model (Table 2). The properties correspond to the following questions: • Does the mutual inhibition motif of Fis and CRP (Fis inhibits the expression of gene crp, and CRP inhibits the expression of gene ﬁs) have an effect on the dynamics of the carbon starvation response network? • Is a carbon upshift a necessary condition for the occurrence of damped oscillations in the concentration of the regulators of the DNA supercoiling level? • Is the entry into stationary phase always preceded by the accumulation of the stress response regulator RpoS? • Is gene topA expressed in response to carbon source availability? The instances of the patterns were translated into CTL following the translation rules of Table 1, and then veriﬁed using the modelchecker NuSMV. The results are shown in the Table 2. By way of illustration we develop the formulation of the pattern for the third question and interpret the results of the veriﬁcation process. RpoS is a general stress response factor that allows cells to adapt to and survive under harmful conditions by entering stationary phase [11]. Due to its key role, the concentration of RpoS is tightly regulated, at the transcriptional, translational, and post-translational lev-

els. The stability of the protein is mainly controlled in our conditions: while cells grow on a carbon source, RpoS is actively degraded through the protein RssB, which binds to RpoS and targets the factor to an intracellular protease. However, the depletion of the carbon source inactivates RssB, thus allowing RpoS to accumulate at a high concentration. Given the importance of RpoS for cell survival, one may ask whether the entry into stationary phase is always preceded by the accumulation of RpoS in the cell. We formulated this question using a sequence pattern, where the stationary phase is represented by a low level of stable RNAs rrn (Table 2). The latter indicator is motivated by the fact that stationary-phase cells do not need high levels of these RNAs, which are necessary for the high translational activity of the exponential phase. The property is true, which indicates that the entry into stationary phase cannot occur before RpoS has accumulated. This points at the central role of RpoS in the growth adaptation of the bacteria.

4

Discussion

Formal veriﬁcation techniques are promising tools for upscaling the analysis of qualitative models of genetic regulatory networks and other dynamical systems. The widespread adoption of modelchecking approaches is restrained, however, by the difﬁculty for nonexpert users to formulate appropriate questions in temporal logics. Inspired by work in the formal veriﬁcation community [7], the ﬁrst

233

P.T. Monteiro et al. / Temporal Logic Patterns for Querying Qualitative Models of Genetic Regulatory Networks

Table 2. Translation of properties used in the analysis of the E. coli carbon starvation response, following the translation rules in Table 1. The symbol isOscillatoryState is a predicate attributed by the qualitative simulator to a state and indicating that the state is part of a cycle in the state transition graph.

Properties Occurrence/exclusion pattern: Mutual inhibition of Fis and CRP | It | is not possible | for a state | xcrp ≥ | It | is not possible | for a state | xcrp ≤

1 2 3 kcrp +kcrp +kcrp ∧ xf is γcrp 1 kcrp ∧ xf is ≤ θf1 is | to γcrp

≥ θf4 is | to occur | and occur |

Consequence pattern: Damped oscillations after nutrient upshift | If a state | xsignal < θsignal | occurs, then it is | necessarily | followed by a state | isOscillatoryState | Sequence pattern: Control of entry into stationary phase by RpoS 1 | A state | xrrn < θrrn | is reachable and is | necessarily | preceded | at some time | by a state | xrpoS ≥ θrpoS | Invariance pattern: Expression of topA during growth-phase transitions 1 | A state | xtopA < θtopA | can | persist indeﬁnitely | contribution of the paper consists in the formulation of a set of patterns in the form of query templates in structured natural language. In addition, we have provided translations of the patterns to two different temporal logics, CTL and μ-calculus. The patterns capture a large number of frequently-asked questions by modelers of regulatory networks, as for example listed in [4]. The second contribution of the paper concerns the instantiation of the patterns for the analysis of the complex genetic regulatory network involved in the carbon starvation response in E. coli. We have extended an existing model of the network with additional global regulators and veriﬁed the effect of the extensions on the predicted network dynamics. The paper addresses issues we were confronted with when applying qualitative simulation techniques to a real-world problem in biology. We have proposed a solution, temporal logic query patterns for the analysis of large FSTSs, that has turned out to be useful in our application. However, we also expect this approach to carry over to other qualitative reasoning applications, where similar problems arise. Model checking is a promising way to analyze the large FSTSs arising in qualitative simulation [17], but most modelers are not familiar with temporal logics and have difﬁculty in expressing their questions by means of these formalisms. Although meant to capture frequently-asked questions in biology, the patterns introduced in this paper are deﬁned for FSTSs in general and seem sufﬁciently generic to apply to other problems as well. At the very least, they form a good starting-point for the formulation of a new set of query templates, tailored to the speciﬁcities of qualitative applications in other domains.

ACKNOWLEDGEMENTS This work was partially supported by FCT program (PhD grant SFRH/BD/32965/2006 to PTM) and PDCT program (project PTDC/EIA/71587/2006). DR, RM, and HdJ were supported by the European Commission under project EC-MOAN (FP6-2005-NESTPATH-COM/043235).

REFERENCES [1] G. Batt, H. de Jong, M. Page, and J. Geiselmann, ‘Symbolic reachability analysis of genetic regulatory networks using discrete abstractions’, Automatica, 44(4), 982–989, (2007). [2] G. Batt, D. Ropers, H. de Jong, J. Geiselmann, R. Mateescu, M. Page, and D. Schneider, ‘Analysis and veriﬁcation of qualitative models of genetic regulatory networks: A model-checking approach’, in Proc. IJCAI-05, ed., L.P. Kaelbling, pp. 370–375, (2005).

Response True

True True False

[3] R. Bellazzi, Guglielmann R., L. Ironi, and C. Patrini, ‘A hybrid inputoutput approach to model metabolic systems: an application to intracellular thiamine kinetics’, J. Biomed. Inform., 34(4), 221–48, (2001). [4] N. Chabrier-Rivier, M. Chiaverini, V. Danos, F. Fages, and V. Sch¨achter, ‘Modeling and querying biomolecular interaction networks’, Theor. Comput. Sci., 325(1), 25–44, (2004). [5] Cimatti,A., et al., ‘NuSMV 2: An opensource tool for symbolic model checking’, in Proc. 14th CAV, eds., D. Brinksma and K.G. Larsen, volume 2404 of LNCS, pp. 359–64, Berlin, (2002). Springer-Verlag. [6] E.M. Clarke, O. Grumberg, and D.A. Peled, Model Checking, MIT Press, Cambridge, MA, 1999. [7] M.B. Dwyer, G.S. Avrunin, and J.C. Corbett, ‘Patterns in property speciﬁcations for ﬁnite-state veriﬁcation’, in Proc. 21st Intl. Conf. Software Engineering, pp. 411–20, Los Alamitos, CA, (1999). [8] H. Garavel, F. Lang, and R. Mateescu, ‘CADP 2006: A toolbox for the construction and analysis of distributed processes’, in Proc. 19th CAV, eds., W. Damm and H. Hermanns, volume 4590 of LNCS, pp. 158–63, Berlin, (2007). Springer-Verlag. [9] L. Glass and S.A. Kauffman, ‘The logical analysis of continuous nonlinear biochemical control networks’, J. Theor. Biol., 39(1), 103–29, (1973). [10] R.M. Gutierrez-R´ıos, J.A. Freyre-Gonzalez, O. Resendis, J. ColladoVides, M. Saier, and G. Gosset, ‘Identiﬁcation of regulatory network topological units coordinating the genome-wide transcriptional response to glucose in Escherichia coli’, BMC Microbiol., 7(1), 53, (2007). [11] R. Hengge-Aronis, ‘Regulation of gene expression during entry into stationary phase’, in Escherichia coli and Salmonella: Cellular and Molecular Biology, ed., F.C. Neidhardt, et al., pp. 1497–512, Washington DC, (1996). ASM Press. [12] R.D. King, S.M. Garrett, and G.M. Coghill, ‘On the use of qualitative reasoning to simulate and identify metabolic pathways’, Bioinformatics, 21(9), 2017–26, (2005). [13] E. Klipp, B. Nordlander, R. Kr¨uger, P. Gennemark, and S. Hohmann, ‘Integrative model of the response of yeast to osmotic shock’, Nat. Biotechnol., 23(8), 975–82, (2005). [14] B.J. Kuipers, Qualitative Reasoning: Modeling and Simulation with Incomplete Knowledge, MIT Press, Cambridge, MA, 1994. [15] O. Kupferman, M.Y. Vardi, and P. Wolper, ‘An automata-theoretic approach to branching-time model checking’, J. ACM, 47(2), 312–60, (2000). [16] D. Ropers, H. de Jong, M. Page, D. Schneider, and J. Geiselmann, ‘Qualitative simulation of the carbon starvation response in Escherichia coli’, Biosystems, 84(2), 124–52, (2006). [17] B. Shults and B.J. Kuipers, ‘Proving properties of continuous systems: Qualitative simulation and temporal logic’, Artif. Intell., 92(1-2), 91– 130, (1997). [18] Z. Szallazi, V. Periwal, and J. Stelling, System Modeling in Cellular Biology: From Concepts to Nuts and Bolts, MIT Press, Cambridge, MA, 2006. [19] R. Thomas, D. Thieffry, and M. Kaufman, ‘Dynamical behaviour of biological regulatory networks’, Bull. Math. Biol., 57(2), 247–276, (1995).

234

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-234

Fighting Knowledge Acquisition Bottleneck with Argument Based Machine Learning Martin Moˇzina and Matej Guid and Jana Krivec and Aleksander Sadikov and Ivan Bratko 1 Abstract. Knowledge elicitation is known to be a difﬁcult task and thus a major bottleneck in building a knowledge base. Machine learning has long ago been proposed as a way to alleviate this problem. Machine learning usually helps the domain expert to uncover some of the more tacit concepts. However, the learned concepts are often hard to understand and hard to extend. A common view is that a combination of a domain expert and machine learning would yield the best results. Recently, argument based machine learning (ABML) has been introduced as a combination of argumentation and machine learning. Through argumentation, ABML enables the expert to articulate his knowledge easily and in a very natural way. ABML was shown to signiﬁcantly improve the comprehensibility and accuracy of the learned concepts. This makes ABML a most natural tool for constructing a knowledge base. The present paper shows how this is accomplished through a case study of building a knowledge base of an expert system used in a chess tutoring application.

1

INTRODUCTION

Knowledge is a key component of every intelligent computer system. Knowledge acquisition is therefore one of the perennial tasks of artiﬁcial intelligence. Unfortunately, this task proves to be a very difﬁcult one, especially so if the goal is to acquire the knowledge in a comprehensible form. In building expert systems, it is exactly this task that presents a major bottleneck [5]. The problem was addressed in various ways [1, 2, 4], proposing assorted cognitive techniques like interviews, observations, analogy, etc. to elicit as much knowledge from experts as possible. Nevertheless, the problem still remains largely unsolved [6]. Machine learning has long ago been proposed as an alternative way of addressing this problem [7]. While it was shown that it can be successful in building knowledge bases [8], the major problem with this approach is that automatically induced models rarely conform to the way an expert wants the knowledge organised and expressed. Models that are incomprehensible have less chance to be trusted by experts and users alike. In striving for better accuracy, modern trends in machine learning (e.g. support vector machines) do not seem to be doing anything to alleviate this problem. A common view is that a combination of a domain expert and machine learning would yield the best results [14]. Most of the applications in the literature combine machine learning and the experts’ knowledge in one of the following ways: (a) experts validate induced models after machine learning was applied, (b) experts provide constraints on induced models in the form of background knowledge, and (c) the system enables iterative improvements of the model, 1

Faculty of Computer and Information Science, University of Ljubljana, Slovenia. Contact email: [email protected]

where experts and machine learning algorithm improve the model in turns. The last approach seems to be the best, however, it requires the most effort on the part of the expert. This calls for a method that allows the expert to express his or her knowledge in a most convenient way and at the same time allows for seamless interaction between the expert and the machine. Argumentation based machine learning (ABML) [10] is a recent method that seems to have potential to accomplish just that. It is a natural fusion of argumentation and machine learning. The advantages over traditional machine learning methods are better accuracy and comprehensibility. Improvement in comprehensibility is especially important in light of knowledge extraction. Through argumentation, ABML enables the expert to articulate his or her knowledge easily and in a very natural way. Moreover, it prompts the expert to share exactly that knowledge that is most useful for the machine to learn, thus signiﬁcantly saving the time of the expert. The present paper describes a case study in building a knowledge base of an expert system used in a chess tutoring application [12] to demonstrate the power of ABML. First, we describe basics of ABML [10] and a procedure that interacts with ABML and the expert. Then, in Section 3 the case study is presented and in Section 4 its results are assessed and discussed. We ﬁnish the paper with conclusions.

2

ARGUMENT BASED MACHINE LEARNING

Argument Based Machine Learning (ABML) [10] is machine learning extended with some concepts from argumentation. Argumentation is a branch of artiﬁcial intelligence that analyzes reasoning where arguments for and against a certain claim are produced and evaluated [11]. A typical example of such reasoning is a law dispute at court, where plaintiff and defendant give arguments for their opposing claims, and at the end of the process the party with better arguments wins the case. Arguments are used in ABML to enhance learning examples. Each argument is attached to a single learning example only, while one example can have several arguments. There are two types of arguments; positive arguments are used to explain (or argue) why a certain learning example is in the class as given, and negative arguments are used to explain why it should not be in the class as given. We used only positive arguments in this work, as negatives were not required. Examples with attached arguments are called argumented examples. Arguments are usually provided by domain experts, who ﬁnd it natural to articulate their knowledge in this manner. While it is generally accepted that giving domain knowledge usually poses a problem, in ABML they need to focus on one speciﬁc case only at a time and provide knowledge that seems relevant for this case and does not

M. Možina et al. / Fighting Knowledge Acquisition Bottleneck with Argument Based Machine Learning

have to be valid for the whole domain. The idea can be easily illustrated with the task of commenting chess games. It would be hard to talk about chess moves in general to decide precisely when they are good or bad. However, if an expert is asked to comment on a particular move in a given position, he or she will be able to offer an explanation and provide relevant elements of this position. Naturally, in a new position the same argument could be incorrect. An ABML method is required to induce a theory that uses given arguments to explain the examples. Thus, arguments constrain the combinatorial search among possible hypotheses, and also direct the search towards hypotheses that are more comprehensible in the light of expert’s background knowledge. If an ABML method is used on normal examples only (without arguments), then it should act the same as a normal machine learning method. We will use method ABCN2 [10], an argument based extension of the well known method CN2 [3], that learns a set of unordered probabilistic rules from argumented examples. In ABCN2, the theory (a set of rules) is said to explain the examples using given arguments, when there exists at least one rule for each argumented example that contains at least one positive argument in the condition part. This deﬁnition is a bit simpliﬁed, since it omits the use of negative arguments, as they are not relevant for this paper2 . In addition to rules, we need an inference mechanism to enable reasoning about new cases. In rule induction community this problem is known as rule classiﬁcation and several approaches can be found in the literature[9]. In this study, we will use a simple algorithm; among all relevant rules the best rules (with the highest predicted class probability) for each class are selected and the probability of the example’s class is obtained by normalising predicted probabilities of selected rules.

2.1

Interactions between expert and ABML

In ABML, experts are asked to provide their prior knowledge in the form of arguments for the learning examples rather than the general domain knowledge. However, asking experts to give arguments to the whole learning set is not likely to be feasible, because it would require too much time and effort. The following loop describes the skeleton of the procedure that picks out critical examples - examples that ABML can not explain without some help: 1. Learn a hypothesis with ABML using given data. 2. Find the most critical example and present it to the expert. If a critical example can not be found, stop the procedure. 3. Expert explains the example; the explanation is encoded in arguments and attached to the learning example. 4. Return to step 1. To ﬁnalise the procedure we need to contemplate the following two questions: • How do we select “critical” examples ? • How can we achieve to get all necessary information for the chosen example?

2.1.1

Identifying critical examples

The main property of critical examples is that the current hypothesis can not explain them well, or, in other words, it fails to predict their 2

Due to space limitations, we will only roughly describe some of the mechanisms of ABML (see [10] or/and its website www.ailab.si/martin/abml for precise details).

235

class. Since ABCN2 gives probabilistic class prediction, we deﬁne the most critical example as the example with the highest probabilistic error. The probabilistic error can be measured in several ways. We use a k-fold cross-validation repeated n times (e.g. n = 4, k = 10), so that each example is tested n times. The most critical example is thus the one with highest average probabilistic error.

2.1.2

Are expert’s arguments good or should they be improved?

Here we describe in details the third (3) step of the above algorithm, where the expert is asked to explain the critical example. Using expert’s arguments, ABML will sometimes be able to explain the critical example, while sometimes this will still not be entirely possible. In such cases, we need additional information from expert. The whole procedure for one-step knowledge acquisition is described with the next 5 steps: Step 1: Explaining critical example. In this step, the expert is asked the following question: ”Why is this example in the class as given?” The answer can be either ”I don’t know” (the expert is unable to explain the example) or a set of arguments A1 , . . . , Ak all conﬁrming the example’s class value can be given. If the system gets the answer “don’t know”, it will stop this procedure and try to ﬁnd another critical example. Step 2: Adding arguments to example. Arguments Ai are given in natural language and need to be translated into domain description language (attributes). Each argument supports its claim with a number of reasons. When a reason is simply an attribute value of the example, then the argument is simply added to the example. On the other hand, if reasons mention other concepts, not currently present in the domain, these concepts need to be included in the domain as new attributes before the argument can be added to the example. Step 3: Discovering counter examples. Counter examples are used to spot if arguments sufﬁce to successfully explain the critical example or not. If ABML fails to explain the example, then the counter examples will show where the problem is. Here, ABML is ﬁrst used to induce a hypothesis H1 using previous learning data only and H2 using learning data together with new arguments. A counter example is deﬁned as: it has a different class value from the critical example, its probabilistic error increases in H2 with respect to H1 , and H2 mentions arguments (given to the critical example) while explaining the counter example. Step 4: Improving arguments. The expert needs to revise the initial arguments with respect to the counter example. This step is similar to steps 1 and 2 with one essential difference; the expert is now asked ”Why is critical example in one class and why counter example in the other?” The answer is added to the initial argument. Step 5: Return to step 3 if counter example found.

3

CASE STUDY: BAD BISHOP

As a case study, we considered the elicitation of the well-known chess concept of bad bishop. There is a general agreement in the chess literature and among chess players about the intuition behind this concept. However, the formalisation of this concept is difﬁcult even for chess experts, which served as the motivation for choosing this concept for the ABML-based knowledge-elicitation process. Watson [13] gives the following deﬁnition as traditional: a bishop that is on the same colour of squares as its own pawns is bad, since

236

M. Možina et al. / Fighting Knowledge Acquisition Bottleneck with Argument Based Machine Learning

its mobility is restricted by its own pawns and it does not defend the squares in front of these pawns. Moreover, he puts forward that centralisation of these pawns is the main factor in deciding whether the bishop is bad or not. In the experiments, the dataset for learning consisted of 200 middlegame positions from real chess games where the black player has only one bishop3 . These bishops were then a subject of evaluation by the experts4 . In 78 cases, the bishops were assessed as bad. Each position had also been statically evaluated (i.e. without applying any search) by the evaluation function of the well-known open source chess program CRAFTY, and its positional feature values5 served as attribute values for learning. We randomly selected 100 positions for learning and 100 for testing (stratiﬁcation was used, preserving the proportion of positive and negative examples). In the ﬁrst iteration of the previously mentioned process, only CRAFTY ’s positional features were used and no arguments have been given yet. ABCN2 induced all together 4 rules achieving 72% classiﬁcation accuracy on the test set. Figure 1 shows the ﬁrst critical example, automatically selected by our algorithm.

the method, which at the time only had CRAFTY’s attributes and the newly included attribute at its disposal, failed to ﬁnd additional restrictions to improve the experts’ argument. The ABML method then presented the experts with a counter example shown in Figure 2. This example is classiﬁed as “bad”, although the value of the attribute IMPROVED BISHOP MOBILITY is high.

Figure 2. Why is the black bishop bad, comparing to the one in Figure 1? The experts’ explanation was: “The important difference between the two examples is the following: in the example in Figure 2 there are more pawns on the same colour of squares as the black bishop, and some of these pawns occupy the central squares, which further restricts the bishop’s possibilities for taking an active part in the game.”

Figure 1. Why is the black bishop not bad? The experts used their domain knowledge to produce the following answer: “The black bishop is not bad, since its mobility is not seriously restricted by the pawns of both players.”

The initial rules failed to classify this example as “not bad”, as was previously judged by the experts. The following question was given to the experts: “Why is the black bishop not bad?” It turned out that the concept mentioned by the experts (see the caption in Figure 1) was not yet present in the domain attributes - the only CRAFTY ’s positional feature that could potentially describe bishop’s mobility, BLACK BISHOPS MOBILITY, expresses the number of squares that the bishop attacks, but doing so takes into account all pieces (not only pawns) that block the bishop’s diagonals, restricting its mobility. A new attribute, IMPROVED BISHOP MOBILITY, was therefore programmed and included into the domain. It is the number of squares accessible to the bishop, taking into account only own and opponents pawn structure. Based on the experts’ explanation, the argument “IMPROVED BISHOP MOBILITY is high” was added to this example. Taking only the bishop’s mobility into account turned out not to be enough for ABCN2 to determine the goodness of the bishop. Also, 3 4 5

The learning data set and a detailed explanation of domain’s attributes can be found at: http://www.ailab.si/matej/. The chess expertise was provided by woman grandmaster Jana Krivec and FIDE master Matej Guid. CRAFTY ’s evaluation function uses about 100 positional features.

The experts were now asked to compare the black bishops in the two examples: “Why is the black bishop in Figure 2 bad, and the bishop in Figure 1 is not?” Again, the experts have been asked to give a description based on their knowledge in the presented domain. Based on this description (given in Figure 2), another attribute, BAD PAWNS, was included into the domain. This attribute evaluates pawns that are on the colour of the square of the bishop (“bad” pawns in this sense). With some help of the experts, a look-up table with predeﬁned values for the pawns that are on the same colour of squares as the bishop was designed in order to assign weights to such pawns. According to the previously mentioned Watson’s deﬁnition, centralisation of the pawns was taken into account. The argument given to the example shown in Figure 1 was then extended to “IMPROVED BISHOP MOBILITY is high AND BAD PAWNS is low,” and with this argument the method could not ﬁnd any counter examples any more. The new rule covering the critical example is: if IMPROVED BISHOP MOBILITY≥4 and BAD PAWNS≤32 then BISHOP=NOT BAD; class distribution [0,39] The above rule evidently uses given argument in its condition. The method operationalised the ﬁrst condition of the argument as IMPROVED BISHOP MOBILITY≥4 (≥4 stands for high here), while in the second it decided that the value of 32 is critical for attribute BAD PAWNS to distinguish a bad and a not bad bishop. The rule covers 39 learning examples (out of 100) and all of them are from class NOT BAD, which suggests that the rule is good indeed. The arguments can consist of both newly included attributes and/or existing ones. During the process, after they were given another critical example selected by the method, the experts expressed the following commentary: “The bishop is not bad, since the pawns that are on the same square colour are not sufﬁciently blocked by opponent’s pawns and pieces.” Their domain knowledge

M. Možina et al. / Fighting Knowledge Acquisition Bottleneck with Argument Based Machine Learning

237

was again translated into domain description language - attribute BLOCKED BAD PAWNS was added to the domain. As in the previous example, the method selected the position shown in Figure 2 as the most appropriate counter example. The “bad” black pawns in this position are also not blocked by opponent’s pawns and pieces, but the bishop is regarded as bad anyway. The experts’ explanation of the crucial difference between the two examples was the same as above in this case. The existing attribute BAD PAWNS was therefore used to improve the argument to “BLOCKED BAD PAWNS is low AND BAD PAWNS is low”. The method was in this case able to induce the rest of the rule: if BLOCKED BAD PAWNS≤3 and BAD PAWNS≤26 and IMPROVED BISHOP MOBILITY>1 then BISHOP=NOT BAD; class distribution [0,19] Figure 4. Why is the bishop not bad, comparing to the bishop in Figure 3? The experts: “The black bishop is not bad, since together with the black queen it represents potentially dangerous attacking force that might create serious threats against the opponent’s king.”

example (if available). In this case, the example in Figure 5 was given to the experts as a counter example to the one in Figure 3.

Figure 3. Why is the black bishop bad? The following commentary was given: “The black bishop is bad, since both of its diagonals are blocked by its own pawns.”

The ABML-based knowledge-elicitation process was used to induce rules to determine both good (i.e. not bad) and bad bishops. The automatically selected critical example shown in Figure 3 represents an example with other class value than the previous ones. The experts were in this case asked to describe why the black bishop is bad. Based on their answer (see Figure 3), another attribute was introduced into the domain: BLACK PAWN BLOCKS BISHOP DIAGONAL, which takes into account own pawns that block the bishops diagonals. The argument “BLACK PAWN BLOCKS BISHOP DIAGONAL is high” was added to the example, however a counter example presented in Figure 4 was found by the method and was shown to the experts. The question was: “Why is the bishop in Figure 4 not bad, and the bishop in Figure 3 bad?” In this case, the experts were unable to express the crucial differences between the selected examples regarding the goodness of the bishop in a way that would enable to translate her description into domain description language. The description (see Figure 4), although completely relevant in the given position, is practically impossible to convert into appropriate attributes, since it would require several very sophisticated attributes to describe the dynamic factors expressed in the experts’ commentary. In such a case (i.e. when the expert is unable to provide an argument that could be translated into domain description language), the ABML method searches for another counter

Figure 5. Why is the bishop not bad, comparing to the bishop in Figure 3? The experts described the difference: “The black bishop is not bad, since its mobility is not seriously restricted, taking the pawn structure into account.”

Based on the experts’ commentary (see Figure 5), the existing attribute IMPROVED BISHOP MOBILITY was used to improve the argument to “BLACK PAWN BLOCKS BISHOP DIAGONAL is high AND IMPROVED BISHOP MOBILITY is low”. The following rule, explaining this critical example, can be found in the new set of induced rules: if BLACK PAWN BLOCKS BISHOP DIAGONAL≥20 and IMPROVED BISHOP MOBILITY≤3 then BISHOP=BAD; class distribution [18,0] In total, there were eight critical examples presented to the experts, however, due to space restrictions we were able to describe only three of these examples. The ﬁnal model scored 95% accuracy on the test set.

238

M. Možina et al. / Fighting Knowledge Acquisition Bottleneck with Argument Based Machine Learning

5

1.0

0.9

. .......... .......... .......................................................................................................... ..... . . . . .. ..... ..... ..... .............. .............. . ... ... ... .. . ... ... ... .. . ... ... ... ....

..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ∗ ....∗.................................. ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ...∗.. ...∗.. .......... .. CA .. ... . . . . .... ∗ .∗... 0.8 ...... .. .. ∗ ∗ ..∗ 0.7 ∗ 0 1 2 3 4 5 6 7 8 Iterations Figure 6. Progress of classiﬁcation accuracies (CA) through iterations for ABCN2 (solid line), logistic regression (stars ∗), C4.5 (dashed line) and classic CN2 (dots).

4

CONCLUSION

In this paper, we introduced a new approach to knowledge elicitation based on the ABML type of machine learning. We studied the effectiveness of this approach in a case study that involves the concept of bad bishop in chess. This concept requires subtle expert judgement that is very hard to formalise. Given our experimental ﬁndings with ABML-based knowledge acquisition in this domain, we believe that this approach will be most helpful in general in areas that require subtle expert judgement, such as medical decision making or aesthetic evaluation of a painting or a piece of music. In addition to the case study that illustrates the effectiveness of ABML-based knowledge acquisition, the paper makes the following new contributions: 1. The idea of counter examples and a mechanism for their detection. 2. The interactive procedure between the expert and ABML during knowledge acquisition.

ASSESSMENT AND DISCUSSION ACKNOWLEDGEMENTS

The ABML-based knowledge-elicitation process presented in our case study consisted of eight (8) iterations. During the process, seven (7) arguments were attached to automatically selected critical examples, and ﬁve (5) new attributes were included into the domain. After each iteration, the obtained rules were evaluated on the test dataset. The improvement of the model is evident: from the initial 72% classiﬁcation accuracy (Brier score 0.39, AUC 0.80), the ﬁnal 95% accuracy (Brier score 0.11, AUC 0.97) was achieved after the end of the process. The question is, whether these improvements were mainly due to the addition of new attributes or were the arguments also just important? The Figure 6 shows that the arguments also mattered signiﬁcantly. We compared the progressions of classiﬁcation accuracies of ABCN2 with some other (“non-ABML” - using only newly added attributes) machine learning algorithms, namely logistic regression, decision trees (C4.5), and the classic CN2. The accuracies of all methods improved during the process, however ABCN2 (which also used the arguments given by the experts) outperformed all the others. The obtained results suggest that the performance of other algorithms could also be improved by adding appropriate new attributes. However, using arguments is likely to lead to even more accurate models. The main advantage of ABML over classical machine learning is the ability to take advantage of expert’s prior knowledge in the induction procedure. This leads to hypotheses comprehensible to experts, as it explains learning examples using the same arguments as the expert did. In our case study this was conﬁrmed by chess experts. According to them, the ﬁnal set of rules are more alike to their understanding of the bad bishop concept than the initial rules were. Furthermore, the ﬁnal rules were also recognised to be in accordance with the traditional deﬁnition of a bad bishop. Our domain experts clearly preferred the ABML approach to manual knowledge acquisition. The formalisation of the concept of bad bishop turned out to be beyond the practical ability of our chess experts (a master and a woman grandmaster). They described the process as time consuming and hard, mainly because it is difﬁcult to consider all relevant elements. ABML facilitates knowledge acquisition by ﬁghting these problems directly. Experts do not need to consider all possibly relevant elements, but only elements relevant for a speciﬁc case, which is much easier. Moreover, by selecting only critical examples, the time of experts involvement is decreased, making the whole process much less time consuming.

This work was partly funded by the X-Media project (www.x-mediaproject.org) sponsored by the European Commission as part of the Information Society Technologies (IST) programme under EC grant number IST-FP6-026978, and Slovene Agency for Research and Development (ARRS).

REFERENCES [1] J. H. Boose, ‘A survey of knowledge acquisition techniques and tools’, Knowledge Acquisition, 1(1), 3–37, (1989). [2] Timothy Chklovski, Using Analogy to Acquire Commonsense Knowledge from Human Contributors, Ph.D. dissertation, MIT Artiﬁcial Intelligence Laboratory, 2003. [3] Peter Clark and Robin Boswell, ‘Rule induction with CN2: Some recent improvements’, in Machine Learning - Proceeding of the Fifth Europen Conference (EWSL-91), pp. 151–163, Berlin, (1991). [4] Nancy J. Cooke, ‘Varieties of knowledge elicitation techniques’, Int. J. Hum.-Comput. Stud., 41(6), 801–849, (1994). [5] Edward A. Feigenbaum, ‘Knowledge engineering: the applied side of artiﬁcial intelligence’, in Proc. of a symposium on Computer culture: the scientiﬁc, intellectual, and social impact of the computer, pp. 91– 107, New York, NY, USA, (1984). New York Academy of Sciences. [6] Edward A. Feigenbaum, ‘Some challenges and grand challenges for computational intelligence’, Source Journal of the ACM, 50(1), 32–40, (2003). [7] Richard Forsyth and Roy Rada, Machine learning: applications in expert systems and information retrieval, Halsted Press, New York, NY, USA, 1986. [8] Pat Langley and Herbert A. Simon, ‘Applications of machine learning and rule induction’, Commun. ACM, 38(11), 54–64, (1995). [9] Tony Lindgren, ‘Methods for rule conﬂict resolution’, in In Proceedings of the 15th European Conference on Machine Learning (ECML04), pp. 262–273, Pisa, (2004). Springer. ˇ [10] Martin Moˇzina, Jure Zabkar, and Ivan Bratko, ‘Argument based machine learning’, Artiﬁcial Intelligence, 171(10/15), 922–937, (2007). [11] Henry Prakken and Gerard Vreeswijk, Handbook of Philosophical Logic, second edition, volume 4, chapter Logics for Defeasible Argumentation, 218–319, Kluwer Academic Publishers, Dordrecht etc, 2002. [12] Aleksander Sadikov, Martin Moˇzina, Matej Guid, Jana Krivec, and Ivan Bratko, ‘Automated chess tutor’, in Proceedings of the 5th International Conference on Computers and Games, (2006). [13] John Watson, Secrets of Modern Chess Strategy, Gambit Publications, 1999. [14] Geoffrey I. Webb, Jason Wells, and Zijian Zheng, ‘An experimental evaluation of integrating machine learning with knowledge acquisition’, Mach. Learn., 35(1), 5–23, (1999).

4. Cognitive Modeling and Interaction

This page intentionally left blank

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-241

241

Automatic Page Turning for Musicians via Real-Time Machine Listening Andreas Arzt(1) , Gerhard Widmer(1,2) , and Simon Dixon(3) 1 Abstract. We present a system that automatically turns the pages of the music score for musicians during a performance. It is based on a new algorithm for following an incoming audio stream in real time and aligning it to a music score (in the form of a synthesised audio ﬁle). Precision and robustness of the algorithm are quantiﬁed in systematic experiments, and a demonstration using an actual page turning machine built by an Austrian company is described.

1

Introduction

The musicians among the readers may be familiar with the problem of having to turn the pages of the music score while playing a piece of music on an instrument. This is not so much a problem in a live concert, where the artist either plays the piece by heart or has a (semi)professional page turner by her side, but it is very bothersome during practicing (where it has to be done over and over again). In many cases, having to turn the page requires the musician to remove one hand from the instrument and, thus, to stop playing and then continue after the page has been turned. Such a forced disruption is annoying and frustrating, both musically and from a practicing point of view. An intelligent system that automatically ‘knows’ when to turn the pages and does that in a reliable manner would be highly useful. The Austrian Company Quidenus GmbH (www.qidenus.com) has developed an electro-mechanical device that turns the pages of books, music scores, etc. via two physical ‘ﬁngers’ (see Fig. 5, center). The device is operated by the musician via a foot switch. The musician may thus play without interruption, but is still forced to focus her thoughts on the act of page turning as the respective point in the piece approaches. Our idea was to make this device decide and act completely autonomously, without the musician having to do anything, by ‘listening’ to the musician in real time, comparing the ongoing performance to some internal representation of the sheet music, and automatically turning the page at the appropriate time. This is a very challenging AI task, which involves real-time machine listening and adaptivity. The contribution of this paper to Artiﬁcial Intelligence is thus a general method for tracking (‘listening to’) audio streams in real time, on-line, with high robustness and ﬂexibility.

2

Requirements and Related Work

Technically, what is required is an algorithm that is capable of automatically listening to a live music performance (in the form of a raw audio stream, in our case) and tracking the current position in 1 (1) Department

of Computational Perception, Johannes Kepler University Linz, Austria; (2) Austrian Research Institute for Artiﬁcial Intelligence, Vienna, Austria; (3) Department of Electronic Engineering, Queen Mary, University of London

the score. This task is known as automatic score following. There has been quite some work on score following in the AI and computer music communities, starting as early as 1984 [3, 9] and intensifying in recent years (e.g., [2, 4, 7, 8]). Many of these algorithms require practising sessions with the same musician [9], during which the system learns a predictive model of the expected tempo and timing deviations applied by the musician, represented, e.g., as a Graphical Model or Bayesian Network [7], or a Hidden Markov Model [8]. Our goal is to avoid this laborious training phase and develop a system that adapts to the musician currently playing without any training. An additional goal is to provide robustness in the face of structural changes (see below), which has mostly been ignored in previous approaches (with the notable exception of [6], where HMMs are used to model the high-level structure of the music). In more detail, our system should have the following properties: On-line tracking: The artiﬁcial page turner must ‘listen to’ and follow a musician’s performance in real time. Speciﬁcally, in a real scenario, we cannot assume that the musician plays a MIDI instrument (in which case we would have conveniently processable symbolic input data), but, rather, a ‘regular’ instrument whose sound is recorded by a microphone. The problem is thus to track a raw audio stream and to align it to the music score in real time. Robustness against changes in tempo and timing: In classical music, musicians deliberately vary the tempo (among other parameters) to add expression to a piece; this phenomenon comes under various names like ‘agogics’, ‘rubato’, or ‘expressive timing’. In fact, this is an indispensable part of (classical) music performance [10]. Such tempo changes can be very abrupt and large (e.g., a slowing down of 50% within one beat). Musical scores do contain some rough indications (like a ritardando prescription), but these are neither precise and quantitative nor complete. A music tracking system must be able to accommodate such changes without becoming confused. Robustness in the face of structural changes: In some cases, the musician may choose to follow structural indications in the score (speciﬁcally, repeats), but may also decide to ignore repeated sections. A page turning system should be able to automatically recognise the performer’s decisions. Error tolerance: Musicians make mistakes (particularly in the practise phase); they may omit notes or whole segments, play erroneous or superﬂuous notes, or spontaneously restart at a particular point of a piece after having made a mistake. Clearly, a page turning system should be robust to such errors. Adaptivity: Initially, it is unclear how and how fast the musician is going to play. The system has to be able to adapt to the speciﬁc circumstances of a live performance, without prior training or information.

242

A. Arzt et al. / Automatic Page Turning for Musicians via Real-Time Machine Listening

Figure 1. Excerpt (bars 43-46) of the Etude Op.25 No.11 in A minor by Fr´ed´eric Chopin: notated score (top); audio signal of synthesised reference score (middle); real performance (bottom); the correct time alignment produced by the algorithm is indicated by connecting lines. To avoid clutter, only the alignment of points at bar lines is shown.

3

On-line Audio Following

The solution we are going to adopt is the following. Rather than trying to identify individual notes from the incoming audio stream and trying to associate these with the corresponding notes in the notated score (the ‘sheet music’), we ﬁrst convert (a MIDI version of) the given score into a sound ﬁle, by using any available software synthesiser. That gives an audio rendition of the piece in poor sound quality, without any expressive aspects (the piece will sound mechanical), and, in the case of the piano, without any pedalling. In the live tracking process, the incoming audio stream must be aligned, on-line, to the synthesised audio ﬁle. Figure 1 shows a short excerpt from the Etude Op.25 No.11 in A minor by Fr´ed´eric Chopin, with the corresponding excerpts from the synthesised score audio ﬁle, and from an actual performance. The algorithm to be described here builds on our on-line timewarping (OLTW) method presented in [5], and adds a number of new strategies to make it more robust and adaptive. We ﬁrst recapitulate the basic algorithm and then present our new method.

3.1

The Basic Audio Alignment Algorithm

In [5], we presented an algorithm for the online alignment of two audio streams that is based on Dynamic Time Warping (DTW). The streams are given as sequences of short (46 ms) audio frames. The important differences between this algorithm and standard DTW are linear time and space complexity, and the fact that the optimal alignment is computed incrementally. The algorithm works as follows: Given two sequences U = u1 , ..., um and V = v1 , ..., vn , an alignment between U and V is a path W = W1 , ..., Wi (through a cost matrix) where each Wk is an ordered pair (ik , jk ) such that (i, j) ∈ W means that the points ui and vi are aligned. W is constrained to be monotonic and continuous. An m×n matrix represents a local cost function d(i, j) which assigns costs to the alignment of each pair (ui , vi ). The cost of a path W is the sum of the local alignment costs along the path. The m×n path cost matrix D is computed using the recursion:

D(i, j) = d(i, j) + min

wa ∗ D(i, j − 1) wa ∗ D(i − 1, j) wb ∗ D(i − 1, j − 1)

$ (1)

D(i, j) is the cost of the minimum cost path from (1, 1) to (i, j), D(1, 1) = d(1, 1), wa = 1 and wb = 2. The weights wa and wb are used to normalise paths of different lengths to make them comparable. The alignment algorithm computes a quasi-optimal solution (a ‘forward path’) by incrementally constructing this cost matrix in real time. During the initial phase, as long as less than s = 500 elements of each series have been processed, columns and rows are calculated alternately and the path follows the diagonal of the matrix. Calculating a row (column) means incrementing the pointer to the next element of the respective time series, calculating the new local distances, and updating the cost matrix D by using formula 1. After this initial phase the number of cells to be calculated is given by a search width parameter c = 500, e.g. for a new column i the local distances d(i, j −(c−1)), d(i, j −(c−2)), ..., d(i, j) are calculated, where j is the index of the current row. The calculation of the minimum cost paths using formula 1 is restricted to using only calculated cells. In this way, only a sub-band of the cost matrix of constant width is computed (see Fig. 2), which reduces time and space complexity from quadratic to linear. To decide if a row or a column should be computed (i.e., which of the two time series to advance), the minimum path cost for each cell in the current row j and column i is found. If this occurs in the current position (i, j) both a new row and column are calculated. If this occurs elsewhere in row j a new row is calculated and if this occurs elsewhere in column i a new column is calculated. If one time series has been incremented more than MaxRunCount = 3 times, the other series is incremented. In our musical setting, this embodies the assumption that a given performance will not be more than 3 times faster or slower than the reference score, and prevents the alignment algorithm from ‘running away’ too far. The audio streams to be aligned are represented as sequences of analysis frames, using a low-level spectral representation computed via a windowed FFT of the signal with a hamming window of size 46ms and a hop size of 20ms. The data is mapped into 84 frequency bins which are spread linearly up to 370Hz and logarithmically above, with semitone spacing, and then normalised to sum up to 1. In order to emphasise note onsets – the most important indicators of musical timing – only the increase in energy in each bin relative to the previous frame is stored. The cost of aligning two such 84-dimensional vectors is computed as the Euclidean distance between the two vectors.

A. Arzt et al. / Automatic Page Turning for Musicians via Real-Time Machine Listening

3.2

243

Steps Towards Intelligent Audio Following

Experiments with the original OLTW algorithm showed that it works relatively well with professional performances (e.g., recordings by famous pianists), when there are no serious performance errors – though even there we encountered some substantial alignment errors, especially in situations of rapidly changing tempo. With less perfect performances, the algorithm has severe problems. In this paper, we propose three strategies for making on-line audio following more effective. They will be evaluated experimentally in Section 4.1. The strategies are presented here in the context of music alignment, but – with the exception of the second one – they are completely general and should prove useful in many other domains that require robust on-line sequence alignment. In the following, the completely known time series representing the score sits on the y axis of the cost matrix, the live audio stream on the x axis. Strategy 1: The Backward-Forward Strategy The ﬁrst strategy consists in using the present hypothesis plus the information from which it was constructed, in order to re-consider past decisions and then, in turn, using the revised decisions to improve the present hypothesis. More precisely, the method works as follows: After every 2 frames of the live input a smoothed backward path is computed, starting at the current position (i, j) of the forward path. By following this path b steps backwards on the y-axis (the score) one gets a new point which lies with a high probability nearer to the globally optimal alignment than the corresponding point of the forward path (because this backward computation takes into account information from the ‘future’ that was not available when computing the original forward path). Starting at this new point another forward path is computed until a border of the current matrix (either column i or row j) is reached. If this new path ends in (i, j) again, this can be seen as a conﬁrmation of the current position. If the path ends in a column k < i, new rows are calculated until the current column i is reached again. If the path ends in a row l < j, the calculation of new rows is stopped until the current row j is reached. In our speciﬁc implementation, two different backtracking lengths are used: after 4 short backtrackings of length b = 10 a longer one of length b = 50 is performed. The main effect of this strategy is increased robustness against tempo changes and improved error tolerance: If there are extreme tempo changes in the performance, or the performer makes large errors – plays wrong notes and repeats or omits a whole bar – the forward-backward strategy permits the system to correct the error faster by waiting for the musician or jumping forward in the score. This is because the re-computation of the backward path is not limited by the MaxRunCount constraint that governs the on-line forward path computation. A situation where the system ‘waits for’ the performer to catch up after a serious mistake is shown in Fig. 2. Strategy 2: Utilising Musical Information Given that the reference audio ﬁle to which a performance is aligned was synthesised from a written score, we have additional information – beyond the pure audio representation – that can be exploited. In particular, for each note, we know precisely where it starts, i.e., we know the precise onset times in the score audio; this is something that is not at all obvious from the audio itself (cf. Fig. 1). The information can be used to bias the path to pass though points where the performance signal is particularly similar to the sound expected at note onsets, as follows: For every frame of the incoming live input

Figure 2. Part of a cost matrix (note that not the complete matrix, but only a sub-band around the diagonal is computed). This particular situation shows the system reacting to an additional bar of music (not present in the score) erroneously played by the pianist. The live performance is on the x axis, the score representation on the y axis. Crosses show the correct note onsets according to the score. The grey path is calculated by Dixon’s original algorithm, the white path is our performance tracker. Note how our algorithm effectively ‘waits’ for the pianist (the horizontal segment) after having noticed the error. This is made possible by the Backward-Forward Strategy.

a heuristic measure M is computed that tries to capture the likelihood that the current frame corresponds to the next onset expected according to the score. If M exceeds a given threshold, the current frame is aligned with the corresponding onset frame on the score axis; otherwise, forward path computation continues as usual. The measure M combines three components: the sound similarity between the current audio frame and the score audio frame representing the next onset; a measure of “onset-ness” of the current frame (this is computed by a simple onset detection measure based on spectral differences to the previous frame); and the distance, on the y axis, of the forward path to the score coordinate of the next onset (the closer, the more likely). The details of the function are too complex to explain here (they can be found in [1]), but the idea is fairly intuitive. The main effect of this strategy is an increase in alignment precision. In addition, strategy 2 also helps improve the robustness of alignment, particularly during hard-to-track tempo changes: the search for onsets adds the capability to catch onsets correctly even if the forward path went wrong for some frames. Strategy 3: Maintaining Multiple Hypotheses The third strategy is aimed directly at the ‘structural changes’ problem, i.e., the possibility that a musician may choose to ignore repeat signs or repeat or skip entire sections. This problem is solved in a straightforward way: Instead of using just one instance of the algorithm, up to 3 instances are started simultaneously at predeﬁned positions and work on different parts of the piece. After 500 frames of the score representation (y) the path costs, normalised by the number of frames processed, are compared and the instance of the algorithm with the least cost is selected. The positions where this ‘forking’ is triggered are the boundaries of major sections as given in the score. At each section boundary, one instance of the alignment

244

A. Arzt et al. / Automatic Page Turning for Musicians via Real-Time Machine Listening

Etude

Ballade 1 cumulative frequency

cumulative frequency

1 0.8 0.6 D[r] A1[r] D[a] A1[a] D[c] A1[c]

0.4 0.2 0

0.8 0.6 D[r] A1[r] D[a] A1[a] D[c] A1[c]

0.4 0.2 0

0

2

4

6

8

10

0

recovery time in beats

5

10

15

20

recovery time in beats

Figure 3. Recovery time in beats after big mistakes by the musician (removing a bar [r], adding a bar [a] and changing a bar [c]) as cumulative frequencies (“y% of the errors are below x beats”). The evaluation is based on 88 alignments of the Etude (4 bars changed) and 132 alignments of the Ballade (6 bars changed), for each of the cases [r,a,c].

algorithm assumes that the musician will repeat the previous section and skips back to the beginning of the section; one instance assumes the musician will continue with the next section; and a third instance assumes that the musician will skip the new section and jumps ahead to the next one. As our experiments show (see Section 4.1), this strategy works extremely well (and is still computationally feasible).

Mean Error 1st Quartile 2nd Quartile 3rd Quartile Largest Error

D 0.23 0.02 0.08 0.26 3.06

Etude A1 0.10 0.02 0.04 0.12 2.02

A2 0.07 0.02 0.02 0.04 2.12

D 0.32 0.04 0.08 0.26 9.82

Ballade A1 0.19 0.02 0.04 0.10 7.56

A2 0.15 0.02 0.02 0.06 7.24

Other Improvements Other changes that led to some improvement concern the optimisation of a few parameters. The weights in the path cost recursion were set to wa = 1.3 while leaving wb = 2, which makes diagonal steps cheaper. The original algorithm showed problems – e.g. uncontrolled expansions of the path in one single direction – especially between note onsets where there is no data for a reasonable alignment. That was mostly solved by this reweighting. As a consequence, the path computation now proved to be so stable that MaxRunCount could be set to 6, which leads to more freedom in path computation and the possibility to catch even large differences in tempo. And ﬁnally, with regard to the adaptivity problem, the described algorithm works entirely without pre-training (in contrast to, e.g., HMM-based approaches). The only information used is the score represented as a (constant tempo) MIDI ﬁle. The only change relative to Dixon’s original algorithm was to shorten the initial phase (where the path is forced to follow the diagonal) to 1 second instead of 10. As a consequence, the algorithm adapts much faster to the general playing speed of the musician. From that point onwards, the above strategies ensure that the system adapts very effectively to tempo changes, delays and even insertions and deletions in the performance.

4 4.1

Experiments Quantitative Experiments

A quantitative evaluation requires correct reference alignments. For practical reasons, the systematic experiments were performed offline. The results are the same as for on-line alignment, except for a small latency that would occur in real-time processing. In the following we refer to Dixon’s original algorithm, which serves as a reference, as D; to the new algorithm that uses all our improvements except strategy 2 (thus not relying on any music-speciﬁc information) as A1; and to the new algorithm including strategy 2 as A2. The algorithms were evaluated on 2 sets of 22 piano recordings of the Etude in E major, Op.10 no.3, bars 1–21 and the Ballade Op.38,

Table 1. Alignment errors of the algorithms on the Etude and the Ballade. The results are based on the alignment of 3564 notes in the Etude and 4422 notes in the Ballade. The errors are given in seconds.

bars 1–45 by Fr´ed´eric Chopin, played on a computer-monitored grand piano by skilled pianists. The audio recordings were aligned to synthesised score audio ﬁles with constant tempo. As the computermonitored piano also records the precise (‘true’) note onset times, the alignment error could then be calculated. As Table 1 shows, both new algorithms A1 and A2 outperform D by far. Further evaluations showed that especially the reweighting towards cheaper diagonal steps improved the accuracy of A1. The further improvement of A2 is due to the fact that strategy 2 is very effective at correcting errors between 0.02 and 0.2 seconds. The excerpt of the ballade ends at a phrase boundary, which due to extreme variations in tempo and discontinuities in timing are the most problematic parts in score following. This explains the large errors on the Ballade in Table 1. After a phrase boundary the algorithm recovers easily. Nonetheless if a page-turning mark happens to be in the area of a phrase boundary this could cause a delayed or premature page-turning. Improvements on handling those boundaries are among the main goals of future work. The new algorithms not only increased the accuracy but also decreased the variability of the results, as can be seen in Figure 4. Furthermore, there was no performance which was better aligned by D than by A1 or which was better aligned by A1 than by A2. Further tests were performed to evaluate the robustness against performance errors. As it is not possible to change the performances, the score representation was changed instead. For the case of the musician leaving out notes, notes are repeated in the score, for playing additional notes, notes are deleted from the score, and for playing wrong notes, score notes are replaced by an augmented fourth. As Figure 3 shows, the new path computation recovers much faster from mistakes than the old one, especially in the cases of adding

A. Arzt et al. / Automatic Page Turning for Musicians via Real-Time Machine Listening

Figure 5.

Some impressions from the second live experiment. Center panel: the page turning device.

0.5 mean error in seconds

mean error in seconds

0.5 0.4 0.3 0.2 0.1 0

0.4 0.3 0.2 0.1 0

D

A1

A2

D

A1

A2

Figure 4. Variability, among 22 performances of the etude (left) and the ballade (right), of the mean errors of the alignments (shown as boxplots).

notes (due to the capability of ’waiting’ for the musician, see Figure 2) and playing false notes (due to both the ’waiting’ and the reweighting towards the diagonal). In correcting alignment errors due to omitted notes the performance of the algorithms is about equal. Strategy 3, ‘Considering Alternatives’, was evaluated on altered scores of the Etude and the Ballade where a repeated section was inserted. For all 22 performances of both pieces, the omitted repetition was recognised and the correct path through the piece was found.

4.2

Qualitative Evaluation

To evaluate the system under realistic conditions, two live experiments were performed (for some impressions see Fig. 5). One was done with a simple electronic piano, one with a grand piano. The audio signal was recorded over the air with a single microphone. A trained pianist from our research group played two Chopin pieces: the Ballade Op.52 in A major and the Etude Op.25 No.11 in A minor (cf. Fig. 1). In both tests the system worked very reliably, even in the presence of errors (and even some re-starts) by the pianist. It turned out that the more onsets are played (the faster the piece is), the better the alignment. So even the very fast etude was aligned perfectly (or at least well enough for correct page turning).

5

245

Conclusions and Future Work

The paper has presented a general algorithm for robust, effective online tracking of audio streams in real time, and has demonstrated its usefulness for an interesting task: automatic page turning for musicians. The algorithm may prove useful in many other score following tasks, e.g. live visualisation and automatic accompaniment.

On the technical side there are some clear directions for future work. The ﬁrst concerns improvements in handling large, discontinuous tempo changes as they occur at phrase boundaries, or at the ends of pieces. This may require explicit recognition and modeling of musical structure. As the concept of multiple matchers is currently limited to ﬁxed parts of the piece, such ﬂexible structure models might also lead to more intelligent tracking of the performer (e.g., re-starting at a musically suitable place after a mistake). So far, the system has only been tested on piano music. There are no fundamental results why it should not perform well on other kinds of music, though non-percussive instruments (like the violin) could be more problematic because our audio features are strongly related to note onsets. Investigations in this direction will be carried out.

ACKNOWLEDGEMENTS This work was supported by the Austrian Fonds zur F¨orderung der Wissenschaftlichen Forschung (FWF) under project number P19349N15. Many thanks to Sebastian Flossmann for playing the piano in the live experiments.

REFERENCES [1] A. Arzt, Score Following with Dynamic Time Warping: An Automatic Page Turner, Master’s Thesis, University of Technology, Vienna, 2008. [2] A. Cont, D. Schwarz, N. Schnell, and C. Raphael, ‘Evaluation of Realtime Audio-to-score Alignment’, in Proc. 8th International Conference on Music Information Retrieval (ISMIR 2007), Vienna, (2007). [3] R. Dannenberg, ‘An On-line Algorithm for Real-time Accompaniment’, in Proceedings of the International Computer Music Conference (ICMC 1984), Paris, France, (1984). [4] R. Dannenberg and N. Hu, ‘Polyphonic Audio Matching for Score Following and Intelligent Audio Editors’, in Proceedings of the International Compter Music Conference (ICMC 2003), Singapore, (2003). [5] S. Dixon, ‘An On-line Time Warping Algorithm for Tracking Musical Performances’, in Proceedings of IJCAI 2005, pp. 1727–1728, Edinburgh, Scotland, (2005). [6] B. Pardo and W. Birmingham, ‘Modeling Form for On-line Following of Musical Performances’, in Proceedings of the 20th National Conference on Artiﬁcial Intelligence, Pittsburgh, Pennsylvania, (July 2005). [7] C. Raphael, ‘A Bayesian Network for Real Time Music Accompaniment’, in Neural Information Processing Systems, 14, (2001). [8] D. Schwarz, N. Orio, and N. Schnell, ‘Robust Polyphonic Midi Score Following with Hidden Markov Models’, in Proc. of the International Computer Music Conference (ICMC 2004), Miami, Florida, (2004). [9] B. Vercoe and M. Puckette, ‘Synthetic Rehearsal: Training the Synthetic Performer’, in Proceedings of the International Computer Music Conference (ICMC 1985), Vancouver, Canada, (1985). [10] G. Widmer, S. Dixon, W. Goebl, E. Pampalk, and A. Tobudic, ‘In Search of the Horowitz Factor’, AI Magazine, 24(3), 111–130, (2003).

246

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-246

CDL: an Integrated Framework for Context Speciﬁcation and Recognition Fulvio Mastrogiovanni, Antonello Scalmato, Antonio Sgorbissa and Renato Zaccaria 1 Abstract. A framework is introduced that is aimed at integrating ontology and logic approaches for context-awareness, suitable for use in Ambient Intelligence (AmI) scenarios. In particular, the context description language CDL is described, which allows to easily specify patterns of events which occurrences must be monitored by actual systems. As long as systems evolve, symbolic data originating from heterogeneous sources are ﬁrst aggregated and then classiﬁed according to formulas described in CDL. Experimental results performed in a Smart Home environment are presented and discussed.

1

Introduction

During the past few years, many research efforts were undertaken to improve the quality of life and to foster the independent living of elderly and people with special needs [4]. It is widely accepted that intelligent environments are a key technology to achieve these goals. Intelligent Environments are responsible for distributed data gathering, information extraction, context representation and interaction with users. In particular, they enable the introduction into the real world of such technologies as remote health care monitoring [3] or intelligent surveillance [7]. The semantic information associated with this data ﬂow is fundamental either for an intelligent monitoring system or for a human caregiver to infer user conditions and to give immediate assistance. Intelligent user interfaces must be fed with ontology-grounded information able to capture the meaning of data being collected. Furthermore, caregivers and system designers should be able to easily specify the overall system behavior, i.e., which contexts are considered relevant and how to face anomalous patterns of events. With respect to these requirements, a high-level symbolic representation for describing both the desired system behavior and the current state of the environment must be deﬁned. This paper proposes a framework aimed at integrating the beneﬁts of ontology and logic approaches for context-awareness in AmI applications. Based on the work described in [8], we assume the availability of a system architecture able to provide an ontology with information originating from distributed sources. In particular, we introduce a context description language, called CDL, suitable to ﬁll the gap between behavior speciﬁcation and context-awareness algorithms. The paper is organized as follows: ﬁrst, relevant literature is discussed; next, the reference architecture, the proposed ontology and the representation language used for context modeling and classiﬁcation are described; experimental results performed in a Smart Home set-up are reported. Conclusion follows. 1

University of Genova, Italy, email: {fulvio, antonello, sgorbiss, renato}@dist.unige.it

2

Related Work

In literature, there is no widespread agreement about how to deﬁne, represent, and actually model contexts. Nonetheless, there are a number of basic requirements to be addressed: (i) context structures must be effectively used to classify actual information; (ii) they can be composed with each other to generate complex hierarchical representations; (iii) they must grasp both relational and temporal information among events. The current debate is focused on issues related to formal description and reasoning capabilities. In particular, it is possible to identify numerical, ontology and logic approaches. Numerical approaches are aimed at recognizing complex contexts from a limited number of highly expressive information sources, mostly using probabilisitic state estimation techniques [10, 17]. Associated drawbacks include high false classiﬁcation rates of both measurements and data association, and the difﬁculty to provide numerical data with semantic labeling. Ontology approaches [12] are characterized by a superior expressiveness in describing objects and relationships. Generic structures are used to model concrete domain objects as well as abstract information (i.e., contexts and events). These models, called ontologies, can be either immediately used by computational algorithms or browsed by authorized caregivers. Ontologies have been widely used to model real-world high-level contexts and speciﬁc domains of interests [6], or location-based and topological models [11]. [13] deﬁnes context proﬁles, a model for representing contexts and situations by iteratively merging information provided by session proﬁles. Other works [14, 15] focus on modeling speciﬁc domains of interest, in order to design standard and reusable representations, whereas others (see [16] and the references therein) are aimed at formalizing context deﬁnition. Although these frameworks offer basic services for supporting reasoning, they lack in generality (as they focus on speciﬁc domains) and extensibility (since context models are hardly aggregated to build complex representations). Furthermore, they mainly focus on the modeling aspect only, disregarding – at least in part – actual data fusion. Logic approaches manage contexts using facts which are stated or inferred from a given set of rules. In particular, [5] identify two main principles: (i) locality: reasoning must occur within a domain deﬁned by a context; (ii) cross-domains bridging: relationships can occur between reasoning activities belonging to different contexts. Several logic-based architectures have successfully been designed and made effective. For instance, [1] proposes an architecture based on simple rules to infer temporal relationships among events, whereas [2] suggests a similar approach to assist human decision making in a healthcare scenario. In spite of the powerful reasoning capabilities offered by logic-based approaches, they are not suited to grasp the

F. Mastrogiovanni et al. / CDL: An Integrated Framework for Context Speciﬁcation and Recognition

247

Figure 1. A representation of the system architecture.

Figure 2. A typical reference scenario.

general relationships among entities in a compact way.

3

CDL: a Context Description Language

In the following paragraphs the reference architecture and a realworld scenario are introduced. Next, the ontology grounding CDL with respect to actual data is brieﬂy described. Finally, the CDL language, used to describe both relational and temporal contexts, is formally introduced.

3.1

An Architecture for Data Acquisition

An intelligent environment is a distributed system where several entities cooperate to perform data gathering and context acquisition. In this paper we refer to [8], where a multi-agent system architecture for AmI applications has been described. When dealing with ontology or logic approaches, context-awareness requires to ground numerical data to symbolic representations. In [8], a simple and effective design choice is adopted to support symbol grounding: only simple sensors, i.e., sensors providing boolean or ordered data, are used, since related information can be easily translated into symbols. Furthermore, it is showed that data association errors can be considered negligible in practice, and then disregarded at the symbolic level. All the involved activities are performed by properly designed agents. Each agent is characterized by a name (i.e., an unique identiﬁer), a locator (e.g., used to establish a communication with the agent itself), and a number of attributes (e.g., agent capabilities or messages exhanged with other agents). These data are stored within an agent-entry, managed by an agent-directory service (see Figure 1). Whenever an agent is looking for a source of information, it queries the agent-directory for agents producing the requested information: once the resource has been selected, a communication link is established using a message-transport service. Agents communicate through messages, i.e., key-value couples encapsulating a content. The message-transport service maintains data structures in order to ground message contents with respect to the ontology.

Figure 3. Part of the proposed ontology.

the availability of temperature, bed and tap sensors, and a number of intelligent appliances (e.g., stove, fridge, shower and tv). It is worth noticing that all the devices are either commercial products or they can be easily obtained (e.g., a contact sensor on a seat generate a seat sensor). In the following discussion, we do not assume knowledge about user identity, which could be achieved in principle using RFIDs and wearable technology. For the sake of simplicity, we assume a single user scenario. The system is designed to monitor the occurrence of speciﬁc contexts, i.e., collections of simultaneous events (i.e., relational contexts) or distributed over different periods of time (i.e., temporal contexts). For example, taking lunch could be modeled as a collection of facts including using the stove, in the kitchen, near the table and on the chair, which must occur either simultaneously or in sequence to infer that the user is taking lunch.

3.3 3.2

A Reference Scenario

Our reference environment (see Figure 2) comprises a bedroom, a bathroom and a living room, which is divided into two areas, namely entrance and kitchen. Several ir (i.e., infra red) sensors are placed in both doorways and passages, whereas pir sensors monitor interesting areas. Furthermore, we reasonably assume

The Ontology Representation

The main goal of context assessment is to arrange information originating from distributed agents in templates relating numerical data to symbols. This process is handled by the ontology agent, which maintains a knowledge base Σ that is represented using a Description Logic (DL for short), such that Σ = (see Figure 3). Reasoning capabilities are based on subsumption. Although a

248

F. Mastrogiovanni et al. / CDL: An Integrated Framework for Context Speciﬁcation and Recognition

comprehensive description of Σ is beyond the scope of the paper, in the following paragraphs concepts and relationships which are relevant for the discussion are brieﬂy outlined. The main purpose of Σ is twofold: (i) to manage a process of symbolic information fusion for assessing information [8]; (ii) to store instances of constants, variables, functions, predicates and terms for specifying the context deﬁnition language CDL. With respect to Figure 3, the Σ can be divided into ﬁve main domains. Entity layer. Σ |= Entity is a base concept used to group {Agent, Action, User, Area, Object} Entity, which do not have instances: rather, it groups together shared descriptions for subsumed concepts. Architecture and data modeling. This part represents deﬁnitions and instances of Agent and Data which are exchanged within the system. An explicit representation is fundamental since it realizes both an agent-directory service and – with respect to instances of Data – a repository for variables used by CDL. Examples of Data instances include Σ |= Data(pir-data) or Σ |= Data(temp-data). User modeling. Σ |= User models users living within the environment. It is assumed that each user is identiﬁed by matching, e.g., RFID tags with predeﬁned labels. User identiﬁcation is a research ﬁeld in its own right. However, it is no longer discussed here. Topology. It includes concepts to describe the physical space, e.g., Σ |= {Area, Sensor, Actuator, Furniture}. Relevant objects include areas, e.g., Area(bedroom) or Area(bathroom), several devices, e.g., PIR(pir1 ), Tap(tap1 ) or Seat(seat2 ), or smart furniture, such as Stove(stove), Fridge(fridge) or Bed(bed1 ). With respect to CDL, instances subsumed by concepts belonging to this class constitute useful constants. Contexts descriptions. A Σ |= Predicate describes something about the state of an Entity, which semantics depends on an interpretation I. As long as ai , such that {ai |Σ |= Data(ai ), i = 1, ..., |Data|} are updated with new sensory information (i.e., a variable assignment α over the ABox is deﬁned), then corresponding pj , such that {pj |Σ |= Predicate(pj ), j = 1, ..., |Predicate|} are updated as well. As a consequence, their truth values change accordingly. Useful instances Predicate(pir-active), Predicate(tap-open) and Predicate(seat-pressed). Σ |= Situation relates an Entity to one or more Predicates, whereas, according to a multi-context perspective [5], a broader – yet partial – view of what is happening within the Intelligent Environment can be inferred by considering the union of relevant Situation instances. A Context is such an aggregate of one or more Situation concepts. Contexts Σc ∈ Σ are formulas which are described using CDL, possibly involving the status of many entities. In general, given an interpretation I, ContextI is formally given by the conjunction of the semantics of the various constituent SituationI and PredicateI . However, in real scenarios, more can be afforded from the mere conjunction of Predicate instances. For example, consider Predicate(stove-active), Predicate(pir4 -active) (assuming that pir4 is located in the kitchen) and seat1 -pressed: when considered as a whole, it is likely that someone is going to take lunch. The corresponding ABox maintains the status of the system at run time. Speciﬁcally, for each time instant t, a different variable assignment αt is used over Predicate instances. A properly deﬁned individual, i.e., history , explicitly stores these instances occurred within a predeﬁned temporal window of lenght n. As long as the system evolves, it happens that (Σ, I, αt ) %i Σic (history), i.e.,

history can be subsumed by a number of Contexts Σic .

3.4

A Language for Describing Contexts

Herewith, we introduce the main aspects of CDL, a language for context description. CDL is deﬁned in terms of variables CDLv , constants CDLc , predicative symbols CDLp , functions CDLf , connecting symbols CDLcs and words CDLw , which are divided into terms and formulas. CDL is a two-way representation between Σ and system designers. From the Σ side, CDL is grounded with respect to concepts, roles and instances in Σ. Furthermore, as concept instances are either created or updated with Σ, CDL speciﬁcations are updated and made available for system designers. From the designer side, as contexts are described using CDL, new concepts instances are created within Σ and immediately used to check subsumption with respect to history. Hereafter, we assume the availability of the operator ξ: given a concept C, ξC = {ci |Σ |= C(ci ), i = 1, ...|C|}, i.e., all the individuals subsumed by C. Variables. CDLv is a possibly inﬁnite enumeration of symbols which, within the ABox, change over time as a consequence of a variable assignment αt . Speciﬁcally, these symbols are mapped to ξ Data and ξ Period. Examples of ξ Data include either boolean (e.g., piri -data, i = 1, ..., 8, smoke-data or stove-data), or ordered (e.g., temp-data) domains. Each Predicate is satisﬁed by (Σ, αt ) for a given Period. Constants. CDLc is a possibly inﬁnite enumeration of symbols which are closed with respect to αt . These symbols are mapped to ξ Device and to thresholds used to discretize Device values originating in ordered domains. Examples of ξ Device include piri , i = 1, ..., 8, smoke, stove and iri , i = 1, ..., 5. Thresholds commonly used include, e.g., those for temperature or light sensor data. Predicative symbols. CDLp is a possibly inﬁnite enumeration of symbols which are explicitly interpreted according to a semantics I. With respect to Σ, they are either directly mapped to ξ Predicate or to immediately inferred facts. For example, interesting predicates include: piri -active (and similar for smoke, ir or stove devices), or temp-cold (and similar for other ordered sensors, such as, e.g., light). Temporal duration of predicates is explicitly modeled using the lenght λ of a Predicate, i.e., a Predicate which is satisﬁed whenever the Period associated with the monitored Predicate exceeds a threshold τ . For example, we refer to the Predicate monitoring the length of smoke-active using λsmoke-active. Another particularly interesting abstraction is the derivative δ of a Predicate, i.e., a Predicate which is satisﬁed when a given monitored Predicate changes its status from true to false (or viceversa) at the time instant t. For example, we refer to the Predicate monitoring the activation of piri using δ piri -activef →t . Functions. CDLf is a possibly inﬁnite enumeration of symbols which explicitly model the structured relationships among ξ Predicate, ξ Situation and ξ Context. Therefore, symbols belonging to CDLf are mapped to roles within Σ. For example, recall the take lunch Context previously introduced: the constituent predicates stove-active, pir4 -active and seat1 -pressed are functionally related to a corresponding Context through the roles doing, where and pose, i.e., ξ TakeLunch Context ∃doing.stove.active ∃where.pir4 -active ∃pose.seat1 -pressed. From this example, it is evident how the generation of complex words, such as terms and formulas, heavily relies on functions. Connecting symbols. CDLcs is a ﬁnite enumeration of propositional connectors, which include:

F. Mastrogiovanni et al. / CDL: An Integrated Framework for Context Speciﬁcation and Recognition

249

Figure 5. Behavior of bed1 -pressed in Experiment 2.

Figure 4. Left: Light sensor in the Know-House@DIST set-up. Right: a door passage controlled by an IR sensor.

• Logic connectors: ¬ (i.e., negation), (i.e., DL-like conjunction and (i.e., DL-like disjunction). • Inference connectors: (i.e., subsumption) and × (i.e., role ﬁlling). Subsumption is the main reasoning scheme provided by DLs. Given two concepts, namely C1 and C2 , we say that C1 is subsumed by C2 (and we write C1 C2 ) iff C2 is more general than or equivalent to C1 . Role ﬁlling identiﬁes associations between functions and constants or variables. • Temporal connectors: λ (i.e., length of a Predicate), δ (i.e., − (i.e., and before). Specifderivative of a Predicate) and ← ← − ically, is used to concatenate sequences of predicative symbols which temporal relationship is important. For example, given C1 −C is true iff C occurs before than C . It and C2 , we say that C1 ← 2 2 1 −I I is worth noting that, semantically speaking, the relation ← holds, i.e., and before is a special case of and. Words. CDLw comprises terms and formulas. With respect to Σ, terms represent partial descriptions of concepts, whereas formulas are properly deﬁned concepts. In order to specify ξ Context, a designer is requested to write formulas, according to the speciﬁcations of CDL. A term is deﬁned as follows: (i) a generic symbol x ∈ CDLv is a term; (ii) a generic symbol a ∈ CDLc is a term; (iii) if f ∈ CDLf is a function, and t1 , ..., t2 ∈ CDLw are terms, then f × t1 × ... × tn is a term. Terms are useful in order to build more complex formulas. In particular, an atomic formula is a word in the form p(t1 , ..., tn ), where p ∈ CDLp and t1 , ..., tn ∈ CDLw . Therefore, a formula is deﬁned as follows: (i) atomic formulas p ∈ CDLw are formulas; (ii) if A ∈ CDLw is a formula, then ¬A ∈ CDLw is a formula; (iii) if A, B ∈ CDLw are formulas, then A B ∈ CDLw − and A B ∈ CDLw are formulas (please notice that, as ← ← − holds, also A B ∈ CDLw is a formula); (iv) if A ∈ CDLw is a formula, then λA ∈ CDLw and δA ∈ CDLw are formulas.

4

Experimental Results and Discussion

The designed experiments, which have been performed in a real setup called Know-House@DIST (see Figure 2 and 4), are aimed at validating the general behavior of the proposed architecture. In the following paragraphs, we introduce and discuss a number of key experiments. Experiment 1. In bed while the shower is active. C Context (bed1 -pressed bed2 -pressed) shower-active. This is an anomalous context because, when the shower is opened, one is supposed to use it, not to stay in bed. The interesting thing is that, although C is a relational context, it subsumes more complex temporal patterns in history, such as, e.g., −ir -active← −ir -active shower-active: bed1 -pressed← 1 2

i.e., the user left the bathroom and went to the bed forgetting the shower opened (referring to Figure 2 to track user movements, re− is meant to be considered from right to left). In other member that ← words, C subsumes all the descriptions subsumed by the constituent formulas (i.e., instances of Predicate): no matter what else happens, the anomaly is detected. Experiment 2. In bed more than usual: C −δ bed -pressedf →t . This is a temporal λbed1 -pressed← 1 Context, as it relates predicates which mutual occurrence is constrained. In order to (Σ, I, αt ) % C(history), a change in the truth value of bed1 -pressed must be detected. Then (remember to − from right to left) the length of bed -pressed is consider the ← 1 checked: if it exceeds the threshold τ , λbed1 -pressed is satisﬁed, and then C is satisﬁed as well. Here, only one Predicate in Σ is really involved in practice, i.e., bed1 -pressed (see Figure 5): in t1 , the user goes to bed. As the time passes, bed1 -pressed does not change: when reaching t2 , the relationship t2 − t1 > τ holds, and then C is satisﬁed.

Figure 6. Behavior of relevant predicates in Experiment 3.

Experiment 3: The user can not sleep during the night, so he prefers to switch on and look at the TV: C = night-time λtv-onτ ← − δ tv-onf →t ← − δ bed-pressedt→f ← − λbed-pressed . Here, 1,τ 1 we do not explicitly specify periods. If history contained, − δ tv-onf →t ← − δ wm-onf →t ← − among others events, λtv-onτ ← ← − λbed-pressed night-time (i.e., the δ bed-pressedt→f 1,τ 1 washing machine is used during the night) it would be equally subsumed by C. Referring to Figure 6, it is evident that events related to wm-on are irrelevant in subsumption. history C holds because of the events occurring in t1 , t2 , t4 and t5 . Experiment 4: The user is taking lunch in the kitchen: C = − δ seat-pressedf →t ← − lunch-time λseat-pressed1,τ ← 1 ← − f →t f →t δ stove-active (oven-time (δ smoke-active )) δ pir-activef4 →t . The user is detected to be near the table, then cooking and ﬁnally on the seat. The patterns in Figure 7 are subsumed by C. Other activities in history, e.g., sitting while cooking

F. Mastrogiovanni et al. / CDL: An Integrated Framework for Context Speciﬁcation and Recognition

250

tem behavior can be formalized using a high-level language, called CDL, which is grounded with respect to an ontology. Speciﬁcally, words belonging to CDL are mapped into concepts and relationships, which are then used to classify actual patterns of data. Using subsumption (i.e., the inference scheme provided by DLs), the system exhibits important desirable features: (i) contexts to be detected do not have to be fully speciﬁed, as more complex patterns of events are subsumed by generic context descriptions; (ii) symbolic false positives are implicitly ﬁltered away, as patterns with false positives are subsumed by more general descriptions. The experimental results performed in a real set-up conﬁrm these properties.

REFERENCES Figure 7. Behavior of relevant predicates in Experiment 4.

(see Figure 7 at the bottom), possibly interleaved with activities in C, do not invalidate subsumption, since it depends only on events occurring in t1 , t2 , t3 , t4 , t5 and t6 .

Figure 8. Behavior of relevant predicates in Experiment 5.

Experiment 5: The user frequently visits the bathroom dur− δ pir-activef →t ← − ing the night: C = λpir-active2,τ ← 2 ← − − ← − f →t ← λbed-pressed2,τ δ bed-pressed2 λpir-active2,τ − ← − − f →t ← f →t ← δ pir-active2 λbed-pressed2,τ δ bed-pressed − δ pir-activef →t night-time (see2Figure λpir-active2,τ ← 2 8). It is worth noting that here we do not explicitly model how to go from the bedroom to the bathroom: in principle, the user could go to the kitchen ﬁrst, and do whatsoever activity, but the system would nonetheless infer multiple visits to the bathroom. Experiment 6: The user is moving from the kitchen to the bal− δ ir-activef →t pir-active cony: C1 = δ pir-activef6 →t ← 4 5 (see Figure 2 to track movements). This description can be used to model more interesting activities. For instance, consider the follow− λwm-on ← − f →t . Here, ing Context template: C2 = C1 ← τ δ wm-on the washing machine has been used for some time and, after that, the moved from the kitchen to the balcony: it is likely that the user is hanging dresses out to dry. Notwithstanding events occurring in the mean time in history, the subsumption history C2 holds.

5

Conclusion

This paper presented an integrated ontology- and logic-based framework for contexts speciﬁcation and acquisition. In particular, the sys-

[1] J. C. Augusto and C. D. Nugent. A New Architecture for Smart Homes Based on ADB and Temporal Reasoning. In Proc. of the 3rd Int.l Conf. on Smart Homes and Health Telematics (ICOST’05), Canada, July 2005. [2] J.C. Augusto, P. McCullagh, V. McClelland and J.A.Walkden. Enhanced Healthcare Provision Through Assisted Decision-Making in a Smart Home Environment. In Proc. of the 2nd Workshop on Artiﬁcial Intelligence Techniques for Ambient Intelligence (AITAmI07), Hyderabad, India, January 2007. [3] T. S. Barger, D. E. Brown and M. Alwan. Health-Status Monitoring Through Analysis of Behavioral Patterns. In IEEE Trans. on Systems, Man, and Cybernetics – Part A, vol. 35, no. 1, January 2005. [4] D.J. Cook and S.K. Das. How Smart are our Environments? An Updated Look at the State of the Art. In Pervasive and Mobile Computing, vol. 3, no. 2, pp. 53–73, March 2007. [5] F. Giunchiglia and L. Seraﬁni. Multilanguage Hierarchical Logics. In Artiﬁcial Intelligence, no. 64, pp. 29–70, 1994. [6] E. J. Ko, H. J. Lee and J. W. Lee. Ontology-Based Context Modeling and Reasoning for U-HealthCare. In IEICE Trans. on Information and Systems, no. 8, pp. 1262–1270, August 2007. [7] Fan Tian Kong, You-Ping Chen, Jing-Ming Xie and Zu-De Zhou. Distributed Temperature Control System Based on Multi-sensor Data Fusion. In Proc. of the 2005 Int.l Conf. on Machine Learning and Cybernetics, China, August 2005. [8] F. Mastrogiovanni, A. Sgorbissa and R. Zaccaria. An Active Classiﬁcation System for Context Representation and Acquisition. In J.C. Augusto and D. Shapiro (Eds.), Advances in Ambient Intelligence, FAIA Series, November 2007. [9] J. McCarthy. Notes on Formalizing Context. In Proc. of the 13th Int.l Joint Conf. on Artiﬁcial Intelligence (IJCAI-93), pp. 555–560, Chambery, France, 1993. [10] H.T. Nguyen, J. Qiang and A.W.M. Smeulders. Spatio-Temporal Context for Robust Multitarget Tracking. In IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 29, no. 1, pp. 52–64, January 2007. [11] I. Satoh. A Location Model for Smart Environments. In Pervasive and Mobile Computing, vol. 3, no. 2, pp. 53 – 73, March 2007. [12] T. Strang and C. Linnhoff-Popien. A Context Modeling Survey. In Proc. of the 6th Int.l Conf. on Ubiquitous Computing (UbiComp2004), Nottingham, England, September 7-10, 2004. [13] A. Held, S. Buchholz and A. Schill. Modeling of Context Information for Pervasive Computing Applications. In Proc. of the 6th World Multiconf. on Syst., Cybern. and Informatics (SCI), Orlando, FL, 2002. [14] A. Ranganathan and R. H. Campbell. A Middleware for ContextAware Agents in Ubiquitous Computing Environments. In Proc. of the 2003 ACM/IFIP/USENIX Int.l Middleware Conference, Rio de Janeiro, Brazil, 2003. [15] I. Horrocks. DAML+OIL: a Reason-able Web Ontology Language. In Proc. of the 8th Int.l Conf. on Extending Database Technology (EDBT02), Prague, Czech Republic, 2002. [16] R. Dapoigny and Barlatier. Towards a Context Theory for ContextAware Systems. In J.C. Augusto and D. Shapiro (Eds.), Advances in Ambient Intelligence, FAIA Series, November 2007. [17] M. Wozniak. Proposition of Common Classiﬁer Construction for Pattern Recognition With Context Task. In Knowledge-Based Systems, vol. 19, no. 8, pp. 617 – 624, December 2006.

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-251

251

Web Page Prediction Based on Conditional Random Fields Yong Zhen Guo and Kotagiri Ramamohanarao and Laurence A. F. Park 1 Abstract. Web page prefetching is used to reduce the access latency of the Internet. However, if most prefetched Web pages are not visited by the users in their subsequent accesses, the limited network bandwidth and server resources will not be used efﬁciently and may worsen the access delay problem. Therefore, it is critical that we have an accurate prediction method during prefetching. Conditional Random Fields (CRFs), which are popular sequential learning models, have already been successfully used for many Natural Language Processing (NLP) tasks such as POS tagging, name entity recognition (NER) and segmentation. In this paper, we propose the use of CRFs in the ﬁeld of Web page prediction. We treat the accessing sessions of previous Web users as observation sequences and label each element of these observation sequences to get the corresponding label sequences, then based on these observation and label sequences we use CRFs to train a prediction model and predict the probable subsequent Web pages for the current users. Our experimental results show that CRFs can produce higher Web page prediction accuracy effectively when compared with other popular techniques like plain Markov Chains and Hidden Markov Models (HMMs).

1

Introduction

While the Internet is developing rapidly, the number of users surﬁng the Internet is dramatically increasing. Even though the construction of the Internet infrastructure is developing very quickly, many users still connect to the Internet through slow connections. According to [28], in 2007 about 20% of the 162 millions Internet users in China access the Internet using slow dial-up connections. Meanwhile, because of the popularization and convenience of wireless connection, many users have begun to use mobile phones or PDAs to surf the Internet. For example, in 2007, at least 34% of the Internet users had wireless devices in both China [28] and America [29], and this ratio keeps increasing steadily. On account of the limited bandwidth and low-speed connection, usually many dial-up and wireless Internet users need to spend long periods of time waiting for the Web pages they are visiting to be transferred to them through Internet, which may lead to intolerable delays. Moreover, the access latency problem of broadband users is noticeable as well and can be improved. In order to decrease the access latency of the Internet, a variety of different approaches have been proposed, among which caching and prefetching are two primary methods. The caching technique has been widely used on the Internet. It greatly improves access speed by saving local copies of the Web pages that users are currently visiting, so that their browsers will not 1

Department of Computer Science and Software Engineering, University of Melbourne, Australia, email: [email protected]

need to connect to the Internet to download these pages during future visits. However, the caching technique has some shortcomings. Firstly, a Web page can not be cached if it has not been previously accessed. Secondly, its function will be nulliﬁed if the Web resources on the Internet have been modiﬁed or updated (for example, the Web pages’ contents are changed). In addition, to maintain the consistency of copies at the client side and the corresponding Web pages at the server side is quite expensive. Thirdly, if the caches are saved at the client side, when a user uses another computer to surf the Internet, the caches that have already been saved in his original computer will be useless for his current access to the Internet. These problems reduce the attractiveness of caching. Web page prefetching techniques are introduced as another effective way to address the access latency problem and thus improve the usability and user retention of a Web site. By analyzing the Web log and a user’s current access path in combination with the link structure using different methods (such as association rules mining, Markov models or neural network), the Web pages that the user may access in the immediate future can be predicted by the Web site server and transferred before the user requests them. When the user accesses the page, there is no latency since the page has already been downloaded. It has been proven by many practical applications that the Web page prefetching technique is able to decrease a user’s access delay dramatically and thus enhance the service quality of the World Wide Web [2]. The results from the simulations in [3] show that a 36% reduction in the latency perceived by an Internet user can be achieved at the cost of a 40% increase in the network trafﬁc. Moreover, the studies in [2] indicate that by using the “rate-controlled” prefetching to smooth the transfer rates of prefetched pages can signiﬁcantly reduce the network trafﬁc congestion caused by “aggressive” prefetching and, therefore, improve the performance of the Internet. However, if most prefetched Web pages are not visited by the users in their subsequent accesses (implying that the prefetching method has predicted these users’ actions poorly), the limited network bandwidth and server resources will not be used efﬁciently, and hence may worsen the access latency problem. Therefore, the success of a prefetching method relies mainly on the prediction accuracy. In this paper, we propose a novel Web page prediction approach based on Conditional Random Fields (CRFs) [1] to improve the prediction accuracy. CRFs are powerful probabilistic framework for labeling and segmenting sequential data. Owing to their conditional nature, CRFs have the ability to model the dependencies among observation elements; they can also incorporate various features from observation sequences to increase the prediction accuracy. CRFs have already been used with success to many labeling-related tasks, such as text chunking [4], part-of-speech (POS) tagging [1], intrusion detection [5] and even predicting the secondary structures of protein

252

Y.Z. Guo et al. / Web Page Prediction Based on Conditional Random Fields

sequences [6]. If we consider the access sessions of previous Internet users as observation sequences, and in each observation sequence we use each pageview’s subsequent pageview as its label to get the corresponding label sequence (each pageview is an observation element), then we can employ CRFs to model the access behaviors of all previous users and predict the possible Web pages that a current user will request in his subsequent access. We show in this paper that the CRF-based Web page prediction approaches have distinct advantages over other well known techniques such as plain Markov Chains and Hidden Markov Models (HMMs). The rest of this paper is organized as follow: In Section 2 we brieﬂy review the related works concerning Web page prediction and personalization. In Section 3, we ﬁrst discuss the main differences between generative models and discriminative models, and then brieﬂy introduce the basic principle of CRFs. The novel CRFbased Web page prediction approaches are presented in Section 4 along with the experimental results and evaluations. Finally, we conclude in Section 5 with our future work.

2

Related Works

Ming Syan Chen et al. [7] introduced the notion of “maximal forward reference (MFR)” to identify users’ transactions and employed data mining techniques (such as association rules discovery) to mine frequently-accessed paths and make predictions. They ﬁrst converted the original log data sequence into a set of maximal forward references and eliminated the effect of some backward references, then they presented algorithms to recognize the frequent traversal patterns from the maximal forward references obtained, which can be used to predict the user’s future requests. T. I. Ibrahim et al. [8] introduced a neural networks model to implement the semantics-based Web page prediction. This model extracts the semantics of a Web page according to the keywords of its URL anchor text. It employs these keywords as the input of the neural network to construct the semantic network of URLs, and predicts user’s future requests based on the output of the neural network. In order to reduce the inﬂuence of the ambiguity of key words, this model builds a predictor for every different category of Web pages, which enhances the prediction accuracy but also decreases the applicability of this model. M. Eirinaki et al. [9] proposed a novel Web personalization approach: Usage-based PageRank (UPR), which combines both Web usage information and Web link structure information to conduct Web page ranking and prediction. This approach employs UPR to rank the Web pages in a relevant personalized navigational graph and predicts the probable pages in terms of their ranking values. By using the number of times a page was visited and the number of times the page was visited right after another page by previous users as the biasing factors, UPR favors the pages and paths that have been accessed more frequently by previous users. Yong Zhen Guo et al. [10] extended the UPR approach by introducing the access time duration of each Web page as another biasing factor, which will yield more accurate prediction. Schechter [11] constructed an access path tree for the current user and used the longest-match method to ﬁnd a history path which matched the user’s current navigational path. In this way the user’s following access requests can be predicted, but the construction of path trees and the match of history paths are expensive in terms of both computing and storage. Sarukkai [12] employed a 1st -order Markov model to analyze access paths and make predictions. In this model, every Web page is

considered as a different state, and one state can transfer to another state with a certain probability according to previous users’ access paths. After all transition probabilities are computed from training Web logs, the model can predict the most probable next page for the current user in terms of the transition probability matrix. However, when making predictions, this approach only takes users’ current access requests into consideration but not the whole access paths, which will inﬂuence the prediction accuracy. In order to deal with this problem, higher-order Markov models [13] are proposed, which take into account more states when computing the transition probability, and thus improve the prediction accuracy. However, the increase of the order will increase the state space complexity. M. Deshpande et al. [14] discussed the shortcomings of higher-order Markov models in predicting Web users’ browsing behaviors, and presented three schemes to eliminate the state space complexity of higher-order Markov models without inﬂuencing the performance. A Hidden Markov Model (HMM) [15] is a dual-stochastic process which is very popular for labeling sequences, one stochastic process is an invisible Markov chain that describes the transition between states (labels) while the other reﬂects the statistical relationship between states and observations. Xin Jin et al. [16] proposed a HMMbased prefetching model in which they employed HMM to capture and mine the latent concepts of information requirement implied by Web users’ access paths, and then used the obtained information to make semantic-based prefetching decisions. In this paper we propose a CRF-based Web page prediction model and compare its prediction accuracy with that of plain Markov Chain models and Hidden Markov Models.

3

Conditional Random Fields

There are two predominant kinds of models for the tasks of sequence labeling and segmentation: generative models and discriminative models. Hidden Markov Models (HMMs) are generative models with a directed graphical structure. Similar to other generative models, HMMs deﬁne a joint probability distribution p(x, y) where x and y are random variables over observation sequences to be labeled and their corresponding label sequences respectively. On account of the nature of modeling the joint probability distribution, generative models have some major drawbacks. First of all, the purpose of sequence labeling tasks is to label the given observations, which corresponds to the conditional distribution p(y|x). Therefore, the joint probability distribution p(x, y) deﬁned by generative models is not the probability distribution of interest since the observation sequence x is already known and visible in both training and testing datasets. Secondly, in order to calculate the conditional distribution p(y|x) from the joint distribution p(x, y), the marginal distribution p(x) is required according to the Bayes rule. However, because usually the amount of the training data is limited, it is difﬁcult to enumerate all possible observation sequences, thus the calculation of p(x) is only an approximation to the real distribution, which will decrease the accuracy of the model [17]. Furthermore, the calculation of p(x) also requires strict independence assumptions over observation elements, which is not always possible in reality since most observation sequences in reality contain long-range dependencies and highly interacting features between observation elements [1]. On the contrary, discriminative models directly model the conditional distribution p(y|x), they do not need to model the visible observation sequences x, which results in the relaxation of unwarranted independence assumptions over observation sequences. Moreover, owing to the conditional nature, discriminative models are able to

Y.Z. Guo et al. / Web Page Prediction Based on Conditional Random Fields

model arbitrary features of observation sequences, regardless of the relationships between them. Therefore, discriminative models can overcome the inherent shortcomings of generative models and obtain higher labeling and prediction accuracy. The Maximum Entropy Markov Models (MEMMs) [18] are discriminative models. Because MEMMs conduct per-state normalization for the conditional probability of every next state given the current state and the observation sequence, they achieve a local optimum which can cause the “label bias” problem [1]. Conditional Random Fields (CRFs) are extensions of MEMMs, they have all the advantages of MEMMs and avoid the label bias problem. CRFs have a single exponential model for the conditional probability of the entire label sequence given the observation sequence [1], therefore, each state needs not to preserve the probability “mass” over its outgoing transitions and the whole model can achieve a global optimum. A Conditional Random Field is an undirected graphical model, it deﬁnes a conditional probability distribution of a label sequence Y , given an observation sequence X. All components Yi of Y are assumed to range over a ﬁnite label set. Let G = (V, E) be a graph where V denotes the set of vertices, E represents the set of edges and Y = (Yv )v∈V , then (X, Y ) is a conditional random ﬁeld if, when conditioned on X, the random variables Yv obey the Markov property with respect to the graph: p(Yv |X, Yw , w = v) = p(Yv |X, Yw , w ∼ v), where w ∼ v means that w and v are neighbors in G [1]. Although theoretically the structure of a Conditional Random Field can be an arbitrary undirected graph that obeys the Markov property, for the tasks of labeling the most common graphical structure is an undirected linear chain of ﬁrst-order among label sequence Y , which can be seen in Figure 1 [19]. In our experiments, we made use of this linear chain model for the implementation of CRFs, where X = (x1 , x2 , · · · , xn ) denotes an observation of a user’s accessing session of length n and Y = (y1 , y2 , · · · , yn ) denotes the corresponding label sequence of X. A linear chain CRF has the form as below: T 1 Pθ (Y |X) = λi fi (yt−1 , yt , X, t) exp Z(X) t=1 i

μj sj (yt , X, t) . +

(1)

j

Where fi (yt−1 , yt , X, t) is a transition feature function between the states (labels) at position t − 1 and t, while sj (yt , X, t) is a state feature function of the state at position t. Z(X) is a global normalization factor over all possible label sequences with the following format:

Z(X) =

Y

exp

T

t=1

λi fi (yt−1 , yt , X, t)

i

+

μj sj (yt , X, t)

(2) .

j

The parameters θ = (λi , μj ) can be estimated from training data using many different approaches such as GIS[20], IIS[24] and LBFGS[22, 23]. After the parameters are trained, the Viterbi [21] algorithm can be used to label the testing data and perform the prediction.

253

Figure 1. First-order linear chain graphical structure of a Conditional Random Field. The unshaded nodes indicate that the corresponding variables are observed and not generated by the model.

4

Experiments

In this section we present a set of experiments that we performed to evaluate the performance of using CRFs in Web page prediction. We compared the prediction results of CRF-based approaches to that of plain Markov Chain-based and Hidden Markov Model-based methods. The experimental results show an overall enhancement in the prediction accuracy by using CRF-based measures.

4.1

Experimental Dataset and Preprocessings

We used the publicly accessible msnbc.com anonymous Web data [26] as the dataset in our experiments. The msnbc dataset is obtained from the Web logs of www.msnbc.com and contains page visits of users who visited this website on September 28, 1999. All the user visits are recorded in session format at the level of page categories deﬁned by the website administrator, such as weather, health, sports and so on. There are 17 different page categories in this dataset which can also be treated as 17 distinct pageviews. In addition, each page category was assigned one integer ranging from 1 to 17, for example, the category weather was assigned the integer 8 while the category sports was assigned the integer 12. An example of a user session in this dataset is: 6 9 4 4 4 10 3 3 10 5 10 4 4. There are 989,818 user sessions with more than one pageview of this kind in this dataset. We also performed a preprocessing to remove the consecutive repetitions of a same page from all of the user sessions. For instance, after this preprocessing, the sample user session above will be reduced to: 6 9 4 10 3 10 5 10 4. Then we randomly selected 50,000 distinct sessions with length more than 5 and less than 100 from the preprocessed dataset and divided them equally into ten subsets, which will be used to perform 10-fold cross validation in our experiments. Furthermore, we labeled the sessions in both training data and testing data. We treated each user session as an observation sequence. In the Web page prefetching scenario, we can use every observation element’s subsequent element as its label. Therefore, since each observation element’s subsequent element can be any of the 17 page categories, there are 17 different labels in total.

4.2

Experimental Setups

In our experiments we created seven different prediction methods to compare the Web page prediction accuracy of plain Markov Chains and Hidden Markov Models with that of Conditional Random Fields. The ﬁrst two methods are the ﬁrst-order plain Markov Chain (referred to as 1st -PMC) and the second-order plain Markov Chain (referred to as 2nd -PMC). We trained the 1st -PMC and 2nd -PMC to obtain their state transition probability matrices, and then labeled the

254

Y.Z. Guo et al. / Web Page Prediction Based on Conditional Random Fields

testing dataset according to the entries of the corresponding transition matrix. We implemented the 1st - and 2nd -order Hidden Markov Models as well, which will be referred to as 1st -HMM and 2nd -HMM respectively. All parameters of a HMM model can be estimated by the Forward-Backward algorithm [15]. However, because in our case the training is fully supervised, we can use a statistical method, which is quicker and more accurate, to acquire the parameters λ = (π, A, B), where π is the initial probability distribution of states, A is the state transition probability matrix and B is the observation probability distribution matrix. After training, the Viterbi algorithm is used to label the testing dataset in terms of the trained model. For the implementation of the Conditional Random Fields, we use the CRF++ toolkit [27]. CRF++ is a simple, customizable implementation of 1st -order Conditional Random Fields which ensures fast training by using L-BFGS. We used three different CRF++ feature templates in our experiments. In the ﬁrst template (referred to as CRF0), we deﬁne the current observation as the only unigram feature; in the second template (CRF1), we use the current and previous one observation and their combination as the unigram features; for the third template (CRF2), we use the current and previous two observations and their combinations as the unigram features. All these three templates share a same bigram feature, which will automatically generate a combination of the current label and previous label as the feature function. The more abundant and detailed features are used, the more powerful CRF models can be attained. Therefore, we expect that CRF2 has the best performance out of these three CRF models. Then we use CRF++ to label the testing dataset by employing the trained models.

4.3

CRF models in all of the three accuracy measures. When compared CRF-based models with non-CRF models, we observed that although the “top-1 accuracy” of CRF0 is slightly lower than that of the second-order non-CRF models (that is, 2nd -PMC and 2nd -HMM), it behaves much better in both top-3 and top-5 accuracy measures. In the “top-1 accuracy” measure, CRF1 can achieve slightly better performance than 2nd -HMM, while CRF2 outperforms 2nd -HMM undoubtedly. In addition, all three CRF-based models provide higher accuracy than the 2nd -HMM in both “top-3 accuracy” and “top-5 accuracy”. Finally, when compared the models of CRF0, CRF1 and CRF2, we found out that the more detailed features a CRF model used, the higher prediction accuracy it can obtain. Moreover, CRF2 has the best performance out of the three CRF-based models in all of the three accuracy measures, which is in accordance with our expectation.

Figure 2.

Top-1 accuracies for the 7 methods.

Figure 3.

Top-3 accuracies for the 7 methods.

Experimental Results

In our experiments we performed the 10-fold cross validation to evaluate the experimental results. We used the aforementioned seven methods to train the training dataset and obtained seven corresponding models, which are applied to label the testing dataset respectively later. For each observation in the testing data we predicted a series of labels that are ranged in a descending order according to their probabilities. Then we evaluated the prediction accuracy of these seven methods by using three different accuracy measures. The ﬁrst measure is the “top-1 accuracy”, in which we used the ﬁrst label, which is also the most probable label, of the predicted label series as the current observation’s label. The accuracy is simply the ratio of the number of correctly predicted labels to the total number of predicted labels. The second measure is the “top-3 accuracy”, in this measure we use the top-3 most probable labels of the predicted label series to form a candidate label set for the current observation, if the real label of the current observation is in this candidate label set, we then consider this labeling as a “correct labeling”. The third measure “top5 accuracy” has the similar deﬁnition to the “top-3 accuracy”. The reason we chose to measure the “top-3 accuracy” and the “top-5 accuracy” is because they resemble what happens in reality better, for instance, usually the prefetching systems will predict 3 or 5 possible “next” pages for the current user. The statistically signiﬁcant prediction accuracies of these three accuracy measures for the seven methods are depicted in Figure 2, Figure 3 and Figure 4 respectively. From the results we can see that the 1st -PMC model has the worst performance in all of the seven models for all three accuracy metrics. Among the four non-CRF models, the second-order models achieve higher prediction accuracies than their corresponding ﬁrstorder models, while the 2nd -HMM outperforms the other three non-

It is obvious from our experimental results that the selection of features is crucial to the performance of CRFs models. A good prediction model has poor performance without good features, while a less powerful prediction model may also perform well with a set of deliberately chosen features [25]. In our experiments, we only used the current and previous observations as the unigram features, more useful features can be incorporated to enhance the prediction accuracy, i.e., “the length of the observation sequence”. Moreover, notice that although all the CRF-based models in our experiments are of ﬁrst order, their performance (except that of CRF0 in the “top-1

Y.Z. Guo et al. / Web Page Prediction Based on Conditional Random Fields

Figure 4. Top-5 accuracies for the 7 methods.

accuracy”) have already exceeded that of the second-order Hidden Markov Models, we can consider that Conditional Random Fields are superior models for Web page prediction than Hidden Markov Models. However, we noticed that the training of a CRF model is expensive and thus slower than that of PMC and HMM models, but once it is trained, its performance is robust and the speed of labeling the testing data is very fast, which is comparable to that of the other two models. Therefore, the CRF-based Web page prefetching can be efﬁciently applied online.

5

Conclusion and Future Work

In this paper, we discussed the main differences between generative models and discriminative models and showed through experimentation that the Conditional Random Fields can be effectively applied in the task of Web page prediction. The ability to model the long range dependencies among observation elements and the combination of arbitrary and overlapping features from observation sequences allow Conditional Random Fields to overcome the inherent disadvantages of the most popular Web page prediction models such as plain Markov Chains and Hidden Markov Models, and thus produce much more accurate predictions. The positive experimental results also revealed that by using richer features, the CRF-based prediction models can achieve better performance. We should point out that the training of Conditional Random Fields converges considerably slowly when compared to HMMs and plain Markov Chains. The training complexity of a CRF is quadratic with respect to the number of labels. When the number of labels is very large, the training of a CRF may become very expensive and even intractable. Therefore, in our future work, we will focus on dealing with the problem of using CRFs to perform Web page prediction in a larger dataset with many thousands of distinct pages.

REFERENCES [1] J. Lafferty, A. McCallum, F. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, In: Proceedings of the 18th International Conference on Machine Learning (ICML), 282-289, 2001. [2] Crovella M, Barford P. The Network Effects of Prefetching, In: Proceedings of the IEEE Conference on Computer and Communications (INFOCOM’ 98), 1232-1240, 1998. [3] Venkata N., Padmanabhan. Improving World Wide Web Latency, Technical Report CSD-95-875, Computer Science Department, University of California at Berkeley, 1995.

255

[4] Charles Sutton, Khashayar Rohanimanesh, Andrew McCallum. Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data, In Proceedings of the 21st International Conference on Machine Learning, 99-106, 2004. [5] Kapil Kumar Gupta, Baikunth Nath, Ramamohanarao Kotagiri. Layered Approach using Conditional Random Fields for Intrusion Detection, IEEE Transactions on Dependable and Secure Computing, 2008. [6] John Lafferty, Xiaojin Zhu, Yan Liu. Kernel Conditional Random Fields: Representation and Clique Selection, In Proceedings of the 21st International Conference on Machine Learning, 2004. [7] Ming Syan Chen, Jong Soo Park. Data Mining for Path Traversal Patterns in a Web Environment, In: Proceedings of the 16th International Conference on Distributed Computing Systems, 385-392, 1996. [8] T. I. Ibrahim, Cheng Zhong Xu. Neural Nets Based Predictive Prefetching to Tolerate WWW Latency, In: Proceedings of the 20th IEEE Conference on Distributed Computing Systems, 636-643, 2000. [9] M. Eirinaki, M. Vazirgiannis, Usage-based PageRank for Web Personalization, In: Proceedings of the 5th IEEE International Conference on Data Mining (ICDM’05), 2005. [10] Yong Zhen Guo, Kotagiri Ramamohanarao, Laurence A. F. Park. Personalized PageRank for Web Page Prediction Based on Access TimeLength and Frequency, In: Proceedings of the 2007 IEEE/WIC/ACM International Conference on Web Intelligence (WI-07), 2007. [11] Schechter S, Krishnan M, Michael DS. Using Path Proﬁles to Predict Http Requests, In: Proceedings of the 7th International World Wide Web Conference, 1998. [12] Ramesh R. Sarukkai. Link Prediction and Path Analysis Using Markov Chains, Computer Networks, 33(1-6), 377-386, 2000. [13] Jaideep Srivastava, Robert Cooley, Mukund Deshpande, Pang-Ning Tan. Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data, SIGKDD Explorations, Vol. 1, Issue 2, 12-23, 2000. [14] Mukund Deshpande, George Karypis. Selective Markov Models for Predicting Web-Page Accesses, In: Proceedings SIAM International Conference on Data Mining (SDM2001), 2001. [15] Lawrence R. Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, In: Proceedings of the IEEE, 257-286, 1989. [16] Xin Jin, Huanqing Xu. An Approach to Intelligent Web Pre-fetching Based on Hidden Markov Model, In: Proceedings of the 42nd IEEE Conference on Decision and Control, 2003. [17] Charles Sutton, Andrew McCallum. Introduction to Statistical Relational Learning: An Introduction to Conditional Random Fields for Relational Learning, MIT Press, 2006. [18] A. McCallum, D. Freitag, F. Pereira. Maximum Entropy Markov Models for Information Extraction and Segmentation, In: Proceedings of the 17th International Conference on Machine Learning (ICML), 591-598, 2000. [19] Hanna M. Wallach. Conditional Random Fields: An Introduction, CIS Technical Report MS-CIS-04-21, University of Pennsylvania, 2004. [20] J. N. Darroch, D. Ratcliff. Generalized Iterative Scaling for Log-Linear Models, In: The Annals of Mathematical Statistics, vol. 43, 1972. [21] Andrew J. Viterbi. Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm, IEEE Transactions on Information Theory, 260-269, 1967. [22] Jorge Nocedal. Updating Quasi-Newton Matrices with Limited Storage, Mathematics of Computation, 773-782, 1980. [23] D. C. Liu, J. Nocedal. On the limited memory BFGS method for large scale optimization, IMathematical Programming: Series A and B, 503528, 1989. [24] Adam Berger. The Improved Iterative Scaling Algorithm: A Gentle Introduction, Technical Report, School of Computer Science, Carnegie Mellon University, 1997. [25] N. Smith, D. Vail, J. Lafferty. Computationally Efﬁcient M-Estimation of Log-Linear Structure Models, In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2007. [26] UCI KDD Archive: msnbc.com anonymous Web data. http://kdd.ics.uci.edu/databases/msnbc/msnbc.html, last accessible on February 10, 2008. [27] CRF++: Yet another CRF toolkit. http://crfpp.sourceforge.net/, last accessible on February 10, 2008. [28] The Website of the China Internet Network Information Center, http://www.cnnic.com.cn/, last accessible on February 10, 2008. [29] PEW Internet & American Life Project, http://www.pewinternet.org/, last accessible on February 10, 2008.

256

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-256

A Formal Model of Emotions: Integrating Qualitative and Quantitative Aspects Bas R. Steunebrink and Mehdi Dastani and John-Jules Ch. Meyer1 Abstract. When constructing a formal model of emotions for intelligent agents, two types of aspects have to be taken into account. First, qualitative aspects pertain to the conditions that elicit emotions. Second, quantitative aspects pertain to the actual experience and intensity of elicited emotions. In this paper, we show how the qualitative aspects of a well-known psychological model of human emotions can be formalized in an agent speciﬁcation language and how its quantitative aspects can be integrated into this model. Furthermore, we discuss several unspeciﬁed details and implicit assumptions in the psychological model that are explicated by this effort.

1

INTRODUCTION

Psychological models of emotions are currently being studied for their applicability in intelligent agents. These models of emotions can help in solving nondeterminism in an individual agent’s decision making, they can be useful as a coordination mechanism in multiagent systems, and they can make artiﬁcial agents more believable for human users (both in the actions that they select and the affective expressions they show based on experienced emotions). However, psychological models do not always take formalization into account, and those that do still leave a lot of details unspeciﬁed. We will look speciﬁcally at the model of Ortony, Clore & Collins [4] and discuss the aspects of their model of emotions that are left open to interpretation when formalizing it. The “OCC model” describes very concisely what are the conditions that elicit an emotion, but upon formalization, the concepts used in these descriptions need to be translated to notions used in the chosen agent speciﬁcation language. Doing so results in a qualitative formalization of emotion ‘triggers.’ However, with respect to quantitative aspects of emotions, which pertain to the actual experience and intensity of them, the OCC model only describes the types of quantities they distinguish and the factors that inﬂuence these quantities. Thus details on how these quantities should be calculated are not given. In this paper, we will present part of a qualitative formalization of the OCC model, highlighting our interpretation choices and showing why we think these are reasonable. Furthermore, we will present how quantitative aspect of the OCC model can be integrated into this formalization and propose a reasonable way of calculating these quantities. Related work on computational models of emotions includes EMA [1], CogAff [6], and the work of Picard [5]. Outline: In section 2 we give an overview of the OCC model and deﬁne the agent speciﬁcation language used for formalization. We present in section 3 a qualitative formalization of one emotion from the OCC model, and in section 4 the integration of quantitative aspects into the speciﬁcation language. Section 5 contains a discussion on implicit assumptions explicated by the presented formalization. 1

Utrecht University, The Netherlands. {bass,mehdi,jj}@cs.uu.nl

2

LANGUAGE AND SEMANTICS

The OCC model describes a hierarchy that classiﬁes 22 emotions. The hierarchy contains three branches, namely emotions concerning aspects of objects (e.g., love and hate), actions of agents (e.g., pride and admiration), and consequences of events (e.g., joy and pity). Additionally, some branches combine to form a group of compound emotions, namely emotions concerning consequences of events caused by actions of agents (e.g., gratitude and anger). Because the objects of all these emotions (i.e. objects, actions, and events) correspond to notions commonly used in agent models (i.e. agents, plans, and goal accomplishments, respectively), this makes the OCC model suitable for use in the deliberation and practical reasoning of artiﬁcial agents. It should be emphasized that emotions are not used to describe the entire cognitive state of an agent, but emotions are always relative to individual objects, actions, and events. The OCC model deﬁnes both qualitative and quantitative aspects of emotions. Qualitatively, it deﬁnes the conditions that elicit an emotion; quantitatively, it describes how a potential, threshold, and intensity are associated with each elicited emotion and what are the variables affecting these quantities. For example, the compound emotion gratitude is qualitatively speciﬁed as “approving of someone else’s praiseworthy action and being pleased about the related desirable event.” The variables affecting its (quantitative) intensity are 1) the judged praiseworthiness of the action, 2) the unexpectedness of the event, and 3) the desirability of the event. We use KARO [2, 3] as a framework for the formalization of the 22 emotions of the OCC model. The KARO framework is a mixture of dynamic logic, epistemic / doxastic logic, and several additional (modal) operators for dealing with the motivational aspects of artiﬁcial agents. We present a modest modiﬁcation of the KARO framework, so that the eliciting conditions of the emotions of the OCC model can be appropriately translated and modeled. Below we explain how ‘OCC ingredients’ are translated into ‘KARO ingredients.’ When formalizing the branch (of the OCC hierarchy) of emotions concerning consequences of events, we will translate OCC’s notion of an event as the accomplishment or undermining of a goal (or part thereof). For goal-directed agents, such goal-related events are useful for determining how well plans are progressing. For example, replanning may be triggered when fear for failure of a plan to reach a goal is greater than hope for accomplishment of the goal [7]. When formalizing the branch (of the OCC hierarchy) of emotions concerning actions of agents, we will translate OCC’s notion of actions as plans consisting of domain actions and sequential compositions of actions. We now go into the formal details of the agent speciﬁcation language that we use to formalize the OCC model. The KARO framework is designed to specify goal-directed agents. However, in contrast to KARO, we do not allow arbitrary formulas as goals; instead,

B.R. Steunebrink et al. / A Formal Model of Emotions: Integrating Qualitative and Quantitative Aspects

we deﬁne a (declarative) goal as a conjunction of literals, where each literal represents a subgoal. This is because we want to be able to break up goals into the part that has already been accomplished and the remaining part. Furthermore, we require goals to be (logically) consistent and nonempty, so they are drawn from the set K below. Deﬁnition 1. (Consistent conjunctions). Let P be a set of atomic propositions, Lits = P ∪ { ¬p | p ∈ P } be the set of literals, and V ∅ = (verum). Then K is the set of all consistent conjunctions of literals, and K does not contain the empty conjunction: V K = { Φ | Φ ⊆ Lits, Φ |=CL ⊥ }, K = K \ {} (1) where CL stands for Classical Logic (so Φ is consistent). In the following we assume the existence of a set A of atomic actions and a set Plans consisting of all actions and sequential compositions of actions, i.e., Plans is the smallest set such that A ⊆ Plans and if α ∈ A and π ∈ Plans then (α; π) ∈ Plans. We deﬁne emotion triggering ﬂuents to represent each of the 22 emotion types of the OCC model. The emotions are outlined below such that each row contains two emotions that are deﬁned by OCC to be each other’s opposites, with the positive (for agent i) emotions on the left and the negative ones on the right. It should be noted that an agent is allowed to have ‘mixed feelings,’ i.e. opposing emotions can be triggered simultaneously. However, our model ensures that the objects of opposing emotions are distinct (e.g., an agent can experience both gratiﬁcation and remorse in response to some event, but the objects of these two emotions will concern different parts of the event). Deﬁnition 2. (Emotion triggering ﬂuents). Let G be a set of agent names. EmoTriggers = { ∈ Em(i) | i ∈ G } is the set of all emotion triggering ﬂuents, where Em(i) = { gratiﬁcationi (α, κ), remorsei (α, κ), (2) gratitudei (j, α, κ), pridei (α), admirationi (j, α), joyi (κ), happy-fori (j, κ), gloatingi (j, κ), hopei (π, κ), satisfactioni (π, κ), relief i (π, ¬κ), lovei (j),

angeri (j, α, κ), shamei (α), reproachi (j, α), distressi (κ), resentmenti (j, κ), pityi (j, κ), feari (π, ¬κ), disappointmenti (π, κ), fears-conﬁrmedi (π, ¬κ), hatei (j)

| j ∈ G, i = j, α ∈ A, π ∈ Plans, κ ∈ K }.

The informal reading of gratiﬁcationi (α, κ) is: agent i has performed action α accomplishing (sub)goal(s) κ eliciting gratiﬁcation. Due to space limitations we can take only this emotion as example. Deﬁnition 3. (Agent speciﬁcation language). Let the sets P , K , Plans, G, and EmoTriggers be deﬁned as above. The agent speciﬁcation language LPAG is the smallest set closed under: • • • • • •

P ∪ EmoTriggers ⊆ LPAG . If ϕ1 , ϕ2 ∈ LPAG then ¬ϕ1 , (ϕ1 ∧ ϕ2 ), (ϕ1 → ϕ2 ) ∈ LPAG . If ϕ ∈ LPAG and i ∈ G then Bi ϕ ∈ LPAG . If κ, κ ∈ K and i ∈ G then Gi κ, Acci (κ, κ ) ∈ LPAG . If κ1 , κ2 , κ3 ∈ K then Diﬀ (κ1 , κ2 , κ3 ) ∈ LPAG . If π ∈ Plans, ϕ ∈ LPAG , and i ∈ G then [doi (π)]ϕ ∈ LPAG .

Bi ϕ means agent i believes in ϕ; Gi κ means agent i has the (declarative) goal to accomplish κ; Acci (κ1 , κ2 ) means agent i believes that the part of κ1 that has been accomplished is κ2 ; Diﬀ (κ1 , κ2 , κ3 ) means κ3 is the difference between κ1 and κ2 ; [doi (π)]ϕ means ϕ holds after agent i has performed π. With respect to the semantics of LPAG , we model the belief and action operators in a standard way using Kripke semantics, while using sets for goals, accomplishments, and emotional ﬂuents. We use sys-

257

tem KD45 for belief models, which have the form M = S, RB , ϑ. We denote the class of belief models as M. With slight abuse of notation, we deﬁne a selector for the set of states S of model M as: states M = S. The semantics of actions are deﬁned over the Kripke models of belief, as actions may change the mental state of an agent. They have the form M = Σ, RA , Emo, Aux, where Σ = { (M, s) | M ∈ M, s ∈ states M } and RA is an accessibility relation on Σ. Emo = {Gratiﬁcation, . . . , Hate} is a set of 22 functions designed to deﬁne the semantics of the emotion triggering ﬂuents such as gratiﬁcation. These functions are deﬁned per agent (G) and model–state pair (Σ) and their mappings can be derived directly from Deﬁnition 2, e.g., Gratiﬁcation : G × Σ → ℘(A × K ). Aux = Γ, H, Acc, Diﬀ is a structure of auxiliary functions, where Γ : G × Σ → ℘(K ) is a function returning the set of goals an agent has per model–state pair; the history (H), accomplishment (Acc), and difference (Diﬀ ) functions are explained below. A history of the multi-agent system needs to be kept to formalize emotions concerning states visited by the multi-agent system in the past. To this end, a history function H is deﬁned, mapping model– state pairs to tuples recording the model–state pairs actually visited by the multi-agent system, the actions performed in these states, and the agents that performed these actions. A history is formally denoted as H(M, s) = (M−1 , s−1 , i−1 , α−1 ), (M−2 , s−2 , i−2 , α−2 ), . . . , (M−n , s−n , i−n , α−n ), i.e., a mapping of model–state pairs to a sequence of model–state–agent–action tuples (M−k , s−k , i−k , α−k ). Such a tuple in H(M, s) thus speciﬁes that the multi-agent system, currently in state s of model M, was once in state s−k of model M−k and left that state because agent i−k performed action α−k . Furthermore, H should satisfy two constraints such that it is well-behaved through the past it records2 and with future actions3 . The gratiﬁcation emotion considered here is deﬁned with respect to actions and goals of an agent (why this is so will become clear in the next section). However, for determining whether a positive eventrelated emotion should be triggered, it must be possible to determine which parts (i.e. subgoals) of a goal have been accomplished by an action. Moreover, such an ability would allow us to compare the accomplished parts of a goal at subsequent states, so that we can talk about actions accomplishing new subgoals. To this end, an accomplishment function Acc is deﬁned as follows. If κ is a consistent conjunction of literals (i.e. κ ∈ K) and (κ, κ ) ∈ Acc(i)(M, s), then κ ∈ K is a conjunction containing exactly those literals from κ that agent i believes to be true in state s of model M. More formally, Acc(i)(M, s) = (3) V V { ( (Φ1 ∪ Φ2 ), Φ1 ) | Φ1 , Φ2 ⊆ Lits, Φ1 ∪ Φ2 |=CL ⊥, ∀ p ∈ Φ1 : ∀s ∈ RB (i)(s) : p ∈ ϑ(s ), ∀¬p ∈ Φ1 : ∀s ∈ RB (i)(s) : p ∈ ϑ(s ), ∀ p ∈ Φ2 : ∃s ∈ RB (i)(s) : p ∈ ϑ(s ), ∀¬p ∈ Φ2 : ∃s ∈ RB (i)(s) : p ∈ ϑ(s ) }

where M = S, RB , ϑ. Note that by construction, Φ1 and Φ2 are mutually exclusive. Because for all κ ∈ K there exists exactly one κ ∈ K such that (κ, κ ) ∈ Acc(i)(M, s) (for arbitrary agent i and model–state pair M, s), we will write Acc(i)(M, s)(κ) = κ , treating Acc as a function determining the accomplished part of a conjunction of literals. To determine the part of a goal that an agent has not yet accomplished, we deﬁne the convenience function Diﬀ . Diﬀ (κ, κ ) returns the part of the conjunction κ that does not appear 2 3

That is, H(M−1 , s−1 ) = (M−2 , s−2 , i−2 , α−2 ), . . . , (M−n , s−n , i−n , α−n ) if H(M, s) is as above. That is, H(M , s ) = (M, s, i, α), (M−1 , s−1 , i−1 , α−1 ), . . . , (M−n , s−n , i−n , α−n ) for all (M , s ) ∈ RA (i, α)(M, s).

258

B.R. Steunebrink et al. / A Formal Model of Emotions: Integrating Qualitative and Quantitative Aspects

in κ . So we write Diﬀ (κ, κ ) = κ , which should be read as “the difference between κ and κ is κ .” Diﬀ is deﬁned as a set of triples: V V V Diﬀ = { ( Φ1 , Φ2 , (Φ1 \ Φ2 )) | Φ1 , Φ2 ⊆ Lits } (4) in which the ﬁrst two elements of each triple determine the third element. So Diﬀ can also be regarded as a function, taking two arguments κ, κ ∈ K and computing their difference κ ∈ K. Note that for convenience, its syntactic counterpart Diﬀ , as shown below in Deﬁnition 4, requires that the difference κ is not empty (i.e. κ = ), because this is needed in all instances where Diﬀ is used. Deﬁnition 4. (Interpretation of formulas). Let M = S, RB , ϑ and M = Σ, RA , Emo, Aux be structures deﬁned as above. Formulas in language LPAG are interpreted in model–state pairs as follows: M, s |= p M, s |= ¬ϕ M, s |= ϕ1 ∧ ϕ2

⇔ p ∈ ϑ(s) for p ∈ P ⇔ M, s |= ϕ ⇔ M, s |= ϕ1 & M, s |= ϕ2

M, s |= Bi ϕ M, s |= Gi κ

⇔ ∀s ∈ RB (i)(s) : M, s |= ϕ ⇔ κ ∈ Γ(i)(M, s)

M, s |= Acci (κ, κ )

⇔ Acc(i)(M, s)(κ) = κ

M, s |= Diﬀ (κ, κ , κ ) ⇔ Diﬀ (κ, κ ) = κ & κ = ⇔ ∀(M , s ) ∈ RA (i, π)(M, s) : M , s |= ϕ M, s |= [doi (π)]ϕ M, s |= gratiﬁcationi (α, κ) ⇔ (α, κ) ∈ Gratiﬁcation(i)(M, s)

Note that we evaluate formulas in state s of model M. The Kripke structure Σ, RA is then used for the interpretation of [doi (π)]ϕ formulas. We express that some formula ϕ is a validity as |= ϕ.

3

FORMALIZING QUALITATIVE ASPECTS

The OCC model provides for each of the 22 emotion types a concise description of the conditions that elicit such an emotion. Below we show for gratiﬁcation how we translate these descriptions to ingredients of the agent speciﬁcation language introduced above. We have a complete formalization of all 22 emotion types of the OCC model, but because of space limitations we can present only one example. According to OCC, gratiﬁcation is approving of one’s own praiseworthy action and being pleased about the related desirable event. So gratiﬁcation for an agent i should be deﬁned with respect to an action α and an event κacc . In our formalization, a desirable event is translated as the accomplishment of one or more subgoals by an action. To check whether this is the case in a state M, s, we ﬁrst examine the history to verify that the last executed action was α by i, i.e. H(M, s) = (M , s , i, α), . . ., so the state of the multi-agent system before the execution of α was s of model M . Next we take a goal κ from i’s goal base at M , s , i.e. κ ∈ Γ(i)(M , s ), and determine which part of κ had already been accomplished by i, i.e. Acc(i)(M , s )(κ) = κold (so the ‘old’ accomplished part is called κold ). In the current state M, s we also determine the part of κ that has been accomplished, i.e. Acc(i)(M, s)(κ) = κnew (so the ‘new’ accomplished part is called κnew ). For α to be praiseworthy to i, we need to determine whether α has accomplished any additional subgoals of κ, i.e. Diﬀ (κnew , κold ) = κacc (so the conjunction of newly accomplished subgoals is called κacc ). Also, κ must either still be in i’s goal base, i.e. κ ∈ Γ(i)(M, s), or action α must have caused goal κ to have become accomplished in its entirety, i.e. κ = κnew (so the accomplished part of goal κ is κ itself). If κacc contains one or more subgoals, i.e. κacc = , then κacc will constitute a desirable event for i and gratiﬁcation about having performed action α resulting in the accomplishment of the subgoals κacc will be triggered for agent i in state M, s. This explanation can be directly transcribed to the following deﬁnition of the semantic function Gratiﬁcation:

Gratiﬁcation(i)(M, s) = { (α, κacc ) | H(M, s) = (M , s , i, α), . . ., κ ∈ Γ(i)(M , s ), Acc(i)(M , s )(κ) = κold , Acc(i)(M, s)(κ) = κnew , κ ∈ Γ(i)(M, s) or κ = κnew , Diﬀ (κnew , κold ) = κacc , κacc = } (5)

We can now derive the following propositions about gratiﬁcation. |= Gκ ∧ Acc(κ, κold ) → [do(α)](Gκ ∧ Acc(κ, κnew ) (6) ∧ Diﬀ (κnew , κold , κacc ) → gratiﬁcation(α, κacc ))

The intuitive reading of this proposition is as follows. Suppose an agent has goal κ of which it has already accomplished part κold . After performing some action α, it inspects its goal κ again and determines that the part it has accomplished is κnew . Now if the difference between κnew and κold , called κacc , is not empty, gratiﬁcation is elicited with respect to action α and the accomplished subgoals κacc . |= Gκ ∧ Acc(κ, κold ) → (7) [do(α)](Bκ ∧ Diﬀ (κ, κold , κacc ) → gratiﬁcation(α, κacc ))

In case the agent believes action α has accomplished all of the remaining subgoals of goal κ, Bκ will hold after performing α. Of course in this situation gratiﬁcation should also be elicited. Note that this proposition corresponds to formula (5) where κ = κnew holds in the condition “κ ∈ Γ(i)(M, s) or κ = κnew ”, whereas Proposition (6) corresponds to the situation where κ ∈ Γ(i)(M, s) holds. Also note that it is possible in both Propositions (6) and (7) that κ = κacc ; this is exactly the case if κold = , i.e. if none of the subgoals of κ had initially been accomplished yet.

4

INTEGRATING QUANTITATIVE ASPECTS

In the preceding section we have explained how the eliciting conditions of the emotions of the OCC model can be formalized in dynamic logic, creating a formal qualitative model of emotions. This model speciﬁes precisely when emotions are triggered, but quantitative aspects responsible for the experience of emotions are missing. In the OCC model, quantitative aspects of emotions are described in terms of potentials, thresholds, and intensities. For each of the 22 emotions, OCC provide a list of variables (e.g., desirability, praiseworthiness, effort) that affect the intensity of that emotion if its eliciting conditions hold. The idea is that the weighted sum of these variables equals the emotion’s potential. The intensity of an emotion is deﬁned as its potential minus its threshold, or zero if the threshold is greater than the potential. The values of thresholds of emotions are not speciﬁed by OCC, but they are hinted to depend on global variables indicating an agent’s ‘mood.’ For example, if an agent is in a ‘good mood,’ the thresholds of the eleven negative emotions are increased, causing a lower (or zero) intensity to be associated with a negative emotion, if one is triggered. Emotions that are assigned a nonzero intensity may in turn inﬂuence the mood of an agent, entangling the dynamics of short-term emotions and a long-term mood. Here we do not study the variables affecting intensities but instead focus on the integration of intensities into the qualitative formalization. Given the values of the potential and threshold of a triggered emotion, OCC indicate how the (initial) intensity value can be calculated (i.e. the maximum of zero and their difference), but not how this value changes over time. We expect the intensity of an emotion to gradually decrease after the time it was triggered, dropping to zero within a ﬁnite amount of time. We propose that a reasonable default 1 choice would be an inverse sigmoid function, i.e., of the form 1+e x. Of course, there are several parameters that have to be set to give the inverse sigmoid function the shape desired for a particular emotion. Speciﬁcally, given the initial intensity qi , the time at which the emotion was triggered t0 , the half-life time μ, and the fall-off speed δ, we deﬁne the intensity value as a function of time (x-axis below) as: qi int(qi , t0 , μ, δ)(x) = −c 1 + e(x−t0 −μ)δ

259

B.R. Steunebrink et al. / A Formal Model of Emotions: Integrating Qualitative and Quantitative Aspects

where c is used to cut off the intensity for a large enough x (because an inverse sigmoid function only reaches zero in the limit, which we do not consider to be an intuitive property of emotion intensities).4 This function and its parameters can be visualized as below:

↑ intensity

qi

μ t0

δ

cut oﬀ at t0 + 2μ time →

Quantitative aspects of emotions can be modeled on top of the described qualitative model as follows. The satisfaction of an emotion triggering ﬂuent in a certain state (e.g., M, s |= joyi (κ)) is regarded as a trigger for associating a potential, threshold, and intensity with the emotional ﬂuent and for calculating their quantities. For this purpose we deﬁne a function newTrigEm returning for each model– state pair the set of ‘new’ emotions triggered in that model–state pair, i.e., all emotion triggering ﬂuents satisﬁed in the model–state pair. Moreover, it associates with each such emotion the time it was triggered (t0 ), its initial intensity (qi ), initial half-life (μ), and initial fall-off speed (δ). We deﬁne newTrigEm for all (M, s) ∈ Σ as: newTrigEm(M, s) = { (, qi , t0 , μ, δ) | ∈ EmoTriggers, M, s |= , t0 = T (M, s), qi = max(0, pot(M, s)() − thr (M, s)()), μ = μ0 (M, s)(), δ = δ0 (M, s)() }

Note that M, s |= represents the triggering condition and that it is assumed there exists an external clock function T : Σ → R+ . How exactly emotion potentials and thresholds are calculated by pot and thr is not the focus of this paper and will be a subject of future work. The function μ0 for calculating the initial half-life may depend on qi , e.g., the greater the initial intensity, the greater the half-life. A reasonable default choice for δ0 may be to just return 1 to obtain the normal sigmoid curve. However, keeping the two inverse sigmoid function parameters (i.e. μ and δ) ﬁxed may not be desirable. For example, one may want to prolong the duration of a high intensity of an anger emotion after another agent’s execution of an action constituting profanity, which can be done by increasing μ. Each such change caused by an action can be encoded in a function Δ. Formally, the change to the intensity function parameters of emotion caused by agent i performing action α in state M, s is denoted as Δ (i, α)(M, s) = (μ , δ ). These new parameters do not have to differ from the old ones. The choices of μ and δ are applicationdependent; here we follow the level of abstraction of the OCC model. Given newTrigEm and Δ, it is a straightforward task to deﬁne a function Emem (for ‘emotion memory’) which returns for each model–state pair all emotions that have been triggered, both in previous states and in the model–state pair. As with newTrigEm, each emotion triggering ﬂuent in the set returned by Emem is accompanied by four values as described above. For all (M, s) ∈ Σ, Emem(M, s) is deﬁned as follows. Emem(M, s) = newTrigEm(M, s) ∪ { (, qi , t0 , μ , δ ) | H(M, s) = (M−1 , s−1 , i−1 , α−1 ), . . ., (, qi , t0 , μ, δ) ∈ Emem(M−1 , s−1 ), Δ (i−1 , α−1 )(M−1 , s−1 ) = (μ , δ ) }

Note that the set comprehension denotes that all previously triggered emotions must be included, possibly with changes to μ and δ (which is prescribed by Δ), but not to qi and t0 . By varying the inverse 4

We should note to the interested reader that the intensity function we actuηqi ηqi ally envisage is int(x) = max(0, (x−t0 −μ)δ − 1+eμδ ) where η = ( 1+e1−μδ −

int(t0 + μ)

1+e 1 )−1 , having the following properties: 1+eμδ = 12 qi , and int(x) = 0 for all x ≥ t0 + 2μ.

int(t0 ) = qi ,

sigmoid function parameters, virtually any other kind of function can be simulated. Also note that if M, s is an initial state (i.e. H(M, s) = ), the set comprehension reduces to ∅. The ﬁgure below visualizes the dynamics of Emem for an emotion joyj (κ) being triggered in state M−2 , s−2 and having its intensity parameters adjusted by Δ at each action (these transitions are available in M, s as the history H(M, s) = (M−1 , s−1 , i−1 , α−1 ), (M−2 , s−2 , i−2 , α−2 ), . . .). M−2 , s−2 g _ W ) •

α−2

M−1 , s−1 (• Δjoy (κ) (i−2 , α−2 )

α−1

M, s (•

Δjoyj (i , α−1 ) j (κ) −1 (M−1 , s−1 ) = (μ, δ) (M−2 , s−2 ) = (μ , δ ) (joyj (κ), qi , t0 , μ , δ ) (joyj (κ), qi , t0 , μ , δ ) (joyj (κ), qi , t0 , μ, δ) ∈ newTrigEm(M−2 , s−2 ) ∈ Emem(M, s) ∈ Emem(M−1 , s−1 ) ⊆ Emem(M−2 , s−2 )

Having Emem means that we know for each model–state pair exactly which emotions have been triggered and when, allowing for the investigation of properties relating to the experience of these emotions. For example, if the emotion anger is triggered and this emotion is assigned a positive intensity (i.e. strictly greater than zero), we say that the agent in question is angry. If we construct such sentences for all 22 emotion types of the OCC model, we see that most are represented by a noun (see formula 2), while one can think of a corresponding adjective for each emotion as well. Since the usage of emotion adjectives corresponds more naturally to the notion of experiencing the emotion in question, we deﬁne a second set of emotional ﬂuents, called EmoExp for “emotional experience ﬂuents.” Emotional experience ﬂuents are denoted as ˆ, where is an emotion triggering ﬂuent. Because of space limitations we do not present a comd (κ) plete deﬁnition of EmoExp as in formula (2), but use, e.g., joy i as the adjective form of joyi (κ), which should be read as “agent i is joyous about having accomplished (sub)goal(s) κ.” By convention, we write ∈ EmoTriggers and ˆ ∈ EmoExp. With slight abuse of notation, we also use the ‘cap’ to convert from EmoTriggers to EmoExp, e.g., if = joyi (κ) then ˆ = joyousi (κ). Now we can model the experience of an emotion as the satisfaction of an emotional experience ﬂuent. Speciﬁcally, an emotional experience ﬂuent ˆ ∈ EmoExp is satisﬁed in a model–state pair M, s if and only if the intensity associated with it is greater than zero at the current time: M, s |= ˆ ⇔ ∃( , qi , t0 , μ, δ) ∈ Emem(M, s) : ˆ ˜ ˆ = ˆ & int(qi , t0 , μ, δ)(T (M, s)) > 0

Note that if the potential of a newly triggered emotion is less than its threshold (i.e. for that emotion qi = 0), the inequality above will reduce to 0 > 0, so its emotional experience ﬂuent is never satisﬁed. In order to formulate a frame property for emotional experience ﬂuents, we introduce several additional constructs. First, a plan π of agent i possibly affects the intensity function of an emotion ˆ, written as aﬀectsi (π, ˆ), if and only if there exists a state resulting from i performing π where the intensity parameters of ˆ have changed: M, s |= aﬀectsi (π, ˆ) ⇔ ∃( , qi , t0 , μ, δ) ∈ Emem(M, s) : ˆ ˆ = ˆ & ∃μ , δ ∈ R, (M , s ) ∈ RA (i, π)(M, s) : ˜ ( , qi , t0 , μ , δ ) ∈ Emem(M , s ) & (μ, δ) = (μ , δ )

Note that no relation whatsoever is assumed between i (i.e. the possible performer of π) and the agent subscript of ˆ (i.e. the agent experiencing ˆ). Furthermore, we specify that the remaining duration of an emotion ˆ, given the current parameters, is greater than d if and only if its intensity is greater than zero at the current time plus d: M, s |= duration(ˆ ) > d ⇔ ∃( , qi , t0 , μ, δ) ∈ Emem(M, s) : ˆ ˜ ˆ = ˆ & int(qi , t0 , μ, δ)(T (M, s) + d) > 0

Finally, we extend the dynamic operator with a subscript duration d, with the interpretation that the formula following the dynamic operator holds if the execution of the action took time less than d:

260

B.R. Steunebrink et al. / A Formal Model of Emotions: Integrating Qualitative and Quantitative Aspects

M, s |= [doi (π)]d ϕ ⇔ ∀(M , s ) ∈ RA (i, π)(M, s) : ˆ ˜ T (M , s ) − T (M, s) ≤ d ⇒ M , s |= ϕ

Given the constructs introduced above, we can form a frame property stating that if an action does not possibly affect the intensity parameters of a currently experienced emotion, then for any time span within which the emotion’s intensity remains positive, the emotion will still be experienced if the action ﬁnishes within the time span: |= ˆ ∧ ¬aﬀectsj (π, ˆ) ∧ duration(ˆ ) > d → [doj (π)]d ˆ (8) A proof of this proposition is omitted due to space limitations. As is to be expected, propositions relating any ∈ EmoTriggers with a corresponding ˆ ∈ EmoExp cannot be made; that is, we have |= → ˆ and |= ˆ → . (9) Note that here we mean propositions of the form, e.g., |= joyi (κ) → joyousi (κ) and |= joyousi (κ) → joyi (κ). Intuitively, |= → ˆ states that the fact that the eliciting conditions of an emotion hold does not mean that a positive intensity is assigned to the emotion. Conversely, |= ˆ → states that the fact that an emotion is currently being experienced, does not mean that its eliciting conditions currently hold (i.e. the emotion may have been triggered sometime in the past but its effect is still being ‘felt’).

5

DISCUSSION

The reason given in the OCC model for splitting quantitative aspects into potentials, thresholds, and intensities is so that a distinction can be made between what inﬂuences the intensity of an emotion (i.e. the variables constituting potential) and how strongly an emotion is actually experienced (i.e. initially, potential minus threshold, then decreasing over time). Translating this idea to doxastic logic, one may expect that if an emotional experience ﬂuent is satisﬁed, this should be believed by the agent in question to emphasize the ‘actual experience.’ Formally, for any agent i ∈ G, one may expect ˆ → Bi ˆ to be derivable, where ˆ ∈ { ˆ | ∈ Em(i) } (i.e. all emotional experience ﬂuents with i as agent subscript). We can easily attain this as a proposition by placing the following constraint on our models: ∀(, qi , t0 , μ, δ) ∈ Emem(M, s) : ∀s ∈ RB (agent )(s) : (, qi , t0 , μ, δ) ∈ Emem(M, s )

for all (M, s) ∈ Σ (we slightly abuse notation by writing agent to extract the agent index from ). Of course, we do not want (and indeed do not have) a similar proposition for emotion triggering ﬂuents, i.e. → Bi for all i ∈ G and ∈ Em(i) is not derivable. In [7] the eliciting conditions of the emotions hope and fear from the OCC model are formalized and propositions relating the resulting (qualitative) emotion triggering ﬂuents are investigated. However, there is a discussion in [4] on the relation between their intensities, which must be taken into account when investigating the quantitative aspects of hope and fear. Speciﬁcally, this discussion applies to simultaneous hope with respect to a prospective desirable event and fear with respect to the absence of the same event. It is noted that in such a case, the intensities of these emotions should sum to a constant. We are now in a position to uncover the assumptions that have to be made in order to make this the case. First, in the formal qualitative model, the formula hopei (π, κ) ∧ feari (π, ¬κ) (i.e. hope that plan π will accomplish goal κ and fear it may not) must be contingent, which is indeed the case [7]. Note that the accomplishment of a goal κ constitutes the desirable event while commitment to a plan π to bring about κ constitutes the prospect. According to OCC, the intensities of hope and fear are determined by the (un)desirability of the prospective event and its likelihood, i.e. in more formal terms: pot(M, s)(hopei (π, κ)) = w1 des(i)(M, s)(κ) + w2 lik (i)(M, s)(π, κ) pot(M, s)(feari (π, ¬κ)) = w3 undes(i)(M, s)(¬κ) + w4 lik (i)(M, s)(π, ¬κ)

where (un)des is a function returning the (un)desirability of a goal, lik is a function returning the (estimated) likelihood of a plan accomplishing a goal, and w are state-dependent weights. Without detailing their deﬁnitions, it is reasonable to assume that lik behaves such that lik (i)(M, s)(π, κ) = 1 − lik (i)(M, s)(π, ¬κ), while des and undes applied to complementary arguments should behave such that des(i)(M, s)(κ) = undes(i)(M, s)(¬κ) [4]. Assuming w2 = w4 , we obtain pot(M, s)(hopei (π, κ)) + pot(M, s)(feari (π, ¬κ)) = (w1 +w3 )des(i)(M, s)(κ)+w2 . So as long as the desirability of the event and the weights remain constant for the duration of the prospect (which is not an unreasonable assumption), the sum of the potentials of complementary hope and fear emotions also remains constant. Thus, the (estimated) likelihood of the event can vary freely over time without affecting the sum above. Furthermore, if the assumption is made that thr (M, s)(hopei (π, κ)) = −thr (M, s)(feari (π, ¬κ)) (which is reasonable if thresholds are deﬁned in terms of a ‘mood’ variable), the initial intensities (qi above) will sum to the same constant (assuming both potentials are greater than the respective thresholds). The reader may have noticed the frequent usage of the word “assumption” in this paragraph, which shows that this formalization is capable of explicating many constraints (although reasonable) that are needed to capture the intuitions of the OCC model. Adopting these constraints will render our model completely in line with OCC.

6

CONCLUSIONS AND FUTURE WORK

In this paper we have presented part of a qualitative formalization of the OCC model of emotions, specifying the conditions that elicit emotions. The thrust of this formalization was to show how emotionrelated concepts can be translated to an agent speciﬁcation language. Moreover, we have shown how quantitative aspects of emotions can be integrated into the qualitative model in order to model the actual experience of emotions. However, certain details pertaining to the calculation of emotion quantities are missing in the OCC model. We have proposed a method for calculating emotion intensities and investigated its properties. Finally, we have explicated some of the implicit assumptions that underlie intuitions of the OCC model. For future work, relations between intensities of opposing emotions with respect to the same objects should be investigated for emotions other than hope and fear, in a fashion similar to the Discussion above. Other issues that need to be addressed include the speciﬁcations of emotion potentials and thresholds, the dynamics of intensity function parameters, and the inﬂuence of the experience of emotions on the deliberation and decision making of agents.

ACKNOWLEDGEMENTS This work is supported by SenterNovem, Dutch Companion project grant nr: IS053013.

REFERENCES [1] J. Gratch and S. Marsella, ‘A domain-independent framework for modeling emotions’, J. of Cognitive Systems Research, 5(4), 269–306, (2004). [2] J.-J.Ch. Meyer, ‘Reasoning about emotional agents’, in Proceedings of ECAI’04, pp. 129–133. IOS Press, (2004). [3] J.-J.Ch. Meyer, W. v.d. Hoek, and B. v. Linder, ‘A logical approach to the dynamics of commitments’, Artiﬁcial Intelligence, 113, 1–40, (1999). [4] A. Ortony, G.L. Clore, and A. Collins, The Cognitive Structure of Emotions, Cambridge University Press, Cambridge, UK, 1988. [5] R.W. Picard, Affective Computing, MIT Press, 1997. [6] A. Sloman, ‘Beyond shallow models of emotion’, Cognitive Processing, 2(1), 177–198, (2001). [7] B.R. Steunebrink, M. Dastani, and J.-J.Ch. Meyer, ‘A logic of emotions for intelligent agents’, in Proceedings of AAAI’07. AAAI Press, (2007).

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-261

261

Modeling Collaborative Similarity with the Signed Resistance Distance Kernel J´ erˆ ome Kunegis, Stephan Schmidt, S ¸ ahin Albayrak 1, 2 Christian Bauckhage and Martin Mehlitz {kunegis, stephan.schmidt, sahin}@dai-lab.de, [email protected], [email protected] Abstract. We extend the resistance distance kernel to the domain of signed dissimilarity values, and show how it can be applied to collaborative rating prediction. The resistance distance is a graph kernel inspired by electrical network models where edges of a graph are interpreted as electrical resistances. We model the similarity between users of a large collaborative rating database using this signed resistance distance, generalizing the previously known regular resistance distance kernel which is limited to nonnegative values. We show that the signed resistance distance kernel can be computed eﬀectively using the Moore-Penrose pseudoinverse of the Laplacian matrix of the bipartite rating graph, leading to fast computation based on the eigenvalue decomposition of the Laplacian matrix. We apply this technique to collaborative rating prediction on the Netﬂix Prize corpus, and show how our new kernel can replace the traditional Pearson correlation for rating prediction.

1

Introduction

In the ﬁeld of information retrieval, the ﬁltering and recommendation of items to users is usually done in a content-based manner, meaning that the content of items is analyzed in order to provide recommendations. Collaborative ﬁltering, by contrast, bases its item rankings on ratings collected from users of the recommendation system. A collaborative ﬁltering system usually consists of a database of users, items such as text documents or movies, and a collection of ratings users give to items. The collected database of ratings is usually sparse, as each single user has generally only rated a small part of all available items. To make recommendations, a collaborative ﬁltering system has to rank items. To rank items, a score has to be calculated for each item, based on the proﬁle of the user receiving the recommendation. These scores can be interpreted as rating predictions, meaning that the recommendation system will recommend items the user will probably like. Diﬀerent algorithms exist for predicting ratings, most based on the calculation of similarities between users, and sometimes between items. In this paper, we describe a rating prediction algorithm using a new graph kernel based on the signed resistance distance. 1 2

DAI-Labor, Technische Universit¨ at Berlin, Germany Telekom Laboratories, Berlin, Germany

The signed resistance distance we deﬁne diﬀers from the regular resistance distance in the literature in that in can be applied to similarity measures taking on negative values as well, such as the Pearson correlation. The regular (unsigned) resistance distance can only be used with nonnegative values, and thus cannot be applied to rating prediction, as ratings take on negative values. Based on a known result about the regular resistance distance, we show how the signed resistance distance can be computed eﬀectively using the Moore-Penrose pseudoinverse of the Laplacian of the correlation matrix. We evaluate our approach by comparing it to the Pearson correlation based prediction algorithm. The evaluation is performed on the Netﬂix Prize corpus. The reminder of this paper is organized as follows. Section 2 introduces the basic collaborative ﬁltering techniques In Section 3, we deﬁne the notation used in the paper. Section 4 describes the basic collaborative ﬁltering algorithm in detail. In Section 5, we deﬁne the regular (unsigned) resistance distance. Section 6 motivates and deﬁnes the signed resistance distance, and presents a closed-form expression for it. Section 7 describes how the signed resistance distance can be applied to rating prediction, giving a signed resistance distancebased algorithm. Section 8 presents the evaluation results, and Section 9 concludes the paper and gives future research directions.

2

Related Work

In this section, we present the basic collaborative ﬁltering methods. Methods based on the resistance distance are introduced later. Collaborative ﬁltering systems appeared in the 1990s, with systems such as GroupLens [14], Ringo [15], MovieLens, etc. These systems all used a simple neighborhood ﬁnding and similarity-based rating prediction approach that consists in calculating a weighted average of known item ratings, taking the user-user correlation values as weights. Variants of these methods can be found in [1,6]. An overview of state-of-the-art collaborative prediction algorithms can be found in [7]. Collaborative ﬁltering algorithm are classiﬁed by various methods: • User-based approaches calculate similarities between users, and use these to weight known item ratings. Item-based

262

J. Kunegis et al. / Modeling Collaborative Similarity with the Signed Resistance Distance Kernel

approaches work the other way around, taking user ratings, and weighting them by item-item similarities. • Implementations accessing the rating database directly are called memory-based, and implementations ﬁrst building a model and then using this model are called model-based. The approach developed in this paper is model-based, as a signed resistance distance matrix is precomputed. It can be used in user-based and item-based variants. Other approaches that have been used in collaborative ﬁltering include other graph-theoretic approaches [10], linear algebraic approaches [8,9], and probabilistic methods [16,17]. The signed resistance has been used before for collaborative ﬁltering [2–5, 13]. In these cases, a multipartite graph was considered, where the diﬀerent vertex types correspond to the diﬀerent entities in the collaborative ﬁltering system, such as users and documents. This method will be described in Section 5.

3

Deﬁnitions

This section gives the notation used in this paper. Symbols used in one section exclusively are deﬁned there. U the set of users u∈U a user I the set of items i∈I an item R ∈ (R ∪ {∅})U×I the sparse rating matrix ∅ denotes a missing rating r(u, i) a rating in R r˜(u, i) a rating prediction for the pair (u, i) U (i) ⊆ U the set of users having rated item i

4

Basic Algorithm

In this section, we describe the basic collaborative rating prediction algorithm based on the Pearson correlation. This algorithm will be adapted to the signed resistance distance in the following sections. Let u ∈ U be a user and i ∈ I an item. We want to predict a rating of i for u, so we assume that r(u, i) = ∅. A rating prediction can be obtained by taking the average of known ratings for i, weighted by the user-user correlations: P u ∈U (i) sim(u, u )r(u , i) P (1) r˜(u, i) = u ∈U (i) sim(u, u ) Here, sim(u, u ) is a measure of the similarity between users u and u , which we deﬁne as the Pearson correlation between these two users ratings: P ¯(u))(r(u , i) − r¯(u )) i (r(u, i) − r (2) sim(u, u ) = σu σu The sums are taken over all items both users have rated. σu is the standard deviation of u’s ratings. This algorithm can be declined in numerous ways: • Mirroring the role of users and items results in the itembased equivalent. • Normalizing each user’s ratings usually improves the results. For each user, adjust the ratings linearly to zero mean

and unit variance. This approach is motivated by the fact that each user may interpret the rating scale diﬀerently, for instance covering situations in which a user only give ratings above the middle value available. • The Pearson correlation can be multiplied with a conﬁdence value, usually based on the number of ratings the users have in common [7]. • The correlation between users can be based on all items at least one user has rated. In this case, a default value has to be chosen as a default value for items rated by only one user [12]. The next sections will show how sim(u, u ) can be replaced by graph kernels.

5

Resistance Distance

This section motivates the use of the resistance distance and presents a method for computing it. The resistance distance is a distance function deﬁned on vertices of a graph inspired from electrical resistance networks. When an electrical current is applied to an electrical network of resistors, the whole network acts as a single resistor whose resistance is a function of the individual resistances. In such an electrical network, any two nodes can be taken as the endpoint of the total resistance, giving a function deﬁned between every pair of nodes. As shown in [11], this function is a metric, usually called the resistance distance. Intuitively, two nodes further apart are separated by a greater equivalent resistance, while nodes closer to each other lead to a small resistance distance. This distance function has been used before to perform collaborative ﬁltering [2–4]. As the resistance distance by default only applies to nonnegative values, these previous works use it on nonnegative data, such as document view counts. The paper [13] uses the same signed resistance distance as this paper, but fails to give a closed-form expression, which is the basis for an eﬃcient calculation of the corresponding similarity matrix. We now establish a closed-form expression giving the resistance distance between all node pairs based on [11]. Deﬁnitions. The following graph-theoretic notation is used. • G = (V, E) is the complete undirected graph. We will assume it is connected. • V is the set of graph vertices, corresponding to nodes in the electrical network. |V | = n. • x, y ∈ V are nodes of G • E is the set set of graph edges, corresponding to connections in the electrical network. |E| = m. • (x, y) is an individual edge • a, b ∈ V are the two endpoints for calculation of the equivalent resistance • ixy is the current ﬂowing through the edge (a, b). The order of a and b is signiﬁcant: iab = −iba . • vx is the electric potential at node x. Potentials are deﬁned up to an additive constant. • rxy is the resistance value of edge (x, y). Note that rxy = ryx . In electrical networks, the current entering a node must be equal to the current leaving that node. This relation is known

263

J. Kunegis et al. / Modeling Collaborative Similarity with the Signed Resistance Distance Kernel

P as Kirchhoﬀ ’s law, and can be expressed as (y,x) ixy = 0 for all x ∈ V − {a, b}. We assume that a current i will be ﬂowing through the network from a to b, and therefore we have X P iay = i, (y,b) iby = −i (y,a)

Using the identity matrix I ∈ [0, 1]V ×V , we express these relations as: X ixy = i(Ixa − Ixb ) (3) (y,x)

The relation between currents and potentials is given by Ohm’s law: vx − vy = rxy ixy for all edges (x, y). We deﬁne the adjacency matrix A ∈ RV ×V to be based on the inverse resistance values, called conductances. Let ( 1/rxy (x, y) (4) Axy = 0 otherwise. Let D ∈ RV ×V be the diagonal degree matrix, giving the sum of conductances of adjacent edges to each node: X 1/rxy (5) Dxx = (y,x)

Finally the Laplacian matrix is given by L

=

D−A

(6)

We will now show that the equivalent resistance r¯ab between a and b is given by: r¯ab

=

(Ia − Ib )L+ (Ia − Ib )T

=

L+ aa

+

L+ bb

−

L+ ab

−

(7)

L+ ba

where L+ is the Moore-Penrose pseudoinverse of L [11]. The proof follows from recasting Equation (3) as: X 1 (vx − vy ) rxy

=

i(Ixa − Ixb )

(y,x)

Combining over all x ∈ V : Dv − Av

=

i(Ia − Ib )

Lv

=

i(Ia − Ib )

Let L+ be the Moore-Penrose pseudoinverse of L, then because v is contained in the row space of L [11], we have L+ Lv = v, and we get v

=

L+ i(Ia − Ib )

Which ﬁnally gives the equivalent resistance between a and b as r˜ab

=

(va − vb )/i

=

(Ia − Ib )T v/i

=

(Ia − Ib )T L+ (Ia − Ib )

A symmetry argument shows that r˜(a, b) = r˜(b, a) as expected. As proved in [11], r˜ is a metric.

(a)

•

(b)

•

r1 =+1 r2 =−1

•

•

r1 =+1 r2 =−1

•

r = r1 + r2 = 0 r=

r1 r2 r1 +r2

= −1/0

Figure 1. Applying the sum rules to negative resistance values leads to contradictions.

6

Signed Resistance Distance

In this section, we deﬁne the requirements for a signed version of the resistance distance, derive a modiﬁed node-based expression, and give a proof for a closed-form expression giving the signed resistance distance for all node pairs. Figure 1. shows two examples in which we allow negative resistance values in Equation (7): two parallel edges, and two serial edges. In these examples, we will use the sum rules that hold for electrical resistances: resistances in series add up and conductances in parallel also add up. Therefore, the constructions of Figure 1. would result in a total resistance of zero for case (a), and an undeﬁned total resistance in case (b). However, the graph of Figure 1 (a) could result from two users a and b having a positive and a negative correlation with a third user c. Intuitively, the resulting distance between a and b should take on a negative value. In the graph of Figure 1 (b), the intuitive result would be r = −1/2. What we would like is for the sign and magnitude of the equivalent resistance to be handled separately: The sum rules should hold for the absolute values of the resistance similarity values, while the sign should obey a product rule. These requirements are summarized in Figure 2. (a)

•

(b)

•

r1 =+1 r2 =−1

•

•

r1 =+1

Figure 2. tions.

r2 =−1

•

r = sgn(r1 r2 )(|r1 | + |r2 |) = −2 r=

r1 r2 |r1 |+|r2 |

= −1/2

Applying modiﬁed sum rules resolves the contradic-

To achieve the serial sum equation proposed in Figure 2., we propose the following interpretation of a negative resistance: • An edge carrying a negative resistance value acts like the corresponding positive resistance in series with a component that negates potentials. A component that negates electric potential cannot exist in physical electrical networks, because it violates an invariant of electrical circuit: The invariant stating that potentials are only deﬁned up to an additive constant. However, as we will see below, the potential inversion gets canceled out in the calculations, yielding results independent of any additive constant. For this reason, we will talk of negative resistances, but avoid the term resistor in this context. Before giving a closed-form expression for the signed resistance distance, we provide three intuitive examples validating our deﬁnition in Figure 3. • Example (a) shows that, as a path of resistances in series gets longer, the resulting resistance increases. This conditions applies to the regular resistance distance as well as to

264

J. Kunegis et al. / Modeling Collaborative Similarity with the Signed Resistance Distance Kernel

(a)

?>=< 89:; A

(b)

?>=< 89:; A

(c)

?>=< 89:; A

•

(d)

?>=< 89:; A

•

•

?>=< 89:; B

r>1

?>=< 89:; B

0
•

?>=< 89:; B

r<0

•

?>=< 89:; B

r>0

•

Figure 3. Example conﬁgurations of signed resistance values. The total resistance is to be calculated between the nodes A and B. All edges have unit absolute resistance. Edges with negative resistance values are shown as dotted lines. For each case, we formulate a condition that should hold for any signed resistance distance.

the signed resistance distance. In this case, the total resistance should be higher than one. • Example (b) shows that a higher number of parallel resistances decreases the resulting resistance value. Again, this is true for both types of resistances. In this example, the total resistance should be less than one. • Examples (c) and (d) show that in a path of signed resistances, the total resistance has the sign of the product of individual resistances. This condition is particular to the signed resistance distance. We will now show how Kirchhoﬀ’s law has to be adapted to support our deﬁnition of negative resistances. We adapt Equation (3) by applying the absolute value to the resistance weight. X 1 (vx − sgn(rxy )vy ) = 0 (8) |rxy | (y,x)

where sgn(x) denotes the sign function. Deﬁning a modiﬁed degree matrix and Laplacian, we get X ¯ xx = |1/rxy | (9) D (y,x)

¯ L r˜ab

=

¯ −A D

(10)

=

¯ + (Ia − Ib )T (Ia − Ib )L + + ¯ bb − L ¯+ ¯+ ¯ aa + L L ab − Lba

(11)

=

The proof follows analogously to the proof for the regular resistance distance by noting that v is again contained in the ¯ row space of L. ¯ ¯ + Lv L

=

v

From which the result follows. As with the regular resistance distance, the signed resistance distance is symmetric: r˜(a, b) = r˜(b, a).

7

Implementation

This section describes how the correlation can be replaced by a similarity measure based on the signed resistance distance. The Pearson correlation sim(u, u ) is a similarity measure in the sense that it is higher when u and u are more similar. In an electrical network, the resistances intuitively represent dissimilarities: They separate nodes. Therefore, we deﬁne r(a, b) = 1/sim(a, b). A Pearson correlation of zero thus corresponds to an inﬁnite resistance, which corresponds to the absence of a connection between the two nodes.

The adjacency matrix A used for calculating the resistance distance contains the inverted resistance, and is thus equal to the correlation matrix. ˜ contains The resulting signed resistance distance matrix R resistance values. To convert these to similarities, we invert each entry and arrive at a similarity matrix. Aij

=

sim(i, j)

(12)

simres (i, j)

=

1/˜ r(i, j)

(13)

The kernel simres (i, j) can be used instead of the correlation sim(i, j) in any algorithm using the Pearson correlation. For reasons of scalability, the Laplacian cannot be inverted exactly. Instead, we approximate matrix pseudoinversion with the eigenvalue decomposition of the Laplacian L = QDQT : L+

=

QD+ QT

(14)

≈

Qk Dk+ QTk

(15)

Where Qk and Dk denote Q and D truncated to the k m+n biggest eigenvalues and corresponding eigenvectors. D+ can be computed by inverting non-null elements on the diagonal of D.

8

Evaluation

In this section, we describe the evaluation methodology and show the evaluation results. We use the corpus of movie ratings available on the Netﬂix Prize website3 . The corpus consists of a database of movie ratings given by the customers of the Netﬂix DVD rental service. We evaluate the rating prediction algorithms by taking subsets of the corpus randomly, and letting each algorithm predict another disjoint subset of the corpus. The accuracy of each prediction algorithm is estimated using two algorithms common for evaluating collaborative prediction algorithms [1]: the mean absolute error (MAE) and the root mean squared error (RMSE). Given the n probe ratings r¯(uj , ij ) for 1 ≤ j ≤ n, the error measures are deﬁned as: 1 X |¯ r(uj , ij ) − r˜(uj , ij )| (16) MAE = n 1≤j≤n s 1 X (¯ r(uj , ij ) − r˜(uj , ij ))2 (17) RMSE = n 1≤j≤n

Of these two, the root mean squared error (RMSE) is used on the Netﬂix Prize website for evaluation of rating prediction algorithms. We check the statistical signiﬁcance of the results by estimate the error on the calculated measures. We implemented the following algorithms: • (CORR) As the baseline, the Pearson correlation-based algorithm. • (CORR-SHIFT) A variant of the baseline algorithm using a shifted correlation that has proven empirically to provide better predictions on the corpus evaluated. • (RES-DIST) The weighted mean algorithm using the (unsigned) resistance distance kernel. 3

http://www.netflixprize.com/

J. Kunegis et al. / Modeling Collaborative Similarity with the Signed Resistance Distance Kernel

• (RES-DIST-SIGNED) The weighted mean algorithm using the signed resistance distance kernel, as deﬁned in this paper. We show all algorithms in user-based and item-based variants. For the two resistance distance-based variants, we use a dimensionality reduction parameter k = 100, giving a good approximation and good runtime. For all algorithms, we use the same normalization method, shifting each rating by the mean of the average user rating and average item rating. Table 1. shows the evaluation results. Performance of prediction User-based MAE RMSE CORR 0.769 0.979 CORR-SHIFT 0.768 0.978 RES-DIST 0.999 1.305 RES-DIST-SIGNED 0.770 0.972 Table 1.

algorithms Item-based MAE RMSE 0.762 0.991 0.760 0.987 0.990 1.296 0.773 0.983

The evaluation shows the signed resistance distance kernel performing better than the baseline algorithm with regard to the root mean squared error. The (unsigned) resistance distance kernel performs very badly because of the presence of negative ratings. We note that the baseline algorithm achieves a lower mean absolute error.

9

Conclusion and Future Work

We have generalized the resistance distance graph kernel to the domain of negative values. We have used this kernel to implement collaborative rating prediction. By constrast, the related (unsigned) graph kernel can only be applied to nonnegative data, such as link prediction and recommendation. We have evaluated our approach by implementing it on a large rating database. The evaluation showed that using this signed resistance distance improves the accuracy of the prediction algorithm. As future work, the following questions remain open: • Other variants of rating prediction algorithms exist take make the results of the correlation algorithm better. They should be evaluated in conjunction with the signed resistance distance kernel. • How does the signed resistance distance perform in other areas besides collaborative ﬁltering? As it provides a similarity measure, it could be used anywhere a similarity measure such as the Pearson correlation is used. • How does the signed resistance distance perform on multipartite datasets? Bipartite and tripartite datasets have only been analyzed for the regular resistance distance.

REFERENCES [1] John S. Breese, David Heckerman, and Carl Kadie, ‘Empirical analysis of predictive algorithms for collaborative ﬁltering’, in Proc. Conf. Uncertainty in Artiﬁcial Intelligence, pp. 43–52, (1998). [2] Fran¸cois Fouss, Alain Pirotte, Jean-Michel Renders, and Marco Saerens, ‘Random-walk computation of similarities between nodes of a graph with application to collaborative recommendation’, IEEE Transactions on Knowledge and Data Engineering, 19(3), 355–369, (2007).

265

[3] Fran¸cois Fouss, Alain Pirotte, and Marco Saerens, ‘The application of new concepts of dissimilarities between nodes of a graph to collaborative ﬁltering’, in Proc. Workshop on Statistical Approaches for Web Mining, pp. 26–37, (2004). [4] Fran¸cois Fouss, Alain Pirotte, and Marco Saerens, ‘A novel way of computing similarities between nodes of a graph, with application to collaborative recommendation’, in Proc. Int. Conf. on Web Intelligence, pp. 550–556, (2005). [5] Fran¸cois Fouss, Luh Yen, Alain Pirotte, and Marco Saerens, ‘An experimental investigation of graph kernels on a collaborative recommendation task’, in Proc. Int. Conf. on Data Mining, pp. 863–868, (2006). [6] Jonathan L. Herlocker, Joseph A. Konstan, Al Borchers, and John Riedl, ‘An algorithmic framework for performing collaborative ﬁltering’, in Proc. Int. Conf. on Research and Development in Information Retrieval, pp. 230–237, (1999). [7] Jonathan L. Herlocker, Joseph A. Konstan, Loren G. Terveen, and John T. Riedl, ‘Evaluating collaborative ﬁltering recommender systems’, ACM Trans. Inf. Syst., 22(1), 5–53, (2004). [8] Thomas Hofmann, ‘Collaborative ﬁltering via gaussian probabilistic latent semantic analysis’, in Proc. Int. Conf. on Research and Development in Information Retrieval, pp. 259– 266, (2003). [9] Thomas Hofmann, ‘Latent semantic models for collaborative ﬁltering’, ACM Trans. Inf. Syst., 22(1), 89–115, (2004). [10] Zan Huang and Daniel D. Zeng, ‘Why does collaborative ﬁltering work? – recommendation model validation and selection by analyzing bipartite random graphs’, in Proc. Workshop of Information Technologies and Systems, (2005). [11] D. J. Klein and M. Randi´c, ‘Resistance distance’, Journal of Mathematical Chemistry, 12(1), 81–95, (1993). [12] J´ erˆ ome Kunegis, Andreas Lommatzsch, Martin Mehlitz, and S ¸ ahin Albayrak, ‘Assessing the value of unrated items in collaborative ﬁltering’, in Proc. Int. Conf. on Digital Information Management, pp. 212–216, (2007). [13] J´ erˆ ome Kunegis and Stephan Schmidt, ‘Collaborative ﬁltering using electrical resistance network models with negative edges’, in Proc. Industrial Conf. on Data Mining, pp. 269– 282. Springer-Verlag, (2007). [14] P. Resnick, N. Iacovou, M. Suchak, P. Bergstorm, and J. Riedl, ‘GroupLens: An Open Architecture for Collaborative Filtering of Netnews’, in Proc. Conf. on Computer Supported Cooperative Work, pp. 175–186, (1994). [15] Upendra Shardanand and Patti Maes, ‘Social information ﬁltering: Algorithms for automating “word of mouth”’, in Proc. Conf. on Human Factors in Computing Systems, volume 1, pp. 210–217, (1995). [16] L. Ungar and D. Foster, ‘A formal statistical approach to collaborative ﬁltering’, in Proc. Conf. on Automated Learning and Discovery, (1998). [17] Kai Yu, Anton Schwaighofer, Volker Tresp, Xiaowei Xu, and Hans-Peter Kriegel, ‘Probabilistic memory-based collaborative ﬁltering’, Trans. on Knowledge and Data Engineering, 16(1), 56–69, (2004).

266

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-266

Modeling the Dynamics of Mood and Depression Fiemke Both and Mark Hoogendoorn and Michel Klein and Jan Treur1 Abstract. Both for developing human-like virtual agents and for developing intelligent systems that make use of knowledge about the emotional state of the user, it is important to model the mood of a person. In this paper, a model for simulating the dynamics of mood is presented, based on psychological theories about a unipolar clinical depression. The model was analyzed mathematically and by means of simulations, and it was shown that the model exhibits the most important characteristics of the theories. It shows how stress factors under some conditions can lead to a depression, while it will not lead to a depression under other conditions.

1 INTRODUCTION1 Traditionally, emotions were often left out of consideration in the areas of cognitive, agent, and user modeling. Only few computational models of mood and depression have been developed [12]. However, emotions undoubtedly influence the behavior of humans. In recent years, there is a growing awareness of this role of emotions within human behavior. Modeling of emotions is important for the development of agents that should exhibit human-like behavior. For example, agents that are used in Intelligent Tutoring Systems to train humans should behave human-like, and show emotions as well. Similarly, virtual agents in games that should interact in a realistic manner have to incorporate the effect of emotions on behavior. Secondly, also systems that reason over the state of humans should take the emotions and mood of humans into account. For example, one may think of ambient intelligence applications that react on the mood and emotional state of humans. Another category of applications that use knowledge about the mood of humans are systems that support therapy, such as systems the help to quit smoking addiction, or internet-based therapy for depression (online counseling systems). In this paper, a formal model is presented that can be used to simulate the dynamics in the mood of humans, and more specifically, whether they develop longer periods a undesired moods, as in depressions. To come to this model, commonly used psychological theories about uni-polar (i.e., uncomplicated) depressions are discussed in Section 2. Then, in Section 3 the main concepts and relations are extracted from the theories. This results in a formal representation of the aspects of mood and depression, used in Section 4 to simulate several situations. In addition, in Section 5 by way of a mathematical analysis it is shown that indeed two equilibria can be found in the model: one in which a depression occurs, and one in which a good mood is maintained as desired. In Section 6, the adherence of the model to the theories is validated by automatic verification of a number of properties that 1

Vrije Universiteit Amsterdam, Department of Artificial Intelligence, De Boelelaan 1081a, 1081 HV Amsterdam, The Netherlands, email: {fboth, mhoogen, mcaklein, treur}@cs.vu.nl

should hold for the model according to the theories. Finally, Section 7 concludes the paper.

2 THEORIES ABOUT DEPRESSION A clinical depression is one of the most prominent disturbances in mood. It is a common psychiatric disorder, affecting about 7– 18% of the people at least once in their lives. In the USA, the prevalence is approximately 14 million adults per year [1]. Symptoms of a depression are a deep feeling of sadness, and a noticeable loss of interest or pleasure in favorite activities. There is not one specific cause of a depression, most experts believe that both biological and psychological factors play a role. In the last decennia, several theories have been developed about the course and treatment of a depression. A classic behavioral model of uni-polar depression by Lewinsohn [2] states that depression results from a stressful event that disrupts normal behavior patterns. According to Lewinsohn, a low rate of behavior (often caused by inadequate social skills) is the essence of the depression and the cause of all other symptoms. Part of his theory is the hypothesis that there is a causal relationship between lack of positive reinforcement from the environment and the depression. According to many psychologists, the mood of a person is influenced by stressful events and the abilities a person has to cope with these events. In the stress, appraisal and coping theory by Lazarus and Folkman [5] they emphasize that stress is not a direct response to a stressor, but a response to a situation that has been appraissed as taxing or exceeding ones resources. When a situation has been identified as stressful, coping skills are applied. Coping is defined as “constantly changing cognitive and behavioral efforts to manage specific external and/or internal demands that are appraised as taxing or exceeding the resources of the person”. In the theory of Lazarus and Folkman, vulnerability is conceptualized in terms of coping resources. A vulnerable person is said to have “deficient coping resources”. Commitments are also claimed to be important by Lazarus and Folkman. It is stated that “commitments are an expression of what is important to persons, and they underlie the choices persons make”. Hence, commitments play a role in selecting the situation. Aaron Beck developed a cognitive theory of depression [4]. He believes that depression is due to negative views towards the self, world, and future in particular. Depressed persons have thoughts like “nobody cares about me” or “I can’t do this task”. In his theory depressed persons also use faulty information processing, like selective attention, to maintain their negative views, even if the situation is actually more positive. Gross [6, 7] developed a theory about emotion regulation. The core idea is that you can regulate emotions by choosing situations, subsituations, aspects and meanings with emotion levels that are near the preferred emotion level. By choosing and changing the emotional value of situations, the mood level can

267

F. Both et al. / Modeling the Dynamics of Mood and Depression

be regulated so that it is close to the prospected mood level. How well a person is able to do this depends on the sensitivity of a person. The diathesis-stress model was first introduced by Zubin and Spring [8] for schizophrenia. Now, there are many different stress-vulnerability models [9], most of which involve predisposed factors (vulnerability or diathesis) and external influences (stress) that together determine whether a person develops a mental disorder or not. There is no typical definition, but most theories assume that vulnerability is a trait, is stable but can change, is endogenous to individuals and is usually latent [10].

3 MODEL OF MOOD DYNAMICS 3.1 Concepts from theory In the model basic concepts form the theories summarized above will be used. From the behavioral theory of Lewinsohn the idea is adopted that a lack of reinforcement from the environment (i.e., situation) influences the choice of new situation via the mood. According to Beck the perception of the situation will be negatively influenced by the thoughts, so also a subjective emotional value of the situation is used. This is also in line with the ideas of stress by Lazarus and Folkman: a part of the model represents the actual environmental situation experienced, whereas there is also a part that represents the interpretation of the situation specific for the person the model concerns. In addition, their idea of vulnerability (also called diathesis) is used, as having deficient coping resources and Zubin and Spring’s idea of vulnerability as having a predisposition for developing a disorder. Coping is used in the model presented in this paper by means of continuously trying to adapt the situation in such a way that an improvement is achieved. This is done through a regulation system in the person, which is inspired by Gross’s theory about emotion regulation by striving for a specific prospected mood level.

representing a temporary prospect when mood level is far from the prospected mood level. The node sensitivity represents the ability to change or choose situations in order to bring mood level closer to prospected mood level. A high sensitivity means that someone’s behavior is very much affected by thoughts and mood, while a low sensitivity means that someone is very unresponsive. The level of sensitivity itself is influenced by mood level and thoughts. A low mood level and negative thoughts can decrease the sensitivity and a high mood level and positive thoughts can increase the sensitivity. Mood level, prospected mood level and sensitivity together influence OEVS by choosing or changing a situation. The new value of a node is determined by preceding nodes and the previous value of that node. Decay factors determine how fast the previous value of the node decays. For the entire model there are two decay factors: diatheses for downward regulation and coping for upward regulation. The term diatheses represents the vulnerability one has for developing a depression. The term coping represents the skills one has to deal with negative moods and situations. A person with very low diatheses will probably never get a depression, because mood, thoughts and sensitivity will go down very slowly with a negative event. That person is therefore always capable of choosing situations that have a positive influence on his/her mood level and emotions. High diatheses and low coping skills will cause a person to get a depression very easily when a negative event occurs, because mood, thoughts and sensitivity will decrease fast. It will be very difficult to climb out of a depression: the upward regulation of mood, thoughts and sensitivity will go very slow. obj. emotional subj. emotional value of situation value of situation

mood level LT prospected mood level thoughts

sensitivity

ST prospected mood level

Figure 1. Model of mood dynamics.

3.2 Conceptual model of mood and depression In the model, it is assumed that every situation has an emotional value, which represents the extent to which a situation is experienced as something positive. The objective emotional value of situation (OEVS) represents how an average person would perceive the situation. A situation can be an event or series of events one has no control over, or that are chosen or influenced by the person. The subjective emotional value of situation (SEVS) can differ from OEVS when the thoughts of the person are more positive or more negative than average. Negative thoughts will cause the SEVS to be lower than OEVS, which is often the case with a depression. How one perceives the situation (SEVS) influences the mood one is in and the thoughts one has. When the person is in a positive situation, mood level and thoughts will increase. For example, attending a birthday party, which is usually a positive experience, causes a better mood and more positive thoughts. In contrast, an argument with a close friend has a low emotional value and causes a bad mood and negative thoughts. By changing or choosing a situation, one can influence their own mood level (e.g. choosing to go to the birthday party when one feels down increases the mood level). The complex notion of mood is represented by the simplified concept mood level, ranging from low corresponding to a bad mood to high corresponding to a good mood. The mood level influences and is influenced by thoughts. Positive thinking has a positive effect on the mood and vice versa. The mood level someone strives for, whether conscious or unconscious, is represented by prospected mood level. This notion is split into a long term prospected mood level, an evolutionary drive to be in a good mood, and a short term prospected mood level,

3.3 Formalization The model described above is formed into a quantitative model in this section. The basic principle used to make this translation is to find equations that incorporate the qualitative trends observed in the literature described before. All nodes in the model have values between 0 and 1. The first formula (equation 1) describes the calculation of the new value for OEVS using sensitivity, mood, LT prospected mood and beta. If the mood level is below the ST prospected mood level (beta times LT prospected mood level), the new situation will be higher than the previous one. The amount of change is mediated by sensitivity (the higher the sensitivity, the faster the optimal situation is reached) and by the previous situation (the closer to the optimal situation, the smaller the steps). oevs ( t + Δt ) = oevs ( t ) − sensitivit y( t ) ⋅ ϕ ⋅ Δ t

ϕ=

oevs ( t ) ⋅ δ mood δ mood >= 0 ( 1 − oevs ( t )) ⋅ δ mood δ mood < 0

(1)

δ mood = mood ( t ) − β ⋅ lt _ prosp _mood

The SEVS (equation 2) depends on the OEVS, thoughts and coping and diatheses. The value for thoughts has a negative influence on SEVS by projecting thoughts (within range [0,1]) onto the part of the scale below the old SEVS value (range [0, SEVS]). Thoughts have a positive influence by projection of thoughts onto the upper part of SEVS (range [SEVS, 1]). The factors coping and diatheses determine the degree of positive and negative influence respectively.

268

F. Both et al. / Modeling the Dynamics of Mood and Depression

Persons with a high vulnerability for depression will use thoughts mostly to downregulate their SEVS, whereas balanced persons will consider their SEVS as similar to the OEVS and their thoughts. sevs ( t + Δt ) = sevs ( t ) + ( − sevs ( t ) ) ⋅ Δt = diatheses ⋅ oevs ( t ) ⋅ thoughts ( t ) +

(2)

coping ⋅ ( 1 − ( 1 − oevs ( t ) ) ⋅ ( 1 − thoughts ( t ) ) )

The new mood level (equation 3) is influenced by the SEVS (with weight wsevs_mood) and thoughts (with weight wthoughts_mood). If the new mood level is to be increased, the decay factor of the previous mood is coping. If the new mood level is to be decreased, diatheses is used to determine the speed of the fall. Good coping skills result in a faster increase of mood when the situation and thoughts are increasing. Bad coping skills result in a faster decrease of mood when SEVS and thoughts are low. mood ( t + Δt ) =

mood ( t ) + coping ⋅ ( φ − mood ( t )) ⋅ Δt φ >= mood ( t ) mood ( t ) + diatheses ⋅ ( φ − mood ( t )) ⋅ Δt φ < mood ( t ) (3)

φ = sevs( t ) ⋅ wsevs_mood + thoughts ( t ) ⋅ wthoughts _mood

The formula for the new thoughts value (equation 4) is similar to the formula for mood: thoughts is influenced by SEVS (with weight wsevs_thoughts) and mood (with weight wmood_thoughts) The decay factors are coping and diatheses for upwards and downwards regulation respectively. thoughts ( t + Δt ) = thoughts ( t ) + coping ⋅ ( φ − thoughts ( t )) ⋅ Δt φ >= thoughts ( t ) thoughts ( t ) + diatheses ⋅ ( φ − thoughts ( t )) ⋅ Δt φ < thoughts ( t )

(4)

φ = sevs( t ) ⋅ wsevs_thoughts + mood ( t ) ⋅ wmood _ thoughts

Again, equation 5 is similar: sensitivity is calculated using the values for mood and thoughts with the corresponding weights and is mediated by coping for upwards regulation and diatheses for downwards regulation. sens( t + Δt ) =

sens( t ) + coping ⋅ ( φ − sens( t )) ⋅ Δt φ >= sens( t ) sens( t ) + diatheses ⋅ ( φ − sens( t )) ⋅ Δt φ < sens( t )

(5)

φ = mood ( t ) ⋅ wmood _sens + thoughts( t ) ⋅ wthoughts_sens

0.9 and 0.6 respectively. The third type, an emotionally very unstable person, is characterized by settings 0.01, 0.99 and 0.6. The start value for OEVS needs to be calculated for each type so that when no events occur, the person stays balanced with al variables equal to LT prospected mood level. For type the OEVS is 0.8, for type 2 it is 0.94 and for type 3 the stable OEVS is 0.999. Each type of person has been simulated in different scenarios during 1000 time steps (representing 1000 hours, 1.5 months), resulting in a total of five simulations. In the first scenario (traces 1 and 5) one or two minor negative events with an emotional value 0.5 occur. In the second scenario (traces 2 and 4) a negative event with value 0.2 or 0.3 occurs and in the third scenario (trace 3) six major negative events occur. Table 1 describes the simulation settings (type of person and scenario) and results (min and max mood levels, occurrence and duration of depression and recovery within 1.5 months). A depression is defined as a mood level below 0.5 during at least 336 time steps (two weeks, [11]). Table 1. Settings and results of the simulations. *: there is no recovery within the simulation time, but mood level is increasing and the person will recover after approximately 22 weeks. # Person Scenario Mood levels Depression Recovery 1 1 2 events 0.5 [0.63; 0.81] no (0) 2 1 1 event 0.2 [0.52; 0.81] no (0) 3 1 6 events 0.1 [0.36; 0.81] yes (572) yes 4 2 1 event 0.3 [0.09; 0.61] yes (888) no* 5 3 1 event 0.5 [0.01; 0.60] yes (887) no

The results show that an emotionally stable person is unlikely to develop a depression. Figure 2 shows the values of OEVS, SEVS, mood and the events for simulationtrace 1. Note that the x-axis indicates time and the y-axis the respective values. Persons with a very high vulnerability develop a depression already after one minor event (simulationtrace 5, Figure 3).

The ST prospected mood level (equation 6) is a percentage ( ) of the LT prospected mood level. The percentage is adjusted towards mood level with a factor (diatheses) and towards LT prospected mood level with factor coping. Persons with bad coping skills will let their ST prospected mood level be heavily influenced by mood and not by LT prospected mood. Healthy persons use a balanced influence to determine their new ST prospected mood level. st_ prosp_mood( t + Δt ) = β ( t + Δt ) ⋅ lt_ prosp_mood β ( t + Δt ) = β ( t ) + Δt ⋅ (( diatheses⋅ ( mood( t ) − β ( t ) ⋅ lt_ prosp_mood ) + coping ⋅ ( lt _ prosp_mood − β ( t ) ⋅ lt_ prosp_mood ) + β ( t ) ⋅ lt_ prosp_mood ) / lt_ prosp_mood − β ( t ))

1

oevs sevs mood event s

0.5

0 0

200

400

600

800

1000

Figure 2. Simulation 1: emotionally stable person, 2 negative events.

(6)

1

sevs mood

= β ( t ) + ( diatheses⋅ ( mood( t ) / lt_ prosp_mood − β ( t )) + coping ⋅ ( 1 − β ( t )))⋅ Δt

event

0.5

4 SIMULATIONS The model for depression was used to simulate three types of people in different situations. The different types are accomplished by setting the parameters coping, diatheses and LT prospected mood level. The six weights between mood, thoughts and SEVS can also be varied to simulate different personal characteristics. However, in these simulations they have been set at the following values: wsevs_mood 0.7, wthoughts_mood 0.3 (mood is influenced by SEVS for 70% and by thoughts for 30%), wsevs_thoughts 0.6, wmood_thoughts 0.4 (thoughts is influenced by SEVS for 60% and by mood for 40%), wmood_sevs 0.5, wthoughts_sevs 0.5 (SEVS is influenced by mood and thoughts for both 50%). The first type, an emotionally stable person, is defined by having good coping skills that balance out any diatheses, and by having the desire to have a good mood: coping value is 0.5, diatheses 0.5 and LT prospected mood level 0.8. An emotionally slightly unstable person is defined by having some diatheses and bad coping skills and the desire to have a medium mood: settings 0.1,

oevs

0

0

200

400

600

800

1000

Figure 3. Simulation 5: emotionally unstable person, 1 negative event.

5 MATHEMATICAL ANALYSIS It is also possible to show mathematically that two different situations can be distinguished in the model: one in which stressful events lead to a depression, and one in which this won’t lead to a depression. To do so, the model is rewritten in a continuous form of the system of (nonlinear) differential equations shown below. Here O denotes oevs, M mood, T thoughts, S sevs, σ sensitivity, and α coping. Moreover, Pos(x) = x when x≥0, and 0 otherwise; alternatively Pos(x) = (x + |x|)/2. dO (t ) dt

= σ(t) (- O(t) Pos(M(t) - β (t)λ ) + (1 - O(t))Pos(β(t)λ - M(t))

dS (t ) = dt dM (t ) = dt

α (1 – (1 - O(t))(1 - T(t))) + (1 -α) O(t) T(t) – S(t) α Pos(wsmS(t) + (1-wsm)T(t) - M(t)) -

F. Both et al. / Modeling the Dynamics of Mood and Depression

(1 - α) Pos(M(t) - wsmS(t) - (1-wsm)T(t)) d (t ) dt

= α Pos(wstS(t) + (1-wst)M(t) - T(t)) -

d (t ) dt

= α Pos(wmσM(t) + (1-wmσ)T(t) - σ(t)) -

d (t ) dt

= α (1 - β(t)) + (1 - α) (M(t)/λ - β (t))

(1 - α) Pos(T(t) - wstS(t) – (1-wst)M(t)) (1 - α) Pos(σ(t) - wmσM(t) – (1-wmσ)T(t))

Equilibria satisfy the following equations in O, S, M, T, σ, β: (i) (ii) (iii)

σ (- O Pos(M - βλ ) + (1 - O)Pos(βλ - M) = 0 α (1 – (1 - O)(1 - T)) + (1 -α) O T – S = 0 α Pos(wsmS + (1-wsm)T - M) - (1 - α) Pos(M - wsmS - (1-wsm)T) =

0 (iv) (v)

α Pos(wstS + (1-wst)M - T) - (1 - α) Pos(T - wstS – (1-wst)M) = 0 α Pos(wmσM + (1-wmσ)T - σ)- (1 - α) Pos(σ - wmσM – (1-wmσ)T)

=0 (vi) α (1 - β ) + (1 - α) (M/λ - β ) = 0 Note that Pos(x) ≠ 0 Pos(-x) = 0. From (vi) it follows that either M = λ β (if α = 0) or M = λ (β - α)/(1 - α) (if 0<α<1) or β = 1 (if α =1). Using this, (i) provides for 0<α<1: σ λ (α/(1-α)) (- O Pos(β - 1 ) + (1 - O)Pos(1 - β)) = 0 This implies σ = 0 or (- O Pos(β - 1 ) + (1 - O)Pos(1 - β) = 0, which implies either β =1 or β>1 and O = 0, or β<1 and O = 1.

Two specific cases are as follows. The case with M = λ In this case β = 1 (by (vi)), and by (iii) and (iv) it follows: λ = wsmS + (1-wsm)T

T = wstS + (1-wst)λ

Hence

λ = wsmS + (1-wsm)( wstS + (1-wst)λ) = (wsm + (1-wsm) wst)S +(1wsm)(1-wst)λ

or

(1-(1-wsm)(1-wst))λ = (wsm + (1-wsm) wst)S = (1 – (1 -wsm ) + (1-wsm) wst)S = (1 – (1 -wsm ) (1-wst) S So, assuming the weights < 1, it follows S = λ, and from (iv) also T = λ. From (v) it follows σ = λ, and by (ii) O can be determined.

This is an equilibrium which would be considered a good situation. The case with M = 0 Another special case of an equilibrium is when the mood M is 0. From (iii) and (iv) it follows that also S = 0, and from (iii) that T = 0 (assuming the weights <1). By (ii) O = 0, and by (v) σ = 0. Finally, from (vi) it follows β = α. This is an equilibrium that would be classified as a depression.

6 VALIDATION: ADHERENCE TO THEORIES In order to verify whether the model indeed produces results that follow psychological observations, a number of properties have been identified in the psychological literature, which have been verified against representative traces (the ones described in Section 4 and additional traces). In order to conduct such a verification, the properties have been specified in a language called TTL (for Temporal Trace Language, cf. [13]) that features an automated checker. This predicate logical temporal language supports formal specification and analysis of dynamic properties, covering both qualitative and quantitative aspects. TTL is built on atoms referring to states of the world, time points and traces, i.e. trajectories of states over time. In addition, dynamic properties are temporal statements that can be formulated with respect to traces based on the state ontology Ont in the following manner. Given a trace γ over state ontology Ont, the state in γ at time point t is denoted by state(γ, t). These states can be related to state properties via the formally defined satisfaction relation denoted by the infix predicate |=, comparable to the Holdspredicate in the Situation Calculus: state(γ, t) |= p denotes that state property p holds in trace γ at time t. Based on these statements, dynamic properties can be formulated in a formal manner in a sorted first-order predicate logic, using quantifiers

269

over time and traces and the usual first-order logical connectives such as ¬, ∧, ∨, , ∀, ∃. Below, the properties and the results of the verification upon the representative traces are shown. The first property (P1) expresses that a person with bad coping and diatheses values will get depressed after having encountered at least one negative situation. This property is supported by the theory of Lazarus and Folkman [5]. P1: Bad coping person gets depressed after negative situation ∀ :TRACE, t:TIME, R1, R2, R3:REAL [ [∀t’:TIME [state( , t’) |= has_value(coping_factor, R1) & R1 < AVERAGE_COPING & state( , t’) |= has_value(diatheses_factor, R2) & R2 > AVERAGE_DIATHESES ] & state( , t) |= has_value(objective_situation, R3) & R3 < AVERAGE_SITUATION ] ∃t2:TIME > t [depression( , t, MIN_DUR, MAX_LEVEL) ] ]

A depression is defined as having a mood value below a certain maximum for a certain time period. depression( :TRACE, t:TIME, MIN_DUR:INTEGER, MAX_LEVEL:REAL) ≡ ∀t2:TIME > t & t2 < t + MIN_DUR [ ∃R:REAL state( , t2) |= has_value(mood, R) & R < MAX_LEVEL ]

The property has been verified against the set of traces generated by the model. Hereby MIN_DUR has been set to 336 and MAX_LEVEL is set to 0.5 (see Section 4). Furthermore, the AVERAGE_COPING has been set to 0.4, AVERAGE_DIATHESES to 0.6 and AVERAGE_SITUATION to 0.5. Given these settings, this property indeed holds for the set of traces. Property P2 expresses a similar property for persons with a healthy coping and diatheses factor, stating that an emotionally stable person will not get depressed from one negative experience. This property is supported by the theories of Zubin and Spring [8] and Lazarus and Folkman [5]. P2: Good coping person does not get depressed after one negative situation

∀ :TRACE, t:TIME, R1, R2, R3, R4:REAL [[∀t’:TIME [state( , t’) |= has_value(coping_factor, R1) & R1 AVERAGE_COPING & state( , t’) |= has_value(diatheses_factor, R2) & R2 AVERAGE_DIATHESES ] & state( , t) |= has_value(objective_situation, R3) & R3 < AVERAGE_SITUATION & ¬∃t’’:TIME > t + MAX_DUR, R4:REAL [state( , t’’) |= has_value(objective_situation, R4) & R4 < AVERAGE_SITUATION ] ] ¬∃t2:TIME > t [depression( , t, MIN_DUR, MAX_LEVEL) ] ]

Using the same parameters as stated before, this property was shown to be satisfied for all traces. Hereby, the MAX_DUR has been set to 75 hours. Besides the influence of a situation upon the internal states of the person, the internal levels also influence the choice of situation. In case the thoughts are more negative for the same objective situation, then the judgment of the situation (i.e. the subjective situation) will be lower. This phenomenon is expressed in property P3. The theory of Beck [4] supports this property. P3: Negative thoughts result in lower subjective situations ∀ :TRACE, t1, t2:TIME, R1, R2, R3:REAL [ [ state( , t1) |= has_value(objective_situation, R1) & state( , t2) |= has_value(objective_situation, R1) & state( , t1) |= has_value(thoughts, R2) & state( , t2) |= has_value(thoughts, R3) & R2 < R3 ] ∃R4, R5 :REAL, D:integer < MAX_DELAY [ state( , t1+D) |= has_value(subjective_situation, R4) & state( , t2+D) |= has_value(subjective_situation, R5) & R4 < R5 ] ]

Again, this property is satisfied for the given traces with parameter MAX_DELAY set to 5. A key element in avoiding a depression is to choose increasingly better situations. Property P4 draws inspiration from this observation. In case a person can constantly choose better situations during a certain period, this avoids a depression. The behavioural theory of Lewinsohn et al [2] supports property P4.

270

F. Both et al. / Modeling the Dynamics of Mood and Depression

P4: Increasingly more positive situations avoid depression ∀ :TRACE, t:TIME, R:REAL [ [ state( , t) |= has_value(objective_situation, R) & R < AVERAGE_SITUATION & increasingly_better_situations( , t, EPSILON, MIN_DURATION) ¬∃t2:TIME > t & t2 < t + MIN_DURATION [depression( , t, MIN_DUR, MAX_LEVEL)

Hereby, the increasingly better situations are defined by means of an epsilon that determines the minimum increase. increasingly_better_situations( :TRACE, t:TIME, EPSILON:REAL, MIN_DUR:INTEGER) ≡ ∀t2:TIME > t & t2 < t + MIN_DUR [ ∀R1:REAL state( , t2) |= has_value(subjective_situation, R1) ∃R2:REAL state( , t2 + 1) |= has_value(subjective_situation, R2) & R2 > R1 * EPSILON ]

This property is satisfied for all traces with an epsilon value of 1.005 and a MIN_DUR of 336. These are chosen such that a person would be above a mood value considered too low from a mood level of 0.1 within 336 hours (i.e. before officially being in state of depression). Property P5 specifies that in case thoughts are negative for a certain period, the person will become depressed. This property, together with property P3, is supported by the ideas of Beck [4]. P5: Negative thoughts result in depression ∀ :TRACE, t:TIME, R:REAL [ state( , t) |= has_value(thoughts, R) & R AVG_THOUGHTS & negative_thoughts( , t + 1, MIN_DURATION, AVERAGE_THOUGHTS) ∃t2:TIME > t [depression( , t2, MIN_DUR, MAX_LEVEL) ] ]

Hereby, negative thoughts means that the thoughts are below a certain threshold during the specified period. negative_thoughts( :TRACE, t:TIME, MIN_DUR:INTEGER, AVG_THOUGHTS:REAL) ≡ ∀t2:TIME > t & t2 < t + MIN_DUR, R :REAL [ state( , t2) |= has_value(thoughts, R) R < AVG_THOUGHTS ]

It was shown that this property is satisfied using the values as identified before, and in addition, a value of 0.5 for AVERAGE_THOUGHTS, and a MIN_DUR of 336, i.e. following the parameters used for the depression. The final property to be satisfied is the monotonic increase of the mood level of an emotionally stable person in case no negative external situations are encountered. Gross’ theory [6, 7] about emotion regulation supports this property. P6: Monotonic increase of mood level for healthy person ∀ :TRACE, t:TIME, R1, R2:REAL [ [ [∀t’:TIME [state( , t’) |= has_value(coping_factor, R1) & R1 AVERAGE_COPING & state( , t’) |= has_value(diatheses_factor, R2) & R2 AVERAGE_DIATHESES ] state( , t-1) |= has_value(mood, R3) & R3 AVG_MOOD & state( , t) |= has_value(mood, R4) & R4 < AVG_MOOD ] & ¬∃t’’:TIME > t + MAX_DUR, R5:REAL [state( , t’’) |= has_value(objective_situation, R5) & R5 < AVERAGE_SITUATION ] ] ∃t2:TIME < t + MAX_DIP [monotonic_increase_mood( , t2, MIN_DUR) ]]

Hereby, the monotonic increase is defined as follows. monotonic_increase_mood( :TRACE, t:TIME, MIN_DUR:INTEGER) ≡ ∀t2:TIME > t & t2 < t + MIN_DUR, R1 :REAL [ state( , t2) |= has_value(mood, R1) ∃R2:REAL state( , t2) |= has_value(mood, R2) & R1 R2]

This final property is also satisfied for the given traces with the parameters used throughout this section, and AVG_MOOD set to 0.5, MAX_DIP to 100 and MAX_DUR to 30.

7 CONCLUSION AND FUTURE WORK A model of mood dynamics has been shown that incorporates concepts from general theories about depression, stress and coping. This model has been used to simulate different scenarios in which personal characteristics determine the effect of stressful effects on the (long-term) mood of a person. A mathematical analysis illustrated the existence of different equilibriums in the model for persons with different characteristics. By formally checking properties of the simulation traces, the adherence of the

model to the most important ideas in the theories was validated. The resulting model can be useful for developing human-like virtual agents. In addition, models like these can be used in systems that use knowledge about the mood of humans, such as systems for internet-based therapy. Currently, only the major general theories about depression have been involved. For specific applications, it might be required to use more detailed theories that focus on certain aspects. In future work will be investigated whether a model in line with the presented model can help to improve an existing online counselling system.

REFERENCES [1] Kessler RC, McGonagle KA, Zhao S, et al. Lifetime and 12month prevalence of DSM-III-R psychiatric disorders in the United States: results from the National Comorbidity Survey. Arch Gen Psychiatry 1994;51:8-19. [2] Lewinsohn, P.M., Youngren, M.A., & Grosscup, S.J. (1979). Reinforcement and depression. In R. A. Dupue (Ed.), The psychobiology of depressive disorders: Implications for the effects of stress (pp. 291-316). New York: Academic Press. [3] Andersson, G., J. Bergstrom, F. Hollandare, P. Carlbring, V. Kaldo and L. Ekselius. Internet-based self-help for depression: randomised controlled trial. The British J. of Psy. 187, Nov. 2005. [4] Beck, A.T., Depression: Causes and Treatment. University of Pennsylvania Press, 1972. - ISBN 978-0812276527 [5] Lazarus, R. S., & Folkman, S. (1984). Stress, Appraisal, and Coping. New York: Springer. [6] Gross, J.J. The Emerging Field of Emotion Regulation: An Integrative Review. Rev. of General Psy., 1998, vol. 2,pp. 271299. [7] Gross, J.J. “Emotion Regulation in Adulthood: Timing is Everything.” Current directions in psychological science, 2001, Vol. 10, No. 6, pp. 214-219. [8] Zubin, J. & Spring, B. (1977). Vulnerability: A new view of schizophrenia. Journal of Abnormal Psychology, 86, 103-126. [9] Ingram, R. E. & Luxton, D. D. (2005). Vulnerability-Stress Models. In J. Abela (Ed.). Development of Psychopathology: Stress-Vulnerability Perspectives. New York: Sage. [10] Ingram, R. E., Miranda, J., & Segal, Z. V. (1998). Cognitive vulnerability to depression. New York: Guilford Press. [11] American Psychiatric Association (1994) Diagnostic and Statistical Manual of Mental Disorders, 4th ed. (DSM-IV). Washington (DC), American Psychiatric Association. [12] Davidson, R.J., D.A. Lewis, L.B. Alloy, D.G. Amaral, G. Bush, J.D. Cohen, W.C. Drevets, M.J. Farah, J. Kagan, J.L. McClelland, S. Nolen-Hoeksema & B.S. Peterson (2002) Neural and behavioral substrates of mood and mood regulation. Biological Psychiatry, Volume 52, Issue 6, pp. 478-502. [13] Bosse, T., C.M. Jonker, L. van der Meij, A. Sharpanskykh and J. Treur, Specification and Verification of Dynamics in Cognitive Agent Models, In: Proc. of IAT’06, IEEE C.Soc. Press, pp.247-254, 2006.

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-271

271

Groovy Neural Networks Axel Tidemann1 and Yiannis Demiris2 Abstract. The drum machine has been an important tool in music production for decades. However, its ﬂawless way of playing drum patterns is often perceived as mechanical and rigid, far from the groove provided by a human drummer. This paper presents research towards enhancing the drum machine with learning capabilities. The drum machine learns user-speciﬁc variations (i.e. the groove) from human drummers, and stores the groove as attractors in Echo State Networks (ESNs). The ESNs are purely generative (i.e. not driven by an input signal) and the output is used by the drum machine to imitate the playing style of human drummers, making it a cost-effective way of achieving life-like drums.

1

INTRODUCTION

The research in this paper is aimed to enhance the cost-effective and easy way of creating drum tracks that is possible with current music production software (e.g. Logic, Pro Tools, Cubase) with the groove of a human drummer. The drum machine is a much cheaper alternative to recording live drums. However, the drum machine plays patterns without ﬂaws, which makes it sound rigid and machinelike. Human drummers vary the way patterns are played, both on a small-scale level (varying dynamics and tempo) and on a large-scale level (changing the pattern to be played altogether, such as playing a break). These variations constitute the groove of the drummer. Current music production software have parameters that can be tweaked to achieve a human-like effect, however these variations add random noise with the intention that these variations will be perceived as human - the software has no understanding of what makes a drum pattern groovy. The research presented in this paper models these user-speciﬁc variations with recurrent neural networks, and demonstrates that the networks are able to represent both the small and large-scale variations that make a drummer groovy. These networks are then used to imitate the playing style of the drummers that served as teachers. The result is a groovy drum machine.

2

BACKGROUND

Modeling user-speciﬁc variations in playing style has been a ﬁeld of study for years within the AI community. Saunders et al. [12] use string kernels to identify the playing style of pianists, by looking at changes in beat-level tempo and beat-level loudness. However, imitating the style of the pianists was not attempted. Tobudic and Widmer [15] also consider variations in tempo and dynamics as the two most important parameters of expressiveness. To learn the playing style of a pianist, they use ﬁrst-order logic to describe how the pianist would play a certain classical piece, and then a clustering algorithm to group similar melody lines (i.e. phrases) together. They 1 2

use the models to play back music in the style of given pianists. Pachet’s Continuator uses Markov models to create a system that allows real-time interactions with musicians [8], however his focus is more on replicating the tonal signature of a musician; the Markov model represents the probabilities that a certain note will follow another. A musician plays a phrase and the Continuator will then play another phrase which is a continuation of that phrase. Mantaras and Arcos use case-based reasoning to generate expressive music performance by imitating certain expressive styles, such as joyful or sad [2]. Raphael [10] has implemented a real-time system for accompanying soloists, “Music Plus One”. The system allows soloists to play along with an orchestra played by a computer, after the soloist has “practiced” along with the system. The idea is to model how a soloist tends to vary the tempo when playing a classical piece of music, making the orchestra (i.e. the computer) follow the soloist. Current sophisticated drum sample software (e.g. FXpansion BFD, Toontrack EZdrummer, DigiDesign Strike, Reason Drum Kits, Native Instruments Battery) contains gigabytes of samples that closely match the acoustic response to playing dynamics. However, the drum libraries still need to be programmed, since they provide no intelligent way to generate human-like drum patterns. This must be done by the user, either by programming the pattern himself or choosing a rhythm template. The drum software contains parameters that can be tweaked to enhance the realism of the produced tracks (typically a “groove engine” where it is possible to increase randomization of beats and/or adjust timing and velocity), in addition to manually changing programmed patterns. Still, the user needs to know how to achieve the desired result, since the software has no understanding of how to generate human-like drum patterns. If the user could buy a “drummer in a box” that had a model of how a real drummer plays a certain pattern, it would greatly reduce the cost of having life-like drums. We believe this could be an important tool for musicians, since the programming of the drums would be easier and the user could select the drummer of his preference to perform on his tracks.

IDI, NTNU, Norway, email: [email protected] ISN, Imperial College London, UK, email: [email protected]

3

ARCHITECTURE

The architecture for learning and imitation of drum patterns is called “Software for Hierarchical Extraction and Imitation of drum patterns in a Learning Agent” (SHEILA), see ﬁgure 1. SHEILA learns drum patterns and the playing style of human drummers, and stores them in a library. After training, SHEILA can be used as a groovy artiﬁcial drummer, capable of imitating the playing style of the drummers that provided the drum patterns used as training data. The drumming domain is suited for time-dependent sequential modeling due to its repetitive nature, since the groove of a drummer manifests over time.

272

A. Tidemann and Y. Demiris / Groovy Neural Networks

Figure 1. The SHEILA architecture. Playing style is learned in a hierarchical fashion by learning large-scale variations (i.e. variations of a pattern), as well as small-scale variations (variations of dynamics and tempo). All the models are represented by Echo State Networks (ESNs). The melody is used to group similar drum patterns together. After the training phase, SHEILA can imitate the playing style of the drummers that served as teachers.

3.1

Modeling the groove using neural networks

The inputs to SHEILA are drum patterns and the accompanying melody, both represented in the MIDI3 format. SHEILA models both small- and large-scale variations of drum patterns using Echo State Networks (ESNs). An ESN is a recurrent neural network characterized by a large sparsely connected hidden layer, where only the output layer weights are modiﬁed during learning [6]. The input layer weights are generated at random, and left unchanged during training. By only modifying the output layer weights, training the network is reduced to a linear regression problem, which is a lot cheaper computationally compared to the backpropagation through time algorithm. In SHEILA, the input layer is not used. The ESNs learn the sequences by teacher forcing, i.e. by writing the desired sequence of output states into the output nodes during the training phase. The only inputs to the hidden layer comes from the output nodes. After training, the teacher forcing stops, and the network runs on its own. The ESN continues to generate the desired output sequence due to the reverberations of the hidden layer dynamics. In other words, the desired sequence of output states is stored as an attractor in the ESN. How this is used in SHEILA will be explained in the following sections. Why use ESNs to represent the groove sequence? Apart from the obvious advantage of using a dynamical system with memory capabilities, it also draws on biological inspiration; neuroscientiﬁc ﬁndings suggest speciﬁc areas of the brain process temporal musical information [11]. To model a human quality such as groove, using a technique that is modeled on how the brain works seems like a step in the right direction. The name of the drummer and the song is also stored in the SHEILA library. Different drummers play the same pattern, but in their own style. The name of the drummer and the song will then help the user of SHEILA to decide which style he wants on the imitated drum track when SHEILA is used to imitate a drummer.

3.1.1

Large-scale variations

Large-scale variations are deﬁned as changes in the actual pattern played, such as playing a break instead of a certain pattern. The most commonly played drum patterns are denoted core patterns. Core patterns are found by a recursive process: ﬁrst the different parts of the melody are found (what in common music terms would be referred to as the verse/chorus/bridge of the song). This is achieved by transforming the melody into a string, and searching for supermaximal repeats, an approach used in computational biology to discover 3

Musical Instrument Digital Interface, a protocol for electronic music equipment to communicate in real time.

sequences of genes [4]. A supermaximal repeat is deﬁned as a recurring pattern that is not a substring of any other pattern. Similar parts are grouped together, and the core patterns are the most commonly played drum pattern within the similar parts. To ﬁnd the most commonly played drum patterns, the same search for supermaximal repeats is performed. Patterns correspond to one bar of the MIDI note sequence (i.e. 4 quarter notes). Patterns that differ from the core pattern within a melodic segment is considered to be a large-scale variation of the core pattern. For each song, there will be several core patterns, corresponding to the melodic segments. Core patterns are written as Cx , whereas variations are written as Cx Vy . From the low-level MIDI data a high-level representation of the song is found, namely the sequence of drum patterns. The sequence of patterns within a melodic segment is represented by an ESN (from now on referred to as ESNseq ). The string sequence is transformed into a binary matrix where one row corresponds to one bar, and the location of the high bit indicates which pattern (core or variation) to play. This sequence is then teacher forced to the ESNseq , which produces the same output sequence after the training phase. The design choice to have one ESNseq for similar melodic segments was made because it is the intention that SHEILA will later be used in a different setting, where only speciﬁc core patterns (and their variations) are to be played. If the ESNseq learned the large-scale sequence of the entire song, it would only be suited to play back that particular song.

3.1.2

Small-scale variations

Small-scale variations are deﬁned as variations in timing (i.e. how much the drummer is before or after the metronome) and dynamics (i.e. how hard a beat it played, also referred to as velocity) that occur when a drummer is playing a pattern. After deﬁning the core patterns and variations, the similar drum patterns are grouped together, and the MIDI data is transformed into a target matrix where one row represents one timestep of the MIDI data. Quantizing the raw MIDI data allows for calculation of how much a note was before/after the metronome. The placements of the velocity and onset time data also code for which note was played (for instance, hihat, snare drum or kick drum). The velocity and onset times were scaled to the range [0,1]. One ESN represent the sequence of velocities (denoted ESNvel ), one ESN represent the sequence of onset times (ESNons ). Early experiments tried to combine both onset time and velocity in one ESN, but ﬁnding a stable solution was difﬁcult. Closer examination of the data revealed that the onset times had a variation that was on a slower timescale than that of the velocities. The onset times varies over several bars, whereas the velocity varies greatly from one note to the next (this will be elaborated in section 4). The spectral radius of the ESN describes the speed of the internal dynamics of the ESN, and is the most important parameter to tune [5]. It was therefore crucial that the velocity and onset times were represented on different networks, since this parameter needed to be different for each network. After the division was made, ﬁnding stable solutions became a lot easier. The grouped patterns are then used to train ESNs that represent their variations in velocity and timing, resulting in speciﬁc ESNs for the core pattern and for each of the variations of the core pattern.

3.2

Imitating the groove

When the training is completed, the user of SHEILA presents a desired pattern in the MIDI format. Upon recognition, the user can choose which drummer should play the desired pattern. The name

273

A. Tidemann and Y. Demiris / Groovy Neural Networks

5 6

www.vmatch.de www.propellerheads.se www.roland.com

> > >A > > > >A > A A

One of the three core patterns the drummers were required to play in the experiment.

127 96 64 32 0

EXPERIMENTAL SETUP

The SHEILA system was implemented in MatLab. To ﬁnd the supermaximal repeats, vmatch4 was used, which is implemented using the algorithms described in [1]. Propellerheads Reason 3.05 loaded with Reason Drum Kits was used for recording MIDI signals and for generating sound from MIDI ﬁles. Recording MIDI was done with a Roland TD-36 velocity sensitive electronic drum kit. Five male amateur drummers (average age 27.8) recorded drum tracks to a melody written by the authors, and were told to play speciﬁc patterns for the verse (shown in ﬁgure 2), chorus and bridge. At each 8th bar of the verse, there was a break that they had to play the same way. Apart from these directions, the drummers were free to introduce largescale variations as they saw ﬁt. The overall structure of the song was verse/chorus/verse/chorus/bridge/chorus/chorus. The tempo was 120 beats per minute (BPM), yielding the length of the song 2:30 minutes. The ESNseq had 50 hidden nodes, and a spectral radius α = .99. The spectral radius describes the speed of the internal dynamics of the ESN; α = .99 speciﬁes slow dynamics. This was chosen because the timescale of the ESNseq were often long and highly irregular. The ESNvel and ESNons on the other hand, represent faster dynamics since their role is to capture a short cycle in a long stream of data. By examining the training data, the timescale of which variations occur was discovered to be different on velocity and timing data. This can be seen in ﬁgure 3, which shows the velocity and timing data for the hihat sequence that correspond to the pattern in ﬁgure 2. Observe how the velocity data vary greatly from one note to the next, whereas the onset time varies more slowly. To account for these observations, α = .1 for the fast ESNvel , α = .4 for the slower ESNons . These values are found by experimentation, as recommended by Jaeger [5]. The networks started out with 10 nodes in the hidden layer. Finding a stable solution in a purely generative ESN was not guaranteed in every trained ESN. The training error would be low for every network, but once left to run on its own, some networks tended to oscillate in an unstable manner. To overcome this problem, each ESN was run for the same length as the training data, and if the resulting output pattern differed more than 10% from the training pattern, it was discarded and another ESN was created, trained and tested. If ﬁve consecutive ESNs had an error greater than 10%, the number of nodes in the hidden layer was increased by 1. In practice, this simple heuristic would guarantee that a solution would be found rather quickly, with less than 25 nodes in the hidden layer. Recall that the training of the ESN is a simple linear regression task; training and testing an ESN on some of the longer sequences (e.g. a 144 x 3 matrix) takes less than a second on an 1.8GHz iMac G5 running MatLab 7. 4

Figure 2.

0

10

20

30

40 50 Time in beats, hihat

60

70

80

90

0

10

20

30

40 50 Time in beats, hihat

60

70

80

90

0.05

Onset time, ticks

4

hihat snare kick

MIDI velocity

of the drummer and which song the pattern was played on will aid the user to decide. The user then speciﬁes how many bars the desired pattern should be played. SHEILA then runs the ESNseq of the desired pattern for the desired number of bars, outputting the sequence of core patterns and variations. The corresponding ESNvel and ESNons are then run for the desired number of bars; the output results in MIDI data. The ESNseq introduces large-scale variations, and the ESNvel and ESNons introduce small-scale variations. Recall that the ESNs are all purely generative, they are not driven by input at all. However, they need to be given a starting state, which is the last state of the hidden and output layer of the network during training.

0 −0.05

Figure 3. The plots show the training sequence of hihat velocity and onset time when drummer B played the pattern in ﬁgure 2. The difference from one velocity to the next is much larger than the difference from one onset time to the next, which ﬂuctuates on a much slower timescale.

5

RESULTS

To evaluate the imitation performance of the SHEILA architecture, it was set to play back the same song structure used during training. Table 1 shows how many large-scale variations a drummer would introduce when playing a core pattern in addition to how often variations were played instead of the core pattern, calculated from the original training data. The table shows how some drummers tends to introduce many variations and play them often, whereas other drummers tend to play just the pattern they were told to play. This indicates the complexity of the sequence the ESNseq had to learn, and the complexity of the imitated sequence. Table 1. The tuples represent how many unique variations the drummers introduced when playing a core pattern, and how often variations in total were played instead of a core pattern (keeping in mind that a particular variation can be repeated throughout the sequence). This indicates the complexity of the sequence of large-scale variations and core patterns. Drummer C1 C2 C3

A 5, 54% 10, 41% 6, 75%

B 3, 18% 2, 22% 2, 38%

C 7, 43% 8, 41% 5, 63%

D 2, 7% 1, 3% 0, 0%

E 5, 46% 11, 63% 5, 63%

The small-scale variations of both velocity and onset time can be modeled using a Gaussian distribution, as described in [13]. One of the leading software samplers on the market, FXpansion BFD, use the same approach to model human variations in its “Humanize panels”7 . This allows for a simple way to show the small-scale variations present in both the original and imitated data. Figures 4 and 5 show how SHEILA models the small-scale variations of drummers A and E playing the pattern shown in ﬁgure 2 (due to space limits, 7

See page 118 of the user manual (accessed 2008-05-26), www.fxpansion1.com/resourceUploads/BFD Manual English.pdf

274

A. Tidemann and Y. Demiris / Groovy Neural Networks

MIDI velocity

127 64 0

1

and

2 and 3 and Time in beats, hihat

4

and MIDI velocity

127 64 0

1

and 2 and 3 and 4 Time in beats, snare drum

and

127

MIDI velocity

MIDI velocity

MIDI velocity

MIDI velocity

the graphs for all drummers cannot be shown). Figure 4 shows how drummer A strongly accentuates (i.e. periodically varies the velocity of) the hihat beats, whereas drummer E has a more even velocity proﬁle for the hihat beats.

64 0

1

and 2 and 3 and 4 Time in beats, kick drum

and

127 64 0

1

and

2 and 3 and Time in beats, hihat

4

and

1

and 2 and 3 and 4 Time in beats, snare drum

and

1

and 2 and 3 and 4 Time in beats, kick drum

and

127 64 0 127 64 0

Figure 4. To the left is the velocity proﬁle for drummer A, playing the pattern shown in ﬁgure 2. The Y scale is [0 − 127], corresponding to the MIDI resolution. The X scale corresponds to the beats in the measure, which is a common way to count when playing music. The blue bar stems from the training data, the red bar from the output of SHEILA, when instructed to play the same song as that of the training input. The similarity between the blue and red bars indicate that the ESNvel successfully captures the small-scale variations of the training data. Notice also how the velocity proﬁle differs from that of drummer E (to the right). Most easily seen is how the accentuation (i.e. variation of velocity) on the hihat is not as pronounced as for drummer A; this is a manifestation of the different grooves of drummers A and E.

0

-0.06

1

and

2 and 3 and Time in beats, hihat

4

and

Onset time

Onset time 1

and 2 and 3 and 4 Time in beats, snare drum

and

and

2 and 3 and Time in beats, hihat

4

and

Original Imitation

Onset time 1

and 2 and 3 and 4 Time in beats, kick drum

and

B 0.37102

1

and 2 and 3 and 4 Time in beats, snare drum

and

1

and 2 and 3 and 4 Time in beats, kick drum

and

A B C D E

0

-0.05

C 0.37176

D 0.60169

E 0.37995

(a)

0.05

0

A 0.46408

0

-0.05

0.05 Onset time

1

0.05

0

-0.06

0

-0.05

0.05

-0.06

Table 2. (a) shows the similarity metric described in [7] when comparing original drum tracks to SHEILA’s imitations, (b) compares drummers to other drummers. The metrics indicate that the originals and imitated drum tracks are similar, and that the different drummers have different playing styles.

0.05 Onset time

Onset time

0.05

tends to be ahead of the metronome (yielding a more aggressive feel), whereas drummer E tends be more centered around the metronome, for a more “tight” feel. The authors are aware that these terms are vague but acoustically distinct; we encourage the reader to listen to available MP3 ﬁles8 that better demonstrate these differences (included are imitations performed by SHEILA). Figures 4 and 5 show the mean and standard deviation for both velocity and onset time, both for the original data and the imitated output. The similarity between the plots shows how SHEILA successfully models the smallscale variations, in addition to demonstrating that drummers A and E plays the same pattern with a different groove. To assess both the large- and small-scale differences between original and imitated drum tracks, as well as between drummers, a sequence similarity metric was implemented as described in [7]. The cost function was adapted to account for differences in velocity as well as timing of events, e.g. by adding the normalized difference in velocity between two events. The similarity metrics can be seen in table 2. The metrics show that imitations are similar to originals, and that the drummers have different styles when compared to another. The difference when comparing originals to imitations and drummers to each other is generally an order of magnitude. However, note that the metrics only have value as relative comparisons between the MIDI sequences. They do not represent an absolute difference. Yuijan and Bo have recently developed a normalized metric [16], however it does not account for timed series; this appears to be an open research issue, and beyond the scope of this paper. Still, the similarity metrics indicate a strong degree of similarity between original drum tracks and imitations (which is further backup up by ﬁgures 4-5), and that each drummer has a playing style different from the others.

Figure 5. To the left is the onset time proﬁle for drummer A, playing the pattern shown in ﬁgure 2. The Y scale is onset time in ticks. There are 120 ticks in the range [0 − 0.99] between each quarter note. The X scale corresponds to the beats in the measure, similar to ﬁgure 4. As in ﬁgure 4, the blue bar is the statistics from the training data, the red bar is the analysis performed on the imitation done by SHEILA, showing that the output of the ESNons resembles that of the training data. The plot shows how drummer A tends to be ahead of the metronome when playing the pattern in ﬁgure 2. To the right is the onset time plot for drummer E. The onset times tend to be more centered around the metronome for the hihat beats, distinctively more than for drummer A, which contributes to the difference of groove between drummers A and E.

The onset time plays an important role in how aggressive/relaxed drum patterns are perceived, depending on whether the onset time is before or after the metronome. Figure 5 reveals that drummer A

A 0 5.185 5.8272 6.1193 6.9911

B 5.185 0 5.4271 1.944 5.4166

C 5.8272 5.4271 0 6.0649 6.4713 (b)

D 6.1193 1.944 6.0649 0 6.135

E 6.9911 5.4166 6.4713 6.135 0

Another important aspect of the onset time is the tempo drift that occurs over time. A drummer will constantly be before or after the metronome, which will make the tempo ﬂuctuate over time, as can be seen in ﬁgure 3. Figure 6 shows how the output of SHEILA induced the same drift in tempo over time as that of the original drum sequence. To examine how the ESN store the grooves as attractors, plots were made of hidden layer nodes during a run where the ESN was generating output. Figure 7 shows plots for some hidden nodes of the ESNvel of the pattern in ﬁgure 2 for drummer A. The ESNvel was run for 240 timesteps (double what it was trained on). The ﬁgures show that the activation patterns have stable attractor shapes, but with deviations. This is a further testament to how small-scale variations are introduced when imitating a certain pattern; these deviations will make the output slightly different over time. But since 8

www.idi.ntnu.no/∼tidemann/sheila

A. Tidemann and Y. Demiris / Groovy Neural Networks

0.1 Actual onset time Imitated onset time

Onset time

0.05 0

ACKNOWLEDGEMENTS

−0.15

100

200

300

400 Note sequence

500

600

700

Figure 6. Tempo drift throughout the song, drummer A. The circle plots show the tempo drift present in the recorded drum patterns. The cross plots show the onset times during imitation. Observe how both the original and imitated note sequence drift over time in a similar fashion.

the attractors are modeled from the patterns from a human drummer, the ﬂuctuations will be similar to that of the human drummer. −0.05

−0.1

0.1

0.5

0

0.45

−0.1

−0.2

−0.25

−0.3

Node 21

0.4

Node 8

Node 4

−0.15 −0.2

−0.3

−0.4

−0.5

−0.35

0.35

0.3

0.25

0.2 −0.6

−0.4

−0.45 0.1

0.15

−0.7

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.

−0.8 −0.2

−0.1

Node 3

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.

0.1 −0.3

−0.2

−0.1

Node 7

0

0.1

0.2

0.3

0.4

0.5

0.

Node 20

Figure 7. Attractor plots, for some randomly selected hidden layer nodes of the ESNvel of the pattern in ﬁgure 2, drummer A. The ESNvel was run for 240 timesteps (twice the training length). The plots are stable, but with deviations implying that the output will be slightly different over time.

DISCUSSION AND CONCLUSION

The choice of using MIDI was based on simplicity regarding the gathering and analysis of data. Another advantage with MIDI is that it will disregard the sound of the drums, which often will help to identify a drummer. The MIDI data allows focusing on the playing style of the drummer, which is the aim for our research. By using ESNs, SHEILA is able to model the human quality that is groove by using a biologically inspired computational model. The model encompasses the quality of varying the output like that of a human drummer, making the output different from the original but still recognizable. The research presented in this paper enables the drum machine to become closer to a groovy human drummer. Effectively, the results will make it cheaper and easier to create human-like drum tracks when making music.

7

and subsequent imitation of arm movements as described in [14], using multiple forward and inverse models as building blocks for a motor control architecture [3].

−0.05 −0.1

6

275

FUTURE WORK

SHEILA depends on MIDI information gathered using a MIDI drum kit. Acquiring the data is expensive; extracting the drum patterns and melody line directly from sound ﬁles would give access to vast amounts of training data; possible approaches are described in [9]. Apart from ease of computation, the reason for recording live drummers was an interest in making SHEILA learn the physical playing style of human drummers, i.e. the movement of the arms, upper torso and head. This would be used to visualize SHEILA. During the experiments conducted for this paper, motion tracking was also done. The goal is to be able to use SHEILA in a live setting as an accompanying musician, interacting with humans playing other instruments. This work will continue research done with motion tracking

The authors would like to thank the referees who helped improve the paper, and the drummers who participated in the experiment (Daniel Erland, Inge Hanshus, Sven-Arne Skarvik, Tony Andr´e Søndbø).

REFERENCES [1] M.I. Abouelhoda, S. Kurtz, and E. Ohlebusch, ‘Replacing Sufﬁx Trees with Enhanced Sufﬁx Arrays’, Journal of Discrete Algorithms, 2, 53– 86, (2004). [2] Ramon Lopez de Mantaras and Josep Lluis Arcos, ‘AI and music from composition to expressive performance’, AI Mag., 23(3), 43–57, (2002). [3] Yiannis Demiris and Bassam Khadhouri, ‘Hierarchical attentive multiple models for execution and recognition of actions’, Robotics and Autonomous Systems, 54, 361–369, (2006). [4] Dan Gusﬁeld, Algorithms on strings, trees, and sequences: computer science and computational biology, Cambridge University Press, New York, NY, USA, 1997. [5] Herbert Jaeger, ‘Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the ”echo state network”’, Technical report, German National Research Institute for Information Technology, (2005). [6] Herbert Jaeger and Harald Haas, ‘Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication’, Science, 304(5667), 78–80, (2004). [7] H. Mannila and P. Ronkainen, ‘Similarity of event sequences’, TIME, 136–139, (1997). [8] Francois Pachet, Enhancing Individual Creativity with Interactive Musical Reﬂective Systems, Psychology Press, 2006. [9] G. E. Poliner, D. P. W. Ellis, A. F. Ehmann, E. Gomez, S. Streich, and B. Ong, ‘Melody transcription from music audio: Approaches and evaluation’, IEEE Transactions on Audio, Speech and Language Processing, 15(4), 1247–1256, (May 2007). [10] Christopher Raphael, ‘Orchestra in a box: A system for real-time musical accompaniment’, in IJCAI workshop program APP-5, pp. 5–10, (2003). [11] S´everine Samson and Nathalie Ehrl´e, The cognitive neuroscience of music, chapter Cerebral substrates for musical temporal processes, Oxford University Press, 2004. [12] Craig Saunders, David R. Hardoon, John Shawe-Taylor, and Gerhard Widmer, ‘Using string kernels to identify famous performers from their playing style.’, in ECML, eds., Jean-Franc¸ois Boulicaut, Floriana Esposito, Fosca Giannotti, and Dino Pedreschi, volume 3201 of Lecture Notes in Computer Science, pp. 384–395. Springer, (2004). [13] Axel Tidemann and Yiannis Demiris, ‘Imitating the groove: Making drum machines more human’, in Proceedings of the AISB symposium on imitation in animals and artifacts, eds., Patrick Olivier and Chris Kay, pp. 232–240, Newcastle, UK, (April 2007). ¨ urk, ‘Self-organizing multiple models for [14] Axel Tidemann and Pinar Ozt¨ imitation: Teaching a robot to dance the YMCA’, in IEA/AIE 2007, volume 4570 of Lecture Notes in Computer Science, pp. 291–302. Springer Verlag, (June 2007). [15] Asmir Tobudic and Gerhard Widmer, ‘Learning to play like the great pianists.’, in IJCAI, eds., Leslie Pack Kaelbling and Alessandro Safﬁotti, pp. 871–876. Professional Book Center, (2005). [16] Li Yujian and Liu Bo, ‘A normalized levenshtein distance metric’, IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6), 1091–1095, (2007).

276

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-276

An Eﬃcient Student Model Based on Student Performance and Metadata Arndt Faulhaber and Erica Melis1 Abstract. This paper describes a new student model technology that combines evidences and knowledge about pedagogical and domain structure. Its structure is generated from the metadata available in the content representation of the adaptive webbased learning platform ActiveMath (or other contents). The evidences are processed with Item Response Theory and Transferable Belief Model uncertainty methodologies. We summarize evaluation results for this student model.

1

Introduction

The main goal of user modeling has been to facilitate adaptivity by estimating values of user variables. In the context of adaptive learning environments, the student model estimates dynamic variables, such as competencies and aﬀective state. In the web-based learning environment for mathematics ActiveMath [9], the variable competency determines, e.g., the diﬃculty of exercises the system adaptively selects for the student. Furthermore, when a course is generated for an individual student, the course planner may add missing prerequisites, if the student is insuﬃciently competent. For web-based learning systems, collaborative authoring and modiﬁcations of content are more frequent than for traditional intelligent tutoring systems, which rely on a ﬁx topic map. For such systems, the structure of the student model has to change according to alterations in the ontologies caused by content changes. Manually constructed Bayesian Networks (BNs) and even automatically constructed BNs [2, 6, 1] cannot adapt in real-time to such changes, since they rely on previously observed data to infer their probability tables. In order to cope with the potentially modiﬁed (implicit) ontology we devised the semantics1

Saarland University / DFKI, Germany, email: [email protected], [email protected]

aware student model (SLM). It dynamically extracts relations and metadata from the ontology and makes use of their semantics. These, together with data from exercise interactions, enable SLM to estimate students’ competencies. The second and more conceptual contribution is a new processing of the evidences for competencies by a combination of Item Response Theory (IRT) [8] and Transferable Belief Model (TBM) [15], two mechanisms to reason about uncertainty. Finally, the new student model scales and eﬃciently and accurately computes values, which is important for real world applications. Following, we describe how SLM interprets student exercise interactions in order to estimate the student’s competencies. We start with a description of essential metadata, continue by detailing how the metadata is interpreted to allow evidences to be quantiﬁed by IRT and thereupon update beliefs in the TBM . Finally, the procedure of competency estimation from current beliefs within TBM is detailed and the utilization of semantic relations within the domain ontology for enhancing estimations. A summary of evaluations then brieﬂy outlines the model’s performance by comparing it to another student model, XLM [7], which also uses TBM, and by determining the gain of employing semantic relations. Eventually, we discuss related work and conclusions.

2

The New Student Model

The structure of SLM consists of nodes, each for a single rule/concept to estimate competencies for. Inter-node relations are dynamically extracted from the domain ontologies. Incoming evidences from students’ exercise performance are processed by IRT for interpretation and to quantify proﬁciency probabilities, which are then taken as beliefs into TBM to compute, combine and update competency estimations (represented by the nodes). Propagation along the relations adds additional information for the competency estimation. SLM automatically creates a node for each con-

A. Faulhaber and E. Melis / An Efﬁcient Student Model Based on Student Performance and Metadata

cept/rule k included in the learning content such as the concept ’deﬁnition of fraction’ or the rule ’addition of fractions with unlike denominators’. SLM stores each associated competency value m(k, p) within the node, where a competency is deﬁned as a pair (k, p), in which p is a cognitive process, such as apply an algorithm or model a mathematical problem that is applied to k. For each competency, beliefs about the competency values are computed separately from recent evidences. In this computation, relations between exercises and concepts/rules as well as exercise competency metadata determine to which nodes evidences are attributed. The combination of IRT and TBM was chosen in order to set SLM on a reliable theoretic foundation, avoid ad hoc mechanisms and to make it as simple, ﬂexible and extensible as possible.

2.1

Metadata and Ontology

The content of learning systems such as ActiveMath consists of learning objects including concepts, rules, and auxiliary learning objects such as explanations, examples and exercises that are related to certain concepts/rules by the for relation [16]. The learning objects are enriched with administrative, domain-speciﬁc and educational metadata [4]. The educational metadata deﬁne, among other things, relations between learning objects such as a prerequisite relation. Together with their metadata, the learning objects induce an implicit ontology. The metadata related to competencies can, e.g., be aligned with the PISA [10] speciﬁcation for mathematics ’competencies’ including think, argue, model, solve, represent, language and tools. The exercise metadata also include a competency level (CL) – a measure of the level of expertise needed to correctly solve an exercise. In PISA, values range from elementary (CLI ), simple conceptual (CLII ), multi-step (CLIII ), to complex (CLIV ). Additionally and conforming to the metadata standards IMS and LOM, exercises are assigned a diﬃculty metadata. The diﬃculty value is one of {very easy, easy, medium, diﬃcult, very diﬃcult}. These values represent the diﬃculty for a student having the same competency level as speciﬁed for the exercise. I.e., it should be easier to solve for more advanced students and harder for less proﬁcient students.

2.2

Beliefs

The choice of the belief model heavily inﬂuences a student model. Belief models include Bayesian Networks [12], fuzzy logic [17], the DempsterShafer theory of evidence (DST) [14], and the TBM. TBM is an interpretation of DST and has been developed by Smets [15].

277

TBM works with a set of hypotheses to which belief masses are attributed. These are derived from the interpretation of evidences. Belief masses of diﬀerent evidences are combined by the Dempster-Shafer rule of combination. In SLM, we model beliefs about a competency value, called competency mastery 2 , as hypotheses in TBM. We deﬁne a set of atomic hypotheses, H(mj ), 0 ≤ j ≤ 34 that encode “Student has mastery mj = j ∗ 0.03”. For instance, H(0.12) means the student has mastery 12% and so forth. That means we consider hypotheses in 3% steps starting at 0% and going up to 102%. This optimal granularity was determined by simulation [3]. Since SLM regards a sequence of the most recent evidences, Dynamic Bayesian Networks would be required for similar modeling with Bayesian Networks. For such a model the conditional probability tables need to be manually constructed (extra eﬀort) or learned from previous student interactions, for which - in our case - the amount of available training data is insuﬃcient. Additionally, the notion of ignorance within TBM serves a more skeptical interpretation of evidences in the way that no further assumptions are made except for what has been directly observed (e.g. A is attributed P (A) = 40% TBM assumes nothing about the other 60%, BNs would quantify P (¬A) = 60%). The interpretation of evidences and the resulting assignment of belief masses to these hypotheses is described in the following.

2.3

Evidences and Interpretation

The most important information for the student model is the student’s performance in an exercise (step). We henceforth refer to exercise steps simply as exercise. It is called achievement in ActiveMath and has a value ach ∈ [0, 1]. A value of 1 indicates a correct answer, 0 a wrong one. Other values refer to partially correct input. Evidences can be represented as a tuple e = (sid , exid , K, P, CL, d, ach), where sid is the student’s ID, exid is the exercise’s ID, K is a set of concepts/rules trained by the exercise, P is a set of cognitive processes required to solve the exercise, CL is the competency level of the exercise, d is the diﬃculty of the exercise, ach is the student’s achievement. For each k ∈ K and p ∈ P , e is converted to an evidence object that is attached to a triple τ = (sid , k, p). One event can trigger the creation of multiple evidences, one for each triple. To enable SLM to adapt to changes in the student’s knowledge only the 6 latest evidences per triple τ are kept. Thus, old evidence does not inﬂuence the estimation of the current mastery. Additionally, indirect evidence is attached to each τ . 2

In the following, in case the context disambiguates we just call it mastery.

278

A. Faulhaber and E. Melis / An Efﬁcient Student Model Based on Student Performance and Metadata

Indirect evidence is computed by a propagation algorithm that sends evidences up and down along prerequisite relations as described in §2.6. Based on the set of available evidences E for a triple τ , SLM estimates the mastery m(τ |E) ∈ [0, 1]. In order to interpret evidences, IRT provides a simple model to relate exercise diﬃculties to proﬁciencies in terms of probabilities for correct answers. Here, these probabilities serve to update beliefs, stored, combined and computed by TBM. IRT is a data-driven statistical model that has been employed for adaptive testing etc. for some decades and proved to be a reliable technique. In IRT the probability P ri (X | θ) for a correct response X to item (question) i based on the test person’s proﬁciency θ is modeled as follows: P ri (X = correct | θ) = ci +

1 − ci , (1) 1 + e−D∗ai ∗(θ−bi )

where ai is the item discrimination factor, bi the item location, ci the item guessing probability and D a constant. The resulting function is called item characteristic curve (ICC) and has the form of a sigmoid function. The guessing parameter is modeled as a lower bound probability for a correct response and the discrimination factor determines the curvature of the function. In our context, θ is m and bi is derived from (CL, d) as indicated in Fig. 1. The parameters ai and ci can be ﬁtted by analyzing previous performance statistics and D is chosen to scale the item characteristic curve (see §3). In the absence of previous performance statistics, the parameter bi for equation (1) is derived from the content metadata: competency level and diﬃculty are combined (both convey information about the absolute diﬃculty of an exercise) into an item location bi used in IRT. Let Λ = (CLI , CLII , CLIII , CLIV ) and Δ = (very easy, easy, medium, diﬃcult, very diﬃcult) be ordered sets of competency levels and diﬃculties with elements Λq and Δr respectively, both with indices starting at 0. To quantify bi (with associated Λq and Δr for item i), we deﬁne: bi (Λq , Δr ) = 0.2 ∗ q + 0.1 ∗ r.

(2)

The interpretation of competency level and difﬁculty imposes an equidistant placement on the item location scale. The pairs (Λq , Δr ) are mapped to item locations bi ∈ [0, 1] in correspondence with the value space of the mastery (see Figure 1). Moreover, we have to map the achievement of the student to the dichotomous variable X in the dichotomous IRT model. This mapping is deﬁned for achievement ach as X = correct, if ach ≥ 0.5 and X = ¬correct otherwise. On the basis of IRT, SLM derives belief masses as follows: belief mass mass(H(m)) is assigned to each atomic hypothesis H(m) equivalent to the probability derived from equation (1) mass(H(m)) = P ri (X = correct | θ), if

Figure 1.

Mapping: competency level and diﬃculty to item location bi .

X = correct, and mass(H(m)) = 1 − P ri (X = correct | θ), if X = ¬correct, with m = θ.

2.4

Calculating the Mastery

The mastery is computed at two levels: (a) the competency mastery at the level of competencies, and (b) concept mastery as aggregation of all competency masteries of a speciﬁc concept. To compute a mastery m(τ |E), SLM derives belief masses from each e ∈ E with IRT and combines the masses using TBM. Then, it chooses the hypothesis Hmax with the maximal belief mass. The mastery associated with Hmax is the current estimation or 100% if the mastery is estimated above 100%. Currently, if multiple hypotheses have the same maximal mass, the mean of their associated masteries is calculated and taken as the estimation. SLM computes the concept mastery for a speciﬁc concept by the weighted sum of the competency masteries for which an estimation is available. The weights are determined by the amount of evidence available for the single competencies, though it would be desirable to make use of relations (probably a partial order) between the cognitive processes.

2.5

Information Radius

Deriving masses for all atomic hypotheses as described may lead to the following undesirable situations: a) The mastery estimation upon ﬁrst evidence is either 0% or 100% regardless of the associated diﬃculty, because the ICC is increasing strictly monotonic and hence, maximal mass is either attributed to H(0) or H(1.02) (depending on the achievement). Therefore, the adaptive system may confront the student with an overly hard (or easy) exercise in the next step. b) If for any available positive evidence3 a negative evidence exists with the same diﬃculty and vice versa, the unweighted combination of all evidences leads to the same belief mass for all hypotheses. This results in a mastery of 51%, regardless of the diﬃculties of the exercises from which the evidences originate. We solve a) by assigning belief masses only to hypotheses that are close to the item location, namely for hypotheses with | θ − bi |< δ. For instance, if bi = 0.1 and δ = 0.1, SLM computes 3

A positive evidence is derived, iﬀ ach ≥ 0.5, else a negative evidence is derived.

A. Faulhaber and E. Melis / An Efﬁcient Student Model Based on Student Performance and Metadata

masses for H(m), 0 ≤ m ≤ 0.2. The ﬁrst problem has thus been addressed, since an estimation based on a student’s performance for exercises with maximal item location bi cannot exceed mastery bi ±δ. We call δ the information radius. Introducing δ also solves b) because either the diﬀerent hypotheses receive diﬀerent masses (if the diﬃculties vary), or not all hypotheses receive a mass because of the restriction introduced by δ, resulting in a mastery close to the mean diﬃculty. E.g., if for two evidences (one positive, one negative) bi = 0.3 and δ = 0.2, we get mass(H(mk )) = mass(H(ml ))∀k, l : 0.1 < mk , ml < 0.5. The estimated mastery is 0.3 instead of 0.51. Actually, SLM uses a variable value for δ, which increases with more available evidence. A simulation using virtual students showed, that this approach further reduces the estimation error [3].

2.6

Evaluation

We quantitatively compare the new student model with the XLM (with propagation) wrt. the accuracy of predicting the correctness of student answers and computing performance. We evaluate the prediction accuracy by replaying real usage data from log-ﬁles. This data has been collected from two evaluations conducted at the university of Edinburgh. One was performed in April 2007 with 42 students and the second in May 2007 with 46 students. All students had to solve calculus exercises.4 Our approach to compare the student models is to predict the students’ result in the next exercise based on the current competency estimation of the student model. This is similar to [5], however, diﬀers in that our student model is supposed to learn after each input instead of prior (oﬀ-line) training. The idea is to compute the probabilities for a correct/incorrect answer for the next exercise and to choose the most probable outcome. After 4

each prediction, the evidence is passed to the student model to recalculate the mastery estimation. The prediction of the next result is based on the student model estimate m and the item location bi of the exercise. The ICC, with values ci = 0, discrimination factor ai = 1 and D = 10, then determines the probability for a correct answer.5 The prediction for the next achievement in exercise it+1 at time t is simply 1 if bit+1 ≤ mt , where mt denotes the mastery of the required competency. Otherwise, the predicted achievement is 0. The results for predicting the next outcome of an exercise are shown in Figure 2. The dimensions are the number of direct evidences processed by the student model at the time of prediction and the prediction accuracy. On average, propagation increases the accuracy of SLM signiﬁcantly from 65.8% (σ = 1.4%) to 71.1% (σ = 1.2%). XLM’s average accuracy is at 68.5% (σ = 1.3%).

Propagation

Indirect evidence informs an educated guess about the mastery in the absence of direct evidence. It is disregarded as soon as enough (a conﬁgurable parameter) direct evidence is available. SLM instruments concept and rule inter-item-prerequisite relations to propagate estimations. The rationale is that students will not be able to solve exercises for which they do not understand the prerequisites (to a certain degree) and, conversely, if they can solve an exercise they most likely understand its prerequisites. For instance, if the student has diﬃculties adding two fractions with the same denominator, it is rather unlikely that she will be able to add fractions with unlike denominators.

3

279

Since most exercises were ﬁll in the blanks, guessing should have a minimal eﬀect.

Figure 2.

Prediction accuracy.

Processing speed is essential because updating an inspectable student model as well as adaptation decisions have to be made in real-time. The average time to process an exercise step event is 9.74 seconds for XLM and 0.03 seconds for SLM, while reaching similar estimation accuracy.

4

Related Work

Currently, most student modeling research uses (Dynamic) Bayesian Networks (see [2, 5, 11]). For instance, Desmarais et al. [2] introduce an eﬃcient algorithm to construct learner models based on BNs by exploiting constraints imposed by item to item structures. For the construction it relies on previously observed system usage. Using belief masses on a per competency basis saves additional modeling eﬀort and the estimation of conditional probabilities needed when employing Bayesian Networks. It allows to distribute evidence to multiple competencies in case the assignment is unclear and to weigh evidences based on additional parameters. Recently, IRT has found its way into some student models. For instance, Johns et al. [5] combine 5

Lacking a priori knowledge about the discrimination properties of the exercises, we chose a constant value of ai = 1. D ﬁts the probabilities to the interval [0, 1] corresponding to the mastery values.

280

A. Faulhaber and E. Melis / An Efﬁcient Student Model Based on Student Performance and Metadata

it with a Dynamic Bayesian Network to take students’ engagement into account, during their use of the tutoring system Wayang Outpost. Again, the student model is pre-trained with log data. A student model that is also based on TBM is XLM [7]. It uses TBM to estimate the competency level of a student and indirectly derives the mastery by combining the belief masses attributed to each competency level. For propagation, it uses a static topic map. Comparing the SLM with the related XLM shows that SLM clearly outperforms the XLM in terms of computing performance, while delivering a slightly better prediction accuracy (see §3). In contrast to XLM, SLM supports parameters for item discrimination, guessing, and continuous diﬃculty values, which can be calibrated using data-mining techniques (to overcome the ad hoc derivation of diﬃculty from content metadata and personal bias of the author).

5

Conclusion and Future Work

Our contribution has been to build a student model, whose structure is automatically generated and that dynamically adapts to changes in the implicit ontology. Its performance is adequate for real-time applications regarding both: estimation accuracy and processing speed. The clear separation between student model and domain structure reduces modeling and coordination eﬀort. Furthermore, the design of SLM allows for taking into account the discrimination properties of an exercise and a continuous value for diﬃculty. A proof of the beneﬁt of using these properties still has to be given, since the amount of data was not suﬃcient to estimate these parameters. By combining IRT and TBM in a relatively simple way, we gain ﬂexibility to vary the interpretation of evidences (e.g. modifying information radius or propagation parameters). The use of indirect evidence serves two goals. For one, it provides information for an initial estimation, in case no direct evidence is available. Additionally, it improves the estimation accuracy for insuﬃcient direct evidence hence relaxing the sparse data problem. The propagation mechanism is based on the assumption that the student’s mastery of a concept inﬂuences the mastery of closely related concepts. A more sophisticated mechanism would diﬀerentiate between diﬀerent relations and takes into account conditions for a propagation. A promising way to enhance the accuracy of the model’s estimations is mining log ﬁles to adjust the IRT-parameters of exercises. Hence, we plan to calibrate the student model automatically based on previous interactions with students.

Acknowledgements This article results from the ATuF (ME 1136/5-1) and LeActiveMath (FP6-507826) projects, funded

by the German National Science Foundation (DFG) and the European Union, respectively.

REFERENCES [1] I. Arroyo and B.P. Woolf, ‘Inferring learning and attitudes from a bayesian network of log ﬁle data’, in Proceedings of the 12th International Conference on Artiﬁcial Intelligence in Education, pp. 33–40, (2005). [2] M.C. Desmarais and M. Gagnon, ‘Bayesian student models based on item to item knowledge structures’, in EC-TEL 2006, eds., W. Nejdl and K. Tochtermann, volume 4227 of LNCS, pp. 111– 124. Springer-Verlag, (2006). [3] A. Faulhaber, Building a new Learner Model for ActiveMath Combining Transferable Belief Model and Item Response Theory, Master’s thesis, Saarland University, 2007. [4] G. Goguadze, C. Ullrich, E. Melis, J. Siekmann, C. Groß, and R. Morales, ‘Leactivemath structure and metadata model’. Deliverable D6 for LeActiveMath project, 2004. [5] J. Johns and B.P. Woolf, ‘A dynamic mixture model to detect student motivation and proﬁciency’, in Proceedings of the 21st National Conference on Artiﬁcial Intelligence (AAAI-06), Boston, MA, (2006). [6] A. Jonsson, J. Johns, H. Mehranian, I. Arroyo, B.P. Woolf, A. Barto, D. Fisher, and S. Mahadevan, ‘Evaluating the feasibility of learning student models from data’, in AAAI Workshop on Educational Data Mining, ed., J. Beck, pp. 1–6, (2005). [7] N. Van Labeke, P Brna, and R. Morales, ‘Opening up the interpretation process in an open learner model’, International Journal of Artiﬁcial Intelligence in Education, 17(3), 305–338, (2007). [8] F.M. Lord, Applications of item response theory to practical testing problems, Hillsdale, NJ: Erlbaum, 1980. [9] E. Melis, G. Goguadze, M. Homik, P. Libbrecht, C. Ullrich, and S. Winterstein, ‘Semantic-aware components and services of activemath’, British Journal of Educational Technology, 37(3), 405– 423, (may 2006). [10] Learning for Tomorrows World – First Results from PISA 2003, ed., OECD, Organization for Economic Co-operation and Development (OECD) Publishing, 2004. [11] Z.A. Pardos, N.T. Heﬀernan, B. Anderson, and C.L. Heﬀernan, ‘Using ﬁne-grained skill models to ﬁt student performance with bayesian networks’, in Educational Datamining Workshop, at ITS-2006, pp. 5–12, (2006). [12] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kauﬀman, 1988. [13] G. Rasch. Probabilistic models for some intelligence and attainment tests. Copenhagen, Paedagogiske Institute, 1960. [14] G. Shafer, A mathematical theory of evidence, Princeton University Press, Princeton, NJ, 1976. [15] P. Smets and R. Kennes, ‘The transferable belief model’, Artiﬁcial Intelligence, 66, 191–234, (1994). [16] C. Ullrich, ‘The learning-resource-type is dead, long live the learning- resource-type!’, Learning Objects and Learning Designs, 1(1), 7–15, (2005). [17] L. Zadeh, ‘The role of fuzzy logic in the management of uncertainty in expert systems’, Fuzzy Sets and Systems, 11, 199–227, (1983).

5. Natural Language Processing

This page intentionally left blank

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-283

283

Reducing Bias Effects in DOP Parameter Estimation Evita Linardaki1 Abstract. Data Oriented Parsing is a natural language processing model that analyses new input based on past experience. The underlying idea is to extract a set of fragment-probability pairs from a given treebank and use these concrete experiences to construct new utterance analyses. Initially, probabilities were based on the fragments’ relative frequency of occurrence. This estimator, however, was soon shown to be biased towards large corpus trees [8] and inconsistent [10]. To alleviate the effects of bias on performance a set of heuristic constraints was put in force. Other estimators addressing these issues have since then been proposed. This paper seeks to show that the most commonly used DOP estimators are in fact susceptible to strong size-sensitive bias effects and to present a new estimation algorithm that greatly reduces these effects of bias on performance without complicating the estimation process.

1

INTRODUCTION

Data Oriented Parsing (DOP) [1, 2, 3], is based on extracting a Stochastic Tree Substitution Grammar (STSG) from a treebank by decomposing its elements into smaller chunks known as subtrees or fragments subject to the following condition: for each complete parse tree τ (Figure 1(a)) the multiset of all subgraphs ti of τ is extracted (Figure 1(b)) such that each ti consists of more than one node, it is connected and each non-leaf node in ti has the same daughters as the corresponding one in τ . S

"b " b

S

H H NP

VP

John

sleeps

NP

VP

John

sleeps S

NP (a) Initial tree. Figure 1.

@ @

VP

S

Q Q

NP

S

Z Z

S

Q Q ◦

NP

VP

(b)

VP sleeps

S

Z Z ◦

NP

VP

NP John

◦

VP sleeps

John

(c)

S

"b " b

NP

VP

◦

(d)

NP John

sleeps Figure 2.

S

H H NP

VP

John

sleeps

Multiple derivations of a single parse tree.

Assuming all fragments are stochastically independent, the probability of a derivation d = t1 ◦..◦ tn is the product of the probabilities of the individual fragments taking part in the derivation (1). P (d) =

n

π(ti )

(1)

i=1

Since each parse tree can have several distinct derivations its probability is deﬁned as the sum of the probabilities of all its derivations (2). m m n P (T ) = P (dj ) = π(ti ) (2) j=1

j=1 i=1

VP

VP

NP

sleeps

John

NP

VP

John

sleeps

(b) Corresponding fragment corpus. DOP grammar extraction.

New input is analysed by combining these partial structures into full parse trees in terms of leftmost substitution. This operation substitutes the leftmost non-terminal leaf node of some fragment with another fragment rooted at that node. Each parse tree can typically be derived in many ways (Figure 2). 1

(a)

Hellenic Open University, Greece, email: [email protected]

2

EXISTING ESTIMATORS

We describe ﬁve estimators for computing the fragment probabilities from the literature and illustrate the problems they encounter. In the discussion that follows TB denotes the initial treebank used for training and FC denotes the corresponding fragment corpus. Additionally, π(t) is a shorthand of P (t|r(t)), where r(t) denotes the root of t.

2.1

DOP1

In the ﬁrst instantiation of DOP, known as DOP1 [3], the probability of a fragment t was determined by its relative frequency of occurrence among all fragments in FC with the same root label (3). π(t) =

|t|FC |t |FC

(3)

t : r(t )=r(t)

DOP1, however, has been shown to be biased towards large corpus trees [8] and inconsistent [10]. The fragment probability deﬁnition in

284

E. Linardaki / Reducing Bias Effects in DOP Parameter Estimation

(3) in conjunction with the fact that the number of subtrees extracted from a corpus tree grows exponentially with the tree size assign a disproportionate amount of the overall probability mass to large parse trees. Consider, for example, the treebank in Figure 3(a) where T1 occurs three times and T2 occurs once. The resulting fragment corpus is depicted in Figure 3(b).

T1 : X ×1 Z

T2 S ×1

TT

a

AA

a

S

S b

3/7

TT

P

Q

P

Q

a

b S

S

TT

P

Q

TT

P

Q

TT

P

Q

a b a b 1/7 1/7 1/7 1/7 (b) Corresponding fragment corpus

P

Q

a

b

1

1

D

D

a

b

Z

S

C

D

c d a b The Bonnema estimator is biased towards small trees.

−2

2 = PBonn (T1 ) =4 × 2−4 + 4 × 2−1 n+2

The Bonnema estimator

The issue of DOP1’s bias towards large trees was addressed by [8] who proposed an alternative estimator. Assuming a uniform distribution over the derivations of a single tree τ , the substitution probability π(t) of an initial subtree t, is deﬁned as in (4), where N (t) is the number of non-root non-terminal nodes in t. |t|FC |t |TB

(4)

t : r(t )=r(t)

With respect to the previous example (in Figure 3(b)), the probabilities assigned to each subtree in the fragment corpus become 3/4, 1/16, 1/16, 1/16, 1/16, 1 and 1 respectively. The probabilities of the initial trees in Figure 3(a) hence become 3/4 and 1/4, reﬂecting the fact that T1 is three times more frequent than T2 in the training data. This estimator was shown by [9] to be consistent for PCFG distributions. [13], however, report very disappointing empirical results on the OVIS corpus suggesting that this approach seriously underestimates substructure size as a disambiguation parameter. In fact, the Bonnema estimator suffers from the reverse situation to DOP1 (i.e. it is biased towards small trees). Consider the following example. Take TB to consist of trees T1 , T2 and T3 in Figure 4 with T1 and T2 occurring once and T3 occurring n times.

1 4

+

1 2(n+2)

n n 2n + 2−3 n+2 + n+2 + n+1

2 4n+1 n+1 n n2 n = 16(n+1) 2 + 8(n+2) n+1 n+2 n+1

PBonn (T4 ) =2−4

T1 has one derivation while T2 has four. T1 is, therefore, assigned a total probability of 3/7, while T2 4/7, making it the preferred analysis for the string “ab” even though T1 is three times more likely in the training data. In order to reduce the effects of bias on performance, practical applications using DOP1 impose a set of heuristic constraints on the fragments such as limiting the subtree depth or the maximum number of substitution sites. With the additional help of these constraints DOP1 has been reported to produce state of the art empirical results [5].

π(t) = 2−N (t)

b Figure 4.

S

C

C

Intuitively, the correct X-rooted analysis of the string “ab” should be T1 because T4 has not been seen in the training data. Note, however, that:

Figure 3. Illustrating DOP1’s bias towards large trees.

2.2

B

T4 : X

S

J

J

b

(a) Initial treebank. S

T3 : Z ×n

Z

S

A a

T1 S ×3

T2 : X ×1

n n+1

2

As a result, T4 is the proposed analysis ∀ P (T4 ) > P (T1 ) ⇒ n > 4. The probability of a derivation is upwards bounded by the probability of the derivation initial subtree. The problem with DOPBonn. is that it penalises large subtrees so much that a long derivation whose probability is bounded by that of a smaller subtree can end up having a higher probability than a derivation that was actually seen.

2.3

Back-off DOP

[13] have suggested that the effects of DOP1’s bias towards large fragments can be alleviated through recursive estimation of the DOP parameters (i.e. the fragment weights). The estimator they propose is based on the hierarchical organisation of subtrees in terms of the so called back-off graph (a directed acyclic graph whose nodes consist of subtree pairs that constitute the back-off of some fragment in the previous layer). According to the deﬁnition of the derivation probability in DOP, if t = t1 ◦ t2 then P (t1 ◦ t2 |r(t1 )) = P (t1 |r(t1 ))P (t2 |r(t2 )), where r(ti ) denotes the root of ti . The chain-rule states that P (t1 ◦ t2 |r(t1 )) = P (t1 |r(t1 ))P (t2 |t1 , r(t1 )). The derivation t1 ◦t2 , therefore, assumes that P (t2 |t1 ) ≈ P (t2 |r(t2 )). This weakening of context from P (t2 |t1 ) to P (t2 |r(t2 )) involves backing-off from conditioning on the entire fragment t1 to just one node category in t1 , so t1 ◦ t2 is said to be a back-off of t. Once the back-off graph has been determined, estimation of the fragment weights is carried out iteratively in a top-down manner. Initially, the weight each fragment is assigned is equal to its relative frequency of occurrence. Then the back-off graph is used to transfer weight mass from larger fragments to their back-offs in a stepwise fashion. The algorithm used for transferring weight mass from one fragment to another is an adaptation of the Katz formula. Back-off parameter estimation in Tree-DOP has produced very promising empirical results on the OVIS corpus [13]. Applying backoff smoothing to weight assignments rather than DOP probabilities, however, does not take into account an event’s probability, as required in the Katz backoff formula, as pointed out by [14].

285

E. Linardaki / Reducing Bias Effects in DOP Parameter Estimation

2.4

DOPα

Based on the view that a treebank consists of not only the full parse trees used for training but all subtrees extracted from these as well, [12] developed another estimator known as DOPα . In order to meaningfully interpret the new treebank notion, she extended the deﬁnition of a derivation to enable the estimator to generate the partial analyses in the fragment corpus. DOPα assigns to parse trees (whether partial or complete) probabilities analogous to their relative frequency in the fragment corpus (5), thus preserving their frequency ranking in the training corpus (i.e. it is rank consistent). P (t) = α

|t|FC = α × rfFC (t) |t |FC

(5)

t :r(t )=r(t)

Fragment weights are deﬁned by: ⎧ ⎪ ⎨α × rfF C (t) πα (t) = ⎪ ⎩ α × rfF C (t) − P (t) where P (t) =

if depth(t) = 1 if depth(t) > 1

P (d(t)).

d(t): length(d(t))>1

Even though DOPα addresses adequately the issue of bias since it respects the trees’ frequency ranking in the training corpus regardless of their size, it does not identify a true probability distribution over the set of complete parse trees in the language.

2.5

DOP*

DOP* is the only consistent estimator proposed in the literature thus far. Two algorithms for DOP* exist, one being a simpliﬁed version of the other. The underlying procedure of the simplest one [15] can be summarised as follows. Split the treebank TB into an extraction part (EC) and a held-out part (HC). Extract the fragment corpus FC from EC. Identify FC , the part of FC containing all fragments t involved in the shortest derivation of parses in HC and determine their frequency counts |t|sd in these. Set: ⎧ |t| ⎪ sd ⎪ ∀t ∈ FC ⎪ ⎪ |tj |sd ⎪ ⎨ j=1..n π(t) = r(tj )=r(t) ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ 0 ∀t ∈ {FC\FC } When the size of the treebank approaches inﬁnity the above algorithm converges to the maximum likelihood estimator. This version of DOP* assigns non-zero probabilities only to a number of fragments linear to the number of nodes in HC. This results, on the one hand, in great efﬁciency advances over the previously described estimators but, on the other, in signiﬁcant coverage loss especially in the case of smaller treebanks. The other version of DOP* [14] computes for each root category the probability mass punk (r(t)) of the trees in HC that are not ECderivable and reserves it for smoothing. Final fragment weights are: ∗

π (t) = π(t) + psmooth (t), where psmooth (t) = punk (r(t))

|t| F C |ti |F C

r(ti )=r(t)

and π(t) is as before.

This estimator assigns non-zero probabilities to all fragments unless all trees in HC are EC-derivable (i.e. no probability mass is reserved for smoothing) rectifying the reduced coverage issue in these cases. Notice, however, that psmooth (f ) is in essence punk (r(f ))pDOP 1 (f ) which means that DOP* inherits DOP1’s overwhelming bias towards large trees. The bias effects will fade away as the treebank size grows to inﬁnity because punk (r(f )) will converge to zero, but they will be quite strong in the case of smaller treebanks if there is some probability mass reserved for smoothing.

3

A NEW ESTIMATOR DOPω

[14] proved that bias is, in fact, a necessary property of any DOP estimator that does not completely overﬁt the treebank, unless crossvalidation is put in force to avoid overﬁtting [4, 6]. When dealing with size-sensitive biased estimators, one should prefer those favouring large over small substructure size because they will be better equipped to capture more linguistic dependencies [13]. The effects of bias on performance, however, should be minimised. In other words, a small preference towards large substructure size is preferable. Next we describe a new estimator, DOPω , based on this observation.

3.1

The estimator

We make no assumptions about fragment probabilities. We treat a tree as the set of all its derivations, and a derivation as a sequence of fragment selection outcomes. These are the same assumptions made by all existing estimators. In [7], however, all derivations of a single tree were assumed to be equally likely. This uniform distribution contradicts the argument of making no assumptions about fragment probabilities, since the probability of the passive fragment of a corpus tree will always be less than or equal to that of an active fragment of the same tree. The incoherence follows from assuming a uniform distribution over all derivations of a single tree, thus treating them as simple events rather than sequences of simple events. In order to fully respect the assumptions made, Laplace’s principle of insufﬁcient reason needs to be applied to the fragment selections rather than the derivations themselves. Let TB be a treebank consisting of a single tree τ . Let σ(τ ) denote the set of initial fragments τ gives rise to and |σ(τ )| the size of this set. The probability of t is deﬁned as: π(t) =

1 |σ(τ )|

(6)

Assuming a uniform distribution over the same rooted fragments of a given tree, a derivation of length greater than one will always be less likely than a length-one derivation, since its initial upper bound probability will be reduced at each successive derivation step. Let dx denote a derivation of length x. From Equation (6) it follows that: P (d1 ) =

1 ≥ P (d>1 ) |σ(τ )|

Consider now a treebank TB containing more than one constituent tree. Each fragment t might belong to the set of initial fragments of 1 several trees. From each of these it will receive a weight of |σ(τ . )| The probability of a subtree t is hence deﬁned by the probability function in (7). Note that each subtree can receive different probability amounts from its various source trees depending on their size.

286

E. Linardaki / Reducing Bias Effects in DOP Parameter Estimation

T3

πω (t)=

τ :t∈σ(τ )

1 |σ(τ )|

%e % e

|τ |TB = |τ |TB

A

τ :r(τ )=r(τ )

(7)

=

τ :t∈σ(τ )

1 rfTB (τ ) |σ(τ )|

P (d) =

τ :ti ∈σ(τ )

i=1

1 rfTB (τ ) |σ(τ )|

(8)

And the probability of a parse tree T is the sum of the probabilities of its individual derivations: m

n

j=1

i=1

P (T ) =

τ :tij ∈σ(τ )

Pω (T1 ) =

Pω (T4 ) =

1 10

=

+

4×2 10×4(n+2)

n n+1

2

n2 10(n+1)2

1 + 40

+

=

2 5

1 rfTB (τ ) |σ(τ )|

n n+2

+

1 5(n+2)

+

2n n n+2 n+1

(9)

T1

×1

S

+

n+1 n+2

X

T2

PP

n n+1

×n

S

%e % e

Y

"b " b

"b " b

U

V

W

Z

%e % e

%e % e

%e

%e % e

A

B

C

D

E

F

G

H

a

b

c

d

e

f

g

h

4

P (t2i ) =

i=1

=

n+1 3n+15

×

4

PDOP 1 (T3 ) = P (U )

2

=

A

B

a

b

4n 4n+676

n+1 3n+15

×

1 3n+15

× 4 14 =

n n+169

P (t3i ) =

i=1

1 3n+15

Doing the calculations shows that DOP1 proposes T2 for n > 13. Under the new estimator, the weight of each of the 676 subtrees of 1 1 T1 is 676 × n+1 while the weight of each of the four subtrees of T2 1 n is 4 × n+1 . The probabilities of T2 and T3 hence become: 4

Pω (T3 ) = P (U )

4 i=1

4n 4(n+1)

f2i =

n+1 3n+15

×

f3i =

1 3n+15

× 4 41 =

i=1

n 4n+1 40(n+2) n+1

PP

The generated fragment corpus will contain 680 distinct S-rooted 1 subtrees. The 676 produced from T1 have DOP1 weights 676+4n , n while the four produced from T2 have 676+4n . Intuitively, if n > 1, T2 should be the optimal analysis (over T3 ) for the string “ab”. Since the trees to be compared have different root categories, derivation probabilities are multiplied with the probability of the root node of their initial subtree [11]. Let t2i and t3i denote the fragments of T2 and T3 respectively. The DOP1 probabilities of T2 and T3 are:

Pω (T2 ) = P (S)

The estimator will propose T4 when P (T4 ) > P (T1 ) which is false ∀ n > 0. Unlike the Bonnema estimator, DOPω will, therefore, rule in favour of the analysis seen in the training data no matter how large n gets. Next we compare the new estimator with DOP1. Consider a training corpus consisting of trees T1 and T2 (Figure 5(a)) occurring once and n times respectively.

a b (b) An unseen tree for “ab”.

PDOP 1 (T2 ) = P (S)

To see the effects of DOPω on performance, let us reconsider the example in Figure 4. The frequency of the X category in the training corpus is two. The number of subtrees extracted from each Xrooted tree is ﬁve. Each X-rooted fragment is assigned a probability 1 π(ti ) = 15 × 12 = 10 apart from (X → Z), which occurs twice 2 and is, therefore, assigned a probability of 10 . The probabilities of all remaining fragments coincide with those of DOPBonn . The new probabilities of T1 and T4 become: 4 10

B

Figure 5. Illustrating the improvement of the new estimator over DOP1.

The probability of a derivation d is again the product of the probabilities of the fragments taking part in the derivation: n

U

=

n 3n+15

1 3n+15

conﬁrming our previous observation that T2 should be the optimal analysis for all n > 1. Even though DOPω is also biased towards large fragments, its effects are constrained by the relative frequency of the source tree in the treebank. The bias reduction of DOPω in comparison to DOP1 becomes clearly evident when considering the example of [9] of a treebank consisting of 1000 balanced binary branching S-rooted trees, 999 out of which are of depth ﬁve and one of depth six. DOP1 assigns approximately 99.8% of the overall probability mass to the subtrees of depth six. Under the new estimator, however, more than 99.9% of the probability mass is assigned to subtrees of depth less than six. Let σ5 denote the number of S-rooted subtrees produced from each of the 999 trees of depth ﬁve and σ6 denote the number of subtrees produced from the initial tree of depth six. The total number of S-rooted subtrees in the fragment corpus is 999σ5 + σ6 . The prob1 ability of each subtree of depth six is: σ16 × 1000 . There are σ6 − σ5 subtrees of depth six so: σ6 − σ5 1 σ5 p(t) = = − 1000σ6 1000 1000σ6 t:depth(t)=6

(a) Training corpus.

where σ5 = 458, 329 and σ6 = 210, 066, 388, 900, so the subtrees of depth six receive approximately 0.1% of the probability mass.

E. Linardaki / Reducing Bias Effects in DOP Parameter Estimation

3.2

Analysis on a linguistic example

Next we present a linguistic example illustrating how the above described estimation procedure provides an improved account of the training data over the DOP1, the Bonnema, the DOPα and the DOP* estimators. Consider the toy corpus in Figure 6, where T1 occurs twice and T2 three times. Take T3 to be the parse tree in Figure 7. T1

S

HH H

NP

She

×2

T2

VP

V

likes

VP

NP

HH H A

NP

visiting

relatives

Figure 6.

T3

Q Q

She

NP

""bb

×3

S

HH V

NP

hates

relatives

Training corpus.

It has to be noted that even though the proposed estimator might perform better that DOP* in the case of largely unseen data (where the latter shows reduced coverage or strong bias effects), this effect will fade away as the treebank size grows. In fact, when the size of the training data becomes very large DOP*, still being the only consistent estimator, can be expected to perform better than DOPω. The two estimators, however, could be combined into a single consistent estimator with reduced bias effects by applying held-out training as in DOP* and using DOPω to distribute the probability mass reserved for smoothing.

4

CONCLUDING REMARKS

In this paper we put forward a new DOP estimator that minimises the negative effects of a size-sensitive bias on performance. We presented a number of examples that clearly bring out the limitations of the algorithms employed in previous DOP models, and illustrated the improved performance of our method on these. In short, the proposed estimator shows a preference for large over small substructure size without, however, overestimating it. Future work will focus on empirical investigation of the proposed estimator.

S

""bb

REFERENCES

VP

NP

Q Q

She

V

NP

likes

relatives

Figure 7.

Unseen parse tree.

Based on the training data T2 should be more likely than T1 , and T1 should be more likely than T3 . Table 1 shows the DOP1, DOPBonn. , DOPα=0.5 and DOPω probabilities of the three parse trees. Table 1. Parse tree probabilities

DOP1 DOPBonn. DOPα=0.5 DOPω

287

T1 165.54 10−3 86.97 10−3 13.51 10−3 129.53 10−3

T2 129.11 10−3 258.54 10−3 20.27 10−3 237.87 10−3

T3 46.14 10−3 101.53 10−3 4.98 10−3 75.19 10−3

The ﬁrst row indicates that DOP1 fails to account for the fact that T2 occurs more often than T1 in the training data. This is a direct consequence of the estimator’s bias favouring large corpus trees. Similarly, DOPBonn. assigns an unseen tree (i.e. T3 ) a considerably larger amount of probability mass than the seen tree T1 . This results from the estimator’s bias towards small fragments. DOPα , on the other hand, conﬁrms what was anticipated (i.e. P (T3 ) < P (T1 ) < P (T2 )), but it does not identify a probability distribution over the set of full parse trees in the language. What is perhaps more striking is that DOP* cannot even generate T3 unless the (extraction):(held-out) split of the training corpus is either (2T1 ):(3T2 ) or (T1 ):(T1 , 3T2 ). Even in those cases, however, T2 (the most frequently seen tree) is not recognised. DOPω, on the other hand, veriﬁes what was anticipated from observing the training data while at the same time it identiﬁes a true probability distribution over the set of full parse trees in the language.

[1] Rens Bod, ‘A Computational Model of Language Performance: Data Oriented Parsing’, in Proceedings COLING’92, Nantes, France, (1992). [2] Rens Bod, ‘Using an Annotated language Corpus as a Virtual Stochastic Grammar’, in Proceedings of AAAI’93, Washington D.C., (1993). [3] Rens Bod, Enriching Linguistics with Statistics: Performance Models of Natural Language, Ph.D. dissertation, Universiteit van Amsterdam, Amsterdam, The Netherlands, 1995. [4] Rens Bod, ‘Combining Semantic and Syntactic Structure for Language Modeling’, in Proceedings ICSLP-2000, Beijing, China, (2000). [5] Rens Bod, ‘What is the Minimal Set of Fragments that Achieves Maximal Parse Acuuracy?’, in Proceedings of the ACL 2001, Toulouse, France, (2001). [6] Rens Bod, ‘An All-Subtrees Approach to Unsupervised Parsing’, in Proceedings ACL-COLING 2006, Sydney, (2006). [7] Remko Bonnema, ‘An Alternative Approach to Monte Carlo Parsing’, in Data-Oriented Parsing, eds., Remko Scha Rens Bod and Khalil Sima’an, chapter 7, 107–124, CSLI Publications, Stanford, California, (2003). [8] Remko Bonnema, Paul Buying, and Remko Scha, ‘A New Probability Model for Data Oriented Parsing’, in Proceedings of the 12th Amsterdam Colloquium, eds., Paul Dekker and Gwen Kerdiles, pp. 85–90, Amsterdam, The Netherlands, (1999). [9] Remko Bonnema and Remko Scha, ‘Reconsidering the Probability Model for DOP’, in Data-Oriented Parsing, eds., Remko Scha Rens Bod and Khalil Sima’an, chapter 3, 25–42, CSLI Publications, Stanford, California, (2003). [10] Mark Johnson, ‘The DOP Estimation Method is Biased and Inconsistent’, Computational Linguistics, 28, 71–76, (March 2002). [11] Evita Linardaki, ‘Comparing Constituents of Different Categories in DOP’, in Proceedings of the Workshop on Exploring Syntactically Annotated Corpora, ed., P. Osenova K. Simov, D. Kazakov, pp. 13–23, Birmingham, UK, (2005). [12] Thuy Linh Nguyen, Rank Consistent Estimation: The DOP Case, Master’s thesis, ILLC, Amsterdam, The Netherlands, November 2004. [13] Khalil Sima’an and Luciano Buratto, ‘Backoff Parameter Estimation for the DOP Model’, in Proceedings of the European Conference on Machine Learning, eds., N. Lavrac, D. Gamberger, H. Blockeel, and L. Todorovski, Lecture Notes in Artiﬁcial Intelligence, pp. 373–384. Springer, (2003). [14] Andreas Zollmann, A Consistent and Efﬁcient Estimator for the DataOriented Parsing Model, Master’s thesis, ILLC, Amsterdam, The Netherlands, May 2004. [15] Andreas Zollmann and Khalil Sima’an, ‘A Consistent and Efﬁcient Estimator for Data-Oriented Parsing’, Journal of Automata, Languages and Combinatorics (JALC), 10(2/3), 367–388, (2005).

288

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-288

Multilingual Evidence Improves Clustering-based Taxonomy Extraction Hans Hjelm and Paul Buitelaar 1 Abstract. We present a system for taxonomy extraction, aimed at providing a taxonomic backbone in an ontology learning environment. We follow previous research in using hierarchical clustering based on distributional similarity of the terms in texts. We show that basing the clustering on a comparable corpus in four languages gives a considerable improvement in accuracy compared to using only the monolingual English texts. We also show that hierarchical k-means clustering increases the similarity to the original taxonomy, when compared with a bottom-up agglomerative clustering approach.

1

Introduction

Does a country and its environment form the language of the people living in it? Or does the language spoken rather form the way people perceive their environment? This type of questions has been raised by linguists like Sapir/Whorf and Berlin/Kay. Whatever the answer to such questions, we do believe that each language provides us with a unique “view” of the world, coded into its grammar and lexicon. The question we wish to answer in this paper is whether this diversity will prove an asset in a taxonomy extraction system or whether the different “views” will merely serve to clutter the meaning expressed through an isolated language. Several researchers have made use of clustering based on distributional similarity between terms to perform taxonomy extraction [1, 2, 12, 13]. We follow their work by ﬁrst extracting a taxonomy using only English language texts and comparing the result to a gold standard taxonomy. We then repeat the procedure, building a taxonomy using four different languages (adding German, French and Spanish to the English), using a comparable corpus. We show that the multilingual version gives a considerable improvement in accuracy and stability over the monolingual version, when compared to the gold standard. We also make use of a hierarchical k-means clustering technique and show that we are able to reproduce the original taxonomy with greater ﬁdelity than when using a bottom-up agglomerative clustering approach.

2

Background and resources

As part of the recently started THESEUS MEDICO project,2 funded by the German government, a system for querying and analyzing medical information (medical records, x-rays etc.) is currently under construction. Certain parts of such a system would arguably beneﬁt from a domain ontology, providing background knowledge during 1 2

GSLT/Stockholm University, Sweden, email: [email protected] and DFKI, Germany, email: [email protected] http://theseus-programm.de/scenarios/en/medico

e.g., information retrieval or image recognition tasks. In the domain of (human) anatomy, there exists such an ontology: the Foundational Model of Anatomy (FMA) ontology. It is developed by the Structural Informatics Group at the University of Washington and it is open source.3 It contains about 100,000 English terms, 8,000 Latin, 4,000 French, 500 Spanish and 300 German terms. There are also some terms in other languages such as Italian and Filipino, but they were not used in this project. The languages we decided to work with (based on available resources and language competence in the project) were English, German, French and Spanish. The ontology models the hierarchical is-a and part-of relations, along with some other relations, but only the is-a structure was considered in this project. We deﬁne our task as such: given a domain-speciﬁc corpus, we want to recreate the structure of the FMA ontology as closely as possible, using a hierarchical term clustering approach.

2.1

Corpus collection

We needed a domain corpus in the relevant languages to have data on which to train the distributional models. We decided to use the Wikipedia4 pages ﬁled under the ’Anatomy’ category for each language.5 This resulted in about 7,300 pages for English, 2,600 for French, 2,400 for German and 1,000 for Spanish. This corresponds to about 4.4 million words for English, 1.1 million for French, 890,000 for German and 400,000 for Spanish. We stripped the texts of HTML and other markup or scripts, as well as Wikipedia related text (as far as possible). It should be noted that Wikipedia is constantly changing and growing and that these numbers reﬂect the status as of February 2007.

2.2

Preprocessing the data

In order to lessen some of the detrimental effects of the data sparseness problem, we decided to lemmatize the corpus (giving us more occurrences of each word type). We used Intraﬁnd’s6 LiSa system for morphological analysis[9] for all languages. Since the concepts in the ontology are associated with terms rather than words, we needed a way of letting the automatic methods treat multi-word terms as single units, as well as being able to distinguish single word terms from “mere” words. We therefore made use of a simple term spotting technique (see [10] for more on term spotting), marking the longest consecutive string of words that also appears in 3 4 5 6

http://sig.biostr.washington.edu/projects/fm/index.html http://www.wikipedia.org For English: http://en.wikipedia.org/wiki/Category:Anatomy. This page links to the corresponding pages in the other languages. http://www.intraﬁnd.de

289

H. Hjelm and P. Buitelaar / Multilingual Evidence Improves Clustering-Based Taxonomy Extraction

the FMA ontology, as a term. The terms in the FMA ontology were also lemmatized, in order to better match the lemmatized corpus text. After preprocessing the data, it looks something like this (example from the FMA corpus): the TERM_zygomatic_bone#55158 ( malar TERM_bone#34122 ) be a pair TERM_bone#34122 of the human TERM_skull#49338 .

context word would. We use three feature weighting schemes to try to model this hypothesis (the choice of weighting schemes is inspired by [2]): 1. Flat: no feature weighting is applied. 2. Conditional probability: if the term under consideration is t, the current feature is f and f req stands for the frequency of a particular term or term-feature pair, then we get: weight(t, f ) = p(t|f ) ≈

f req(t, f ) f req(f )

Obviously many other relations (e.g., the synonymy and part-of relations) also hold between the terms we are studying, not just the is-a relation. The distributional model we apply in these experiments (see Sect. 2.3) is not designed to separate between different types of relatedness – we accept a degree of overlap between relations in our learned taxonomy. However, given the results presented in the papers listed in Sect. 1, we can expect a focus on the type of relations we are interested in for these experiments (i.e., the sibling (cohyponymy) and is-a (hyponymy) relations).

3. Mutual Information: we can write the Mutual Information formula like this: p(tx , fy ) p(tx , fy )log p(tx )p(fy )

2.3

Dimensionality reduction: we tried using singular value decomposition [6] for some settings on the distributional models.

Distributional model parameters

When building the distributional models for each language, there are a number of parameters that can be varied. In a pre-study, we examined the effects of these parameters on a ﬂat clustering task, i.e., merely clustering the terms into groups, with no hierarchical information. We used the best settings from this pre-study for the hierarchical clustering experiments. Although using the settings from such a pre-study will not be possible in a typical application scenario, our aim here is not to present high-scoring evaluation ﬁgures for the system as such. Our focus is to investigate a possible improvement when using multilingual data as opposed to just the monolingual data. Also note that the pre-study was carried out on strictly monolingual data. Size of sliding window: When constructing a term-term distributional model, one typically makes use of a ﬁxed-size sliding window which is moved over the text. Varying the size of this window effects the type of information captured by the model. We varied the window size between 3–500 in our experiments (on each side of the focus word). We also investigated the effects of not using a sliding window, but rather using document co-occurrence as our features. Minimum feature frequency: If a feature is too infrequent, it is possible that its distribution is not captured well enough in the corpus to be of any use. We therefore experimented with different lower thresholds for our features. Left/right distinction: In some cases it might be important to keep track of whether the context word appeared to the left or to the right of the focus word. If we want to make this distinction, we simply introduce separate features for each word: one for the left and one for the right context. Distance weighting: Intuitively, words appearing closer to the focus word should be given more weight than words appearing further away when building a distributional model. We made use of three different distance weighting schemes in our experiments (d stands for distance measured in number of words from the focus word): 1. Flat: no weighting scheme is applied. 2. Inverse distance: the context term is weighted by d1 . 3. Logarithmic distance: the context term is weighted by 21−d (weights decrease faster than for Inverse distance). Feature weighting: We can hypothesize that a very frequent context word (measured over the whole corpus) contributes less to deﬁning the “distributional proﬁle” of a focus word, than a less frequent

tx ,fy

where x, y ∈ {0, 1}, indicating the presence or absence of t and f (again, probabilities are estimated using relative frequencies).

3

Hierarchical clustering

We examine two kinds of hierarchical clustering: bottom-up agglomerative clustering and hierarchical k-means. Neither method produces a hierarchy in the traditional sense, but rather a structure like the one depicted in ﬁgure 1. The bottom-up approach builds this structure starting with each term in its own cluster, whereas k-means starts with all terms in the same cluster and recursively splits each (sub-)cluster.

Figure 1. Structure produced by hierarchical clustering methods.

Picture taken from Wikimedia Commons (http://commons.wikimedia.org), ﬁle name “Hierarchical clustering diagram.png”.

3.1

Bottom-up agglomerative clustering

We start by building a word-space model, using the settings that gave the best results in the pre-study. We use: • Window size: 500 (in each direction). • Distance weighting: ﬂat.

290

• • • •

H. Hjelm and P. Buitelaar / Multilingual Evidence Improves Clustering-Based Taxonomy Extraction

Feature weighting: none. Left/right distinction: not made. Minimum feature frequency: 51. Dimensionality reduction: svd, 200 dimensions.

We employ a version of average linking, where we start by calculating a centroid representation for each cluster and then calculate the average similarity of each cluster member to this centroid.

3.2

K-means clustering

The agglomerative clustering approach, described in the previous section, produces a binary tree. Since we have many terms to cluster (1,164 terms in total – we only cluster terms with a minimum frequency of 50 in the English corpus), this results in a very deep tree, especially if we compare it with the FMA ontology, which is much ﬂatter. Further, a binary tree will never be able to correctly capture some hierarchical relations. E.g., the relations between ﬁnger and thumb, index ﬁnger, middle ﬁnger, ring ﬁnger and little ﬁnger are not binary (one-to-one) but n-ary (one-to-many). We would like to model the relationship with ﬁnger directly dominating the others. Using hierarchical k-means clustering, we are no longer forced to produce binary trees; we can simply tell the algorithm how many times we would like to split the cluster at each iteration. Though we still do not get a structure where one term directly dominates other terms, but rather a one-to-many variant of the structure shown in ﬁgure 1, we at least have a chance of producing a model which is closer in structure to the FMA ontology. For each clustering step, we try to ﬁnd the appropriate k for splitting that particular cluster. We iterate through different values of k and evaluate each clustering by calculating the harmonic mean of intra similarity and inter distance between the clusters [15] and choose the best performing k in each step. In our experiments, we set an upper limit for k to 20, since it would be very time consuming to evaluate every possible k–value.7

3.3

Clustering from multilingual evidence

To test the effects of including evidence from more than one language when performing the clustering, we started by building four separate distributional models, one for each language, using the same settings as described in 2.3. Next, for each term in every non-English model, we look up in the FMA ontology if it is listed as a translation of any of the English terms. If it is, we concatenate the vector for this non-English term to the vector of the English term, resulting in a vector that is twice the length of the original vector. This process is repeated for every non-English language, which means that the ﬁnal vectors we are working with are four times the length of the original vectors (since we are using four languages). Figure 2 illustrates the idea behind such a multilingual vector.

3.4

Evaluating the hierarchical clustering

Some of the ﬁrst measures for evaluating the similarity between two ontologies (also applicable to taxonomies in general) were introduced in [11]. Further additions and alterations have been suggested since then, see [2]. In [3], an attempt is made to establish a standard measure called T FCSC , which is the harmonic mean between 7

Choosing 20 as upper limit as opposed to any other number was an arbitrary choice.

Figure 2. Distributional information from each language is concatenated to form an elongated version of the co-occurrence vector. The vectors used in the monolingual experiments consist only of the part marked ’English’.

T PCSC (TP for taxonomic precision) and T RCSC (TR for taxonomic recall). “CSC” stands for “common semantic cotopy”, where “semantic cotopy” refers to the set of all super- and sub concepts of a particular concept and “common” means that one only takes the concepts shared by both ontologies into consideration. If OC is the computed ontology and OR is the reference ontology, then T Pcsc and T Rcsc are calculated as: T Pcsc (OC , OR ) :=

1 |OC ∩ OR |

tpcsc (c, OC , OR )

c∈OC ∩OR

T Rcsc (OC , OR ) := T Pcsc (OR , OC ) tpcsc (c, OC , OR ) :=

|csc(c, OC , OR ) ∩ csc(c, OR , OC )| |csc(c, OC , OR )|

csc(c, OC , OR ) := {ci |ci ∈ OC ∩ OR ∧ (ci A0 -> A0 -> A0 -> A0 -> A0 -> A1 -> A1 -> A1 -> A1 -> A1 -> A -> ... 8

A: A1: root: B: B0: B1: A: root: B: B0: B1: root:

1 2 2 3 4 4 1 2 3 4 4 1

http://wit.shef.ac.uk:8080/onteval/

291

H. Hjelm and P. Buitelaar / Multilingual Evidence Improves Clustering-Based Taxonomy Extraction

root

A1

Figure 3.

Monolingual clustering Multilingual clustering

A

B

A0

B0

Table 2.

Once we have calculated these series, we can use them to calculate the PMCC measure: cov(X, Y ) σX σY

where X is the series from the reference taxonomy and Y the series from the learned taxonomy, cov stands for covariance and σ is the standard deviation. The measure returns a value between -1 and 1, where 1 means perfect correlation, 0 means no correlation and -1 means perfect negative correlation, somewhat simpliﬁed.

4

Results and discussion

First off, we compare the bottom-up agglomerative clustering to the hierarchical k-means clustering described in section 3.2. Table 1 shows that the k-means approach gives a result which is substantially closer to the gold standard than the bottom-up agglomerative approach does. Since the k-means clustering uses a random initialization, we repeated the experiments ten times and report the average correlation and the standard deviation for this approach. These experiments were carried out using only the English terms and texts. Bottom-up aggl. K-means hierarch.

ρ 0.109 0.166

σ N/A 0.043

σ 0.043 0.027

ρmin 0.114 0.137

ρmax 0.237 0.229

Comparing mono- and multilingual k-means clustering.

B1

Small example taxonomy.

ρX,Y =

ρ 0.166 0.201

ρmin 0.109 0.114

ρmax 0.109 0.237

Table 1. Comparing bottom-up and k-means monolingual

clustering. ρ is the correlation, σ the standard deviation.

We see two possible explanations for this improvement. One is that, since we are evaluating different k for each new split and sticking with the best one, it is possible that we this way are able to ﬁnd a more “data-cohesive” way of splitting the terms. Another explanation could be that we are mimicking the ﬂatter structure of the FMA ontology better this way, than we are with the bottom-up approach. As ever, a combination of these two factors seems most likely. Turning our attention now to the comparison of the mono- and multilingual cases, table 2 shows that the multilingual clustering on average gives a considerable increase in correlation, paired with a marked decrease in standard deviation, indicating that this method is less sensitive to different random initializations. Now, one might argue that these improvements are not surprising – more data is always more data. However, as was stated in the introduction, since the additional data comes from different languages than the original data, it was not self evident that the added data would help to clarify the taxonomy extraction, rather than confuse

the models. Our results, however, do support using the multilingual evidence for this application. Previous research (see references in Sect. 1) has demonstrated the ability of distributional similarity models to capture relevant information for the task at hand and that the resulting hierarchical clustering methods do capture useful semantic information. The focus of this article therefore is not to demonstrate this once more, but rather, again based on the articles previously referred to, we take this as a given and instead investigate if the distributional models can be made even more useful by including multilingual data. This is in fact what we see conﬁrmed in our experiments. The approach for building the multilingual model presented here assumes that we have access to a domain-speciﬁc bilingual (or multilingual) dictionary. One could imagine getting by without such a dictionary and instead using machine translation techniques to identify term equivalents [8]. Because we are dealing with comparable rather than parallel texts here, we would have had to resort to techniques like the ones suggested by e.g., [14, 4, 5]. These have the disadvantage of being much less accurate than techniques developed for parallel texts. To avoid evaluating the quality of a term translation system rather than the effects of multilingual evidence, we decided on using the translation information coded in the FMA ontology as our lexicon. This seems not too far fetched a scenario: having access to a domain-speciﬁc bilingual dictionary and wishing to extract a taxonomy for the terms listed there.

5

Conclusions

In our experiments, we have focused on clustering based on distributional similarity. Other researchers have experimented with including other types of information for taxonomy extraction, such as two terms (NPs) sharing the same head noun [2, 16], two terms appearing in certain lexico-syntactic patterns [7] or combining the hyponymy (is-a) and cohyponymy (sibling) relations [17]. We have ongoing experiments where we include this type of information in the taxonomy extraction process and we are optimistic that the multilingual approach presented here will prove equally beneﬁcial in these cases. Summing up, the increase in average correlation and decrease in standard deviation when evaluating against the gold standard mean that we can make a strong case for the usefulness of multilingual evidence for the taxonomy extraction task. What’s more, the resulting resource has added value when compared with the monolingual approach, since we are now free to switch between languages at will, while staying within the same taxonomic structure.

ACKNOWLEDGEMENTS This research has been supported in part by the THESEUS Program in the MEDICO Project, which is funded by the German Federal Ministry of Economics and Technology under the grant number 01MQ07016. The responsibility for this publication lies with the authors. We also thank three anonymous reviewers for their helpful comments.

292

H. Hjelm and P. Buitelaar / Multilingual Evidence Improves Clustering-Based Taxonomy Extraction

REFERENCES [1] Stephan Bloehdorn, Philipp Cimiano, and Andreas Hotho, ‘Learning ontologies to improve text clustering and classiﬁcation’, in From Data and Information Analysis to Knowledge Engineering: Proceedings of the 29th Annual Conference of the German Classiﬁcation Society (GfKl 2005), eds., Myra Spiliopoulou, Rudolf Kruse, Andreas N¨urnberger, Christian Borgelt, and Wolfgang Gaul, volume 30 of Studies in Classiﬁcation, Data Analysis, and Knowledge Organization, pp. 334–341, Magdeburg, Germany, (2006). Springer-Verlag. [2] Philipp Cimiano, Ontology Learning and Population from Text: Algorithms, Evaluation and Applications, Springer-Verlag, New York, NY, USA, 2006. [3] Klaas Dellschaft and Steffen Staab, ‘On how to perform a gold standard based evaluation of ontology learning’, in 5th International Semantic Web Conference, Athens, GA, USA, (2006). [4] Pascale Fung and Kathleen McKeown, ‘Finding terminology translations from non-parallel corpora’, in The 5th Annual Workshop on Very Large Corpora, pp. 192–202, Hong Kong, (1997). [5] Eric Gaussier, Jean-Michel Renders, Irina Matveeva, Cyril Goutte, and Herv´e D´ejean, ‘A geometric view on bilingual lexicon extraction from comparable corpora’, in Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04), Main Volume, pp. 526–533, Barcelona, Spain, (July 2004). [6] Gene H. Golub and Charles F. van Loan, Matrix Computations, Johns Hopkins University Press, Baltimore, MD, USA, 3 edn., 1996. [7] Marti Hearst, ‘Automatic acquisition of hyponyms from large text corpora’, in Proceedings of the 14th International Conference on Computational Linguistics, Nantes, France, (1992). [8] Hans Hjelm, ‘Identifying cross language term equivalents using statistical machine translation and distributional association measures’, in Proceedings of Nodalida 2007, the 16th Nordic Conference of Computational Linguistics, Tartu, Estonia, (2007). [9] Hans Hjelm and Christoph Schwarz, ‘LiSa - morphological analysis for information retrieval’, in Proceedings of the 15th NODALIDA conference, Joensuu 2005, ed., Stefan Werner, volume 1 of University of Joensuu electronic publications in linguistics and language technology. NoDaLiDa, Ling@JoY, (2006). [10] Christian Jacquemin, Spotting and Discovering Terms through Natural Language Processing, The MIT Press, Cambridge, Massachusetts, USA, 2001. [11] Alexander Maedche, Ontology Learning for the Semantic Web, Kluwer Academic Publishers, Norwell, MA, USA, 2002. [12] Alexander Maedche, Viktor Pekar, and Steffen Staab, ‘Ontology learning part one – on discovering taxonomic relations from the web’, in Web Intelligence, eds., Ning Zhong, Jiming Liu, and Yiyu Yao, chapter 14, Springer Verlag, New York, NY, USA, (2003). [13] Inderjeet Mani, Ken Samuel, Kris Concepcion, and David Vogel, ‘Automatically inducing ontologies from corpora’, in Proceedings of CompuTerm 2004: 3rd International Workshop on Computational Terminology, Geneva, Switzerland, (2004). COLING. [14] Reinhard Rapp, ‘Automatic identiﬁcation of word translations from unrelated english and german corpora’, in Proceedings of the 37th Annual Meeting of the ACL (ACL’99), College Park, MD, USA, (1999). [15] Magnus Rosell, Clustering in Swedish – The Impact of some Properties of the Swedish Language on Document Clustering and an Evaluation Method, Licentiate thesis, School of Computer Science and Communication, Royal Institute of Technology, Stockholm, Sweden, 2005. [16] Pum-Mo Ryu and Key-Sun Cho, ‘Taxonomy learning using term speciﬁcity and similarity’, in Proceedings from the Workshop on Ontology Learning and Population: Bridging the Gap between Text and Knowledge (with Coling.ACL 2006), pp. 41 – 48, Sydney, Australia, (2006). [17] Rion Snow, Daniel Jurafsky, and Andrew Y. Ng, ‘Semantic taxonomy induction from heterogeneous evidence’, in Proceedings of COLING/ACL 2006, Sydney, Australia, (2006).

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-293

293

Unsupervised Grammar Induction Using a Parent Based Constituent Context Model Seyed Abolghasem Mirroshandel and Gholamreza Ghassem-Sani1 Abstract. Grammar induction is one of attractive research areas of natural language processing. Since both supervised and to some extent semi-supervised grammar induction methods require large treebanks, and for many languages, such treebanks do not currently exist, we focused our attention on unsupervised approaches. Constituent Context Model (CCM) seems to be the state of the art in unsupervised grammar induction. In this paper, we show that the performance of CCM in free word order languages (FWOLs) such as Persian is inferior to that of fixed order languages such as English. We also introduce a novel approach, called parent-based constituent context model (PCCM), and show that by using some history notion of context and constituent information of each span's parent, the performance of CCM, especially in dealing with FWOLs, can be significantly improved.

1 INTRODUCTION Based on the type of corpora that different unsupervised grammar induction methods use, these methods are divided into three major categories [1]: supervised, unsupervised, and semi-supervised. Supervised methods normally rely on the correct parse of training sentences via a full-parsed and tagged treebank. Semi-supervised methods use less supervision information than supervised ones. Unsupervised methods rely only on tagged sentences without any bracketing. Although current supervised methods highly outperform unsupervised methods, there are important motives to continue the work on unsupervised methods [2, 3, 4], because producing the necessary training data (corpora) of supervised methods is a time consuming, hard, and expensive work. Besides, it is very difficult to adapt supervised methods for new tasks, languages, and domains. Consequently, it is the corpus availability that directs the research in this area. Not only unsupervised methods do not need such training data, but also they can be used in many applications: in primary phases of constructing large treebanks, in language modeling, and in some NLP research areas that do not require an exact grammar of sentences. Constituent Context Model (CCM) [2, 3] seems to be the state of the art in unsupervised grammar induction. In this paper, we show that the performance of CCM in free word order languages (FWOLs) such as Persian is inferior to that of fixed order languages such as English. We also introduce a novel approach, called parent-based constituent context model (PCCM), and show that by using some history notion of context and constituent information of each span's parent, the performance of CCM, 1 Department of Computer Engineering, Sharif University of Technology, Tehran, Iran, emails: {[email protected], [email protected]}

especially in dealing with FWOLs, can be significantly improved. The remainder of the paper is arranged as follows: Section 2 is about previous approaches to unsupervised grammar induction. Section 3 explains original constituent context model, and our improved method. Section 4 demonstrates the evaluation of the proposed algorithm. Finally, section 5 includes paper's conclusion.

2 PREVIOUS WORKS There is a lot of ongoing research on unsupervised grammar induction (UGI) methods. These methods can be divided into three groups: 1) Likelihood based; 2) Compression based, and 3) Distribution based. These groups are discussed in the next three sub-sections.

2.1 Likelihood Based Methods This group of UGI selects maximum likelihood model using a probabilistic context free grammar (PCFG). Likelihood based methods, also known as inside-outside (IO), work using the expectation maximization (EM) algorithm [5, 6, 7]. There are some researches in amendment of these methods [8]. IO algorithms produce a grammar in Chomsky normal form (CNF). These algorithms often converge toward a local optimum state by iteratively re-estimating the probabilities in a manner that maximizes the likelihood of the training corpus, given the grammar. They would nearly always converge to a linguistically improper grammar [9]. These methods have also been implemented using genetic algorithms [10]. One algorithm in this group, which added the parent of each non-terminal as the conditioning information to the IO grammar rules, is history-based IO (HIO) [11]. In HIO, grammar rules are in CNF, but HIO replaces ordinary CNF ( X o A B ) with a pseudo CNF which adds the parent of each non-terminal in the left hand side of the rules ( X , Parent X o A B ). HIO showed some improvement in UGI, especially in Persian [11].

2.2 Compression Based Methods These methods work using the minimum description length (MDL) principle. Several methods based on this approach have been proposed, none of which showed a satisfactory result [6]. One of these methods uses the Bayesian model selection criterion for hidden Markov model (HMM) and PCFGs, but it only works in small and artificial languages [12]. Another method [13] works on regular languages rather than context free languages. Since the only factor with which these methods work is the compression of

294

S.A. Mirroshandel and G. Ghassem-Sani / Unsupervised Grammar Induction Using a Parent Based Constituent Context Model

the most frequent sequence of tags (sequences with high mutual information), their results are not satisfactory. For example, the sequences IN DT and DT NN have high mutual information in both English and Persian languages. In Penn treebank, the sequence IN DT and DT NN have 1.3675 and 1.266 pointwise mutual information. Thus these methods incorrectly divide the sequence IN DT NN into IN DT and DT NN.

2.3 Distribution Based Methods These methods are based on a simple idea: the sequences of words or tags that construct the same constituents appear in analogous contexts. There are several methods based on this approach: Distributional clustering was one of the first attempts in this regard [14]. This method used a conditional entropy measure to identify constituents. Another method in this group used Kullback-Leibler (KL) divergence measure of contexts to extract probable rules [15]. In this algorithm, only sequences with two tags are compared with sequences with a single tag. Thus, the method cannot induce sequences that do not correspond to a single tag sequence. There are other approaches that use distributional clustering in finding similar tag sequences appeared in similar contexts [16]. These methods generate some linguistically plausible clusters, but at the same time find many implausible ones. Therefore, they cannot induce acceptable grammars. Context distributional clustering (CDC) is another distributed method that uses distributional clustering of sequence of tags. However, in CDC, only clusters that satisfy the mutual information (MI) measure are regarded as valid clusters. In other words, the MI measure is used to prune linguistically implausible clusters. CDC also incorporates the MDL to extract grammars [5, 6]. At present, the most successful UGI algorithm is the so-called constituent context model (CCM). CCM is a parameter search algorithm [4] that, by using some distributional information in an EM method, can induce a grammar. Section 3.1 explains this algorithm in more detail. All these methods use local distributional context. However, there are two techniques that use the whole sentence as the context: Alignment-Based Learning (ABL) and EMILE [17, 18]. These

(a)

techniques look for minimal pairs. In fact, they search for pairs of sentences that, except for a particular phrase, look the same. These two algorithms have reasonable results only in restricted and artificial languages [17].

3 PARENT-BASED CCM In this section, a novel method based on CCM is introduced, which improves CCM's performance, especially in dealing with free word order languages (FWOL). Before describing the new method, CCM is briefly explained in next sub-section.

3.1 CCM As mentioned before, CCM is a distribution based method and works on the basis of a weakened version of the classic linguistic constituency tests [19]: constituents occur in their contexts. CCM is designed to transmit the constituency of a sequence (it works with part-of-speech tag sequences) directly to its context, which is intended to pressure new sequence in that context. This pressure directs a new sequence to be parsed as a constituent in the following step. In fact, this method is a distributional clustering with no-overlap constraint.

3.1.1 Constituents and Contexts In CCM, all sequences of tags, i.e. spans, are modeled, regardless of being constituent or non-constituent. Contexts in CCM are local linear contexts, which means that context of a word is the pair of words immediately adjacent to its left and right. For example, in the sentence "Factory payrolls fell in September", the word "payrolls" occurs in the context "Factory–fell". A bracketing of a sentence is a boolean matrix where a true element indicates that the related span is a constituent, and conversely a false element corresponds to a non-constituent (called distituent).

(b)

(c) Figure 1. A parse tree for the Persian sentence (a), Its related bracketing (b), and the constituents and context associated with the bracketing (c).

S.A. Mirroshandel and G. Ghassem-Sani / Unsupervised Grammar Induction Using a Parent Based Constituent Context Model

Figure 1 demonstrates a parse tree of a sentence in Persian language, its related bracketing, and constituent-contexts of the parse tree. Representation of words in the sentence is based on [20]. A bracketing "B" is non-crossing if at most one of the two crossing brackets is a constituent in "B". A non-crossing bracketing that satisfies these rules is a tree-equivalent bracketing: 1) all unit spans (i.e., spans including just one word) of a sentence are constituent; 2) the span containing full sentence is constituent, and 3) all zero size spans of a sentence are distituent. If a bracketing corresponds to a binary tree, then the bracketing is binary, too. The performance of CCM depends on two simple properties: 1) only binary bracketings are valid, and 2) constituents occur in constituent contexts. It has been shown that without the first assumption, CCM cannot produce valuable results [2, 3]. The generative CCM over sentences S has two steps. First, according to some distribution P(B), a bracketing B is chosen, and then given that bracketing, the corresponding sentence is generated [4]: (1) PS , B PB PS | B Contexts and constituents are independent. These are generated by using the following equation: PS | B PD ij , xij | Bij i j spans S (2) PD ij | Bij Pxij | Bij i

j

295

In linguistic typology, the order in which words appear in sentences is called the word order. In FWOLs, the orders of some or all of the words in many sentences are not important, and they can freely appear in different places of the sentences. As described in 3.1, CCM uses span type counts for validity discrimination of constituents and their contexts. Since in FWOLs, words appear in optional places, each span type is divided into a number of different span types. This property decreases the count of each span type. Consequently, there would be less information about pattern of constituents and their contexts available during parsing. CCM was applied to Persian, which is a rather FWOL, and it was shown that CCM's performance is reduced when dealing with such languages [21]. In the next section, we demonstrate how CCM performance can be improved by using parent information of constituents and contexts.

3.2 Parent-based CCM (PCCM) In this section, we introduce a new model called parent-based constituent context model (PCCM), in which spans' parent information is employed to improve CCM performance especially in FWOLs.

3.2.1 Adding Parent Information

All possile bracekting B

As described in section 3.1, in CCM, the context and constituent probabilities of each span are computed, and then used in grammar induction. PCCM takes advantage of two types of supplementary probabilities. The first type is the conditional constituent probability of every span, given span's parent. The second type is the conditional context probability of each span's context, given its parent's context. For example, in the sentence of figure 1, we calculate the two following probabilities for the span "IN NN2": constituent probability of span "IN NN2" by considering constituent probability of span "IN NN2 VBD", as the parent span of "IN NN2", ( Pconstituen cy IN NN 2 | Parent IN NN 2 VBD ) and context

For inducing a grammar, CCM runs the EM algorithm on this model [4]. In the EM algorithm, brackets B are observed, and sentences S are hidden (unobserved) random variables.

probability of pair "NN1-VBD" by considering context probability of pair "NN1-¸", as the parent context of "NN1-VBD", ( Pcontext NN 1 VBD | Parent NN 1 ¡ ).

3.1.2 The Induction Algorithm

3.2.2 The Induction Algorithm

As mentioned before, in CCM, only binary bracketings are valid, so all binary bracketings are equally likely. The EM algorithm includes the following two steps: E-Step: Fix current T and obtain the conditional bracketing likelihoods PB | S ,T . M-Step: Find T c that maximizes PB | S , T log P S , B | T c

The employed induction algorithm is analogous to that of CCM. We combine extracted probabilities of original CCM with the new probabilities. The final probabilities will be used for grammar induction. Here like CCM, only binary bracketings are valid and we employ EM algorithm in a similar manner to CCM. The difference between CCM and PCCM is in the definition and usage of T parameters. In CCM, T is only the context and constituent probability of each span of the sentences, but in PCCM, T parameter is the context and constituent probability of each span of the sentences (original CCM) and the context and constituent probability of each span by considering the span's parent (PCCM). In estimating parameters with EM algorithm, the computational bottleneck is the E-step, where we must calculate posterior expectations of various tree configurations according to a fixed parameter vector T . This problem can be fixed by using dynamic programming. The only difficulty is that dynamic programming works in a bottom up manner, and we cannot get any knowledge about parents of spans. To tackle this problem, we used a memoization technique. Memoization works in a top down manner

i

j Tree( S )

PD ij | true Pxij | true

i

PD

j Tree( S )

ij

| false Pxij | false ,

D ij are spans, xij are contexts, PD ij | true is the conditional probability of constituency of D ij when Bij is true, and where

PD ij | false is the same probability when Bij is false. In a similar manner, Pxij | true and Pxij | false can be defined for xij . The marginal probability of sentence S is:

P S

¦ P B P S | B

(3)

¦ B

with fixed T . In estimating parameters by the EM algorithm, the computational bottleneck is the E-step, where we must calculate posterior expectations of various tree configurations according to a fixed parameter vector T . This problem can be fixed by a cubic dynamic program similar to the inside-outside algorithm [4].

3.1.3 Weakness in Free Word Order Languages In addition to CCM weakness in dealing with long sentences, its performance further degrades when dealing with FWOLs.

296

S.A. Mirroshandel and G. Ghassem-Sani / Unsupervised Grammar Induction Using a Parent Based Constituent Context Model

with an analogous order to that of dynamic programming. We also used relative frequency estimates for setting T c of the M-step. It is worth noting that, due to the usage of parent information, the smoothing task in PCCM is even more important than in CCM. It has been shown that the time complexity of CCM is O n 3 , and its space requirement is O n 2 [4]. The time complexity of PCCM is the same as CCM [22]; however, its space requirement is O n 3 , due to the need to store the extra information regarding parents of contexts and constituents.

4 EXPERIMENTAL RESULTS In this section, we first give a brief description of Persian as an FWOL, and then describe the results of applying PCCM to two different corpuses of both English and Persian.

4.1 Persian Language Persian is the native language of approximately one hundred million people, and is spoken in different countries such as Iran, Afghanistan, and Tajikistan. This language belongs to Indo branch of the Indo-European language family. Normally the structure of declarative sentences in Persian is "(S) (PP) (O) V". Parentheses in this structure represent optional components, i.e. subjects, prepositional phrases, and objects. This language has high potential to be categorized in the FWOLs, especially in the preposition adjunction and complements. For example, adverbs can occur anywhere in the sentences without any change in the meaning [23].

4.2 Experiments We applied PCCM to both English and Persian. In English, we used WSJ-10 corpus with sentences of less than 11 words and ATIS corpus. Using the ten-fold cross validation method, the results were evaluated by measuring unlabeled precision (UP), unlabeled recall (UR), and F1 (Harmonic mean of UP and UR) of parsed trees against a number of gold trees (trees in the treebank).

Figure 2. Parsing performance of PCCM method comparison with other unsupervised methods on ATIS corpus.

Figure 3. Parsing performance of PCCM method comparison with other unsupervised methods on WSJ-10 corpus.

Figure 3 shows the results of PCCM against that of a number of other techniques including DEP-PCFG [9] and SUP-PCFG [3]. The random, left- and right-branching approaches are also shown as the baselines. The random method chooses binary trees randomly. The left- and right-branching methods respectively choose left and right branching chains parsing. Since the structure of parse trees are binary, upper bound (UBOUND) of UP and F1 are less than 100 percent. The results in both figure 2 and 3 show that PCCM on ATIS and WSJ-10 corpuses outperforms other unsupervised grammar induction methods. In Persian, two different training corpuses were manually developed. The sentences of these corpuses contain less than 11 words, and have been extracted from a corpus named Peykareh [24, 25], which has been collected from formal newspapers in Persian. Peykareh has more than 32255 sentences and uses a tag set similar to the tag set used in [20, 26]. For extracting sentences, the punctuation and null elements were removed. The first corpus includes 3000 sentences, which have been manually changed in such a way that the structure of "S PP O V" is held. The common property of the sentences in this corpus is that the order of words are artificially fixed (i.e., they are not free in order). The second corpus comprises 2500 sentences of free word order. Some other features of these two corpuses are shown in table 1. Table 1. Main features of first and second corpuses. Corpus 1 Corpus 2 Num. of sentences 3000 2500 Max. Len. 10 10 Min. Len. 2 2 Avg. Len 7 7 Num. of Pos Tags 18 18 Num. of Words 22153 18482

In Persian, we first ran CCM and PCCM on each of the above corpuses, separately. We also joined these corpuses to create a new mixed corpus, and repeated the experiments on this corpus, too. The results are shown in table 2. Three baselines for Peykareh are shown in table 3. Table 2. Comparison of CCM and PCCM methods on Persian Corpuses. Corpus Method UP UR F1 44.15 68.45 53.68 CCM First Corpus 48.01 70.89 57.25 PCCM 26.67 51.3 35.17 CCM Second Corpus 31.21 54.23 39.62 PCCM 32.92 55.2 41.42 CCM Third Corpus 37.92 59.17 46.22 PCCM Table 3. Baselines for Peykareh treebank. Method UP UR F1 25.07 16.45 19.87 LEFT-BRANCHING 17.63 11.57 13.97 RIGHT-BRANCHING 94.35 100 97.09 UNBOUND

Table 3 shows that Persian, unlike English that is highly rightbranching, is neither left- nor right-branching, which was also observed by [11]. However, high upper bound of F1 shows that Persian has a binary structure. The results of table 2 show the effect of the free word orderness on the CCM's performance. The reduction in the performance of CCM on the second corpus in comparison to that of the first corpus is 18 percent in F1 score. The results of applying CCM to the combined corpus demonstrate that CCM shows little improvement.

S.A. Mirroshandel and G. Ghassem-Sani / Unsupervised Grammar Induction Using a Parent Based Constituent Context Model

Thus CCM method is weak in dealing with FWOLs. The reason for this weakness is that CCM works based on the repetition of constituents in their contexts. Since in FWOLs, some words can freely appear in different places of sentences, the mentioned repetition decreases substantially and, as a result, the performance of CCM worsens. The experiments also show that PCCM outperforms CCM in both languages. However, the improvement achieved by using PCCM's parent information is more considerable in FWOLs. An important implementation issue to note is the restriction of parent information usage in spans with maximum length of 5. In order to select an appropriate value for the maximum span's length, we applied PCCM to different maximum span's lengths, and to different corpuses. As it is shown in figure 4, the best performance is achieved when the spans are shorter than or equal to 5 words. One possible reason is that as spans get longer, the co-occurrence of spans and their parents will substantially decrease, and thus parsing will be less-informative.

Figure 4. The effect of using parent information for different span's length in English WSJ-10 and Persian first, second and third corpuses.

5 CONCLUSION Constituent Context Model (CCM) is currently the state of the art in unsupervised grammar induction. It combines distributional clustering methods with an EM parameter search. CCM works based upon sequences of words (spans) repetition. However, in free word order languages such as Persian, words can grammatically appear in different places of sentences and, as a result, the number of occurrences of each span type decreases. Consequently, CCM faces more divergent information. In this paper, we proposed a novel approach, called parent-based constituent context model (PCCM), by adding some history notion of context and constituent information of each span's parent. Considering parent information for constituents and contexts prevents from probability divergence and parsing will be more-informative. To evaluate the new method, we applied CCM and PCCM to both English and Persian (as a free word order language). The results of applying the new method to several corpuses with different degree of free word orderness show that using parent information improves CCM's performance, particularly when the degree of free word orderness is high.

REFERENCES [1] T. Thanaruk and M. Okumaru, Grammar Acquisition and Statistical Parsing. Journal of Natural language Processing 2 (3), 1995. [2] D. Klein and C. D. Manning, . Natural Language Grammar Induction Using a Constituent-Context Model, Advances in Neural Information Processing Systems 14 (NIPS 2001), vol.1. MIT Press, pp. 35–42, 2001.

297

[3] D. Klein and C. D. Manning, A Generative Constituent-Context Model for Improved Grammar Induction. In: ACL 40, pp. 128–135, 2002. [4] D. Klein, The Unsupervised Learning of Natural Language Structure. Ph.D. Thesis, Department of Computer Science, Stanford University, (2005). [5] A. Clark, Inducing syntactic categories by context distribution clustering, In the 4th Conference on Natural Language Learning, 2000. [6] A. Clark, Unsupervised Language Acquisition: Theory and Practice. Ph.D. Thesis, University of Sussex, (2001). [7] J. K. Baker, Trainable Grammars for Speech Recognition. In Klatt, D. H., & Wolf, J. J. (Eds.), Speech communication papers for the 97th meeting of the Acoustic Society of America, pp. 547–550, 1979. [8] K. Lari and S. J. Young, The Estimation of Stochastic Context-Free Grammars Using the Inside-Outside Algorithm. Computer Speech and Language, 4, 35–56, 1990. [9] G. Carroll and E. Charniak, Two Experiments on Learning Probabilistic Dependency Grammars from Corpora. Tech. rep. CS-92-16, Department of Computer Science, Brown University, 1992. [10] B. Keller and R. Lutz, Evolving Stochastic Context-Free Grammars from Examples Using a Minimum Description Length Principle. In Workshop on Automatic Induction, Grammatical Inference and Language Acquisition, 1997. [11] H. Feili, and G. R. Ghassem-Sani, Unsupervised Grammar Induction Using History Based Approach, In Computer Speech and Language 20 644–658, (2006). [12] A. Stolcke and S. M. Omohundro, Inducing Probabilistic Grammars by Bayesian Model Merging. In Grammatical Inference and Applications, Proceedings of the Second International Colloquium on Grammatical Inference. Springer Verlag, (1994). [13] S. Chen, Bayesian Grammar Induction for Language Modeling. In Proceedings of the 33rd Annual Meeting of the ACL, pp. 1995. [14] S. M. Lamb, On the Mechanization of Syntactic Snalysis. In 1961 Conference on Machine Translation of Languages and Applied Language Analysis, Vol. 2 of National Physical Laboratory Symposium No. 13, pp. 674–685, 1961. Her Majesty’s Stationery Office, London. [15] E. Brill and M. Marcus, Automatically Acquiring Phrase Structure Using Distributional Analysis. In Proceedings of DARPA workshop on speech and natural language, 1992. [16] S. Finch, N. Chater, and M. Redington, Acquiring Syntactic Information from Distributional Statistics. In Levy, J. P., Bairaktaris, D., Bullinaria, J. A., & Cairns, P. (Eds.), Connectionist Models of Memory and Language. UCL Press, (1995). [17] M. Van Zaanen, ABL: Alignment-Based Learning, In Proceedings of the 18th International Conference on Computational Linguistics (COLING 18), 961–967, 2000. [18] M. Van Zaanen and P. Adriaans, Comparing two unsupervised grammar induction systems: Alignment-based learning vs. Emile. Research report series 2001.05, School of Computing, University of Leeds, 2001. [19] A. Radford, Transformational Grammar, Cambridge University Press, Cambridge, 1988. [20] K. Megerdoomian, Persian Computational Morphology: A Unification-Based Approach, NMSU, CRL. Memoranda in Computer and Cognitive Science, MCCS-00-320 , 2000. [21] S. A. Mirroshandel, G. R. Ghassem-Sani and M. A. Honrapisheh, Using of the Constituent Context Model to Induce a Grammar for a Free Word Order Language: Persian. Proceedings of the 3rd L&TC, pp. 443447, Poland, October, 2007. [22] S. A. Mirroshandel, Persian Unsupervised Grammar induction using Constituent Context Model. M.Sc. Thesis, Department of Computer Science, Sharif University of Technology, 2007 (in Persian). [23] H. Feili and G. R. Ghassem-Sani, An Application of Lexicalized Grammars in English-Persian Translation. Proceedings of the 16th European Conference on Artificial Intelligence (ECAI 2004). Universidad Politecnica de Valencia, Spain, pp. 596–600, 2004. [24] M. Bijankhan. The feasibility study for Persian language modeling, The Journal of Literature 162–163 (50–51), 81–96, 2003 (in Persian). [25] M. Bijankhan, The role of corpus in generating grammar: presenting a computational software and corpus, Iranian Linguistic Journal 2 (19), 48–67, 2005 (in Persian). [26] J.W. Amtrup, H. R. Rad, K. Megerdoomian and R. Zajac, PersianEnglish Machine Translation: An Overview of the Shiraz Project, NMSU, CRL. MCCS-00-319, 2000.

298

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-298

Word Sense Induction Using Graphs of Collocations Ioannis P. Klapaftis and Suresh Manandhar 1 Abstract. Word Sense Induction (WSI) is the task of identifying the different senses (uses) of a target word in a given text. Traditional graph-based approaches create and then cluster a graph, in which each vertex corresponds to a word that co-occurs with the target word, and edges between vertices are weighted based on the co-occurrence frequency of their associated words. In contrast, in our approach each vertex corresponds to a collocation that co-occurs with the target word, and edges between vertices are weighted based on the co-occurrence frequency of their associated collocations. A smoothing technique is applied to identify more edges between vertices and the resulting graph is then clustered. Our evaluation under the framework of SemEval-2007 WSI task shows the following: (a) our approach produces less sense-conﬂating clusters than those produced by traditional graph-based approaches, (b) our approach outperforms the existing state-of-the-art results.

1

Introduction

network. The resulting graph, keeping only nouns and removing the target word is shown in Figure 1A. • To install our satellite system please call our technicians and book an appointment. Connection to our television network is free of charge... • To connect to the BT network, proceed with the installation of the connection software and then reboot your system...

In this example, system appears in two different contexts. The target word network appears with two different senses i.e. (1) communication system consisting of a group of broadcasting stations (Television Network), and (2) web;an interconnected system of things or people (Computer Network). Any hard clustering approach attempting to identify the clusters (senses) of network would assign system to only one of the two senses of network, even though system is related to both. The same problem appears for the word connection. Note that both system and connection are not noisy words, which could have been removed in a pre-processing stage. Instead, they are words semantically related to network, and hence cannot be ﬁltered out.

Using word senses instead of word forms is essential in many applications such as information retrieval (IR) and machine translation (MT) [13]. Word senses are a prerequisite for word sense disambiguation (WSD) algorithms. However, they are usually represented as a ﬁxed-list of deﬁnitions of a manually constructed lexical database. There are several disadvantages associated with the ﬁxedlist of senses paradigm. Firstly, lexical databases often contain general deﬁnitions and miss many domain speciﬁc senses [1]. Secondly, they suffer from the lack of explicit semantic and topical relations between concepts [1]. Thirdly, they often do not reﬂect the exact content of the context, in which the target word appears [17]. WSI aims to overcome these limitations of hand-constructed lexicons. Most of the work in WSI is based on the vector-space model, where each context of a target word is represented as a vector of features (e.g. frequency of co-occurring words). Context vectors are clustered and the resulting clusters are taken to represent the induced senses. Recently, graph-based methods [8, 17, 4] have been employed to WSI. Typically, graph-based approaches represent each word wi co-occurring with the target word tw, within a pre-speciﬁed window, as a vertex. Two vertices are connected via an edge if they co-occur in one or more contexts of tw. Once the co-occurrence graph of tw has been constructed, different graph clustering algorithms are applied to induce the senses. Each cluster (induced sense) consists of a set of words that are semantically related to the particular sense. Graph-based approaches assume that each context word, wi , is related to one and only one sense of tw. This assumption is not always valid since wi may appear with more than one senses of tw. For example, consider the following contexts, where the target word is

The above limitation negatively affects the quality of induced senses, and tends to produce clusters conﬂating the target word senses. In this work we deal with this problem by creating, populating and clustering a graph, in which each vertex corresponds to a collocation2 that co-occurs with the target word, and edges between vertices are weighted based on the co-occurrence frequency of their associated collocations. Each produced cluster consists of a set of collocations. Our intuition is that through this tactic, we will be able to produce clusters that are less sense-conﬂating than those produced by current graph-based approaches, since collocations provide strong and consistent clues to

1

2

Department of Computer Science, The University of York, YO10 5DD, United Kingdom, email: {giannis,suresh}@cs.york.ac.uk

Figure 1. (A) Graph of words for the target word network. Numbers inside vertices correspond to their degree. (B) WSI using V´eronis method [17].

collocation is used in the following sense: juxtaposition of words within the same paragraph

I.P. Klapaftis and S. Manandhar / Word Sense Induction Using Graphs of Collocations

the sense of a target word [18]. We evaluate our approach in nouns of the SemEval-2007 WSI task (SWSI) [3]. Our results conﬁrm our intuition. The evaluation shows that our method achieves consistently high performance in all the evaluation measures outperforming the existing state-of-the-art results.

2

Related work

Most of the work in WSI is based on the vector-space model, where the context of each instance of a target word is represented as a vector of features (e.g second-order word co-occurrences) [14]. These vectors are then clustered to produce the induced senses. Figure 2 shows a simple example of four vectors taken from four contexts of the target word network, which appears with two different senses i.e. Television Network and Computer Network. Clustering of these vectors creates two clusters, the ﬁrst one consisting of vectors context #1 and context #3, and the second consisting of vectors context #2 and context #4. SenseClusters [14] is a vector-based WSI system. SenseClusters represents the contexts to be clustered using second order cooccurrence vectors. Initially, a word-by-word co-occurrence matrix is constructed by identifying bigrams that occur two or more times in the contexts to be clustered, and have a Pointwise Mutual Information (PMI) score greater than a pre-speciﬁed threshold. A row in the matrix is the co-occurrence vector for a particular context word. Each of the contexts is then represented by a single vector, which is the centroid of all the co-occurrence vectors of the words that make up the context. The k-means algorithm is used for clustering the context vectors, where the number of clusters, k, is automatically determined using the Adapted Gap Statistic [15].

Figure 2. Four context vectors for the target word network.

Recently, graph-based approaches [8, 17, 4] have been employed to WSI. Graph-based approaches construct a co-occurrence graph, in which words occurring in the context of the target word are represented as vertices. Two vertices share an edge, if they co-occur in the same context. Each edge receives a weight, which indicates how strong the incident vertices relate to each other. V´eronis [17] has shown that co-occurrence graphs are small-world networks and, thus, they contain highly dense subgraphs (hubs), which represent the different clusters (senses) the target word may have. To identify these hubs V´eronis’ algorithm iteratively ﬁnds the candidate root hub with the highest degree 3 , which is then deleted along with its direct neighbours from the graph, if and only if it satisiﬁes a set of heuristics. These heuristics are the minimum number of vertices in a hub, the average weight between the candidate root hub and its adjacent neighbours and the minimum frequency of a root hub. For example, in Figure 1A the highest degree vertex, television, is the ﬁrst root hub, which would be deleted along with its direct neighbours. The deleted hub corresponds to the Television Network sense of target word network. Figure 1B shows the extracted hubs following V´eronis’ algorithm. We observe that words system and con3

For efﬁcieny reason V´eronis’ uses the relative frequency of a vertex to identify a root hub, since the relative frequency of a vertex and its degree are linearly related.

299

nection have been assigned to the Television Network extracted hub, although they are related to the Computer Network hub as well. In [8, 5], a co-occurrence graph is built for a target word by considering only nouns found in enumerations. Each noun corresponds to a vertex, and two vertices share an edge, if they co-occur in more than n enumerations. The problem of sense conﬂation is present here as well, since these approaches apply different hard clustering algorithms to constructed graphs.

3

Collocational graphs for WSI

Let bc, be the base corpus, which consists of paragraphs containing the target word tw. Our aim is to induce the sense of tw given bc as the only input. Let rc be a large reference corpus. In this work we have used the British National Corpus (BNC) 4 .

3.1

Corpus pre-processing

Initially, tw is removed from bc and each paragraph pi of bc and rc is POS-tagged. Following the example in [4, 2], only nouns are kept and lemmatised, since they are less ambiguous than verbs, adverbs or adjectives. At this stage each paragraph pi both in bc and rc is a list of lemmatised nouns. Each paragraph pi in bc contains nouns which are semantically related to tw, as well as, common nouns which are noisy, in the sense that they are not semantically related to tw. Most graph-based WSI approaches ﬁlter out these words by applying raw frequency heuristics. In this work, we employ a more sophisticated technique based on corpora comparison using log-likelihood (G2 ) [9]. Our aim is to check if the distribution of a word wi , given it appears in bc, is similar to the distribution of wi , given it appears in rc, i.e. p(wi |bc) = p(wi |rc) (null hypothesis). If that is true, G2 will have a small value, and wi should be removed from the paragraphs of bc. . / nij 2 G = 2∗ nij · log (1) mij i,j 2 2 k=1 nkj k=1 nik · (2) mij = N We create two noun frequency lists. The ﬁrst one, lbc, is derived from the processed bc corpus, and the second, lrc, is derived from the processed reference corpus rc. For each word wi ∈ lbc, we create two contingency tables. The ﬁrst one (OT) contains the observed counts taken from lbc and lrc (Table 1). The second (ET) contains the expected values under the model of independence (Table 2). Then we can calculate G2 (Equation 1), where nij is the i, j cell of OT and mij (Equation 2) is the i, j cell of ET, and N = i,j nij . Table 1.

Contingency table for observed values (OT) and example for target word network and context word cable Observed Values Base Reference (OT) Corpus (bc) Corpus (rc) network BNC Freq. of cable 213 (n11 ) 2439 (n12 ) Total Freq. 23279 (n21 ) 24038639 (n22 ) of remaining words

4

The British National Corpus (2001, version 2). Distributed by Oxford University Computing Services.

300

I.P. Klapaftis and S. Manandhar / Word Sense Induction Using Graphs of Collocations

Table 2. Contingency table for expected values (ET) and example for target word network and context word cable Expected Values Base Reference (ET) Corpus (bc) Corpus (rc) network BNC Freq. of cable 2.58 (m11 ) 2649.4 (m12 ) Total Freq. 23489.4 (m21 ) 2.4037 (m22 ) of remaining words

Following this process, we are able to identify words in lbc, which are most indicative in bc as compared to rc and vice versa. However, in this setting we are not interested in words, which have a distinctive frequency in lrc. As a result, lbc, is ﬁrstly ﬁltered by removing words, which have a relative frequency in lbc less than in lrc. The resulting lbc is then sorted by the G2 values. The G2 -sorted list is used to remove words from each paragraph of bc, which have a G2 value less than a pre-speciﬁed threshold (parameter p1 ). At the end of that stage, each paragraph pi ∈ bc is a list of nouns, which are assumed to be topically related to the target word tw. Table 3 shows the top 10 words in lbc for the target word network. Table 3.

Top 10 words for the target word network Lemma nbc cnn cable computer turner news cbs television task studio

3.2

G2 2319.6 1544 1476.92 1449.7 1010.2 994.1 826.1 568.5 369.0 438.8

Creating the initial collocational graph

A key problem at this stage is the determination of related nouns, which can be grouped into collocations, and the weighting of each such collocation. In this work, we consider collocations of size 2, i.e. they consist of two nouns. Collocations are detected by generating all the n by 2-combinations for each n-length paragraph, and then measuring their frequency. The frequency of a collocation is the number of paragraphs in the whole SWSI corpus (27132 paragraphs), which contain that collocation. Each extracted collocation is assigned a weight, which measures the relative frequency of two nouns co-occurring. Collocations are usually weighted using information theoretic measures such as pointwise mutual information (PMI). Recently, a comparison between PMI and conditional probabilities for weighting object/verb and subject/verb pairs shows that conditional probabilities produce better results better than PMI [6], since PMI overestimates rare events. Therefore, conditional probabilities seem to be a reasonable choice for our collocation weighting. Let freqij denote the number of paragraphs, in which nouns i, j co-occur, and freqj denote the number of paragraphs, where noun j occurs. Then we can measure the conditional probability p(i|j) using Equation 3, and p(j|i) in a similar way. The ﬁnal weight applied to collocation cij is the average of the calculated conditional probabilities wcij = p(i|j)+p(j|i) . 2 p(i|j)

=

freqij freqj

(3)

We only extract collocations, which have frequency (parameter p2 ) and weight (parameter p3 ) higher than pre-speciﬁed thresholds. This ﬁltering appears to compensate for inaccuracies in G2 , as well as for low-frequency distant collocations that are ambiguous. Each extracted and weighted collocation is represented as a vertex. Two vertices share an edge, if they co-occur, in one or more paragraphs of bc.

3.3

Weighting & populating the collocational graph

The constructed graph, G, is sparse, since we are attempting to identify rare events, i.e. edges connecting collocations. To deal with that problem, we apply a smoothing technique extending the principle that a word is characterized by the company it keeps [10] to collocations. Our target is both to discover new edges between vertices and to assign weights to all of the graph edges. For each vertex i (collocation ci ), we associate a vertex vector VCi containing the vertices (collocations), which share an edge with i in graph G. Table 4 shows an example of two vertices, i.e. cnn nbc and nbc news, which are not connected in G of the target word network. In the next step, the similarity between each vertex vector VCi and each vertex vector VCj is calculated. A comparison of different similarity measures [12] shows that Jaccard similarity coefﬁcient (JC) shows superior performance over other symmetric similarity measures such as cosine, L1 norm, euclidean distance, Jensen-Shannon divergence, etc. Therefore, we have used JC for estimating similarity |VC ∩VC | between vertex vectors: JC(V Ci , V Cj ) = |VCi ∪VCj | . Two colloi j cations ci and cj are mutually similar if ci is the most similar collocation to cj and the other way round. Two mutually similar collocations ci and cj are clustered with the result that an occurrence of a collocation ck with one of ci , cj is also counted as an occurrence with the other collocation. For example in Table 4, if cnn nbc and nbc news are mutually similar, then the zero-frequency event between nbc news and cnn tv is set equal to the joint frequency between cnn nbc and cnn tv. Marginal frequencies of collocations are updated and the overall result is consequently a smoothing of relative frequencies. Table 4.

Collocations connected to cnn nbc and nbc news Target:cnn nbc nbc tv cnn tv cnn radio news newscast radio television cnn headline nbc politics breaking news

Target:nbc news nbc tv soap opera nbc news news newscast nbc newshour cnn headline radio tv breaking news

The weight applied to each edge connecting vertices i and j (collocations ci and cj ) is the maximum of their conditional probabilities, freq where p(i, j) = freqi,j , f reqi is the number of paragraphs collocaj tion ci occurs, and p(j|i) is deﬁned similarly.

3.4

Inducing senses & tagging

The ﬁnal graph G , resulting from the previous stage, is clustered in order to produce the induced senses. The two criteria for choosing

301

I.P. Klapaftis and S. Manandhar / Word Sense Induction Using Graphs of Collocations

a clustering algorithm were its ability to automatically induce the number of clusters and its execution time. Markov Clustering algorithm (MCL) [7] has been extensively used in WSI [4, 8]. MCL is a fast clustering method, which is based on simulation of (stochastic) ﬂow in graphs. The number of produced clusters depends on an inﬂation parameter that controls the number of produced clusters. Chinese Whispers (CW) [5] is a randomised graph-clustering method, time-linear to the number of edges. Contrarily to MCL, CW does not require any input parameters. However, CW is not guaranteed to converge. Evaluation of CW in WSI shows that CW performs well [5]. Biemann [5] notes that CW’s ability to automatically infer the number and the size of clusters, makes it especially suited for WSI problems, where class distributions are often highly skewed and the number of classes unknown. Normalised MinCut [16] is a well-known graph-partitioning algorithm, in which a graph is partitioned in two subgraphs by minimising the total association between the two subgraphs. Normalised Min-Cut is iteratively applied for each extracted subgraph until a user-deﬁned criterion is met (e.g. number of clusters). In our work, we chose to use CW for clustering our collocational graph, since compared to the above algorithms, it does not require any input parameters, it is linear to the number of edges and has already been applied to WSI. Initially, CW assigns all vertices to different classes. Each vertex i is processed for an x (parameter p4 ) number of iterations and inherits the strongest class in its local neighborhood (LN) in an update step. LN is deﬁned as the set of vertices which share a direct connection with vertex i. During the update step for a vertex i: each class, cl, receives a score equal to the sum of the weights of edges (i, j), where j has been assigned class cl. The maximum score determines the strongest class. In case of multiple strongest classes, one is chosen randomly. Classes are updated immediately, which means that a node can inherit classes from its LN that were introduced there in the same iteration. Word sense disambiguation (WSD) assigns one of the induced clusters to each instance of tw. Particularly, given an instance of tw in paragraph pi : each induced cluster clj is assigned a score equal to the number of its collocations occurring in pi . Our WSD exploits the one sense per collocation property [18], which means that WSD based on collocations is probably ﬁner than WSD based on simple words, since ambiguity is reduced.

4 4.1

Evaluation Experimental setting

We evaluate our WSI approach under the framework of SemEval2007 WSI task (SWSI) [3]. The corpus consists of texts of the Wall Street Journal corpus, and is hand-tagged with OntoNotes senses [11]. In this paper, we focus on all 35 nouns of SWSI, ignoring verbs. We induce the senses of each target noun, tn, and then we tag each instance of tn with one of its induced senses. SWSI task organisers employ two evaluation schemes. In the ﬁrst one, unsupervised evaluation, the results of systems are treated as clusters of target noun contexts and gold standard (GS) senses as classes. A perfect clustering solution will be the one, where each induced cluster has exactly the same contexts as one of the classes (Homogeneity), and each class has exactly the same contexts as one of the clusters (Completeness). The traditional clustering measure of F-Score is used to assess the overall quality of clustering, as well as the complementary measures of entropy and purity. Note that F-Score is a better measure than entropy or purity, since F-Score measures both homogeneity and completeness of a clustering solution, while entropy and purity measure

only the ﬁrst. In the second evaluation scheme, supervised evaluation, the training corpus is used to map the induced clusters to GS senses. The testing corpus is then used to measure the performance of systems in a WSD setting (Table 6 Sup. Recall). Table 5.

Chosen parameters for our approach

Parameter G2 threshold Collocation frequency Collocation weight CW iterations

Range 5,10,15 4,6,8,10 0.2,0.3,0.4 100,200

Value p1 = 5 p2 = 8 p3 = 0.2 p4 = 200

Our WSI methodology that uses Jaccard similarity to populate the graph is referred as Col-JC. Col-BL induces senses as Col-JC does, but without smoothing. We ﬁne-tuned Col-JC by cross-validation in the training set of SWSI. We tried 72 combinations of parameters, and chose the setting, with the highest F-Score (Table 5). Note that SWSI participating systems UOY, UBS-AC have used labeled data for parameter estimation, while systems I2R, UPV SI, UMND2 do not state how their parameters were estimated [3]. GCL baseline is a traditional graph-based method, which builds a graph as in [17] and then uses CW to produce the clusters. The parameters of GCL are estimated following the process used for estimating the parameters of Col-JC. The 1cl1inst baseline assigns each instance to a distinct cluster, while the 1c1w baseline groups all instances of a target word into a single cluster. Note that the 1cl1w baseline is equivalent to the most frequent baseline MFS in this setting. Tables 6 presents the unsupervised and supervised evaluation results. The ﬁfth column in table 6 shows the average number of clusters.

4.2

Analysis of results

Evaluation of WSI methods is a difﬁcult task. For instance, the 1cl1inst baseline (Table 6) achieves a perfect purity and entropy. However, F-Score of 1cl1inst is low, because senses of gold standard are spread among induced clusters causing a low unsupervised recall. Supervised recall of 1cl1inst is undeﬁned, due to the fact that each cluster tags one and only one instance in the corpus. Hence, clusters tagging instances in the test corpus do not tag any instances in the train corpus and the mapping cannot be performed. Table 6.

Unsupervised & supervised evaluation of WSI systems.

System UBC-AS 1c1w-MFS GCL Col-JC Col-BL upv si I2R UMND2 UOY 1c1inst

Unsupervised Evaluation FSc. Pur. Ent. # Cl. 80.8 83.6 43.5 1.6 80.7 82.4 46.3 1 81.1 84.0 42.7 2.3 78.0 88.6 31.0 5.9 73.1 89.6 29.0 8.0 69.9 87.4 30.9 7.2 68.0 88.4 29.7 3.1 67.1 85.8 37.6 1.7 65.8 89.8 25.5 11.3 6.6 100 0 73.1

Sup. Recall 80.7 80.9 82.3 86.4 85.6 82.5 86.8 84.5 81.6 NA

The 1c1w baseline (Table 6) achieves high F-Score performance due to the dominance of MFS in the testing corpus. However, its purity, entropy and supervised recall are much lower than other systems, because this baseline only induces the dominant sense. UBCAS seems to have a similar behaviour.

302

I.P. Klapaftis and S. Manandhar / Word Sense Induction Using Graphs of Collocations

A clustering solution, which achieves high supervised recall, does not necessarily achieve high F-Score. One reason for that stems from the fact that F-Score penalises systems for getting the number of GS classes wrongly as in 1cl1inst baseline. According to [3], supervised evaluation seems to be more neutral regarding the number of induced clusters, because clusters are mapped into a weighted vector of senses, and therefore inducing a number of clusters similar to the number of senses is not a requirement for good results. Hence, supervised recall seems to be a less biased measure for assessing WSI systems relative to the size of the training corpus. High supervised recall also means high purity and entropy as in I2R, but not vice versa as in UOY. UOY produces a large number of clean clusters, in effect suffering from an unreliable mapping of clusters to senses due to the lack of adequate training data. On the contrary, I2R produces a small number of clean clusters, in effect having a more reliable mapping and a higher supervised performance. The above statements can be better illustrated by looking at the performance of WSI systems in both evaluation settings. Particularly, no system was able to achieve high performance in both settings, in effect being biased against one of the two evaluation schemes. However, this is not the case for our method. In Table 6, we observe that Col-BL (Col-JC) achieve 72.9% (78.0%) F-Score, outperforming the SWSI participating systems, with their entropy and purity being at high levels. In this comparison we omit the performance of UBC-AS, which was ﬁne-tuned to return a number of clusters close to GS number of senses [4]. The picture is the same in the supervised evaluation, where Col-JC and Col-BL achieve high performance. Note that the performance difference between Col-JC and I2R is not statistically signiﬁcant (McNemar’s test at the 95% conﬁdence level). Our graph-based baseline, GCL, achieves high F-Score (81.1%), but low purity (84.0%) and high entropy (42.7%), which means that it is biased towards the 1c1w (MFS) baseline. On the contrary, ColJC achieves 88.6% purity, 31% entropy and a relatively high FScore (78.0%). The same applies for Col-BL, which however achieves a lower F-Score than Col-JC, due to the larger number of induced clusters. By examining the results on the supervised evaluation, which is a more neutral measure regarding the number of induced clusters, we observe that both Col-JC, Col-BL outperform GCL by a statistically signiﬁcant amount. These results clearly indicate that the proposed method produces less sense conﬂating clusters than the traditional graph-based baseline. The target of smoothing was to reduce the number of clusters, and obtain a better mapping of clusters to GS senses, but without affecting the clustering quality. In Table 6 we observe, that Col-JC has produced a smaller number of clusters than Col-BL with a small effect on purity and entropy. As a result, supervised recall has increased. We also observe that Col-JC has a higher F-Score performance than Col-BL due to the reduction of the number of clusters. Both Col-BL and Col-JC produce a larger number of clusters than the GS number of senses. This does not have a major effect in their F-Score performance, due to the fact that both of them generate a small number of clean large clusters, which tag the majority of instances and a higher number of small clean clusters, which tag only few instances.

5

Conclusion

We presented a graph-based WSI method, in which each vertex corresponds to a collocation that co-occurs with the target word, and edges between vertices are weighted based on the co-occurrence frequency of their associated collocations. A smoothing technique was then applied to identify more edges between vertices.

Evaluation has shown that our method produces less senseconﬂating clusters than traditional graph-based approaches. Our method achieved high performance in both evaluation settings. Future work focuses on applying different collocation weighting schemes and evaluation of our approach on verbs, which are more polysemous than nouns.

REFERENCES [1] Eneko Agirre, Olatz Ansa, David Martinez, and Eduard Hovy, ‘Enriching wordnet concepts with topic signatures’, in Proceedings of the NAACL workshop on WordNet and Other Lexical Resources: Applications, Extensions and Customizations. ACL, (2001). [2] Eneko Agirre, David Mart´ınez, Oier L´opez de Lacalle, and Aitor Soroa, ‘Two graph-based algorithms for state-of-the-art wsd’, in Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 585–593, Sydney, Australia, (July 2006). ACL. [3] Eneko Agirre and Aitor Soroa, ‘Semeval-2007 task 02: Evaluating word sense induction and discrimination systems’, in Proceedings of the Fourth International Workshop on Semantic Evaluations, pp. 7–12, Prague, Czech Republic, (June 2007). ACL. [4] Eneko Agirre and Aitor Soroa, ‘Ubc-as: A graph based unsupervised system for induction and classiﬁcation’, in Proceedings of the Fourth International Workshop on Semantic Evaluations, pp. 346–349, Prague, Czech Republic, (June 2007). ACL. [5] Chris Biemann, ‘Chinese whispers - an efﬁcient graph clustering algorithm and its application to natural language processing problems’, in Proceedings of TextGraphs, pp. 73–80, New York,USA, (June 2006). ACL. [6] Philipp Cimiano, Andreas Hotho, and Steffen Staab, ‘Learning concept hierarchies from text corpora using formal concept analysis’, Journal of Artiﬁcial Intelligence Research (JAIR), 24, 305–339, (2005). [7] Stijn Dongen, ‘Performance criteria for graph clustering and markov cluster experiments’, Technical report, CWI (Centre for Mathematics and Computer Science), Amsterdam, The Netherlands, (2000). [8] Beate Dorow and Dominic Widdows, ‘Discovering corpus-speciﬁc word senses’, in Proceedings of the 10th conference of the European chapter of the ACL, pp. 79–82, Budapest, Hungary, (2003). ACL. [9] Ted Dunning, ‘Accurate methods for the statistics of surprise and coincidence’, Comput. Linguist., 19(1), 61–74, (1993). [10] R. Firth, John, ‘A synopsis of linguistic theory, 1930-1955’, in Studies in Linguistic Analysis, 1–32, Basic Blackwell, Oxford, 1st edn., (1957). Special Volume of the Philological Society. [11] Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel, ‘Ontonotes: The 90% solution’, in Proceedings of the Human Language Technology / North American Association for Computational Linguistics conference, New York, USA, (2006). [12] Lillian Lee, ‘Measures of distributional similarity’, in 37th Annual Meeting of the Association for Computational Linguistics, pp. 25–32, Maryland,USA, (1999). ACL. [13] Patrick Pantel and Dekang Lin, ‘Discovering word senses from text’, in KDD ’02: Proceedings of the 8th ACM International Conference on Knowledge Discovery and Data Mining, pp. 613–619, New York, NY, USA, (2002). ACM. [14] Ted Pedersen, ‘Umnd2 : Senseclusters applied to the sense induction task of senseval-4’, in Proceedings of the Fourth International Workshop on Semantic Evaluations, pp. 394–397, Prague, Czech Republic, (June 2007). ACL. [15] Ted Pedersen and Anagha Kulkarni, ‘Selecting the ”right” number of senses based on clustering criterion functions’, in Proceedings of 11th Conference of the European Chapter of the Association for Computational Linguistics, Trento, Italy, (2006). ACL. [16] Jianbo Shi and Jitendra Malik, ‘Normalized cuts and image segmentation’, in CVPR ’97: Proceedings of the 1997 Conference on Computer Vision and Pattern Recognition (CVPR ’97), p. 731, Washington, USA, (1997). IEEE Computer Society. [17] Jean V´eronis, ‘Hyperlex: lexical cartography for information retrieval’, Computer Speech & Language, 18(3), 223–252, (2004). [18] David Yarowsky, ‘Unsupervised word sense disambiguation rivaling supervised methods’, in Proceedings of the 33rd annual meeting of the Association for Computational Linguistics, pp. 189–196, Massachusetts,USA, (1995). ACL.

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-303

303

Learning context-free grammars to extract relations from text Georgios Petasis1 and Vangelis Karkaletsis1 and Georgios Paliouras1 and Constantine D. Spyropoulos1 Abstract. 1 In this paper we propose a novel relation extraction method, based on grammatical inference. Following a semisupervised learning approach, the text that connects named entities in an annotated corpus is used to infer a context free grammar. The grammar learning algorithm is able to infer grammars from positive examples only, controlling overgeneralisation through minimum description length. Evaluation results show that the proposed approach performs comparable to the state of the art, while exhibiting a bias towards precision, which is a sign of conservative generalisation.

1

INTRODUCTION

Relation extraction is the task of identifying the relations that hold between interesting entities in text data. Being a challenging subtask of information extraction, it extracts the knowledge required to move from named entity recognition to data interpretation and understanding. Thus, it has been one of the main areas of research in the field of computational linguistics. Initial attempts were mainly rule based [1] involving manually constructed rules, based on the results of syntactic analysis. Current research focuses mostly on the use of machine learning techniques. Supervised techniques have been shown to be effective for the task ([2];[3];[4]), while several approaches employ semi-supervised or unsupervised learning ([5];[6];[7];[8];[9];[10]), using also the Web as a corpus. In this paper, a supervised machine learning approach is proposed. Assuming the existence of a named entity recogniser (NERC), the proposed approach extracts binary relations between named entities already identified in texts. Operating at the sentence level, a context-free grammar (CFG), which captures the patterns connecting related entities, is inferred from positive examples only. The eg-GRIDS ([11];[12]) grammatical inference algorithm that is used to learn the grammar, infers a CFG from positive examples only. The need for negative feedback to control overgeneralisation, is eliminated through the use of minimum description length (MDL) [13]. The main aim of this paper is to examine the suitability of grammatical inference for the task of relation extraction. A large part of the work done in the field exploits the results of syntactic analysis, along with statistical information obtained from large corpora, to acquire/generalise rules/patterns in order to perform relation extraction ([14];[15];[9];[17]). Starting from a parse tree that can been generalised by merging tree nodes [9], or from word sequences that can be converted into rules by exploiting information from parse trees [16], various heuristics have been proposed to drive the generalisation process and control the level of generalisation performed (in order to avoid over/under1

Software and Knowledge Engineering Laboratory, National Centre for Scientific Research – N.C.S.R. “Demokritos”, Athens, Greece, e-mails: {petasis, vangelis, paliourg, costass}@iit.demokritos.gr

generalisation). A general-purpose grammatical inference algorithm, on the other hand, already includes the required strategy for guiding generalisation along with the required stopping criteria. In addition, a grammatical inference algorithm is expected to be able to capture the syntactic structure of the language, minimising the need to perform syntactic analysis, making the approach more suitable for thematic domains where syntactic analysis exhibits reduced performance or for languages where the required processing resources are not available. The criteria that have led to the selection of eg-GRIDS for relation extraction include its ability to infer grammars from positive examples only, the diversity of the search strategies implemented and the performance of the algorithm in the Omphalos context-free language learning competition [18]. Evaluation results show that the proposed method performs comparatively to the state of the art, while exhibiting a bias towards high precision, which can be attributed to the conservative generalisation approach of eg-GRIDS. Novel aspects of the proposed method include the ability to learn grammars autonomously, without relying on the availability of processing resources like part-of-speech taggers or syntax analysers. For example, many existing approaches use the results of syntactic analysis to generalise an initial hypothesis, or use syntax trees as the initial hypothesis to be generalised through node merging. Our approach eliminates these dependencies on processing resources, at the cost of extracting the required knowledge from the data directly. Thus, instead of applying heuristics to adapt a generalpurpose grammar, such as the grammar of a conventional syntax analyser, into a specialised grammar for relation extraction, our approach concentrates on extracting the target grammar directly. Equally important is also the fact that the proposed approach does not rely on any sort of negative feedback, either direct, like the requirement for negative examples or irrelevant documents, or indirect, i.e. by assuming all data not positively annotated as negative examples, to control the level of generalisation performed. The advantage of not requiring additional resources and negative information increases the portability of the proposed approach not only to new thematic domains and languages, but perhaps also to other learning paradigms, like for example minimally supervised approaches: requiring only a limited amount of seed positive examples (or rules), the aim is to learn a target grammar through bootstrapping with respect to a corpus. The rest of the paper is organised as follows: section 2 presents the proposed approach and introduces the grammatical inference algorithm, followed by an evaluation presented in section 3. Section 4 discusses work related to the presented approach, while section 5 concludes and outlines plans for future research.

304

2

G. Petasis et al. / Learning Context-Free Grammars to Extract Relations from Text

EXTRACTING RELATIONS

In this section the proposed approach for relation extraction, using the eg-GRIDS algorithm is presented. More details about egGRIDS can be found at [11], [12].

2.1 Extracting relations The task of extraction for a single relation type can be described as follows: Given a data set D and an n-ary relation Rel with arguments X, Y, …, Z, find all instances x ɽ X, y ɽ Y, …, z ɽ Z (x, y, z ɽ D), such as Rel(x, y, …, z) holds [19]. The approach presented in this paper concentrates on extracting binary relations from textual corpora, by trying to capture the linguistic evidence in the text that connects two related entities. In the training phase the method requires a set of training examples as input. The required examples can be easily obtained, if a corpus annotated with entities and relations between these entities is assumed. Each training example comprises the set of tokens (words) that lie between two related named entities x, y (including punctuation marks), and is labelled by the relation type Rel(X, Y). If any named entity w is contained in such a training word sequence, all the tokens that constitute the named entity are replaced by the type of the entity (i.e. if “United States” is found, it is replaced with country), as the main focus is on capturing the information between entities and not the linguistic structure of entities, which is the task of a named entity recogniser. From the set of training examples a set of context-free grammars is inferred, one for each relation type found in the training examples. The result of the training phase is a set of context-free grammars, one for each relation that can be extracted. Each context-free grammar is then converted into a classifier with the help of Boost.Xpressive C++ library [20]. Such a classifier returns true if the content between two entities can be parsed by the grammar and false otherwise.

2.2 The eg-GRIDS algorithm The eg-GRIDS grammatical inference algorithm learns contextfree grammars solely from positive example sets. Utilising a limited set of generalisation operations, eg-GRIDS follows an iterative approach in order to generalise an initial “flat” grammar extracted from the (positive) training examples. In each iteration, candidate grammars are scored according to the MDL heuristic, while search in the space of possible grammars can be directed by various search strategies (such as beam search or genetic evolution) and heuristics, which try to reduce training time through the detection of specific grammatical structures.

2.2.1

A bias towards “simple” grammars

As eg-GRIDS uses no negative evidence, an additional criterion is needed to direct the search through the space of context-free grammars and avoid overly general grammars. The approach of minimum description length (MDL) has been adopted in egGRIDS, which directs the search process towards grammars that are compact, i.e., ones that require few bits to be encoded, while at the same time they encode the example set in a compact way, i.e. few bits are required to encode the examples using the grammar. Assuming a context-free grammar G and a set of examples (sentences) T that can be recognised (parsed) by the grammar G, the total description length of a grammar, henceforth model description length abbreviated as ML, is the sum of two independent lengths: x The grammar description length (GDL), i.e. the bits required to encode the grammar rules and transmit them to a recipient who has minimal knowledge of the grammar representation, and

x The derivations description length (DDL), i.e. the bits required to encode and transmit all examples in the set T as encoded by grammar G, provided that the recipient already knows G. The first component of the ML directs the search away from the sort of trivial grammar that has a separate rule for each training sentence, as this grammar will have a large GDL. However, the same component leads to another sort of trivial grammar, a grammar that accepts all sentences (i.e. the most general grammar, “S ĺ S T; T ĺ (any terminal | e)”). In order to avoid this, the second component estimates the derivation power of the grammar, by measuring the way the training examples are generated by the grammar, and helps to avoid overgeneralisation by penalising general grammars. The higher the derivation power of the language, the higher its DDL is expected to be. The initial overly specific grammar is trivially best in terms of DDL, as usually there is a one-to-one correspondence between the examples and the grammar rules, i.e. its derivation power is low. On the other hand, the most general grammar has the worst score, as it involves several rules in the derivation of a single sentence, requiring substantial effort to track all the rules involved in the generation of the sentence.

2.2.2

Architecture of eg-GRIDS and the learning operators

The architecture of eg-GRIDS is summarised in Figure 1. egGRIDS uses the training sentences in order to construct an initial, “flat" grammar. This initial grammar is constructed by simply converting each one of the training examples into a grammar rule 2 . As a result, the number of initial rules corresponds to the number of training examples. This initial grammar is overly specific, as it can recognise only the sentences contained in the training set. After the initial grammar has been created, eg-GRIDS generalises this initial grammar, using one of the two available iterative search processes: beam or genetic search. Both search strategies utilise the same search operators in order to produce more general grammars. Currently, eg-GRIDS supports five search operators: Merge NT: merges two non-terminal symbols into a single symbol, thereby replacing all their occurrences in all rules with the new symbol. Create NT: creates a new non-terminal symbol X, which is defined as a sequence of two or more existing non-terminal symbols. X is defined as a new production rule that decomposes X into its constituent symbols. Create Optional NT: duplicates a rule created by the “Create NT” operator and appends an existing non-terminal symbol at the end of the body of the rule, thus making this symbol optional. Detect Center Embedding: aims to capture the center embedding phenomenon. This operator tries to locate the most frequent four-gram 3 of the form “A A B B”. Once such a fourgram is located, the operator creates a new non-terminal symbol X as the operator “Create NT” would have done. However, assuming that this four-gram was created through center embedding involving symbol X, this operator additionally creates a new production rule of the form “X ĺ A A X B B” and replaces all symbol sequences that match the pattern “A+ X? B+” with X. Rule Body Substitution: examines whether the body of a production rule R is contained in bodies of other production rules.

2 The body of each rule is a sequence of non-terminal symbols, as each terminal is mapped initially to a unique non-terminal. 3 Since bigrams and trigrams are quite common (frequent) structures and their presence can be attributed to a large number of phenomena, fourgrams are assumed to be the smallest n-grams that indicate possible existence of center embedding.

305

G. Petasis et al. / Learning Context-Free Grammars to Extract Relations from Text

In such a case, every occurrence of the body of rule R in other rule bodies is replaced by the head of rule R. The five operators create grammars that have either the same or greater expressiveness than their parent grammar. As the operators never remove rules from a grammar, the resulting grammars have at least the same coverage as the parent grammar, i.e. they can recognise at least the same set of sentences.

From this corpus, a set of 8.497 training examples was created. To reduce data sparseness, word stems were used instead of the actual words. Each training example contained all word stems and punctuation symbols found in the corpus between two related entities, in the order that they appear in the corpus. Each entity found into the training example was replaced by its entity type, while each example was labelled with the entity types of the two related entities. An example of a sentence annotated with named entities is shown in Figure 2, while the generated training examples are shown in Figure 3. Kenya=[country]'s Richard Limo=[name] the World champion (eventual third=[ranking] 5000m=[sport_name] 26:50.20=[performance]) came the nearest during the first 300m of the lap, until in the finishing straight, Ethiopia=[country]'s Olympic bronze=[ranking] Assefa Mezegebu=[name] started a drive to the line which took second=[ranking] place (26:49.90=[performance]). Figure 2: A sample sentence annotated with named entities.

Figure 1: The architecture of the eg-GRIDS algorithm.

3

EVALUATION

For the purposes of the evaluation, annotated corpus from the BOEMIE research project was used. The corpus contained 800 HTML pages, retrieved from various sites of athletics associations like IAAF 4 , EAA 5 and USATF 6 , containing pages with news, results and athlete’s biographies. All pages have been manually annotated, according to a semantic model capturing information about athletes and their participations in sports competitions, held under official competitions. This semantic model formed also the basis for annotating the corpus with relations. A named entity recogniser developed in the context of the BOEMIE project was applied to the corpus, to identify named entities related to the athletics domain. The recogniser uses Conditional Random Fields [21], and exhibits precision of 90 %, and recall approaching 86 %. Once the corpus has been annotated with named entities, entities representing the same real objects or events were identified through matching, and associated with the entities of the semantic model. Having an alignment between identified entities and the semantic model, relations in the semantic model can be projected on the corpus, providing an initial annotation of binary relations between the identified entities. As a next step in the preparation of the data, the relations involving person names and person properties like gender, age, nationality, performance and ranking were manually verified and corrected where necessary. The evaluation was limited to relations occurring within sentence boundaries, in order to keep the complexity of the grammars to be learned, and thus the required time to learn them, at tractable levels. This is the main reason also for considering only relations involving names and properties related to athletes, as their vast majority does not cross sentence boundaries, in contrast to relations involving athletes and sport competitions or athletic events they have participated in. Thus, as a final step, relations crossing sentence boundaries were removed from the corpus, producing a corpus with 8.497 relations involving person names and person properties.

4 5 6

International Association of Athletics Federations – http://www.iaaf.org/. European Athletics Association – http://www.european-athletics.org/. USA Track and Field – http://www.usatf.org/.

Evaluation was performed through 10 fold-cross validation, and performance was measured in terms of precision, recall and Fmeasure. In each fold, one grammar per relation type was inferred from 9/10 of the training examples. The unseen 1/10 of the examples held for evaluation was parsed by all inferred grammars: if an example was parsed correctly only by the grammar corresponding to the correct relation type, the example was considered correct. In all other cases, including the case where an example was parsed by more than one of the learned grammars, the example was considered a failure. The obtained performance results are shown in Table 1. Word stems

Relation label

's the world entity:sport_name champion ( eventual the world entity:sport_name champion ( eventual entity:ranking 's start a drive to the line which take start a drive to the line which take entity:ranking place (

name-country name-ranking name-performance name-country name-ranking name-performance

Figure 3: Training examples extracted from the sample sentence of Figure 2.

Name-Ranking Name-Performance Name-Country Name-Gender Name-Age Overall

Precision 95.05 % 92.14 % 98.85 % 99.21 % 100.00 % 96.48 %

Recall 54.07 % 49.26 % 88.88 % 79.17 % 98.11 % 65.96 %

F-measure 68.57 % 64.17 % 93.58 % 88.00 % 99.04 % 78.32 %

Table 1: Performance results.

Evaluation results suggest that the proposed approach performs well in comparison to the state of the art, despite the difficulties of comparing results obtained on different corpora. For example, in [9], the presented approach, expanding on a basis of 55 manually constructed seed rules, exhibits precision around 88 % with 43 % recall on 1032 news reports on Nobel prizes from New York Times, BBC and CNN. The fact that our approach uses as input only word stems has two interesting implications: (a) if an example contains a stem that has not been seen before, this example will always be classified as a failure, as it cannot be parsed by any grammar and (b) any generalisation can only be attributed to the successful operation of eg-GRIDS in forming the correct syntactic abstractions, in order to allow the use of “similar” stems instead of a specific stem. One

306

G. Petasis et al. / Learning Context-Free Grammars to Extract Relations from Text

easy answer to (a), followed by numerous approaches (e.g. [16]) in the literature, is to add another level of abstraction over words, such as part-of-speech tags. The fact that the presented approach does not make use of such an abstraction layer, allows us to obtain an estimate of the generalisation achieved solely by the grammatical inference algorithm in use. For this reason, the same experiment was repeated with a slight change: duplicate entries were removed from the training example set, making all training examples unique. This reduced the training example set by almost 2/3, but ensured that all examples used for evaluation had never been seen during training. Again 10 fold-cross validation was used, and the evaluation results are shown in Table 2.

Name-Ranking Name-Performance Name-Country Name-Gender Name-Age Overall

Precision 50.04 % 67.16 % 100.00 % 74.83 % 80.00 % 67.58 %

Recall 6.79 % 11.87 % 16.05 % 7.04 % 47.12 % 10.46 %

F-measure 11.90 % 20.13 % 27.20 % 12.73 % 55.00 % 18.09 %

Table 2: Performance results on unique training example set.

Despite the fact that the results of Table 2 are a pessimistic approximation (since examples containing unknown words with respect to the training examples have not been eliminated), egGRIDS managed to achieve a generalisation of about 10 pp in terms of recall, which is impressive considering that this involves word usages in an ordering never observed during training, even if the loss in precision approaches 29 pp. Regarding execution time during grammar learning, the egGRIDS algorithm is able to converge to a final grammar within a few minutes (from 5 to 15 minutes in most cases) when learning from the complete training example set in the evaluation experiment performed first. However, converting the learned context-free grammar into a classifier (through the use of the template-based Boost.Xpressive C++ library) required considerable amounts of compilation time 7 , in the range of 45 to more than 60 minutes per grammar.

4

RELATED WORK

To our knowledge, there is very little work on relation extraction with grammatical inference. In [14] a semi-automated approach is presented, which exploits the results of corpus analytics (mainly concordances of verbs) to propose patterns. These patterns, after being validated by an expert, are converted into a set of finite state automata. Similarly, in [15] automata are again used, created from manually constructed patterns. Both approaches however operate on syntactic trees obtained through parsing and involve manually or semi-automatically constructed patterns for extracting relations. On the other hand, there are some approaches that exhibit some resemblance in the sense that they try to generalise extracted patterns/rules [16], or modify extraction rules by applying operators similar to the ones employed by eg-GRIDS [9]. LearningPinocchio [17] has been built upon the LP2 algorithm [16], which creates an initial set of rules from positive examples that are generalised by exploiting the results of linguistic analysis/shallow syntactic parsing to remove constraints from the rules. Overgeneralisation is controlled through negative examples, obtained automatically from the corpus, under the assumption that everything not marked as a positive example is a negative one. 7

The experiment was conducted on a PC running Windows Vista (64bit), with an Intel 6700 processor and 4 GB or RAM. The compiler used was MS VC++ 2005.

Following a similar approach, DARE [9] starts with a minimal number of seed rules which are used to annotate a corpus. Having as input syntax trees, DARE follows a bottom-up approach to obtain more general rules by merging nodes of the syntax trees of sentences, an operation that is also part of eg-GRIDS, as one of its generalisation operators. Overgenaralisation is controlled by trying to maximise rule matches in relevant documents while maintaining a small number of matches in irrelevant documents. Our approach differs from these two by not depending on syntactic analysis (used either as a starting point for extracting rules in DARE or for guiding generalisation in LearningPinocchio). Our method also uses MDL for controlling overgeneralisation, thus eliminating the need for negative feedback.

5

CONCLUSIONS

Relation extraction methods typically involve the acquisition of extraction rules and grammars: from Hearst patterns [5] that try to detect hierarchical relation such as hypernyms, to complex lexicosyntactic grammars [9] aiming at extracting n-ary relations with n > 2. Being mainly supervised or semi-supervised methods, they frequently combine syntax trees obtained through syntactic analysis with heuristics based on various statistical measures, in order to generalise an initial hypothesis formed from the training data. In an attempt to ease the requirements posed by such approaches we have examined the suitability of a general purpose grammatical inference algorithm to the task, aiming to evaluate its suitability in replacing both the need for syntactic analysis as well as the heuristics required to guide the generalisation process. The proposed approach has been evaluated with the help of a manually annotated corpus and the obtained evaluation results suggest that the approach performs comparatively to the state of the art, without requiring additional resources such as syntactic analysis or part-ofspeech tagging. In addition, the fact that the proposed approach does not involve any abstraction other than the generalisation performed by the grammatical inference algorithm, allowed us to get an estimate of the degree of generalisation that can be achieved by the algorithm. This was measured to be at least 10 pp accompanied with a degradation in precision of about 29 pp. Since the obtained results are satisfactory, it seems interesting to try to eliminate also the requirement of utilising a manually annotated corpus. Unsupervised approaches have attracted significant research interest, as manual annotation is a timeconsuming and resource needing process. Following recent advances in the field, the proposed approach can be adapted to accept a set of seed context-free grammars, with each one containing only a few rules targeting a specific relation type. Utilising a bootstrapping procedure the system may try to generalise these seed grammars with respect to a corpus of documents relevant to the domain of interest.

Acknowledgements This work has been partially funded by the BOEMIE Project, FP6027538, 6th EU Framework Programme.

6

REFERENCES

[1] 7th Message Understanding Conference (MUC-7), April, 1998. http://www-nlpir.nist.gov/related_projects/muc/proceedings/muc_7_toc.html

[2] CoNLL, 2008: http://www.yr-bcn.es/conll2008/ [3] Carreras X. and Màrquez L., “Introduction to the CoNLL-2005 Shared Task: Semantic Role Labeling”, In Proc. of the Ninth Conference on

G. Petasis et al. / Learning Context-Free Grammars to Extract Relations from Text

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13] [14]

[15]

[16]

[17]

[18]

[19]

Natural Language Learning (CoNLL-2005), June 29 – 30, Michigan, USA, 2005. http://www.lsi.upc.edu/~esrlconll/st05/st05.html Carreras X. and Màrquez L., “Introduction to the CoNLL-2004 Shared Task: Semantic Role Labeling”, In Proc. of the Eighth Conference on Natural Language Learning (CoNLL-2004), Workshop of HLT/NAACL 2004, May 6 – 7, Boston, MA, USA, 2004. http://www.lsi.upc.edu/~srlconll/st04/st04.html Hearst M., “Automatic acquisition of hyponyms from large text corpora”, In Proc. of the 14th International Conference on Computational Linguistics (COLING-1992), 1992. Davidov D., Rappoport A., and Koppel M., “Fully Unsupervised Discovery of Concept-Specific Relationships by Web Mining”, In Proc. of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 232 – 239, Prague, Czech Republic, June, 2007. http://www.aclweb.org/anthology/P/P07/P07-1030 Brody S., “Clustering Clauses for High-Level Relation Detection: An Information-theoretic Approach”, In Proc. of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 448 – 455, Prague, Czech Republic, June, 2007. http://www.aclweb.org/anthology/P/P07/P07-1057 Bunescu R., and Mooney R., “Learning to Extract Relations from the Web using Minimal Supervision”, In Proc. of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 576 – 583, Prague, Czech Republic, June, 2007. http://www.aclweb.org/anthology/P/P07/P07-1073 Xu F., Uszkoreit H., Li H., “A Seed-driven Bottom-up Machine Learning Framework for Extracting Relations of Various Complexity”, In Proc. of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 584 – 591, Prague, Czech Republic, June, 2007. http://www.aclweb.org/anthology/P/P07/P07-1074 Rosenfeld B., and Feldman R., “Using Corpus Statistics on Entities to Improve Semi-supervised Relation Extraction from the Web”, In Proc. of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 600 – 607, Prague, Czech Republic, June, 2007. http://www.aclweb.org/anthology/P/P07/P07-1076 Petasis G., Paliouras G., Spyropoulos C. D. and Halatsis C., “egGRIDS: Context-Free Grammatical Inference from Positive Examples using Genetic Search”. In Proc. of the 7th International Colloquium on Grammatical Inference (ICGI 2004), Lecture Notes in Artificial Intelligence 3264, pp. 223 – 234, Springer, 2004. Petasis G., Paliouras G., Karkaletsis V., Halatsis C., and Spyropoulos C. D., “e-GRIDS: Computationally Efficient Grammatical Inference from Positive Examples”. GRAMMARS, (7), pp. 69 – 110, 2004. (http://grammars.grlmc.com/special.asp) Rissanen J., “Stochastic Complexity in Statistical Inquiry”, World Scientific Publishing Co, Singapore, 1989. Pustejovsky J., Castano J., Zhang J., Cochran B., and Kotecki M., “Robust relational parsing over biomedical literature: Extracting inhibit relations” In Pacific Symposium on Biocomputing, 2002. http://citeseer.ist.psu.edu/527763.html Leroy G. and Chen H., “Genescene: An Ontology-enhanced Integration of Linguistic and Co-occurrence based Relations in Biomedical Texts” In Journal of the American Society for Information Systems and Technology (JASIST), 56 (5), 457 – 468, March 2005. http://beta.cgu.edu/Faculty/leroyg/Leroy-JASIST-2005.pdf Ciravegna F., “Adaptive information extraction from text by rule induction and generalization”, In Proc. of the 17th International Joint Conference on Artificial Intelligence (IJCAI 2001), 2001. Ciravegna F., and Lavelli A., “LearningPinocchio: Adaptive Information Extraction for Real World Applications”, In Journal of Natural Language Engineering, 10 (2), 2004. Starkie B., Coste F., van Zaanen M. “The Omphalos Context-free Grammar Learning Competition”, In Grammatical Inference: Algorithms and Applications; Proc. of 7th International Colloquium on Grammatical Inference (ICGI 2004), vol. 3264 of LNCS/LNAI, pp. 16 – 27, Springer-Verlag, 2004. Katrenko S., and Adriaans P., “Learning Relations from Biomedical Corpora Using Dependency Tree Levels”, In Proceedings of the Fifteenth Dutch-Belgian Conference on Machine Learning (Benelearn), Ghent, Belgium, May 12, 2006. http://staff.science.uva.nl/~katrenko/katrenko_adriaans.pdf

[20] http://boost-sandbox.sourceforge.net/libs/xpressive/doc/html/index.html

307

[21] Lafferty J., McCallum A., and Pereira F., “Conditional random fields: Probabilistic models for segmenting and labeling sequence data”, In Proc. of ICML, pp.282 – 289, 2001.

308

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-308

Talking Points in Metaphor: A Concise Usage-based Representation for Figurative Processing Tony Veale and Yanfen Hao1 Abstract. An effective speaker can use metaphor to communicate a wealth of propositions and affective attitudes with a single juxtaposition of ideas [12,8,6,10,7,3,15]. But as such, an effective metaphor requires effective communication, which in turn requires that the speaker has a clear idea of the content to be communicated, and an equally clear understanding of which conceptual vehicles best communicate this content. We present here a concise corpusderived meaning representation for metaphor processing that captures the most widely-used talking points that are evoked in everyday metaphors and similes. We illustrate how these talking points can be acquired by harvesting the web, and further show how comparable but discretely different talking points can be reconciled during metaphor processing. Finally, by replicating the clustering experiments of [1], we show that talking points yield an especially concise representation of concepts in general.

1

INTRODUCTION

Though sometimes fanciful and frequently indeterminate, metaphor is, at heart, a communication device. As such, metaphor can only support effective communication when it employs a vehicle of comparison whose import is well understood by both speaker and audience. For this reason, metaphors make frequent use of a communal inventory of consensus imagery, attitudes and beliefs. Though one can ﬁnd reﬂections of this shared knowledge in hand-crafted semantic resources like Cyc [9], WordNet [11] and HowNet [5], the best guide to this communal inventory is provided by the stereotypes that pervade our everyday language. These stereotypes provide the talking points that underlie our most effective similes and metaphors: in this inventory, snakes evoke cunning and poisonous charm; lions evoke nobility and bravery; tigers evoke ﬁerceness and feline grace, whales evoke grandeur and massiveness, scientists evoke objectivity and intellectual rigor, and typhoons evoke events of devastating power. These stereotypes comprise both a cultural legacy [8] and a widely accepted linguistic currency [14], so much so that they are often used by speakers who have never actually encountered the physical entities described by them. We see a talking point as any part of the conventional view of a concept that is primed whenever that concept is employed in discourse. As received wisdom, talking points may reﬂect an anthropomorphic bias, an idealized world-view [8] or an outdated scientiﬁc belief; but what matters most here is that they each embody a belief that is widely held and readily evoked by certain words. 1

School of Computer Science and Informatics, University College Dublin, Ireland, email: {tony.veale, yanfen.hao}@ucd.ie

Computational models of metaphor processing stand or fall on the knowledge they have available to them [10,7]. But the knowledge required for metaphor processing is not special ﬁgurative knowledge that is qualitatively different from that required for other languageprocessing tasks. Rather, because metaphor is used to communicate the talking points that are most salient for an entity in a given situation, this is the same knowledge needed to categorize those entities and situations and to determine our affective and inferential response to these concepts [14]. It follows that by understanding how metaphor shapes and exploits our shared view of the world, we can design and acquire richer and more ﬂexible models of world knowledge. In sections 2 and 3 we motivate and describe the construction of a comprehensive knowledge base of the most common talking points in everyday language, by harvesting stereotypical allusions from the texts of the web. In section 4 we then describe how this talkingpoints database can be used to form a robust computational basis for a scaleable model of metaphor generation and comprehension. In section 5 we show that talking points are not just a very concise means for reasoning within metaphor, but an especially concise representation for capturing the most important elements of conceptual description in general, as judged by the ability to cluster ontologicallyrelated words and ideas.

2 RELATED WORK Metaphor has long been recognized as a knowledge-driven process, from the hierarchy-traversing approaches of [15,7] to the graphmapping approaches of [6,13]. However one cuts it, metaphor requires insightful knowledge about the words and ideas it employs. While psychologists, philosophers and cognitive linguists can make a general appeal to the notion of world knowledge when proposing a schematic view of metaphor [8], computationalists must actually furnish this knowledge, in a detailed representational form, if they are to gain enough traction for an implemented approach. Martin [10] attempts a balance between the schematic view of the cognitivists and the representational demands of a real working model; he does so by focusing on conventional metaphors (such as ”to catch a cold”, ”to kill a process”, etc.) and how these can be extended and elaborated in a question-answering/ advice-giving context. Likewise, Barnden and Lee [3] focus on the knowledge needed to comprehend the metaphors of mind that one ﬁnds in commonplace utterances such as ”to grasp an idea” and ”to have doubts at the back of one’s mind”. Other approaches, such as that of Fass [7], understand metaphor as an aberration relative to literal meaning, and require a rich representation

T. Veale and Y. Hao / Talking Points in Metaphor: A Concise Usage-Based Representation for Figurative Processing

of this literal reality as a diagnostic against which metaphoric statements can be repaired. Veale [13] applies the structure-mapping approach to literal knowledge extracted from the HowNet bilingual ontology [5], but notes that this knowledge is both sparse and unevenly distributed: the lexical concepts for which knowledge can be extracted are not necessarily the ones that are commonly used in metaphors. Since metaphor pervades language, the knowledge needed for metaphor processing should be evident in everyday language. Almuhareb and Poesio [1,2] describe how conceptual descriptions comprising both attributes (such as Temperature) and values (such as hot) can be extracted from syntagmatic patterns on the web (such as ”the A of X is ...”), with enough insight to permit the concepts so described to be clustered accurately with respect to WordNet’s noun hierarchy. Moving closer to metaphor, Veale and Hao [14] describe how more insightful and prototypical feature values for concepts can be derived from simile patterns on the web, such as ”as X as a Y”. Though the simile pattern is inherently leaky in English, these authors show that with some human annotation, a sizable database of the most stereotypical ascriptions (e.g., pearls are lustrous, wolves are ruthless, prophets are inspiring, etc.) can be acquired and then exploited in the generation of metaphors. This latter work is limited to feature-centric metaphors only, ones that hinge on the sharing of one of more feature values (e.g., magicians and surgeons are both skilled, supermodels and greyhounds are both skinny, etc.). In this paper we extend the work of Veale and Hao by acquiring detailed talking points that relate, in the style of Almuhareb and Poesio, speciﬁc values to salient attributes of a concept; we can thus speak of the eloquent delivery of an orator, the lithe body of a panther, and the powerful aroma or pungent taste of espresso. This in turn supports greater ﬁnesse in the generation and comprehension of metaphors, allowing our system to understand that the same (or similar) features relate to two different concepts in the same (or a similar) manner.

3

ACQUIRING TALKING POINTS

We use the simile work of Veale and Hao as the basis for this current foray into conceptual description. To recap here, those authors describe how the query pattern ”as ADJ as a *” can be used to harvest web similes in a two-pass bootstrapping process: ﬁrst, WordNet [11] is used to furnish different values for the adjective ﬁeld ADJ, while the wildcard * is used to collect associated nouns for those adjectives via the Google API; the query ”as * as a NOUN” is then used to collect stereotypical adjectives for those nouns in a second phase of harvesting. By analyzing 200 hits for each query, Veale and Hao acquire 74,704 potential simile instances, yielding 42,618 potentially stereotypical associations between adjectives and nouns. When human judges are used to annotate these associations, many are rejected as noise or explicitly marked as ironic (e.g., ”as bullet-proof as a sponge-cake”), but the remaining 12,259 bona-ﬁde pairings yield a comprehensive database of stereotypical descriptions for over 4000 nouns. But how useful is it to know that drums are taut and pearls are lustrous? When used as the basis of a metaphoric comparison (e.g., to describe a given person), it is equally important to know that these properties stereotypically refer the outer appearance of an entity. In other words, taut skin and lustrous sheen are conventional talking points of drums and pearls. We now demonstrate how an additional phase of web harvesting can turn these simple property ascriptions into the talking points needed to drive metaphor processing.

309

For every stereotypical pairing of ADJ and NOUN, we send the query ”the ADJ * of a|an|the NOUN” to Google and again scan 200 result snippets for each to identify possible noun values for *. As in Almuhareb and Poesio [1,2], these queries allow us to determine the speciﬁc attributes targeted by different property ascriptions. In this way, we ﬁnd that lions, tigers and cannons all have an angry roar, plums, pearls and ball-bearings all have a smooth surface, and gods, shrines and torahs all have a sacred purpose. These triads of concepts are comparable by virtue of having the same values for the same attributes, or in other words, because they share the same talking points. Talking points should offer genuine insights into a commonplace idea, but not all adjective:noun pairings will be insightful enough to be considered a talking point. For instance, while colorful:plumage and proud:strut capture a faithful picture of the stereotypical peacock, the pairings colorful:environment and proud:owner do not. Notions such as environment and owner can certainly be relevant in some contexts, but they describe qualities that are clearly contingent or extrinsic to the concept of peacock. Likewise, ”the skilled son of a surgeon” does not speak directly to the concept of surgeon, so we should take care not to mistake pairings like skilled:son as talking points. We thus make the simplifying assumption that the intrinsic qualities of an entity to which a talking point might pertain can be characterized as one of: the traits possessed by an entity (e.g., grace, strength, skill); the physical properties of an entity (e.g., color, length, weight); the feelings experienced by an entity (e.g., sorrow, courage, soulfulness); the body parts that comprise a physical entity (e.g., hands, eyes, teeth); and the actions that an entity can perform (e.g., roar, bite, cry, gallop). This in turn allows us to use WordNet [11] as a ﬁlter for likely talking points, since each of these categories of intrinsic quality corresponds to one or more sub-trees in the WordNet hierarchy of noun senses. Applying this ﬁlter to the results of the aforementioned web search, we obtain 22,693 talking points, linking 1360 adjectival values to 1950 different vehicle nouns via 1796 attribute nouns; each stereotypical term has, on average, 9 talking points.

4 TALKING POINTS IN METAPHOR When we speak of the ﬂuid gait of a dancer, we employ a speciﬁc talking point that ties a given attribute (gait) to a speciﬁc value (ﬂuid). Speciﬁc talking points thus allow a concept like dancer to serve as a useful vehicle of comparison in metaphors and similes where it can project speciﬁc qualities onto a target concept. For instance, one might compare a boxer to a dancer to highlight the former’s ﬂuidity of movement and sense of balance. It is not necessary that ﬂuid:gait be an established talking point of boxer for this metaphor to be apt, but it is enough that gait is a meaningful attribute of boxers to describe. Indeed, following [12], it is preferable if the target concept does not already possess the talking points in question, since metaphor functions best when it is used to make novel but believable claims about a target.

4.1 Understanding Metaphor with Talking Points Similarity between concepts, and the ability to measure it, is therefore central to the interpretation of metaphors. Though an elusive idea that can mean different things in different contexts, similarity has been operationalized in the context of WordNet in a variety of simple metrics [4]. Consider then the interpretation of a metaphor in

310

T. Veale and Y. Hao / Talking Points in Metaphor: A Concise Usage-Based Representation for Figurative Processing

which a vehicle V (such as dancer, or matador) is used to describe a target T (such as boxer or prize-ﬁghter); similarity plays a role in each of the following comprehension strategies: 1. V and T are sufﬁciently similar (e.g., both are persons, animals, artifacts, events, etc.) that the established talking points of V can be projected directly onto T. 2. V and T are semantically distant, but the talking points of V concern attributes that one can meaningfully ascribe to T. A corpus can be used to ascertain the degree to which these attributes (such as speed, grace, teeth, etc.) are salient of T. For instance, the pattern ”the * of a *” identiﬁes stance and grace as salient attributes of prize-ﬁghter in the Google web IT corpus. 3. Both V and T have their own talking points, but these are similar enough to be reconciled, as in the skilled hands of a surgeon and the creative hands of an artist. This reconciliation process is described in section 4.3. 4. If the system possesses no talking points for V, a set of potential talking points is established by looking at the semantic neighbors of V (e.g., in WordNet). For instance, if V is gladiator then it can borrow the talking points solid:strength and offensive:capability from ﬁghter (a generalization of gladiator), and strong:grip, muscular:strength and powerful:body from wrestler (a specialization of ﬁghter and a sibling of gladiator). These four strategies can apply singly or in combination to produce meaningful interpretations of T is(like) V. As noted in strategy 2, a corpus can be used to determine the most salient attributes of T to describe, allowing the elements of the interpretation to be ranked accordingly.

4.2

Generating Metaphor with Talking Points

Metaphor generation is a considerably more open-ended and creative process. We brieﬂy consider here the generation of potentially apt vehicles for a given target T. This ﬁrst requires that we determine the attributes of T that can be meaningfully described; then we can consider potential values for these attributes, which will yield a set of potential talking points that can be ascribed to T; we should then identify the concepts that best evoke those talking points and which can sensibly be compared to T. As noted previously, a large corpus (such as the Google web IT corpus of frequent n-grams) can be used to ascertain the different attributes that are relevant to a concept T. For instance, the query pattern ”* of a|an|the *” identiﬁes the attributes work, soul, spirit, understanding, mind, eye, words and inﬂuence as the most frequently cited aspects of a philosopher, in that order. Now consider the attribute understanding, which underpins talking points that have been acquired for the following entities: loving:understanding of mother, father, dog, baby systematic:understanding of science compassionate:understanding of priest, mother technical:understanding of scientist imaginative:understanding of poet To accentuate the imaginative powers of a philosopher, we can thus use poet as a vehicle; for technical insight, we should use scientist; to suggest compassion, we should use priest or mother;

and so on. WordNet-based similarity metrics will indicate that it requires an unlikely semantic stretch to compare a philosopher to a science, so this potential metaphor is discarded. However, if technical:understanding and systematic:understanding can be identiﬁed as conveying similar ideas, the latter talking point will also be evoked by the vehicle scientist. We now show how related talking points can be reconciled.

4.3 Reconciling Similar Talking Points Two talking points may have discretely different linguistic forms yet still convey much the same content. This issue arises because talking points are not hand-crafted by a semanticist but harvested automatically from the web. Of course, WordNet can help us to reconcile those talking points whose linguistic elements are synonymous or related by hyponymy, such as the sleek:beauty of a greyhound with the sleek:appearance of a seal or a yacht. However, as seen in the case of technical:understanding and systematic:understanding, similarity between talking points can be as much a pragmatic as a semantic issue. We turn again to usage-based insights from a corpus to resolve this problem. In particular, we turn to similes in which a speaker applies two related adjectives simultaneously to a vehicle, as in ”as hot and spicy as a curry”. We can expect two adjectives ADJ1 and ADJ2 to be related to the extent that we can ﬁnd instances of the double-adjective construction ”as ADJ1 and ADJ2 of X”. So by collecting the web frequencies of this pattern for all pairings of ADJ1 and ADJ2 , we construct a confusion matrix of adjectives that indicates the likelihood of one adjective being used to evoke and reinforce the other. The top 10 co-descriptors of systematic in this matrix are: objective, comprehensive, thorough, impartial, rigorous, regular, relentless, unbiased, complete and logical. Likewise, we can expect two attribute nouns to reinforce and suggest each other to the extent that they are found in double-noun constructions like ”with NOUN1 and NOUN2 ” and ”the NOUN1 and NOUN2 of X”. We thus construct a confusion matrix of nouns that indicates the likelihood of one attribute being used to evoke or reinforce another. The top 10 co-descriptors of understanding in this matrix are: knowledge, love, compassion, appreciation, patience, sympathy, wisdom, care, sensitivity and insight. Taken together, these matrices allow talking points that comprise different adjectives and nouns to be seen to communicate similar ideas if these adjectives and nouns exhibit a sufﬁcient degree of codescription in a corpus. This gives rise to a slippage network of similar talking points that allows a system to more thoroughly explore the space of possible metaphors, similes and blends.

5 EMPIRICAL EVALUATION Many of the talking points we acquire from the web express viewpoints that are far from objective. In fact, some are strikingly poetic, suggesting that talking points are an ideal basis for capturing the insights needed for metaphor. For instance, we ﬁnd that lions are believed to have a kingly:roar, a majestic:gait and a noble:heart, while warriors have a courageous:heart and a heroic:path. But one can ask whether these talking points are merely decorative, or whether they actually reﬂect the essential qualities of concepts. We aim to demonstrate the latter, by replicating the clustering experiments of Almuhareb and Poesio [1,2], who in turn demonstrated that conceptual features that are web-mined from speciﬁc textual patterns can be used to construct WordNet-like concept clusters. These authors used

311

T. Veale and Y. Hao / Talking Points in Metaphor: A Concise Usage-Based Representation for Figurative Processing

different text patterns for mining adjectival values (like hot) and noun attributes (like temperature), and their experiments evaluated the relative effectiveness of each as a means of ontological clustering. Almuhareb and Poesio describe two different clustering experiments. In the ﬁrst, they choose 214 English nouns from 13 of WordNet’s upper-level semantic categories, and proceed to harvest adjectival values for these concepts from the web using the pattern ”a|an|the * C is|was”. This pattern yields a combined total of 51,045 adjectival values for all 214 nouns, such as hot, black, etc. They also harvest 8934 attributes, such as temperature and color, using the query pattern ”the * of the C is|was”. These values and attributes are then used as the basis of a clustering algorithm to partition the 214 nouns into 13 categories, in an attempt to re-construct their original semantic groupings. Comparing these clusters with the original WordNetbased groupings, Almuhareb and Poesio report a cluster accuracy of 71.96% using just values (all 51,045), an accuracy of 64.02% using just attributes (all 8934), and an accuracy of 85.5% using both together (59979 features). In a second, larger experiment, Almuhareb and Poesio select 402 nouns from 21 different semantic classes in WordNet, and proceed to harvest 94,989 adjectival values and 24,178 noun attributes from the web using the same retrieval patterns. They then applied the repeated bisections clustering algorithm to this larger data set, and report an initial cluster purity measure of 56.7% using adjectival values only, 65.7% using noun attributes only, and 67.7% using both together. Suspecting that noisy feature sets had contributed to the apparent drop in performance, those authors then proceeded to apply a variety of noise ﬁlters to reduce the adjectival value set to just 51,345 values and the attribute set to just 12,345 nouns, for a size reduction of about 50% in each case. This in turn leads to an improved cluster purity measure of 62.7% using adjective values only and 70.9% using noun attributes only. Surprisingly, ﬁltering reduces the clustering performance of both sets together to 66.4%. We replicate here both of these experiments using the same datasets of 214 and 402 nouns respectively. For fairness, we collect raw talking points for each of these nouns directly from the web, and use no ﬁltering (manual or otherwise) to remove poor or ill-formed talking points. We thus use the pattern ”as * as a|an|the C” to collect 2209 raw adjectival values for the 214 nouns of experiment 1, and 5547 raw adjectival values for the 402 nouns of experiment 2. We then use the pattern ”the ADJ * of a|an|the C” to collect 4974 attributes for the 214 nouns of experiment 1, and 3952 for the 402 nouns of experiment 2; in each case, ADJ is bound to the raw adjectival values that were acquired using ”as * as a|an|the C”. In effect then, we harvest not just attributes but talking points. A comparison of clustering results is given in Tables 1 and 2. Table 1. clustering accuracy for experiment 1 (214 nouns).

5.1

Approach

Values only

Attr’s only

All (V + A)

Almu. + Poesio

71.96% (51045 vals)

64.02% (8934 attr)

85.51% (59979 v+a)

Talking Points

70.2% (2209 vals)

78.7% (4974 attr)

90.2% (7183 v+a)

Discussion of Results

These tables illustrate that clustering is most effective when it is performed on the basis of both values and attributes (yielding the highest

Table 2. clustering accuracy for experiment 2 (402 nouns). Approach

Values only

Attr’s only

All (V + A)

Almu. + Poesio (no ﬁltering)

56.7% (94989 vals)

65.7% (24178 attr)

67.7% (119167 v+a)

Almu. + Poesio (with ﬁltering)

62.7% (51345 vals)

70.9% (12345 attr)

66.4% (63690 v+a)

Talking Points

64.3% (5547 vals)

54.7% (3952 attr)

69.85% (9499 v+a)

scores, 90.2% and 69.85%, in each experiment respectively). These results thus support the combination of conceptual attributes with speciﬁc adjectival values into single integrated features, which we have dubbed talking points in this paper. As designed by Almuhareb and Poesio, these experiments are not intended to measure poetic or metaphoric potential but the simple ability to capture those aspects of a concept that are responsible for how the concept is ontologically organized. As such, these experiments suggest that the linguistic insights we acquire from similes and metaphors - even when ﬁgurative - strongly reﬂect the essential qualities of concepts and are more than mere decorations. Most signiﬁcantly, we see from these experiments that talking points yield an especially concise representation. With no ﬁltering of any kind, the talking points approach achieves comparable clustering results with feature sets that are many times smaller than those used in [1,2]

6 CONCLUSIONS In this paper we have presented a contextualized model of metaphor understanding and generation that derives its pragmatic sensibilities from the simple analysis of text corpora. In this usage-based view, widely held and culturally established talking points are harvested from the global texts of the web, while potential talking points are collected from local, context-speciﬁc texts. Thus, a comparison between a prize-ﬁghter and a dancer carries greater pragmatic force in contexts (and w.r.t. corpora) where there is a precedent of speaking of the grace of a prize-ﬁghter, or of combatants in general. Though a simple pairing of adjective and noun, each talking point offers a valuable glimpse into how a word/concept is popularly imagined and construed. Talking points are general enough to be metaphorically transferable between concepts, but speciﬁc enough to capture that part of each concept that determines how it is ontologically categorized. Furthermore, this glimpse is frequently insightful in a way that dictionary deﬁnitions and ontological speciﬁcations are not. For instance, consider the insight we gain from the talking point that pyramids have a balanced form. This reﬂects the physical intuition one has about a stereotypical pyramid, whose broad square base and tapering structure lends it a low center-of-gravity. This intuition is further elaborated in the web simile ”as unbalanced as an upturned pyramid”. These insights are clearly grounded in visual imagination and an embodied sense of how objects behave in the real world. Though not linguistic in nature, these insights can nonetheless be acquired from linguistic data sources by observing how speakers employ concepts for the purposes of illuminating comparison and description via simile. The talking-points model thus offers a bootstrapping approach to ﬁgurative processing, in which the knowledge required to understand and generate similes and metaphors can readily be acquired from cor-

312

T. Veale and Y. Hao / Talking Points in Metaphor: A Concise Usage-Based Representation for Figurative Processing

pora by observing how others make and appreciate such statements. Furthermore, this is a computational approach to language in which metaphor and simile earn their own keep, by helping us to identify those stereotypical elements of conceptual description that are most useful for inference and categorization. In this view, metaphor shifts from being a vexing and sometimes fanciful problem of language processing to being a convenient window through which a system can acquire insightful knowledge about the world.

REFERENCES [1] A. Almuhareb and M. Poesio, Attribute-based and Value-based clustering: an evaluation, In Proc. of EMNLP, 158–165, Barcelona, 2004. [2] A. Almuhareb and M. Poesio, Concept Learning from the web, In Proc. of the 27th Meeting of the Cognitive Science society, New Jersey: Lawrence Erlbaum, 2005. [3] J.A. Barnden and M.G. Lee, An Artiﬁcial Intelligence Approach to Metaphor Understanding, Theoria et Historia Scientiarum, 6(1), 399412, (2002). [4] A. Budanitsky and G. Hirst, Evaluating WordNet-based Measures of Lexical Semantic Relatedness, Computational Linguistics, 32(1):13– 47, (2006). [5] Z. Dong and Q. Dong, HowNet and the Computation of Meaning, World Scientiﬁc: Singapore, 2006. [6] B. Falkenhainer, K. Forbus and D. Gentner, Structure-Mapping Engine: Algorithm and Examples, Artiﬁcial Intelligence, 41, 1–63, (1989). [7] D. Fass, Met*: a method for discriminating metonymy and metaphor by computer, Computational Linguistics, 17(1), 49–90, (1991). [8] G. Lakoff, Women, ﬁre and dangerous things, Chicago University Press, 1987. [9] D. Lenat and R.V. Guha, Building Large Knowledge-based Systems, Addison Wesley, 1990. [10] J.H. Martin, A Computational Model of Metaphor Interpretation, New York: Academic Press, 1990. [11] G.A. Miller, WordNet: A Lexical Database for English, Communications of the ACM, Vol. 38 No. 11, (1995). [12] A. Ortony, Beyond literal similarity, Psychological Review, 86, 161– 180, (1979). [13] T. Veale, Analogy Generation with HowNet. In Proc. of IJCAI05, the 19th joint int. conf on AI, 1148-53, 2005. [14] T. Veale and Y. Hao, Making Lexical Ontologies Functional and Context-Sensitive, In Proc. of ACL, the 45th Ann. Meeting of the Assoc. of Computational Linguistics, 57–64, 2007. [15] E.C. Way, Knowledge Representation and Metaphor, Studies in Cognitive systems, Holland: Kluwer, 1991.

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-313

313

Semantic Decomposition for Question Answering Sven Hartrumpf1 Abstract. In this paper, we develop and evaluate methods for decomposing complex questions for a question answering system to less complex questions. This aims at increasing the number of correct answers, especially in (deep) semantic question answering systems. For example, an event that occurs as a temporal restriction of a question can be queried for its date and the resulting answer can be substituted in the original question leading to a simpler, revised question. We present six decomposition classes, which are employed for annotating the 996 different German QA@CLEF questions from 2004 till 2008 and trigger different decomposition methods. Most methods work on the level of semantic representations, thereby avoiding natural language generation, a second parsing step, and possible errors in these two steps. The decomposition classes are not equally distributed, but three of them occur frequently in the questions. In the evaluation, the precision and recall for automatically classifying questions with respect to the decomposition classes are investigated. Then, the impact on a deep question answering system is determined. On the QA@CLEF questions, which by construction prefer questions that can be answered from single sentences, the performance gain in number of correct answers is not large, but signiﬁcant. This encourages us to develop and test further decomposition classes and methods as future work.

1

INTRODUCTION

Real-world questions aimed at document collections that are nontrivial in size and content are often complex questions. These go beyond what most questions in question answering (QA) tracks in evaluation campaigns like CLEF or TREC ask for. (Over the years however, there is a trend to slightly increase the difﬁculty of questions.) In this paper, we investigate the case of decomposable questions, i.e. a question that can be made solvable (or more easily solvable or more probably solvable) by decomposing it into several questions where later questions are built on top of the answers to earlier questions. We will concentrate on the decomposition into sequences of two questions. Several decomposition methods can be developed based on the semantic representation of questions. A prerequisite of this approach is a parser that produces adequate semantic representations for questions. Since the decomposition works on the semantic level, the most natural choice of a QA system for testing and applying decomposition is a deep (or semantic) QA system that works on the same (or similar) semantic representations of questions and documents. Most answer candidates produced by question decomposition cannot be delivered by shallow QA systems because the subquestion and the revised question derived from the original question are 1

Intelligent Information and Communication Systems (IICS), University of Hagen (FernUniversit¨at in Hagen), 58084 Hagen, Germany, email: [email protected]

typically answered in different passages of a document or—more likely—in different documents.

2

DECOMPOSITION METHODS

One can deﬁne several decomposition classes. Each class is described in a separate subsection below and can be linked to a decomposition method that tries to exploit the semantics of the class for improving QA results.

2.1

Temporal decomposition

A class that has been investigated before (see Section 3) is temporal decomposition where a situation that is used as a temporal speciﬁcation can be replaced by its date of occurrence. Figure 1, Figure 2, and Figure 3 illustrate the effect of this decomposition on the level of semantic representations for the original question, the subquestion (subordinate question) and its answer(s), and the revised question, respectively. The original question is a simpliﬁed version of CLEF 07 0902 . The simpliﬁcations are as follows: a coreference between questions is resolved and a support verb construction is replaced by a nearly synonymous verb in order to concentrate on the decomposition effect itself.3 As the parser applied in the QA system used for the evaluation (see Section 4) produces semantic networks, the semantic representations in these ﬁgures are semantic networks. The semantic networks follow the MultiNet (multilayered extended semantic network, [6]) formalism. Each arc is labeled by a relation from a predeﬁned inventory of around 140 relations. The relations occurring in this paper are brieﬂy described in Table 1. Each node represents a concept like enden (‘to end’; more speciﬁcally enden.1.1, but the numerical extension of concept names is omitted for brevity) and belongs to an ontological sort (written as a subscript of the concept name). The question focus (or top-level node for non-questions) is marked by a question mark followed by the sentence type. Nodes are characterized by layer features, which describe different intensional and extensional aspects like facticity and cardinality or cardinal value. Note that for improving the layout, instantiating relations (SUB, SUBR, SUBS) and some other relations are folded below the concept node, where the relation starts. Other examples of temporal decomposition are questions asking for the age of a person at an event, e.g. Wie alt war Konrad Adenauer, als er starb? (‘How old was Konrad Adenauer when he died?’).4 If the coreference between the personal pronoun and the named entity 2 3

4

Question 90 from the German test set of QA@CLEF 2007 [2]. The QA system used for evaluation can handle the simpliﬁed and the original question equally well because the system employs a coreference resolver and a normalization module for support verb constructions. The original sentences are shown in italics; the translated English ones are single-quoted in addition.

314

S. Hartrumpf / Semantic Decomposition for Question Answering

c3

vietnamkriegad ⎡FACT adreal⎤ GENER sp ⎢QUANT one ⎥ SUBS/ ASSOC vietnam.0 ⎢REFER det ⎥ SUBS krieg ⎢ ⎥ ⎣CARD 1 ⎦ GENER ge ETYPE 0 ETYPE 0 VARIA con ARG1

c2st SUBR equ.0 TEMP past.0

TEMP

c4dn SUBS enden TEMP past.0

/

II IIARG2 II I$

[GENER sp] ARG1

wer?wh-questionco [GENER sp]

O

c1 EQU ⎡FACT d real⎤ / QUANT one ⎦ ⎣ CARD 1 ETYPE 0

SUB /

uspr¨asidentd ASSOC us.0 SUB asident pr¨ GENER ETYPE

c4dn enden TEMP past.0 [GENER sp]

SUBS

GENER ETYPE

SUB

real sp one ⎥ det ⎥ ⎥ 1 ⎦ 0 con

c7me∨oa∨ta SUB monat

ATTR/

GENER sp QUANT one CARD 1 ETYPE 0

ATTR

c9me∨oa∨ta SUB jahr

GENER sp QUANT one CARD 1 ETYPE 0

c2st SUBR equ.0 TEMP past.0

c8nu QUANT CARD

ARG1

c6t

/

HH HHARG2 HH HH #

wer?wh-questionco [GENER sp]

O

⎡FACT TEMP

GENER sp QUANT one CARD 1 ETYPE 0

VAL

GENER ⎢QUANT ⎢REFER ⎢ ⎣CARD ETYPE VARIA

c10nu QUANT CARD

O

nfquant 1975

nfquant 4

VAL real⎤ sp ATTR/ one ⎥ det ⎥ ⎥ 1 ⎦ 0 con

c1 EQU ⎡FACT d real⎤ / QUANT one ⎣ ⎦ CARD 1 ETYPE 0

SUB/

c9me∨oa∨ta SUB jahr GENER sp QUANT one CARD 1 ETYPE 0

uspr¨asidentd ASSOC us.0 SUB asident pr¨ GENER ETYPE

ge 0

Figure 3. Example of a temporal decomposition: semantic network for the revised question ‘Who was US president in April 1975?’ (Wer war US-Pr¨asident im April 1975?)

is correctly resolved, a subquestion could be Wann starb Konrad Adenauer? (‘When did Konrad Adenauer die?’) and the revised question, given the correct subanswer 1967, is Wie alt war Konrad Adenauer 1967? (‘How old was Konrad Adenauer in 1967?’). Although subquestions, subanswers, and revised questions are shown in natural language (NL) in this paper, the handling in the implementation is different. There, most decompositions are working on the level of semantic representations. One advantage of a deep approach (besides the accurate spotting of places and classes of decomposition) is that NL generation is not needed for neither the subquestion nor the revised questions; instead one can directly work on semantic representations derived from the original question. To yield the revised question, the semantic network for the subanswer is substituted into the semantic network of the original question. As a consequence the subquestion and the revised question need not be parsed because their semantic representations are already available. This saves time and avoids noise from several sources of errors. As some of the decomposition examples indicate, an answer could sometimes be found for the original question without decomposition, too. Therefore, decompositions are only seen as alternatives to the original question.

VAL

Table 1.

Short description of relations used in this paper

VAL

c10nu QUANT CARD

sp one 0

GENER QUANT ETYPE

c6?declarative-sentencet ⎤ ⎡ FACT GENER ⎢QUANT ⎢REFER ⎢ ⎣CARD ETYPE VARIA

TEMP

c5?wh-questiont underspeciﬁed-time.0 7 6

ge 0

c7me∨oa∨ta SUB monat

ge 0

SUBS

vietnamkriegad ASSOC vietnam.0 SUBS krieg

o

ATTR

ARG1

nfquant 4

QUANT CARD

[GENER sp]

c3

o

c8nu

[GENER sp]

Figure 1. Example of a temporal decomposition: semantic network of the original question ‘Who was US president when the Vietnam war ended?’ (Wer war US-Pr¨asident, als der Vietnamkrieg endete?)

⎡FACT adreal⎤ GENER sp ⎢QUANT one ⎥ ⎢REFER det ⎥ ⎥ ⎢ ⎣CARD 1 ⎦ ETYPE 0 VARIA con

nfquant 1975

Figure 2. Example of a temporal decomposition: semantic network for the subquestion ‘When did the Vietnam war end?’ (Wann endete der Vietnamkrieg?) and the subanswer April 1975

Relation

Description

ARG 1, ARG 2 ASSOC ATTR EQU SUB SUBR SUBS TEMP VAL

Speciﬁcation of arguments (metalevel) Relation of association Speciﬁcation of an attribute Equality/Equivalence relation Relation of conceptual subordination (for objects) Relation of conceptual subordination (metalevel) Relation of conceptual subordination (for situations) Relation for the temporal embedding of a situation Relation between an attribute and its value

S. Hartrumpf / Semantic Decomposition for Question Answering

2.2

Local decomposition

Exploiting the popular analogy between space and time, there is also a class for local decomposition, where a local (or spatial) restriction can be replaced by the answer(s) to a subquestion about the exact location(s) fulﬁlling the restriction. For example in a question like ‘Which parties reign in countries that are in Northern Europe?’, the subquestion would be ‘Which countries are located in Northern Europe?’ and a revised question (using a list-valued answer) could be ‘Which parties reign in Sweden, Norway, and Finland?’. Similarly, questions aiming at the local origin or the local direction can be decomposed. Often such questions cannot be expressed in a straightforward way; for example in German, they have to include a so-called correlate like dorther und dorthin in Was kommt dorther, wo ewiges Eis existiert? (‘What comes from where perpetual ice exists?’) and Wer reist dorthin, wo es nie friert? (‘Who travels to places where it never freezes?’), respectively.

2.3

Coordinated situation decomposition

A question can contain several propositions also in the form of a conjunction of situations (coordinated situation decomposition), often involving ellipsis, e.g. ‘Which football players played for Borussia M¨onchengladbach and Real Madrid?’ (Welche Fußballspieler spielten f¨ur Borussia M¨onchengladbach und Real Madrid?). The ﬁrst three decomposition classes can be seen as subclasses of the class multisituation decomposition because they typically involve two situations that are linked by a relation in the semantic network for the question.

2.4

Meronymy decomposition

Meronymy decomposition tries to generate and exploit geographical knowledge on the ﬂy (here: meronymy knowledge for geographical objects). This decomposition class has already been exploited for GIR (geographic information retrieval), with some positive effects (see [7] for a description of query decomposition). For example, a question like When did a hurricane hit Northern Germany? could lead via subquestions like Which regions are in Northern Germany? or Which cities are in Northern Germany? to more speciﬁc revised questions like When did a hurricane hit Niedersachsen? or When did a hurricane hit Hamburg?, respectively. In contrast to meronymy decomposition, local decomposition is restricted to cases where the location of a situation is described by a second situation (like the one expressed by the relative clause ‘that are in Northern Europe’) and not by a named entity.

2.5

Description decomposition

Descriptions in the question focus like Which football players in the example for the class coordinated situations can be used to ﬁrst query for such objects with a subquestion. For a question like Which Italian cities have an Olympic stadium?, the subquestion derived by this decomposition class (description decomposition) is: Name Italian cities. The original question and its subquestion typically lead to many different revised questions, each formed by plugging in a subanswer for the subquestion above: ‘Does Rome have an Olympic stadium?’, ‘Does Pisa have an Olympic stadium?’, etc.

2.6

315

Operational decomposition

Operational questions form a separate decomposition class (operational decomposition), e.g. How many countries belonged to the EU in 1994? Some of such questions can be answered directly because the information is explicit in a document from the document collection of the QA system, e.g. for the question above, a phrase like The 12 EU member states . . . might be found in a document. But for more speciﬁc questions with fewer relevant documents, it becomes more likely that one really has to calculate an operation on the different answers of a subquestion. In the above case, the subquestion would be Name countries that belong to the EU. Some heuristics are important for this class to avoid subquestions that are unlikely to lead to the correct answer to the original question. For example, if the question indicates that there are more than several dozen answers to the potential subquestion. This can be assumed reliably if the question focus contains units like thousand or million, e.g. Wie viele Millionen Menschen vereint der Europ¨aische Wirtschaftsraum? (CLEF 05 084) (‘How many million people does the European Economic Area unite?’). Operations other than cardinality are maximum, minimum, mean average, etc.

2.7

Other decomposition classes and corpus statistics

One could deﬁne something like multi-property decomposition, where a concept is modiﬁed by two or more properties. But this can lead to incorrect answers. For example, consider the question Name a German communist politician. It can easily happen that the QA system ﬁnds a person named M¨uller who is a politician in Germany but not a communist and another politician named M¨uller who is a communist but in Switzerland. Therefore, splitting or decomposing such questions seems too dangerous without additional measures for preserving precision of the QA system. Note that a solution for intertextual named entity identiﬁcation and tracking will provide a clean solution for such questions, making multi-property decomposition redundant because all properties of a given named entity will (ideally) be represented at one unique concept node. Table 2 contains the frequencies of the above decomposition classes for the QA@CLEF test sets from 2004 till 2008.5 A question can belong to zero or more decomposition classes, e.g. How many aristocrats were archbishops in Italy before the Western Schism? should be annotated with four decomposition classes: operational (How many), description (aristocrats), meronymy (in Italy), temporal (before the Western Schism). Note that the last percentage in Table 2 is smaller than the sum of its column because some questions belong to two classes. Operational decomposition in the QA@CLEF test sets involves only cardinality. A class is annotated if—given the general knowledge of the document collection—it is likely that the decomposition leads to an answer. For example, counting the 15 federal states of Germany given the German news corpora is likely to succeed by operational decomposition, whereas correctly counting the UN member states is unlikely. A perfect annotation, i.e. one where all the documents are checked whether a decomposition can lead to a correct answer, is far too expensive and would in the end need to assume some concrete QA system or at least some class of QA approaches to classify a question. A similar problem exists to decide whether a decomposition is needed or not in the sense that an answer cannot be found without exploiting the decomposition class at hand. 5

The test set from 2003 was omitted because most decomposable questions had no answer in the German document collection (so-called NIL questions).

316

S. Hartrumpf / Semantic Decomposition for Question Answering

Again this would require expensive manual annotation and would need to assume some knowledge of the relevant QA system. Therefore, the annotation is not a perfect one, but an annotation that is aimed to measure the potential of decomposition.6 In a sense, this annotation shows an upper bound for the effect of decomposition. Table 2. Frequency of decomposition classes in QA@CLEF 2004–2008

Decomposition class

Absolute frequency

Affected questions (%)

description operational multisituation temporal local, other other decomposition (total)

131 24

13.1 2.4

16 7 6 184

1.6 0.7 0.6 15.9

The fact that around 16% of the QA@CLEF questions could proﬁt from decomposition is signiﬁcant but might seem not overwhelming. But as indicated above, most QA@CLEF questions are by design of the test set development process oriented towards the surface structure of a single document sentence. Many real-world questions show more diverse and difﬁcult patterns. A good complementary impression will be provided by the evaluation in Section 4. There, a QA system will be run twice: without decomposition and with decomposition, making the impact of decomposition directly measurable. A question can often be decomposed in several ways when it contains several propositions. For example, the question Which planet orbits the sun once in every 12 years? can be decomposed into: a) What orbits the sun once in every 12 years? b) Is subanswer a planet? Here, the outer predicate comes ﬁrst, and the inner predicate second. But also the opposite order is possible: a) Name planets! b) Does subanswer orbit the sun once in every 12 years? Now, the inner predicate is contained in the subquestion, the outer predicate in the revised question. Note that the revised question is always just a decision question (yes/no question). The two different decompositions can be drawn as a decomposition rectangle which connects the original question and its answer by two different two-edge paths forming a rectangle. The decomposition rectangle is in general applicable to all classes, therefore also to temporal decomposition. But the two paths through the rectangle are not equally likely to succeed. For example, consider the example given by [8]: Who was the German Chancellor when the Berlin Wall was opened? The preferred decomposition as it is realized in our QA system is the subquestion When was the Berlin Wall opened? and the revised question Who was the German Chancellor in subanswer? The alternative decomposition would be Who was German Chancellor? and as revised question Was subanswer German Chancellor when the Berlin Wall was opened? The reason behind the different chances of success is that human beings use temporal speciﬁcations like years much more frequently than events described in separate clauses like the temporal when-clause above. This is at least true for texts about history and politics. With an ideal knowledge base, both decompositions will lead to the same answer(s), but probably with different run times. However, for a realistic knowledge base like one derived from a news article corpus and a QA system composed of NLP modules which still make some errors, the set of answers obtained by the two decompositions can differ. One reﬁnement of the current system could be to try both 6

The annotation of the QA@CLEF test sets will be made freely available.

subquestions ﬁrst and then to decide which path to follow. This decision could prefer the subquestion which returns fewer answers in order to reduce the run time for the revised question(s). The decomposition of questions could also be handled by a machine learning approach. For example in a supervised manner, the correct decomposition could be learned from a corpus of annotated questions. The correct decomposition consists of the following bits of information: • where to split • how to adjust sentence types (here: question types) • how to integrate the subanswer(s) into the original question to form the revised question(s) But there are not many frequent decomposition classes on the deep semantic level, so that an explicit manual listing is the most efﬁcient classiﬁer implementation. For the same reason, no rule-based approach for implementing the decomposition methods was selected; instead, the decomposition methods were implemented using an API for semantics networks of the MultiNet formalism. The decomposition methods described above are applicable with minimal changes to languages other than German as long as a MultiNet producing parser (or similar) for the language is available because it works on the semantic representation only and not on surface strings. The methods check for semantic network relations (and only rarely for language-speciﬁc concepts) so that a transfer from German to other languages will be fast and straightforward.

3

RELATED WORK

Some of the decomposition classes have been investigated in one or the other form. For example, decomposition has been tried in the context of temporally restricted questions. [8] showed the example Who was the German Chancellor when the Berlin Wall was opened? Their method works not on the semantic level like ours, but mainly on the syntactic level. There has not been any large-scale evaluation of this decomposition class or several decomposition classes. [3] uses the term decomposition in a wider sense in the context of domain-speciﬁc QA. For questions like essay questions or biographical questions (e.g. Describe the life of Andrea Palladio.), decomposition means coming up with several questions so that the summary of their answers will provide an informative answer to the user’s original question. A general objection to investigating decomposition could come from logic-oriented approaches: If you had one large knowledge base and an adequate calculus, there would be no need for decomposition; a standard theorem prover will do all the magic automatically and ﬁnd answers over as many intermediate steps as needed. But a logic-oriented approach to advanced questions whose answers must be constructed from several passages or documents needs to devise a method to load the formulas for the relevant sentences, passages, or documents into main memory at the same time. Note that the corpus used since QA@CLEF 2007 comprises around 1700 million relations. This collection is still medium-sized, and our QA method (mainly: sentence-oriented semantic network comparison plus relation or concept indexes, coreference resolution, and question decomposition, [4, 5]) is also applicable to collections that are one or two order of magnitudes larger. The number of 1700 million relations comes from counting edges in the simpliﬁed form of semantic networks of the QA system InSicht used for evaluation in this paper. Standard theorem provers are known to be not well suited for such large collections of relations (atomic formulas) and large set of rules, see for instance [1].

S. Hartrumpf / Semantic Decomposition for Question Answering

4

EVALUATION

5

The ﬁrst evaluation is aimed to determine how well our system can classify a question into one of the decomposition classes or into the rest class atomic (or non-decomposable). Table 3 shows that precision and recall are promising, but not perfect. Imperfect precision is normally no problem: As false positive (or spurious) decompositions rarely lead to highly scored answer candidates (if any), the effect of false positive decompositions is mainly an increased run time, but no precision loss for the QA system as a whole (as also witnessed by the second row of Table 4). On the other hand, imperfect recall is not too problematic because the annotation describes an upper bound. So, many missed decompositions would never have led to a correct answer for other reasons. But as Table 3 indicates there is room for improving the classiﬁer, for example to detect reliably questions where operational decomposition is unlikely to succeed (beyond the technique from Section 2.6). The obvious choice for evaluating the beneﬁt of question decomposition is to count how many additional questions are correctly answered with decomposition added to our deep QA system InSicht.7 The results for the decomposable questions from QA@CLEF 2004– 2008 are shown in Table 4 (16 correct NIL answers are among the correct results with and without decomposition). For simplicity, the QA system tries only one decomposition per question. Note that the number of decomposable questions given in Table 2 is only an upper bound; therefore the overall performance gain of 5.3% achieved by question decomposition looks promising. This increase is already statistically signiﬁcant at the level of p=0.1. The run time penalty imposed by question decomposition is small. A simple cache containing subquestions and their answers can help to amortize the additional run time when using question decomposition: popular subquestions will appear again and again (like city in Italy and beach in California in the domain of tourism) so that the subquestions will not incur any overhead. To improve performance further, one can integrate this subquestion cache with a general cache for original questions. In this way, the number of cache hits increases because answers to previous original questions can be used as instant answers to subquestions from question decomposition. Table 3.

Results of automatic decomposition classiﬁcation

Class

Precision (%)

Recall (%)

decomposition atomic

65.3 96.7

74.0 91.7

all

90.0

88.5

Table 4. Effect of decomposition on the decomposable questions from QA@CLEF 2004–2008 for our QA system Class

Correct answers without decomposition

7

with decomposition

relative change

decomposition atomic

25 314

43 314

+72.0% ±0%

all

339

357

+5.3%

One could also investigate how many additional pairs of answer candidate and support are found by decomposition. Such additional pairs help the answer selection module if there are more ways to investigate the validity of an answer candidate.

317

CONCLUSION

We have described six decomposition methods. Their application to semantic representations of questions posed in a QA system has shown signiﬁcant improvements, even on question test sets that typically contain easier questions than the ones that can proﬁt from decomposition most. In the future, we would like to develop further decomposition methods, by analyzing manually or automatically larger question sets. Some of the decomposition methods described above (especially local decomposition) have a positive impact on GIR; but also for general IR queries some of the investigated decomposition methods can be exploited to generate promising variants of NL queries during query expansion. We would like to apply a QA system with question decomposition to a GIR task in a more general way. We have some initial positive evaluation results from GeoCLEF (the GIR task at CLEF) showing that additional relevant documents can be found for topics or queries with descriptions of locations that are typically not found in gazetteers or similar resources, e.g. geographic descriptions like cities belonging to Southeastern Europe or islands in the southern part of the Baltic Sea.

REFERENCES [1] Johan Bos, ‘Towards wide-coverage semantic interpretation’, in Proceedings of the 6th International Workshop on Computational Semantics (IWCS 6), pp. 42–53, (2005). [2] Danilo Giampiccolo, Pamela Forner, Anselmo Pe˜nas, Christelle Ayache, Dan Cristea, Valentin Jijkoun, Petya Osenova, Paulo Rocha, Bogdan Sacaleanu, and Richard Sutcliffe, ‘Overview of the CLEF 2007 multilingual question answering track’, in Results of the CLEF 2007 CrossLanguage System Evaluation Campaign, Working Notes for the CLEF 2007 Workshop, Budapest, Hungary, (2007). [3] Sanda Harabagiu, ‘Questions and intentions’, in Advances in Open Domain Question Answering, eds., Tomek Strzalkowski and Sanda Harabagiu, volume 32 of Text, Speech and Language Technology, 99– 147, Springer, Dordrecht, (2006). [4] Sven Hartrumpf, ‘Question answering using sentence parsing and semantic network matching’, in Multilingual Information Access for Text, Speech and Images: 5th Workshop of the Cross-Language Evaluation Forum, CLEF 2004, eds., C. Peters, P. Clough, J. Gonzalo, G. J. F. Jones, M. Kluck, and B. Magnini, volume 3491 of Lecture Notes in Computer Science, 512–521, Springer, Berlin, (2005). [5] Sven Hartrumpf, ‘Extending knowledge and deepening linguistic processing for the question answering system InSicht’, in Accessing Multilingual Information Repositories: 6th Workshop of the Cross-Language Evaluation Forum, CLEF 2005, Vienna, Austria, Revised Selected Papers, eds., Carol Peters, Fredric C. Gey, Julio Gonzalo, Gareth J. F. Jones, Michael Kluck, Bernardo Magnini, Henning M¨uller, and Maarten de Rijke, volume 4022 of Lecture Notes in Computer Science, 361–369, Springer, Berlin, (2006). [6] Hermann Helbig, Knowledge Representation and the Semantics of Natural Language, Springer, Berlin, 2006. [7] Johannes Leveling and Sven Hartrumpf, ‘University of Hagen at GeoCLEF 2007: Exploring location indicators for geographic information retrieval’, in Results of the CLEF 2007 Cross-Language System Evaluation Campaign, Working Notes for the CLEF 2007 Workshop, Budapest, Hungary, (2007). [8] G¨unter Neumann and Bogdan Sacaleanu, ‘Experiments on crosslinguality and question-type driven strategy selection for open-domain QA’, in Accessing Multilingual Information Repositories: 6th Workshop of the Cross-Language Evaluation Forum, CLEF 2005, Vienna, Austria, Revised Selected Papers, eds., Carol Peters, Fredric C. Gey, Julio Gonzalo, Gareth J. F. Jones, Michael Kluck, Bernardo Magnini, Henning M¨uller, and Maarten de Rijke, volume 4022 of Lecture Notes in Computer Science, 429–438, Springer, Berlin, (2006).

318

ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-318

Finding Key Bloggers, One Post At A Time Wouter Weerkamp and Krisztian Balog and Maarten de Rijke1 Abstract. User generated content in general, and blogs in particular, form an interesting and relatively little explored domain for mining knowledge. We address the task of blog distillation: to ﬁnd blogs that are principally devoted to a given topic, as opposed to blogs that merely happen to discuss the topic in passing. Working in the setting of statistical language modeling, we model the task by aggregating a blogger’s blog posts to collect evidence of relevance to the topic and persistence of interest in the topic. This approach achieves state-ofthe-art performance. On top of this baseline, we extend our model by incorporating a number of blog-speciﬁc features, concerning document structure, social structure, and temporal structure. These blogspeciﬁc features yield further improvements.

1

Introduction

With the growth of the blogosphere comes the need to provide effective access to the knowledge and experience contained in the many tens of millions of blogs out there. Information needs in the blogosphere come in many ﬂavors. E.g., Mishne and de Rijke [9] consider both ad hoc and ﬁltering queries, and argue that blog searches have different intents than typical web searches, suggesting that the primary targets of blog searchers are tracking references to named entities and identifying blogs or posts which focus on a certain concept. The Blogranger system [3] offers several types of search facilities; in addition to post retrieval facilities, it also offers a blog search engine, i.e., an engine aimed at identiﬁng blogs about a given topic which a user can then add to an RSS reader. The task on which we focus in this paper is the blog distillation task: to ﬁnd blogs that are principally devoted to a given topic. That is, instead of identifying individual “utterances” (posts) by bloggers, we want to identify key blogs with a recurring interest in the topic, that provide credible information about the topic. The blog distillation task is an interesting one, for a number of reasons. First, it addresses a real information need, shared by professional and nonprofessional searchers of the blogosphere. Second, a retrieval model for this task seems to require multiple types of evidence: “local” evidence derived from (a small number of) blog posts of a given blogger plus more “global” evidence derived from a blog as a whole. Successful approaches at the new feed (blog) distillation task at TREC 2007 Blog track, take the entire blog as indexing unit, that is, the contents of individual posts belonging to the same blog are concatenated into a single document: the blog. Even though this approach performs well in TREC, we want to use individual posts as indexing unit for three (practical) reasons: (i) to allow for easy incremental indexing, (ii) for presentation of retrieval results, posts are natural and coherent units, and (iii) the most important reason, to allow the use of one index for both blog post and blog retrieval. 1

ISLA, University of Amsterdam, weerkamp,kbalog,[email protected]

The

Netherlands.

Email:

Given this constraint, how, then, should we model the blog distillation task? In this paper we view blog distillation as an association ﬁnding task, more speciﬁcally, as a blogger-topic association ﬁnding task: which blogger is most closely associated with a given topic? Given our choice of working with posts (as opposed to entire blogs) as indexing units, we need effective ways of estimating blogger-topic associations from blog posts. As we will see below, we also need to be able to incorporate the importance of an individual blog post given the blog from which it originates. Given this choice of model, we explore a number of dimensions. First, how does our (post-based) model perform compared to other known solutions to blog distillation? Second, and assuming that blog distillation is a precision-oriented task (like so many search tasks on the web), can we use the document structure that blogs come with to favor relatively rare but high quality matches; i.e., if we represent blog posts using their titles only (as opposed to title-plus-body) do we observe a strong precision-enhancing effect (perhaps at the expense of recall)? And what if we combine the title-only representation with the title-plus-body representations? Third, how can we make use of blog speciﬁc features to improve blog distillation? Our main ﬁnding is that a post-based approach to feed distillation can be as effective as state-of-the-art blog-based approaches. Additionally, on top of this competitive baseline we explore blog-speciﬁc features such as document structure, social structure and temporal structure, and ﬁnd that (i) the lean title-only content representation has a clear precision-enhancing effect when compared to a title+body representation, while a combination of the two representations outperforms both; (ii) post priors based on comments and post length have a positive inﬂuence on performance; (iii) temporal ordering of posts can be used as an indication of the importance of a post given a blog; and (iv) a combination of all blog speciﬁc features on top of our baseline shows signiﬁcant improvements over the baseline on most metrics. The remainder of the paper is organized as follows. In Section 2 we discuss related work. Section 3 details our experimental setup. In Section 4 we introduce our blog distillation model and assess its performance. Next, in Section 5 we discuss the implementation of the blog speciﬁc features; this is followed by an experimental evaluation and discussion of our ﬁndings in Section 6. We conclude in Section 7.

2

Related Work

Responding to the emerging interest in blogging and in information access tasks for the blogosphere, TREC launched a blog track in 2006 [10]. The ﬁrst round of this track focused mainly on ﬁnding relevant blog posts, with a special interest in their opinionatedness. Many insights in blog post retrieval have been gained (see, e.g., [6, 8, 10]), but the task of ﬁnding relevant blogs has received

319

W. Weerkamp et al. / Finding Key Bloggers, One Post at a Time

less attention. As part of the TREC 2007 Blog track [6] a new task was introduced: feed distillation. The aim of this task is to return a ranking of blogs rather than individual posts given a topic; this is summarized as ﬁnd me a blog with a principle, recurring interest in X. The scenario underlying this task is that of a user wanting to add feeds of blogs about a certain topic to his or her RSS reader. This task is different from a ﬁltering task [11] in which a user issues a repeating search on posts, constructing a feed from the results. Prior to TREC 2007 the interest in identifying key blogs was limited. Fujimura et al. [3] propose a multi-faceted blog search engine that allows users to search for blogs and posts. One of the options is to use the blogger ﬁlter: the search results (blog posts) are clustered by blog and the user is presented with a list of blogs that contain one or more relevant posts. Ranking of the blogs is done based on the EigenRumor algorithm [2]; in contrast to our method, this algorithm is query-independent. TREC 2007 witnessed a broad range of approaches to the task of identifying key blogs. Seki et al. [12] experiment with the idea that over time each blog post of a relevant blog should be relevant to the given topic, which they implement in a two-stage retrieval model. A very different approach is tested in [4]; the method uses Web 2.0 applications and thesauri (like Wikipedia, WordNet, Dmoz, etc.) to generate topic maps. After an initial retrieval run against an index consisting of the RSS content (rather than the HTML content) of the blogs, a classiﬁer is used to determine the relevance of blogs regarding the topic (map). An interesting preprocessing step by Seo and Croft [13] consists of removing all blogs that consist of only one post, since retrieving these blogs would come down to retrieving posts. After this step, three retrieval models are tested: the baseline run builds virtual documents consisting of all posts of a blog and uses a language modeling approach to retrieval of relevant blogs. The weakness of this model is that longer posts may bias the blog’s relevance. The second model described in the paper constructs pseudo-clusters based on an initial retrieval run on a post index and ranks blogs using a product over the documents in the cluster. The weakness of this model lies in the fact that a very regularly updated blog (with many posts) is likely to have many posts in the initial post retrieval results, even though this might only be a small portion of the total number of posts in the blog. To counter this, a third model combines the two previous models and uses the ﬁrst model to penalize blogs that have many non-relevant posts. Results are reasonable, with the combination of models performing best. The most effective approaches to feed distillation at TREC 2007 were based on using the (aggregated) text of entire blogs as indexing units. E.g., Elsas et al. [1] experiment with a large document model, and a small document model. The former views blogs as a single document, disregarding the fact that a blog is constructed from multiple posts. The latter takes samples of posts from blogs and combines the relevance scores of these posts into a single blog score. Their main outcome is that the large document model outperforms the small document model and that query expansion on Wikipedia is very beneﬁcial (with an increase in MAP scores of almost 10%).

3.1

Test Collection

As our test collection we use the TRECBlog06 corpus [5]. This corpus has been constructed by monitoring feeds for a period of 11 weeks and downloading all permalinks. For each permalink (or blog post or document) the feed number is registered. Besides the permalinks (HTML documents) we also have syndicated content to our availability; we only used the HTML documents. For our experiments we construct two indices: a title-only index (T), and a title-and-body index (TB). The former consists of the ﬁeld of the documents, the latter combines this ﬁeld with the content of the <body> part of the documents. Table 1 lists the characteristics of both indices. index T TB size 674MB 16GB Table 1. terms 17.4M 1,656.3M unique terms 439,747 9,106,161 Characteristics of T and TB indices The TREC 2007 Blog track offers 45 feed distillation topics and assessments [6]. Both topic development and assessments are done by the participants. Assessors were asked to check a substantial number of blog posts of a retrieved feed to determine the relevance of the entire feed. For all our runs we use the topic ﬁeld (T) of the topics and ignore all other information available (e.g., description (D) or narrative (N)). 3.2 Metrics We use the following standard metrics to determine the effectiveness of our retrieval methods: mean average precision (MAP), R-precision (R-prec), as well as three precision-oriented measures: precision at ranks 5 and 10 (P@5 and P@10) and mean reciprocal rank (MRR). 4 Modeling Blog Distillation To tackle the problem of identifying key blogs given a query, we take a probabilistic approach and formulate the task as follows: what is the probability of a blog (feed) being a key source given the query topic q? That is, we determine p(blog|q), and rank blogs according to this probability. Since the query is likely to consist of very few terms to describe the underlying information need, a more accurate estimate can be obtained by applying Bayes’ Theorem, and estimating: p(blog|q) = p(q|blog) · p(blog) , p(q) Experimental Setup Before we introduce our modeling of blog distillation in Section 4, we ﬁrst describe our test collection, the metrics we use, and the smoothing settings we employed. (1) where p(blog) is the probability of a blog and p(q) is the probability of a query. Since p(q) is a constant (for a given query), it can be ignored for the purpose of ranking. Thus, the probability of a blog being a key source given the query q is proportional to the probability of a query given the blog p(q|blog), weighted by the a priori belief that a blog is a key source, p(blog): p(blog|q) ∝ p(q|blog) · p(blog). 3 avg. length 5 515 (2) Since we focus on a post-based approach to blog distillation, we assume the prior probability of a blog p(blog) to be uniform. The distillation task then boils down to estimating p(q|blog), the probability of a query q given a blog. For this estimation we consider a model based on language modeling techniques. We build a textual representation 320 W. Weerkamp et al. / Finding Key Bloggers, One Post at a Time of a blog, based on posts that belong the blog. From this representation we estimate the probability of the query topic given the blog’s model. The language modeling setting allows us to use blog posts to build associations between queries and blogs in a transparent and principled manner. 4.1 4.3 Assessing the baseline model We report on the retrieval scores achieved by our baseline model; see Table 2 for the results. Model Fields MAP R-prec P@5 P@10 MRR Baseline TB .3272 .4023 .4844 .4844 .6892 A baseline model Table 2. Our baseline model for estimating the probability of a query given a blog, p(q|blog), represents the blog by a multinomial probability distribution over the vocabulary of terms. Therefore, a blog model θblog is inferred for each blog, such that the probability of a term given the blog model is p(t|θblog ). The model is then used to predict how likely a blog would produce a query q. Each query term is assumed to be sampled identically and independently. Thus, the query likelihood is obtained by taking the product across all terms in the query: Y p(q|θblog ) = p(t|θblog )n(t,q) , (3) Results of our baseline retrieval model. The scores obtained by our baseline model would have been ranked second (according to most measures) if submitted to the TREC 2007 Blog distillation task; cf. [6]. Ensuring our baseline model achieves state-of-the-art performance on the blog distillation task using posts as indexing units allows us to look for improvements over this baseline using blog speciﬁc features. The next section details on these features and how we implement these in our model. t∈q where n(t, q) denotes the number of times term t is present in query q. To ensure that there are no zero probabilities due to data sparseness, it is standard to employ smoothing. That is, we ﬁrst obtain an empirical estimate of the probability of a term given a blog p(t|blog), which is then smoothed with the background collection probabilities p(t): p(t|θblog ) = (1 − λblog ) · p(t|blog) + λblog · p(t). (4) In Eq. 4, p(t) is the probability of a term in the document repository. In this context, smoothing adds probability mass to the blog model according to how likely it is to be generated (i.e., published) by any blog. To approximate p(t|blog) we use the posts as a bridge to connect the term t and the blog in the following way: X p(t|blog) = p(t|post, blog) · p(post|blog), (5) post∈blog We assume the post and blog to be conditionally independent, and therefore set p(t|post, blog) = p(t|post). The probability p(post|blog) expresses the importance of a given post within the blog. The simplest approach is to set this distribution to be uniform, i.e., all posts of a blog are equally important. This results in p(post|blog) = posts(blog)−1 , where posts(blog) is the number of posts in the blog. In Section 5 we shall see an alternative way of setting this probability, based on blog speciﬁc features. Next, we discuss our estimation of the smoothing parameter λblog . It is followed by an experimental evaluation of our baseline model. 4.2 Smoothing parameters It is well-known that smoothing can have a signiﬁcant impact on the overall performance of language modeling-based retrieval methods [15]. For the smoothing parameter λblog in Eq. 4, we set λblog equal n(blog) to β+n(blog) , where n(blog) is the length of the blog (i.e., summarizing the length of all posts of the blog). Essentially, the amount of smoothing is proportional to the length of the blog (and is like Bayes smoothing with a Dirichlet prior [7]). So if there are very few posts in the blog then the model of the blog is more uncertain, leading to a greater reliance on the background probabilities. We set β to be the average blog length. That is β = 170 for the title-only index and β = 17, 400 for the title-and-body index. 5 Beyond the baseline The document collection at hand, blog posts, has several speciﬁc features that could be used to improve blog retrieval effectiveness. We distinguish three types of feature: (i) document structure, (ii) temporal structure, and (iii) social structure. Below we discuss each of these types and show how they can be implemented in our blog distillation model. Results obtained using the various implementations are discussed in Section 6. 5.1 Document structure There are (at least) two blog speciﬁc document structure characteristics that can be incorporated in the retrieval model: (i) document length and (ii) representation. The former is implemented using a prior probability of a post being relevant, the latter is achieved using a linear combination of two content representations. Blog posts are characterized by relatively short document lengths. For a blog post to be considered relevant to a given topic though, it should contain “enough” information. Short posts, containing mostly uninformative utterances of a blogger, have a smaller a priori probability of being relevant than longer posts. We model this feature as the prior probability of a post p(post). We rewrite the p(post|blog) component from Eq. 5 using Bayes’ Theorem: p(post|blog) = p(blog|post) · p(post) , p(blog) (6) where p(blog) is constant for all posts in a blog, and p(blog|post) is set to 1. We therefore set p(post|blog) ∝ p(post). In the case of post length we set p(post) ∝ log(|post|), where |post| is the length of the post in words. The second document structure characteristic we use is title (T) versus title+body (TB) representation. A blog post consists of a title and a body part. Since the title usually is a very clean, yet short description of the post content, we expect a precision-enhancing effect from using the T index instead of the TB index. On the other hand, due to the very short documents, we believe a T only run cannot achieve state-of-the-art performance on recall-based metrics (e.g., MAP). We therefore explore the possibilities of a mixture of two representations: The rationale behind mixing two content representations is to mimic a user’s search behavior: after being presented with a relevant blog post, a user might look at the titles of other posts 321 W. Weerkamp et al. / Finding Key Bloggers, One Post at a Time within the same blog to come to a ﬁnal relevance judgement concerning the entire blog. We mimic this behavior by combining the T and TB representations in a linear way: Run MAP Baseline title+body .3272 title .2602** Document structure (A) T+TB .3449 (B) doc. length .3464* Temporal structure (C) recency .3323* Social structure (D) comments .3278 Combinations (E) B+C+D .3452** (F) A + E .3596** p(q|θblog ) = λTB · pTB (q|θblog ) + (1 − λTB ) · pT (q|θblog ), (7) where both pTB (q|θblog ) and pT (q|θblog ) are deﬁned in Eq. 3. In the ﬁnal combination of this linear combination run with the other features, we apply the non-uniform probabilities p(post|blog) to the TB run only, and keep this probability uniform in the T run (i.e. p(post|blog) = posts(blog)−1 ). 5.2 Blogs have a very speciﬁc temporal structure in that posts within a blog are time-stamped and reverse chronologically ordered (most recent post ﬁrst). When searching for blogs devoted to a given topic, one could assume the most recent posts to be of more importance than much older posts: even a blogger’s interest can shift over time, and more recent posts can therefore give better insight in their current interests. To incorporate this intuition in our model we assign a recency score (rs) to each post in a blog: j 1 + γ, 1, if recency(post, blog) ≤ M otherwise, rs(post, blog) . rs(post , blog) (9) post ∈blog 5.3 Social structure Probably the most eye-catching feature of blogs is their social structure: this structure displays itself in more than one way (e.g., blog rolls), but we focus on the commenting feature of blogs. Readers of a post can usually respond to that post by leaving a comment; the more comments readers leave with a certain post, the more important we consider this post to be. As with the document length, we implement this feature using a prior probability p(post) (following Eq. 6). We set p(post) ∝ 1+log(comments(post)), where comments(post) is the number of comments for a post. 6 P@10 MRR .4023 .4844 .4844 .6892 .3236** .5689* .4889 .7770 .4186 .4196 .5200 .5044 .7733** .5111 .4733 .6820 .4031 .4844 .4822 .6866 .4066 .4978 .4822 .6798 .4148* .5156 .4822 .7245 .4277* .5467* .5044 .7654** 6.1 Observations We formulate the main observations based on our experiments. First, our baseline model displays state-of-the-art performance using a blog post index and no special features (Table 2). Further main observations include the inﬂuence of the title index, the blog speciﬁc features, and the combination of all previously presented components. Finally, we pick three example topics that are discussed in more detail. (8) where posts within the top M most recent posts within a blog are awarded with γ additional points, on top of the standard 1 for each post. We normalize the recency scores to get an estimate of p(post|blog): p(post|blog) = P P@5 Table 3. Results of the baseline runs, single feature runs, and combined runs, using MAP, R-precision, P@5, P@10 and mean reciprocal rank. Temporal structure rs(post, blog) = R-prec Results In this section we present the outcomes of our experiments, followed by the main observations from these outcomes and a more detailed analysis of these outcomes. All results are summarized in Table 3; signiﬁcance is tested using a two-tailed paired t-test, with signiﬁcant differences compared to the baseline TB run reported with ** for α = .01, and * for α = .05. For runs (A) and (F) we need an estimation of λTB , the weight of the title+body index (see Eq. 7). We estimate this parameter empirically by performing a sweep over possible λTB values. We obtained best performance using λTB = .7 for run (A) and λT B = .8 for run (F). Title-only index: We see that by using a title only index we can improve on early precision (P@5, P@10) and mean reciprocal rank, where the difference on P@5 is signiﬁcant. This gain in precision has the side-effect of a loss in recall, as shown by signiﬁcant drops in MAP and R-precision scores. Still, this title only run would have been ranked third in TREC 2007 (ordered by MAP). When we combine the T and TB runs we improve over the baseline on all metrics. This result does show that using the leaner content representation of the title only index helps improving blog distillation performance when combined with the full post index. Blog speciﬁc features: As we can see from Table 3 all three blog features (post length, recency, and comments) help to improve over the baseline on most metrics. Especially post length seems a good indicator of post importance, whereas the number of comments shows only marginal improvements. When we combine all three features by averaging over their probabilities, we see an increase over the baseline on all metrics except P@10. From the fact that improvements of the combined run on MAP and R-precision are more signiﬁcant than their single run counterparts, we can conclude that the combination has a recall-enhancing effect. Combination of features: The combination of the TB run with all features and the T run shows best overall performance. It achieves highest scores on 3 of 5 metrics and improves signiﬁcantly over the baseline on MAP, R-precision, P@10, and MRR. The performance in terms of MAP is close to the best TREC 2007 run (MAP .3695). 6.2 Topic analysis Zooming in on the results of the ﬁnal run (F) compared to the baseline, we observe AP gains for most topics (see Figure 1: 32 out of 45 topics gain over the baseline, while 13 drop). Also, the improvements in AP are in general stronger than the AP drops. We select the two topics with the largest improvement (990 and 982), and the topic with the biggest AP drop (967): AP difference 322 W. Weerkamp et al. / Finding Key Bloggers, One Post at a Time 0.3 including query enrichment, link analysis, credibility scoring [14], and spam ﬁltering. 0.2 Acknowledgements 0.1 We would like to thank our reviewers for their valuable feedback. Weerkamp and De Rijke were supported by the E.U. IST programme of the 6th FP for RTD under project MultiMATCH contract IST033104. Balog and De Rijke were supported by the Netherlands Organisation for Scientic Research (NWO) under project number 220-80-001. De Rijke was also supported by NWO under numbers 017.001.190, 640.001.501, 640.002.501, STE-07-012. 0 -0.1 -0.2 REFERENCES -0.3 topics Figure 1. AP difference for run (F) compared to baseline. Topic 990 (lost tv): has an AP of .1913 in the baseline setting, but shows a remarkably high performance in the title only run of AP .5535. The ﬁnal score for this topic is .4326, showing that the title only run can outperform the combination. Topic 982 (machine learning): has a baseline AP of .5657; using the post importance probabilities we can improve this to .6143, but most improvement is obtained by using the title only run. This run shows an AP performance of .8000, which is also the ﬁnal score for this topic. Topic 967 (home baking): has a baseline AP of .2703; using only the number of comments as prior probability, we can improve to .2895. On the other hand, all other features, but especially the title only run (.0884) and length prior (.1783) hurt performance. The ﬁnal score (.2329) reﬂects this, although the decrease is only marginal. 7 Conclusion In this paper we described a language modeling approach to the task of identifying blogs that are principally devoted to a given topic. Our main focus concerned an approach that uses blog posts as indexing units, instead of the (aggregated posts of) blogs that have so far mostly been used. There are a number of pragmatic reasons that account for this choice: (i) to allow for easy incremental indexing, (ii) for presentation of retrieval results posts are natural and coherent units, and (iii) to allow the use of one index for both blog post and blog retrieval. At the same time we aimed to achieve state-of-theart performance. On top of this, we wanted to deploy blog speciﬁc features to improve over our baseline model. Our main ﬁnding is that the we can indeed achieve state-of-the-art performance using a blog post index. Additionally, we ﬁnd that (i) the lean title-only content representation has a clear precision-enhancing effect when compared to a title+body representation, while a combination of the two representations outperforms both; (ii) post priors based on post length and number of comments have a positive inﬂuence on retrieval performance; (iii) temporal ordering of posts can be used as an indication of importance of a post given a blog; and (iv) a combination of all blog speciﬁc features on top of our baseline shows signiﬁcant improvements over the baseline on most metrics. As to future work, we are interested in combining our proposed post-based approach to blog distillation with techniques that have been shown to improve retrieval effectiveness for the task at hand, [1] J. Elsas, J. Arguello, J. Callan, and J. Carbonell. Retrieval and feedback models for blog distillation. In TREC 2007 Working Notes, 2007. [2] K. Fujimura, T. Inoue, and M. Sugisaki. The eigenrumor algorithm for ranking blogs. In WWW 2005 Proceedings, 2005. [3] K. Fujimura, H. Toda, T. Inoue, N. Hiroshima, R. Kataoka, and M. Sugizaki. Blogranger—a multi-faceted blog search engine. In Proceedings of the WWW 2006 3nd Annual Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 2006. [4] W.-L. Lee and A. Lommatzsch. Feed distillation using adaboost and topic maps. In TREC 2007 Working Notes, 2007. [5] C. Macdonald and I. Ounis. The TREC Blogs06 collection: Creating and analyzing a blog test collection. Technical Report TR-2006-224, Department of Computer Science, University of Glasgow, 2006. [6] C. Macdonald, I. Ounis, and I. Soboroff. Overview of the TREC 2007 Blog Track. In TREC 2007 Working Notes, pages 31–43, 2007. [7] D. J. C. Mackay and L. Peto. A hierarchical dirichlet language model. Natural Language Engineering, 1(3):1–19, 1994. [8] G. Mishne. Applied Text Analytics for Blogs. PhD thesis, University of Amsterdam, 2007. [9] G. Mishne and M. de Rijke. A study of blog search. In M. Lalmas, A. MacFarlane, S. R¨uger, A. Tombros, T. Tsikrika, and A. Yavlinsky, editors, Advances in Information Retrieval: Proceedings 28th European Conference on IR Research (ECIR 2006), volume 3936 of LNCS, pages 289–301. Springer, April 2006. [10] I. Ounis, C. Macdonald, M. de Rijke, G. Mishne, and I. Soboroff. Overview of the TREC 2006 Blog Track. In The Fifteenth Text Retrieval Conference (TREC 2006). NIST, 2007. [11] S. Robertson and J. Callan. Routing and ﬁltering. In TREC Experiment and Evaluation in Information Retrieval, pages 99– 122. MIT, 2005. [12] K. Seki, Y. Kino, and S. Sato. TREC 2007 Blog Track Experiments at Kobe University. In TREC 2007 Working Notes, 2007. [13] J. Seo and W. B. Croft. UMass at TREC 2007 Blog Distillation Task. In TREC 2007 Working Notes, 2007. [14] W. Weerkamp and M. de Rijke. Credibility improves topical blog post retrieval. In ACL08:HLT, June 2008. [15] C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst., 22(2):179–214, 2004. ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-323 323 Why is this Wrong? – Diagnosing Erroneous Speech Recognizer Output with a Two Phase Parser Bernd Ludwig and Martin Hacker 1 Abstract. A major problem of understanding language in spoken dialog systems is to detect recognition errors in the output of a speech recognizer. Such a capability is the basis of implementing repair strategies that allow a dialog system to handle communication about misunderstandings similarly to other clariﬁcations. In this paper we present a two-phase approach that combines chunk and dependency parsing and takes the global syntactic structure of recognizer output into account. This enables us to identify dependencies between chunks and detect syntactical errors caused by word confusions in case dependency constraints are violated. Finally, we apply these diagnostics to dialog modeling and discuss how the resulting error information can be used by clariﬁcation strategies. 1 Introduction A major problem of understanding language in spoken dialog systems is to detect recognition errors in the output of a speech recognizer. While the signal processing community works on improvements of feature extraction algorithms in order to decrease the average word error rate of a speech recognizer, less attention is paid to the linguistic side of errors and how to detect them. The capability to locate errors in a speech recognizer output (word chain, word lattice, or word confusion matrix) enables a dialog system to engage in clariﬁcation dialogs on acoustic misunderstandings. Our paper presents a study how to identify these misunderstandings. As example application, we choose a natural language dialog system for controlling digital equipment for home entertainment, such as TV sets or DVD players. In this domain, a typical utterance is approach that combines chart-based chunk parsing with constrainedbased dependency parsing to a global syntactic analysis providing possible readings of the input and detailed error diagnostics. We conclude with a presentation of the system’s performance, a discussion of the results and an outlook to open issues and future work. 2 Current (commercial) state-of-the-art systems do not implement strategies revealed in corpus analyses, but rather heuristic approaches to clariﬁcations. Often, feature extraction that is needed for implementing recovery or clariﬁcation strategies is difﬁcult to compute. Secondly, the features are unclear themselves. Mostly, they are not speciﬁed semantically, so it is very hard to compare them. In contrast to such ad-hoc technical solutions, our approach is to ﬁnd operational semantics for clariﬁcation strategies and for algorithmic detection of features that allow to diagnose misunderstanding errors at run-time. In our view, clariﬁcation has to be implemented differently and separately on all levels of perception3 : • Acoustic and syntactic errors are hard to distinguish in speech recognizer output. Often, they are generated during the speech recognition process – either because the speech recognizer did not classify correctly or because the user transgressed the lexical and grammatical limitations of the recognizer’s language model: ich möchte eine komödie laufen werden4 These types of error are the focus of our paper. • Another type is misunderstanding on the semantic level: Die Sparte Serie auf RTL ist ausgewählt2 . Normally, speech recognizers will misrecognize some words producing hypotheses which may be inconceivable to a human hearer and—in particular—to an automatic dialog system. Figure 1 lists hypotheses for the above utterance. As discussed in detail later, they are ill-formed and every native speaker will ask a number of questions in order to clarify the misunderstandings. The contribution of this paper is a parsing algorithm that computes error diagnostics similar to those a native speaker would ﬁnd. Then we sketch how such diagnostics can be used in a dialog system for clariﬁcation sub-dialogs. The paper is structured as follows: First, we discuss some corpus studies that exemplify the way in which humans communicate about misunderstandings on different levels. In the section to follow, we report on other computational approaches to localize and identify errors in ill-formed utterances. Then we explain our two-phase 1 2 Chair for Artiﬁcial Intelligence, University of Erlangen-Nuremberg, Germany, email: {ludwig,martin.hacker}@cs.fau.de The genre series is selected for RTL. Clariﬁcation of Misunderstandings ich möchte eine komödie auswählen (I want to select a comedy) ich möchte eine komödie aufnehmen (I want to record a comedy) Such misunderstandings may be caused by acoustic problems as well, but cannot be detected on syntactic level as both utterances are grammatically perfect. They are only detectable with information about context and therefore beyond the scope of this paper. • The same holds for pragmatic misunderstandings. For clariﬁcations of pragmatics the context or state of the dialog has to be considered in parallel to the state of the application which is addressed in the interaction. Pragmatic misunderstandings are discussed e.g. in [11] or [12] and are beyond our scope as well. 3 Error Classiﬁcation In a corpus of spoken user input collected with the EMBASSI dialog system (see [8]), the example utterance from section 1 is recognized 3 4 A similar approach is discussed in [14] I want a comedy run will 324 B. Ludwig and M. Hacker / Why Is this Wrong? – Diagnosing Erroneous Speech Recognizer Output with a Two Phase Parser die sparte gelaufen ab elf uhr gewählt5 die sparte gelaufen ab elf uhr ausgewählt5 die sparte gelaufen das will vox gewählt6 die sparte gelaufen was wähle vox gewählt7 die sparte gelaufen ard wähle vox gewählt8 Figure 1. 5 best hypotheses for Die Sparte Serie auf RTL ist ausgewählt as shown in Figure 1. None of the ﬁve hypotheses computed by the speech recognizer contains the transcription (see above). Though some words are recognized correctly, no native speaker of German would consider any hypothesis out of the ﬁve to be conceivable as a complete sentence. He would argue that in each hypothesis there are sequences of words conceivable for themselves, but along with the other ones they cannot be integrated in a way that a hearer can make sense out of the whole. The reason for that observation is that in a sentence there are several grammatical functions that have to be fulﬁlled by certain phrases. Having identiﬁed these, the hearer is able to construct a hypothesis for the meaning of the utterance. However, in the hypotheses in Figure 1 some of the functions are missing (e.g. predicate in hypothesis 1 and 2), some are fulﬁlled more than once (e.g. the predicate in hypothesis 3, 4, and 5), and some are located in unusual positions (e.g. what in hypothesis 4). These facts are obstructive for the hearer to understand the hypothesis. In most dialog systems, there is a work-around: sequences that have meaning in isolation are extracted and syntactic functions are ignored. Whenever an recognition error affects a word that is crucial for the meaning of the utterance, this work-around fails. When errors lead to conﬂicting information (e.g. VOX vs. ARD in hypothesis 5), it is even harder for the dialog system to take a decision. 4 Related Work In literature, the analysis of clariﬁcation dialogs has attracted the interest of many researchers. While there are many corpus studies on which patterns are used for clariﬁcation in human-human dialog (G INZBURG [7] provides an overview of types of clariﬁcations which may follow an utterance), it is difﬁcult to deﬁne an efﬁcient and tractable decision procedure for error classiﬁcation and selection of an repair strategy. Indeed, there is even no consensus about an appropriate feature set. Different ones are used e.g. in [5, 2, 16]. Our approach differs from the cited publications in that we want to classify errors in natural language output of a speech recognizer by employing as much knowledge about language use as possible instead of abstracting immediately to data that involves the (error prone) interpretation of the recognizer output. Another main difference is that the cited approaches lead to an acceptance or rejection of the user utterance as a whole. However, we are interested in ﬁnding the type of misunderstanding in order to provide a computational basis for (interactive) clariﬁcation also on partial utterances. In the area of speech recognition, conﬁdence scores are deﬁned by an entropy-based measure of confusion in a word graph. While the idea of word confusion graphs goes back at least to [13] for the purpose of comparing different hypotheses from a speech recognizer, H AKKANI -T ÜR and R ICCARDI [9] report about experiments to use posterior probabilities or posterior entropy as a conﬁdence measure and to localize errors in positions with low conﬁdence. In [3] anti models for phones are proposed to compute conﬁdence scores. 5 6 7 8 the genre run from 11 clock selected the genre run that want VOX selected the genre run what choose VOX selected the genre run ARD choose VOX selected gelaufen, V die sparte, DP DET:1 cas NOM num SING gen FEM NP:2 cas NOM num SING gen FEM 0 1 ab elf uhr, PP P:1 pos kasrek NP:2 cas num gen mod PP ? 2 ? 3 4 gewählt, V mod PP prepos DAT DAT SING FEM 5 ? 6 ? 7 Figure 2. Extract from the chart graph for the ﬁrst hypothesis from Fig. 1. The approach in this paper takes the main ideas just outlined into account, but anti models are built on a word or even POS basis in order to highlight linguistic instead of acoustic information. H OGAN [10] was interested in ﬁnding differences in the distribution of existing trigrams within “correct” and corresponding machine translated text. Our focus on the contrary is to identify trigrams that are rare in the language model, but occur in ill-formed input and therefore can serve as error indicators. While conventional parsers are constructed for parsing and completely disambiguating grammatically error free input, our goal is fault-tolerant parsing of sentences with syntactic errors. For diagnostics the parser must provide useful information about what is wrong. Furthermore the language model should be robust as we are concerned with spoken language. However, efﬁcient parsing is essential for applying the method in real time dialog systems. There is a number of parsers for German, but none of them meets all demands of error handling, robustness and time complexity. S CHRÖDER [15] and F OTH et. al. [4] present a parser that is based on a Weighted Constraint Dependency Grammar and is able to analyze erroneous sentences. Tests show that—even for short sentences— parsing takes too long for real time needs. Regrettably, the parser detects nouns only if they are capitalized (in accordance with German spelling), but this information is not available in spoken language. For every sentence a lot of violations of defeasible grammar constraints are reported. As a whole they may give a rough clue for rating syntactic correctness, but most violations are not constructive for error diagnostics. What is more, some of them indicate topological errors. However, these never directly result from acoustic misunderstandings, but only from the parser choosing a wrong interpretation. 5 The Two-Phase Parsing Approach The parsing is composed of two phases: In a preprocessing step a chunk parser generates a chart graph with all possible chunks (cf. Fig. 2) from a word confusion graph (see Fig. 3). In a second phase a dependency parser searches the chart graph for a path that can be parsed according to a rule-based linguistic model. If no path can be found that ﬁts the model, it chooses a path as close to the model as possible and describes the deviation on the basis of the conﬂicts arised during the parsing process. This description serves as input to a subsequent diagnostics procedure speciﬁed in section 6. The underlying linguistic model is based on a dependency grammar combined with a topology model. The grammar contains subcategorization rules and lexical subcategorization information to describe admissible dependencies. Our approach lifts dependency relations from word level to a higher level, the chunk level. Thus, partial parses computed by the chunk parser can be utilized as short cuts for the search process and abstractions within parse trees. 325 B. Ludwig and M. Hacker / Why Is this Wrong? – Diagnosing Erroneous Speech Recognizer Output with a Two Phase Parser wähle - ARD A@ R @ die- sparte- gelaufen - das A VOX @ @ UA elf R ab @ @ R - As dependency grammars do not make statements on word order, we need a topology model describing admissible linearizations of parse trees. It is based on topology ﬁeld theory in German linguistics that subdivides sentences into ﬁve ﬁelds (see [1]). Similarly to [6], who suppose that every word (here: chunk) induces a new sub-ﬁeld, we deﬁned a dynamic hierarchical ﬁeld model capable of handling complex phenomena like subordinate clauses, scrambling or partial verb fronting. While target ﬁeld rules such as (predicative_noun, VL, ∗) (predicative_noun, ¬VL, f initeV erb) (predicative_noun, ¬VL, ¬f initeV erb) (predicative_noun, ¬VL, ∗) → → → → (R, lef t) (RB, right) (R, lef t) (VFF, right) specify in which ﬁeld a dependent can be located (depending on its function, verb order and regent’s properties), precedence rules like NP_es < NP_akk : 0 NP_NOM < NP_DAT : 2 constrain the relative order of chunks within a ﬁeld. Unlike common ones, our topology model works interactively since ﬁeld boundaries can be extended outwards when a new chunk is assigned to the ﬁeld. Thus, whensoever a partial parse is to be extended by a new dependency, we can immediately validate that the new dependent’s position ﬁts the topological structure built up by all previous dependencies. The dependency parser works top down using an A* search algorithm to ﬁnd a path ﬁtting the linguistic model best. For this, it must work fault-tolerant as to a certain degree also erroneous dependencies must be taken into account. The deviation from the linguistic model can be described by a set of conﬂicts of the following types: 1. Incongruency: A dependency is presumed while features of the dependent do not exactly meet the speciﬁcations. E. g. in he sleep he can be interpreted as subject to sleep though regent and dependent do not correspond in number. 2. Vacant grammatical function: A mandatory grammatical function is not fulﬁlled because there is no adequate chunk that is not occupied by a previously assigned dependency. E. g. the hypothesis he gives to her causes a conﬂict because there is no chunk like it that can serve as direct object to gives. 3. Spare chunk : At the end of the parsing process a chunk remains whose function could not be identiﬁed. E. g. the hypothesis he sleeps it contains the chunk it that cannot be incorporated into a semantic interpretation of the whole sentence. Figure 3. Word Confusion Graph die 0 UNSURE ? ab ? elf ? uhr ? UNSURE ? ?sparte ? 1 2 3 4 5 6 • Simple confusion: A word is replaced by another one: I don’t understand who • Omission: A word is omitted or replaced by a break: don’t understand you • Insertion: A word not contained in the utterance is inserted: I don’t understand it you • Contraction: A sequence of words is replaced by a single word: I don’t understanding • Separation: A single word is replaced by a sequence of words: I don’t thunder stand you In order to ﬁnd out location and type of the confusion we need to infer confusions from conﬂicts. However, in most cases there is no clear relation between cause and effect: A single confusion can cause more than one conﬂict. Often conﬂicts are bidirectional, that is either the dependent or the regent can be erroneous. In the end, several confusion types can cause the same conﬂict. For an adequate handling of these ambiguous relations, we need a statistical error model. 7 Worked Example In this section, we discuss step by step how the 5 best hypotheses in Fig. 1 are processed to localize and analyze the detected errors. The ﬁrst step is to transform the single hypotheses into a joint word confusion graph as shown in Fig. 3. In such a graph, sub-sequences of words can be identiﬁed that are typical for ill-formed input. The result is a disambiguated word chain with markers for critical regions (see Fig. 4). In these regions, syntactic errors are likely to be found. Then a chunk parsing step tries to ﬁnd all chunk readings for the given hypotheses or the whole word confusion graph. Fig. 2 shows an extract of the chunk graph for the input in Fig. 4. This chart graph is the basis for the following dependency analysis. It aims at identifying how the chunks are related taking topological ﬁelds for German and the subcategorization frames of the involved words into account. For going step by step through the second parsing phase we discuss another example from the home entertainment domain illustrating clearly the linguistic background and how error diagnostics can be made. The original utterance and the 5 best hypotheses for it were T: H1: H2: H3: H4: H5: Ich möchte morgen abend einen Krimi im Fernsehen anschauen 9 mich möchte morgen abend einen krimi im fernsehen anschauen10 ich möchten morgen abend einen krimi im fernsehen anschauen11 ich möchte morgen abend einen krimi im ersehnt anschauen12 ich möchte sorgen abend einen krimi im fernsehen anschauen13 ich möchte morgen haben einen krimi im fernsehen anschauen14 Error Diagnostics The system analyzes the conﬂicts detected by the parser and decides whether the speech recognizer output is accepted or rejected or whether an error model can be used to identify misunderstandings. Assuming that the linguistic model contains the original utterance, all those conﬂicts are caused by classiﬁcation errors of the following types (the examples refer to the utterance I don’t understand you): 7 Figure 4. Initial chart containing all error localization labels Efﬁcient parsing as postulated in section 4 can be achieved by relaxing the parser’s precision: As we only need to ﬁnd out if the input is erroneous, exact resolution of ambiguities is only required for errorrelated sequences. For all other ones it is sufﬁcient to validate that there are admissible interpretations, no matter which one to choose. 6 H j was - gewählt H J * Z > ~ ausgewählt Z J H ^ J j H * uhr 9 Tomorrow night I would like to watch a detective story on TV tomorrow night me would like to watch a detective story on TV 11 tomorrow night I would like to watch a detective story on TV (would being second-person plural) 12 tomorrow night I would like to watch a detective story on desired 13 sorrow night I would like to watch a detective story on TV 14 tomorrow have I would like to watch a detective story on TV 10 326 B. Ludwig and M. Hacker / Why Is this Wrong? – Diagnosing Erroneous Speech Recognizer Output with a Two Phase Parser In this example, parsing the word confusion graph would result in automatic correction because it includes the transcription as a path. As no other path can be parsed without any conﬂicts, the search algorithm will deﬁnitely choose the transcription as best interpretation. Parsing each word chain independently illustrates how error detection works if the error cannot be eliminated. First of all, the dependency parser performs the following steps: 1. Possible sentence readings and verb orders are taken into account. One of them is statement with Verb-Zweit-Stellung (verb-second order). This reading calls for the ﬁnite verb to constitute the left bracket15 . Finite verb candidates are möchte, fernsehen and anschauen. The former obtains the highest priority as the left bracket is expected to be early in the sentence. 2. The verb form möchte is an ﬁrst-person singular modal auxiliary with subcategorization frame verbal_part(inﬁnitive), subject(ﬁrst-person singular NP) The topology model allows the inﬁnitive to be in the right bracket or the Vorfeld. As the left bracket has already been located in step 1, the Vorfeld is constituted by ich, which cannot be an inﬁnitive. Thus fernsehen and anschauen are the candidates for the inﬁnitive constituting the right bracket. This time the latter is priorized. 3. It has the subcategorization frame object(NP). For the transcription, ich would be assigned to the subject slot and einen Krimi to the object slot while morgen abend would be interpreted as time supplement to the verb and im fernsehen as location adjunct to either the verb or the object. However, for each hypothesis one of these assignments can only be done at the cost of a conﬂict. Hence the A* search tests the other alternatives of step 1 and 2, but as none of them provides a conﬂict-free solution, the primary interpretation is rated best. For the individual hypotheses, the following conﬂicts are detected: H1: Incongruency: subject mich has ACC case instead of NOM H2: Incongruency: subject ich differs from the verb in person/number H3: Spare chunk: ersehnt (desired) Vacant gramm. no noun to preposition im (on) function: H4: Spare chunk: sorgen (sorrow) Spare chunk: abend (night) H5: Spare chunk: haben (have) For H1 the error can be localized in the word mich (me). An adequate repair strategy was either to guess that it was originally nominative or to ask the speaker for the subject: Wer oder was? (who or what?) For H2 the system is unsure if the error is located in the subject or in the verb. Either both were originally ﬁrst-person singular or both were second-person plural. If an error marker (compare Figure 4) was found at exactly one of both positions, it is possible to disambiguate this error diagnosis, with the same conclusions as for H1. If not, the risk is too high to choose the wrong alternative. Therefore it is better to ask for the subject. For H3 it is likely that the spare chunk ersehnt originally fulﬁlled the vacant grammatical function (and thus described the location of the detective story). The adequate question to the speaker was either wo? (where?) or in wem oder was? (on whom or what?) Diagnostics are more complicated for H4. As both chunks are adjacent, it is likely that both words form a single word or chunk in the original utterance. But as they serve as supplement (which is facultative and therefore not included in the subcategorization frame) their 15 The left bracket is one of ﬁve topological ﬁelds in German linguistics. For detailed information on German verb order and topological ﬁelds see [1]. grammatical function is unknown. Hence a speciﬁc question cannot be formulated. Only if the system expected the user to tell something about time, it is reasonable to guess that the missing phrase is about time. In this case it could try to ask: wann? (when?) However, the latter does not work for H5 because here a part of the time speciﬁcation (tomorrow) is complete apart from the other one (night). Hence an adequate strategy was either to ignore the missing chunk (that is to suppose it to be irrelevant for the meaning of the utterance), to reject the whole sentence (Sorry, could you repeat?) or to formulate a position-oriented question (Sorry, could you repeat the word you said after morgen?). 8 Evaluation on a Small Test Corpus For evaluating the parser on word chains, we used a subset of the EMBASSI corpus [8] containing 5 best hypotheses (cf. Figure 1) for each of 53 utterances. 87% of the transcriptions were recognized as error free while 80% of the wrong hypotheses were classiﬁed as erroneous. The latter is due to the fact that not all confusions by the speech recognizer result in syntactic errors. Leaving out hypotheses that are grammatically correct but contain semantic misunderstandings, 96.6% are recognized as syntactically erroneous. Taking only one best hypothesis per utterance, both recognition rates rise to 94%. Localizing the error on the basis of syntactic conﬂicts turns out to be more difﬁcult. Even if we only take those hypotheses into account that cause a single conﬂict, for only 49% the error position can correctly and unambiguously be identiﬁed. In contrast, for 36% the conﬂict is not tangent to the misunderstood word. Taking a look at the individual utterances we can clearly see that it is hardly possible to ﬁnd the error by means of parsing when the stem of the ﬁnite verb is affected by the confusion. This is due to the fact that the ﬁnite verb determines the most important semantic relations in a sentence and that its subcategorization frame differs for different verbs. Hence the syntactic interpretation of an utterance tends to be completely wrong if the ﬁnite verb could not be identiﬁed correctly. Besides, error diagnostics that are useful for repair strategies seem hardly practicable if an hypothesis contains more than one misunderstanding. Particularly in short utterances multiple errors often effectuate that a substantial part of the semantic structure is not accessible. 9 Open Issues More detailed diagnostics—particularly if more than one conﬂict occurs—would require an elaborate statistical error model trained by large amounts of data. In order to provide more signiﬁcant indications for error positions, syntactical information given by conﬂicts should be combined with statistical information given by error localization labels as shown in Figure 4. An important question to answer is if this could raise the rate of correctly localized errors to a level that makes clariﬁcation worthwhile compared to complete rejection. Additionally, an evaluation is to be done on word confusion graphs as their use could lead to automatic correction provided that they contain every word of the utterance. Up to now, our work was conﬁned to independent utterances. However, in dialog systems contextual information is essential for interpretation of utterances. Considering context means to integrate our approach with a semantic analysis. Thereby also considerable improvements could be achieved both in efﬁciency and error detection if dependencies between chunks are only considered if there is an adequate semantic relation according to a semantic model. B. Ludwig and M. Hacker / Why Is this Wrong? – Diagnosing Erroneous Speech Recognizer Output with a Two Phase Parser 10 Further Work on Clariﬁcation Strategies In this section we sketch brieﬂy how the error diagnostics help the dialog system in initiating a clariﬁcation dialog. The following table shows what information the dialog system can extract from the (parsed) word confusion graph (see Fig. 3, 4, and 2): 327 The evaluation of our approach proved to be reliable. Almost all grammatical errors can be localized as indicated by Figure 5. chunk interpretation (0,2) DP chunk is the only hypothesis. It is grammatically correct. (2,3) PPART chunk is the only hypothesis, but in this position it is grammatically ill-formed. (3,6) PP chunk ab elf Uhr is correct, but there are several alternatives; the risk for choosing it is very high even if the acoustic score of this reading is the highest available (see Fig. 1). (6,7) The readings for the PPART chunk are semantically synonymous. So, there is little risk in choosing any of them. The topological information computed in phase two of the parsing process delivers the following diagnostics: (0,2) DP chunk may be the subject in the utterance. However, it is semantically incomplete as genre lacks an apposition, which is another indication that chunk (2,3) is wrong. (2,3) PPART chunk may be the (elliptical) predicate. However, none of its case frames can be ﬁlled by any other chunk. So, the risk is high for this interpretation. (6,7) PPART chunk may be the (elliptical) predicate. The risk for this option is lower as (0,2) ﬁlls the subject case frame. While deciding how to react on the utterance, the dialog system must be aware that (3,6) is very confusing. This indicates that a large part of the whole utterance may be misrecognized. The critical part even extends to (2,3) as discussed above. Therefore, a cautious strategy would be to reject the utterance as a whole and to answer: Sorry, I didn’t understand you. Could you repeat, please? A risky strategy would assume the subject (0,2) and the elliptical predicate (6,7) to be stable and try to clarify the genre in chunk (2,3): Which genre do you want to select? This strategy includes a clariﬁcation of (3,6) as it assumes the chunk to be a complement to the DP or the predicate. It further assumes the user to repeat this part of the utterance as well. However, the risk for something different to happen is high. Therefore, this strategy potentially increases the risk for the whole clariﬁcation to fail. 11 Summary The discussion of clariﬁcation strategies indicates further research directions: Do native speakers really interpret ill-formed sentences in a similar way to that presented here? Do they similarly react in case they feel there is some misunderstanding? While these are open issues for dialog research, the paper shows that progress can be achieved in the parsing phase: Our approach constructs a possible reading of user input and computes error diagnostics that are not related to artifacts of parsing strategy or grammar (formalism), but to a model of how native speakers analyze and understand utterance. The resulting analysis eases the tasks for modules in later steps of the natural understanding process. We showed that the analysis delivers valuable criteria for a dialog module. In case of misunderstandings it can generate more “natural” continuations than common slot-ﬁlling approaches do. The reason is again that the diagnostics are not related to artifacts of the technical aspects of the dialog system, but to human usage of spoken language. Figure 5. x-axis: percentage of errors in the transcription, y-axis: percentage of errors localized by error labels REFERENCES [1] H. Altmann and U. Hofmann, Topologie fürs Examen, Linguistik fürs Examen, VS Verlag für Sozialwiss., Wiesbaden, Germany, 2004. [2] D. Bohus and A. Rudnicky, ‘Sorry, I didn’t catch that!—an investigation of non-understanding errors and recovery strategies’, in Proceedings of SIGDIAL 2005, eds., L. Dybkjaer and W. Minker, pp. 128–143, Lisbon. [3] D. Falavigna, R. Gretter, and G. Riccardi, ‘Acoustic and word lattice based algorithms for conﬁdence scores’, in Proceedings of ICSLP 2002. [4] K. Foth, M. Daum, and W. Menzel, ‘A broad-coverage parser for german based on defeasible constraints’, in Beiträge zur siebten Konferenz zur Verarbeitung natürlicher Sprache, pp. 45–52, (2004). [5] M. Gabsdil and O. Lemon, ‘Combining acoustic and pragmatic features to predict recognition performance in spoken dialogue systems’, in Proceedings of 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004), pp. 344–351, Barcelona, (2004). [6] K. Gerdes and S. Kahane, ‘Word order in German: A formal dependency grammar using a topological hierarchy’, in Meeting of the Association for Computational Linguistics, pp. 220–227, (2001). [7] J. Ginzburg, ‘Clarifying utterances’, in Proceedings of the 2nd Workshop on the Formal Semantics and Pragmatics of Dialogue, eds., J. Hulstijn and A. Nijholt, (1998). [8] G. Görz and B. Ludwig, ‘Speech dialogue systems – a pragmaticsguided approach to rational interaction’, KI – Künstliche Intelligenz, 10(3), 5–10, (2005). [9] D. Hakkani-Tür and G. Riccardi, ‘A general algorithm for word graph matrix decomposition’, in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, (2003). [10] D. Hogan, Statistical Methods for Identifying Umgrammaticality in Texts, Master’s thesis, Computer Science Department, Trinity College, University of Dublin, 1998. [11] U. Krum, H. Holzapfel, and A. Waibel, ‘Clariﬁcation questions to improve dialogue ﬂow and speech recognition in spoken dialogue systems’, in Proceedings of Interspeech 2005, (2005). [12] B. Ludwig, ‘Tracing actions helps in understanding interactions’, in Proc. of SIGDIAL 2006, eds., J. Alexandersson and A. Knott, Sydney. [13] L. Mangu, E. Brill, and A. Stolcke, ‘Finding consensus in speech recognition: Word error minimization and other applications of confusion networks’, Computer Speech and Language, 14(4), 373–400, (2000). [14] D. Schlangen, ‘Causes and strategies for requesting clariﬁcation in dialogue’, in Proceedings of the 5th SIGdial Workshop on Discourse and Dialogue, eds., M. Strube and C. Sidner, pp. 136–143, Cambridge, Massachusetts, USA, (April 30 - May 1 2004). Association for Computational Linguistics. [15] I. Schröder, Natural Language Parsing with Graded Constraints, Ph.D. dissertation, Dept. of Computer Science, University of Hamburg, 2002. [16] G. Skantze, ‘Exploring human error recovery strategies: Implications for spoken dialogue systems’, Speech Communication, 45(3), 325–341, (2004). 328 ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-328 Task Driven Coreference Resolution for Relation Extraction Feiyu Xu and Hans Uszkoreit and Hong Li 1 Abstract. This paper presents the extension of an existing mimimally supervised rule acquisition method for relation extraction by coreference resolution (CR). To this end, a novel approach to CR was designed and tested. In comparison to state-of-the-art methods for CR, our strategy is driven by the target semantic relation and utilizes domain-speciﬁc ontological and lexical knowledge in addition to the learned relation extraction rules. An empirical investigation reveals that newswire texts in our selected domains contain more coreferring noun phrases than prononimal coreferences. This means that existing methods for CR would not sufﬁce and a semantic approach is needed. Our experiments show that the utilization of domain knowledge can boost CR. In our approach, the tasks of relation extraction and CR support each other. On the one hand, reference resolution is needed for the detection of arguments of the target relation. On the other hand, domain modelling for the IE task is used for semantic classiﬁcation of the referring nouns. Moreover, the application of the learned relation extraction rules often narrows down the number of candidates for CR. With respect to the minimally supervised learning of relation extraction grammars, we design and evaluate two integration strategies: (i) resolution after the complete pattern acquisition process and (ii) resolution embedded in the iterations of the learning process. The evaluation helps us to gain and substantiate a relevant insight: CR effectively improves recall in both strategies but it can hurt the precision because of its error spreading potential. 1 INTRODUCTION Minimally supervised pattern acquisition methods for relation extraction such as [1] and [5] can be viewed as attempts to realize a major dream of machine learning: After receiving a few semantic examples of relevant information units, the machine autonomously learns from texts how humans express such kinds of information. [20] show how this goal can be much better achieved if the machine is able to perform a syntactic analysis of the relevant sentences. The method does not need an annotated corpus for learning the extraction rules but it needs a small semantic domain model and general linguistic knowledge. For the method to be successful, only the relevant pieces of the text have to be analyzed. But it turns out that many instances of the targeted relation cannot be found because some required arguments are not directly contained in the learned pattern. They are indirectly present represented by a pronoun or another coreferring noun phrase. Yet all existing bootstrapping methods do not provide means for detecting the real arguments that usually follow or sometimes precede the detected relation pattern. 1 the German Research Center for Artiﬁcial Intelligence, Germany, email: feiyu@dfki.de Most text sorts, among them newspaper texts, often make use of referential expressions which form the textual coherence(see [9] and [10]). Without effective coreference resolution such relation instances cannot be recognized. Coreference resolution is an important research area in general linguistics and computational linguistics (see [2], [3], [6], [9], [10], [11], [12], [15], [16], [17] and [21]). Our novel method for coreference resolution is very restrictive. Its sole purpose is the improvement of relation extraction, thus it only attempts to resolve coreferences that matter for the recognition of relation arguments. To this end it utilizes the learned rules, the semantic domain model and generic linguistic knowledge sources such as WordNet (see [14]). In this way, our coreference resolution does not only support dedicated relation extraction, but it is also supported by the relation extraction. Our main objective is to improve recall without doing too much damage to precision. The reasons for setting the priority on recall are: (i) recall is lower than than precision, (ii) in the framework of DARE, precision can still be further improved by exploiting additional linguistic components as ﬁlters, but instances not detected during the bootstrapping cycles cannot be used for learning new rules and (iii) we utilize relation extraction technology mainly in business intelligence and similar applications, where truly relevant information must not be missed but some manual ﬁltering belongs to the work process. The paper starts with a desription of the baseline, i.e., an existing minimally supervised bootstrapping approach to relation extraction (see [19] and [20]). Next a data analysis demonstrates the need for CR by quantifying the proportion of relation instances that involve coreferenced arguments. We then show how semantic domain modelling can provide valuable resources for CR including hierarchical noun classes and synonyms. In contrast to widely used CR methods, our approach treats noun phrases as complex descriptions thus also detecting subsequent references to elements of set denoting noun phrases. It is also suited for gathering information on arguments by aggregating the descriptions of more than one coreferrring noun phrase. Our novel approach to CR is then described and demonstrated in action. Finally, the improvement of the baseline relation extraction system is measured. CR is shown to raise the recall in detecting prize award events and the management succession event. The paper closes with a summary and discussion of the gained insights. 2 DARE FRAMEWORK [19] and [20] describe a minimally supervised machine learning framework for extracting relations of various complexity, called DARE (Domain Adaptive Relation Extraction). The bootstrapping starts from a small set of n-ary relation instances as ”seed”, in order to automatically learn pattern rules from parsed data, which can then 329 F. Xu et al. / Task Driven Coreference Resolution for Relation Extraction extract new instances of the n-ary relation and its projections. The bootstrapping process stops when no new rules or instances can be detected. In DARE, the rule learning and the instance extraction interplays with each other. DARE presents a novel rule representation model which enables the composition of n-ary relation rules on top of the rules for projections of the relation. The compositional approach to rule construction is supported by a bottom-up pattern extraction method. The ﬁrst example relation comes from the prize award domain. The relation contains four arguments representing an event in which a person or an organization won a particular prize in a speciﬁc area and in a certain year: 1. <recipient, award, area, year> 2. <Mohamed ElBaradei, Nobel, Peace, 2005> 3. Mohamed ElBaradei, won the 2005 Nobel Prize for Peace on Friday for his efforts to limit the spread of atomic weapons. DARE learns three rules from the tree in (4), i.e., (5), (6) and (7). 4. win lex−mod:PrizeName modr:for (5) extracts the semantic argument area from a prepositional phrase, while (6) extracts three arguments year, prize and area from the complex noun phrase and calls the rule (5) for the argument area. 1 5. Rule name:: area ⎡ Rule body:: ⎢head ⎢ ⎣ Output:: daughters 1 Area 6 pos lex-form pcomp-n preposition “for” head 1 ⎤ 7 ⎥ ⎥ ⎦ Area 6. Rule name:: year prize area 1 Rule body:: ⎡ 6 7 pos noun ⎢head lex-form “prize” ⎢ ⎤ ⎥ ⎥ ⎢ ⎥ ⎢daughters lex-mod head 1 Year , ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ lex-mod head 2 Prize , ⎢ ⎥ ⎢ ⎡ ⎡ ⎤⎤ ⎥ ⎢ ⎥ 6 7 ⎢ ⎥ pos preposition ⎢ ⎢ ⎢head lex-form “for” ⎥⎥ ⎥ ⎢ mod ⎣ ⎣ ⎦⎦ ⎥ ⎣ ⎦ rule Output:: 1Y 4 Y ear, ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎤ ⎤ ⎥ ⎥ ⎥ ⎥ ⎥⎥ ⎥ ⎥⎥⎥ ⎦⎦ ⎥ ⎦ 2 P rize, 3 Area 1 Recipient, 2 P rize, 3 Area, 4 Y ear All DARE rules are extracted from sentences in which the arguments of the seed example such as (2) occur. The arguments are named entities (or other selected entity types) recognized by a named-entity-recognition system called SProUT (see [7]). SProUT also resolves variants of names, e.g., Dr. Mohamed ElBaradei and Dr. ElBaradei are recognized as the same person. However, the current learning system cannot cope with sentences that mention the target relation but contain anaphoric references to their actual arguments. If a learned rule such as (7) matches the parsed tree of (8), DARE will not be able to extract a new instance from (8) because the subject is not a person name. 8. He/The scientist won the 2005 Nobel Prize for Peace on Friday for his efforts to limit the spread of atomic weapons. Object: Prize Lex−mod:Year ⎢ ⎢daughters subject head 1 Person , ⎢ ⎡ ⎢ ⎡ 6 7 ⎢ pos noun ⎢ head ⎢ ⎢ ⎢ lex-form “prize” ⎢ ⎢ ⎢object ⎢ ⎢ ⎣rule year prize area 1:: ⎣ ⎣ Output:: (2) is an example relation instance of (1), referring to an event mentioned in the sentence (3). Subject:Person 7. Rule name:: recipient prize area year 1 Rule body:: ⎡ ⎡ ⎤ pos verb ⎢head ⎣mode active ⎦ ⎢ ⎢ lex-form “win” ⎢ area 1:: 3 Area ear, 2 P rize, 3 Area (7) is the rule that extracts all four arguments from the verb phrase dominated by the verb ”win” and calls (6) to handle the arguments embedded in the linguistic argument ”object”. It is known that arguments belonging to a relation instance are often distributed over several sentences. These sentences are usually linked by coreferences, semantic chains or various discourse relations, e.g., (9). 9. a. Three of the Nobel Prizes for Chemistry during the ﬁrst decade were awarded for pioneering work in organic chemistry. b. In 1902 Emil Fischer (1852-1919), then in Berlin, was given the prize for his work on sugar and purine syntheses. c. Another major inﬂuence from organic chemistry was the development of the chemical industry, and a chief contributor here was Fischer’s teacher, Adolf von Baeyer (1835-1917) in Munich, who was awarded the prize in 1905. In example (9), two concrete Nobel Prize winning event instances in Chemistry are mentioned, one in the year 1902 for Emil Fischer and another one in 1905 for Adolf von Baeyer. However, the linking between the Nobel Prize winners with the Nobel Prize is expressed indirectly via the anaphoric NP the prize. The two arguments (prize name and area) shared by the two event instances are located in the ﬁrst sentence. The two winners and their prize award years can be found in sentence (b) and (c), respectively. If we consider sentence (b) and (c) independently from the context, we cannot tell that they are about the Nobel Prize events, without resolving the anaphoric reference the prize as the Nobel Prize. An evaluation in [19] shows us that more than 10% of the relation instances in the Nobel Prize award domain can only be detected with the help of coreference resolution. Therefore we can expect coreference resolution to improve the learning performance by detecting more relation instances as seed. 330 F. Xu et al. / Task Driven Coreference Resolution for Relation Extraction 3 INVESTIGATION OF COREFERENCE RELATIONS IN THE EXPERIMENT DOMAIN The phenomena of coreference has been investigated intensively in the literature (see [9], [10], [11] and [15]). They are complex linguistic phenomena inﬂuenced by lexical, syntactic, semantic and discourse constraints. In recent years, a number of computational approaches attempted to map these constraints to features of computational models, e.g., features of some classiﬁers (see [6], [13] and [16]). The constraints shared by many approaches are • Distance: coreference expressions are often close to each other in the surface structure; • Syntactic: pronominal resolution constraints within sentence • Semantic: same semantic category, agreement in number, gender and person; • Discourse: parallelism, repetition, apposition, name alias We did a study to investigate the coreference relations in our experiment domain Nobel Prize award. The texts in the corpus are Nobel Prize related articles from New York Times, online BBC and CNN news reports. It contains 3328 documents with a size of 18.4 MB. We only consider documents mentioning the target relation. Figure 1 shows the distribution of distance of coreference links in the corpus. The target relation is located in sentence 0. We call this sentence our anchor. The context around the anchor where the antecedents can be found is within three sentences before or after the anchor. The distribution result conﬁrms the closeness as indicator. b. President Hosni Mubarak of Egypt has awarded the country’s most prestigious prize - the Nile Necklace - to the Egyptian-born chemist Ahmed Zewail. Many approaches emphasize semantic similarity and semantic consistency between the coreference expressions, e.g., ISA and synonym relations and make use of the cohesion indicator repetition. Although there are repetitions in both (10) and (11), e.g., the number two occurs both in (a) and (b) in (10) and the word chemist is mentioned in the two coreference noun phrases in (11). However, neither of the coreference expressions can be simply resolved by ISA and synonym relations. In both cases, the second phrase adds additional semantic information, corresponding to the elaboration phenomena discussed by [10]: S1 is an Elaboration of S0 if a proposition P follows from the assertions of both S0 and S1, but S1 contains a property of one of the elements of P that is not in S0. Elaboration is an important feature of newspaper reports. In our experiment domain, we observe that various aspects of a person are mentioned in a report to describe a Nobel Prize winner. The most frequent properties are: • Nationality/origin/inhabitant: e.g., two Americans, the Egyptianborn, a Dutch • Profession/occupation: e.g., novelist, chemist, scientist, researcher • Title/position: e.g., professor, president • Domain description: e.g., recipient, winner, Nobel Laureate • General description: e.g., the man, a woman, the team The most frequently mentioned property in our domain for a Prize winner is profession or occupation. The second one is the title and position. Nationalities are ranked in third position. (10) and (11) also show that a noun phrase often describes more than one property of a person. Egyptian-born chemist Ahmed Zewail mentions not only the person name but also the origin and the profession of the person. Therefore, it is important to treat a referent as a complex semantic object. The antecedents of the anaphora noun phrase in both examples are in the second sentence. Both backward and forward search are important in our domain. Furthermore, we calculate the distribution of the pronominal and the nominal coreference links. 25.08% of the links are pronominal, while 74.92% are nominal. Most of the anaphora expressions are singular (96.19%) and only 3.81% are plural. To our surprise, the forward links make up for 40%, but the backward links for 60%. Among the noun phrases, the deﬁnite phrases account for 19.92% of occurrences and indeﬁnite NPs thus for 80.08%. Let us look at the following two examples (10) and (11): 10. a. Two Americans have won the 2002 Nobel Prize in Economic Sciences. b. The two scientists, Daniel Kahneman and Vernon L. Smith, received the honour on Wednesday for their work using psychological research and laboratory experiments in economic analysis. 11. a. Egypt honours its Nobel Prize chemist. 4 ACQUISITION OF DOMAIN KNOWLEDGE DURING BOOTSTRAPPING The observation in section 3 tells us that it is important to acquire domain knowledge for the coreference resolution in our application domain. However, manual modeling is too time consuming and not easily adaptable to new domain (see [2] and [9]). The general principle of the DARE framework is to start with some relation instances as their semantic seed to acquire linguistic pattern rules. The learned extraction rules are applied to the parsed sentences in order to extract relation instances as new seed for the next iteration. The current relation instances contain only entity instances as their arguments. In order to acquire the knowledge about the semantic roles in the target relation, we add a knowledge mining step in each iteration during the bootstrapping process: Given a target relation R with n arguments and a set of relation instances Γ detected by DARE: for all argument argi in r such that r ∈ Γ ∧ i ∈ [1, n] do 331 F. Xu et al. / Task Driven Coreference Resolution for Relation Extraction 1. collection appositions of argi in the whole corpus 2. extract adjectives and nouns from the appositions and assign the frequency to the words 3. retrieve direct hypernyms, inherited hypernyms and its sister terms of the adjective and noun terms from WORD NET end for build a domain-specific ontology from the extracted WORD NET relations for each argument In our approach, the domain ontology contains a list of subontologies for each semantic role in the target relation. For the prize award event, we have one for recipient, one for prize award, one for area. Each subontology models the domain knowledge of each semantic argument. The words mentioned in the corpus are marked and assigned with their frequencies in the domain ontology. Furthermore, we store all descriptions for each entity in the database. These descriptions will be used for validation when extracted entities are referred to again in later iterations. 5 TASK DRIVEN COREFERENCE RESOLUTION Our coreference resolution is driven by the target relation. We consider only anaphora expressions that are potential semantic arguments of the mentioned relation instances embedded in the sentences matched against the learned pattern rules such as (7). As conﬁrmed by our domain investigation, antecedents in the nearest context are preferred. The search for an antecedent stops when an entity instance as a coreference candidate can be found. Therefore, we construct coreference chains during the backward and the forward search. The end of the chain is an entity instance. In addition to the general features such as distance, agreement with number and gender and discourse parallelism, we introduce a novel uniﬁcation method to verify the compatibility among the referential expressions. This method is suitable to handle relations among noun phrases, in particular, the complex noun phrases, e.g., the coreference relation between the inﬁnitive plural noun phrase two Americans and the complex noun phrase the two scientists, Daniel Kahneman and Vernon L. Smith in (10). Given the domain ontology learned from the wordNet, we construct a template for each semantic argument. For example, the recipient of the Nobel Prize has the following properities: nationality, profession, title, domain-description, general-description and name. Then we develop mappings between the wordNet concepts and the properties. For example, concepts inherited by the wordNet concept inhabitant or native are values of nationality. Let us step through (10) with our solution. A DARE rule matches sentence (a) where the recipient role is an indeﬁnite plural noun phrase. The features of this noun phrase and their values are described in (12): 12. ⎡ sentence id: ⎢ ⎢number: ⎢ ⎢ ⎢ ⎢deﬁnite: ⎢grammarrole: ⎣ semantics: ⎤ i 6 type: plural amount: 2 7 – subject nationality: american ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ The system looks forwards and ﬁnds a complex noun phrase in the next sentence (b) where the feature structure of this noun phrase is as follows: 13. ⎡ ⎤ i+1 sentence id: 6 ⎢ ⎢number: ⎢ ⎢ ⎢deﬁnite: ⎢ ⎢grammarrole: ⎢ ⎢ ⎣semantics: 7 ⎥ ⎥ ⎥ ⎥ ⎥ + ⎥ ⎥ subject ⎥ ⎥ profession: scientist ⎦ type: plural amount: 2 name1, name2 names: These two feature structures are compatible because of their close distance, agreement in number and the parallel grammar function. (13) can be regarded as an elaboration of (12). It adds the profession information and instantiates the two persons with their names. The uniﬁed feature structure for (12) and (13) is (14): 14. ⎡ 6 ⎢number: ⎢ ⎢ ⎢ ⎢semantics: ⎣ names: 6 type: plural amount: 2 7 ⎤ ⎥ 7⎥ ⎥ nationality: american ⎥ ⎥ profession: scientist ⎦ name1, name2 6 EVALUATION The general motivation of this approach is to improve the recall of the DARE framework. We conduct two strategies: (i) resolution after the complete pattern acquisition process and (ii) resolution embedded in the iterations of the learning process. In the ﬁrst strategy, we apply the learned pattern rules again to the experiment corpus and extract sentences with anaphora experssions. In the second strategy, the coreference resolution belongs to the relation extraction in each iteration. We experiment with the two strategies in two domains: the Nobel Prize award and the management succession (MUC-6) (see [8]). In MUC-6, our target relation contains also four arguments. It is about a person (person in) taking over a position in an organisation and a person (person out) leaving the position. 15. <person in, person out, position, organisation> The data set and the DARE performance before the coreference resolution is given in Table 1: Given the same data set and the same Table 1. relation extraction without coreference resolution domain Nobel Prize MUC-6 data size 18.4MB 1 MB initial seed no. 1 55 precision 86.5% 62% recall 50.7% 48% initial seed examples, we apply the two strategies to the two experiment domains. The ﬁnal performance is listed in Table 2. In [19] we describe why the data of the MUC-6 management succession task are not well suited for our RE method. Our hope that CR might improve recall for this task without ruining precision was not fulﬁlled. Minor improvements of recall were outweighed by a drastic drop of precision. To a large part the disappointing performance of the CR extension for this task can be attributed to shortcomings of the person-name recognition of our NEE system. The missing precision 332 F. Xu et al. / Task Driven Coreference Resolution for Relation Extraction Table 2. domain Nobel Prize MUC-6 relation extraction with coreference resolution strategy (1) after pattern acquisiton precision recall 82.76% 53.47% 48.89% 51.55% strategy (2) during pattern acquisition precision recall 83.9% 54.21% 33.5% 52.85% in NEE interacts in bad ways with the confusion between person-in and person-out in coreference resolution. In the Nobel prize domain, the contributions of the CR component were much more promising. Through a manual data analysis, we found 42 relation instances that could not be detected by DARE because of missing coreference resolution. In these cases at least one participant of the relation instance could not be found because the identiﬁying NP occured outside the relation pattern. Our CR method found 11 of these instances. In addition, CR also correctly resolved 29 cases of coreference that did not contribute to RE recall because they occurred in mentions of instances that could be detected without CR. Besides the 40 correctly resolved coreferences, our method also returned 13 false coreference resolutions. Thus the precision of the coreference resolution was 75.5 %. Because of the tight coupling of CR and RE, the false positives of CR also turned into false positives of RE. Thus the overall precision of the RE system slightly decreased. In the experiment with a tight integration of CR and rule learning, recall improved by 3.5 %. Although precision decreased by 2.6 % even the unweighted F-measure gained 1.9 %. The measured performance gain would be higher for any F-measure variant reﬂecting the stronger relevance of recall. The performance gain was slightly lower for the experiment in which CR was applied after rule learning. 7 CONCLUSION The coreference resolution approach proposed by us is driven by the relation extraction task. The investigation of the coreference relations in the application domain shows us that coreferential nominal phrases do not only share the same semantic category (repetition), but there also often exists elaboration relationship between them. We make use of the general bootstrapping strategy to learn and extract subontologies for the relation arguments from WordNet. The domain ontology reﬂects the domain-speciﬁc properties of the relation arguments and helps on the one hand the validation of the semantic compatibility of the coreferences and on the other hand the construction of the information content of the individual referents. In our experiments, we show that integration of coreference resolution generally improves the recall value. However, precision can be hurt at different degrees. Our experiments have shown that a low base-line performance in both CR and RE precision can be aggravated by a combination of CR and RE. For the Nobel prize domain, the decrease in precision was outweighed by the improvement in recall. We expect that an improvement of CR precision by an enhanced NEE component will lead to even better overall effects of the integration of CR into semisupervised RE. In our method, the result of coreference resolution is not a simple yes or no such as in classiﬁcation-based methods. It is an aggregation of semantic descriptions about the referents. These descriptions can be reused for further coreference resolution and even for identity resolution across documents ( see [4] and [13] ). Our approach is quite in line with our philosophy of information extraction, i.e., the view that truly systematic approaches to informa- tion extraction may turn into controlled gradual approximations of text understanding [18] . ACKNOWLEDGEMENTS The work presented here was partially supported by a research grant from the German Federal Ministry of Education, Science, Research and Technology (BMBF) to the DFKI project Hylap (FKZ: 01 IW F02) and the international project RASCALLI supported by the European Commission Cognitive Systems Programme (IST-275962004). REFERENCES [1] Eugene Agichtein and Luis Gravano, ‘Snowball: extracting relations from large plain-text collections’, in ACM 2000, pp. 85–94, (2000). [2] D. Bean and E. Riloff, ‘Unsupervised learning of contextual role knowledge for coreference resolution’, in Proceedings of HLT/NAACL, pp. 297–304, (2004). [3] David Bean and Ellen Riloff, ‘Unsupervised learning of contextual role knowledge for coreference resolution’, in Proceedings of HLT-NAACL 2004, pp. 297–304, Boston, Massachusetts, USA, (May 2 - May 7 2004). [4] R. Bekkerman and A. McCallumb2Golub, ‘Disambiguating web appearances of people in a social network’, in Proceedings of the 14th international conference on World Wide Web, 2005, Chiba, Japan, (2005). [5] S. Brin, ‘Extracting patterns and relations from the World Wide Web’, Lecture Notes in Computer Science, 1590, (1999). [6] C. Cardie and K. Wagstaff, ‘Noun phrase coreference as clustering’, in Proceedings of EMNLP/VLC, pp. 82–89, (1999). [7] Witold Drozdzynski, Hans-Ulrich Krieger, Jakub Piskorski, Ulrich Sch¨afer, and Feiyu Xu, ‘Shallow processing with uniﬁcation and typed feature structures - foundations and applications’, KI, 18(1), 17, (2004). [8] Grishman and B. Sundheim, Proceedings of the Sixth Message Understanding Conference (MUC-6), Morgan Kaufmann, 1995. [9] S. Harabagiu, R. Bunescu, and S. Maiorano, ‘Text and knowledge mining for coreference resolution’, in Proceedings of NAACL-2001, pp. 55– 62, (2001). [10] J. Hobbs, ‘Coherence and coreferen’ce’, in Cognitive Science 3, pp. 67–90, (1979). [11] A. Kehler, ‘Probabilistic coreference in information extraction’, in Proceedings of EMNLP, pp. 163–173, (1997). [12] Shalom Lappin and Herbert J. Leass, ‘An algorithm for pronominal anaphora resolution’, Computational Linguistics, 20(4), 535–562, (1994). [13] Andrew McCallum and Ben Wellner, ‘Conditional models of identity uncertainty with application to noun coreference’, in NIPS, (2004). [14] George A. Miller, ‘Wordnet: A lexical database for english’, Commun. ACM, 38(11), 39–41, (1995). [15] Ruslan Mitkov and Wolverhampton Wv Sb. Factors in anaphora resolution: they are not the only things that matter. A case study based on two different approaches, July 24 1997. [16] Vincent Ng, ‘Machine learning for coreference resolution: From local classiﬁcation to global ranking’, in Proceedings of ACL 2005. The Association for Computer Linguistics, (2005). [17] M. Strube, S. Rapp, and C. Muller. The inﬂuence of minimum edit distance on reference resolution, 2002. [18] Hans Uszkoreit, ‘Methods and applications for relation detection’, in Proceedings of the Third IEEE International Conference on Natural Language Processing and Knowledge Engineering, (2007). [19] Feiyu Xu, Bootstrapping Relation Extraction from Semantic Seeds, Saarbruecken dissertation series in Computational Linguistics and Language Technology, phd disseration, volume xxiv edn., 2007. [20] Feiyu Xu, Hans Uszkoreit, and Hong Li, ‘A seed-driven bottom-up machine learning framework for extracting relations of various complexity’, in Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 584–591, Prague, Czech Republic, (June 2007). Association for Computational Linguistics. [21] D. Zelenko, C. Aone, and Tibbetts J., ‘Coreference resolution for information extraction’, in Proceedings of the ACL Workshop on Reference Resolution and its Applications, pp. 9–16, (2004). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-333 333 WWW sits the SAT: Measuring Relational Similarity on the Web Danushka Bollegala1 and Yutaka Matsuo2 and Mitsuru Ishizuka3 Abstract. Measuring relational similarity between words is important in numerous natural language processing tasks such as solving analogy questions and classifying noun-modiﬁer relations. We propose a method to measure the similarity between semantic relations that hold between two pairs of words using a web search engine. First, each pair of words is represented by a vector of automatically extracted lexical patterns. Then a Support Vector Machine is trained to recognize word pairs with similar semantic relations. We evaluate the proposed method on SAT multiple-choice word-analogy questions. The proposed method achieves a score of 40% which is comparable with relational similarity measures which use manually created resources such as WordNet. The proposed method signiﬁcantly reduces the time taken by previously proposed computationally intensive methods, such as latent relational analysis, to process 374 analogy questions from 8 days to less than 6 hours. 1 Introduction Similarity can be broadly categorized into two types: attributional and relational [4, 14]. Attributional similarity is correspondence between attributes and relational similarity is correspondence between relations. When two words have a high degree of attributional similarity, they are called synonymous. When two pairs of words show a high degree of relational similarity, they are called analogous. For example, the two analogous word-pairs: ostrich:bird and lion:cat – both implying the relation X is a large Y – has a high relational similarity. Relational similarity measures are useful for numerous tasks in natural language processing such as detecting word analogies and classifying semantic relations in noun-modiﬁer pairs. Word analogy questions have been used as a component of Scholastic Aptitude Test (SAT) to evaluate applicants to the U.S. college system for decades. A SAT analogy question comprises a source-pairing of concepts/terms and a choice of (usually ﬁve) possible target pairings, only one of which accurately reﬂects the source relationship. A typical example is shown below. Question: Ostrich is to Bird as: a. Cub is to Bear b. Lion is to Cat c. Ewe is to Sheep d. Turkey is to Chicken e. Jeep is to Truck 1 2 3 Research Fellow of the Japan Society for the Promotion of Science (JSPS), The University of Tokyo, 7-3-1, Hongo, Tokyo, Japan. danushka@mi.ci.i.u-tokyo.ac.jp matsuo@biz-model.t.u-tokyo.ac.jp ishizuka@i.u-tokyo.ac.jp Here, the relation is a large holds between the two words in the question (e.g. Ostrich and Bird), which is also shared between the two words in the correct answer (e.g. Lion is a large Cat). SAT analogy questions have been used as a benchmark to evaluate relational similarity measures in previous work on relational similarity [19, 14]. Noun-modiﬁer pairs such as ﬂu virus, storm cloud, expensive book, etc are frequent in English language. In fact, WordNet contains more than 26, 000 noun-modiﬁer pairs. Natase and Szpakowicz [6] classiﬁed noun-modiﬁers into ﬁve classes according to the relations between the noun and the modiﬁer. Turney [14] used a relational similarity measure to compute the similarity between noun-modiﬁer pairs and classify them according to the semantic relations that hold between a noun and its modiﬁer. We proposes a method to measure the relational similarity between two given pairs of words using text-snippets returned by a web search engine. Snippets provide useful information about the relations that hold between words. For example, Google4 returns the snippet ...the ostrich is the largest bird in the world and can be found in South Africa... for the conjunctive query ostrich AND bird. This snippet alone suggests that ostrich is a large bird. The proposed method automatically extracts lexical patterns that describe the relation implied by the two words in a word-pair and computes the relational similarity between two word-pairs using a machine learning approach. Relational similarity is a dynamic phenomenon. In particular, relations between named entities change over time and across domains. Therefore, it is costly or even impossible to manually update language resources to reﬂect those changes. The proposed method does not require language resources such as taxonomies or dictionaries which makes it attractive when measuring relational similarity between words that do not appear in manually created resources. Using SAT analogy questions as training data, we propose an algorithm to automatically extract lexical patterns to represent the numerous relations implied by two words. The proposed method requires a lesser number of search engine queries (one query per word-pair) and does not require computationally intensive large matrix manipulations as required by the previously proposed latent relational analysis (LRA) [12], thereby reducing the time taken to answer 374 SAT analogy questions from 8 days by LRA to less than 6 hours. 2 Related work The Structure-mapping theory (SMT) [2] claims that an analogy is a mapping of knowledge from one domain (base) into another (target) which conveys that a system of relations known to hold in the base also holds in the target. The target objects do not have to resemble their corresponding base objects. This structural view of analogy is 4 http://code.google.com/apis 334 D. Bollegala et al. / WWW Sits the SAT: Measuring Relational Similarity on the Web based on the intuition that analogies are about relations, rather than simple features. Although this approach works best when the base and the target are rich in higher-order causal structures, it can fail when structures are missing or ﬂat [20]. Turney et al. [16] combined 13 independent modules by considering the weighted sum of the outputs of each individual module to solve SAT analogy questions. The best performing individual module was based on Vector Space Model (VSM). In the VSM approach to measuring relational similarity [15], ﬁrst a vector is created for a word-pair X:Y by counting the frequencies of various lexical patterns containing X and Y. In their experiments they used 128 manually created patterns such as “X of Y”, “Y of X”, “X to Y” and “Y to X”. These patterns are then used as queries to a search engine and the number of hits for each query is used as elements in a vector to represent the word-pair. Finally, the relational similarity is computed as the cosine of the angle between the two vectors representing each word-pair. This VSM approach achieves a score of 47% on collegelevel multiple-choice SAT analogy questions. A SAT analogy question consists of a target word-pair and ﬁve choice word-pairs. The choice word-pair that has the highest relational similarity with the target word-pair in the question is selected by the system as the correct answer. Turney [12, 14] proposes Latent Relational Analysis (LRA) by extending the VSM approach in three ways: a) lexical patterns are automatically extracted from a corpus, b) the Singular Value Decomposition (SVD) is used to smooth the frequency data, and c) synonyms are used to explore variants of the word-pairs. LRA achieves a score of 56.4% on SAT analogy questions. Both VSM and LRA require a large number of search engine queries to create a vector representing a word-pair. For example, with 128 patterns, VSM approach requires at least 256 queries to compute relational similarity. LRA considers synonymous variants of the given word pairs, thus requiring even more search engine queries. Despite efﬁcient implementations, singular value decomposition of large matrices is time consuming. In fact, overall LRA takes over 8 days to process the 374 SAT analogy questions [14], which can be problematic when used in many real world NLP tasks. Veale [19] proposed a relational similarity measure based on taxonomic similarity in WordNet. He evaluates the quality of a candidate analogy A:B::C:D by comparing the paths in WordNet, joining A to B and C to D. Relational similarity is deﬁned as the similarity between the A:B paths and C:D paths. However, WordNet does not fully cover named entities such as personal names, organizations and locations, which becomes problematic when using this method to measure relational similarity between named entities. Using a relational similarity measure, Turney [13] proposed an unsupervised learning algorithm to extract patterns that express implicit semantic relations from a corpus. His method produces a ranked set of lexical patterns that unambiguously describes the relation between the two words in a given word-pair. Patterns are ranked according to their expected relational similarity (i.e. pertinence), computed using an algorithm similar to LRA. To answer SAT a analogy question, ﬁrst, ranked lists of patterns are generated for each of the six word pairs (one stem word-pair and ﬁve choice word-pairs). Then each choice is evaluated by taking the intersection of its patterns with the stem’s patterns. The shared patterns are scored by the average of their rank in the stem’s lists and the choice’s lists. The algorithm picks the choice with the lowest scoring shared pattern as the correct answer. This method reports a SAT score of 54.6%. A:B C:D Web Search Engine pattern snippets extraction/ feature selection vectors (PrefixSpan) Identify the implicit relations Training/ Relational Similarity (SVM) Compare the relations in the two word-pairs Figure 1. Outline of the proposed method. 3 3.1 Method Outline The proposed relational similarity measure is outlined in Fig.1. It can be described in two main steps: identifying the implicit relations between the two words in each word-pair and comparing the relations that exist in each word-pair. In order to measure the relational similarity between two word-pairs A:B and C:D, we must ﬁrst identify the relations implied by each word-pair. For example, the relation X isa-large Y holds between the the two words in pairs ostrich:bird and lion:cat. We propose the use of PreﬁxSpan [7], a sequential pattern mining algorithm, to extract implicit relations from snippets returned by a web search engine for two words. We train a Support Vector Machine (SVM) [18] using SAT multiple-choice analogy questions as training data to compare the extracted relations and identify analogous word-pairs. 3.2 Pattern Extraction and Selection We represent the implicit relations indicated by the two words in a word-pair X:Y using automatically extracted lexical patterns. Although automatic pattern extraction methods [9, 11] have been proposed based on dependency parsing of sentences, extracting lexical patterns from snippets using such methods is difﬁcult because most snippets are not grammatically correct complete sentences. However, lexical syntactic patterns have been successfully used to extract semantic information such as qualia structures [1] from web text snippets. Consequently, in this paper we employ a shallow pattern extraction method based on sequential pattern mining. To identify the implicit relations between two words X and Y, we ﬁrst query a web search engine using the phrasal query “X*******Y”. Here, the wildcard operator “*” would match any word or nothing. This query retrieves snippets that contain both X and Y within a window of 7 words. For example, Google returns the snippet shown in Fig.2 for the word-pair lion:cat. We use PreﬁxSpan ...lion, a large heavy-built social cat of open rocky areas in Africa ... Figure 2. A snippet returned by Google for the query “lion*******cat”. (i.e., preﬁx-projected sequential pattern mining) [7] algorithm to extract frequent subsequences from snippets that contain both X and Y. PreﬁxSpan extracts all word subsequences which occur more than a speciﬁed frequency in snippets. We select subsequences that contain D. Bollegala et al. / WWW Sits the SAT: Measuring Relational Similarity on the Web both query words (eg. lion and cat) and replace the query words respectively with variables X and Y to construct lexical patterns. For example, some of patterns we extract from the snippet in Fig.2 are “X a large Y”, “X a large Y of” and “X, a large social Y”. PreﬁxSpan algorithm is particularly suitable for the current task because it can efﬁciently extract a large number of lexical patterns. We used the SAT analogy questions dataset which was ﬁrst proposed by Turney and Littman [15] as a benchmark to evaluate relational similarity measures, to extract lexical patterns. The dataset contains 2176 unique word-pairs across 374 analogy questions. For each word-pair, we searched Google and download the top 1000 snippets. From the patterns extracted by the above mentioned procedure, we select ones that occur more three times and have less than seven words. The variables X and Y in patterns are swapped to create a reversed version of the pattern. The ﬁnal set contains 9980 unique patterns. However, out of those patterns only 10% appear in both for a question and one of its choices. It is impossible to learn with such a large number of sparse patterns. Therefore, we perform a pattern selection procedure to identify those patterns that convey useful clues about implicit semantic relations. First, for each extracted pattern v, we count the number of times where v appeared in any of the snippets for both a question and its correct answer (pv ) and in any of the snippets for both a question and any one of its incorrect answers (nv ). We then create a contingency table for each pattern v, as shown in Table 1. In Table 1, P denotes the total frequency of all patterns that occur in snippets for a question and its correct answer (P = p ) and N is the same for incorrect v v answers (N = v nv ). If a pattern occurs many times in a question We model the problem of computing relational similarity as a one of identifying analogous and non-analogous word-pairs, which can be solved by training a binary classiﬁer. Using SAT analogy questions as training data, we train a two-class support vector machine (SVM) as follows. From each question in the dataset, we create a positive training instance by considering A:B to be the word-pair for the question and C:D to be the word-pair for the correct answer. Likewise, a negative training instance is created from a question wordpair and one of the incorrect answers. The trained SVM model can then be used to compute the relational similarity between two given word-pairs A:B and C:D as follows. First, we represent the two word-pairs by a feature vector F of pattern frequency-based features. Second, we deﬁne the relational similarity RelSim(A : B, C : D) between the two word-pairs A:B and C:D as the posterior probability Prob(F |analogous) that feature vector F belongs to the analogous-pairs (positive) class, RelSim(A : B, C : D) = Prob(F |analogous). Being a large margin classiﬁer, the output of an SVM is the distance from the decision hyper-plane. For the purpose of solving SAT questions, we can directly use the distance from the decision hyper-plane and rank the candidate answers. However, distance from the decision hyper-plane is not a calibrated posterior probability that lies between [0, 1] range. We use sigmoid functions to convert this uncalibrated distance into a calibrated posterior probability (see [8] for a detailed discussion on this topic). 4 Table 1. Contingency table for a pattern v Freq. in snippets for question and correct answer Freq. in snippets for question and incorrect answer v patterns other than v Total pv P − pv P nv N − nv N and its correct answer, then such patterns are reliable indicators of latent relations between words. To evaluate the reliability of an extracted pattern as an indicator of a relation, we calculate the χ2 [3] value for each pattern using Table 1 as, (P + N )(pv (N − nv ) − nv (P − pv ))2 . χ = P N (pv + nv )(P + N − pv − nv ) 2 Patterns with χ2 value greater than a speciﬁed threshold are used as features for training. Some of the selected patterns are shown later in Table 3. 3.3 Training For given two pairs of words A:B and C:D, we create a feature vector using the patterns selected in section 3.2. First, we record the frequency of occurrence of each selected pattern in snippets for each word-pair. We call this the pattern frequency. It is a local frequency count, analogous to term frequency in information retrieval [10]. Secondly, we combine the two pattern frequencies of a pattern (i.e., frequency of occurrence in snippets for A:B and that in snippets for C:D) using various feature functions to compute the feature-values for training. The different feature functions experimented in the paper are explained in section 4. 335 Experiments For the experiments in this paper we used the 374 SAT college-level multiple-choice analogy questions dataset which was ﬁrst proposed by Turney et al. [16]. We compute the total score for answering SAT questions as follows, score = no. of correctly answered questions . total no. of questions (1) Formula 1 does not penalize a system for marking incorrect answers. 4.1 Feature Functions Evidence from psychological experiments suggest that similarity can be context-dependent and even asymmetric [17, 5]. Human subjects have reportedly assigned different similarity ratings to word-pairs when the two words were presented in the reverse order. However, experimental results investigating the effects of asymmetry reports that the average difference in ratings for a word pair is less than 5 percent [5]. Consequently, in this paper we assume relational similarity to be symmetric and limit ourselves to symmetric feature functions. This assumption is in line with previous work on relational similarity described in section 2. Let us assume the frequency of a pattern v in two word-pairs A:B and C:D to be fAB and fCD , respectively. We compute the value assigned to the feature corresponding to pattern v in the feature vector that represents the two word-pairs A:B and C:D using the following four feature functions. |fAB − fCD |: The absolute value of the difference of pattern frequencies is considered as the feature-value. (fAB − fCD )2 : The square of the difference of pattern frequencies is considered as the feature-value. 336 D. Bollegala et al. / WWW Sits the SAT: Measuring Relational Similarity on the Web fAB × fCD : The product of the pattern frequencies is considered as the feature-value. JS divergence: Ideally, if two word-pairs are analogous we would expect to see similar distributions of patterns in each word-pair. Consequently, the closeness between the pattern distributions can be regarded as an indicator of relational similarity. We deﬁne a feature function based on Jensen-Shannon divergence [3] as a measure of the closeness between pattern distributions. JensenShannon (JS) divergence DJS (P ||Q), between two probability distributions P and Q is given by, 1 1 DJS (P ||Q) = DKL (P ||M ) + DKL (Q||M ). 2 2 100 Score 80 P (v) log v P (v) . Q(v) 0 (2) (3) Here, P (v) denotes the normalized pattern frequency of a pattern v in the distribution P . Pattern frequencies are normalized s.t. P (v) = 1 by dividing the frequency of each pattern by v the sum of frequencies of all patterns. We deﬁne the contribution of each pattern towards the total JS-divergence in Formula 2 as its feature value, JS(v). Substituting Formula 3 in 2 and collecting the terms under summation, we derive JS(v) as, inﬂuence it imparts on the ﬁnal SVM output. Patterns shown in Table 3 express various semantic relations that can be observed in SAT analogy questions. We experimented with different kernel types as shown in Table 4. Best performance is achieved with the linear kernel. A drop of performance occurs with more complex kernels, which is attributable to over-ﬁtting. Figure 3 plots the variation of SAT score with the Table 4. Performance with different Kernels Kernel Type Linear Polynomial degree=2 Polynomial degree=3 RBF Sigmoid Here, p and q respectively denote the normalized pattern frequencies of fAB and fCD . Performance with various feature weighting methods Feature function |fAB − fCD | (fAB − fCD )2 fAB × fCD JS(v) Score 0.30 0.30 0.40 0.32 To evaluate the effect of various feature functions on performance, we trained a linear kernel SVM with each of the feature functions. We randomly selected 50 questions from the SAT analogy questions for evaluation. The remainder of the questions (374-50) are used for training. Experimental results are summarized in Table 2. Out of the four feature functions in Table 2, product of pattern frequencies performs best. For the remainder of the experiments in the paper we used this feature function. Patterns with the highest linear kernel weights Table 3. Patterns with the highest SVM linear kernel weights pattern and Y and X Y X small X in Y use Y to X from the Y X to that Y X or X Y X and other Y a Y or X that Y on X χ2 0.8927 0.0795 0.0232 0.5059 0.3697 0.1310 0.0751 1.0675 0.0884 0.0690 SVM weight 0.0105 0.0090 0.0087 0.0082 0.0079 0.0077 0.0074 0.0072 0.0068 0.0067 are shown in Table 3 alongside their χ2 values. The weight of a feature in the linear kernel can be considered as a rough estimate of the 0 100 200 300 400 500 600 700 800 900 1000 Number of snippets Figure 3. Performance with the number of snippets 1 2q 2p JS(v) = (p log + q log ). 2 p+q p+q Table 2. 40 20 Here, M = (P + Q)/2 and DKL is the Kullback-Leibler divergence, which is given by, DKL (P ||Q) = 60 Score 0.40 0.34 0.34 0.36 0.36 number of snippets used for extracting patterns. From Fig.3 it is apparent that overall the score improves with the number of snippets used for extracting patterns. Typically, with more snippets to process, the number of patterns that can be extracted for a word-pair increases. That fact enables us to represent a word-pair with a rich feature vector, resulting in better performance. Table 5 summarizes various relational similarity measures proposed in previous work. All algorithms in Table 5 are evaluated on the same SAT analogy questions. Score is computed by Formula 1. Because SAT questions contain 5 choices, a random guessing algoTable 5. 1 2 3 4 5 6 7 8 9 10 Comparison with previous relational similarity measures. Algorithm Phrase Vectors Thesaurus Paths Synonym Antonym Hypernym Hyponym Meronym:substance Meronym:part Meronym:member Holonym:substance score 0.382 0.250 0.207 0.240 0.227 0.249 0.200 0.208 0.200 0.200 11 12 13 14 15 16 17 18 19 20 Algorithm Holonym:member Similarity:dict Similarity:wordsmyth Combined [16] Proposed (SVM) WordNet [19] VSM [15] Pertinence [13] LRA [12] Human score 0.200 0.180 0.294 0.450 0.401 0.428 0.471 0.535 0.561 0.570 rithm would obtain a score of 0.2 (lower bound). The score reported by average senior high-school student is about 0.570 [15] (upper bound). We performed 5-fold cross validation on SAT questions to evaluate the performance of the proposed method. The ﬁrst 13 (rows 1-13) algorithms were proposed by Turney et al. [16], in which they combined these modules using a weight optimization method. For D. Bollegala et al. / WWW Sits the SAT: Measuring Relational Similarity on the Web given two word-pairs, the phrase vector (row 1) algorithm creates a vector of manually created pattern-frequencies for each word-pair and compute the cosine of the angle between the vectors. Algorithms in rows 2-11 use WordNet to compute various relational similarity measures based on different semantic relations deﬁned in WordNet. Similarity:dict (row 12) and Similarity:wordsmith (row 13) respectively use Dictionary.com and Wordsmyth.net to ﬁnd the deﬁnition of words in word-pairs and compute the relational similarity as the overlap of words in the deﬁnitions. The proposed method outperforms all those 13 individual modules reporting a score of 0.401, which is comparable with Veale’s [19] WordNet-based relational similarity measure. 5 337 Conclusion We proposed a relational similarity measure that uses a web search engine to ﬁnd the relations that exists between words. We represent two word-pairs by a feature vector using automatically extracted lexical patterns. Then an SVM is trained using SAT analogy questions as training data. The proposed method achieved SAT scores comparable to previously proposed WordNet-based relational similarity measures while signiﬁcantly reducing the processing time. In future, we intend to integrate WordNet-based similarity measures with the proposed SVM-based method to construct more accurate relational similarity measures. REFERENCES Table 6. Comparison with LRA on runtime LRA Find alternatives Filter phrases and patterns Generate a sparse matrix Calculate entropy Singular value decomposition Evaluate alternatives Total Proposed Download snippets Pattern extraction Pattern selection Create feature vectors Training Testing Total Hrs:Mins 24 : 56 143 : 33 38 : 07 0 : 11 0 : 43 2 : 11 209 : 41 Hrs:Mins 2 : 05 0 : 05 2 : 56 0 : 46 0 : 03 0.01 5 : 56 Hardware 1 CPU 16 CPUs 1 CPU 1 CPU 1 CPU 1 CPU Hardware 1 CPU 1 CPU 1 CPU 1 CPU 1 CPU 1 CPU Although LRA (row 19 in Table 5) reports the highest SAT score of 0.561 it takes over 8 days to process the 374 SAT analogy questions [14]. On the other hand the proposed method requires less than 6 hours using a desktop computer with a 2.4 GHz Pentium4 processor and 2GB of RAM. In Table 6 we compare the proposed method against LRA on runtime. The runtime ﬁgures for LRA are obtained from the original paper [14] and we have only shown the components that consume most of the processing time. The gain in speed is mainly attributable to the lesser number of web queries required by proposed method. To compute the relational similarity between two word-pairs A:B and C:D using LRA, we ﬁrst search in a dictionary for synonyms for each word. Then the original words are replaced by their synonyms to create alternative pairs. Each word-pair is represented by a vector of pattern-frequencies using a set automatically created 4000 lexical patterns. Pattern frequencies are obtained by searching for the pattern in a web search engine. For example, to create a vector for a word-pair with three alternatives, LRA requires 12000 (4000 × 3) queries. On the other hand, the proposed method ﬁrst downloads snippets for each word-pair and then searches for patterns only in the downloaded snippets. Because multiple snippets can be retrieved by issuing a single query, the proposed method requires only one search query to compute a pattern-frequency vector for a word-pair. Processing snippets is also efﬁcient as it obviates the trouble of downloading web pages, which might be time consuming depending on the size of the pages. Moreover, LRA is based on singular value decomposition (SVD), which requires time consuming complex matrix computations. [1] P. Cimiano and J. Wenderoth, ‘Automatic acquisition of ranked qualia structures from the web’, in Proc. of ACL’07, pp. 888–895, (2007). [2] B. Falkenhainer, K.D. Forbus, and D. Gentner, ‘Structure mapping engine: Algorithm and examples’, Artiﬁcial Intelligence, 41, 1–63, (1989). [3] C. D. Manning and H. Sch¨utze, Foundations of Statistical Natural Language Processing, The MIT Press, Cambridge, Massachusetts, 2002. [4] D.L. Medin, R.L. Goldstone, and D. Genter, ‘Similarity involving attributes and relations: Judgments of similarity and difference are not inverse’, Psychological Sciences, 1(1), 64–69, (1990). [5] D.L. Medin, R.L. Goldstone, and D. Gentner, ‘Respects for similarity’, Psychological Review, 6(1), 1–28, (1991). [6] V. Natase and S. Szpakowicz, ‘Exploring noun-modiﬁer semantic relations’, in Proc. of ﬁfth int’l workshop on computational semantics (IWCS-5), pp. 285–301, (2003). [7] J. Pei, J. Han, B. Mortazavi-Asi, J. Wang, H. Pinto, Q. Chen, U. Dayal, and M. Hsu, ‘Mining sequential patterns by pattern-growth: the preﬁxspan approach’, IEEE Transactions on Knowledge and Data Engineering, 16(11), 1424–1440, (2004). [8] J. Platt, ‘Probabilistic outputs for support vector machines and comparison to regularized likelihood methods’, Advances in Large Margin Classiﬁers, 61–74, (2000). [9] D. Ravichandran and E. Hovy, ‘Learning surface text patterns for a question answering system’, in Proc. of ACL ’02, pp. 41–47, (2001). [10] G. Salton and C. Buckley, Introduction to Modern Information Retreival, McGraw-Hill Book Company, 1983. [11] R. Snow, D. Jurafsky, and A.Y. Ng, ‘Learning syntactic patterns for automatic hypernym discovery’, in Proc. of Advances in Neural Information Processing Systems (NIPS) 17, pp. 1297–1304, (2005). [12] P.D. Turney, ‘Measuring semantic similarity by latent relational analysis’, in Proc. of IJCAI’05, pp. 1136–1141, (2005). [13] P.D. Turney, ‘Expressing implicit semantic relations without supervision’, in Proc. of Coling/ACL’06, pp. 313–320, (2006). [14] P.D. Turney, ‘Similarity of semantic relations’, Computational Linguistics, 32(3), 379–416, (2006). [15] P.D. Turney and M.L. Littman, ‘Corpus-based learning of analogies and semantic relations’, Machine Learning, 60, 251–278, (2005). [16] P.D. Turney, M.L. Littman, J. Bigham, and V. Shnayder, ‘Combining independent modules to solve multiple-choice synonym and analogy problems’, in Proc. of RANLP’03, pp. 482–486, (2003). [17] A. Tversky, ‘Features of similarity’, Psychological Review, 84(4), 327– 352, (1997). [18] V. Vapnik, Statistical Learning Theory, Wiley, Chichester, GB, 1998. [19] T. Veale, ‘Wordnet sits the sat: A knowledge-based approach to lexical analogy’, in Proc. of ECAI’04, pp. 606–612, (2004). [20] T. Veale and M. T. Keane, ‘The competence of structure mapping on hard analogies’, in Proc. of IJCAI’03, (2003). 338 ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-338 Improved Statistical Machine Translation Using Monolingual Paraphrases Preslav Nakov123 Abstract. We propose a novel monolingual sentence paraphrasing method for augmenting the training data for statistical machine translation systems “for free” – by creating it from data that is already available rather than having to create more aligned data. Starting with a syntactic tree, we recursively generate new sentence variants where noun compounds are paraphrased using suitable prepositions, and vice-versa – preposition-containing noun phrases are turned into noun compounds. The evaluation shows an improvement equivalent to 33%-50% of that of doubling the amount of training data. 1 Introduction Most modern Statistical Machine Translation (SMT) systems rely on aligned bilingual corpora (bi-texts) from which they learn how to translate small pieces of text. In many cases, these pieces are semantically equivalent but syntactically different from translation-time text, and thus the potential for high-quality translation can be missed. In this paper, we describe a method for expanding the training bi-text using paraphrases that are nearly-equivalent semantically but different syntactically. In particular, we apply sentence-level paraphrasing on the source-language side, focusing on noun compounds (NCs) and noun phrases (NPs), which have been reported to be very frequent in English written text: 2.6% of the tokens in the British National Corpus and 3.9% in the Reuters corpus are covered by NCs [2], and about half of the words in news texts are part of an NP [13]. The proposed approach is novel in that it augments the training corpus with paraphrases of the original sentences, thus augmenting the training bi-text without increasing the number of training translation pairs needed. It is also monolingual; other related approaches map from the source language to other languages in order to obtain paraphrases. Finally, while our paraphrasing rules are Englishspeciﬁc, the method is general enough to be domain-independent. 2 Related Work Recent work in automatic corpus-based paraphrasing includes using bi-texts as a source of alternative expressions for the same term [4, 20], or using multiple expressions of the same concept in one language [22]. In a more recent work, [3] propose using phrases in a second language as pivots. For example, if in a parallel English-German corpus, the English phrases under control and in check happen to 1 2 3 Linguistic Modeling Department of the Institute for Parallel Processing at the Bulgarian Academy of Sciences, 25A, Acad. G. Bonchev St., 1113 Soﬁa, Bulgaria, email nakov@lml.bas.bg Department of Mathematics and Informatics, Soﬁa University, 5, James Bourchier blvd. 1164 Soﬁa, Bulgaria Part of this research was done while the author was a PhD student at the EECS department, CS division, University of California at Berkeley, USA. be aligned (in different sentences) to the same German phrase unter controlle, they would be hypothesised to be paraphrases of each other with some probability. Recently, paraphrases have been used to improve machine translation evaluation . For example, [9] argue that automated evaluation measures like Bleu [21] end up comparing n-gram overlap rather than semantic similarity with the reference text. Having performed an experiment asking two human translators to translate the same set of 10,000 sentences, they found that less than .2% of the translations were identical, and 60% differed by more than ten words. Therefore, they proposed an evaluation method which paraphrases the machineproduced translations and yields improved correlation with human judgements compared to Bleu. In a similar spirit, [25] use a paraphrase table extracted from a bilingual corpus in order to improve the evaluation of automatic summarization algorithms. Another related research direction is in translating units of text smaller than a sentence, e.g., NCs [8, 23, 2], NPs [7, 13], named entities [1], and technical terms [15]. While we focus on paraphrasing NPs/NCs, unlike these approaches, we paraphrase and translate full sentences, as opposed to working with small text units in isolation. The approach we propose below is most closely related to that of [6], who translate English sentences into Spanish and French by substituting unknown source phrases with suitable paraphrases. Our paraphrases, however, are quite different. Theirs are extracted with the above-mentioned bilingual method of [3] using eight additional languages from the Europarl corpus [12] as pivots. These paraphrases are incorporated in the machine translation process by adding them as additional entries in the phrase table and pairing them with the foreign translation of the original phrase. Finally, the system is tuned using minimum error rate training [17] with an extra feature penalising the low-probability paraphrases. This yielded dramatic increases in coverage (from 48% to 90% of the test word types when 10,000 training sentences were used), and notable increase on Bleu (up to 1.5%). However, the method requires large multi-lingual parallel corpora, which makes it domain-dependent and most likely limits its applicability to Chinese, Arabic, and the languages of the EU, for which such large resources are likely to be available. 3 Method We propose a novel general approach for improving SMT systems using monolingual paraphrases. Given a sentence from the source (English) side of the training corpus, we generate conservative meaning-preserving syntactic paraphrases of that sentence. Each paraphrase is paired with the foreign (Spanish) translation that is associated with the original source sentence in the training bi-text. This augmented training corpus is then used to train an SMT system. P. Nakov / Improved Statistical Machine Translation Using Monolingual Paraphrases We further introduce a variation on this idea that can be used with a phrase-based SMT. In this alternative, the source-language phrases from the phrase table are paraphrased, but again using the target source-language phrase only, as opposed to requiring a third parallel pivot language as in [6]. We also try to combine these approaches. 4 Paraphrasing Given a sentence like “I welcome the Commissioner’s statement about the progressive and rapid lifting of the beef import ban.”, we parse it using the Stanford parser [10], and we recursively apply the following syntactic transformations: 1. [NP NP1 P NP2 ] ⇒ [NP NP2 NP1 ]. the lifting of the beef import ban ⇒ the beef import ban lifting 2. [NP NP1 of NP2 ] ⇒ [NP NP2 gen NP1 ]. the lifting of the beef import ban ⇒ the beef import ban’s lifting 3. NPgen ⇒ NP. Commissioner’s statement ⇒ Commissioner statement 4. NPgen ⇒ NPP Pof . Commissioner’s statement ⇒ statement of (the) Commissioner 5. NPN C ⇒ NPgen . inquiry committee chairman ⇒ inquiry committee’s chairman 6. NPN C ⇒ NPP P . the beef import ban ⇒ the ban on beef import where: gen is a genitive marker: ’ or ’s; P is a preposition; NPP P is an NP with an internal PP-attachment; NPP Pof is an NP with an internal PP headed by of; NPgen is an NP with an internal genitive marker; NPN C is an NP that is a noun compound. The resulting paraphrases are shown in Table 2. In order to prevent transformations (1) and (2) from constructing awkward NPs, we impose certain limitations on NP1 and NP2 . They cannot span a verb, a preposition or a quotation mark (although they can contain some kinds of nested phrases, e.g., an ADJP in case of coordinated adjectives, as in the progressive and controlled lifting). Therefore, the phrase reduction in the taxation of labour is not transformed into taxation of labour reduction or taxation of labour’s reduction. We further require the head to be a noun and we do not allow it to be an indeﬁnite pronoun like anyone, everybody, and someone. Transformations (1) and (2) are more complex than they may look. In order to be able to handle some hard cases, we apply additional restrictions. First, some determiners, pre-determiners and possessive adjectives must be eliminated in case of conﬂict between NP1 and NP2 , e.g., the lifting of this ban can be paraphrased as the ban lifting, but not as this ban’s lifting. Second, in case both NP1 and NP2 contain adjectives, these adjectives have to be put in the right order, e.g., the ﬁrst statement of the new commissioner can be paraphrased as the ﬁrst new commissioner’s statement, but not the new ﬁrst commissioner’s statement. There is also the option of not re-ordering them, e.g., the new commissioner’s ﬁrst statement. Third, further complications are due to scope ambiguities of modiﬁers of NP1 . For example, in the ﬁrst statement of the new commissioner, the scope of the adjective ﬁrst is not statement alone, but statement of the new commissioner. This is very different for the NP the biggest problem of the whole idea, where the adjective biggest applies to problem only, and therefore it cannot be transformed to the biggest whole idea’s problem (although we do allow for the whole idea’s biggest problem). While the ﬁrst four transformations are purely syntactic, (5) and (6) are not. The algorithm must determine whether a genitive marker is feasible for (5) and must choose the correct preposition for (6). 339 In either case, for noun compounds of length three or more, we also need to choose the correct position to modify, e.g., inquiry’s committee chairman vs. inquiry committee’s chairman. In order to improve the accuracy of the paraphrases, we use the Web as a corpus, generating and testing the paraphrases in the context of the preceding and the following words in the sentence. First, we split the noun compound into two sub-parts N1 and N2 in all possible ways, e.g., beef import ban lifting would be split as: (a) N1 =“beef”, N2 =“import ban lifting”, (b) N1 =“beef import”, N2 =“ban lifting”, and (c) N1 =“beef import ban”, N2 =“lifting”. For each split, we issue exact phrase queries to Google using the following patterns: "lt N1 gen N2 rt" "lt N2 prep det N1 rt" "lt N2 that be det N1 rt" "lt N2 that be prep det N1 rt" where: N1 can be a singular or a plural form of N1 ; lt is the word preceding N1 in the original sentence, if any; rt is the word following N2 in the original sentence, if any; gen is a genitive marker (’s or ’); that is that, which or who; be is is or are; det is the, a, an, or none; and prep is one of the prepositions used by [14] for NC interpretation: about, at, for, from, in, of, on, and with. Given a particular split, we ﬁnd the number of page hits for each instantiation of the above paraphrase patterns, ﬁltering out the ones whose page hit counts are less than ten. We then calculate the total number of page hits H for all paraphrases (for all splits and all patterns), and we retain the ones whose page hits counts are at least 10% of H, which allows for multiple paraphrases (possibly corresponding to different splits) for a given noun compound. If no paraphrases are retained, we repeat the above procedure with lt set to the empty string. If there are still no good paraphrases, we set rt to the empty string. If this does not help either, we make a ﬁnal attempt, by setting both lt and rt to the empty string. For example, EU budget is paraphrased as EU’s budget and budget of the EU; also environment policy becomes policy on environment, policy on the environment, and policy for the environment; UN initiatives is paraphrased as initiatives of the UN, initiatives at the UN, and initiatives in the UN, and food labelling becomes labelling of food and labelling of foods. We apply the same algorithm to paraphrasing English phrases from the phrase table, but without transformations (5) and (6). See Table 1 for sample paraphrases. 1 2 3 4 5 6 7 8 % of members of the irish parliament % of irish parliament members % of irish parliament ’s members universal service of quality . universal quality service . quality universal service . quality ’s universal service . action at community level community level action , and the aptitude for communication and , and the communication aptitude and to the fall-out from chernobyl . to the chernobyl fall-out . ﬂexibility in development - and quick development ﬂexibility - and quick , however , the committee on transport , however , the transport committee and the danger of infection with aids and the danger of aids infection and the aids infection danger and the aids infection ’s danger Table 1. Sample English phrases from the phrase table and corresponding automatically generated paraphrases. 340 P. Nakov / Improved Statistical Machine Translation Using Monolingual Paraphrases I welcome the Commissioner ’s statement about the progressive and rapid beef import ban lifting . I welcome the progressive and rapid beef import ban lifting Commissioner ’s statement . I welcome the Commissioner ’s statement about the beef import ban ’s progressive and rapid lifting . I welcome the beef import ban ’s progressive and rapid lifting Commissioner ’s statement . I welcome the Commissioner ’s statement about the progressive and rapid lifting of the ban on beef imports . I welcome the Commissioner statement about the progressive and rapid lifting of the beef import ban . I welcome the Commissioner statement about the progressive and rapid beef import ban lifting . I welcome the progressive and rapid beef import ban lifting Commissioner statement . I welcome the Commissioner statement about the beef import ban ’s progressive and rapid lifting . I welcome the beef import ban ’s progressive and rapid lifting Commissioner statement . I welcome the Commissioner statement about the progressive and rapid lifting of the ban on beef imports . I welcome the statement of Commissioner about the progressive and rapid lifting of the beef import ban . I welcome the statement of Commissioner about the progressive and rapid beef import ban lifting . I welcome the statement of Commissioner about the beef import ban ’s progressive and rapid lifting . I welcome the statement of Commissioner about the progressive and rapid lifting of the ban on beef imports . I welcome the statement of the Commissioner about the progressive and rapid lifting of the beef import ban . I welcome the statement of the Commissioner about the progressive and rapid beef import ban lifting . I welcome the statement of the Commissioner about the beef import ban ’s progressive and rapid lifting . I welcome the statement of the Commissioner about the progressive and rapid lifting of the ban on beef imports . Table 2. 5 5.1 Sample sentences and their automatically generated paraphrases. Paraphrased noun compounds are in italics. Experiments and Evaluation Europarl Corpus We trained and evaluated several English→Spanish phrase-based statistical machine translation systems using the Europarl corpus [12] and the standard training/tuning/testing dataset splits. First, we built English→Spanish and Spanish→English directed word alignments using IBM model 4 [5], we combined them using the intersect+grow heuristic [18], and we extracted phrase-level translation pairs using the alignment template approach [19]. We thus obtained a phrase table where each translation pair is associated with ﬁve parameters: forward phrase translation probability, reverse phrase translation probability, forward lexical translation probability, reverse lexical translation probability, and phrase penalty. We then trained a log-linear model using the following feature functions: language model probability, word penalty, distortion cost, and the above-mentioned parameters from the phrase table. We set the feature weights by optimising the Bleu score directly using minimum error rate training (MERT) [17] on the ﬁrst 500 sentences from the development set. We then used these weights in a beam search decoder [11] to translate the 2,000 test sentences, and we compared the translations to the gold standard using Bleu [21]. Baseline. Using the above procedure, we built and evaluated a baseline system S, trained on the original training corpus. Sentence-Level Paraphrasing. We further built SpW , which uses a version of the training corpus augmented with syntactic paraphrases of the sentences from the English side paired with the corresponding Spanish translations. In order to see the effect of not breaking NCs and not using the Web, we built Sp , which does not use transformations (5) and (6). Phrase Table Paraphrasing. System S paraphrases and augments the phrase table of the baseline system S using syntactic transformations (1)-(4), similarly to Sp , i.e., without NC paraphrasing. Similarly, SpW is obtained by paraphrasing the phrase table of SpW . Combined Systems. Finally, we merged the phrase tables for some of the above systems, which we designate with a “+”, e.g., S + SpW and S + SpW . In these merges, the phrases from the ﬁrst phrase table are given priority over those from the second one in case a phrase pair is present in both phrase tables. This is important since the parameters estimated from the original corpus are more reliable. Following [3], we also performed an experiment with an additional feature FpW for each phrase: its value is 1 if the phrase is in the phrase table of S, and 0.5 if it comes from the phrase table of SpW . As before, we optimised the weights using MERT. For S +SpW , we also tried using two features: in addition to FpW , we introduced F , whose value is 0.5 if the phrase comes from paraphrasing a phrase table entry, and 1 if it was in the original phrase table. The evaluation results are shown in Tables 3 and 4. The differences between the baseline and the remaining systems shown in Table 3 are statistically signiﬁcant, which was tested using bootstrapping [24]. Gain of 33%–50% compared to doubling the training data. As Table 4 shows, neither paraphrasing the training sentences, SpW , nor paraphrasing the phrase table, S , lead to notable improvements. For 10k training sentences, the systems are comparable and improve Bleu by .3, while for 40k sentences, S matches the baseline, and SpW even drops below it. However, merging the phrase tables of S and SpW , yields an improvement of almost .7 for 10k and 20k sentences, and about .3 for 40k sentences. While this improvement might look small, it is comparable to that of [3], who achieved .7 improvement for 10k sentences, and 1.0 for 20k (translating in the reverse direction: Spanish→English). Note also that the .7 improvement in Bleu for 10k and 20k sentences is about 1/3 of the 2 Bleu point improvement achieved by the baseline system by doubling the training size. Note also that the .3 gain on Bleu for 40k sentences is equal to half of what would have been gained if we had trained on 80k sentences. Improved precision for all n-grams. Table 3 compares different systems trained on 10k sentences. In addition to the Bleu score, we give its elements: n-gram precisions, BP (brevity penalty), and ration. Comparing the baseline with the last four systems, we can see that all n-gram precisions are improved by about .4-.7 Bleu points. Importance of noun compound splitting. Sp is trained on the training corpus augmented with paraphrased sentences, where the P. Nakov / Improved Statistical Machine Translation Using Monolingual Paraphrases System S (baseline) Sp SpW S S + Sp S + SpW S + SpW † S + SpW ‡ S + SpW Bleu 22.38 21.89 22.57 22.58 22.73 23.05 23.13 23.09 23.09 n-gram precision 1-gr. 2-gr. 3-gr. 4-gr. 55.4 27.9 16.6 10.0 55.7 27.8 16.5 10.0 55.1 27.8 16.7 10.2 55.4 28.0 16.7 10.1 55.8 28.3 16.9 10.3 55.8 28.5 17.1 10.6 55.8 28.5 17.1 10.5 56.1 28.7 17.2 10.6 55.8 28.4 17.1 10.5 Bleu # of phrases BP ration gener. used 0.995 0.995 181k 41k 0.973 0.973 193k 42k 1.000 1.000 202k 43k 1.000 1.001 207k 41k 0.994 0.994 262k 54k 0.995 0.995 280k 56k 1.000 1.002 280k 56k 0.993 0.993 327k 56k 1.000 1.001 327k 56k Table 3. Bleu scores and n-gram precisions for 10k training sentences. The last two columns show the total number of entries in the phrase table and the number of phrases that were usable at testing time, respectively. System S (baseline) SpW S S + SpW Table 4. # of training sentences 10k 20k 40k 80k 22.38 24.33 26.48 27.05 22.57 24.41 25.96 22.58 25.00 26.48 23.05 25.01 26.75 Bleu scores for different number of training sentences. NC splitting rules (5) and (6) are not used. We can see that the results for this system go below the baseline: while there is a .3 gain on Bleu on unigram precision, bigram and trigram precision go down by about .1. More importantly, BP decreases as well: since the sentencelevel paraphrases (except for genitives, which are infrequent) convert NPs into NCs, the resulting sentences are shorter, and thus the translation model learns to generate shorter sentences. This is different in SpW , where transformations (5) and (6) counter-weight (1)-(4), thus balancing BP to 1. A somewhat different kind of argument applies to S + Sp , which is worse than S + SpW , but not because of BP. In this case, there is no improvement for unigrams, but a consistent .2-.3 drop for bigrams, trigrams and fourgrams. The reason is shown in the last column of Table 4: omitting rules (5) and (6) results in fewer training sentences, which means fewer phrases in the phrase table and therefore fewer ones usable at translation time. More usable phrases. The last two columns of Table 3 show that, in general, having more phrases in the phrase table implies more usable phrases at translation time. A notable exception is S , whose phrase table is bigger than those of Sp and SpW , but yields less utile phrases. Therefore, we can conclude that the additional phrases extracted from paraphrased sentences are more likely to be usable at test time than the ones generated by paraphrasing the phrase table. Paraphrasing sentences vs. paraphrasing the phrase table. As Tables 3 and 4 show, paraphrasing the phrase table, as in S (Bleu score 22.58), cannot compete against paraphrasing the training corpus followed by merging the resulting phrase table with the phrase table for the original corpus4 , as in S + SpW (Bleu score 23.05). We also tried to paraphrase the phrase table of S + SpW , but the resulting system S + SpW yielded little improvement: 23.09 Bleu score. Adding the two extra features, F and FpW , did not yield im provements as well: S + SpW ‡ achieved the same Bleu score as S + SpW . This shows that extracting additional phrases from the augmented corpus is a better idea than paraphrasing the phrase table, 4 Note that S does not use rules (5) and (6). However, as S + Sp shows, the claim holds even if these rules are excluded when paraphrasing whole sentences: the Bleu score for S + Sp is 22.73 vs. 22.58 for S . 341 which can result in erroneous splitting of noun phrases. Paraphrasing whole sentences as opposed to paraphrasing the phrase table could potentially improve the approach of [6] as well: while low probability and context dependency could be problematic, a language model could help ﬁlter the bad sentences out. Such ﬁltering could potentially improve our results as well. Finally, note that different paraphrasing strategies could be used when paraphrasing phrases vs. sentences. For example, paraphrasing the phrase table can be done more aggressively: if an ungrammatical phrase is generated in the phrase table, it would most likely have no negative effect on translation quality since it would be unlikely to be observed at translation time. Quality of the paraphrases and comparison to [6]. An important difference between our syntactic paraphrasing and the multilingual approach of [6] is that their paraphrases are only contextually synonymous and often depart signiﬁcantly from the original meaning. As a result, they could not achieve improvements by simply augmenting the phrase table: this introduced too much noise and yielded accuracy that was below their baseline by 3-4 Bleu points. In order to achieve an improvement, they had to introduce an extra feature penalising the low probability paraphrases and promoting the original phrase table entries. In contrast, our paraphrases are meaning-preserving and less context-dependent. For example, introducing feature FpW which penalises phrases coming from the paraphrased corpus in system S + SpW † yielded a tiny improvement on Bleu score (23.13 vs. 23.05), i.e., the phrases extracted from our augmented corpus are almost as good as the ones from the original corpus. Finally, note that our paraphrasing method is complementary to that of [6] and therefore the two can be combined: the strength of our approach is in improving the coverage of longer phrases using syntactic paraphrases, while the strength of theirs is in improving the vocabulary coverage with words extracted from additional corpora (although they do get some gain from using longer phrases as well). Paraphrasing the target side. We also tried paraphrasing the target language side, i.e., translating into English, which resulted in decreased performance. This is not surprising: the set of available source phrases remains the same, and a possible improvement could only come from producing a more ﬂuent translation, e.g., from transforming an NP with an internal PP into an NC. However, unlike the original translations, the extra ones are a priori less likely to be judged correct since they were not observed on training. 5.2 News Commentary & Domain Adaptation We further applied the proposed paraphrasing method to domain adaptation using the data from the ACL’07 Workshop on SMT: 1.3M words (64k sentences) of News Commentary data and 32M words of Europarl data. We used the standard training/tuning/testing splits, and we tested on News Commentary data. This time we used two additional features with MERT (indicated with the ≺ operation): for the original and for the augmented phrase table, which allows extra weight to be given to phrases appearing in both. With the default distance reordering, for 10k sentences we had 28.88 Bleu for S + SpW vs. 28.07 for S, and for 20k we had 30.65 vs. 30.34. However, for 64k sentences, there was almost no difference: 32.77 vs. 32.73. Using a different tokenizer and a lexicalized reordering model, we got 32.09 vs. 32.34, i.e., the results were worse. However, as Table 5 shows, using a second language trained on Europarl, we were able to improve Bleu to 34.42 (for S + SpW ) from 33.99 (for S). Using SpW lead to even bigger improvements (0.64 Bleu) when added to S news ≺ S euro , where an additional phrase table from Europarl was used. See [16] for further details. 342 P. Nakov / Improved Statistical Machine Translation Using Monolingual Paraphrases Model S news S news S news S news S news Table 5. news ≺ SpW ≺ S euro news ≺ S euro ≺ SpW news ≺ S euro ≺ SpW Language Models News Only News+Euro 32.27 33.99 32.09 34.42 34.05 34.25 34.69 Bleu scores on the News Commentary data (64k sentences). is closer to the Spanish “¡Recuerda al individuo con quien est´as!”, which might facilitate the translation process. Finally, the process could be made part of the decoding, which would eliminate the need of paraphrasing the training corpus and might allow dynamically generating paraphrases both for the phrase table entries and for the target sentence that is being translated. ACKNOWLEDGEMENTS This research was supported in part by NSF DBI-0317510 and by FP7-REGPOT-2007-1 SISTER. 6 Problems and Limitations Error analysis has revealed that the major problems for the proposed method are incorrect PP-attachments in the parse tree, and, less frequently, wrong POS tags (e.g., JJ instead of NN). Using a syntactic parser further limits the applicability of the approach to languages for which such parsers are available. In fact, for our purposes, it might be enough to use a shallow parser or just a POS tagger. This would cause problems with PP-attachment, but these attachments are often assigned incorrectly by parsers anyway. The main target of our paraphrases are noun compounds – we turn NPs into NCs and vice versa – which limits the applicability of the approach to languages where noun compounds are a frequent phenomenon, e.g., Germanic, but not Romance or Slavic. From a practical viewpoint, an important limitation is that the size of the phrase table and/or of the training corpus increases, which slows down both training and translation, and limits the applicability to relatively small corpora for computational reasons. Last but not least, as Table 4 shows, the improvements get smaller for bigger training corpora, which suggests it becomes harder to generate useful paraphrases that are not already in the corpus. 7 Conclusion and Future Work We presented a novel domain-independent approach for improving statistical machine translation by augmenting the training corpus with monolingual source-language side paraphrases, thus increasing the training data “for free”, by creating it from data that is already available rather than having to create more aligned data. While in our experiments we used phrase-based SMT, any machine translation approach that learns from parallel corpora could potentially beneﬁt from the idea of syntactic corpus augmentation. At present, our paraphrasing rules are English-speciﬁc, but they could be easily adapted to other Germanic languages, which make heavy use of noun compounds; the general idea of automatically generating nearly equivalent source-side syntactic paraphrases can in principle be applied to any language. The current version of the method should be considered preliminary, as it is limited to NPs; still, the results are already encouraging, and the approach is worth considering when building MT systems from small corpora, e.g., in case of resource-poor language pairs, in speciﬁc domains, etc. Better use of the Web could be made for paraphrasing noun compounds (e.g., using verbal paraphrases), and other syntactic transformations could be tried (e.g., adding/removing complementisers like that and commas from nonmandatory positions). Even more promising, but not that simple, would be using a treeto-tree syntax-based SMT system and learning suitable syntactic transformations that can make the source-language trees structurally closer to the target-language ones. For example, the English sentence “Remember the guy who you are with!” would be transformed into “Remember the guy with whom you are!”, whose word order REFERENCES [1] Y. Al-Onaizan and K. Knight, ‘Translating named entities using monolingual and bilingual resources’, in Proc. of ACL, pp. 400–408, (2001). [2] T. Baldwin and T. Tanaka, ‘Translation by machine of compound nominals: Getting it right’, in Proceedings of ACL’04 Workshop on Multiword Expressions: Integrating Processing, pp. 24–31, (2004). [3] C. Bannard and C. Callison-Burch, ‘Paraphrasing with bilingual parallel corpora’, in Proceedings of ACL, pp. 597–604, (2005). [4] R. Barzilay and K. McKeown, ‘Extracting paraphrases from a parallel corpus’, in Proceedings of ACL, pp. 50–57, (2001). [5] P. Brown, V. Della Pietra, S. Della Pietra, and R. Mercer, ‘The mathematics of statistical machine translation: parameter estimation’, Computational Linguistics, 19(2), 263–311, (1993). [6] C. Callison-Burch, P. Koehn, and M. Osborne, ‘Improved statistical machine translation using paraphrases’, in HLT, pp. 17–24, (2006). [7] Y. Cao and H. Li, ‘Base noun phrase translation using web data and the EM algorithm’, in Proc. of Computational Linguistics, pp. 1–7, (2002). [8] G. Grefenstette, ‘The World Wide Web as a resource for example-based machine translation tasks’, in Translating and the Computer 21, (1999). [9] D. Kauchak and R. Barzilay, ‘Paraphrasing for automatic evaluation’, in Proceedings of HLT, pp. 455–462, (2006). [10] D. Klein and C. Manning, ‘Accurate unlexicalized parsing’, in Proceedings of ACL, pp. 423–430, (2003). [11] P. Koehn, ‘Pharaoh: a beam search decoder for phrase-based statistical machine translation models’, in Proc. of AMTA, pp. 115–124, (2004). [12] P. Koehn, ‘Europarl: A parallel corpus for evaluation of machine translation’, in Proceedings of MT Summit, pp. 79–86, (2005). [13] P. Koehn and K. Knight, ‘Feature-rich statistical translation of noun phrases’, in Proceedings of ACL, pp. 311–318, (2003). [14] Mark Lauer, Designing Statistical Language Learners: Experiments on Noun Compounds, Ph.D. dissertation, Department of Computing Macquarie University NSW 2109 Australia, 1995. [15] Masaaki Nagata, Teruka Saito, and Kenji Suzuki, ‘Using the web as a bilingual dictionary’, in Proceedings of the workshop on Data-driven methods in machine translation, pp. 1–8, (2001). [16] P. Nakov, ‘Improving English-Spanish statistical machine translation: Experiments in domain adaptation, sentence paraphrasing, tokenization, and recasing’, in Proceedings of Workshop on SMT, (2008). [17] F. J. Och, ‘Minimum error rate training in statistical machine translation’, in Proceedings of ACL, pp. 160–167, (2003). [18] F. J. Och and H. Ney, ‘A systematic comparison of various statistical alignment models’, Computational Linguistics, 29(1), 19–51, (2003). [19] F. J. Och and H. Ney, ‘The alignment template approach to statistical machine translation’, Computat. Linguistics, 30(4), 417–449, (2004). [20] B. Pang, K. Knight, and D. Marcu, ‘Syntax-based alignment of multiple translations: extracting paraphrases and generating new sentences’, in Proceedings of NAACL, pp. 102–109, (2003). [21] K. Papineni, S. Roukos, T. Ward, and W. Zhu, ‘Bleu: a method for automatic evaluation of machine translation’, in Proceedings of ACL, pp. 311–318, (2001). [22] Y. Shinyama, S. Sekine, and K. Sudo, ‘Automatic paraphrase acquisition from news articles’, in Proceedings of HLT, pp. 313–318, (2002). [23] T. Tanaka and T. Baldwin, ‘Noun-noun compound machine translation: a feasibility study on shallow processing’, in Proceedings of ACL’03 workshop on Multiword expressions, pp. 17–24, (2003). [24] Y. Zhang and S. Vogel, ‘Measuring conﬁdence intervals for the machine translation evaluation metrics’, in Proceedings of TMI, pp. 4–6, (2004). [25] L. Zhou, C. Lin, and E. Hovy, ‘Re-evaluating machine translation results with paraphrase support’, in Proc. of EMNLP, pp. 77–84, (2006). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-343 343 2UWKRJUDSKLFVLPLODULW\VHDUFK IRUGLFWLRQDU\ORRNXSRI-DSDQHVHZRUGV /DUV<HQFNHQ DQG 7LPRWK\%DOGZLQ whhfu(peiy<_ooa*qjeiah^*a`q*]q 1,&7$ 5HVHDUFK/DE 8QLYHUVLW\RI0HOERXUQH $EVWUDFW )LQGLQJDQXQNQRZQ-DSDQHVHZRUGLQDGLFWLRQDU\LVDGLIÀFXOWDQG VORZWDVNZKHQRQHRUPRUHRIWKHZRUG·VFKDUDFWHUVLVXQNQRZQ)RU DGYDQFHGOHDUQHUVXQNQRZQFKDUDFWHUVHYRNHWKHIRUPDQGPHDQ LQJRIYLVXDOO\VLPLODUFKDUDFWHUVWKH\DUHIDPLOLDUZLWK:HSURSRVH DUDQJHRIFKDUDFWHUGLVWDQFHPHWULFVWRDOORZOHDUQHUVWROHYHUDJH NQRZQFKDUDFWHUVWRVHDUFKIRUZRUGVFRQWDLQLQJXQNQRZQEXWYLVX DOO\VLPLODUFKDUDFWHUV7KLVQHZIRUPRIGLFWLRQDU\VHDUFKLVLPSOH PHQWHGDVDQH[WHQVLRQWRWKH)2.6 GLFWLRQDU\V\VWHP PHWKRGIRUORJRJUDSKLFFKDUDFWHUFRQIXVDELOLW\DQGWKHLQFRUSRUD WLRQRINDQMLVLPLODULW\LQWRDZRUGOHYHOORRNXSPRGHO 7KHUHPDLQGHURIWKLVSDSHULVVWUXFWXUHGDVIROORZV)LUVWO\ZH UHYLHZH[LVWLQJNDQMLORRNXSPHWKRGV6HFWLRQ DQGJRRQWRGLV FXVVKRZZHPHDVXUHDQGPRGHONDQMLVLPLODULW\LQFOXGLQJDQHYDOX DWLRQRIWKHPHWKRGV6HFWLRQ :HWKHQIRFXVRQWKHFRQYHUVLRQRI VLPLODULW\PRGHOVLQWRFRQIXVLRQPRGHOVDQGWKHLULQWHJUDWLRQLQWRD VHDUFKLQWHUIDFH6HFWLRQ ([DPLQLQJERWKRXUPRGHOVDQGWKHLQ WHUIDFHLWVHOIZHGLVFXVVRXUÀQGLQJV6HFWLRQ EHIRUHÀQDOO\FRQ FOXGLQJ6HFWLRQ ,QWURGXFWLRQ -DSDQHVHNDQMLDUHORJRJUDSKLFFKDUDFWHUVDNLQWRWUDGLWLRQDO&KL QHVHKDQ]LLQQXPEHUDQGYLVXDOFRPSOH[LW\EXWZLWKDPRUHORRVHO\ FRXSOHGV\VWHPIRUSURQXQFLDWLRQLQZKLFKNDQMLUHDGLQJVDUHERWK PRUHQXPHURXVDQGFRQWH[WVSHFLÀF/HDUQHUVVSHQGPXFKWLPHOD ERULRXVO\ORRNLQJXSXQNQRZQFKDUDFWHUVVLQFHWKHUHLVQRFRPPRQ PHWKRGIRUW\SLQJWKHPGLUHFWO\LQWRDFRPSXWHUEDVHGRQWKHLUVKDSH DORQH$Q\WLPHVDYHGIURPWKLVRQHURXVWDVNLVOLNHO\WRVSHHGXS OHDUQLQJDQGDLGPRWLYDWLRQ 6LQFHZRUGORRNXSLVE\FKDUDFWHUVRUE\SURQXQFLDWLRQZRUGVFRQ WDLQLQJXQNQRZQFKDUDFWHUVDUHW\SLFDOO\IRXQGE\ORRNLQJXSWKRVH FKDUDFWHUVRQHDWDWLPHLQDFKDUDFWHUGLFWLRQDU\7KHWUDGLWLRQDO PHWKRGRINDQMLFKDUDFWHUORRNXSLQYROYHVLGHQWLI\LQJWKHSULPDU\ FRPSRQHQWRU´UDGLFDOµ FRXQWLQJLWVVWURNHVORRNLQJLWXSLQWKHLQ GH[FRXQWLQJWKHUHPDLQGHURIVWURNHVLQWKHRULJLQDOFKDUDFWHUWKHQ ÀQGLQJWKHFKDUDFWHULQDVXELQGH[7KLVSUHVHQWVVHYHUDORSSRUWXQL WLHVIRUHUURUEXWIRUWXQDWHO\LPSURYHPHQWVKDYHEHHQPDGHDVZH GLVFXVVLQ6HFWLRQ :HSURSRVHDPHWKRGIRUORRNLQJXSXQNQRZQZRUGVFRQWDLQLQJ ZRUGVZLWKXQIDPLOLDUFKDUDFWHUVEDVHGRQVLPLODULW\ZLWKNQRZQ FKDUDFWHUV,QHVVHQFHRXUPHWKRGLVEDVHGRQWKHXVHUSODXVLEO\ ´PLVW\SLQJµWKHZRUGEDVHGRQFORVHO\PDWFKLQJNDQMLWKH\DUHID PLOLDUZLWKDQGKHQFHFDQUHDGLO\DFFHVVYLDDVWDQGDUGLQSXWPHWKRG HGLWRU IURPZKLFKZHSUHGLFWWKHFRUUHFWNDQMLFRPELQDWLRQEDVHG RQNDQMLVLPLODULW\DQGZRUGIUHTXHQF\)RUH[DPSOHJLYHQWKHLQSXW 姊⻔WKHV\VWHPFRXOGVXJJHVWWKHZRUG 姊‾ >KRVD@´KHOSµEDVHGRQ VLPLODULW\EHWZHHQWKHKLJKIUHTXHQF\ ⻔ DQGWKHJUDSKLFDOO\VLPLODU EXWORZIUHTXHQF\ ‾ 7KHSURSRVHGPHWKRGLVFRPELQHGZLWKWKH)2.6 ORRNXSVWUDWHJ\ SURSRVHGE\ IRUORRNLQJXSXQNQRZQZRUGVYLDSODXVLEO\LQFRUUHFW UHDGLQJV 7KHFRQWULEXWLRQVRIWKLVSDSHUDUHWKHSURSRVDORIDUDQJHRIFKDU DFWHUVLPLODULW\PRGHOVIRUORJRJUDSKLFVFULSWVDQRYHOHYDOXDWLRQ $ UHYLHZRINDQMLORRNXSPHWKRGV ,QWKLVVHFWLRQZHSURYLGHDEULHIUHYLHZRIH[LVWLQJNDQMLORRNXS PHWKRGV 7KH6.,3 6\VWHPRI.DQML,QGH[LQJE\3DWWHUQV V\VWHPRIORRNXS UHPRYHVWKHQHHGWRLGHQWLI\WKHSULPDU\UDGLFDOSURYLGLQJDPRUH HIÀFLHQWLQGH[LQJVFKHPHZKLFKUHOLHVRQRYHUDOOVKDSHLQVWHDG )RUH[DPSOH 㛼 >DND@´EULJKWµKDVVNLSFRGH7KHÀUVWQXPEHU LQGLFDWHVLWLVVSOLWLQWRWZRSDUWVKRUL]RQWDOO\DQGWKHVHFRQGDQGWKLUG QXPEHUVDUHWKHVWURNHFRXQWVRIWKRVHWZRSDUWVUHVSHFWLYHO\6LQFH WKHHQWLUHFRGHFDQEHGHWHUPLQHGDWRQFHLWLVIDVWHUWKDQWUDGLWLRQDO PHWKRGVWKRXJKOHDUQHUVDUHSURQHWRPDNLQJVWURNHFRXQWHUURUV 7KH .DQVXNH GLFWLRQDU\ IRUJRHV UDGLFDOV RU VKDSH DQG LQVWHDG VLPSOLÀHVWKHPHWKRGRIFRXQWLQJVWURNHVFRXQWLQJWKHQXPEHURI KRUL]RQWDOYHUWLFDODQGRWKHUVWURNHVWRIRUPDWKUHHQXPEHUFRGH ZKLFKOHDUQHUVÀQGHDVLHUWRGHWHUPLQHWKDQWUDGLWLRQDOVWURNHFRXQW LQJPHWKRGV &KDUDFWHUVFDQDOVREHORRNHGXSIURPWKHLUFRP SRQHQWV)RURXUHDUOLHUH[DPSOH 㛼 FRQVLVWVRI 㛓 ZLWKFRGH DQG 㟶 ZLWKFRGH7KHVHWZRFRPSRQHQWVVXIÀFLHQWO\FRQVWUDLQ UHVXOWVWRÀQGRXUWDUJHW ,QFRQWUDVWWKHFRPPHUFLDO .DQMLUXGLFWLRQDU\ IRFXVHVRQLQQR YDWLYHXVHRI0DF.D\·V'DVKHULQWHUIDFHDWWHPSWLQJWRLQWHUDFWLYHO\ DVVHPEOHDFKDUDFWHUE\VKDSHDQGVWURNHLQDQDGDSWLYHPDQQHU7KH XVHUJXLGHVWKHVHDUFKWKURXJKPRXVHPRYHPHQWVSURYLGLQJWKHXVHU ZLWKDZD\RIEXLOGLQJXSFRPSRQHQWVVWURNHE\VWURNHXQWLOWKHGH VLUHGFKDUDFWHULVIRXQG )LQDOO\KDQGZULWLQJLQWHUIDFHVVXFKDVWKDWXVHGIRU:::-',& DWWHPSWWRFLUFXPYHQWWKHFRPSXWHULQSXWSUREOHPDOWRJHWKHU+RZ HYHUWKHDFFXUDF\RIWKHFXUUHQWJHQHUDWLRQRIVXFKLQWHUIDFHVLVOLP LWHG7KH\VXIIHUIURPWKHDZNZDUGQHVVRIGUDZLQJFKDUDFWHUVZLWK DPRXVHDQGDOVRRYHUVHQVLWLYLW\WRVWURNHRUGHUDQGFRQQHFWLYLW\RI dppl6++sss*_ooa*ikj]od*a`q*]q+zfs^+dsn+ 344 L. Yencken and T. Baldwin / Orthographic Similarity Search for Dictionary Lookup of Japanese Words FRPSRQHQWVLQUHVROYLQJWKHUHVXOWLQJGUDZLQJ)XUWKHUPRUHOHDUQ HUVWHQGWRLPLWDWHNDQMLIRQWVZKHQKDQGZULWLQJLQVWHDGRIXVLQJWKH PRUHQDWXUDOKDQGZULWWHQVW\OHRIQDWLYHVSHDNHUVUHVXOWLQJLQVXEWO\ GLIIHUHQWFKDUDFWHUVKDSHV (DFKRIWKHVHVWDUWRIWKHDUWORRNXSPHWKRGVFRQWUDVWVZLWKRXU SURSRVHGVLPLODULW\EDVHGVHDUFKLQVHYHUDOZD\V)LUVWO\RXUVHDUFK LVXQFRQYHQWLRQDOLQWKDWLWLVZRUGEDVHG\HWSURYLGHVWKHPHDQVWR ORRNXSZRUGVZLWKXQNQRZQFKDUDFWHUVZLWKRXWWKHXVHRIZLOGFDUGV +RZHYHUWKHQHHGWRXVHNDQMLLQWKHVHDUFKTXHU\OLPLWVLWVXVHIXOQHVV LIDFRUHVHWRIFRPPRQNDQMLLVQRWNQRZQOLPLWLQJRXUVFRSHWR LQWHUPHGLDWHDQGDGYDQFHGOHDUQHUV 6HFRQGO\DVDQH[WHQVLRQWRDQH[LVWLQJVHDUFKPHWKRGZHDUHDEOH WRFDWHUWRERWKLQWHQWLRQDOVLPLODULW\EDVHGVHDUFKHVDQGXQLQWHQ WLRQDOLQSXWHUURUVPDGHE\XVHUVLQFUHDVLQJDFFHVVLELOLW\RIWKHEDVH GLFWLRQDU\,QWKLVZD\ZHVKDUHPXFKZLWKWKH)2.6 GLFWLRQDU\LQ WHUIDFH ZKLFKSURYLGHVHUURUFRUUHFWLQJORRNXSIRUUHDGLQJEDVHG GLFWLRQDU\TXHULHV:HGHPRQVWUDWHWKHFRPSOHPHQWDULW\RIWKHWZR PHWKRGVE\SURSRVLQJDFRPELQHGORRNXSPRGHOEDVHGRQHLWKHUUHDG LQJVRUNDQML 7KH)2.6 PHWKRGRORJ\FDQEHLOOXVWUDWHGE\ZD\RIDQH[DPSOH 6XSSRVHDXVHUZLVKHVWRORRNXSWKHZRUG ⵟ徸EXWLVXQVXUHRILWV SURQXQFLDWLRQ)2.6 DOORZVWKHPWRJXHVVWKHSURQXQFLDWLRQEDVHG RQUHDGLQJVWKH\NQRZIRUHDFKFKDUDFWHULQRWKHUFRQWH[WV,QWKLV FDVHWKH\PLJKWFRPELQH ⵟ >\DPD@´PRXQWDLQµDQG 徸 >NXUXPD@ ´FDUµWRJHWKHUDQGJXHVVWKHFRPELQHGUHDGLQJDV>\DPDNXUXPD@,Q WKLVFDVHWKHFRUUHFWUHDGLQJ>GDVKL@LVQRQFRPSRVLWLRQDODQGKHQFH FDQQRWEHJXHVVHGIURPWKHZRUG·VSDUWV1RQHWKHOHVVWKHVHDUFKLV FRQVWUDLQLQJHQRXJKWRDOORZXVHUVWRÀQGWKHZRUGDQGGLVFRYHULWV PHDQLQJ´IHVWLYDOÁRDWµ 2QWKHRWKHUKDQGVXSSRVHWKHXVHUHQWHUVNDQMLDVWKHLUVHDUFK TXHU\EXWPDNHVDQHUURU)RUH[DPSOHWKH\PD\EHORRNLQJIRUWKH ZRUG 㚧嫮EXWDFFLGHQWDOO\HQWHU ỵ嫮2XUSURSRVHGPHWKRGZRXOG WDNHLQWRDFFRXQWWKHVLPLODULW\EHWZHHQ ỵ DQG 㚧DQGSURYLGHWKH GHVLUHGZRUGLQWKHUHVXOWVDOORZLQJWKHXVHUWRGHWHUPLQHERWKLWV SURQXQFLDWLRQ>KĞJH1@ DQGLWVPHDQLQJ´GLDOHFWµ 0RGHOOLQJVLPLODULW\RINDQML 0HWULFVSDFHPRGHOV 7KHUHKDVEHHQOLWWOHZRUNRQPHWKRGVIRUPHDVXULQJRUSUHGLFWLQJWKH VLPLODULW\EHWZHHQWZRNDQML:KLOHWKHUHKDYHEHHQPDQ\SV\FKROLQ JXLVWLFVWXGLHVRQYDULRXVVSHFLÀFDVSHFWVRISHUFHSWLRQRI&KLQHVH DQG-DSDQHVHORJRJUDSKLFFKDUDFWHUVIHZWRXFKRQGLUHFWDVSHFWVRI RUWKRJUDSKLFFRQIXVLRQ)RUDEULHIGLVFXVVLRQVHH $VDEURDGRYHUYLHZFXUUHQWOLWHUDWXUHVXJJHVWVWKDWVRPHIRUPRI KLHUDUFKLFDOUHFRJQLWLRQRFFXUVEXLOGLQJUDGLFDOVIURPVWURNHVDQG ZKROHFKDUDFWHUVIURPUDGLFDOV(DFKSRLQWRIUHFRJQLWLRQDQGFRP ELQDWLRQVXJJHVWVDSRWHQWLDOVLWHIRUPLVUHFRJQLWLRQRUFRQIXVLRQZLWK DQRUWKRJUDSKLFRUVHPDQWLFQHLJKERXULGHDOO\DGLVWDQFHPHWULFDG GUHVVHVWKHPDOO7KHVWXG\E\<HKDQG/L IRXQGWKDWLQDYLVXDO VHDUFKWDVNFKDUDFWHUVZLWKVKDUHGUDGLFDOVZHUHRQO\GLVWUDFWRUVLI WKHEURDGVKDSHRIWKHWZRFKDUDFWHUVZHUHWKHVDPH 7KLVLQGLFDWHV WKHLPSRUWDQFHRIVKDUHGVKDSH :HSUHYLRXVO\FRQVLGHUHGWZRGLIIHUHQWVLPLODULW\PHDVXUHVRQ NDQMLDFRVLQHVLPLODULW\PHWULFRSHUDWLQJRQERROHDQUDGLFDOYHFWRUV DQGWKH l1 QRUP0DQKDWWDQGLVWDQFH EHWZHHQUHQGHUHGLPDJHVRI ,QWKHH[SHULPHQWHDFKNDQMLZDVFKRVHQVRWKDWLWFRQVLVWHGRIWZRSDUWVDU UDQJHGKRUL]RQWDOO\RUYHUWLFDOO\7KHVHDUHWKHPRVWFRPPRQDUUDQJHPHQWV RIFRPSRQHQWVWKRXJKPDQ\PRUHÀQHJUDLQHGVXEFDWHJRULHVFDQDOVREH GHWHUPLQHG NDQML (YDOXDWLQJRQDVHWRIKXPDQVLPLODULW\MXGJHPHQWVZHGH WHUPLQHGWKDWWKHFRVLQHVLPLODULW\PHWKRGRXWSHUIRUPHGWKH l1 QRUP DOWKRXJKLWKDGORZHUSUHFLVLRQIRUKLJKVLPLODULW\SDLUV$JUHHPHQW RQWKHH[SHULPHQWDOVHWZDVVWLOOTXLWHORZVRZHDWWHPSWHGWRLP SURYHWKHVHPRGHOVLQRUGHUWRUHGXFHDQ\QRLVHWKDWRFFXUULQJLQ VLPLODULW\EDVHGVHDUFKUHVXOWV %DJRIUDGLFDOVZLWKVKDSH $VWKHLQWHUPHGLDWHOHYHORIVWUXFWXUHEHWZHHQVWURNHVDQGZKROH FKDUDFWHUVUDGLFDOVKDYHPXFKSHUFHSWXDOVDOLHQFH:KHQOHDUQHUV RI-DSDQHVHVWXG\DQHZFKDUDFWHUWKH\GRQRWVWXG\LWVVWURNHVEXW LQVWHDGLWVFRPSRQHQWUDGLFDOV)RUH[DPSOH 㛼 >DND@´EULJKWµFRXOG EHDQDO\VHGDVEHLQJPDGHXSRIWKH 㛓 >VXQ@´KLµDQG 㟶 >PRRQ@ ´WVXNLµUDGLFDOV 5DGLFDOVDUHXVHIXOLQVHYHUDOZD\V7KHQXPEHURIUDGLFDOVLQDQ\ NDQMLLVPXFKVPDOOHUWKDQWKHWRWDOQXPEHURIVWURNHVPDNLQJNDQML HDVLHUWRFKXQNDQGUHFDOO)XUWKHUPRUHWKH\FDQSURYLGHVHPDQWLF FXHVDVWRWKHZKROHFKDUDFWHUPHDQLQJDQGFDQDOVRSURYLGHSUR QXQFLDWLRQFXHV 7KHRULJLQDOPHWULFXVHGLQ VLPSO\FDOFXODWHGWKHFRVLQHVLP LODULW\EHWZHHQUDGLFDOYHFWRUV7KLVLJQRUHVWKHSRVLWLRQRIUDGLFDOV ZKLFKLVNQRZQWREHLPSRUWDQWLQVLPLODULW\MXGJHPHQWVDQGDOVRWKH QXPEHURIHDFKUDGLFDOZLWKLQDNDQML+HQFH 㠖㢅 DQG 㧜 DUHDOO FRQVLGHUHGLGHQWLFDOUDGLFDO 㠖 DVDUH 㛓 DQG 㝤 UDGLFDO 㛓 7RDGGUHVVWKLVZHDNQHVVDQGWKHÀQGLQJVRI<HKDQG/L·VVWXG\ ZHDXJPHQWHGWKHRULJLQDOPHWULFWRDOZD\VFRQVLGHUWZRFKDUDFWHUV GLIIHUHQWXQOHVVWKHLUEDVLFVKDSHLVWKHVDPH7KLVDOORZVLWWRGLVWLQ JXLVKEHWZHHQWKHSUHYLRXVH[DPSOHV7KHÀQDOIRUPLV r ·r 1 − |r x||ry | LIVKDSH(x) = VKDSH(y) x y dUDGLFDO (x, y) = 1 RWKHUZLVH ,QWKLVFDVHZHDSSUR[LPDWHWKHEURDGVKDSHE\WKHXVHRIWKHÀUVW SDUWRIHDFKNDQML·VSDUW6.,3 FRGHZKLFKFDQWDNHYDOXHV KRUL ]RQWDO YHUWLFDO FRQWDLQPHQW RU RWKHU6.,3 FRGHVIRUHDFKNDQMLDUH SURYLGHGLQ NDQMLGLF DQGUDGLFDOPHPEHUVKLSLQ UDGNÀOH 7KLVPHWULF DLPVWRFDSWXUHWKHYLVXDODQGVHPDQWLFVDOLHQFHRIUDGLFDOVLQNDQML SHUFHSWLRQDQGWRDOVRWDNHLQWRDFFRXQWEDVLFVKDSHVLPLODULW\ 'LVWDQFHRIUHQGHUHGLPDJHV ,QVWDUNFRQWUDVWWRWKHKLJKO\ODQJXDJHVSHFLÀFUDGLFDOFRPSRQHQW PHWULFZHFDQLQVWHDGFRQVLGHUDPXFKPRUHDEVWUDFWDSSURDFKWR YLVXDOVLPLODULW\RINDQMLE\GLUHFWO\HYDOXDWLQJWKHVLPLODULW\RIKRZ WKH\DUHGLVSOD\HGRQVFUHHQRULQSULQW7KHVLPSOHVWZD\WRGRWKLV LVWRVLPSO\UHQGHUWZRNDQMLWRDQLPDJHRIÀ[HGVL]HDQGWRSHUIRUP VRPHFRPSDULVRQRQWKDWLPDJH7KLVLVSUHFLVHO\WKHJRDORIWKH l1 QRUPPHWULF 7KHUHDUHPDQ\SRVVLEOHPHWKRGVRIHVWDEOLVKLQJDGLVWDQFHEH WZHHQLPDJHVEXWWKHVLPSOHVWDJDLQLVWRFRQVLGHUWKHGLIIHUHQFHEH WZHHQSL[HOVRIWKHWZRLPDJHVIRUVRPHDOLJQPHQW)RUWXQDWHO\DOO NDQMLDUHLQWHQGHGWRRFFXS\DQLGHQWLFDOO\VL]HGEORFNVRZHQHHGQRW ZRUU\DERXWVFDOLQJRURWKHUZLVHWUDQVIRUPLQJLPDJHVEHIRUHFRP SDULVRQ&RQVLGHULQJ px (i, j) WREHWKHLPDJHEULJKWQHVVIRUWKHSL[HO DWSRVLWLRQ (i, j)ZHHYDOXDWHWKH l1 QRUPDVIROORZV l1 (x, y) = ∑ |px (i, j) − py (i, j)| i, j 1RWHWKDWWKHH[DFWFDOFXODWLRQZLOOEHGLIIHUHQWGHSHQGLQJRQWKH IRQWDQGLPDJHVHWWLQJV:HXVHGWKH06 *RWKLFIRQWUHQGHULQJWR [LPDJHVZLWKDQWLDOLDVLQJ 345 L. Yencken and T. Baldwin / Orthographic Similarity Search for Dictionary Lookup of Japanese Words 扌, 土, 寸 dradical 彳, 土, 寸 l1 VLJQDWXUH7KHG\QDPLFSURJUDPPLQJDOJRULWKPWKXVDOLJQVPDWFK LQJUDGLFDOVZKHQHYHUWKH\DUHFORVHHQRXJKLQWZRGHVFULSWLRQV7KH RUGHURIFRPSRQHQWVLQDGHVFULSWLRQDOVRUHÁHFWVWKHLUSRVLWLRQDV SDUWRIWKHODUJHUFRPSRXQGFRPSRQHQWVDUHXVXDOO\GUDZQLQDOHIW WRULJKWWRSWRERWWRPRUGHU)LQDOO\LWSURYLGHVDVPRRWKEOHQGLQJ IURPVWURNHVLPLODULW\WRUDGLFDOVLPLODULW\DQGFDQUHFRJQL]HWKHVLP LODULW\EHWZHHQFDVHVOLNH 㛓 DQG 䟜DVVKRZQLQ)LJXUH (YDOXDWLRQ 7KLVPHWULFLVDLPHGDWFDSWXULQJWKHJHQHUDORYHUODSRIVWURNHVEH WZHHQWKHWZRFKDUDFWHUVDQGDOVRWKHRYHUODSRIZKLWHVSDFHZKLFK JLYHVXVHIXOVWUXFWXUHLQIRUPDWLRQ7KLVPHWULFLVNQRZQWREHQRLV\ IRUORZWRPHGLXPVLPLODULW\SDLUVEXWLVYHU\XVHIXODWGLVWLQJXLVK LQJQHDUQHLJKERXUV :HHYDOXDWHRXUGLVWDQFHPHWULFVRYHUWZRGDWDVHWV 7KHÀUVWGDWDVHWLVWKHKXPDQVLPLODULW\MXGJHPHQWVIURPRXUHDU OLHUH[SHULPHQW 7KLVGDWDVHWLVRYHUO\EURDGLQWKDWLWZHLJKWV HTXDOO\WKHDELOLW\WRGLVWLQJXLVKORZDQGPHGLXPVLPLODULW\SDLUVZLWK GLVWLQJXLVKLQJPHGLXPDQGKLJKVLPLODULW\SDLUV,WLVFOHDUWKDWIRU PRVWDSSOLFDWLRQVGHWHUPLQLQJWKHKLJKVLPLODULW\SDLUVZLWKKLJKSUH FLVLRQLVZKDWLVPRVWLPSRUWDQW1HYHUWKHOHVVWKLVGDWDVHWLVXVHIXO IRUFRPSDULQJRXUPHWULFVZLWKWKRVHLQWKHHDUOLHUVWXG\ ,QRUGHUWREHWWHUPHDVXUHSHUIRUPDQFHRQKLJKVLPLODULW\SDLUV ZKLFKZHH[SHFWWRIRUPWKHEDVLVRILQFRUUHFWNDQMLLQSXWVZHQHHG DVHWRIKXPDQVHOHFWHGFRQIXVLRQGDWD7KHVHFRQGGDWDVHWLVEDVHG RQWKH:KLWH5DEELW-/37 /HYHO NDQMLÁDVKFDUGV(DFKÁDVKFDUG FRQWDLQVHLWKHURQHRUWZRKLJKO\VLPLODUQHLJKERXUVZKLFKPLJKWEH FRQIXVHGZLWKDJLYHQNDQML:HXVHWKLVVHWWRGHWHUPLQHRXUOLNHO\ SHUIRUPDQFHLQDVHDUFKWDVN dstroke )LJXUH $ YLVXDORYHUYLHZRIWKHWKUHHGLVWDQFHPHWULFV)RU dUDGLFDO DWR NHQUHSUHVHQWLQJHDFKUDGLFDOSUHVHQWLVXVHG)RU l1 WKHLPDJHVDUHUHQGHUHG DQGWKHVXPRIWKHGLIIHUHQFHLPDJHLVFDOFXODWHG)RU dVWURNH WKHHGLWGLVWDQFH EHWZHHQVWURNHFRGHYHFWRUVLVXVHG 6WURNHHGLWGLVWDQFH 2IWKHPHWULFVZHKDYHLQWURGXFHGVRIDUZHÀUVWO\KDYH dUDGLFDO ZKLFKUHZDUGVVKDUHGUDGLFDOVDQGVKDSHEXWGLVFDUGVUDGLFDOSRVL WLRQLQIRUPDWLRQDQGLJQRUHVDQ\SRWHQWLDOVLPLODULW\EHWZHHQUDGLFDOV WKHPVHOYHV)RUH[DPSOHGHVSLWHGLIIHULQJLQRQO\RQHVWURNH 㛓 >KL@ ´VXQµLVFRQVLGHUHGGLVWDQWWR 䟜 >PH@´H\Hµ:HDOVRKDYHWKH l1 QRUPZKLFKUHZDUGVKLJKOHYHOVRIVWURNHRYHUODSEXWLJQRUHVWKH VDOLHQFHRIUDGLFDOVDOWRJHWKHUDQGLVNQRZQWREHQRLV\,IZHFDQ ÀQGDPHWKRGZKLFKPDNHVEHWWHUXVHRIUDGLFDOSRVLWLRQLQIRUPDWLRQ EXWUHZDUGVVXIÀFLHQWO\VLPLODUUDGLFDOVZKLFKPLJKWEHFRQIXVHGZH FDQH[SHFWVXFKDPHWULFWRSHUIRUPPXFKEHWWHU7KLVUHTXLUHVDGDWD VHWZKLFKFRQWDLQVKLHUDUFKLFDOLQIRUPDWLRQDERXWNDQML 6XFKDGDWDVHWZDVFUHDWHGE\$SHODQG4XLQW ZKRHQYLVLRQHG LWVXVHLQGLFWLRQDU\ORRNXSDSSOLFDWLRQV(DFKNDQMLLVVSHFLÀHGE\LWV VWURNHVJURXSHGLQWRFRPPRQVWURNHJURXSVFRPSRQHQWV DQGEUR NHQGRZQLQDKLHUDUFKLFDOPDQQHULQWRUHODWLYHSRVLWLRQVZLWKLQWKH NDQMLIRUH[DPSOHOHIWDQGULJKWWRSDQGERWWRP 7KHVWURNHVWKHP VHOYHVDUHEDVHGRQDSUHGHWHUPLQHGWD[RQRP\RIVRPHVWURNH W\SHVLQFOXGLQJVXEYDULDQWV 7KHUHDUHPDQ\ZD\VWRFRQVWUXFWGLVWDQFHPHWULFVEHWZHHQVXFK KLHUDUFKLFDOUHSUHVHQWDWLRQVRINDQMLKRZHYHUWKHRQO\PHWKRGZH KDYHRIHYDOXDWLQJVLPLODULW\EHWZHHQUDGLFDOVGLUHFWO\LVWKHVHTXHQFH RIVWURNHVZLWKLQWKHP%HWZHHQWZRVXFKRUGHUHGVHTXHQFHVWKH HGLWGLVWDQFHLVDQDWXUDOPHWULFWRFKRRVH+RZHYHUWKHFRPSRQHQWV WKHPVHOYHVKDYHDQDWXUDORUGHULQJLQZKLFKWKH\DUHZULWWHQIURP OHIWWRULJKWDQGWRSWRERWWRP:HFDQWKHQXVHWKLVRUGHULQJWRJHQ HUDWHDVHTXHQFHRIVWURNHVIRUHDFKHQWLUHNDQMLDQGFRPSDUHZKROH NDQMLXVLQJWKHHGLWGLVWDQFHPHWULFRQWKHVHVHTXHQFHV7KLVIRUPV RXU dVWURNH PHWULF 6WULQJHGLWGLVWDQFHVDWLVÀHVWKHWULDQJOHLQHTXDOLW\ DQGKDV VHYHUDOGHVLUDEOHSURSHUWLHVLQWKLVFRQWH[W:KHQZULWWHQE\KDQG HDFKUDGLFDOLQDNDQMLLVZULWWHQLQVHTXHQFHFRPSOHWLQJRQHEH IRUHWKHQH[W7KLVPHDQVUDGLFDOVIRUPQDWXUDOFRQWLJXRXVVXEVH TXHQFHVRIWKHVHVWURNHGHVFULSWLRQVHDFKZLWKDUHDVRQDEO\XQLTXH 6LPLODULW\H[SHULPHQWGDWD 7KHÀUVWGDWDVHWZDVJHQHUDWHGE\H[SRVLQJSDUWLFLSDQWVRIYDULRXV OHYHOVRI-DSDQHVHSURÀFLHQF\WRUDQGRPVWLPXOXVSDLUVHDFK FRQVLVWLQJRIWZRNDQMLDQGDVNLQJWKHPWRUDQNWKHLUJUDSKLFDOVLP LODULW\RQDSRLQWVFDOH7KHDLPRIWKHH[SHULPHQWZDVWRFROOHFW GDWDWRGLUHFWO\HYDOXDWHVLPLODULW\PHWULFVVXFKDVWKRVHWKDWZHKDYH SURSRVHG 6LQFHDJUHHPHQWEHWZHHQUDWHUVZDVORZUDWKHUWKDQGLVWLOODJROG VWDQGDUGVLPLODULW\YDOXHIRUHDFKSDLULQJZHHYDOXDWHE\FRPSDULQJ HDFKPHWULF·VSUHGLFWLRQVZLWKWKHKXPDQUHVSRQVHV6LQFHZHDUH LQWHUHVWHGLQWKHRUGHULQJEHWZHHQNDQMLUDWKHUWKDQH[DFWVLPLODULW\ YDOXHZHXVH6SHDUPDQ·VUDQNFRUUHODWLRQFRHIÀFLHQW ρ WRPHDVXUH DJUHHPHQWEHWZHHQDPHWULFDQGDQLQGLYLGXDOUDWHUWKHQWDNHWKH PHDQDFURVVDOOUDWHUVLQWKHJURXS 7KHNH\SDUWLFLSDQWJURXSVRILQWHUHVWDUHVKRZQLQSDUWLFXODU QRQVSHDNHUVRI&KLQHVH-DSDQHVHRU.RUHDQ1RQ&-. -DSDQHVH VHFRQGODQJXDJH OHDUQHUV -6/ DQG -DSDQHVH ÀUVW ODQJXDJHVSHDNHUV-)/ (YDOXDWLQJRYHUWKHVHJURXSVJLYHVWKHUH VXOWVLQ7DEOH 0HWULF dUDGLFDO +VKDSH dUDGLFDO −VKDSH l1 dVWURNH 1RQ&-. -6/ -)/ 7DEOH 0HDQYDOXHRI6SHDUPDQ·VUDQNFRUUHODWLRQ ρ RYHUUDWHUJURXSVIRU HDFKPHWULF)RUWKH dUDGLFDO YDULDQWV −VKDSHLQGLFDWHVWKHRULJLQDOPHWULFDQG +VKDSHLQGLFDWHVRXUDXJPHQWHGYHUVLRQ )RUHDFKRIWKHVLPLODULW\PHWULFVWKHPHDQUDQNFRUUHODWLRQLQ FUHDVHGDVWKHSDUWLFLSDQWV·NQRZOHGJHRI-DSDQHVHLQFUHDVHGIURP -DSDQHVH/DQJXDJH3URÀFLHQF\7HVWWKHVWDQGDUGJRYHUQPHQWWHVWIRUIRU HLJQHUVOHDUQLQJ-DSDQHVHZLWKIRXUOHYHOVUDQJLQJIURPHDVLHVW WR PRVWGLIÀFXOW 346 L. Yencken and T. Baldwin / Orthographic Similarity Search for Dictionary Lookup of Japanese Words 1RQ&-. WR-6/ WR-)/ DOORZLQJWKHPWRPDNHPRUHPRWLYDWHGDQG FRQVLVWHQWVLPLODULW\MXGJHPHQWV7KH dUDGLFDO +VKDSH PHWULFFOHDUO\ GRPLQDWHVRYHUWKHRWKHUPHWULFVDWDOOOHYHOVRINQRZOHGJHLQFOXG LQJWKHRULJLQDO dUDGLFDO −VKDSH ZKLFKZHLQFOXGHGIRUFRPSDULVRQ 7KLVFRQÀUPVWKHVDOLHQFHRIUDGLFDOVDQGWKHWHQGHQF\IRULQGLYLG XDOVWRFODVVLI\NDQMLE\WKHLUEURDGVKDSHDVVXJJHVWHGE\<HKDQG /L·VVWXG\ 7KH l1 QRUPSHUIRUPVUHODWLYHO\SRRUO\HYHQIRUWKH-DSDQHVHQD WLYHVSHDNHUV2XUQHZPHWULF dVWURNH SHUIRUPVYHU\SRRUO\IRUWKH QRQVSHDNHUV\HWLPSURYHVWRDURXQGWKHVDPHOHYHODV l1 IRUQDWLYH VSHDNHUV7KLVVXJJHVWVLWLVHVSHFLDOO\QRLV\LQLWVEURDGUDQNLQJRI SDLUVIRUVLPLODULW\EXWJLYHVQRGHÀQLWLYHLQIRUPDWLRQRQLWVKLJK VLPLODULW\EHKDYLRXU:HWKXVFRQWLQXHRXUHYDOXDWLRQRQWKHÁDVK FDUGGDWDVHWIRUFRPSDULVRQ )ODVKFDUGGDWDVHW &RQVLVWLQJ RI RQO\ KLJKVLPLODULW\ SDLUV WKH ÁDVKFDUG GDWD VHW LV TXLWHGLIIHUHQWIURPWKHVLPLODULW\H[SHULPHQWVHWZDUUDQWLQJGLIIHU HQWPHWKRGVRIHYDOXDWLRQ:HXVHWZRGLIIHUHQWPHWKRGV)LUVWO\WDN LQJHDFKKLJKVLPLODULW\SDLUDNDQMLDQGLWVGLVWUDFWRU ZHUDQGRPO\ FKRVHDWKLUGNDQMLIURPWKHMĿ\ĿFRPPRQXVH JRYHUQPHQWNDQMLVHW RIFKDUDFWHUV6LQFHKLJKVLPLODULW\SDLUVDUHVWDWLVWLFDOO\UDUH WKHÁDVKFDUGNDQMLDQGWKHUDQGRPO\FKRVHQNDQMLDUHKLJKO\XQOLNHO\ WREHVLPLODU:HWKHQFRPSDUHKRZZHOOHDFKPHWULFFDQFODVVLI\ WKHWZRSDLUVE\LPSRVLQJWKHFRUUHFWRUGHULQJRQWKHPLQWKHIRUP RIFODVVLÀFDWLRQDFFXUDF\7KHUHVXOWVRIWKLVIRUPRIHYDOXDWLRQDUH VKRZQLQ7DEOH :HLQFOXGHDWKHRUHWLFDOUDQGRPEDVHOLQHRI VLQFHDQ\GHFLVLRQKDVDFKDQFHRIEHLQJVXFFHVVIXO 0HWULF dVWURNH l1 dUDGLFDO UDQGRPEDVHOLQH 0HWULF dVWURNH l1 dUDGLFDO 7KHVHUHVXOWVDUHDFRPSOHWHUHYHUVDORIRXUHDUOLHUHYDOXDWLRQLQ 7DEOH 6XUSULVLQJO\ dUDGLFDO SHUIRUPVH[WUHPHO\SRRUO\RQWKLVWDVN LQFRPSDULVRQWRWKHRWKHUPHWULFVTXLWHFORVHWRRXUUDQGRPEDVH OLQH7KLVVXJJHVWVWKDWLWLVSRRUDWGHWHUPLQLQJWKHFRUUHFWRUGHULQJ EHWZHHQKLJKDQGPHGLXPVLPLODULW\SDLUVWKRXJKRXUHDUOLHUHYDOX DWLRQVXJJHVWVLWEURDGO\RUGHUVH[DPSOHVFRUUHFWO\DFURVVWKHZKROH VSHFWUXPDVHYLGHQFHGE\LWVSHUIRUPDQFHRQWKHVLPLODULW\H[SHU LPHQWGDWD,WVSUHFLVLRQLVVLPSO\WRRORZIRUWKHVHKLJKVLPLODULW\ FDVHVEXWLWLVSUHFLVHO\WKHVHFDVHVZHDUHLQWHUHVWHGLQIRUXVHIXO VHDUFKDQGHUURUFRUUHFWLRQ 7KH dVWURNH DQG l1 QRUPSHUIRUPYHU\KLJKO\RQWKLVWDVNLQGLFDW LQJWKHHDVHZLWKZKLFKWKH\FDQGLVWLQJXLVKVXFKSDLUV+RZHYHU WKLVGRHVQRWJXDUDQWHHWKDWWKHQHLJKERXUKRRGVWKH\JHQHUDWHZLOO EHIUHHIURPQRLVHVLQFHWKHUHDOZRUOGSUHYDOHQFHRIKLJKO\VLPLODU FKDUDFWHUVLVOLNHO\WREHYHU\ORZ7REHWWHUGHWHUPLQHZKDWVHDUFK UHVXOWVPLJKWEHOLNHQRZFRQVLGHUHDFKÁDVKFDUGNDQMLDVDTXHU\ DQGLWVKLJKVLPLODULW\GLVWUDFWRUVDVUHOHYDQWGRFXPHQWVDQGLPSOLF LWO\DOOUHPDLQLQJNDQMLDVLUUHOHYDQWGRFXPHQWVLHGLVVLPLODUFKDU DFWHUV :HFDQWKHQFDOFXODWHWKH0HDQ$YHUDJH3UHFLVLRQ0$3LH WKHPHDQDUHDXQGHUWKHSUHFLVLRQ²UHFDOOFXUYHRIDJLYHQTXHU\ DQG WKHSUHFLVLRQDW N QHLJKERXUVIRUYDULHG N 7KLVIRUPVRXUVHFRQG HYDOXDWLRQVWUDWHJ\WKHUHVXOWVIRUZKLFKDUHSUHVHQWHGLQ7DEOH S# S# S# 7DEOH 7KHPHDQDYHUDJHSUHFLVLRQ0$3 DQGSUHFLVLRQDW N IRU N ∈ 1, 5, 10RQWKHÁDVKFDUGGDWDVHW 7KHSUHFLVLRQVWDWLVWLFVFRQÀUPWKHUDQNLQJRIPHWULFVIRXQGLQ WKHHDUOLHUFODVVLÀFDWLRQWDVN7KH dVWURNH PHWULFRXWSHUIRUPV l1 E\ DJUHDWHUPDUJLQLQWKH0$3 VWDWLVWLFDQGSUHFLVLRQDW N = 1EXW QDUURZVDJDLQIRUJUHDWHU N 7KLVVXJJHVWVWKDWLWLVPRUHUHOLDEOHLQ WKHXSSHUUHDFKHVRIWKHVLPLODULW\UDQNLQJ %DVHGRQRXUHYDOXDWLRQRYHUWKHÁDVKFDUGGDWDVHWWKH dVWURNH PHWKRGVHHPVPRVWSURPLVLQJWRXVHIRUVHDUFKSXUSRVHVGHVSLWHLWV SRRUSHUIRUPDQFHRQWKHVLPLODULW\H[SHULPHQWGDWDVHW )URPVLPLODULW\WRVHDUFK $IWHUFRQVLGHULQJVHYHUDOFKDUDFWHUGLVWDQFHPHWULFVDQGHYDOXDWLQJ WKHPRYHURXUWZRGDWDVHWVZHQRZFRQVLGHUWKHLUDSSOLFDWLRQWR GLFWLRQDU\ZRUGVHDUFK 2YHUDOOPRGHO 7KHEURDGSUREDELOLW\PRGHOIRUORRNLQJXSZRUGVEDVHGRQVLPLODU NDQMLLVLGHQWLFDOWRWKH)2.6 PRGHOIRUVHDUFKEDVHGRQUHDGLQJV VDYHWKDWZHTXHU\YLDNDQMLUDWKHUWKDQUHDGLQJV)RUDXVHUTXHU\ q IRUDWDUJHWZRUG wWKLVOHDGVWR(TXDWLRQ 3U(w|q) $FFXUDF\ 7DEOH $FFXUDF\ LQ VHOHFWLQJ WKH KLJKVLPLODULW\ SDLU RYHU WKH ORZ VLPLODULW\SDLU+LJKVLPLODULW\SDLUVZHUHFKRVHQIURPWKHÁDVKFDUGGDWD/RZ VLPLODULW\SDLUVZHUHUDQGRPO\VHOHFWHGIURPWKHMĿ\ĿNDQMLVHW 0$3 ∝ 3U(w) 3U(q|w) = 3U(w) ∏ 3U(qi |w, q0 . . . qi−1 ) ≈ 3U(w) ∏ 3U(qi |wi ) i i :HDVVXPHWKDWHDFKTXHU\FKDUDFWHUV qi DUHFKRVHQLQGHSHQGHQWO\ E\WKHXVHUDQGWKDWWKH\XVHRQO\WKHWDUJHWFKDUDFWHU wi LQWKHLU GHFLVLRQ7KHVHDVVXPSWLRQVPD\QRWKROGLQDOOFDVHV)RUH[DP SOHVXSSRVHDXVHULVWU\LQJWRORRNXS 慎庡 >H1VRNX@´H[FXUVLRQ WULSµEXWFDQ·WUHDGWKHÀUVWFKDUDFWHU 慎,IWKH\NQRZWKHZRUG 愍庡 >KD\DDVKL@´TXLFNPDUFKµWKH\PLJKWEHSULPHGWRXVH 愍 LQSODFH RI 慎 LQWKHLUTXHU\PRUHRIWHQWKDQRWKHUPRUHVLPLODUQHLJKERXUV $OWKRXJKWKHUHDUHVXFKH[DPSOHVZKHUHXVHUVPD\XVHFRQWH[WWRHL WKHUFRQVFLRXVO\RUXQFRQVFLRXVO\SULPHWKHLUTXHULHVZHQRQHWKHOHVV NHHSWKLVDSSUR[LPDWLRQVLQFHLWDOORZVXVWRPXFKPRUHHIÀFLHQWO\ FDOFXODWHVHDUFKFDQGLGDWHVG\QDPLFDOO\DWUXQWLPH 7KHÀQDOOLQHRI(TXDWLRQ UHTXLUHVWZRPRGHOVWREHVXSSOLHG7KH ÀUVW 3U(w)LVWKHSUREDELOLW\WKDWDZRUGZLOOEHORRNHGXS+HUHZH DSSUR[LPDWHXVLQJFRUSXVIUHTXHQF\RYHUWKH1LNNHLQHZVSDSHUGDWD DFNQRZOHGJLQJWKDWDQHZVSDSHUFRUSXVPD\EHVNHZHGGLIIHUHQWO\ WROHDUQHUGDWD7KHVHFRQGPRGHOLVRXUFRQIXVLRQPRGHO 3U(qi |wi ) LQWHUSUHWHGHLWKHUDVWKHSUREDELOLW\RIFRQIXVLQJNDQML wi ZLWKNDQML qi RURIWKHXVHULQWHQWLRQDOO\VHOHFWLQJ qi WRTXHU\IRU wi ,WLVWKLV PRGHOZKLFKZHQRZIRFXVRQ &RQIXVLRQPRGHO 2XUFRQIXVLRQPRGHOPXVWDFFRXQWIRUWZRPDLQIDFWRUV)LUVWO\LW PXVWEHEDVHGRQWKHYLVXDOVLPLODULW\RIWZRFKDUDFWHUV qi DQG wi ZKLFKZHDGGUHVVE\XVHRIRQHRIRXUGLVWDQFHPHWULFV6HFRQGO\ XVHUVZLOOWHQGWRFRQIXVHXQNQRZQFKDUDFWHUVZLWKFKDUDFWHUVWKDW L. Yencken and T. Baldwin / Orthographic Similarity Search for Dictionary Lookup of Japanese Words WKH\DUHDOUHDG\IDPLOLDUZLWK7KLVLVDOVRFOHDUVLQFHWKHRQO\FKDU DFWHUVWKH\FDQLQSXWDUHWKRVHWKH\DOUHDG\NQRZ:HWKXVSURSRVHD JHQHULFFRQIXVLRQPRGHOEDVHGDVLPLODULW\PHDVXUHEHWZHHQNDQML 3U(qi |wi ) ≈ 3U(qi )s(qi , wi ) ∑ j 3U(qi, j )s(qi, j , wi ) 7KHFRQIXVLRQPRGHOXVHVDVLPLODULW\IXQFWLRQ s(qi , wi ) DQGDNDQML IUHTXHQF\PRGHO 3U(qi ) WRGHWHUPLQHWKHUHODWLYHSUREDELOLW\RIFRQ IXVLQJ wi ZLWK qi DPRQJVWRWKHUFDQGLGDWHV6RIDUZHKDYHRQO\GHDOW ZLWKGLVWDQFHPRGHOV)RURXUSURWRW\SHZHXVHWKHIROORZLQJVLPL ODULW\PHWULF 1 sVWURNH (x, y) = 1 + dVWURNH (x, y) )RUHIÀFLHQF\ZHDOVROLPLWWKLVPRGHOWRDÀQLWHQXPEHURIFDQ GLGDWHVSHUNDQML7KLVLVLPSRUWDQWVLQFHZHWKHSUHFLVLRQZLOOIDOO DVWKHQXPEHURIFDQGLGDWHVSHUNDQMLLQFUHDVHVVRWKHUHLVDQDWX UDOWUDGHRIIEHWZHHQSURYLGLQJHQRXJKFDQGLGDWHVDQGOLPLWLQJWKH QRLVHLQWKHFDQGLGDWHVWKDWDUHSUHVHQW)LQGLQJWKHULJKWEDODQFHZLOO PD[LPL]HWKHDFFHVVLELOLW\RIWKLVIRUPRIVHDUFK :HOLPLWWKHFDQGLGDWHVXVLQJDWKUHVKROGLQJPHWKRGERUURZHG IURP&ODUNDQG&XUUDQ ZKHUHRXUWKUHVKROGLVVHWDVDSURSRUWLRQ RIWKHÀUVWFDQGLGDWH·VVFRUH)RUH[DPSOHXVLQJDVRXUWKUHVKROG LIWKHÀUVWFDQGLGDWHKDVDVLPLODULW\VFRUHRIZLWKWKHWDUJHWNDQML ZHZRXOGWKHQDFFHSWDQ\QHLJKERXUVZLWKDVLPLODULW\JUHDWHUWKDQ 0.638VLQJWKH sVWURNH VLPLODULW\PHDVXUHZLWKDUDWLRRIWKHUH DUHQHLJKERXUVIRUHDFKNDQMLMĿ\ĿFKDUDFWHUVHW 6HDUFK E\ VLPLODU JUDSKHPH KDV DQ DGYDQWDJH ZKHQ FRQVLGHU LQJFDQGLGDWHSUXQLQJLQFRPSDULVRQWRVHDUFKHVE\ZRUGUHDGLQJ VHDUFKHVE\UHDGLQJDUHQDWXUDOO\DPELJXRXVVREURDGHQLQJWKHVFRSH RIWKHVHDUFKDQGLQFUHDVLQJWKHQXPEHURIHUURUFRUUHFWLRQFDQGL GDWHVPD\SXVKH[DFWPDWFKUHVXOWVIXUWKHUGRZQLQWKHUDQNLQJ *UDSKHPHEDVHGVHDUFKRQWKHRWKHUKDQGLVQDWXUDOO\H[DFWWKHUH FDQQRUPDOO\RQO\EHRQHVXFFHVVIXOPDWFKVRDGGLWLRQDOVHFRQGDU\ FDQGLGDWHVDUHQRWLQGLUHFWFRPSHWLWLRQZLWKH[LVWLQJVHDUFKSUDF WLFHV +DYLQJGLVFXVVHGDQLPSOHPHQWDWLRQRIVLPLODULW\EDVHGVHDUFKIRU -DSDQHVHNDQMLZHQRZUHWXUQWRFRQVLGHUEURDGHUDVSHFWVRIWKLVDS SURDFKWRORRNXS 'LVFXVVLRQDQGIXWXUHZRUN $ FXUUHQWGLIÀFXOW\LQPRGHOOLQJWKLVIRUPRIVHDUFKLVWKHODFNRI DYDLODEOHVHDUFKGDWDWRREMHFWLYHO\HYDOXDWHWKHVHDUFKEHIRUHGH SOR\PHQW7KLVUHVWULFWVHYDOXDWLRQWRDORQJHUWHUPSRVWKRFDQDO\ VLVEDVHGRQTXHU\ORJV,IVXIÀFLHQWORJVDUHNHSWWKLVZRXOGDOVR SURYLGHVRPHDGGLWLRQDOUHDOZRUOGVLPLODULW\DQGFRQIXVLRQGDWD +RZFDQVLPLODULW\PRGHOVEHLPSURYHGIXUWKHU"7KHGLIIHUHQWVLP LODULW\PHWULFVZHKDYHH[SORUHGDOVRPDNHGLIIHUHQWPLVWDNHVLQWKH KLJKHVWVLPLODULW\FDVHV7KLVVXJJHVWVWKDWVRPHFRPELQDWLRQPRGHO PLJKWEHDEOHWRYRWHRQZKHWKHUDJLYHQSDLULVKLJKVLPLODULW\RUQRW 7KHUHLVDOVRPXFKSURPLVHLQIXUWKHUH[SORLWLQJ$SHODQG4XLQW·V GDWDVHWWREHWWHUWDNHDGYDQWDJHRIWKHUHODWLYHORFDWLRQRIHOHPHQWV 2QHSRVVLELOLW\LVWRXVHDWUHHHGLWGLVWDQFH EHWZHHQNDQMLUHS UHVHQWDWLRQV<HWDQRWKHULVWRGHWHUPLQHIRUPDOERXQGLQJER[HVIRU HDFKFRPSRQHQWDQGGHWHUPLQHWKHRYHUODSRIWKHVHERXQGLQJER[HV ZKHQGHWHUPLQLQJWKHVLPLODULW\RIRWKHUZLVHLGHQWLFDOFRPSRQHQWV 7KHUHDUHWZRDGGLWLRQDOQHHGVZKLFKZHIRUHVHHLQWKHIXWXUHIRU WKLVIRUPRILQWHUIDFH)LUVWO\WKHFXUUHQW)2.6 H[WHQVLRQSURYLGHV WZRVHSDUDWHIRUPVRIVHDUFKLQWHOOLJHQWJUDSKHPHVHDUFKDQGLQWHOOL JHQWUHDGLQJVHDUFK$OORZLQJXVHUVWRFRPELQHWKHVHVHDUFKHVPRUH IUHHO\LQWKHVDPHTXHU\FRXOGDLGXVDELOLW\ 347 &RQFOXVLRQ :HKDYHSURSRVHGDPHWKRGRIVHDUFKLQJWKHGLFWLRQDU\IRU-DSDQHVH ZRUGVFRQWDLQLQJXQNQRZQNDQMLEDVHGWKHLUYLVXDOVLPLODULW\WRID PLOLDUNDQML,QRUGHUWRDFKLHYHWKLVZHKDYHFRQVLGHUHGVHYHUDOPHW ULFVRYHUFKDUDFWHUVLPSURYLQJRYHUH[LVWLQJEDVHOLQHV(YDOXDWLQJ RYHUDÁDVKFDUGVHWOHGWRDFRPSOHWHUHYHUVDORIWKHHDUOLHUKXPDQ HYDOXDWLRQVXJJHVWLQJWKDWIHDWXUHVFUXFLDOWRGLVWLQJXLVKLQJPLGWR ORZVLPLODULW\SDLUVDUHLQDGHTXDWHIRUGHWHFWLQJKLJKVLPLODULW\SDLUV 2IWKHPHWULFVGLVFXVVHGWKHHGLWGLVWDQFHWDNHQRYHUVWURNHGHVFULS WLRQVSHUIRUPHGWKHEHVWIRUKLJKVLPLODULW\FDVHVDQGZDVWKXVXVHG WRFRQVWUXFWVLPLODULW\EDVHGVHDUFKDWWKHZRUGOHYHO 5()(5(1&(6 >@ 8OULFK$SHODQG-XOLHQ4XLQW¶%XLOGLQJDJUDSKHWLFGLFWLRQDU\ IRU-DSDQHVHNDQML²FKDUDFWHUORRNXSEDVHGRQEUXVKVWURNHVRU VWURNHJURXSVDQGWKHGLVSOD\RINDQMLDVSDWKGDWD·LQ 3URFHHG LQJVRIWKHWK,QWHUQDWLRQDO&RQIHUHQFHRQ&RPSXWDWLRQDO /LQJXLVWLFV >@ 6ODYHQ%LODF ,QWHOOLJHQW'LFWLRQDU\,QWHUIDFHIRU/HDUQHUVRI -DSDQHVH0DVWHU·VWKHVLV7RN\R,QVWLWXWHRI7HFKQRORJ\ >@ 3KLOLS %LOOH ¶$ VXUYH\ RQ WUHH HGLW GLVWDQFH DQG UHODWHG SUREOHPV· 7KHRUHWLFDO&RPSXWHU6FLHQFH ² >@ 6WHSKHQ&ODUNDQG-DPHV 5&XUUDQ¶7KHLPSRUWDQFHRIVX SHUWDJJLQJIRUZLGHFRYHUDJH&&* SDUVLQJ·LQ 3URFHHGLQJV RIWK,QWHUQDWLRQDO&RQIHUHQFHRQ&RPSXWDWLRQDO/LQJXLVWLFV &2/,1* S² >@ 7KH.RGDQVKD.DQML/HDUQHU·V'LFWLRQDU\HG-DFN+DOSHUQ .RGDQVKD,QWHUQDWLRQDO7RN\R >@ .XPLNR 7DQDND,VKLL DQG -XOLDQ *RGRQ ¶.DQVXNH $ NDQML ORRNXSV\VWHPEDVHGRQDIHZVWURNHSURWRW\SH·LQ 3URFHHG LQJVRIVW,QWHUQDWLRQDO&RQIHUHQFHRQ&RPSXWHU3URFHVVLQJ RI2ULHQWDO/DQJXDJHV6LQJDSRUH'HFHPEHU >@ 0LFKDHO 6:DWHUPDQ7HPSOH )6PLWKDQG:LOOLDP $%H\HU ¶6RPHELRORJLFDOVHTXHQFHPHWULFV· $GYDQFHVLQ0DWKHPDWLFV ² >@ &KULV :LQVWHDG ¶(OHFWURQLF NDQML GLFWLRQDU\ EDVHG RQ ´'DVKHUµ· $GDSWLYH DQG /HDUQLQJ 6\VWHPV ,((( 0RXQWDLQ:RUNVKRSRQ² >@ 6X/LQJ<HKDQG-LQJ/LQJ/L¶5ROHRIVWUXFWXUHDQGFRPSR QHQWLQMXGJPHQWVRIYLVXDOVLPLODULW\RI&KLQHVHFKDUDFWHUV· -RXUQDORI([SHULPHQWDO3V\FKRORJ\+XPDQ3HUFHSWLRQDQG 3HUIRUPDQFH ² >@ /DUV <HQFNHQ DQG 7LPRWK\ %DOGZLQ ¶0RGHOOLQJ WKH RUWKR JUDSKLFQHLJKERXUKRRGIRU-DSDQHVHNDQML·LQ ,&&32/ · 3URFHHGLQJVRIWKHVW,QWHUQDWLRQDO&RQIHUHQFHRQWKH&RP SXWHU3URFHVVLQJRI2ULHQWDO/DQJXDJHV This page intentionally left blank 6. Uncertainty and AI This page intentionally left blank ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-351 351 From Belief Change to Preference Change J´erˆome Lang1 and Leendert van der Torre2 Abstract. Various tasks need to consider preferences in a dynamic way. We start by discussing several possible meanings of preference change, and then focus on the one we think is the most natural: preferences evolving after some new fact has been learned. We deﬁne a family of such preference change operators, parameterized by a revision function on epistemic states and a semantics for interpreting preferences over formulas. We list some natural properties that this kind of preference change should fulﬁll and give conditions on the revision function and the semantics of preference for each of these properties to hold. 1 Introduction Analyzing games requires a formal modelling of preferences, when the behaviour of a rational agent is a function of her beliefs and her preferences about the possible consequences of her actions. The mental state of an agent evolves in the course of the game, and these changes result also in a change of the agent’s preferences. Quoting Liu [15], “preferences are not static, but they change through commands of moral authorities, suggestions from friends who give good advice, or just changes in our own evaluation of worlds and actions.” The effects of learning some information or performing some action on the agent’s beliefs has been extensively studied in the last two decades. Yet, while there is a huge literature on belief change and, to some extent, a general agreement about the meaning of the various classes of belief change operators such as revision or update, the literature on preference change is sparser. In this paper we argue that the difﬁculty is that whereas belief change processes can reasonably be considered independent of an agent’s preferences, it is generally not true that a preference change process is independent of the agent’s beliefs. What triggers changes in the mental state of an agent (hence changing her present or future behaviour) generally consists of inputs that come from the world or from other agents (via observations, communication etc.) and primarily affects the agent’s beliefs. We do not mean that these inputs do not affect in any way the agent’s preferences, but that they often do so because they change her beliefs in the ﬁrst place. A second difﬁculty is that “preference change” conveys more ambiguity than belief change, suggesting that the variety of processes being covered by preference change might be larger than that covered by belief change. After discussing brieﬂy the different meanings that “preference change” may convey, we focus on one of the most natural interpretations of preference change, namely, the evolution of an agent’s preferences after revision by a new fact (or belief), and we give technical developments. We end the paper by discussing related work and further research directions. 2 Throughout the paper we consider a propositional language formed from a ﬁxed, ﬁnite set of propositional symbols and the usual connectives. This language will be enriched with modalities in Section 4. The set of all truth assignments (or valuations) satisfying a formula ϕ is denoted by M od(ϕ). We use the following notation for valuations: a¯bc denotes the valuation where a and c are assigned to true and b to false. A preference relation is a weak order , (that is, a reﬂexive, transitive and complete relation, also called a total preorder) on the set of valuations. The relations ∼ and 1 are deﬁned from , in the usual way: for valuations s, s , s , we have s ∼ s if s , s and s , s, and s 1 s if s , s and not (s , s). If X ⊆ W , then M ax (X) is the set of maximal elements in X: M ax (X) = {w ∈ X | there is no w such that w 1 w}. 3 2 IRIT, Universit´e Paul Sabatier, 31062 Toulouse, France; lang@irit.fr Universit´e de Luxembourg; leendert@vandertorre.com Various kinds of preference change We distinguish several kinds of preference change, depending mainly on the nature of the mathematical object that changes and the nature of the input that leads this object to change. Example 1 Initially, I desire to eat sushi from this plate. Then I learn that these sushi have been made with old ﬁsh. Now I desire not to eat any of these sushi. This is an instance of preferences that change when beliefs are revised. Learning that the sushi were made from old ﬁsh made me believe that I could be sick, and as a consequence I change my mind about my future behaviour. I will choose the action “doing nothing” rather than the action “eat”. Whether preferences have really changed is a complicated question. This primarily depends on what we mean by “preference”. On the one hand, the preference relation on complete states of the worlds remains static – only the relative plausibility of these states of the world (and thus the agent’s beliefs) change. Let S = {ef, ef¯, e¯f, e¯f¯} be the set of possible states of the world.3 At ﬁrst, it is reasonable to assume (even if this is not said explicitly) that I believe the sushi to be made out of fresh ﬁsh — or, at least, that I do not believe that the ﬁsh is not fresh. After I am told that the ﬁsh is not fresh, it is reasonable to expect that my belief that the ﬁsh is fresh gets much lower. As for my preferences, they may initially be ef 1P e¯f ∼P e¯f¯ 1P ef¯ Now, my preferences after learning that ¬f is true or likely to be true are exactly the same: I still prefer ef (even if I now consider 3 1 Notations Some may argue that e is an action rather than a static proposition. To resolve this ambiguity, just consider that e precisely refers to “being in the process of eating”. 352 J. Lang and L. van der Torre / From Belief Change to Preference Change this world hardly plausible) to e¯f and e¯f¯, and these two to ef¯. Thus, beliefs change, but preferences remain static. Still, I used to prefer e over ¬e and I no longer do. However, e and ¬e are not single states, but formulas or, equivalently, sets of states. E.g., e corresponds to the set of states {ef, ef¯}) and ¬e to {¯ ef, e¯f¯}. When expressing an initial preference for e I mean that when I focus on those states where e is true, I see ef as the most plausible state, and similarly when I focus on those states where ¬e is true, I see e¯f as the most plausible state. Because I prefer ef to e¯f , I naturally prefer e to ¬e: in other terms, I prefer e to ¬e because I prefer the most plausible state satisfying e to the most plausible state satisfying ¬e. Of course, after learning the information about the ﬁsh, these typical states are now ef¯ and e¯f¯, and after focusing, I now prefer ¬e to e.4 One may also argue that whether preferences over states change or not is a question of language granularity. If both e and f are in the language, then preference over states do not change, but if the language contains only the propositional symbol e, then they do. The process that we have explained here on an example will be formalized in Section 4. Example 2 It is a nice afternoon and I’d like to have a walk. Then it starts to rain. I don’t want to have a walk anymore. This is an instance of preferences that changes when the world changes: the preference change is triggered by a change of the world (it was not raining and now it does). Things are quite similar to the previous situation, with the difference that the belief change process is not a revision, but an update. Again, we argue that preference over states do not change (I prefer walking under the sun to not walking, and not walking to walking in the rain); only the state of world, and of course the agent’s belief about the state of the world, do. Example 3 [11] I grow tired of my favourite brand of mustard, A, and start to like brand B better. This is an instance of preferences that change when the agent evolves. A change in preference reﬂects a modiﬁcation of the agent’s tastes, possibly due to an event the agent is subject to. It could be discussed whether it is relevant to distinguish preference change due to the evolution of the rational agent to preference change due to the evolution of the world. This is primarily a choice to be made when we model the process, as thus comes down to decide whether the rational agent should be part of the world of not. Consider the following example from [15]: Example 4 [15] Alice is looking for a ﬂat. She considers price more important than quality. After she wins a lottery prize of ten million dollars, she considered quality most important. Depending on whether the agent is part of the world, this is an instance of a preference change triggered by a change in the world or by an evolution of the tastes of the agent. This kind of preference change can be modelled in a way that mirrors belief change, in the sense that preferences are revised by preferences, and lead to new preferences, without beliefs to intervene in the process. Other examples can be found in [3], who consider preference change triggered by “commands” or ”suggestions”.A command is an input from 4 A related interpretation of this example, more in accordance with decision theory, consists in seeing e and ¬e as actions (or as the postconditions of actions, as pointed out by an anonymous referee): my future behaviour (that is, the action that I intend to do) has changed, but my preference between states has not. This process is well-known in decision theory: after learning something, probabilities change, utilities of consequences remain unchanged but the expected utility of actions (that depend both on the probability of states and the utility of consequences) change. an authority (“see to it that ϕ!”) whose effect is that the agent now prefers ϕ-worlds over ¬ϕ-worlds. A suggestion is a milder kind of preference upgrade. Example 5 [3] Let’s take a trip! See [3] for a dynamic epistemic logic formalization of preference upgrade via commands and suggestions. 4 Preference change triggered by belief revision 4.1 Beliefs and preferences We now consider in more detail the scenario illustrated informally on Example 1. The general principle is the following: the agent has some initial beliefs and preferences over possible states of the world; these preferences over states can be lifted to preferences over formulas; then she learns a new piece of information α about the world; she revises her prior beliefs by α and keeps the same preference on states; however, preferences over formulas may change in reaction to the change of beliefs. We see that a formalization needs at least two semantical structures: one for beliefs and one for preferences. Because one has to make choices, we stick to the ordinal way of modeling beliefs and preferences (which is common in the belief change literature). Thus, as in [4] and subsequently in [14], we use a normality ordering together with a preference ordering. Deﬁnition 1 A model M is a triple W, ,N , ,P , where W is a set of valuations of a set of propositions, and ,N and ,P are total pre-orders on W . We do not distinguish worlds from valuations, so each valuation occurs precisely once. s ,N s means that s is at least as plausible (or normal) as s , whereas s ,P s means that s is at least as preferred as s . The model for Example 1 is visualized on the ﬁgure below. The normality ordering is visualized vertically, where higher worlds are more normal. The most normal worlds are worlds in which the ﬁsh is fresh, and exceptional worlds are worlds in which the ﬁsh is not fresh: f e ∼N f e¯ 1N f¯e ∼N f¯e¯. Preferences are visualized horizontally, where the more to the right are the more preferred worlds. Eating fresh sushi is preferred to not eating sushi, which itself is preferred to eating not fresh sushi: ef 1P e¯f ∼P e¯f¯ 1P ef¯. preference e¯f ef ¯ ¯ ef e¯f normality 6 As in [4, 14], we extend the propositional language with two dyadic modalities: N for normality and P for preference. As usual, N (ψ|ϕ) is true if the most normal ϕ-worlds are ψworlds. N (ϕ|) is abbreviated as N (ϕ). Deﬁnition 2 (normality) M |= N (ψ|ϕ) iff M axN (M od(ϕ)) ⊆ M od(ψ) Things are less easy with preference, for two reasons. First, there is no standard way of lifting preferences from the level of worlds to the level of sets of worlds (see, e.g., [12, 13]). We consider ﬁrst the three following ways of lifting:5 5 There is obviously a fourth one, corresponding to two existential quantiﬁers; however, this notion is much too weak, as it makes P ϕ ∧ P ¬ϕ consistent. As suggested by a referee, we may also consider combinations of optimistic and pessimistic semantics. Alternative ways of lifting preference would also be worth considering, such as, for instance, ceteris paribus preferences [17] of other kinds of similarity-based preferences [11]. J. Lang and L. van der Torre / From Belief Change to Preference Change strong semantics W1 2str W2 if W1 = ∅, W2 = ∅, and ∀w ∈ W1 ∀w ∈ W2 : w 1P w : the worst worlds in W1 are preferred to the best worlds in W2 , or equivalently, every world in W1 is preferred to every world in W2 . optimistic semantics W1 2opt W2 if W1 = ∅, W2 = ∅, and ∃w ∈ W1 such that ∀w ∈ W2 , w 1P w : the best worlds in W1 are preferred to the best worlds in W2 6 . pessimistic semantics W1 2pess W2 if W1 = ∅, W2 = ∅, and ∀w ∈ W1 ∃w ∈ W2 such that w 1P w : the worst worlds in W1 are preferred to the worst worlds in W2 . Second, as argued in [4, 14], in the presence of uncertainty or normality expressed by ,N , preferences cannot be interpreted from ,P alone, but from ,P and ,N . There are at least two ways of interpreting a preference for ϕ over ¬ϕ in this context. Let 2 be one of 2str , 2opt , or 2pess . 1. “among the most normal φ-worlds, ψ is preferred to ¬ψ”[4]: M |= P (ψ|ϕ) iff M axN (M od(ϕ)) ∩ M od(ψ) 2 M axN (M od(ϕ)) ∩ M od(¬ψ)). 2. “the most normal ψ ∧ φ-worlds are preferred to the most normal ¬ψ ∧ φ-worlds” [14]: M |= P (ψ|ϕ) iff M axN (M od(ϕ ∧ ψ)) 2 M axN (M od(ϕ ∧ ¬ψ)). P (ϕ|) is abbreviated in P (ϕ). Note that 1. and 2. are not equivalent, because either the most normal ψ ∧ φ worlds or the most normal ¬ψ ∧ φ worlds may be exceptional among the φ worlds.7 They coincide if there exist both most normal ψ ∧ φ-worlds and most normal ¬ψ ∧ φ-worlds, that is, if ¬N (ψ|φ) ∧ ¬N (¬ψ|φ) holds. We have thus deﬁned six semantics for interpreting P (.|.), since we have three ways of lifting preference from worlds to formulas, and two ways of focusing on normal worlds. We denote the corresponding 6 modalities using the superscript B (for item 1. above) or LT W (for item 2. above), and one of the three subscripts str, opt LT W or pess. For instance, Popt refers to the semantics in [14] and the optimistic way of lifting preferences. However we will try to avoid using these subscripts and superscripts whenever possible.8 4.2 4.2.1 353 Deﬁnition 3 A revision function maps each complete weak order over W and each α into a complete weak order over W . For the sake of notation we note ,α N instead of ,N α. Revision functions on plausibility orderings are usually required to obey some properties. In the rest of the paper we need the following ones. A revision function satisﬁes • acceptance if for every ,N and every satisﬁable α, M ax(,α N , W ) ⊆ [α] (most normal worlds after revising by α satisfy α). • positive uniformity (called (CR1) in [6]) if for any two worlds w, w such that w |= α and w |= α, w 1α N w iff w 1N w ; • negative uniformity (called (CR2) in [6]) if for any two worlds w, w such that w |= ¬α and w |= ¬α, w 1α N w iff w 1N w . • weak (resp. strong) responsiveness if for any two worlds w, w such that w |= α and w |= ¬α then w ,N w implies w ,α N w α (resp. w 1N w ). Deﬁnition 4 Given a model M = W, ,N , ,P , a revision function , and a formula α, the revision of M by α, is the model M α deﬁned by M α = W, ,α N , ,P 4.2.2 AGM style postulates Perhaps the easiest way to describe the behavior of the preference change, is to aim for an AGM style representation with postulates. To do so, we use dynamic modalities to refer to revisions, as in [7, 2]. M, w |= [α]ϕ iff M α, w |= ϕ We are now going to look into the logical properties of preference change under newly learned beliefs (that is, the relationships between M and M α), depending on the belief revision operator used and the choice of the semantics for interpreting preference. For readability we only give the properties for unconditional preferences like P (α), but they can be extended to conditional ones like P (α|β) in a straightforward way. The impact of belief revision on preferences Revising a pre-order Given a model M = W, ,N , ,P , its revision by belief α is a new model M = M α consists of the same W , the same ,P (since preferences over worlds do not change), and the revision of the initial plausibility ordering ,N by α. This requires the prior deﬁnition of a revision function acting on plausibility orderings. Such functions have been extensively considered in the literature of iterated belief revision (e.g., [6, 16]). 6 Recall that the set of truth asignments is ﬁnite; therefore, there cannot be any inﬁnite ascending chains of worlds, anf our deﬁnition always makes sense. An equivalent deﬁnition, which does not need the ﬁniteness assumption, is: ∀w ∈ W2 ∃w ∈ W1 such that w ≺P w . 7 The two approaches are be based on distinct intuitions. In 2., the intuition is that an agent is comparing two alternatives, and for each alternative he is considering the most normal situations. Then he compares the two alternatives and expresses a preference of the former over the latter. The difference between both approaches (already discussed in [14]) is a matter of choosing the worlds to focus on. 8 From the P modality we may also deﬁne a dyadic > modality (where ϕ > ψ means “I prefer ϕ to ψ”), deﬁned by (ϕ > ψ) ≡ P (ϕ|(ϕ ∧ ¬ψ) ∨ (ψ ∧ ¬ϕ)) P (.|.) and . > . are interdeﬁnable (see [11]). 4.2.3 Preference satisfaction (or dissatisfaction) Suppose we learn that what we want to hold, in fact holds. In that case, it would be intuitive that the preference persists. (P1) P α → [α]P α or, equivalently: if M |= P α then M α |= P α. Proposition 1 (P1) is satisﬁed: • if satisﬁes positive and negative uniformity; • for any lifting operator with the LTW semantics. Let us give a quick proof. Because is positively (resp. negatively) uniform, the most normal α-worlds (resp. ¬α-worlds) are the same before and after revision by α. Now, for any lifting operator, whether P α holds in the LTW semantics depends only on the preference between most normal α-worlds and most normal ¬α-worlds, from which the result follows. Positive and negative uniformity are necessary. Consider for instance the drastic revision operator that preserves the relative ranking 354 J. Lang and L. van der Torre / From Belief Change to Preference Change of α-worlds and then pushes all ¬α-worlds towards the bottom, irre spectively of their relative initial ranking: w ,α N w iff (a) w |= α, w |= α and w ,N w ; or (b) w |= α and w |= ¬α. ∗ satisﬁes positive uniformity, but not negative uniformity. Suppose we initially have pq 1N p¯q¯ 1 p¯ q 1 p¯q and p¯q 1P pq 1P p¯q¯ 1 p¯ q . After revision by p we have pq 1p q 1 p¯q ∼ p¯q¯, therefore, with the N p¯ optimistic lifting we have M |= P p and M |= [p]P ¬p. (P1) does not hold either for Boutilier’s semantics, because [α] or [¬α] may become empty after revision by α. By symmetry, things are similar when revising by a dispreferred formula: (P2) P α → [¬α]P α Proposition 2 (P2) is satisﬁed: • if satisﬁes positive and negative uniformity; • for any lifting operator with the LTW semantics. Suppose now that we learn that what we want to hold, in fact partially holds. In that case, it would be intuitive that the preference persists: (P3) P α ∧ ¬N (¬β|¬α) → [∗(α ∨ β)]P α Proposition 3 (P3) is satisﬁed: • if satisﬁes positive and negative uniformity, and weak responsiveness; • for strong or optimistic lifting with the LTW semantics. The proof goes as follows. By positive uniformity, α ∨ β-worlds are shifted uniformly. This applies in particular to α-worlds, therefore the most normal α-worlds remain the same. Because M |= ¬N (¬β|¬α), at least one most normal ¬α-world w satisﬁes β. After revision by α ∨ β, this world is still a most normal ¬α-world. To (α∨β) see this, assume w 1N w. If w |= ¬α ∧ β then by positive uniformity, w 1N w, which contradicts w being a most normal ¬αworld. If w |= ¬α ∧ ¬β then by weak responsiveness, w 1N w, which again contradicts w being a most normal ¬α-world. There(α∨β) fore, the set of most normal ¬α-worlds in ,N is contained in the set of most normal ¬α-worlds in ,N . From this we get that M |= (α 2str ¬α) → [(α ∨ β)](α 2str ¬α), and similarly for 2opt . Note that it does not hold for the pessimistic semantics, since if the worst world used to be a ¬α-world, after the revision the worst world may be a α-world. It does not hold either for the B-semantics, because after revision by α ∨ β the ¬α-worlds may disappear from the top cluster. Symmetrically, we may consider the following. (P4) P α ∧ ¬N (β|α) → [∗(¬α ∨ β)]P α Proposition 4 (P4) is satisﬁed if: • satisﬁes positive and negative uniformity, and weak responsiveness • = strong or pessimistic lifting with the LTW semantics 4.2.4 Preference change implies surprise Whereas in belief revision, learning a fact that is not disbelieved does not affect the old beliefs, we may wonder whether newly learned beliefs which are not exceptional do not change the preferences. However, this holds only under the assumption that the normality ordering remains the same when we revise by a normal formula: (P5) N α ∧ P β → [∗α]P β Proposition 5 (P5) is satisﬁed: • if satisﬁes stability: if all most normal worlds in ,N satisfy α then ,α N =,N ; • for any lifting operator with the LTW semantics. or • if satisﬁes top-stability: if all most normal worlds in ,N satisfy α then M ax(,α N , W ) = M ax(,N , W ); • for any lifting operator with the B semantics. Note that top-stability is implied by positive uniformity and weak responsiveness. In the ﬁrst case, the validity of N α∧P ϕ → [∗α]P ϕ comes simply from the fact that ,N does not change after revision by α. In the second case , the fact that N α is true implies that all most normal worlds satisfy α, therefore revising by α leaves these most normal worlds (that is, M axN (W )) unchanged; since the truth of P (.|.) depends only on M axN (W ), preferences remain unchanged. However, 1. no longer holds if does not satisfy stability, because revising by α may have an impact on the most normal β-worlds or on the most normal ¬β-worlds (but never on both). For example: ,N : pq 1 p¯ q 1 p¯q¯ 1 p¯q; ,P : p¯q 1 pq 1 p¯q¯ 1 p¯ q ; and such that that q in ,α N , all α-worlds are ranked above all ¬α-worlds. That is: ,N : pq 1 p¯q 1 p¯ q 1 p¯q¯. Before learning q, the most normal p-world is pq and the most normal ¬p-world is p¯q¯, therefore M |= P p for any kind of lifting. After learning q, the most normal p-world is still pq and the most normal ¬p-world is p¯q, therefore M |= P ¬p, again for any kind of lifting. A weaker form of the previous property is that preference for ϕ should remain unchanged if we learn something that is normal both given ϕ and given ¬ϕ: (P 6) N (α|ϕ) ∧ N (α|¬ϕ) ∧ P ϕ → [∗α]P ϕ Proposition 6 (P6) is satisﬁed: • if satisﬁes positive uniformity and weak responsiveness; • for any lifting operator with the LTW semantics. or • if satisﬁes top-stability; • for any lifting operator with the B semantics. The proof is easy: when N (α|ϕ) ∧ N (α|¬ϕ) holds, the most normal ϕ-worlds are α ∧ ϕ-worlds and the most normal ¬ϕ-worlds are α∧¬ϕ-worlds, therefore, the most normal ϕ-worlds remain the same after learning α, and similarly for the most normal ¬ϕ-worlds. Still a stronger form of (1) which is incomparable with (2) is when one learns something which is believed (normal) and the preference bears on something which is not exceptional. (P7) N α ∧ ¬N β ∧ ¬N ¬β ∧ P β → [∗α]P β Proposition 7 (P7) is satisﬁed: • if satisﬁes top-stability; • for any lifting operator with the LTW semantics. J. Lang and L. van der Torre / From Belief Change to Preference Change Indeed, the most normal ϕ-worlds are also α-worlds and hence remain the same after learning α, and similarly for the most normal ¬ϕ-worlds. This condition that both ϕ and ¬ϕ are non-exceptional is intuitively desirable in many contexts, especially when ϕ (and ¬ϕ) refers to something that is controllable by the agent. For instance, on Example 1: M |= P e ∧ ¬N ¬e ∧ ¬N ¬e ∧ N f : the agent initially believes that the ﬁsh is fresh and, of course, does not considers eating, nor not eating, as exceptional. As a result, after learning that the ﬁsh is fresh, he still prefers eating the sushi. Now, when revising by something that is not exceptional (not disbelieved), we would expect some form of preservation of preference as well. (P8) ¬N (¬α|β) ∧ ¬N (¬α|¬β) ∧ P β → [α]P β Proposition 8 (P8) is satisﬁed: • if satisﬁes positive and negative uniformity • for the strong lifting operator with the LTW semantics. This holds because at least one most normal α ∧ ϕ-world remains in the set of most normal α ∧ ϕ-worlds after learning α. However this no longer holds with the other kinds of lifting, as it can be seen on the following example: ,N : pq ∼ p¯ q 1 p¯q ∼ p¯q¯ and q 1 p¯q 1 pq 1 p¯q¯. We have M |= P p for any of 2=2opt ,P : p¯ or 2=2pess . After learning q, for any “reasonable” revision operator , including drastic revision, pq 1q q and p¯q 1 p¯q¯. Therefore, N p¯ the most normal p-world is pq and the most normal ¬p-world is p¯q, which implies that we have M |= [q](P ¬p ∧ ¬P p). 5 Related research Preference change was be given an in depth analysis by Hansson [10, 11], who deﬁnes preference change in a way that is parallel to belief change: preferences are revised by preferences so as to lead to new preferences. He addresses not only preference revision and contraction, but also preference addition (resp. subtraction), where preference evolve after an alternative is added to (resp. removed from) the set of alternatives. Preference change as a result of belief change has only been considered only recently. Bradley [5] argues that changes in preference can have two sorts of possible causes: “what might be called change in tastes” (cf. Example 3) and change in beliefs, where “preference change is induced by a redistribution of belief across some particular partition of the possibility space”, Then he develops a Bayesian formalization of this principle. Starting from similar intuitions, our work goes in another direction than ours and connects the interaction between belief change and preference change to the existing body of research in belief revision. Liu [15] also considers preference change due to belief change, that she contrasts with preference change due to changes in her priorities (see Example 4). Then she goes in another direction than ours, by building an extension of dynamic epistemic logic for reasoning both with beliefs and preferences. Van Benthem and Liu [3] discuss and study two kinds of preference change in a DEL setting, namely preference upgrade via commands and suggestions. A command is an input from an authority (“see to it that ϕ!”) whose effect is that the agent now prefers ϕ-worlds over ¬ϕ-worlds. A suggestion is a milder kind of preference upgrade. See Example 5. Freund [8, 9] investigates preference revision in the following meaning: how should an initial ranking over a set of worlds be revised by the addition, retraction of modiﬁcation of the links of the 355 chain? In these papers, “preference” has to be understood as “ranking over a set of worlds” and the results apply indifferently whether the ranking is interpreted in terms of decision-theoretic preferences or in terms of comparative plausilibity. In contrast, our work makes a fundamental distinction between preference and plausibility, since changes of preferences are the repercussion of changes of beliefs. 6 Conclusion In this paper we have given a ﬁrst investigation of the properties of preference change in reponse to belief change, depending on the choice of a revision operator and the choice of a semantics of semantics for preference. Even if we have obtained sufﬁcient conditions for several signiﬁcant properties of preference change, what is still missing is a series of representation theorems of the form: this list of properties is satisﬁed if and only if satisﬁes this set of properties and 2 this other set of properties. Obtaining such result is a longterm goal that does not seem easy at all, due to the high number of parameters that can vary. Acknowledgements We wish to thank the anonymous revewers for helpful comments. J´erˆome Lang is supported by the ANR Project ANR-05-BLAN-0384. REFERENCES [1] C. Alchourr`on, P. G¨ardenfors and D. Makinson, On the logic of theory change: Partial meet functions for contraction and revision. J. of Symbolic Logic, 50, 510-530, 1985. [2] J. van Benthem. Dynamic logic for belief revision. Journal of Applied Non-Classical Logics 17, 129-156, 2007. [3] J. van Benthem and F. Liu, Dynamic Logic of Preference Upgrade. In Journal of Applied Non-Classical Logic, Vol.17, No.2, 2007. [4] C. Boutilier, Toward a Logic for Qualitative Decision Theory. Proceedings of KR94, 75-86, 1994. [5] R. Bradley, The kinematics of belief and desire, Synthese 156 (3), 513535, 2007. [6] A. Darwiche and J. Pearl, On the logic of iterated belief revision. Artiﬁcial Intelligence 89, 1-29, 1997. [7] H. van Ditmarsch, W. van der Hoek and B. Kooi, Dynamic Epistemic Logic, Springer’s Synthese Library, 2007. [8] M. Freund, On the revision of preferences and rational inference processes, Artiﬁcial Intelligence, 2004. [9] M. Freund, Revising preferences and choices, Journal of Mathematical Economics 41, 229-251, 2005. [10] S. O. Hansson, Changes in preferences. Theory and Decision 38, 1-28, 1995. [11] S. O. Hansson, The structure of values and norms, Cambridge University Press, 2001. [12] J. Halpern. Deﬁning relative likelihood inpartially ordered preferential structures. Journal of Artiﬁcial Intelligence Research 7, 1-24, 1997. [13] J. Lang, L. van der Torre and E. Weydert, Utilitarian Desires. International Journal on Autonomous Agents and Multi-Agent Systems, 5, 329-363, 2002. [14] J. Lang, L. van der Torre and E. Weydert, Hidden Uncertainty in the Logical Representation of Desires, Proceedings of IJCAI2003, pages 685-690. 2003. [15] F. Liu. Changing for the Better. Preference Dynamics and Agent Diversity. PhD Thesis, University of Amsterdam, 2008. [16] H. Rott, Shifting Priorities: Simple Representations for Twenty-seven Iterated Theory Change Operators. In: Modality Matters: Twenty-Five Essays in Honour of Krister Segerberg. Uppsala Philosophical Studies, pp. 359-384, 2006. [17] H.H. von Wright, The logic of preference, Edinburgh University Press, 1963 356 ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-356 A General Model for Epistemic State Revision using Plausibility Measures Jianbing Ma 1 and Weiru Liu1 Abstract. In this paper, we present a general revision model on epistemic states based on plausibility measures proposed by Friedman and Halpern. We propose our revision strategy and give some desirable properties, e.g., the reversible and commutative properties. Moreover, we develop a notion called plausibility kinematics and show that our revision strategy follows plausibility kinematics. Furthermore, we prove that the revision following plausibility kinematics satisﬁes the principle of minimal change based on some distance measures. Finally, we discuss a revision operator deﬁned for plausibility functions and its relationship with iterated belief revision proposed by Darwiche and Pearl. We show that the revision operator satisﬁes all the DP postulates when it is Max-Additive. 1 Introduction Belief revision [AGM85, KM91, DP97] is a signiﬁcant subarea of artiﬁcial intelligence and philosophy. It depicts the process that an agent revises its beliefs upon receiving new evidence, under the assumption that an agent always takes the new information as the most reliable one and uses it to revise its current beliefs to reach a new consistent set of beliefs. In recent years, many researchers realized that epistemic states (not just their belief sets) should play an important, even fundamental role in iterated belief revision [DP97, B+00, NPP03, B+05, JT07]. These papers are concerned with the logic of iterated revision with the integration of epistemic states. More precisely, an agent’s current beliefs are modeled with epistemic states and new evidence is in the form of propositional logic formula. In contrast to the above approaches to epistemic state revision derived from the AGM revision framework in logics, epistemic state revision has also been studied in numerical settings. In [Spo88], ordinal conditional functions (OCFs, also known as ranking functions [Hal03]) are introduced to render the dynamics of the change of epistemic states (i.e., epistemic state revision). In [DP93], a counterpart in possibility theory was proposed by Dubois and Prade. In this paper, we present a generalized model for the dynamics (strategies) of epistemic state revision under the framework of plausibility measures introduced by Friedman and Halpern [FH95, Hal01], which takes OCFs and possibility measures as its special cases. We also investigate if our revision strategy is optimal such that it satisﬁes the principle of minimal change. Moreover, we want our general model satisfying all the iterated belief revision postulates, e.g., DP postulates [DP97]. We prove that it requires the plausibility measure to be Max-Additive in order to satisfy DP postulates. 1 School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, Belfast BT7 1NN, UK, email: {jma03, w.liu}@qub.ac.uk The remainder of this paper is organized as follows. Section 2 provides some preliminary knowledge of OCFs, possibility functions, and plausibility measures. Section 3 introduces our general revision model and its properties. We also show that the revision model satisﬁes the principle of minimal change. In section 4, we use the iterated belief revision postulates in [DP97] to verify our model. Finally, in Section 5, we draw a conclusion of the paper. 2 Preliminaries 2.1 Ordinal Conditional Functions An ordinal conditional function [Spo88], also known as a ranking function [Hal03] or a kappa-function, commonly denoted as κ, is a function from a set of possible worlds, W , to the set of ordinal numbers. Function κ is normalized (consistent) if there exists at least one possible world s.t. κ(w) = 0. Value κ(w) is understood as the degree of disbelief of world w. So the smaller the value, the more plausible the world is. The ranking value of a set A (i.e. a proposition μA ) is deﬁned as: κ(A) = minw∈A κ(w) or κ(μ) = minw|=μ κ(w), M od(μ) = A. The conditioning of ordinal conditional function is deﬁned as: κ(B|A) = minw∈A∩B (κ(w)) − κ(A) = κ(A ∩ B) − κ(A). Note that in [Spo88], κ(∅) = ∞. So when A ∩ B = ∅, κ(B|A) = ∞. In [Spo88], the (A, α)-conditionalization, also commonly considered as (A, α)-revision, is proposed as follows. Let an agent’s current belief be represented by an OCF κ, and let new evidence concerning event A be given as κ (A) = 0 and κ (A) = α (where A = W \ A), then the revised κ (by κ ) is deﬁned as: κr (w) = 2.2 κ(w|A) α + κ(w|A) for w ∈ A for w ∈ A (1) Possibility Theory Semantically, a possibility distribution π is a mapping from W to [0, 1]. It induces a possibility measure Π : 2W → [0, 1] and a necessity measure N : 2W → [0, 1] as follows: Π(A) = maxw∈A π(w) and N (A) = 1 − Π(A). Π(A) estimates the degree an agent believes the true world can be in A while N (A) estimates the degree the agent believes the true world should be necessarily in A. J. Ma and W. Liu / A General Model for Epistemic State Revision Using Plausibility Measures There are several conditioning methods in possibility theory, and we adopt the following one in this paper [DP93]. def Π(B|A) = Π(B ∩ A) Π(A) (2) A counterpart of Spohn’s (A, α)-conditionalization was suggested in [DP93] in possibility theory such that if new evidence suggests that Π (A) = 1 and Π (A) = 1 − α (which implies that N (A) = α), then the belief change of an agent’s current belief π can take the following form πr (w) = π(w|A) (1 − α)π(w|A) for w ∈ A for w ∈ A (3) where π(w|A) = π(w)/Π(A) which can be derived from Equation 2 with B being a singleton, i.e., B = {w}. 2.3 Plausibility Measure Deﬁnition 1 [FH95, Hal03] A plausibility space is a tuple S = (W, F , D, P l), where W is a set of possible worlds, F is an algebra over W 2 , D is a domain of plausibility values partially ordered by a relation ≤D , and P l maps sets in F to D. D is assumed to contain two special elements, and ⊥, such that ⊥ ≤D d ≤D for all d ∈ D. Besides, the plausibility measure P l should satisfy the following conditions: Pl1 P l(∅) = ⊥ Pl2 P l(W ) = Pl3 If U ⊆ V , then P l(U ) ≤D P l(V ) For example, if P l is reduced to a probability measure, then ⊥ = 0, = 1 and ≤D is ≤. If P l is reduced to an OCF, then ⊥ = +∞, = 0 and ≤D is ≥, and if ≤ is reduced to a possibility measure, then ⊥ = 0, = 1 and ≤D is ≤. In [Hal01], a plausibility measure P l is additive with respect to ⊕ such that P l(U ∪ V ) = P l(U ) ⊕ P l(V ) for disjoint U, V ∈ F where ⊕ is a mapping from D × D to D. The conditioning of P l on A, denoted as P l(·|A), is deﬁned as satisfying CPl1 CPl2 CPl3 CPl4 Furthermore, P l is said algebraic if it satisﬁes the following: Alg1 If U ∩ V = ∅, then P l(U ∪ V |V ) = P l(U |V ) ⊕ P l(V |V ) Alg2 P l(U ∩ V |V ) = P l(U |V ∩ V ) ⊗ P l(V |V ) Alg3 ⊗ distributes over ⊕; more precisely, a ⊗ (b1 ⊕ b2 ⊕ . . . ⊕ bn ) = (a ⊗ b1 ) ⊕ (a ⊗ b2 ) ⊕ . . . ⊕ (a ⊗ bn ) Alg4 a ⊗ c ≤D b ⊗ c and c = ⊥ implies a ≤D b where ⊗ is a mapping from D × D to D. To put operators ⊕ and ⊗ into perspective with respect to probability measures, OCFs, and possibility measures, we have (⊕ = +, ⊗ = ×) for a probability measure P r, (⊕ = min, ⊗ = +) for an OCF κ, and (⊕ = max, ⊗ = ×) for a possibility measure Π respectively. 2 Proposition 1 Let d ∈ D, we have d ⊗ = ⊗ d = d. To make the subsequent discussion easier, we have the following: let A be any set, for any binary relation ≤ over A × A, < is deﬁned as a 3 Epistemic State Revision by Plausibility Measures Here we present a revision model for epistemic state change using plausibility measures. This model is general enough to subsume the conditionalization of ordinal conditional functions, Jeffrey’s rule of probability updating, and the revision operator (Equation 3) in possibility theory introduced above. For this purpose, we need to deﬁne some simple and rational properties for operator ⊗ mentioned in the last section. Deﬁnition 2 Let S = (W, F , D, P l) be a plausibility space, a, b, c be any elements in D and ⊗ be a mapping from D × D to D, then ⊗ is called reversible iff there exists a mapping ⊗−1 such that a ⊗−1 b ⊗ b = a and a ⊗ b ⊗−1 b = a for b = ⊥. commutative iff a ⊗ b = b ⊗ a. associative iff a ⊗ (b ⊗ c) = a ⊗ b ⊗ c equal-ranking iff a ⊗ b ⊗−1 c = a ⊗−1 c ⊗ b for c = ⊥. right-sign-keeping iff a ⊗ c <D b ⊗ c for a <D b. left-sign-keeping iff c ⊗ a <D c ⊗ b for a <D b. sign-keeping iff ⊗ is both right-sign-keeping and left-sign-keeping. Property equal-ranking says that an operation ⊗ and its reversing operation ⊗−1 have the same level of operation grade, such as, ‘+’ and its reverse ‘-’ have the same level of arithmetic calculation grade and they are a grade lower than ‘×’ and ‘/’. Note that if ⊗ is reversible, then by setting V = W in Alg2, we obtain a conditional plausibility as follows. P l(U |V ) = P l(V ∩ U ) ⊗−1 P l(V ). (4) The reason we need to have both the right-sign-keeping and leftsign-keeping conditions is that some operators may not be associative, so these two conditions are not totally equivalent. P l(∅|A) = ⊥ P l(W |A) = If U ⊆ V , then P l(U |A) ≤D P l(V |A) P l(U |A) = P l(U ∩ A|A) 357 An algebra over W is a set of subsets of W closed under complementation and union. Proposition 2 Let S = (W, F , D, P l) be a plausibility space and ⊗ be a reversible and right-sign-keeping mapping from D × D to D, then ⊗−1 is right-sign-keeping. Note that if ⊗ is commutative, then ⊗ is right-sign-keeping iff ⊗ is left-sign-keeping. But we still differentiate the two situations as there may be non-commutative operators, e.g., ⊗−1 . Deﬁnition 3 Let S = (W, F , D, P l) be a plausibility space and ⊗ be a mapping from D×D to D, then ⊗ is called a rational mapping iff it satisﬁes reversible, commutative, associative, equal-ranking, and sign-keeping. Proposition 3 Let S = (W, F , D, P l) be a plausibility space and ⊗ be a rational mapping from D×D to D, then for any a, b, c, d ∈ D and b, c = ⊥, we have 1. a ⊗−1 b ⊗−1 c = a ⊗−1 c ⊗−1 b, 2. a ⊗ (d ⊗−1 c) = a ⊗ d ⊗−1 c, 3. b ⊗−1 b = , 358 J. Ma and W. Liu / A General Model for Epistemic State Revision Using Plausibility Measures In fact, when probability functions, OCFs and possibility functions, are viewed as plausibility functions, the corresponding ⊗s (which are ‘+’, min, and max respectively) are indeed rational mappings. More formally, we have the following lemma. Lemma 1 (Part of this lemma can be found in [Hal01]) Let P r be a probability function, κ be an OCF, and Π be a possibility function. When considered as a plausibility function P l, it satisﬁes the followings: 1. P l is additive with respect to the corresponding ⊕. 2. The conditioning P lA (B) can be written as P l(A∩B)⊗−1 P l(A) and P lA is also a probability function (resp. OCF κ, possibility measure Π) if the original P l is P r (resp. κ, Π). 3. ⊗ is a rational mapping (Def 3). 4. ⊗ distributes over ⊕ (Alg1). We deﬁne the revision model by plausibility measures as follows. Deﬁnition 4 Let S = (W, 2W , D, P l) be a plausibility space for the prior, and Se = (W, Fe , D, P le ) be the plausibility space for new evidence where Fe = 2{A1 ,...,An } is the powerset of a partition of W , then the revised plausibility measure, denoted as P lre , is P lre (w) = P le (Ai ) ⊗−1 P l(Ai ) ⊗ P l(w), w ∈ Ai , 1 ≤ i ≤ n. Proposition 4 Let S = (W, 2W , D, P l) be a plausibility space for the prior, and Se = (W, Fe , D, P le ) be the plausibility space for new evidence where Fe = 2{A1 ,...,An } , then we have P lre (Ai ) = P le (Ai ), 1 ≤ i ≤ n. This proposition shows that the above deﬁnition indeed reserves the value P le (Ai ) from the evidence, so it satisﬁes the general requirement in revision that the new evidence has to be preserved. Here are some general properties of the revision by plausibility measures. Proposition 5 Let S = (W, F = 2 , D, P l) be a plausibility space for the prior state and Se1 = (W, Fe1 , D, P le1 ), Se2 = (W, Fe2 , D, P le2 ) be two plausibility spaces for two new pieces of evidence such that Fe1 = Fe2 = 2{A1 ,...,An } , then we have (P lre1 )re2 = P lre2 . W This proposition reveals that if two pieces of evidence are about the same event but differ on the strengthes, then the evidence arriving later will suppress the former. When new evidence is given on Fe = 2{A,A} within a plausibility measure, the above revision is reduced to the well known (A, α)revision with OCFs [Spo88, DP93] which is the revision when Se = (W, 2{A,A} , D, P le ) such that P l(A) = and P l(A) = α. Thus we have Proposition 6 P lA,α (w) = P l(w) ⊗−1 P l(A) α ⊗−1 P l(A) ⊗ P l(w) for w ∈ A, for w ∈ A. This corollary is a direct generalization of Theorem 3 in [Spo88] for OCFs. Deﬁnition 5 Let ⊕ be a mapping from D × D to D, then ⊕ is called bounded-additive if and only if it follows: ⊕ d = d ⊕ = for all d ∈ D. For convenience, if P l is associated with a bounded-additive ⊕, then we simply call P l is bounded-additive. It is clear to see that OCFs and possibility measures are boundedadditive, but unfortunately, the probability function is not boundedadditive. Lemma 2 Let κ be an OCF and Π be a possibility measure. When considered as plausibility measures, they are bounded-additive. For bounded additive P l, we have the following theorem. Proposition 7 (Commutative) Let P l be a bounded-additive plausibility measure and A, B∈ A \ {∅, W } such that P l(A ∩ B) = P l(A ∩ B) = P l(A ∩ B) = , then we have (P lA,α )B,β = (P lB,β )A,α . In Theorem 4 [Spo88], Spohn pointed out that accumulated epistemic revision on events satisfying certain conditions (κ(A ∩ B) = κ(A ∩ B) = κ(A ∩ B) = 0) should be commutative. Here we generalize the theorem to the plausibility case and give the above proposition which is the counterpart of the theorem. The revision by Deﬁnition 4 can be equivalently rewritten as P lre (w) ⊗−1 P l(w) = P lre (Ai ) ⊗−1 P l(Ai ), w ∈ Ai It is a counterpart of so called probability kinematics [Jef65] in probability theory. In [CD05b], it is proved that Jeffrey’s Rule and Pearl’s virtual evidence method (a kind of revision on Bayesian networks) both follow probability kinematics. Hence here our revision strategy can be called plausibility kinematics. We give the formal deﬁnition of plausibility kinematics as follows: Deﬁnition 6 Suppose that two plausibility measures P l and P l disagree on the plausibility values they assign to a set of mutually exclusive and exhaustive events A1 , A2 , . . . , An . The distribution P l is said to be obtained from P l by plausibility kinematics on A1 , A2 , . . . , An , iff for any w ∈ Ai , 1 ≤ i ≤ n, we have P l (w) ⊗−1 P l(w) = P l (Ai ) ⊗−1 P l(Ai ). Obviously, the revision strategy in Deﬁnition 4 shows that the revised plausibility measure is obtained from the prior plausibility measure by plausibility kinematics. Next we prove that our revision strategy does achieve the minimal change. Namely, we show that among all revision strategies, the plausibility measure obtained by plausibility kinematics has the shortest distance to the prior plausibility measure. First, we deﬁne a distance function which is generalized from its probability counterpart in [CD05a, CD05b]. (5) For the (A, α)-revision, we have the following corollary. Corollary 1 (Reversible) Let S = (W, F = 2W , D, P l) be a plausibility space and A ∈ F \ {∅, W } such that P l(A) = and P l(A) = β, then we have (P lA,α )A,β = (P lA,α )A,β = P l. Deﬁnition 7 Let P l and P l be two plausibility measures on 2W , then the distance between P l and P l is deﬁned as d(P l, P l ) = 4(maxw P l (w) ⊗−1 P l(w)) − 4(minw P l (w) ⊗−1 P l(w)), where we deﬁne here ⊥ ⊗−1 ⊥ = , and 4 is a mapping from D to R and satisﬁes the followings. 359 J. Ma and W. Liu / A General Model for Epistemic State Revision Using Plausibility Measures 1. 4(a ⊗−1 b) = 4a − 4b, 2. if a < b, then 4a < 4b, 3. 4⊥ = ∞. In fact, if ⊗−1 is −, 4 can be ≡; while if ⊗−1 is /, 4 can be ln. P l and P l are said to have the same support [CD05b] if ∀w, P l(w) = ⊥ iff P l (w) = ⊥. If P l and P l do not have the same support, as 4⊥ = ∞, we can conclude that d(P l, P l ) = ∞. Lemma 3 For a, b, c, d ∈ D and a, b, c, d = ⊥, if a ⊗−1 b ≥D c ⊗−1 d, we have b ⊗−1 a ≤D d ⊗−1 c. Proposition 8 d(P l, P l ) deﬁned in Deﬁnition 7 is a distance function. A common perspective on revision strategies is to have minimal change between the prior belief (resp. epistemic state) and the revised belief (resp. epistemic state) [RF89], [KM91], [Bou96], [DP97]. The theorem below shows that our revision strategy is optimal in the sense that our revision strategy satisﬁes this common perspective. Theorem 1 The plausibility distribution P l1 obtained from P l by plausibility kinematics on partition A1 , A2 , . . . , An of W is optimal in the following sense. Among all possible plausibility distributions that agree with P l on the plausibility values of events A1 , A2 , . . . , An , P l1 is the closest to P l according to the distance measure by Deﬁnition 7. 4 A veriﬁcation using the belief revision postulates In this section, we use some well known belief revision postulates to verify the revision operator by plausibility measures. We mainly adopt the postulates proposed by Darwiche and Pearl [DP97], and also consider the Recalcitrance postulate [NPP03] and the Independence postulate [JT07]. The Darwiche-Pearl iterated belief revision postulates (DP Postulates) [DP97], which stems from the KM postulates [KM91], provide a general framework as how a belief set shall be obtained after iterated belief revision. There are following postulates for general revision in which Φ stands for an epistemic state (usually it means W plus the preorder ≤Φ on W ) and Φ ◦ μ is a new epistemic state after revising Φ with revision operator ◦. For each epistemic state Φ, there is a belief set Bel(Φ) and it is deﬁned as Bel(Φ) = ψ, where M ods(ψ) = min(W, ≤Φ ). In the following when an epistemic state Φ is embedded in a logical formula, it actually represents its corresponding belief set. For example, Φ∧μ stands for Bel(Φ) ∧ μ. R1 R2 R3 R4 R5 R6 Ψ ◦ μ implies μ. If Ψ ∧ μ is satisﬁable, then Ψ ◦ μ ≡ Ψ ∧ μ. If μ is satisﬁable, then Ψ ◦ μ is also satisﬁable. If Ψ1 = Ψ2 and μ1 ≡ μ2 , then Ψ1 ◦ μ1 ≡ Ψ2 ◦ μ2 . (Ψ ◦ μ) ∧ φ implies Ψ ◦ (μ ∧ φ). If (Ψ ◦ μ) ∧ φ is satisﬁable, then Ψ ◦ (μ ∧ φ) implies (Ψ ◦ μ) ∧ φ. and the following postulates for iterated belief revision: C1 C2 C3 C4 If α |= μ, then (Ψ ◦ μ) ◦ α ≡ Ψ ◦ α. If α |= ¬μ, then (Ψ ◦ μ) ◦ α ≡ Ψ ◦ α. If Ψ ◦ α |= μ, then (Ψ ◦ μ) ◦ α |= μ. If Ψ ◦ α |= ¬μ, then (Ψ ◦ μ) ◦ α |= ¬μ. The following two theorems are the representation theorems for the DP postulates. Theorem 2 ([DP97]) A revision operator ◦ satisﬁes postulates R1R6 precisely when the total pre-order ≤Ψ induced on the epistemic state Ψ satisﬁes: M ods(Bel(Ψ ◦ μ)) = min(M ods(μ), ≤Ψ ), and 1. w1 , w2 |= Bel(Ψ) implies w1 =Ψ w2 . 2. w1 |= Bel(Ψ) and w2 |= ¬Bel(Ψ) implies w1 ≤Ψ◦μ w2 . 3. Ψ1 = Ψ2 implies ≤Ψ1 =≤Ψ2 . Theorem 3 ([DP97]) Suppose that a revision operator ◦ satisﬁes postulates R1-R6. Then ◦ satisﬁes C1-C4 iff: CR1 If w1 CR2 If w1 CR3 If w1 w2 . CR4 If w1 |= μ and w2 |= μ, then w1 ≤Ψ w2 iff w1 ≤Ψ◦μ w2 . |= ¬μ and w2 |= ¬μ, then w1 ≤Ψ w2 iff w1 ≤Ψ◦μ w2 . |= μ and w2 |= ¬μ, then w1 <Ψ w2 implies w1 <Ψ◦μ |= μ and w2 |= ¬μ, then w1 <Ψ w2 iff w1 <Ψ◦μ w2 . We extend the plausibility measure P l to propositions such that for a proposition μ, we have P l(μ) = ⊕w|=μ P l(w). A proposition μ is believed by an agent if P l(¬μ) <D . An agent’s belief in the current epistemic state P l, denoted as Bel(P l), is then characterized as follows: def M ods(Bel(P l)) = {w : P l(w) = }. Obviously, a proposition μ is accepted iff its models subsume M ods(Bel(P l)), i.e., M ods(Bel(P l)) ⊆ M ods(μ). For a new piece of evidence, we assume that the evidence is represented as P le (μ) = and P le (¬μ) < . Furthermore, we assume P le (¬μ) = β⊗P l(¬μ) where P l(¬μ) is the plausibility measure of a prior belief and β is any value that satisﬁes β ⊗ P l(¬μ) = . Such β indeed exists, in fact, ⊥ is such a value. Thus a revision operator • that revises P l withe formula μ can be deﬁned as def (P l • μ)(w) = P l(w) ⊗−1 P l(μ) β ⊗ P l(w) for w |= μ, for w |= ¬μ. (6) P l • μ is a new plausibility measure. In fact, (P l • μ) is equivalent to the (A, α)-revision P lA,α such that A = M ods(μ) and α = β ⊗ P l(¬μ). Before discussing the relationship between the above revision operator • and the DP postulates, we introduce the following property. Deﬁnition 8 Let ⊕ be a mapping from D × D to D, then ⊕ is called Max-Additive iff it satisﬁes: a ⊕ b =D a for a, b ∈ D and a ≥D b. For convenience, if P l is associated with a Max-Additive ⊕, we simply call P l is Max-Additive. It is easy to ﬁnd that if P l is Max-Additive, then it is boundedadditive. And in fact it means that ⊕ is actually the max operator with respect to the total pre-order ≤D . Obviously, OCFs and possibility measures satisfy this property, but probability functions do not. Intuitively this is not surprising, as OCFs and possibility measures have their belief sets whilst for probability functions, there are no corresponding belief sets. Proposition 9 Let S = (W, F = 2W , D, P l) be a plausibility space and the total pre-order on the set of interpretations is deﬁned as def w1 ≤P l w2 = P l(w1 ) ≥D P l(w2 ). Then we have: 360 J. Ma and W. Liu / A General Model for Epistemic State Revision Using Plausibility Measures 1. w1 , w2 |= Bel(P l) implies w1 =P l w2 . 2. w1 |= Bel(P l) and w2 |= ¬Bel(P l) implies w1 ≤P l•μ w2 . 3. P l1 = P l2 implies ≤P l1 =≤P l2 . and we also have M ods(Bel(P l • μ)) = min(M ods(μ), ≤P l ) iff P l is Max-Additive. Proposition 10 Let ≤P l and ≤P l•μ be total pre-orders induced by P l and P l • μ respectively, then we have: PlR1 If w1 |= μ and w2 |= μ, then w1 ≤P l w2 iff w1 ≤P l•μ w2 . PlR2 If w1 |= ¬μ and w2 |= ¬μ, then w1 ≤P l w2 iff w1 ≤P l•μ w2 . PlR3 If w1 |= μ and w2 |= ¬μ, then w1 functions and possibility theory. The reversible and commutative properties are proved to be held in our model. Moreover, we proposed a notion of plausibility kinematics which is a generalization of probability kinematics [Jef65] and showed that the revision using plausibility kinematics satisﬁes the principle of minimal change, so that our revision model to some extent is optimal. Finally, we used the DP postulates [DP97] to verify our revision operator and proved that our revision strategy and the DP postulates are compatible when plausibility measures satisfy the Max-Additive property. In [Hal01], Halpern showed that variety of uncertainty measures can be represented by plausibility measures. Therefore, it would be interesting to see if our revision model can be applied to those uncertainty measures. Another issue for future research is that Darwiche and Pearl’s iterated belief revision cannot be applied to probability measures, because there does not exist a belief set from a probability distribution. Therefore, more general revision postulates maybe required purely on epistemic states other than on their associated belief sets. With Propositions 9 and 10, we immediately get that our revision operator satisﬁes all DP postulates (with the help of Theorems 2 and 3). Thus, for the Max-Additive plausibility measures, we have [AGM85] Theorem 4 The revision operator • deﬁned in Equation 6 satisﬁes the DP postulates R1-R6 and C1-C4. [B+00] The Recalcitrance (Rec) postulate [NPP03] and Independent (Ind) postulate [JT07] are presented as follows. [B+05] Rec If α |= ¬μ, then (Φ ◦ μ) ◦ α |= μ. Ind If Φ ◦ ¬α |= ¬μ, then (Φ ◦ μ) ◦ ¬α |= μ. REFERENCES [Bou96] Semantically, postulate Rec and Ind correspond to the following conditions ([NPP03] and [JT07]). [CD05a] RecR If w1 |= μ and w2 |= ¬μ, then w1 <Φ◦μ w2 . IndR If w1 |= μ and w2 |= ¬μ, then w1 ≤Φ w2 only if w1 <Φ◦μ w2 . [CD05b] Thus, the following proposition shows that • operator deﬁned by Equation 6 satisﬁes the Independence postulate. Proposition 11 Let ≤P l and ≤P l•μ be total pre-orders induced by P l and P l • μ, then we have: PlIndR If w1 |= μ and w2 |= ¬μ, then w1 ≤P l w2 only if w1 <P l•μ w2 . hence the revision operator • deﬁned in Equation 6 satisﬁes the Independence Postulate. And the following example shows that the Recalcitrance postulate is not satisﬁed by •. Example 1 Let W = {w1 , w2 , w3 }, P l be an OCF κ (thus ⊗ is +) over W such that κ(w1 ) = 3, κ(w2 ) = 0 and κ(w3 ) = 1, and μ be a formula such that M ods(μ) = {w1 , w3 } (thus κ(μ) = 1), then let β = 1, we have (κ • μ)(w1 ) = 2 > 1 = (κ • μ)(w2 ) which violates the RecR condition. 5 Conclusion In this paper, we presented a general revision model for epistemic state using plausibility measures and this model generalizes Spohn’s and Dubois and Prade’s results on revision in ordinal conditional [DP93] [DP97] [FH95] [Hal01] [Hal03] [Jef65] [JT07] [KM91] [NPP03] [RF89] [Spo88] [Wil94] C E Alchourr´on, P G¨ ardenfors, and D Makinson. On the logic of theory change: Partial meet functions for contraction and revision. Journal of Symbolic Logic, 50, 510-530, 1985. S Benferhat, S Konieczny, O Papini, and R P P´erez. Iterated Revision by Epistemic States: Axioms, Semantics and Syntax. Procs. of ECAI 2000, 13-17, 2000. S Benferhat, S Lagrue, and O Papini. Revision of Partially Ordered Information: Axiomatization, Semantics and Iteration. Procs. of IJCAI 2005, 376-381, 2005. C Boutilier. Iterated Revision and Minimal Change of Conditional Beliefs. Journal of Philosophical Logic, 25:263-305, 1996. H Chan and A Darwiche. A distance measure for bounding probabilistic belief change. Internat. J. Approx. Reason., 38(2), 149-174, 2005. H Chan and A Darwiche. On the revision of probabilistic beliefs using uncertain evidence. Artif. Intel., 163, 67-90, 2005. D Dubois and H Prade. Belief Revision and Updates in Numerical Formalisms: An Overview, with New Results for the Possibilistic Framework. Procs. of IJCAI 1993, 620-625, 1993. A Darwiche and J Pearl. On the logic of iterated belief revision. Artif. Intel., 89, 1-29, 1997. N Friedman and J Y Halpern. Plausibility measures: a user’s guide. Procs. of UAI 1995, 175-184, 1995. J Y Halpern. Plausibility measures: A general approach for representing uncertainty. Procs. of IJCAI 2001, 1474-1483, 2001. J Y Halpern. Reasoning about Uncertainty. The MIT Press, Cambridge, Massachusetts, London, England, 2003. R C Jeffrey. The Logic of Decision. McGraw-Hill, New York, 1965. (2nd edition) University of Chicago Press, Chicago, IL, 1983. (Paperback correction) 1990. Y Jin and M Thielscher. Iterated belief revision, revised. Artif. Intel., 171, 1-18, 2007. H Katsuno and A O Mendelzon. Propositional knowledge base revision and minimal change. Artif. Intel., 52, 263-294, 1991. A C Nayak, M Pagnucco, and P Peppas. Dynamic belief revision operators. Artif. Intel., 146:193-228, 2003. A S Rao and N Y Foo. Minimal Change and Maximal Coherence: A Basis for Belief Revision and Reasoning about Actions. Procs. of IJCAI 1989, 966-971, 1989. W Spohn. Ordinal Conditional Functions: A Dynamic Theory of Epistemic States. In W.Harper and B.Skyrms (Eds.), Causation in Decision, Belief Change, and Statistics, 2, 105-134, 1988 by Kluwer Academic Publishers. M.A. Williams. Transmutations of Knowledge Systems. Procs. of KR 1994, 619-629, 1994. ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-361 361 Structure Learning of Markov Logic Networks through Iterated Local Search Marenglen Biba and Stefano Ferilli and Floriana Esposito1 Abstract. Many real-world applications of AI require both probability and ﬁrst-order logic to deal with uncertainty and structural complexity. Logical AI has focused mainly on handling complexity, and statistical AI on handling uncertainty. Markov Logic Networks (MLNs) are a powerful representation that combine Markov Networks (MNs) and ﬁrst-order logic by attaching weights to ﬁrst-order formulas and viewing these as templates for features of MNs. State-of-theart structure learning algorithms of MLNs maximize the likelihood of a relational database by performing a greedy search in the space of candidates. This can lead to suboptimal results because of the incapability of these approaches to escape local optima. Moreover, due to the combinatorially explosive space of potential candidates these methods are computationally prohibitive. We propose a novel algorithm for learning MLNs structure, based on the Iterated Local Search (ILS) metaheuristic that explores the space of structures through a biased sampling of the set of local optima. The algorithm focuses the search not on the full space of solutions but on a smaller subspace deﬁned by the solutions that are locally optimal for the optimization engine. We show through experiments in two real-world domains that the proposed approach improves accuracy and learning time over the existing state-of-the-art algorithms. 1 Introduction Traditionally, AI research has fallen into two separate subﬁelds: one that has focused on logical representations, and one on statistical ones. Logical AI approaches like logic programming, description logics, classical planning, symbolic parsing, rule induction, etc, tend to emphasize handling complexity. Statitistical AI approaches like Bayesian networks, hidden Markov models, Markov decision processes, statistical parsing, neural networks, etc, tend to emphasize handling uncertainty. However, intelligent agents must be able to handle both for real-world applications. The ﬁrst attempts to integrate logic and probability in AI date back to the works in [1, 8, 19]. Later, several authors began using logic programs to compactly specify Bayesian networks, an approach known as knowledge-based model construction [26]. Recently, in the burgeoning ﬁeld of statistical relational learning [7], several approaches for combining logic and probability have been proposed such as probabilistic relational models [17], bayesian logic programs [10], relational dependency networks [18], and others. All these approaches combine probabilistic graphical models with subsets of ﬁrst-order logic (e.g., Horn Clauses). In this paper we focus on Markov logic [22], a powerful representation that has ﬁnite 1 Department of Computer Science, University of Bari, Italy, email: {biba,ferilli,esposito}@di.uniba.it ﬁrst-order logic and probabilistic graphical models as special cases. It extends ﬁrst-order logic by attaching weights to formulas providing the full expressiveness of graphical models and ﬁrst-order logic in ﬁnite domains and remaining well deﬁned in many inﬁnite domains [22, 25]. Weighted formulas are viewed as templates for constructing MNs and in the inﬁnite-weight limit, Markov logic reduces to standard ﬁrst-order logic. In Markov logic it is avoided the assumption of i.i.d. (independent and identically distributed) data made by most statistical learners by using the power of ﬁrst-order logic to compactly represent dependencies among objects and relations. Learning an MLN consists in structure learning (learning the logical clauses) and weight learning (setting the weight of each clause). In [22] structure learning was performed through ILP methods [13] followed by a weight learning phase in which maximumpseudolikelihood [2] weights were learned for each learned clause. State-of-the-art algorithms for structure learning are those in [11, 16] where learning of MLNs is performed in a single step using weighted pseudo-likelihood as the evaluation measure during structure search. However, these algorithms follow systematic search strategies that can lead to local optima and prohibitive learning times. The algorithm in [11] performs a beam search in a greedy fashion which makes it very susceptible to local optima, while the algorithm in [16] works in a bottom-up fashion trying to consider fewer candidates for evaluation. Even though it considers fewer candidates, after initially scoring all candidates, this algorithm attempts to add them one by one to the MLN, thus changing the MLN at almost each step, which greatly slows down the computation of the optimal weights. Moreover, both these algorithms cannot beneﬁt from parallel architectures. We propose an approach based on the Iterated Local Search (ILS) metaheuristics that samples the set of local optima and performs a search in the sampled space. We show that, through a simple parallelism model such as independent multiple walk, ILS achieves improvements towards the state-of-the-art algorithms. The paper is organized as follows: Section 2 introduces MNs and MLNs, Section 3 describes learning approaches for MLNs, Section 4 introduces stochastic local search methods, Section 5 presents the ILS metaheuristic for MLNs structure learning. We present the experiments in Section 6 and conclude in Section 7. 2 Markov Networks and Markov Logic Networks A MN (also known as Markov random ﬁeld) is a model for the joint distribution of a set of variables X = (X1 ,X2 ,. . . ,Xn ) ∈ χ [5]. It is composed of an undirected graph G and a set of potential functions. The graph has a node for each variable, and the model has a potential function φk for each clique in the graph. A potential function is a non-negative real-valued function of the state of the corresponding 362 M. Biba et al. / Structure Learning of Markov Logic Networks Through Iterated Local Search clique. The joint distribution represented by a MN is given by: P (X = x) = 1 Z φk (x{k} ) k where x{k} is the state of the kth clique (i.e., the state of the variables that appear in that clique). Z, known as the partition function, is given by: Z= x∈χ φk (x{k} ) k MNs are often conveniently represented as log-linear models, with each clique potential replaced by an exponentiated weighted sum of features of the state, leading to: P (X = x) = 1 wj fj (x)) exp( Z j A feature may be any real-valued function of the state. We will focus on binary features, fj ∈ {0, 1}. In the most direct translation from the potential-function form, there is one feature corresponding to each possible state xk of each clique, with its weight being log(φ(x{k} ). This representation is exponential in the size of the cliques. However a much smaller number of features (logical functions of the state of the clique) can be speciﬁed, allowing for a more compact representation than the potential-function form, particularly when large cliques are present. MLNs take advantage of this. A ﬁrst-order knowledge base (KB) can be seen as a set of hard constraints on the set of possible worlds: if a world violates even one formula, it has zero probability. The basic idea in Markov logic is to soften these constraints: when a world violates one formula in the KB it is less probable, but not impossible. The fewer formulas a world violates, the more probable it is. Each formula has an associated weight that reﬂects how strong a constraint it is: the higher the weight, the greater the difference in log probability between a world that satisﬁes the formula and one that does not, other things being equal. A MLN [22] L is a set of pairs (Fi ; wi ), where Fi is a formula in ﬁrst-order logic and wi is a real number. Together with a ﬁnite set of constants C = {c1 , c2 , . . . , cp } it deﬁnes a MN ML;C as follows: 1. ML;C contains one binary node for each possible grounding of each predicate appearing in L. The value of the node is 1 if the ground predicate is true, and 0 otherwise. 2. ML;C contains one feature for each possible grounding of each formula Fi in L. The value of this feature is 1 if the ground formula is true, and 0 otherwise. The weight of the feature is the wi associated with Fi in L. Thus there is an edge between two nodes of ML;C iff the corresponding ground predicates appear together in at least one grounding of one formula in L. An MLN can be viewed as a template for constructing MNs. The probability distribution over possible worlds x speciﬁed by the ground MN ML;C is given by 1 P (X = x) = exp( wi ni (x)) Z F i=1 where F is the number of formulas in the MLN and ni (x) is the number of true groundings of Fi in x. As formula weights increase, an MLN increasingly resembles a purely logical KB, becoming equivalent to one in the limit of all inﬁnite weights. In this paper we focus on MLNs whose formulas are functionfree clauses and assume domain closure (it has been proven that no expressiveness is lost), ensuring that the MNs generated are ﬁnite. In this case, the groundings of a formula are formed simply by replacing its variables with constants in all possible ways. 3 Structure and Parameter Learning of MLNs A ﬁrst-order knowledge base (KB) is a set of sentences or formulas in ﬁrst-order logic [6]. Formulas are constructed using four types of symbols: constants, variables, functions, and predicates. Constant symbols represent objects in the domain of interest. Variable symbols range over the objects in the domain. Function symbols represent mappings from tuples of objects to objects. Predicate symbols represent relations among objects in the domain or attributes of objects. A term is any expression representing an object in the domain. It can be a constant, a variable, or a function applied to a tuple of terms. An atomic formula or atom is a predicate symbol applied to a tuple of terms. A ground term is a term containing no variables. A ground atom or ground predicate is an atomic formula all of whose arguments are ground terms. Formulas are recursively constructed from atomic formulas using logical connectives and quantiﬁers. A positive literal is an atomic formula; a negative literal is a negated atomic formula. A KB in clausal form is a conjunction of clauses, a clause being a disjunction of literals. A deﬁnite clause is a clause with exactly one positive literal (the head, with the negative literals constituting the body). A possible world or Herbrand interpretation assigns a truth value to each possible ground predicate. Inductive Logic Programming (ILP) systems learn clausal KBs from relational databases, or reﬁne existing KBs [13]. Hypotheses are constructed through reﬁnement operators that add or remove literals from clauses. In the learning from interpretations setting of ILP, the examples are databases, and the system searches for clauses that are true in them. For example, CLAUDIEN [4], starting with a trivially false clause, repeatedly forms all possible reﬁnements of the current clauses by adding literals, and adds to the KB those that satisfy a minimum accuracy and coverage criterion. In the learning from entailment setting, the system searches for clauses that entail all positive examples of some relation and no negative ones. For example, FOIL [21] learns each deﬁnite clause by starting with the target relation as the head and greedily adding literals to the body. MN weights have traditionally been learned using iterative scaling [5]. However, maximizing the likelihood (or posterior) using a quasiNewton optimization method like L-BFGS has recently been found to be much faster [23]. Regarding structure learning, the authors in [5] induce conjunctive features by starting with a set of atomic features (the original variables), conjoining each current feature with each atomic feature, adding to the network the conjunction that most increases likelihood, and repeating. The work in [15] extends this to the case of conditional random ﬁelds, which are MNs trained to maximize the conditional likelihood of a set of outputs given a set of inputs. The ﬁrst attempt to learn MLNs was that in [22], where the authors used the CLAUDIEN system to learn the clauses of MLNs and then learned the weights by maximizing pseudo-likelihood. In [11] another method was proposed that combines ideas from ILP and feature induction of MNs. This algorithm, that performs a beam or shortest ﬁrst search in the space of clauses guided by a weighted pseudo-likelihood (WPLL) measure [2], outperformed that of [22]. M. Biba et al. / Structure Learning of Markov Logic Networks Through Iterated Local Search Recently, in [16] a bottom-up approach was proposed in order to reduce the search space. This algorithm uses a propositional MN learning method to construct template networks that guide the construction of candidate clauses. In this way, it generates fewer candidates for evaluation. Even though it evaluates fewer candidates, after initially scoring all candidates, the algorithm attempts to add them one by one to the MLN, thus changing the MLN at almost each step, which greatly slows down the computation of the WPLL. For every candidate structure, in both [11, 16] the parameters that optimize the WPLL are set through L-BFGS that approximates the secondderivative of the WPLL by keeping a running ﬁnite-sized window of previous ﬁrst-derivatives. Regarding weight-learning, as pointed out in [11] a potentially serious problem that arises when evaluating candidate clauses using WPLL is that the optimal (maximum WPLL) weights need to be computed for each candidate. Since this involves numerical optimization, and needs to be done millions of times, it could easily make the algorithm too slow. In [15, 5] the problem is addressed by assuming that the weights of previous features do not change when testing a new one. Surprisingly, the authors in [11] found this to be unnecessary if it is used the very simple approach of initializing LBFGS with the current weights (and zero weight for a new clause). Although in principle all weights could change as the result of introducing or modifying a clause, in practice this is very rare. Secondorder, quadratic-convergence methods like L-BFGS are known to be very fast if started near the optimum [23]. This is what happened in [11]: L-BFGS typically converges in just a few iterations, sometimes one. We use the same approach for setting the parameters that optimize the WPLL. 4 Iterated Local Search Many widely known and high-performance local search algorithms make use of randomized choice in generating or selecting candidate solutions for a given combinatorial problem instance. These algorithms are called stochastic local search (SLS) algorithms [9] and represent one of the most successful and widely used approaches for solving hard combinatorial problems. Many “simple” SLS methods come from other search methods by just randomizing the selection of the candidates during search, such as Randomized Iterative Improvement (RII), Uniformed Random Walk, etc. Many other SLS methods combine “simple” SLS methods to exploit the abilities of each of these during search. These are known as Hybrid SLS methods [9]. ILS is one of these metaheuristics because it can be easily combined with other SLS methods. One of the simplest and most intuitive ideas for addressing the fundamental issue of escaping local optima is to use two types of SLS steps: one for reaching local optima as efﬁciently as possible, and the other for effectively escaping local optima. ILS methods [9, 14] exploit this key idea, and essentially use two types of search steps alternatingly to perform a walk in the space of local optima w.r.t. the given evaluation function. The algorithm works as follows: The search process starts from a randomly selected element of the search space. From this initial candidate solution, a locally optimal solution is obtained by applying a subsidiary local search procedure. Then each iteration step of the algorithm consists of three major steps: ﬁrst a perturbation method is applied to the current candidate solution s; this yields a modiﬁed candidate solution s’ from which in the next step a subsidiary local search is performed until a local optimum s” is obtained. In the last step, an acceptance criterion is used to decide from which of the two local optima s or s’ the search process 363 Algorithm 1 Structure Learning Input: P:set of predicates, MLN:Markov Logic Network, RDB:Relational Database CLS = All clauses in MLN ∪ P; LearnWeights(MLN,DB); Score = WPLL(MLN,RDB); repeat BestClause = SearchBestClause(P,MLN,Score,CLS,RDB); if BestClause = null then Add BestClause to MLN; Score = WPLL(MLN,RDB); if BestScore <= Score then Gain = Score - BestScore; BestScore = Score; end if end if until BestClause = null || Gain <= minGain for two consecutive steps return MLN is continued. The algorithm can terminate after some steps have not produced improvement or simply after a certain number of steps. The choice of the components of the ILS has a great impact on the performance of the algorithm. As pointed out in [9] there are three good reasons to consider applying SLS algorithms instead of systematic algorithms. The ﬁrst is that many problems are of a constructive nature and their instance is known to be solvable. In these situations, the goal of any search algorithm is to ﬁnd a solution rather than just to decide whether a solution exists. This holds in particular for optimization problems, where the actual problem is to ﬁnd a solution of sufﬁciently high quality. Therefore, the main advantage of a complete systematic algorithm (the ability to detect that a given problem instance has no solution) is not relevant for ﬁnding solutions of solvable instances. Secondly, in most application scenarios, the time to ﬁnd a solution is limited. In these situations, systematic algorithms often have to be aborted after the given time has been exhausted, which renders them incomplete. This is problematic for many systematic optimization algorithms that search through spaces of partial solutions without computing complete solutions early in the search, and if such a systematic algorithm is aborted prematurely, usually a non solution candidate is available, while in the same situation SLS algorithms typically return the best solution found so far. Thirdly, algorithms for real-time problems should be able to deliver reasonably good solutions at any point during their execution. For optimization problems this typically means that run-time and solution quality should be positively correlated; for decision problems one could guess a solution when a timeout occurs, where the accuracy of the guess should increase with the run-time of the algorithm. This so-called any-time property of algorithms is usually very difﬁcult to achieve, but in many situations the SLS paradigm is naturally suited for devising any time algorithms. In general, it is not straightforward to decide whether to use a systematic or SLS algorithm in a certain task. Systematic and SLS algorithms can be considered complementary to each other. SLS algorithms are advantageous in many situations, particularly if reasonably good solutions are required within a short time, if parallel processing is used and if knowledge about the problem domain is rather limited. In other cases, when time constraints are less important and some knowledge about the problem domain can be exploited, systematic search may a better choice. Structure learning of MLNs is a hard optimization problem due to the large space to be explored, thus SLS methods are suitable for 364 M. Biba et al. / Structure Learning of Markov Logic Networks Through Iterated Local Search ﬁnding solutions of high quality in short time. Moreover, one of the key advantages of SLS methods is that they can greatly speed up learning through parallel processing, where speedups proportional to the number of CPUs can be achieved [9]. We also exploit this feature of our ILS algorithm, by parallelizing multiple independent walks of ILS in separate CPUs. 5 Generative Structure Learning of MLNs through ILS In this section we describe the ILS metaheuristic tailored to the problem of learning the structure of MLNs. Algorithm 1 iteratively adds the best clause to the current MLN until two consecutive steps have not produced improvement (however other stopping criteria could be applied). Algorithm 2 performs an iterated local search to ﬁnd the best clause to add to the MLN. It starts by randomly choosing a unit clause CLC in the search space. Then it performs a greedy local search to efﬁciently reach a local optimum CLS . At this point, a perturbation method is applied leading to the neighbor CLC of CLS and then a greedy local search is applied to CLC to reach another local optimum CLS . The accept function decides whether the search must continue from the previous local optimum CLC or from the last found local optimum CLS (accept can perform random walk or iterative improvement in the space of local optima). Careful choice of the various components of Algorithm 2 is important to achieve high performance. The clause perturbation operator (ﬂipping the sign of literals, removing literals or adding literals) has the goal to jump in a different region of the search space where search should start with the next iteration. There can be strong or weak perturbations which means that if the jump in the search space is near to the current local optimum the subsidiary local search procedure LocalSearchII may fall again in the same local optimum and enter regions with the same value of the objective function called plateau, but if the jump is too far, LocalSearchII may take too many steps to reach another good solution. In our algorithm we use only strong perturbations, i.e. we always re-start from unit clauses (in future work we intend to adapt dynamically the nature of the perturbation). Regarding the procedure LocalSearchII we decided to use an iterative improvement approach in order to balance intensiﬁcation (greedily increase solution quality by exploiting the evaluation function) and diversiﬁcation (randomness induced by strong perturbation to avoid search stagnation). The accept function always accepts the best solution found so far. 6 6.1 Experiments Datasets We carried out experiments on two publicly-available databases: the UW-CSE database used by [11, 22, 16] (available at http://alchemy.cs.washington.edu/data/uw-cse) and the Cora dataset originally labeled by Andrew McCallum. Both represent standard relational datasets and are used for two important relational tasks: Cora for entity resolution and UW-CSE for social network analysis. For Cora we used a cleaned version from [24], with ﬁve splits for crossvalidation. The published UW-CSE dataset consists of 15 predicates divided into 10 types. Types include: publication, person, course, etc. Predicates include: Student(person), Professor(person), AdvisedBy(person1, person2), TaughtBy(course, person, quarter), Publication (paper, person) etc. The dataset contains a total of 2673 tuples (true ground atoms, with the remainder assumed false). The Cora Algorithm 2 SearchBestClause Input: P:set of predicates, MLN:Markov Logic Network, BestScore: current best score, CLS: List of clauses, RDB:Relational Database) CLC = Random Pick a clause in CLS ∪ P; CLS = LocalSearchII (CLS ); BestClause = CLS ; repeat CL’C = Perturb(CLS ); CL’S = LocalSearchII (CL’C ,MLN,BestScore); if WPLL(BestClause,MLN,RDB) ≤ WPLL(CL’S ,MLN,RDB) then BestClause = CL’S ; Add BestClause to MLN; BestScore = WPLL(CL’S ,MLN,RDB) end if CLS = accept(CLS ,CL’S ); until two consecutive steps have not produced improvement Return BestClause dataset consists of 1295 citations of 132 different computer science papers, drawn from the Cora Computer Science Research Paper Engine. The task is to predict which citations refer to the same paper, given the words in their author, title, and venue ﬁelds. The labeled data also specify which pairs of author, title, and venue ﬁelds refer to the same entities. We performed experiments for each ﬁeld in order to evaluate the ability of the model to deduplicate ﬁelds as well as citations. Since the number of possible equivalences is very large, we used the canopies found in [24] to make this problem tractable. 6.2 Systems and Methodology We implemented Algorithm 1 (ILS) in the Alchemy package [12]. We used the implementation of L-BFGS in Alchemy to learn maximum WPLL weights. We compared our algorithm performance with the state-of-the-art algorithms for generative structure learning of MLNs: BS (Beam Search) of [11] and BUSL (Bottom-Up Structure Learning) of [16]. In the UW-CSE domain, we used the same leave-one-area-out methodology as in [22]. In the Cora domain, we performed crossvalidation. For each system on each test set, we measured the conditional log-likelihood (CLL) and the area under the precision-recall curve (AUC) for all the predicates. The advantage of the CLL is that it directly measures the quality of the probability estimates produced. The advantage of the AUC is that it is insensitive to the large number of true negatives (i.e., ground atoms that are false and predicted to be false). The CLL of a query predicate is the average over all its groundings of the ground atoms log-probability given evidence. The precision-recall curve for a predicate is computed by varying the CLL threshold above which a ground atom is predicted to be true; i.e. the predicates whose probability of being true is greater than the threshold are positive and the rest are negative. For all algorithms, we used the default parameters of Alchemy changing only the following ones: maximum variables per clause = 5 for UW-CSE and 6 for Cora; penalization of WPLL: 0.01 for UWCSE and 0.001 for Cora. For L-BFGS: convergence threshold = 10−5 (tight) and 10−4 (loose); minWeight = 0.5 for UW-CSE for BUSL as in [16], 1 for BS as in [11] and 1 for ILS; minGain = 0.05 for ILS. For ILS we used a multiple independent walk parallelism, assigning an instance of the algorithm to a separate CPU on a cluster of Intel Core2 Duo 2.13 GHz CPUs. M. Biba et al. / Structure Learning of Markov Logic Networks Through Iterated Local Search 6.3 Results After learning the structure, we performed inference on the test fold for both datasets by using MC-SAT [20] with number of steps = 10000 and simulated annealing temperature = 0.5. For each experiment, all the groundings of the query predicates on the test fold were commented. MC-SAT produces probability outputs for every grounding of the query predicate on the test fold. We used these values to compute the average CLL over all the groundings and the relative AUC (for AUC we used the method proposed in [3]). For ILS we report the best performance in terms of CLL among ten parallel independent walks. Both CLL and AUC results (Table 1) are averaged over all predicates of the domain. Learning times are reported in Table 2. For BS in the Cora domain we were not able to report results, since structure learning with this algorithm did not ﬁnish in 45 days. BS is heavily slowed by its systematic top-down nature that tends to evaluate a very large number of candidates. In the UW-CSE domain, BS gets easily stuck in local optima due to its greedy strategy. Table 1. Algorithm BS BUSL ILS Accuracy results for all algorithms UW-CSE CLL -0.312±0.046 -0.074±0.014 -0.069±0.016 AUC 0.320 0.431 0.432 CORA CLL -0.196±0.003 -0.102±0.003 AUC 0.201 0.225 In both domains, ILS gives the best overall results in terms of CLL and AUC. BUSL is competitive with ILS in terms of accuracy but is much slower. Even though BUSL evaluates fewer candidates than ILS, it changes the MLN completely at each step, thus calculating the WPLL becomes very expensive. In ILS this does not happen because, like in [11], at each step L-BFGS is initialized with the current weights (and zero weight for a new clause) and it converges in a few iterations. We empirically observed that ILS is very effective in escaping local optima and further improvements can be achieved by dynamically adapting the strength of the perturbation operator. Table 2. Average learning times for all algorithms (in minutes) Algorithm BS BUSL ILS 7 UW-CSE 335 618 148 CORA 9350 1597 Conclusion and Future Work Markov logic networks are a powerful representation that combine ﬁrst-order logic and probability. We have introduced an iterated local search algorithm for learning the structure of Markov Logic Networks. The approach is based on a biased sampling of the set of local optima focusing the search not on the full space of solutions but on a smaller subspace deﬁned by the solutions that are locally optimal for the optimization engine. We have shown through experiments in two real-world domains that the proposed algorithm performs better than state-of-the-art structure learning algorithms for MLNs. Future work includes implementing more sophisticated parallel models suich as MPI (Message Passing Interface) or PVM (Parallel Virtual 365 Machine), dynamically adapting the nature of perturbations in ILS, using a Metropolis criterion in the acceptance function of ILS. ACKNOWLEDGEMENTS We thank Pedro Domingos and Stanley Kok for helpful discussions, Marc Sumner for help on using Alchemy and Lilyana Mihalkova for help on BUSL. REFERENCES [1] F. Bacchus, Representing and Reasoning with Probabilistic Knowledge, Cambridge, MA: MIT Press, 1990. [2] J. Besag, ‘Statistical analysis of non-lattice data’, Statistician, 24, 179– 195, (1975). [3] J. Davis and M. Goadrich, ‘The relationship between precision-recall and roc curves’, in Proc. 23rd ICML, pp. 233–240, (2006). [4] L. De Raedt and L. Dehaspe, ‘Clausal discovery’, Machine Learning, 26, 99–146, (1997). [5] S. Della Pietra, V. Della Pietra, and J. Laferty, ‘Inducing features of random ﬁelds’, IEEE Transactions on Pattern Analysis and Machine Intelligence, 19, 380–392, (1997). [6] M. R. Genesereth and N. J. Nilsson, Logical foundations of artiﬁcial intelligence, San Mateo, CA: Morgan Kaufmann., 1987. [7] L. Getoor and B. Taskar, Introduction to Statistical Relational Learning, MIT, 2007. [8] J. Halpern, ‘An analysis of ﬁrst-order logics of probability’, Artiﬁcial Intelligence, 46, 311–350, (1990). [9] H. H. Hoos and T. Stutzle, Stochastic Local Search: Foundations and Applications, Morgan Kaufmann, San Francisco, 2005. [10] K. Kersting and L. De Raedt, ‘Towards combining inductive logic programming with bayesian networks’, in Proc. 11th Int’l Conf. on Inductive Logic Programming, pp. 118–131. Springer, (2001). [11] S. Kok and P. Domingos, ‘Learning the structure of markov logic networks’, in Proc. 22nd Int’l Conf. on Machine Learning, pp. 441–448, (2005). [12] S. Kok, P. Singla, M. Richardson, and P. Domingos, ‘The alchemy system for statistical relational ai’, Technical report, Department of CSEUW, Seattle, WA, http://alchemy.cs.washington.edu/, (2005). [13] N. Lavrac and S. Dzeroski, Inductive Logic Programming: Techniques and applications, UK: Ellis Horwood, Chichester, 1994. [14] H.R. Loureno, O. Martin, and T. Stutzle, ‘Iterated local search’, in Handbook of Metaheuristics, 321–353, F. Glover and G. Kochenberger, Kluwer Academic Publishers, Norwell, MA, USA, (2002). [15] A. McCallum, ‘Efﬁciently inducing features of conditional random ﬁelds’, in Proc. UAI-03, pp. 403–410, (2003). [16] L. Mihalkova and R. J. Mooney, ‘Bottom-up learning of markov logic network structure’, in Proc. 24th Int’l Conf. on Machine Learning, pp. 625–632, (2007). [17] D. Koller N. Friedman, L. Getoor and A. Pfeffer, ‘Learning probabilistic relational models’, in Proc. 16th Int’l Joint Conf. on AI (IJCAI), pp. 1300–1307. Morgan Kaufmann, (1999). [18] J. Neville and D. Jensen, ‘Dependency networks for relational data’, in Proc. 4th IEEE Int’l Conf. on Data Mining, pp. 170–177. IEEE Computer Society Press., (2004). [19] N. Nilsson, ‘Probabilistic logic’, Artiﬁcial Intelligence, 28, 71–87, (1986). [20] H. Poon and P. Domingos, ‘Sound and efﬁcient inference with probabilistic and deterministic dependencies’, in Proc. 21st Nat’l Conf. on AI, (AAAI), pp. 458–463. AAAI Press, (2006). [21] J. R. Quinlan, ‘Learning logical deﬁnitions from relations’, Machine Learning, 5, 239–266, (1990). [22] M. Richardson and P. Domingos, ‘Markov logic networks’, Machine Learning, 62, 107–236, (2006). [23] F. Sha and F. Pereira, ‘Shallow parsing with conditional random ﬁelds’, in Proc. HLT-NAACL-03, pp. 134–141, (2003). [24] P. Singla and P. Domingos, ‘Entity resolution with markov logic’, in Proc. ICDM-2006, pp. 572–582. IEEE Computer Society Press, (2006). [25] P. Singla and P. Domingos, ‘Markov logic in inﬁnite domains’, in Proc. 23rd UAI, pp. 368–375. AUAI Press, (2007). [26] J. S. Wellman, M. Breese and R. P. Goldman, ‘From knowledge bases to decision models’, Knowledge Engineering Review, (1992). 366 ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-366 Single-peaked consistency and its complexity ¨ urk ¨ 3 Bruno Escofﬁer1 and J´erˆome Lang2 and Meltem Ozt Abstract. A common way of dealing with the paradoxes of preference aggregation consists in restricting the domain of admissible preferences. The most well-known such restriction is singlepeakedness. In this paper we focus on the problem of determining whether a given proﬁle is single-peaked with respect to some axis, and on the computation of such an axis. This problem has already been considered in [2]; we give here a more efﬁcient algorithm and address some related issues, such as the number of orders that may be compatible with a given proﬁle, or the communication complexity of preference aggregation under the single-peakedness assumption. 1 Introduction Aggregating preferences for ﬁnding a consensus between several agents is an important topic at the boarder between social choice and artiﬁcial intelligence. Given the preferences of a set of agents (or voters) over a set of alternatives (or candidates), preference aggregation aims at determining a collective preference relation representing as much as possible the individual preferences, whereas voting rules consists in ﬁnding a socially preferred candidate. Among the paradoxes and impossibility theorems of preference agregation, the most famous may be the following three (in all three cases we assume that there are at least 3 alternatives): • the Condorcet paradox [3]: a Condorcet cycle is a sequence of candidates x1 , . . . , xk such that for all i ≤ k − 1, a majority of voters prefers xi to xi+1 , and a majority of voters prefers xk to x1 . Such cycles make it impossible to build a collective preference relation compatible with pairwise majority comparisons between candidates. • Arrow’s theorem [1]: any unanimous aggregation function for which the pairwise comparison between two alternatives is independent or irrelevant alternatives is dictatorial; • Gibbard and Satterthwaite’s theorem [7, 8]: any surjective and nondictatorial voting rule is manipulable. A proﬁle consists of a collection of preference relations over the candidates (one per voter). In the above results, any proﬁle is admissible. However, in some contexts, voters’ preferences may have a special structure restricting the domain of admissible proﬁles. The most well known such restriction is single-peakedness. It assumes that there is a natural linear axis, independent of the voters, on which alternatives are positioned: one may for instance think of a left-right axis as in politics, or a numerical axis (when the voters have to decide for instance about an amount of money to spend). A voter has a single-peaked preferences with respect to such an axis if, on each side of the “peak” (that is, the preferred candidate), his preference grows 1 2 3 LAMSADE, Universit´e Paris Dauphine and CNRS, email: escofﬁer@lamsade.dauphine.fr IRIT, Universit´e Paul Sabatier, Toulouse, email: lang@irit.fr CRIL, Universit´e d’Artois, email: ozturk@cril.fr with the proximity to the peak. It is well-known that Condorcet cycles cannot occur when preferences are single-peaked; therefore, one escapes from the Condorcet paradox as well as Arrow’s and GibbardSatterthwaite’s theorem. However, this way of escaping the paradoxes and impossibility theorems assumes that the axis on which the candidates are positioned is known in advance. In contexts where it is partially or fully unknown, one should identify it before any aggregation process is started. Therefore, we consider the problem of determining whether, given the preferences of some agents on a set of alternatives, these preferences are single-peaked with respect to some axis (which we refer to as single-peaked consistency), and if so, how one of the possible axes can be determined. This problem has been considered in [2] (as well as the problem of determining whether a proﬁle is single peaked w.r.t. a tree [9], which is weaker than single peakedness w.r.t. an axis). They give an algorithm in O(m.n2 ) where n (resp. m) is the number of candidates (resp. voters), based on matrix representation. We give here a different algorithm, both more intuitive and efﬁcient since it works in time O(m.n). While the difference between O(m.n) and O(m.n2 ) is pratically not very signiﬁcant for standard political elections where n is typically small, this is no longer the case when the set of alternatives (or “candidates ”) has a combinatorial structure, which is often the case in AI applications. A related problem is addressed by Conitzer [4]: without the prior knowledge of the axis, but knowing the preference relation of one agent (which gives some incomplete information about the axis), how can we elicit as efﬁciently as possible the preferences of a second agent? Single peaked consistency is important in at least two contexts. First, some domains tend to have a single-peaked structure, but for some reason we may not know the axis : In this case, from a few votes (for instance obtained from a sample of votes), we may learn this axis. Second, in some domains it is unclear whether it is reasonable to assume single-peakedness: then, checking the single-peaked consistency of the preference relations of a few voters gives a good hint as to whether single-peakedness is reasonable.4 In Section 2, we deﬁne single-peaked consistency and give a constructive algorithm that checks whether a proﬁle is single-peaked consistent, and if so, returns a compatible axis. This algorithm works in time O(n.m), where n is the number of agents and m the number of alternatives. In Section 4 we study a few combinatorial aspects of single-peaked preferences; in particular, we give a result on the number of axes that are compatible with a tuple of single-peaked preferences. In Section 5 we give a simple additional result on the communication complexity of preference aggregation of single-peaked preferences. Finally we point to interesting extensions of our work. 4 This is for instance of particular interest when alternatives are evaluated on several criteria; here, the hidden axis may be some (a priori unknown) combination of the different criteria (projection from a multidimensional to a monodimentional representation). B. Escofﬁer et al. / Single-Peaked Consistency and Its Complexity 2 Single-peaked preferences Let V = {1, . . . , m} be a ﬁnite set of voters and X = {x1 , . . . , xn } a ﬁnite a set of candidates (or alternatives), with n ≥ 3. Deﬁnition 1 A preference relation 1 on X is a linear order on X. The peak of a preference relation 1 is the candidate x∗ = peak(1) such that x∗ 1 x for all x ∈ X \ {x∗ }. A proﬁle is a m-uple P = 11 , . . . , 1m of preference relations on X. Deﬁnition 2 An axis O (noted by >) is a linear order on X. Given two candidates xi , xj ∈ X, a preference relation 1 on X whose peak is x∗ , and an axis O, we say that xi and xj are on the same side of the peak of 1 iff one of the following 2 conditions is satisﬁed: (1) xi > x∗ and xj > x∗ ; (2) x∗ > xi and x∗ > xj . A preference relation 1 is single-peaked with respect to an axis O if and only if for all xi , xj ∈ X such that xi and xj are on the same side of the peak x∗ of 1, one has xi 1 xj if and only if xi is closer to the peak than xj , that is, if x∗ > xi > xj or xj > xi > x∗ . For simplicity, we sometimes note (as in Example 1) x1 x2 . . . xn instead of x1 1 x2 1 · · · 1 xn or of x1 > x2 > · · · > xn . Example 1 Let X = {x1 , x2 , x3 , x4 , x5 , x6 } and O = (x1 > x2 > x3 > x4 > x5 > x6 ). The preferences x2 x3 x4 x1 x5 x6 ; x4 x3 x2 x5 x6 x1 ; and x6 x5 x4 x3 x2 x1 are single-peaked with respect to O but not x4 x3 x5 x1 x6 x2 . Indeed, x1 and x2 are on the same side of the peak (x4 ) but x2 is not preferred to x1 while it is closer to the peak than x1 . An interesting question is the existence of a common axis to all voters, such that the preferences of these voters are single-peaked with respect to this common axis. Deﬁnition 3 A proﬁle 11 , . . . , 1m is single-peaked with respect to O iff for each voter i, 1i is single-peaked with respect to O. Whether single-peakedness seems justiﬁed or not strongly depends on the nature of X. It is often deemed reasonable if the axis represents an objective left-right political axis such that voters’ preferences are determined only from the position of the candidates on the axis, or else, if X is a set of numerical values or more generally a set equipped with a natural ordering. Conitzer [4] considers the elicitation of single-peaked preferences. The elicitation process is all the more efﬁcient as the amount of communication required by the process is low. This amount of communication can be measured in terms of the number of elementary queries of the form “between the candidates x and y, which one do you prefer?” 3 Single-peaked consistency A very natural question is the following: given a p-voter proﬁle, is it single-peaked with respect to some (unknown) axis? This is deﬁned formally as follows: Deﬁnition 4 (single-peaked consistency) A preference proﬁle P = 11 , . . . , 1m on X is single-peaked consistent if there exists an axis O such that for all i, 1i is single-peaked with respect to O. When P is single-peaked with respect to the axis O, we say that O is compatible with P . For every axis O, we denote by SP (O) the set of preference relations on X that are single-peaked with respect to 367 O. For instance, if n = 3 and O = x1 > x2 > x3 , then SP (O) = {x1 x2 x3 , x2 x1 x3 , x2 x3 x1 , x3 x2 x1 }. The main problem associated with this deﬁnition is to determine if a given proﬁle is single-peaked consistent. We now present the main result of this article, i.e. the resolution of this problem. More precisely, we propose an algorithm working in time O(mn) which, given a proﬁle, outputs an axis compatible with this proﬁle if it exists, and ﬁnd a contradiction otherwise. The axis is built recursively, starting from the candidates ranked in last position by one or more voters. Indeed, we have the following easy lemma. Lemma 1 Let x be a candidate ranked in last position by a voter i. If the axis O is compatible with 1i , then x is either in the leftmost or in the rightmost position in O. Proof. If x is neither in the leftmost nor in the rightmost position, then there exist a candidate y on the left of x and a candidate z on the right of x (in O). But y 1i x and z 1i x, contradiction with the fact that 1i is single-peaked with respect to O. As a consequence of Lemma 1: in a single-peaked consistent proﬁle, at most two candidates are ranked last by at least one voter. Before giving the algorithm, we ﬁrst explain in detail the ﬁrst (and easiest) iteration. Let L be the set of all candidates ranked last by at least one voter. We consider the three (exhaustive) possible cases: • |L| ≥ 3: then P is not single-peaked consistent, due to Lemma 1. • L = {x}: we place indifferently x either in the leftmost or in the rightmost position of the axis; this choice does not create any constraint in the remainder of the construction of the axis. Indeed, the problem is equivalent to ﬁrst ﬁnding an axis compatible with the proﬁles restricted to the other candidates, and then adding x. • L = {x1 , x2 }: we place x1 and x2 in the leftmost and the rightmost position of the axis. P is compatible with an order O if and only if it is compatible with the inverse of O; as a consequence, the choice (x1 in leftmost or rightmost position) does not matter. Then, the candidates of L being positioned, we iterate the process considering the restriction of the preference relations to the other candidates. Of course, this ﬁrst iteration is simple because no other candidate is already positioned in the axis. More generally, at each step of the algorithm, we have a set T of candidates already positioned at the extremal positions of the axis. Without loss of generality, let T = {x1 , x2 , . . . , xi , xj , xj+1 , . . . , xn } the candidates already positioned in the axis under construction: we have x1 > x2 > . . . > xi in the leftmost positions of the axis O, and xj > xj+1 > . . . > xn in the rightmost positions. The other candidates in T = X \ T will be positioned between xi and xj positioned in the leftmost/rightmost position). Then, at this iteration: • either we ﬁnd a full compatible axis and P is single-peaked consistent; • or we ﬁnd a contradiction and P is not single-peaked consistent; • or we position one or two new candidates to the right of i and/or to the left of j. The soundness of the algorithm will follow from the recursive proof of the following hypothesis. At each iteration, the axis under construction veriﬁes the two following properties: • There exists a compatible axis for P if and only if there exists a compatible axis which extends the axis under construction. • For any voter k, x1 ≺k x2 ≺k . . . ≺k xi and xj 1k xj+1 1k . . . 1k xn . 368 B. Escofﬁer et al. / Single-Peaked Consistency and Its Complexity In particular, from the second item we deduce that the candidates in T , i and j excepted, are not the peak of any voter. Let us now analyze the different possible conﬁgurations. Let L be the set of candidates ranked last by at least one voter (restricted to the candidates in T ). Based on Lemma 1, we have 3 possible cases: 1. |L| ≥ 3: contradiction, 3 candidates must be either in position i + 1 or j − 1. 2. L = {x, y}: either x is in position i + 1 and y in position j − 1, or vice versa, or we will ﬁnd a contradiction. Let us consider a voter k who ranked x last (among the candidates in T ): (a) x ≺k xi and x ≺k xj : this is not possible since necessarily xi or xj is ranked worse than x by k (xi or xj was the candidate ranked last by k at the previous iteration). (b) xi ≺k x and xj ≺k x: x being the last candidate in T , and since x1 ≺k x2 ≺k . . . ≺k xi and xj 1k xj+1 1k . . . 1k xn , then any axis compatible with voter k on T will be compatible on all the candidates. Having positioned the ﬁrst candidates does not create any constraint . Indeed, all the candidates in T are ranked better than all the candidates in T by voter k. As a consequence, for voter k, having x in position i + 1 and y in position j − 1 or vice versa does not matter. (c) xi ≺k x ≺k xj ≺k y : x is necessarily in position i + 1. Indeed, having x in position j − 1 leads to a contradiction: x is positioned between y and xj in the axis, but x ≺k y and x ≺k xj . Then, necessarily x is in position i + 1 and y in position j − 1. Symmetrically, if xj ≺k x ≺k xi ≺k y, then x is necessarily in position j − 1. (d) xi ≺k x ≺k y ≺k xj (or the symmetrical case) : xj is necessarily the peak for the voter k (the candidate positioned immediately to the left is worse, and the candidate xj+1 (if any) positioned immediately to the right is also worse, by our recursive hypothesis), hence the candidates in T are necessarily positioned between positions i and j following the increasing order of voter k. We test if this axis is compatible with the preferences of other voters. If so, we have a compatible axis, otherwise we conclude that P is not single-peaked consistent. We repeat step 2 for all voters. If case 2d occurs (for at least one voter), then the algorithm ends (either we found an axis, or a contradiction). Otherwise, either we ﬁnd a contradiction (x have to be placed in two different positions), and the algorithm stops, or we position candidates x and y on the axis. To conclude, note that if we are not in case 2d , the induction hypothesis x1 ≺k x2 ≺k . . . ≺k xi and xj 1k xj+1 1k . . . 1k xn remains true after positioning x and y (otherwise, in case 2d the algorithm stops). 3. L = {x}, i.e. each voter ranked x last (in T ). Several cases may occur for voter k: (a) x ≺k xi and x ≺k xj : as previously, this case is impossible. (b) xi ≺k x and xj ≺k x : no constraint. (c) xi ≺k x ≺k xj (or inverse): x is necessarily in position i + 1. Hence, if no contradiction is obtained and no compatible order is found, we position one or two new candidates. Steps 2 and 3 are repeated until all the candidates are positioned or a contradiction occurs. The previous analysis enables us to state the following result: Proposition 1 Let P be a preference proﬁle. The previous algorithm outputs an axis compatible with P if any, or ﬁnds a contradiction otherwise. Example 2 Let X = {x1 , x2 , x3 , x4 , x5 , x6 } and consider two voters with the following preferences: x6 ≺1 x5 ≺1 x4 ≺1 x1 ≺1 x3 ≺1 x2 and x1 ≺2 x6 ≺2 x5 ≺2 x2 ≺2 x3 ≺2 x4 • Iteration 1: The set L of worst candidates is L = {x1 , x6 }. T being empty, we can choose the positions of x1 and x6 , for instance respectively in the leftmost and rightmost positions. Partial axis: x1 > .... > x6 . • Iteration 2: T = {x2 , x3 , x4 , x5 } and L = {x5 }. For voter 1, x6 ≺1 x5 ≺1 x1 , hence necessarily x5 is in ﬁfth position in the axis. For voter 2, x1 ≺2 x5 and x6 ≺2 x5 hence for the voter 2 the positioning does not matter. Partial axis: x1 > ... > x5 > x6 . • Iteration 3: T = {x2 , x3 , x4 } and L = {x2 , x4 }. For voter 1, x5 ≺1 x4 ≺1 x1 ≺1 x2 , hence necessarily x4 is in fourth position, and therefore x2 is in second position. For voter 2, x1 ≺2 x5 ≺2 x2 ≺2 x4 hence for her the positioning does not matter. Partial axis: x1 > x2 > . > x4 > x5 > x6 • Iteration 4: T = {x3 }. We verify that with x3 in third position, the partial axis x2 > x3 > x4 is compatible with the two votes. Then, the axis x1 > x2 > x3 > x4 > x5 > x6 is compatible with the proﬁle constituted by the preference relations of the 2 voters. Example 3 Let us consider ﬁve candidates and two voters, with x1 ≺1 x2 ≺1 x3 ≺1 x4 ≺1 x5 and x4 ≺2 x3 ≺2 x2 ≺2 x1 ≺2 x5 • Iteration 1: L = {x1 , x4 }: we choose x1 > ... > x4 . • Iteration 2: T = {x2 , x3 , x5 } with L = {x2 , x3 }. voter 1: x1 ≺1 x2 ≺1 x3 ≺1 x4 hence x4 is necessarily the peak of the voter 1. The unique axis possible is consequently x1 > x2 > x3 > x5 > x4 ; it is not compatible with the preference relation of the second voter. This proﬁle is not single-peaked consistent. Example 4 Let us consider ﬁve candidates and two voters, with x1 ≺1 x2 ≺1 x3 ≺1 x4 ≺1 x5 and x4 ≺2 x2 ≺2 x3 ≺2 x1 ≺2 x5 . Iteration 1 is as Example 3. For iteration 2: T = {x2 , x3 , x5 } with L = {x2 }. For voter 1, x1 ≺1 x2 ≺1 x4 hence x2 must be immediately to the right of x1 . For voter 2, x4 ≺2 x2 ≺2 x1 hence x2 must be immediately to the left of x4 . Contradiction. This proﬁle is not single-peaked consistent. Example 4 shows that a 2-voters proﬁle may not be consistent. Now, we analyse the running time of the algorithm. At each iteration, either we ﬁnd a compatible order, or a contradiction, or we position at least one new element. Assuming that each preference relation is given in decreasing order, we ﬁnd the set L of worst candidates in time O(m). Then, for each voter we do O(1) comparaisons. Step 2d can be possibly longer, since we test the compatibility of an axis with the preference relations of all voters. This step is done in time O(nm) (O(n) for each voter), but it occurs at most once during the algorithm. Then, as long as this step does not occur we have T (n, m) ≤ T (n − 1, m) + O(m). This sums up to T (n, m) = O(nm), and the possible execution of step 2d still leads to T (n, m) = O(nm). Therefore : Proposition 2 The single-peaked consistency problem can be solved in time O(nm). Proposition 2 improves the O(m.n2 ) algorithm given in [2] and is established by a completely different method. Interestingly the algorithm in [9] for cumputing a tree with respect to which the proﬁle is single peaked has similarities with ours. However, not only it works in O(m.n2 ) but it is designed to ﬁnd a tree and does not guarantee to output an axis where there exists one. B. Escofﬁer et al. / Single-Peaked Consistency and Its Complexity Of course, there may exist several axes compatible with a given proﬁle (the number of such axes is the topic of the next section), and given a proﬁle, one might be interested in ﬁnding all the axes compatible with it5 . It is easy to see that the method we proposed can be adapted to ﬁnd all axes compatible with a proﬁle P ; indeed, it is sufﬁcient to keep in steps 2b and 3b all the different possibilities when several choices are possible. As we will see in the next section, there can be an exponential number of compatible axes, hence of course the running time cannot be polynomially bounded. Lemma 3 Q(2n−1 , n) = 2 x4 x3 x2 x1 x7 x6 x5 x5 x6 x1 x7 x2 x3 x4 x4 x3 x2 x7 x1 x6 x5 x5 x6 x7 x1 x2 x3 x4 x4 x3 x1 x7 x2 x6 x5 x5 x6 x2 x7 x1 x3 x4 Lemma 4 For all k, n ≥ 1, Q(k, n + 1) ≥ 2Q(k, n) x4 x3 x7 x1 x2 x6 x5 x5 x6 x2 x1 x7 x3 x4 x5 ≺2 x6 ≺2 x4 ≺2 x3 ≺2 x2 ≺2 x7 ≺2 x1 The modiﬁed algorithm gives the 8 compatible axes: On the number of axes compatible with a proﬁle In Section 3, we proposed an algorithm for computing an axis compatible with a given proﬁle, but such an axis is not necessarily unique. It is now worth to give bounds on the number of axes compatible with a given proﬁle, as well as the prior probability that a proﬁle is singlepeaked consistent. As mentioned earlier, this set of compatible axes may be of some interest when new voters give their preferences. Obviously, the more compatible axes we have, the more likely this new proﬁle is single-peaked consistent. On the other hand, the existence of several compatible axes may be considered as a drawback, for instance if our goal is to learn some structural information about the candidates. In this section, we focus on the minimum and maximum numbers of axes that are compatible with a set of k distinct votes for n candidates. Let q(k, n) and Q(k, n) be these respective numbers. To begin with, remark that P is compatible with O then P is compatible with the inverse of O (denoted by O−1 ). Moreover, of course, the more voters (or candidates), the less the number of compatible axes. Hence, q and Q are even and non increasing with k and n. First, let us deal with the case of a single axis. Lemma 2 |SP (O)| = 2n−1 Proof. Let O = x1 > x2 > . . . > xn and 1∈ SP (O). 1 is fully determined by (a) its peak xi and (b) the positions of x1 , . . . , xi−1 in the remaining n − 1 positions. Indeed, we know that xj 1 xk for xk < xj < x∗ and for` x∗ ´< xj < xk , hence (a) and (b) sufﬁce to describe 1. There are n−1 possible positionings for x1 , . . . , xi−1 , i−1 ` ´ therefore, n−1 preference relations in SP (O) whose peak is xi . To i−1 get the cardinality of SP (O) we have to sum up over i. By symmetry considerations, we obtain that there exist 2n−1 axes compatibles with a given preference relation. Hence, q(1, n) = Q(1, n) = 2n−1 . We also know (cf. Example 4 without x5 ) that q(2, 4) = 0, therefore, for every k ≥ 2 and n ≥ 4 we have q(k, n) = 0. The only missing case is q(2, 3), which can be easily shown to be equal to 2. The case of Q(k, n) is more interesting. We already know that Q(1, n) = 2n−1 , and, by Lemma 2, Q(k, n) = 0 for k > 2n−1 . 5 We now show that the maximum number of compatible axes is globally inversely proportional to the number of distinct votes. More precisely, Q(k, n) = 2n /k when k = 2j 1 ≤ j ≤ n−1 (Proposition 3). This gives bounds on Q(k, n) for the other values of k. We ﬁrst show this result for k = 2n−1 (Lemma 3), and then some relations between the values of Q(k, n) when n and/or k change (lemmas 4 and 5). Proof (sketch). Let O = x1 > x2 > · · · > xn . Let us focus on the set of axes compatible with the 2n−1 preference relations (see Lemma 2) in SP (O). Let xi , xj with xi > xj . The relation R: xj 1 xj+1 1 . . . xn 1 xj−1 1 . . . 1 xi 1 . . . 1 x1 is compatible with O. Any axis O such that xj >O xi >O xn is not compatible with R. Therefore, O is the only axis compatible with SP (O) whose rightmost element is xn . By symmetry, O−1 is the only one whose rightmost element is x1 . The result follows from Lemma 1. Example 5 Let us consider 7 candidates and two voters, with: x4 ≺1 x3 ≺1 x5 ≺1 x6 ≺1 x2 ≺1 x1 ≺1 x7 4 369 This may be useful for instance if a new voter appears. In this case, it is very easy to ﬁnd for instance if this new proﬁle is single-peaked consistent. Proof. Consider a proﬁle P of k preference relations on n candidates that are compatible with Q(k, n) axes. We extend these k relations to n+1 candidates by positioning the new candidate xn+1 last in all relations. For each of the Q(k, n) axes compatible with the initial k relations, we can add xn+1 either as the leftmost element or rightmost element. Therefore we obtain 2Q(k, n) distinct axes, compatible with k distinct preference relations. Thus, Q(k, n+1) ≥ 2Q(k, n). Lemma 5 (Proof omitted) For all n ≥ 2 and all k : Q(k, n + 1) ≤ max{Q(5k/26, n), 2Q(k, n)}. Proposition 3 For all n ≥ 2, all j ∈ [1, n − 1]: Q(2j , n) = 2n−j Proof (sketch). Let j between 1 and n − 1. By Lemma 3, Q(2j , j + 1) = 2. Thanks to Lemma 4, we get Q(2j , n) ≥ 2n−j . Using Lemma 5, we can show that it is in fact an equality. In particular, we get that for each k between 2 and 2n−1 , n−1 2 /k < Q(k, n) < 2n+1 /k (or, if we want tighter bounds: 2n−log2 (k)−1 < Q(k, n) ≤ 2n−log2 (k) ). Lemma 2 enables us to give an approximation of the probability that a randomly generated k-voter, n-candidate proﬁle is singlepeaked consistent. Suppose P is drawn randomly with a uniform probability: for each voter i, the probability that a given preference 1 relation R is the preference relation of voter i is n! , the preference relations of two different voters being independent, therefore each ` 1 ´k . From Lemma 2 we get possible proﬁle has a probability of n! that given an axis O and a preference relation R, the probability that n−1 R ∈ SP (O) is 2 n! . Now, the probability that a k-voter proﬁle is “ n−1 ”k k(n−1) = 2 n!k . This imcompatible with a ﬁxed axis O is 2 n! plies that the probability that a k–voter proﬁle on n candidates is k(n−1) k(n−1) single-peaked consistent is smaller than n! 2 n!k = 2n!k−1 . (The exact probability is of course lower than that, but gets asymptotically close to this upper bound, when the number of voters grows.) Therefore, the probability of single-peaked consistency decreases exponentially with both with the number of voters and the number of candidates6 . Finally, note that the probability of single-peaked consistency is lower than the probability of non-occurrence of the Condorcet paradox. which has received much more attention (see e.g. [6]). 6 Of course, the above computation relies on the assumption that the preference relations of the voters are independent, which is arguably not very realistic. Positive correlations between preference relations allow the probability of single-peaked consistency to decrease less fast. 370 5 B. Escofﬁer et al. / Single-Peaked Consistency and Its Complexity Communication complexity of the aggregation of single-peaked preferences We end this paper by a short additional result on the communication complexity of the aggregation of single-peaked preferences. As said in Section 1, the restriction to single-peaked proﬁles allows for escaping usual impossibility theorems, which means that there exist natural and satisfactory voting rules and aggregation functions under single-peakedness. First, it is well-known that, if the number of voters is odd (which we will now assume for the sake of simplicity), then the median of the peaks is the Condorcet winner and the pairwise majority aggregation of a proﬁle P , deﬁned by x 1∗P y if and only if |{k | x 1k y}| > m for all x, y ∈ X, is a linear order. 2 We are now interested in the communication complexity of the median voting rule and pairwise majority aggregation for single-peaked proﬁles. The deterministic communication complexity of a function is the minimal quantity of information (measured in number of bits) used by the a protocol that computes it. One can ﬁnd a study on the communication complexity of several voting rules (without the single-peakedness restriction) in [5]. In this Section, we assume that the axis O is given (and is common knowledge to all voters). Obviously, the deterministic communication complexity of the median of peaks for single-peaked proﬁles is at most m.5log n6, since the median of peaks can simply be computed by asking voters to name their peak, which needs 5log n6 bits per voter. The lower bound is less obvious. It can be obtained by taking the same fooling set as in the proof of Theorem 3 in [5], and taking an axis whose median is a. This leads to the following result: Proposition 4 The deterministic communication complexity of the median of peaks is O(m. log n) and Ω(m. log n)7 . The (deterministic) communication complexity of pairwise majority aggregation is a little less obvious but still very simple: Proposition 5 The deterministic communication complexity of pairwise majority aggregation for single-peaked proﬁles is at most 2m.5log n6 + 2m(n − 2). The proof uses a protocol very similar to the one used in [4] for the elicitation of single-peaked preferences of a voter. We start by determining the median of peaks, which needs m.5log n6 bits (see above). Then we communicate the result to each voter (which requires again m.5log n6 bits). After this, the voters are asked m − 2 successive pairwise comparisons, according to the following protocol, presented informally on an example: suppose the median of peaks is x3 (the axis being x1 < x2 < x3 < x4 < . . .). We set rank(x3 ) = 1, and we ask to each voter her preference between x2 and x4 . If there is a majority for x2 , then x2 is the second “socially preferred candidate” and we set rank(x2 ) = 2. Then, we ask to each voter her preference between x1 and x4 , and so on. Each of these steps requires the central authority (CE) to send to each voter the information enabling her to know the two candidates she has to compare. For this, CE does not have to send the identity of the two candidates (which would require 25log n6 bits) but only one bit, indicating whether the winner of the previous step is the “right” candidate, or the “left” one (for instance, after the voters have been asked their preferences between x2 and x4 , if there is a majority for x4 then CE sends the information “right” to the voters, who now know the next comparison is between x2 and x5 ). Each voter sends the answer to CE, which requires one bit per voter. Hence each iteration requires 2m bits. There are exactly n − 2 iterations, hence the protocol requires the communication of 7 Actually, the same bounds would hold for the nondeterministic communication complexity – see [5]. m.5log n6 + 2m(n − 2) bits. Finally, we see easily that x 1∗P y if and only if rank(x) < rank(y), hence the protocol computes 1∗P . 6 Discussion In this article we have studied some combinatorial and algorithmic aspects of reasoning with single-peaked preferences. The main contribution is an algorithm that outputs an axis compatible with a proﬁle (when there is one) in time O(mn). We have identiﬁed the minimal and maximal number of axes that are simultaneously compatible with a proﬁle (which, as a byproduct, gives an approximation of the probability of single-peaked consistency of a randomly generated proﬁle). As a side result we have given some simple results on the communication complexity of the aggregation of single-peaked preferences. This work deserves some further research in several directions. In particular, as said in Section 4, the probability that a proﬁle singlepeakes decreases dramatically with the number of voters and the number of candidates. However, in many practical cases, even if not stricto sensu single-peaked, the proﬁle can be close (with respect to some metric) to being so. For instance, in a nation-wide political election, given the very high number of voters, the proﬁle is surely not single-peaked. However, in this case, it may be the case that the proﬁle is approximately single-peaked. To make this more precise, we need to deﬁne formal notions of “approximate singlepeakedness”, which are meant to measure how far a proﬁle is from being single-peaked. Several deﬁnitions seem natural, such as (1) the minimum number of voters whose deletion gives a single-peaked proﬁle, (2) the minimum number of candidates whose deletion gives a single-peaked proﬁle, or (3) the minimum number of axes such that each preference relation of the proﬁle is single-peaked with at least one axis. Computing these measures of single-peakedness lead to very interesting computational problems, for which our algorithm of Section 3 can be the starting point. For instance, for (1) and (2), we can design a branch-and-bound algorithm that generalizes our algorithm. As for (3), we can modify our algorithm to produce a set of axes which covers the whole proﬁle (i.e. such that each preference relation of the proﬁle is compatible with at least one axis). ACKNOWLEDGEMENTS The authors are grateful to the Project ANR-05-BLAN-0384 for its ﬁnancial support. REFERENCES [1] K.J. Arrow, Social choice and individual values, J. Wiley, New York, 1951. 2nd edition, 1963. [2] J. Bartholdi and M. Trick, ‘Stable matching with preferences derived from a psychological model’, Operations Research Letters, 5(4), 165– 169, (1986). [3] Marquis de Condorcet, Essai sur l’application de l’analyse a` la probabilit´e des d´ecisions rendues a` la pluralit´e des voix, Imprimerie Royale, Paris, 1785. [4] V. Conitzer, ‘Eliciting single-peaked preferences using comparison queries’, in Proceedings of AAMAS-07, pp. 408–415, (2007). [5] V. Conitzer and T. Sandholm, ‘Communication complexity of common voting rules’, in Proceedings of EC-05, pp. 78–87, (2005). [6] W. Gehrlein, ‘Condorcet’s paradox and the likelihood of its occurrence: different perspectives on balanced preferences’, Theory and Decision, 52(2), 171–199, (2002). [7] A. Gibbard, ‘Manipulation of voting schemes: A general result’, Econometrica, 41, 587–601, (1973). [8] M.A. Satterthwaite, ‘Strategy proofness and Arrow’s conditions: Existence and correspondence theorems for voting procedures and social welfare functions’, Journal of Economic Theory, 10, 187–217, (1975). [9] M. Trick, ‘Recognizing single-peaked preferences on a tree’, Mathematical Social Sciences, 17(1), 329–334, (1989). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-371 371 Belief Revision through Forgetting Conditionals in Conditional Probabilistic Logic Programs Anbu Yue 1 and Weiru Liu 1 Abstract. In this paper, we present a revision strategy of revising a conditional probabilistic logic program (PLP) when new information is received (which is in the form of probabilistic formulae), through the technique of variable forgetting.We ﬁrst extend the traditional forgetting method to forget a conditional event in PLPs. We then propose two revision operators to revise a PLP based on our forgetting method. By revision through forgetting, the irrelevant knowledge in the original PLP is retained according to the minimal change principle. We prove that our revision operators satisfy most of the postulates for probabilistic belief revision. A main advantage of our revision operators is that a new PLP is explicitly obtained after revision, since our revision operator performs forgetting a conditional event at the syntax level. 1 Introduction Belief revision is concerned with how to revise an agent’s current belief when new evidence is received, where this new evidence is assumed to have the highest priority. Any belief in the current belief set that is inconsistent with the evidence has to be weakened or omitted in order to get a revised consistent set of beliefs. In the literature of probabilistic belief revising, most research focuses on revising a single probabilistic distribution [5, 1, 8, 4, 3]. However, a single probabilistic distribution is not suitable for representing imprecise probabilistic beliefs, as the case for a conditional probabilistic logic program (PLP), where a set of probability distributions are usually associated with a PLP [13, 14]. Research on revising a set of probabilistic distributions are reported in [16, 7], but these methods (as well as methods on revising single probabilistic distributions) can only revise probability distributions by a certain kind of evidence, i.e., evidence that is consistent with the original distributions. Therefore, any evidence that is not fully consistent with current knowledge (beliefs) cannot be used. The notion of forgetting (facts) (or referred to as variable forgetting) proposed in [12] has been applied (or adapted) in many logic based reasoning techniques. For example, forgetting is used for belief merging in [11], and the relationship between forgetting and belief change is studied in [15]. Traditionally, the main focus has been on forgetting a fact in classical logics. The issue of forgetting conditional knowledge has not been investigated, whilst conditional knowledge is very important, especially in research on (logic) reasoning with conditionals [13, 14]. In this paper, we extend the method of forgetting to forget conditional events in conditional probabilistic logic programs (PLPs). 1 School of Electronics, Electrical Engineering and Computer Science, Queen’s University of Belfast, Belfast BT7 1NN, UK {a.yue, w.liu}@qub.ac.uk Given a PLP P , forgetting a conditional event (ψ|φ) in P means that fact ψ is forgotten only in the domain deﬁned by φ. Assume that (ψ |φ )[l , u ] ∈ P , the challenge is how to retain part or all of knowledge (ψ |φ )[l , u ] when φ and φ are inequivalent. To achieve this, we deﬁne a notion of irrelevnace for conditional events, so that forgetting a conditional event will retain any irrelevant knowledge. Since any classical theory T can be represent by a PLP [13], we prove that forgetting a fact ψ in a classical theory T is equivalent to forgetting a conditional event (ψ|) in P that represents T . Based on the technique of forgetting a conditional event, we propose two operators for revising PLPs by a probabilistic formula of the form (ψ|φ)[l, u]. Our revision operators satisfy most of the postulates for imprecise probabilistic belief revision. These postulates were proposed in [18] and were proved to be an extension of Darwiche and Pearl postulates [2], Bayesian conditioning and Jeffrey’s rule. Since any conditional event can be forgotten in a PLP, our revision operators do not require new evidence (information) to be consistent with the original PLPs. Another advantage of these revision operators is that, a new PLP is explicitly obtained as the result of revision, since forgetting a conditional event is deﬁned at the syntax level. This is in contract to traditional probabilistic revision mentioned above where a revision result is a single or a set of probability distributions (which can be seen as the models of a probabilistic knowledge base, e.g., PLP). This paper is organized as follows. In the next section, we brieﬂy review probabilistic logic programming, postulates for probabilistic belief revision, and forgetting. In Section 3, we propose an approach to forgetting a conditional event in a PLP, and in Section 4, we propose two belief revision operators and give their properties. After comparing with related work in Section 5, we conclude this paper. 2 2.1 Preliminaries Probabilistic logic programs (PLPs) We brieﬂy review conditional probabilistic logic programs here, see [13, 14] for details. Let Φ be a ﬁnite set of predicate symbols and constant symbols, and V be a set of object variables and B be a set of bound constants which are in [0,1] describing the bound of probabilities. It is required that Φ contains at least one constant symbol. We use lowercase letters a, b, . . . for constants from Φ, uppercase letters X, Y for object variables, and l, u for bound constants. In Φ, there are two predicate symbols and ⊥ which represent true and false respectively. An object term is a constant from Φ or an object variable from V. An atom is of the form p(t1 , . . . , tk ), where p is a predicate symbol and ti is an object term. An event or formula is constructed from 372 A. Yue and W. Liu / Belief Revision Through Forgetting Conditionals in Conditional Probabilistic Logic Programs a set of atoms by logic connectives ∧, ∨, ¬ as usual, and a conditional event is of the form ψ|ϕ with events ψ and ϕ. We use Greek letters φ, ψ, ϕ for events, α, β for conditional events. A probabilistic formula is of the form (ψ|ϕ)[l, u] which means that the probability bounds for conditional event ψ|ϕ are l and u. We call ψ its consequent and ϕ its antecedent. A conditional probabilistic logic program (PLP) P is a set of probabilistic formulae. We use PL to denote the set of all PLPs, and F to denote the set of all probabilistic formulas. An object term, event, conditional event, probabilistic formula, or PLP is called ground iff it does not contain any object variables from V. Herbrand universe (denoted as HUΦ ) is the set of all constants from Φ, and Herbrand base HBΦ is a ﬁnite nonempty set of all events constructed from the predicate symbols in Φ and constants in HUΦ . A possible world I is a subset of HBΦ s.t. ∈ I and ⊥ ∈ / I, and IΦ is the set of all possible worlds over Φ. An assignment σ maps each object variable to an element of HUΦ . It is extended to object terms by σ(c) = c for all constant symbols from Φ. An event ϕ is satisﬁed by I under σ, denoted by I |=σ ϕ, is deﬁned inductively as: • • • • I I I I |=σ |=σ |=σ |=σ p(t1 , . . . , tn ) iff p(σ(t1 ), . . . , σ(tn )) ∈ I; φ1 ∧ φ2 iff I |=σ φ1 and I |=σ φ2 ; φ1 ∨ φ2 iff I |=σ φ1 or I |=σ φ2 ; ¬φ iff I |=σ φ An event ϕ is satisﬁed by a possible world I, or I is a model of ϕ, denoted by I |=cl ϕ, iff I |=σ ϕ for all assignment σ. In this paper, we call the set of the models of ϕ the domain of ϕ. An event ϕ is a logical consequence of event φ, denoted as φ |=cl ϕ, iff all possible worlds that satisfy φ also satisfy of ϕ. A probabilistic interpretation P r is a probability distribution on IΦ (i.e., as IΦ is ﬁnite, P r is a mapping from IΦ to the unit interval [0,1] such that I∈IΦ P r(I) = 1). The probability of an event ϕ in P r under an assignment σ, is deﬁned as P rσ (ϕ) = I∈IΦ ,I|=σ ϕ P r(I). If ϕ is ground, we simply write as P r(ϕ). A probabilistic formula (ψ|ϕ)[l, u] is satisﬁed by a probabilistic interpretation P r under an assignment σ, denoted by: P r |=σ (ψ|ϕ)[l, u] iff P rσ (ϕ) = 0 or P rσ (ψ|ϕ) ∈ [l, u]. A probabilistic formula μ is satisﬁed by a probabilistic interpretation P r, or P r is a probabilistic model of μ, denoted by P r |= μ, iff P r |=σ μ for all assignments σ. A probabilistic interpretation is a probabilistic model of a PLP P , denoted by P r |= P , iff P r is a probabilistic model of all μ ∈ P . A PLP P is satisﬁable or consistent iff a model of P exists. A probabilistic formula (ψ|ϕ)[l, u] is a consequence of the PLP P , denoted by P |= (ψ|ϕ)[l, u], iff all probabilistic models of P are also probabilistic models of (ψ|ϕ)[l, u]. A probabilistic formula (ψ|ϕ)[l, u] is a tight consequence of P , denoted by P |=tight (ψ|ϕ)[l, u], iff P |= (ψ|ϕ)[l, u], P |= (ψ|ϕ)[l, u ], P |= (ψ|ϕ)[l , u] for all l > l and u 2.2 Probabilistic belief revision We brieﬂy review the postulates for revising PLPs here, see [18] for details. Given a PLP P , we deﬁne set Bel0 (P ) as Bel0 (P ) = {(ψ|φ)[l, u] | P |= (ψ|φ)[l, u], P |= (φ|)[0, 0]} and call it the belief set of P . Condition P |= (φ|)[0, 0] is required because when P |= (φ|)[0, 0], P |= (ψ|φ)[l, u] for all ψ and all [l, u] ⊆ [0, 1]. Without this condition, some counterintuitive conclusions can be inferred, for instance, (ψ|φ)[0, 0.3] and (ψ|φ)[0.9, 1] can simultaneously be the beliefs of an agent if P |= (φ|)[0, 0]. Each probabilistic epistemic state, Ψ, has a unique belief set, denoted as Bel0 (Ψ), which is a set of probabilistic formulae. Bel0 (Ψ) is closed, i.e. Bel0 (Bel0 (Ψ)) = Bel0 (Ψ). We call Ψ a probabilistic epistemic state of a PLP P , iff Bel0 (Ψ) = Bel0 (P ). In general, there exist many ways to deﬁne probabilistic epistemic state. e.g., we can deﬁne a probabilistic epistemic state as the set of probabilistic distributions that satisﬁes the PLP, see [18] for details. Furthermore, we have the following inference relations: Ψ |= (ψ|φ)[l, u] iff (ψ|φ)[l, u] ∈ Bel0 (Ψ), and Ψ |=tight (ψ|φ)[l, u] iff Ψ |= (ψ|φ)[l, u] and for all [l , u ] ⊂ [l, u], Ψ |= (ψ|φ)[l, u]. We write Ψ ∧ (ψ|φ)[l, u] to represent Bel0 (Ψ) ∪ {(ψ|φ)[l, u]}. Also, Ψ |= (ψ|φ)[l, u] iff P |= (ψ|φ)[l, u] when P |=tight (φ|)[0, 0]. Deﬁnition 1 A conditional event (ψ|φ) is more speciﬁc than another conditional event (ψ |φ ), denoted as (ψ|φ) (ψ |φ ), iff • φ |=cl φ ∧ ψ , or • φ |=cl φ ∧ ¬ψ . Conditional event (ψ|φ) affects only the relationship (probability distributions) between φ ∧ ψ and φ ∧ ¬ψ. When (ψ|φ) (ψ |φ ) holds, (ψ|φ) provides detailed information about φ, which is a subevent of φ ∧ ψ or φ ∧ ¬ψ . Therefore, (ψ|φ) is more speciﬁc than (ψ |φ ). Deﬁnition 2 (perpendicular) A conditional event (ψ|φ) is perpendicular with another conditional event (ψ |φ ), denoted as (ψ|φ) !" (ψ |φ ) iff (ψ|φ) (ψ |φ ), or (ψ |φ ) (ψ|φ), or |=cl ¬(φ ∧ φ). The perpendicularity relation formalizes a kind of irrelevance between two conditional events. The above deﬁnition is an extension of the deﬁnition of perpendicular in [9], in which the ﬁrst condition is not required. If (ψ|φ) (ψ |φ ), then (ψ|φ) is more speciﬁc than (ψ |φ ) and thus (ψ|φ) will not affect (ψ |φ ). We know that (ψ|φ) can not affect the probability distributions within the domain (ψ ∧ φ) or the domain (¬ψ ∧ φ), so if (ψ |φ ) (ψ|φ), then φ is a sub-event of (ψ ∧ φ) or (¬ψ ∧ φ), and therefore (ψ|φ) can not affect (ψ |φ ). If |=cl ¬(φ ∧ φ), then φ and φ have disjoint domains, so (ψ|φ) and (ψ |φ ) are irrelevant. Deﬁnition 3 ([18]) Let P be a PLP with epistemic state Ψ and μ = (ψ|φ)[l, u] be a probabilistic formula. The result of revising P by μ is another probabilistic epistemic state, denoted as Ψ μ where is a revision operator. Operator is required to satisfy the following postulates: R*1 Ψ μ |= μ R*2 Ψ ∧ μ |= Ψ μ R*3 if Ψ ∧ μ is satisﬁable, then Ψ μ |= Ψ ∧ μ R*4 Ψ μ is unsatisﬁable only if μ is unsatisﬁable R*5 Ψ μ ≡ Ψ μ if μ ≡ μ R*6 Let μ = (ψ|φ)[l, u] and Ψ μ |=tight (ψ|φ)[l , u ]. Let μ = (ψ|φ)[l1 , u1 ] and Ψ μ |=tight (ψ|φ)[l1 , u1 ]. For any > 0, if |u1 − u| + |l1 − l| < , and both of (ψ|φ)[l, u] and (ψ|φ)[l , u ] are satisﬁable, then |u1 − u | + |l1 − l | < . R*7 if Ψ |= (φ|)[l, u], then (Ψ μ) |= (φ|)[l, u] R*8 for all ψ and φ , if (ψ|φ) !" (ψ |φ ) and Ψμ |= (ψ |φ )[l, u] then Ψ |= (ψ |φ )[l, u] . A. Yue and W. Liu / Belief Revision Through Forgetting Conditionals in Conditional Probabilistic Logic Programs R*1 - R*5 is an analog to postulates R1 - R4 in [2]. We do not have corresponding postulates for R5 and R6 in [2] since revision with the conjunction of conditional events are more complicated and is beyond the scope of this paper. R*6 is a sensitivity requirement, which says that a slightly modiﬁcation on the bounds of μ = (ψ|φ)[l, u] (i.e., μ = (ψ|φ)[l1 , u1 ]) shall not affect the result of revision signiﬁcantly. R*7 says that revising by μ = (ψ|φ)[l, u] should not affect the statement about φ (but the impreciseness of φ may be decreased). Recall that perpendicular condition characterizes a kind of irrelevance, R*8 says that any irrelevance knowledge with new evidence should not be affected by the revision using this evidence. It is proved that these postulates is an extension of modiﬁed AGM postulates and Darwiche and Pearl postulates for iterative revision [2]. It is also proved that these postulates lead to Jeffrey’s rule and Bayesian conditioning when the original PLP (probabilistic epistemic) deﬁnes a single probability distribution. 2.3 Forgetting a fact Given a set of ground formulas and an atom p, forgetting p in a set of formulas T means obtaining another set of formulas which is weaker than T , but retain the same conclusions that irrelevant to p. Let p(t) be a ground atom, and I1 , I2 be two possible worlds. Deﬁne I1 ≈p(t) I2 iff I1 and I2 agree on everything except possibly on the truth value of p(t): 1. I1 and I2 have the same domain, i.e. I1 and I2 are deﬁned on the same Herbrand base. 2. for every predicate symbol q that differs from p, and for every ground term t , q(t ) ∈ I1 iff q(t ) ∈ I2 . Deﬁnition 4 ([12]) Let T be a set of formulae and p(t) be a ground atom. The result of forgetting p(t), denoted as T = f orgetcl (T, p(t)), is a set of formulae such that, for any possible world I , I is a model of T iff there is a model I of T such that I ≈p(t) I . Proposition 1 ([12]) For any theory T and ground atom p(t), T |= f orgetcl (T, p(t)). Let ϕ be a ground formula and p(t) be a ground atom. We use ϕ+ (resp. ϕ− ) to denote the result of replacing every occurrence p( t) p( t) of p(t) in ϕ by (resp. ⊥). Proposition 2 ([12]) Let ϕ be a ground formula and p(t) be a ground atom. Suppose that theory T = {ϕ}, then ∨ ϕ− }. f orgetcl (T, p(t)) ≡ {ϕ+ p( t) p( t) Let p1 (t1 ), . . . , pn (tn ) be a sequence of ground atoms. The result of forgetting p1 (t1 ), . . . , pn (tn ) in T , denoted as f orgetcl (T, p1 (t1 ), . . . , pn (tn )), is inductively deﬁned as f orgetcl (f orgetcl (T, p1 (t1 ), . . . , pn−1 (tn−1 )), pn (tn )). Proposition 3 ([12]) For any theory T and any ground atoms f orgetcl (f orgetcl (T, p1 (t1 )), p2 (t2 )) and p1 (t1 ), p2 (t2 ), f orgetcl (f orgetcl (T, p2 (t2 )), p1 (t1 )) are logically equivalent. The above proposition indicates that the order of the sequence p1 (t1 ), . . . , pn (tn ) is not important in f orgetcl (T, p1 (t1 , . . . , pn (tn ))). In this paper, we write f orgetcl (T, A) to represent f orgetcl (T, p1 (t1 ), . . . , pn (tn )), where A = {p1 (t1 ), . . . , pn (tn )}. We also write f orgetcl (T, φ) to represent f orgetcl (T, Aφ ), where Aφ is the set of atoms that appear in φ. 3 373 Forgetting a Conditional Event Sometimes, forgetting a fact under certain conditions is useful, for example, forgetting fact ϕ when φ is given. To achieve this, we provide an approach to forgetting a conditional event (ψ|φ), which means forgetting ψ only in the domain of φ, and keeping the original knowledge that is out of the domain of φ unchanged. Deﬁnition 5 Let [l, u] and [l , u ] be two intervals. The closest subinterval of [l , u ] to [l, u], denoted as clb([l , u ], [l, u]), is deﬁned by clb([l , u ], [l, u]) = [lb , ub ], where • if u < l then lb = ub = u , • if l > u then lb = ub = l , • otherwise, lb = max{l, l }, ub = min{u, u }. Deﬁnition 6 Let P be a PLP and μ ∈ P where μ = (ψ1 |φ1 )[l, u]. Assume that ν = (ψ2 |φ2 ) is a conditional event. We deﬁne f orgetP (μ, ν) as: ⎫ ⎧ ⎬ ⎨ (φ2 |φ1 )[la , ua ], (φ1 |φ2 )[lb , ub ], (ψ1 |φ1 ∧ ¬φ2 )[l1 , u1 ], f orgetP (μ, ν) = ⎭ ⎩ (f orgetcl (ψ1 , ψ2 )|φ1 ∧ φ2 )[l2 , u2 ] where P |=tight (φ2 |φ1 )[la , ua ], P |=tight (φ1 |φ2 )[lb , ub ], P |=tight (ψ1 |φ1 ∧ ¬φ2 )[l , u ], P |=tight (ψ1 |φ1 )[l”, u”], clb([l , u ], [l”, u”]) = [l1 , u1 ], P |=tight (f orgetcl (ψ1 , ψ2 )|φ1 ∧ φ2 )[l2 , u2 ]. We deﬁne f orget(P, ν) = μ∈P f orgetP (μ, ν). When forgetting a conditional event (ψ2 |φ2 ), the domain of the original beliefs should be divided into two parts: within the domain of φ2 and out of the domain of φ2 . That is, if (ψ1 |φ1 )[l, u] ∈ P , then the knowledge about (ψ1 |φ1 ) in P is implicitly contained by (ψ1 |φ2 ∧ φ1 ) and (ψ2 |φ1 ∧ ¬φ2 ). Intuitively, the former may be affected and the latter should be retained. Also, the knowledge about (ψ1 |φ1 ) should be changed as minimal as possible. To achieve this, the knowledge about (ψ1 |φ1 ) must be retained by the knowledge about (ψ1 |φ1 ∧ ¬φ2 ) in the result PLP. In addition, the relationships (subsumption, overlap, disjoint, etc.) between the domains of φ1 and φ2 should not be affected. Proposition 4 Let P be a PLP, and ν = (ψ|φ) be a conditional event. If P |= (φ|)[0, 0] then f orget(P, ν) |=tight ν[0, 1]. If P |= (φ|)[0, 0] then f orget(P, ν) ≡ P , and we have that ν∈ / f orget(P, ν). In the above proposition, P |= (φ|)[0, 0] indicates that any conditional event with φ as the antecedent has no effects on the semantics of P , however, at the syntax level ν ∈ / f orget(P, ν). Proposition 5 Let P = {(ψ1 |φ1 )[l1 , u1 ], . . . , (ψn |φn )[ln , un ]} be a PLP, and ν = (ψ|φ) be a conditional event. Suppose that (ψ|φ) (ψi |φi ) for all i ∈ {1, . . . , n}, then f orget(P, ν) ≡ P . However, if P = {(ψ1 |φ1 )[l1 , u1 ], . . . , (ψn |φn )[ln , un ]} and (ψi |φi ) (ψ|φ) holds for i = 1, ...n, then f orget(P, (ψ|φ)) ≡ P does not hold in general. This is because that forgetting a conditional event (ψ|φ) will forget not only the relationship between (φ∧ψ) and (φ ∧ ¬ψ), but also all statements about ψ in the domain of φ. 374 A. Yue and W. Liu / Belief Revision Through Forgetting Conditionals in Conditional Probabilistic Logic Programs Proposition 6 Let P = {(φ1 ∧ · · · ∧ φn |)[1, 1]} and ν = (ϕ|). Then for any event ψ, f orget(P, ν) |= (ψ|)[1, 1] iff f orgetcl ({φ1 ∧ · · · ∧ φn }, ϕ) |=cl ψ. Let two theories be T1 = {φ1 , . . . , φn } and T2 = {φ1 ∧· · ·∧φn }, then T1 ≡ T2 and T2 is logically equivalent to PLP P = {(φ1 ∧· · ·∧ φn |)[1, 1]}, f orget(P, ν) is equivalent to f orgetcl (T, ϕ), where ν = (ϕ|). As a consequence, forgetting facts is a special case of forgetting conditional events. Deﬁnition 7 Let P be a PLP and its set of probability distributions be Pr, and ν = (ψ|φ) be a conditional event. We let PrνP be the set of probabilistic distributions s.t. P r ∈ PrνP iff there exists a P r ∈ Pr such that (1) (2) (3) P r (I) = P r(I), if I |=φ J|=φ,J≈ψ I P r (J) = J|=φ,J≈ψ I P r(J), if I |= φ P r (φ ∧ φ ) = P r(φ ∧ φ ), if there exists (ψ |φ )[l, u] ∈ P In the above deﬁnition, condition (1) means that when φ is not satisﬁed, then nothing should be forgotten; condition (2) says that even when φ is satisﬁed, only the beliefs that are relevant to ψ are forgotten; condition (3) says that within the domain of φ, the probabilities of the antecedents of probabilistic formulae in P should not be affected. Obviously, Pr ⊆ PrνP and therefore, PrνP is not empty iff P is satisﬁable. Proposition 7 Let P be a PLP and ν = (ψ|φ) be a conditional event. Then PrνP is the set of probabilistic models of f orget(P, ν). Forgetting a conditional event will not introduce new knowledge. Proposition 8 Let P be a PLP and ν = (ψ|φ) be a conditional event. Then f orget(P, v) |= P . Example 1 Let P be given as: ⎧ ⎫ ⎨ (f ly(t)|bird(t))[0.98, 1] ⎬ (bird(t)|penguin(t))[1, 1] P = ⎩ ⎭ (penguin(t)|bird(t))[0.1, 1] From P , it can be inferred that P |= (f ly(t)|penguin(t))[0.8, 1]. When it is informed that this conclusion may be wrong, we want to revise P by forgetting ν = (f ly(t)|penguin(t)). After forgetting ν from P we can get the PLP f orget(P, ν). It is worth noting that, for any PLP P , for any events φ and ψ and any l, u ∈ [0, 1], the statements P |= (|φ)[1, 1], P |= (φ|⊥)[l, u] and P |= (ψ|φ ∧ ψ)[1, 1] always hold. By omitting such kind of probabilistic formulae, f orget(P, ν) can be simpliﬁed as: ⎧ ⎫ ⎨ (penguin(t)|bird(t))[0.1, 1] ⎬ (bird(t)|penguin(t))[1, 1] f orget(P, ν) = ⎩ ⎭ (f ly(t)|bird(t) ∧ ¬penguin(t))[0.98, 1] In the original P , P |=tight (f ly(t)|bird(t) ∧ ¬penguin(t))[0, 1]. The lower bound is from the assumption that it is possibly that all birds are penguins and all penguins cannot ﬂy. In another word, this conclusion depends on the knowledge about (f ly(t)|penguin(t)) which should be forgotten, and thus, this bound is not suitable. On the contrary, it is stated in f orget(P, ν) that (f ly(t)|bird(t) ∧ ¬penguin(t))[0.98, 1], which retains the knowledge that a bird (which is not a penguin) very likely can ﬂy. Let P = f orget(P, ν), we have P |= (f ly(t)|penguin(t))[0, 1], which means that indeed in P , the knowledge about whether penguins can ﬂy is totally forgotten. 4 Belief Revision by Forgetting In this section, we deﬁne two speciﬁc revision operators that revising a PLP P with a probabilistic formula. Deﬁnition 8 Let P be a PLP in PL, and μ = (ψ|φ)[l, u] be a probabilistic formula in F . Let ν = (ψ|φ). We deﬁne operator 70 : PL × F → F such that P 70 μ = f orget(P, ν) ∪ {μ}. Example 2 Let P be given as in Example 1, and (f ly(t)|penguin(t))[0, 0] be a probabilistic formula. ⎧ ⎪ ⎪ (f ly(t)|bird(t) ∧ ¬penguin(t))[0.98, 1] ⎨ (bird(t)|penguin(t))[1, 1] P 70 μ = (penguin(t)|bird(t))[0.1, 1] ⎪ ⎪ ⎩ (f ly(t)|penguin(t))[0, 0] μ = ⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭ Now we can infer that P 0 ν |=tight (f ly(t)|bird(t))[0, 0.9] which is intuitively correct. The upper bound of the probability if whether a bird can ﬂy is changed to be 0.9 following the fact that some birds (penguins) cannot ﬂy. In the above example, we have P 70 ν |=tight (f ly(t)|bird(t))[0, 0.9]. This lower bound (0) means that it is possible that all birds cannot ﬂy. The lower bound comes from the possibility that all birds are penguins since P |= (penguin(t)|bird(t))[0.1, 1]. Using operator 70 to revise P with (f ly(t)|penguin(t))[0, 0] does not eliminate such possibility. On the another hand, since the new information that penguins cannot ﬂy contradicts with the original general knowledge that most birds can ﬂy, it implicitly suggests that penguins are very different from typical birds. Formally, the probability of (penguin(t)|bird(t)) should be low. In fact, if we had (penguin(t)|bird(t))[0.1, 0.1] in P 70 μ in the above example, we should have got P = (P 70 μ) ∪ {(penguin(t)|bird(t))[0.1, 0.1]} and P |=tight (f ly(t)|bird(t))[0.882, 0.9] which gives a much tighter and more intuitive bounds for (f ly(t)|bird(t)). This discussion suggests that sometimes the contradiction between new information (ψ|φ)[l, u] and an original PLP P implies that the antecedent φ is a special case of φ for any φ that (ψ |φ )[l , u ] ∈ P and φ is relevant to (ψ|φ). Here φ is relevant to (ψ|φ) means that a tighter probability bound for (ψ|φ) can be inferred from P only when more knowledge about the relationship between φ and φ (i.e. a tighter bound for (φ |φ) or (φ|φ )) is provided. The above discussion leads us to deﬁne another revision operator 7. Revising with this operator, the impreciseness of the antecedent of new information may be decreased. Deﬁnition 9 Let P be a PLP in PL, and μ = (ψ|φ)[l, u] be a probabilistic formula in F . Let ν = (ψ|φ). We deﬁne operator 7 : PL × F → PL which satisﬁes (1) μ ∈ P 7 μ (2) f orget(P, ν) ⊆ P 7 μ (3) ∀(ψ |φ )[l, u] ∈ P, (φ |φ)[la , ua ] ∈ P 7 μ and (φ|φ )[lb , ub ] ∈ P 7 μ where P |=tight (ψ|φ)[l0 , u0 ] clb([l0 , u0 ], [l, u]) = [l , u ] P ∪ {(ψ|φ)[l , u ]} |=tight (φ |φ)[la , ua ] P ∪ {(ψ|φ)[l , u ]} |=tight (φ|φ )[lb , ub ] and P 7 μ is the smallest set (with respect to set inclusion) that satisfying the above conditions. Obviously, P 7 ν |= P 70 ν. A. Yue and W. Liu / Belief Revision Through Forgetting Conditionals in Conditional Probabilistic Logic Programs Example 3 Let P be as given in Example 1, and (f ly(t)|penguin(t))[0, 0] be a probabilistic formula. ⎧ (f ly(t)|bird(t) ∧ ¬penguin(t))[0.98, 1] ⎪ ⎪ ⎨ (bird(t)|penguin(t))[1, 1] P 7μ= (penguin(t)|bird(t))[0.1, 0.1] ⎪ ⎪ ⎩ (f ly(t)|penguin(t))[0, 0] μ = ⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭ Now, we have that most birds can ﬂy since P 7 ν |=tight (f ly(t)|bird(t))[0.882, 0.9] and this knowledge is still imprecise. Proposition 9 Both operators 70 and 7 satisfy the postulates R*1, R*2, and R*4-R*7. Both operators do not satisfy R*3 in general. This comes from the fact that our operators retain the impreciseness of the original knowledge whilst Ψ ∧ μ decreases the impreciseness of the original knowledge. These two operators also do not satisfy R*8 in general. Let be any revision operator, then R*8 is equivalent to the following three separate postulates: R*8 .1 for all ψ and φ , if (ψ|φ) (ψ |φ ) and P (ψ|φ)[l, u] |= (ψ |φ )[l , u ] then P |= (ψ |φ )[l , u ]. R*8 .2 for all ψ and φ , if (ψ |φ ) (ψ|φ) and P (ψ|φ)[l, u] |= (ψ |φ )[l , u ] then P |= (ψ |φ )[l , u ]. R*8 .3 for all ψ and φ , if |=cl ¬(φ ∧ φ ) and P (ψ|φ)[l, u] |= (ψ |φ )[l , u ] then P |= (ψ |φ )[l , u ]. Proposition 10 The operator 70 and 7 satisfy R*8 .1 and R*8 .3. R*8 .2 is not satisﬁed by 70 and 7 because forgetting conditional event (ψ|φ) may affect the knowledge about (ψ |φ ) if (ψ |φ ) (ψ|φ). 5 Related Work and Conclusion Related work: Traditionally, forgetting is to delete some concepts (atoms or facts) from a given theory in a classical logic-based language. In this paper, we extended the concept of forgetting to forget conditional events other than facts in the framework of conditional probabilistic logic programming. Since facts can be represented as a special kind of conditional events, i.e., conditional events that have tautologies as its antecedent, it is not surprising that our forgetting method subsumes the original approach to forgetting facts. In [15], forgetting facts is deployed in belief change in propositional logic. When reducing forgetting conditional events operation to forgetting facts in our operator 70 (since when the bounds for every probabilistic formula is either [0,0] or [1,1], a PLP actually contains a set of propositional formulae), we can obtain the update operator deﬁned in [15]. However, there is no counterpart of our 7 in [15]. In the literature of probabilistic belief revision, most revision operators are model-based, that is a revision operator revises a single or a set of probability distributions, and the result is also a single or a set of probability distributions. This kind of revision makes the probabilistic knowledge implicit, especially when this knowledge is in the form of PLP. On the contrary, our operators are deﬁned at the syntax level, and a revised PLP is obtained as the result. Many probabilistic belief revision operators require that new knowledge is consistent with the original knowledge [1, 8, 3, 4, 10, 17]. In contrast, since any conditional event can be forgotten from a PLP, we do not require that new knowledge is consistent with a given PLP. Furthermore, our revision results can still be imprecise (See Example 3) while some other revision operators [1, 8, 3, 4, 5, 6, 10], produce single probability distributions as the result of revision. 375 Conclusions: In this paper, we extended the concept of forgetting to forgetting conditional events in PLPs and proposed two revision operators based on our forgetting (of conditional events) approach. Our revision operators forget inconsistent knowledge and retain irrelevant knowledge with respect to new information. Among the two operators we have deﬁned, the second operator (7) is particularly designed for situations where the antecedent of a conditional event (new information) in the original PLP is imprecise. The ﬁrst revision operator does not change anything (bounds of probabilities) about the antecedent after revision whilst the second operator decreases the imprecision of the antecedent (in terms of probability bounds). The rational of operator 7 comes from the assumption that if new information contradicts with the original PLP, then it suggests that the antecedent may be a special case of a general concept deﬁned in this PLP (such as penguin is a special type of bird, but not a common type of bird). Our operators satisfy most of the postulates for probabilistic belief revision and operate at the syntax level of a PLP, so that a new PLP is explicitly returned as the result of revision. REFERENCES [1] Hei Chan and Adnan Darwiche, ‘On the revision of probabilistic beliefs using uncertain evidence’, Artif. Intell., 163(1), 67–90, (2005). [2] Adnan Darwiche and Judea Pearl, ‘On the logic of iterated belief revision’, Artif. Intell., 89(1-2), 1–29, (1997). [3] Didier Dubois and Henri Prade, ‘Focusing vs. belief revision: A fundamental distinction when dealing with generic knowledge’, in Proc. of ECSQARU-FAPR’97, pp. 96–107, (1997). [4] B. Van Fraasen, ‘Probabilities of conditionals’, in Proc. of Foundations of Probability Theory, Statistical Inference, and Statistical Theories of Science, pp. 261–300, (1976). [5] I. R. Goodman and Hung T. Nguyen, ‘Probability updating using second order probabilities and conditional event algebra’, Inf. Sci., 121(34), 295–347, (1999). [6] Adam J. Grove and Joseph Y. Halpern, ‘Probability update: Conditioning vs. cross-entropy’, in Proc. of UAI’97, pp. 208–214, (1997). [7] Adam J. Grove and Joseph Y. Halpern, ‘Updating sets of probabilities’, in Proc. of UAI’98, pp. 173–182, (1998). [8] Peter Gr¨unwald and Joseph Y. Halpern, ‘Updating probabilities’, J. Artif. Intell. Res. (JAIR), 19, 243–278, (2003). [9] Gabriele Kern-Isberner, ‘Postulates for conditional belief revision’, in Proc. of IJCAI’99, pp. 186–191, (1999). [10] Gabriele Kern-Isberner and Wilhelm R¨odder, ‘Belief revision and information fusion on optimum entropy’, Int. J. Intell. Syst., 19(9), 837–857, (2004). [11] J´erˆome Lang, Paolo Liberatore, and Pierre Marquis, ‘Propositional independence: Formula-variable independence and forgetting’, J. Artif. Intell. Res. (JAIR), 18, 391–443, (2003). [12] Fangzhen Lin and Raymond Reiter, ‘Forget it!’, in Working Notes, AAAI Fall Symposium on Relevance, eds., Russell Greiner and Devika Subramanian, pp. 154–159, Menlo Park, California, (1994). American Association for Artiﬁcial Intelligence. [13] Thomas Lukasiewicz, ‘Probabilistic logic programming.’, in Proc. of ECAI’98, pp. 388–392, (1998). [14] Thomas Lukasiewicz, ‘Probabilistic logic programming with conditional constraints.’, ACM Trans. Comput. Log., 2(3), 289–339, (2001). [15] Abhaya C. Nayak, Yin Chen, and Fangzhen Lin, ‘Forgetting and knowledge update’, in Proc. of Australian Conference on Artiﬁcial Intelligence, pp. 131–140, (2006). [16] Damjan Skulj, ‘Jeffrey’s conditioning rule in neighbourhood models’, Int. J. Approx. Reasoning, 42(3), 192–211, (2006). [17] Frans Voorbraak, ‘Probabilistic belief change: Expansion, conditioning and constraining’, in Proc. of UAI’99, pp. 655–662, (1999). [18] Anbu Yue and Weiru Liu, ‘Revising imprecise probabilistic beliefs in the framework of probabilistic logic programming’, in Proc. of AAAI’08, (2008). 376 ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-376 Mastering the Processing of Preferences by Using Symbolic Priorities in Possibilistic Logic Souhila Kaci and Henri Prade 1 Abstract. The paper proposes a new approach to the handling of preferences expressed in a compact way under the form of conditional statements. These conditional statements are translated into classical logic formulas associated with symbolic levels. Ranking two alternatives then leads to compare their respective amount of violation with respect to the set of formulas expressing the preferences. These symbolic violation amounts, which can be computed in a possibilistic logic manner, can be partially ordered lexicographically once put in a vector form. This approach is compared to the ceteris paribus-based CP-net approach, which is the main existing artiﬁcial intelligence approach to the compact processing of preferences. It is shown that the partial order obtained with the CP-net approach fully agrees with the one obtained with the proposed approach, but generally includes further strict preferences between alternatives (considered as being not comparable by the symbolic level logic-based approach). These additional strict preferences are in fact debatable, since they are not the reﬂection of explicit user’s preferences but the result of the application of the ceteris paribus principle that implicitly, and quite arbitrarily, favors father node preferences in the graphical structure associated with conditional preferences. Adding constraints between symbolic levels for expressing that the violation of father nodes is less allowed than the one of children nodes, it is shown that it is possible to recover the CP-net-induced partial order. Due to existing results in possibilistic logic with symbolic levels, the proposed approach is computationally tractable. Key words: preference, priority, partial order, CP-net, possibilistic logic. 1 Introduction The compact representation of preferences has raised a vast interest in artiﬁcial intelligence in the last decade [5, 9, 18, 14, 10]. Indeed, it has been early recognized that, since value functions cannot be explicitly deﬁned in case of a great number of alternatives described by means of attributes, preferences should be handled in a compact way, starting from non completely explicit preferences expressed by a user. In particular, conditional statements are often used for describing preferences in a local, contextualized manner. Moreover some generic principle is often used for completing the preferences [5, 14]. The CP-net approach [6] has emerged in the last decade as the preeminent and prominent method for processing preferences in artiﬁcial intelligence, due to its intuitive appeal. The CP-net approach directly exploit sets of conditional preferences and their associated graphical structures, assuming an apparently natural and innocuous ceteris 1 Souhila Kaci, Universit´e Lille-Nord de France, Artois, F-62307 Lens CRIL, CNRS UMR 8188, F-62307 - IUT de Lens, kaci@cril.univ-artois.fr, Henri Prade, IRIT, Universit´e Paul Sabatier, 118 route de Narbonne, 31062 Toulouse Cedex 9, France, prade@irit.fr paribus principle that expresses that conditional preferences, which in general refer to two incompletely described alternatives, still hold when the speciﬁcation of the two alternatives are completed in the same way. However, the CP-net approach may be computationally costly for dominance queries, which ask whether a ranking for two alternatives holds in any preference ordering that satisﬁes the CPnet requirements, rather than just asking if it holds in at least one of these preference orderings. This has led to look for tractable approximations of CP-nets [10, 18, 16]. Generally speaking, conditional statements express, in a given context, preferences about what are the most plausible states of the world according to pieces of default knowledge, or what are the most satisfactory states when expressing desires. It has been shown that conditional statements can be expressed under the form of constraints that may be turned into sets of prioritized logical formulas [17, 2, 1]. However, although the case might be encountered in practice, the available approaches for handling preferences do not usually allow for the simultaneous expression of general preferences and of more speciﬁc ones that are reversed with respect to the general tendency. In this latter case, the various levels of speciﬁcity of the conditionals induce a complete preorder on the logical formulas encoding the defaults. But, in the case of a set of (monotonic) conditional preference statements, we have not necessarily indications about their respective levels of importance. It is why in the following we encode the conditional preferences statements by means of classical logical formulas associated with symbolic priorities (since no a priori ordering between them is known), as already done in the approximation of CP-nets recently proposed [16]. Then the respective amount of preference violation of an alternative with respect to the set of formulas encoding the preferences, can be computed in a possibilistic logic manner [13, 3], and results in a conjunctive combination of symbolic levels. Such combinations of symbolic levels can be partially ordered lexicographically, once they are put in a vector form. After introducing the basic deﬁnitions in Section 2, this is explained in Section 3 on a motivating example taken from the CP-net literature. In Section 4, after a refresher on the CP-net approach, it is shown that the partial order obtained with the CP-net approach fully agrees with the one obtained with the symbolic priorities approach, but generally includes further strict preferences. A discussion shows that it is due to a debatable use of ceteris paribus principle on pairs of alternatives for which there is no inclusion relation between the two sets of preferences that they violate. Section 5 shows how the CP-net partial order can be recovered by adding constraints between symbolic levels for expressing that the violation of father nodes is less allowed than the one of children nodes. Such a representation framework, where logical formulas are associated with symbolic priority levels between which further con- S. Kaci and H. Prade / Mastering the Processing of Preferences by Using Symbolic Priorities in Possibilistic Logic straints may be added, is akin to the one presented in [3] (for handling multiple sources information) for which tractable computational procedures exist. 2 Deﬁnitions and notations Let V = {X1 , · · · , Xl } be a set of l variables. Each variable Xi takes its values in a domain denoted Dom(Xi ) = {xi1 , · · · , ximi }. Let V be a subset of V . An assignment of V is the result of giving a value in Dom(Xi ) to each variable Xi in V . Asst(V ) is the set of all possible assignments to variables in V . In particular Asst(V ), denoted Ω, is the set of all possible assignments of the variables in V . Each element in Ω, denoted ω, is called an alternative. When dealing with binary variables, formulas of propositional logic are denoted a, b, c, · · ·. Let , (resp. 1) be a binary relation on a ﬁnite set A = {x, y, z, · · ·} such that x , y (resp. x 1 y) means that x is at least as preferred as (resp. strictly preferred to) y. x = y means that both x , y and y , x hold, i.e. x and y are equally preferred. Lastly x ∼ y means that neither x , y nor y , x holds, i.e. x and y are incomparable. , is a partial preorder on A if and only if , is reﬂexive (x , x) and transitive (if x , y and y , z then x , z). 1 is a partial order on A if and only if 1 is irreﬂexive (x 1 x does not hold) and transitive. A partial order 1 may be deﬁned from a partial preorder , as x 1 y if x , y holds but y , x does not. A (pre-)order is asymmetric if and only if ∀x, y ∈ A, if x 1 y holds then y 1 x does not. A preorder , on A is complete if and only if all pairs are comparable i.e. ∀x, y ∈ A, we have x , y or y , x. 3 Motivating example and preference encoding We ﬁrst motivate the proposed approach on an example inspired from [11] about how to be dressed for an evening party. Example 1 Let V (vest), P (pants), S (shirt) and C (shoes) be four binary variables taking their values in {Vb , Vw }, {Pb , Pw }, {Sr , Sw } and {Cr , Cw } respectively, where b, w and r stand respectively for black, white and red. Clearly there are sixteen possible evening dress Ω = {Vb Pb Sr Cr , Vb Pb Sw Cr , Vb Pw Sr Cr , Vb Pw Sw Cr , Vw Pb Sr Cr , Vw Pb Sw Cr , Vw Pw Sr Cr , Vw Pw Sw Cr , Vb Pb Sr Cw , Vb Pb Sw Cw , Vb Pw Sr Cw , Vb Pw Sw Cw , Vw Pb Sr Cw , Vw Pb Sw Cw , Vw Pw Sr Cw , Vw Pw Sw Cw }. Assume that when choosing his evening dress, Peter is not able to compare the sixteen possible choices but expresses the following partial preferences: (P1 ): he prefers black vest to white vest, (P2 ): he prefers black pants to white pants, (P3 ): when vest and pants have the same color, he prefers red shirt to white shirt otherwise he prefers white shirt, and (P4 ): when the shirt is red then he prefers red shoes otherwise he prefers white shoes. The problem now is how to rank-order the sixteen possible choices according to Peter’s preferences. The above preferences are conditionals of the form “in context c, a is preferred to b”, where c may be a tautology. Such a preference can be modelled as a pair of prioritized goals {(¬c ∨ a ∨ b, 1), (¬c ∨ a, 1 − α)}, which stand for “when c is true, one should have a or b (the choice is only between a and b) and in context c, it is somewhat imperative to have a true”. These pairs of propositional formulas associated with a level are known as 377 possibilistic formulas [13]. Indeed, e.g. (¬c ∨ a, 1 − α) encodes a constraint of the form Π(c ∧ ¬a) ≤ α (≡ N (¬c ∨ a) ≥ 1 − α), where Π, N are dual possibilistic measures (1 − Π(¬p) = N (p)). This expresses that the satisfaction level when the constraint is violated is upper bounded by α. Note that when b ≡ ¬a, the clause (¬c ∨ a ∨ b, 1) becomes a tautology, and thus does not need to be written. Indeed the clause (¬c ∨ a, 1 − α) expresses a preference for a over ¬a in context c. The clause (¬c ∨ a ∨ b, 1) is only needed if a ∨ b does not cover all the possible choices. Assume a ∨ b ≡ ¬d (where ¬d is not a tautology), then it makes sense to understand the preference for a over b in context c, as the fact that in context c, b is a default choice if a is not available. If one wants to open the door to remaining choices, it is always possible to use (¬c ∨ a ∨ b, 1 − α ) with 1 − α > 1 − α, instead of (¬c ∨ a ∨ b, 1). Thus, the approach would easily extend to non binary choices. Example 2 (Example 1 continued) Thus P1 and P2 are encoded by means of (i) : {(Vb , 1 − α)} and (ii) : {(Pb , 1 − β)} respectively. P3 is encoded by (iii) : {(¬Vb ∨ ¬Pb ∨ Sr , 1 − γ)}, (iv) : {(¬Vw ∨ ¬Pw ∨ Sr , 1 − η)}, (v) : {(¬Vw ∨ ¬Pb ∨ Sw , 1 − δ)} and (vi) : {(¬Vb ∨¬Pw ∨Sw , 1−ε)}. Lastly P4 is encoded by (vii) : {(¬Sr ∨ Cr , 1 − θ)} and (viii) : {(¬Sw ∨ Cw , 1 − ρ)}. Note that we have chosen here, in order to be as general as possible, to give distinct symbolic priority levels for the formulas associated to the different contexts covered by a preference Pi . Since one does not know precisely how imperative the preferences are, the weights will be handled in a symbolic manner. However, they are assumed to belong to a linearly ordered scale (the strict order will be denoted by > on this scale), with a top element (denoted 1) and a bottom element (denoted 0). Thus, 1 − (.) should be regarded here just as denoting an order-reversing map on this scale (without having a numerical ﬂavor necessarily), with 1 − (0) = 1, and 1 − (1) = 0. On this scale, one has 1 > 1 − α, as soon as α = 0. The order-reversing map exchanges two scales: the one graded in terms of necessity degrees, or if we prefer here in terms of imperativeness, and the one graded in terms of possibility degrees, i.e. here, in terms of satisfaction levels. Thus, the level of priority 1 − α for satisfying a preference is changed by the involutive mapping 1 − (.) into a satisfaction level α < 1 when this preference is violated. Since in the example the values of the weights 1 − α, 1 − β, 1 − γ, 1 − η, 1 − δ, 1 − ε, 1 − θ and 1 − ρ are unknown, no particular ordering is assumed between them. Table 1 gives the satisfaction levels of the above clauses and the sixteen possible choices. The last column gives the vector of the global satisfaction, exhibiting symbolic satisfaction levels that are different from 1, each time a formula is violated. In practice, this violation amounts can be syntactically computed using the approach proposed in [3]. Even if the values of the weights are unknown, a partial order between the sixteen choices can be naturally induced. For example Vb Pb Sr Cr is preferred to all remaining alternatives since it is the only alternative that satisﬁes all Peter’s preferences. Also, Vw Pb Sw Cw is preferred to Vw Pw Sr Cr since the former falsiﬁes (Vb , 1 − α) while the latter falsiﬁes both (Vb , 1 − α) and (Pb , 1 − β). This partial order is depicted in Figure 1. An edge from ω to ω means that ω is preferred to ω. Indeed an alternative ω is naturally preferred to an alternative ω when the set of clauses falsiﬁed by ω is included in the set of clauses 378 S. Kaci and H. Prade / Mastering the Processing of Preferences by Using Symbolic Priorities in Possibilistic Logic Vb Pb Sr Cr Vb Pb Sw Cr Vb Pw Sr Cr Vb Pw Sw Cr Vw Pb Sr Cr Vw Pb Sw Cr Vw Pw Sr Cr Vw Pw Sw Cr Vb Pb Sr Cw Vb Pb Sw Cw Vb Pw Sr Cw Vb Pw Sw Cw Vw Pb Sr Cw Vw Pb Sw Cw Vw Pw Sr Cw Vw Pw Sw Cw (i) 1 1 1 1 α α α α 1 1 1 1 α α α α (ii) 1 1 β β 1 1 β β 1 1 β β 1 1 β β (iii) 1 γ 1 1 1 1 1 1 1 γ 1 1 1 1 1 1 (iv) 1 1 1 1 1 1 1 η 1 1 1 1 1 1 1 η Table 1. (v) 1 1 1 1 δ 1 1 1 1 1 1 1 δ 1 1 1 (vii) 1 1 1 1 1 1 1 1 θ 1 θ 1 θ 1 θ 1 (viii) 1 ρ 1 ρ 1 ρ 1 ρ 1 1 1 1 1 1 1 1 satisfaction levels (1, 1, 1, 1, 1, 1, 1, 1) (1, 1, γ, 1, 1, 1, 1, ρ) (1, β, 1, 1, 1, ε, 1, 1) (1, β, 1, 1, 1, 1, 1, ρ) (α, 1, 1, 1, δ, 1, 1, 1) (α, 1, 1, 1, 1, 1, 1, ρ) (α, β, 1, 1, 1, 1, 1, 1) (α, β, 1, η, 1, 1, 1, ρ) (1, 1, 1, 1, 1, 1, θ, 1) (1, 1, γ, 1, 1, 1, 1, 1) (1, β, 1, 1, 1, ε, θ, 1) (1, β, 1, 1, 1, 1, 1, 1) (α, 1, 1, 1, δ, 1, θ, 1) (α, 1, 1, 1, 1, 1, 1, 1) (α, β, 1, 1, 1, 1, θ, 1) (α, β, 1, η, 1, 1, 1, 1) Satisfaction levels. (vi) 1 1 ε 1 1 1 1 1 1 1 ε 1 1 1 1 1 Figure 1. Basic partial order. lation (called “ordered Pareto” and denoted 1OP ) that exploits the available information about the relative values of the symbolic levels. Then v is preferred to v , denoted v 1OP v , if there is a reordering of each vector of symbolic levels such that ∀i, do (vi ) ≥ do (vi ) and ∃j, do (vj ) > do (vj ) according to the current knowledge about the ordering between symbolic levels, where do (v) is the reordered vector associated to d(v). Initially the only available knowledge about the ordering between the symbolic levels is α < 1 when α = 1 and 1 ≤ 1. Then, for example d(v) = (α, 1) and d(v ) = (1, ε) are incomparable. Now if we also know that α > ε then v 1OP v (i.e. (α, β, 1, 1, 1, 1, 1, 1) 1OP (1, β, 1, 1, 1, ε, 1, 1)) since do (v) = (α, 1) and do (v ) = (ε, 1) are now Pareto comparable. Proposition 1 Let Σ = {(ai , αi )} be a set of formulas. Let ω and ω be two alternatives. Let Fω and Fω be the sets of formulas of Σ falsiﬁed by ω and ω respectively. Let v and v be the satisfaction levels of ω and ω respectively. Then, ω 1b,Σ ω iff v 1OP v . falsiﬁed by ω . Deﬁnition 1 (Basic preference relation) Let Σ = {(ai , αi )} be a set of formulas associated with symbolic weights. Let ω and ω be two alternatives and Fω and Fω be the sets of Σ falsiﬁed by ω and ω respectively. ω is basically preferred to ω , denoted ω 1b,Σ ω , iff Fω ⊂ F ω . Thus ω is preferred to ω only when the components of its associated satisfaction vector are equal to 1 for those components that are different in the two satisfaction vectors associated to ω and ω . Formally we describe the basic preference relation as follows. Let v = (v1 , · · · , vn ) and v = (v1 , · · · , vn ) be two vectors of satisfaction levels. These satisfaction levels are ordered according to the order in which we consider the formulas. In our example from (i) to (viii). Discrimin criterion [12] is deﬁned by ignoring the values that are the same in both v and v for a given vector’s component pertaining to the same formula. For example the two vectors v = (α, β, 1, 1, 1, 1, 1, 1) and v = (1, β, 1, 1, 1, ε, 1, 1) reduce to d(v) = (α, 1) and d(v ) = (1, ε) respectively. For further comparing the reduced vectors, we deﬁne the following preference re- Each additional preference between two alternatives should be the consequence of an explicit constraint between symbolic weights. For example Vb Pb Sw Cw and Vb Pb Sr Cw are incomparable since γ and θ are incomparable. Now if we state that θ > γ then Vb Pb Sr Cw would be preferred to Vb Pb Sw Cw . 4 Conditional Preference Networks (CP-nets) Conditional preference networks (CP-nets for short) [5] encode comparative conditional statements and are based on ceteris paribus principle. More precisely, a CP-net is a directed graphical representation of conditional preferences, where nodes represent variables and edges express preference links between variables. When there exists a link from X to Y , X is called a parent of Y . P a(X) denotes the set of parents of a given node X. It determines the user’s preferences over possible values of X. For the sake of simplicity, we suppose that variables are binary. Preferences are expressed at each node by means of a conditional preference table (CP T for short) such that: S. Kaci and H. Prade / Mastering the Processing of Preferences by Using Symbolic Priorities in Possibilistic Logic • for root nodes Xi , the conditional preference table, denoted CP T (Xi ), provides the strict preference2 over xi and its negation ¬xi , other things being equal, i.e. ∀y ∈ Asst(Y ), xi y 1 ¬xi y where Y = V \{Xi }. This is the ceteris paribus principle. • For other nodes Xj , CP T (Xj ) describes the preferences over xj and ¬xj other things being equal given any assignment of P a(Xj ), i.e. xj zy 1 ¬xj zy, ∀z ∈ Asst(P a(Xj )) and ∀y ∈ Asst(Y ) where Y = V \({Xj } ∪ P a(Xj )). For each assignment z of P a(Xj ) we write for short a statement of the form z : xj 1 ¬xj . Note that this is a parent-dependent speciﬁcation. Deﬁnition 3 (Preference entailment) Let N be a CP-net over a set of variables V , and ω, ω ∈ Ω. N entails that ω is strictly preferred to ω , denoted ω 1N ω , if and only if ω 1 ω holds in every preference ranking , that satisﬁes N . Vb Vw Pb Pw P V S C Figure 2. Vb Pb : Sr Sw Vw Pw : Sr Sw Vw Pb : Sw Sr Vb Pw : Sw Sr Sr : C r C w Sw : C w C r A CP-net and its associated order. Proposition 2 Let N be a CP-net. Let Σ = {(¬ui ∨ x, αi )} where ui : x 1 ¬x are unconditional/conditional local preferences expressed in N . Then, ∀ω, ω ∈ Ω, if ω 1b,Σ ω then ω 1N ω . For example Vw Pw Sr Cr falsiﬁes (Vb , 1 − α), (Pb , 1 − β) and Vw Pw Sw Cr falsiﬁes (Vb , 1 − α), (Pb , 1 − β), 2 We restrict ourselves to a complete order over xi and ¬xi as it is the case with CP-nets in general. However this can be easily extended to a preorder. A CP-net N is consistent when there exists an asymmetric preference ranking that is consistent with N . We focus in this paper on acyclic CP-nets in order to ensure their consistency. Example 3 (Example 1 continued) Peter’s preferences can be represented by the CP-net depicted in Figure 2. As one would expect, the CP-net fully agrees with basic preference relation. Deﬁnition 2 A complete preorder , on Ω, called also preference ranking, satisﬁes a CP-net N if and only if it satisﬁes each conditional preference expressed in N . In this case, we say that the preference ranking , is consistent with N . Indeed 1N is the intersection of all preference rankings consistent with N . When ω 1N ω holds, we say that ω dominates ω . The preferential comparison in CP-nets is based on the notion of worsening ﬂip. A worsening ﬂip is a change of the assignment of a variable to an assignment that is less preferred following the conditional preference table of that variable, and under ceteris paribus assumption, w.r.t. the CP-net N . Then ω is preferred to ω w.r.t. N iff there is a chain of worsening ﬂips from ω to ω . 379 Figure 3. A CP-net and its associated order. (¬Vw ∨¬Pw ∨Sr , 1−η) and we have Vw Pw Sr Cr 1N Vw Pw Sw Cr . However as we can check in Figure 2, the partial order associated to the CP-net is more reﬁne than the basic preference relation, i.e. some incomparabilities in the latter have been turned into strict comparabilities in the former. For example Vw Pb Sr Cw is preferred to Vw Pw Sr Cw w.r.t. the CP-net while they are incomparable w.r.t. 1b,Σ since Vw Pb Sr Cw falsiﬁes (Vb , 1−α), (¬Vw ∨¬Pb ∨Sw , 1−δ) and (¬Sr ∨ Cr , 1 − θ) while Vw Pw Sr Cw falsiﬁes (Vb , 1 − α), (Pb , 1 − β) and (¬Sr ∨ Cr , 1 − θ). These additional strict preferences are due to the fact that preferences in CP-nets depend on the structure of the graph. More precisely, since preferences over the values of a variable are conditioned on the values of its parents, the application of ceteris paribus principle implicitly gives priority to father nodes. For example Vw Pb Sr Cw 1N Vw Pw Sr Cw due to Pb 1 Pw . Indeed Vw Pw Sr Cw is less preferred than Vw Pb Sr Cw since the former falsiﬁes (Pb , 1 − β) while the latter falsiﬁes (¬Vw ∨ ¬Pb ∨ Sw , 1 − δ) (they both falsify (Vb , 1 − α) and (¬Sr ∨Cr , 1−θ)). Indeed when two alternatives ω and ω differ on the value of one variable only, ω is preferred to ω w.r.t. a CP-net if and only if • either Fω ⊂ Fω , (cf. Deﬁnition 1) • or ω falsiﬁes a father node preference while ω falsiﬁes a child node preference. 5 Encoding CP-nets We show in this section that the partial order associated to a CPnet can be retrieved in our approach using additional constraints on symbolic levels. This encoding follows three steps: • Let X be a node in the CP-net N and CP T (X) be its associated conditional preference table. For each local preference 380 S. Kaci and H. Prade / Mastering the Processing of Preferences by Using Symbolic Priorities in Possibilistic Logic ui : x 1 ¬x in CP T (X) we associate a base made of one formula ¬ui ∨ x as follows ΣX,ui = {(¬ui ∨ x, 1 − αi )}. We do not add (¬ui ∨ x ∨ ¬x, 1) since we are dealing with binary variables. • For each node X in the CP-net N , build ΣX = i ΣX,ui where the bases ΣX,ui have been obtained at the previous step. Then Σ = X ΣX is the partially ordered base associated to N . • For each formula (¬ui ∨x, 1−αi ) in ΣX and each formula (¬uj ∨ y, 1 − αj ) in ΣY such that X is a father of Y and we are in the same context, i.e. ¬uj = ¬x ∨ ¬uk , we put 1 − αi > 1 − αj . Example 4 (Example 1 cont’d) We have Σ = {(Vb , 1 − α), (Pb , 1 − β), (¬Vb ∨ ¬Pb ∨ Sr , 1 − γ), (¬Vw ∨ ¬Pw ∨ Sr , 1 − η), (¬Vw ∨ ¬Pb ∨ Sw , 1 − δ), (¬Vb ∨ ¬Pw ∨ Sw , 1 − ε), (¬Sr ∨ Cr , 1 − θ), (¬Sw ∨ Cw , 1 − ρ)}. We deﬁne the following constraints between symbolic weights, which express that constraints associated with father nodes have priority w.r.t. the ones associated with their child nodes: 1 − α > 1 − γ, 1 − α > 1 − ε, 1 − β > 1 − γ, 1 − β > 1 − δ, 1 − γ > 1 − θ, 1 − η > 1 − θ, 1 − δ > 1 − ρ, 1 − ε > 1 − ρ which are equivalent to α < γ < θ, α < ε < ρ, β < δ < ρ, β < γ and η < θ. Then we have the following general result: Proposition 3 Let N be a CP-net and Σ be its associated formulas base as described above and its associated partial order > on symbolic levels. Then ∀ω, ω ∈ Ω, v 1OP v iff ω 1N ω , where v (resp. v ) is the vector of satisfaction levels associated to ω (resp. ω ). Example 5 (Example 1 continued) Let us consider again the two alternatives ω : Vw Pb Sr Cw and ω : Vw Pw Sr Cw . We have vω = (α, 1, 1, 1, δ, 1, θ, 1) and vω = (α, β, 1, 1, 1, 1, θ, 1). Then vω 1OP vω since vω and vω reduce to (1, δ) and (β, 1) following discrimin criterion, i.e. d(vω ) = (1, δ) and d(vω ) = (β, 1). Now since δ > β, (1, δ) and (β, 1) can be reordered into do (vω ) = (δ, 1) and do (vω ) = (β, 1) such that we have δ > β and 1 ≥ 1. We can check that Vw Pb Sr Cw 1N Vw Pw Sr Cw . Generally speaking, the proposed approach allows us to add any further constraint between priority levels, which may privilege a particular child node if it is desirable, or express, as in TCP-nets [7], a conditional relative importance of the satisfaction of a particular requirement over another. Indeed a contextual preference in favor of a variable attached to a node can be expressed in our framework by means of additional constraints between symbolic levels. 6 Conclusion The paper has proposed an encoding of conditional preferences by means of classical logic formulas associated with symbolic priority levels, in a possibilistic logic manner. It has led to the deﬁnition of a natural partial order that is always more cautious than the corresponding partial order obtained with a CP-net approach. Moreover adding constraints between symbolic priority levels has enabled us to recover the CP-net partial order exactly (although as explained in the paper, the strict preferences found in the CP-net approach, but not with our approach, are debatable). The approach can beneﬁt from the existance of a computationally tractable inference procedure in possibilistic logic with partially ordred symbolic levels [3]. Besides, it is worth noticing that the representation obtained looks similar to an hybrid possibilistic Bayesian-like network [4] since each node of the graphical structure reﬂecting the conditional preferences is associated with a set of constraints encoded by possibilistic logic-like formulas. The precise linkage between the representation presented in this paper and hybrid possibilistic networks is a topic of a further research. Lastly the proposed approach might be applied to the management of preference queries addressed to a database for rank-ordering the answers according to their amounts violation of conditional preferences associated to the queries, and thus contributes to an active database research trend [8, 15]. REFERENCES [1] S. Benferhat, D. Dubois, S. Kaci, and H. Prade, ‘Bridging logical, comparative, and graphical possibilistic representation frameworks’, in ECSQARU, pp. 422–431, (2001). [2] S. Benferhat, D. Dubois, and H. Prade, ‘Representing default rules in possibilistic logic’, in KR, pp. 673–684, (1992). [3] S. Benferhat and H. Prade, ‘Encoding formulas with partially constrained weights in a possibilistic-like many-sorted propositional logic’, in IJCAI, pp. 1281–1286, (2005). [4] S. Benferhat and S. Smaoui, ‘Hybrid possibilistic networks’, International Journal of Approximate Reasoning, 44(3), 224–243, (2007). [5] C. Boutilier, R. Brafman, C. Domshlak, H. Hoos, and D. Poole, ‘CPnets: A tool for representing and reasoning with conditional ceteris paribus preference statements’, Journal of Artiﬁcial Intelligence Research, 21, 135–191, (2004). [6] C. Boutilier, R.I. Brafman, H.H. Hoos, and D. Poole, ‘Reasoning with conditional ceteris paribus preference statements’, in UAI, pp. 71–80, (1999). [7] R.I. Brafman and C. Domshlak, ‘Introducing variable importance tradeoffs into cp-nets’, in UAI, pp. 69–76, (2002). [8] J. Chomicki, ‘Database querying under changing preferences’, Annals of Mathematics and Artiﬁcial Intelligence, 50(1-2), 79–109, (2007). [9] S. Coste-Marquis, J. Lang, P. Liberatore, and P. Marquis, ‘Expressive power and succinctness of propositional languages for preference representation’, in KR, pp. 203–212, (2004). [10] C. Domshlak, F. Rossi, K.B. Venable, and T. Walsh, ‘Reasoning about soft constraints and conditional preferences: complexity results and approximation techniques’, in IJCAI, pp. 215–220, (2003). [11] C. Domshlak, F. Rossi, K.B. Venable, and T. Walsh, ‘Reasoning about soft constraints and conditional preferences: complexity results and approximation techniques’, in IJCAI, pp. 215–220, (2003). [12] D. Dubois, H. Fargier, and H. Prade, ‘Beyond min aggregation in multicriteria decision: (ordered) weighted min, discri-min, leximin’, in The Ordered Weighted Averaging Operators - Theory and Applications (R.R. Yager, J. Kacprzyk, eds.), pp. 181–192. Kluwer Acad. Publ., (1997). [13] D. Dubois, J. Lang, and H. Prade, ‘Possibilistic logic’, In Handbook of Logic in Artiﬁcial Intelligence and Logic Programming, 439–513, (1994). [14] R. G´erard, S. Kaci, and H. Prade, ‘Ranking alternatives on the basis of generic constraints and examples - a possibilistic approach’, in IJCAI, pp. 393–398, (2007). [15] A. HadjAli, S. Kaci, and H. Prade, ‘Database preferences queries a possibilistic logic approach with symbolic priorities’, in FoIKS, pp. 291–310, (2008). [16] S. Kaci and H. Prade, ‘Relaxing ceteris paribus preferences with partially ordered priorities’, in ECSQARU, pp. 660–671, (2007). [17] J. Pearl, ‘System Z: A natural ordering of defaults with tractable applications to default reasoning’, in Proceedings of the 3rd Conference on Theoretical Aspects of Reasoning about Knowledge (TARK’90), pp. 121–135, (1995). [18] N. Wilson, ‘An efﬁcient upper approximation for conditional preference’, in ECAI, pp. 472–476, (2006). 7. Distributed and Multi-Agents Systems This page intentionally left blank ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-383 383 Interaction-Oriented Agent Simulations: From Theory to Implementation Yoann Kubera and Philippe Mathieu and S´ ebastien Picault1 Abstract. This paper deals with the software architecture for individual-centered simulations, i.e. involving many entities interacting together. Many software architectures have been developped in this context, especially many advanced – but domain speciﬁc – frameworks. Yet those frameworks imply tight software dependencies between agents, behaviors and action selection mechanisms, which leads to many diﬃculties in modelling and programming. We propose a method and an architecture where interactions are reiﬁed regardless of agents, in order to obtain a complete interaction-oriented design process for simulations. Then, an agent is only an entity that can perform or undergo a set of interactions, even not speciﬁcally developped for it. Thus most interactions can be re-used in many contexts. In addition, our method clearly separates knowledge about behaviors from its processing, and thus makes the design of simulations easier. Moreover, this new and user-friendly approach helps programmers to build simulations with a large number of diﬀerent behaviors at the same time, especially in the context of large-scale simulations. 1 Introduction In recent years, agent-based simulations became preponderant among living beings simulation tools, either to understand their mechanisms or to copy them for leisure use (video games, animation in ﬁlms, etc.). It links up experts from both speciﬁc domains (biology, sociology, etc.) and computer science. Its multidisciplinary nature has given birth to more-or-less domain-speciﬁc platforms. A large subset of those – like Swarm [4], Madkit [6] or Magique [2] – are open, and thus enable the user to freely implement agents, behaviors and environments. They oﬀer diﬀerent levels of software reﬁnement and allow the use of many engineering tools – design patterns, components, inheritance, etc. Moreover, the platforms cited above are not only dedicated to simulations, but can also be used to build agent-based applications. Others – like Netlogo [12] – are based on a simple programming paradigm designed for non-computer scientists. The generic aspect of all those open platforms is obtained at the expense of a formal way to guide the design of behaviors. Data is indeed mixed with its processing – i.e. the action selection mechanism is mixed with behavior representation – which implies a complete reimplementation of the agent when adding or deleting an interaction in which it is involved. On the opposite, many formalisms – like Petri nets, subsumption, rules sets, artiﬁcial neural networks – may strongly 1 University of Lille, France, email: name.surname@liﬂ.fr guide agent architecture, at the expense of reusability in other formalisms. Some of the rare ones that make possible behavior reuse are the cognitive architectures with plans like in Act-R [1] where knowledge is separated from its processing. However they are often not ﬁtted neither to build multiagent simulations because of their poor performances nor to design reactive agents. In order to build reusable and generic behaviors, we promote in this paper the Interaction-Oriented Design of Agent simulations (IODA) formal method and architecture based on [9, 8] works. It consists in abstracting from the agents the actions they participate in, by reifying them into the notion of interaction. An agent may perform or undergo a set of interactions which are not speciﬁcally developped for it. Thus most interactions can be re-used in many contexts. In addition, this architecture clearly separates data from processing, and thus makes the design of simulations easier. We also describe the Java Environment for the Design of agent Interactions (JEDI) platform, which is a Java implementation of IODA for simulations with reactive situated agents. The second section contains a brief introduction to related work on generic agent behavior achitectures. The third section describes the IODA methodology and its advantages – like the separation between data and its processing, interactions libraries, or large-scale simulation construction. The fourth section presents the generic features of IODA concepts, through an easy to customize simulation platform called JEDI. Eventually, the last section concludes about IODA and JEDI. 2 Additional Related Work Research on multiagent systems and on agent design is very active, and many generic agent description models do exist. Formal description methods and generic architectures for agents behavior can be examined from two points of view. The ﬁrst one is about interactions design : the way agents communicate with each others are extracted from their model into abstract communication patterns and protocols. Generally, this abstraction is limited to the model design step, and the interaction protocol and the agent’s behavior are mixed together during implementation – like in JADE, AgenTalk, Swarm, etc. This leads to a decreased maintainability due to dispersal of the protocol’s implementation. As proposed in [5], one solution is to abstract the interaction protocol from agents, and then reify it as a single entity deﬁned by roles and messages sequences, which use functionalities that agents implement on their own according to their role. 384 Y. Kubera et al. / Interaction-Oriented Agent Simulations: From Theory to Implementation The second one is about agents behavior itself. Many generic methodologies stop at the formal speciﬁcation of a simulation, giving place at worse to implementation errors and at best to mixing data (i.e. actions an agent can perform) and its processing (i.e. selection of an action given a particular valuation of the global state of the simulation). Deﬁnition 1 The global state of the simulation is the union of the set of all states of the environment and the states of all agents in the environment. Formal methods and architectures allow to keep the separation between data and processing with agent-independent actions, like in [3] where actions are agent-independent components, so that the behavior of an agent is deﬁned by a set of interconnected components. This kind of solution is well suited to complex action scheduling, but the connectivity of these components decrease the maintainability of the agents, especially if their behavior change during simulation, or if the simulation is using a large scale knowledge representation. Deﬁnition 2 A simulation is called large scale simulation if its environment contains a great amount of agents (namely simulation with large scale computations) or if it contains a large number of agents with diﬀerent behavior and a large number of actions per agent (namely simulation with large scale knowledge representation). In the following sections, we propose a formal method and an architecture providing the advantages of both interaction reiﬁcation and separation between knowledge and processing, ﬁtting large scale knowledge representation requirements with an homogeneous design of agents and interactions. 3 The IODA Methodology In general, a communication protocol is used in order to describe a particular abstract process involving many agents, for instance “to exchange goods”. In order to build reusable and generic behavior, we present in this paper the IODA formal method and architecture. It relies on an homogeneous representation of actions performed by agents, called Interaction, close to the concept of design/perceived aﬀordance of Norman [11]. This formal representation is adapted to represent actions involving only one agent as well as complex actions involving many communicating agents. 3.1 An Interaction-centered Methodology The behavior of an agent is deﬁned by a speciﬁc arrangement of semantic blocks called interactions (see § 3.5). An interaction is itself a set of primitives simultaneously involving a ﬁxed number of agents, which describes how and under what kind of conditions agents may interact one with others or with the environment. An agent owns a set of perception primitives – used to get information from the global state of the simulation – and a set of action primitives – used to change this global state (change the environment’s, other agent’s or his own local state). These are the atomic elements of interactions. Deﬁnition 3 An Interaction is a structured set of action primitives involving simultaneously a ﬁxed number of agents. An interaction can occur only if the activation conditions – a boolean expression of perception primitives – are met. Deﬁnition 4 Agents involved in an interaction generally do not play the same role. We make a diﬀerence between Source agents that may perform the interaction, and Target agents that may undergo it. As described in Def. 3, an interaction sets the logical sequence of primitives required in order to make agents interact. These primitives may be implemented diﬀerently according to the agent speciﬁcities. As a consequence, it leads to a more enhanced and easier-to-use polymorphism in agent behavior compared to other agent architectures like [3] where close behaviors cannot be expressed without complex means. An interaction is not agent-dependent and may be re-used in other simulations. Thus, building simulations leads to the construction of interaction and agent libraries, and facilitates further simulation design. 3.2 IODA Agents In IODA, agents follow a simple architecture which makes possible to design homogeneously agents with diﬀerent speciﬁcities in the same simulation. Deﬁnition 5 An agent x is an autonomous entity of a simulation. Its minimal speciﬁcation : • • • • has properties; has a local state, which is a valuation of its properties; implements a set of action and perception primitives; perceives other agents and the state of the environment only in a subset of the environment H(x) called halo. The set N (x) of agents present in H(x) is called neighborhood; • is assigned a set of interactions it can perform or undergo (see § 3.5); • implements an interaction selection process (see § 3.7). Deﬁnition 6 An agent family (or agents equivalence class or agent class) is an abstract set of agents, in which all agents share all or part of their properties, action or interaction primitives, or behavior. From this point on, if S ∈ F, x ∈ S means x is an agent from the S agent family. A IODA agent is not restricted to a particular kind of agent. Programmers may freely deﬁne a cognitive or reactive interaction selection process, reactive or cognitive perception primitives, more or less complex neighborhood computations. Besides, neighborhood computation taken apart, the interaction selection process is independent from the environment’s topology, and needs only a notion of distance between agents. 3.3 Interactions and cardinality As its name implies, an interaction may occur between a source agent and a target agent. However, complex problems need to deﬁne other situations like the interaction of an agent with itself (to sleep, to think ) or with the environment (to move, to die). Even more complex situations may occur, where interactions involve more than one source or target (for instance to burst involving many casualties). Cardinality (see Def. 7) uniﬁes those notions. Y. Kubera et al. / Interaction-Oriented Agent Simulations: From Theory to Implementation 385 Deﬁnition 7 The cardinality of an interaction I is the pair (cardS (I), cardT (I)) where cardS (I) (resp. cardT (I)) is the number of source agents (resp. target agents) involved in the interaction. Particular interactions where an agent interact with itself or with the environment, i.e. with T = ∅, are called degenerate interactions. Deﬁnition 9 The assignation aS/T of an interaction set (Ij )j∈[1,n] between a source agent family S and a set of target agent families T describes the set of interactions that agents belonging to S may perform as sources together with sets of agents from T as targets. It is deﬁned by a set of tuples (Ij , pj , cj , dj )j∈[1,n] , named assignation elements, where : Deﬁnition 8 An interaction I is in normal form if and only if cardS (I) = 1. It has been shown that any interaction can be expressed into normal form [8]. Thus, in the following sections of this paper, interactions are supposed in normal form, mainly for complexity matters [8]. • Ij is an interaction that S can perform and all x ∈ T can undergo; • pj is the priority of this assignation of interaction Ij ; • cj is the interaction’s cardinality (i.e. the number of awaited targets); • dj is the limit distance allowed between S and all x ∈ T so that S may perform the interaction with T . 3.4 N.B.: Elements of the assignation aS/∅ of degenerate interactions are (Ij , pj ) pairs. Problem analysis In addition to the formal speciﬁcation of simulations, IODA provides a set of algorithms to go from model analysis to concrete implementation. Those algorithms are demonstrated in the JEDI platform (see § 4) in the context of reactive and situated agents, but could also be implemented for any other kind of multi-agent system as well. According to our methodology, the design of a simulation follows 5 steps : 1. Identify all agent families as well as all interactions of the simulation. It leads to the deﬁnition of a matrix between source agents and target agents containing interactions. This step is called “assignation of interactions to source and target agents”. 2. Deﬁne all primitives needed to write the activation conditions and the action sequence of the interactions. 3. Identify the action and perception primitives that will be implemented by each agent family, and how they will be implemented. 4. Deﬁne for each assigned interaction I a priority p(I) and a limit distance d(I) (see § 3.7). It implies reﬁning the initial matrix. 5. Deﬁne how the matrix evolves during simulation, i.e. if agents can change their own or other’s behavior by changing a line or a colmun of the matrix. To help the design of simulations, the assignation of interactions to source and target agents is summarized into a matrix called Interaction Matrix . 3.5 The Interaction Matrix Agents may interact only if target agents are present into the neighborhood of the source agent, but interaction is also constrained by a limit distance. Indeed, seeing a target doesn’t means a source agent may perform the interaction to slap target with it : it has to be close enough to the target, and this distance depends on the source agent’s properties. This notion is independent of grid-like environments : it may be a Minkowski distance as well as a social distance, etc. Additionally, every assigned interaction is endowed with a priority, so to build a hierarchy between them from the viewpoint of the source agent, which is used in the interaction selection process (see § 3.7). These priorities may be constant or dynamic, depending on the nature of the source agent. Deﬁnition 10 If F is the set of all agent families in a simulation, then the interaction matrix of the simulation is the set M = (aS/T )S∈F,T ⊆F of all assignations between all relevant source agent family S and target agent family set T , according to the behaviors to be modeled (see Fig. 1). 3.6 Agent libraries Because agents from diﬀerent families may have some similar behavior, agents from an A agent family may be a particular subset of a B agent family. Thus, if M = (aS/T )S∈F,T ⊆F is the interaction matrix of a simulation, S and T may be abstract sets of agent families like groups, teams, etc. We deﬁne a particular algebra to specify the relations between agent families, especially how they share their assignation elements through 3 matrix modiﬁcation operators : Deﬁnition 11 Let F be the set of all agent families. • The specialization of an agent family X by a agent family Y is noted Y : X. It means agents of the Y family inherit all assignation elements, perception process, primitives and properties of the X family. • The addition of an assignation element e with source agent family S ∈ F and target agent families T ⊆ F to the interaction matrix is noted +(aS/T , e). • The suppression of an inherited assignation element e with source agent ` family´ S ∈ F and target agent families T ⊆ F is noted − aS/T , e . • The modiﬁcation of an inherited assignation element e = (I, p, c, d) with source agent ´ ` family S ∈ F and target agent families T ⊆ F is noted ∗ aS/T , e, I , p , c , d . • The modiﬁcation of an inherited assignation element e = (I, p) with source agent ` family S ∈´F and target agent families T ⊆ F is noted ∗ aS/T , e, I , p . Property 1 Let F be the set of all agent families, X, S, Y ∈ F, T ⊆ F, e an assignation element, I, I two interactions, d, d ∈ R and c, c , p, p ∈ N. ´ ` • Generally, ´(Y : X) ⇒ ∀T ⊆ F, aX/T ⊆ aY /T ` • + `aS/T , e´ ⇒ e ∈ aS/T • − aS/T , e ⇒ e ∈ / aS/T • ∗(aS/T , (I, p, c, d), I , p , c , d ) ⇒ ((I, p, c, d) ∈ / aS/T ∧ (I , p , c , d ) ∈ aS/T ) 386 Y. Kubera et al. / Interaction-Oriented Agent Simulations: From Theory to Implementation ``` source ```target ``` Grass Animal Herbivore Sheep:Animal,Herbivore Goat:Animal,Herbivore Wolf:Animal ∅ +(Grow;0) +(Die;3) +(Move;0) Grass Sheep +(Eat;2;1;0) +(Breed;1;1;1) *((Die;3),Die,4) +(Eat;2;1;0) Goat Wolf +(Breed;1;1;1) +(Eat;3;1;0) +(Breed;1;1;1) Figure 1. Example of an interaction matrix for a predator/prey simulation with 4 species. The ’∅’ column contains degenerate interactions. In this example, the ’+’ operator uses either one integer representing the degenerate assignation element’s priority, or three integers representing the assignation element’s priority, its cardinality and its limit distance. The ’∗’ operator, in this case, is used to modify the priority of the inherited “Die” interaction for wolves. • ∗(aS/T , (I, p), I , p ) ⇒ ((I, p) ∈ / aS/T ∧ (I , p ) ∈ aS/T ) In the interaction matrix, a cell is the intersection of a line, corresponding to the interactions that an agent of S family can perform, and a column, corresponding to the interactions that a set of agents of T families can undergo. Thus aS/T is implicit in the operators used in the matrix on Fig. 1. Such a formalism is platform-independent, especially the specialization notion which meaning changes along the programming language : inheritance for a language object, kind of in a frame language, etc. 3.7 Interaction Selection Basics The core of an agent’s behavior is the interaction selection process (see Def. 12). This process checks if activation conditions are met, ﬁnds targets to interact with, selects a particular set of targets, considers interactions with the correct priorities, and ﬁnally performs the sequence of actions. Deﬁnition 12 Interaction selection is the process an agent uses in order to select an interaction to perform (i.e. as a source) on particular targets given a particular valuation of the global state of the simulation. Both the eligibility syntaxic criterion and realizability semantic criterion, as well as the Interaction potential set are deﬁned in this section to help the census of all possible interactions for a source agent x. Deﬁnition 13 Let dist(x, y) be the distance between two agents x and y. The assignation element e = (Ij , pj , cj , dj ) is said eligible for the source agent x and the set Targ of target agents – written eligible(e, x, Targ ) – if and only if e ∈ ax/Targ and ” “ cardT (Ij ) = 0 ⇒ ∀y ∈ Targ , y ∈ N (x) ∧ dist(x, y) ≤ dj . Deﬁnition 14 Let cond(I, x, Targ ) be the activation conditions of the interaction I applied to the source agent x and the set of target agents Targ . The assignation element e = (Ij , pj , cj , dj ) is said realizable for the source x and the set Targ of targets – written realizable(e, x, Targ ) – if and only if: eligible(e, x, Targ ) ∧ cond(e, x, Targ ) Deﬁnition 15 The “p-level interaction potential” of an agent x – written Pp (x) – is the set of all realizable assignation elements with x as a source for any target set : Pp (x) = {(e, T ), e = (Ie , pe , ce , de ) | T ⊆ N (x) ∧ p = pe ∧ realizable(e, x, T )} As a consequence, interaction selection is the process where an x agent selects an element from Ppmax (x) where pmax is the highest priority such that Ppmax (x) = ∅. All the deﬁnitions and properties deﬁned in this section are platform independent. Their implementation on a speciﬁc programming language implies many choices. We propose in the following section a possible implementation of IODA concepts in the Java language. 4 From Methodology to Implementation The JEDI platform implements the formal concepts deﬁned in IODA, which means there is an univocal path from problem analysis in IODA to implementation in JEDI. Besides, this transition between model and implementation is automated by a generator called JEDI-Builder . Note that JEDI is more a proof of usefulness of IODA concepts than a regular simulation platform : the IODA methodology may be implemented in other languages, so we did in Netlogo. 4.1 Implementation Choices Implementation choices deﬁne the scope of simulation models supported by a platform. Their consequences are displayed in [7], therefore this section does not argue in details about the reasons of those choices. In JEDI, these are : • • • • Interaction cardinality is restricted. Simulation is in discrete time. Situated : simulation is in a two-dimensional grid. Everything is agent, which allows an uniform treatment of things (called artifacts, objects, tools, patches, etc.) and “true” agents at implementation. Deﬁnition 16 An agent is said active if he can perform at least one interaction. Otherwise he is said passive. In JEDI, the only diﬀerence between passive and active agents depends on the interaction matrix. This homogeneous representation of agents makes transition of agents between passive and active easier. Interactions are reiﬁed in a Java abstract class called Interaction. Each agent family S ∈ F – represented by a class inheriting from Agent – contains a set canPerform which is a part of the interaction matrix. It is deﬁned such that ∀x ∈ S, canP erf orm(x) = {aS/T ∀T ⊆ F}. Thus each line of the interaction matrix is deﬁned in an agent family. The abstract class Moteur is the core of the simulation, where the run() method executes the main algorithm of the simulation, i.e. performs every steps of the simulation (see Fig. 2). Y. Kubera et al. / Interaction-Oriented Agent Simulations: From Theory to Implementation Let A be the set of agents in the environment and Aact ⊆ A the set of active agents. 1. Reorder Aact according to a particular criterion (see Sect. 4.2), for instance a random order (equitable choice); 2. Set all agents in A operative; 3. For each operative agent a ∈ Aact do : (a) Deﬁne the part of the environment H(a) perceived by a; (b) Deﬁne the set of all neighboring agents N (a), and remove from it all non-operative agents; (c) Let p = maximal priority in canPerform(a); (d) Compute Pp (a); while Pp (a) = ∅, decrement p and compute again; (e) If p = 0 and P0 (a) = ∅, then a cannot perform any interaction. It remains operative but ends its simulation step; (f) Else, select at an element from Pp (a), i.e. elements ((I, p, c, d), Targ ) containing an assignation element and a set of target agents, using the interaction selection process of the agent; for instance a random choice; (g) Perform the interaction I with a as source and Targ as targets; (h) Deactivate a and all agents in Targ . Figure 2. 4.2 Algorithm of a simulation step. JEDI Tuning In order to build simulations with large-scale computations, the programmer has to control the complexity of many parts of the simulation platform in order to ﬁnd a tradeoﬀ between performances and implementation bias. JEDI’s modular decomposition deﬁnes a set of parameters for this purpose : • Agents’ halo H(x) may be deﬁned at will as a set of cells. • “P-level interaction potential“ computing complexity (3f in Fig. 2) may be reduced if needed, though it may introduce a bias in the evaluation order of assignation elements and target sets; for instance a census of only one target set Targ per assignation element e. • Interaction Selection process may easily be customized by writing how to select an element from Pp (x). • Pseudo parallelism may be tuned by the order according to which agents are evaluated (1 in Fig. 2), knowing what kind of bias are introduced [10]. • Interaction matrix is a shared object between agents when is not modiﬁed during the simulation. 5 Conclusion Designing a simulation is the art of ﬁnding a tradeoﬀ between model precision – in order to implement the model without any ambiguities – and model universality – in order to easily implement it on any simulation platform. Most simulation platforms neglect one of those points and sometimes do not even clearly deﬁne the model they use. In this paper we have presented a formal method and an architecture for the design of multiagent simulations, called IODA, which uses an homogeneous representation of actions performed by agents named Interaction. Actions involving a single agent, or complex actions involving many communicating agents, are both represented with the same formalism. As a consequence of this, the interaction selection process is also 387 the same for all agents, and can be deﬁned independently from both agents and interactions. Knowledge and processing are not mixed, therefore the user is able to build reusable agent and interaction libraries along with simulations. Moreover, the interaction matrix helps to design simulations with largescale knowledge representation, and to build automatically the corresponding implementation through a code generator. The JEDI simulation platform provides a simple implementation tool of IODA models, and deﬁnes an interaction selection process suitable to reactive, cognitive or any other kind of agents. In addition, it points up a set of parameters that can be tuned at will. This aims at controlling implementation bias when adapting the complexity of the platform to match with large-scale computations requirements. Acknowledgements This research is supported by the FEDER and the “Contrat´ Plan Etat R´egion TAC” of Nord-Pas de Calais. REFERENCES [1] J. R. Anderson, D. Bothell, M. D. Byrne, S. Douglass, C. Lebiere, and Y Qin, ‘An integrated theory of the mind’, Psychological Review, 111(4), (2004). [2] Nourredine Bensaid and Philippe Mathieu, ‘A hybrid and hierarchical multi-agent architecture model’, in Proceedings of the Second International Conference and Exhibition on the Practical Application of Intelligent Agents and Multi-Agent Technology, London, UK, (april 1997). [3] Jean-Pierre Briot, Thomas Meurisse, and Fr´ed´ eric Peschanski, ‘Une exp´erience de conception et de composition de comportements d’agents ` a l’aide de composants’, L’Objet, Revue des Sciences et Technologies de l’Information, 12(4), (2006). [4] R. Burkhart, ‘The swarm multi-agent simulation system’, in Position Paper for OOPSLA’94 Workshop on ’The Object Engine’, (1994). [5] Takuo Doi, Yasuyuki Tahara, and Shinichi Honiden, ‘IOM/T: an interaction description language for multi-agent systems’, in AAMAS’05: Proceedings of the fourth international joint conference on Autonomous agents and multiagent systems. ACM, (2005). [6] Olivier Gutknecht, Jacques Ferber, and Fabien Michel, ‘Integrating tools and infrastructures for generic multi-agent systems’, in Proceedings of the Fifth International Conference on Autonomous Agents, eds., J¨ org P. M¨ uller, Elisabeth Andre, Sandip Sen, and Claude Frasson, Montreal, Canada, (2001). ACM Press. [7] Yoann Kubera, Philippe Mathieu, and S´ebastien Picault, ‘La complexit´ e dans les simulations multi-agents’, in Actes des Journ´ ees Francophones sur les Syst` emes Multi-Agents (JFSMA07), ed., C´epadu`es-Editions, Carcassonne, France, (2007). [8] Philippe Mathieu, S´ebastien Picault, and Jean-Christophe Routier, ‘Donner corps aux interactions (l’interaction enﬁn erence MFI’07, Paris, France, concr´etis´ ee)’, in Actes de la conf´ (2007). [9] Philippe Mathieu, Jean-Christophe Routier, and Pascal Urro, ‘Un mod` ele de simulation agent bas´e sur les interactions’, in Actes des Premi` eres Journ´ ees Francophones sur les Mod` eles Formels de l’Interaction (MFI’01), Toulouse, France, (2001). [10] Fabien Michel, Jacques Ferber, and Olivier Gutknecht, ‘Generic simulation tools based on mas organization’, in Proceedings of the 10 European Workshop on Modelling Autonomous Agents in a Multi Agent World MAMAAW’2001, Annecy, France, (2001). [11] Donald A. Norman, The Psychology of Everyday Things, Basic Books, 1988. [12] Uri Wilenski, ‘Netlogo’, Technical report, Center for Connected Learning and Computer-Based Modeling, (1999). 388 ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-388 Optimal Coalition Structure Generation In Partition Function Games Tomasz Michalak, Andrew Dowell, Peter McBurney and Michael Wooldridge Department of Computer Science, The University of Liverpool, L69 3BX Email: {tomasz, adowell, mcburney, mjw} @ liv.ac.uk Abstract. 1 In multi-agent systems (MAS), coalition formation is typically studied using characteristic function game (CFG) representations, where the performance of any coalition is independent from co-existing coalitions in the system. However, in a number of environments, there are signiﬁcant externalities from coalition formation where the effectiveness of one coalition may be affected by the formation of other distinct coalitions. In such cases, coalition formation can be modeled using partition function game (PFG) representations. In PFGs, to accurately generate an optimal division of agents into coalitions (so called CSG problem), one would have to search through the entire search space of coalition structures since, in a general case, one cannot predict the values of the coalitions affected by the externalities a priori. In this paper we consider four distinct PFG settings and prove that in such environments one can bound the values of every coalition. From this insight, which bridges the gap between PFG and CFG environments, we modify the existing state-of-the-art anytime CSG algorithm for the CFG setting and show how this approach can be used to generate the optimal CS in the PFG settings. 1 Introduction & Motivation In multi-agent systems (MAS), coalition formation occurs when distinct autonomous agents group together to achieve something more efﬁciently than they could accomplish individually. One of the main research issues in co-operative MAS is to determine which division of agents into disjoint coalitions (i.e. a coalition structure (CS)) maximizes the total payoff of the system [12, 10]. To this end, coalition formation is typically studied using characteristic function game (CFG) representations which consist of a set of agents A and a characteristic function v, which takes, as input, all feasible coalitions C ⊆ A and outputs numerical values reﬂecting how these coalitions perform. Furthermore, it is assumed that the performance of any coalition is independent from co-existing coalitions in the system. In other words, the value of a coalition C in a structure CS has the same value as it does in another distinct structure CS . Based on this characteristic of CFGs, Rahwan et al. [10] proposed an algorithm that usually generates an optimal CS without searching through the 1 The authors are grateful for ﬁnancial support received from the UK EPSRC through the project Market-Based Control of Complex Computational Systems (GR/T10657/01). The authors are also thankful to Jennifer Mcmanus, School of English, University of Liverpool for excellent editorial assistance. entire space of CSs. In many real life MAS environments, CFG representations are sufﬁcient to model coalition formation, as the coalitions either do not interact with each other while pursuing their own goals or because such interactions are small enough to be neglected. However, in a number of other environments, there are signiﬁcant externalities from coalition formation (henceforth externalities) where the performance of one coalition may be affected by the formation of another distinct coalition. For example, as more commercial activity moves to the internet, we can expect online economies to become increasingly sophisticated, as is happening, for instance, with real time electronic purchase of wholesale telecommunications bandwidth or computer processor resources. In such contexts, ad hoc coalition formation will need to allow for coalition externalities, thus, rendering CFG representation inadequate to model coalition formation. In contrast, externalities are accounted for in the partition function game (PFG) representation. A PFG consists of a set of agents A and a partition function which takes, as input, every feasible coalition structure (CS), and for each coalition in each structure, outputs a numerical value that reﬂects the performance of the coalition in that structure. Now, the value of a coalition C in a structure CS may not have the same value in another distinct structure CS . This means that it is not generally possible to pre-determine the value of a coalition in a certain CS without actually computing it in this speciﬁc CS. Consequently, one must search through the entire space of CSs to guarantee an optimal solution. This presents a major computational challenge as, even for a moderate number of agents, there are billions of structures to search through (for example, for 14 agents there are 190, 899, 322 CSs and for 15 agents there are 1, 382, 958, 545 CSs). In this paper we contribute to the literature as follows: • We prove that it is possible to bound the coalition values in two commonly used PFG settings, thus bridging the gap between PFG and CFG environments; • We show that our theorems regarding bounded values can be used to modify the existing state-of-the-art CSG algorithm for the CFG settings. Consequently, our new algorithm can be applied to generate the optimal CS in these PFG settings; • Using numerical simulations we demonstrate the effectiveness of our approach which, in a number of cases, is comparable to results obtained for the CFG setting. Much research effort has been directed at optimal CS generation in the CFG setting. Sandholm et al. [12] proposed a new way to rep- T. Michalak et al. / Optimal Coalition Structure Generation in Partition Function Games resent the entire set of CSs in the form of a coalition structure tree. For this representation, they developed an algorithm which generates CS values within a ﬁnite bound from the optimal value for the entire system. It initially searches the two lowest rows of the tree and then searches from the top downwards either until the whole space has been searched or the running time of the algorithm has expired. Based on this representation, Dang and Jennings proposed a much faster algorithm which, after performing the same initial step as that of Sandholm et al., then searches exclusively through particular coalition structures in the remaining space [3]. Nevertheless, both solutions have drawbacks; notably, that the worst case bounds they provide are relatively low and that both the algorithms must always search the whole space in order to guarantee an optimal solution. To circumvent these problems Rahwan et al. recently proposed a more efﬁcient anytime CSG algorithm for the CFG setting [10]. Using a novel representation of the search space, this algorithm is signiﬁcantly faster than its existing counterparts. The input to the algorithm are coalitions lists structured according to the distributed coalitional value calculation (DCVC) algorithm presented in [9]. In contrast, in the ﬁeld of economics, much research has been directed at coalition formation in PFG settings. Particular efforts have been undertaken towards computing both the Shapley value and the core solution in such settings [7, 4]. Furthermore, PFGs have been used to represent coalition formation in many practical applications, such as ﬁsheries on the high seas [8], fuel emissions reduction [5] or Research & Development (R&D) cooperation between ﬁrms [2]. Both of the former settings are examples of games with positive externalities, where the decision by one group of countries to reduce ﬁshing activities or fuel emissions may have a positive impact on other countries. In contrast, a R&D cooperation between a group of companies could be modeled using a game with negative externalities since the market positions of some companies could be hindered by the increased competitiveness resulting from a collusion of other companies. An excellent overview of both CFG and PFG approaches in economics is provided in [1]. 2 Partition Function Games For a set of agents A = {a1 , . . . , an } and a coalition C ⊆ A, a PFG generates a non-negative integer value v(C; CS), where CS is a coalition structure of A and C ∈ CS. Following Halfalir [6], a PFG is said to have weak positive externalities if for every three subsets C, S, T ⊆ A where C ∩ S ∩ T = ∅ and for any structure CS of A \ (S ∪ T ∪ C) then: 389 C , C1 , . . . Ci−1 , Ci+1 , Cj−1 , Cj+1 , . . . , Ck is at least (at most) as large as the sum of the values of Ci and Cj in CS. Classic results in game theory tell us that for super-additive CFGs (where for any two disjoint coalitions S, T v(S ∪ T ) ≥ v(S) + v(T )) the optimal CS is the grand coalition (i.e. the coalition containing every agent in the system), whereas in sub-additive CFGs (where for any two disjoint coalitions S, T v(S ∪ T ) ≤ v(S) + v(T )) the optimal structure is the CS of singletons, i.e. the structure where all the agents act as individuals.2 We now show, with the aid of an example (taken from [6]), that this does not necessarily hold in a super- (sub-) additive PFG setting. Consider the following super-additive PFG for A = {1, 2, 3}, where, in addition, there are negative externalities: • v((i); {(1), (2), (3)}) = 4 for i = 1, 2, 3; • v((j, k); {(i), (j, k)}) = 9 and v((i); {(i), (j, k)}) = 1 for all i, j, k ∈ A where i = j = k; and • v(A; {A}) = 11. Clearly, the super-additive requirement is met but the grand coalition is not the optimal structure since v(A; {A}) = 11 < 3 v((i); {(1), (2), (3)}) = 12. Thus, this example shows that i=1 the grand coalition is not always the optimal structure in a superadditive PFG with negative externalities. Equally, for the same A, suppose that the values of the partition function are as follows: • v((i); {(1), (2), (3)}) = 3 for i = 1, 2, 3; • v((j, k); {(i), (j, k)}) = 2 and v((i); {(i), (j, k)}) = 7 for all i, j, k ∈ A where i = j = k; and • v(A; {A}) = 4. In this game, the sub-additivity property is met but the CS of singletons is not the optimal CS, due to the positive externalities. This shows that this structure is not always the optimal in sub-additive PFGs with positive externalities. Thus, the classic results from the CFG setting do not always hold in the PFG one. Consequently, in this paper, we shall study four classes of PFG: 1. 2. 3. 4. + super-additive games with positive externalities (P Fsup ); − super-additive games with negative externalities (P Fsup ); + sub-additive games with positive externalities (P Fsub ); − sub-additive games with negative externalities (P Fsub ). v(C; {S ∪ T, C} ∪ CS ) ≥ v(C; {S, T, C} ∪ CS ). In the case where the inequality is ≤ the PFG is said to exhibit weak negative externalities. Intuitively, this property means that a game has positive (respectively, negative) externalities if a merger between two coalitions makes every other coalition better (worse) off. Furthermore, a PFG is weakly super-additive (sub-additive) if for any S, T ⊆ A with S∩T = ∅ and structure CS of A\S∪T then: Figure 1: Paths for a six agent setting v(S ∪ T ; {S ∪ T } ∪ CS ) ≥ (≤)v(S; {S, T } ∪ CS ) + v(T ; {S, T } ∪ CS ). Intuitively, this means that a PFG is super-additive (subadditive) if two coalitions Ci and Cj in a structure, say CS = C1 , . . . Ci , Cj , . . . , Ck , join together to form coalition C = Ci ∪ Cj then the value of C in the structure CS = The Sandholm et al. tree representation of the CS space, brieﬂy described in Section 2, is very useful in solving the CSG prob2 There also exists similar deﬁnitions for the strong positive and negative externalities and strong super- and sub-additivity, in which signs ≤ and ≥ are replaced with < and >. In the remainder of this paper, whenever we refer to externalities and additivity, we mean their weak forms. Note that strong relationships are a subset of weak ones. 390 T. Michalak et al. / Optimal Coalition Structure Generation in Partition Function Games lem for PFGs. Figure 1 displays a modiﬁed version of the Sandholm et al. tree for six agents, where nodes (hereafter conﬁgurations) represent subspaces of CSs containing coalitions of particular sizes indicated by the number (cf. [11]). For instance, the conﬁguration {5, 1} denotes the subspace of all CSs containing exactly two coalitions of size 5 and 1 for 6 agents i.e {(12345), (6)}, {(12346), (5)}, {(12356), (4)}, {(12456), (3)}, {(13456), (2)} and {(1), (23456)}. The arrows between the subspaces show how a merger of two coalitions converts one CS to the other. For example, the arrow from {4, 1, 1} → {4, 2} shows how the merge of the two coalitions of size 1 converts the conﬁguration {4, 1, 1} to {4, 2}. The notion of weakness implies that there can be many CSs with the optimal value. Therefore, in actual fact, we should speak about a set of optimal coalition structures which, in a special case, might contain every feasible CS; this could occur, for example, when all weak externalities are zero and weak super- (sub-) additivity does not increase (decrease) the combined value of merging coalitions. − + Theorem 1 In P Fsup (P Fsub ) the grand coalition (the coalition structure of singletons) always belongs to the set of optimal coalition structures. Furthermore, assuming that super- (sub-) additivity is not weak and both the positive and negative externalities are not weak − + then in P Fsup ) the grand coalition (the coalition structure (P Fsub of singletons) is the only optimal structure. − + Proof: Consider P Fsup (P Fsub ). Beginning with conﬁguration {1, 1, 1 . . . , 1}, it is possible to reach conﬁguration {n} by a variety of paths. Assume that we move from a coalition structure CS in conﬁguration G of size k to a structure CS in conﬁguration G of size k − 1, ∀k = n, . . . , 2. In such a case, CS must contain one coalition which is the union of exactly two coalitions in CS ∈ G and k − 2 ‘other’ coalitions in CS which were not involved in the merge. Due to the super-additivity (sub-additivity) property, the value of the merged coalition in CS must be greater than (less than) or equal to the sum of the component coalitions in CS. Furthermore, as a result of the positive (negative) externalities, the value of the other coalitions in CS must not be smaller (bigger) than in CS. Consequently, the value of CS ∈ G is not smaller (not bigger) than the value of CS ∈ G. Without loss of generality, this is applicable to every path, thus the conﬁguration {n} ({1, 1, 1 . . . , 1}) must contain a structure whose value is not smaller than values of other CSs in every other conﬁguration. Hence, the grand coalition (structure of singletons) always belongs to the set of optimal coalition structures − + in P Fsup (P Fsub ). Waiving the assumption of weakness (where the ‘≤’ and ‘≥’ signs are replaced with ‘<’ and ‘>’, respectively, in both super- and subadditivity as well as positive and negative externality) then the above + proof remains valid and it is not difﬁcult to show that in the P Fsup − (P Fsub ) setting the grand coalition (structure of singletons) is the only optimal structure. It immediately follows that for both these PFGs, it is not necessary to search the entire CS space to ﬁnd the optimal CS. 3 Bounded Coalition Values in − P Fsup and + P Fsub In the PFG setting, each coalition (with the exception of the grand coalition and some coalitions in the second level of the Sandholm et al. tree) may have many values, depending on which CS it belongs to. This means that we cannot determine an exact value of a coalition in a particular structure without actually searching it. However, we will now show that, by searching only certain paths in the Sandholm et al. Figure 2: An extract from Sandholm et al. tree for 6 agents representation, it is possible to bound the value of every coalition in + − problem is dual to the P Fsub problem, the entire tree. As the P Fsup our result can be presented for both classes of games simultaneously. + − (P Fsub ) setting and the coaliTheorem 2 Consider the P Fsup tion Cx in the structures CS = {Cx , (i1 ), . . . , (in−|Cx | )} and CS = {Cx , Cy } where (i1 ), . . . , (in−|Cx | ) ∈ Cx and Cy = A \ Cx . The value of Cx in CS is the greatest (smallest) value of Cx in every coalition structure it belongs to, or ∀Cx ∈ CS, v(Cx ; CS ) ≥ (≤)v(Cx ; CS). The value of Cx in CS is the smallest (greatest) value of Cx in every coalition structure it belongs to, or ∀Cx ∈ CS, v(Cx ; CS ) ≤ (≥) v(Cx ; CS). Proof: First consider the value of Cx in CS (i.e. v(Cx ; CS )). In Figure 1, CS can belong to any conﬁgurations in the following path: {1, 1, 1, 1, 1, 1} → {2, 1, 1, 1, 1} → {3, 1, 1, 1} → {4, 1, 1} → {5, 1}. Every coalition Cx such that |Cx | ≥ 1 which appears in any conﬁguration in this path is the only coalition that is formed. This guarantees that v(Cx ; CS ) has never been affected by a negative (positive) externality. Conversely, in all the other conﬁgurations where Cx appears, other non-trivial coalitions co-exist whose creation, by deﬁnition, have induced negative (positive) externality on Cx . In such conﬁgurations the values of Cx will be at most (least) equal to v(Cx ; CS ) since, as is visible in Figure 1, one can always reach any other conﬁguration containing CSs with Cx starting from CS .3 Since, on such a path, Cx is only subject to negative (positive) externalities, v(Cx ; CS ) must be at least as big (small) as in any other CS. Therefore, v(Cx ; CS ) is the greatest (smallest) value of Cx in every CS that it belongs to. Now consider the value of Cx in CS (i.e. v(Cx ; CS )). Cx is a part of both CS and CS , therefore, it is always possible to ﬁnd a path which starts from CS and leads to CS , i.e. CS → . . . → CS . Since Cx is only subject to consecutive negative (positive) externalities, the value of Cx will decrease (increase) or at most (least) remain the same, every time one traverses this path, moving from one conﬁguration to another. Consequently, v(Cx ; CS ) will not be greater (smaller) than v(Cx ; CS ) or the value of Cx in any other conﬁguration on this path. Similarly, starting from any other conﬁguration containing Cx , it is always possible to ﬁnd a path leading to CS . Since Cx is subject to consecutive negative (positive) externalities through such paths, the above argument is equally compelling. Therefore, the value of Cx ∈ CS is the smallest (greatest) value of Cx in every coalition structure it belongs to. Consider a few elements of the original Sandholm et al. tree − in Figure 2. Theorem 2 says that under P Fsup ∀(123) ∈ CS, v((123); CSa ) ≥ v((123); CS) (where CS is any structure containing (123)). Initially, it may seem possible for v((123); CSd ) to be higher than v((123); CSa ) because the former structure emerged 3 With the exception of CS , CS and the grand coalition, any coalition Cx might have a number of different values in one conﬁguration, as it belongs to a number of distinct CSs. Thus, we use the plural for “values” and “coalition structures”. T. Michalak et al. / Optimal Coalition Structure Generation in Partition Function Games after agent 3 joined coalition (12) in CSb and, due to superadditivity property, v((123); CSd ) could become much higher than v((123); CSa ). However, in actual fact, this cannot happen because of the assumed negative externalities. It is always possible to ﬁnd a path from {(123), (4), (5), (6)} to any other CS that contains (123) and on such a path the value of (123) is only subject to negative externalities. Consequently, v((123); CSd ) cannot be higher than v((123); CSa ). Such reasoning can also show that v((123); CSe ) is the smallest value of (123) in Figure 2 and similar reasoning can + be used to back up our claims for the P Fsub setting. 4 CSG Algorithm For The PFG Setting The Rahwan et al. CSG algorithm relies on the fact that coalition values are always constant in the CFG setting. This makes it possible to collect a number of basic statistics at the very beginning to assess which conﬁgurations are most promising and which not. In the PFG setting, coalition values depend on the CS they belong to, so such − a technique is not generally feasible. However, for both P Fsup and + P Fsub , Theorem 2 allows us to construct bounds on the values of every coalition in every CS. Subsequently, we can use these bounds to construct upper and lower bounds for each conﬁguration. In other words, our theorem bridges the gap between both settings, making it possible to modify the existing state-of-the-art CSG algorithm so + − that it can generate a set of optimal CSs in the P Fsup (P Fsub ) setting, often without searching the entire CS space. Let Ls denote the (structured) list containing all coalitions of size s. 4 Our CSG algorithm can be summarized as follows: Step 1. Compute the value of the grand coalition. For every coalition C in list Ls : 1 ≤ s = |C| < n, compute its value in the CSs where: (i) all the other agents not in C form coalition C = A \ C, and (ii) every other agent not in C acts alone. These are the maximum and minimum (minimum and maximum) values of each coalition in the entire CS space and are stored in lists Lmax and s Lmin which are structured as in the DCVC algorithm (see [9]); s Step 2. Partition the search space into conﬁgurations. Prune those which were searched in Step 1; Step 3. Compute the upper bounds of every remaining conﬁguration G, denoted U BG , using maximum values the lists ofmax from Step 1, i.e. U BG = . Set the upper ∀s∈G maxLs bound of the entire system U B to be the value of the highest upper bound, i.e. U B = maxU BG and set the lower ∗ bound to be max{v(CSN ), max{AvgG }}, where AvgG = min avgL is the lower bound for the average value of each s ∀s∈G ∗ conﬁguration G and CSN is the CS with the highest value found thus far. Order the conﬁgurations w.r.t. the value of U BG ; Step 4. Prune away those subspaces which cannot deliver a CS greater than LB, i.e. U BG < LB; Step 5. Search the conﬁguration with the highest upper bound, updating LB to be the highest value of the structure found thus far ∗ (CSN ). During the search process, a reﬁned branch and bound technique should be used; Step 6. Once the search of the conﬁguration in Step 5 is completed, ∗ check whether v(CSN ) = U B or all conﬁgurations have been searched or pruned. If any of these conditions hold then the optimal CS has been found. Otherwise, go to Step 4. In Step 1 we compute the maximum (minimum) and the minimum (maximum) values of each coalition C in the entire tree. Storing both numbers per coalition requires twice as much memory as 4 see [9] for more details 391 in the CFG setting, but ensures that the highest and lowest values of each list Ls can be computed. This makes it possible to determine upper and lower bounds for each conﬁguration as well as the upper bound of the entire system. Furthermore, in contrast to Rahwan et al., we cannot compute an exact average value of all the coalitions of size si , ∀i = 1, . . . , m, for a given conﬁguration G = {gs1 , . . . , gsm }. However, it is possible to compute a lower bound for such an average value using Lmin as no average value can be smaller than the s one computed for the lists containing minimum values. In addition, the upper bound for each conﬁguration G can be deﬁned as the sum of maximal values that every coalition of size s in CS ∈ G can take, i.e. ∀s∈G maxLmax . s In the PFG setting, partitioning and pruning of the search space is done as in the Rahwan et al. algorithm for the CFG setting. Also, the process of searching through the promising subspaces is similar. In particular, certain techniques ensure that no redundant calculations are performed, i.e. no CS is considered twice. However, the branch-and-bound rule needs to be modiﬁed for the PFG setting. This rule prevents traversing hopeless paths while constructing CSs in the considered conﬁguration. Branch and Bound Rule Suppose that G∗ = {gs1 , gs2 , gs3 , gs4 } is the conﬁguration with the highest upper bound, which has not yet been searched. In the CFG setting, the branch and bound rule of Rahwan et al. goes as follows. Suppose the algorithm has already added coalitions Cgs1 , Cgs2 to the CS under construction. When adding the next coalition from list Lgs3 , the rule ignores cases which together with max Lgs4 , would render the values of the CS less than the current LB of the entire system. From Theorem 2, instead of exact values of coalitions which we do not know beforehand, we can use the maximum values as computed in Step 1 and incorporate + − this rule to both P Fsup and P Fsub settings. However, with only maximum values, such a branch and bound rule is likely to be less effective than in the original setting. Anytime properties When the arguments of Sandholm et al. [12] are applied to the upper bounds of values of coalitions, it can be proven that after Step 1 of our algorithm, where levels 1, 2 and;n have < ∗ already been searched, CSN is no smaller than n2 of ; n <the value of ∗ ∗ the optimal CS, i.e. 2 × v(CSN ) ≥ CS . Furthermore, updating the lower bound in Step 5 ensures that if we were to continuously stop and restart the algorithm, every time we stopped, we would always ∗ have a current optimal structure CSN with a value at least as big as the value obtained before we re-started. Therefore, the algorithm retains its anytime properties in the PFG setting. 5 Numerical Simulations To the best of our knowledge, the CSG algorithm for the PFG setting proposed in this paper is the only one in the literature; thus there is no benchmark algorithm that can be used for a numerical comparison. Although it would be possible to adapt the CFG dynamic programming techniques for the PFG setting, due to lack of space, we will compare our results to the CSG algorithm for the CFG setting instead. As noted at the beginning of the paper, this solution has already been proven to be signiﬁcantly superior w.r.t. dynamic programming alternatives, because it does not need to search all the feasible CSs. We will show that, in many cases, our modiﬁcation of this algorithm for the PFG setting also only searches through a fraction of the CS space, thus saving a vast amount of calculation time. − Simulations are performed for the P Fsup setting. When the new 392 T. Michalak et al. / Optimal Coalition Structure Generation in Partition Function Games the structures in each conﬁguration are dependent on the value of the structures in the conﬁguration in the previous level (see Figure 1). Consequently, when the gain from the super-additivity and the loss from externalities are of a similar magnitude, the extreme values of CSs in different conﬁgurations are more likely to be akin, making pruning techniques less effective. This effect is magniﬁed by the use of the uniform distribution since CSs’ values in all conﬁgurations tend to be relatively dispersed. 6 Figure 3: Simulation results for − P Fsup setting coalition is formed, the ‘gain’ from super-additivity is accounted for by adding a factor αa to its value. In addition, the ‘loss’ from the externality on the other coalitions in the structure is accounted for by multiplying their values by factors b−β , where α, β ∈ [0, 1) b are randomly-generated uniform variables and a, b ≥ 1 are constants. We assume that in the system there are 10 agents, from which 115,975 CSs can be formed.5 In Step 1 2028 CSs are searched, i.e. the grand coalition, the CSs of singletons and 2C210 + 2C310 + ... + 2C810 +C910 other CSs. This amount accounts for 1.75% of the search space. The vertical axis on Figure 3 represents the proportion of the CS space searched, whereas a and b are indicated on the x and y axes, respectively. As the values of a and b increase, the ‘gain’ from superadditivity and the ‘loss’ from externalities decrease. We performed our simulations 25 times for each combination of a and b. The surface shown in Figure 3 is the average proportion of space searched by our algorithm. Furthermore, as the original CSG algorithm for the CFG setting for the uniform distribution of coalition payoffs searches on average about 2.5%, and this result is independent from a and b, we do not report it in Figure 3. We observe that when the ‘gain’ from super-additivity is high and the ‘loss’ from the negative externality is low, only a minimal proportion (under 4 %) of the space need be searched in order to compute the optimal structure. In fact, in such cases, the grand coalition or a coalition in the ﬁrst few levels of the Sandholm et al. tree is usually the optimal structure. Consequently, it would seem that the − smaller the externality, the more the P Fsup setting becomes like the super-additive CFG setting, thus explaining why so little of the space is searched. Conversely, when the ‘gain’ from super-additivity is low and the ‘loss’ from the negative externality is high, only a fraction of − the search space was searched. This time, the P Fsup setting becomes more akin to the sub-additive CFG setting, so that the CS of singletons or a CS with a relatively small number of cooperating agents tends to be optimal. However, in situations where the ‘loss’ from the externality and the ‘gain’ from the super-additivity are both either high or low, it seems that pruning is ineffective since nearly all of the search space has to be searched in order to guarantee an optimal outcome (more than 98% in many cases). This is due to an inher− ent characteristic of the P Fsup setting: namely, that the values of 5 The particular challenge of simulations in the PFG setting is that (in contrast to the CF G setting) one must generate the values of all CSs beforehand. Furthermore, during the random generation of coalition values, it is im− + portant to ensure that all the CSs meet P Fsup (P Fsub ) properties. Consequently, we restrict our simulations to 10 agents and 115,975 CSs. Although this is less than the system of 27 agents considered for the CFG setting (cf. [11]), such a system in the PFG setting would require generating a CS space with more than 5.24 × 1020 CSs. Conclusion & Future Work In this paper, we considered coalition structure formation in the presence of coalition externalities, a novel topic in the multi-agent system literature. We modeled coalition formation with a partition function game (PFG), and considered four cases: (1) super-additive + games with positive externalities (P Fsup ), or (2) negative exter− nalities (P Fsup ); (3) sub-additive games with positive externalities + − (P Fsub ); or (4) negative externalities (P Fsub ). For cases (1) and (4), we proved that computing the optimal structure is straightforward, because either the grand coalition or the CS of singletons belong to the set of optimal CS. In contrast, this is not true for cases (2) and (3), where any CS can belong to the set of optimal coalition structures. Therefore, for these two cases we proved that it is possible to bound the value of each coalition. From this insight, we modiﬁed the existing state-of-the-art anytime CSG algorithm for the CFG setting and show how it can be used to generate the optimal CS in these two PFG settings. In future work, we plan to study the numerical performance of the new algorithm under different distributional assumptions regarding coalition values, and also develop a distributed version of our approach. REFERENCES [1] F. Bloch, ‘Non-cooperative models of coalition formationin games with spillovers’, in Carraro, C. (ed.), Endogenous Formation of Economic Coalitions, ch. 2 pp. 35–79, (2003). [2] E. Catilina and R.Feinberg, ‘Market power and incentives to form research consortia’, Review of Industrial Organization, 28(2), 129–144, (2006). [3] V. Dang and N. Jennings, ‘Generating coalition structures with ﬁnite bound from the optimal guarantees’, in AAMAS, New York, USA, (2004). [4] K. Do and H. Norde, ‘The shapley value for partition function form games’, Discussion Paper, Tilburg University, Center for Economic Research, (2002). [5] M. Finus and B. Rundshagen, ‘Endogenous coalition formation in global pollution control. a partition function approach.’, Working Paper No. 307, University of Hagen., (2001). [6] I. Hafalir, ‘Efﬁciency in coalition games with externalities’, Games and Economic Behaviour, 61(2), 209–238, (2007). [7] L. K´oczy, ‘A recursive core for partition function form games’, Theory and Decision, 63(1), 41–51, (2007). [8] P. Pintasssilgo, ‘A coalition approach to the management of high seas ﬁsheries in the presence of externalities’, Natural Resource Modeling, 2(16), 175–197, (2003). [9] T. Rahwan and N. Jennings, ‘Distributing coalitional value calculations among cooperating agents’, in AAAI, pp. 152–157, Pittsburgh, USA, (2005). [10] T. Rahwan, S. Ramchurn, V. Dang, A. Giovannucci, and N. Jennings, ‘Anytime optimal coalition structure generation’, in Proceedings of AAAI 2007, (2007). [11] T. Rahwan, S. Ramchurn, V. Dang, A. Giovannucci, and N. Jennings, ‘Near-optimal anytime coalition structure generation’, in IJCAI, Hyderabad, India, (2007). [12] T. Sandholm, K. Larson, M. Andersson, O. Shehory, and F. Tohme, ‘Coalition structure generation with worst case guarantees’, AI, 12(111), 209–238, (1999). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-393 393 Coalition Structures in Weighted Voting Games Edith Elkind and Georgios Chalkiadakis and Nicholas R. Jennings 1 Abstract. Weighted voting games are a popular model of collaboration in multiagent systems. In such games, each agent has a weight (intuitively corresponding to resources he can contribute), and a coalition of agents wins if its total weight meets or exceeds a given threshold. Even though coalitional stability in such games is important, existing research has nonetheless only considered the stability of the grand coalition. In this paper, we introduce a model for weighted voting games with coalition structures. This is a natural extension in the context of multiagent systems, as several groups of agents may be simultaneously at work, each serving a different task. We then proceed to study stability in this context. First, we deﬁne the CS-core, a notion of the core for such settings, discuss its non-emptiness, and relate it to the traditional notion of the core in weighted voting games. We then investigate its computational properties. We show that, in contrast with the traditional setting, it is computationally hard to decide whether a game has a non-empty CS-core, or whether a given outcome is in the CS-core. However, we then provide an efﬁcient algorithm that veriﬁes whether an outcome is in the CS-core if all weights are small (polynomially bounded). Finally, we also suggest heuristic algorithms for checking the non-emptiness of the CS-core. 1 Introduction Coalitional games [8] provide a rich framework for the study of cooperation both in economics and politics, and have been successfully used to model collaboration in multiagent systems [11, 3]. In such games, teams (or coalitions) of agents come together to achieve a common goal, and derive individual beneﬁts from this activity. A particularly simple, yet expressive, class of coalitional games is that of weighted voting games (WVGs) [13]. In a weighted voting game each player (or agent) has a weight, and a coalition wins if its members’ total weight meets or exceeds a certain threshold, and loses otherwise. Weighted voting has straightforward applications in a plethora of societal and computer science settings ranging from real-life elections to computer operating systems, as well as a variety of settings involving multiagent coordination. In particular, an agent’s weight can be thought of as the amount of resources available to this agent, and the threshold indicates the amount of resources necessary to achieve a task. A winning coalition then corresponds to a team of agents that can successfully complete this task. Originally, research in weighted voting games was motivated by a desire to model decision-making in governmental bodies. In such settings, the threshold is usually at least 50% of the total weight, and the issues of interest relate to the distribution of payoffs within the grand coalition, i.e., the coalition of all agents. Perhaps for this reason, to date, all research on weighted voting games tacitly assumes that the grand coalition will form. However, in multiagent settings such as 1 School of Electronics and Computer Science, University of Southampton, UK; email: {ee, gc2, nrj}@ecs.soton.ac.uk those described above, the threshold can be signiﬁcantly smaller than 50% of the total weight, and several winning coalitions may be able to form simultaneously. Moreover, in this situation the formation of the grand coalition may not, in fact, be a desirable outcome: instead of completing several tasks, forming the grand coalition concentrates all agent resources on ﬁnishing a single task. In contrast, the overall efﬁciency will be higher if the agents form a coalition structure (CS), i.e., a collection of several disjoint coalitions. To model such scenarios, in this paper we introduce a model for WVGs with coalition structures. We then focus on the issue of stability in this setting. A structure is stable when rational agents are not motivated to depart from it, and thus they can concentrate on performing their task, rather than looking for ways to improve their payoffs. Therefore, stability provides a useful balance between individual goals and overall performance. To study it, we extend the notion of the core—a classic notion of stability for coalitional games— to our setting, by deﬁning the CS-core for WVGs. We then provide a detailed study of this concept, comparing it with the classic core and analyzing its computational properties. Our main contributions are as follows: (1) we deﬁne a new model that allows weighted voting games to admit coalition structures (Sec. 3); (2) we deﬁne the CS-core for such games, relate it to the classic core, and describe sufﬁcient conditions for its non-emptiness (Sec. 4); (3) we show that several natural CS-core-related problems are intractable—namely, it is NP-hard to decide the non-emptiness of the CS-core and coNP-complete to check whether a given outcome is in the CS-core (Sec. 5). Interestingly, this contrasts with what holds in weighted voting games without coalition structures, where both of these problems are polynomial-time solvable; (4) we provide a polynomial-time algorithm to check if a given outcome is in the CScore in the important special case of polynomially-bounded weights. We then show how to use this algorithm to efﬁciently check if a given coalition structure admits a stable payoff distribution, and suggest a heuristic algorithm to ﬁnd an allocation in the core (Sec. 6). We begin with some background and a brief review of related work. 2 Background and Related Work In this section, we provide an overview of the basic concepts in coalitional game theory. Let I, |I| = n, be a set of players. A subset C ⊆ I is called a coalition. A coalitional game with transferable utility is deﬁned by its characteristic function v : 2I → R that speciﬁes the value v(C) of each coalition C [14]. Intuitively, v(C) represents the maximal payoff the members of C can jointly receive by cooperating, and it is assumed that the agents can distribute this payoff between themselves in any way. While the characteristic function describes the payoffs available to coalitions, it does not prescribe a way of distributing these payoffs. We say that an allocation is a vector of payoffs x = (x1 , . . . , xn ) assigning some payoff to each i ∈ I. We write x(S) to denote 394 E. Elkind et al. / Coalition Structures in Weighted Voting Games i∈S xi . An allocation is feasible for the grand coalition if x(I) ≤ v(I). An imputation is a feasible allocation that is also efﬁcient, i.e., x(I) = v(I). A weighted voting game (WVG) is a coalitional game G given by a set of agents I = {1, . . . , n}, their weights w = {w1 , . . . , wn }, wi ∈ R+ , and a threshold T ∈ R; we write G = (I; w; T ). We use w(S) to denote i∈S wi . For a coalition S ⊆ I, its value v(S) is 1 if w(S) ≥ T ; otherwise, v(S) = 0. Without loss of generality, the value of the grand coalition I is 1 (i.e., w(I) ≥ T ). One of the best-known solution concepts describing coalitional stability is the core[8]. Deﬁnition 1. An allocation x is in the core of G iff x(I) = v(I) and for any S ⊆ I we have x(S) ≥ v(S). If an allocation x is in the core, then no subgroup of agents can guarantee all of its members a higher payoff than the one they receive in the grand coalition under x. This deﬁnition of the core can therefore be used to characterize the stability of the grand coalition. The setting where several coalitions can form at the same time can be modeled using coalition structures. Formally, a coalition structure (CS ) is an exhaustive partition of the set of agents. CS(G) denotes the set of all coalition structures for G. Given a structure CS = {C1 , . . . , Ck }, an allocation x is feasible for CS if x(Ci ) ≤ v(Ci ) for i = 1, . . . , k and efﬁcient for CS if this holds with equality. Games with coalition structures were introduced by Aumann and Dreze [2], and are obviously of interest from an AI/multiagent systems point of view, as illustrated in Section 1. Indeed, in this context dealing with coalition structures other than the grand coalition is of uttermost importance: simply put, there is a plethora of realistic application scenarios where the emergence of the grand coalition is either not guaranteed, is plainly impossible, or might be perceivably harmful (for instance, it usually makes little sense to allocate all available robots on a single task). In particular, in the context of WVGs, by forming several disjoint winning coalitions, the agents generate more payoff than in the grand coalition. Additional motivation from an economics perspective is given in [2], which contains a thorough and insightful discussion on why coalition structures arise. Now, there exists a handful of approaches in the multiagent literature that do take coalition structures explicitly into account. Sandholm and Lesser [11] discuss the stability of coalition structures when examining the problem of allocating computational resources to coalitions. Apt and Radzik [1] also do not restrain themselves to problems where the outcome is the grand coalition only. Instead, they introduce various stability notions for abstract games whose outcomes can be coalition structures, and discuss simple transformations by which stable partitions of the set of players may emerge. Dieckmann and Schwalbe [5] also propose a version of the core with coalition structures when studying dynamic coalition formation, and so do Chalkiadakis and Boutilier when tackling coalition formation under uncertainty [4]. None of these papers studies WVGs, however. A thorough discussion of weighted voting games can be found in [13]. The stability-related solution concepts for WVGs (without coalition structures) have recently been studied by Elkind et al. [6], who also investigate them from computational perspective. However, there is no existing work in the literature studying WVGs with coalition structure—a class of games that we now proceed to deﬁne. 3 Coalition structures in WVGs We now extend the traditional model for WVGs to allow for coalition structures. First, an outcome of a game is now a pair of the form (coalition structure, allocation) rather than just an allocation. Furtheremore, in the traditional model, any allocation of payoffs among the participating agents is required to be an exhaustive partition of the value of the grand coalition. In other words, it is always an imputation, i.e., an allocation of payoffs that is feasible and efﬁcient for the grand coalition I. As we now allow WVGs to admit coalition structures, we replace the aforementioned requirement with similar requirements with respect to a coalition structure: First, we no longer require an allocation to be an imputation in the classic sense. Instead, we demand that, for a given outcome (CS , x), the allocation x of payoffs for I is feasible for CS . In this way, CS may contain zero or more winning coalitions. Furthermore, we deﬁne an imputation for a coalition structure CS as a vector p of nonnegative numbers (p1 , . . . , pn ) (one for each agent in I), such that for every C ∈ CS it holds p(C) = v(C) ≤ 1; we write p ∈ I(CS ). That is, an imputation is now a feasible and efﬁcient allocation of the payoff of any coalition C ∈ CS . 4 Core and CS-core of weighted voting games In this section we deﬁne the core of WVG games with coalition structures, relate it to the “classic” core of WVG games without coalition structures, and obtain some core characterization results for a few interesting classes of WVG games. The deﬁnition of the core (Def. 1) takes the following simple form in the traditional WVGs setting (see, e.g., [6]): Deﬁnition 2. The core of a WVG game G = (I; w; T ) is the set of imputations p such that, ∀S ⊆ I, w(S) ≥ T ⇒ p(S) ≥ 1. Intuitively, an imputation p is in the core whenever the payoffs deﬁned by p are such that any winning coalition already receives collective payoff of 1 (and therefore no coalition can improve its payoff by breaking away from the grand coalition). This notion of the core cannot be directly used for coalition structures: indeed, it demands that an allocation is an imputation in the traditional sense, and therefore no imputation for a coalition structure with more than one winning coalition can ever be in the core. We will now extend this deﬁnition to the setting with coalition structures. Namely, we deﬁne the core of weighted voting games with coalition structures, or CS-core, as follows: Deﬁnition 3. The CS-core of a WVG game G = (I; w; T ) with coalition structures is the set of outcomes (CS , p) such that ∀S ⊆ I, w(S) ≥ T ⇒ p(S) ≥ 1 and ∀C ∈ CS it holds p(C) = v(C). Intuitively, given an outcome that is in the CS-core, no coalition has an incentive to break away from the coalition structure. Now, it is well-known (see, e.g., [6]) that in weighted voting games the core is non-empty if and only if there exists a veto player, i.e., a player that belongs to all winning coalitions, and an imputation is in the core if and only if it distributes the payoff in some way between the veto players. This directly implies the following result. Observation 1 (An imputation in the core induces an outcome in the CS-core). Let G = (I; w; T ). If the core of G is non-empty, then, for any p in the core, the outcome ({I}, p) is in the CS-core of G. However, it turns out that the CS-core may be non-empty even when the core is empty. Example 1. Consider a weighted voting game G = (I; w; T ), where I = {1, 2, 3}, w = (1, 1, 2) and T = 2. It is easy to see that none of the players in G is a veto player, so G has an empty core. On E. Elkind et al. / Coalition Structures in Weighted Voting Games the other hand, the outcome (CS , p), where CS = {{1, 2}, {3}}, p = (1/2, 1/2, 1) is in the CS-core of G. Indeed, agent 3 is getting a payoff of 1 under this outcome, so his payoff cannot improve. Therefore, the only deviation available to the other two players is to form singleton coalitions, and this is clearly not beneﬁcial. We now show that if the threshold T is strictly greater than 50% the CS-core and the core coincide. Proposition 1 (In absolute majority games, the cores coincide). Let G = (I; w; T ) be a WVG game with T > w(I)/2. Then there is an outcome (CS , p) in the CS-core of G if and only if p is in the core of G. Consequently, G has a non-empty core if and only if it has a non-empty CS-core. Proof. Suppose that an outcome (CS , p) is in the CS-core of G. As T > w(I)/2, CS can contain at most one winning coalition C, and hence p(I) = 1. Consider any player i ∈ C such that pi > 0. If pi is not a veto player, we have w(I \ {i}) ≥ T , p(I \ {i}) < 1, so (CS , p) is not in the CS-core of G, a contradiction. Hence, under p only the veto players get any payoff, which implies that p is in the core of G. Conversely, if p is in the core of G, it is easy to see that ({I}, p) is in the CS-core of G. We can also prove the following sufﬁcient condition for nonemptiness of the CS-core. Theorem 1. Any WVG G = (I; w; T ) that admits a partition of players into coalitions of weight T has a non-empty CS-core. Proof. Let CS = {C1 , . . . , Ck } be the corresponding partition such that w(Ci ) = T for all i = 1, . . . , k. Deﬁne p by setting pj = wj /T for all j = 1, . . . , n. Consider any winning coalition S. We have w(S) ≥ T , so p(S) = w(S)/T ≥ 1, and hence S does not want to deviate. As this holds for any S with v(S) = 1, the outcome (CS , p) is in the CS-core of G. However, it is not the case that the CS-core of a weighted voting game is always non-empty. In particular, this follows from the fact that the CS-core coincides with the core in games with T > w(I)/2, and such games may have an empty core. We now show that the CScore can be empty also if T < w(I)/2: Example 2. Consider a WVG G = (I; w; T ), where I = {1, 2, 3, 4, 5}, w = (1, 1, 1, 1, 1) and T = 2. We now show that this game has empty CS-core. Indeed, consider any CS ∈ CS(G) and any p ∈ I(CS ). Clearly, CS can contain at most two winning coalitions, so p(I) ≤ 2. Now, if there is a coalition C ∈ CS , |C| ≥ 3, such that pi > 0 for all i ∈ C, any two players i, j ∈ C can deviate by forming a winning coalition and splitting the surplus p(C \ {i, j}). If all coalitions have size at most 2, then there is a player i that forms a singleton coalition (and hence pi = 0). There also exists another player j such that pj < 1 (otherwise p(I) ≥ 4). But then S = {i, j} satisﬁes w(S) ≥ T , p(S) < 1, so it is a successful deviation. 5 Non-emptiness of the CS-core: hardness results In the rest of the paper, we deal with computational questions related to the notion of the CS-core. This topic is important since in practical applications agents have limited computational resources, and may not be able to ﬁnd a stable outcome if this requires excessive computation. To provide a formal treatment of complexity issues in our setting, we assume that all weights and the threshold are integers 395 given in binary. As any rational weights can be scaled up to integers, this can be done without loss of generality. In the previous section, we explained how to verify whether the core is non-empty or whether a given outcome is in the core. It is not hard to see that this veriﬁcation can be done in polynomial time: e.g., to check the non-emptiness of the core, we simply check if w(I \ {i}) ≥ T for all i ∈ I. In WVGs with coalition structures, the situation is very different. Namely, we will show that it is NP-hard to decide whether a given WVG has a non-empty CS-core. Moreover, even if we are given an imputation, it is coNP-complete to decide whether it is in the CS-core of a given WVG. We now state these computational problems more formally. Name: N ON E MPTY C S C ORE. Instance: Weighted voting game G = (I; w; T ). Question: Does G have a non-empty CS-core? Name: I N C S C ORE. Instance: Weighted voting game G = (I; w; T ), a coalition structure CS ∈ CS(G) and an imputation p ∈ I(CS ). Question: Is (CS , p) in the CS-core of G? Both of our reductions rely on the well-known NP-complete PAR problem. An input to this problem is a pair (A; K), where A is a list of positive integers A = {a1 , . . . , an } such that n i=1 ai = 2K. It is a “yes”-instance if there is a subset of indices J such that i∈J ai = K and a “no”-instance otherwise [7, p.223]. TITION Theorem 2. The problem N ON E MPTY C S C ORE is NP-hard. Proof. We will describe a polynomial-time procedure that maps a “yes”-instance of PARTITION to a “yes”-instance of N ON E MP TY C S C ORE and a “no”-instance of PARTITION to a “no”-instance of N ON E MPTY C S C ORE. Suppose that we are given an instance (a1 , . . . , an ; K) of PARTITION. If there is an i such that ai > K, then obviously it is a “no”-instance of PARTITION, so we map it to a ﬁxed “no”-instance of N ON E MPTY C S C ORE, e.g., by setting G = ({1, 2, 3, 4, 5}; (1, 1, 1, 1, 1); 2) as in Example 2. Otherwise, we construct a game G = (I; w; T ) by setting I = {1, . . . , n}, wi = ai for i = 1, . . . , n, T = K. Note that in this case we have w(I \ {i}) ≥ T for any i, so there are no veto players in G. Suppose that we have started with a “yes”-instance of PARTITION, and let J be such that i∈J ai = K. Consider the coalition structure CS = {J, I \ J} and an imputation p given by pi = wi /K for i = 1, . . . , n. Note that w(J) = w(I \J) = K, so p(J) = p(I \J) = 1, i.e., p is a valid imputation. It is easy to see that (CS , p) is in the CScore of G. Indeed, for any winning coalition S we have w(S) ≥ K, so p(S) ≥ 1, i.e., the members of S would not want to deviate. On the other hand, suppose that we have started with a “no”instance of PARTITION. Consider any outcome (CS , p) in the resulting game. Clearly, CS can contain at most one winning coalition: if there are two disjoint winning coalitions, each of them has weight K, i.e., it can be used as a “yes”-certiﬁcate for PARTITION. If CS contains no winning coalitions, then it is clearly unstable, as w(I) ≥ T , p(I) = 0. Now, suppose that CS contains exactly one winning coalition S. In this case we have p(S) = p(I) = 1 and pi = 0 for all i ∈ S. We have pi > 0 for some i ∈ S, so p(I \ {i}) < 1. Moreover, by construction, w(I \ {i}) ≥ T . Hence, I \ {i} can deviate, so (CS , p) is not in the CS-core of G. Theorem 3. The problem I N C S C ORE is coNP-complete. Proof. We will show that the complementary problem on checking that a given outcome is not in the core is NP-complete. 396 E. Elkind et al. / Coalition Structures in Weighted Voting Games First, it is easy to see that this problem is in NP: we can guess a coalition S such that w(S) ≥ T , but p(S) < 1; this coalition can successfully deviate from (CS , p). Now, to show that this problem is NP-hard, we construct a reduction from PARTITION as follows. Given an instance (a1 , . . . , an ; K) of PARTITION, we set I = {1, . . . , n, n + 1, n + 2} and wi = 2ai for i = 1, . . . , n. Deﬁne also I = {1, . . . , n}. The weights wn+1 and wn+2 and the quota T are determined as follows. We construct a coalition S by adding agents 1, 2, . . . to it one by one until the weight of S is at least 2K. If the weight of S is exactly 2K, this means that we have started with a “yes”-instance of PARTITION. In this case, we set wn+1 = wn+2 = 0, T = 2K, CS = {I}, and pi = wi /T for all i ∈ I. It is easy to see that the outcome (CS , p) is not stable: the agents in S can deviate and increase their total payoff from 1/2 to 1. Hence, in this case we have mapped a “yes”-instance of PARTITION to a “no”-instance of I N C S C ORE. Now, suppose that w(S) > 2K. As all weights are even, we have w(S) = 2Q for some integer Q > K. Also, we have w(I \ S) = 4K − 2Q. Set T = 2Q, and let wn+1 = wn+2 = 2Q − 2K. Now we have w(I \ S) = 4K − 2Q + 4Q − 4K = 2Q, i.e., both S and I \ S are winning coalitions. Set CS = {S, I \ S}. Now, p is deﬁned as follows: for all i ∈ I set pi = wi /T , set pn+1 = wn+1 /(T + 1), and set pn+2 = 1−p(I \S)−pn+1 . We have p(S) = w(S)/T = 1, p(I \ S) = p(I \ S) + pn+1 + pn+2 = 1, so p is an imputation. Note also that we have pn+1 + pn+2 = 1 − p(I \ S) = 1 − w(I \ S)/T = (wn+1 + wn+2 )/T . Moreover, we have pn+1 < wn+1 /T , p(I \ S) = w(I \ S)/T , and hence pn+2 > wn+2 /T . We now show that if (a1 , . . . , an ; K) is a “yes”-instance of PAR TITION , then (I; w; T ), CS , p is a “no”-instance of I N C S C ORE . Indeed, suppose there is a set J such that i∈J ai = K. Consider the coalition J = J ∪{n+1}. We have w(J ) = 2K +2Q−2K, so it is a winning coalition. On the other hand, p(J ) = p(J) + pn+1 = w(J)/T + wn+1 /(T + 1) < w(J )/T = 1. Hence, J can beneﬁt from deviating, i.e., (CS , p) is not in the core. On the other hand, suppose that (I; w; T ), CS , p is a “no”instance of I N C S C ORE, i.e., there is a set J such that w(J ) ≥ T , p(J ) < 1. Suppose that w(J ) > T , i.e., w(J ) ≥ T + 1. We have pi ≥ wi /(T + 1) for all i ∈ I (indeed, we have pi ≥ wi /T for i = n + 1 and pi = wi /(T + 1) for i = n + 1), so p(J ) ≥ w(J )/(T + 1) ≥ 1, a contradiction. Hence, we have w(J ) = T . Moreover, if n + 1 ∈ J , we have p(J ) ≥ w(J )/T = 1, a contradiction again. Therefore, n + 1 ∈ J . Finally, if n + 2 ∈ J , we have p(J ) = p(J ∩ I ) + pn+1 + pn+2 = w(J ∩ I )/T + (wn+1 + wn+2 )/T = w(J )/T = 1, also a contradiction. We conclude that w(J ) = T , n + 1 ∈ J , n + 2 ∈ J , and hence w(J ∩ I ) = 2Q − (2K − 2Q) = 2K, which means that i∈J ∩I ai = K, i.e., J ∩ I is a witness that we have a “yes”-instance of PARTITION. 6 Algorithms for the CS-core The hardness results presented in the previous section rely on all weights being given in binary. However, in practical applications it is often the case that the weights are not too large, or can be rounded down so that the weights of all agents are drawn from a small range of values. In such cases, we can assume that the weights are given in unary, or, alternatively, are at most polynomial in n. It is therefore natural to ask if our problems can be solved efﬁciently in such settings. It turns out that for I N C S C ORE this is indeed the case. Theorem 4. There exists a pseudopolynomial2 algorithm AInCsCore 2 An algorithm whose running time is polynomial if all numbers in the input for I N C S C ORE, i.e., an algorithm that correctly decides whether a given outcome (CS , p) is in the CS-core of a weighted voting game (I; w; T ) and runs in time poly(n, w(I), |p|), where |p| is the number of bits in the binary representation of p. Proof. The input to our algorithm is an instance of I N C S C ORE, i.e., a weighted voting game G = (I; w; T ), a coalition structure CS ∈ CS(G) and an imputation p ∈ I(CS ). The outcome (CS , p) is not stable if and only if there exists a set S such that w(S) ≥ T , but p(S) < 1. This means that our problem is essentially reducible to the classic K NAPSACK problem [7], which is known to have a pseudopolynomial time algorithm based on dynamic programming. In what follows, we present this algorithm for completeness. Let W = w(I). For j = 1, . . . , n and w = 1, . . . , W , let P (j, w) be the smallest total payoff of a coalition with total weight w all of whose members appear in {1, . . . j}: P (j, w) = min{p(J) | J ⊆ {1, . . . , j}, w(J) = w}. Now, if minw=T,...,W P (n, w) < 1, it means that there is a winning coalition whose total payoff is less than 1. Obviously, this coalition would like to deviate from (CS , p), i.e., in this case (CS , p) is not in the CS-core. Otherwise, the payoff to any winning coalition (not necessarily in CS ) is at least 1, so no group wants to deviate from CS , and thus (CS , p) is in the CS-core. It remains to show how to compute P (j, w) for all j = 1, . . . , n, w = 1, . . . , W . For j = 1, we have P (1, w) = p1 if w = w1 and P (1, w) = +∞ otherwise. Now, suppose we have computed P (j, w) for all w = 1, . . . , W . Then we can compute P (j + 1, w) as min{P (j, w), pj+1 + P (j, w − wj )}. The running time of this algorithm is polynomial in n, W and |p|, i.e., in the input size. We now show how to use the algorithm AInCsCore to check whether for a given coalition structure CS there exists an imputation p such that the outcome (CS , p) is in the CS-core. Our algorithm for this problem also runs in pseudopolynomial time. Theorem 5. There exists a pseudopolynomial algorithm Ap that given a weighted voting game G = (I; w; T ) and a coalition structure CS ∈ CS(G), correctly decides whether there exists an imputation p ∈ I(CS ) such that the outcome (CS , p) is in the CS-core of G and runs in time poly(n, w(I)). Proof. Suppose CS = {C1 , . . . , Ck }. Consider the following linear feasibility program (LFP) with variables p1 , . . . , pn : pi ≥ 0 for all i = 1, . . . , n pi = 1 for all j such that w(Cj ) ≥ T pi = 0 for all j such that w(Cj ) < T pi ≥ 1 for all J ⊆ I such that w(J) ≥ T i∈Cj i∈Cj (1) i∈J The ﬁrst three groups of equations require that p is an imputation for CS : all payments are non-negative, the sum of payments to members of each winning coalition in CS is 1, and the sum of payments to members of each losing coalition in CS is 0. The last group of equations states that there is no proﬁtable deviation: the payoff to each winning coalition (not necessarily in CS ) is at least 1. Clearly, we can implement the algorithm Ap by solving this LFP, as follows: The size of this LFP may be exponential in n, as there is a constraint for each winning coalition. Nevertheless, it is well-known that such LFPs can be solved in polynomial time by the ellipsoid method are given in unary is called pseudopolynomial. E. Elkind et al. / Coalition Structures in Weighted Voting Games provided that they have a polynomial-time separation oracle. A separation oracle is an algorithm that, given an alleged feasible solution, checks whether it is indeed feasible, and if not, outputs a violated constraint [12]. In our case, such an oracle will have to verify whether a given vector p violates one of the constraints in (1): It is straightforward to verify whether all pi are non-negative, and whether the payment to each winning coalition in CS is 1 and the payment to each losing coalition in CS is 0. If any of these constraints is violated, our separation oracle outputs the violated constraint. If this is not the case, we can use the algorithm AInCsCore described in the proof of Theorem 4 to decide whether there exists a winning coalition J such that w(J) ≥ T , p(J) < 1; this algorithm can be easily adapted to return such coalition if one exists. If AInCsCore produces such a coalition, our separation oracle outputs the corresponding violated constraint. If AInCsCore reports that no such coalition exists, then (CS , p) is in the CS-core of G, so we can output p and stop. The algorithm Ap described in the proof of Theorem 5 allows us to check whether a given weighted voting game G has a non-empty CS-core: we can enumerate all coalitional structures in CS(G), and for each of them check whether there is an imputation p, which, combined with the coalition structure under consideration, results in a stable outcome. However, the number of coalition structures in CS(G) is exponential in n, and solving a linear feasibility problem for each of them using the ellipsoid method is prohibitively expensive. We now describe heuristics that can be used to speed up this process. First, observe that we can exclude from consideration coalition structures that contain more than one losing coalition. Indeed, if any such coalition structure is stable, the coalition structure obtained from it by merging all losing coalitions will also be stable. Moreover, we can assume that each winning coalition C in our coalition structure is minimal, i.e., if we delete any element from C, it becomes a losing coalition. The argument is similar to the previous case: if any coalition structure with a non-minimal coalition C is stable, the coalition structure obtained by moving the extraneous element from C to the (unique) losing coalition is also stable. Now, suppose that we have a coalition structure CS = {C0 , C1 , . . . , Ck } such that v(C0 ) = 0 (C0 can be empty), v(Ci ) = 1 for i = 1, . . . , k, and all Ci , i > 0, are minimal. Consider an agent j ∈ Ci , i > 0. If pj > 0 and w(C0 ) ≥ wj , then CS is not stable: the players in C0 ∪ Ci \ {j} can deviate by forming a winning coalition and redistributing the extra payoff of pj between themselves. Set Ci = {j ∈ Ci | wj ≤ w(C0 )}. The argument above shows that the members of the sets Ci get paid 0 under any imputation p such that (CS , p) is stable. Now, set C = ∪i>0 Ci . If w(C ) + w(C0 ) ≥ T , there is no imputation p such that (CS , p) is stable: any such imputation would have to pay 0 to players in C0 and each Ci , but then the players in these sets can jointly deviate and form a winning coalition. Therefore, we can speed up the algorithm in the proof of Theorem 5 as follows: given a coalition structure CS = {C0 , C1 , . . . , Ck }, compute the sets Ci , i = 1, . . . , k, and check whether w(C ) + w(C0 ) ≥ T . If this is indeed the case, there is no imputation p such that (CS , p) is stable. Otherwise, run the algorithm Ap . Clearly, this preprocessing step is very fast (in particular, unlike Ap , it runs in polynomial time even if the weights are large, i.e., given in binary), and in many cases we will be able to reject a candidate coalition structure without having to solve the LFP (which is computationally expensive). We can also try to optimize the order in which we consider the candidate coalition structures. Heuristics for social welfare-maximizing coalition structure generation might be of use here [10, 9]. 7 397 Conclusions In this paper, we extended the model of weighted voting games (WVGs) to allow for the formation of coalition structures, thus permitting more than one coalition to be winning at the same time. We then studied the problem of stability of the resulting structure in such games. Speciﬁcally, we introduced CS-core (the core with coalition structures), and discussed its properties by relating it to the traditional concept of the core for WVGs and proving sufﬁcient conditions for its non-emptiness. Following that, we showed that deciding CS-core non-emptiness or checking whether an outcome is in the CS-core are computationally hard problems (unlike what holds in the traditional WVGs setting). However, for speciﬁc classes of games, we presented polynomial-time algorithms for checking if a given outcome is in the CS-core, and discovering a CS-core element given a coalition structure. We then suggested heuristics that, combined with these algorithms, can be used to generate an outcome in the CS-core. We believe that the line of work presented here is important: Weighted voting games are well understood, and the addition of coalition structures increases the usability of this intuitive framework in multiagent settings (where weights can represent resources and thresholds do not necessarily exceed 50%). In terms of future work, we intend, ﬁrst of all, to come up with new heuristics to speed up our algorithms. In addition, notice that the algorithms and heuristics of Sec. 6 provide essentially centralized solutions to their respective problems. Therefore, we are interested in studying decentralized approaches; to begin, we intend to speed up, in the WVGs context, the exponential decentralized coalition formation algorithm of [5]. Finally, studying other solution concepts in this context, such as the Shapley value [8], is also within our intentions. Acknowledgements This research was undertaken as part of the ALADDIN (Autonomous Learning Agents for Decentralised Data and Information Networks) project. ALADDIN is jointly funded by a BAE Systems and EPSRC strategic partnership (EP/C548051/1). REFERENCES [1] K. Apt and T. Radzik. Stable Partitions in coalitional games, 2006. Working Paper, available at http://arxiv.org/abs/cs.GT/0605132. [2] R.J. Aumann and J.H. Dreze, ‘Cooperative Games with Coalition Structures’, International Journal of Game Theory, 3(4), 217–237, (1974). [3] P. Caillou, S. Aknine, and S. Pinson, ‘Multi-agent models for searching pareto optimal solutions to the problem of forming and dynamic restructuring of coalitions’, in Proc. of ECAI’02, pp. 13 – 17, (2002). [4] G. Chalkiadakis and C. Boutilier, ‘Bayesian Reinforcement Learning for Coalition Formation Under Uncertainty’, in Proc. of AAMAS’04. [5] T. Dieckmann and U. Schwalbe. Dynamic Coalition Formation and the Core, 1998. Economics Dept. Working Paper Series, Nat. Univ. of Ireland - Maynooth. [6] E. Elkind, L.A. Goldberg, P.W. Goldberg, and M. Wooldridge, ‘Computational complexity of weighted threshold games’, in Proc. of AAAI’07. [7] M. Garey and D. Johnson, Computers and Intractability; A Guide to the Theory of NP-Completeness, W. H. Freeman & Co., N. York, 1990. [8] R. Myerson, Game Theory: Analysis of Conﬂict, 1991. [9] T. Rahwan, S. Ramchurn, A. Giovannucci, V. Dang, and N. R. Jennings, ‘Anytime optimal coalition structure generation’, in Proc. of AAAI’07. [10] T. Sandholm, K. Larson, M. Andersson, O. Shehory, and F. Tohme, ‘Anytime coalition structure generation with worst case guarantees’, in Proc. of AAAI’98, (1998). [11] T. Sandholm and V.R. Lesser, ‘Coalitions Among Computationally Bounded Agents’, Artiﬁcial Intelligence, 94(1), 99 – 137, (1997). [12] A. Schrijver, Combinatorial Optimization: Polyhedra and Efﬁciency, Springer, 2003. [13] A. Taylor and W. Zwicker, Simple Games: Desirability Relations, Trading, Pseudoweightings, Princeton University Press, Princeton, 1999. [14] J. von Neumann and O. Morgenstern, Theory of Games and Economic Behavior, Princeton University Press, Princeton, 1944. 398 ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-398 Agents Preferences in Decentralized Task Allocation Mark Hoogendoorn 1,2 and Maria L. Gini 2 Abstract. The ability to express preferences for speciﬁc tasks in multi-agent auctions is an important element for potential users who are considering to use such auctioning systems. This paper presents an approach to make such preferences explicit and to use these preferences in bids for reverse combinatorial auctions. Three different types of preference are considered: (1) preferences for particular durations of tasks, (2) preferences for certain time points, and (3) preferences for speciﬁc types of tasks. We study empirically the tradeoffs between the quality of the solutions obtained and the use of preferences in the bidding process, focusing on effects such as increased execution time. We use both synthetic data as well as real data from a logistics company. 1 Introduction Auctions are used in multi-agent systems, among other things, to perform allocation of tasks (see e .g. [13] and [14]). Such reverse auctions, where the buyer is the auctioneer, can be of a combinatorial type, allowing for bidding on bundles of tasks. Sandholm [12] notes that reverse auctions are not economically efﬁcient because optimal bundling depends on suppliers preferences, which traditionally cannot be expressed. Enabling the agents to express the preferences of their users is an important requirement for actual companies and people to use agents for bidding. In this paper we propose a concrete preference function to be used by an agent to express preferences over tasks. This function expresses preferences for speciﬁc properties of tasks and it is used in a decentralized task allocation setting. We introduce a bidding algorithm, where an agent bids on its most preferred tasks that are feasible given its current commitments. This algorithm uses a pricing mechanism which depends on the actual cost to perform the tasks and on the preference for the task. The inﬂuence of preferences on the price can be varied by setting a parameter (look at the role of the parameter p in the algorithm in Section 3.5). Using this algorithm, we investigate the impact of preferences upon other aspects of task execution, such as execution time. We use both synthetic as well as real data from a logistics company. This paper is organized as follows. First, the auctioning system used throughout the paper is introduced in Section 2. Section 3 introduces a function to express preferences and a bidding algorithm based upon such preferences. Experiments to evaluate the bidding algorithm and to study the trade-off between preferences and efﬁciency of task execution are presented in Section 4. Section 5 discusses related work, and ﬁnally, Section 6 concludes the paper. 1 2 Vrije Universiteit Amsterdam, Department of Artiﬁ cial Intelligence, De Boelelaan 1081a, 1081 HV Amsterdam, The Netherlands, email: mhoogen@cs.vu.nl University of Minnesota, Department of Computer Science and Engineering, 200 Union Street SE, Minneapolis, MN 55455, United States of America, email: gini@cs.umn.edu 2 The MAGNET System The approach we present exploits some unique features of the MAGNET [4] system that allows autonomous agents to negotiate over coordinated tasks with precedence and time constraints. The MAGNET system consists of: (1) a customer agent, which puts tasks up for auction. The tasks have time constraints and other restrictions; (2) suppliers agents, which bid on the tasks and execute them if awarded; and (3) the MAGNET market server, which keeps track of the activities of the agents and of the auctions. The main interactions between agents in the MAGNET system are as follows: • A customer agent issues a Request for Quotes (RFQ) which speciﬁes the tasks, their precedence relations, and a time line for the bidding process. For each task, a time window speciﬁes the earliest time the task can start and the latest time the task can end. • Supplier agents submit bids. A bid includes one or more tasks, a price, the portion of the price to be paid as a non-refundable deposit, and the estimated duration and time window for task execution. Bids reﬂect supplier resource availability and constrain the customer’s scheduling process. • The customer agent decides which bids to accept. Each task needs to be mapped to one bid and the constraints of all awarded bids must be satisﬁed in the ﬁnal schedule. In MAGNET the customer can chose from a collection of winner-determination algorithms (A*, IDA* [2], simulated annealing, and integer programming [3]). • The customer agent awards bids and speciﬁes the work schedule. 3 Preference Algorithm In the bidding algorithm we propose, price is used as a mechanism to express preferences for tasks. Preferences in our case can be a combination of the following: (1) a preference for tasks of a particular duration (e.g. I hate performing very short tasks), (2) a preference for tasks at particular times during the day (e.g. I love getting up early in the morning, so give me tasks that ought to start early in the morning), and (3) a preference for particular types of tasks (e.g. I really hate to perform a task like that). We show how to express these preferences and how to combine them. The preference for a task is referred to as φtask , which we express using a real number in the interval [0,1]. Hereby, 21 indicates a neutral preference, 0 is not preferred, and 1 is fully preferred. Since humans typically do not think in terms of a number when specifying preferences, we provide for each of the preference types covered a more intuitive formulation, as explained next. The speciﬁcs of how preferences are computed could be adapted for different domains, while keeping the approach. M. Hoogendoorn and M.L. Gini / Agents Preferences in Decentralized Task Allocation 399 3.1 Preferences for Duration 3.4 Combining Preferences Let the preference to perform tasks of a certain duration be an integer. Such an integer can indicate either a minimum or a maximum duration (i.e. dmin , dmax ). Let dmin be the minimum duration you want a task to last, i.e. you want the task to last longer than dmin . Durations below dmin are not preferred. If the duration is precisely dmin your preference is 21 , i.e. neutral. Let dclose be an integer that indicates how longer than dmin you want the task to last to be fully preferred. Tasks with duration in the range [dmin , dmin +dclose ] are more preferred than neutral, but not fully preferred. Any duration longer than dmin +dclose is fully preferred. Then the preference φduration of a task with duration dtask can be calculated as follows: The preferences speciﬁed above are usually combined. We use a weighted sum of the preferences, setting the weight to 0 if a preference is not expressed. • if there is a preference for minimum duration dmin : −dmin 1 , 2) dtask ≥ dmin : φduration,task = 12 + min( 12 × dtask d close - 12 , 0) dtask < dmin : φduration,task = max( ddtask min • if there is a preference for maximum duration dmax : −dtask ), 12 ) dtask ≤ dmax : φduration,task = 12 + min( 12 × ( dmax dclose dmax 1 dtask > dmax : φduration,task = max( dtask - 2 , 0) 3.2 Preferences for Time Points Let the preference for particular time points be indicated by a time of the day (e.g. 6.30 a.m.). Such a preference can indicate that the time needs to be before a particular time point tbef ore , or after a time point taf ter . Let tclose indicate a time which is considered close to a particular time point. Again, the preference for a task which is precisely at the speciﬁed time point tbef ore or taf ter is 21 , i.e. preference neutral. The preference for a given start time ttask can now be calculated as follows (note that for calculations using time points these are represented in seconds of the day): • if a preference has been set for a task time before tbef ore : t −ttask ), 12 ) ttask ≤ tbef ore : φtime,task = 12 + min( 12 × ( bef ore t t close ore - 12 , 0) ttask > tbef ore : φtime,task = max( bef ttask • if a preference has been set for a task time after taf ter : t −taf ter 1 , 2) ttask ≥ taf ter : φtime,task = 12 + min( 12 × tasktclose ttask 1 ttask < taf ter : φtime,task = max( t - 2 , 0) af ter 3.3 Preferences for Tasks The last way to express preferences is for particular types of tasks. Let typetask be the type for a given task. The type of a task is speciﬁed by means of a certain range of integers, whereby integers are ordered based upon similarity of the tasks. For example, if the tasks are represented on the interval [0, 100], then the task identiﬁed with 1 is completely different from the task identiﬁed with 100, but has great similarity with the task identiﬁed with 2. Let the preferred tasks include a certain range of tasks [typelower , typeupper ]. Furthermore, let typeclose be an integer that expresses when a task is close to another task. The preference is calculated as follows: • if (typelower ≤ typetask ) ∧ (typeupper ≥ typetask ) φtype,task = 1; • if typelower > typetask typeclose φtype,task = max( typelower , 0) −typetask • if typeupper < typetask typeclose φtype,task = max( typetask , 0) −typeupper φtask = wduration × φduration,task + wtime × φtime,task + wtype × φtype,task , where wduration + wtime + wtype = 1 3.5 Bidding Algorithm with Preference for Tasks We assume that the supplier agent owns a single resource with a particular capability (with which, of course, a number of different task types can be performed, as explained earlier). Furthermore, the resource has an availability slot (i.e. a begin and end time) as well as a particular typebegin when the resource is initially setup and an end typeend at which the use of the resource needs to end. The supplier agent maintains a schedule of the tasks planned for its resource. We now present a bidding algorithm that takes preference values φtask into account. The algorithm is a greedy algorithm, supplier agents try to bid upon as many tasks as feasible to maximize the usage of their resource. The algorithm uses a parameter, p, to vary the inﬂuence of the preference upon the eventual price bid. The tasks within an RFQ are ﬁrst ordered based upon their preference. If some tasks have identical preferences, they are ordered according to the earliest start time speciﬁed in the RFQ for the tasks included. We assume that there exists a function switch time: TASK T YPE × TASK T YPE → D URATION that calculates the switching time from one task type to another (when it can be performed on the resource). Furthermore, performance time: TASK T YPE → D URATION expresses the time needed to perform the task. Bidding Algorithm For the bidding algorithm, let latest end timeprevious be the latest end time of the previous task in the current schedule of the resource (or the schedule start time in case no such task exists). Let typeprevious be the type of the previous task (or the start type in case of no prior task), latest start timenext be the latest start time of the next task (or the schedule end time in case no such task exists), and typenext be the type of the next task (or the schedule end time in case no such task exists). For each preference ordered task: Check if task (current) can be done using the resource. If yes, see if it ﬁts in the current schedule (see below). From the beginning of the schedule and for each empty slot in the schedule do: If the task ﬁts in the current empty slot in the schedule then insert the task in the bid, add its time parameters to the schedule, and compute the price of the bid (see below) else if latest end timecurrent > latest end timenext then continue with the next slot else continue with the next task. To see if the task ﬁts in the schedule, check if the following holds: [(latest end timeprevious + switch time(typeprevious , typecurrent )) ≤ latest start timecurrent ] ∧ [(latest start timenext - switch time(typecurrent , typenext ) performance time(typecurrent )) ≥ earliest start timecurrent ] ∧ 400 M. Hoogendoorn and M.L. Gini / Agents Preferences in Decentralized Task Allocation [(latest end timeprevious + switch time(typeprevious , typecurrent )) remains constant over time, so that the inﬂuence of the parameter p is the only variation regarding the preference function. ≤ (latest start timenext - switch time(typecurrent , typenext ) We used several variations of the difﬁculty of the task allocation - performance time(typecurrent ))] throughout the experiments. In particular we considered a market where more than sufﬁcient resources are available (overﬂow) versus a The price of the bid is computed as follows (note the parameter p): market where resources are insufﬁcient (shortage). Furthermore, the pricetask = tightness of the time windows was varied by either setting them very (1 + (p × (1 − φtask )))× tight or setting them wide. More precisely, the following parameters [switch time(typeprevious , typecurrent ) + have been used to affect tightness of tasks: switch time(typecurrent, typenext , ) + perf ormance time(typecurrent)] 1. the number of tasks was ﬁxed to 10. We have shown earlier how to calculate the value of φtask for the 2. the number of resources available varied between 12 (tight mardifferent type of preferences. This price equation assumes a certain ket) and 50 (plenty of resources available). standard price for each minute of time spent. In case these costs vary, 3. the ratio between the required resources to perform tasks and the the cost per minute can be included as an additional parameter. availability of those resources was ﬁxed. We had three types of resources, each generated with an equal probability. The number of different tasks per resource was set to 9999. The maximum time 4 Experimental Setup to change from one task to another was set to 100 minutes. Task We now describe the effect of adjusting the parameter p in the bidtypes were generated in a random fashion with an equal probabilding algorithm deﬁned above. Furthermore, we study the effect of ity as well. the preferences on the duration of task execution, which is an indica4. The tightness of the time windows speciﬁed in the tasks was varied tor of how efﬁciently the tasks are being performed. Together these between just sufﬁcient time to perform the task to twice the time form the utility function of the suppliers. Of course it is expected that needed plus two full hours. having more preferences awarded will result in a less efﬁcient execution. We are interested in assessing the severeness of these effects. The parameter setting for the preferences are set so that initially We performed experiments using synthetic data, and experiments usthe preference for tasks is around 60%, equally divided over the difing a real dataset obtained from a trucking company. ferent preferences. Each of the agents is assigned one type of preference at random. The parameter p varied between 0 and 5. 4.1 Experimental Setup with Synthetic Data We start by describing the parameters in the setup with synthetic data, and specify the actual settings used. There are many parameters that can inﬂuence the results. Many of them inﬂuence the difﬁculty of the task allocation problem in general. These include: 4.1.2 Results 1 The preference value itself is inﬂuenced by other parameters, including the following: Average preference for awarded tasks 0.9 1. The number of tasks to be allocated. 2. The number of resources available. 3. The ratio between the resources required to perform the tasks and the availability of those resources (e.g. one resource might be more scarce than another). This also includes the speciﬁcation of duration of tasks, switching time, and initial resource settings. 4. The tightness of the time windows speciﬁed in the tasks. Wider time windows allow more ﬂexible scheduling of tasks, therefore ﬁnding a solution is easier. 0.8 0.7 0.6 0.5 0.4 0.2 1. The parameter setting for the preference functions (e.g. what is considered to be a close by task, the stricter this norm is, the more easily preferences can be met). 2. The variation of tasks that exist (i.e. more variation means that it will be easier to get your preferences met). Finally, other parameter settings can be varied, such as the number of iterations, and the value of the parameter p, which is used in the bidding algorithm to determine the inﬂuence of preferences on price. 4.1.1 Parameter Settings Used We set the parameters of the preference functions and the variety of tasks to ﬁxed values. This means that the preference function itself Shortage, Wide Overflow, Wide Shortage, Narrow Overflow, Narrow 0.3 0 0.5 1 1.5 2 2.5 Setting of p 3 3.5 4 4.5 5 Figure 1. Preferences met for varying values of p Figure 1 shows the average preferences for tasks with varying p for the different market types and time window settings. As can be seen, the easiest way to get the preferences met is the overﬂow market with wide time windows. The most difﬁcult is the shortage market with tight time windows. The curves of the shortage market are less steep compared to the overﬂow market. The inﬂuence of the time windows on the average preference value is that the curve is basically lower by a certain constant value. The shape of the curve does not change for varying time window settings (i.e. in both the shortage market and 401 M. Hoogendoorn and M.L. Gini / Agents Preferences in Decentralized Task Allocation mentioned for the synthetic data. Finally, the preference for type of tasks is the average of the three different integers included in the task description (i.e. pickup, delivery, and return location). 18 16 Shortage, Wide Overflow, Wide Shortage, Narrow Overflow, Narrow 14 4.2.1 Results 10 1 8 0.95 6 4 2 0 0.4 0.5 0.6 0.7 Preferences Met 0.8 0.9 1 Average preference for awarded tasks Total duarion increase (%) 12 0.9 0.85 0.8 Figure 2. Preferences met versus increase in duration 0.75 4.2 Trucking data Besides synthetic data, we tested our approach using a real company dataset from the trucking domain. The dataset consists of a number of container transports that need to take place. Tasks require a certain transportation from one zip code (the pickup location) to an intermediate location (the delivery location), ending at a third location (the return location). Therefore, a task description does not consist of one integer specifying the task, as before, but of three integers. Furthermore, each task is associated with a certain early start time and a particular deadline at which the container needs to be returned at the return location. In addition to the containers that require transportation, the dataset also speciﬁes which trucks are available. These can carry one container at a time (so only one type of resource is available), and have a certain availability slot when the truck becomes available, and when the truck needs to be returned. A location is also speciﬁed where the truck starts, and where it has to end. This nicely maps to the algorithm speciﬁed. The performance time is now deﬁned as the time to go from the pickup to the delivery location, plus the time to go from the delivery location to the return location. The switching time is no longer an artiﬁcial time, but it is the actual driving time from one zip code to another. The only artiﬁcial data which we have generated are the preferences of the various trucks. This is done according to the method Trucking dataset 0.7 0 0.1 0.2 0.3 0.4 0.5 Setting of p 0.6 0.7 0.8 0.9 1 Figure 3. Preferences met for trucking dataset, for varying values of p Figure 3 shows how the value of p affects the average preference for tasks. It can be seen that the value of p required to increase the average preference signiﬁcantly is much lower than for the random dataset. Furthermore, the limit seems to be comparable with the overﬂow market with wide time windows. 2.5 Trucking dataset 2 Total duarion increase (%) the overﬂow market, the shape of the curve is the same for narrow and wide time windows). Figure 2 shows the average preference for tasks on the x-axis and the increase in the average duration to perform the tasks on the yaxis. This clearly shows the trade-off between preferences awarded and the efﬁciency of task execution. All curves look similar (an xn type shape) except for the point where the huge increase starts, which varies for the different types of markets. The only exception is the curve of the overﬂow market with narrow time windows. In this case the results are less stable compared to the other results. The curve with the lowest preference value, after which a steep increase is observed, is the one in the shortage market with narrow time windows. This makes sense because there is hardly any room for allocating tasks to other agents. The curve with the highest point is the overﬂow market with wide time windows, in which there is plenty of space to express preferences and get them awarded. 1.5 1 0.5 0 −0.5 0.6 0.65 0.7 0.75 0.8 Preferences Met 0.85 0.9 0.95 Figure 4. Average task preference versus duration of performing the tasks Figure 4 shows the preference value versus the average duration increase, i.e. the trade-off between preferences met, and the efﬁciency of execution. It can be seen that there is hardly any correlation between the average preference value of the trucks and the average increase in duration. This is of course very good news for the trucking company because this means they can award drivers their preferences without increasing the total driving time. This is assuming that preferences are equally divided amongst the truckers, as in the experimental setup. 402 M. Hoogendoorn and M.L. Gini / Agents Preferences in Decentralized Task Allocation 5 Related Work In the ﬁeld of combinatorial auctions, a lot of attention has been devoted to ﬁnding out the exact preference for particular bundles of tasks (see e.g. [5] and [11]). In general a certain preference for each of the bundles is assumed, but no detail is given on how the bidder comes to such a preference value. In this paper we introduce a preference function that allows for a more intuitive speciﬁcation of preferences, thereby taking multiple aspects of the tasks into account. Preferences for different aspects of the tasks are combined using a weighted average to produce a single preference value. In research on preference elicitation, typically the impact on selling is addressed, but the precise inﬂuence of preferences upon the quality of the solution is not. In this paper, we show how the allocation of tasks in a decentralized fashion directly inﬂuences the quality of the solution, and we explore the relationship between the average preference of tasks and the solution quality. In [7] an approach for scheduling a meeting between agents is proposed, which takes into account the preferences of the agents. The relationship between such preferences and the quality of the solution is addressed, but the problem is not studied from the perspective of combinatorial auctions. Task allocation can also be performed from a centralized perspective, using preferences as soft constraints. See, for example, [9] for an approach to consider preferences in decision making. There are decentralized variants of constraint optimization, but the agents in our case are not necessarily cooperative. In the ﬁeld of planning and scheduling preferences have been considered as well. Languages have been developed that allow for the speciﬁcation of preferences and soft constraints (see e.g. [8]). The logistic domain we use for our experiments has been researched for quite some time (see e.g. [10]), mainly focusing on calculating optimal solutions from a centralized perspective. For instance, in [6] the problem addressed is to ﬁnd optimal routes for transportation orders of a large set of users. Orders have to be picked up and delivered at speciﬁc locations, within a given time window, and using a limited number of trucks. The solution proposed is centralized, and it is used to support a human dispatcher. The current trend in logistics requires an even more distributed setting because of the use of fourth party logistics (4PL) [1]. 4PL companies sign contracts with large companies to arrange their entire transportation demand. These companies, however, do not have sufﬁcient resources on their own to arrange all these transports and therefore distribute many of those tasks to other (partner) companies. Centralized calculation might no longer be feasible due to lack of complete information (availability of resources is too sensitive for a company to communicate) as well as the complexity of calculating an optimal solution within a short period (time is crucial in the business). 6 Conclusions We have presented an approach to specify preferences for tasks in a combinatorial auction setting. Allowing users to specify such preferences is essential for them to use auctions and to increase the economic efﬁciency of reverse auctions, as reported, for instance, in [12]. We propose a preference function and use it in a bidding algorithm where bids on non-preferred tasks have a higher price. We evaluated our approach in two ways, ﬁrst by rigorously testing it with synthetic data. Several parameters have been varied, namely the tightness of the time windows, within a certain schedule), and the relative availability of resources. It was shown that it was easiest to get preferences awarded in for markets with wide time windows. The trade-off between meeting preferences and overall execution time has been studied in depth. We have shown that the overall execution time is inﬂuenced most in the case of the overﬂow market, due to the fact that in the shortage market there are hardly any alternatives at hand and therefore, although the agent might not prefer a task, it will still get its bid awarded. The curves observed tend to have the same shape when the time window setting changes but the market type remains the same. For different market types, the curves vary in steepness. Besides testing with synthetic data, we have also used a real company dataset from the trucking domain. We have shown that the bidding algorithm is effective in awarding suppliers more preferred tasks. The inﬂuence of this preference on the overall solution quality was not observed using the real dataset. Hence, in this setting the preferences being met have much less inﬂuence on the efﬁciency of the solution found. For future work, it would be interesting to ﬁnd out whether other real datasets would show the same results as the dataset used in this paper. Furthermore, exploring how well the companies can express their preferences using these functions would be interesting as well. Acknowledgments: Partial support is gratefully acknowledged from NSF under grant IIS-0414466. REFERENCES [1] P. Briggs. The hand-off: the future of outsourced logistics may be found in the latest buzzword [fourth party logistics]. Canadian Transportation Logistics, 102(5):18, 1999. [2] J. Collins, G. Demir, and M. Gini. Bidtree ordering in IDA* combinatorial auction winner-determination with side constraints. In J. Padget, O. Shehory, D. Parkes, N. Sadeh, and W. Walsh, editors, Agent Mediated Electronic Commerce IV, volume LNAI2531, pages 17–33. Springer-Verlag, 2002. [3] J. Collins and M. Gini. An integer programming formulation of the bid evaluation problem for coordinated tasks. In B. Dietrich and R. V. Vohra, editors, Mathematics of the Internet: E-Auction and Markets, volume 127 of IMA Volumes in Mathematics and its Applications, pages 59–74. Springer-Verlag, New York, 2001. [4] J. Collins, W. Ketter, and M. Gini. A multi-agent negotiation testbed for contracting tasks with temporal and precedence constraints. Int’l Journal of Electronic Commerce, 7(1):35–57, 2002. [5] W. Conen and T. Sandholm. Preference elicitation in combinatorial auctions. In Proc. First Int’l Conf. on Autonomous Agents and MultiAgent Systems, volume 1, pages 168–169, Bologna, Italy, July 2002. [6] K. Dorer and M. Calisti. An adaptive solution to dynamic transport optimization. In Proc. Fourth Int’l Conf. on Autonomous Agents and Multi-Agent Systems, pages 45–51, 2005. [7] M. Franzin, E. Freuder, F. Rossi, and R. Wallace. Multi-agent meeting scheduling with preferences: efﬁ ciency, privacy loss, and solution quality. In Proc. of AAAI Workshop on Preference in AI and CP, 2002. [8] A. Gerevini and D. Long. Preferences and soft constraints in PDDL3. In Proc. ICAPS Workshop on Planning with Preferences and Soft Constraints, 2006. [9] R. L. Keeney and H. Raiffa. Decisions with Multiple Objectives: Preferences and Value Tradeoffs. Wiley, 1976. [10] T. Magnanti. Combinatorial optimization and vehicle ﬂeet planning: Perspectives and prospects. Networks, 11:179–214, 1981. [11] D. C. Parkes. Auction design with costly preference elicitation. Annals of Mathematics and Artiﬁcial Intelligence, 44:269–302, 2005. [12] T. Sandholm. Expressive commerce and its application to sourcing: How we conducted $35 billion of generalized combinatorial auctions. AI Magazine, 28(3):45–58, Fall 2007. [13] R. G. Smith. The contract net protocol: High level communication and control in a distributed problem solver. IEEE Trans. Computers, 29(12):1104–1113, December 1980. [14] W. Walsh and M. Wellman. A market protocol for decentralized task allocation and scheduling with hierarchical dependencies. In Proc. of 3th Int’l Conf on Multi-Agent Systems, 1998. ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-403 403 Game Theoretical Insights in Strategic Patrolling: Model and Algorithm in Normal-Form Nicola Gatti Dipartimento di Elettronica e Informazione, Politecnico di Milano, Milano, Italy ngatti@elet.polimi.it Abstract. In artiﬁcial intelligence literature there is a rising interest in studying strategic interaction situations. In these situations a number of rational agents act strategically, being in competition, and their analysis is carried out by employing game theoretical tools. One of the most challenging strategic interaction situation is the strategic patrolling: a guard patrols a number of houses in the attempt to catch a rob, which, in its turn, chooses a house to rob in the attempt to be not catched by the guard. Our contribution in this paper is twofold. Firstly, we provide a critique concerning the models presented in literature and we propose a model that is game theoretical satisfactory. Secondly, by exploit the game theoretical analysis to design a solving algorithm more efﬁcient than state-of-the-art’s ones. 1 Introduction The study of strategic interaction situations, commonly named noncooperative games, has been receiving more and more attention in artiﬁcial intelligence literature [7]. For instance, the problem of automating agents in negotiations [4] and in auctions [7] is usually modeled as a strategic interaction problem. Commonly, strategic interaction situations are tackled by employing game theoretical tools [3], in which one distinguishes the mechanism (i.e., the rules according which agents interact) from the strategies (i.e., the behaviors of the agents in the game). Given a mechanism, rational agents should behave in order to maximize their revenue. An interesting open strategic interaction problem is the strategic patrolling [8, 9]. This problem is characterized by a guard that decides what houses to patrol and how often and by a robber that decides what house to strike. Obviously, the guard will not know in advance exactly where the robber will choose to strike. Moreover, the guard does not known with certainty what adversary it is facing. A common approach for choosing a strategy for agents in such a scenario is to model the scenario as a Bayesian game [3]. A Bayesian game is a game in which agents may belong to one or more types; the type of an agent determines its payoffs. The probability distribution over agents’ types is common knowledge. The appropriate solution concept for these games is the Bayes-Nash equilibrium [3]. In [8] the authors propose a model for the strategic patrolling and an algorithm to solve it. Exactly they model the situation as a Bayesian game. The guard’s actions are all the possible routes of houses, while the robber’s action is the choice of a single house to rob. The robber can be of several types with a given probability distribution. Moreover, the rob can observe the actions undertaken by the guard and choose its optimal action on the basis of this observation. We show in this paper that the model proposed in [8] is not game theoretically satisfactory. Indeed, we show that such model does not effectively capture the possibility available to the rob to observe the actions undertaken by the guard. A further issue risen by [8] concerns time required to compute solutions. Although the algorithm proposed by the authors ﬁnds solutions that are computationally less hard than Bayes-Nash, the computation of a solution does not result affordable even in very simple settings. This paper provides two original contributions. The ﬁrst contribution concerns the design of a strategic interaction model for the strategic patrolling that is game theoretically satisfactory. Precisely, we provide a critique to the model presented in [8], showing why it is not satisfactory in real-word settings. Subsequently, we provide a satisfactory Bayesian game model. The second contribution concerns the design of an efﬁcient solving algorithm. Algorithmic game theory literature provides a number of on-the-shelf algorithms able to solve a large class of games [7]. However, these algorithms have exponential complexity in the worst-case and cannot address realworld settings. The exploitation of game theoretical analysis can lead to improve the efﬁciency of the solving algorithms and therefore to address real-world problems. This approach, although it is very preliminary, has been successfully followed in [2, 4], where the authors provide efﬁcient algorithms for bargaining situations. The contribution of game theoretical analysis in the design of efﬁcient algorithms can be twofold. Firstly, game theoretical analysis can be employed to reduce the space of search, e.g. by excluding all the strategy proﬁles that can be assured to be not of equilibrium independently of the parameters of the game. Secondly, it can be employed to “guide” the searching algorithm, e.g. by choosing speciﬁc orders over the strategy proﬁles according which the algorithm searches for the equilibrium [10]. In this paper we exploit game theoretical analysis to the limited extent of the ﬁrst issue: the reduction of the space of search. We propose an algorithm much more efﬁcient then on-the-shelf ones, its space of search being dramatically reduced with respect to the one considered by these algorithms. However, the space of search of the proposed algorithm raises exponentially in the size of the problem and therefore the algorithm needs to be improved by considering also the second issue concerning game theoretical analysis: the exploitation of information to efﬁciently guide the search. This second issue will be considered in future works. This paper is structured as follows. The next section reviews the strategic patrolling model presented in [8] and provides a critique to it. Section 3 proposes a satisfactory game model for the considered situation. Section 4 provides some game theoretical insights concerning the proposed model and Section 5 exploits these to design a solving algorithm. Section 6 closes the paper. 404 2 N. Gatti / Game Theoretical Insights in Strategic Patrolling: Model and Algorithm in Normal-Form Basic Strategic Patrolling Model and Critique We brieﬂy review the model proposed in [8]. The strategic situation to be considered is constituted by m houses, denoted by 1, . . . , m, and two agents: a guard, denoted by g, and a robber, denoted by r. Essentially, g chooses a patrolling strategy, i.e. a route of houses, in the attempt to catch r, which, in its turn, chooses the house to rob in the attempt to be not caught by g. For the sake of simplicity, the following assumptions are commonly made: • time is discretized in turns; • g takes one turn to patrol one house independently of the patrolled house; • r takes d turns to rob one house independently of the robbed house; • the time needed by g to move between two houses is negligible. Agents act simultaneously and their available actions are: g : it can choose a route of d houses to patrol, e.g. 1, 2, 3, . . .; r : it can choose one house to rob, e.g. 1. Possible outcomes are the following: if the house chosen by r is within the route chosen by g, then g catches r; otherwise r robs the house. Players’ preferences over the outcomes are expressed by the following payoffs: g : it assigns the outcome wherein r is caught an evaluation x0 and assigns each outcome wherein house i is robbed an evaluation xi . If g catches r, then g’s payoff is x0 , otherwise its payoff is xi where i is the robbed house. Customarily, it is assumed that x0 > max{xi } with i > 0; r : it assigns each house i an evaluation y i and assigns its catch an evaluation y 0 . If r is caught by g, then r’s payoff is y 0 , otherwise its payoff is y i where i is the robbed house. Customarily, it is assumed that y 0 < min{y i } with i > 0. Finally, it is assumed that g’s preferences are common knowledge, while r’s ones not. Precisely, it is assumed that r can be of n types with a given probability distribution. We denote type i of r by ri . According to Harsanyi such a game is casted into an imperfect information game wherein nature, denoted by N, chooses initially the type of r and g does not perfectly know which game is playing [3]. An example with m = 2, n = 2, and d = 1 is depicted in Fig. 1 Nb H HHωr2 H pr p p p p p p p p p gp p p p p p p p H pH p r @ 2 @ 2 1 1 @ @ pr p p p p r1p p p p @ p r pr p p p p r2p p p p @ p r L L L L 1 1 1 1 L2 L2 L2 L2 L L L L r Lr r Lr r Lr r Lr ωr 1 x0 , yr01 x2 , yr21 x1 , yr11 x0 , yr01 x0 , yr02 x2 , yr22 x1 , yr12 x0 , yr02 Figure 1. Game tree with two houses, denoted by 1 and 2, and with d = 1. The appropriate solution concept for a game such the one we are dealing with is the Bayes-Nash equilibrium [3]. It prescribes one strategy σg∗ for g and one, generally different, strategy σr∗i for each ri . The peculiarity of this solution concept is that g maximizes its expected payoff according to its beliefs, i.e. the probability distribution over r’s types. It can be showed – we omit the pertinent proof for reason of space – that agents’ equilibrium strategies prescribe that g randomizes over all the possible routes wherein houses are patrolled only one time, e.g. with d = 2 all routes i, j such that i = j. It can be also showed that, in order for a strategy proﬁle to be an equilibrium, at least one r’s type must randomize. The above model is satisfactory when g and r act simultaneously. However, in real-world applications it is unreasonable to assume that r always acts at the turn where g starts to patrol. This is essentially due to two reasons. Firstly, g cannot synchronize the beginning of its patrolling route with r’s action, since g cannot observe r. Secondly, r could wait for one or more turns before choosing the house to rob in order to observe g’s strategy and take advantage from this observation. Thus, there is a discrepancy between the situation captured by the above model, i.e. r cannot make anything but choosing the house to rob, and the real-world situation, i.e. r can wait for some turns observing g’s strategy. This discrepancy must be carefully studied in order to evaluate the effectiveness of the above model. Exactly, we need to verify whether in real-world situations r violates the protocol prescribed by the above model. Technically speaking, we need to verify whether r can improve its revenue by waiting. In the afﬁrmative case, r will wait, violating thus the protocol, and then the above model will be not satisfactory. In what follows we show that on the equilibrium path r waits. At ﬁrst, if r waits for one or more turns, the game could close after d turns. However, the above model captures a strategic situation d turns long and does not prescribe how g behaves after t = d. Since we are limiting our analysis to the above model, we can only assume that g repeats its equilibrium strategies at every d turns. With this extended model it can be showed that r can improve its expected utility by waiting for one or more turns in order to partially observe the route of g and exploiting this information to choose its strategy. (This is essentially because the model does not perfectly captures the situation we are considering: it implicitly assumes that r can enter the house to rob only at every d turns, meanwhile r could enter it at every turn.) We report an example. Consider a setting with three houses, d = 2, one r’s type, and y i = y j = y H for all i, j > 0. Call αij the probability prescribed by σg∗ to make the route i, j. It can be easily showed that αij = 16 for all i = j with i, j > 0. If r immediately enters the house to rob, playing at the initial turn, its optimal strategy is to randomize with probability 13 among the three houses and its expected utility is 26 y H + 46 y 0 . If r waits for one turn to observe g’s action, it can improve its expected utility. Precisely, if r has observed that g has patrolled the house i in the ﬁrst turn, then r’s expected utility of choosing house i in the second turn is 46 y H + 26 y 0 that, being y H > y 0 , is strictly greater than r’s expected utility of robbing at the initial turn. Therefore, r will wait for one turn rather than to rob a house immediately. 3 The Proposed Normal-Form Model The failure of the model previously described is due to the neglecting of the possibility that r can wait: since in real-world situations r can improve its revenue by waiting, then it will violate the protocol. To overcome the drawbacks of this model, we must take into account the real-world possibility that r can wait for one or more turns. Two routes can be followed: 1. we cast the game into an extensive-form game and we explicitly take into account the action wait for r by introducing it at every decision node of r; N. Gatti / Game Theoretical Insights in Strategic Patrolling: Model and Algorithm in Normal-Form 2. we develop a normal-form game wherein the action wait is not explicitly taken into account, but, when the game is played in realworld situations, r cannot improve its expected utility by waiting. In this paper we limit our study to normal-form models for patrolling and therefore we follow the second route. The development of an appropriate extensive-form game will be explored in future works. The model we propose is simple. We initially describe it and subsequently we discuss why it is satisfactory. The model prescribes that g and r act simultaneously. The actions available to r are the same ones in the model presented in the previous section. The actions available to g are the following: 405 Anyway, if r waits for one or more turns, it can be easily observed that r’s expected utility and g’s one do not change. Then this possibility does not affect the employment of the model in real-world situations. Consider g. The proposed model requires that g employs the same strategy at every turn, but in real-world situations it can employ different strategies at different turns. Anyway, g cannot make anything better than employing the same strategy at every turn, since it has not any information concerning when r acts. It can be showed – we omit the pertinent proof for reason of space – that g’s optimal strategy in the proposed model is consistent to g’s optimal strategy in the extensive-form game wherein the r’s action wait is explicitly taken into account. g: it chooses a house to patrol among all the possible ones. The strategy of g will be repeated at every turn. For instance, if the strategy chosen by g is to patrol house 1, then g will always patrol it. Practically, on the equilibrium path g’s strategy will be fully mixed and therefore g will randomize over all the houses at every turn with the same probability distribution. Agents’ payoffs are exactly the same ones we deﬁned in the previous section. We provide agents’ expected utilities since they will be fundamental in the analysis we carried out in the next section. The expected utility of r can be easily calculated. Precisely, called αi the probability prescribed by σg∗ to patrol house i, the expected utility for rj of robbing house i is EUrj (i) = (1 − αi )d · yri j + (1 − (1 − αi )d ) · yr0j . Essentially, it is the convex combination between yri j , i.e. the rj ’s evaluation of house i, and yr0j , i.e. the evaluation of rj ’s caught, where the parameter of the convex combination is (1 − αi )d , i.e. the probability that g will never patrol house i for d turns. The calculation of g’s expected utility is more complicated. We give it by degrees. Suppose initially that r can be only of one type. Called β i the probability prescribed by σr∗ to rob house i and supposed that g will follow a mixed strategy based on probabilities αl s for the next d − 1 turns, the expected utility for g of patrolling house j at the current turn is: EUg (j) = h m X i i i d−1 x · β · (1 − α ) i=1,i=j 0 0 » m X + x · @β + j i + „ i β · 1 «– “ ” i d−1 A 1− 1−α i=j,i=j Essentially, EUg (j) gives the expected utility to choose house j at the current turn given that g will employ a mixed strategy from turn t = 1 to turn t = d. Suppose now that r can be of different types. The formula of EUg (j) is deﬁned as a weighted sum of a number of terms. The weights are the types’ probabilities. The terms to sum are deﬁned exactly as in the previous formula of EUg (j) and refer to the single types. The formula of EUg (j) is: EUg (j) = n X " ωrk · k=1 0 +x · j k βr + m X i=1,i=j m X i=1,i=j h i i i i d−1 x · βr · (1 − α ) + k » „ «– !!# “ ” i i d−1 βr · 1 − 1 − α k Now we produce some considerations concerning the proposed model. Precisely, we need to verify whether in real-world situations agents will violate the protocol prescribed by the proposed model. Consider r. The proposed model does not take into account the possibility that r can wait for some turns, but in real-world setting it can. 4 Game Theoretical Insights A game such as the one we are dealing with can be solved by employing on-the-shelf algorithms. Speciﬁcally, such a problem can be casted into a linear-complementarity problem and then solved by employing the Lemke-Howson’s algorithm [6]. However, the computational complexity of the Lemke-Howson’s algorithm is exponential in the size of the problem, i.e. the number m of houses and the number n of r’s types. Practically, the production of exact solutions in real-world situations is not affordable and the computation of approximate solutions for very simple problems requires long time, e.g. more than 30 minutes with m = 3 and n = 7 [8]. The drawbacks related to on-the-shelf algorithms are due to the principle on which they are based: they search for an equilibrium strategy proﬁle among all the possible ones neglecting any information concerning the speciﬁc problem to solve. Since the space composed of all the possible strategy proﬁles raises exponentially in the size of the problem, the search is inefﬁcient also with very simple problems. This makes the study of real-world strategic situations by employing on-shelf-algorithm unaffordable. A route to follow to solve more efﬁciently strategic situations is to exploit game theoretical analysis. Precisely, the game theoretical analysis allows one to derive insights concerning regularities and singularities of the problem that can be employed to reduce the space of strategy proﬁles among which the algorithm searches for the equilibrium. Examples of similar works can be found in [1, 2, 4]. In what follows we game theoretically analyze the proposed game in the attempt to produce several insights to employ in the design of a solving algorithm more efﬁcient than the state-of-the-art. Considering g’s strategies, we can state the following lemma. Lemma 4.1 On the equilibrium path g’s strategy cannot be pure. Proof. The proof is by contradiction. Assume σ ∗ to be an equilibrium strategy proﬁle wherein g’s strategy is pure. On the equilibrium path every r’s types believes that g employs a pure strategies choosing a speciﬁc house to patrol (say house i). On the basis of these beliefs, since any r’s type strictly prefers not to be caught rather than to be caught, no r’s type will choose house i. On the basis of this fact, g can improve its expected utility by patrolling a house different from i. We reach a contradiction and then σ ∗ is not an equilibrium. 2 We can state the following lemma, whose proof is omitted being similar to the proof of Lemma 3.1 but much longer. Lemma 4.2 On the equilibrium path g’s strategy prescribes that every house can be patrolled with a strictly positive probability. Considering strategies of r’s types, we can state the following lemma. Lemma 4.3 On the equilibrium path at least one r’s type employs a mixed strategy. 406 N. Gatti / Game Theoretical Insights in Strategic Patrolling: Model and Algorithm in Normal-Form Proof. The proof is by contradiction. Assume σ ∗ to be an equilibrium strategy proﬁle wherein the strategy of all r’s types is pure. It can be easily showed that g’s optimal strategy is a unique action, expect for a null-measure subset [5] of the space of the parameters. However, by Lemma 3.1, there is not any equilibrium wherein g’s strategy is pure. We reach a contradiction and therefore σ ∗ is not an equilibrium. 2 5 Improving Solving Algorithm Efﬁciency In this section we show how the previous three lemmas can be employed to reduce the space of the strategy proﬁles among which one can search for an equilibrium. Precisely, we can exclude a large number of strategy proﬁles that we can assure to be not of equilibrium independently of the values of the agents’ parameters, e.g. x0 and x1 . Although the proposed algorithm searches within a space of strategy proﬁles dramatically reduced with respect the state-of-the-art’s one, this space raises exponentially in the size of the problem. Therefore, in order to tackle real-world problems, the proposed algorithm must be improved by introducing heuristics that efﬁciently guide the search. We will discuss this topic in future works. On the basis of Lemma 3.2 every equilibrium strategy proﬁle for the game we are dealing with is characterized by αi ∈ (0, 1) for any i ∈ {1, . . . , m}. Since these variables are bound by the equation P m i i=1 α = 1, the number of free variables related to g’s strategy is m − 1. Furthermore, on the basis of Lemma 3.3 we know that every equilibrium strategy proﬁle is characterized by at least one r’s type that randomizes. The exact number of r’s types that randomize and the number of actions over which each speciﬁc randomizing type randomizes in an equilibrium strategy proﬁle can be determined by studying the pertinent solving equation sets and by excluding singularities of these. For the sake of clarity, we study the possible randomizations of r’s types by degrees: at ﬁrst when the number of r’s types is one and subsequently when r’s types are more than one. 5.1 The Base Case: One Robber’s Type We consider the situation in which the number of r’s types is one. As customarily in game theory, in a two-player game, the randomization probabilities related to each player are computed in such a way the other player can effectively randomize, i.e. every action over which a player randomizes gives it the same expected utility and no other action gives it more than randomizing. In the game we are studying, the randomization probabilities of g will be computed in such a way the actions over which r randomizes give r the same expected utility and vice versa. Technically speaking, we have two equation sets: the ﬁrst one, say Φg , wherein the variables are the randomization probabilities of g, i.e. αi s, and the equations are of the form EUr (i) = EUr (j) for all actions i, j over which r randomizes, the second one, say Φr , wherein the variables are the randomization probabilities of r, i.e. β i s, and the equations are of the form EUg (i) = EUg (j) for all actions i, j over which g randomizes. On the basis of Lemma 3.2, we know that equation set Φg is characterized by m − 1 variables and that equation set Φr is characterized by m − 1 independent equations. We need to ﬁnd the number of actions over which r randomizes in order to have two well-deﬁned equation sets. Since, when r randomizes over m actions, m − 1 variables are introduced in equation set Φr and m − 1 equations are introduced in equation set Φg , the appropriate number of actions over which r randomizes on the equilibrium path is m. Notice that, if r randomizes over a number of actions lower than m, then equation set Φr would present a number of variables lower than the number of equations and therefore it does not admit any solution. Easily, since at the equilibrium both g and r randomize over all the possible actions and since Φg and Φr , being linear equation sets, admit a unique solution, then the game admits a unique equilibrium strategy. In this equilibrium αi s, β j s∈ (0, 1). Since agents’ equilibrium strategies can be provided in closed form, no search is needed. By imposing EUr (i) = EUr (j) for any i, j ∈ {1, . . .q , m}, we 0 , can calculate the values of αi . Exactly, called γ(i, j) = d yyj −y −y 0 Pm 1 + j=1 [γ(i, j) − 1] Pm . by trivial mathematics we obtain: αi = j=1 [γ(i, j)] By imposing EUg (i) = EUg (j) for any i, j ∈ {1, . . . , m}, we can calculate the values of β i . Exactly, called ε(i, j) = i (x0 −xi )·(1−αi )d−1 , (x0 −xj )·(1−αj )d−1 by trivial mathematics we obtain: β i = 1 Pm . j=1 [ε(i, j)] 5.2 The General Case: More Robber’s Types We consider the situation in which the number of r’s types can be any. The analysis is similar to the basic case, but it is more complicated. Meanwhile with a unique r’s type there is a unique possible set of actions over which r can randomize that makes the above equation sets well-deﬁned, i.e. all the m houses, with more r’s types it does not, e.g. r1 could randomize over m−3 houses and r2 over 3 houses. Furthermore, among all the possible ways with which r’s types can randomize that makes the pertinent equation sets well-deﬁned, only one leads to an equilibrium. We need therefore to search for this. At ﬁrst, we characterize the strategy proﬁles with respect to the actions over which r’s randomizing types randomize (we exclude all r’s types that do not randomize). We use a n × m binary matrix Rr where the rows denote the r’s types and the columns denote the houses, i.e. z 2 1 0 Rr = 40 0 m houses }| 1 ... 0 0 ... 1 0 ... 0 0 ... 0 { 39 0 = 15 0 ; n types 0 Precisely, the meaning of Rr is the following: Rr (i, j) = 1 means that ri randomizes over house j, while Rr (i, j) = 0 means that ri does not randomize over house j. Notice that, inP order for Rr to be well-deﬁned, the following constraint must hold: m j=1 Rr (i, j) = 1 for any i ∈ {1, . . . , n} (i.e., a randomizing agent must randomize over two actions at least). We call this constraint C1. Given a matrix Rr we can build equation set Φg for the calculation of g’s randomization probabilities. Trivially, in order for Φg to be well-deﬁned, two properties must hold: Φg must be composed of m − 1 independent equations and all αi s must be present in Φg . These two properties can be translated into the following two constraints over Rr : Pm Rr (i, j) > 0 (i.e., each C2: for any j ∈ {1, . . . , m} it holds i=1 j variable α must be present in Φ g ), Pn Pm C3: i=1 [max{ j=1 [Rr (i, j)]−1, 0}] = m−1 (i.e., the number of independent equations must be m − 1). Similarly, given a matrix Rr we can also build equation set Φr for the calculation of the randomization probabilities of r’s types. It can showed that, in order for Φr to be well-deﬁned, no further constraint over Rr is needed. 407 N. Gatti / Game Theoretical Insights in Strategic Patrolling: Model and Algorithm in Normal-Form With m = 3 and n = 2, all the feasible Rr s, i.e. all ones that satisfy C1, C2, C3, are: nh 1 0 1 0 i h , 01 h i h , 11 i h 1 1 0 0 , 0 1 1 0 1 1 0 1 0 1 0 1 i h , 10 i h 1 0 1 1 , 1 1 1 0 0 1 1 1 0 1 i i h 1 0 0 , 1 , 1 0 1 1 io RAM. Experimental results are reported in Tab. 5.3. Although the proposed algorithm is a prominent step ahead with respect to the state-of-the-art, it cannot address real-world settings. For instance, it requires more than one day computation for settings with m = 20 and n = 10. The efﬁciency of the algorithm can be improved by employing heuristics to order dynamically the feasible Rr s. Given a matrix Rr , it can be possible to ﬁnd univocally the values of αi s and βrji s by employing equations similar to the ones employed in the previous section and, subsequently, it is possible to verify whether agents’ strategies computed on the basis of Rr lead or not to an equilibrium. Precisely, we need to verify that: • all αi ∈ (0, 1); • all the βrji s prescribed by Rr belong to (0, 1); • all the randomizing r’s types cannot make anything better than randomizing. Therefore, we can limit the search for an equilibrium to the search for a feasible Rr that leads to an equilibrium. This allows one to dramatically reduce the space of search and reduce thus the time needed to compute a solution. Consider for instance the setting with m = 3, n = 2, and d = 2. The space over which on-the-shelf algorithms search is the set of vertices of a complex 9-polytope, while the space of all the feasible Rr s is composed of eight elements. We report our algorithm in Algorithm 1. Currently, all the feasible Rr s are statically ordered in lexicographic order. Algorithm 1: EQUILIBRIUM FINDER 1 for all feasible Rr do solve Φg 2 3 if all αi ∈ (0, 1) then 4 calculate optimal strategies of randomizing r’s types on the basis of αi s if no randomizing type deviates from actions in Rr then 5 6 calculate optimal strategies of non-randomizing r’s types on the basis of αi s solve Φr 7 if all βri ∈ (0, 1) then 8 j 9 houses 3 4 6 0.007 0.190 7 0.011 0.352 8 0.017 0.720 types 9 0.024 1.015 10 0.033 1.532 11 0.043 1.852 12 0.055 2.231 Table 1. Average time (in seconds) required by Algorithm 1 for the computation of the equilibrium. 6 Conclusions and Future Works The strategic patrolling is a challenging problem that has found a lot of attention in artiﬁcial intelligence literature. In this paper we consider the principal strategic patrolling model presented in literature. We provide two prominent contributions. At ﬁrst, we show that the model proposed in the state-of-the-art presents some unsatisfactory issues concerning game theory and subsequently we provide a model that is game theoretically satisfactory. Then, we have analyzed the considered game in order to produce some insights concerning regularities and singularities of the corresponding solving equation sets. These insights have been subsequently employed in the design of a solving algorithm. This algorithm has been showed to be much more efﬁcient than state-of-the-arts’s ones. Our intention is to develop the proposed work along two main directions. The ﬁrst one concerns the provision of an appropriate extensive-form model for the considered strategic interaction situation. We will study furthermore leadership with commitment to mixed strategies in our model. The second one is more general and concerns the development of a general approach to exploit game theoretical analysis to enable algorithms to afford real-world setting game situations, e.g. by employing genetic algorithms. return Rr , αi s, and βrj i REFERENCES 5.3 Experimental Considerations We provide a preliminary experimental evaluation of the proposed algorithm. In order to evaluate it, we compare the average time it requires for the computation of an equilibrium with respect the one required by the algorithm proposed in [8]. Since our algorithm is not directly comparable with the algorithm presented in [8], considering a different model, we have modiﬁed our algorithm to solve the model present in [8]. The experimental results reported below refer to this modiﬁed version of the algorithm. No signiﬁcant difference in terms of computational time (< 5%) has been found between the application of our algorithm to the model proposed in [8] and the one we present in Section 3. The algorithm proposed in [8], implemented in CPLEX, requires more than 30 minutes to compute approximate solutions for settings with m = 3, n = 7 and settings with m = 4, n = 6. We have implemented our algorithm in C and we have considered all the settings with m = 3, 4 and n ∈ {6, . . . , 13}. For each setting we have considered 103 different agents’ payoffs calculated at random in (0, 1). We have used a 1.4GHz CPU with 500MBytes [1] F. Di Giunta and N. Gatti, ‘Alternating-offers bargaining under onesided uncertainty on deadlines’, in Proceedings of ECAI, pp. 225–229, Riva del Garda, Italy, (2006). [2] F. Di Giunta and N. Gatti, ‘Bargaining over multiple issues in ﬁnite horizon alternating-offers protocol’, Annals of Mathematics in Artiﬁcial Intelligence, 47(3-4), 251–271, (2006). [3] D. Fudenberg and J. Tirole, Game Theory, The MIT Press, Cambridge, MA, USA, 1991. [4] N. Gatti, F. Di Giunta, and S. Marino, ‘Alternating-offers bargaining with one-sided uncertain deadlines: an efﬁcient algorithm’, Artiﬁcial Intelligence, 172(8-9), 1119–1157, (2008). [5] P. R. Halmos, Measure Theory, Springer, Berlin, Germany, 1974. [6] C. Lemke, ‘Some pivot schemes for the linear complementarity problem’, Mathematical Programming Study, 7, 15–35, (1978). [7] N. Nisam, T. Roughgarden, E. Tardos, and V. V. Vazirani, Algorithmic Game Theory, Cambridge University Press, New York, USA, 2007. [8] P. Paruchuri, J. P. Pearce, M. Tambe, F. Ordonez, and S. Kraus, ‘An efﬁcient heuristic approach for security against multiple adversaries’, in Proceedings of AAMAS, pp. 311–318, Honolulu, USA, (2007). [9] P. Paruchuri, M. Tambe, F. Ordonez, and S. Kraus, ‘Security in multiagent systems by policy randomization’, in Proceedings of AAMAS, pp. 273–280, Hakodate, Japan, (2006). [10] R. Porter, E. Nudelman, and Y. Shoham, ‘Simple search methods for ﬁnding a nash equilibrium’, in Proceedings of AAAI, p. 664669, San Jose, USA, (2004). 408 ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-408 Monitoring the Execution of a Multi-Agent Plan: Dealing with Partial Observability Roberto Micalizio and Pietro Torasso1 Abstract. The paper addresses the task of monitoring and diagnosing the execution of a multi-agent plan (MAP) which involves actions concurrently executed by a team of cooperating agents. The paper describes a weak commitment strategy to deal with cases where observability is only partial and it is not sufﬁcient for inferring the outcome of all the actions executed so far. The paper discusses the role of target actions in providing sufﬁcient conditions for inferring the pending outcomes in a ﬁnite time window. The action outcome provides the basis for computing plan diagnosis and for singling out the goals which will not be achieved because of an action failure. requirement combined with the weak commitment strategy guarantees that the outcome of each action can be inferred within a ﬁnite time window. The paper is organized as follows. In the following sections we introduce the basic notions of global and local plans, then we formalize the processes of monitoring and diagnosis of a MAP and discuss the role of minimal observability requirement and weak commitment strategy in inferring the actions outcomes which cannot be directly observed; ﬁnally we discuss some computational issues and conclude. 1 Introduction 2 Distributed Plan Execution and Supervision The problem of diagnosing the execution of a single-agent plan has been investigated long time ago (see the pioneering work by Birnbaum et al. [1], where the concept of plan threat is introduced). However, only recently a number of Model-Based approaches (see [4, 8, 5]) have started to address the complex problem of diagnosing the execution of a multi-agent plan (MAP), i.e. a plan involving a team of cooperating agents, which execute actions concurrently. These works are essentially based on a distributed approach where each agent is responsible for supervising (monitoring and diagnosis) the actions it executes. Typically these approaches assume that action failures are not consequences of plan ﬂaws, but failures are due to the occurrence of unexpected events (such as discrepancies in the shared assumptions or occurrence of faults in some agents functionalities). Thus, the plan execution needs to be supervised in order to detect and explain an action failure as soon as possible. As discussed in [8], the plan diagnosis consists in a subset of actions whose failure is consistent with the anomalous observed behavior of the system. However, this notion of plan diagnosis can be complemented with a notion of threatened actions, which estimates the impact of the failure since the harmful effects of an action failure may propagate to the whole MAP. In this paper, similarly to the previous approaches, a distributed approach for supervising the MAP execution is adopted. However, we address the problem of diagnosing plans characterized by the presence of joint actions, which introduce further dependencies among the agents as they need to synchronize and to communicate among themselves. Moreover, we have to deal with actions whose faulty behavior may be non deterministic. In the paper we show that the nominal plan execution imposes some requirement on observability (we will call minimal observability requirement) in order to guarantee the inter-agent communication and we introduce a weak commitment strategy to deal with cases where observability is only partial and it is not sufﬁcient for inferring the outcome of all the actions executed so far. We will show how the minimal observability In this paper we consider a speciﬁc class of MAS where a team T of agents cooperate to reach a common complex goal G. In particular, the global goal G is decomposed into a set of (easier) sub-goals, each of which is assigned to an agent in the team. In most cases, however, the sub-goals are not independent of one another as the agents have to cooperate by exchanging services or by executing joint actions; this cooperative behavior introduces causal dependencies among activities, hence when an unexpected event causes the failure of an agent activity, this failure may propagate in the whole system affecting the activities of the other agents in the team. Global plan. The notion of multi-agent plan (MAP), as formalized by Cox et al. in [2], is well suited for modeling both the agents activities and the causal dependencies existing among them. According to [2], given a team T of agents, the MAP is the tuple A, E, CL, CC, N C such that: A is the set of the action instances the agents have to execute; each action a is assigned to a speciﬁc agent i of the team T and it is modeled in terms of preconditions and direct effects. Within the set A there are two special actions: a0 and a∞ ; a0 is the starting action, it has no preconditions and its effects specify which propositions are true in the initial state; a∞ is the ending action, it has no effects and its preconditions specify the propositions which must hold in the ﬁnal state i.e., the preconditions of a∞ specify the MAP’s goal G. While E is a set of precedence links between actions, CL is a set of q causal links of the form l : a → a ; the link l states that the action a provides the action a with the service q, where q is an atom occurring in the preconditions of a ; ﬁnally, CC and N C are respectively the concurrency and non-concurrency symmetric relations over the action instances in A; in particular, the pair a, a in CC models a joint action whereas constraints in N C prevent the conﬂicts for accessing the resources; this is equivalent to the concurrency requirement introduced in [8]. Plan Distribution The execution of the MAP P is a critical step as the agents have to concurrently execute the actions assigned to them without violating the constraints introduced in the planning phase. It 1 Dipartimento di Informatica - Universit`a di Torino, Italy, email: {micalizio, torasso}@di.unito.it R. Micalizio and P. Torasso / Monitoring the Execution of a Multi-Agent Plan: Dealing with Partial Observability 409 AT(O1,P4) AT(A1,P2) 1 Move(A1,P0,P2) AT(O1,T) AT(A1,T) AT(A1,P4) 2 Move(A1,P2,P4) 3 Push(A1,O1,P4,T) 4 PutOn(A1,O1,O2,T) NC AT(A1,P0) AT(A2,P3) a0 AT(O2,P2) 5 Move(A2,P3,P2) AT(O2,T) AT(O2,P4) AT(A2,P4) AT(A2,P2) 6 Push(A2,O2,P2,P4) ON(O1,O2) AT(O2,T) AT(A2,P4) AT(A2,T) 7 Push(A2,O2,P4,T) 8 Move(A2,T,P4) a∞ NC AT(A3,P0) CC AT(A3,P1) 9 Move(A3,P0,P1) AT(O2,P2) AT(A3,P4) AT(A3,P2) 10 Move(A3,P1,P2) Figure 1. 11 Push(A3,O2,P2,P4) AT(O2,T) CC 12 Push(A3,O2,P4,T) AT(A3,P4) AT(A3,T) 13 Move(A3,T,P4) The MAP P to be monitored. is therefore quite natural to conceive a distributed approach to the supervision of the plan execution. In this paper we adopt a distributed approach to the supervision (similar to the ones discussed in [8, 5]) where each agent performs a (local) control loop over the actions it executes. Local Plans. The MAP P under consideration is decomposed into as many sub-plans Pi as the agents in T and each sub-plan Pi is assigned to agent i. The decomposition can be easily done by selecting from P all the actions an agent i has to execute. Formally, the subplan for agent i is the tuple Pi = Ai , Ei , CLi , CCi , N Ci where: Ai , Ei , CLi , CCi and N Ci are the same as in P restricted to the actions agent i has to execute (i.e., at least one action belongs to Ai ). We consider the time as a discrete sequence of instants; the actions are executed synchronously by the agents in the team and each action in P takes a time unit to be executed (this common assumption is also made in [8, 6]). At a given time t, an agent i can execute just one action a (in the following the notation ait will denote the action executed by agent i at time t). After the execution of action ait the agent i may receive a set of observations, denoted as obsit+1 , relevant for the status of i itself. Minimal Observability Requirement and Agent Communication. Since the agents need to communicate for achieving coordination during the plan execution, a minimal observability requirement must be satisﬁed. To ﬁgure out what events are included in the minimal observability requirement consider that coordination is required in three cases during the nominal plan execution. First, when an agent i has to provide a service q to another agent j; technically, this case is encoded by a q causal link l : a → a in the MAP P (where a ∈ Ai and a ∈ Aj ). After the execution of a, the agent i must be able to observe the achievement (or the absence) of service q and must notify agent j whether the service has been provided or not; in fact, because of partial observability, the agent j can not directly observe the service q and has to wait for a message from i. The second situation which requires explicit coordination during the execution regards the joint actions. Every pair of actions a, a , included in the set of concurrency constraints CC, models a joint action; where a and a are actions assigned to agents i and j, respectively. In order to execute the joint action a, a in a synchronized way, the two agents i and j need to observe whether the preconditions of the actions a and a are satisﬁed; in fact the joint action can be performed only when the preconditions of both actions are satisﬁed and both agents have to be aware of this. Explicit coordination is also required for executing actions bounded by non-concurrency constraints: in this case coordination is ruled by the set of non-concurrency constraints N C in P and prevents the simultaneous execution of the constrained actions. Given the pair of actions a, a ∈ N C (where a ∈ Ai and a ∈ Aj respectively), when agent i intends to execute action a must inform agent j and in case of conﬂict the two agents have to negotiate. As we will see in section 4, agents communicate even in case of action failures: an agent must notify other agents when a service will not be provided as a consequence of a failure. Running Example. In the paper we will use a simple example from the blocks world for illustrating the concepts and the techniques we propose. Let us consider three agents that cooperate to achieve a global goal G where two blocks O1 and O2 are moved in a target position T and O1 is put on the top of block O2; initially, the blocks are located in position P4. In its nominal behavior an agent can move a block by pushing it; however, in some cases a block may be too heavy and two agents need to join their efforts to push it. Figure 1 shows a possible MAP which achieves the goal G, in particular, the agents A2 and A3 cooperate to move the (heavy) block O2 in position T (see the joint actions 6,11 , 7,12 ); the agent A1 move the block O1 in position T than it puts O1 on the top of O2 (see action PutOn). The MAP is a DAG where nodes are actions, solid and dashed arrows are causal and precedence links respectively, while concurrent and non-concurrent constraints are solid, bidirectional arrows labeled as CC and N C respectively. The dashed rectangles specify which actions are included in the sub-plans assigned to the three agents. The operations within the target position are constrained: at each time instant only one block can be moved in, so there are non-concurrency constraints between the joint action 7,12 and the simple action 3; moreover, since the block O2 must be positioned in T earlier than O1, precedence links exist between the actions 7,12 and 3. 3 Monitoring with uncertain action outcomes The monitoring performed by agent i over the execution of its subplan provides two important services: 1) Estimate the state of agent i after the execution of an action a 2) Detect the outcome of the action a However, before describing the monitoring process we need to introduce some important concepts. Agent state. Intuitively, the system status can be expressed in terms of the status variables of the agents in the team T and of the status of the system resources RES . However, the distributed approach to the supervision prevents the adoption of a global notion of status while it allows a local view based on a single agent. The status of agent i is expressed in terms of a set of status variables VAR i , which is partitioned into three subsets END i , ENV i and HLT i . END i and ENV i denote the set of endogenous (e.g., the agent’s position) and environment (e.g., the resources state) status variables, respectively. Note that, because of the partitioning, each 410 R. Micalizio and P. Torasso / Monitoring the Execution of a Multi-Agent Plan: Dealing with Partial Observability agent i has to maintain a private copy of the resource status variables; more precisely, for each resource resk ∈ RES (k : 1..|RES |) the private variable resk,i is included in the set ENV i . Since we are interested in monitoring the plan execution even when action failures occur, we introduce a further set of variables in order to model the agent faults, which may cause action failures. HLT i denotes the set of variables concerning the health status of an agent functionalities (e.g., mobility and power); in particular, for each agent functionality f , a variable vf ∈ HLT i represents the health status of f , the domain of variable vf is the set {ok, abn1 , . . . , abnn } where ok denotes the nominal mode while abn1 , . . . , abnn denote non nominal modes. It is worth noting that the observations obsit agent i may receive, convey information about just a subset of variables in VAR i . First of all, an agent can directly observe just the status of the resources it is actively exploiting: the status of other resources is not directly observable but an agent can communicate with other agents to determine it. Moreover, the observations obsit provide in general the value of just a subset of variables in END i ; whereas the variables in HLT i are not directly observable and their actual value can be just inferred. Given this partial observability, at each time t the agent i can just estimate a set of alternative states which are consistent with the received observations obsit ; in literature this set is known as belief state and will be denoted as Bti . Action models. The model of a simple action ait (assigned to agent i at time t) is the tuple var(ait ), pre(ait ), eff(ait ), event(ait ), Δ(ait ) ; where var(ait ) ⊆ VAR i is the subset of status variables over which the set pre(ait ) of preconditions and the set eff(ait ) of effects are deﬁned; event(ait ) is the set of exogenous events (e.g., faults) which may occur during the execution of action ait and which possibly may affect its outcome; ﬁnally, Δ(ait ) is a transition relation where every tuple d ∈ Δ(ait ) models a possible state transition, which may occur while i is executing ait . Each tuple d has the form d = st , event , st+1 ; where st and st+1 represent two agent states at time t and t + 1 respectively (each state is a complete assignment of values to the status variables in var(ait )) and event (possibly empty) represents the occurrence of an unexpected event in event(ait ). Since Δ(ait ) is a relation, the action model can represent non deterministic, anomalous action effects. The healthy formula healthy(ait ) of action ait is computed by restricting each variable vf ∈ healthVar(ait ) to the nominal behavioral mode ok and represents the nominal health status of agent i required to successfully complete the action itself. Therefore, the (expected) nominal effects of ait are nominalEff(ait )={q ∈ eff(ait ) | pre(ait )∪ healthy(ait ) q}. On the contrary, when the healthy formula does not hold, the behavior of the action may be non deterministic and some (even all) of the expected effects may be missing. Joint actions. The notion of simple action can be extended to cover the notion of joint action, which as discussed in [3], can be seen as the simultaneous execution of a subset of simple actions. In this paper we consider a stronger notion of joint action: two simple actions ait and ajt are part of a joint action ai,j t not only because they are executed at the same time, but also because the agents i and j actively cooperate to reach an effect. The notion of dependency set, introduced in [5], is exploited to homogenously represent both simple and joint actions. Intuitively, a dependency set I(t) highlights the subset of agents whose strict cooperation is required in a speciﬁc time instant t and can be easily determined from the concurrency constraints deﬁned in the MAP. The agents within the same dependency set I(t) synchronize in order to build a joint belief state BtI (resulting from the conjunction of their local belief states) and the joint model of the I(t) action at (see details in [5]). Thus, given the dependency set I(t), I(t) at may denote a simple or a joint action. The prediction process. In [5] Micalizio et al. have proposed a distributed strategy for monitoring the execution of a MAP. Their apI(t) proach can be summarized as follows: let at denote the (joint) action executed by the agent(s) in the dependency set I(t) at time t (for the sake of readability we will write aIt whenever the time of the dependency set is obvious from the context), let BtI be the (joint) belief state of the agents in I, let Δ(aIt ) the model of the (joint) action the agents in I have to execute at time t, the joint belief state at time t + 1 (i.e., after the action execution) can be inferred as: I Bt+1 = π VAR I (σ obsI (BtI 1 Δ(aIt ))). t+1 t+1 The join operation BtI 1 Δ(aIt ) represents the prediction step as it estimates the set of possible states of the agents in I at time t + 1. However, the set of predictions resulting from the join operation is in general spurious as it predicts all possible evolutions. The selection operation σ obsI has the effect of pruning off all those predictions t+1 which are inconsistent with the observations received by the agents in I at time t+1 where obsIt+1 = i∈I obsit+1 . Of course, the precision of the estimated joint belief strongly depends on the amount of available observations: in the worst case obsIt+1 is empty and the selection operator can not discard any of the predicted states; in the best (unrealistic) case obsIt+1 is so complete to reduce the estimated I states to the actual agent state. Finally, the (joint) belief state Bt+1 is inferred by projecting the set of reﬁned predictions over the agent status variables at time t+1. Strong Commitment. Intuitively, the outcome of action aIt is succeeded when all the nominal, expected effects are achieved after its execution, the action outcome is failed otherwise. However, since the I belief state Bt+1 may be highly ambiguous, in [5] the authors adopt a strong commitment policy by considering aIt as successfully completed iff all its nominal effects nominalEff(aIt ) have been achieved I , formally: in every possible state s included in Bt+1 I aIt succeeded ↔ ∀q ∈ nominalEﬀ (aIt ), ∀s ∈ Bt+1 , s |= q aIt (1) Moreover, [5] requires that the outcome of action must be immediately assessed after the execution of the action at time t+1. Therefore, when nominalEff(aIt ) are not satisﬁed in at least one state I included in the joint belief Bt+1 , the outcome of action aIt is assumed to be failed. This strong commitment policy is based on the assumption that, whenever the action aIt is successfully completed, the amount of observations available at time t+1 is sufﬁcient for pruning I off from Bt+1 any state s where the nominal effects do not hold. Under this assumption it is sufﬁcient that each agent maintains just the I last belief state Bt+1 , as it represents a synthesis of the past history till time t+1. Unfortunately this assumption may not hold in many domains and, as a consequence of the partial observability, it may happen that even when an action is successfully completed an agent concludes a failure because it can not univocally assert a success. Weak Commitment. In order to avoid this problem we propose a more ﬂexible strategy where the outcome of an action aIt can be inferred within a time window rather than at the precise time instant t+1. In particular we assume that the system observability satisﬁes just the minimal observability requirement, and we propose a methodology for monitoring the plan execution which is able to cope with this constraint. This means that, when an agent is unable to determine the outcome of action aIt , the agent does not conclude the failure of aIt , but postpones the assessment of the action outcome. In fact, although the outcome of aIt can not be precisely determined R. Micalizio and P. Torasso / Monitoring the Execution of a Multi-Agent Plan: Dealing with Partial Observability at the current time t+1, it may be determined in a future time instant by exploiting observations that each agent i in I will receive. For this reason, each agent i has to maintain a list pOi (t) of actions whose outcome has not been determined yet at time t; i.e., a list of pending outcomes. Moreover, the agent i has to maintain a trajectory T ri [0, t + 1], which relates all the belief states agent i has inferred so far. In particular, since the belief states are ambiguous and, in general, include a number of alternative states, T ri [0, t + 1] is a set of trajectories. Reﬁning the agent trajectory. Given the action aIt , such that the agent i ∈ I, the process for estimating the belief state of agent i at time t+1 consists in the process for extending the agent trajectory T ri [0..t] to cover the time instant t+1. Also in this case we adopt the Relational Algebra operators to formalize this process: T ri [0, t + 1]= σ obsI (T ri [0, t] 1 Δ(aIt )) t+1 The join operator represents the step which extends the agent trajectory, in fact any of the transitions modeled in Δ(aIt ) are appended at the end of one (or more) traces in T ri [0..t]. Observe that the join operator implicitly reﬁnes also the agent trajectory: all the traces in T ri [0..t] which do not participate to the join are discarded. The selection operator further reﬁnes the agent trajectory as it ﬁlters out all those traces which are inconsistent with the observations available at time t+1. Therefore, T ri [0..t + 1] maintains all the possible agent trajectories which are consistent with the observations received so far, given a sequence of actions executed by agent i in the interval [0..t + 1]. Inferring action outcome. Since the extension of the agent trajectory reﬁnes the trajectory itself, it may reduce the ambiguity in some of the previous belief states. Thus, agent i can try to infer the outcome of some of the actions in pOi (t + 1); in fact, for each action I(k) ∈ pOi (t + 1) (where I(k) represents the dependency set inak cluding i at time k ∈ [0, t + 1]), it is possible to determine the belief state inferred by the agent at time k + 1 from the agent trajectory T ri [0, t + 1] as follows: I =π k+1 (T ri [0, t + 1]). Bk+1 I Observe that Bk+1 is potentially different from the belief state inI results from the proferred by the agent i at time k, in fact Bk+1 gressive extension of the agent trajectory from time k to time t+1 I and at each step Bk+1 may have been reﬁned. The nominal outcome I of action ak is therefore inferred similarly to the deﬁnition in formula 1; i.e., if the nominal effects of the action aIk hold in every state I s ∈ Bk+1 the action outcome is succeeded. However, the achievement of the nominal effects of action aIk is a consequence not only of the nominal execution of this action but also of the previous actions which are causally related to aIk . The relation between the nominal outcome of aIk and the previous actions is formalized in the following property: Property 1 Given the agent i and its dependency set I at time k, let aIk be an action with outcome succeeded, then all the actions ah in pO i (k ) ∩ dependsOn(akI ) have outcome succeeded too where dependsOn(aIk ) denotes the subset of actions {a1 , . . . , an } in Ai , which directly or indirectly provide aIk with a service (i.e., through a sequence of causal links). It is also possible that the extension of the trajectory does not sufﬁI in such a way the nominal effects ciently reﬁne the belief state Bk+1 I of the action aIk hold just in a subset of the states included in Bk+1 ; I in this case the outcome of ak remains pending. Finally, if the nominal effects do not hold in any state included in I Bk+1 we conclude that the outcome of aIk is non nominal. This does not necessarily imply that the action aIk is failed since the not achievement of the nominal effects may depend on the failure of previous actions causally related to aIk . 411 4 Plan Diagnosis. As soon as the outcome of an action aIt is determined to be non nominal , a diagnostic process is activated in order to provide a possible explanation for such a non nominal outcome. In this paper we adopt the same notion of plan diagnosis introduced by Roos et al. in [8]: once observed a non nominal outcome of action aIt , the plan diagnosis P D(aIt ) singles out a subset of actions executed by the agents in I, whose failure is consistent with the anomalous, observed behavior of the system. Given an agent i ∈ I, every action a in EXP i (aIt )= (pOi (t) ∩ dependsOn(aIt )) ∪ aIt is a minimal explanation of the non nomI inal outcome of at . Therefore, the plan diagnosis for agent i is a; in fact, due to the causal depenP Di (aIt ) = a∈EXP i (aI ) t dencies, it is sufﬁcient to assume the failure of at least one of these actions to explain the observed non nominal outcome of aIt . It is east extending the plan diagnosis to the dependency set I as P D(aIt )= i∈I P Di (aIt ). Essentially, the plan diagnosis explains the observed, non nominal outcome of aIt by singling out a subset of actions whose failure may be the root cause of that observation. Missing Goals. The plan diagnosis can be reﬁned by determining the set of missing goals. A missing goal is a service which can not be provided by agent i as a consequence of the failure of action aIt (where i belongs to the dependency set I). To formally characterize the concept of missing goal we introduce the notion of primary effect: given an action aI the nominal effect q in nominalEff(aI ) is a primary effect if at least one of the following conditions hold: 1. q ∈ pre(a∞ ) i.e., q belongs to the global goal. 2. q is a service that aI provides to a subset J of agents; i.e., there q exists a causal link l : aI → aJ where I = J. Observe that aI and J a can be joint or simple actions. In general, given an action aI , primary(aI ) denotes the (possibly empty) set of primary effects provided by aI . To determine the set of missing goals we adopt a conservative policy and we assume that all actions included in plan diagnosis are actually failed. Therefore, the subset of missing goals that the agents in I can no longer achieve is: missingGoals(aIt )= a∈P D(aI ) primary(a). t In principle, it is sufﬁcient to achieve all the missing goals in an alternative way in order to reach the MAP’s global goal G despite the occurrence of the failure. Therefore the missing goals may be the starting point for any plan recovery strategy. Propagating the Plan diagnosis. As said above, the failure of aIt may propagate in the plan preventing the execution of actions assigned to different agents in the team (not limited to the dependency set I), possibly causing the stop of the whole system. For this reason, we complement the notion of plan diagnosis with the set threatened actions ThrActs(aIt ) which could be indirectly affected by the failure of aIt (through a sequence of causal links). Intuitively, an acq tion is threatened through a causal link l : a → a when it is no longer guaranteed that the action a provides the service q; this may happen either because a is failed (i.e., included in the plan diagnosis P D(aIt )) or because a is in turn threatened. Formally the set ThrActs(aIt ) is deﬁned as: q ThrActs(aIt )={a ∈ A| aIt ≺ a and ∃ a causal link l : a → a, I l ∈ CL, a ∈ P D(at ) or a ∈ ThrActs } Observe that, the propagation is a form of communication among agents, which conveys negative information; hence an agent does not wait indeﬁnitely for services which will never be provided. Running Example. Let us consider the blocks world example and assume that at time 4 the failure of the joint action 7,12 412 R. Micalizio and P. Torasso / Monitoring the Execution of a Multi-Agent Plan: Dealing with Partial Observability (whose dependency set is {A2,A3}) is detected. To determine whether this failure may be consequence of a previous failure, one has to single out which actions in pOA2 (4) (pOA3 (4)) directly or indirectly provide 7,12 with a service. According to the deﬁnition of primary service, the outcome of actions 5 and 10 must be observable, whereas the outcome of action 6,11 may be not. Let us suppose that the observations available at time 3 are not sufﬁcient for inferring the outcome of action 6,11 ; thereby both the agents A2 and A3 include the joint action 6,11 in the set pending outcomes; i.e., pOA2 (4)=pOA3 (4)={6,11 }. Since 6,11 directly provides a service to action 7,12 , the failure of the second action may be a consequence of the failure of the ﬁrst one, thus the plan diagnosis includes both actions: P D(7,12 )={6,11 , 7,12 }. Given the plan diagnosis, the set of missing goals is missingGoals(7,12 )={AT(O2,T)}; moreover the propagation of the plan diagnosis highlights that the failure of action 7,12 affects not only the actions 8 and 13 of the agents A2 and A3 respectively, but also the action 4 of the agent A1. Note that providing the missing service AT(O2,T) the agent A1 would be able to accomplish its task without any adjustment to its sub-plan. and the robotic agents, simulated in a software environment, are implemented as threads running on the same Intel Pentium (1.86 GHz, RAM 1 GB, Windows XP OS). The preliminary results collected so far are encouraging: given MAPs involving up to 6 agents and 60 actions on average, the plan supervision (monitoring, agent diagnosis and failure propagation) performed by each agent requires on average 5 msec. per instant (being the maximum absolute CPU time per instant 30 msec); exploiting the target actions, an agent maintains a trajectory whose length is 5 instants in the worst case (3 on average). 6 Discussion and Conclusion Property 2 Given the agent i, the target action aIt , where i ∈ I, and the set pOi (t), if each action in pOi (t) provides (directly or indirectly) aIt with a service, the detection of the outcome of aIt allows to infer the outcome of each action included in pOi (t). The problem of diagnosing a multiagent plan has been recently addressed by exploiting methods and techniques developed within the MBD community, in particular for the diagnosis of distributed system (see e.g., [7]). In [4] the authors consider multi-agent systems where, at each time instant, every agent chooses the more appropriate behavior to assume according to its beliefs. The authors introduce the notion of social diagnosis to explain the disagreements among cooperating agents. The approach presented in this paper has some resemblance with [8], where a distributed approach to monitoring and diagnosing the execution of a MAP is proposed. It assumes that each agent monitors and diagnoses the actions it is responsible for where actions are atomic and are modeled as functions of their nominal behavior only. Since the anomalous behavior of the actions is not explicitly modeled, the monitoring can not estimate faulty system states. In this paper, we propose a distributed approach for monitoring and diagnosing the execution of a multi-agent plan in a system which is only partially observable. Differently from the approach in [8], we adopt extended action models for capturing both nominal and anomalous execution. Thereby the monitoring process we propose is able to estimate system states even after the occurrence of faults. Moreover, by exploiting the notion of dependency set introduced in [5], the approach uniformly deals with simple as well as joint actions. Finally the paper has discussed a methodology based on the weak commitment strategy which is able to infer a plan diagnosis and to determine two important pieces of knowledge over the system status: the set of missing goals and the set of actions threatened by the plan diagnosis. These two sets play a critical role in any plan recovery strategy since one has to ﬁnd (if possible) an alternative way for reaching the global goal which is not achievable because of the action failure. In general such a recovery step requires a re-planning phase where the set of missing goals contribute to reduce the search space since they clearly point out what must be achieved. Property 2 states that, after the execution of a target action aIt , every agent i ∈ I can determine the outcome of all the actions in the set of pending outcomes pOi (t). Moreover, in case no failure has been detected, every agent i can replace the trajectory with the belief state I as it represents a synthesis of the past history till time t+1. Bt+1 For example, in the MAP of Figure 1, the simple actions 4,5,8,13 are target actions as well as the joint action 7,12 . Implementation and preliminary results From a computational point of view, managing relations such as belief states and action models which possibly have a huge dimension may be very expensive. In order to implement both the monitoring and diagnostic processes in an effective way, we have encoded the relation by means of the symbolic formalism of the Ordered Binary Decision Diagrams (OBDDs); the relational operations have been mapped into standard operations on OBDDs. A prototype has been implemented in Java JDK 1.6 and exploits the JavaBDD( http://sourceforge.net/projects/javabdd) package for manipulating OBDDs. The approach has been tested in a ofﬁce domain [1] L. Birnbaum, G. Collins, M. Freed, and B. Krulwich, ‘Model-based diagnosis of planning failures’, in Proc. AAAI’90, pp. 318–323, (1990). [2] J. S. Cox, E. H. Durfee, and T. Bartold, ‘A distributed framework for solving the multiagent plan coordination problem’, in Proc. AAMAS05, pp. 821–827, (2005). [3] R. M. Jensen and M. M. Veloso, ‘Obdd-based universal planning for synchronized agents in non-deterministic domains’, JAIR, 13, 189–226, (2000). [4] M. Kalech and G.A. Kaminka, ‘Towards model-based diagnosis of coordination failures’, in Proc. AAAI05, pp. 102–107, (2005). [5] R. Micalizio and P. Torasso, ‘On-line monitoring of plan execution: a distributed approach’, Knowledge-Based Systems, 20(2), 134–142, (2007). [6] Roberto Micalizio and Pietro Torasso, ‘Plan diagnosis and agent diagnosis in multi-agent systems’, volume 4733 of LNCS, pp. 434–446, (2007). [7] Y. Pencol´e and M.O. Cordier, ‘A formal framework for the decentralised diagnosis of large scale discrete event systems and its application to telecommunication networks’, AI, 164, 121–170, (2005). [8] C. Witteveen, N. Roos, R. van der Krogt, and M. de Weerdt, ‘Diagnosis of single and multi-agent plans’, in Proc. AAMAS05, pp. 805–812, (2005). 5 Computational Issues So far we have discussed in a declarative way a methodology for supervising a MAP, in this section we analyze some computational issues which may arise while implementing the approach. Agents Trajectories. Since maintaining a set of trajectories from the initial time instant may be computationally expensive, we can limit the length of the agent trajectories by considering that the primary effects of an action aIt must always be observable at time t+1 (according to the minimal observability requirement). In order to make evident in the MAP which actions provide primary effects we introduce the notion of target action; in particular, an action aIt is said to be a target action iff primary(aIt ) is not empty. Since the outcome of a target action is always observable, target actions can be considered as milestones in the plan and exploited for determining temporal windows the agent trajectories must cover. In particular, under some requirements on the causal dependencies in the MAP, the following property holds. REFERENCES ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-413 413 A hybrid approach to multi-agent decision-making Paulo Trigo 1 and Helder Coelho 2 Abstract. In the aftermath of a large-scale disaster, agents’ decisions derive from self-interested (e.g. survival), common-good (e.g. victims’ rescue) and teamwork (e.g. ﬁre extinction) motivations. However, current decision-theoretic models are either purely individual or purely collective and ﬁnd it difﬁcult to deal with motivational attitudes; on the other hand, mental-state based models ﬁnd it difﬁcult to deal with uncertainty. We propose a hybrid, CvI-JI, approach that combines: i) collective ‘versus’ individual (CvI) decisions, founded on the Markov decision process (MDP) quantitative evaluation of joint-actions, and ii) joint-intentions (JI) formulation of teamwork, founded on the belief-desire-intention (BDI) architecture of general mental-state based reasoning. The CvI-JI evaluation explores the performance’s improvement during the process of learning a coordination policy in a partially observable stochastic domain. 1 INTRODUCTION The agents that cooperate to mitigate the effects of a large-scale disaster, e.g. an earthquake or a terrorist incident, take decisions that follow two large behavioral classes: the individual (ground) activity and the collective (institutional) coordination of such activity. Additionally, agents are motivated to form teams and jointly commit to goals that supersede their individual capabilities [8]. Despite such motivation, communication is usually insufﬁcient to ensure that decision-making is supported by a single and coherent world perspective. The communication constraint causes the decision-making process to evolve simultaneously, both at the collective (common-good) and at the individual (self-interested) strata, sometimes in a conﬂicting manner. For instance, an ambulance searches for a policy to rescue a perceived civilian, while the ambulance command center, when faced with a global view of multiple injured civilians, searches for a policy to decide which ambulance should rescue which civilian. However, despite the intuition on a 2-strata decision process, research on multi-agent coordination often proposes a single model that amalgamates both strata and searches for optimality within that model. The approaches based on the multi-agent Markov decision process (MMDP) [1] are purely collective and centralized, thus too complex to coordinate while requiring unconstrained communication. The multi-agent semi-Markov decision process (MSMDP) [7], although decentralized, requires each individual agent to represent the whole decision space (states and actions) which may become very large, thus causing the individual policy learning to be slow and highly dependent on up-to-date information about the decisions of all other agents. The game-theoretic approach requires an agent to compute the utility of all combinations of actions executed by all other 1 2 GuIAA/LabMAg; DEETC, ISEL - Instituto Superior de Engenharia de Lisboa, Lisbon, Portugal, email: ptrigo@deetc.isel.ipl.pt LabMAg; DI, FCUL - Faculdade de Ciˆencias da Universidade de Lisboa, Lisbon, Portugal, email: hcoelho@di.fc.ul.pt agents (payoff matrix), which is then used to search for Nash equilibria (where no agent increases his payoff by unilaterally changing his policy); thus, if several equilibria exist, agents may adhere to purely individual policies never being pulled by a collective perspective. The multi-agent collective ‘versus’ individual (CvI) decision model [15], which is founded on the semi-Markov decision process (SMDP) framework, is neither purely collective nor purely individual and explores the explicit separation of concerns between both (collective and individual) decision strata while aiming to conciliate their reciprocal inﬂuence. Despite that, the CvI misses the agents’ intentional stance toward team activity. On the other hand, the joint-intentions (JI) formulation of teamwork [5], based on the belief-desire-intention (BDI) mental-state architecture [9, 16], captures the agents’ intentional stance, but misses the MDP domain-independent support for sequential decision-making in stochastic environments. Research on single-agent MDP-BDI hybrids formulates the correspondence between the BDI plan and the MDP policy concepts [11] and empirically compares each model’s performance [10]. Multi-agent MDP-BDI hybrid models often exploit BDI plans to improve MDP tractability, and use MDP to improve BDI plan selection [13]. In this paper, instead of exploring the MDP-BDI policy-plan relation, we focus on the link between the BDI intention concept and the MDP temporally abstract action concept [12]. We see an intention as an action that executes for time variable periods and, when terminated, yields a reward to the agent. We extend this view to the joint-intentions concept and integrate the resulting formulation in the 2-strata multilevel hierarchical CvI decision model. Thus, the CvI-JI is a hybrid approach that combines the MDP temporally abstract action concept and the BDI mental-state architecture. The motivation for the hybrid CvI-JI model is to use the JI as a heuristic constraint that reduces the space of admissible MDP joint-actions, thus enabling to escalate the problems’ dimension. The experiments show the CvI-JI learning improvement in a partially observable environment. 2 THE CvI DECISION MODEL The premise of the CvI decision model is that the individual choice coexists with the collective choice and that coordinated behavior happens (is learned) from the prolonged relation (in time) of the choices exercised at both of those strata (individual and collective). Coordination is exercised on high level, hierarchically organized cooperation tasks, founded on the framework of Options [12], which extends the MDP theory to include temporally abstract actions (variable time duration tasks, whose execution resorts to primitive actions). 2.1 The framework of Options Formally, an MDP is a 4-tuple M ≡ S, A, Ψ, P, R model of stochastic sequential decision problems, where S is a set of states, A 414 P. Trigo and H. Coelho / A Hybrid Approach to Multi-Agent Decision-Making is a set of actions, Ψ ⊆ S × A is the set of admissible state-action pairs, R( s, a ) is the expected reward when action a is executed at s, and P ( s | s, a ) is the probability of being at state s after executing a at state s. Given an MDP, an option o ≡ I , π, β , consists of a set of states, I ⊆ S, from which the option can be initiated, a policy, π, for the choice of actions and a termination condition, β, which, for each state, gives the probability that the option terminates when that state is reached. The computation of optimal value functions and optimal policies, π , resorts to the relation between options and actions in a semi-Markov decision process (SMDP): “any MDP with a ﬁxed set of options is a SMDP” [12]. Thus, all the SMDP learning methods can be applied to the case where temporally extended options are used in an MDP. The options deﬁne a multilevel hierarchy where the policy of an option chooses among other lower level options. At each time, the agent’s decision is entirely among options; some persist for a single time step (primitive action or one-step option), others are temporarily extended (multi-step option). 2.2 The CvI collective and individual strata The individual stratum is simply a set of agents, Υ, each agent, j ∈ Υ, with its capabilities described as a hierarchy of options. The collective stratum is an agent (e.g. institutional) that cannot act on its own (its actions are executed by the individual stratum agents) and its purpose is to coordinate the individual stratum. Formally, at the collective stratum, each action is deﬁned as a collective option, oo = Io , πo , βo , where o = o1 , . . . , o|Υ| represents the simultaneous execution of option oj ≡ I j , π j , β j by each agent j ∈ Υ. ⊆ O 1 × . . . × O |Υ| , The set of agents, Υ, deﬁnes an option space, O is a where Oj is the set of agent j options and each oo ∈ O decomposes into O d disjoint subsets, each collective option. The O with the collective options available at the, d, hierarchical level, where 0 < d ≤ D − 1 and level-0 is the root and level-D is the hierarchy depth. A level d policy, πd , is implicitly deﬁned by the d . The Md solution is SMDP Md with state set S and action set O the optimal way to choose the level d individual policies which, in the long run, gathers the highest collective reward. The CvI structure. Figure 1 illustrates the CvI structure, where the individual stratum (each agentj ) is a 3-level hierarchy and thus the collective stratum (the two, o1 and o2 , collective option instances) is a 2-level hierarchy; at each level, the set of diamond ended arcs, links the collective option to each of its individual policies. o2 = 〈o1p-2.1 , o2p-3.2〉 o1 = 〈o1p-2 , o2p-3〉 collective stratum individual stratum op-1 o π11 op-2 op-3 π21 o op-1 op-2 op-3 op-4 π12 agent 1 π22 op-2.1 op-2.2 op-3.1 op-3.2 op-3.3 agent 2 Figure 1. The CvI structure and inter-strata links (superscript j refers to agentj ; subscripts k and p-k refer to k hierarchical level and k tree path). stratum, which replies with an option, oj , decision. The d-w-d process represents the importance, that an agent credits to each stratum, deﬁned as the ratio between, the maximum expected beneﬁt in choosing a collective and an individual decision. The expected beneﬁt is given, at each hierarchical level-d, by the value functions of the corresponding SMDP Md . A threshold, κ ∈ [ 0, 1 ], focus-grades between collective and individual strata, thus enabling the (human) designer to specify diverse social attitudes: ranging from common-good (κ = 0) to self-interested (κ = 1) motivated agents. The CvI is a decentralized model as each agent decides whether to make a decision by itself or to ask the collective layer for a decision. The comprehensive description of the CvI model refers to [15]. 2.3 Given the individual stratum set of agents, Υ, and a collective stratum agent, υ, the design of a CvI instance is a 3-step process: i. For each j ∈ Υ, specify Oj — the set of options and its hierarchical organization. ii. For each j ∈ Υ, and from the agent υ perspective, identify the subset of cooperation tasks, C j ⊆ Oj — the most effective options to achieve coordination skills; the remaining options, J j = Oj − C j , represent purely individual tasks. iii. For each j ∈ Υ, assign κ its regulatory value — where κ = 0 is a common-good motivated agent, κ = 1 is a self-interested attitude, and κ ∈ ] 0, 1 [ embraces the whole spectrum between those two extreme decision motivations. A simple, domain-independent design deﬁnes C j (item ii above) as the multi-step options; hence J j as the one-step options. Also, the highest hierarchical level(s) are usually effective to achieve coordination skills as they escape from getting lost in lower level details. 3 The CvI dynamics. At each decision epoch, agent gets the partial perception, ω j , and decide-who-decides (d-w-d), i.e., the agentj either: i) chooses an option oj ∈ Oj , or ii) requests, the collective THE JOINT-INTENTIONS (JI) MODEL The precise semantics for the intention concept varies across the literature. An intention is often taken to represent an agent’s internal commitment to perform an action, where a commitment is speciﬁed as a goal that persists over time, and a goal (often named as desire) is a proposition that the agent wants to get satisﬁed; an intention can also represent a plan that an agent has adopted to reach or a state that the agent is committed to bring about [3, 4, 9, 16]. The framework of joint-intentions (JI) adopts the semantics of the “intention as a commitment to perform an action” and extends it to describe the concept of teamwork. A team is described as a set, of two or more agents, collectively committed to achieve a certain goal [5]. The teamwork agents (those acting within a team) are expected to ﬁrst form future-directed joint-intentions to act, keep those joint-intentions over time, and then jointly act. Formally, given a set of agents, Υ, a team is described as a 2-tuple T ≡ α, g , where the team members are represented by α ⊆ Υ, and the team goal is g. In a team all members, α, are jointly committed to achieve the goal, g, while mutually believing that they are all acting towards that same goal. The teamwork terminates as soon as all members mutually believe that there exists at least one member that considers g as ﬁnished (achieved, impossible to achieve or irrelevant). 4 j The design of CvI agents THE HYBRID CvI-JI DECISION MODEL Given the CvI (cf. section 2) decision-theoretic model we regard the JI approach as a way to reduce the collective option space exponentially in the number of team members. For example, given Υ P. Trigo and H. Coelho / A Hybrid Approach to Multi-Agent Decision-Making agents, all with the same cooperation tasks, C, there are at most |C||Υ| admissible options to choose; during α, g teamwork, that number reduces to |C||Υ|−|α| and such reduction motivates the formulation of the hybrid CvI-JI decision model. The next sections address two questions: i) how to specify, at design time, the JI using the CvI components, and ii) how to integrate, at execution time, the JI speciﬁcation in the CvI decision process. 4.1 Specify JI using the CvI components The teamwork goal. The JI describes teamwork in terms of goals which, in general, take multiple time periods until satisfaction. The CvI speciﬁes decisions in terms of options which are temporally abstract actions. Therefore, a (team) goal corresponds to a (team) option. Given a goal, g, described as a proposition, ϕ, we formulate the corresponding option as I , π, β , where, I is the set of states where ¬ ϕ is satisﬁed, β( s ) = 1 if s ∈ ( S −I ) or β( s ) = 0 otherwise, and π is any policy to satisfy ϕ (i.e., to terminate the option). The teamwork commitment. The JI only requires agents to “keep the joint-intentions commitment over time, and then jointly act”. It is up to the agent to decide when to terminate executing an ongoing task and effectively start acting to achieve the team goal. Thus, being jointly committed to a goal, g, does not imply immediate action toward that same goal, g. For example, two ambulances may jointly commit to the same disaster while one of them is executing an action (e.g., delivering an injured civilian); as soon as the ongoing task is terminated, the ambulance starts acting towards the team goal. Therefore, our CvI-JI formulation assumes that, at each decision epoch, an agent may establish a JI while still acting to satisfy another intention (either individual or JI). Thus, at each instant, an agent may have an ongoing activity and also one (at most) established JI. Our approach enables teamwork decisions to be asynchronous; agents do not need to wait, for each others’ option termination, before committing to a JI. Our hybrid CvI-JI option selection function distinguishes two teamwork stages: i) the “ongoing task continue” when an agent decides to establish a JI (becomes a team member) even though the agent still executes some other task, and ii) the “team option startup” when a team member decides to start executing the team option. Given a team member, j, a team option, o, and its initiation set, I, we deﬁne the ongoing states, I ongo: j ⊂ I, where j is allowed to continue executing an ongoing task while jointly committed to achieve the team option, o. The teamwork reconsideration. The JI assumes that once an agent commits to a team goal he will fulﬁl that commitment. The CvI is a stochastic model so we assume the possibility that an agent drops a previous commitment before actually starting to act as a team member. Given agent j we deﬁne the commitment probability, p commit: j , that j meets his engagement. The teamwork design component. The CvI-JI combines all the above (team option, ongoing set and commitment probability) into a “teamwork design component” tdcj ≡ oj , I ongo: j , p commit: j , which describes, for agent, j ∈ Υ, and team option, oj ∈ Oj , the set of states, I ongo: j , where the agent may continue an ongoing task before start executing oj , and the probability, p commit: j , of effectively committing to oj . The design of the tdc structure assumes that: i) a team option is always represented in more than one agent, ii) a tdcj is speciﬁed for each team option that j may get committed, and iii) theI ongo: j speciﬁcation considers the j’s environment local view. The CvI-JI model describes, via tdc, the domain-dependent teamwork knowledge which contributes to reduce the collective option space. Thus, CvI integrates JI as an heuristic ﬁlter (at collective stra- 415 tum) that reiﬁes the (human) designer domain knowledge. The next section integrates the heuristic ﬁlter in the decision process. 4.2 Integrate JI in the CvI decision process The integration of the JI in the CvI decision process is designed, at the collective stratum, by modifying the CvI option selection process, which chooses, at each decision epoch, a level d collective option, od given perceived state, s, and a set of agents, B, that request for a collective stratum decision. The algorithm 1 shows the option selection function, CHOOSE O PTION, and the inclusion of the two subroutines, APPLY F ILTER -JI (cf. line 3) and UPDATE F ILTER -JI (cf. line 5), that implement the CvI-JI integration. Algorithm 1 Choose option at level d of CvI collective stratum. d , πd , B ) 1 function CHOOSE O PTION( s, O d, B ) d ← getAdmissibleOptionSet( s, O 2 O 3 Od ← APPLY F ILTER -JI( s, Od , B ) d , O d , πd ) 4 od ← applyPolicy( s, O 5 UPDATE F ILTER -JI( od , B ) 6 return od 7 end function The getAdmissibleOptionSet function (cf. algorithm 1, line 2) is the same as in CvI; evaluates Io of each collective option, oo , and re d , of admissible options (given the perceived s and turns the set, O the set of agents, B, that requested a level d collective stratum decision). The applyPolicy function (cf. algorithm 1, line 4) chooses the next collective option to execute; the policy, πd , is either predeﬁned or follows some explore-and-exploit reinforcement learning method. We followed the learning approach and implemented a -greedy policy, which picks: i) a random admissible collective option, oo ∈ d , with probability , and ii) otherwise, picks the highest estimated O action value collective option, at the current state, s, already considering the JI commitments (i.e., picks the maxoo ∈ O d Q ( s, oo ) ). The algorithm 2, APPLY F ILTER -JI function, shows the integration of JI commitments throughout the manipulation of the tdc instances. The set of goals that call for teamwork effort are represented by the global TDC set (cf. line 3) which is initially empty. The ﬁrst part (cf. lines 2 to 10, algorithm 2) determines the TDC set of admissible tdc from agents that requested for a level d collective stratum decision. The teamwork reconsideration concept (cf. section 4.1) is represented by the possibility of discarding a previously established and currently admissible JI (cf. algorithm 2, line 5). The second part (cf. lines 11 to 16, algorithm 2) restricts the collective options to those that are compatible (all oo components match) with the team options of all tdc ∈ TDC ; the remaining collective options are discarded. The algorithm 3, UPDATE F ILTER -JI function, describes the strategy used, at each decision epoch, to select a team goal and to ﬁnd the set of agents that are available to commit to that team goal (i.e., select a goal, g, and ﬁnd the set, α ⊆ Υ, of agents available to form a team T ≡ α, g ). The implemented strategy simply selects the ﬁrst admissible team goal and assumes that each agent “is available to commit to a team goal as long as he is not already a team member”. The TDC set is updated (cf. algorithm 3) according to that strategy, for all agents, at each decision epoch. 5 EXPERIMENTS AND RESULTS We propose the teamwork taxi coordination problem that extends the previous taxi coordination problem [6, 15] and enforces the team- 416 P. Trigo and H. Coelho / A Hybrid Approach to Multi-Agent Decision-Making Algorithm 2 Apply JI to reduce collective options’ admissible set. d , B ) 1 function APPLY F ILTER -JI( s, O 2 TDC ← ∅ 3 for each tdc ∈ TDC do 4 if ( s[ j ] ∈ / tdc.I ongo: j ) ∧ ( tdc.j ∈ B ) then 5 if random ≤ tdc.p commit: j then 6 TDC ← TDC ∪ { tdc } 7 end if 8 TDC ← TDC − { tdc } 9 end if 10 end for d ← ∅ 11 O d do 12 for each oo ∈ O 13 if oo is compatible with TDC then d ∪ { oo } d ← O 14 O 15 end if 16 end for d d = O d when TDC = ∅ 17 return O !O 18 end function Algorithm 3 Strategy to update the set, TDC, containing the selected team goal and the agents available for a JI. 1 function UPDATE F ILTER -JI( od , B ) 2 teamOption ← false 3 for each tdc ∈ DTDC do ! DTDC ≡ designed tdc elements 4 if ¬ teamOption then 5 o ← tdc.o ! o ≡ a team option 6 end if 7 for each ag ∈ Υ do 8 if ( od [ ag ] = o ) ∧ ( od [ tdc.j ] = o ) ∧ 9 ( ag ∈ B ) ∧ ( ag = tdc.j ) then 10 TDC ← TDC ∪ { tdc } 11 if ¬ teamOption then 12 teamOption ← true 13 end if 14 end if 15 end for 16 end for 17 end function work behavior, as follows: “passengers appears at an origin site and wants to get transported to a destination site; there are some predeﬁned sites where passengers only accept to get transported all together (as in a family)”; those sites are named teamwork sites as taxis must work as a team to transport all passengers at the same time. The experimental setup is given by: i) a 5 × 5 grid, ii) 4 sites, Sb = { b1 , b2 , b3 , b4 }, iii) 2 taxis, St = { t1 , t2 }, iv) 3 passengers, Spsg = { psg 1 , psg 2 , psg 3 }, and v) a single, btw ∈ Sb , teamwork site. The primitive actions, available to each taxi, are pick, put, move( m ), where m ∈ { N, E, S, W } are the cardinal directions, and the wait action supports the agent’s synchronization (at teamwork sites). The problem is partially observable as a taxi does not perceive the other taxis’ locations; it is collectively observable as the combination of all individual observations determines a sole world state. We deﬁned 3 different CvI-JI conﬁgurations, each assigning all j ∈ Υ the same p commit: j ∈ { 0, 12 , 1 } value. Therefore, we deﬁne: i) never JI, when p commit: j = 0, ii) sometimes JI, when p commit: j = 12 , and iii) always JI, when p commit: j = 1. The goal of the individual stratum is to learn how to execute tasks (e.g. how to navigate to a site and when to pick up a passenger). The goal of the collective stratum is to learn to coordinate the individual tasks to minimize the resources (time) to satisfy passengers’ needs. The learning of the policy at the collective stratum occurs simultaneously with the learning of each agent’s policy at the individual stratum. The results of the experiments (cf. section 5.4) show the hybrid CvI-JI performance improvement of the collective stratum learning process, when compared with the pure CvI (i.e., never JI) approach. 5.1 JI speciﬁcation The JI is speciﬁed as a set of predeﬁned tdc instances. The tdc instance is deﬁned, for each taxi (agent) tj ∈ St as btw , I ongo: tj , p commit: tj . The btw is the teamwork site. The I ongo: tj speciﬁes the following ongoing state set: i) the taxi, tj , already transports a passenger, or ii) there is a passenger to pick up at tj current location. The p commit: tj is assigned the value 0, 12 or 1, respectively for the never JI, sometimes JI or always JI experiment conﬁguration. 5.2 Individual stratum speciﬁcation Each taxi’s observation, ω = x, y, psg 1 , psg 2 , psg 3 , is its ( x, y )-position and passenger, psg i = loci , desti , orig i , status where loci ∈ Sb ∪ St ∪ { t1acc , t2acc } (t1acc means that taxi j accomplished delivery), desti ∈ Sb , and orig i ∈ Sb . Therefore, the state space, perceived by each taxi, is describe by a total of 5 × 5 × (8 × 4 × 4)3 = 52,428,800 states. The taxi capability is a 3-level hierarchy, where root is the multi-step level-zero option, navigate( b ) is the multi-step level-one option, pick, put and wait are the one-step level-one options and move( m ) are the level-two one-step options (for each navigate( b )); a total of 5 multi-step options and 7 one-step actions. The taxi is not equipped with any explicit deﬁnition of its goal; also, it does not hold any internal representation of the maze grid. The taxi j decision is based solely on the information available at each decision epoch: i) its perception, ω j , and ii) the immediate reward provided by the last executed one-step action. The immediate taxi rewards are: i) 20 for delivering a passenger, ii) −10 for illegal pick or put, iii) −12 for any illegal move action in a teamwork site, and iv) −1 for any other action, including moving into walls and picking more than one passenger in a teamwork site. 5.3 Collective stratum speciﬁcation The collective stratum perceives s = t1 , t2 , psg 1 , psg 2 , psg 3 which combines all the individual stratum partial observations, where tj is the ( x, y )-position of agent j. Therefore, the collective stratum state space is describe by (5 × 5)2 × (8 × 4 × 4)3 = 1,310,720,000 states. The collective stratum chooses mainly among multi-step options, so we specify: i) C = { navigate( b ) for all b ∈ Sb } ∪ { wait } ∪ { indOp }, and J = { pick, put }, where indOp is an implicit option representing J at the collective stratum. The indOp option gives place to a ping-pong decision scenario between strata, whenever an agent chooses to “request for a collective stratum decision” and the collective stratum replies: “decide yourself but consider only your purely individual tasks”. Hence, the decision forwards back to the agent (via indOp) raising a second opportunity for the agent “to choose an option in J ”. The ping-pong effect, while giving a second decision opportunity, does not increase the communication between strata and reduces, to |J |, the individual decision space. We assume that agents equitably contribute to the current state. Thus, the collective reward is the sum of rewards provided to each agent; our purpose is to maximize the long run collective reward. P. Trigo and H. Coelho / A Hybrid Approach to Multi-Agent Decision-Making 5.4 The CvI-JI experimental evaluation Our experiments evaluate the inﬂuence of the JI integration in the CvI model, by measuring the learning process performance (quantiﬁed as the collective stratum cumulative reward). An episode starts with 2 passengers in the teamwork site and the third passenger in another site; the episode terminates as soon as all passengers reach their destination; each experiment executes for 700 episodes. Policy learning follows the SMDP Q-learning [2, 12] approach with the -greedy strategy (cf. section 4.2). Each experiment starts with = 0.15 and, after the ﬁrst 100 episodes, decays 0.004 every each 50 episodes. We ran 3 experiments, one for each CvI-JI conﬁguration. Figure 2 shows that the never JI conﬁguration exhibits the worst performance; about 6.5% worse than always JI and about 12% worse than sometimes JI; the difference remains almost uniform throughout the whole experiment. The sometimes JI reveals an unexpected behavior while, around episode 300, it starts to outperforms always JI. Cumulative Reward 100 200 300 Episode 400 500 600 700 reinforcement learning framework. The initial experimental results, of the CvI-JI model, sustain the hypothesis that the JI heuristic reduction of the action space improves the process of learning a policy to coordinate multiple agents. An interesting conclusion is that, taking into account our preliminary results, the teamwork reconsideration concept suggests investigating the hypothesis that not fulﬁlling a commitment (at a speciﬁc state) is an opportunity to ﬁnd an alternative path that, in the long run, is globally better than teamwork. This work describes the ongoing research steps to construct agents that participate in the decision-making process that occurs in the response to a large-scale disaster. Future work will apply the CvI-JI in a a simulated disaster response environment [8] and will explore teamwork (re)formation strategies [14] at the collective stratum. ACKNOWLEDGEMENTS This research was partially supported by LabMAg FCT R&D unit. REFERENCES -200000 -250000 -300000 -350000 neverJI alwaysJI sometimesJI -400000 Figure 2. The inﬂuence of JI in the performance of the learning process. An insight on these results is that the JI teamwork heuristic is exploited by the collective stratum, without compromising the exploration (search for novelty) that is required by the learning process. Somehow unexpected was that, being able not to fulﬁll a previous teamwork commitment (cf. sometimes JI), enables to ﬁnd improvements over the fully reliable commitment attitude (cf. always JI). The CvI-JI enables continuous (non interrupted) ﬂow of decision-making and task execution activities. Such asynchronous process opens a time space between the instant the agent establishes a JI and the instant the agente actually begins acting to achieve the JI. The possibility to reconsider a commitment, just before actually start acting, explores alternatives to teamwork. The ability to drop a pre-established JI enables to ﬁnd individual activity in state points where the heuristic approach (JI) would suggest a teamwork approach. Results (cf. ﬁgure 2) show that the exploration of individual policies combined with the heuristic teamwork approach enables to improve the process of learning a coordination policy. The experiment’s dimension. In this experiment, an agent perceives 52,428,800 states, and the collective stratum contains 1,310,720,000 states. Each decision considers 6 individual options and 36 collective options. Hence, this experimental world captures some of the complexity of the decision-making process that aims to achieve coordinated behavior in a disaster response environment. 6 417 CONCLUSIONS AND FUTURE WORK In this paper, we identiﬁed a series of relations between the 2-strata decision-theoretic CvI approach and the joint-intentions (JI) mental-state based reasoning. We extended CvI by exploring the algorithmic aspects of the CvI-JI integration. Such integration represents our novel contribution to a multi-agent hybrid decision model within a [1] Craig Boutilier, ‘Sequential optimality and coordination in multiagent systems’, in Proceedings of the Sixteenth International Joint Conferences on Artiﬁcial Intelligence (IJCAI-99), pp. 478–485, (1999). [2] Steven Bradtke and Michael Duff, ‘Reinforcement learning methods for continuous-time Markov decision problems’, in Proceedings of Advances in Neural Information Processing Systems, volume 7, pp. 393– 400. The MIT Press, (1995). [3] Michael Bratman, ‘What is intention?’, in Intentions in Communication, 15–31, MIT Press, Cambridge, MA, (1990). [4] Philip Cohen and Hector Levesque, ‘Intention is choice with commitment’, Artiﬁcial Intelligence, 42(2–3), 213–261, (1990). [5] Philip Cohen and Hector Levesque, ‘Teamwork’, Noˆus, Cognitive Science and Artiﬁcial Intelligence, 25(4), 487–512, (1991). [6] Thomas Dietterich, ‘Hierarchical reinforcement learning with the MAXQ value function decomposition’, Journal of Artiﬁcial Intelligence Research, 13, 227–303, (2000). [7] Mohammad Ghavamzadeh, Sridhar Mahadevan, and Rajbala Makar, ‘Hierarchical multi-agent reinforcement learning’, Autonomous Agents and Multi-Agent Systems, 13(2), 197–229, (2006). [8] Hiroaki Kitano and Satoshi Tadokoro, ‘RoboCup Rescue: A grand challenge for multi-agent systems’, AI Magazine, 22(1), 39–52, (2001). [9] Anand Rao and Michael Georgeff, ‘BDI agents: From theory to practice’, in Proceedings of the First International Conference on Multiagent Systems, pp. 312–319, San Francisco, USA, (1995). [10] Martijn Schut, Michael Wooldridge, and Simon Parsons, ‘On partially observable MDPs and BDI models’, in Foundations and Applications of Multi-Agent Systems, volume 2403 of LNCS, 243–260, SpringerVerlag, (2002). [11] Gerardo Simari and Simon Parsons, ‘On the relationship between MDPs and the BDI architecture’, in Proceedings of the Fifth International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS-06), pp. 1041–1048, Hakodate, Japan, (2006). ACM Press. [12] Richard Sutton, Doina Precup, and Satinder Singh, ‘Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning’, Artiﬁcial Intelligence, 112(1–2), 181–211, (1999). [13] Milind Tambe, E. Bowring, H. Jung, Gal Kaminka, R. Maheswaran, J. Marecki, P. Modi, Ranjit Nair, S. Okamoto, J. Pearce, P. Paruchuri, David Pynadath, P. Scerri, N. Schurr, and Pradeep Varakantham, ‘Conﬂicts in teamwork: Hybrids to the rescue’, in Proceedings of the 4th International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS-05), pp. 3–10. ACM Press, (2005). [14] Paulo Trigo and Helder Coelho, ‘The multi-team formation precursor of teamwork’, in Progress in Artiﬁcial Intelligence, EPIA-05, volume 3808 of LNAI, 560–571, Springer-Verlag, (2005). [15] Paulo Trigo, Anders Jonsson, and Helder Coelho, ‘Coordination with collective and individual decisions’, in Advances in Artiﬁcial Intelligence, IBERAMIA/SBIA 2006, volume 4140 of LNAI, 37–47, SpringerVerlag, (2006). [16] Michael Wooldridge, Reasoning About Rational Agents, chapter Implementing Rational Agents, The MIT Press, 2000. 418 ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-418 Coalition Formation Strategies for Self-Interested Agents Thomas G´enin and Samir Aknine 1 Abstract. Coalition formation is a major research issue in multiagent systems in which the agents are self-interested. In these systems, agents have to form groups in order to achieve common goals, which they are not able to achieve individually. A coalition formation mechanism requires two deﬁnition levels: ﬁrstly agents need a common protocol to reach an agreement and secondly individual strategies are required to make efﬁcient proposals. Both issues are addressed in this paper. First, we propose a two-phase decentralized protocol that allows agents to interact directly through message passing. Secondly we propose some strategies which allow agents to make clever proposals using the information that has already been collected from other agents. The experimental evaluation shows that the proposed mechanism allows agents to efﬁciently form coalitions and that the strategies make real improvements for the coalition search process. 1 INTRODUCTION In a multi-agent system where independent agents evolve autonomously and are guided by their speciﬁc objectives, it is always difﬁcult, even impossible to reach these objectives individually. This is mainly due to a lack of resources or expertise. One way to overcome this difﬁculty is by sharing agents’ resources and capabilities. This enables the agents to jointly reach their individual objectives. Such a temporary grouping dedicated to the achievement of common goals is called a coalition. The process that allows agents to build these coalitions is the coalition formation. In this article, we assume that agents are self-interested. These agents have several tasks to perform so it is often impossible for a single agent to perform them alone. In addition these tasks can be combined together. The preferences of each agent for the achievement of task combinations are represented by a utility function that is obviously different for each agent. An agent does not know the preferences i.e. the utility function of other agents of the system. It means that, initially, agents are not able to estimate which agents are likely to form a coalition for the achievement of a speciﬁc combination of tasks. In this article we propose a decentralized coalition formation protocol for self-interested agents. This protocol allows agents to interact directly through message passing. Then we propose some strategies that allow agents to select combinations of tasks cleverly, targeting potential interested agents. These strategies are based on characteristic vectors computation and on logic rule generation. An experimental evaluation of the mechanism has been carried out. The results obtained show that our mechanisms allow agents to efﬁciently form their coalitions. Moreover, using information about potential preferences of other agents in the proposed strategies speeds up the coalition formation process. 1 LIP6, Universit´e Paris 6, Paris, France. Email: thomas.genin@lip6.fr, samir.aknine@lip6.fr This article is structured as follows. Section 2 presents the context of this work. Section 3 presents the protocol and the different strategies we propose. Section 4 describes the experimental evaluation of this mechanism and discusses the obtained results. Section 5 analyzes main related works. In the end, section 6 concludes our work. 2 CONTEXT AND PROBLEM DESCRIPTION We consider a set of n agents and a set of m tasks. Agents want to perform combinations of tasks. Each agent ai has a utility function ui deﬁned on the combinations of tasks and does not know the utility function of other agents aj . The aim of each agent is to maximize its own utility. As agents are not able to perform their tasks alone, they try to form coalitions with other agents. These coalitions will then perform the required tasks. In the following sections, N = {a1 , . . . , an } is deﬁned as the set of agents in the system and M = {T1 , . . . , Tm } is the set of all tasks. A is a subset of N and comb is a subset of M : A ⊂ N and comb ⊂ M . To form coalitions, agents have to negotiate and exchange several messages before reaching agreements, so a shared protocol is required to perform the interaction. Moreover agents do not know initially the preferences of other agents. Consequently these agents need some strategies to select and interpret relevant proposals of task combinations and target potentially interested agents. Deﬁnition 1 (Coalition) A coalition C is a pair comprised of a set of agents A and a combination of tasks comb performed by A: A, comb with A ⊂ N and comb ⊂ M . A combination of tasks is represented by a binary vector of size m, where m is the total number of tasks in the system. A task included in the combination is represented by 1 and a task that is not included is represented by 0. Application example To clarify our problem, we consider an example of buyers connected to a virtual bookshop in order to buy books. In this example, groupbuying allows buyers to reduce the unit price of the books [12]. However the buyers do not have the same preferences about books they want to buy. The interest of coalition formation in this problem is to allow agents to organize themselves to beneﬁt from wholesale prices. Let us assume a set of 3 buyers a1 , a2 and a3 , interested in getting books and connected to a virtual bookshop. Six different items are available in this bookshop: a comic book o1 , a historic book o2 , a travel book o3 , a science ﬁction novel o4 , a philosophical essay o5 and a cookery book o6 . Potential buyers have got $10. Each book worth a unit price of $6, but there is also a wholesale price i.e. the unit price is $5 when two same books are bought. Therefore buyers try to T. Génin and S. Aknine / Coalition Formation Strategies for Self-Interested Agents form groups when they purchase the same books. We assume that a1 is interested in o6 but not in o1 , a2 is interested in o2 but not in o3 and a3 is interested in o5 but refuses o6 . The j th preferred combination of agent i is noted as combji . 5 combinations of the 3 agents, having highest preference are combji , i ∈ [1, 3] and j ∈ [1, 5]. For instance, a1 : (o5 , o6 ) 1 (o4 , o6 ) 1 (o3 , o6 ) 1 (o2 , o6 ) 1 (o6 ), for a2 : (o1 , o2 ) 1 (o1 , o6 ) 1 (o1 , o4 ) 1 (o1 , o5 ) 1 (o1 ) and for a3 : (o4 , o5 ) 1 (o3 , o5 ) 1 (o2 , o5 ) 1 (o1 , o5 ) 1 (o5 ). Assuming these preferences, one example of potential coalition for the group books buying, could be the coalition formed by a2 and a3 for purchasing o1 and o5 and the coalition of a1 for purchasing o6 alone. We notice here that the main difﬁculty for the agents is to ﬁnd other potentially interested agents in a particular combination of books when they don’t have any information about their preferences at this stage of interaction. 3 PROTOCOL AND STRATEGIES In this section, we propose a coalition formation protocol and different strategies for making proposals, and we describe the decision making method used by agents. 3.1 Protocol The coalition formation protocol is based on message passing between each initiator agent of a proposal and its solicited agents. Initiator agent manages the negotiation concerning its proposal. Any agent in the system can make a proposal. When an agent chooses an interesting task combination and a group of solicited agents, the interaction starts with this below mentioned protocol: 1. Initiator agent sends its proposal to solicited agents 2. Solicited agents reply to the initiator agent, indicating whether they are interested in the proposal or not. 3.(a) If a solicited agent refuses the proposal, the initiator sends a withdrawal of its proposal to the solicited agents and the interaction is cancelled. (b) Otherwise the initiator sends a conﬁrmation request to the solicited agent. 4. Solicited agents either conﬁrm or don’t conﬁrm their involvement in the coalition. 5.(a) If a solicited agent refuses to conﬁrm, the initiator sends a withdrawal of its proposition to the solicited agents and the interaction is cancelled. (b) Otherwise the initiator conﬁrms the formation of the coalition by sending a message to solicited agents. The protocol is divided into two phases: the acceptance phase (1,2,3) and the conﬁrmation phase (3,4,5). Conﬁrmation phase acts as a contract between the initiator agent and the solicited agents. Any agent should always be able to achieve all proposals that it has conﬁrmed. Moreover, the ﬁrst answer to a proposal (acceptance phase) is only informative. To sum up, an agent may accept as many proposals as it wants but it conﬁrms only for the proposals it can comply with depending on its ressource or expertise. Our strategies are based on two vectors: a characteristic vector of acceptances and a characteristic vector of refusals, which represent respectively all the accepted proposals and all the refused proposals. Deﬁnition 2 (Acceptance vector) For each agent ai , the acceptance vector V Aji of an other agent aj is a vector of size m, representing all the task combinations proposed by ai and accepted by aj and all the task combinations proposed by aj to ai . For each task t, V Aji (t) represents the ratio of task combinations containing t. In the same way, the vector of refusals V Rij represents all the task combinations proposed by ai and refused by aj . The distance d(comb, V ) between a combination comb and a characteristic vector C is deﬁned is d(comb, V ) = m 2 1/2 i=1 (comb(i) − V (i)) 3.2.1 Strategies In this section we propose two different strategies for the selection of task combinations and solicited agents in order to make proposal. Characteristic vectors selection strategy (CV Strategy) For each agent ai , all accepted or refused combinations by an agent aj and all the proposals of aj to ai are represented by two vectors V Aji and V Rij . In order to generate a proposal, an agent proceeds as follows: • ai generates a list of task combinations and ﬁrst selects the q preferred ones {comb1i , . . . , combiq } • ai computes the distances d(combki , V Aji ) and d(combki , V Rij ) between each combination combki and the acceptance and refusal vector of each agent aj . Then it computes the ratio of these distances. k • For each looks for the set of agents that minimizes i , ai / . comb j k d(combi ,V Ai ) moy j d(combk i ,V Ri ) • At the end ai selects the combination that minimize this mean. d combki , V Aji ci = arg min moy d combki , V Rij k∈[1,q],j∈[1,n] Example 1 (continues) Let’s illustrate these strategies using the example presented in section 2. We assume that a2 has proposed the combinations comb12 = [1, 1, 0, 0, 0, 0] and comb22 = [1, 0, 0, 0, 0, 1] to a1 and a3 . We also assume that a1 and a3 have refused them. At the same time, a2 received the proposals comb11 = [0, 0, 0, 0, 1, 1] and comb21 = [0, 0, 0, 1, 0, 1] from a1 and comb13 = [0, 0, 0, 1, 1, 0] and comb23 = [0, 0, 1, 0, 1, 0] from a3 that it refused. From the point of view of a2 , V R21 and V R23 are computed using comb12 and comb22 which have been refused by both a1 and a3 : V R21 = V R23 = [1, 0.5, 0, 0, 0, 0.5]. Using the received proposals, a2 computes V A12 = [0, 0, 0, 0.5, 0.5, 1] and V A32 = [0, 0, 0.5, 0.5, 1, 0]. When a2 intends to make a new proposal, it pre-selects the two combinations comb32 = [1, 0, 0, 1, 0, 0] and comb42 = [1, 0, 0, 0, 1, 0]. It calculates the distances between these combinations and the four characteristic vectors V A12 , V A32 , V R21 and V R23 (cf. table 1). Finally, agent a2 selects the combination and the agent that minimize the ratio of distances (cf table 2). In this example, these combinations are comb42 and agent a3 . 3.2.2 3.2 419 Rule based Selection Strategy (RG Strategy) The strategy presented in this subsection is based on binary rule generation algorithms. Rule generation consists of building a logic formula from an incomplete logic table. This formula has to correctly 420 T. Génin and S. Aknine / Coalition Formation Strategies for Self-Interested Agents Table 1. Distances between proposed combinations and characteristic vectors (example 1). VC comb23 comb24 V A12 V A32 V R21 V R23 1.58 1.58 1.22 1.22 1.58 1.22 1.22 1.22 Table 2. Ratio of distances between proposed combinations and characteristic vectors (example 1). a1 a3 comb32 comb42 1,29 1,29 1,29 1 From each set of combinations, a3 computes an acceptance vector. Consequently there are as many acceptance vectors as conjunctions in the generated formula: V A23 (1) = [1, 1, 0, 0, 0, 0] and V A23 (2) = [1, 0, 0, 0, 0, 1]. For a1 , the algorithm generates only one conjunction: conj31 (1) = [−1, −1, −1, −1, −1, 1], and the corresponding acceptance vector V A13 (1) = [0, 0, 0, 0.5, 0.5, 1]. a3 computes the distances between comb43 = [1, 0, 0, 0, 1, 0] and the three vectors V A13 (1), V A23 (1) and V A23 (2) (cf. table 3). These distance ratios are shown in table 4. a3 selects the agent which minimizes these ratios i.e. a2 with a ratio of 1.34 for the two acceptance vectors V A23 (1) and V A23 (2). Table 3. Distances between proposed combinations and characteristic vectors (example 2). VC comb43 match with the table. To do so we use the OCAT algorithm which generates formula in disjunctive normal form (DNF) (for details, see [6]). This rule generation algorithm processes the combinations of tasks and expected predicate results. These combinations are used as an input table represented in a ﬁrst order logic formalism. Logic variables are the tasks: the presence of a task in a combination is labeled true and its absence is labeled false. Moreover refused combinations are labeled false and accepted ones and received proposals are labeled true. As an output, we obtain a logic formula in which variables are the tasks (present or not). Each conjunction of the DNF rejects all the refused combinations of the tasks and accepts a subset of accepted combinations and received proposals. After processing these formulas, each agent ai computes for every other agent aj several acceptance vectors V Aji (1), V Aji (2),... one from each different subset of accepted combinations and received proposals. First ai gathers the combinations of tasks which are compatible with each conjunction of the DNF. From each set of combinations, ai calculates an acceptance vector. A DNF formed with p conjunctions gives p acceptance vectors. Finally for each acceptance vector, ai calculates distance ratios and chooses the minimum. Logic formulas are represented by vectors. A variable which is present in the vector and which is positive is represented by 1, the presence of a negative variable is represented by 0 and the absence of a variable by -1. For instance, the conjunction T1 ∧ T3 is represented by the vector [1, −1, 0]. Example 2 (continues) Now we assume that agents a1 , a2 and a3 have the following preferences. For a1 : (o5 , o6 ) 1 (o4 , o6 ) 1 (o3 , o6 ) 1 (o2 , o6 ) 1 (o6 ), for a2 : (o1 , o2 ) 1 (o1 , o6 ) 1 (o1 , o4 ) 1 (o1 , o5 ) 1 (o1 ) and for a3 : (o4 , o5 ) 1 (o3 , o5 ) 1 (o1 , o3 ) 1 (o1 , o5 ) 1 (o5 ) We also assume that a3 has proposed comb13 = [0, 0, 0, 1, 1, 0], comb23 = [0, 0, 1, 0, 1, 0] and comb33 = [1, 0, 1, 0, 0, 0] to a1 and a2 , which they have refused. Also agent a3 has received comb11 = [0, 0, 0, 0, 1, 1] and comb21 = [0, 0, 0, 1, 0, 1] from a1 and comb12 = [1, 1, 0, 0, 0, 0] and comb22 = [1, 0, 0, 0, 0, 1] from a2 , which it has refused. Agent a3 can compute refusal vectors of a1 and a2 : V R31 = V R32 = [0.33, 0, 0.66, 0.33, 0.66, 0]. a3 applies its rule generation algorithm on accepted and refused proposals of a1 and a2 . a3 gets a formula composed of two conjunctions conj32 (1) = [−1, 1, −1, −1, −1, −1] and conj32 (2) = [−1, −1, −1, −1, −1, 1], using this algorithm on the proposals of a2 . Then a3 collects the combinations of tasks that are compatible with these conjunctions. Table 4. V A23 (1) V A23 (2) V R31 V R32 1.58 1.41 1.41 1.05 1.05 Ratio of distances between proposed combinations and characteristic vectors (example 2). comb43 3.3 V A13 (1) a1 , V A13 (1) a2 , V A23 (1) a2 , V A23 (2) 1.5 1.34 1.34 Agent Decision Making Agents face two levels of decision making during the coalition formation mechanism, one in the acceptance phase (Step 2 of the protocol, section 3.1) and a second in the conﬁrmation phase (Step 4). To make these decisions, agents use two utility thresholds: acceptance utility ua and conﬁrmation utility uc , with ua ≤ uc . Acceptance utility ua is the threshold beyond which a proposal is accepted as in step 2. Conﬁrmation utility uc is the threshold beyond which a proposal is conﬁrmed as in step 4. 4 4.1 EXPERIMENTAL STUDY Experimental settings In the following experiments, the multi-agent system is composed of 30 agents and each coalition is formed for 1 to 4 tasks. The maximum number of proposals that an agent can formulate is set to 200. At the end of each experiment, agents that did not form a coalition get the ualone utility, which corresponds to the utility of the highest combination of tasks the agent can perform lonely (without coalition). The results given in this section are obtained on an average of 5 experiments. We have used an additive utility function to represent the preferences of agents and to generate these preferences, agents assign to each task a random score between -10 and 10. The utility of a combination is the sum of the utilities of all the tasks forming this combination. In the second part of the experiment, we have kept same preferences but we have applied logic formulas on the tasks. Agents do not consider the combinations of tasks that do not validate them. The thresholds ua and uc are set with ualone and umax which is the utility of the most preferred combination: ux = ualone + λx ∗ (umax − ualone ), x ∈ {a, c}, λx ∈ [0, 1] (1) 421 T. Génin and S. Aknine / Coalition Formation Strategies for Self-Interested Agents 4.2 Experimental Results Firstly, we keep uc = ua and λx varies between 0 and 1 (equation 1). Then we measure the number of agents which join a coalition at the end of the experiment (the execution stops when every agent has made its 200 proposals or has joined a coalition) then we compare the strategies presented in the previous section. We use a third basic strategy where the agents propose each combination in decreasing order of utility, starting with the most preferred one. Each combination is proposed to disjoined groups of agents. The initiator agent maintains a list of the agents which have conﬁrmed, accepted or refused the combination. So by using this list, the combination is proposed to the agents which have already conﬁrmed or accepted it. These results are shown ﬁgure 1. We can observe that for values of λx greater than 0.6, agents are very selective and it is quite hard for the three strategies to form coalitions. For values of λx lower than 0.3, agents are less selective and then we observe the same behavior: most of agents are able to join a coalition. Finally, for values between 0.3 and 0.7, we notice that CV and RG strategies allow more agents to form coalitions than basic strategy. Figure 3. Evolution of the number of formed coalitions with respect to the number of propositions for λa = λc = 0.45 Now we show the results of other experiments performed, by adding in the same system different agents which implement different strategies. We used 10 agents implementing CV strategy, 10 agents implementing RG strategy and 10 agents implementing basic strategy. We observe the distribution of the strategies implemented by agents which are the initiators of the formed coalitions. These results are shown in table 5. We observe that more than half of the coalitions are initiated by agents implementing CV strategy and that only 12% of the coalitions are initiated by the agents implementing basic strategy. Table 5. Distribution of the strategies of initiator agents of effectively formed coalitions for a simulation of 30 agents: each strategy is implemented on 10 agents CV Strategy RG Strategy Basic Strategy 54% Figure 1. Evolution of the number of agents in a coalition with respect to λ when uc = ua In ﬁgures 2 and 3 we can observe the evolution of the number of coalitions formed according to the number of proposals. We notice that in ﬁgure 2, for less selective agents, the three strategies allow agents to form coalitions quickly: in average at the 30th proposal almost all coalitions are formed for CV and RG strategies and most of them for basic strategy. When agents get a higher selection threshold (ﬁgure 3), CV and RG strategies still allow agents to form coalitions quickly, but with basic strategy it is harder to form coalitions and at the end of the experiment only few coalitions are formed. 34% 12% We notice in ﬁgure 4 that for ua strictly lower than uc , the number of agents that ﬁnally joined a coalition decreases. In fact agents accept combinations of tasks that are not conﬁrmed afterwards. Strategies that are based on these combinations to guide selection are less efﬁcient. Figure 4. Number of agents in a coalition w.r.t. λa for a ﬁxed λc = 0.5 Figure 2. Evolution of the number of formed coalitions with respect to the number of propositions for λa = λc = 0 In other experiments, we have modiﬁed the utility functions by adding logic reasoning on the tasks. We applied an exclusive OR (XOR) on the preferred tasks of each agent: we keep only the combinations of tasks including the most preferred task or the second preferred task. The combinations including both of them are simply rejected. Next we have made other experiments with logical implication for 422 T. Génin and S. Aknine / Coalition Formation Strategies for Self-Interested Agents preferred tasks. Intuitively, RG strategy should allow agents to ﬁnd inherent logic formulas, and should be more efﬁcient than CV strategy. These results are shown in table 6, and we observe that CV and RG strategies are better than basic strategy. However RG strategy is not as efﬁcient on logic preferences as expected. This inefﬁciency is mainly due to the limited amount of data available for the rule generation algorithm. Some of the generated rules match with the set of rejected, accepted, conﬁrmed and received proposals but are not the equivalent of the initial logic formulas (XOR or logical implication), which biases the results. Table 6. Additive utility and logic. Average number of agents in a coalition. λa = λc = 0.6 CV Strategy RG Strategy Basic Strategy XOR IMPLICATION 5 17.2 7.6 16.2 8 10.4 1.2 RELATED WORK Several coalition formation methods have been proposed but only a few of them deal with self-interested agents and are really decentralized. Kraus, Shehory and Taase [4] proposed a coalition formation protocol for the request for proposal (RFP) domain. This protocol is used by Shehory in [9] and Westwood and Allan in [13]. The process is based on a central manager (CM) entity, which manages communications and makes proposals to the agents. The process is performed in several rounds. At each negotiation round, the CM sorts the agents randomly. Each agent, in its turn, can either send a proposal for forming a coalition or accept a previous proposal received from another agent. Each agent has only one turn in each round and proposals are valid for one round. This protocol allows a decentralization of the decision process even though communications are still centralized at the CM level. Other works are based on deﬁning a common utility function for the coalitions [7, 10, 8]. In game theory this problem is similar to the coalitional function games (CFG). For example, the utility of a coalition can be a net income that the members of a coalition gain together and have to share. Zlotkin and Rosenschein [14] proposed a coalition formation mechanism that uses cryptographic techniques for subadditive task oriented domain. Sandholm et al. [7] developped an anytime algorithm for ﬁnding the optimal coalition structure, establishing a worst case bound on the quality of the solution. Rahwan et al. [5] proposed an efﬁcient algorithm for distributing the coalitional value calculations among agents in cooperative environments. Several other works used a common utility function like in the request for proposal (RFP) problems [9, 4, 13]. In our work, values of coalitions are utilities which are different for each agent, as in [2]. Additionally, we focus on the study of the strategies that agents should use to come to agreements on the coalitions. Some works use learning techniques in their coalition formation process. For example, Aknine and Shehory [2] propose a coalition formation mecanism based on tasks relationship analysis and derivation of intentions. Chalkiadakis and Boutilier [3] implement a Bayesian reinforcement learning in a way that enables coalition participants to reduce their uncertainty regarding coalitional values and the capabilities of others. Soh and Li [11] use learning mechanisms (reinforcement learning based on past helpful cooperation between agents and case base reasoning). Their aim is to improve the quality of the coalition process compared to the quality of the coalitions. Finally, Aknine et al. [1] propose a coalition formation method based on preference model for cooperative and self-interested multi-agent systems. 6 CONCLUSION In this article we have addressed the problem of coalition formation in multi-agent systems. Our work specially focuses on self-interested agents. Several works have underlined the difﬁculty of solving a coalition formation problem in such a context, mostly due to the intrinsic autonomy property of the agents and to the difﬁculty of convergence of these systems to acceptable solutions. In this paper, we have addressed this problem by considering agents which have different utility functions. This assumption is useful since it enlarges the application scope of the proposed mechanism. To solve this problem, we have proposed an original mechanism taking into account the main features of competitive agents. We have proposed a two phase protocol which only requires information about the preferred task combinations that agents want to perform. Then this mechanism has been enhanced with several strategies based on the analysis of the agents’ proposals. The proposed mechanism has been implemented and tested. The results of the experiments have shown that our strategies allow agents to get their preferred coalitions and improve the coalition formation process. In future work, we intend to analyse the scalability of our mecanism and adapt it to large scale systems. REFERENCES [1] S. Aknine, S. Pinson, and M. Shakun, ‘A multi-agent coalition formation method based on preference models’, Group Decision and Negotiation, 13, 513–538(26), (2004). [2] S. Aknine and O. Shehory, ‘Reaching agreements for coalition formation through derivation of agents’ intentions’, in ECAI, pp. 180–184, (2006). [3] G. Chalkiadakis and C. Boutilier, ‘Bayesian reinforcement learning for coalition formation under uncertainty’, in AAMAS ’04, pp. 1090–1097, Washington, (2004). IEEE. [4] S. Kraus, O. Shehory, and G. Taase, ‘Coalition formation with uncertain heterogeneous information’, in AAMAS ’03, pp. 1–8, New York, (2003). ACM Press. [5] T. Rahwan, S.D. Ramchurn, V.D. Dang, and N.R. Jennings, ‘Nearoptimal anytime coalition structure generation’, in IJCAI ’07, pp. 2365– 2371, (2007). [6] S. N. Sanchez, E. Triantaphyllou, C. Jianhua, and T. W. Liao, ‘An incremental learning algorithm for constructing boolean functions from positive and negative examples’, Oper. Res., 29(12), 1677–1700, (2002). [7] T. Sandholm, K. Larson, M. Andersson, O. Shehory, and F. Tohme, ‘Coalition structure generation with worst case guarantees’, Artif. Intell., 111(1-2), 209–238, (1999). [8] T. Sandholm and V. R. Lesser, ‘Coalitions among computationally bounded agents’, Artiﬁcial Intelligence, 94(1-2), 99–137, (1997). [9] O. Shehory, ‘Coalition formation: Towards feasible solutions’, Fundam. Inf., 63(2-3), 107–124, (2004). [10] O. Shehory and S. Kraus, ‘Methods for task allocation via agent coalition formation’, Artiﬁcial Intelligence, 101(1–2), 165–200, (1998). [11] L.K. Soh and X. Li, ‘An integrated multilevel learning approach to multiagent coalition formation’, in IJCAI ’03, pp. 619–624, (2003). [12] M. Tsvetovat and K. Sycara, ‘Customer coalitions in the electronic marketplace’, in AGENTS ’00, pp. 263–264, New York, NY, USA, (2000). ACM Press. [13] K. Westwood and V.H. Allan, ‘Heuristics for dealing with a shrinking pie in agent coalition formation’, in IAT ’06, pp. 537–546, Washington, (2006). IEEE. [14] G. Zlotkin and J.S. Rosenschein, ‘Coalition, cryptography, and stability: Mechanisms for coalition formation in task oriented domains’, in National Conference on Artiﬁcial Intelligence, pp. 432–437, (1994). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-423 423 1 2 2 θ ∈ Θ θ = (θ1 , . . . , θi , . . . , θn ) θi ∈ Θ i A = {1, . . . , n} i θi Fi Oi ⊂ O ci : Oi → R− I i ⊂ Fi ri : Gi → R+ Gi ⊂ Fi I I= π = oi11 , . . . , oimm i πi = oj | oij ∈ π Π θ I= i∈A 1 2 Ii oij ∈ Oi π π Result (I, π) θ θ i oi ∈πi ci oj Ii π = oi11 , . . . , oimm c (π, θ) = j i i ci (π, θ) = oij ∈π ci oj θ π r (π, θ) = i∈A i∈A g∈Gi ci (π, θ) ri (g) 0 i U (π, θ) = c(π, θ) + r(π, θ) g ∈ Result (I, π) ri (π, θ) π 424 R. van der Krogt et al. / Of Mechanism Design and Multiagent Planning (f, p1 , . . . pn ) π π = f (θ) i ui (π, θ) = vi (π, θ) − pi (θ) Π i f (θ) vi (π, θ) = ci (π, θ) + ri (π, θ) i θi = {Fi , Oi , ci , Ii , Gi , ri } i Θi i (θi , θ−i ) θ θi i ci Gi Oi ⊂ Oi Ii ⊂ Ii Oi ⊃ Oi (f, p1 , . . . pn ) ri i θ1 , . . . , θ i , . . . , θ n ∈ Θ 1 × . . . × Θ i × . . . × Θ n θi ∈ Θi vi (f (θi , θ−i ) , θ) − pi (θi , θ−i ) ≥ vi (f (θi , θ−i ) , θ) − p (θi , θ−i ) Ii ⊃ Ii π∈Π Π v θ v (π, θ) = vi (π, θ) = i∈A π (f, p1 , . . . , pn ) ci (π, θ) + ri (π, θ) i∈A vi (π, θ) v(π, θ) i π π ˆ • f (θ) ∈ arg maxπ∈Π v (π, θ) f • h1 , . . . , hn : Θn−1 → R θ = (θ1 , . . . , θn ) pi (θ) hi (θ1 , . . . , θi−1 , θi+1 , . . . , θn ) − j =i vj (f (θ) , θ) π i = hi (θ−i ) = 0 vi (ˆ π , θ) hi (θ−i ) i f pi p1 , . . . , p n i f i∈A f (θ) f (θ) ∈ argmaxπ∈Π vi (π, θ) θ pi : Θ1 × · · · × Θn → R i (f, p1 , . . . pn ) f : Θ 1 × · · · × Θn → Π p1 , . . . , pn i 3 i hi (θ−i ) 425 R. van der Krogt et al. / Of Mechanism Design and Multiagent Planning ui (π, θ) = vi (π, θ) − pi (θi , θ−i ) = vi (π, θ) + vj (π, θ) + hi (θ−i ) = v(π, θ), j =i π = f (θi , θ−i ) i i θi i r (π, θ) + ci (π, θ) = v(π, θ) i i∈A U (π, θ) = r(G) = i i∈A gi ∈Gi ri (gi ) π f i i pi hi (θ) = 0 π i πi i i π π i ui (π, θ) = v(π, θ) θi Oi ⊃ Oi π π i i π j = i G G = {g ∈ Gj | g ∈ Result (I, π ) \ Result (I, π)} i π ui (π , θ) = vi (ˆ π , θ) − pi (θi , θ−i ) = v(π , θ) + vi (ˆ π , θ) − vi (π , θi , θ−i ) vi (ˆ π , θ) i π ˆ θi vi (π , θi , θ−i ) θi i π vi (ˆ π , θ) = ri (ˆ π , θ) + oi ∈ˆπi ci (oj ) = vi (π , θi , θ−i ) ui (π , θ) = v(π , θ). π v(π , θ) > v(π, θ) π i π ui (π , θ) > u(π, θ) G j i θi π i i π π ui (π , θ) = v(π , θ) r(G) r(G) ≥ v(π , θ) ui (ˆ π , θ) ≤ 0 i π i i ui (π, θ) = r(G) − r(G) + v(π, θ) ≥ 0 π ˆ=π r(G) hi (θ−i ) = 0 i 426 R. van der Krogt et al. / Of Mechanism Design and Multiagent Planning ri θ m n f m m n fbw n (f, p1 , . . . , pn ) m m f fbw p(·) i∈A vi (π, θ) fbw n + 2m fbw (fbw , p1 , . . . , pn ) f i∈A (f, p1 , . . . , pn ) f Π = {f (θ) | θ ∈ Θ} Θ vi (π, θ) π ∈ Π fbw Π f θ ∈ Θ f (θ) K f f fd : Θ → Π d K fdK fdK K K i=0 K |G| i fdK ≤ K · |G|K fdK K fbw K fbw K fbw fbw K fbw K θ = {F, O, c, I, G, r} F O c I 4 i Gi R. van der Krogt et al. / Of Mechanism Design and Multiagent Planning 427 428 ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-428 IAMwildCAT: The Winning Strategy for the TAC Market Design Competition Perukrishnen Vytelingum and Ioannis A. Vetsikas and Bing Shi and Nicholas R. Jennings1 Abstract. In this paper we describe the IAMwildCAT agent, designed for the TAC Market Design game which is part of the International Trading Agent Competition. The objective of an agent in this competition is to effectively manage and operate a market that attracts traders to compete for resources in it. This market, in turn, competes against markets operated by other competition entrants and the aim is to maximise the market and proﬁt share of the agent, as well as its transaction success rate. To do this, the agent needs to continually monitor and adapt, in response to the competing marketplaces, the rules it uses to accept offers, clear the market, price the transactions and charge the traders. Given this context, this paper details IAMwildCAT’s strategic behaviour and describes the wide techniques we developed to operationalise this. Finally, we empirically analyse our agent in different environments, including the 2007 competition where it ranked ﬁrst. 1 Introduction Continuous Double Auctions (CDAs) have traditionally been used in stock markets in order to trade securities and other ﬁnancial commodities. Their attraction lies in the fact that any trader (buyer or seller) can come into the market, at any point, and place a shout for buying (resp. selling) at some desired price, and a trade will take place almost instantly, if there is a matching offer to sell (resp. buy) at that or a better price. Given this, most of the existing work on CDAs addresses ways of designing effective strategies that maximize a trader’s proﬁt. However, there is considerably less literature on the design of market protocols for such auctions in order to promote desirable properties (such as improved efﬁciency or reduced market volatility). Moreover, in today’s globalised economy, stocks are often traded simultaneously in different (competing) markets around the world. Thus, the different markets need to differentiate themselves and appeal to traders to conduct their business under their jurisdiction (e.g. by offering attractive prices for participation and trading). To rectify this shortcoming, the TAC Market Design Competition (CAT) provides a test-bed for exploring the problem of designing competitive and efﬁcient markets (see [3] for the competition rules). Each CAT game lasts a number of days, and each day consists of a number of trading rounds, which each lasts for a known constant length of time. A number of traders and a number of markets participate (the former are determined by the competition organisers, while the latter are the competition entrants). Each trader is given a ﬁnite set of goods to trade and is assigned a private value (also referred to as a limit price) for each good. The difference between this price and the transaction price represents the proﬁt of the agent in the transaction; their total proﬁt in the market is the sum of these transaction prof1 University of Southampton, UK, email: {pv,iv,bs07r,nrj}@ecs.soton.ac.uk its minus any fees that they incur in participating in the market. The traders use various well-known strategies from the CDA literature: ZI, ZIP, GD, RE (Roth-Erev) [5, 1, 4, 6] and are allowed to register with a different market at the beginning of each day. They also have a memory of the proﬁt they achieved historically in each market such that they are more likely to register to the market where they made the highest proﬁt. Thus, the markets must compete for traders by clearing transactions efﬁciently and not charging excessive fees. The different competing markets are represented by specialists, each of which is an agent entered by a separate competitor. These specialists set the rules for their respective market; they determine which shouts are accepted in the market (quote-accepting rule), which shouts will be matched for transactions (clearing rule) and at what price (pricing rule), as well as the fees to charge for various services (charging policy). The score of each agent is a combination of three different metrics: the proﬁt obtained as a percentage of the total proﬁt obtained by all specialists, the market share of the agent (i.e. the percentage of traders who register with the specialist), and the transaction success rate (TSR) (i.e. the percentage of shouts accepted by the market that resulted in a transaction). To be successful, therefore, an agent needs to be competitive in making proﬁt, attracting traders and ensuring that shouts placed in the market result in transactions. While these goals are not necessarily contrary to each other, there are a number of trade-offs to be resolved here. For example, charging larger fees will increase the proﬁt but decrease the market share, while improving the TSR by accepting fewer shouts will result in fewer total transactions and thus less proﬁt both for the specialist and the traders. In order to design an effective specialist agent, we decided to break the agent down to multiple components, where each one deals with a particular trade-off. Then looking at each component, we designed it in such a way as to balance that trade-off. For example, we designed a clearing rule that allows us to maximize the TSR with a minimal drop in the efﬁciency of the transactions, and a pricing policy, that manages to extract enough proﬁt without compromising the agent’s market share. Similar methods, of breaking down a complex problem into multiple parts and then selecting strategies for and optimizing each one separately, have successfully been used in other complex trading domains [9]. Drawing inspiration from this approach, we also started testing the various individual components using experimental comparisons. The goals of these experiments are two-fold: to determine the best possible agent design, and to examine the behaviour of the market and how it is affected by the different strategies. Against this background, in this paper we make the following contributions. First, we describe, for the ﬁrst time, the various policies of our agent. We explain how the various trade-offs guided the design of the agent and how each one was addressed, in order to generate the most competent and successful agent that participated in the P. Vytelingum et al. / IAMwildCAT: The Winning Strategy for the TAC Market Design Competition competition. Thus, we designed a number of novel strategies, e.g. clearing shouts, in some rounds, to maximise the number of transactions cleared rather than the proﬁt. Second, we experimentally evaluate the performance of our agent. We compare the efﬁciency and performance of our agent against that of the other competitors in the competition. Here we show that our agent achieved the best and most stable performance, both in the score and across other metrics (i.e. attracting “good” traders and maintaining a high market efﬁciency). This paper is organized as follows. In Section 2, we give a complete description of our agent and all its components. In Section 3, we present the experiments we conducted. Then, we conclude. 2 The IAMwildCAT Strategy Given this background on the CAT game and its goals, our objective is to design an agent that maximises the scoring function. Speciﬁcally, our strategy consists of a set of different market rules and the charging policy (see Figure 1). Each of these policies involves a particular trade-off; in the rest of this section, we detail how we designed the agent in order to resolve each trade-off. average (giving more weight to more recent transactions because of the expected convergence) that is reset at the end of each trading day. Furthermore, because traders register with a specialist at the beginning of each trading day, it is impossible to ensure that the same set of traders will remain in the market. Thus we do not expect the equilibrium price to remain constant across trading days. Given this, we reset the moving average of the equilibrium price at the beginning of each day. On the ﬁrst round, because of the high variance of transaction prices [10] and, hence, the poor estimate of the equilibrium price, we set αr to be proportional to the variance of the transaction prices (for more slack at the beginning of the trading day). In more detail, intra-marginal2 traders are expected to trade earlier than extra-marginal traders (as their better shouts are more likely to be cleared ﬁrst) such that most of the proﬁtable transactions occur earlier during the trading day, with the extra-marginal and marginal traders left to bid at the end of the day. To avoid marginal bids and asks (that are slightly lower and higher than the equilibrium price respectively) being submitted and risking that they remain uncleared at the end of the trading day, we further constrain our quote-accepting rule. In particular, on the last few trading rounds, our strategy only accepts bids and asks that can be currently cleared. Thus, we minimise the number of uncleared bids and asks, improving our TSR. 2.2 Figure 1. Structure of the IAMwildCAT Strategy 2.1 The Quote-Accepting Rule We ﬁrst consider the quote-accepting rule which selects the bids and asks that are accepted into the market (i.e. not all bids submitted by the traders will necessarily be accepted into the marketplace). Such rules are typically employed to speed up the bidding process (e.g. the NYSE quote-accepting rule [10] speciﬁes that any new quote must improve upon the currently outstanding quote), as well as to improve the properties of the auction (e.g. reducing price ﬂuctuations [8]). In the CAT platform, because TSR is a measure of success, it is important to reject the “poor” bids and asks that the market does not expect to clear. Now, we could maximise the TSR by accepting only a few really “good” shouts. However, the fewer shouts that are accepted, the smaller the number of transactions and thus the smaller the proﬁt of both agents and traders; it also makes the market less attractive to traders, which impacts the market share. Thus, we need to select just the right shouts in order to balance this trade-off. The micro-economic theory of competitive equilibrium states that transaction prices are expected to converge to the competitive equilibrium price p∗ where demand meets supply [2]. Thus, we expect the bids (resp. asks) that will be cleared in the market to be at least as high (resp. low) as the competitive equilibrium price. The aim, then, is to accept these bids and asks, rejecting those bids below and those asks above this price. Now, because we can only estimate the equilibrium based on the convergence of transaction prices, we assume some error in our estimation and provide some slack, αr and αa , when deciding the minimum bid, bidmin = (1 − αr )p∗ − αa or maximum ask, askmax = (1 + αr )p∗ + αa to accept. We estimate the competitive equilibrium price using a weighted moving 429 The Clearing Rule The clearing rule deﬁnes when and how to clear the market. There are two parts to this rule. The ﬁrst is when to clear. One approach is to collect all bids and asks and clear the market at the end of the trading day to maximise proﬁts. However, because traders bid for single units at a time, this approach would imply traders have the opportunity to trade only a single unit (unable to trade the rest of their multi-unit endowment). An alternative approach is to maximise the number of transactions (instead of proﬁts) by a continuous clearing rule whenever a bid or an ask is accepted in the market (e.g. the Continuous Double Auction clearing [10]). Given this, our strategy adopts a rule in between these two approaches, with the market clearing at the end of each round. In this way, we can be almost as efﬁcient as clearing at the end of the day, while allowing the traders to still trade multiple times. By so doing, we get most of the beneﬁts from both approaches without the drawbacks. The second part is how to match bids. At the end of each round, our agent has a list of shouts to clear. It can try to maximize the number of transactions, by matching “bad” shouts with “good” shouts, but in so doing, it will reduce the efﬁciency of the market and give less average proﬁt to the traders (which will have an impact on the market share primarily). On the other hand, it can match the shouts efﬁciently, and maximise proﬁts to the traders, but it will generate less transactions (and TSR). As mentioned earlier, intra-marginal traders are expected to trade earlier than marginal (and extra-marginal) traders such that the amount of proﬁt to be extracted in the market is higher earlier during the trading day, with less proﬁt to be made at the end of the trading day. Thus we chose the following strategy to deal with this trade-off: our agent clears the market for maximum proﬁts at the end of the earlier rounds of the trading day, while, on the following rounds, with less proﬁts to be made in the market, our agent clears to maximise the number of transactions. By so doing, some extramarginal traders are allowed to transact while increasing the number 2 An intra-marginal buyer (resp. seller) is expected to trade in the market because of its limit price is higher (resp. lower) than the equilibrium price. The remaining traders are extra-marginal. 430 P. Vytelingum et al. / IAMwildCAT: The Winning Strategy for the TAC Market Design Competition of transactions and hence the TSR, at the expense of some proﬁts (though these are generally low at this point). 2.3 The Pricing Rule The pricing rule determines the price at which a transaction occurs when a buyer and a seller are chosen to transact (by the clearing rule). This price can have any value between the ask and bid prices. Initially, we used primarily discriminatory k-pricing3 with k = 0.5; this means that the mean of the ask and bid prices is chosen as the transaction price. In the competition, we used a variation of this policy, called side-biased pricing, which varies k, depending on the number of buyers and sellers participating in the market. Speciﬁcally, we looked at a window of the latest 10 trading days for the average number of buyers and sellers our agent attracted, and if the difference between the number of buyers and sellers is bigger than 10% of the total number of traders, we adjust k (proportionately to this difference) in order to give more proﬁt to the side which is under represented. We do this in an attempt to attract more of them. However, as we wanted to be somewhat conservative4 , we only allow k to vary in k ∈ [0.3, 0.7]. In Section 3.3, we discuss this issue in more detail and examine the preformance of the two policies. 2.4 The Charging Policy The charging policy determines the speciﬁc charges that are levied from the traders in the system. A registration fee is paid by traders in order to register with the market agent at the beginning of the day, irrespective of whether they transact or not. An information fee is paid if transaction history information is obtained. A shout fee and a transaction fee are the amount payed respectively when a shout is placed and when a transaction occurs. The proﬁt fee is the percentage of the difference between the accepted shout and the transaction price that is paid by the traders to the market.5 Before we describe our policy in detail, it is necessary to note the ways in which an agent’s charging policy changes its score: • the score is increased each day by the percentage of the proﬁt that the agent achieved compared to all agents; this means that extracting proﬁt is most efﬁcient for small absolute values of the proﬁt (compared to the total proﬁt extracted by everyone else). • the market share is decreased by an amount which is relatively proportional to the absolute value of the proﬁt that any agent extracts in total. These two facts led us to design a charging policy that is mainly trying to maintain a minimum amount of target market share, while at the same time extracting the best possible score from the proﬁt, without compromising the market share. More speciﬁcally, we use a target proﬁt percentage charging policy, that during each day aims to extract a predetermined proﬁt score. This target score depends on the agent’s current market share M S. Speciﬁcally, our agent aims to maintain a target market share M Starget which takes a value in: 1 1.25 M Starget ∈ [ M , M ], where M is the total number of competing markets. Thus it tries to obtain a market share slightly higher than the average market share that all markets have. We regulate our market 3 4 5 The value of k determines the difference of the transaction price from the ask and bid prices. In the CAT game, because traders consider their entire history of proﬁts and some randomness introduced in the trader’s selection of the market to trade in, the effect of giving more proﬁt to one side could be delayed; if we are too aggressive, it might lead us to overshoot our goal of balancing the populations of buyers and sellers and thus cause the behaviour to oscillate. Note that if this is 100%, then the pricing rule does not matter at all, since all the difference between the ask and bid prices is levied by the market. share by getting more proﬁt than our opponents when our market share is high, and less when our market share is below our target. We thus distinguish between two states in this strategy: • If M S < M Starget , then the market is in trader attraction mode6 and we aim to extract a small proﬁt percentage equal to P % = 50% ; as this percentage is about half that of the average proﬁt M made by other agents, it will lead (all other things being equal) to an increase of market share within some trading days. • If M S > M Starget , then the market is in trader exploitation mode7 and we aim to extract a larger proﬁt percentage equal to P % = 200% ; as this percentage is about twice that of the averM age proﬁt made by other agents it will lead (all other things being equal) to a reasonable score, but at a cost of some market share loss within the next trading days. The target share M Starget is gradually decreased if trader attraction mode lasts for more than 10 days and is increased for every day that the agent is in trader exploitation mode. In more detail, let Π, σ, τ and φ be, respectively, the total opponent proﬁt, the number of traders in our market, the number of transactions and the average transaction proﬁt (measured as the difference between the ask and bid prices in each transaction), averaged over the last few days. These average values are reasonable expectations for these variables during the following day. Our agent target proﬁt πtarget is set to πtarget = P % · Π. Therefore the average fee π paid by each trader must be target . In trader attraction mode, we set σ the registration fee equal to 75% of this value, while, in exploitation mode, this is set to 50%. The remaining proﬁt is extracted through the proﬁt fee by dividing the remainder by φ. If this value is more than 100%, then we set the proﬁt fee to 100% and gain the remaining proﬁt by additionally setting a transaction fee equal to the remaining proﬁt divided by τ . We don’t set an information nor a shout fee. The reason for chosing to extract most of the proﬁt through the registration fee is because all traders, whether intra or extra-marginal, pay this, while only successful (i.e. intra-marginal) traders pay the other two. In this way, we also achieve the effect of attracting the desirable, intra-marginal traders and driving away the undesirable, extra-marginal traders. A ﬁnal adjustment to this strategy is made to account for the beginning and end of the game. As market share is more important at the beginning and becomes progressively less so towards the end, we try to build market share at the beginning, by not extracting any proﬁt for a set number of days (set to 80), and increasing the target percentage during the last 100 days of the game, and in particular during the last 40, when the increase becomes quite pronounced.8 3 Evaluation In this section, we analyse the performance of our specialist against other competitors entered in the CAT competition. To this end, we adopt a similar experimental setup9 as in the competition, with a 6 7 8 9 To avoid thrashing, we also count the number of trading days since we last switched modes in the strategy; there is a minimum number of days since the last switch before the next switch is allowed. In fact we use this additional rule before we switch to this mode: we aim to exploit when the total proﬁt made by the opponents drops below its historical average (by a certain discount), as this will allow us to get more score with less penalty to the market share. This discount is being adjusted depending on the number of times that this rule succeeds or fails. It should be noted that the length of the CAT game, during the competition, was set to 500 days and each day had 10 rounds; these facts were common knowledge and this allowed us to use this end game strategy. Note that, in our experiments, we used all the available binaries of competition entrants, with the exception of Havana (because of the unavailability of the CPLEX optimisation library they employ) and PSUCAT (because of their unstable implementation). P. Vytelingum et al. / IAMwildCAT: The Winning Strategy for the TAC Market Design Competition game10 running over 500 trading days each lasting 10 trading rounds. The trader population comprises of 180 ZIP traders, 180 RE traders, 20 ZI traders and 20 GD traders, equally split as buyers and sellers. Each trader is endowed with 10 goods to buy or sell at a limit price that is independently drawn from a uniform distribution between 50 and 150 such that the theoretical equilibrium price11 is 100. In particular, we ﬁrst analyse the competition results reported by Nui et al. [7] in Subsection 3.1. Then, we analyse in detail the performance of IAMwildCAT. Speciﬁcally, we consider the following aspects that Nui et al. do not analyse. First, we look at how the number of globally intra-marginal12 buyers and sellers compares over the trading days (to analyse its effectiveness in attracting “good” traders) in Subsection 3.2. Second, we look at our policy for side-biased pricing in Subsection 3.3 and how it improved our market share and, ﬁnally, we look at some more general experiments on the efﬁciency of our strategy in a homogeneous environment in Subsection 3.4. The purpose of this exercise is to observe its efﬁciency if all the agents adopt the IAMwildCAT strategy. Note that in ﬁgures 2, 3 and 4 we plot only the 5 best strategies for clarity. 3.1 Intra-Marginal and Extra-marginal Traders We observe in Figures 2 and 3 that the ratio of intra-marginal traders registered with IAMwildCAT converges to 0.9 (which is considerably higher than that of the other agents). This suggests that our agent successfully incentivises intra-marginal traders to join its market, driving away extra-marginal ones. This is done through setting the fees appropriately (see the charging policy in Subsection 2.4) such that extra-marginal traders, which are not expected to trade, would make negative proﬁt by being charged a registration fee. A market with more intra-marginal traders would imply better bids and asks that can be cleared, which improves our TSR in the process. Now, 10 We repeat each game for 15 runs to improve our estimate of performance. Because the limit prices are drawn from a uniform distribution, the demand and supply curves are expected to be linear, intersecting at 100. 12 A trader is globally intra-marginal if it is intra-marginal when we consider all traders in the system. In our experiments, buyers (resp. sellers) are expected to be intra-marginal if their limit prices are at least higher (resp. lower) than the theoretical equilibrium price at 100. 11 Percentage of intra-marginal buyers. Figure 3. Percentage of intra-marginal sellers. The CAT Competition Nui et al. reported the results of the 2007 CAT competition which was won by IAMwildCAT, with the highest score (at 240.2) outperforming the second placed one by 13% and the third one by 25% [7]. They also empirically evaluated all strategies to identify how they perform in difference cases. They showed that IAMwildCAT had the lowest standard deviation (at 2.8), which suggests consistent behaviour over all the runs. Furthermore, they showed that IAMwildCAT had the highest market share and the highest TSR throughout most of the games. We attribute the former to our strategic choice of maximising market share at the beginning, sacriﬁcing all profits. After 80 trading days, our agent starts charging the traders (see Subsection 2.4) which gradually increases our proﬁt share. We typically expect its market share to decrease (as traders are less proﬁtable in its market). However, by adapting its charging policy effectively, IAMwildCAT does not compromise its market share and, indeed, it is able to increase its proﬁt share while sustaining its market share. Furthermore, our quote-accepting and clearing strategies (see Subsections 2.1 and 2.2) are proved to be very effective, with the TSR increasing from 0.92 at the beginning of the game to over 0.99 after 150 days, outperforming that of all the other agents. 3.2 Figure 2. 431 the intuition behind this ratio capping at around 0.9 is that given the trader’s selection strategy, there is a probability of 0.1 that a trader, whether it is intra-marginal or extra-marginal, randomly selects a specialist. Thus, there is always a chance that extra-marginal traders will register with a specialist, such that the ratio can never be 1. 3.3 Discriminatory Versus Side-Biased Pricing We next evaluate our side-biased pricing policy (where we vary the k parameter); we considered an experiment with 7 different agents, including IAMwildCAT (with this policy) and a modiﬁed version of IAMwildCAT, which used the ﬁxed discriminatory k-pricing policy. We believe it is necessary to vary k because intra-marginal traders in a specialist’s market might not necessarily be globally intra-marginal. Thus, given our aim to incentivise only intra-marginal traders to join our market, we vary k to give more proﬁt to globally intra-marginal traders than to globally extra-marginal ones. Here, we analyse the effect of side-biased pricing on our strategy. Now, from Figure 4, we observe that our side-biased pricing policy does increase our ratio of intra-marginal sellers to intra-marginal buyers in the market. However, it introduces a small bias for sellers with more intra-marginal sellers than intra-marginal buyers. It is also interesting to note that IAMwildCAT has a ratio of globally intramarginal sellers to buyers stable around 1 compared to the hugely varying one of the other agents. This is indeed effective behaviour as a ratio that deviates from 1 implies an equilibrium price that is higher or lower than the theoretical equilibrium in the global market such that some of the proﬁts are distributed to globally extra-marginal 432 P. Vytelingum et al. / IAMwildCAT: The Winning Strategy for the TAC Market Design Competition Experiment Global Efﬁciency Convergence Coefﬁcient 6 PersianCATS 90.9% 8.1 6 IAMwildCATS 90.6% 6.2 6 Heterogeneous CATS 88.7% 6.4 6 CorcodileAgents 79.8% 6.1 Figure 6. Figure 4. Ratio of intra-marginal buyers to sellers. Figure 5. Market share with discriminatory and side-biased pricing. traders at the expense of globally intra-marginal ones. While the pricing does not affect the specialist’s proﬁt share (but rather the distribution of proﬁts among buyers and sellers) or its TSR, we can see from Figure 5 that our side-biased pricing is an improvement over the ﬁxed discriminatory pricing, since it does increase the market share. Efﬁciency of homogeneous and heterogeneous markets. offs present in the design of the agent and gave our strategic rules for quote-accepting, clearing, pricing and charging. We analysed the competition results and, in particular, the IAMwildCAT agent’s market share, proﬁt share and transaction success rate compared to the other agents. We then looked at how IAMwildCAT is very successful at incentivising intra-marginal traders to join its market, driving away extra-marginal ones. Furthermore, we examined experimentally the advantage of our side-biased pricing over the standard ﬁxed discriminatory pricing and showed that our agent is able to balance the number of globally intra-marginal buyers and sellers which avoids distributing proﬁts to undesirable, extra-marginal traders. Finally, we analysed the strategies outside the scope of the competition by looking at the market efﬁciency in homogeneous and heterogeneous environments. As discussed in Subsection 3.4, such insights are particularly important if agents are allowed to change strategies and they all choose the most efﬁcient one. We empirically demonstrated that a market with only IAMwildCAT agents does reasonably well at only 0.3% less than the most efﬁcient one, PersianCAT, while outperforming the heterogeneous market in terms of market efﬁciency. As future work, we intend to improve on all the policies we currently have. For example, we intend to improve our charging policy, by better understanding how the different fees individually affect the market share and proﬁt share. This would allow us to experiment with various combinations of strategies (like in [9]) and select the best combination, so as to improve our agent even more. As such strategies are designed to be more and more effective, they will be the foundations for automating real markets in a global economy. ACKNOWLEDGEMENTS 3.4 Homogeneous and Heterogeneous Markets Finally, as per previous evaluation methodologies of double auctions [10, 2], we analyse the global efﬁciency (and the convergence of the daily market efﬁciency) of the strategies in both homogeneous and heterogeneous settings. Now, if agents were allowed to select their strategy, they would all chose the most efﬁcient one, i.e. IAMwildCAT, and it would then be very insightful to see how the market efﬁciency changes if all agents use the same strategy. In particular, in a homogeneous setting, IAMwildCAT does better than in the heterogeneous setting, with a global efﬁciency of 90.6% (see Figure 6). While PersianCAT has the highest global efﬁciency (slightly higher than IAMwildCAT at 90.9%), it does poorly in the heterogeneous environment where it scores 128.8, i.e. 47% less than IAMwildCAT. PersianCAT performs well in the homogeneous case because its strategy favours proﬁt-maximisation (sacriﬁcing its TSR) that contributes to the high efﬁciency. Thus, overall, IAMwildCAT performs well in both a homogeneous (with a high global market efﬁciency) and a heterogeneous environment (with a high score). 4 Conclusions This paper details the IAMwildCAT agent, winner of the 2007 TAC Market Design Competition. In particular, we presented the trade- We would like to thank Rajdeep K. Dash who participated in the initial design of IAMwildCAT. Part of this research was undertaken under ALADDIN (joint EPSRC and BAE project EP/C548051/1). REFERENCES [1] D. Cliff and J. Bruten. Minimal-intelligence agents for bargaining behaviors in market-based environments. Tech Report HPL-97-91, 1997. [2] D. Friedman and J. Rust, The Double Auction Market: Institutions, Theories and Evidence, Addison-Wesley, New York, 1992. [3] E. Gerding, P. McBurney, J. Niu, S. Parsons, and S. Phelps, ‘Overview of CAT: A market design competition’, Tech Report ULCS-07-006, Dept. of Computer Science, University of Liverpool, Liverpool, UK, (2007). [4] S. Gjerstad and J. Dickhaut, ‘Price formation in double auctions’, Games and Economic Behavior, 22, 1–29, (1998). [5] D. K. Gode and S. Sunder, ‘Allocative efﬁciency of markets with zerointelligence traders: Market as a partial substitute for individual rationality’, Journal of Political Economy, 101(1):119–137, 1993. [6] J. Nicolaisen, V. Petrov, and L. Tesfatsion, ‘Market power and efﬁciency in a computational electricity market with discriminatory double-auction pricing’, IEEE Trans: Evolutionary Computation, 5(5), 504–523, (2001). [7] J. Niu, K. Cai, E. Gerding, P. McBurney, and S. Parsons, ‘Characterizing effective auction mechanisms: Insights from the 2007 TAC market design competition’, in AAMAS-08, 1079-1086, (2008). [8] S. Parsons, J. Niu, K. Cai, and E. Sklar, ‘Reducing price ﬂuctuation in continuous double auctions through pricing policy and shout improvement’, AAMAS-06, 1143–1150, (2006). [9] I. A. Vetsikas and B. Selman, ‘A principled study of the design tradeoffs for autonomous trading agents’, in AAMAS-07, pp. 473–480, (2003). [10] P. Vytelingum, The structure and behaviour of the Continuous Double Auction, Ph.D. dissertation, School of ECS, Univ. of Southampton, 2006. ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-433 433 Multi-Agent Reinforcement Learning Algorithm with Variable Optimistic-Pessimistic Criterion Natalia Akchurina1 Abstract. A reinforcement learning algorithm for multi-agent systems based on variable Hurwicz’s optimistic-pessimistic criterion is proposed. The formal proof of its convergence is given. Hurwicz’s criterion allows to embed initial knowledge of how friendly the environment in which the agent is supposed to function will be. Thorough testing of the developed algorithm against well-known reinforcement learning algorithms has shown that in many cases its successful performance can be explained by its tendency to force the other agents to follow the policy which is more proﬁtable for it. In addition the variability of Hurwicz’s criterion allowed it to converge to best-response against opponents with stationary policies. 1 Introduction At the middle of ﬁfties a new ﬁeld of research with a fascinating name “Artiﬁcial Intelligence” (AI) couldn’t help arousing a lot of questions. One of them was “will ever computers be intelligent enough to compete with humans in chess?”. In 1997 supercomputer Deep Blue did win the match against the world chess champion Garry Kasparov with the only question left — was the supercomputer really intelligent?! Around this time reinforcement learning was an AI technique that didn’t need supercomputer to play on the level of human world masters in backgammon! Nowadays a challenge to AI is multi-agent — to create a team of robots that will beat humans in football. Reinforcement learning that provides a way of programming agents without specifying how the task is to be achieved could be again of use here but the convergence of reinforcement learning algorithms is only guaranteed under the conditions of stationarity of the environment that is violated in multi-agent systems. Several algorithms [5], [4], [6], [3], [2] were proposed to extend this approach to multi-agent systems. The convergence was proved either for very restricted class of environments (strictly competitive or strictly cooperative) or/and against very restricted class of opponents. In this paper we propose an algorithm based on Hurwicz’s optimistic-pessimistic criterion that allows it to function effectively in a wider range of environment and prove its convergence. Variability of Hurwicz’s criterion allowed the proposed algorithm to be rational — play best-response against stationary opponents. In self play for all types (according to Rapoport’s classiﬁcation [8]) of repeated 2 × 2 games the proposed algorithm has converged to pure Nash equilibrium when the later existed. Section 2 is devoted to formal deﬁnition of stochastic games — framework for multi-agent reinforcement learning, and presents the theorems that we will use in the proof of the convergence of our 1 International Graduate School of Dynamic Intelligent Systems, University of Paderborn, Germany, email: anatalia@mail.uni-paderborn.de method in sections 3. Section 4 is devoted to the analysis of the results of thorough testing of our algorithm against other reinforcement learning algorithms. 2 Preliminary Deﬁnitions and Theorems Deﬁnition 2.1 A 2-player stochastic game Γ is a 6-tuple S, A1 , A2 , r1 , r2 , p , where S is the discrete state space (|S| = m), Ak is the discrete action space of player k for k = 1, 2, rk : S × A1 × A2 → R is the payoff function for player k, p : S × A1 × A2 → Δ is the transition probability map, where Δ is the set of probability distributions over state space S. It is assumed that for every s, s ∈ S and for every action a1 ∈ A1 and a2 ∈ A2 , transition probabilities p(s |s, a1 , a2 ) are stationary Pm for all t = 0, 1, 2, . . . and s =1 p(s |s, a1 , a2 ) = 1. Each player k (k = 1, 2) strives to maximize its expected discounted cumulative reward: v k (s, π 1 , π 2 ) = ∞ X γ t E(rtk |π 1 , π 2 , s0 = s) t=0 = where γ ∈ [0, 1) is the discount factor, π 1 1 (π (s0 ), . . . , π (sm )) and π 2 = (π 2 (s0 ), . . . , π 2 (sm )) are the policies of players 1 and 2 respectively and s — initial state. π k (s) is a mixed policy in state s. 1 Deﬁnition 2.2 A Nash equilibrium point is a pair of policies (π∗1 , π∗2 ) such that for all s ∈ S and for all policies π 1 and π 2 : v 1 (s, π∗1 , π∗2 ) ≥ v 1 (s, π 1 , π∗2 ) v 2 (s, π∗1 , π∗2 ) ≥ v 2 (s, π∗1 , π 2 ) Repeated games are a special case of stochastic games when there is the same state at each time period. In [2] two properties that any learning algorithm for multi-agent systems should satisfy were formulated: Deﬁnition 2.3 (Rationality): If the other players’ policies converge to stationary policies then the learning algorithm will converge to a policy that is a best-response to the other players’ policies. Deﬁnition 2.4 (Convergence): The learner will necessarily converge to a stationary policy against agents using an algorithm from some class of learning algorithms. 434 N. Akchurina / Multi-Agent Reinforcement Learning Algorithm with Variable Optimistic-Pessimistic Criterion 2.1 Convergence Theorem Theorem 1 [10] Let X be an arbitrary set and assume that B is the space of bounded functions over X and T : B(X ) → B(X ) is an arbitrary mapping with ﬁxed point v ∗ . Let U0 ∈ B(X ) be an arbitrarily value function and T = (T0 , T1 , . . .) be a sequence of random operators Tt : B(X ) × B(X ) → B(X ) such that Ut+1 = Tt (Ut , v ∗ ) converges to T v ∗ uniformly over X . Let V0 be an arbitrary value function, and deﬁne Vt+1 = Tt (Vt , Vt ). If there exist random functions 0 ≤ Ft (x) ≤ 1 and 0 ≤ Gt (x) ≤ 1 satisfying the conditions below with probability 1, then Vt converges to v ∗ with probability 1 uniformly over X : 1. for all U1 and U2 ∈ B(X ), and all x ∈ X , |Tt (U1 , v ∗ )(x) − Tt (U2 , v ∗ )(x)| ≤ Gt (x)|U1 (x) − U2 (x)| 2. for all U and V ∈ B(X ), and all x ∈ X , |Tt (U, v ∗ )(x) − Tt (U, V )(x)| ≤ Ft (x) sup ||v ∗ (x ) − V (x )|| x Pn 3. t=1 (1 − Gt (x)) converges to inﬁnity uniformly in x as n → ∞ 4. there exists 0 ≤ γ < 1 such that for all x ∈ X and large enough t Ft (x) ≤ γ(1 − Gt (x)) 2.2 Stochastic Approximation Let M (x) denote the expected value at level x of the response to a certain experiment. It is assumed that to each value x corresponds a random variable Y = Y (x) with distribution function P r[Y (x) ≤ R∞ y] = H(y|x), such that M (x) = −∞ ydH(y|x) is the expected value of Y for the given x. Neither the exact nature of H(y|x) nor that of M (x) is known to the experimenter. It is desired to estimate the solution x = θ of the equation M (x) = α, where α is a given constant by making successive observations on Y at levels x1 , x2 , . . . Let deﬁne a (nonstationary) Markov chain {xn } by taking x1 to be an arbitrary constant and deﬁning xn+1 − xn = αn (α − yn ) where yn is a random variable such that P r[yn ≤ y|xn ] = H(y|xn ) constants Theorem 2 [9] P If {αn } is a ﬁxed sequence of Ppositive ∞ ∞ 2 and αn = ∞, if such that 0 < n=1 αn = A < ∞ n=1 RC ∃C > 0 : P r[|Y (x)| ≤ C] = −C dH(y|x) = 1 for all x and M (x) is nondecreasing, M (θ) = α, M (θ) > 0 then limn→∞ E(xn − θ)2 = 0 3 Optimistic-Pessimistic Q-learning Algorithm with Variable Criterion (OPVar-Q) Competitive or cooperative environments are just extreme cases. In most cases the environment where our agent will function is competitive / cooperative to some degree. In this section we are proposing a reinforcement learning algorithm (OPVar-Q) based on Hurwicz’s optimistic-pessimistic criterion [1] that allows us to embed preliminary knowledge of how friendly the environment will be. For example, parameter λ = 0.3 means that we believe that with 30% probability the circumstances will be favourable and the agents will act so as to maximize OPVar-Q’s reward and in 70% will force it to achieve the minimum value and we choose the strategy in each state that will maximize our gain under the above described circumnstances (OPVar-Q with λ = 0.3 tries more often to avoid low rewards than to get high rewards in comparison with OPVar-Q(0.5)). The algorithm is presented for 2-player stochastic game but without difﬁculty can be extended for arbitrary number of players. Algorithm 1 OPVar-Q (for player 1) Input: parameters λ, , α (see theorem 3) for all s ∈ S, a1 ∈ A1 , and a2 ∈ A2 do Q(s, a1 , a2 ) ← 0 V (s) ← 0 π(s, a1 ) ← 1/|A1 | end for loop Choose action a1 from s using policy π(s) with probability 1− and with probability select an action at random Take action a1 , observe opponent’s action a2 , reward r1 and succeeding state s provided by the environment Q(s, a1 , a2 ) ← (1 − α)Q(s, a1 , a2 ) + α(r1 + γV (s )) if opponent’s has become stationary then j policy P 1 2 = arg max π (s, a2 )Q(s, a1 , a2 ) 1 a 1 2 a a π(s, a1 ) ← 0 otherwise P V (s) ← maxa1 a2 π 2 (s, a2 )Q(s, a1 , a2 ) else 8 1 a1 = arg maxa1 [(1 − λ)· > > < · mina2 Q(s, a1 , a2 )+ π(s, a1 ) ← +λ maxa2 Q(s, a1 , a2 )] > > : 0 otherwise V (s) ← maxa1 [(1 − λ) mina2 Q(s, a1 , a2 ) + λ maxa2 Q(s, a1 , a2 )] end if end loop Lemma 3.1 Let Q : S ×A1 ×A2 → R then for Hurwicz’s criterion: H(Q(s)) = max[(1 − λ) min Q(s, a1 , a2 ) + λ max Q(s, a1 , a2 )] a1 a2 a2 where 0 ≤ λ ≤ 1 the following inequality holds: |H(Q1 (s)) − H(Q2 (s))| ≤ max |Q1 (s, a1 , a2 ) − Q2 (s, a1 , a2 )| a1 ,a2 Proof. |H(Q1 (s)) − H(Q2 (s))| = = − | max[(1 − λ) min Q1 (s, a1 , a2 ) + λ max Q1 (s, a1 , a2 )] a1 a2 a2 1 max[(1 − λ) min Q2 (s, a , a ) + λ max Q2 (s, a1 , a2 )]| a1 2 a2 a2 1 ≤ max |(1 − λ)(min Q1 (s, a , a ) − min Q2 (s, a1 , a2 )) + λ(max Q1 (s, a , a ) − max Q2 (s, a , a2 ))| ≤ max[|(1 − λ)(min Q1 (s, a , a2 ) − min Q2 (s, a1 , a2 ))| a1 a2 1 a2 1 2 a2 a1 2 a2 1 a2 1 a2 1 + |λ(max Q1 (s, a , a ) − max Q2 (s, a , a2 ))|] ≤ max[(1 − λ) max |Q1 (s, a1 , a2 ) − Q2 (s, a1 , a2 )| a2 b a1 2 a2 N. Akchurina / Multi-Agent Reinforcement Learning Algorithm with Variable Optimistic-Pessimistic Criterion + λ max |Q1 (s, a1 , a2 ) − Q2 (s, a1 , a2 )|] = max max |Q1 (s, a1 , a2 ) − Q2 (s, a1 , a2 )| a2 a1 a2 The above holds due to the triangle and the following inequalities [10]: | max Q1 (s, a1 , a2 ) − max Q2 (s, a1 , a2 )| ≤ ak ≤ ak max |Q1 (s, a1 , a2 ) − Q2 (s, a1 , a2 )| ak against opponent with stationary policy (π 2 ), and to the stationary policy deﬁned by ﬁxed point of operator X p(s |s, a1 , a2 )H(Q(s )) [T Q](s, a1 , a2 ) = r1 (s, a1 , a2 ) + γ s against other classes of opponents. Proof. Let further on V (Q(s)) = BR(Q(s), π 2 (s)) when the opponent follows stationary policy (π 2 ) and V (Q(s)) = H(Q(s)) otherwise. Let Q∗ be ﬁxed point of operator T and | min Q1 (s, a1 , a2 ) − min Q2 (s, a1 , a2 )| ≤ ak ≤ 435 M (x) ak max |Q1 (s, a1 , a2 ) − Q2 (s, a1 , a2 )| = − ak x − r1 (s, a1 , a2 ) X γ p(s |s, a1 , a2 )V (Q∗ (s ))) s where k = 1, 2 Lemma 3.2 Let Q : S × A1 × A2 → R and π 2 be the policy of player 2 then for X 2 π (s, a2 )Q(s, a1 , a2 ) BR(Q(s), π 2 (s)) = max a1 a2 the following inequality holds: |BR(Q1 (s), π 2 (s)) − BR(Q2 (s), π 2 (s))| It’s evident that conditions of theorem 2 on M are fulﬁlled. M (Q∗ ) = α = 0 The random approximating operator: 8 (1 − αt )Qt (st , a1t , a2t )+ > > < 1 1 2 ∗ α t (r (st , at , at ) + γV (Q (st ))) Tt (Qt , Q∗ )(s, a1 , a2 ) = 1 1 if s = st and a = at and a2 = a2t > > : Qt (s, a1 , a2 ) otherwise where yt (s, a1 , a2 ) = Qt (st , a1t , a2t ) − r1 (st , a1t , a2t ) − γV (Q∗ (st )) if s = st and a1 = a1t and a2 = a2t It is evident that the other conditions will be satisﬁed if st is ranProof. domly selected according to the probability distribution deﬁned by p(·|st , a1t , a2t ) 2 2 Then according to theorem 2 Tt approximates the solution of the |BR(Q1 (s), π (s)) − BR(Q2 (s), π (s))| = X 2 equation M (x) = 0 uniformly over X = S × A1 × A2 . In other 2 1 2 = | max π (s, a )Q1 (s, a , a ) to T Q∗ uniformly over X . words, Tt (Qt , Q∗ ) converges a1 8 1 1 a2 < 1 − αt if s = st and a = at X 2 2 1 2 2 2 1 2 π (s, a )Q2 (s, a , a )| − max and a = at Let Gt (s, a , a ) = : a1 1 otherwise a2 8 X 2 X 2 2 1 2 2 1 2 γα if s = st and a1 = a1t < t π (s, a )Q1 (s, a , a ) − π (s, a )Q2 (s, a , a )| ≤ max | 1 2 1 a and a2 = a2t and Ft (s, a , a ) = a2 a2 : X 2 0 otherwise π (s, a2 )[Q1 (s, a1 , a2 ) − Q2 (s, a1 , a2 )]| = max | Let’s check up conditions of theorem 1: a1 a2 X 2 1. when s = st and a1 = a1t and a2 = a2t : π (s, a2 ) max |Q1 (s, a1 , a2 ) − Q2 (s, a1 , a2 )|| ≤ max | max |Q1 (s, a1 , a2 ) − Q2 (s, a1 , a2 )| ≤ a1 = a1 ,a2 a2 a2 max max |Q1 (s, a1 , a2 ) − Q2 (s, a1 , a2 )| a1 |Tt (Q1 , Q∗ )(s, a1 , a2 ) − Tt (Q2 , Q∗ )(s, a1 , a2 )| = a2 = |(1 − αt )Q1 (st , a1t , a2t ) + The above holds due to the inequalities that we used for proving + αt (r1 (st , a1t , a2t ) + γV (Q∗ (st ))) lemma 3.1. Now we are ready to prove the convergence of our algorithm in a − (1 − αt )Q2 (st , a1t , a2t ) − usual way [10], [5], [6], [4]. − αt (r1 (st , a1t , a2t ) + γV (Q∗ (st )))| P∞ = Gt (s, a1 , a2 )|Q1 (s, a1 , a2 ) − Q2 (s, a1 , a2 )| Theorem 3 If {αt } is a sequence, such P∞that: αt > 0, 1 t=1 1χ(s2t = 1 1 2 2 2 s, at = a , at = a )αt = ∞ and t=1 χ(st = s, at = a , at = when s = st or a1 = a1t or a2 = a2t it’s evident that the condition a2 )αt2 < ∞ with probability 1 uniformly over S × A1 × A2 then holds. OPVar-Q algorithm converges to the stationary policy deﬁned by 2. when s = st and a1 = a1t and a2 = a2t : ﬁxed point of operator3 : X p(s |s, a1 , a2 )BR(Q(s ), π 2 (s )) [T Q](s, a1 , a2 ) = r1 (s, a1 , a2 )+γ |Tt (Q1 , Q∗ )(s, a1 , a2 ) − Tt (Q1 , Q2 )(s, a1 , a2 )| = s 2 3 χ denotes the characteristic function here. We assume here that OPVar-Q plays for the ﬁrst agent. = |(1 − αt )Q1 (st , a1t , a2t ) + + αt (r1 (st , a1t , a2t ) + γV (Q∗ (st ))) 436 N. Akchurina / Multi-Agent Reinforcement Learning Algorithm with Variable Optimistic-Pessimistic Criterion − − = ≤ (1 − αt )Q1 (st , a1t , a2t ) − 1 αt (r (st , a1t , a2t ) + γV (Q2 (st )))| Ft (st , a1t , a2t )|V (Q∗ (st )) − V (Q2 (st ))| Ft (s, a1 , a2 ) max |Q∗ (s , a1 , a2 ) − Q2 (s , a1 , a2 )| a1 ,a2 The last inequality holds due to lemmas 3.1 and 3.2. when s = st or a1 = a1t or a2 = a2t it’s evident that the condition holds. Pn 3. t=1 (1 − Gt (x)) converges to inﬁnity uniformly in x as n → ∞ (see the assumption of the theorem) 4. the fourth condition evidently holds. 4 Experiments We tested OPVar-Q algorithm on 14 classes of 10-state 2×2 stochastic games (with uniformly distributed transition probabilities) and 1000 random 10-state 6-agent 2-action stochastic games (with uniformly distributed payoffs and transition probabilities) derived with the use of Gamut [7]. For the sake of reliability we derived 100 instances of each game class and made 20000 iterations. The agent plays as both the row agent and the column agent. Below in this section we will present the average rewards (including exploration stage) of the developed OPVar-Q algorithm against the following well-known algorithms for multi-agent reinforcement learning: • Stationary opponent plays the ﬁrst action in 75% cases and the second action in 25% cases. • Q [11] was initially developed for single-agent environments. Learns by immediate rewards a tabular function Q(s, a) that returns the largest value for the action a that should be taken in each particular state s so as to maximize expected discounted cumulative reward. When applied to multi-agent systems Q learning algorithm ignores totally the presence of other agents though the later naturally inﬂuence its immediate rewards. • MinimaxQ [5] was developed for strictly competitive games and chooses the policy that maximizes its notion of the expected discounted cumulative reward believing that the circumstances will be against it. • FriendQ [6] was developed for strictly cooperative games and chooses the action that will bring the highest possible expected discounted cumulative reward believing that the circumstances will favor it. • NashQ [4] believes that the opponent will play its part of Nash equilibrium and is proved to converge against itself in games with only one Nash equilibrium. • JAL [3] believes that the average opponent’s strategy very well approximates the opponent’s policy in the future and takes it into account while choosing the action that maximizes its expected discounted cumulative reward. • PHC [2] in contrast to Q learning algorithm changes its policy gradually in the direction of the highest Q values. • WoLF[2] differs from PHC only in that it changes its policy faster when losing and more slowly when winning. The results of the experiments showed that the developed algorithm can function on the level (sometimes better) of its opponents which though don’t possess both properties: rationality (convergence to best-response against opponents with stationary policies) and convergence to stationary policies against all types of opponents. Because of the limitation on space we present only the analysis of a few game classes which should be sufﬁcient to understand the general notion of interaction between the developed OPVar-Q and the above presented multi-agent reinforcement learning algorithms. The test classes are presented in general form, where A, B, C, D are uniformly distributed in the interval [−100, 100] payoffs and A > B > C > D. We will analyze the result as though OPVarQ played for the row agent. For all games we chose neutral parameter λ = 0.5 for OPVar-Q. To illustrate the gain of variable Hurwicz’s optimistic-pessimistic criterion against stationary opponents we compare our algorithm with algorithm OP-Q(0.5) which is based on the same principle as OPVar-Q but doesn’t make a difference between opponents with stationary and non-stationary policies. Horizontal line on the graphics is the average reward of Nash equilibrium. Q, PHC, WoLF, JAL turned out to have very similar ﬁnal behavior. Small difference in the performance of these algorithms is due to a bit different manner of tuning the policy and underlying mechanism. 4.1 4.1.1 Battle of the Sexes Type 1 Table 1. Battle of the sexes: type 1 A,B C,C C,C B,A After a short exploration phase OP-Q (and OPVar-Q at ﬁrst) chooses the ﬁrst strategy in battle of the sexes type 1. Indeed Hurwicz’s criterion for the ﬁrst and the second strategies are: H1 = 0.5 · (A + V ) + 0.5 · (C + V ) H2 = 0.5 · (C + V ) + 0.5 · (B + V ) where V is the OP-Q’s (OPVar-Q’s) notion of the expected discounted cumulative reward that it will get starting from the next step. • Stationary opponent gets 0.75 · B + 0.25 · C as OP-Q (OPVar-Q) plays the ﬁrst strategy. OP-Q gets in average 0.75 · A + 0.25 · C. After noticing that its opponent is stationary OPVar-Q also plays the ﬁrst strategy for 0.75 · A + 0.25 · C > 0.75 · C + 0.25 · B and gets in average 0.75 · A + 0.25 · C. • Q, PHC, WoLF get the impression that in their environment (where OP-Q (OPVar-Q) agent is constantly playing the ﬁrst strategy) the ﬁrst strategy is much more proﬁtable than the second one (B against C, where B > C) and play it. As a result OP-Q gets A as average reward after exploration stage and Q, PHC, WoLF only — B. On realizing that the opponent’s strategy has become stationary (1, 0), OPVar-Q also plays the ﬁrst strategy (A > C) and gets A as average reward. • MinimaxQ strives to maximize its expected discounted cumulative reward in the worst case. But battle of the sexes is not strictly competitive. That’s why OP-Q and OPVar-Q show better results. • FriendQ developed for cooperative environments believes that when it gets the best reward so do the other agents in the environment and that’s why it is the most proﬁtable for them to play the other part of the joint action that results in the largest reward to FriendQ. In battle of the sexes it is constantly playing the second action. As a result OP-Q and FriendQ both get very low C reward. N. Akchurina / Multi-Agent Reinforcement Learning Algorithm with Variable Optimistic-Pessimistic Criterion After realizing that its opponent plays the second strategy OPVarQ also plays the second strategy for B > C and this results in A to FriendQ and B to OPVar-Q as average rewards. • NashQ plays the ﬁrst and the second strategy alternately since there are two Nash equilibria (a11 , a21 ) and (a12 , a22 ). OP-Q gets A+C and NashQ B+C . On getting to know that 0.5 is exactly the 2 2 right optimistic-pessimistic parameter OPVar-Q also chooses the ﬁrst strategy and gets the same reward as OP-Q. • JAL taking into account OP-Q’s (OPVar-Q’s) stationary (1, 0) policy chooses also the ﬁrst more proﬁtable for it action (B > C). OP-Q and JAL respectively get A and B as average rewards. As JAL’s policy becomes stationary OPVar-Q also plays the ﬁrst strategy (A > C) and gets A as average reward. 4.1.2 Type 2 Table 2. Battle of the sexes: type 2 B,A C,C C,C A,B After a short exploration phase OP-Q (and OPVar-Q at ﬁrst) chooses the second strategy in battle of the sexes type 2. • Stationary opponent gets 0.75 · C + 0.25 · B as OP-Q (OPVar-Q) plays the second strategy. OP-Q gets in average 0.75·C +0.25·A. OPVar-Q results are higher because it chooses the action that will maximize its cumulative reward against stationary opponent. • Q, PHC, WoLF, JAL and OP-Q (OPVar-Q) play the second strategies and get B and A as average rewards correspondingly. • MinimaxQ the same as for type 1. • FriendQ plays the ﬁrst strategy while OP-Q chooses the second action. They both get low C average reward. On getting to know that opponent permanently plays policy (1, 0) OPVar-Q chooses the ﬁrst action and gets B as average reward while FriendQ gets A. • NashQ the same as for type 1. Figure 1. Battle of the sexes As you can see on the ﬁgure 1 OPVar-Q managed to get far higher rewards than it would have got playing Nash equilibrium policy 437 by forcing opponent to play the strategy that is more proﬁtable for OPVar-Q and at the same time tuning its policy when facing stationary opponent. 4.2 Self Play In self play OPVar-Q converged to one of pure Nash equilibria for every class of 2 × 2 repeated games (out of 78 according to Rapoport’s classiﬁcation [8]) where the later exist. 5 Discussion and Conclusion This paper is devoted to an actual topic of extending reinforcement learning approach for multi-agent systems. An algorithm based on Hurwicz’s optimistic-pessimistic criterion is developed. Hurwicz’s criterion allows us to embed initial knowledge of how friendly the environment in which the agent is supposed to function will be. A formal proof of the algorithm convergence is given. Thorough testing of the developed algorithm against Q, PHC, WoLF, MinimaxQ, FriendQ, NashQ, JAL showed that OPVar-Q functions effectively in the environments of different level of amicability by making its opponents follow the policy which is more proﬁtable for it. The variability of Hurwicz’s criterion allowed it to converge to best-response against opponents with stationary policies. In self play for all types (according to Rapoport’s classiﬁcation) of repeated 2 × 2 games the proposed algorithm has converged to pure Nash equilibrium when the later existed. REFERENCES [1] Kenneth Arrow, ‘Hurwiczs optimality criterion for decision making under ignorance’, Technical Report 6, Stanford University, (1953). [2] Michael H. Bowling and Manuela M. Veloso, ‘Multiagent learning using a variable learning rate’, Artiﬁcial Intelligence, 136(2), 215–250, (2002). [3] Caroline Claus and Craig Boutilier, ‘The dynamics of reinforcement learning in cooperative multiagent systems’, in AAAI ’98/IAAI ’98: Proceedings of the ﬁfteenth national/tenth conference on Artiﬁcial intelligence/Innovative applications of artiﬁcial intelligence, pp. 746–752, Menlo Park, CA, USA, (1998). American Association for Artiﬁcial Intelligence. [4] Junling Hu and Michael P. Wellman, ‘Multiagent reinforcement learning: Theoretical framework and an algorithm’, in ICML ’98: Proceedings of the Fifteenth International Conference on Machine Learning, pp. 242–250, San Francisco, CA, USA, (1998). Morgan Kaufmann Publishers Inc. [5] Michael L. Littman, ‘Markov games as a framework for multi-agent reinforcement learning’, in ICML, pp. 157–163, (1994). [6] Michael L. Littman, ‘Friend-or-foe q-learning in general-sum games’, in ICML, eds., Carla E. Brodley and Andrea Pohoreckyj Danyluk, pp. 322–328. Morgan Kaufmann, (2001). [7] Eugene Nudelman, Jennifer Wortman, Yoav Shoham, and Kevin Leyton-Brown, ‘Run the gamut: A comprehensive approach to evaluating game-theoretic algorithms’, in AAMAS ’04: Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems, pp. 880–887, Washington, DC, USA, (2004). IEEE Computer Society. [8] Anatol Rapoport, Melvin J. Guger, and David G. Gordon, The 2 × 2 Game, Ann Arbor: The University of Michigan Press, 1976. [9] Herbert Robbins and Sutton Monro, ‘A stochastic approximation method’, Annals of Mathematical Statistics, 22(3), 400–407, (1951). [10] Csaba Szepesv´ari and Michael L. Littman, ‘Generalized markov decision processes: Dynamic-programming and reinforcement-learning algorithms’, Technical report, Providence, RI, USA, (1996). [11] Chris J. C. H. Watkins, Learning from Delayed Rewards, Ph.D. dissertation, King’s College, Cambridge, England, 1989. 438 ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-438 As Safe As It Gets: Near-Optimal Learning in Multi-Stage Games with Imperfect Monitoring Danny Kuminov1 and Moshe Tennenholtz2 Keywords: DMA::Game Theoretic Foundations; PS::Planning with Incomplete Information Abstract. We introduce the ﬁrst near-optimal polynomial algorithm for obtaining the mixed safety level value of an initially unknown multi-stage game, played in a hostile environment, under imperfect monitoring. In an imperfect monitoring setting all that an agent can observe is the current state and its own actions and payoffs, but it can not observe other agents’ actions. Our result holds for any multistage generic game with a “reset” action. 1 Introduction Decision making in adversarial settings is a central topic in both AI and game theory. Assuming a purely adversarial setting, playing the (mixed) safety-level strategy is the best one can hope for. Such a strategy maximizes the expected worst case payoff of the player, and it can also be computed efﬁciently. This leaves us with two complementary central problems. One problem is the need to deal with settings which are not purely adversarial. Another challenging issue is the need to deal with incomplete information about the environment. In particular, the game played might be unknown, and therefore guaranteeing the safety level value may be problematic. This paper deals with the latter issue. Consider a multi-stage game. A multi-stage game consists of ﬁnitely many states, each of which is associated with a strategic form game. The actions selected by the agents in a given state determine the payoffs of the agents according to the payoff matrix of the corresponding game. Moreover, as a function of the current state and the selected actions we reach a new state. We will consider the situation where we have two agents, and we care about the payoffs that can be guaranteed by player 1 (which we refer to as the agent), when playing against player 2 (which we refer to as the opponent). If the multi-stage game starts from a given initial state, and is played along T stages then the agent can guarantee itself a particular, optimal safety-level value. This is common and highly natural solution for this general class of games. However, when the multi-stage game is unknown it is no longer clear what will be the best possibility for the agent. A clever algorithm should attempt to learn the structure of the game, in order to attempt to obtain a value which is close to the safety level value. A central issue in this regard is the type of information available to the agent. In particular, the literature distinguishes between perfect monitoring and imperfect monitoring. In the 1 2 Technion – Israel Institute of Technology, Haifa, Israel 32000. Email: dannykv@tx.technion.ac.il Technion – Israel Institute of Technology, Haifa, Israel 32000. Email: moshet@ie.technion.ac.il perfect monitoring setting the agent can recognize the state, and observe both its payoff and the opponent action after the state-game is played. In the imperfect monitoring setting the agent can only recognize the state and observe its own payoff; it can not observe the opponent actions. In both settings the idea is to come up with an algorithm that will guarantee that the average payoff will be close to the safety level value of the underlying game. Moreover, an important objective is that convergence to this value will be obtained in a polynomial number of iterations. The above challenge ﬁts into the so-called agent-centric approach to learning in games (see e.g. [10, 6]). We consider multi-stage games with incomplete information, where there is strict initial uncertainty about the game being played [8, 1]. In the context of repeated games with incomplete information [3], where the multi-stage game consists of only a single state, Banos [5] and Megiddo [9] proved the existence of an algorithm that converges to the safety-level value in any repeated game, even under imperfect monitoring. The algorithm they present, however, is highly inefﬁcient; an efﬁcient algorithm addressing this problem in repeated games can be found in [2]. These results, however, do not apply to general multi-stage games.3 On the other hand, if we allow perfect monitoring, then the R-max algorithm, introduced in [7], provides a near-optimal polynomial algorithm. However, the problem of obtaining the (mixed) safety-level value in multi-stage games with imperfect monitoring was left open. In this paper4 we address the above challenge, by presenting the ﬁrst near-optimal polynomial algorithm for obtaining the safety-level value in generic multi-stage games, with strict initial uncertainty about the game. In a generic multi-stage game, for any given state s, action a of the agent, and actions b1 , b2 of the opponent, the agent’s payoff for (a, b1 ) in s is different from its payoff for (a, b2 ) in s. Although somewhat limiting, this assumption captures many interesting situations, and is quite common in the literature. Namely, given an initially unknown generic multi-stage game, we show an efﬁcient algorithm that after polynomially many iterations (in the game size and accuracy parameters) guarantees (almost) the safety level value with overwhelming probability. A major challenge that this algorithm addresses is that given imperfect monitoring the agent can not know whether two payoffs he obtains in a given state, when playing actions a1 and a2 respectively, are associated with the same action by the opponent. 3 4 We have considered several ways to reduce a multi-stage game to a representation acceptable by the algorithm in [2], but all of them result in learning time that is exponential in the number of states. We omitted many proofs in this version of the paper, due to lack of space. A full version, with all the proofs and additional discussion, can be found at http://www.technion.ac.il/∼dannykv/ecai08.pdf. D. Kuminov and M. Tennenholtz / As Safe as It Gets: Near-Optimal Learning in Multi-Stage Games with Imperfect Monitoring 2 The setting In a multi-stage game (MSG) the players play a (possibly inﬁnite) sequence of games from some given set of ﬁnite games (in strategic form). After playing each game, the players receive the appropriate payoff, as dictated by that game’s matrix, and move to a new game. In the model we consider here, the identity of this new game is uniquely determined by the previous game and by the players’ actions in it.5 Formally: Deﬁnition 1. A ﬁxed-sum, two player, multi-stage game (MSG) M on ﬁnite set of states S and ﬁnite sets of actions X1 , X2 consists of: • Stage Games: each state s ∈ S is associated with a two-player, ﬁxed-sum game in strategic form, where the action set of each player is X1 , X2 accordingly, and the utility function of player 1 is Us : X1 × X2 → . For brevity, we denote X = X1 × X2 . • Transition Function ftr : S × X1 × X2 → S: ftr (s, x1 , x2 ) is the state to which the game transfers from state s given that the ﬁrst player (the agent) plays x1 and the second player (the opponent) plays x2 . • Designated initial state. W.l.o.g let us denote it by start. In this work, we assume that player 1 (the row player of the stage games) does not know a priori what the payoff matrices are, neither he is informed after each stage about the action taken by player 2 (but he observes what his payoff at that stage was, and he knows what the current state is before playing the respective game). Player 2 (the column player), however, is fully informed about both the payoff matrices and the history of the game. We use the following deﬁnitions: • The set of histories of length t of M isH t = ti=1 (S × X). ∞ • The set of all ﬁnite histories is H = k=0 H k (here we use the simplifying notation that H 0 = {e}, where e denotes the empty history). • H∞ = ∞ i=1 (S × X) is the set of all inﬁnite histories. • Given a history h, we will slightly abuse notation and denote by Ui (ht ) the payoff to player i in round t of the history h. • A behavioral policy of the informed player (player 2) in the multistage game is a function p2 : H × S → Δ(X2 ). • The set of histories of the game which is available to the unint t formed player (player 1) at stage t is H = l=1 (S × X1 × ). k • We denote H = ∞ H . k=0 • A behavioral policy of the uninformed player in the multi-stage game is a function p1 : H × S → Δ(X1 ). The S in the function parameters represents the current state of the game. • We denote the set of possible behavioral policies of player i by Pi . A utility function in this setting is a function U˜i : H ∞ → . There are several possible deﬁnitions of this function based on the utilities in the one-shot game; we will use the function U˜i (h) = lim inf t→∞ 1t tk=1 Ui (hk ). The true game M together with the players’ behavioral policies generate a probability measure over H ∞ , which can be described uniquely by its values for ﬁnite cylinder sets. Given this measure, the expected utility of player i given policies p1 , p2 is deﬁned as U˜i (p1 , p2 ) = lim inf t→∞ h∈H t P rh|(p1 , p2 ) 1t tk=1 Ui (hk ), where P rh|(p1 , p2 ) denotes the probability that a ﬁnite history h ∈ H t occurs in the ﬁrst t stages of the game. In this work, we will assume that all payoffs in the stage games are positive and bounded from above by Umax , that the stage games are 5 This model is a special case of the well known stochastic game model. 439 generic and that there is a designated action (w.l.o.g let us denote it by reset) that from any state and given any opponent action, transfers the game to the initial state and gives payoff 0 to the agent.6 In a generic game, for any given action a of the agent, and actions b1 , b2 of the opponent, the agent’s payoff for (a, b1 ) is different from its payoff for (a, b2 ). We also assume that the above is the only prior information available to the agent. He does not know a priori the exact payoff matrices of the stage games, neither does he know a priori what the transition function is for actions that are not reset. In this work we assume that we are given a time limit T , and we limit ourselves only to policies that play for T steps and always do reset on step T + 1 (and then repeat themselves ad inﬁnitum), both in our learning algorithm and in the optimal policy to T t 1 which we compare.7 Let Vp1 ,p2 (T ) = Ep1 ,p2 T +1 t=1 U1 (h ) denote the expected average payoff guaranteed by such policy p1 against opponent policy p2 , and V (T ) = maxp1 minp2 Vp1 ,p2 (T ) denote the maximal expected average payoff that can be guaranteed by such policy. Given this deﬁnition, our goal is to develop a policy p1 for player 1 which, given conﬁdence δ, accuracy and ﬁnite time horizon T , guarantees after ˆ l = poly(|X1 |, |X2 |, |S|, 1δ , 1 , T ) rounds an expected average payoff of at least (1 − )V (T ) with probability at least 1 − δ. 8 Formally, for any game for which the above assumptions hold and for any policy p2 of the opponent: l : 1l lt=1 U1 (ht ) ≥ (1 − )V (T ) ≥ 1 − δ. P rp1 ,p2 ∀l ≥ ˆ Note that the optimal policy under this criterion can be described as a mapping S × {1 . . . T } → Δ(X1 ). Informally, the policy only has to take into account the current state and the number of steps remaining in the current T -step sequence, when determining the next action – the speciﬁc previous history does not matter. This means that the T -step min-max policy can be described concisely (i.e., the size of its representation is polynomial in T and the problem parameters) and that it can be computed efﬁciently, by combining backward induction with the usual techniques for computing mixed min-max strategy in strategic form games. In fact, this observation holds for any stochastic game, which is a more general model. 3 The algorithm The basic idea of the algorithm can be summarized as follows: • In each iteration, the algorithm constructs an approximate (optimistic) model of the multi-stage game, computes the T -step optimal strategy for it and executes it. • The agent represents its knowledge about the game matrix of each stage game by a partition of the set of opponent’s actions. For each element of the partition and each action of the agent, it keeps the set of payoffs associated with that subset of the opponent’s actions. • With some small probability the algorithm will explore in the current iteration - that it, it will draw a number i ∈ [1, 2, . . . , T ] (distributed uniformly and independently) and in round i of the iteration, it will play a random action (distributed uniformly and independently) and count the number of times each distinct payoff 6 7 8 The reset action ensures that there are no irreversible actions. It can be easily veriﬁed that learning is impossible otherwise, since, by trying an unknown action, the agent might trap himself in an inferior subgame, without any possibility for going back. This choice is justiﬁed in the full version of the paper. Note that the average is taken over all stages of the game, including the initial learning period ˆ l. 440 D. Kuminov and M. Tennenholtz / As Safe as It Gets: Near-Optimal Learning in Multi-Stage Games with Imperfect Monitoring was encountered for each action in each state during the sampling. After sampling, it will play reset and start the next iteration. • When updating its model, for each stage game, the algorithm tries to ﬁnd a reﬁnement of the partition of the opponent’s actions so that payoffs with sufﬁciently different counts9 are in different groups, and payoffs with similar counts are in the same group (and the new partition is the same for all rows). • We prove that, with high probability, if there are two groups of actions that the opponent used sufﬁciently different number of times, we will be able to separate the respective payoffs correctly in all rows - we will learn something about the game matrix. • Otherwise, the difference between the number of times that the opponent used actions that are in a given element of the partition is small. Note that when constructing the tentative model, the algorithm treats each element as a single meta-action, takes the payoff for the agent when the opponent plays this meta-action to be the average of the distinct payoffs associated with it, and takes the transition function for this meta-action to be uniformly distributed over the successive states for the payoff values that are associated with it.10 Given that the above difference is small, the algorithm obtains a sufﬁciently high payoff when using this model. We assume that the agent knows a priori the following parameters of the problem: • |S| - the number of the states in the multi-stage game. • |X1 |, |X2 | - the sizes of the strategy sets (of the agent and the opponent respectively). • Umax - the largest possible payoff for the agent in the game. • , δ - accuracy parameters. We will also use the following notation: • β, γ - two parameters that control the behavior of the algorithm (to be determined later). • S = (S × {1, . . . , T }) {0} is the extended set of states. Note that we add a ﬁctitious state 0 to the model, and we treat being in the same state at different times of the T -step sequence as being in different states.11 • Let Cs be a variable that holds the counters that the algorithm maintains, for a stage game s ∈ S . Speciﬁcally, Cs : X1 × [0, Umax ] → N is a function that maps the distinct payoff values for each row to the number of times they were encountered while sampling that row. We denote by C the set of all such variables. • Let Ωs and φs , for s ∈ S , be two variables that represent the partial knowledge that the algorithm has regarding the game matrix (of stage game s). Speciﬁcally, Ωs is a partition over the opponent’s action set and φs : X1 × Ωs → 2[0,Umax ] is a function that maps (for each row) elements of the partition to associated groups of payoff values. Note that the initial state of “no knowledge” is represented by ∀s ∈ S : Ωs = {X2 } and ∀s ∈ S , i ∈ X1 : φs (i, X2 ) = ∅, and a state of complete and accurate knowledge is represented by Ωs = {{j}|j ∈ X2 } and φs (i, {j}) = {Uij }. • Let ftr : S ×X1 ×[0, Umax ] → S be the transition function that the algorithm maintains (since the game is generic, utility values can be used in place of opponent actions). 9 We use the word “count” to denote the number of times the algorithm encountered a speciﬁc payoff value while sampling, as opposed to the actual number of times the respective action was used by the opponent. 10 Note that although the real transition function is deterministic, the algorithm uses a stochastic game as a tentative model. 11 This distinction is required, since the optimal policy must be able to treat these states differently. • Let ls (ω), for ω ∈ Ωs , be a variable that holds the number of times a stage game s ∈ S has been played and the opponent used an action in ω since the last time the partition Ωs was reﬁned. We denote by l the set of all such variables. Now we deﬁne the algorithm: Procedure RecordAndReset(s, s , x, u, Ωs , φs , ftr , C, l) Let ω ∈ arg minω∈Ωs minu∈φs (x,ω) Cs (x, u) Let φs (x, ω ) := φs (x, ω ) ∪ {u} Let ftr (s, x, u) = s For all s ∈ S ∀x ∈ X1 , u ∈ [0, Umax ] : Cs (x, u) := 0 ∀ω ∈ Ωs : ls (ω) := 0 End for End procedure // Initialization For all s ∈ S For all x1 ∈ X1 \ {reset} Let ftr (s, x1 , Umax ) = 0 For all s ∈ S Let ftr (s, reset, Umax ) = (start, 1) For all s ∈ S Ωs := {X2 }; ls (X2 ) = 0 For all x ∈ X1 φs (x, X2 ) := ∅ For all x ∈ X1 , u ∈ [0, Umax ] Cs (x, u) := 0 End For While true // Endless loop // Model update For all s ∈ S For each ω ∈ Ωs : For each row i ∈ X1 Let (ui1 , . . . , ui|ω| ) be the elements of φs (i, ω), ordered in non-decreasing order of Cs (i, u). If |φs (i, ω)| < |ω| then // the number of observed payoffs for ω // is less than the number of actions in ω Add (|ω| − |φs (i, ω)|) entries of Umax + 1 to φs (i, ω) (with count 0). 12 End if End for If there exists 1 < k ≤ |ω| such that ∀i ∈ X1 : 3 Cs (i, uik ) − Cs (i, ui(k−1) ) > 2γls (ω) 4 , // Here we reﬁne the partition split ω = {y1 , . . . , y|ω| } into ω1 = {y1 , . . . , yk−1 } and ω2 = {yk , . . . , y|ω| }. Replace ω with ω1 , ω2 in Ωs . Modify φs so that ∀i ∈ X1 : φs (i, ω1 ) = {ui1 , . . . , ui(k−1) } \ {Umax + 1} and φs (i, ω2 ) = {uik , . . . , ui|ω| } \ {Umax + 1} End if End for Repeat the previous loop until no more splits are made. If any split was made ∀x ∈ X1 , u ∈ [0, Umax ] : Cs (x, u) := 0 ∀ω ∈ Ωs : ls (ω) := 0 End for D. Kuminov and M. Tennenholtz / As Safe as It Gets: Near-Optimal Learning in Multi-Stage Games with Imperfect Monitoring Build a stochastic game in which: S is the set of states The game matrix Us ∈ |X1 |×|Ωs | for each state s ∈ S is: For each i ∈ X1 For each ω ∈ Ωs 1 Let Uik = |ω| ui ∈φ(i,ω) ui + +(|ω| − |φ(i, ω)|)Umax ) // See 13 The game matrix Us ∈ |X1 |×|X2 | for the state 0 is: For all x1 ∈ X1 , x2 ∈ X2 : Ux1 ,x2 = Umax For all states s ∈ S , t ∈ S \ {0}, agent action x ∈ X1 and opponent meta-action ω ∈ Ωs , the transition probability is: P r(s, x, ω, t) = 1 = |ω| |{u ∈ φs (x, ω) : ftr (s, x, u) = t}| For s ∈ S , x ∈ X1 , ω ∈ Ωs , the probability of transition to state 0 is: // See 14 P r(s, x, ω, 0) = |ω|−|φ(x,ω)| |ω| For the state 0, the transition function is ∀x1 ∈ X1 , ω ∈ Ω0 : P r(0, x1 , ω, 0) = 1 ∀x1 ∈ X1 , ω ∈ Ω0 , s ∈ S : P r(0, x1 , ω, s) = 0 Compute the T -step mixed safety level strategy for this stochastic game. Let explore be a random boolean value with P (explore = true) = β Let i be an integer selected from [1, T ] with uniform probability Repeat for t from 1 to T Let s denote the current state. If explore = true and t = i: Let x ∈ X1 be an action selected at random with uniform probability Execute action x → let u be the observed payoff and s - the new state. Let Cs (x, u) := Cs (x, u) + 1 If ∃ω ∈ Ωs : u ∈ φs (x, ω) Call RecordAndReset(s, s , x, u, Ωs , φs , ftr , C, l) Break // T -step Repeat End if Let x be the action prescribed by the safety level strategy for the current state and step. Execute action x → let u be the observed payoff and s - the new state. If ∃ω ∈ Ωs : u ∈ φs (x, ω) then Call RecordAndReset(s, s , x, u, Ωs , φs , ftr , C, l) Break // T -step Repeat End if End // T -step Repeat Play reset End while 4 The analysis ˜ ∞ = (X × {true, f alse} × X1 )N be the set of all inﬁnite Let H histories of the game that includes information about the realization 12 Those values are just placeholders for unknown values - we could use any impossible value here. Here we again make an optimistic assumption that payoffs yet unobserved are equal to Umax . 14 Here we make an optimistic assumption that transitions yet unobserved lead to the “heaven” state 0. 13 441 of the random decision variables used by the algorithm (whether an exploration has been done in a speciﬁc round and the row chosen for exploration). The true multi-stage game M together with both play˜ ∞ , which can be ers’ policies generate a probability measure over H described uniquely by its values for ﬁnite cylinder sets. All random variables that we use in this analysis are derived from this probability measure. We show: Theorem 1. Given a multi-stage game that conforms to the requirements set in Section 2, > 0, δ > 0, there exists ˆ l = poly(|X1 |, |X2 |, T, |S|, Umax , 1δ , 1 ) such that for any policy of the opponent, the above algorithm achieves in every round l ≥ ˆ l an expected average (over all rounds since the start of the game) payoff of at least (1 − ) V (T ) with probability at least 1 − δ, where V (T ) is the maximal expected average payoff that can be guaranteed after playing T steps. To prove the theorem, we need the following notation: 1. Let (l1 , l2 . . .) be the indices of the rounds of the multi-stage game at which the algorithm updates the partition and/or records a new payoff value for one of the states (note that there can be at most |X2 ||S|T + |X1 ||X2 ||S|T such rounds), and let us divide the rounds of the game into epochs ((0, . . . , l1 − 1), (l1 , . . . , l2 − 1), (l2 , . . . , l3 − 1), . . .). 2. For brevity, we will denote by Q1 , Q2 , . . . constant values that are polynomial with respect to the problem parameters and are constant throughout the execution of the algorithm. In particular, we will denote by Q1 = (|X1 | + 1)|X2 ||S|T + 1 the maximal number of epochs. 3. For a given stage game s ∈ S and epoch e = (li , . . . , li+1 − 1), l let Cs,e be the counter function that the algorithm maintains at round l of the epoch (i.e., at round li + l of the game) for the stage game s. l 4. For an epoch e and for each j ∈ X2 , let Fs,e (j) be the number of times that the stage game s was played in the ﬁrst l rounds of the epoch and the opponent played j. Note that, by deﬁnition, the value of ls (ω) at round l of the epoch equals to j∈ω Fsl (j). 5. When the epoch under consideration is clear from context, we will omit the subscript e. 6. Note that all of the above are random variables. 7. Note that the probability that sampling occurs in a given round of the multi-stage game is Tβ and the probability that a speciﬁc action β is sampled is T |X , independent of any other random variables or 1| the actions of the opponent. 8. Therefore, for any stage game s and actions i ∈ X1 , j ∈ X2 , the expected value of the counter maintained by the algorithm (Csl (xi , Uij )) given the value of Fsl (j) is βFsl (j) . T |X1 | The following Lemma shows that the counters maintained by our algorithm represents in an adequate manner the frequency in which actions are used by the opponent. Lemma 1 (Counter accuracy). Let us examine a speciﬁc epoch e = (lk , . . . , lk+1 − 1). There exists γ = poly(|S|, T, 1δ , |X1 |, |X2 |) so that for any policy of the opponent and any 0 < β < 1: ⎡ ⎤ ∃l ∈ N = = ⎢ ∃i ∈ X1 = 3⎥ βFsl (j) == δ Pr ⎢ : ==Csl (i, Uij ) − ≥ γls (ω) 4 ⎥ = ⎣ ∃s ∈ S ⎦ ≤ Q1 T |X1 | ∃j ∈ ω ∈ Ωs 442 D. Kuminov and M. Tennenholtz / As Safe as It Gets: Near-Optimal Learning in Multi-Stage Games with Imperfect Monitoring The intuition here is that, since the sampling is independent of any action of the adversary and any other action of the algorithm, the counters collected by the algorithm result from a representative sample of the opponent’s actions, and therefore result in a reliable estimate of the number of times the opponent used the respective action. Technically, it is proved using the Azuma bound ([4]). The proof is omitted due to lack of space, and appears in the full version. Given the above, from now on, we will assume as given that for all epochs in which the game is not yet fully known: = = = 3 βFsl (j) == ∀l ∈ N, ∀i ∈ X1 : ==Csl (i, Uij ) − < γls (ω) 4 (1) ∀s ∈ S, ∀j ∈ ω ∈ Ωs T |X1 | = set of payoffs works well enough. The proof is omitted due to lack of space, and appears in the full version. The following Lemmas deal with the situation where nothing is learned for a “long time”, and show that in this case the agent will get high payoff. The proofs of these Lemmas are omitted due to lack of space, and appear in the full paper. 2 4 1 ||X2 | Let us denote Q2 = 4γ 4T |Xβ . (that is, the negation of the inequality in the above lemma holds for all states and rounds of play in the epoch). Using this assumption we show that the algorithm achieves the required expected average payoff against any policy of the opponent with probability 1. The following pair of Lemmas, show that (under the above assumption) the algorithm reﬁnes the information structure appropriately. Lemma 6. Let l denote the length (in rounds) of an epoch. Suppose 1 4Umax that l ≥ 1−β Q2 |S|T 2 |X2 | rounds, then the expected average payoff in this epoch is at least (1 − β) 1 − 2 V (T ). Lemma 2 (Sufﬁcient condition for split). Given Eq. (1), if at any round l ∈ N in a given epoch there exist stage game = s and two actions = y, y ∈ ω ∈ Ωs of the opponent such that =Fsl (y) − Fsl (y )= > 3 4γls (ω) 4 |X1 ||X2 | , β then the algorithm must split ω in this round. The intuition here is that since the sampling process is representative, the counters collected in different rows for payoffs that result from the same (hidden) action by the adversary must have similar values. In particular, if the frequency with which the opponent used two of his actions is sufﬁciently different, the counters for the respective payoff values will have signiﬁcantly different values – in all rows. Therefore, the algorithm can safely conclude that those payoff values result from distinct actions by the opponent. The proof is omitted due to lack of space, and appears in the full version. Lemma 3 (Split correctness). Given Eq. (1), the algorithm never makes a mistake in assigning the payoffs in the “split” phase. Formally, a mistake would mean that in partitioning ω into ω1 and ω2 at round l, there are two payoff values u1 ∈ φls (i1 , ω) and u2 ∈ φls (i2 , ω) that belong to the same column in the true game matrix (i.e. ∃j ∈ X2 : u1 = Uis1 j , u2 = Uis2 j ) and the algorithm assigns u1 to φls (i1 , ω1 ) and u2 to φls (i2 , ω2 ). The intuition here is that given the error margin asserted by Eq. 1, the algorithm cannot, while reﬁning the partition, mistakenly assign a payoff value to a partition element that does not contain the respective opponent action. This is so, since the algorithm relies on the counter values when assigning the payoffs, and the counter values are representative of the actual opponent actions so far. The proof is omitted due to lack of space, and appears in the full version. Lemma 4. Suppose that in a given epoch of length l, in all stage games, for any two opponent strategies j1 , j2 ∈ ω ∈ Ωs (which are in the same part of the partition Ωs ) in a given stage game s ∈ S it holds that |Fsl (j1 ) − Fsl (j2 )| ≤ 4T |X ls (ω). Then the expected av2| erage payoff of the algorithm in this epoch is at least 1 − 4 V (T ). The intuition here is that as long as the opponent uses some of his actions the same (roughly) amount of times, the fact that the algorithm cannot distinguish which payoff belongs to which action (in this set of actions) does not decrease its payoff – the assumption that the payoff for each of those actions is the numerical average of the Lemma 5. If, during some epoch, there is a stage game s ∈ S and two strategies j1 , j2 ∈ ω ∈ Ωs (which are in the same part of the partition Ωs ) such that ls (ω) > Q2 and |Fsl (j1 ) − Fsl (j2 )| > l (ω), then a split will occur (and the epoch will end). 4T |X2 | s The intuition here is that, if the algorithm did not reﬁne any of the partitions for a long time, then, for each partition element, the opponent must have used the different actions in this partition element a similar amount of times. The key observation is that the bound on the difference in frequency of use of the actions that is implied by 3 Lemma (2) is O(ls (ω) 4 ) < O(ls (ω)), and therefore, if an epoch is longer than some polynomial, the relative difference in frequency of use (relative to the overall length of epoch) will become small enough for Lemma (4) to hold. The proof is omitted due to lack of space, and appears in the full version. 1 4Umax Let us denote Q3 = 1−β Q2 |S|T 2 |X2 |. Combining the above, we can now prove our main theorem: Proof. Let us select β = /4 – it follows that from the previous lemma that the expected average payoff of any epoch that is longer than Q3 is at least 1 − 3 V (T ). Recall that there are at most Q1 4 epochs and therefore the maximal total length of epochs that contain less than Q3 rounds is Q1 Q3 . This means that if the algorithm runs for at least ˆ l = 4 Q1 Q3 rounds the expected average payoff ˆ l−Q1 Q3 is at least 1 − 3 V (T ) = 1 − 4 1 − 3 V (T ) ≥ ˆ 4 4 l (1 − ) V (T ) REFERENCES [1] I. Ashlagi, D. Monderer, and M. Tennenholtz, ‘Robust learning equilibrium’, in Proceedings of the 22nd Conference on Uncertainty in Artiﬁcial Intelligence (UAI 2006)., (2006). [2] Peter Auer, Nicol`o Cesa-Bianchi, Yoav Freund, and Robert E. Schapire, ‘The non-stochastic multi-armed bandit problem’, SIAM J. Comput., 32, 48–77, (2002). [3] R. Aumann and M. Maschler, Repeated Games with Incomplete Information, MIT Press, 1995. [4] K. Azuma, ‘Weighted sums of certain dependent random variables’, Tˆohoku Math. Journal, 19, 357–367, (1967). [5] A. Banos, ‘On pseudo games’, The Annals of Mathematical Statistics, 39, 1932–1945, (1968). [6] M. Bowling and M. Veloso, ‘Rational and covergent learning in stochastic games’, in Proc. 17th IJCAI, pp. 1021–1026, (2001). [7] R. I. Brafman and M. Tennenholtz, ‘R-max – a general polynomial time algorithm for near-optimal reinforcement learning’, Journal of Machine Learning Research, 3, 213–231, (2002). [8] N. Hyaﬁl and C. Boutilier, ‘Regret minimizing equilibria and mechanisms for games with strict type uncertainty’, in Proceedings of the 20th Annual Conference on Uncertainty in Artiﬁcial Intelligence (UAI-04), pp. 268–277, Arlington, Virginia, (2004). AUAI Press. [9] N. Megiddo, ‘On repeated games with incomplete information played by non-bayesian players’, Int. J. of Game Theory, 9, 157–167, (1980). [10] R. Powers and Y. Shoham, ‘New Criteria and a New Algorithm for Learning in Multi-Agent Systems’, in NIPS 2004, (2004). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-443 443 A Heuristic Based Seller Agent for Simultaneous English Auctions Patricia Anthony1 and Edwin Law2 Abstract. The popularity of online auction emerges due to the ﬂexibility and convenience that it offers to consumers. Sellers offer a variety of items for sale with the aims of obtaining more proﬁt. To obtain a reasonable proﬁt, the reserve price for the item must be determined before the item is put up for sale. However, setting the price too high may result in no sale, while setting the price too low may yield a lower proﬁt. In real auction, this is the main selling problem since most sellers fail to place a strategic reserve price for a given item which results in a lower proﬁt. In this work, we develop a seller agent that proposes the item’s reserve price based upon several selling constraints that comprise of the number of competitors, the number of bidders, the duration of the auction and the degree of proﬁt that the seller desires when disposing the item. This paper describes the detail implementation and design of our agent’s strategy. The seller strategy will be evaluated across a wide context of diverse and varying selling environment using a simulated auction marketplace. 1 INTRODUCTION Online auction has been increasingly used as a medium to sell a variety of items via the Internet. Among the most popular auction houses are eBay, UBid, Bidbay, Yahoo, and Amazon.com. However, eBay has become the most popular auction site and has emerged as the online market leader in 2007 in which it reported the highest growth rate with sales totalling USD$16.2 billion3 . Online auction is the process of buying and selling goods by offering them up for bid, taking bids and then selling the item to the winning bidder4 . There are four main types of auction being used for single object auction where a single unit item is offered. This includes the English auction, the Dutch auction, the ﬁrst price sealedbid auction and the second price sealed-bid auction (also known as Vickrey auction [9]). Among the frequent selling formats used for these auctions are the no reserve price auction (item being auctioned does not have any reserve price), public reserve price auction (the reserve price for the item is publicly announced), private reserve price auction (the item’s reserve price is known only to the seller) and buy it now auction (user can purchase the item based on a ﬁxed price, while the auction is progressing) [3]. Since there can be multiple auctions trading the same item, the pricing of the good is the most essential factor that must be considered if the seller wishes to get more proﬁt. Many sellers have found that setting a high price for an item may not result in a sale. However, setting a low price may result in the item being disposed off at 1 2 3 4 Universiti Malaysia Sabah,Malaysia, email: panthony@ums.edu.my Universiti Malaysia Sabah, Malaysia, email: jazedwin@gmail.com http://news.ebay.com/fastfacts ebay marketplace.cfm http://en.wikipedia.org/wiki/Auction a low price and there is a possibility that the item is sold at a value below the market price. Since the auction environment is highly complex, dynamic and unpredictable, setting a strategic reserve price is not a straightforward process. There are several factors that need to be considered when deciding on the single optimal reserve price [1]. Firstly, we are uncertain of the number of competitors who are competing in selling the identical objects. Secondly, we cannot capture on the number of bidders who may participate in each auction since there are a number of auctions that run simultaneously in which the bidders can choose to participate. Thirdly, each ongoing auction has varying auction duration. Fourthly, each auction is imposed with different reserve price for the item being offered. To date, researchers have worked on generating the seller’s reserve price, since the role of the reserve price has received much attention in both single isolated auction and also in cases where sellers compete against each other. Gerding et al., worked on optimizing the sellers’ proﬁt using the strategy of shill bidding where sellers submit a shill bid to increase their proﬁt and to ensure that they do not sell the item at a low price [2]. However, shill bid is undesirable and has become a common form of Internet auction fraud that undermines the trust and sales revenues of institutions. Morris et al., developed the reserve price strategy and the seat releasing strategy based on the past history in a sealed-bid auction [7]. These two strategies work simultaneously when the reserve price for the seats is determined based on the number of seats available. The pricing strategy works when today’s reserve price is computed based on the yesterday’s reserve price. However, the strategy only works best if the demand remains constant since it does not cater on the demands prediction and movements. Besides, poor sales may occur if the ﬁrst day’s reserve price was inaccurately computed. Min et al., proposed an agent based system to generate the reserve price based on the case similarity of information retrieval theory and the moving average of time series analysis [6]. This technique has the tendency to produce an unreasonable reserve price that failed to reﬂect the recent trend of auction prices, which may bring harm to the sellers. Most of the previous work is targeted to obtaining a high winning price in which other selling aspects such as the selling rate and the market true value for the item being auctioned have been ignored. To solve these problems, we developed an intelligent selling agent that is able to generate a reserve price for the seller by taking into account the selling constraints. In this work, we will focus on the English auction using the private reserve price format, since this kind of auction is commonly practiced in eBay auction site [5, 8]. The main purpose of our work is to generate a reserve price that guarantees a reasonable proﬁt within a given time frame. The remainder of the paper is structured as follows. Section 2 explains the simulated marketplace used in our experiment. Section 3 444 P. Anthony and E. Law / A Heuristic Based Seller Agent for Simultaneous English Auctions describes the implementation and the design of the selling strategy that generates the reserve price. In Section 4, we describe the experimental evaluations and ﬁnally, Section 5 concludes. 2 THE SIMULATED ONLINE AUCTION MARKETPLACE The electronic marketplace simulation supports three types of protocols: English, Dutch and Vickrey. However, for this particular work, it is conﬁgured to run multiple auctions using English only. The simulated marketplace serves as a platform to simulate and replicate the real online auction environment in which there are multiple buyers and sellers participating in the marketplace. This platform is also used to measure and to evaluate the appropriateness and the suitability of our seller agent’s strategy. In this work, the market is setup to run in a continuous selling round where in each round, there are a number of auctions running simultaneously until the global time for the market is reached. It is assumed that each auction is offering the same identical item and only one single unit is being offered. The number of auctions is generated randomly based on a standard probability distribution between 2 and 30. This number constitutes the number of competing sellers offering the same identical item for sale. For each auction, there are between 2 and 15 number of bidders. Each auction has a selling duration and is randomly generated between 1 and 30. The reserve price (which is the minimum price that seller is willing to sell the item) is randomly generated between 50 and 90. Each bidder has their own private valuation which is the maximum price they are willing to pay for the desired item and this private valuation is randomly generated between 50 and 90. All these values are randomly drawn from a standard probability distribution. Each auction has a ﬁnite start time and ﬁnite end time. The auction starts with an opening price and each bidder bids for the item by raising up the bid price. The bidders follow the standard dominant bidding strategy in which they will only bid slightly higher than the current price as described in the auction theory [4]. It is also assumed that the English auction is a private reserve auction, in which the bidders are informed when the current bid has exceeded the reserve price. When this information is announced, the bidders will change their tactics by bidding with the smallest possible price to avoid overpaying for the item. This scenario is similar to eBay auction. The marketplace remains actives until the global time is reached and when all auctions are closed. There are several events that can happen once the auction is closed. If the closing price is less than the reserve price of the item, the auction is closed with no trade. Otherwise, the winner and the auction closing price are announced and a trade will take place between the seller and the winning bidder. 3 that needs to be considered is the number of bidders participating in each auction. In economic terms, a low price is imposed when there are a few bidders (low demand) and a high price is set when there are many bidders (high demand) in the market. The third factor that affects the price setting is the selling duration for each auction. Auction with a longer duration allows the possibility of eliciting a higher bid price. As such, the seller should impose a higher reserve price for the item being auctioned. However, a low price must be considered for auction that has a shorter duration since this limits the chances of obtaining the item at a higher bid value. The last factor being considered is the level of proﬁt that the seller desires. If the seller’s intention is to get rid of the item, a lower pricing strategy is inevitable to optimize the chance of selling. However, a high pricing strategy must be considered if they intend to obtain a higher proﬁt in order to obtain the best price for the item being sold. The set of considerations of the number of competitors, the number of bidders, the level of proﬁt the seller desires, and the auction duration is referred to as the selling constraints. More formally, let C be the set of considerations that the agent takes into account when generating the reserve price and j represents the individual selling constraints, where j ∈ 1..|C|. For each of the constraint j ∈ 1..|C|, there is a corresponding function fj which will suggest a value of the reserve price based on that particular constraint. At a given time, the agent may consider any of the selling constraints individually or it may combine them depending on the situation. If the agent combines multiple selling constraints, it allocates weights to denote their relative importance. Here, the weights are rated based on the given scale of 0 ≤ wj ≤ 1 and wj = 1 where wj is the weight allocated on the constraint j. There is a set of constraints C such that the reserve price v is calculated as, v= wj fj (1) j∈C 3.1 The Competitor Function Assume that n is the number of competitors at a given selling round, r. Let fc be the function to determine a single price based on the number of sellers, where p is the mean price for a given number of competitors. fc is then deﬁned as, fc (n) = p(n) 3.2 (2) The Bidder Function Similarly, assume that n is the number of bidders in a given auction. Letfb be the function to determine a single price based on the number of participating bidders, where p is the mean price for a given number of bidders. fb is then deﬁned as, DESIGNING THE AGENT’S SELLING STRATEGY There are several factors that need to be considered when deciding the reserve price for the item. Rationally, the price should be determined based on the supply and demand in the market. The ﬁrst factor is the number of competitors offering the same item at the same time. The key determinant of what price to offer is dependent on how many competitors are selling the same item in the marketplace. In reality, a high price must be set when there are only a few competitors (low supply) in the market. However, a low price must be considered when there are many competitors (high supply). The second factor fb (n) = p(n) 3.3 (3) The Time Function Assume that t is the auction length for a given auction. Let ft be the function to determine a single price based on the duration in which the auction will be held. In a real auction, this information is expressed by the seller. The parameter p is the mean price based on the of auction length and hence ft is deﬁned as, ft (n) = p(n) (4) P. Anthony and E. Law / A Heuristic Based Seller Agent for Simultaneous English Auctions 3.4 The Proﬁt Function Assume that n is the percentage of proﬁt that the seller desires. Let fp be the function to determine a single price based on the given percentage between 0% to 20%. This percentage is kept small to capture the range of proﬁt percentage under a more realistic assumption. The single price generated using this function is formed in two stages. This is made by estimating the reserve price based on the past auction history, and the price is inﬂated according to the percentage of proﬁt that the seller desires. Here, we deﬁne fp as the function to determine a single price based on the proﬁt that the seller desires, r is the reserve price and n is the percentage of proﬁt that is desired. In order to generate the reserve price r, we analyzed the bidding history for all the successful auctions that were completed with a sale. For all closed auctions with a sale, the lower bound price which is the ﬁrst minimum bid that has met the reserve price is deﬁned as the Lowest Traded Price, (λ) whereas the closing price is termed as the Highest Traded Price, (ρ). We then calculate the mean price for χ and δ by accumulating the λ and ρ divided by the total number of successful auctions M . χ= M (λ1 + .. + λM )/M (5) (ρ1 + .. + ρM )/M (6) i=1 δ= M i=1 Assume that N ∈ M is the number of auctions which recorded λ and ρ that exceeded the χ and δ. Here, α (deﬁned as the estimated minimum price once the reserve price is met) is calculated by accumulating all the λs among these outstanding auctions N where ∀λ ≥ χ. To calculate β (the fraction of price that lies between the maximum price and the minimum price once the reserve price is met), the price difference among the λ and ρ for each of the auction is calculated and divided by the λ, and this is accumulated for all N auctions where ∀ρ ≥ δ. α= N (λ1 + .. + λN ) (7) ((ρ1 − λ1 )/λ1 + .. + (ρN − λN )/λN ) (8) i=1 β= N i=1 Finally the reserve price r can be calculated as follow, r = 1/N 2 × α × (N − β) (9) The reserve price fp is then inﬂated based on the percentage of the proﬁt desired (n). fp (n) = r × (1.00 + n) (10) In each round, a different reserve price r is generated and this information is updated with each successful auction. 4 EXPERIMENTAL EVALUATION To evaluate the performance of our agent using the selling strategy described above, we performed several experimental evaluations. The objective of these experimental evaluations is to examine the efﬁciency and effectiveness of our seller’s strategy in achieving and 445 delivering the desired selling aims. In order to evaluate the effectiveness of the selling strategy, four different measures are used. Firstly, the success rate which is the number of times the seller agent is able to sell the item. Secondly, the total proﬁt made by the agent measured in percentage. The third measurement is the average winning price for all the auctions won by the seller agent. The last measurement is the percentage of gain/loss with respect to the market price for the item being sold by the seller agent. The gain/loss is calculated by taking the closing price of a given auction minus the average closing price. In this experiment, the performance of the seller agent across a diverse and various selling environment is also evaluated. The purpose of doing this, is to investigate the suitability of our strategy in various environments. The seller agent is subdivided into six individual agents. The difference from one seller agent to the other is in the distribution of weights in the each of the four functions. The purpose of conﬁguring the experimental setup this way, is to identify which strategy is best suited to a given environment. The environment is classiﬁed into seven different settings that can be categorized as LC, MC, LB, MB, ST, LT and Rand. The environment can be categorized based on the number of competitors, the number of bidders and the duration of the auction. The ﬁrst environment is categorized as less competitors (LC) that has between 2 and 15 competitors. In other words, there are between 2 and 15 auctions that are selling identical item in the marketplace. For this environment, the total number of bidders is drawn randomly between 2 and 15 while the auction duration is drawn randomly between 1 and 30. The second environment is deﬁned as many competitors (MC) in which there are between 16 and 30 competitors that are running concurrently in the marketplace. As in the ﬁrst environment, other parameters are generated randomly.The third environment is deﬁned as less bidders (LB) where the number of bidders for each auction in the marketplace is between 2 and 8. The remaining parameters are drawn randomly, between 1 and 30 for the number of competitors and between 1 and 30 for the auction duration. The fourth environment is deﬁned as many bidders (MB) which consist of 9 to 15 bidders. The ﬁfth environment is deﬁned as short time (ST) in which the duration of each auction is between 1 and 15. Similarly, all other parameters are drawn randomly. The sixth environment is deﬁned as long time (LT) where the duration of each auction in the marketplace is between 16 and 30 and all other parameters are generated randomly. Lastly, the random environment (Rand) is where the number of competitors, the number of bidders, and the timing is generated randomly between 2 and 30, 1 and 15 and 1 and 30 respectively. We deﬁned six different strategies for our seller agents. These six strategies use a combination of varying weights among the four constraints for each environment. The agents’ strategies are categorized as One Constraint, Two Constraints, Three Constraints, Four Constraints, Equal Constraints and Unequal Constraints. Agent I uses the strategy of One Constraint where only a single function (in this case the weighting is 1.0) is used. For example, if the environment is LC/MC, the competitor function is selected. For LB/MB setting, the bidder function is picked and similarly the timing function is considered for ST/LT environment. For Agent II that uses the strategy called Two Constraints, a combination of two functions is picked as the strategy. Similar to Agent I, Agent II deploys the strategy that is dependent on the current environment. Agent III (Three Constraints) and Agent IV (Four Constraints) uses a combination of three functions and four functions respectively. The strategy Equal Constraints for Agent V uses equal values for the four constraints (in this case each function is assigned 446 P. Anthony and E. Law / A Heuristic Based Seller Agent for Simultaneous English Auctions a weight of 0.25). Agent VI uses the Unequal Constraints strategy, in which it uses all the four constraints but the weight for each constraint is varied accordingly based on the current environment. The performance of these agents are compared against the performance of the control agent that deploys the No Constraint strategy. It generates random pricing between 50 and 90 without considering any selling constraint. Our experiment consists of 2000 runs for the seller agents and the control agent. Running the marketplace 2000 times means that the agents have 2000 chances of selling the item. The performance of each agent is then summed and averaged over these 2000 runs. Figure 1 shows the success rate achieved by the agents. It can be seen that the seller agents (Agent I, Agent II, Agent III, Agent IV, Agent V, Agent VI) outperformed the control agent. All the seller agents produced a high success rate for all the environments . Here, all the seller agents achieved 15% higher success rate compared to the control agent in all environments. This is because the seller agents consider the selling constraints when generating the reserve price for the item and they are able to auction off the item 80% of the time. However, there is no difference in the performance of the seller agents, indicating that varying the weight for each constraint does not have a signiﬁcant effect on the success rate performance of the agents. On the other hand, the control agent failed to achieve a satisfactory success rate with approximately 60% success rate in all cases because it does not take into account the selling environment (constraints) at all. Figure 1. The Agents’ Success Rate The proﬁt obtained each time the agent is able to sell is shown in Figure 2. Again, the seller agents were more superior in delivering a greater proﬁt compared to the control agent. The seller agents in MB and LT settings recorded a higher proﬁt of 18% and 16% respectively because a higher price could be elicited for auction that is held within a longer duration. In addition, market with a higher demand (MB, in this case more bidder in a given auction) tends to generate a stiffer competition resulting in a higher closing price which in turn leads to a higher proﬁt. The LB and ST environments recorded a lower proﬁt with 10% and 12% while the LC, MC, and Rand setting recorded an average proﬁt of approximately 13%. The strategy for Agent I which uses a single function recorded the highest proﬁt in all situations except in Rand. This is partly due to the tuning of the strategy to the auction environment. In Rand setting, the highest proﬁt was recorded by Agent VI using unequal weight for all the four constraints because the environment is randomized (unknown), and the best strategy to be deployed is a combination of the strategies with varying weights. In contrast, the control agent recorded the lowest proﬁt for all cases since it is not sensitive to what is going on around it. Figure 2. The Agents’ Selling Proﬁt Figure 3 shows the average winning price (closing price) obtained by all the sellers. The seller agents are able to obtain a higher winning price when compared to the control agent. In this experiment, the seller agents performed best in the LC, MB and LT environments, with winning price at 79, 81 and 80 compared to the MC, LB and ST settings with only 78, 76 and 78. With a decreasing market supply (LC), there is a possibility that the bidders will bid higher and thus a higher winning price is obtained. Similarly, bidders tend to bid higher due to a large competition when the market demand is increasing (MB) resulting in a higher bid price as well. As claimed, auction with a longer time (LT) is able to elicit a higher bidding price and thus a higher closing price is observed. This result also implies that a seller agent with a high success rate will most probably obtain a higher proﬁt and a higher winning price. As expected, the control agent failed to obtain a high winning price. We measure the average gain/loss of the closing price towards the market price as shown in Figure 4. Agent V that deploys the strategy of Equal Constraints recorded the highest gain overall across all settings when compared to the other strategies. This infers that, if seller wishes to sell above the market price, all constraints must be considered with equal weighting. The seller agents in MC, MB, ST and Rand environments recorded a gain above 1.5%. With increased market supply (MC), bidders tend to bid slower and as a result this may lower the market value. As expected, the seller agents in Rand setting recorded a high average gain across all the environments. The seller agents in LC and LB recorded the lowest gain below 0.5%. This could be due to the tendency of the bidders to bid higher when only a few auctions (LC) are made available and thus raise up the market price. However, this winning price obtained is very close to the market price, resulting in a low gain. The result obtained in LB environment could be due to the possibility that only a few bidders are participating and they are not able to raise the price due to a low competition, thus lowering the gain. The control agent recorded a P. Anthony and E. Law / A Heuristic Based Seller Agent for Simultaneous English Auctions Figure 3. The Agents’ Average Winning Price signiﬁcant loss for all environments except for the MB setting. This indicates that the control agent’s item is always sold at a price below the market value. 447 converted into function that generates an individual reserve price, whereby the combination of these values forms the strategy that the seller agent can utilize when auctioning an item. In addition, the strategies that the agents deploy are evaluated under various environments to investigate the appropriateness and suitability of our strategy in broad situations towards a wide applicability. The main concern of this work lies in minimizing the tradeoff between delivering an auction with a sale and obtaining proﬁt. There is no direct comparison that can be made between this work and other previous work in that the measurements that we used to evaluate the performance of the agents is entirely different. The performance of the agents were evaluated based on the success rate, the selling proﬁt, the average winning price and the average gain/loss. Based on the results obtained in the experimental evaluation, our seller agents outperformed the control agent in all measurements. This shows that the seller agent’s strategy is effective and efﬁcient in generating a reserve price that will guarantee a sale with some proﬁt and within a ﬁxed duration. The experimental evaluation clearly demonstrated that all the six seller agents produced a higher success rate, a higher proﬁt, a higher winning price and a higher market gain across all the environments when compared to the control agent. Therefore, our selling strategy could be considered as a model for a single object auction that utilizes the private reserve price for English auction protocol. In this work, we assume that the seller agent knows the number of bidders and the number of sellers that enter the marketplace. In reality, this information is not known and this complicates the decision process in generating the reserve price. For future work, a prediction model will be used to estimate and predict the number of bidders and sellers that participate in the market, since this information (supply and demand) is required in computing the reserve price. ACKNOWLEDGEMENTS We wish to acknowledge the Ministry of Science, Technology and Innovation Malaysia (MOSTI) for funding this research. REFERENCES Figure 4. The Agents’ Average Gain/Loss In summary, we can conclude that all our seller agents outperformed the control agent in all experiments. The four constraints that we have identiﬁed should be considered when deciding the reserve price in order to achieve the selling goals. The result obtained shows that in order to obtain an optimal performance of the agents’ strategies, the weights should be tuned according to the auction environment. Our ﬁndings also illustrate that our seller agents were able to perform with a satisfactory result in all environments. 5 CONCLUSION AND FUTURE WORK This paper proposes the design and the development of a seller agent that tackles the problems when offering an item for sale by suggesting a strategic reserve price. We propose and establish the pricing strategy based on the four selling constraints that involve the competitors, the bidders, the timing and the proﬁt. Each constraint is [1] P. Anthony and J. Dargham, ‘Seller agent for online auctions’, in Proceedings of the Second International Conference on Innovations in Information Technology (IIT’05), (2005). [2] E. H. Gerding, A. Rogers, R. K. Dash, and N. R. Jennings, ‘Sellers competing for buyers in online markets: Reserve prices, shill bids, and auction fees’, in Proceedings of Twentieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-07), pp. 1287–1293, (2007). [3] R. Katkar and D. Lucking-Reiley, ‘Public versus secret reserve prices in ebay auctions: Results from a pokemon ﬁeld experiment’, NBER Working Papers 8183, National Bureau of Economic Research, Inc, (March 2001). available at http://ideas.repec.org/p/nbr/nberwo/8183.html. [4] P. Klemperer, ‘Auction theory: a guide to the literature’, Journal of Economic Surveys, 13(3), 227–286, (1999). [5] D. Lucking-Reiley, ‘Auctions on the internet: What’s being auctioned, and how?’, Journal of Industrial Economics, 48(3), 227–252, (2000). [6] J. K. Min and K. L. Yong, ‘Reserve price recommendation by similaritybased time series analysis for internet auction systems’, LNAI, 4251, 292–299, (2006). [7] J. Morris, P. Ree, and P. Maes, ‘Sardine: dynamic seller strategies in an auction marketplace’, in Proceedings of the 2nd ACM Conference on Electronic Commerce (EC-00), pp. 128–134, (2000). [8] E. J. Pinker, A. Seidman, and Y. Vakrat, ‘Managing online auctions: Current business and research issues’, Management Science, 49(11), 1457– 1484, (2003). [9] W. Vickrey, ‘Counterspeculation, auctions, and competitive sealed tenders’, The Journal of Finance, 16(1), 8–37, (1961). 448 ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-448 A Truthful Two-Stage Mechanism for Eliciting Probabilistic Estimates with Unknown Costs Athanasios Papakonstantinou and Alex Rogers and Enrico H. Gerding and Nicholas R. Jennings1 Abstract. This paper reports on the design of a novel two-stage mechanism, based on strictly proper scoring rules, that motivates selﬁsh rational agents to make a costly probabilistic estimate or forecast of a speciﬁed precision and report it truthfully to a centre. Our mechanism is applied in a setting where the centre is faced with multiple agents, and has no knowledge about their costs. Thus, in the ﬁrst stage of the mechanism, the centre uses a reverse second price auction to allocate the estimation task to the agent who reveals the lowest cost. While, in the second stage, the centre issues a payment based on a strictly proper scoring rule. When taken together, the two stages motivate agents to reveal their true costs, and then to truthfully reveal their estimate. We prove that this mechanism is incentive compatible and individually rational, and then present empirical results comparing the performance of the well known quadratic, spherical and logarithmic scoring rules. We show that the quadratic and the logarithmic rules result in the centre making the highest and the lowest expected payment to agents respectively. At the same time, however, the payments of the latter rule are unbounded, and thus the spherical rule proves to be the best candidate in this setting. 1 INTRODUCTION In a world where information can be distributed over systems owned by different stakeholders and accessed by multiple users, it is important to develop processes that will evaluate this information and will give some guarantees to its quality. This is particularly important in cases where the information in question is a probabilistic estimate or forecast whose generation involves some cost. Examples include estimates of quality of service within a reputation system, or forecasts of future events such as weather conditions, where such costs could represent the computational task of accessing and evaluating previous interactions records, or that of running a large scale weather prediction model. Now, when the provider of such information is a rational selﬁsh agent, it may have an incentive to misreport its estimate, or to allocate less costly resources to its generation, if it can increase its own utility by doing so (e.g. by being rewarded for a more precise estimate than it actually provides). Thus, a centre attempting to elicit such information is presented with three challenges. First, it must identify the agent who can provide an estimate of the required precision at the lowest cost. Second, it must incentivise this agent to allocate sufﬁcient costly resources in order to provide an estimate of the required precision. Finally, it must incentivise this agent to truthfully report the estimate that has been generated. Against this background, a number of researchers have proposed the use of ‘strictly proper scoring rules’ to address these challenges 1 School of Electronics and Computer Science, University of Southampton, Southampton, SO17 1BJ, UK, email:{ap06r,acr,eg,nrj}@ecs.soton.ac.uk [1, 5]. Mechanisms using these rules reward accurate estimates or forecasts by making a payment to agents based on the difference between an event’s predicted and actual outcome (observed at some later stage). Such mechanisms have been shown to incentivise agents to truthfully report their estimates in order to maximise their expected payment [6]. More recently, strictly proper scoring rules have been used in computer science to promote the honest exchange of beliefs between agents [7], and within reputation systems to promote truthful reporting of feedback regarding the quality of a service experienced [2]. Furthermore, Miller et al. have shown that when the agents’ costs are known, it is possible to use an appropriately scaled strictly proper scoring rule to induce agents to commit costly resources to generate estimates of any required precision [4]. While these approaches are effective in the speciﬁc cases that they consider, they all rely on the fact that the cost of the agent providing the estimate or forecast is known by the centre. This is not the case in our scenario where these costs represent private information known only to each individual agent (since they are dependent on the speciﬁc computational resources available to the agent). Thus, in addressing this shortcoming, we contribute to the state of the art by presenting a novel two-stage mechanism which relaxes this assumption. The ﬁrst stage of the mechanism incentivises agents’ to truthfully reveal their costs to the centre, thus allowing it to select the agent with the lowest cost. The second stage then incentivises this agent to generate an estimate with a minimum required precision, and to truthfully report this estimate to the centre. In more detail, in this paper we extend the state of the art in the following ways: • We describe a novel two-stage mechanism in which a centre uses a reverse second price auction in the ﬁrst stage to elicit the true costs of agents, and hence identify the agent that can provide an estimate with a speciﬁed precision at the lowest cost. An appropriately scaled strictly proper scoring rule is then used in the second stage of the mechanism to incentivise this agent to generate and truthfully report the estimate. • We formally prove that this mechanism is incentive compatible in both costs and estimates revealed, and that it is individually rational. That is, agents will truthfully report both costs and estimates to the centre, and willingly participate within the mechanism. • We empirically evaluate our mechanism by comparing the quadratic, spherical and logarithmic scoring rules in a setting where costs depend linearly on precision. We show that while the logarithmic rule results in the centre making the lowest expected payment to the agent, this payment is unbounded. The other rules are bounded, but result in higher expected payments. Hence, we ﬁnd that the spherical rule is preferred in our setting. The rest of this paper is organised as follows: In section 2 we describe our model, and in section 3 we present background on strictly A. Papakonstantinou et al. / A Truthful Two-Stage Mechanism for Eliciting Probabilistic Estimates with Unknown Costs 449 Table 1. Comparison of Quadratic Spherical and Logarithmic Scoring Rules Scoring Rule: S(x0 ; x, θ) S(θ) S (θ) α β Quadratic 2N (x0 ; x, 1/θ) − > 1 2 1 2 > θ π θ π Spherical 4π 1 4 N (x0 ; x, 1/θ) θ θ 14 √1 4 πθ 1 4 √ 4c (θ0 ) πθ0 c(θ0 ) − 2θ0 c (θ0 ) 4π 1 4πθ 3 1 We now describe our model in more detail. Speciﬁcally, we assume that there is a centre interested in acquiring a probabilistic estimate or forecast (such as an expected quality of service within a reputation system, or a forecast temperature in a weather prediction setting) with a minimum precision θ0 , henceforth referred to as the required precision2 . We assume that there are N ≥ 2 rational, risk neutral agents who can provide the centre with an unbiased but noisy estimate or forecast, x, of precision θ. We model the agents’ private estimates as Gaussian random variables such that x ∼ N (x0 , 1/θ), where x0 is the true state of the parameter being estimated. Note that this true state is unknown to both the centre and the agents at the time that the estimate is requested, but becomes available to the centre at some time in the future. For example, in a reputation system the actual quality of service received is only known once the service has been procured, and in a weather forecasting setting the actual weather that occurs is observed by the centre at some later date. The agents incur a cost in producing their estimate, and we assume that this cost is a function of the precision of the estimate, c(θ). While the centre has no information regarding the agents’ cost functions, we assume that all cost functions are convex (i.e. ci (θ) ≥ 0), and we note that this is a realistic assumption in all cases where there are diminishing returns as the precision increases. We do not assume that all agents use the same cost function, but we do demand that the costs of different agents do not cross (i.e. the cost ordering of agents is the same over all precisions). Given this model, the challenge is to design a mechanism that enables the centre to identify the agent that can provide the estimate or forecast at the lowest cost, and to provide a payment to this agent such that it is incentivised to generate the estimate or forecast with a precision at least equal to the required one and to report it truthfully. 3 STRICTLY PROPER SCORING RULES As discussed in the introduction, the problem described above has previously been addressed through the use of strictly proper scoring rules as payments in the case that the agents’ cost functions are known to the centre [2, 4]. Before we proceed to the analysis of our mechanism which is designed for cases where the centre has no knowledge about the costs, we give a brief description of strictly proper scoring rules. As described earlier, such rules are used to calculate a payment to an agent depending on the difference Note that we assume that the centre derives no additional beneﬁt if the estimate is of precision greater than θ0 . 1 2θ 1 3 4 c(θ0 ) − 4θ0 c (θ0 ) 2 INFORMATION ELICITATION PROBLEM log(N (x0 ; x, 1/θ)) θ 1 1 −2 log 2π 2 4 4c (θ0 ) 4πθ0 proper scoring rules. In section 4 we detail our mechanism and formally prove its economic properties, before empirically evaluating it in section 5. We conclude and discuss future work in section 6. 2 Logarithmic 2c (θ.0 )θ0 c(θ0 ) − 2c (θ0 )θ0 1 2 log θ0 2π / − 1 2 between an event’s predicted and actual outcome. Much of the literature of strictly proper scoring rules concerns three speciﬁc rules, the quadratic, spherical and logarithmic rules, given by: ∞ 1. Quadratic: S(x0 |r(x0 )) = 2r(x0 ) − −∞ r2 (x)dx ∞ 2 2. Spherical: S(x0 |r(x0 )) = r(x0 )/( −∞ r (x)dx)1/2 3. Logarithmic: S(x0 |r(x0 )) = log r(x0 ) In each case, S(x0 |r(x0 )) is the payment given to an agent after it has reported its estimate (represented as probability density function r(x)) and x0 is the actual outcome observed. 3.1 An Incentive Compatible Mechanism It is a standard property of strictly proper scoring rules that an agent will maximise its expected score (and hence the payment it receives) by reporting its true probabilistic estimate to the centre [1, 3]. Thus, mechanisms based upon them are incentive compatible. Using this result, we can calculate the score that the agent expects to receive, given that it has generated an estimate of precision θ and has truthfully reported it to the centre (as it is incentivised to do). To do so, we ﬁrst note that, in our case, where estimates are represented by Gaussian distributions, we can replace r(x0 ) with N (x0 ; x, 1/θ), and derive new expressions for each of the three scoring rules shown above (these are presented in the ﬁrst row of table 1). We can then simply integrate over the expected outcome to derive the agents expected score, S(θ). These results are shown in the second row of table 1, and form the basis of the calculations and proofs that we present in the following sections. 3.2 Eliciting Effort with Known Costs It should now be noted that the above scoring rules will still be incentive compatible if they undergo an afﬁne transformation. Indeed, Miller et al. show that by using appropriate scaling parameters, and given knowledge of an agent’s costs, it is possible to induce an agent to make and truthfully report an estimate with a speciﬁed precision, θ0 [4]. In this case, an agent’s expected payment, P (θ), is given by: P (θ) = αS(θ) + β (1) and the expected utility of the agent is given by: U (θ) = αS(θ) + β − c(θ) (2) The centre can now choose the value of α such that the agent’s utility (its payment minus its costs) is maximised when it produces and truthfully reports an estimate of the required precision, θ0 . To do so, it solves dU /dθ|θ0 = 0 to give: α= c (θ0 ) S (θ0 ) (3) 450 A. Papakonstantinou et al. / A Truthful Two-Stage Mechanism for Eliciting Probabilistic Estimates with Unknown Costs In rows three and four of table 1 we present this result, and the derivative of the expected score that is required to calculate it, for each of the three strictly proper scoring rules presented earlier. 3.3 An Individually Rational Mechanism Finally, we now note that in order for an agent to incur the cost of producing an estimate, it must expect to derive positive utility from doing so. Thus, the centre can use the constant β to ensure that it makes the minimum payment to the agent, while still ensuring that the mechanism is individually rational. When costs are known, the centre can do so by making the agents indifferent between producing the estimate or not, by ensuring that U (θ0 ) = 0, thus giving: β = c(θ0 ) − c (θ0 ) S (θ0 ) S(θ) (4) Again, row ﬁve of table 1 shows this result for each scoring rule. 4 TRUTH ELICITATION MECHANISM FOR UNKNOWN COSTS In the previous section we discussed how the centre can motivate agents to make a probabilistic estimate or a measurement of a speciﬁc precision. However, this analysis assumed the agents’ costs are known. In this section we relax this assumption and present a novel two-stage mechanism which ﬁrst incentivises the agents to reveal their true costs to the centre, and then, based on this information, induces an agent to produce an estimate of at least the required precision. In more detail, in the ﬁrst stage the centre asks the agents to submit their cost functions and then it assigns the estimation task to the agent with the lowest cost. Then, in the second stage, the centre uses a strictly proper scoring rule as before, but now uses the second-lowest cost reported by the agents to scale the scoring rule (i.e., set α and β). This is akin to a reverse second-price or Vickrey auction, where the agents’ rewards are equal to the second-lowest reported costs. However, in this case the reward is determined by the scoring rule, and hence depends on the actual estimate produced. In particular, this requires the scaling parameters α and β to be chosen carefully in order to incentivise the agents to reveal their true costs in the ﬁrst stage. In more detail, our mechanism proceeds as follows: 1. First Stage • The centre announces that it needs an estimate of required precision θ0 , and asks all agents i ∈ {1, . . . , N }, where N ≥ 2, to report their cost functions ci (θ).3 • The centre assigns the forecast or estimate to the agent who reported the lowest cost at the required precision, i.e., agent i ck (θ0 ). such that ci (θ0 ) = mink∈{1,...,N } 2. Second Stage • The centre announces a scoring rule αS(x0 ; x, θ) + β, where: (1) S(x0 ; x, θ) is a strictly proper scoring rule, (2) S(θ) is strictly concave as a function of precision θ,4 and (3) α and β are determined using equations 3 and 4 respectively, but now based on the second-lowest reported cost functions (i.e. cj (θ) ck (θ0 )). such that cj (θ0 ) = mink=i 3 4 We note that in practise the centre only requires ci (θ0 ) and ci (θ0 ). However, for notational convenience we request the agents to reveal their entire cost function. We note that the quadratic, spherical, and logarithmic scoring rules satisfy both of these properties (see row 2 of table 1). • The agent selected in the ﬁrst stage produces an estimate x with precision θ and reports x and θ to the centre. • Once the actual outcome has been observed, the centre then gives the following payment to the agent: = αS(x0 ; x +β , θ) , θ) P (x0 ; x 4.1 (5) Economic Properties of the Mechanism Having detailed the two-stages of the mechanism, we now identify and prove its economic properties. Speciﬁcally, we show that: 1. The mechanism is incentive compatible in the ﬁrst stage w.r.t. the costs. Speciﬁcally, truthful revelation of agents’ cost functions is a weakly dominant strategy. 2. The mechanism is incentive compatible w.r.t. the selected agent’s reported measurement and precision in the second stage. 3. The mechanism is individually rational. 4. The centre motivates the selected agent to make an estimate with a precision which is at least as high as θ0 , the precision required by the centre. We refer to actual precision produced as the ‘optimal precision’ (from the perspective of the agent) θ∗ . We now formally prove these properties. To do so, we ﬁrst derive two lemmas which are then used in the proofs that follow. The ﬁrst lemma shows that, if the true costs of the agent performing the measurement are less than the costs which are used to scale the scoring rule, the optimal precision θ∗ will be greater than θ0 . Let these cost functions be denoted by ct (θ) and cs (θ) respectively. More formally: Lemma 1. If ct (θ0 ) < cs (θ0 ), where ct (θ) is the agent’s true cost function, and cs (θ) is the cost function used to scale the scoring function, then θ∗ > θ0 . Proof. By scaling the scoring function using equations 3 and 4 and cs (θ), the agent’s expected utility becomes: U (θ) = cs (θ0 ) S (θ0 ) (S(θ) − S(θ0 )) + (cs (θ0 ) − ct (θ)) (6) Now, the optimal precision θ∗ which maximises his expected utility is formally denoted by θ∗ = argmaxθ U (θ). Therefore, U (θ∗ ) = 0, and thus we have: S (θ∗ ) c (θ∗ ) = t . (7) cs (θ0 ) S (θ0 ) Let f (θ) = S (θ)/S (θ0 ) and g(θ) = ct (θ)/cs (θ0 ). Since S(θ) is (strictly) concave it is easy to show that f (θ) ≤ 0 for θ ≥ θ0 and f (θ) < 0 for θ > θ0 . Furthermore, since ct (θ) is convex g (θ0 ) ≥ 0 for θ ≥ θ0 . Now, since f is decreasing and g is increasing, when ct (θ0 ) = cs (θ0 ) clearly the only point which satisﬁes equation 7 is where θ∗ = θ0 . If ct (θ0 ) < cs (θ0 ), on the other hand, it is easy to verify that g(θ0 ) < 1, since we assumed the cost functions to be non-crossing. Hence, since f (θ0 ) = 1, the only solution where the two function meet is where θ > θ0 , and thus, θ∗ > θ0 . The next lemma shows that, if the true costs of the agent doing the measurement are higher than the costs used for the scaling of the scoring function, then the agent’s utility will always be negative. Lemma 2. If ct (θ) > cs (θ) then U (θ) < 0 for any θ. A. Papakonstantinou et al. / A Truthful Two-Stage Mechanism for Eliciting Probabilistic Estimates with Unknown Costs Table 2. Scoring Rule: θ∗ P (θ0 ) 451 Comparison of Quadratic, Spherical and Logarithmic Scoring Rules Quadratic 2 c2 θ0 c1 c2 c2 θ0 2 c1 − 1 Spherical 4 c2 3 θ0 c1 c2 13 c2 θ0 4 c1 −3 Logarithmic c2 θ0 c1 c2 θ0 1 + log cc21 Note that costs are given by linear functions, c(θ) = cθ, and c1 and c2 are the lowest and second lowest costs. Proof. Concavity of the expected score S(θ) implies: S (θ0 )(θ − θ0 ) ≥ S(θ) − S(θ0 ) Similarly, convexity of the cost function cs (θ) gives: cs (θ0 )(θ − θ0 ) ≤ cs (θ) − cs (θ0 ). By performing basic manipulations this results in: cs (θ0 ) S (θ0 ) (S(θ) − S(θ0 )) + cs (θ0 ) − cs (θ) ≤ 0 Furthermore, since ct (θ) > cs (θ), the following holds, for any θ: U (θ) = cs (θ0 ) S (θ0 ) (S(θ) − S(θ0 )) + cs (θ0 ) − ct (θ) < 0 Having presented these two key lemmas, we now proceed to prove the four economic properties of our mechanism. Theorem 1. Truthful revelation of agents’ cost functions in the ﬁrst stage of the mechanism is a weakly dominant strategy. c(θ) denote an Proof. We prove this by contradiction. Let ct (θ) and agents’ true and reported cost functions respectively. Furthermore, let cs (θ) denote the cost function used to scale the scoring function if the agent wins (i.e. if c(θ0 ) < cs (θ0 )). Now, suppose that the agent misreports, but this does not affect whether the agent wins or not. If the agent loses then the payoff is alway zero. If the agent wins the payoff is unaffected, since it is calculated from the second-lowest cost. Therefore, there is no incentive to misreport. Suppose that the agent misreports, and now it does affect whether the agent wins or not. There are now two cases: (1) ct (θ0 ) > cs (θ0 ) and c(θ0 ) < cs (θ0 ) (the agent wins by misreporting but would have c(θ0 ) > cs (θ0 ) (the lost when truthful), and (2) ct (θ0 ) < cs (θ0 ) and agent loses by misreporting but would have won when truthful). Case (1). Since the true cost ct (θ0 ) > cs (θ0 ), it follows directly from lemma 2 that the expected utility U (θ) is strictly negative, irrespective of θ. Therefore, the agent could do strictly better by reporting truthfully in which case the expected utility is zero. Case (2). In this case the agent would have won by being truthful, but now receives a utility of zero. To show that this type of misreporting is suboptimal, we need to show that, when ct (θ0 ) < cs (θ0 ), an agent beneﬁts from being selected and generating the (optimal) estimate (i.e. U (θ∗ ) > 0 when ct (θ0 ) < cs (θ0 )). Now, since θ∗ is optimal by deﬁnition, then U (θ∗ ) ≥ U (θ0 ). From the expected utility in equation 6 we have, U (θ0 ) = cs (θ0 ) − ct (θ0 ) > 0 when ct (θ0 ) < cs (θ0 ), and hence U (θ∗ ) > 0 at true costs reporting. Theorem 2. The mechanism is incentive compatible w.r.t. the agent’s reported measurement and precision in the second stage. Proof. The proof for this theorem follows directly from the deﬁnition of the strictly proper scoring rules (see section 3). Theorem 3. The two-stage mechanism is individually rational. Proof. From theorem 1 we can assume that agents report their true cost functions in the ﬁrst stage. Since agents who do not win in the ﬁrst stage receive zero utility, we only need to consider the case of the selected agent with cost function ct (θ) ≤ cs (θ). From equation 6, it follows that U (θ0 ) = cs (θ0 ) − ct (θ0 ) ≥ 0. Lemma 1 shows that the agent may produce an estimate θ∗ > θ0 . Since θ∗ is optimal by deﬁnition, then U (θ∗ ) ≥ U (θ0 ), and thus U (θ∗ ) ≥ 0. Theorem 4. For the agent selected in the ﬁrst stage of the mechanism, it is optimal to produce an estimate with a precision equal or higher than the precision required by the centre, i.e., θ∗ ≥ θ0 . Proof. This proof follows directly from Lemma 1. In more detail, given that the agents reveal their true cost functions, we have ct (θ) ≤ cs (θ). Therefore, from lemma 1 it follows that θ∗ ≥ θ0 . Note that these proofs indicate that the two stages of the mechanism are inextricably linked and cannot be considered in isolation of one another. Indeed, apparently small changes to the second stage of the mechanism can destroy the incentive compatibility property of the ﬁrst stage. For example, it is important to note that our mechanism is more precisely known as interim individually rational, since the utility is positive in expectation. In any speciﬁc instance, the payment could actually be negative if the prediction turns out to be far from the actual outcome. An alternative choice for the second stage of the mechanism would be to set β such that the payments are always positive, thus making the mechanism ex-post individually rational. However, this would then violate the incentivecompatibility property since the agents could then receive positive payoffs by misreporting their cost functions. Likewise, it might be tempting to imagine that the centre could use the revealed costs of the agents in order to request a lower precision, conﬁdent in the knowledge that the selected agent will actually produce an estimate of the required precision. However, by effectively using the lowest revealed cost within the payment rule in this way, the incentive-compatibility property of the mechanism would again be destroyed. 5 EMPIRICAL EVALUATION Having proved the economic properties of the mechanism in the general case with any convex cost function, we now present empirical results for a speciﬁc scenario in which costs are linear functions, given by ci (θ) = ci θ, where the value of ci is drawn from a uniform distribution ci ∼ U (1, 2) and θ0 = 1. Within this scenario our intention is to compare the performance of the three scoring presented earlier. To this end, for a range from 2 to 20 agents participating in the mechanism, we simulate the mechanism 106 times and, for each iteration, record the payment made to the agent who provided the estimate 452 A. Papakonstantinou et al. / A Truthful Two-Stage Mechanism for Eliciting Probabilistic Estimates with Unknown Costs 2.8 2.4 Quadratic Spherical Logarithmic 2.2 c2 θ 0 Mean Payment (P ) 2.6 c1 θ0 2 1.8 1.6 1.4 1.2 1 2 4 Figure 1. 6 8 10 12 14 16 Number of Agents (N ) 18 20 The mean payment made by the centre. ∗ Mean Optimal Precision (θ ) 1.8 Quadratic Spherical Logarithmic 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1 2 4 6 8 10 12 14 16 Number of Agents (N ) 18 20 Figure 2. The mean optimal precision of agents’ estimates. and the precision of this estimate. In ﬁgures 1 and 2 we present the means of these results (and note that the standard error in both means is much smaller than the symbol size). Consider ﬁrst ﬁgure 1 which shows the mean payment made by the centre. We note that, as expected, as the number of agents increases, the mean payment decreases toward the lower limit of the uniform distribution from which the costs were drawn. Furthermore, note that there is a ﬁxed ordering over the entire range, with the payment resulting from the quadratic scoring rule being the highest, and that of the logarithmic scoring rule being the lowest. In this ﬁgure, we also show the mean of the lowest and second lowest costs evaluated at the required precision θ0 (denoted by c1 θ0 and c2 θ0 respectively). The ﬁrst cost represents the minimum payment that could have been made if the costs of the agents were known to the centre. While, the second represents the payment that would have been made, had the agent produced an estimate of the required precision rather than its own optimal precision. The gap between c1 θ0 and c2 θ0 represents the ‘information rent’ that must be paid in the case that costs are unknown. The gap between c2 θ0 and the mean payment of any particular scoring rule represents the loss that the centre has to cover due to the agent making a more precise estimate than required. The goal in selecting scoring rules is clearly to minimise this gap, and it can be seen that the logarithmic scoring rule is closest to achieving this goal. The reason for this can be seen in ﬁgure 2 where the precision of the estimates that were actually made are shown. Note that in this ﬁgure the logarithmic scoring rule is shown to induce agents to produce estimates closer to the required precision than both the spherical and the quadratic scoring rules. The same ordering as observed in these ﬁgures (when averaged over costs drawn from a uniform distribution) is also seen in analytical results for any speciﬁc values for the lowest and second lowest costs (see table 2). Based solely on these results, it can be considered that the logarithmic scoring rule presents the best choice for the centre in this case. However, it is important to note that the logarithmic scoring rule is unbounded. That is, in the event that the agent’s estimate is far from the actual outcome, then a payment based on the logarithmic scoring rule will go to −∞ since the agent’s probability density function goes to 0 in this case (see row 1 of table 1). Thus, given this additional observation, it is clear that the spherical scoring rule represents a better choice since its payments are only slightly greater than that of the logarithmic, but it has ﬁnite bounds. 6 CONCLUSIONS In this paper we introduced a novel two-stage mechanism based on strictly proper scoring rules that motivates selﬁsh rational agents to make a costly probabilistic estimate or forecast of a speciﬁed precision and report it truthfully to a centre. We applied the mechanism in a setting in which the centre is faced with multiple agents but has no knowledge about their costs, and we proved that it was incentive compatible and individually rational. We also empirically evaluated our mechanism, and in comparing the quadratic, spherical and logarithmic scoring rules, showed that the logarithmic one minimises the centre’s expected payment, but is unbounded. Thus, we proposed the use of the spherical rule as the best compromise between achieving minimal payments with ﬁnite bounds. Our future work consists of two main tracks. First, we would like to explore the design of alternative strictly proper scoring rules, with the intention of minimising the loss that the centre has to cover, as a result of agents making an estimate of precision higher than the required one. In this respect the value of c2 θ0 , shown in ﬁgure 1, represents a bound on the ultimate performance of the mechanism. Second, we would like to extend our mechanism to the case where the centre procures estimates from more than one agent, and then fuses them together. When costs are convex, procuring several low precision estimates may be more cost effective than procuring a single high precision estimate. Indeed, Miller et al. have shown how scoring rules can be used to score one agent’s estimate against another’s, and thus in this case there is no need to wait until the actual event’s outcome is revealed before making payments to agents [4]. However, in such a case, it is an open question as to whether it is possible to design a mechanism that incentives multiple agents to truthfully reveal their costs and estimates. ACKNOWLEDGEMENTS This research was undertaken as part of the EPSRC funded project on Market-Based Control (GR/T10664/01). This is a collaborative project involving the Universities of Birmingham, Liverpool and Southampton and BAE Systems, BT and HP. REFERENCES [1] A. D. Hendrickson and R. J. Buehler, ‘Proper scores for probability forecasters’, The Annals of Mathematical Statistics, 42(6), 1916–1921, (1971). [2] R. Jurca and B. Faltings, ‘Reputation-based service level agreements for web services’, in Proceedings of the International Conference on Service Oriented Computing (ICSOC), pp. 396–409, (2005). [3] J. E. Matheson and R. L. Winkler, ‘Scoring rules for continuous probability distributions’, Management Science, 22(10), 1087–1096, (1976). [4] N. Miller, P. Resnick, and R. Zeckhauser, ‘Eliciting honest feedback: The peer prediction method’, Management Science, 51(9), 1359–1373, (2005). [5] L. J. Savage, ‘Elicitation of personal probabilities and expectations’, Journal of the American Statistical Association, 66(336), 783–801, (1977). [6] R. Selten, ‘Axiomatic characterization of the quadratic scoring rule’, Experimental Economics, 1(1), 43–61, (1998). [7] A. Zohar and J. S. Rosenschein, ‘Robust mechanisms for information elicitation’, in Proceedings of the Fifth International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS), pp. 1202–1204, (2006). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-453 453 Goal Generation and Adoption from Partially Trusted Beliefs C´elia da Costa Pereira and Andrea G. B. Tettamanzi 1 Abstract. A rational agent adopts (or changes) its goals when new information (beliefs) becomes available or its desires (e.g., tasks it is supposed to carry out) change. In this paper we propose a nonconventional approach for adopting goals which takes the degree of trust in the sources of information into account. Beliefs, desires, and goals, as a consequence, are gradual. Incoming information may be any propositional formula. Two algorithms for updating the mental state of an agent in this new setting are proposed. The ﬁrst algorithm is relevant to the updating when a new piece of information arrives and the second one is relevant to the updating when a new desire arises. 1 Introduction and Motivation Changes in the mental attitudes of a BDI agent [14] may inﬂuence its behavior when deciding which information to believe, which desires/goals to generate/adopt, which action to perform, and so on. The goals to be adopted in a given situation may depend on the agent’s beliefs, desires and obligations. However, most works on goal change and generation do not build on results on belief change, e.g., [2, 16, 15]. One of the ﬁrst approaches in this line is Thomason’s [17], whose objective is to describe a formalism designed to integrate reasoning about desires and planning. The work by Broersen and colleagues [3] introduces the BOID architecture in which goals are generated from conditional beliefs, obligations, intentions, and desires. Also the approach by Dignum and colleagues [9] and, more recently, the one proposed in [6] are very much in this line. However, these works consider the notion of belief as an all-or-nothing concept: either the agent believes something, or it does not. Parsons and Giorgini [13] proposed to treat beliefs as degrees of evidence. Hansson [11] pointed out that there are two notions of degree of belief. The ﬁrst is the static concept of degree of conﬁdence. In this sense, the higher an agent’s degree of belief in a sentence, the more conﬁdently it entertains that belief. The other notion is the dynamic concept of degree of resistance to change. In that sense, the higher an agent’s degree of belief in a sentence, the more difﬁcult it is to change that belief. In this work, we just consider beliefs, desires, and goals, not intentions, but, in addition, we allow an agent to believe a piece of information to a degree. Such degree depends on the trust the agent has in the source of information. That way, we make it possible to represent the fact that if a piece of information comes from a completely trusted source, like in traditional approaches, the degree of resistance to change of the agent is null and, therefore, the agent revises its beliefs and (completely) adopts the new belief. Instead, if the 1 Universit`a degli Studi {pereira,tettamanzi}@dti.unimi.it di Milano, Italy, email: agent does not trust the source at all, its belief will not change. The interesting case is when information comes from a partially trusted source. In this case, we will show that the relative shift of an agent’s belief degrees depends only on the degree of trust of the source. Our aim is not to compute such trust degrees, we are just interested in how they inﬂuence the agent’s beliefs and, as a consequence, the choice of which set of goals, among the possible ones, it will adopt. This work is an extension of one of the ﬁrst attempts to study the impact of trusted beliefs on desires and goals [7]. The extension consists in allowing all kinds of information including disjunctive information. To explain the kind of issues we want to address, let us consider the following example, which we will refer to in the rest of the paper. You go for dinner to a new restaurant. You like to have meat (hm) but, if you ﬁnd fresh ﬁsh (ﬀ ), you’d rather have ﬁsh (hf ). Also, you like to have red wine (rw ) with meat and white wine (ww ) with ﬁsh. When you go to a new place, you assume ﬁsh is not fresh, unless you ﬁnd evidence to contrary — however, you leave some room to doubt. Now, your friend, who already knows the place and whom you trust pretty much, albeit not completely, tells you they usually have fresh ﬁsh or, when they don’t, their escargots are great (ge), which you would be curious to try (he) in case you decided not to have ﬁsh. In this paper, we attempt to take into account this kind of considerations on beliefs and desires in goal generation/adoption. The paper is organized as follows. Section 2 presents the fuzzy logic-based formalism which will be used throughout the paper. Section 3 illustrates how changes due to the arrival of new information and/or a new desire inﬂuence the agent’s beliefs and desires. In Section 4, the notion of goal set is deﬁned and requirements for goal set adoption are underlined. Section 5 concludes. 2 The Formalism The formalism which will be used throughout the paper is inspired by the one used in [18]. However, unlike [18], the objective of our formalism is to analyze, not to develop, agent systems. Precisely, our agent must single out an optimal set of goals to be adopted. 2.1 Basic Considerations Fuzzy sets, introduced by Zadeh [19], are a generalization of classical sets obtained by replacing the characteristic function of a set A with a membership function μA , which can take up any value in [0, 1]. The value μA (x) or, more simply, A(x) is the membership degree of element x in A, i.e., the degree to which x belongs in A. The support of A, supp(A), is the set of all x such that A(x) > 0. The usual set-theoretic operations of union, intersection, and complement can be deﬁned as a generalization of their counterparts 454 C. da Costa Pereira and A.G.B. Tettamanzi / Goal Generation and Adoption from Partially Trusted Beliefs on classical sets by introducing two families of operators, called triangular norms and co-norms. In practice, it is usual to employ the min norm for intersection and the max co-norm for union. Given two fuzzy sets A and B, and an element x, (A ∪ B)(x) = ¯ max{A(x), B(x)}, (A ∩ B)(x) = min{A(x), B(x)}, and A(x) = 1 − A(x). Deﬁnition 1 (Fuzzy Interpretation) A fuzzy interpretation is an assignment of truth degrees in [0, 1] to all atomic propositions (or atoms, for short) deﬁned in the problem domain. Given a set of atoms A, a fuzzy interpretation is a function I : A → [0, 1], which assigns a truth degree I(p) ∈ [0, 1] to all atoms p ∈ A. Note that a fuzzy interpretation is, in all respects, a fuzzy set of atoms. 2.2 Formalism’s Components An agent’s belief is a piece of information that the agent believes in. An agent’s desire is something (not always material) that the agent would like to possess or perform. Desires (or motivations) are necessary but not sufﬁcient conditions for action. When a desire is met by other conditions that make it possible for an agent to act, that desire becomes a goal. Therefore, given this technical deﬁnition of a desire, all goals are desires, but not all desires are goals. The main distinction we made here between desires and goals is in line with that made by Thomason [17] and other authors: goals are required to be consistent whereas desires need not be. Deﬁnition 2 (Language) Let A be a set of atomic propositions and let L be the propositional language such that A ∪ {, ⊥} ⊆ L, and, ∀φ, ψ ∈ L, ¬φ ∈ L, φ ∧ ψ ∈ L, φ ∨ ψ ∈ L. Our formalism accounts for formulas (beliefs and desires), and the trust degree of information sources. The formalism is a fuzzy extension of that proposed in [6] in two ways: (i) incoming information may be any propositional formula, not only atoms or literals, and (ii) the degree of trust in the sources of information is taken into account. The ﬁrst extension makes the formalism more general with respect to previous proposals, in that it allows to express all kinds of incoming information, including disjunctive information. Thanks to the second extension, it is possible to represent how strongly the agent believes in a given piece of information. We suppose that this trust degree depends on how reliable the source of the piece of information is. Here, we are not interested in the computation of such reliabilities; we merely assume that, for an agent, a belief has a trust degree in [0, 1]. An approach to the problem of assigning fuzzy trust degrees to information sources can be found for example in previous work by Castelfranchi and colleagues [4]. Consequently, if we take into account the fact that here the notion of belief is not conceived as an all-or-nothing concept but as a “fuzzy concept”, also the relations among beliefs and desires are fuzzy. The fuzzy counterpart of a desire-generation rule deﬁned in [6] is deﬁned as follows: Deﬁnition 3 (Desire-Generation Rule) A desire-generation rule is an expression of the form R = {βR , ψR ⇒+ D d | βR , ψR ∈ L, d ∈ {a, ¬a}, a ∈ A}2 . 2 The unconditional counterpart of of this rule is δ ⇒+ D d which means that the agent (unconditionally) desires d to degree δ. Intuitively this means that: “an agent desires d as much as it believes βR and desires ψR . Unlike in most conventional approaches like e.g. [3], in which the authors do not consider at all disjunctive information, here, for sake of simplicity, we make this restriction only for generated desires, i.e. those in the right-hand side of the rules: a generated desire is then represented by a literal. Given a desire-generation rule R, we shall denote rhs(R) the literal on the right-hand side of R. The preferences and habits of the gourmet in the example may be described by means of the following rules: R1 : R2 : R3 : 2.2.1 ﬀ, ge, ¬hf , hm ⇒+ D ⇒+ D ⇒+ D R4 : R5 : R6 : hf , he, rw , , hf , hf 0.7 ⇒+ D ⇒+ D ⇒+ D ww , ¬hm, hm. Agent’s State In this section, we deﬁne the mental state of an agent and the semantics of belief and desire formulas. The state of an agent is completely described by a triple S = B, RJ , J , where • B is a fuzzy interpretation on A; • RJ is a set of desire-generation rules, such that, for each desire d, RJ contains at most one rule of the form δ ⇒+ D d; • J is a fuzzy set of literals. B is the fuzzy interpretation which deﬁnes the degree to which the agent believes each atom in A. Representing the beliefs as a fuzzy interpretation on A guarantees by construction that the agents beliefs are consistent, i.e., for all atom a we have B(a) = 1 − B(¬a). RJ contains the rules which generate desires from beliefs and other desires (subdesires). J contains all literals (positive and negative form of atoms in A) representing desires which may be deduced from the agents’s desire-generation rules. We suppose that an agent can have inconsistent desires, i.e., for each desire d we can have J (d) + J (¬d) > 1. In the gourmet example, your initial state when you step into the restaurant might be described, by B(ﬀ ) = 0.2, B(ge) = 0, and J (rw ) = J (hm) = 0.7, J (ww ) = J (¬hm) = J (hf ) = 0.2, J (he) = 0. By extension, we can compute the truth degree of any belief and desire formulas in L. Deﬁnition 4 (Degree of fuzzy belief and desires formulas) Let S = B, J , RJ be the state of the agent, φ, ψ ∈ L be formulas. We can extend B to arbitrary formulas in L by deﬁning: B() = 1, (1) B(⊥) = 0, (2) B(¬φ) = 1 − B(φ), (3) B(φ ∧ ψ) = min{B(φ), B(ψ)}, (4) B(φ ∨ ψ) = max{B(φ), B(ψ)}. (5) The extension of J is obtained in the same way, except that Equations 2 and 3 do not hold for J because J may be inconsistent. They are replaced by J (⊥) = δ ∈ [0, 1]. Besides, if φ is a literal, J (φ) is directely given by the state of the agent. Note that since J need not to be consistent, the De Morgan laws do not hold, in general, for desire formulas. C. da Costa Pereira and A.G.B. Tettamanzi / Goal Generation and Adoption from Partially Trusted Beliefs Deﬁnition 5 (Degree of Activation of a Rule) Let R be a desiregeneration rule. The degree af activation of R, Deg(R), is given by Deg(R) = min(B(βR ), J (ψR )) and for its unconditional counterpart R = δ ⇒+ D d: Deg(R) = δ. Deﬁnition 6 (Degree of Justiﬁcation) The degree of justiﬁcation of desire d is deﬁned as J (d) = maxR∈RJ :rhs(R)=d Deg(R). This represents how rational it is the fact that an agent desires d. 3 455 This observation underlines the fact that, in case of a completely trusted source, our operator obeys the Primacy of New Information Principle [8]. We measure the amount of change in beliefs by means of a fuzzy version dH of the Hamming distance between interpretations: given two fuzzy interpretations I1 and I2 , dH (I1 , I2 ) = |I1 (a) − I2 (a)|. (7) a∈A As explained previously, based on the minimal change principle, we suppose that the agent chooses the disjunct (or one of the disjuncts in case of tie) with the smallest total amount of change. More formally, Changes in the Agent’s State The acquisition of a new consistent piece of information with a given degree of trust in state S may cause changes in both degrees of beliefs and justiﬁcation in the agent’s belief and desire sets respectively. Likewise, the arising of a new desire with a given degree may also cause changes in the desire set J . Deﬁnition 8 (Belief Change Operator) Let β = K1 ∨ . . . ∨ Kn be the incoming information with trust degree α. The new set of beliefs is given by B ∗ α = Bi∗ , with β i∗ = arg min dH (Bi , B), i 3.1 3.1.1 Changes Caused by a new Belief Changes in the Agents’s Belief Set To account for changes in the belief set B caused by the acquisition of a new piece of information, we deﬁne a new operator for belief change, noted ∗, which is an adaptation of the well known AGM operator for belief revision [1] to the fuzzy belief setting. We consider the disjunctive normal form (DNF) of the new piece of information β 3 , i.e., β = K1 ∨ K2 ∨ . . . Kn , with ∀ i, Ki = i l1i ∧ l2i ∧ . . . lm , and lji ∈ {aij , ¬aij }, with aij ∈ A. We suppose that β is consistent. This allows us to dispense with dealing with cases in which inconsistent beliefs make it possible to deduce all formulas and, therefore, to believe everything. If a new piece of information β arrives with degree of trust α, n alternatives are possible: either the agent trusts K1 with degree α, or K2 with degree α and so on. The value α corresponds to how strongly the agent trusts β. Here, we make a choice, motivated by the Minimal Change Principle [12]. We suppose that the agent chooses the alternative which produces the smallest change in its beliefs. We measure such changes for each disjunct of the incoming formula, thanks to the belief change operator deﬁned below. Deﬁnition 7 (Belief Change Alternatives) Let a ∈ A and Ki be one of the disjuncts of β, the incoming information, whose degree of trust is α. Let B be the agent’s fuzzy belief set. The ith alternative new fuzzy set of beliefs Bi = B ∗ Kαi is such that, for all a ∈ A, Bi (a) = B(a) · (1 − α) + α, B(a) · (1 − α), B(a), if Ki |= a; if Ki |= ¬a; otherwise. In the gourmet example, when your friend, whom you trust to a degree α = 0.8, tells you “ﬀ ∨ ge”, you would change your beliefs B as follows: K1 = ﬀ , K2 = ge; B1 (ﬀ ) = B (ﬀ )·0.2+0.8 = 0.84; B1 (ge) = B (ge) = 0; B2 (ﬀ ) = B (ﬀ ) = 0.2; B2 (ge) = B (ge) · 0.2 + 0.8 = 0.8; dH (B1 , B) = 0.64; dH (B2 , B) = 0.8; therefore, i∗ = 1 and B ∗ (ﬀ ∨ ge) = B1 . Proposition 1 If Ki∗ |= a, i.e., Ki∗ conﬁrms a to a certain degree, then applying the operator ∗ never causes the belief degree of a to decrease, i.e., B (a) ≥ B(a). Proof: If B(a) = 0 the result is obvious. Otherwise, if B(a) > 0, we have B (a) − B(a) = α · (1 − B(a)) ≥ 0. 2 Proposition 2 If Ki∗ |= ¬a, i.e., Ki∗ contradicts a to a certain degree, then applying the operator ∗ never causes the belief degree of a to increase, i.e., B (a) ≤ B(a). Proof: We have B (a) − B(a) = −α · B(a) ≤ 0. 2 The semantics of our belief change operator is deﬁned by the following properties. Here B represents a fuzzy belief set, β the incoming trusted information with degree of trust α, supp is the support of a fuzzy set, and ∪, ∩, ⊆ and ⊇ are fuzzy operators. (6) This operator allows us to update the new degree of the agent’s beliefs in each atom a ∈ A, with respect to both the Ki disjunct of the incoming information and the trust degree of its source. Observation 1 If the agent trusts completely (α = 1) a source which provides a piece of information conﬁrming (contradicting) a, then the agent will (will not at all) believe a completely (anymore) no matter which its previous degree of belief in a was. 3 where Bi is the ith alternative revision as per Deﬁnition 7. If there is more than one i such that dH (Bi , B) is minimal, one is chosen arbitrarily. This is not a restriction because any well-formed formula of propositional logic has an equivalent DNF expression. • (P ∗ 1)(Stability) The result of applying * in B with β is always a fuzzy set of beliefs: B ∗ α is a fuzzy set of beliefs. β • (P ∗ 2)(Expansion) If Ki∗ contains only positive atoms, then the ) ⊇ supp(B). fuzzy set B expands: supp(B ∗ α β • (P ∗ 3)(Shrinkage) If Ki∗ contains only negated atoms, then the fuzzy set B shrinks: supp(B ∗ α ) ⊆ supp(B). β • (P ∗ 4)(Invariance) If the new information is completely untrusted, i.e., α = 0, invariance holds: (α = 0) ⇒ (B ∗ α = B). β • (P ∗ 5)(Predictability) The result of applying ∗ contains all belief atoms in supp(B ∪ { α }): supp(B ∗ α ) ⊇ supp(B ∪ { α }). β β β • (P ∗ 6)(Identity) The result of applying ∗ does not depend on the 1 2 particular information. If β1 ≡ β2 and α1 = α2 : B∗ α = B∗ α . β1 β2 456 C. da Costa Pereira and A.G.B. Tettamanzi / Goal Generation and Adoption from Partially Trusted Beliefs When all beliefs are crisp and the trust in new information is complete (α = 1), our fuzzy belief-change operator satisﬁes the six basic AGM revision rationality postulates K∗1–K∗6 [10]. In order to show that, let us consider the standard deﬁnition of expansion of a crisp set of formulas B with a formula φ ∈ L as B + φ = {ψ : B ∪ {φ} % ψ}. Proposition 3 If B is crisp, φ is new information whose trust is α = 1, the following hold: 1. 2. 3. 4. 5. B = B ∗ φ is a crisp interpretation (K∗1); B (φ) = 1 (K∗2); B ⊆ B + φ (K∗3); if B(¬φ) = 0, then B + φ ⊆ B (K∗4); if φ ≡ ψ, then B ∗ φ = B ∗ ψ; (K∗6); For the convenience of the reader, the corresponding AGM rationality postulate has been indicated between parentheses for each thesis. Note that Postulate K∗5, which in our formalism would be “B ∗ φ = L iff φ = ⊥”, is not relevant to our discussion, since we have made the assumption that new information is never inconsistent; therefore, it has not been considered. Proof: To prove Thesis 1, we observe that, when α = 1, for all atom a, B (a) ∈ {0, B(a), 1}; but B(a) ∈ {0, 1}, since B is crisp; therefore, B (a) ∈ {0, 1} as well. To prove Thesis 2, let us consider Ki∗ , the chosen alternative disjunct of φ: a sufﬁcient condition for B (φ) = 1 is that B (Ki∗ ) = 1; ∗ now, Ki∗ = l1∗ ∧ . . . ∧ lm ; it is easy to verify that, according to Def ∗ inition 7, B (li ) = 1 for all i = 1, . . . , m; therefore, B (Ki∗ ) = mini {B (li∗ )} = 1 and the thesis follows. As for Thesis 3, it follows trivially if B + φ = L, i.e., if B(¬φ) = 1. In all other cases, i.e., when B(¬φ) = 0, we have B(φ) = 1, and, because of the minimal change principle, B = B, which proves the thesis. The proof of Thesis 4 is similar: B(¬φ) = 0 implies B(φ) = 1, whence one concludes B = B; furthermore, B + φ = B, which veriﬁes the thesis. Finally, to prove Thesis 5, we recall that, if φ ≡ ψ, their DNFs are identical; therefore, B ∗ φ = B ∗ ψ by deﬁnition. 2 3.1.2 Changes in the Agents’s Desire Set The acquisition of a new belief may induce changes in the justiﬁcation degree of some desires. More generally, the acquisition of a new belief may induce changes in the belief set of an agent which, in turn, may induce changes in its desire set. Let β, be a new belief trusted to degree α, ( α ). To account for the changes in the desire set caused by β this new acquisition, we have to recursively: (i) calculate for each rule R ∈ RJ their new activation degree by considering B and (ii) update the justiﬁcation degree of all desires in their right-hand side (rhs(R)). The new desire set J is obtained by executing the algorithm in Figure 1 with the following inputs: B = B ∗ α , RJ , and A. β The algorithm propagates changes until a ﬁxpoint is reached; Ck is the set of desires whose justiﬁcation degree changes in step k, i.e., ∀d ∈ {a, ¬a}, with a ∈ A, d ∈ Ck ⇒ J k (d) = J k−1 (d). Step 1 updates B with respect to the incoming information α , and initialβ izes to empty, the set of desires such that justiﬁcation degrees directly changes with the arrival of α , C0 . Step 2 updates C0 . Steps 3 and 4 β 1. B ← B ∗ α β ; k ← 1; C 0 ← ∅ ; 2. For each d ∈ {a, ¬a} with a ∈ A do (a) consider all Ri ∈ RJ such that rhs(R) = d; (b) calculate Deg(Ri ) by considering B ; (c) J 0 (d) ← maxRi Deg(Ri ); (d) if J0 (d) = J (d) then C0 ← C0 ∪ {d}. 3. repeat (a) Ck ← ∅; (b) for each d ∈ Ck−1 do i. for all Rj ∈ RJ such that ψRj |= d do (d); A. calculate Deg(Rj ) considering Jk−1 B. Jk (rhs(Rj )) ← max Deg(Ri ); Ri |rhs(Ri )=rhs(Rj ) (rhs(Rj )) C. if Jk (rhs(Rj )) = Jk−1 then Ck ← Ck ∪ {rhs(Rj )}. ii. k ← k + 1. until Ck−1 = ∅. 4. for all d, J (d) is given by the following equation: J (d) = J (d), Ji (d), if d ∈ C; otherwise, (8) where i is such that d ∈ Ci and ∀j = i if d ∈ Cj then j ≤ i, i.e., the justiﬁcation degree of a “changed” desire is the last degree it takes, and C = ∞ k=0 Ck is the set of “changed” desires. Figure 1. An algorithm to compute the new desire set upon arrival of a new belief. update desire degrees which are indirectly changed by the incoming information. Of course, the set RJ does not change. In the gourmet example, learning β = ﬀ ∨ ge with α = 0.8, which has you change your beliefs to B such that B (ﬀ ) = 0.84 and B (ge) = 0, makes J change to J such that J (ww ) = J (hf ) = 0.84, J (he) = 0, J (rw ) = J (hm) = 0.7, and J (¬hm) = 0.84. Proposition 4 If the chosen disjunct Ki∗ does not contain negated atoms, then J = ∞ k=0 Jk . Proof: According to Proposition 1, for all a we have B (a) ≥ B(a). Therefore, the degree of all desires d in the new desire set J may not decrease, i.e., for all k, Jk (d) ≥ Jk−1 (d). 2 Proposition 5 If the choosen disjunct Ki∗ only contains negated atoms, then J = ∞ k=0 Jk . Proof: According to Proposition 2, for all a we have B (a) ≤ B(a). Therefore, the degree of all desires d in the new desire set J may not increase, i.e., for all k, Jk (d) ≤ Jk−1 (d). 2 3.2 Changes Caused by a New Desire The acquisition of a new desire may cause changes in the fuzzy desire set and in the desire-generation rule base. In this work, for the sake of simplicity, we consider only new desires which are not dependent on beliefs and/or other desires. A new desire, justiﬁed with degree δ, implies the addition of the desire-generation rule δ ⇒+ D d into RJ , resulting in the new base R J . By deﬁnition of a desire-generation rule base, R J must not contain a δ ⇒+ D d with δ = δ . How does δ S change with the arising of the new desire d ? C. da Costa Pereira and A.G.B. Tettamanzi / Goal Generation and Adoption from Partially Trusted Beliefs + + 1. if {δ ⇒+ D d} ∈ RJ then R J ← (RJ \ {δ ⇒D d}) ∪ {δ ⇒D d}; else R J ← RJ ∪ {δ ⇒+ d} ; D 2. k ← 1; C0 ← {d}; J0 (d) ← δ ; 3. repeat (a) Ck ← ∅; (b) for each d ∈ Ck−1 do i. for all Rj ∈ RJ such that ψRj |= d do (d); A. calculate their respective degrees Deg(Rj ) considering Jk−1 B. Jk (rhs(Rj )) ← max Deg(Ri ); Ri |rhs(Ri )=rhs(Rj ) (rhs(Rj )) C. if Jk (rhs(Rj )) = Jk−1 then Ck ← Ck ∪ {rhs(Rj )}. ii. k ← k + 1. until Ck−1 = ∅. 4. for all d, J (d) is given by Equation 8. Figure 2. An algorithm to compute the new desire set upon the arisal of a new desire. • Any rule δ ⇒+ D d with δ = δ is retracted from RJ , + • δ ⇒D d is added to RJ , It is clear that the arising of a new desire does not change the belief set of the agent. The new fuzzy set of desires, J , is computed by the algorithm in Figure 2. 4 Goal Adoption Goals serve a dual role in the deliberation process, capturing aspects of both intentions and desires. Besides expressing desirability, when an agent adopts a goal, it also makes a commitment to pursue the goal. Here, we concentrate exclusively on the second role served by a goal. For more information about intentions see for example Cohen and Levesque [5]. The main point about desires is that we expect a rational agent to try and manipulate its surrounding environment to fulﬁll them. In general, considering a problem P to solve, not all generated desires can be adopted at the same time, especially when they are not feasible at the same time. We assume we dispose of a P-dependent function FP wich, given a fuzzy set of beliefs B and a fuzzy set of desires J , returns a degree γ which corresponds to the certainty degree of the most certain feasible solution found. We may call γ the degree of feasibility of J given B, i.e., FP (B, J ) = γ. Deﬁnition 9 (γ-Goal Set) A γ-goal set, with γ ∈ [0, 1], in state S is a fuzzy set of desires G such that: 1. G is justiﬁed: G ⊆ J , i.e., ∀d ∈ {a, ¬a}, a ∈ A, G(d) ≤ J (d); 2. G is γ-feasible: FP (B, G) ≥ γ; 3. G is consistent: ∀d ∈ {a, ¬a}, a ∈ A, G(d) + G(¬d) ≤ 1. In the gourmet example, J is inconsistent, in that J (hm) + J (¬hm) = 1.54 > 1; on the other hand, consistency requires that G(hm)+G(¬hm) ≤ 1; therefore, one possible choice for G could be such that G(hm) = 0.45 and G(¬hm) = 0.55, or even G(hm) = 0 and G(¬hm) = 0.84. In general, given a fuzzy set of desires J , there may be more than one possible γ-goal set G. However, a rational agent in state S = B, J , RJ , for practical reasons, may need to elect one precise set of goals, G ∗ , to pursue, which depends on S. The choice of one γgoal set over the others may be based on a preference relation , on 457 desire sets, as proposed in [7], where it is required that a goal election function Gγ is such that: • ∀S, Gγ (S) is a γ-goal set, i.e., it does indeed return a γ-goal set; and • ∀S, if G is a γ-goal set, then Gγ (S) , G, i.e., the γ-goal set returned by function Gγ and then adopted by the agent is “optimal”. The issue of deﬁning a speciﬁc goal election function is a critical part of constructing a rational agent framework. Such issue falls out of the scope of this work. 5 Summary We have investigated how trust in a source of information can inﬂuence the degree of an agent’s beliefs, and how these graded beliefs inﬂuence the agent’s generated desires and then its adopted goals. We propose a new fuzzy belief change operator to deal with this new kind of information and two algorithms for updating the agent’s desire set after the arrival of a new, even partially trusted, piece of information, and a new unconditional desire. Finally, requirements for goal adoption have been stated. REFERENCES [1] C. E. Alchourr´on, P. G¨ardenfors, and D. Makinson, ‘On the logic of theory change: Partial meet contraction and revision functions.’, J. Symb. Log., 50(2), 510–530, (1985). [2] J. Bell and Z. Huang, ‘Dynamic goal hierarchies’, in PRICAI ’96: Proceedings from the Workshop on Intelligent Agent Systems, Theoretical and Practical Issues, pp. 88–103, London, UK, (1997). SpringerVerlag. [3] J. Broersen, M. Dastani, J. Hulstijn, and L. van der Torre, ‘Goal generation in the BOID architecture’, Cognitive Science Quarterly Journal, 2(3–4), 428–447, (2002). [4] C. Castelfranchi, R. Falcone, and G. Pezzulo, ‘Trust in information sources as a source for trust: a fuzzy approach’, in Proceedings of AAMAS’03, pp. 89–96, (2003). [5] P. R. Cohen and H. J. Levesque, ‘Intention is choice with commitment’, Artif. Intell., 42(2-3), 213–261, (1990). [6] C. da Costa Pereira and A. Tettamanzi, ‘Towards a framework for goal revision’, in Proceedings of BNAIC’06, pp. 99–106, (2006). [7] C. da Costa Pereira and A. Tettamanzi, ‘Goal generation with relevant and trusted beliefs’, in Proceedings of AAMAS’08, pp. 397–404, (2008). [8] M. Dalal, ‘Investigations into a theory of knowledge base revision’, in AAAI, pp. 475–479, (1988). [9] F. Dignum, D. N. Kinny, and E. A. Sonenberg, ‘From desires, obligations and norms to goals.’, Cognitive Science Quarterly ., 2(3-4), 407– 427, (2002). [10] P. G¨ardenfors, ‘Belief revision: A vademecum’, in Meta-Programming in Logic, 1–10, Springer, Berlin, (1992). [11] S. O. Hansson, ‘Ten philosophical problems in belief revision’, Journal of Logic and Computation, 13(1), 37–49, (February 2003). [12] H. Katsuno and A. O. Mendelzon, ‘Propositional knowledge base revision and minimal change’, Artif. Intell., 52(3), 263–294, (1991). [13] S. Parsons and P. Giorgini, ‘An approach to using degrees of belief in BDI agents’, in Information, Uncertainty, Fusion, eds., B. BouchonMeunier, R. R. Yager, and L. A. Zadeh, Kluwer, Dordrecht, (1999). [14] A. S. Rao and M. P. Georgeff, ‘Modeling rational agents within a BDIarchitecture’, in Proceedings of KR’91, pp. 473–484, (1991). [15] S. Shapiro, Y. Lesp´erance, and H. J. Levesque, ‘Goal change’, in Proceedings of IJCAI’05, pp. 582–588, (2005). [16] J. Thangarajah, L. Padgham, and J. Harland, ‘Representation and reasoning for goals in BDI agents’, in Proceedings of CRPITS’02, pp. 259– 265, (2002). [17] R. H. Thomason, ‘Desires and defaults: A framework for planning with inferred goals’, in Proceedings of KR’00, pp. 702–713, (2000). [18] M. Birna van Riemsdijk, Cognitive Agent Programming: A Semantic Approach, Ph.D. dissertation, University of Utrecht, 2006. [19] L. A. Zadeh, ‘Fuzzy sets’, Information and Control, 8, 338–353, (1965). 458 ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-458 Adaptive play in Texas Hold’em Poker Rapha¨el Maˆıtrepierre and J´er´emie Mary and R´emi Munos1 Abstract. We present a Texas Hold’em poker player for limit headsup games. Our bot is designed to adapt automatically to the strategy of the opponent and is not based on Nash equilibrium computation. The main idea is to design a bot that builds beliefs on his opponent’s hand. A forest of game trees is generated according to those beliefs and the solutions of the trees are combined to make the best decision. The beliefs are updated during the game according to several methods, each of which corresponding to a basic strategy. We then use an exploration-exploitation bandit algorithm, namely the UCB (Upper Conﬁdence Bound), to select a strategy to follow. This results in a global play that takes into account the opponent’s strategy, and which turns out to be rather unpredictable. Indeed, if a given strategy is exploited by an opponent, the UCB algorithm will detect it using change point detection, and will choose another one. The initial resulting program , called Brennus, participated to the AAAI’07 Computer Poker Competition in both online and equilibrium competition and ranked eight out of seventeen competitors. 1 INTRODUCTION Games are an interesting domain for Artiﬁcial Intelligence research. Computer programs are better than humans in many games including Othello, chess [10] or checkers [14]. Those games are perfect information games in the sense that all useful informations for predicting the outcome of the game is common knowledge of all players. Moreover, those games are deterministic. On the contrary, Poker is an incomplete information and stochastic game: players do not know which cards their opponents are holding neither the community cards remaining to come. These aspects make Poker a challenging domain for AI research [7]. Thanks to Nash’s work [13] we know the existence of an equilibrium strategy (Nash equilibrium) for the 2 players game Poker. Now, since this is a zero-sum game, if a player plays according to this strategy, in average he will not lose. But a good Poker player should also be able to adapt to his opponent game in order to exploit possible weaknesses since the goal of Poker is to win the maximum of chips. Studies have been done in the domain of opponent exploitation for the game of RoShamBo [3, 4]. For this game, a Nash equilibrium consists in playing uniformly randomly all three actions (Rock, Paper, Scissors); this strategy does not lose (in average) since it is unpredictable, but it does not win either! Indeed, against a player who always plays ”Paper” for example, the expected payoff of the Nash strategy is null (1/3 lose, 1/3 win, 1/3 draw), whereas a player who could exploit his opponent would quickly ﬁnd out that playing ”Scissors” all the time is the best decision. 1 INRIA LILLE NORD EUROPE, France, {raphael.maitrepierre,jeremie.mary,remi.munos}@inria.fr email: In the game of poker, this idea remains true up to some extent: if an opponent bluffs too many times we should call him more often, and on the other hand, if he does not bluff often, we should be very careful. So it appears necessary to model our opponent strategy if we want to exploit his weaknesses and maximize our income. In the last few years, Poker research have received a large amount of interest. One of the ﬁrst approaches to build a Poker playing bot was based on simulations [8]. The logical next step was to compute Nash equilibria for Poker. In a zero-sum game, it is possible to compute an equilibrium of the sequence form of the game using linear programming methods. But in Poker the state space is so huge that one needs to use abstraction methods (which gather similar states) before solving the sequence form game. Such powerful methods have been used in [5, 11, 12, 16] to reduce the size of the game and compute a near optimal Nash equilibrium. Some research has also been conducted in opponent modeling for game-tree search in [6], and the resulting program, named Vexbot, is also available in Poker-Academy software. In this paper, we present a new method for building an adaptive Poker bot based on beliefs updates and strategies selection. Our contribution is two-fold. First, in opponent modeling we consider a belief update on the opponent’s hand based on Bayes’ rule, which combined with different opponent models, yields different basic strategies. Second, we consider a strategy selection procedure based on a bandit algorithm (namely the Upper Conﬁdence Bounds algorithm introduced in [2]) which performs a good trading-off between exploitation (choose a strategy that has performed well against the opponent) and exploration (try another apparently sub-optimal strategy in order to learn more about the opponent). The paper is organized as follows: after brieﬂy reminding the rules of Hold’em Poker, we present our contributions in Section 3, with the description of the forest of game trees, the belief update rule for the opponent modeling, and the bandit algorithm for the strategy selection. We conclude with experimental results. 2 Rules of the game In this paper we consider the two player version of Texas Hold’em Poker called heads-up. A good introduction to the rules can be found in [7]. The betting structure used is limit poker. This is the structure used in the AAAI Computer Poker Competition. A hand of Hold’em consists in four stages, each one followed by a betting round. The game begins with two forced bets called the blinds. The ﬁrst player puts the small blind (half a small bet) and the other player puts the big blind (one small bet). Each player is dealt two hidden cards, a ﬁrst betting round occurs, this is the preﬂop stage. Then, three community cards (called the board) are dealt face up, this is the ﬂop stage. In the third stage, the turn, a fourth card is added to R. Maîtrepierre et al. / Adaptive Play in Texas Hold’em Poker 459 the board. A last card is added to the board in the river stage. After the last betting round the showdown occurs: the remaining players compare their hands (their hole cards) and the player with the best ﬁve cards combination formed with his two cards and the community cards wins the pot (amount of chips bet by all players). In limit Poker, two sizes of bet are used: in the ﬁrst two stages the bet is called the small bet. And in the last ones the bet is called big bet and worths two small bets. In betting rounds the player who acts has three possibilities: all trees weighted by the probability of the hands used in the trees (belief that this tree corresponds to the true situation). Those probabilities are given by the beliefs table. Each game tree is solved as follows. There are 3 kinds of nodes: • He may fold, so he loses the game and the chips he put in. • He may call, in this case he puts the same amount of chips as his opponent, if no chips have been bet in the round this action is called check • He may raise, in this case he puts one bet more than his opponent has bet, if no chips have been bet this action is called bet. • If it is a ”fold”, the value corresponding to the player who has made this action is 0 while his opponent’s value equals the amount of chips in the pot at this point of the tree, • If it is a ”call” the value of each player is the amount of chips won (or lost) against his opponent’s hand. 3 OUR APPROACH The approach studied in this paper is close to the human way of playing poker: our bot tries to guess what are his opponent’s hands based on the previous decisions of the opponent in this game. For that purpose, we assign to each possible hand of the opponent, the probability he holds this hand given what he has played before. This association hand/probability represents the beliefs of our bot, and are saved in a table. These probabilities are updated after each action taken by the opponent using a simple Bayes rule (see subsection 3.2). Then, given those beliefs, we compute a ”forest” of Min-Max trees (where each tree corresponds to a possible hand assignments to both players based on the current beliefs of our bot about his opponent) to evaluate the current situation and make our decision based on a weighted combination of the solutions returned by the Min-Max trees. We described this step in the next subsection. This method is used to make decisions after the ﬂop, for preﬂop play we use precalculated tables from [15]. 3.1 • Action nodes: Nodes representing actions of players. • Chance nodes: Nodes representing chance events (cards dealing) • Leaves: Nodes representing the end of the game. The value of a leaf depends on the last selected action: Now, concerning the action nodes values computation: the value of the active player (the one who takes an action in that node) is deﬁned as the maximal value (for the same player) of the 3 children nodes (corresponding to the 3 possible actions) minus the cost of the action (the amount of chips added in the pot corresponding to the action). His opponent’s value is the value of the child corresponding to the action chosen by the active player. Figures 1 and 2 illustrate the action nodes value update. In case of equality, when choosing the max value for the active player, we choose the most aggressive action (i.e. ”raise” rather than ”call”, ”call” rather than ”fold”), the reason being that in heads-up poker, playing aggressively is usually better than playing passively. Forest of Game Trees A forest of trees is composed of a set of Min-Max game trees where each game tree is deﬁned by two couples of hands, one for each player. A couple of hands represent a player point of view: his real hand and his belief about his opponent’s hidden cards. For the AI player, real hand is represented by the two actual hidden cards dealt to him, and the opponent’s hands are chosen randomly according to the current belief table of probabilities about his opponent’s cards. For the opponent player, his real hand is chosen (randomly) according to the belief table (independently of the choice of the AI opponent’s hand) and his belief about our bot’s hand is uniformly randomly generated (i.e. currently, there is no model of the opponent’s belief about our bot’s cards). The beliefs about opponent’s hands are ﬁxed within a tree. To each leaf and node of each game tree, 2 values are assigned, each one corresponding to the value for each player (Vp1 and Vp2 ). One represents the expected outcome from the point of view of the AI bot: the result of the game between his hand against the current belief about his opponent (Vp1 ). The other value is the expected outcome from the point of view of the opponent (Vp2 ). Since each possible hand for our opponent have different probabilities, we build a ”forest” of such trees in order to evaluate the current situation. Once all the game trees have been solved, the value of each possible action is given by the convex combination of the values of Figure 1. Update of action nodes: here active player is player 1 (black nodes). His values is Vp1 . Values shown on the edges are the children nodes values minus the corresponding action cost. Active player choose the action corresponding to the maximum edge value. Here the rightmost action is chosen (Vp1 = 40). His opponent value (Vp2 ) is the value corresponding to the action chosen by player1: Vp2 = 10. Players value at chance nodes is the mean of each sons of the node, for example if we are on the ”turn” stage, in two players games, there is 45 remaining possible cards to be dealt. So value for players is: Vp1 = i=45 i=45 1 X 1 X Vp1i ; Vp2 = Vp2i 45 i=1 45 i=1 Since computing whole trees is too long for online play, we use an approximation for computing trees values at chance nodes: instead 460 R. Maîtrepierre et al. / Adaptive Play in Texas Hold’em Poker Thus a simple belief update is performed after each action, based on a model of play P(a|H, It ) of the opponent. Now we explain our choice of such models. To deﬁne the style (or model) of play, Poker theorists usually [9] consider two attributes: • Whether the player plays tight (plays very few hands) or loose (plays a lot of hands); • Whether the player is aggressive (raises and bluffs) or passive (calls other players bets). Figure 2. Here active player is player2 (white nodes), he chooses the max value among the Vp2 (this corresponds to the second action). The corresponding Vp1 and Vp2 are updated. of computing chance nodes values nine times at each stage of the game (1 time for each sequence of actions leading to a next stage) we compute values for the ﬁrst chance node encountered, and for chance nodes in the same subtree (resulting from the same card dealt) we use the value of the ﬁrst chance node, modiﬁed to consider the change of pot size. 3.2 Belief Update The AI’s belief about his opponent’ hidden cards (H) is the probability that he really holds these cards given the past actions of the opponent. At the beginning of a hand2 , each possible couple of cards is assigned a uniform probability since no information is revealed from our opponent yet. After each action of our opponent, we update those beliefs according to a model of play of the opponent, which is expressed in terms of the probabilities of choosing an action given his game. Actually we consider several possible such models, each of which deﬁning a speciﬁc style of play. A model of play of our opponent is deﬁned by the probabilities P(a|H, It ) of choosing an action a given his hidden cards H and the information set It , where It represents all the information available to both players at time t (e.g. the ﬂop, the bets of the players up to time t, ...). Now, once the opponent has chosen an action a at time t, the beliefs P(H|It ) on his hidden cards H are updated according to Bayes’ rule: P(H|It ) = P(H|It−1 )P(a|H, It−1 ), where It = (It−1 ∪ {a}). 2 here hand means the game from preﬂop to showdown if it occurs We selected three features to deﬁne relevant properties of a game state. The ﬁrst one is the stage S of the game (Flop, Turn, River) since it deﬁnes the amount of the bets and the number of remaining community cards to come. The second one is the hand strength F (probability of winning) of the hand. The third one is the size of the pot C since it greatly inﬂuence the way of playing. We thus model the strategy of the opponent P(a|H, It ) using these three features (F, C, S) of H and It , and write P(a|F, C, S) the corresponding model (approximation of P(a|H, It )). In our implementation, we model two basis strategies, one is tight/aggressive and the other is loose/passive. A model is deﬁned as follow: for each possible stage of the game (Flop, Turn, River) we have a table that gives the probability of choosing each action as a function of the hand strength and the pot size. Hand strength is discretized into 5 possible values and pot size is discretized every 2 big blind. For example at the Flop stage, the table is composed of 5 × 4 = 20 values. Tables at other stages are bigger since the maximum size of the pot is bigger. Those tables have been generated by resorting to expert knowledge. They are not detailed in the paper for size reasons but are available at http://sequel.futurs.inria.fr/maitrepierre/basis-strategies-tables At the beginning we only consider one strategy (tight/aggressive), and after several hands against an opponent, we are able to identify some weakness in our strategy, so we add new strategies. A new strategy is a convex combination of the two basis strategies. For example since the initial strategy is very tight, adding a looser strategy and selecting which one to use (by a method described in the next section) will improve the global behavior. Figure 3 shows the improvement of adding new strategies in games against Vexbot [6]. In the version which participate to the AAAI’07 we considered 5 different strategies built from the 2 basis strategies. 3.3 Strategy Selection We have seen in the previous section that the different styles of play about the opponent yield different belief updates, which in turn deﬁnes different basic strategies. We now have to select a good one. To do that we use a bandit algorithm called UCB (for Upper Conﬁdence Bounds), see [2]. This algorithm allows us to ﬁnd a good tradeoff between exploitation (use what is believed to be the best strategy) and exploration (select another apparently sub-optimal strategy in order to get additional information about the opponent). UCB algorithm works by deﬁning a conﬁdence bound for each possible strategy, and selects the strategy that has the highest upper bound. In our version we use a slightly modiﬁed version of the algorithm named UCB-tuned [1], which takes into account the empirical variance of the obtained rewards. For strategy i, the bound is deﬁned as: r 2 ln n def Bi (n) = σi ni where: R. Maîtrepierre et al. / Adaptive Play in Texas Hold’em Poker 461 we give the interpretation that this strategy is starting to be less effective against the opponent (the opponent adapts to it), and we decide to forget the period when strategy i was the best, and recompute the bounds and the average rewards for each strategy but only over the 200 lasts hands. Change point detection is illustrated in Figure 4: near the 370th hand, strategy 1 average income has decreased to be under the lower conﬁdence bound, so we recompute new average and bounds. Figure 3. Performance of one, four, and ﬁve strategies against Vexbot (which is an adaptive bot). We observe that the resulting meta-strategy is stronger because it adapts automatically to the opponent and is less predictable. • n is the number of hand played. • ni is the number of times strategy i was played. • σi is the empirical standard deviation of the rewards. The UCB (-tuned) algorithm consists in selecting the strategy i which has the highest upper bound: x ¯i (n) + Bi (n), where x ¯i (n) is the average rewards won by strategy i up to time n. This version of UCB assumes that the rewards corresponding to each strategy are independent and identically distributed samples of ﬁxed random variables. However, in Poker, our opponent may change his style of play and search for a counter-strategy which adapts to ours. In order to detect possible changes in the opponent strategy, we combine the UCB selection policy to a change-point detection technique. The change-point detection technique should detect brutal decrease in the rewards when using the best strategy (this would correspond to an adaptation of the opponent to our strategy). For this purpose, we deﬁne a lower bound on each strategy: Figure 4. Change point detection. After hand 370 the average reward of strategy 1 goes under UCB’s bound. So the historic of hands is reset and we recompute new bounds for each strategies. 4 Numerical results We tested our bot against Sparbot [5], Vexbot [6] which are the current best bots in limit heads-up Poker, an AlwaysCall bot and an AlwaysRaise bot (2 deterministic bots which always play the same action). The tests was 1000 hands sessions and we test our bot against each bots on 10 sessions. Vexbot and our bot memory was reset after each session. Results are presented in Table 1. Our bot Vexbot Sparbot AlwaysCall AlwaysRaise Our bot Vexbot Sparbot AlwCall AlwRaise +0.05 +0.02 +1.01 +1.87 -0.05 +0.056 +1.04 +2.98 -0.02 -0.056 +0.47 +1.34 -1.01 -1.04 -0.47 =0.00 -1.87 -2.98 -1.34 =0.00 Li (n) = x ¯i (n) − Bi (n), Table 1. Matches against different Bots, over 10 sessions of 1000 hands. Results are expressed in smallBlind won per Hand for the line player versus column one. and we compute the moving average rewards, written x ¯i (n − 200 : n), on a window corresponding to the last 200 played hands with each strategy. We say that there is change-point detection if, for the current best strategy i, it happens that x ¯i (n − 200 : n) ≤ Li (n) (i.e. the average rewards obtained over a certain time period is actually worse than the current lower bound on the expected rewards), then We have studied UCB’s behavior all along a match against Vexbot, studying this match seems more interesting to us since Vexbot is the only bot which adapts to his opponent’s behavior. Figure 5 shows the different uses of the strategies all over the match. We can see that some strategies are favored than others during def 462 R. Maîtrepierre et al. / Adaptive Play in Texas Hold’em Poker such or such periods: between hands 1500 and 2500, strategies 1 and 2 are very often used. After, strategies 3 and 4 are used over the 1000 following hands. This shows our opponent’s capacity of adaptation and the fact that UCB, thanks to change point detection, detects this adaptation and changes the current strategy. We can see that strategy 5 isn’t very used during the match but the addition of this strategy improves the performance of our bot ﬁgure 3 shows the performance difference before and after the add of strategy 5. We must keep in mind that UCB not only perform a choice over the strategies but also bring us a strategy which is a mix of the basic ones. So Vexbot defeats all our basic strategies but is defeated by the meta-strategy. Also note that Sparbot which is a pseudo-equilibrium is defeated. That is something very interesting because equilibrium players, since they don’t have any weakness to exploit, are a nightmare for adaptive play. It means that even taking no care of computing an equilibrium play on our basis strategies, the meta-strategy can adapt to be not so far of an equilibrium. there is correlations between the rewards of each arm: if a slightly aggressive style doesn’t work, a very aggressive one will probably fail too. It will allow us to add more basic strategies and having more subtle attempt of exploitation. 5 CONCLUSIONS We presented an Texas Hold’em limit poker player which adapts its playing style to his opponents. It combines beliefs update methods to obtain different strategies. The use of UCB algorithm enables a fast adaptation to modiﬁcations of opponent’s playing style. For humans player, the produced bot seems more pleasant than equilibriums ones since it tries different strategies against his opponent. Moreover due to the UCB selection, the style of play varies very quickly which sometimes give the illusion that the computer tried to trap the opponent. Using different strategies and choosing the right one depending on opponents playing styles seems to be promising idea and should be adapted to multi-player gaming. REFERENCES Figure 5. These curves show the number of uses of each strategy over hands played. Reference curve represents an uniformly use of each strategy. Plateaus represents periods during which a strategy isn’t used whereas slopes show great use of it. We register our bot to the AAAI’07 Computer Poker Competition. It takes part in two competitions, the online learning competition and the equilibrium one. Results can be viewed at http://www.cs.ualberta.ca/~pokert. Even if our approach is able to defeat all the AAAI’06 bots, we didn’t perform very well in this competition (not in top 5 bots). We see several reasons to this: ﬁrstly, our approach require a lot a computer time during the match. So we had to limit the monte carlo exploration in order to comply with the time limit. Secondly, the strategy of the top competitors is really very close to a Nash equilibrium. So as our different strategies are not computed to be Nash equilibrium, our aggressive play is defeated. In fact, in future version we think that the meta-strategy obtained by uniformly choose one of our basis strategy should lead to a Nash equilibrium. Doing so will ensure us to not losing chips during the exploration stage because at the very beginning, UCB performs a near uniform exploration of all strategies. Moreover, it would offer a good response to a Nash equilibrium player: an other Nash equilibrium. An other future improvement will be the update at the same time of the expectation of several arms of the UCB. This is possible because [1] J.Y. Audibert, R. Munos, and C. Szepesv´ari. Use of variance estimation in the multi-armed bandit problem. NIPS Workshop on on-line trading of exploration and exploitation, Vancouver, 2006. [2] Peter Auer, Nicol`o Cesa-Bianchi, and Paul Fischer, ‘Finite-time analysis of the multiarmed bandit problem’, Machine Learning, 47(2/3), 235–256, (2002). [3] D. Billings, ‘The ﬁrst international roshambo programming competition’, The International Computer Games Association Journal, 1(23), 42–50, (2000). [4] D. Billings, ‘Thoughts on roshambo’, The International Computer Games Association Journal, 1(23), 3–8, (2000). [5] D. Billings, N. Burch, A. Davidson, R. Holte, J. Schaeffer, T. Schauenberg, and D. Szafron. Approximating game-theoretic optimal strategies for full-scale poker, 2003. [6] D. Billings, A. Davidson, T. Schauenberg, N. Burch, M. Bowling, R. Holte, J. Schaeffer, and D. Szafron, ‘Game-tree search with adaptation in stochastic imperfect-information games’, Computers and Games: 4th International Conference, 21–34, (2004). [7] Darse Billings, Aaron Davidson, Jonathan Schaeffer, and Duane Szafron, ‘The challenge of poker’, Artiﬁcial Intelligence, 134(1-2), 201–240, (2002). [8] Darse Billings, Lourdes Pena, Jonathan Schaeffer, and Duane Szafron, ‘Using probabilistic knowledge and simulation to play poker’, in AAAI/IAAI, pp. 697–703, (1999). [9] Doyle Brunson, Super System: A Course in Power Poker,, Cardoza Publishing, 1979. [10] Murray Campbell, A. Joseph Hoane Jr., and Feng hsiung Hsu, ‘Deep blue’, Artiﬁcial Intelligence, 134, 57–83, (2002). [11] Andrew Gilpin and Tuomas Sandholm, ‘Finding equilibria in large sequential games of imperfect information’, in EC ’06: Proceedings of the 7th ACM conference on Electronic commerce, pp. 160–169, New York, NY, USA, (2006). ACM. [12] Andrew Gilpin and Tuomas Sandholm, ‘Better automated abstraction techniques for imperfect information games, with application to texas hold’em poker’, in AAMAS ’07: Proceedings of the 6th international joint conference on Autonomous agents and multiagent systems, pp. 1– 8, New York, NY, USA, (2007). ACM. [13] J. F. Nash. Equilibrium points in n-person games, 1950. [14] J. Schaeffer and R. Lake. Solving the game of checkers, 1996. [15] A. Selby. Optimal strategy for heads-up limit holdem. [16] Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione, ‘Regret minimization in games with incomplete information’, in Advances in Neural Information Processing Systems 20, eds., J.C. Platt, D. Koller, Y. Singer, and S. Roweis, MIT Press, Cambridge, MA, (2008). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-463 463 Theoretical and Computational Properties of Preference-based Argumentation Yannis Dimopoulos1 and Pavlos Moraitis2 and Leila Amgoud3 Abstract. During the last years, argumentation has been gaining increasing interest in modeling different reasoning tasks of an agent. Many recent works have acknowledged the importance of incorporating preferences or priorities in argumentation. However, relatively little is known about the theoretical and computational implications of preferences in argumentation. In this paper we introduce and study an abstract preference-based argumentation framework that extends Dung’s formalism by imposing a preference relation over the arguments. Under some reasonable assumptions about the preference relation, we show that the new framework enjoys desirable properties, such as coherence. We also present theoretical results that shed some light on the role that preferences play in argumentation. Moreover, we show that although some reasoning problems are intractable in the new framework, it appears that the preference relation has a positive impact on the complexity of reasoning. 1 Introduction Argumentation has become an Artiﬁcial Intelligence keyword for the last ﬁfteen years, especially in sub-ﬁelds such as non monotonic reasoning [8] and agent technology (e.g. [4]). Argumentation is a promising reasoning model based on the interaction of different arguments for and against some statement. This interaction between arguments is typically based on a notion of attack, which can take different forms according to the form that the arguments have. For example, when an argument takes the form of a logical proof, arguments for and against a statement can be put across and in this case the attack relation expresses logical inconsistency. Argumentation can therefore be considered as a reasoning process implying construction and evaluation of interacting arguments. Several interesting argumentation frameworks have been proposed in the literature (see e.g. [3, 14, 12]). The majority of these systems is based on the abstract argumentation framework of Dung [8], where no assumption is made about the nature of arguments or the properties of the attack relation (i.e. the attack relation can be any binary relation on the set of arguments). Some recent works have proposed argumentation systems (see e.g. [2, 1, 5]) that are based on a defeat relation (corresponding to the attack relation in Dung’s framework), that is composed from a conﬂict relation on the set of arguments and a preference relation between arguments, reﬂecting the fact that arguments may not have equal strengths. However till now, relatively little is known about the 1 2 3 University of Cyprus, 75 Kallipoleos Str. 1678, Nicosia-Cyprus Paris Descartes University, 45 rue des Saints-P`eres 75270, Paris-France Paul Sabatier University, 118 route de Narbonne 31062, Toulouse-France theoretical and computational properties of abstract preference-based argumentation systems. This paper is an attempt towards understanding the effects of a preference relation on an argumentation system. More precisely, it investigates the impact of the preference relation between arguments within a new abstract argumentation framework. The attack relation is the composition of a conﬂict relation with the preference relation, both deﬁned on the set of arguments. The framework is abstract and general in the sense that the only assumptions made are that the conﬂict relation is symmetric and irreﬂexive, and the preference relation is a partial pre-order (i.e. reﬂexive and transitive). Under these reasonable and general assumptions, we show that the new framework enjoys desirable properties for an argumentation system, such as coherence. It turns out that the preference relation on the arguments translates into a preference relation on the powerset of these arguments. Moreover, the stable extensions of the preference-based argumentation theories correspond to the most preferred sets of arguments that are conﬂict-free. We also investigate the computational properties of the new framework and demonstrate that a transitive preference relation on the set of arguments can mitigate the computational burden of some reasoning tasks. Indeed, computing a stable extension of a preference-based argumentation theory can be performed in polynomial time. Furthermore, enumerating all stable extensions of such a theory without incomparability between arguments can be carried out with polynomial delay. Moreover, if in addition the theory does not contain indifferent arguments, ﬁnding its unique stable extension is also a polynomial computation. On the negative side, some other reasoning tasks are intractable. More speciﬁcally, deciding whether an argument is a credulous conclusion of a preference-based argumentation theory is NP-hard, while deciding whether it is a skeptical one is coNP-hard. The paper is organized as follows. We ﬁrst review the basics of argumentation as introduced in [8]. Then, we present the abstract preference-based argumentation framework we propose, and investigate some of its properties. We then present algorithms for reasoning in the new framework, along with some complexity results. The last section concludes with some remarks and perspectives. 2 Basics of argumentation Argumentation is a reasoning model based on the following main steps: i) constructing arguments and counter-arguments, ii) deﬁning the strengths of those arguments, and iii) concluding or deﬁning the justiﬁed conclusions. Argumentation systems are built around an underlying logical language and an associated notion of logical consequence, deﬁning the notion of argument. The argument construction is a monotonic process: new knowledge cannot rule out an argument 464 Y. Dimopoulos et al. / Theoretical and Computational Properties of Preference-Based Argumentation but only gives rise to new arguments which may interact with the ﬁrst argument. Arguments may be conﬂicting for different reasons. Deﬁnition 1 (Argumentation system [8]) An argumentation system is a pair T = (A, R). A is a set of arguments and R ⊆ A × A is an attack relation. We say that an argument a attacks an argument b iff (a, b) ∈ R. Among all the arguments, it is important to know which arguments to keep for inferring conclusions. In [8], different acceptability semantics have been proposed. The basic idea behind these semantics is the following: for a rational agent, an argument ai is acceptable if he can defend ai against all attacks. All the arguments acceptable for a rational agent will be gathered in a so-called extension. An extension must satisfy a consistency requirement and must defend all its elements. Deﬁnition 2 (Conﬂict-free, Defence [8]) Let B ⊆ A, and ai ∈ A. • B is conﬂict-free iff ai , aj ∈ B s.t. (ai , aj ) ∈ R. • B defends ai iff ∀ aj ∈ A, if (aj , ai ) ∈ R, then ∃ ak ∈ B s.t. (ak , aj ) ∈ R. The main semantics introduced by Dung are summarized in the following deﬁnition. Deﬁnition 3 (Acceptability semantics [8]) Let B be a conﬂict-free set of arguments. • B is admissible iff it defends any argument in B. • B is a preferred extension iff it is a maximal (w.r.t ⊆) admissible extension. • B is a stable extension iff it is a preferred extension that attacks any argument in A\B. Now that the acceptability semantics are deﬁned, we are ready to deﬁne the status of any argument. Deﬁnition 4 (Argument status) Let T = (A, R) be an argumentation system, and E1 , . . . , Ex its stable extensions. Let a ∈ A. • a is skeptical conclusion of T iff a ∈ Ei , ∀Ei=1,...,x = ∅. • a is credulous conclusion of T iff ∃Ei such that a ∈ Ei . 3 A Preference-based Argumentation Framework In [1] the basic argumentation framework of Dung has been extended into preference-based argumentation theory (PBAT). The basic idea of a PBAT is to consider two binary relations between arguments: 1. A conﬂict relation, denoted by C, that is based on the logical links between arguments. 2. A preference relation, denoted by ,, that captures the idea that some arguments are stronger than others. Indeed, for two arguments a, b ∈ A, a , b means that a is at least as good as b. The relation , is assumed to be a partial pre-order (that is reﬂexive and transitive). The relation 1 denotes the corresponding strict relation. That is, a 1 b iff a , b and b , a. The two relations are combined into a unique attack relation, denoted by R, and the Dung’s semantics are applied on the resulting framework. In what follows, we will study a particular class of PBATs, where the conﬂict relation C is irreﬂexive and symmetric. Deﬁnition 5 (Preference-based Argumentation Theory (PBAT)) Given an irreﬂexive and symmetric conﬂict relation C and a preference relation , on a set of arguments A, a preference-based argumentation theory (PBAT) on A is an argumentation system T = (A, R), where (a, b) ∈ R iff (a, b) ∈ C and b 1 a. It follows directly from the deﬁnition that if (a, b) ∈ C and a , b and b , a, then (a, b) ∈ R. Moreover, if (a, b) ∈ C and a, b are either indifferent or incompatible in ,, then (a, b) ∈ R and (b, a) ∈ R. Also note that if (a, b) ∈ C, then either (a, b) ∈ R or (b, a) ∈ R. Finally, if (a, b) ∈ R and (b, a) ∈ / R, then a 1 b. The following example illustrates some features of PBATs. Example 1 Let A = {a, b, c, d} be a set of arguments, and C the conﬂict relation on A deﬁned as C = {(a, b), (b, a), (b, c), (c, b), (c, d), (d, c)}. Moreover, let the preference relation , contain transitive closure of the set of pairs a , b, b , c, c , d, and d , c. The corresponding PBAT is T = (A, R), where R = {(a, b), (b, c), (c, d), (d, c)}. Theory T has two stable extensions, E1 = {a, c} and E2 = {a, d}. We note here that, although it seems that combining the conﬂict and preference relations can be done in many different ways other than the one proposed in deﬁnition 5, all of these combinations lead to counterintuitive results and properties. A detailed analysis of these possibilities will appear in an extended version of this paper. 4 Basic Properties of PBATs In this section we present some basic properties of PBATs. To facilitate the discussion and the presentation of the results of this section as well as those of other part in the remainder of this paper, we use some basic notions from graph theory. Indeed, as with every binary relation on a set, an argumentation system T is associated with a directed graph (digraph) GT whose nodes are the different arguments, and the edges represent the attack relation deﬁned on them. The identiﬁcation of graph theoretical structures has led to useful results regarding the properties of argumentation systems (e.g. [9]). Let G = (N, E) be a digraph and n ∈ N a node of G. The indegree of n in G is the number of nodes n of G such that (n , n) ∈ E. A (strongly connected) component C of a digraph G is a maximal subgraph C of G such that for every pair of nodes x, y ∈ C, there is a path from x to y in C. If each component of a digraph G is contracted to a single node, the resulting graph is a directed acyclic one, and is called the components graph of G. A top component of a digraph G is one that has in-degree 0 in the components graph of G. Our ﬁrst result characterizes the cycles of the graph of a PBAT. Proposition 1 Let GT be the graph associated with a PBAT T = (A, R). Every cycle of GT has at least two symmetric edges. Proof We prove by case analysis that a cycle of GT cannot have no or one symmetric edges. Let a1 , a2 , . . . , an be a cycle of GT . This means that ∀i < n, (ai , ai+1 ) ∈ R and (an , a1 ) ∈ R. Let us assume that this cycle has no symmetric edges, ie. ∀i < n, (ai+1 , ai ) ∈ R and (a1 , an ) ∈ R. Since ∀i < n, (ai , ai+1 ) ∈ R and (ai+1 , ai ) ∈ R, it holds that ∀i < n, ai , ai+1 . By transitivity, a1 , an , meaning (a1 , an ) ∈ R, contradiction. Assume now that a1 , a2 , . . . , an is a cycle of GT such that (an , a1 ) is the only symmetric edge of the cycle. Assume ﬁrst that the two arguments an , a1 are incomparable wrt the underlying preference relation ,. The transitivity of the preference relation requires that Y. Dimopoulos et al. / Theoretical and Computational Properties of Preference-Based Argumentation a1 , an , which contradicts the incomparability of the two arguments. Assume now that a1 , an and an , a1 . Since an , a1 and a1 , a2 , by transitivity an , a2 . On the other hand we have a2 , a3 , . . ., an−1 , an , and by transitivity a2 , an . Hence the cycle must also contain a symmetric edge between a2 and an . Therefore every cycle of GT has at least two symmetric edges. Doutre [6] has shown that the kernels of the associated graph of an argumentation theory correspond exactly to its stable extensions. A kernel of a directed graph G = (N, E) is a set of nodes K ⊆ N such that (a) K is an independent set, that is, there is no pair of nodes ni , nj ∈ K s.t. (ni , nj ) ∈ E or (nj , ni ) ∈ E (b) for all n ∈ N \ K there is a node n ∈ K s.t. (n , n) ∈ E. Moreover, Duchet [7] proved that every graph with at least two symmetric edges has a kernel. By combining these two results we obtain the following theorem. Theorem 1 Every PBAT has a stable extension. We show now that the graph associated with a PBAT has no elementary cycles of length greater than 2. The notion of elementary cycle is deﬁned as follows. Deﬁnition 6 (Elementary cycle) Let T = (A, R) be a PBAT and X = {a1 , . . ., an } be a set of arguments of A. X is an elementary cycle of T iff: 1. ∀i ≤ n − 1, (ai , ai+1 ) ∈ R and (an , a1 ) ∈ R 2. X ⊂ X such that X satisﬁes condition 1. Proposition 2 Let T = (A, R) be a PBAT on an underlying preorder ,. Then, R has no elementary cycle of length greater than 2. Proof Let a1 , . . . , an be arguments of A, with n > 2, and assume that they form an elementary cycle, i.e. ∀i ≤ n, (ai , ai+1 ) ∈ R, and (an , a1 ) ∈ R. Since the cycle is elementary, then ai , ai+1 such that (ai , ai+1 ) ∈ R and (ai+1 , ai ) ∈ R. Thus, ai 1 ai+1 , ∀i < n. Therefore, a1 1 a2 1 . . . an 1 a1 , contradiction. A direct consequence of the above property is that PBATs do not have elementary odd-length cycles. By the results of [10], this implies that PBATs are coherent, i.e., their preferred and stable extensions coincide. Theorem 2 Every PBAT is coherent. In the remaining of this section we investigate the impact of the preference relation on an argumentation system. We ﬁrst deﬁne a relation on the powerset of the arguments of a PBAT T = (A, R) (we denote by P(A) the powerset of A), and then show that the stable extensions of T correspond to the most preferred elements of P(A) wrt this relation. Deﬁnition 7 Let T = (A, R) be a PBAT built on an underlying preorder ,. If A1 , A2 ∈ P(A), with A1 = A2 , then A1 A2 iff one of following holds: • A 1 ⊃ A2 • for all a, b such that a ∈ A1 \ A2 and b ∈ A2 \ A1 , it holds that a1b The following result states the relation between and stable extensions, and hence sheds some light on the connection between preference and argumentation. 465 Theorem 3 Let T = (A, R) be a PBAT built on an underlying preorder , and a conﬂict relation C. E is a stable extension of T iff there are no arguments a, b ∈ E s.t. (a, b) ∈ C, and for all A ∈ P(A) such that A E, there are a1 , a2 ∈ A such that (a1 , a2 ) ∈ C. Proof Let E be a stable extension of T . Then, by deﬁnition, it contains no pair of arguments a, b s.t. (a, b) ∈ R. Hence, E can not contain arguments a, b s.t. (a, b) ∈ C. We prove by case analysis that for all A ∈ P(A) such that A E there exists a pair of arguments a1 , a2 ∈ A s.t. (a1 , a2 ) ∈ C. Assume ﬁrst a set A with A ⊃ E. Since E is a stable extension, for all a ∈ A \ E, there is b ∈ E, and because A ⊃ E, b ∈ A s.t. (b, a) ∈ R. Therefore there exist a, b ∈ A, s.t. (a, b) ∈ C. Assume now that AE and A ⊃ E. Again, for all a ∈ A\E, there is b ∈ E s.t. (b, a) ∈ R. Since A E, by deﬁnition 7 follows that for all a ∈ A \ E and c ∈ E \ A, it holds a 1 c and hence (c, a) ∈ R. Therefore, it must be the case that b ∈ E ∩ A, which means that A contains a pair a, b such that (b, a) ∈ R, and therefore (a, b) ∈ C. Let now E be a set of arguments that contains no pair of elements a, b s.t. (a, b) ∈ C, and for all A ∈ P(A) such that A E, there are a1 , a2 ∈ A such that (a1 , a2 ) ∈ C. We prove that E is a stable extension. We show ﬁrst that E is admissible. Observe that since E contains no pair of elements a, b s.t. (a, b) ∈ C, it can not contain a pair a, b s.t. (a, b) ∈ R. Assume that there exist a ∈ E and b ∈ A \ E s.t. (b, a) ∈ R and there is no c ∈ E such that (c, b) ∈ R. Hence b 1 a. Then deﬁne D(b) = {d|(b, d) ∈ R and d ∈ E}, and construct the set E = (E \ D(b)) ∪ {b}. Then, it is the case that E E and furthermore there is no pair a1 , a2 ∈ E such that (a1 , a2 ) ∈ R, and therefore (a1 , a2 ) ∈ C, contradiction. Assume now that there exists b ∈ A \ E s.t. for all a ∈ E it holds that (a, b) ∈ R. Clearly, (b, a) ∈ R, because otherwise E is not admissible. Then again, E ∪ {b} E and furthermore there is no pair a1 , a2 ∈ E ∪ {b} such that (a1 , a2 ) ∈ C, contradiction. The example below highlights the link between the relation and the stable extensions. Example 2 Let T = (A, R) be a PBAT with A = {a, b, c} and R composed from the conﬂict relation C = {(a, b), (b, a)(a, c), (c, a)} and preference relation that contains the pairs a , b and a , c. The relation on P(A) induced by , is depicted in ﬁgure 1. Since the sets {a, b, c}, {a, b}, {a, c} are ruled out by C, the set E = {a} is the stable extension of T . 5 Reasoning in PBATs This section contains a preliminary investigation of the computational properties of the new argumentation framework. We start by presenting below the algorithm stable extension that computes a stable extension of a PBAT in polynomial time. Recall that ﬁnding a stable extension of a general argumentation system is an intractable task (see eg. [9]). stable extension(A, R) A = A; E = ∅ While (A = ∅) do Compute a top component C of theory (A, R) Select a node n ∈ C such that for all n ∈ A with (n , n) ∈ R it holds that (n, n ) ∈ R E = E ∪ {n}; 466 Y. Dimopoulos et al. / Theoretical and Computational Properties of Preference-Based Argumentation a b a • (li , cj ), if literal li appears in clause cj . • (ci , t), for 1 ≤ i ≤ n. c b a c a b c c b {} Figure 1. Ranking relation where an edge from A to B means that A B. A = A − ({n} ∪ {n |(n, n ) ∈ R}) end do Return E Notice that by construction the set E returned by the above algorithm does not contain two elements x, y such that (x, y) ∈ R. Moreover, again by construction, for each element x ∈ A that is not included in E, there must by some element y ∈ E such that (y, x) ∈ R. Therefore, the set E returned by the algorithm is a stable extension of the input theory (A, R). The key point of the stable extension algorithm is that at each iteration it ﬁnds a node n from a top component of the input theory such that for all n ∈ A for which (n , n) ∈ R, it holds that (n, n ) ∈ R. An informal justiﬁcation of the existence of such elements is the following. Assume that the algorithm reaches a point where there is a top component C of the theory that contains no node with the above property. This means that for every node n ∈ C there exists some other node n ∈ C such that (n , n) ∈ R and (n, n ) ∈ R. Remove from C all symmetric edges (the edge (x, y) ∈ R is symmetric if (y, x) ∈ R also holds). Then, in the resulting graph all nodes of C must have an incoming edge, which means that C contains a cycle with no symmetric edges, which contradicts proposition 1. Although computing a stable extension of a PBAT can be performed in polynomial time, we prove below that credulous and skeptical reasoning in the new framework are intractable. Theorem 4 Let T = (A, R) be a PBAT and a ∈ A. Deciding whether a is a credulous conclusion of T is NP-hard. Proof We prove the claim by a reduction from 3SAT. Let S = {c1 , . . . cn } be a 3SAT theory on a set of clauses c1 , . . . cn . From S we construct a PBAT ST = (A, R). The set of arguments A of ST contains the following elements: • An argument li for each literal li that appears in S. • An argument cj for each clause cj of S, 1 ≤ j ≤ n. • An additional argument t that corresponds to the whole theory S. The underlying conﬂict relation C of ST contains the following (symmetric) pairs: • (li , ¬li ), for each argument li that corresponds to a literal li of S Finally, the underlying preference relation , of ST , is deﬁned as ,= {(a, b)|a, b ∈ A, a = b} − {(t, ci )|ci is the argument that corresponds to clause ci }, that is, each argument that corresponds to clauses is preferred to the argument that corresponds to the theory, whereas all other arguments are indifferent to each other. Therefore, R coincides with its underlying conﬂict relation, with the only difference that it does not contain the pairs (t, ci ), for 1 ≤ i ≤ n. We now prove that S is satisﬁable iff ST has a stable (admissible) extension that contains argument t. Let M be a satisfying truth assignment of S. We show that the set of arguments E = M ∪ {t} is an extension of ST . First note that for any pair of arguments ai , aj ∈ E, it holds that (ai , aj ) ∈ R. Furthermore, it holds that for each ci ∈ A that corresponds to a clause of S, there must be some argument lj ∈ E that corresponds to some literal of S such that (lj , ci ) ∈ R (otherwise M is not satisfying). Therefore, E is a stable extension of ST . Let now E be a stable extension of ST such that t ∈ E. We prove that the assignment that corresponds to the arguments of E is a satisfying one for S. This assignment does not contain any pairs of complementary literals because these pairs of literals belong to R. Furthermore, since t ∈ E, it must be the case that ci ∈ E for 1 ≤ i ≤ n. Therefore it must be the case that for each clause ci of S at least one of its literals must belong to E, therefore the assignment that corresponds to E is satisfying. Proposition 3 Let T = (A, R) be a PBAT and a ∈ A. Deciding whether a is a skeptical conclusion of T is coNP-hard. Proof Given a propositional theory S we construct a PBAT TS = (A, R) in a way similar to that of the previous proof with the difference that A contains an additional argument t such that pair (t, t ) ∈ C, (t , t) ∈ C, and t , t , t , t. It is not difﬁcult to prove that t is a skeptical conclusion of TS iff S is unsatisﬁable. 6 Theories without incomparability In this section we turn our attention to PBATs without incomparability, i.e. theories T = (A, R) such that for each pair of arguments ai , aj ∈ A, either ai , aj or aj , ai . More speciﬁcally we present an algorithm that enumerates all stable extensions of a theory in this class with polynomial delay. An algorithm that enumerates the elements of a set S is said to be a polynomial delay one, if it computes the ﬁrst element of the set within polynomial time in the size of the input, and furthermore the time taken by the algorithm between computing two consecutive elements of this set is also bounded by some polynomial in the size of the input. The key property of PBATs without incomparability that is exploited by the stable extensions computation algorithm, is that the strongly connected components of the graph GT of such a theory T contain only symmetric edges, and therefore these components are essentially undirected (sub)graphs. This useful property is proved in the following result. Proposition 4 Let T = (A, R) be a PBAT without incomparability, and GT its associated digraph. If a, b ∈ A are arguments that belong to the same component of GT and (a, b) ∈ R, then (b, a) ∈ R. Proof Let a, b ∈ A be arguments that belong to the same component of GT and (a, b) ∈ R. Therefore (b, a) ∈ C, and a , b. Since Y. Dimopoulos et al. / Theoretical and Computational Properties of Preference-Based Argumentation a, b belong to same component there must be a path from b to a. Since there is no incomparability, by transitivity we get that b , a. From this and the fact (b, a) ∈ C we conclude that (b, a) ∈ R. The kernels (recall that kernels correspond to stable extensions) of a graph that contains only symmetric edges are exactly its maximal (w.r.t. set inclusion) independent sets (MISs). To see this, note that it follows from the deﬁnition, that every kernel is an MIS. On the other hand, since in this case all edges are symmetric, an MIS is also a kernel. This connection between stable models, kernels and MISs, allows us to employ well-known procedures that enumerate all maximal independent sets of a graph with polynomial delay [11]. Algorithm all stable extensions, that is presented below, enumerates the stable extensions of the input theory by traversing the theory from its top components downwards. Singleton components are handled separately by the ﬁrst iteration of the algorithm. To enumerate the elements that belong to stable extensions and at the same time to components with more than one nodes, the algorithm utilizes a procedure that performs MISs computation with polynomial delay. all stable extensions(A, R) A = A; E = ∅ While (A = ∅) do While (A has nodes with in-degree 0) do E = E ∪ {a|a ∈ A and has in-degree 0 } A = A − (E ∪ {a |a ∈ E and (a, a ) ∈ R}) end do Select a top component C of (A, R) For each MIS M of C computed with polynomial delay do E = E ∪ M; A = A − (M ∪ {a |a ∈ M and (a, a ) ∈ R}) call stable extension(A , R) end do end do Return E It is known [13] that the number of MISs of a graph with n nodes is at most nn/3 . Therefore, if a PBAT has m components each of which has at least 2 nodes and at most k nodes, then the theory has at most nmk/3 stable extensions. Hence, the run time of the algorithm is exponential in mk. For ”small” values of m and k, the above algorithm can be also used to perform credulous and skeptical reasoning. The idea is to simply enumerate all stable extensions of the input theory, and terminate as soon as the given argument belongs (credulous reasoning) or does not belong (skeptical reasoning) to one of the stable extensions. Consider now a PBAT T = (A, R) where the underlying preference , relation contains neither incomparability nor indifference. Therefore, for all pairs of arguments ai , aj ∈ A, either ai , aj or aj , ai holds, but not both. In this case the graph of T is acyclic and T has exactly one stable extension. The ﬁrst iteration of the algorithm all stable extensions above computes this unique stable extension in polynomial time. Obviously, the same procedure can be used for credulous and skeptical reasoning in this restricted class of PBATs. 7 Conclusion and Future Work In this paper we presented an abstract preference-based argumentation framework. Although other works in the literature (see e.g. [2, 1, 5] ) have also acknowledged the importance of incorporat- 467 ing preferences in argumentation systems, very little have been said about the theoretical and computational properties of such systems. This paper is a work in the direction of ﬁlling this gap by proposing a new preference-based argumentation framework and studying its basic properties. We have shown that the theories of the new framework have always stable extensions and are coherent. We also characterized the structure of preference-based argumentation theories by extending previous works that attempted to link argumentation and graph theory (see eg. [9] for a recent example). Moreover, it seems that the transitivity of the underlying preference relation imposes a strong structure on the preference-based argumentation theories that can be exploited computationally. Indeed, some computational problems become easier in the new framework, whereas others remain intractable. There are many directions for future research. We plan to investigate more deeply the structural properties of PBATs and further extend the link with graph theory. Moreover, we intend to study the properties of the relation and identify its effects on argumentation. Finally, the computational properties of the new framework will be explored more fully in the future. ACKNOWLEDGEMENTS We thank one of the reviewers for many helpful comments. REFERENCES [1] L. Amgoud and C. Cayrol, ‘On the acceptability of arguments in preference-based argumentation framework’, in Proceedings of the 14th Conference on Uncertainty in Artiﬁcial Intelligence, pp. 1–7, (1998). [2] L. Amgoud, Y. Dimopoulos, and P. Moraitis, ‘A Uniﬁed and General Framework for Argumentation-based Negotiation’, in Proc. 7th International Joint Conference on Autonomous Agents and Multi-Agent Systems, pp. 963–970. ACM Press, (2007). [3] L. Amgoud and H. Prade, ‘Explaining qualitative decision under uncertainty by argumentation’, in 21st National Conference on Artiﬁcial Intelligence, AAAI’06, pp. 16 – 20, (2006). [4] T. Bench-Capon and P. Dunne, ‘Argumentation in artiﬁcial intelligence’, Artif. Intell., 171(10-15), 619–641, (2007). [5] T. J. M. Bench-Capon, ‘Persuasion in practical argument using valuebased argumentation frameworks’, Journal of Logic and Computation, 13(3), 429–448, (2003). [6] S. Doutre, Autour de la s´emantique pr´ef´er´ee des syst`emes d’argumentation, PhD thesis, Universit´e Paul Sabatier, Toulouse – France., 2002. [7] P. Duchet, Repr´esentations, Noyaux en Th´eorie des Graphes et Hypergraphes, PhD thesis, 1979. [8] P. M. Dung, ‘On the acceptability of arguments and its fundamental role in nonmonotonic reasoning, logic programming and n-person games’, Artiﬁcial Intelligence, 77, 321–357, (1995). [9] P. Dunne, ‘Computational properties of argument systems satisfying graph-theoretic constraints’, Artif. Intell., 171(10-15), 701–729, (2007). [10] P. Dunne and T. Bench Capon, ‘Coherence in ﬁnite argument systems’, Artiﬁcial Intelligence, 141 (1–2), 187–203, (2002). [11] D. S. Johnson, C. H. Papadimitriou, and M. Yannakakis, ‘On generating all maximal independent sets’, Inf. Process. Lett., 27(3), 119–123, (1988). [12] A. Kakas and P. Moraitis, ‘Argumentation based decision making for autonomous agents’, in Proc. 2nd International Joint Conference on Autonomous Agents and Multi-Agents Systems, pp. 883–890, (2003). [13] J. Moon and L. Moser, ‘On cliques in graphs’, Israel Journal of Mathematics, 3, 23–28, (1965). [14] H. Prakken and G. Sartor, ‘Argument-based extended logic programming with defeasible priorities’, Journal of Applied Non-Classical Logics, 7, 25–75, (1997). 468 ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-468 Norm Defeasibility in an Institutional Normative Framework Henrique Lopes Cardoso and Eug´enio Oliveira1 Abstract. Normative environments have been proposed to regulate agent interaction in open multi-agent systems. However, most approaches rely on pre-imposed regulations that agents are subject to. Taking a different stance, we focus on a normative framework that assists agents in establishing by themselves their own commitment norms. With that aim in mind, a model of norm defeasibility is presented that enables exploiting and adapting a normative background to different extents. We formalize the normative state using ﬁrst-order logic and deﬁne rules and norms operating on that state. A suitable semantics regarding the use of norms within a hierarchical context structure is given, based on norm activation conﬂict and defeasibility. 1 INTRODUCTION Most approaches regarding the use of norms in multi-agent systems (MAS) have addressed one of two ends of a spectrum. On one end, there are systems where norms are pre-imposed on agents, either with no possible deviation [5] or admitting violations [4, 1]. On the other end, social norm emergence from agent interaction is also being addressed [19]. In this paper we consider a midway approach on the use of norms in MAS, where norms are consciously adopted by a group of agents. The Electronic Institution (EI) concept has also been studied with this aim in mind [14, 3]. In particular, in [13] a normative framework has been suggested as a core component of an EI. In the present paper, a normative environment assisting agentbased automated contract establishment is formalized. Agents can exploit a supportive normative framework in order to establish their mutual contracts in a more straightforward fashion: contracts [13] can be underspeciﬁed, relying on a structured normative environment that ﬁlls in any omissions. We deﬁne the notion of normative context and context hierarchies, characterize the normative state and give a representation of norm. We then formalize normative conﬂicts in our approach and their resolution based on norm activation defeasibility. From the law ﬁeld, three normative conﬂict resolution principles have been deﬁned and traditionally used. The lex superior is a hierarchical criterion and indicates that a norm issued by a more important legal entity prevails, when in conﬂict with another norm (e.g. the Constitution prevails over any other legal body). The lex posterior is a chronological criterion indicating that the most recent norm prevails. The lex specialis is a speciﬁcity criterion establishing that the most speciﬁc norm prevails. While not ﬁrmly adopting any of these options, our approach resembles more the lex specialis principle, because broadly speaking a norm deﬁned at a more speciﬁc context will typically prevail. 1 LIACC, DEI / Faculdade de Engenharia, Universidade do Porto, R. Dr. Roberto Frias, 4200-465 Porto, Portugal, email: {hlc, eco}@fe.up.pt The paper is organized as follows. Section 2 deals with the normative environment, deﬁnes the notion of context and sub-context, describes the normative state and gives a representation for rules and norms. Section 3 is devoted to norm semantics and to the norm activation defeasibility approach. In Section 4 we provide some examples that exploit the usage of the normative environment. Finally, Section 5 discusses related work and Section 6 concludes. 2 AN INSTITUTIONAL NORMATIVE ENVIRONMENT In this section we deﬁne the normative environment and present a context-based normative framework. This framework forms the basis for the norm defeasibility model described in Section 3. Def. 1: Normative Environment NE = NS , IR, N The normative environment NE of an EI is composed of a normative state NS , a set IR of institutional rules (see Def. 7) that manipulate that normative state and a set N of norms, which can be seen as a special kind of rules (see Def. 8). The role of institutional rules is to maintain the normative state of the system. While norms deﬁne the normative positions of each agent, the main purpose of those rules is to relate the normative state with the standing normative positions (see [12] for the use of rules in monitoring those normative positions). 2.1 Contexts Our model is based on a contextualization of both the normative state and norms. In this subsection we introduce the notion of context and context organization. Def. 2: Context C = PC , CA, CI , CN A context C is an organizational structure within which a set CA of agents commits to a joint activity partially regulated by a set CN ⊆ N of appropriate norms. A context includes a set CI of contextual info that makes up a kind of background knowledge for that context (see Def. 4). PC is the parent context within which context C is formed. Let PCA be the set of agents in context PC : we have that CA ⊆ PCA. Contexts allow us to organize norms according to a hierarchical normative structure. Norm set N is partitioned among the several contexts that may exist, that is, sets CN for each context are mutually disjoint. A norm inheritance mechanism (as explained later) justiﬁes why set CN only partially regulates the activity of agents in CA. We identify a top level context within which all other contexts are (directly or indirectly) formed. We now introduce the notion of sub-context. 469 H. Lopes Cardoso and E. Oliveira / Norm Defeasibility in an Institutional Normative Framework Def. 3: Sub-context C C Context C = PC , CA , CI , CN is a sub-context of context C = PC , CA, CI , CN , denoted C C, if PC = C or if PC C . When C is either a sub-context of C or C itself, we write C C. From Def. 2 we also have that CA ⊆ CA. A sub-context deﬁnes an organizational structure committed to by a subset of the parent context’s agents. Notice that the sub-context relationship is an explicit one. Every context is a sub-context of the top context. We now deﬁne contextual information as a foundational component of a context. Def. 4: Contextual info Info C Contextual info Info C is a fully-grounded atomic formula in ﬁrst-order logic, which comprises founding information regarding a context C = PC , CA, CI , CN . Info C ∈ CI . The CI component in a context deﬁnition is therefore composed of ﬁrst-order logic formulae that provide background information for that context. 2.2 Normative State The normative state is organized through contexts, and concerns the description of what is taken for granted in a model of institutional reality. Therefore, we call every formula in NS an institutional reality element, or IRE . Each institutional reality element refers to a speciﬁc context within which it is relevant. Def. 5: Contextual institutional reality element IRE C A contextual institutional reality element IRE C is an IRE regarding context C. We distinguish the following kinds of IRE C and with the following meanings: ifact C (f , t) institutional fact f has occurred at time t time C (t) instant t has elapsed C obl (a, f , t) agent a is obliged to bring about fact f until deadline t fulf C (a, f , t) agent a has fulﬁlled, at time t, his obligation to bring about f viol C (a, f , t) agent a has violated, at time t, his obligation to bring about f Note that the use of context C as a superscript is only a syntactical convenience – both contextual info and institutional reality elements are ﬁrst-order formulae (C could be used as the ﬁrst argument of each of these formulae). While contextual info is conﬁned to background information that is part of the context deﬁnition, contextual institutional reality elements represent occurrences taking place after the context’s creation, during its lifetime. We consider institutional facts as agent-originated, since they are obtained as a consequence of some agent action. The remaining elements are environment events, asserted in the process of norm activation and monitoring [13]. Our model of institutional reality is based on a discrete model of time. The time elements are used to signal instants that are relevant to the context at hand. Obligations are deontic statements, and we admit both their fulﬁllment and violation. Def. 6: Normative State NS = {IRE1C1 , IRE2C2 , ..., IREnCm } The normative state NS is a set of fully-grounded atomic formulae IREiCj , 1 ≤ i ≤ n, in ﬁrst-order logic. The normative state will contain, at each moment, all elements that characterize the current state of affairs in every context. In that sense, NS could be seen as being partitioned among the several contexts, as is the case with norms; however, IRE ’s are not part of a context’s deﬁnition, since they are obtained at a later stage, during the context’s operation. Some of the IRE ’s are interrelated: for instance, a fulﬁllment connects an obligation to bring about a fact with its achievement as an institutional fact. These interrelations are captured with institutional rules. 2.3 Rules and Norms Given the “contextualization” of the normative state, we are now able to deﬁne rules and norms. Institutional rules allow us to maintain the normative state of the system. They are not contextualized, but yet they operate on contextual IRE ’s. Def. 7: Institutional rule R ::= Antecedent → Consequent An institutional rule R deﬁnes, for a given set of conditions, what other elements should be added to the normative state. The rule’s Antecedent is a conjunction of patterns of IRE C (see Def. 5), which may contain variables; restrictions may be imposed on such variables through relational conditions. We also allow the use of negation (as failure): Antecedent ::= IRE C | ¬Antecedent | RelCondition | Antecedent ∧ Antecedent The rule’s Consequent is a conjunction of IRE C which are not deontic statements (IRE –C ), and which are allowed to contain bounded variables: Consequent ::= IRE –C | IRE –C ∧ Consequent When the antecedent matches the normative state using a ﬁrstorder logic substitution Θ, and if all the relational conditions over variables hold, the atomic formulae obtained by applying Θ to the consequent of the rule are added to the normative state as fullygrounded elements. Besides institutional reality elements, the norms themselves are also contextual. Def. 8: Norm N C ::= Situation C → Prescription C A norm N C is a rule with a deontic consequent, deﬁned in a speciﬁc context C. The norm is applicable to a context C C. The norm’s Situation C is a conjunction of patterns of Info C and IRE –C (no deontic statements). Both kinds of patterns are allowed to contain variables; restrictions may be imposed on such variables through relational conditions: Situation C ::= Info C | IRE –C | RelCondition | Situation C ∧ Situation C The norm’s Prescription C is a (possibly empty) conjunction of deontic statements (obligations) which are allowed to contain bounded variables and are affected to the same context C: Prescription C ::= | OblConj C C C OblConj ::= obl (...) ∧ OblConj C | obl C (...) Conceptually, the norm’s Situation C can be seen as being based on two sets of elements: background (Sb) and contingent (Sc). Background elements are those that exist at context C creation (the founding contextual info), while contingent elements are those that are added to the normative state at a later stage. This distinction will be helpful when describing norm semantics. 470 H. Lopes Cardoso and E. Oliveira / Norm Defeasibility in an Institutional Normative Framework Observe the distinction between the context where the norm is deﬁned, and the context to which the norm applies. While, in order to make the model as simple as we can, we deﬁne a norm as being applicable to a speciﬁc context, in Section 3.1 we relax this assumption, which will in part clarify the usefulness of the model. part of norms). Observe that we do not talk about norm defeasance, but rather norm activation defeasance. Thus, the defeasance relationship may only materialize on actual norm applicability. 3 A question that may arise when going through the previous deﬁnitions can jeopardize the purpose of having defeasible norms as those in the model presented. Why should there be norms that, while being applicable to the same context, are deﬁned in different contexts that have a sub-context relationship? Why not have all norms applicable to context C deﬁned inside context C? The reason for our approach becomes apparent when considering the stated aim of a supportive normative environment: to have a normative background that can ﬁll-in details of sub-contexts that are created later and that can beneﬁt from this setup by being underspeciﬁed. This leads us to the subject of “default rules” in the law ﬁeld [2]. Thus, part of the normative environment’s norms will typically be predeﬁned, in the sense that they are pre-existent to the applicable contexts themselves (which correspond to and result from contracts as they are signed up). What we need is to typify contexts in order to be able to say that a norm applies to a certain type of contexts. This way, a norm might be deﬁned at a super-context and applicable to a range of sub-contexts (of a certain type) to be subsequently created. We can do this adaptation by considering a context identiﬁer C as a pair id:type, where id is a context identiﬁer and type is a pre deﬁned context type. In a norm N C = S C → P C (see Def. 8), C C C C patterns of Info and IRE within S and P will be rewritten to accommodate this kind of context reference, eventually using a variable in place of the context id . For instance, an IRE X :t pattern, where X is a variable, would match IRE ’s of any sub-context of type t. When activating a norm with this kind of pattern, the substitution Θ (as used in Def. 9) would have to bind X to a speciﬁc sub-context identiﬁer; every further occurrence of X is thus a bounded-variable. This approach allows us to maintain our deﬁnitions of norm activation conﬂict and defeasance, with minor syntactical changes. NORM SEMANTICS In this section we deﬁne the semantics of norms and formalize a model for norm defeasibility in the ambit of a supportive normative framework. We start by exploring norm applicability according to the normative state. For that, we make use of the notion of substitution in ﬁrst-order logic. We denote by f ·Θ the result of applying substitution Θ to atomic formula f . Def. 9: Norm activation Norm N C = S C → P C , applicable to a context C = PC , CA , CI , CN , is activated if there is a substitution Θ such that: • ∀c∈Sc c · Θ ∈ NS , where Sc is the set of contingent con juncts (IRE –C patterns) in S C ; and • ∀b∈Sb b · Θ ∈ CI , where Sb is the set of background con juncts (Info C patterns) in S C ; and • all the relational conditions in S C over variables hold. We are now able to deﬁne the notion of conﬂicting norm activations, as follows. Def. 10: Norm activation conﬂict Let Act1 be the activation of norm N1C1 = S1C1 → P1C1 obtained with substitution Θ1 and Act2 the activation of norm N2C2 = S2C2 → P2C2 obtained with substitution Θ2 . Let NS1 = {c · Θ1 |c ∈ Sc1 }, and NS2 = {c · Θ2 |c ∈ Sc2 }, where Sc1 and Sc2 are the sets of contingent conjuncts of S1C1 and S2C2 , respectively. Both NS1 and NS2 represent fractions of the whole normative state NS .& Norm activations Act1 and Act2 are in conﬂict, written Act1 Act2 , if NS1 = NS2 and either C1 C2 or C2 C1. Succinctly, we say there is a norm activation conﬂict if we have two applicable norms activated with the same fraction of the normative state and deﬁned in different contexts. Notice that the fact that both norms are activated with the same contextual IRE ’s already dictates that the norm contexts, if different, have a sub-context relationship (there is no multiple inheritance mechanism in our normative structure). This becomes clearer when taking into account the sub-context (Def. 3) and norm (Def. 8) deﬁnitions: a context has a single parent context, and a norm N C applies to a context C C. In principle, all norm activations are defeasible, according to the following deﬁnition. Def. 11: Norm activation defeasance C1 Norm activation Act1 for norm & N1 defeats norm activation C2 Act2 for norm N2 if Act1 Act2 and C1 C2. A defeated norm activation is discarded, that is, the defeated activation is not applied to the normative state fraction used for activating the norm. Only undefeated norm activations will be applied: the substitution that activated the norm is applied to its prescription part and the resulting fully-grounded deontic statements are added to the normative state (recall that there are no free variables in the prescription 3.1 4 Norm Contextual Target EXAMPLES In this section we sketch some examples towards the exploitation of the normative environment. The examples try to focus on the important aspects of our approach; in the following we adopt the convention that variables begin with an upper-case letter. Our scenario is based on the following: each of a group of companies (agents) provides different resources that may need to be combined in order to present a value-added offering to third-parties. For that, they agree to form a virtual organization (VO). This organization will deﬁne a supply-agreement that translates into a context sa3:sa in the normative environment, where sa3 is the context id and sa is the context type (see Section 3.1). Notice that sa3:sa top, where top is the top context. Suppose we have, at the top context, the following norm: N1top = ifact X:sa (order (A1 , Res, Qt, A2 ), T )∧ supply–info X:sa (A2 , Res, Pr ) → obl X:sa (A2 , delivery(A2 , Res, Qt, A1 ), T + 2 )∧ obl X:sa (A1 , payment(A1 , Qt ∗ Pr , A2 ), T + 2 ) H. Lopes Cardoso and E. Oliveira / Norm Defeasibility in an Institutional Normative Framework The norm states that for any supply-agreement, when an order is made that corresponds to the supply-info (which is an Info C for this type of context) of the receiver, he is obliged to deliver the requested goods and the sender is obliged to make the associated payment. Now, suppose context sa3:sa includes the following norms. N1sa3 :sa = ifact sa3 :sa (order (A1 , Res, Qt, jim), T )∧ supply–info sa3 :sa (jim, Res, Pr ) ∧ Qt > 99 → obl sa3 :sa (jim, delivery(jim, Res, Qt, A1 ), T + 5 )∧ obl sa3 :sa (A1 , payment(A1 , Qt ∗ Pr , jim), T + 2 ) This norm expresses the fact that agent jim, when receiving orders with more than 99 units, has an extended delivery deadline. N2sa3 :sa = ifact sa3 :sa (order (sam, Res, Qt, A2 ), T )∧ supply–info sa3 :sa (A2 , Res, ) → obl sa3 :sa (A2 , delivery(A2 , Res, Qt, sam), T + 2 ) N3sa3 :sa = fulf sa3 :sa (A2 , delivery(A2 , Res, Qt, sam), T )∧ supply–info sa3 :sa (A2 , Res, Pr ) → obl sa3 :sa (sam, payment(sam, Qt ∗ Pr , A2 ), T + 2 ) These two norms express the higher position of agent sam who, as opposed to other agents, only pays after receiving the merchandise. Suppose we have the following founding contextual info for context sa3:sa: supply–info sa3 :sa (jim, r1 , 1 ) 5 471 RELATED WORK From a theoretical logical stance, norm defeasibility has been mainly guided by deontic reasoning [16], where conﬂicts regard the deontic operators themselves. Our approach is centered instead on the applicability of norms, not on their prescriptions. More practical approaches (e.g. in the B2B domain) to normative conﬂict resolution have also been developed. The application of business rules in e-commerce has been studied in [11], where courteous logic programs allow for an explicit deﬁnition of priorities among rules. An extension based on defeasible logic [15] has been advanced in [10]. Also, [9] addresses defeasible reasoning in the e-contracts domain, based on default logic and on the deﬁnition of dynamic priorities among rules. The work in [7] addresses the issue of conﬂict resolution in a structured setup of compound activities. These resemble our context and sub-context relationships. However, they model deontic conﬂicts (e.g. an action being obliged and prohibited at the same time), while we model norm (activation) conﬂicts. They study the inheritance of normative positions (obligations, permissions, prohibitions), based on an explicit stamping of each one of them with a priority value and a timestamp; the speciﬁcity criterion is based on the compound activities’ structure. We address the inheritance of norms and provide a means to override norm activations based on their defeasibility. Our approach of context and sub-context deﬁnitions, together with the presented norm defeasibility model, is similar to the notion of supererogatory defeasibility in [18]. They model defeasibility in terms of role and sub-role deﬁnitions. In fact, they also consider express defeasibility, which is based on the speciﬁcity of conditions for norm applicability, but this approach has been followed by several others. We should also point out that [8] presents a grammar for rules that combines both our rule and norm deﬁnitions. However, our concern is to distinguish a priori rule deﬁnition as a normative state maintenance concern from norm deﬁnition as a contracting activity. Furthermore, in [8] there is no attempt to solve any disputes related with possibly conﬂicting norms. supply–info sa3 :sa (sam, r2 , 1 ) supply–info sa3 :sa (tom, r3 , 1 ) Table 1 shows what might happen in different normative states. The second column shows which norm activation conﬂicts come about (and how they are resolved) when the institutional reality elements of the ﬁrst column are present. Notice that in the ﬁrst example there is no conﬂict, since norm N1sa3 :sa is not activated because of a variable restriction. The third column shows the normative state after applying the defeating norm activation. For instance, in the second example NS contains NS together with the prescriptions of norm N1sa3 :sa (after applying the substitution that activated the norm). The third and fourth examples illustrate sam’s advantage in being obliged to pay only after the delivery has been fulﬁlled. In each case we rely on refraction (a principle used in rule-based systems) to avoid ﬁring a defeating norm more than once on the same activation (which would otherwise happen since our normative state is monotonic). The norm activation defeasibility model is very ﬂexible, allowing us to easily specify different contracting situations that exploit and adapt the normative background to different extents. Also, although the examples do not show this, it may be the case that a VO created by a group of agents deﬁnes norms to be applied in sub-contexts of a certain type. This would make up a three-level norm inheritance structure, where a subset of the VO’s agents could make further contracts that are covered by the VO’s agreement. 6 CONCLUSIONS In this paper we formalized a normative environment with a hierarchical normative framework, including norm inheritance as a mechanism to facilitate contract establishment. Contexts were used as a means to organize norms and, more importantly, to guide their inheritance to new contexts. For that, we distinguished the context where a norm is deﬁned from the context(s) to which it can be applied. In order to allow the expansibility of the system, and its application in different contracting scenarios, a model of norm activation defeasibility was designed, allowing an exploitation of the normative framework to different extents. Each signed contract generates a new context. A contract can include norms that defeat some of the norms of its super-contexts (which would otherwise be inherited), thus adapting the normative background to a speciﬁc situation. Considering normative conﬂict resolution from the law ﬁeld, as disclosed in the introduction, our approach has some similarities with the lex specialis principle. However, the defeating norms are more speciﬁc in the sense that they are deﬁned at (as opposed to applied to) a more speciﬁc context (a kind of “lex inferior”). The lex specialis ﬂavor comes from the fact that in most cases a defeating norm should also apply to a narrower context-set. These properties of our norm defeasance approach result from the fact that the original aim is not to impose predeﬁned regulations on 472 H. Lopes Cardoso and E. Oliveira / Norm Defeasibility in an Institutional Normative Framework Table 1. Different normative states and norm activation conﬂicts. NS ifact sa3 :sa (order (tom, r1 , 5 , jim), 1 ) Conﬂict none, N1top applies NS ifact sa3 :sa (order (tom, r1 , 5 , jim), 1 ) obl sa3 :sa (jim, delivery(jim, r1 , 5 , tom), 3 ) obl sa3 :sa (tom, payment(tom, 5 , jim), 3 ) ifact sa3 :sa (order (tom, r1 , 100 , jim), 1 ) N1sa3 :sa defeats N1top ifact sa3 :sa (order (tom, r1 , 100 , jim), 1 ) obl sa3 :sa (jim, delivery(jim, r1 , 100 , tom), 6 ) obl sa3 :sa (tom, payment(tom, 100 , jim), 3 ) ifact sa3 :sa (order (sam, r3 , 5 , tom), 1 ) N2sa3 :sa defeats N1top ifact sa3 :sa (order (sam, r3 , 5 , tom), 1 ) obl sa3 :sa (tom, delivery(tom, r3 , 5 , sam), 3 ) ifact sa3 :sa (order (sam, r3 , 5 , tom), 1 ) obl sa3 :sa (tom, delivery(tom, r3 , 5 , sam), 3 ) fulf sa3 :sa (tom, delivery(tom, r3 , 5 , sam), 2 ) none, N3sa3 :sa applies ifact sa3 :sa (order (sam, r3 , 5 , tom), 1 ) obl sa3 :sa (tom, delivery(tom, r3 , 5 , sam), 3 ) fulf sa3 :sa (tom, delivery(tom, r3 , 5 , sam), 2 ) obl sa3 :sa (sam, payment(sam, 5 , tom), 4 ) agents, but instead to help them in building contractual relationships by providing a normative background (which can be exploited in a partial way through adaptation). A feature of our approach that exposes this aim is that all norms are defeasible. In this respect we follow the notion from law theory of “default rules” [2]. We leave for future work the possibility of deﬁning non-defeasible norms, that is, norms that are not to be overridden. This notion of “default rules” might be misleading; it has not a direct correspondence with default logic formalizations [17]. We do not handle the defeasibility of conclusions of default rules in that sense, but instead model defeasibility of the application of the rules themselves (which are called norms). Although we are primarily concerned with deadline obligations, the inclusion of permissions or prohibitions as possible deontic statements prescribed by norms demands no changes in our norm activation defeasibility approach. We do not rely on conﬂicts between the content of deontic statements (which are deontic conﬂicts), but instead on norm activation conﬂicts. These are closely related to the notion of conﬂict set (or agenda) in rule-based forward-chaining systems (e.g. [6]). In those systems, a conﬂict is a possible application of more than one rule at the same time, and a conﬂict resolution strategy will decide which rule to apply in each step of the process. Some open issues in our research include, as already mentioned, the possibility of deﬁning non-defeasible norms, which might be important in certain contracting domains. The development of multipleinheritance mechanisms within our contextual framework is also an interesting issue, although it poses additional problems regarding norm defeasibility. ACKNOWLEDGEMENTS The ﬁrst author is supported by FCT (Fundac¸a˜ o para a Ciˆencia e a Tecnologia) under grant SFRH/BD/29773/2006. REFERENCES [1] A. Artikis, J. Pitt, and M. Sergot, ‘Animated speciﬁcations of computational societies’, in Int. J. Conf. on Autonomous Agents and MultiAgent Systems, eds., C. Castelfranchi and W. L. Johnson, pp. 1053– 1062, Bologna, Italy, (2002). ACM, New York, USA. [2] R. Craswell, ‘Contract law: General theories’, in Encyclopedia of Law and Economics, eds., B. Bouckaert and G. De Geest, volume III: The Regulation of Contracts, 1–24, Edward Elgar, Cheltenham, (2000). [3] F. Dignum, ‘Autonomous agents with norms’, Artiﬁcial Intelligence and Law, 7(1), 69–79, (1999). [4] F. Dignum, ‘Abstract norms and electronic institutions’, in International Workshop on Regulated Agent-Based Social Systems: Theories and Applications (RASTA’02), Bologna, Italy, (2002). [5] M. Esteva, B. Rosell, J. A. Rodr´ıguez-Aguilar, and J. L. Arcos, ‘Ameli: An agent-based middleware for electronic institutions’, in Third Int. J. Conf. on Autonomous Agents and Multi-agent Systems, volume 1, pp. 236–243. IEEE Computer Society, (2004). [6] E. Friedman-Hill, Jess in Action, Manning Publications Co., 2003. [7] A. Garc´ıa-Camino, P. Noriega, and J. A. Rodr´ıguez-Aguilar, ‘An algorithm for conﬂict resolution in regulated compound activities’, in Seventh Int. Workshop Engineering Societies in the Agents World (ESAW’06), (2006). [8] A. Garc´ıa-Camino, J. A. Rodr´ıguez-Aguilar, C. Sierra, and W. Vasconcelos, ‘Norm-oriented programming of electronic institutions: A rule-based approach’, in Coordination, Organizations, Institutions, and Norms in Agent Systems II, 177–193, Springer, (2007). [9] G. K. Giannikis and A. Daskalopulu, ‘Defeasible reasoning with econtracts’, in IEEE/WIC/ACM International Conference on Intelligent Agent Technology, pp. 690–694, (2006). [10] G. Governatori, ‘Representing business contracts in ruleml’, International Journal of Cooperative Information Systems, 14(2-3), 181–216, (2005). [11] B. N. Grosof, ‘Representing e-commerce rules via situated courteous logic programs in ruleml’, Electronic Commerce Research and Applications, 3(1), 2–20, (2004). [12] H. Lopes Cardoso and E. Oliveira, ‘A context-based institutional normative environment’, in AAMAS’08 Workshop on Coordination, Organization, Institutions and Norms in agent systems (COIN), pp. 119–133, Estoril, Portugal, (2008). [13] H. Lopes Cardoso and E. Oliveira, ‘A contract model for electronic institutions’, in Coordination, Organizations, Institutions, and Norms in Agent Systems III, LNAI 4870, 27–40, Springer, (2008). [14] H. Lopes Cardoso and E. Oliveira, ‘Electronic institutions for b2b: Dynamic normative environments’, Artiﬁcial Intelligence and Law, 16(1), 107–128, (2008). [15] D. Nute, ‘Defeasible logic’, in Handbook of Logic in Artiﬁcial Intelligence and Logic Programming, eds., D.M. Gabbay, C.J. Hogger, and J.A. Robinson, volume 3, 353–395, Oxford University Press, (1994). [16] D. Nute, Defeasible Deontic Logic, volume 263 of Synthese Library, Kluwer Academic Publishers, 1997. [17] R. Reiter, ‘A logic for default reasoning’, Artiﬁcial Intelligence, 13(1/2), 81–132, (1980). [18] Y. U. Ryu, ‘Relativized deontic modalities for contractual obligations in formal business communication’, in 30th Hawaii International Conference on System Sciences (HICSS), pp. 485–493, Hawaii, USA, (1997). [19] S. Sen and S. Airiau, ‘Emergence of norms through social learning’, in Twentieth International Joint Conference on Artiﬁcial Intelligence, pp. 1507–1512, Hyderabad, India, (2007). 8. Constraints and Search This page intentionally left blank ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-475 475 SLIDE: A Useful Special Case of the CARDPATH Constraint Christian Bessiere1 and Emmanuel Hebrard2 and Brahim Hnich3 and Zeynep Kiziltan4 and Toby Walsh5 Abstract. We study the C ARD PATH constraint. This ensures a given constraint holds a number of times down a sequence of variables. We show that S LIDE, a special case of C ARD PATH where the slid constraint must hold always, can be used to encode a wide range of sliding sequence constraints including C ARD PATH itself. We consider how to propagate S LIDE and provide a complete propagator for C ARD PATH. Since propagation is NP-hard in general, we identify special cases where propagation takes polynomial time. Our experiments demonstrate that using S LIDE to encode global constraints can be as efﬁcient and effective as specialised propagators. 1 INTRODUCTION In many scheduling problems, we have a sequence of decision variables and a constraint which applies down the sequence. For example, in the car sequencing problem, we need to decide the sequence of cars on a production line. We might have a constraint on how often a particular option is met (e.g. 1 out of 3 cars can have a sun-roof). As a second example, in a nurse rostering problem, we need to decide the sequence of shifts worked by nurses. We might have a constraint on how many consecutive night shifts any nurse can work. Such constraints have been classiﬁed as sliding sequence constraints [7]. To model such constraints, we can use the C ARD PATH constraint. This ensures that a given constraint holds a number of times down a sequence of variables [5]. We identify a special case of C ARD PATH which we call S LIDE, that is interesting for several reasons. First, many sliding sequence constraints, including C ARD PATH, can easily be encoded using this special case. S LIDE is therefore a “generalpurpose” constraint for encoding sliding sequencing constraints. This is an especially easy way to provide propagators for such global constraints within a constraint toolkit. Second, we give a propagator for enforcing generalised arc-consistency on S LIDE. By comparison, the previous propagator for C ARD PATH given in [5] does not prune all possible values. Third, S LIDE can be as efﬁcient and effective as specialised propagators in solving sequencing problems. 1 2 3 4 5 LIRMM (CNRS / U. Montpellier), France, email: bessiere@lirmm.fr. Supported by the ANR project ANR-06-BLAN-0383-02. 4C, UCC, Ireland, email: ehebrard@4c.ucc.ie. Izmir Uni. of Economics, Turkey, email: brahim.hnich@ieu.edu.tr. Supported by the Scientiﬁc and Technological Research Council of Turkey (TUBITAK) under Grant No. SOBAG-108K027. CS Department, Uni. of Bologna, Italy, email: zeynep@cs.unibo.it. NICTA and UNSW, Sydney, Australia, email: toby.walsh@nicta.com.au. Funded by the Australian Government’s Department of Broadband, Communications and the Digital Economy, and the ARC. 2 CARDPATH AND SLIDE CONSTRAINTS A constraint satisfaction problem consists of a set of variables, each with a ﬁnite domain of values, and a set of constraints specifying allowed combinations of values for given sets of variables. We use capital letters for variables (e.g. X), and lower case for values (e.g. d). We write D(X) for the domain of variable X. Constraint solvers typically explore partial assignments enforcing a local consistency property. A constraint is generalised arc consistent (GAC) iff when a variable is assigned any value in its domain, there exist compatible values in the domains of all the other variables of the constraint. The C ARD PATH constraint was introduced in [5]. If C is a constraint of arity k then C ARD PATH(N, [X1 , . . . , Xn ], C) holds iff C(Xi , . . . , Xi+k−1 ) holds N times for 1 ≤ i ≤ n − k + 1. For example, we can count the number of changes in the type of shift with C ARD PATH(N, [X1 , . . . , Xn ], =). Note that C ARD PATH can be used to encode a range of Boolean connectives since N ≥ 1 gives disjunction, N = 1 gives exclusive or, and N = 0 gives negation. We shall focus on a special case of the C ARD PATH constraint where the slid constraint holds always. S LIDE(C, [X1 , . . . , Xn ]) holds iff C(Xi , . . . , Xi+k−1 ) holds for all 1 ≤ i ≤ n − k + 1. That is, a C ARD PATH constraint in which N = n − k + 1. We also consider a more complex form of S LIDE that applies only every j variables. More precisely, S LIDEj (C, [X1 , . . . , Xn ]) holds iff . By deﬁnition C(Xij+1 , . . . , Xij+k ) holds for 0 ≤ i ≤ n−k j S LIDEj for j = 1 is equivalent to S LIDE. Beldiceanu and Carlsson have shown that C ARD PATH can encode a wide range of constraints like C HANGE , S MOOTH, A MONG S EQ and S LIDING S UM [5]. As we discuss later, S LIDE provides a simple way to encode such sliding sequencing constraints. It can also encode many other more complex sliding sequencing constraints like R EGULAR [16], S TRETCH [13], and L EX [7], as well as many types of chanelling constraints like E LEMENT [19] and optimisation constraints like the soft forms of R EGULAR [20]. More interestingly, C ARD PATH can itself be encoded into a S LIDE constraint. In [5], a propagator for C ARD PATH is proposed that greedily constructs upper and lower bounds on the number of (un)satisﬁed constraints by posting and retracting (the negation of) each of the constraints. This propagator does not achieve GAC. We propose here a complete propagator for enforcing GAC on S LIDE. S LIDE thus provides a GAC propagator for C ARD PATH. In addition, S LIDE provides a GAC propagator for any of the other global constraints it can encode. As our experimental results reveal, S LIDE can be as efﬁcient and effective as specialised propagators. We illustrate the usefulness of S LIDE with the A MONG S EQ constraint which ensures that values occur with some given frequency. For instance, we might want that no more than 3 out of every sequence of 7 shift variables are a “night shift”. More 476 C. Bessiere et al. / SLIDE: A Useful Special Case of the CARDPATH Constraint precisely, A MONG S EQ(l, u, k, [X1 , . . . , Xn ], v) holds iff between l and u variables in every sequence of k variables take value in the ground set v [8]. We can encode this using S LIDE. More precisely, A MONG S EQ(l, u, k, [X1 , . . . , Xn ], v) can be encoded k,v k,v , [X1 , . . . , Xn ]) where Dl,u is an instance of the as S LIDE(Dl,u k,v A MONG constraint [8]. Dl,u (Xi , . . . , Xi+k−1 ) holds iff between l and u variables take values in the set v. For example, suppose 2 of every 3 variables along a sequence X1 . . . X5 should take the value a, where X1 = a and X2 , . . . , X5 ∈ {a, b}. This can be encoded as S LIDE(E, [X1 , X2 , X3 , X4 , X5 ]) where E(Xi , Xi+1 , Xi+2 ) ensures two of its three variables take a. This S LIDE constraint ensures that E(X1 , X2 , X3 ), E(X2 , X3 , X4 ) and E(X3 , X4 , X5 ) all hold. Note that each ternary constraint is GAC. However, enforcing GAC on the S LIDE constraint sets X4 = a as there are only two satisfying assignments and neither have X4 = b. 3 SLIDE WITH MULTIPLE SEQUENCES We often wish to slide a constraint down two or more sequences of variables at once. For example, suppose we want to ensure that two vectors of variables, X1 to Xn and Y1 to Yn differ at every index. We can encode such a constraint by interleaving the two sequences and sliding a constraint down the single sequence with a suitable offset. In our example, we simply post S LIDE2 (=, [X1 , Y1 , . . . , Xn , Yn ]). As a second example of sliding down multiple sequences of variables, consider the constraint R EGULAR(A, [X1 , . . . , Xn ]). This ensures that the values taken by a sequence of variables form a string accepted by a deterministic ﬁnite automaton A [16]. This global constraint is useful in scheduling, rostering and sequencing problems to ensure certain patterns do (or do not) occur over time. It can be used to encode a wide range of other global constraints including: A MONG [8], C ONTIGUITY [15], L EX and P RECEDENCE [14]. To encode the R EGULAR constraint with S LIDE, we introduce variables, Qi to record the state of the automaton. We then post S LIDE2 (F, [Q0 , X1 , Q1 , . . . , Xn , Qn ]) where Q0 is set to the starting state, Qn is restricted to accepting states, and F (Qi , Xi+1 , Qi+1 ) holds iff Qi+1 = δ(Xi , Qi ) where δ is the transition function of the automaton. If we decompose this encoding into the conjunction of slid constraints, we get a set of constraints similar to [6]. Enforcing GAC on this encoding ensures GAC on R EGULAR and, by exploiting functionaliy of F , takes O(ndq) time where d is the number of values for Xi and q is the number of states of the automaton. This is asymptotically identical to the specialised R EGULAR propagator [16]. This encoding is highly competitive in practice with the specialized propagator [2]. One advantage of this encoding is that it gives explicit access to the states of the automaton. Consider, for example, a rostering problem where workers are allowed to work for up to three consecutive shifts. This can be speciﬁed with a simple R EGULAR constraint. Suppose now we want to minimise the number of times a worker has to work for three consecutive shifts. To encode this, we can post an A MONG constraint on the state variables to count the number of times we visit the state representing three consecutive shifts, and minimise the value taken by this variable. As we shall see later in the experiments, the encoding also gives an efﬁcient incremental propagator. In fact, the complexity of repeatedly enforcing GAC on this encoding of the R EGULAR constraint down the whole branch of a backtracking search tree is just O(ndq) time. 4 SLIDE WITH COUNTERS We may want to slide a constraint on a sequence of variables computing a count. We can use S LIDE to encode such constraints by incrementally computing the count in an additional sequence of variables. Consider, for example, C ARD PATH(N, [X1 , . . . , Xn ], C). For simplicity, we consider k = 2 (i.e., C is binary). The generalisation to other k is straightforward. We introduce a sequence of integer variables Mi in which to accumulate the count. We encode C ARD PATH as S LIDE2 (G, [M1 , X1 , . . . , Mn , Xn ]) where M1 = 0, Mn = N , and G(Mi , Xi , Mi+1 , Xi+1 ) is deﬁned as: if C(Xi , Xi+1 ) holds then Mi+1 = Mi + 1, otherwise Mi+1 = Mi . GAC on S LIDE ensures GAC on C ARD PATH. As a second example, consider the S TRETCH constraint [13]. Given variables X1 to Xn taking values from a set of shift types τ , a set π of ordered pairs from τ × τ , and functions shortest(t) and longest(t) giving the minimum and maximum length of a stretch of type t, S TRETCH([X1 , . . . , Xn ]) holds iff each stretch of type t has length between shortest(t) and longest(t); and consecutive types of stretches are in π. We can encode S TRETCH as S LIDE2 (H, [X1 , Q1 , . . . , Xn , Qn ]) where Q1 = 1 and H(Xi , Xi+1 , Qi , Qi+1 ) holds iff (1) Xi = Xi+1 , Qi+1 = 1 + Qi , and Qi+1 ≤ longest(Xi ); or (2) Xi = Xi+1 , Xi , Xi+1 ∈ π, Qi ≥ shortest(Xi ) and Qi+1 = 1. GAC on S LIDE ensures GAC on S TRETCH. 5 OTHER EXAMPLES OF SLIDE There are many other examples of global constraints which we can encode using S LIDE. For example, we can encode L EX [7] using S LIDE. L EX holds iff a vector of variables [X1 ..Xn ] is lexicographically smaller than another vector of variables [Y1 ..Yn ]. We introduce a sequence of Boolean variables Bi to indicate if the vectors have been ordered by position i − 1. Hence B1 = 0. We then encode L EX as S LIDE3 (I, [B1 , X1 , Y1 , . . . , Bn , Xn , Yn ]) where I(Bi , Xi , Yi , Bi+1 ) holds iff (Bi = Bi+1 = 0 ∧ Xi = Yi ) or (Bi = 0 ∧ Bi+1 = 1 ∧ Xi < Yi ) or (Bi = Bi+1 = 1). This gives us a linear time propagator as efﬁcient and incremental as the specialised algorithm in [12]. As a second example, we can encode many types of channelling constraints using S LIDE like D OMAIN [17], L INK S ET 2B OOLEANS [7] and E LEMENT [19]. As a ﬁnal example, we can encode “optimisation” constraints like the soft form of the R EGULAR constraint which measures the Hamming or edit distance to a regular string [20]. There are, however, constraints that can be encoded using S LIDE which do not give as efﬁcient and effective propagators as specialised algorithms (e.g. the global A LL D IFFERENT constraint [18]). 6 PROPAGATING SLIDE A constraint like S LIDE is only really useful if we can propagate it efﬁciently and effectively. The simplest possible way to propagate S LIDEj (C, [X1 , . . . , Xn ]) is to decompose it into a sequence of constraints, C(Xij+1 , . . . , Xij+k ) for 0 ≤ i ≤ n−k and let the j constraint solver propagate the decomposition. Surprisingly, this is enough to achieve GAC in many cases. For example, we can achieve GAC in this way on the S LIDE encoding of the R EGULAR constraint. If the constraints in the decomposition overlap on just one variable then the constraint graph is Berge acyclic [4], and enforcing GAC on the decomposition of S LIDEj achieves GAC on S LIDEj . Similarly, enforcing GAC on the decomposition achieves GAC on S LIDEj if 477 C. Bessiere et al. / SLIDE: A Useful Special Case of the CARDPATH Constraint the constraint being slid is monotone. A constraint C is monotone iff there exists a total ordering ≺ of the values such that for any two values v, w, if v ≺ w then v can replace w in any support for C. For instance, the constraints A MONG and S UM are monotone if either no upper bound, or no lower bound is given. Theorem 1 Enforcing GAC over each constraint in the decomposition of S LIDEj achieves GAC on S LIDEj if the constraint being slid is monotone. Proof: For an arbitrary value v ∈ D(X), we show that if every constraint is GAC, then we can build a support for X = v on S LIDEj . For any variable other than X, we choose the smallest value in the total order. This is the value that can be substituted for any other value in the same domain. A tuple built this way satisﬁes all the constraints being slid since we know that there exists a support for each (they are GAC), and the values we chose can be substituted for this support. 2 In the general case, when constraints overlap on more than one variable (e.g. in the S LIDE encoding of A MONG S EQ), we need to do more work to achieve GAC. We distinguish two cases: when the arity of the constraint being slid is not ﬁxed, and when the arity is ﬁxed. We show that enforcing GAC in the former case is NP-hard. Theorem 2 Enforcing GAC on S LIDE(C, [X1 , . . . , Xn ]) is NPhard when the arity of C is not ﬁxed even if enforcing GAC on C is itself polynomial. Proof: We give a reduction from 3-SAT in N variables and M clauses. We introduce variables Xij for 1 ≤ i ≤ N + 1 and 1 ≤ j ≤ M . For each clause j, if the clause is xa ∨ ¬xb ∨ xc , then we set X1j ∈ {xa , ¬xb , xc } to represent the values that make this j clause true. For each clause j, we set Xi+1 ∈ {0, 1} for 1 ≤ i ≤ N to represent a truth assignment. Hence, we duplicate the truth assignment for each clause. We now build the following constraint j 1 M S LIDE(C, [X11 , .., XN+1 , .., X1j , .., XN+1 , .., X1M , .., XN+1 ]) where C has arity N + 1. We construct C(Y1 , . . . , YN+1 ) to hold iff Y1 = xd and Y1+d = 1, or Y1 = ¬xd and Y1+d = 0. (in these two cases, the value assigned to Y1 represents the literal that makes clause j true), or Yi ∈ {0, 1} and Yi = Yi+N+1 (in this case, the truth assignment is passed down the sequence). Enforcing GAC on C is polynomial and an assignment satisfying the S LIDE constraint corresponds to a satisfying assignment for the original 3-SAT problem. 2 When the arity of the constraint being slid is not great, we can enforce GAC on S LIDE using dynamic programming (DP) in a similar way to the DP-based propagators for the R EGULAR and S TRETCH constraints [16, 13]. A much simpler method, however, which is just as efﬁcient and effective as dynamic programming is to exploit a variation of the dual encoding into binary constraints [10] based on tuples of support. Such an encoding was proposed in [1] for a particular sliding constraint. Here we show that this method is more general and can be used for arbitrary S LIDE constraints. Using such an encoding, S LIDE can be easily added to any constraint solver. We illustrate the intersection encoding by means of an example. Consider again the A MONG S EQ example in which 2 of every 3 variables of X1 . . . X5 should take the value a, where X1 = a and X2 , . . . , X5 ∈ {a, b}. We can encode this as S LIDE(E, [X1 , X2 , X3 , X4 , X5 ]) where E(Xi , Xi+1 , Xi+2 ) is an instance of the A MONG constraint that ensures two of its three variables take a. If the sliding constraint has arity k, we introduce an intersection variable for each subsequence of k − 1 variables of S LIDE. The ﬁrst intersection variable V1 has a domain containing X1 a X2 a b aa ab V1 V2 X3 a b aa ab ba bb V3 X4 a b aa ab ba bb Figure 1. X5 a b V4 aa ab ba bb : channelling constraint : allowed tuple in compatibility constraint Vi : intersection variable Intersection encoding all tuples from D(X1 ) × . . . × D(Xk−1 ). The jth intersection variable Vj has domain containing D(Xj ) × . . . × D(Xj+k−2 ). And so on until Vn−k+2 . In our example in Fig 1, this gives D(V1 ) = D(X1 ) × D(X2 ), . . . , D(V4 ) = D(X4 ) × D(X5 ). We then post binary compatibility constraints between consecutive intersection variables. These constraints ensure that the two intersection variables assign (k − 1)-tuples that agree on the values of their k − 2 common original variables (like constraints in the dual encoding). They also ensure that the k-tuple formed by the two (k − 1)-tuples satisﬁes the corresponding instance of the slid constraint. For instance, in Fig 1, the binary constraint between V1 and V2 does not allow the pair ab, aa because the second argument of ab for V1 (value b for X2 ) is in conﬂict with the ﬁrst argument of aa for V2 (value a for X2 ). That same constraint between V1 and V2 does not allow the pair ab, bb because the tuple abb is not allowed by E(X1 , X2 , X3 ). Enforcing AC on such compatibility constraints prunes aa and bb from V2 , ab and bb from V3 , and ba and bb from V4 . Finally, we post binary channelling constraints to link the tuples to the original variables. One such constraint for each original variable is sufﬁcient. For example, we can have a channelling constraint between V4 and X4 which ensures that the ﬁrst argument of the tuple assigned to V4 equals the value assigned to X4 . Enforcing AC on this channelling constraint prunes b from the domain of X4 . We could instead post a channelling constraint between V3 and X4 ensuring that the second argument in V3 equals X4 . The A MONG S EQ constraint is now GAC. Theorem 3 Enforcing AC on the intersection encoding of S LIDE achieves GAC in O(ndk ) time and O(ndk−1 ) space where k is the arity of the constraint to slide and d is the maximum domain size. Proof: The constraint graph associated with the intersection encoding is a tree. Enforcing AC on this therefore achieves GAC. Enforcing AC on the channelling constraints then ensures that the domains of the original variables are pruned appropriately. As we introduce O(n) intersection variables, and each can contain O(dk−1 ) tuples, the intersection encoding requires O(ndk−1 ) space. Enforcing AC on a compatibility constraint between two intersection variables Vi and Vi+1 takes O(dk ) time as each tuple in the intersection variable Vi has at most d supports which are the tuples of Vi+1 that are equal to Vi on their k − 2 common arguments. Enforcing AC on O(n) such constraints therefore takes O(ndk ) time. Finally, enforcing AC on each of the O(n) channelling constraints takes O(dk−1 ) time as they are functional. Hence, the total time complexity is O(ndk ). 2 Arc consistency on the intersection encoding simulates pairwise consistency on the decomposition. It does this efﬁciently as intersection variables represent in extension ’only’ the intersections. This is sufﬁcient because the constraint graph is acyclic. This encoding is also very easy to implement in any constraint solver. It has good 478 C. Bessiere et al. / SLIDE: A Useful Special Case of the CARDPATH Constraint incremental properties. Only those constraints associated with a variable which changes need to wake up. The intersection encoding of S LIDEj for j > 1 is less expensive to build than for j = 1 as we need intersection variables for subsequences of less than k − 1 variables. For 1 ≤ j ≤ k/2, we introduce intersection variables for subsequences of variables of length k − j starting at indices 1, j + 1, 2j + 1... whose domains contain (k − j)tuples of assignments. Compatibility and channelling constraints are deﬁned as with j = 1. If j > k/2, two consecutive intersection variables (for two subsequences of k − j variables) involve less than k variables of the S LIDEj . The compatibility constraint between them cannot thus ensure the satisfaction of the slid constraint. We therefore introduce intersection variables for subsequences of length k/2 starting at indices 1, j + 1, 2j + 1... and for subsequences of length k/2 ﬁnishing at indices k, j + k, 2j + k... The compatibility constraint between two consecutive intersection variables representing the subsequence starting at index pj + 1 and the subsequence ﬁnishing at index pj + k ensures satisfaction of the (p + 1)th instance of the slid constraint. The compatibility constraint between two consecutive intersection variables representing subsequence ﬁnishing at index pj + k and the subsequence starting at index (p + 1)j + 1 ensures the consistency of the arguments in the intersection of two instances of the slid constraint. We use the same variable ordering for all models so that heuristic choices do not affect results. We schedule the days in chronological order and within each day we allocate a shift to every nurse in lexicographical order. Initial experiments show that this is more efﬁcient than the minimum domain heuristic. However, it restricts the variety of domains passed to the propagators, and thus hinders any demonstration of differences in pruning. We therefore also use a more random heuristic. We allocate within each day a shift to every nurse randomly with 20% frequency and lexicographically otherwise. decomp amongseq slide slidec decomp amongseq slide slidec decomp amongseq slide slidec Table 1. Nurse scheduling with lexicographical variable ordering (1 on instances solved by all methods, 2 on instances solved by the method). 7 EXPERIMENTS We now demonstrate the practical value of S LIDE. Due to space limits, we only report detailed results on a nurse scheduling problem, and summarise the results on balanced incomplete block design generation and car sequencing problems. Experiments are performed with ILOG Solver 6.2 on a 2.8GHz Intel computer running Linux. We consider a Nurse Scheduling Problem [9] in which we generate a schedule of shift duties for a short-term planning period. There are three types of shifts (day, evening, and night). We ensure that (1) each nurse takes a day off or is assigned to an available shift; (2) each shift has a minimum required number of nurses; (3) each nurse’s work load is between speciﬁc lower and upper bounds; (4) each nurse works at most 5 consecutive days; (5) each nurse has at least 12 hours of break between two shifts; (6) the shift assigned to a nurse does not change more than once every three days. We construct four different models, all with variables indicating what type of shift, if any, each nurse is working on each day. We break symmetry between the nurses with lex concstraints. The constraints (1)-(3) are enforced using global cardinality constraints. Constraints (4), (5) and (6) form sequences of respectively 6-ary, binary and ternary constraints. Since (4) is monotone, we simply post the decomposition in the ﬁrst three models. This achieves GAC by Theorem 1. The models differ in how (5) and (6) are propagated. In decomp, they are decomposed into conjunction of slid constraints. In amongseq, (5) is decomposed and (6) is enforced using the A MONG S EQ constraint of ILOG Solver (called IloSequence). The combination of (5) and (6) are enforced by S LIDE in slide. Finally, in slidec , we use S LIDE for the combination of (4), (5), and (6). We test the models using the instances available at http://www.projectmanagement.ugent.be/nsp.php in which nurses have no maximum workload, but a set of preferences to optimise. We ignore these preferences and post a constraint bounding the maximum workload to at most 5 day shifts, 4 evening shifts and 2 night shifts per nurse and per week. Similarly, each nurse must have at least 2 rest days per week. We solve three samples of instances involving 25, 30 and 60 nurses to schedule over 28 days. #solved bts1 time1 bts2 time2 25 nurses, 28 days (99 instances) 99 301 0.13 301 0.13 99 301 0.19 301 0.19 99 301 0.19 301 0.19 99 295 0.68 295 0.68 30 nurses, 28 days (99 instances) 68 7101 2.80 15185 5.29 67 7101 4.31 7150 4.33 70 3303 1.99 4319 2.53 75 1047 2.13 11014 10.02 60 nurses, 28 days (100 instances) 51 5999 4.38 5999 4.38 51 5999 7.10 5999 7.10 52 5300 5.61 8479 7.21 58 2157 7.52 4501 12.07 #solved decomp amongseq slide slidec 86 85 97 97 decomp amongseq slide slidec 20 20 42 43 decomp amongseq slide slidec 3 2 27 34 bts1 time1 bts2 time2 25 nurses, 28 days (99 instances) 35084 7.69 41892 10.06 35401 14.43 35401 14.43 1699 1.00 1547 0.92 457 0.58 438 0.56 30 nurses, 28 days (99 instances) 68834 11.94 69550 12.75 68834 18.89 69550 19.83 378 0.18 8770 7.29 365 0.95 12857 6.76 60 nurses, 28 days (100 instances) 122406 71.06 250427 142.90 122406 119.40 122406 119.40 562 0.65 2367 2.19 542 3.96 1368 6.38 Table 2. Nurse scheduling with random variable ordering (1 on instances solved by all methods, 2 on instances solved by the method). Tables 1 and 2 report the mean runtime and fails to solve the instances with 5 minutes cutoff. Between the ﬁrst three models, the best results are due to slide. We solve more instances with slide, as well as explore a smaller tree. By developing a propagator for a generic constraint like S LIDE, we can increase pruning without hurting efﬁciency. Note that slide always performs better than amongseq. A possible reason is that A MONG S EQ cannot encode constraint (6) as directly as S LIDE. As in previous models, we need to channel into Boolean variables and post A MONG S EQ on them. This may not give as effective and efﬁcient pruning. S LIDE thus offers both modelling and solving advantages over existing sequencing constraints. Note also that slidec solves additional instances in the time limit. This is not suprising as the model slides the combination of the constraints (4), (5), and (6). Recall that the sliding constraint of (4) is 6-ary. It is pleasing to note that the intersection encoding performs well even in the presence of such a high arity constraint. We also ran experiments on Balanced Incomplete Block Designs (BIBDs) and car sequencing. For BIBD, we use the model in [12] which contains L EX constraints. We propagate these either using the specialised algorithm of [12] or the S LIDE encoding. As both propagators maintain GAC, we only compare runtimes. Results on large instances show that the S LIDE model is as efﬁcient as the L EX C. Bessiere et al. / SLIDE: A Useful Special Case of the CARDPATH Constraint model. For car sequencing, we test the scalability of S LIDE on large arity constraints and large domains using 80 instances from CSPLib. Unlike a model using IloSequence, our S LIDE model does not combine reasoning about overall cardinality of a conﬁguration with the sequence of A MONG constraints. Hence, it is not as efﬁcient: 26 instances were solved with S LIDE within the ﬁve minute cutoff, compared to 39 with IloSequence. However, 9 of the instances solved with S LIDE were not solved by IloSequence. The memory overhead of the S LIDE propagator was not excessive despite the slid constraints having arity 5 and domains of size 30. The S LIDE model used on average 22Mb of space, compared to 5Mb for IloSequence. 8 RELATED WORK Pesant introduced the R EGULAR constraint, and gave a propagator based on dynamic programming to enforce GAC [16]. As we saw, the R EGULAR constraint can be encoded using a simple S LIDE constraint. In this simple case, the dynamic programming machinery of Pesant’s propagator is unnecessary as the decomposition into ternary constraints does not hinder propagation. We have found that S LIDE is as efﬁcient as R EGULAR in practice [2]. Furthermore, our encoding introduces variables for representing the states. Access to the state variables may be useful (e.g. for expressing objective functions). Although an objective function can be represented with the C OST R EGULAR constraint [11], this is limited to the sum of the variable-value assignment costs. Our encoding is more ﬂexible, allowing different objective functions like the min function used in the example in Section 3. Beldiceanu, Carlsson, Debruyne and Petit have proposed specifying global constraints by means of deterministic ﬁnite automata augmented with counters [6]. They automatically construct propagators for such automata by decomposing the speciﬁcation into a sequence of signature and transition constraints. This gives an encoding similar to our S LIDE encoding of the R EGULAR constraint. There are, however, a number of advantages of S LIDE over using an automaton. If the automaton uses counters, pairwise consistency is needed to guarantee GAC (and most constraint toolkits do not support pairwise consistency). We can encode such automata using a S LIDE where we introduce an additional sequence of variables for each counter. S LIDE thus provides a GAC propagator for such automata. Moreover, S LIDE has a better complexity than a brute-force pairwise consistency algorithm based on the dual encoding as it considers only the intersection variables, reducing the space complexity by a factor of d. Hellsten, Pesant and van Beek developed a GAC propagator for the S TRETCH constraint based on dynamic programming similar to that for the R EGULAR constraint [13]. As we have shown, we can encode the S TRETCH constraint and maintain GAC using S LIDE. Several propagators for the A MONG S EQ are proposed and compared in [21, 3]. Among these propagators, those based on the R EGULAR constraint do the most pruning and are often fastest. Finally, Bartak has proposed a similar intersection encoding for propagating a sliding scheduling constraint [1] We have shown that this method is more general and can be used for arbitrary S LIDE constraints. 9 CONCLUSIONS We have studied the C ARD PATH constraint. This slides a constraint down a sequence of variables. We considered S LIDE a special case of C ARD PATH in which the slid constraint holds at every position. We demonstrated that this special case can encode many global sequencing constraints including A MONG S EQ, C ARD PATH, R EGULAR in a 479 simple way. S LIDE can therefore serve as a “general-purpose” constraint for decomposing a wide range of global constraints, facilitating their integration into constraint toolkits. We proved that enforcing GAC on S LIDE is NP-hard in general. Nevertheless, we identiﬁed several useful and common cases where it is polynomial. For instance, when the constraint being slid overlaps on just one variable or is monotone, decomposition does not hinder propagation. Dynamic programming or a variation of the dual encoding can be used to propagate S LIDE when the constraint being slid overlaps on more than one variable and is not monotone. Unlike the previous proposed propagator for C ARD PATH, this achieves GAC. Our experiments demonstrated that using S LIDE to encode constraints can be as efﬁcient and effective as specialised propagators. There are many directions for future work. One promising direction is to use binary decision diagrams to store the supports for the constraints being slid when they have many satisfying tuples. We believe this could improve the efﬁciency of our propagator in many cases. REFERENCES [1] R. Bartak, ‘Modelling resource transitions in constraint-based scheduling’, in Proc. of SOFSEM 2002: Theory and Practice of Informatics. (2002). [2] C. Bessiere, E. Hebrard, B. Hnich, Z. Kiziltan, C.-G. Quimper and T. Walsh, ‘Reformulating global constraints: the SLIDE and REGULAR constraints’, in Proc. of SARA’07. (2007). [3] S. Brand, N. Narodytska, C.-G. Quimper, P. Stuckey and T. Walsh, ‘Encodings of the SEQUENCE Constraint’, in Proc. of CP’07. (2007). [4] C. Beeri, R. Fagin, D. Maier, and M. Yannakakis, ‘On the desirability of acyclic database schemes’, Journal of the ACM, 30, 479–513, (1983). [5] N. Beldiceanu and M. Carlsson, ‘Revisiting the cardinality operator and introducing cardinality-path constraint family’, in Proc. of ICLP’01. (2001). [6] N. Beldiceanu, M. Carlsson, R. Debruyne, and T. Petit, ‘Reformulation of global constraints based on constraints checkers’, Constraints, 10(4), 339–362, (2005). [7] N. Beldiceanu, M. Carlsson, and J-X. Rampon, ‘Global constraints catalog’, Technical report, SICS, (2005). [8] N. Beldiceanu and E. Contejean, ‘Introducing global constraints in CHIP’, Mathl. Comput. Modelling, 20(12), 97–123, (1994). [9] E.K. Burke, P.D. Causmaecker, G.V. Berghe and H.V. Landeghem, ‘The state of the art of nurse rostering’, Mathl. Journal of Scheduling, 7(6), 441–499, (2004). [10] R. Dechter and J. Pearl, ‘Tree clustering for constraint networks’, Artiﬁcial Intelligence, 38, 353–366, (1989). [11] S. Demassey, G. Pesant, and L.-M. Rousseau, ‘A cost-regular based hybrid column generation approach’, Constraints, 11(4), 315–333, (2006). [12] A. Frisch, B. Hnich, Z. Kiziltan, I. Miguel, and T. Walsh, ‘Global constraints for lexicographic orderings’, in Proc. of CP’02. (2002). [13] L. Hellsten, G. Pesant, and P. van Beek, ‘A domain consistency algorithm for the stretch constraint’, in Proc. of CP’04. (2004). [14] Y.C. Law and J.H.M. Lee, ‘Global constraints for integer and set value precedence’, in Proc. of CP’04. (2004). [15] M. Maher, ‘Analysis of a global contiguity constraint’, in Proc. of the CP’02 Workshop on Rule Based Constraint Reasoning and Programming, (2002). [16] G. Pesant, ‘A regular language membership constraint for ﬁnite sequences of variables’, in Proc. of CP’04. (2004). [17] P. Refalo, ‘Linear formulation of constraint programming models and hybrid solvers’, in Proc. of CP’00. (2000). [18] J-C. R´egin, ‘A ﬁltering algorithm for constraints of difference in CSPs’, in Proc. of AAAI’94. (1994). [19] P. Van Hentenryck and J.-P. Carillon, ‘Generality versus speciﬁcity: An experience with AI and OR techniques’, in Proc. of AAAI’88. (1988). [20] W-J. van Hoeve, G. Pesant, and L-M. Rousseau, ‘On global warming : Flow-based soft global constaints’, Journal of Heuristics, 12(4-5), 347– 373, (2006). [21] W-J. van Hoeve, G. Pesant, L-M. Rousseau, and A. Sabharwal, ’Revisiting the sequence constraint’ in Proc. of CP’06. (2006). 480 ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-480 L. Mandow and J.L. Pérez de la Cruz / Frontier Search for Bicriterion Shortest Path Problems 481 482 L. Mandow and J.L. Pérez de la Cruz / Frontier Search for Bicriterion Shortest Path Problems L. Mandow and J.L. Pérez de la Cruz / Frontier Search for Bicriterion Shortest Path Problems 483 484 L. Mandow and J.L. Pérez de la Cruz / Frontier Search for Bicriterion Shortest Path Problems ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-485 485 Heuristics for Dynamically Adapting Propagation Kostas Stergiou1 Abstract. Building adaptive constraint solvers is a major challenge in constraint programming. An important line of research towards this goal is concerned with ways to dynamically adapt the level of local consistency applied during search. A related problem that is receiving a lot of attention is the design of adaptive branching heuristics. The recently proposed adaptive variable ordering heuristics of Boussemart et al. use information derived from domain wipeouts to identify highly active constraints and focus search on hard parts of the problem resulting in important saves in search effort. In this paper we show how information about domain wipeouts and value deletions gathered during search can be exploited, not only to perform variable selection, but also to dynamically adapt the level of constraint propagation achieved on the constraints of the problem. First we demonstrate that when an adaptive heuristic is used, value deletions and domain wipeouts caused by individual constraints largely occur in clusters of consecutive or nearby constraint revisions. Based on this observation, we develop a number of simple heuristics that allow us to dynamically switch between enforcing a weak, and cheap local consistency, and a strong but more expensive one, depending on the activity of individual constraints. As a case study we experiment with binary problems using AC as the weak consistency and maxRPC as the strong one. Results from various domains demonstrate the usefulness of the proposed heuristics. 1 INTRODUCTION Building adaptive constraint solvers is a major challenge in constraint programming. One aspect of this goal is concerned with ways to dynamically adapt the level of local consistency applied on constraints during search. Constraint solvers typically maintain (generalized) arc consistency (G)AC, or a weaker consistency property like bounds consistency, during search. Although many stronger local consistencies have been proposed, their practical usage is limited as they are mostly applied during preprocessing, if at all. The main obstacle is the high time and in some cases space complexity of the algorithms that can achieve these consistencies. This, coupled with the implicit general assumption that constraints should be propagated with a predetermined local consistency throughout search, makes maintaining strong consistencies an infeasible option, except for some speciﬁc CSPs. One way to overcome the high complexity of maintaining a strong consistency while retaining its beneﬁts is to dynamically evoke it during search only when certain conditions are met. There have been some works along this line in the literature, mainly focusing on methods to switch between AC and weaker consistencies [8, 10, 17, 14]. Here we consider methods to selectively apply stronger local consistencies than AC during search. 1 Department of Information & Communication Systems Engineering University of the Aegean, Greece (konsterg@aegean.gr). Recently, Boussemart et al. proposed two adaptive conﬂict-driven variable ordering heuristics for CSPs called wdeg and dom/wdeg [2]. These heuristics use information derived from conﬂicts, in the form of domain wipeouts (DWOs), and stored as constraint weights to guide search. These heuristics, and especially dom/wdeg, are among the most efﬁcient, if not the most efﬁcient, general-purpose heuristics for CSPs. Grimes and Wallace proposed alternative conﬂictdriven heuristics that consider value deletions as the basic propagation events associated with constraint weights [11]. The efﬁciency of all the proposed conﬂict-directed heuristics is due to their ability to learn though conﬂicts encountered during search. As a result they can guide search towards hard parts of the problem and identify contentious constraints [11]. It has been recognized, for example in [14], that in many problems only few of the constraint revisions that occur during search are fruitful (i.e. delete values) while, as an extreme case, some constraints do not cause any value deletions at all despite being revised many times. Hence it would be desirable to apply a strong consistency only when it is likely that it will prune many values and avoid using such a consistency when the expected pruning is non-existent or very low. Through weight recording, conﬂict-driven heuristics are able to identify highly active constraints and focus search on variables involved in such constraints. Given that highly active constraints usually reside in hard parts of the problem, can one take advantage of this information to adapt the level of constraint propagation accordingly? In this paper we show how information about conﬂicts and value deletions can be exploited, not only to perform variable selection, but also to dynamically adapt the level of local consistency achieved on the constraints of the problem. First we demonstrate that when a conﬂict-driven heuristic is used on structured problems, constraint activity during search is not uniformly distributed among the revisions of the constraints. On the contrary it is highly clustered as value deletions and domain wipeouts caused by individual constraints largely occur in clusters of nearby revisions. Based on this observation, we develop simple heuristics that allow us to dynamically switch between enforcing a weak, and cheap local consistency, and a strong but more expensive one. The proposed heuristics achieve this by monitoring the activity of the constraints in the problem and triggering a switch between different propagation methods on individual constraints once certain conditions are met. For example, one of the heuristics works as follows. It applies a weak consistency on each constraint c until a revision of c results in a DWO. Then it switches to a strong consistency and applies it on c for the next few revisions. If no further weight update occurs during these revisions, it switches back to a weaker consistency. As a case study we experiment with binary problems using AC as the weak consistency and maxRPC as the strong one. Experimental results from various domains demonstrate the usefulness of the proposed heuristics. 486 BACKGROUND 3 CONSTRAINT ACTIVITY DURING SEARCH In many, mainly structured, problems some constraints do not cause any DWOs or even are deletion-inactive during the run of a search algorithm. For example, when solving the scen11 RLFA problem with MAC+dom/wdeg, only 27 of the 4103 constraints in the problem were DWO-active while 2182 were deletion-active. The activity of the constraints in a problem depends on the structure of the problem since constraints in difﬁcult local subproblems are more likely to cause deletions and DWOs, especially if a heuristic like dom/wdeg that can identify such subproblems is used. Due to the complex interactions that may exist between constraints, the activity also depends on the search algorithm, the propagation method, the variable ordering heuristic, and on the order in which constraints are propagated. For example, when solving scen11 with an algorithm that maintains maxRPC (MmaxRPC) + dom/wdeg, 29 constraints were DWO-active with 13 of these identiﬁed as DWO-active by both MAC and MmaxRPC. Importantly, many revisions of the constraints that are DWOactive and deletion-active are redundant or achieve very little pruning. Figure 1 demonstrates how DWOs (y-axis) caused by 4 sample constraints are detected as constraint revisions (x-axis) occur throughout search. That is, each data point gives the weight of the constraint at its i-th DWO-revision. The algorithm used is MAC + dom/wdeg and the sample constraints are taken from three structured and one random problem. As we can see, DWOs in structured problems form clusters of successive or very close calls to the revision procedure, with the exception of a few outliers. The same pattern occurs with respect to value deletions (not shown due to lack of space). In contrast, DWOs in the random instance are distributed in a much more uniform way along the line of revisions. Similar results were 250 60 50 200 constraint weight constraint weight 150 100 50 40 30 20 10 weight updates 0 0 500 1000 1500 2000 2500 constraint revisions 3000 weight updates 0 3500 0 100 200 300 400 constraint revisions 500 600 160 60 140 50 120 constraint weight A Constraint Satisfaction Problem (CSP) is a tuple (X, D, C) where: X is a set of n variables, D is a set of domains, one for each variable, and C is a set of e constraints. Each constraint c is a pair (var(c), rel(c)), where var(c) = {x1 , . . . , xk } is an ordered subset of X, and rel(c) is a subset of the Cartesian product D(x1 )x . . . xD(xk ). In a binary CSP, a directed constraint c, with var(c) = {xi , xj }, is arc consistent (AC) iff for every value ai ∈ D(xi ) there exists a value aj ∈ D(xj ) s.t. the 2-tuple <(xi , ai ), (xj , aj )> satisﬁes c. In this case (xj , aj ) is called an AC-support of (xi , ai ) on c. A problem is AC iff there is no empty domain in D and all the constraints in C are AC. A directed constraint c, with var(c) = {xi , xj }, is max restricted path consistent (maxRPC) iff it is AC and for each value (xi , ai ) there exists a value aj ∈ D(xj ) that is an AC-support of (xi , ai ) s.t. the 2-tuple <(xi , ai ), (xj , aj )> is path consistent (PC) [5]. A tuple <(xi , ai ), (xj , aj )> is PC iff for any third variable xm there exists a value am ∈ D(xm ) s.t. (xm , am ) is an AC-support of both (xi , ai ) and (xj , aj ). In this case we say that (xj , aj ) is a maxRPC-support of (xi , ai ) on c. The revision of a constraint c, with var(c) = {xi , xj }, using a local consistency A is the process of checking whether the values of xi verify the property of A. We say that a revision is fruitful if it deletes at least one value, while it is redundant if it achieves no pruning. A DWO-revision is one that causes a DWO. We will say that a constraint is DWO-active during a run of a search algorithm if it caused at least one DWO. Accordingly, we will call a constraint deletion-active if it deleted at least one value from a domain and deletion-inactive if it caused no pruning at all. constraint weight 2 K. Stergiou / Heuristics for Dynamically Adapting Propagation 40 30 20 100 80 60 40 10 20 weight updates 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 constraint revisions weight updates 0 0 1000 2000 3000 4000 5000 6000 7000 8000 900010000 constraint revisions Figure 1. DWOs caused by sample constraints from the RLFAP instance scen11 (top left), the driver instance driver-08c (top right), the quasigroup completion instance qcp15-120-0 (bottom left), and the forced random instance frb35-17-0 (bottom right). obtained when MmaxRPC was used in place of MAC. Note that in the structured problems the percentage of DWO-revisions to total revisions is low. There were also many redundant revisions. For example in the RLFAP instance the sample constraint, which was the most active one in terms of DWOs caused, was revised 3386 times during search, but only 407 of these revisions were fruitful, while only 265 were DWO-revisions. To further investigate these observations we run the Expectation Maximization (EM) clustering algorithm [7] on the data of Figure 1 (top left). This revealed 20 clusters of DWO-revisions with average size of 13.25. The mean and median standard deviation (SD) for the DWO-revisions (x-axis) across the clusters was 21.67 and 7.41 respectively. The SD in a cluster is an important piece of information as it represents the average distance of any member of the cluster from the cluster’s centroid. That is, it is a measure of the cluster’s density. The median SD over the 20 clusters is quite low which indicates that DWO-revisions are closely grouped together. The mean is higher because it is affected by the presence of outliers. That is, some of the clusters formed by EM may include outliers which increase the cluster’s SD. Table 1. Clustering results from benchmark instances. instance #constraints avg #clusters avg size mean SD median SD scen11 27/4103 6.66 10.82 41.09 16.12 driver-08c 87/9321 2.44 12.62 38.50 25.11 qcp15-120-0 554/3150 12.87 15.26 226.12 129.28 frb35-17-0 233/262 7.20 19.38 1856.70 1649.05 Table 1 shows clustering results from the four benchmark instances of Figure 1. For each instance we report the ratio of DWOactive constraints over the total number of constraints, the average number of clusters, the average cluster size, and the mean and median SD for the clusters of DWO-revisions. Averages are taken over 20 sample DWO-active constraints from each problem. The mean and median SD are much lower in structured problems compared to the random one verifying the observation that in the presence of structure DWO-revisions largely occur in clusters while in its absence they tend to be uniformly distributed. The question we try to K. Stergiou / Heuristics for Dynamically Adapting Propagation answer in the following is whether we can take advantage of this to discover dead-ends sooner through strong propagation while keeping cpu times manageable. 4 HEURISTICALLY ADAPTING PROPAGATION We now present four simple heuristics that can be used to dynamically adapt the level of consistency enforced on individual constraints. These heuristics exploit information regarding domain reductions and wipeouts gathered during search. We limit ourselves to the case where dynamic adaptation involves switching between a weak, and cheap, local consistency and a stronger but more expensive one. In general, it may be desirable to utilize a suit of local consistencies with varying power and properties. The intuition behind the heuristics is twofold. First to target the application of the strong consistency on areas in the search space where a constraint is highly active so that domain pruning is maximized and dead-ends are encountered faster. And second to avoid using an expensive propagation method when pruning is unlikely. The ﬁrst three heuristics try to take advantage of the clusterness that fruitful revisions display in structured problems, while the fourth heuristic simply reacts to any deletions caused by a constraint. Importantly, any heuristic, be it for branching or for adapting the local consistency enforced, must be lightweight, i.e. cheap to compute. As it will become clear, the heuristics proposed here are indeed lightweight as they affect the complexity of the propagation procedure only by a constant factor. The heuristics can be distinguished according to the propagation events they monitor (deletions or DWOs) and also according to the extent of user involvement in their tuning (fully and semi automated). Heuristics based on DWOs (value deletions) may change or maintain the level of local consistency employed on a given constraint by monitoring the DWOs (value deletions) caused by this constraint. There are also hybrid heuristics that may react to both types of propagation events. Fully automated heuristics do not require any tuning while semi automated ones are parameterized by a bound. This bound speciﬁes the desired number of revisions during which a strong consistency is enforced after a propagation event has been detected. The greater the bound the longer is the strong consistency applied. In our experiments we have used AC and maxRPC as the weak and strong local consistency respectively. As proved in [5], maxRPC is strictly stronger than AC. That is, it will always delete at least the same values as AC. Also, maxRPC displays a good cpu time to value deletions ratio compared to other strong local consistencies [6]. However, since our approach is generic, when describing the heuristics we will avoid naming speciﬁc consistencies and instead we will refer to switching between a weak (W ) and a strong (S) local consistency, where S is strictly stronger than W . For each c ∈ C, the heuristics record the following information: 1) rev[c] is a counter holding the number of times c has been revised, incremented by one each time c is revised. 2) dwo[c] is an integer denoting the revision in which the most recent DWO caused by c occurred. 3) del[c] is a Boolean ﬂag denoting whether the most recent revision of c resulted in at least one value deletion (del[c]=T) or not (del[c]=F). 4) del S[c] is a Boolean ﬂag denoting whether the most recent revision of c identiﬁed and deleted at least one value that is W but not S. The ﬂag becomes T only if a value that is W but not S is deleted. Otherwise, it is set to F. 5) del W [c] is a Boolean ﬂag denoting whether the current revision of c resulted in at least one value deletion (del W [c]=T) or not (del W [c]=F). H1 (l): semi automated - DWO monitoring Heuristic H1 monitors and counts the revisions and DWOs of the constraints in the problem. 487 A constraint c is made S if the number of calls to Revise(c) since the last time it caused a DWO is less or equal to a (user deﬁned) threshold l. That is, if rev[c]-dwo[c] ≤ l. Otherwise, it is made W . H2 : fully or semi automated - deletion monitoring H2 monitors revisions and value deletions. A constraint c is made S as long as del[c]=T. Otherwise, it is made W . H2 can be semi automated in a similar way to H1 by allowing for a (user deﬁned) number l of redundant revisions after the last fruitful revision. If l is set to 0 we get the fully automated version of H2 . H3 : fully or semi automated - hybrid H3 is a reﬁnement of H2 . It monitors revisions, value deletions, and DWOs. A constraint c is made S as long as del S[c]=T. Otherwise, it is made W . Once the constraint causes a DWO, del S[c] is set to T and the monitoring of S’s effects starts again. If this is not done then once del S[c] is set to F the constraint will thereafter be propagated using W . H3 can be semi automated in a similar way to H1 and H2 by allowing for a (user deﬁned) number l of revisions that only delete W -inconsistent values or no values at all after the last revision that deleted values that were W but not S. H4 : fully or semi automated - deletion monitoring H4 monitors value deletions. For any constraint c, H4 applies W until del W [c] becomes T. In this case c is made S. In other words, if at least one value is deleted from the domain of a variable x ∈ var(c) by W then S is applied on the remaining available values in D(x). H4 can be semi automated by insisting that S is applied only if a (user deﬁned) proportion p of x’s values have been deleted by W during the current revision of c. With high values of p S will be applied only when it is likely that it will cause a DWO. Importantly, the heuristics deﬁned above can be combined either disjunctively or conjunctively in various ways. For example, heuristic H∨ 124 applies S on a constraint whenever the condition speciﬁed by either H1 , H2 , or H4 holds. Heuristic H∧ 24 applies S when both the conditions of H2 and H4 hold. We can choose a disjunctive or conjunctive combination depending on whether we want S applied to a greater or lesser extent respectively. Figure 2 describes the implementation of functions Revise for applying a weak or a strong consistency using the proposed heuristics. They are based on corresponding functions of coarse-grained algorithms like AC-3. Once a constraint is selected for revision a function we call Decide (which is not shown for space reasons) is called to determine how it will be propagated. This function is parameterized by the adaptive propagation heuristic h and the data structures required for the computation of the heuristics. The appropriate function w.r.t. to h is called to compute the heuristic and decide on the local consistency to be applied. Thereafter, depending on the selected consistency, the appropriate version of function Revise is called to perform the propagation. The two versions of Revise shown, one ∨ for W and one for S, implement H∨ 124 or H134 . As values are deleted and DWOs are detected, the data structures used by the heuristics are updated. Initially, i.e. before the ﬁrst revision of c, del[c], del W [c] and del S[c] are set to F and rev[c], dwo[c] are set to 0. 5 EXPERIMENTS We implemented and tested the heuristics described in Section 4 as well as a number of combined heuristics. We used d-way branching, dom/wdeg for variable ordering, and lexicographic value ordering. We experimented with the following classes of benchmarks taken from C. Lecoutre’s web page, where details about them can be found: radio links frequency assignment (RLFAP), langford, black hole, driver, hanoi, quasigroup completion, quasigroup with holes, 488 K. Stergiou / Heuristics for Dynamically Adapting Propagation function Revise(c,xi ,S) rev[c]++; for each a ∈ D(xi ) if a is not W -supported in c then delete a from D(xi ); 2: del[c] ← T; else if a is not S-supported in c then delete a from D(xi ); 2: del[c] ← T; 3: del S[c] ← T; if D(xi ) = ∅ then dwo[c] ← rev[c]; 3: del S[c] ← T; 2:if no value is deleted then del[c] ← F; 3:if no value that is W is deleted by S then del S[c] ← F; function Revise(c,xi ,W ) rev[c]++; for each a ∈ D(xi ) if a is not W -supported in c then delete a from D(xi ); del W [c] ← T; 2: del[c] ← T; if del W =T then for each a ∈ D(xi ) if a is not S-supported in c then delete a from D(xi ); 3: del S[c] ← T; if D(xi ) = ∅ then dwo[c] ← rev[c]; 3: del S[c] ← T; 2:if no value is deleted then del[c] ← F; 3:if no value that is W deleted by S then del S[c] ← F; ∨ Figure 2. The versions of Revise given can apply H∨ 124 or H134 . ∨ ). Removing lines labelled with 3 (2) gives H∨ (H 124 134 graph coloring, composed random, forced random, geometric random. Some classes and many speciﬁc instances are very easy (e.g. composed) or very hard (e.g black hole) for all methods. The results presented below demostrate that the heuristics retain the efﬁciency of maxRPC where it is better than AC and improve it where it is worse. Also, we need to point out that for many of the tested classes there exist specialized methods that can solve the speciﬁc problems much faster than the generic methods we use. Our aim is only to demonstrate the efﬁciency of the proposed heuristics in dynamically switching between different local consistencies. Table 2 shows results from some real-world RLFAP instances. We compare adaptive algorithms that use the heuristics of Section 4, where each algorithm is denoted by the corresponding heuristic, to MAC and MmaxRPC, simply denoted by AC and maxRPC respectively. For H1 , and any combined heuristic that includes H1 , l was set to 100 while for H2 l was set to 10. These values were chosen empirically and display a good performance across a number of instances2 . In these problems maxRPC is too expensive to maintain compared to AC. The adaptive heuristics cut down the size of the explored search space and reduce the run times in most cases. This is more visible in problems where maxRPC visits considerably less nodes than AC (e.g. graph08-f11). Importantly, in easy problems or in problems where maxRPC does not have a considerable effect compared to AC the heuristics do not slow the search process in a signiﬁcant way. Table 3 shows results, including only some of the heuristics, from instances belonging to the following classes of benchmarks: graph 2 The fully automated version of H2 is competitive but less robust. Table 2. Nodes (n) and cpu times (t) in seconds from RFLAP instances. The s and g preﬁxes stand for scen and graph respectively. The best cpu time for each instance is highlighted with bold. instance AC maxRPC H1 H2 H3 H4 H∨ H∨ 14 124 s11 n 2864 1334 1175 1842 1432 1678 1358 1360 t 6.9 24.2 3.7 6.7 5.5 6.0 4.9 4.9 s11-f9 n 108184 37663 35102 47552 39312 53338 38202 37743 t 539.6 3478.3 170.4 335.4 183.3 274.8 205.2 212.7 s11-f10 n 8576 2098 2197 2675 1938 3849 2462 2467 t 30.2 93.8 11.6 18.8 10.2 13.9 11.4 11.3 s11-f12 n 6678 1923 1750 2804 1763 3095 1953 1921 t 19.7 101.7 8.6 14.5 9.4 14.7 11.0 10.6 s02-f25 n 11998 5262 3114 10802 2938 12961 4367 4922 t 9.3 65.1 5.6 16.0 5.5 15.2 9.3 10.3 s03-f11 n 8314 880 1047 4830 2762 4518 2068 1489 t 26.4 24.7 5.6 20.2 11.8 17.2 12.5 9.5 g08-f10 n 11948 6342 6650 6423 9540 4863 4474 4119 t 34.5 147.1 21.9 19.4 26.8 13.9 16.3 16.2 g08-f11 n 9996 629 753 960 748 713 608 619 t 35.9 18.7 4.3 4.5 4.8 3.6 3.6 3.6 g14-f27 n 11602 926 10759 2237 9698 2877 2750 2750 t 13.0 2.5 15.3 3.1 17.2 3.3 3.1 3.1 coloring (1st,2nd), driver (3rd,4th), quasigroup completion (5th-7th), quasigroups with holes (8th,9th). In some of these problems maxRPC is much more efﬁcient than AC. The heuristics, except H4 , can further improve on the performance of maxRPC making the adaptive algorithms considerably more efﬁcient than MAC. Table 3. Nodes (n) and cpu times (t) in seconds from structured instances. instance queen8-8-8 games120-9 driverlogw-08 driverlogw-09 qcp-15-120-0 qcp-15-120-5 qcp-15-120-10 qwh-20-166-0 qwh-20-166-1 n t n t n t n t n t n t n t n t n t AC maxRPC H2 H4 H∨ H∨ 24 124 1458 2807 5863 4244 >1h 3.15 2.9 >1h 5.1 2.7 3208852 1392922 5511126 2265133 1604133 1452449 403.7 432.3 834.3 293.7 216.1 195.9 3814 785 1003 3417 855 903 13.2 25.5 6.9 9.2 6.1 6.2 14786 8342 10802 10627 8859 8895 239.2 265.8 152.9 167.1 137.8 141.2 108336 21926 35394 101901 29990 27167 98.4 43.3 39.9 83.9 33.4 28.3 387742 80424 84193 370461 81269 112290 422.0 201.0 118.2 369.4 117.7 147.0 1136801 52112 58325 152497 76399 68046 1178.0 113.6 65.1 145.1 88.6 71.2 104288 20236 15550 62993 15591 24725 269.1 86.9 42.3 140.0 46.0 78.2 132842 22688 29681 66775 25147 39435 355.4 111.4 88.2 151.1 78.5 116.7 The results given in Tables 2 and 3 show that individual heuristics can display considerable variance in their performance from instance to instance. On the contrary, combined heuristics are quite robust. Comparing the heuristics, H2 and the combined ones that include H2 display good performance on a variety of problems. It has to be noted ∨ that H∨ 24 and H124 were faster than AC in all instances we tried from the classes mentioned at the start of this section, except for some easy instances where they were slightly slower. H1 and H3 are effective on RLFAPs but worse than H2 on quasigroup problems. The fully automated version of H4 displays the worst performance among the individual heuristics. But we have not yet tried semi automated versions of H4 . Overall the heuristics offer a good balance between AC and maxRPC. In problems where maxRPC offers signiﬁcant savings in nodes, they retain this advantage and translate it into considerable savings in run times. In problems where maxRPC offers moderate savings in nodes, the heuristics signiﬁcantly reduce the run times of maxRPC and are competitive, and often faster, than AC. K. Stergiou / Heuristics for Dynamically Adapting Propagation Finally, Table 4 gives result from forced and geometric random problems. As is clear, in such problems that lack structure the heuristics do not reduce the node visits in a signiﬁcant way and are outperformed by AC. The best heuristic is by far H4 . This is because H4 does not target clusters of activity to apply maxRPC but reacts to value deletions wherever they occur. Hence, it is not signiﬁcantly handicapped by the absence of clusters. Table 4. Nodes (n) and cpu times (t) in seconds from random instances. instance frb35-17 n t frb40-19 n t geo50-20-75 n t AC maxRPC H2 H4 H∨ 24 23782 14920 15022 21182 15064 13.5 107.5 47.5 16.1 48.4 40058 20073 24446 32393 19722 24.9 151.6 76.8 27.9 63.4 227535 112785 148853 221211 142416 218.9 2089.4 765.7 247.1 748.3 H∨ 124 14642 46.8 22752 76.1 141726 750.1 A ﬁnal interesting observation is that sometimes the heuristics result in fewer node visits than maxRPC or in more than AC. This is explained by the interaction between constraint propagation and the variable ordering heuristic. Different propagation methods can lead to different weight increases for the costraints, which in turn can guide dom/wdeg to different variable selections, and hence different parts of the search space. 6 RELATED WORK Building adaptive constraint solvers is a topic that has attracted considerable interest in the literature (see for example [1, 15, 9, 12]). Part of this interest has been directed to the dynamic adaptation of constraint propagation during search. The most common manifestation of this idea is the use of different propagators for different types of domain reductions in arithmetic constraints. When handling arithmetic constraints most solvers differentiate between events such as removing a value from the middle of a domain, or from a bound of a domain, or reducing a domain to a singleton, and apply suitable propagators accordingly. Works on adaptive propagation for general constraints include the following. El Sakkout et al. proposed a scheme called adaptive arc propagation for dynamically deciding whether to process individual constraints using AC or forward checking [8]. Freuder and Wallace proposed a technique, called selective relaxation which can be used to restrict AC propagation based on two criteria; the distance in the constraint graph of any variable from the currently instantiated one, and the proportion of values deleted [10]. Chmeiss and Sais presented a backtrack search algorithm, MAC (dist k), that also uses a distance parameter k as a bound to maintain a partial form of AC [4]. Schulte and Stuckey proposed a technique for selecting which propagator to apply to a given constraint, among an array of available constraint propagators, using priorities that are dynamically updated [17]. Similar ideas are also implemented in constraint solvers such as Choco [13]. Probabilistic arc consistency is a scheme that can help avoid some consistency checks and constraint revisions that are unlikely to cause any domain pruning [14]. As in [8], the scheme is based on information gathered by examining the supports of values in constraints which can be very expensive for non-binary constraints. Our work is more closely related to [8] as the aim is to dynamically adapt the level of local consistency achieved on individual constraints. However, neither [8] or any of other works use information about failures captured in the form of constraint weights to achieve this. Besides, to the best of our knowledge, although many levels of consistency stronger than AC have been proposed, they have not been studied in this context before (i.e. evoking them dynamically). 7 489 CONCLUSION We have proposed a number of simple heuristics for dynamically switching between different local consistencies applied on individual constraints during search. These heuristics monitor propagation events like DWOs and value deletions caused by the constraints and react by changing the propagation method when certain conditions are met. The inspiration behind the development of the heuristics was based on observing the activity of the constraints when using conﬂict-driven search heuristics. As we demonstrated, DWOs and value deletions in structured problems mostly occur in clusters of consecutive or nearby revisions. This can be taken advantage of to increase or decrease the level of consistency applied when a constraint is highly active or inactive respectively. Experimental results from various domains displayed the usefulness of the heuristics. The work presented here is only a ﬁrst step towards designing heuristics for adaptive constraint propagation using information gathered during search. There are several directions for future work. First of all we need to further evaluate the heuristics including their conjunctive combinations. We can also investigate different local consistencies for binary and non-binary problems, try to devise more sophisticated heuristics, and integrate with existing related works (e.g. [14]). Also, it would be interesting to study the interaction of adaptive propagation with other adaptive branching heuristics apart from dom/wdeg. For example, the impact-based heuristics of [16] and the explanation-based heuristics of [3]. REFERENCES [1] J. Borrett, E Tsang, and N. Walsh, ‘Adaptive Constraint Satisfaction: The Quickest First Principle’, in ECAI-96, pp. 160–164, (1996). [2] F. Boussemart, F. Heremy, C. Lecoutre, and L. Sais, ‘Boosting systematic search by weighting constraints’, in ECAI-2004, pp. 482–486, (2004). [3] H. Cambazard and N. Jussien, ‘Identifying and Exploiting Problem Structures Using Explanation-based Constraint Programming’, Constraints, 11, 295–313, (2006). [4] A. Chmeiss and L. Sais, ‘Constraint Satisfaction Problems: Backtrack Search Revisited’, in ICTAI-2004, pp. 252–257, (2004). [5] R. Debruyne and C. Bessi`ere, ‘From restricted path consistency to maxrestricted path consistency’, in CP-97, pp. 312–326, (1997). [6] R. Debruyne and C. Bessi`ere, ‘Domain Filtering Consistencies’, Journal of Artiﬁcial Intelligence Research, 14, 205–230, (2001). [7] A. Dempster, N. Laird, and D. Rubin, ‘Maximum Likelihood from Incomplete Data via the EM Algorithm’, Journal of the Royal Statistical Society, 39(1), 1–38, (1977). [8] H. El Sakkout, M. Wallace, and B. Richards, ‘An Instance of Adaptive Constraint Propagation’, in CP-96, pp. 164–178, (1996). [9] S. Epstein, E. Freuder, R. Wallace, A Morozov, and Samuels. B., ‘The Adaptive Constraint Engine’, in CP-2002, pp. 525–540, (2002). [10] E. Freuder and R.J. Wallace, ‘Selective relaxation for constraint satisfaction problems’, in ICTAI-96, (1996). [11] D. Grimes and R.J. Wallace, ‘Sampling Strategies and Variable Selection in Weighted Degree Heuristics’, in CP-2007, pp. 831–838, (2007). [12] 1st International Workshop on Autonomous Search (in conjunction with CP-07), eds., Y. Hamadi, E. Monfroy, and F. Saubion, 2007. [13] F. Laburthe and Ocre, ‘Choco : implementation du noyau d’un systeme de contraintes’, in JNPC-00, pp. 151–165, (2000). [14] D. Mehta and M.R.C. van Dongen, ‘Probabilistic Consistency Boosts MAC and SAC’, in IJCAI-2007, pp. 143–148, (2007). [15] S. Minton, ‘Automatically Conﬁguring Constraint Satisfaction Programs: A Case Study’, Constraints, 1(1/2), 7–43, (1996). [16] P. Refalo, ‘Impact-based search strategies for constraint programming’, in CP-2004, pp. 556–571, (2004). [17] C. Schulte and P.J. Stuckey, ‘Speeding Up Constraint Propagation’, in CP-2004, pp. 619–633, (2004). 490 ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-490 Near Admissible Algorithms for Multiobjective Search Patrice Perny and Olivier Spanjaard1 Abstract. In this paper, we propose near admissible multiobjective search algorithms to approximate, with performance guarantee, the set of Pareto optimal solution paths in a state space graph. Approximation of Pareto optimality relies on the use of an epsilon-dominance relation between vectors, signiﬁcantly narrowing the set of nondominated solutions. We establish correctness of the proposed algorithms, and discuss computational complexity issues. We present numerical experimentations, showing that approximation signiﬁcantly improves resolution times in multiobjective search problems. 1 INTRODUCTION Heuristic search in state space graphs was initially considered in the framework of single objective optimization. The value of a path is deﬁned as the sum of the costs of its arcs and the problem amounts to ﬁnding one path with minimum cost among all paths from a source node to the goal. This problem is solved by constructive search algorithms like A∗ [6] which provide the optimal solution-path. In this case preferences are measured by a scalar cost function inducing a complete weak-order over sub-paths. However, preferences are not always representable by a single criterion function. For example, in path planning problems for autonomous agents, the action allowing a transition from a state to another might have an impact in term of time, distance, energy consumption etc, thus leading to different points of views, non-necessarily reducible to a single overall cost [3]. More generally, multiobjective search is very useful in many applications requiring computer-aided problem solving (e.g., engineering design, preference-based conﬁguration). It justiﬁes the interest for search algorithms like MOA∗ [12], the multiobjective extension of A∗ , and its recent reﬁnement by Mandow and P´erez-de-la-Cruz [8]. Besides these works on exact algorithms, several ε-admissible variations of the A∗ algorithm have been proposed in the literature (e.g. [11, 4]). These algorithms guarantee to ﬁnd a solution that is within a factor of (1 + ε) of the best solution. They realize a compromise between time and space requirements on the one hand, and optimality of the returned solution on the other hand. These variations have proved to perform well, achieving a signiﬁcant reduction of the number of iterations (up to 90% for ε = 0.1) to solve instances of the traveling salesman problem [11, 4]. Near admissible algorithms might also prove their efﬁciency in multiobjective search. At least, the introduction of tolerance thresholds in dominance concepts is worth investigating, with possibly a twofold beneﬁt: not only it might simplify the search by increasing pruning possibilities, but it also might possibly reduce the size of the potential output (the set of non-dominated elements). This latter point is crucial as can be seen in the following example derived from Hansen [5]. 1 LIP6, Univ. Pierre and Marie Curie, 104 av. du Pr´esident Kennedy 75016 Paris, France, email: ﬁrstname.lastname@lip6.fr. This work has been supported by the ANR project PHAC which is gratefully acknowledged. Example 1. Consider a simple biobjective state-space graph with a set N = {0, . . . , q} of nodes, 0 being the initial node and q being the goal node. At each node n ∈ N \{q}, two actions a1 , a2 are feasible: action a1 leads to node {n + 1} with cost (2n , 0) whereas action a2 leads to the same node with cost (0, 2n ). By construction, there exists 2q distinct solution-paths from 0 to q in this graph with costs (k, 2q − 1 − k) for k = 0, . . . , 2q − 1. For example the sequence of q times action a1 yields a solution path with cost (2q − 1, 0) whereas the sequence of q times action a2 yields a solution path with cost (0, 2q −1). In that graph, all the paths from 0 to q have the same sum of costs but distinct costs on the ﬁrst objective (due to the uniqueness of the binary representation of an integer). The images of all these paths in the space of objectives are on the same line (orthogonal to vector (1,1)) and therefore, they are all Pareto-optimal. In such family of instances, with q nodes and only 2 actions and 2 objectives, we can see that the number of Pareto-optimal paths grows exponentially with q. For instance, if q = 16 we have 65536 Pareto-optimal solution paths. This example shows that the exact determination of the Pareto set might induce prohibitive computation times. Moreover, producing the entire list of Pareto optimal solutions is probably useless for the Decision maker. In such cases, two approaches might be of interest: 1) focusing the search on a speciﬁc compromise solution; 2) approximating the Pareto set while keeping a good representation of the various possible tradeoffs in the Pareto set. The ﬁrst approach requires additional preference information from the Decision Maker concerning, for example, the relative importance of criteria, the compensations allowed, and the type of compromise sought. When this information is not available, the second approach is particularly relevant. In this direction, several studies have been proposed, relying on the concept of ε-dominance introduced as an approximation of Pareto dominance in various multiobjective problems [14, 10, 2, 7, 1]. Despite the growing interest for these concepts, the potential of ε-relaxation of dominance concepts has not been investigated, to the best of our knowledge, in the framework of multiobjective search on implicit state space graphs. This is precisely the aim of this paper which is organized as follows. In the ﬁrst two sections, we recall some useful results. In Section 2 we introduce formal material for the approximation of the Pareto set. In Section 3 we provide a simple reformulation of a multiobjective search algorithm to determine the exact Pareto set, and we prove its pseudopolynomiality. Then, we show how to modify this algorithm to get more efﬁcient and near admissible versions. Finally, we provide numerical experimentations in the last section. 2 PARETO SET AND ITS APPROXIMATION Considering a ﬁnite set of objectives {1, . . . , m} any solution-path can be characterized by a cost-vector (c1 , . . . , cm ) ∈ Zm + where ci represents the cost of the path with respect to objective i. Hence, the 491 P. Perny and O. Spanjaard / Near Admissible Algorithms for Multiobjective Search comparison of paths reduces to the comparison of their cost-vectors. The set of all cost-vectors attached to solution-paths is denoted X. We recall now some deﬁnitions linked to dominance concepts: Deﬁnition 1. The weak Pareto dominance relation (p -dominance for short) on cost-vectors of Zm + is deﬁned by: x p y ⇐⇒ [∀i ∈ {1, . . . , m}, xi ≤ yi ] Thus x dominates y, which is denoted by x p y, when x is at least as good as y with respect to all objectives. For any dominance relation deﬁned on a set X, we will use the following deﬁnitions: Deﬁnition 2. Any element x ∈ X is said to be -optimal in X if, for all y ∈ X, y x ⇒ x y. If x is not -optimal then it is said to be -dominated. Deﬁnition 3. A subset Y ⊆ X is said to be a -covering of X if for all x ∈ X there exists y ∈ Y such that y x. Whenever no proper subset of Y is a -covering of X, then Y is said to be a minimal -covering of X. The aim of multiobjective search is to ﬁnd a p -covering of the set of solution-paths. As shown in Example 1, such a set can be very large. This difﬁculty can be overcome by resorting to an approximate dominance concept called ε-dominance relation [5, 14]: Deﬁnition 4. The ε-dominance relation on cost-vectors of Zm + is deﬁned by: x ε y ⇐⇒ x p (1 + ε)y As an illustration, consider the left part of Figure 1 concerning a bi-objective problem where every feasible solution is represented by a point x = (x1 , x2 ) in the bi-objective space. Within this space, point p1 (resp. p2 ) ε -dominates all the points within cone C 1 (resp. C 2 ). The notion of ε -covering arises then naturally. Indeed, the set {p1 , p2 } is a 2-points ε -covering of X since X ⊆ C 1 ∪ C 2 . Note that a smaller ε yields a ﬁner ε -covering of X, as illustrated on the right part of Figure 1, where a 5-points ε -covering of the same set X is given. x2 x2 p1 C1 C2 p2 x1 Figure 1. x1 ε > 0 and any set X of bounded vectors x such that 1 ≤ xi ≤ M for all i ∈ {1, . . . , m}, there exists a ε -covering subset of X the size of which is polynomial in log M and 1/(log(1 + ε)), see [10, 7]. This can simply be explained by considering a logarithmic scaling m function ϕ : Zm + → Z+ on the objective space, deﬁned as follows: — log xi ϕ(x) = (ϕ1 (x), . . . , ϕm (x)) with ϕi (x) = log(1 + ε) For every component xi , it returns an integer k such that (1 + ε)k ≤ xi < (1 + ε)k+1 . Using ϕ we can deﬁne a ϕ-dominance relation: Deﬁnition 5. The ϕ-dominance relation on cost-vectors of Zm + is deﬁned by: x ϕ y ⇐⇒ ϕ(x) p ϕ(y) This relation satisﬁes the following properties: Proposition 1. For all vectors x, y, z ∈ Zm + , we have: (i) x ϕ y and y ϕ z ⇒ x ϕ z (transitivity) (ii) x ϕ y ⇒ x ε z. The symmetric part of ϕ deﬁned by x ≡ϕ y if and only if ϕ(x) = ϕ(y) is therefore an equivalence relation (by transitivity). Clearly, by keeping one element of X for each equivalence class of ≡ϕ , one obtains an ϕ -covering of X [10]. The left part of Figure 2 illustrates this point on the bi-objective example introduced for Figure 1. The dotted lines form a logarithmic grid in which each square represents an equivalence class for ≡ϕ . Hence the set of black points (one per non-empty square) represents a ϕ -covering of all points. Interestingly enough, the resulting ϕ -covering is also a ε covering set by Proposition 1 (ii). Moreover, the size of this ε covering is upper bounded by the number of equivalence classes of relation ≡ϕ , which is not greater than: (1 + 9log M/ log(1 + ε):)m [10]. A reﬁned ϕ -covering (which is also an ε -covering) can easily be derived by removing ϕ -dominated elements (we keep only the black points on the right part of Figure 2) which improves the bound to (1+9log M/ log(1+ε):)m−1 , see [7]. Coming back to Example 1 with q = 16, a p -covering requires 65536 solution-paths whereas a ε -covering of this set constructed with k ϕ as indicated j 65536 + 1 = 117 eleabove (for ε = 0.1) contains at most log log 1.1 ments. More generally, it is important to note that, for ﬁxed values of ε and m, the size of the ε -covering grows only polynomially with the size of the instance, even when the Pareto set grows exponentially. In addition, if a set Y ⊆ X is an ε -covering of X, we know (by Deﬁnition 3) that any feasible tradeoff achieved in X is approximated with performance garantee, i.e it is ε -dominated by at least one element in Y . This enables a more concise and yet representative description of possible tradeoffs in the Pareto set. The question is whether an ε -covering is computable in polynomial time or not. x2 x2 ε-coverings for two values of ε. Note that, for a given ε, several minimal ε -covering subsets of different sizes exist. For example, consider X = {x, y, z} with x = (800, 950), y = (880, 880) and z = (950, 800) and set ε = 0.1. The set {x, z} is an ε -covering subset of X since 800 ≤ 968 = (1 + 0.1) × 880 and 950 ≤ 968 and thereby x ε y. Furthermore, neither x ε z nor z ε x, and therefore {x, z} is minimal. Note that {y} is also a minimal ε -covering subset. On the one hand we have indeed y ε x since 880 ≤ 880 = (1 + 0.1) × 800 and 880 ≤ 1045 = (1 + 0.1) × 950, on the other hand y ε z for the same reasons. The very interest of ε-dominance lies in the following property: for any ﬁxed number m > 1 of objectives, for any ﬁnite x1 1 (1 + ε) (1 + ε)2 (1 + ε)3 (1 + ε)4 Figure 2. x1 1 (1 + ε) (1 + ε)2 Logarithmic grid. (1 + ε)3 (1 + ε)4 492 3 P. Perny and O. Spanjaard / Near Admissible Algorithms for Multiobjective Search MULTIOBJECTIVE SEARCH ALGORITHM ∗ We now present a multiobjective extension of A (reformulation of the label-expanding version of MOA∗ [8]), and we prove its pseudopolynomiality, which is directly related to that of the size of the Pareto set. In the following sections, we will then use a logarithmic grid to derive near-admissible algorithms: a ﬁrst one the complexity of which is polynomial in the number of states, a second one more efﬁcient in practice in spite of a higher theoretical complexity. To our knowledge, this is the ﬁrst attempt to devise near admissible algorithms for multiobjective search in implicit graphs (the existing near admissible multiobjective algorithms work in explicit graphs). A∗ algorithm and its multiobjective extensions explore a state space graph G = (N, A) where N is a ﬁnite set of nodes (possible states), and A is a set of arcs representing transitions. Formally, we have A = {(n, n ) : n ∈ N, n ∈ S(n)} where S(n) ⊆ N is the set of all successors of node n. A cost-vector v(n, n ) is at tached to each arc (n, cost-vector of a path P is Pn ) ∈ A, and the deﬁned by v(P ) = (n,n )∈P v(n, n ). In the sequel, we assume that v(P ) ∈ [1, M ] for every solution path, where M is a known constant. Then s ∈ N denotes the source of the graph (the initial state), Γ ⊆ N the subset of goal nodes, P(s, Γ) the set of all paths from s to a goal node γ ∈ Γ (solution-paths), and P(n, n ) the set of all paths linking n to n , characterized by a list n, . . . , n of nodes. Unlike the scalar case, there possibly exists several p -optimal paths with distinct cost-vectors to reach a given node in a multiobjective problem. Hence, one expands labels = [n , P , g ] (attached to subpaths) rather than nodes, where n indicates the labeled node, P the corresponding subpath in P(s, n ), and g the cost-vector of P . As in A∗ , the set of generated labels is divided into two disjoint sets: a set OPEN of not yet expanded labels and a set CLOSED of already expanded labels. Besides, the p -optimal expanded labels in { : n ∈ Γ} are stored in a set SOL. Since a node n may be on the path of more than one p -optimal solution, a set H(n) of heuristic cost-vectors is given for each node n, estimating the set {v(P ) : P ∈ P(n, Γ)}. For each generated label 0 , a set F ( 0 ) of evaluation vectors is computed from all possible combinations {g0 + h, h ∈ H(n0 )}. It estimates the set of p -optimal values of solution-paths extending P0 . Initially, OPEN contains only label [s, s, 0], while CLOSED and SOL are empty. At each subsequent step, one expands a label ∗ in OPEN such that F ( ∗ ) contains at least one p -optimal vector in ∪∈OPEN F ( ). The process is kept running until OPEN becomes empty. Two pruning rules are used: S Rule R1 : discard label when there exists ∈ OPEN CLOSED s.t. n = n and g p g . Rule R2 : discard label when ∀f ∈ F ( ), ∃ ∈ SOL s.t. g p f . These rules ensure to generate all p -optimal paths in P(s, Γ) provided heuristic H is admissible, i.e. ∀n ∈ N, ∀P ∈ P(n, Γ), ∃h ∈ H(n) s.t. h p v(P ). The algorithm is outlined below: M ULTIOBJECTIVE S EARCH A LGORITHM (MOA∗ ) Input: G, OPEN, CLOSED, SOL while OPEN = ∅ 01 move a label ∗ from OPEN to CLOSED 02 if n∗ ∈ Γ 03 then U PDATE(SOL, ∅, ∗ ) 04 else for each node n ∈ S(n∗ ) do 05 create 0 = [n , P∗ , n , g∗ + v(n, n )] 06 if ∃f0 ∈ F ( 0 ) s.t. ∀ ∈ SOL not(g p f0 ) 07 then U PDATE(OPEN(n ), CLOSED(n ), 0 ) 08 else discard 0 Output: SOL This algorithm calls procedure U PDATE which applies to L1 , a list of open labels, and L2 , a list of closed labels. It possibly updates list L1 with label as follows: U PDATE(L1 , L2 , ) 01 If ∀ ∈ L1 ∪ L2 not(g p g ) then L1 ← L1 ∪ {l} 02 Remove p -dominated labels from L1 We now show that this multiobjective search algorithm is pseudopolynomial for integer costs (for a ﬁxed number m of objectives), with the following worst case complexity analysis. The “while” loop in the main procedure is iterated at most |N | (M + 1)m times since this is the maximum number of distinct labels. Indeed there are |N | nodes, and for each of them, the number of different cost vectors is upper bounded by (M + 1)m . Furthermore, at each iteration of the loop the main computational cost is due to line 06 which requires binary comparisons of labels from F ( 0 ) and SOL. With a naive method, this represents (M + 1)2m comparisons. Hence the algorithm executes less than |N | (M + 1)m loops of cost (M + 1)2m . Therefore the overall complexity is within O(|N |2 M 3m ). 4 APPROXIMATION ALGORITHMS We consider now two ways of relaxing the exact version of the multiobjective search algorithm so as get a better efﬁciency, either by modiﬁcation of R1 or R2 (both modiﬁcation cannot be performed together without losing the performance guarantee). 4.1 An FPTAS for multiobjective search In this subsection, we assume that a ﬁnite upper bound L on the lengths (numbers of arcs) of all solution-paths in P(n, Γ) is known. Under this assumption, we provide a Fully Polynomial Time Approximation Scheme (FPTAS) for computing an approximation of the Pareto set. For simplicity, we assume throughout this section that the input is a ﬁnite graph on |N | nodes. Several FPTAS to compute ε coverings in multiobjective shortest path problems (MSP) have been proposed in the literature; that is, algorithms that, given an encoding of the graph and an accuracy level ε > 0, yield an ε -covering in time and space bounded by a polynomial in |N | and 1ε . Hansen [5] and Warburton [14] have proposed methods combining rounding and scaling techniques (i.e., approximating data elements before the execution of an algorithm) with pseudopolynomial exact algorithms (i.e., algorithms that operate in time and space bounded by a polynomial in |N | and the largest data element), in order to keep polynomially bounded the size of the auxiliary data computed during the execution. These methods are particular to biobjective problems and acyclic graphs respectively. Another algorithm is due to Papadimitriou and Yannakakis [10]. It is less speciﬁc to MSP, and its interest resides mainly in its generality: it proceeds by computing one solution (if it exists) inside every box of the logarithmic grid of Section 2. The authors show that this can be polynomially performed in a problem A if there is a pseudopolynomial algorithm for the exact version of A (given an instance of A and an integer B, is there a feasible solution with cost exactly B?). Finally, Tsaggouris and Zaroliagis [13] have recently proposed an FPTAS based on a generalized BellmanFord algorithm. Except for [10], all the other approaches rely on dynamic programming. We now show how to obtain an FPTAS by applying trimming techniques to the multiobjective search algorithm. The idea is to keep polynomially bounded the number of possible labels at each node, by using a logarithmic grid. Nevertheless, it is not possible to work directly with ϕ in place of p within procedure UPDATE because we might exceed the desired error threshold (1+ε) due to error propagations, as shown in the following example. 493 P. Perny and O. Spanjaard / Near Admissible Algorithms for Multiobjective Search Example 2. Consider the graph with nodes {s, n, n , γ} and costs: v(s, n) = (2, 2), v(s, n ) = (1, 1.1), v(n , n) = (0.9, 1), v(n, γ) = (1, 1), v(n , γ) = (2.3, 1.8). We set ε = 0.1. We get two labels at node n, 1 = [n, s, n, (2, 2)] and 2 = [n, s, n , n, (1.9, 2.1)]. Since 1 ε 2 , assume that 2 is discarded. We get two labels 3 = [γ, s, n , γ, (3.3, 2.9) and 4 = [γ, s, n, γ, (3, 3)] at γ. At this point 4 might be discarded since 3 ε 4 . In this case the unique returned solution path would be s, n , γ with cost (3.3, 2.9)]. However it is clear that path s, n , n, γ with cost (2.9, 3.1) is not εcovered by (3.3, 2.9). Actually we have: (3.3, 2.9) p 1.1(3, 3) and (3, 3) p 1.1(2.9, 3.1) but not (3.3, 2.9) p 1.1(2.9, 3.1). We only have (3.3, 2.9) p 1.12 (2.9, 3.1). This example suggests a possible solution relying on the assumption that solution-paths contain at most L arcs. We might replace 1 (1 + ε) by (1 + ε) L to remain below (1 + ε) by propagation of errors. This idea is implemented in the following revised pruning rule. S Rule R1 : discard label when there exists ∈ OPEN CLOSED s.t. n = n and ψ(g ) p ψ(g ) m where ψ is a logarithmic scaling function ψ : Zm + → Z+ on the objective space, deﬁned as follows: j k ψ(x) = (ψ1 (x), . . . , ψm (x)), ψi (x) = log xi / log(1 + ε)1/L This lead to replace procedure U PDATE by: ψ−U PDATE(L1 , L2 , ) 01 If ∀ ∈ L1 ∪ L2 not(ψ(g ) p ψ(g )) 02 then L1 ← L1 ∪ {l} 03 Remove ψ -dominated labels from L1 With ψ−U PDATE the multiobjective search algorithm becomes polynomial in |N | and 1ε , provided 1 ≤ M ≤ 2p(|N |) where p denotes some polynomial. Indeed, the cost jof every solution pathkon 1 the logarithmic scale is upper bounded by log M/ log(1 + ε) L ∈ O(L log M/ε). Hence, the global complexity of the algorithm becomes O(|N |2 (L log M/ε)3m ). Since L ≤ |N | and log M ∈ ` ´3m O(p(|N |)), it is within O( 1ε p(|N |)) for some polynomial p and therefore polynomial in 1ε and |N |. Now, it remains to show that this version of the algorithm yields a ε -covering subset of the solution-paths. To this end we state the following propositions: Proposition 2. For all i ∈ {1, . . . , L}, ∀x, y, z ∈ X the following monotonicity property hold: i i x p y(1 + ε) L ⇒ (x + z) p (y + z)(1 + ε) L Note that this monotonicity property does not hold for the ψdominance relation induced by ψ(x) p ψ(y). Proposition 3. Let P ∈ P(s, Γ). At any time before termination, if ∀ ∈ SOL not(g ε v(P )), then there exists ∈ OPEN and P extending P such that v(P ) ε v(P ). Proof. Consider a solution-path P = s, n1 , . . . , nk ∈ Γ. By contraposition, assuming that for all ∈ OPEN no solution-path P extending P is such that v(P ) ε v(P ), we show that there exists a label ∈ SOL for which g ε v(P ). For that purpose, we exhibit a ﬁnite sequence ( i ) of closed labels generated during the search, i such that gi p (1 + ε) L v(Pi ) (1), where Pi = s, n1 , . . . , ni . We proceed as follows: for i = 0, we set 0 = [s, s, 0] and we 0 clearly have g0 p (1 + ε) L v(P0 ). Inductively, assume now that labels 0 , . . . , j have been generated and closed (j < k), such that Equation 1 holds for i = 0, . . . , j. Let = [nj+1 , Pj , nj+1 , gj + v(nj , nj+1 )] be the label of the path from s to nj+1 extending Pj . This label has been generated since j has been expanded and nj+1 ∈ S(nj ). There are two cases: Case 1. If ∈ CLOSED then we set j+1 = . Case 2. If ∈ CLOSED, we cannot have ∈ OPEN since it would contradict the initial assumption. Indeed, consider solutionpath P = Pj , nj+1 , . . . , nk . We would have: v(P ) = gj + j v(nj , . . . , nk ) p (1 + ε) L v(Pj ) + v(nj , . . . , nk ) p (1 + ε)v(Pj ) + (1 + ε)v(nj , . . . , nk ) p (1 + ε)v(P ). Hence, has been generated, but ∈ OPEN ∪ CLOSED. Therefore has been discarded using pruning rule R1 or R2 : Case 2.1. If is discarded by R1 , then there exists ∈ OPEN ∪ CLOSED such that n = n and ψ(g ) p ψ(g ), which implies 1 1 1 g p (1 + ε) L g . We have g p (1 + ε) L g = (1 + ε) L (gj + 1 j v(nj , nj+1 )) p (1 + ε) L ((1 + ε) L v(Pj ) + v(nj , nj+1 )) p j+1 (1 + ε) L v(Pj+1 ). Moreover, by the same reasoning as above with P = P , nj+2 , . . . , nk , we have v(P ) ε v(P ), and therefore cannot be in OPEN. Hence, ∈ CLOSED and we set j+1 = . Case 2.2. If R2 prunes , then sequence ( i ) is stopped. Whenever case 2.2 stops the sequence 0 , . . . , j by discarding label , then for all f ∈ F ( ), there exists ∈ SOL s.t. g p f (Eq. 1). Moreover, there exists f ∈ F ( ) such that f ε v(P ), as we now show. By admissibility of H, there exists h ∈ H(nj+1 ) such that h p v(nj+1 , . . . , nk ). Then, there exists f = g + h ∈ F ( ) such that f p g + v(nj+1 , . . . , nk ) = gj + v(nj , . . . , nk ) j p (1 + ε) L v(Pj ) + v(nj , . . . , nk ) p (1 + ε)v(Pj ) + (1 + ε)v(nj , . . . , nk ) = (1+ε)v(P ). Hence f p (1+ε)v(P ) (Eq. 2). From Eq. 1 and Eq. 2, we get g ε v(P ) (by transitivity of p ). Whenever case 2.2 does not occur, the sequence continues until j = k. Once label k has been expanded at nk , solution-path Pk has been discovered and SOL includes a label such that g ε v(P ). In all cases, the existence of ∈ SOL with g ε v(P ) is proved. 2 From this proposition, it follows that the algorithm cannot terminate as long as the solution-paths stored in SOL does not constitute a ε -covering of P(s, Γ). Indeed, it would imply that OPEN is nonempty, which contradicts the termination of the algorithm. We can therefore conclude that the algorithm returns a ε -covering subsets of solution-paths. Note that this technique is mainly of theoretical interest, since the complexity is quadratic in the number |N | of states but |N | is usually exponential in the depth of the search. We propose therefore below a simpler technique that also guarantees the approximation, more efﬁcient in practice in spite of a higher complexity. 4.2 A near admissible version of MOA∗ We now present the MOA∗ε algorithm, which returns a ε -covering of solution-paths without requiring the knowledge of an upper bound on the number of nodes that can be expanded. The basic features of the algorithm are essentially the same as MOA∗ . The main difference lies in the following pruning rule which uses ε -dominance: Rule R2 : discard label when ∀f ∈ F ( ), ∃ ∈ SOL s.t. g ε f . This rule allows an early elimination of uninteresting labels while keeping near admissibility of the algorithm provided heuristic H is admissible. Indeed, if H is admissible, then for all f ∗ ∈ F ∗ ( ) there exists f ∈ F ( ) such that f = g + h p g + h∗ = f ∗ . Hence g ε f implies that g ε f ∗ . This pruning rule can be inserted in the multiobjective search algorithm by substituting line 06 by: 06 if ∃f0 ∈ F ( 0 ) s.t. not(f ( ) ε f0 ) ∀ ∈ SOL Despite MOA∗ε does not provide complexity guarantees, it outperforms signiﬁcantly the exact version. 494 P. Perny and O. Spanjaard / Near Admissible Algorithms for Multiobjective Search Remark 1. The following weaker relaxation of the pruning condition in R2 can be used in the FPTAS: k 06 if ∃f0 ∈ F ( 0 ) s.t. not(f ( ) p (1 + ε) L f0 ) ∀ ∈ SOL where k is an upper bound of the length of the longest path from n0 to a goal. This is the case in the implemented version. 5 NUMERICAL EXPERIMENTATIONS To investigate the potential of approximation, we tested our algorithms on two multiobjective combinatorial problems. Biobjective binary knapsack problem. Given a set {1, . . . , n} of items j, each item having a weight wj and a proﬁt pij according to every objective i, one searches a minimal p -covering of combinations of items that can be put into a knapsack of capacity b (i.e., theP total weight of thePitems cannot run over b): n max n j , max j=1 p1j xP j=1 p2j xj n subject to w x j=1 j j ≤ b xj ∈ {0, 1} ∀j ∈ {1, . . . , n} where xj = 1 iff one chooses to put item j in the knapsack. The state space has been deﬁned such that all solution-paths share the same length n. This enables to apply the FPTAS with L = n. The heuristic evaluations used to order and prune the search derive from the upper bound of Martello and Toth [9] for the single objective version. The MOA∗ , FPTAS and MOA∗ε algorithms have been implemented in JAVA and were run on a Pentium 4 3.60GHz PC. Table 1 shows computation times (in sec) obtained on 35 random instances of size n, with proﬁts and weights randomly drawn in [1, 100], and a capacity b bounded to 50% of the total weight of the items. These results show that the relaxation of the optimality condition signiﬁcantly speeds up the search, with faster results when using MOA∗ε . We have also studied the behavior of MOAε when setting ∀j p1j = 2j , p2j = 2n − 2j and wj = 1, which yields instances where all combinations of b items are non-dominated with distinct proﬁts on the ﬁrst objective. In the ﬁrst line of Table 2, we indicate the execution times of MOA∗ , and in the second line the number #sol of non-dominated solutions (which grows exponentially with n). Both approximation algorithms return a ε -covering in less than one second for all ε in {0.005, 0.01, 0.05}. For each value of ε we give the size of the returned ε -covering. It shows that the choice of ε allows the size of the output set to be controlled, as well as the computation times. n 30 40 50 60 70 80 MOA∗ time 0.397 1.879 11.31 43.66 215.2 457.7 ε FPTAS 0.005 0.353 1.514 7.922 29.90 127.8 226.5 0.297 1.077 4.842 18.29 65.91 97.26 0.01 0.05 0.046 0.036 0.065 0.331 0.555 0.393 0.003 0.001 0.001 0.002 0.001 0.001 0.1 ε MOA∗ε 0.005 0.315 0.940 4.225 18.58 62.37 110.4 0.179 0.364 1.389 9.294 19.75 35.11 0.01 0.05 0.008 0.007 0.013 0.064 0.065 0.075 0.001 0.001 0.001 0.001 0.001 0.001 0.1 Table 1. Numerical results on the biobjective knapsack. n 15 16 17 18 19 20 time 0.495 3.328 1.895 44.08 17.55 711.9 6.103 1.104 2.104 5.104 9.104 2.105 # sol 0.005 31 28 28 25 25 22 0.01 16 14 15 13 13 11 3 3 4 3 3 3 0.05 Table 2. Pareto approximation on pathological instances. Multiobjective shortest path problem. In order to study the interest of approximation when the number of objectives grows, we have per- formed experimentations of MOA∗ε on the multiobjective path problem. We have generated different classes of instances by controlling the number of nodes |N | = 1000, 2000, 3000 and the number of objectives m = 2, 5, 10. Cost of arcs are randomly generated within [1, 100]. The approximations have been computed with ε = 0.1. Table 3 gives, in each class of instances, the average execution time (in sec) obtained on 20 different instances. These performances illustrate that approximation remains powerful when the number of objectives grows. As a comparison, the exact determination (with MOA∗ ) of the Pareto set on instances with 1000 nodes and 10 criteria required more that one hour on the same computer. Table 3. 6 |N | 1000 2 obj 0.078 0.175 5 obj 10 obj 0.447 Times for MOA∗0.1 2000 3000 0.295 0.751 0.761 1.901 2.474 7.268 on the shortest path problem. CONCLUSION We have proposed two approximation algorithms for multiobjective search. The ﬁrst one is an FPTAS which requires that an upper bound on the length of a solution-path is known, while the second one does not provide guarantee on the worst case complexity, but performs better in practice without requiring any information on the length of solution-paths. Both algorithms outperform exact multiobjective search in times. Note that the approximate Pareto set can include dominated solutions (although close to optimality). An interesting research direction is therefore to look for algorithms able to compute approximate Pareto set including only non-dominated solutions. Another possible extension of this work is to study the use of εdominance to approximate more involved preference models. REFERENCES [1] E. Angel, E. Bampis, and A. Kononov, ‘On the approximate tradeoff for bicriteria batching and parallel machine scheduling problems.’, Theor. Comput. Sci., 306(1-3), 319–338, (2003). [2] T. Erlebach, H. Kellerer, and U. Pferschy, ‘Approximating multiobjective knapsack problems’, Manag. Science, 48(12), 1603–1612, (2002). [3] K. Fujimura, ‘Path planning with multiple objectives’, IEEE Robotics and Automation Magazine, 3(1), 33–38, (1996). [4] M. Ghallab, ‘Aε : an efﬁcient near admissible heuristic search algorithm’, in Proc. of the 8th IJCAI, pp. 789–791, (1983). [5] P. Hansen, ‘Bicriterion path problems’, in Multicriteria Decision Making, eds., G. Fandel and T. Gal, (1980). [6] P.E. Hart, N.J. Nilsson, and B. Raphael, ‘A formal basis for the heuristic determination of minimum cost paths’, IEEE Trans. Syst. and Cyb., SSC-4 (2), 100–107, (1968). [7] M. Laumanns, L. Thiele, K. Deb, and E. Zitzler, ‘Combining convergence and diversity in evolutionary multiobjective optimization’, Evolutionary Computation, 10(3), 263–282, (2002). [8] L. Mandow and J.-L. P´erez-de-la Cruz, ‘A new approach to multiobjective A* search.’, in Proc. of the 19th IJCAI, pp. 218–223, (2005). [9] S. Martello and P. Toth, ‘An upper bound for the zero-one knapsack problem and a branch and bound algorithm’, European J. of Operational Research, 1, 169–175, (1975). [10] C. H. Papadimitriou and M. Yannakakis, ‘On the approximability of trade-offs and optimal access of web sources’, in Proc. of the 41th IEEE Symp. on FOCS, pp. 86–92, (2000). [11] J. Pearl and J.H. Kim, ‘Studies in semi-admissible heuristics’, IEEE Trans. on PAMI, 4(4), 392–400, (1982). [12] B.S. Stewart and C.C. White III, ‘Multiobjective A*’, J. of the Association for Computing Machinery, 38(4), 775–814, (1991). [13] G. Tsaggouris and C. Zaroliagis, ‘Multiobjective optimization: Improved FPTAS for shortest paths and non-linear objectives with applications’, in Proc. of the 17th ISAAC, pp. 389–398, (2006). [14] A. Warburton, ‘Approximation of pareto optima in multiple-objective shortest-path problems’, Operations Research, 35(1), 70–79, (1987). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-495 495 Compressing Pattern Databases with Learning Mehdi Samadi1 and Maryam Siabani2 and Ariel Felner 3 and Robert Holte1 Abstract. A pattern database (PDB) is a heuristic function implemented as a lookup table. It stores the lengths of optimal solutions for instances of subproblems. Most previous PDBs had a distinct entry in the table for each subproblem instance. In this paper we apply learning techniques to compress PDBs by using neural networks and decision trees thereby reducing the amount of memory needed. Experiments on the sliding tile puzzles and the TopSpin puzzle show that our compressed PDBs signiﬁcantly outperforms both uncompressed PDBs as well as previous compressing methods. Our full compressing system reduced the size of memory needed by a factor of up to 63 at a cost of no more than a factor of 2 in the search effort. 1 Introduction and Overview States in a search space are often represented using a set of state variables. An abstraction of the search space, called the pattern space, can be deﬁned by only considering a subset of the state variables (called the pattern variables). A pattern is a state of the pattern space which has an assignment of values to the pattern variables. A state in the original space is mapped to a pattern by ignoring the state variables in that are not pattern variables. A pattern database (PDB) stores the distance of each pattern to the goal pattern. The value stored in the PDB for is a lower bound on the distance from to the goal state, and thus serves as an admissible heuristic for searching in the original search space. A PDB contains one entry for each pattern in pattern space. In general, the more entries a PDB contains, the more accurate it is as a heuristic, and the more efﬁcient is the search that uses the PDB as a heuristic. The drawback of large PDBs is the amount of memory they consume. One approach to mitigating this problem is to compress the PDB. For example, Felner et al. [3] compress a PDB by simply merging several highly correlated (usually adjacent) entries into one. They achieved a signiﬁcant improvement on the 4-peg Towers of Hanoi and the TopSpin problems but only limited success for the sliding tile puzzles. The main drawback of that work is that the rule for deciding which PDB entries to merge was ﬁxed throughout the entire compressing process. Higher degree of compression signiﬁcantly degrades the performance. We introduce a new, general and ﬂexible compression method for PDBs that is experimentally shown to improve uncompressed PDBs as well as the compression methods reported in [3]. Improvement takes the form of either reducing the amount of memory required for the PDB without substantially increasing the number of generated nodes, or reducing both the memory required and the number of gen1 2 3 Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada T6G 2E8, {msamadi,holte}@cs.ualberta.ca Electrical and Computer Engineering Department, Isfahan University of Technology, Isfahan, Iran, siabani@ec.iut.ac.ir Information Systems Engineering Dept.,Deutsche Telekom Labs,Ben Gurion University,Beer-Sheva, Israel, felner@bgu.ac.il erated nodes. The main idea underlying our work is to use techniques from the machine learning literature to compress PDBs. In particular, we train an artiﬁcial neural network (ANN) so that it can be used instead of the PDB. The neural network requires almost no memory. However since the ANN’s output is not guaranteed to be less than or equal to the PDB value (admissible), we use additional storage (in the form of a hash table) for all the patterns whose value is overestimated by the ANN. This basic idea is then improved with two steps. Decision trees and a PDB-partitioning method are used to separate the PDB entries into smaller subgroups with similar characteristics and then training separate ANNs for each subgroup. We tested our compression system on three search spaces: the 15puzzle, the 24-puzzle and TopSpin. Our results show that our full compression system requires up to 63 times less memory than the original PDB while increasing the number of nodes generated by no more than a factor of two. The modest increase in search effort is not a concern because the freed-up memory can be used in ways that are known to substantially speed up search, e.g., for additional PDBs [6], and/or for memory-based search algorithms such as A*, perimeter search or memory-enhanced IDA*. We do not actually implement any of these techniques in this paper, but are conﬁdent that they would more than compensate for the small increase in search effort caused by our compression technique. 2 Related Work Symbolic PDBs [1] use binary decision diagrams (BDDs) to store a PDB, and have been shown, for some search spaces, to signiﬁcantly reduce the memory needed to store the PDB entries compared to traditional PDB tables. However, a recent unpublished study of symbolic PDBs on a wide range of search spaces has shown that symbolic PDBs do not always result in compression, they sometimes require more memory than a table. In particular, symbolic PDBs for the 15-puzzle require more memory than the traditional PDB representation, whereas the experiments below show that our method greatly reduces the memory required. The idea of using learning and classiﬁcation techniques in heuristic search was suggested before. In [9], a feature vector for partitioning the state space to a number of classes is used. Learning techniques were used for each class. In [10] the state space was partitioned based on feature vector and then ”generalized heuristic information” is learned for each class. These ideas were only applied to small domains and in contrast to our approach did not ﬁnd the optimal solution. Recently, multi-layered ANN was used to represent heuristics for the 15-puzzle [2]. Given a training set of states descriptions together with their optimal solution, a learning system that predicts the length of the optimal solution for an arbitrary state was built. They biased their predicted values towards admissibility but unlike our approach their system returned suboptimal solutions in about 50% of the cases. 496 M. Samadi et al. / Compressing Pattern Databases with Learning Reverse circle 18 17 19 20 1 2 3 4 16 5 15 6 14 7 13 12 11 10 9 1 2 3 4 1 2 3 5 6 7 8 9 5 6 7 10 11 12 13 14 8 9 10 11 15 16 17 18 19 12 13 14 15 20 22 23 24 4 8 21 Figure 1. The Top-spin and Sliding tile puzzles. 3 Search domains The sliding tile puzzles such as the 15- and 24- puzzles (shown in Figure 1) have been used as benchmark problems in many previous papers. For clarity, we describe all our methods in the context of the sliding tile puzzle, but our ideas are general and can be applied to other problems as well. In our representation of the sliding tile puzzle, the variables are the tiles and the values are their locations. The best existing method for solving the sliding tile puzzles optimally uses additive PDBs [4]. The tiles are partitioned into disjoint sets, and a PDB is built for each set. The PDB stores the cost of moving the tiles in the pattern set from any given arrangement to their goal positions, counting the moves of the pattern tiles. Under such circumstances the sum of the values from different disjoint PDBs is an admissible heuristic [4]. We use the notation − − to denote a partitioning of the tiles into three disjoint groups with , and tiles in each group, respectively. The -TopSpin puzzle has tokens arranged in a ring. Any set of 4 consecutive tokens can be reversed (rotated 180 degrees in the physical puzzle). Our encoding of this puzzle has operators, one for each possible reversal. In TopSpin more than one object is moved in each move and simple additive PDBs are not applicable here. The standard way to build a PDB for this domain is to specify a set of pattern tokens, and to treat the remaining tokens as if they were indistinguishable from one another4 . 4 Augmented compression We introduce a method that compresses PDBs using learning techniques and preserves the admissibility property. Our system includes three independent steps, ANN learning , Decision tree classiﬁcation and pattern partitioning and we describe each of them in turn. 4.1 Compression with ANN Our ﬁrst idea is to build an ANN that learns the PDB. Assume a PDB for the tile puzzle which consists of the set of tiles, . The different patterns are the different ways to permute all the tiles in into the state space. Each pattern has an entry ( ) which stores its heuristic value. We want to build a learning system that will be able to predict ( ) for each given pattern . For this we use Artiﬁcial Neural Networks (ANN) [8], a well-known learning technique. Multi Layer Perceptron (MLP) neural network (ANN) with standard modiﬁed back-propagation [8] algorithm is used for the prediction. This system is called the basic ANN compressing in this paper. 4.1.1 Feature selection We use two types of features for the ANN: 1) Description of the pattern: Each tile in is a feature and its position in pattern is the value for that feature. 4 Since this puzzle is cyclic, we can assume that token number 1 is always in ﬁxed position. Thus, for implementation, the total number of states can be reduced by a factor of N. 2) Heuristic vector: We also construct smaller PDBs each for a subset of tiles i ⊂ . We denote the corresponding PDB heuristic for a given pattern as i ( ). Note that each i ( ) is admissible for . We deﬁne a heuristic vector for pattern as ( ) = 1( ) 2( ) K ( ). Each member of the heuristic vector i is also used as a feature and i ( ) is the value of that feature. For example, for a 6-tile PDB we used two different 2-4 partitionings to a total of 4 smaller PDBs that are used in the heuristic vector. The heuristic value of in the original PDB is the target function. 4.1.2 Training and using the ANN The ANN is trained by iterating over all the entries of the original PDB. For each pattern we construct its different features and feed them to the ANN coupled with ( ) (the PDB heuristic for ). Once the training process ends, we can delete the original PDB from memory. Only the smaller PDBs which make up the heuristic vector are left in memory. Similarly, the ANN itself is also kept in memory. Then, during the search, given a pattern , we calculate its features (e.g., by looking up the smaller PDBs). The values of these features are given an input to the ANN and the output ,denoted as ( ), is used as the heuristic value for pattern . 4.1.3 Correcting overestimations Training an ANN to predict the exact desired value is NP-complete [7]. Thus, learning systems in general and ANNs in particular are not completely accurate by nature as they can deviate from the real value for many of the instances. With heuristics, lower deviations are not a problem as an admissible heuristic should be a lower bound. However, if the ANN is overestimating, the heuristic will no longer be admissible and non-optimal solutions might be returned. We solved this problem as follows. After the ANN was built, we iterated again on the entire set of patterns (as a test set). Each pattern whose () ( ) is inserted into a hash table together with its correct heuristic value ( ). During the search, we ﬁrst check to see whether ∈ . If indeed ∈ we use its heuristic value stored in and do not even consult the ANN. As shown below, for well trained ANNs the set of overestimating patterns is small and so is the memory requirements of . Traditionally, the training phase is stopped when the mean square error (MSE) of the training data is below a predeﬁned small threshold. For our case it is deﬁned as: = Σt∈T ( ( )2 ) | |, where is the training set, ( ) = ( ) − ( ), is the original PDB value, and is the learned function. ( )2 is symmetric, so overestimation and underestimation have the same cost. Using this function with an ANN results in a heuristic that tries to be close to the optimal value without discriminating between being under (acceptable) or over (undesirable). We modiﬁed the error function to penalize positive values of ( ) (overestimation), biasing the ANN towards producing admissible 1 values. We used ( ) = ( + 1+exp(−bE(t)) ) ( ) instead of ( ) in the MSE calculations. The constants and were determined experimentally. ( ) reduces the number of overestimating instances by a factor of 4 (over ( )) and was used in our experiments. 4.1.4 Experimental results In this section, we evaluate our compression system on the 6-6-3 (of tiles (4-9), (10-15) and (1-3)) additive PDB for the 15-puzzle; additional evaluation of the ﬁnal system is given in Section 5. The compression technique is applied to the two 6-tile PDBs individually; M. Samadi et al. / Compressing Pattern Databases with Learning Heuristic 6-6-3 (4-2)2 -(4-2)2 -3 DIV 2 basic ANN ANN+DT ADP AvH 40.06 37.84 38.88 39.20 39.75 39.90 Nodes 6,323,187 50,818,284 19,204,184 11,676,726 9,550,754 7,285,207 Time 2.39 19.71 7.92 10.44 5.42 4.62 Mem 11.00 0.18 5.50 1.30 0.84 0.50 Hash 8% 4% 2% Table 1. Results for the 6-6-3 PDBs of the 15-puzzle. the 3-tile PDB is very small and is left uncompressed. The heuristic vector for each 6-tile PDB contains four values, which are created by using two sets of additive 4- and 2-tile PDBs. Table 1 shows the results. All the values shown are averages over the ﬁrst 100 random initial states used in [4]. The ﬁrst column is the heuristic used. The next four columns present the average initial heuristic value, the number of nodes generated by IDA*, the average time (in seconds), and the amount of memory used (in Megabytes). The time needed to precompute the PDB and train the ANN is not included in the times reported. This is standard, since these operations are done just once, no matter how many problems are solved. The ﬁnal column shows the percentage of entries in the original PDB that were stored in the hash tables because the ANNs overestimated their value. The ﬁrst row presents the results of using the normal 6-6-3 PDB. The second row shows the results of directly using the PDBs that make up the heuristic vector inside our ANN system. The superscript 2 in the heuristic description indicates the use of two sets of 4-2 additive PDBs for each 6-tile PDB in the 6-6-3. The maximum value of the two sets is used instead of the 6-tile PDB value. The third row shows the results of using the 2 method for compressing the 6-tile PDBs used in [3]. In this method, adjacent PDB entries are replaced by a single entry. The fourth row (basic ANN) is for the ANN system just described. The total memory for this system is dominated by the memory needed for the hash table; the memory needed for the small PDBs used in the heuristic vector is small (reported in row 2), and the memory needed for the ANN itself is negligible. The last two rows are for the enhanced ANN systems described below. Again, the total memory needed for them is mostly needed by the hash tables. The direct use of the smaller PDBs that make up the heuristic vector of our ANN (row 2) dramatically reduces the memory but increases the number of generated nodes by an order of magnitude. By contrast, our basic ANN technique reduces the amount of memory by a factor of 9 while increasing the number of generated nodes by a factor of only 1.84. This is a signiﬁcant improvement over the 2-fold memory reduction of 2 compressing technique [3] which was achieved at the cost of 3 times more generated nodes5 . In all the results of this paper the constant CPU time per node favors simple PDB construction. While we efﬁciently implemented all our learning techniques, they could probably be made more efﬁcient and better optimized. We decided to also report the CPU time but it should be taken with care. 4.2 Using decision tree to classify data A major problem of using ANN for predicting the value of PDBs is the size of the hash table used to store the patterns with overestimating ANN values. To address this we ﬁrst construct a decision tree 5 In [3] sparse (multi dimensional array) mapping was used for the PDBs and thus the DIV method compressed cliques. Here, we used their more realistic compact mapping (a single dimensional array). The DIV method does not compress cliques here and its performance is worse than DIV for sparse mapping. See [3] for more details. 497 (DT) which classiﬁes the patterns into two types. The ANN is only used for one type while the other type will consult smaller PDBs. As described above, each PDB is partitioned into smaller disjoint PDBs. For example a 6-tile PDB, 6 is partitioned into two disjoint PDBs 2 and 4 . We want to classify the 6-tile patterns into two classes: equal and larger. A pattern is classiﬁed as equal if 6( ) = 2 ( )+ 4 ( ). It is classiﬁed as larger if 6( ) 2( ) + 4 ( ). Patterns in the equal class need only to consult the smaller 2- and 4-tile PDBs and add their values. For patterns in the larger class 6 has knowledge about additional moves (over the sum of the two smaller PDBs) that are needed. Thus, the ANN is built to learn these additional moves. The beneﬁt of using the DT before the ANN is twofold. First, it is sufﬁcient to train and use the ANN for patterns in the larger group only. Thus, the ANN can be made more accurate as it needs to learn the behavior of a special class of patterns only - the ones whom their PDB values were larger than the sum of the smaller PDBs. Second, for the equal group, there is no need to pass through the complex network of the ANN and consulting the smaller PDBs is enough. Note that deepening down a decision tree is rather cheap as it is usually implemented as a series of nested − − statements. Adding the DT proved useful. For example, for the 6-6-3 PDBs of the 15 puzzle, nearly 58% of the patterns were classiﬁed as equal6 and only 42% are larger patterns that trained the ANN. 4.2.1 Building the Decision Tree A decision tree is built by examining various attributes of the training data. The entire set of features used by the ANN (described above) were used as attributes for the DT and the entire set of patterns were used to train and build the DT. We used ID3 [8], a common algorithm for building DT. Classic ID3 stops growing the DT when each leaf contains items that should be classiﬁed to one class only. Since we had a very large set of patterns we stopped growing the tree as soon as the percentage of patterns of one of the groups (larger or equal) in the given tree node exceeded a predeﬁned threshold 1 (classic ID3 uses 1 = 100%). The exact value for 1 was determined experimentally for the various domains. Similarly, once the number of patterns in a node was smaller than another threshold 2 we stopped growing the DT. In nodes with mixed patterns we used the majority function to determine the class of this node. 4.2.2 Misclassiﬁcation of the decision tree Because of the early stopping condition, some of the patterns can be misclassiﬁed by the decision tree. There is no problem if a pattern of the larger group was misclassiﬁed as equal. In this case, we use the sum of the smaller PDBs. This value is admissible but might be smaller than the real value of the larger PDB. The other direction is more problematic. Here equal patterns were misclassiﬁed as larger. This will cause the ANN to have such patterns in its training set. But, recall that to preserve admissibility, all patterns with overestimated ANN values are stored in a hash table so admissibility is kept. 4.2.3 Experimental results Line 5 in Table 1 shows the results of using the ANN+DT to compress the 6-6-3 additive PDB of the 15-puzzle. It shows that augmenting the basic ANN with the DT technique reduces the number 6 In fact, as described earlier, we had two sets of smaller PDBS. We classiﬁed a pattern as equal if its heuristic was equal to the maximum of the sums of the two sets of smaller disjoint partitioning. M. Samadi et al. / Compressing Pattern Databases with Learning of nodes generated by roughly 20% (from 11,676,726 to 9,550,754) and reduces the memory requirements by 35% (from 1.3 to 0.84 Megabytes). The ANN now only handles the larger patterns. Not only it has fewer patterns to classify, these patterns have similar attributes. This allows it to be more accurate for the same amount of training. Consequently, the hash tables can be smaller because fewer patterns have their values overestimated by the ANN. Indeed the hash table percentage dropped from 8% to 4%. 4.3 Partitioning the patterns into groups (PART) To properly train the ANN to have a reasonable error range it is necessary to feed it with the entire set of training instances at least 500 times. This can increase the total amount of training time especially if very large PDBs are used where data is saved on the disk. To address this, we add another step before building the DT. The basic idea is to partition the patterns into smaller groups (for very large PDBs this can be done in the disk) and then (load each group into the memory and) build a separate DT+ANN system for each group. In order to classify these groups we use a smaller heuristics (e.g., members of the heuristic vector). We call them the pivot heuristics. We then classify the patterns according to the values of the pivot heuristics, For example assume that two members of the heuristic vector 1 and 2 are used. Pattern with 1 ( ) = and will belong to the group labeled . Each such group 2( ) = contains patterns with similar attributes as they had similar values for the pivot heuristics. Each group will have a separate DT+ANN and the prediction will be more accurate due to similarities of the patterns inside each group. Another advantage is that for very large PDBs, which cannot be stored in memory, we can partition the large PDB into smaller groups which can ﬁt in memory. Then, we build a DT+ANN for each group. Our full system of ANN+DT+Partitioning will be refereed to as ADP in the reminder of this paper. Line 6 in Table 1 shows the results of using the full ADP system to compress the 6-6-3 PDBs. We used exactly the same heuristic vector as used for previous lines. Partitioning is done based on two heuristic values, each is the sum of 2- and 4-tile PDBs. Augmenting the ANN+DT system with the partitioning technique reduces the number of nodes generated by roughly 25% (from 9,550,754 to 7,285,207) and reduces the memory requirements by 40% (from 0.84 to 0.5 Megabytes). Compared to the original 6-6-3 PDB (line 1 in Table 1), ADP compression reduces the memory required by over 95%, while increasing the number of nodes generated by only 15%. It also signiﬁcantly outperforms the 2 compression method of [3] in all aspects - nodes, time and memory. 4.4 The general framework for ADP To summarize, the following preprocessing steps should be taken to build the full three-step ADP learning system: • Create the original PDB. • Create small PDBs for the heuristic vector and choose the pivot heuristics. • Partition the patterns of the original PDB into small groups according to values of the pivot heuristics. • Create a DT for each group of the partition and classifying patterns to equal or to larger. • Train an ANN for patterns that were classiﬁed as larger. • Test the ANN and build the hash table for overestimating patterns. To obtain a heuristic value for a state the following: during the search we do • Extract the values of the heuristic vector for and we ﬁnd appropriate group according to the pivot heuristics. • Traverse the relevant DT and see if it is a larger or equal node. • If it is an equal node, add up the smaller PDB heuristics. If it is a larger node consult the relevant hash table and the relevant ANN and retrieve the heuristic value. 5 Experimental results We now present additional experimental results for the full ADP system for the the 15- and 24-puzzles and for the TopSpin puzzle. 5.1 15-Puzzle ADP was used to compress the 7- and 8-tile PDBs of the 7-8 additive PDB for the 15-puzzle (used in [4]). The two PDBs were compressed individually. The heuristic vector for each consisted of four values based on two 6−2 additive PDBs, for the 8-tile PDB and on 6−1 for the 7-tile PDB. These heuristics are also used as the pivot heuristics. Heuristic 7-8 7-6-2 DIV 2 ADP (6-1)2 -(6-2)2 ADP (4-3)2 -(4-4)2 Table 2. AvH 44.08 41.70 42.43 43.03 41.96 Nodes 157,553 1,486,038 950,473 307,332 899,516 Time 0.07 0.54 0.33 0.21 0.57 Mem 549 61 274 46 16 Hash 0 0 0 2.9% 2.2% ADP compression of the 7-8 additive PDB. Table 2 presents the results in the same format as Table 1. The ﬁrst row presents the results when using the normal, uncompressed additive 7-8 PDB. The second row is for an uncompressed 7-6-2 additive PDB. The next row is for the DIV 2 compressing of [3]. The next row is for ADP using heuristic vectors containing two 6 − 1 additive PDBs for the 7-tile PDB and 6−2 for the 8-tile PDB. The ﬁnal row is for ADP using heuristic vectors containing two 4 − 3 additive PDBs for the 7-tile PDB and two 4 − 4 additive PDBs for the 8-tile PDB. The last two lines show that varying the PDBs used in ADP’s heuristic vector produces an interesting time-space tradeoff. However, both of these systems use less memory than the uncompressed 7-6-2 additive PDB and the DIV 2 compressing method and generate signiﬁcantly fewer nodes. The ADP with (6−1)2 −(6−2)2 was even faster in CPU time. Compared to the state-of-the-art uncompressed 7-8 PDB, this ADP reduces the memory required by over 90%, at a cost of less than doubling the number of nodes generated. 100 Nodes (in Millions - log scale) 498 Regular PDB ANN+DT+PART 10 1 0.1 0 20 40 60 80 Memory (Megabytes) 100 120 Figure 2. Nodes generated by both compression and regular systems. Figure 2 brings together the data for uncompressed PDBs (solid line) and compressed PDBs using ADP (dashed line) from Tables 1 and 2 in order to compare the number of generated nodes as a function of the memory used. It also includes two data points not shown in those tables of 7-7-1 additive PDBs. This ﬁgure clearly M. Samadi et al. / Compressing Pattern Databases with Learning shows that for any given amount of memory it is far better to use a compressed PDB than a regular uncompressed PDB. 5.2 24-puzzle The best existing heuristic for 24-puzzle uses a 6-6-6-6 additive PDB, and takes the maximum of the normal PDB lookup (r), its reﬂection about the main diagonal (r*), the dual lookup (d), and the reﬂection of the dual (d*) [5]. All values for regular lookup can be extracted from two 6-tile PDBs. For dual lookup, we need six additional PDBs [5]. ADP is applied to all these 6-tile PDBs. As in the 15-puzzle, the heuristic vector for the 6-tile PDB contains two additive 4-2 PDBs and they were also used for the partitioning step. PDB Lookups r,r* r,r*,d,d* r,r* (ADP) r,r*,d,d* (ADP) Nodes 43,454,810,045 13,549,943,868 69,527,696,072 19,781,408,283 Table 3. Time 15,861 8,441 31,843 15,971 Mem 244 972 4 37 Hash 0 0 1.6% 1.9% Results for the 24-puzzle. Table 3 shows the experimental results. The values are averages over the ﬁrst 25 random instances used in [5]. Lines 1-2 are for the uncompressed PDBs, lines 3-4 are for the compressed PDBs. The ﬁrst line in each group shows the results when only the regular lookup and its reﬂection are done. The second line in each group shows the results when the dual lookup and its reﬂection are done in addition to the regular lookups. By using two lookups, ADP decreased the size of PDB by a factor of 63 while increasing the number of nodes generated by only a factor of 1.6. For four lookups ADP decreased the size of PDB by a factor of 27 while increasing the number of generated nodes only by a factor of 1.45. 5.3 Top-spin We also applied the ADP system on the -TopSpin puzzle. A PDB of tokens has actually N different ways of being used. A PDB of tokens [1 ... ] can also be used as a PDB of [2... +1], [3... +2], etc. with the appropriate mapping of tokens. Thus, a single PDB allows up to different lookups. In separate experiments we applied ADP to a 9-token PDB and a 10-token PDB for the 17-TopSpin. The heuristic vector in each case contained 3 values corresponding to 3 different lookups in a PDB based on 7 tokens, for the 9-token PDB, and based on 8 tokens for the 10-token PDB. The partitioning of the 9- and 10-token PDBs useed all the PDBs from their heuristic vectors. PDB AvH 9 8 9 MOD 9 ADP 10.61 9.58 9.30 9.97 9 8 9 MOD 9 ADP 10 10 ADP 10.96 10.01 9.68 10.20 11.94 11.32 Nodes Time 1 Lookup 43,496,120 74.18 394,922,925 589.10 61,709,097 104.38 48,335,470 97.44 2 Lookups 664,966 1.62 5,777,064 11.29 6,489,343 14.71 1,475,642 4.29 84,772 0.21 194,252 0.92 Table 4. Mem Hash 494 54 54 48 0 0 0 2.6% 494 54 54 48 3,959 484 0 0 0 2.6% 0 2.4% Results for (17,4)-TopSpin. The experimental results are shown in Table 4, where each value is an average over a set of 100 random instances. Lines 1-4 show 499 the results of solving 17-Topspin if just one lookup is made in the PDB, while rows 5-8 show the results if two lookups are made. The ﬁrst two lines in each group show the results of using an uncompressed 9-token or 8-token PDB. The third line shows the results of the best compression technique used in [3] for 17-TopSpin, which compresses the table for the 9-token PDB using the MOD operator. The ﬁnal row in each group is for our ADP compression technique. For both one and two lookups ADP clearly generates fewer nodes than the other techniques with similar amount of memory (the 8token PDB and the 9-Token PDB compressed by the MOD operator. With two lookups it was even faster in CPU time. In fact, when two lookups are made the MOD method actually generates more nodes than an uncompressed PDB of the same size, the 8-token PDB. The last two rows show the results of using regular and compressed 10-token PDB and with two lookups. ADP reduces the memory required by 87% while increasing the number of generated nodes by a factor of 2.3. The compressed version of the 10-token PDB requires slightly less memory than the uncompressed 9-token PDB but generates only 30% of the nodes and 56% of the time. 6 Summary and Conclusions We presented a new technique that better utilizes memory by compressing PDBs with learning techniques and we applied it to different domains. A three step mechanism to construct the system was introduced but any subset of them can be separately used. A signiﬁcant reduction in memory was achieved over the uncompressed PDB at a cost of a small increase in the search effort. Furthermore, our compressing idea usually outperforms previous compressing techniques in both memory and number of nodes and many times in CPU time as well. For a given amount of memory it is beneﬁcial to use our compressing technique over uncompressed PDB of the same size. An advantage of our system is that PDBs that are much larger than the available memory can be generated on disk and can be compressed to ﬁt the memory. In fact, we used this method to compress 10-token PDB for 17-TopSpin. Future work will continue these ideas as follows. First we would like to compress much larger PDBs as well as trying to solve larger versions of these puzzles. Second, other classiﬁer techniques (like oblique tree and SVM [8]) might perform better that the ADP system. Finally, this approach can be applied in compressing PDBs in planning domains. REFERENCES [1] S. Edelkamp, ‘Symbolic pattern databases in heuristic search planning’, AIPS, 274–293, (2002). [2] M. Ernandes and M. Gori, ‘Likely-admissible and sub-symbolic heuristics.’, in ECAI, pp. 613–617, (2004). [3] A. Felner, R. Korf, R. Meshulam, and R. Holte, ‘Compressed pattern databases’, JAIR, 30, 213–247, (2007). [4] A. Felner, R. E. Korf, and S. Hanan, ‘Additive pattern database heuristics’, JAIR, 22, 279–318, (2004). [5] A. Felner, U. Zahavi, R. Holte, and J. Schaeffer, ‘Dual lookups in pattern databases’, in Proc. IJCAI, pp. 103–108, (2005). [6] R. C. Holte, A. Felner, J. Newton, R. Meshulam, and D. Furcy, ‘Maximizing over multiple pattern databases speeds up heuristic search’, Artiﬁcial Intelligence, 170, 1123–1136, (2006). [7] J. S. Judd, Neural network design and the complexity of learning, MIT Press, Cambridge, MA, USA, 1990. [8] T. Mitchell, ‘Machine learning and data mining’, Communications of the ACM, 42(11), 30–36, (1999). [9] George Politowski, On the construction of heuristic functions, Ph.D. dissertation, University of California at Santa Cruz, 1986. [10] S. Sarkar, P. Chakrabarti, and S. Ghose, ‘A framework for learning in search-based systems’, IEEE Transactions on Knowledge and Data Engineering, 10(4), 563–575, (1998). 500 ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-500 A Decomposition Technique for Max-CSP Hach´emi Bennaceur, Christophe Lecoutre, Olivier Roussel1 Abstract. The objective of the Maximal Constraint Satisfaction Problem (Max-CSP) is to ﬁnd an instantiation which minimizes the number of constraint violations in a constraint network. In this paper, inspired from the concept of inferred disjunctive constraints introduced by Freuder and Hubbe, we show that it is possible to exploit the arc-inconsistency counts, associated with each value of a network, in order to avoid exploring useless portions of the search space. The principle is to reason from the distance between the two best values in the domain of a variable, according to such counts. From this reasoning, we can build a decomposition technique which can be used throughout search in order to decompose the current problem into easier sub-problems. Interestingly, this approach does not depend on the structure of the constraint graph, as it is usually proposed. Alternatively, we can dynamically post hard constraints that can be used locally to prune the search space. The practical interest of our approach is illustrated, using this alternative, with an experimentation based on a classical branch and bound algorithm, namely PFC-MRDAC. 1 Introduction The Constraint Satisfaction Problem (CSP) is the task of determining if a given constraint network is satisﬁable or not, i.e. if it is possible to assign a value to all variables in order to satisfy all constraints. When no solution can be found, it may be interesting to identify a complete instantiation which satisﬁes the greatest number of constraints (or equivalently, which minimizes the number of violated constraints). This is called the Maximal Constraint Satisfaction Problem (MaxCSP). During the last decade, many works have been carried out to solve this problem (and its direct extension, WCSP). The basic (complete) approach is to employ a branch and bound mechanism, traversing the search space in a depth-ﬁrst manner while maintaining an upper bound, the best solution cost found so far, and a lower bound on the best possible extension of the current partial instantiation. When the lower bound is greater than or equal to the upper bound, backtracking (or ﬁltering) occurs. Lower bound computations of constraint violations have been improved repeatedly, over the years, by exploiting inconsistency counts [7, 11, 1, 10], disjoint conﬂicting sets of constraints [13], or cost transfers between constraints [3, 4, 2]. Alternative approaches (usually) combine branch and bound search with dynamic programming or structure exploitation. On the one hand, Russian Doll Search [14] and variable elimination [9] can be considered as dynamic programming methods, whose principle is to solve successive sub-problems, one per variable of the initial problem. On the other hand, structural decomposition methods [8, 12, 5] 1 Universit´e Lille-Nord de France, Artois, F-62307 Lens – CRIL, F62307 Lens – CNRS UMR 8188, F-62307 Lens – IUT de Lens – {bennaceur,lecoutre,roussel}@cril.univ-artois.fr exploit the structure of the problems in order to establish some conditions about possible decompositions. Such methods are based on tree decomposition, provide interesting theoretical time complexities which depend on the width of the decomposition (tree-width), and are becoming increasingly successful. In [6], Freuder and Hubbe have proposed to exploit, for constraint satisfaction, the principle of inferred disjunctive constraints: given a satisﬁable binary constraint network P , for any pair (X, a) where X is a variable of P and a a value in the domain of X, if there is no solution containing a for X, then there is a solution containing a value (for another variable) which is not compatible with (X, a). Using this principle, the authors show that it is possible to dynamically and iteratively decompose a problem. In this paper, we generalize this approach to Max-CSP (including the non-binary case) by exploiting the arc-inconsistency counts, associated with each value of the problem. The arc-inconsistency count (aic for short) of a pair (X, a) corresponds to the number of constraints that do not support (X, a). The aic gap associated with the variable X is the absolute difference between the two lowest arcinconsistency counts of values of X (plus 1). We show that it is possible to reason from the aic gap to obtain a condition under which we have the guarantee to obtain an optimal solution, while avoiding to explore some portions of the search space. From this reasoning, we can build a decomposition technique which can be used throughout search to decompose the current problem into simpler sub-problems, generalizing for Max-CSP the approach of [6]. It is important to remark that unlike usual decomposition methods, this approach does not depend on the structure of the constraint graph, since the decomposition can always be applied, whatever the structure of the constraint graph is. Alternatively, we can dynamically post hard constraints that can be used locally to prune the search space. Depending on the implementation, these hard constraints can participate to constraint propagation, or just impose backtracking. The paper is organized as follows. After some technical background, we introduce the central result of this paper. Then, we present two main exploitations of it: decomposition and pruning. After the presentation of some experimental results, we conclude. 2 Background In this paper, we are dealing with the discrete CSP (Constraint Satisfaction Problem) framework. Each CSP instance P corresponds to a constraint network which is deﬁned by a ﬁnite set of n variables {X1 , X2 , . . . , Xn } and a ﬁnite set of e constraints {C1 , C2 , . . . , Ce }. Each variable X must be assigned a value from its associated discrete domain dom(X), and each constraint C involves an ordered subset scp(C) of variables of P , called its scope, and speciﬁes the set rel(C) of combinations of values allowed for H. Bennaceur et al. / A Decomposition Technique for Max-CSP the variables of its scope. |scp(C)| is called the arity of C, and C is binary if its arity is 2. A CSP instance is binary if it only contains binary constraints, and normalized if it does not contain two constraints with the same scope. Two variables are neighbours iff they both belong to the scope of a constraint. A complete instantiation is the assignment of a value to each variable. Let s denote a complete instantiation, s(X, a) is the complete instantiation obtained from s by replacing the value assigned to X in s by a. A constraint C is violated (or unsatisﬁed) by a complete instantiation s iff the projection of s over scp(C) does not belong to rel(C). A solution is a complete instantiation that satisﬁes every constraint. In some cases, the CSP instance may be over-constrained, and thus admits no such solution. We can then be interested in ﬁnding a complete instantiation that best respects the set of constraints. In this presentation, we consider the Max-CSP problem where the goal is to ﬁnd an optimal solution, i.e. a complete instantiation satisfying as many constraints as possible. A Max-CSP instance is also represented by a constraint network. Given a constraint C with scp(C) = {Xi1 , . . . , Xir }, any tuple in dom(Xi1 ) × . . . × dom(Xir ) is called a valid tuple on C. A value a for the variable X is often denoted by (X, a). A constraint C supports the value (X, a) (equivalently, a value (X, a) has a support on C) iff either X ∈ / scp(C) or there exists a valid tuple on C which belongs to rel(C) and which contains the value a for X. When any value is supported by a constraint, this constraint is said (generalized) arc-consistent. For the binary normalized case, we say that a variable Y supports the value (X, a) iff either no constraint involves both X and Y , or such a constraint supports (X, a). For a binary constraint C such that scp(C) = {X, Y }, a value (X, a) is compatible with a value (Y, b) iff (a, b) belongs to rel(C). The arc-inconsistency count of a value (X, a), denoted by aic(X, a), is the number of constraints (variables for the binary normalized case) which do not support (X, a). 3 Main Theorem In this section, we present the main result of this paper, generalizing the approach [6] developed in the context of binary CSP. Deﬁnition 3.1 Let P be a Max-CSP instance and X be a variable of P . An aic best value of X is a value a ∈ dom(X) such that aic(X, a) is minimal, i.e. ∀c ∈ dom(X), aic(X, a) ≤ aic(X, c). An aic second best value of X is a value b ∈ dom(X) such that b = a and ∀c ∈ dom(X) \ {a, b}, aic(X, b) ≤ aic(X, c). The aic gap of X is deﬁned as δ = aic(X, b) − aic(X, a) + 1. Theorem 3.1 Let P be a Max-CSP instance, X be a variable of P , a be an aic best value of X, δ be the aic gap of X and C1 , . . . , Cm be the m constraints involving X which support (X, a). There always exists an optimal solution s∗ of P such that: • either X is assigned the value a in s∗ , • or X is assigned a value different from a in s∗ , and at least δ constraints among C1 , . . . , Cm are violated by s∗ (X, a). Proof: When P has an optimal solution where X is assigned a, the ﬁrst condition is obviously satisﬁed and the theorem is veriﬁed. Otherwise if there is no optimal solution where X = a, let s∗ = (v1 , ..., v, ..., vn ) be an optimal solution of P , and let v be the value of X in s∗ . Let CX be the set of constraints of P involving the variable X (CX is a superset of {C1 , . . . , Cm }). 501 Assume that s∗ violates p constraints of CX and s∗ (X, a) violates q constraints of CX . Since there is no optimal solution with X = a, s∗ (X, a) necessarily violates more constraints of CX than s∗ and therefore q > p. Necessarily, we have p ≥ aic(X, v) and q ≥ aic(X, a) since arc-inconsistency counts computed wrt P represent lower bounds of aic counts obtained after assigning all variables of P . Therefore, ∃t ≥ 0, r ≥ 0 s.t. p = aic(X, v) + r and q = aic(X, a) + t. Since q > p, aic(X, a) + t > aic(X, v) + r or equivalently t > aic(X, v) − aic(X, a) + r. Since v = a, we have aic(X, v) ≥ aic(X, b) (b is an aic second best value of X) and therefore t > aic(X, b) − aic(X, a) + r. As r ≥ 0 and δ = aic(X, b) − aic(X, a) + 1, we obtain t ≥ δ. This means that at least δ constraints of P which support (X, a) involve variables whose values given by s∗ are not compatible with (X, a). Therefore, the theorem is also veriﬁed. 2 This theorem can be used in two different ways. It can be used to generate a decomposition of the Max-CSP instance or it can be exploited as a pruning rule. 4 The Decomposition Approach The decomposition of a Max-CSP instance P around a variable X is deﬁned as follows. Deﬁnition 4.1 Under the hypotheses and with the notations of Theorem 3.1, the decomposition of a Max-CSP instance P around the value a of` variable X generates the sub-problems P0 , P1 , . . . Pk ´ (with k = m ) deﬁned by: δ • P0 is derived from P by assigning a to variable X • Pi (with i ∈ 1..k) is derived from P by removing a from the domain of X and restricting the assignments of neighbours of X so that at least δ of the constraints supporting (X, a) in P do not support (X, a) in Pi any more. These sub-problems may be solved independently and Theorem 3.1 guarantees that at least one of them contains an optimal solution of P . It should be noticed that this decomposition may prune some (equivalent) optimal solutions of P . As described in the deﬁnition, the sub-problems are not disjoint which means that an assignment may be a solution of several subproblems simultaneously. It is however easy to generate disjoint subproblems as will be shown in section 4.2. With m denoting the number of that support (X, a), this decomposition generates ` constraints ´ 1+ m sub-problems (when δ = 1, this number is equal to 1 + m δ and is bounded by n−aic(X, a) with n the number of variables). Although the number of sub-problems is exponential in δ, Section 4.3 proves that the search space of the different sub-problems P0 , . . . Pk is exponentially smaller than the search space of the initial problem P provided that we generate disjoint sub-problems. This means that the decomposition is always beneﬁcial because, even if it may generate many sub-problems, they are always easier to solve globally than the initial problem. 4.1 Example To illustrate the decomposition technique, let us consider the binary constraint network P built on {X1 , X2 , X3 } and containing the constraints {C12 , C13 , C23 }. We have dom(Xi ) = {1, 2, 3} for i ∈ 1..3, and the constraints are deﬁned by the following tables (allowed tuples): 502 H. Bennaceur et al. / A Decomposition Technique for Max-CSP rel(C12 ) X1 X2 1 1 1 2 3 1 rel(C13 ) X1 X3 1 1 1 2 2 1 2 3 3 2 rel(C23 ) X2 X3 1 3 3 1 An optimal solution of this Max-CSP instance violates one constraint. For example, X1 = 1, X2 = 1, X3 = 2 is an optimal solution which violates the constraint C23 . To perform the decomposition strategy, we have to select one variable and one of its aic best values. For example, (X1 , 1) is one aic best value of X1 since aic(X1 , 1) = 0, aic(X1 , 2) = 1 and aic(X1 , 3) = 0. Here, we have δ = 1. The decomposition around (X1 , 1) leads to the following independent sub-problems: P0 is derived from P by assigning X1 = 1. In P0 , dom(X10 ) = {1}, dom(X20 ) = dom(X30 ) = {1, 2, 3}. P1 is derived from P by asserting X1 = 1 and restricting the domain of X2 to the values incompatible with (X1 , 1). In P1 , dom(X11 ) = {2, 3}, dom(X21 ) = {3}, dom(X31 ) = {1, 2, 3}. P2 is derived from P by asserting X1 = 1 and restricting the domain2 of X2 to the values incompatible with (X1 , 1) and restricting the domain of X3 to the values compatible with (X1 , 1). In P2 , dom(X12 ) = {2, 3}, dom(X22 ) = {1, 2}, dom(X32 ) = {3}. Notice that the sub-problem where dom(X1 ) = {2, 3}, dom(X2 ) = {1, 2} and dom(X3 ) = {1, 2} is pruned and this subproblem contains an optimal solution of the whole problem which is X1 = 3, X2 = 1 and X3 = 2. Now, let us modify slightly the initial problem. Assume that the value 3 of X1 is incompatible with all values of X3 , then we have: Enumerating all the sub-problems in the decomposition and ensuring that these problems are disjoint is as simple as enumerating the values of a binary counter under the constraint that at least δ of its bits must be 0. Let IYX=a be the values of domain dom(Y ) which are incompatible with (X, a) and CYX=a be the values of dom(Y ) which are compatible with (X, a). By deﬁnition, dom(Y ) = IYX=a ∪ CYX=a and IYX=a ∩ CYX=a = ∅. Clearly, sub-domains I and C form a partition of each domain and this can be used to decompose the search in a systematic way. Exhaustive search on all values of a variable Y can be performed by ﬁrst restricting the domain to IYX=a and then to CYX=a . This is a binary branching. Since this can be done recursively, each branch can be represented by a binary word bY1 , . . . , bYm where bYi = 0 indicates that the domain of Yi is restricted to IYX=a and i bYi = 1 indicates that the domain of Yi is restricted to CYX=a . Exi haustive search on all values of all variables Y will enumerate the 2m binary words (from all 0 to all 1). When X = a is chosen for the decomposition of a problem P , the ﬁrst sub-problem is P0 where X = a and the other sub-problems are the ones where X = a and where δ variables among the m variables Yi which support (X, a) have their domain reduced to IYX=a . i A simple solution to avoid any redundant or useless search is to use the binary branching scheme presented above. The restriction where δ variables among the m variables Yi have their domain reduced to IYX=a translates to the condition ’at least δ bits in the binary word i representing the branch must be 0’. This condition is trivial to enforce in a binary branching. Y1 0 1 1 1 rel(C13 ) X1 X3 1 1 1 2 2 1 2 3 In this case, for X1 there is only one aic best value (since aic(X1 , 1) = 0, aic(X1 , 2) = 1 and aic(X1 , 3) = 1) and so δ = 2. Thus, the decomposition leads only to two sub-problems P0 and P1 . P0 is unchanged and P1 is obtained from P by asserting X1 = 1 and restricting the domain of X2 , X3 to the values incompatible with (X1 , 1). In P1 , dom(X11 ) = {2, 3}, dom(X21 ) = {3}, dom(X31 ) = {3}. In this case we have discarded the following two sub-problems: P2 where dom(X12 ) = {2, 3}, dom(X22 ) = {3} and dom(X32 ) = {1, 2}, and P3 where dom(X13 ) = {2, 3}, dom(X23 ) = {1, 2} and dom(X33 ) = {1, 2, 3}. The sub-problem P3 contains one optimal solution of P : X1 = 3, X2 = 1 and X3 = 3. For the initial problem, the decomposition prunes 23 out of 33 possible complete instantiations while in the modiﬁed problem it prunes 16 (more than a half) of them. 4.2 Enumeration of Sub-problems For the sake of simplicity, we now assume that constraints are binary and normalized (i.e. they all have different scopes) but the method is easy to generalize3 . When constraints are binary, ensuring that a constraint C with scp(C) = {X, Y } does not support (X, a) simply amounts to reducing the domain of Y to the values incompatible with (X, a). 2 3 This restriction is enforced to obtain disjoint sub-problems, see 4.2 This restriction just ensures that reducing the domain of a neighbour of X will affect only one constraint on X. Otherwise we have to take into account some variables more than once. Y2 ∗ 0 1 1 Y3 ∗ ∗ 0 1 Y4 ∗ ∗ ∗ 0 Y1 0 0 0 1 (a) search with δ = 1 Y2 0 0 1 0 Y3 0 1 0 0 Y4 ∗ 0 0 0 (b) search with δ = 3 Figure 1. List of branches to explore for n = 4 and different values of δ As an example, Figure 1 represents the branches that must be explored for two different values of δ and for n = 4 variables. For clarity, ∗ is used as a joker to represent any 0/1 value. 4.3 Some Complexity Results Interestingly, this binary branching scheme allows to draw immediate complexity results. Assume that Y1 , . . . , Ym are the variables which support (X, a) and that Z1 , . . . , Zr are the other unassigned variables. Without applying the decomposition, an exhaustive search of the sub-problem where X = a will have to explore the Cartesian Qr product of the domains which amounts to Q m i=1 |dom(Yi )|. i=1 |dom(Zi )| complete instantiations. When the decomposition is used, at least δ variables Y must have their domain reduced to IYX=a . This means that the number of complete instantiations which are not explored amounts to: X S∈2{Yi } Y with card(S)<δ Yi ∈S |IYX=a |. i Y Yi ∈S |CYX=a |. i r Y |dom(Zi )| i=1 have the same size c and all IYX=a As an illustration, if all CYX=a i i have the same size i, the number of pruned complete instantiations simpliﬁes as: ! r X n j m−j Y |dom(Zi )| i c j i=1 j≤δ H. Bennaceur et al. / A Decomposition Technique for Max-CSP WhenQδ = 1, the number of pruned complete instantiations is just i.cm−1 m i=1 |dom(Zi )|. It roughly corresponds to the size of the so-called consistent sub-problem identiﬁed in [6] for the CSP case. In any case, the number of complete instantiations that are explored when the decomposition is applied is smaller than the initial number of complete instantiations to explore (by an exponential factor in the general case). 4.4 Related Work Classical structural decomposition methods combine tree decomposition of graphs with branch and bound search [8, 12, 5]. A tree decomposition involves computing a pseudo-tree which covers the set of variables by clusters. Two clusters are adjacent in this tree if they share some variables. An important property of tree decomposition is that the sub-problems associated with clusters may be solved independently after assigning values to the shared variables. In practice, the efﬁciency of decomposition methods highly depends on the structure of the constraint graph. The decomposition approach presented here, inspired from [6], proceeds differently from classical ones since the principle is to directly decompose the whole problem into independent sub-problems without computing any pseudo-tree or assigning any variable of the problem. Each sub-problem can be solved independently while in the same time, a portion of the search space of the whole problem is pruned. The downside of this method is that the number of generated sub-problems may be large. However, the decomposition does not rely on the structure of the constraint graph. 5 The Pruning Approach Another way to exploit Theorem 3.1 is to interpret it as a pruning rule which can be integrated into any method based on tree search to solve the Max-CSP problem. Assuming here a tree search algorithm employing a binary branching scheme, at each node ν, a value (X, a) is selected, and two branches are built from ν: a left one labelled with the variable assignment X = a, and a right one labelled with the value refutation X = a. Considering the current instance at node ν, let a, δ and {Ci } be the best aic value of X, the aic gap of X and the set of constraints supporting (X, a), respectively. As soon as the left branch has been explored, one can post a hard constraint atLeastU nsatisf ied(δ, {Ci }, (X, a)) before exploring the right branch of ν. This constraint is violated as soon as it is no more possible to ﬁnd in {Ci }, at least δ constraints which do not support anymore (X, a). Of course, a constraint posted with respect to the right branch of node ν must be removed when the algorithm backtracks from ν. These hard constraints, dynamically added to the instance, can be used to impose backtracking, and consequently, to avoid exploring useless portions of the search space. After each propagation phase, one can simply check that all currently posted hard constraints are still satisﬁed. If this is not the case, backtracking occurs. We will denote any tree search algorithm A, exploiting this approach, by AP C (Pruning Constraints). Interestingly, except for some particular search heuristics (such as the ones based on constraint weighting), we have the guarantee that A-P C will always visit a tree which is included in the one built by A. On the other hand, the additional hard constraints can also participate to constraint propagation. When for a constraint atLeastU nsatisf ied(δ, {Ci }, (X, a)), we can determine that at 503 most δ constraints of {Ci } can still be in a position of not supporting (X, a), we can impose that these δ constraints do not support (X, a), making then new inferences. For example, for a binary constraint of {Ci }, among the δ ones, involving X and another variable Y , any value of Y compatible with (X, a) can be removed. Here, we can imagine sophisticated mechanisms to manage propagation such as the use of lazy structures (e.g. watched literals). Importantly, notice that this pruning approach can be integrated into many search algorithm solving the Max-CSP problem, including hybrid ones that combine tree decomposition with enumeration. 6 Experimental Results In order to show the practical interest of the approach described in this paper, we have conducted an experimentation on a cluster of Xeon 3,0GHz with 1GiB under Linux using the benchmark suite used for the 2006 competition of Max-CSP solvers (see http://www.cril.univ-artois.fr/CPAI06/). We have used the classical branch and bound PFC-MRDAC algorithm [11] which maintains reversible directed arc-inconsistency counts in order to compute lower bounds at each node of the search tree, and have been interested in the impact of using the PC (Pruning Constraints) approach (see Section 5). We have used here the variant that just imposes backtracking, and have not still implemented the one that allows to make inferences. We have not still implemented the decomposition approach either. Two variable ordering heuristics have been considered. The ﬁrst one is dom/ddeg, usually considered for Max-CSP, which selects at each node the variable with the lowest ratio domain size on dynamic degree. The second one, denoted by dom ∗ gap/ddeg, involves the aic gap of the variables. More precisely, the ratio dom/ddeg is multiplied by the aic gap in order to favour variables for which there is a large gap between the best value and the following one. We believe that it may help quickly ﬁnding good solutions and, more specifically, increasing the efﬁciency of our approach. Finally, the value with the lowest aic is always selected. Notice that it can be seen as a reﬁnement of the ic + dac counters usually used to select values. The protocol used for our experimentation is the following: for each instance, we start with an initial upper bound4 set to inﬁnity, and record the (cost of the) best solution found (and time-stamp it) within a given time limit (here, 1, 500 seconds). Even if this protocol prevents us from getting some useful results for some instances (for example, if the same best solution is found by the different algorithms after a few seconds), it beneﬁts from being easily reproducible and exploitable, whether the optimum value is known or not. First of all, recall that we have the guarantee that PFC-MRDACPC always visits a tree which is smaller than the one built by PFCMRDAC. It makes our experimental comparisons easier. We can then make a ﬁrst general observation about the results of our experimentation. The overhead of managing PC hard constraints is usually between 5% and 10% of the overall cpu time. Since on random instances, our approach permits to only save a limited number of nodes (as expected), we obtain a similar behaviour with PFCMRDAC and PFC-MRDAC-PC. This is not shown here, due to lack of space. On the other hand, on structured instances, Table 1 presents the results on representative instances and clearly demonstrates the interest of our approach. These instances belong to academic and patterned series maxclique (brock, p − hat,san), kbtree (introduced in [5]), dimacs (ssa) and composed, and also to real-world 4 In the experimentation, Max-CSP was considered as the problem of minimizing the number of violated constraints. 504 H. Bennaceur et al. / A Decomposition Technique for Max-CSP series celar (scen, graph) and spot. The ratio introduced in the table corresponds to the cpu of PFC-MRDAC divided by the cpu of PFC-MRDAC-PC. It is either an exact value (when both methods have found the same upper bound) or an approximate one (in this case, we use the time limit 1, 500 as a lower bound). For example, on instance spot5 − 404, we obtain 74 as upper bound with PFCMRDAC and 73 with PFC-MRDAC-PC. Since, any node visited by PFC-MRDAC-PC is necessarily visited by PFC-MRDAC, we know that at least 1, 500 seconds are required by PFC-MRDAC to ﬁnd the upper bound 73. We then obtain a speedup ratio which is greater than 1, 500/99 = 15.1. Remark that, as expected, the results are more impressive when using the heuristic dom ∗ gap/ddeg (more than two orders of magnitude on some instances) which besides, often allows us to ﬁnd better upper bounds. PFC-MRDAC dom/ddeg dom*gap/ddeg ¬P C P C ratio ¬P C P C ratio Academic and Patterned instances ub 184 183 184 183 brock-200-1 cpu 1, 490 706 > 2.1 3 57 > 26.3 ub 191 191 191 190 brock-200-2 cpu 638 92 > 6.9 85 201 > 7.4 ub 3 3 6 3 composed-25-1-2-1 cpu 613 332 = 1.8 19 846 > 1.7 ub 4 4 6 3 composed-25-1-25-1 cpu 92 72 = 1.2 14 1, 407 >1 ub 6 0 3 0 kbtree-9-2-3-5-20-01 cpu 996 1, 333 > 1.1 0 15 > 100 ub 13 13 14 4 kbtree-9-2-3-5-30-01 cpu 1, 037 1, 009 = 1.0 1, 177 392 >3 ub 162 160 162 160 keller-4 cpu 36 303 > 4.9 1 149 > 10.0 ub 293 293 293 293 p-hat300-1 cpu 396 76 = 5.2 1, 481 224 = 6.6 ub 493 493 493 492 p-hat500-1 cpu 1, 357 652 = 2.0 33 717 > 2.0 ub 174 173 157 155 san-200-0.9-1 cpu 1, 425 1, 287 > 1.1 0 888 > 1.6 ub 185 185 185 184 sanr-200-0.7 cpu 426 94 = 4.5 808 324 > 4.6 ub 82 73 11 2 ssa-0432-003 cpu 0 175 > 8.5 46 19 > 78.9 ub 392 390 52 49 ssa-2670-130 cpu 1 56 > 26.7 55 1, 126 > 1.3 Real-world instances ub 342 341 366 365 graph6 cpu 216 935 > 1.6 7 406 > 1.0 ub 161 159 160 160 graph8-f11 cpu 5 1, 299 > 1.1 644 56 = 11.5 ub 576 576 620 620 graph11 cpu 5 5 =1 677 70 = 9.6 ub 269 269 211 211 scen6 cpu 69 27 = 2.5 20 14 = 1.4 ub 744 744 741 741 scen10 cpu 34 34 =1 623 56 = 11.1 ub 81 81 66 66 scen11-f12 cpu 729 395 = 1.8 146 35 = 4.1 ub 215 214 133 131 scenw-06-18 cpu 8 934 > 1.6 231 442 > 3.3 ub 98 98 121 117 scenw-06-24 cpu 686 244 = 2.8 741 400 > 3.7 ub 353 353 525 524 scenw-07 cpu 1, 239 471 = 2.6 25 8 > 187.5 ub 207 206 196 196 spot5-28 cpu 0 31 > 48.3 1 1 =1 ub 52 51 49 48 spot5-29 cpu 25 305 > 4.9 807 29 > 51.7 ub 124 124 122 122 spot5-42 cpu 900 64 = 14.0 1, 157 6 = 192.8 ub 74 73 76 73 spot5-404 cpu 85 99 > 15.1 0 331 > 4.5 Table 1. Best upper bound (ub, number of violated constraints) and cpu time (to reach it) obtained with PFC-MRDAC on structured instances, with (P C) and without (¬P C) the Pruning Constraints method. The timeout was set to 1, 500 seconds per instance. Finally, for a very limited number of these instances, we succeeded in ﬁnding an optimal value and proving optimality, given 20 hours of cpu time per instance. For example, for brock-200-2, optimality is proved when using PC in 13, 394 and 29, 217 seconds with dom/ddeg and dom ∗ gap/ddeg respectively, while optimality is not proved within 72, 000 seconds when PC is not employed. As an- other example, the instance scenw-06-24 is solved in 18, 858 seconds with PFC-MRDAC-PC-dom ∗ gap/ddeg and in 37, 405 seconds when PC is not used. 7 Conclusion In this paper, we have generalized to Max-CSP the principle of inferred disjunctive constraints introduced in [6] for CSP. Using the socalled aic (arc-inconsistency count) gap, we have shown that it was possible to obtain a guarantee about the obtention of an optimal solution, while pruning some portions of the search space. Interestingly, this result can be exploited both in terms of decomposition (already addressed for CSP in [6]) and backtracking/ﬁltering (by posting hard constraints). We have shown that our approach, grafted to a classical branch and bound algorithm, was really boosting search when solving structured instances. Indeed, using PFC-MRDAC, we have noticed a speedup that sometimes exceeds one order of magnitude with the heuristic dom/ddeg and two orders of magnitude with the original dom ∗ gap/ddeg. We want to recall that dynamic programming and decomposition methods, which have recently received a lot of attention, still rely on branch and bound search. It means that all these methods may beneﬁt from the approach developed in this paper. Finally, one perspective of this work is to extend it with respect to Weighted CSP and Valued CSP frameworks. Acknowledgments This paper has been supported by the IUT de lens, the CNRS and the ANR “Planevo” project no JC05 41940. REFERENCES [1] M.S. Affane and H. Bennaceur, ‘A weighted arc-consistency technique for Max-CSP’, in Proceedings of ECAI’98, pp. 209–213, (1998). [2] M.C. Cooper, S. de Givry, and T. Schiex, ‘Optimal Soft Arc Consistency’, in Proceedings of IJCAI’07, pp. 68–73, (2007). [3] M.C. Cooper and T. Schiex, ‘Arc consistency for soft constraints’, Artiﬁcial Intelligence, 154(1-2), 199–227, (2004). [4] S. de Givry, F. Heras, M. Zytnicki, and J. Larrosa, ‘Existential arc consistency: Getting closer to full arc consistency in weighted CSPs’, in Proceedings of IJCAI’05, pp. 84–89, (2005). [5] S. de Givry, T. Schiex, and G. Verfaillie, ‘Exploiting Tree Decomposition and Soft Local Consistency In Weighted CSP’, in Proceedings of AAAI’06, (2006). [6] E.C. Freuder and P.D. Hubbe, ‘Using inferred disjunctive constraints to decompose constraint satisfaction problems’, in Proceedings of IJCAI’93, pp. 254–261, (1993). [7] E.C. Freuder and R.J. Wallace, ‘Partial constraint satisfaction’, Artiﬁcial Intelligence, 58(1-3), 21–70, (1992). [8] P. J´egou and C. Terrioux, ‘Hybrid backtracking bounded by treedecomposition of constraint networks’, Artiﬁcial Intelligence, 146(1), 43–75, (2003). [9] J. Larrosa and R. Dechter, ‘Boosting search with variable elimination in constraint optimization and constraint satisfaction problems’, Constraints, 8(3), 303–326, (2003). [10] J. Larrosa and P. Meseguer, ‘Partition-Based lower bound for MaxCSP’, Constraints, 7, 407–419, (2002). [11] J. Larrosa, P. Meseguer, and T. Schiex, ‘Maintaining reversible DAC for Max-CSP’, Artiﬁcial Intelligence, 107(1), 149–163, (1999). [12] R. Marinescu and R. Dechter, ‘AND/OR Branch-and-Bound for Graphical Models’, in Proceedings of IJCAI’05, pp. 224–229, (2005). [13] J.C. Regin, T. Petit, C. Bessiere, and J.F. Puget, ‘New lower bounds of constraint violations for over-constrained problems’, in Proceedings of CP’01, pp. 332–345, (2001). [14] G. Verfaillie, M. Lemaitre, and T. Schiex, ‘Russian doll search for solving constraint optimization problems’, in Proceedings of AAAI’96, pp. 181–187, (1996). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-505 505 Fast Set Bounds Propagation using BDDs Graeme Gange and Vitaly Lagoon Department of Computer Science and Software Engineering The University of Melbourne, Vic. 3010, Australia Peter J. Stuckey NICTA Victoria Laboaratory Department of Computer Science and Software Engineering The University of Melbourne, Vic. 3010, Australia Abstract. Set bounds propagation is the most popular approach to solving constraint satisfaction problems (CSPs) involving set variables. The use of reduced ordered Binary Decision Diagrams (BDDs) to represent and solve set CSPs is well understood and brings the advantage that propagators for arbitrary set constraints can be built. This can substantially improve solving. The disadvantages of BDDs is that creating and manipulating BDDs can be expensive. In this paper we show how we can perform set bounds propagation using BDDs in a much more efﬁcient manner by generically creating set constraint predicates, and using a marking approach to propagation. The resulting system can be signiﬁcantly faster than competing approaches to set bounds propagation. 1 Introduction It is often convenient to model a constraint satisfaction problem (CSP) using ﬁnite set variables and set relationships between them. A common approach to solving ﬁnite domain CSPs is using a combination of a global backtracking search and a local constraint propagation algorithm. The local propagation algorithm attempts to enforce consistency on the values in the domains of the constraint variables by removing values from the domains of variables that cannot form part of a complete solution to the system of constraints. The most common level of consistency is set bounds consistency [4] where the solver keeps track for each set of which elements are deﬁnitely in or out of the set. Many solvers use set bounds consistency including ECLiPSe, Gecode, and ILOG SOLVER. Set bounds propagation is supported by solvers since stronger notions of propagation such as domain propagation require representing exponentially large domains of possible values. However, [8] demonstrated that it is possible to use reduced ordered binary decision diagrams (BDDs) as a compact representation of both set domains and of set constraints, thus permitting set domain propagation. A domain propagator ensures that every value in the domain of a set variable can be extended to a complete assignment of all of the variables in a constraint. The use of the BDD representation comes with several additional beneﬁts. The ability to easily conjoin and existentially quantify BDDs allows the removal of intermediate variables, thus strengthening propagation, and also makes the construction of propagators for global constraints straightforward. Given the natural way in which BDDs can be used to model set constraint problems, it is therefore worthwhile utilising BDDs to construct other types of set solver. Indeed it has been previously demonstrated [5, 6] that set bounds propagation can be efﬁciently implemented using BDDs to represent constraints and domains of variables. A major beneﬁt of the BDD-based approach is that it frees us from the need to laboriously construct set bounds propagators for each new constraint by hand. Moreover, correctness and optimality of such BDD-based propagators follow by construction. The other advantages of the BDD-based representation identiﬁed above still apply, and the resulting solver performs very favourably when compared with existing set bounds solvers. But set bounds propagation using BDDs still constructs BDDs during propagation, which is a considerable overhead. In this paper we show how we can perform BDD-based set bounds propagation using a marking algorithm that perform linear scans of the BDD representation of the constraint without constructing new BDDs. The resulting set bounds propagators are substantially faster than those using BDDs. We can use the same linear pass to detect elements of the set which can make further difference in propagation, and construct a ﬁlter on the propagator to prevent invoking it unless one of the variables that can make a difference changes. To summarize, the beneﬁts of the approach of this paper are: • efﬁciency, no new BDDs are constructed during propagation, so it is very fast; • reuse, we can reuse a single BDD for multiple copies of the same constraint, and hence handle larger problems; • ordering, we are not restricted to a single global ordering of Booleans for constructing BDDs; and • ﬁltering, we can keep track of which parts of the set variable can really make a difference, and reduce the amount of propagation. We illustrate a prototype solver using the approach on well-known set problems, comparing against the state of the art Gecode set bounds propagation solver. 2 Preliminaries Propagation based approaches to solving set constraint problems represent the problem using a domain storing the possible values of each set variable, and propagators for each constraint, that remove values 506 G. Gange et al. / Fast Set Bounds Propagation using BDDs from the domain of a variable that are inconsistent with values for other variables. Propagation is combined with backtracking search to ﬁnd solutions. A domain D is a complete mapping from the ﬁxed ﬁnite set of variables V to ﬁnite collections of ﬁnite sets of integers. The domain of a variable v is the set D(v). A domain D1 is said to be stronger than a domain D2 , written D1 D2 , if D1 (v) ⊆ D2 (v) for all v ∈ V. A domain D1 is equal to a domain D2 , written D1 = D2 , if D1 (v) = D2 (v) for all V variables v ∈ V. A domain D can be interpreted as the constraint v∈V v ∈ D(v). For set constraints we will often be interested in restricting variables to take on convex domains. A set of sets K is convex if a, b ∈ K and a ⊆ c ⊆ b implies c ∈ K. We use interval notation [a, b] where a ⊆ b to represent the (minimal) convex set K including a and b. For any ﬁnite collection of sets K = {a1 , a2 , . . . , an }, we deﬁne the convex closure of K: conv (K) = [∩a∈x a, ∪a∈x a]. We extend the concept of convex closure to domains by deﬁning ran(D) to be the domain such that ran(D)(x) = conv (D(x)) for all x ∈ V. A valuation θ is a set of mappings from the set of variables V to sets of integer values, written {x1 → d1 , . . . , xn → dn }. A valuation can be extended to apply to constraints involving the variables in the obvious way. Let vars be the function that returns the set of variables appearing in an expression, constraint or valuation. In an abuse of notation, we say a valuation is an element of a domain D, written θ ∈ D, if θ(vi ) ∈ D(vi ) for all vi ∈ vars(θ). Constraints, Propagators and Propagation Solvers A constraint is a restriction placed on the allowable values for a set of variables. We shall use primitive set constraints such as (membership) k ∈ v, (equality) u = v, (subset) u ⊆ w, (union) u = v ∪ w, (intersection) u = v ∩ w, (cardinality) |v| = k, (upper cardinality bound) |v| ≤ k, (lexicographic order) u < v, where u, v, w are set variables, k is an integer. We can also construct more complicated constraints which are (possibly existentially quantiﬁed) conjunctions of primitive set constraints. We deﬁne the solutions of a constraint c to be the set of valuations θ on vars(c) that make the constraint true. We associate a propagator with every constraint. A propagator f is a monotonically decreasing function from domains to domains, so D1 D2 implies that f (D1 ) f (D2 ), and f (D) D. A propagator f is correct for a constraint c if and only if for all domains D: {θ | θ ∈ D} ∩ solns(c) = {θ | θ ∈ f (D)} ∩ solns(c) A propagation solver solv (F, D) for a set of propagators F and a domain D repeatedly applies the propagators in F starting from the domain D until a ﬁxpoint is reached. solv (F, D) is the weakest domain D D where f (D ) = D for all f ∈ F . Domain and Bounds Consistency A domain D is domain consistent for a constraint c if D is the smalest domain containing all solutions θ ∈ D of c. We deﬁne the domain propagator for a constraint c as ( dom(c)(D)(v) = {θ(v) | θ ∈ solns(D ∧ c)} if v ∈ vars(c) D(v) otherwise Then dom(c)(D) is always domain consistent with c. A domain D is (set) bounds consistent for a constraint c if for every variable v ∈ vars(c) the upper bound of D(v) is the union of the values of v in all solutions of c in D, and the lower bound of D(v) is the intersection of the values of v in all solutions of c in D. We deﬁne the set bounds propagator for a constraint c as ( conv (dom(c)(ran(D))(v)) if v ∈ vars(c) sb(c)(D)(v) = D(v) otherwise Then sb(c)(D) is always bounds consistent with c. BDDs We assume a set B of Boolean variables with a total ordering ≺. We make use of the following Boolean operations: ∧ (conjunction), ∨ (disjunction), ¬ (negation), → (implication), ↔ (biimplication) and ∃ (existential quantiﬁcation). We denote by ∃V F ¯V F the formula ∃x1 · · · ∃xn F where V = {x1 , . . . , xn }, and by ∃ we mean ∃V F where V = vars(F ) \ V . Reduced Ordered Binary Decision Diagrams are a well-known method of representing Boolean functions on Boolean variables using directed acyclic graphs with a single root. Every internal node n(v, f, t) in a BDD r is labelled with a Boolean variable v ∈ B, and has two outgoing arcs — the ‘false’ arc (to BDD f ) and the ‘true’ arc (to BDD t). Leaf nodes are either F (false) or T (true). Each node represents a single test of the labelled variable; when traversing the tree the appropriate arc is followed depending on the value of the variable. Deﬁne the size |r| as the number of internal nodes in a BDD r, and VAR(r) as the set of variables v ∈ B appearing in some internal node in r. Reduced Ordered Binary Decision Diagrams (BDDs) [1] require that the BDD is: reduced, that is it contains no identical nodes (that is, nodes with the same variable label and identical then and else arcs) and has no redundant tests (no node has both then and else arcs leading to the same node); and ordered, if there is an arc from a node labelled v1 to a node labelled v2 then v1 ≺ v2 . A BDD has the nice property that the function representation is canonical up to variable reordering. This permits efﬁcient implementations of many Boolean operations. BDDs can represent an arbitrary Boolean formula over variables B. We shall be interested in stick BDDs where for every internal node n(v, f, t) exactly one of f or t is the constant V F node. V Stick BDDs represent exactly the formulae of the form v∈T v∧ v∈F ¬v where T and F are disjoint subsets of B. A Boolean variable v is said to be ﬁxed in a BDD r if either for every node n(v, t, e) ∈ r t is the constant F node, or for every node n(v, t, e) e is the constant F node. Such variables can be identiﬁed in a linear time scan over the domain BDD. For convenience, if φ is a BDD, we write φ to denote the BDD representing the conjunction of the ﬁxed variables of φ. Note φ is a stick BDD. 3 Set Propagation using BDDs The key step in building set propagation using BDDs is to realize that we can represent a ﬁnite set domain using a BDD. Representing domains If v is a set variable ranging over subsets of {1, . . . , N }, then we can represent v using the Boolean variables V (v) = {v1 , . . . , vN } ⊆ B, where vi is true iff i ∈ v. We will order the variables v1 ≺ v2 · · · ≺ vN . We can represent a valuation θ using a formula 0 1 ^ ^ ^ @ R(θ) = vi ∧ ¬vi A . v∈vars(θ) i∈θ(v) i∈{1,...,N }−θ(v) Then a domain of variable v, D(v) can be represented as W a∈D(v) R({v → a}). This formula can be represented by a BDD. G. Gange et al. / Fast Set Bounds Propagation using BDDs Representing constraints We can similarly model any set constraint c as a BDD B(c) using the Boolean variable representation V (v) of its set variables v. By ordering the variables in each BDD carefully we can build small representations of the formulae. The pointwise order of Boolean variables is deﬁned as follows. Given set variables u ≺ v ≺ w ranging over sets from {1, . . . , N } we order the Boolean variables as u1 ≺ v1 ≺ w1 ≺ u2 ≺ v2 ≺ w2 ≺ · · · un ≺ vn ≺ wn . The representation B(c) is simply ∨θ∈solns(c) R(θ). For primitive set constraints (using the pointwise order) this size is linear in N . For more details see [6]. The BDD representation of x = y ∪ z is shown in Figure 2(a). BDD-based Set Bounds Propagation We can build a set bounds propagator, more or less from the deﬁnition, since we have BDDs to represent domains and constraints. ^ φ = B(c) ∧ D(v ) v ∈vars(c) sb(c)(D)(v) = ∃V (v) φ We simply conjoin the domains to the constraint obtaining φ, then extract the ﬁxed variables from the result, and then project out the relevant part for each variable v. The set bounds propagation can be improved by removing the ﬁxed variables as soon as possible. The improved deﬁnition is given in [5]. Overall the complexity can be made O(|B(c)|). The updated set bounds can be used to simplify the BDD representing the propagator. Since ﬁxed variables will never interact further with propagation they can be projected out of B(c), so we can replace B(c) by ∃VAR(φ) φ. 4 Faster Set Bounds Propagation While set bounds propagation using BDDs is much faster than set domain propagation and often better than set domain propagation (or other variations of propagations for sets) it still creates new BDDs. This is not necessary as long as we are prepared to give up the simplifying of BDDs that is possible in set bounds propagation. We do not represent domains of variables as BDD sticks, but rather as arrays of integer values. A domain D is an array where, for variable v ranging over subsets of {1, . . . , N }: D[vi ] = 0 indicates i ∈ / v, D[vi ] = 1 indicates i ∈ x, and D[vi ] = 2 means we don’t know whether i is in or not in v. Hence D(v) = [{i|D[vi ] = 1}, {i|D[vi ] = 0}]. The BDD representation of a constraint B(c) is built as before. A signiﬁcant difference is that since constraints only communicate through the set bounds of variables we do not need them to share a global variable order hence we can if necessary modify the variable order used to construct B(c) for each c, or use automatic variable reordering (which is available in most BDD packages) to construct B(c). Another advantage is that we can reuse the BDD for a constraint c(¯ x) on variables x ¯ for the constraint c(¯ y ) on variables y¯ (as long as they range over the same initial sets), that is, the same constraint on different variables. Hence we only have to build one such BDD, rather than one for each instance of the constraint. The set bounds propagator sb(c(¯ x) for constraint c(¯ x) is now implemented as follows. A generic BDD representation r of the constraint c(¯ y ) is constructed. The propagator copies the domain description of the actual parameters x1 , . . . , xn onto a domain description E for formal parameters y 1 , . . . , y n . It constructs an array E 507 where E[yij ] = D[xji ]. Let V = {yij | 1 ≤ j ≤ n, 1 ≤ i ≤ N } be the set of Boolean variables occurring in the constraint c(¯ y ). The propagator executes the code bddprop(r, V, E) shown in Figure 1 which returns (r , V , E ). If r = F the propagator returns a false domain, otherwise the propagator copies back the domains of the formal parameters to the actual parameters so D[xji ] = E[yij ]. We will come back to the V argument in the next subsection. The procedure bddprop(r, V, E) traverses the BDD r as follows. We visit each node n(v, f, t) in the BDD in a top-down memoing manner. We record if, under the current domain, the node can reach the F node, and if it can reach the T node. If the f child can reach the T node we add support for the variable v taking value 0. Similarly if the t child can reach T we add support for the variable v taking 1. If the node can reach both F and T we record that the variable v matters to the computation of the BDD. After the visit we reduce the variable set for the propagator to those that matter, and remove values with no support from the domain. The procedure assumes a global time variable which is incremented between each propagation, which is used to memo the marking phase. The top(n, V ) function returns the variable in the root node of n or the largest variable (under ≺) in V if n = T or n = F. Example 1 Consider the BDD for the constraint x = y ∪ z when N = 2 shown in Figure 2(a). Assuming a domain E where E[y1 ] = 1 (1 ∈ y) and E[z2 ] = 1 (2 ∈ z), and the remaining variables take value 2, the algorithm traverses the edges shown with double lines in Figure 2(b). No path from x1 , or x2 following the f arc reaches T hence alive[x1 ,0] and alive[x2 ,0] are not marked with the current time. As a result E[x1 ] and E[x2 ] are set to 1. Hence we have determined 1 ∈ x and 2 ∈ x. Also, no nodes for z1 are actually visited, and the left node for y2 only reaches F and the right node only reaches T . Hence matters[z1 ] and matters[y2 ] are not marked with the current time. The set of vars collected by bddprop is empty, since the remaining variables are ﬁxed. 2 4.1 Waking up less often In practice a bounds propagation solver does not blindly apply each propagator until ﬁxpoint, but keeps track of which propagators must still be at ﬁxpoint, and only executes those that may not be. For set bounds this is usually managed as follows. To each set variable v is attached a list of propagators c that involve v. Whenever v changes, these propagators are rescheduled for execution. We can do better than this with the BDD based propagators. The algorithm bddprop collects the set of Boolean variables that matter to the BDD, that is can change the result. If a variable is ﬁxed that does not matter, then set bounds propagation cannot learn any new information. We modify the wakeup process as follows. Each variable xj stores a list of pairs (f, S) of propagator f with the subset S of variables xji which matter to the propagator with the current domain. When the variable changes we traverse the list of propagators and wake those propagators where the change intersects with S. On executing a propagator we revise the set S stored in the list for variable xj to be {xji | yij ∈ vars} where vars is the the set of “interesting” variables returned by bddprop. Note the same optimization could be applied to the standard approach, but requires the overhead of computing VAR(r ) which here is folded into bddprop. 508 G. Gange et al. / Fast Set Bounds Propagation using BDDs bddprop(r,V ,E) { (reachf , reacht) = bddp(r, V, E); if (¬reacht) return (F, ∅, E); vars = ∅; for (v ∈ V ) { for (d ∈ {0, 1}) if (alive[v,d] < time) E[v] = 1 − d; if (E[v] = 2 ∧ matters[v] ≥ time) vars = vars ∪ {v}; } return (r, vars, E); } bddp(node,V ,E) { switch node { F : return (1,0); T : return (0,1); n(v, f, t): if (visit[node] ≥ time) return save[node]; reachf = 0; reacht = 0; if (E[v] = 1) { (rf0 , rt0 ) = bbdp(f, V, E); reachf = reachf ∧ rf0 ; reacht = reacht ∧ rt0 ; if (rt0 ) { for (v ∈ V, v ≺ v ≺ top(f, V )}) alive[v ,0] = alive[v ,1] = time; alive[v,0] = time; } } if (E[v] = 0) { (rf1 , rt1 ) = bbdp(t, V, E); reachf = reachf ∧ rf1 ; reacht = reacht ∧ rt1 ; if (rt1 ) { for (v ∈ V, v ≺ v ≺ top(t, V )}) alive[v ,0] = alive[v ,1] = time; alive[v,1] = time; } } if (reachf ∧ reacht) matters[v] = time; save[node] = (reachf , reacht); visit[node] = time; return (reachf , reacht); } } Figure 1. 5 Pseudo-code for BDD propagation. Experimental Results We have built a prototype set bounds solver implementing the algorithms described. Currently a Prolog engine takes the deﬁnition of the problem, and used an interface to the BDD package CUDD [10] to construct the generic BDDs. It then creates a C ﬁle for backtracking solver with data structures for the BDDs. This prototype is very expensive in terms of compilation time, ranging from 0.36–4.65s for Steiner and 0.52–2.42s for golfers, but the actual BDD creation time is a tiny proportion of this, at most 30ms and usually unmeasurable (0ms). In a direct implementation the compilation time will effectively shrink to the BDD creation time. Experiments were conducted on a 2.66GHz Core2 Duo with 2 Gb of RAM running Ubuntu GNU/Linux 7.04. We compare against the state of the art set bounds propagators of Gecode 2.0 [3]. Steiner Systems A commonly used benchmark for set constraint solvers is the calculation of small Steiner systems. A Steiner system S(t, k, N ) is a set X of cardinality N and a collection C of subsets of X of cardinality k (called ‘blocks’), such that any t elements of /.-, ()*+ x1 { CCC CC { C! }{ /.-, ()*+ /.-, ()*+ y1 y1 ()*+ /.-, ()*+ /.-, z1 C z { 1 { C { { C! }{ { /.-, ()*+ x2 { CCC CC { C! }{ ()*+ ()*+ /.-, y2 y2 /.-, /.-, ()*+ ()*+ z2 C { z /.-, i{ 2 i | i || t C i {{ ~|util| oi qi i C! }{{ F T (a) /.-, ()*+ x {{ 1CCCCCC {{ CCCC y {{ % /.-, ()*+ /.-, ()*+ y1 y1 /.-, ()*+ /.-, ()*+ z1 C z {{ 1 C { { C! }{ { /.-, ()*+ x { 2CCC {{ { CCCCCCC ()*+y {{ % ()*+ y2 y2 /.-, /.-, ()*+ /.-, ()*+ z2 C { z2 /.-, i |||||||q t iC i i i { {{{{{{ C! y {{{ zutil|||oi i F T (b) Figure 2. (a) The BDD representing x = y ∪ z where N = 2. A node n(v, f, t) is shown as a circle around v with a dashed arrow to f and full arrow to t. (b) The edges traversed by bddprop, when E[y1 ] = 1 and E[z2 ] = 1 and E[v] = 2 otherwise, are shown doubled. X are` in´exactly ` ´ one block. Any Steiner system must have exactly m = Nt / kt blocks (Theorem 19.2 of [9]). We use the same modelling of the problem as [8], extended for the case of more general Steiner Systems. We model each block as a set variable s1 , . . . , sm , with the constraints: m ^ (|si | = k) ∧ i=1 m−1 ^ m ^ (∃uij .uij = si ∩ sj ∧ |uij | ≤ (t − 1)) ∧ (si < sj ) i=1 j=i+1 To compare the raw performance of the bounds propagators we performed experiments using a model of the problem with primitive constraints and intermediate variables uij directly as shown above equivalent to the Gecode model. The results are shown in “Split Constraints” section of Table 1. Gecode has slightly better search behaviour than our solver because its set bounds propagators take into account cardinality information. Clearly the raw propagation speed of the BDD solver is better than Gecode except for the case where the N is large. Note that the BDD solver of [6] cannot handle the largest four Steiner problems with split constraints, because there are too many Boolean variables for the BDD package. Of course, the BDD representation permits us to merge primitive constraints and remove ` ´ intermediate variables, allowing us to model the problem as m binary constraints (containing no intermediate 2 variables uij ) corresponding to second line above conjoined with the cardinality constraints for si and sj . Results for this improved model are shown in the “Merged Constraints” section of Table 1. Here the search is reduced and propagation speed usually signiﬁcantly increased, though ﬁltering is less beneﬁcial. Social Golfers Another common set benchmark is the “Social Golfers” problem, which consists of arranging N = g × s golfers into g groups of s players for each of w weeks, such that no two players play together more than once. Again, we use the same model as [8], using a w ×g matrix of set variables vij where 1 ≤ i ≤ w and 1 ≤ j ≤ g. Gecode is restricted to use separate constraints, while the BDD solver uses merged constraints. G. Gange et al. / Fast Set Bounds Propagation using BDDs 509 Table 1. Performance results on Steiner Systems: ﬁrst solution (F) and all solutions (A). Time in seconds for 1000 runs (ﬁrst solution problems) and one run (all solutions) and number of failures are given for Gecode and the BDD solver for split constraints and the BDD solver for merged constraints. Two times for the BDD set bounds solver are shown: time without ﬁltering and time+f with ﬁltering. A ﬁrst-fail “element-in-set” labelling strategy is used in all cases. “—” denotes failure to complete a test case within 240 minutes. × denotes a case where our naive trailing implementation for ﬁltering runs out of space. Problem S(2,3,7) S(3,4,8) S(2,3,9) S(2,4,13) S(2,3,15) S(3,4,16) S(2,5,21) S(3,6,22) S(2,3,31) S(2,3,7) S(3,4,8) S(2,3,9) F F F F F F F F F A A A Gecode time fails 0.41 2 5.43 14 47.98 395 3.33 4 19.86 6 1688.81 90 14.97 4 495.9 118 1098.82 14 0.27 6.10×103 1018.84 6.36×106 2593.03 3.15×107 Split Constraints time time+f fails 0.30 0.24 2 1.70 1.56 14 37.04 14.50 542 3.58 1.98 4 29.01 16.54 6 431.53 × 90 20.50 19.68 4 271.14 243.27 142 1659.71 1891.04 14 0.29 0.22 1.17×104 10108.89 7875.12 1.44×107 — — — Table 2. First-solution performance results on the Social Golfers problem. Time in seconds for 100 runs and number of failures are given for both solvers. A ﬁrst-fail “element-in-set” labelling strategy is used in all cases. Problem 2-5-4 2-6-4 2-7-4 3-5-4 3-6-4 3-7-4 4-5-4 4-6-5 4-7-4 4-9-4 5-5-4 5-7-4 5-8-3 6-5-3 6-6-3 Gecode time fails 0.33 14 7.71 860 34.30 2935 0.81 14 18.57 863 93.42 2974 0.65 1388 225.92 5355 142.58 2979 10.52 54 149.73 2495 308.61 3062 5.07 10 102.84 1621 3.06 4 Merged Constraints time time+f fails 0.21 0.14 30 5.77 2.55 2036 19.58 8.9 4447 0.46 0.44 30 22.91 11.82 2039 64.06 41.77 4492 0.30 0.26 2886 298.43 209.22 12747 137.80 103.25 4498 7.8 5.49 71 50.73 28.29 2758 218.58 190.9 4582 3.29 2.36 14 35.93 17.05 1615 1.74 1.23 5 Experimental results are shown in Table 2. Interestingly the merged constraints here are not enough to compete with the set bounds propagators of Gecode that include cardinality considerations. Not withstanding the greater search space, the BDD set solver is still substantially faster than Gecode. For these examples ﬁltering is always beneﬁcial, sometimes 2 times faster. If we compare against the BDD solver of [6] on these examples, our new solver is around 30 times faster (although the machines used are not identical). 6 Related Work BDD based set solvers were introduced by [8] originally for domain propagation, and then extending to bounds, split and lex and cardinality propagation [6]. The combination of BDD based set bounds propagation with nogoods was introduced in [7]. Another approach automatically constructing set bounds propagators is deﬁned in [12]. A similar approach to using BDDs in propagation was previously deﬁned for solving SAT problems in [2]. This approach informally deﬁnes a marking approach to BDD propagation, but does not consider sets, generic constraints, or ﬁltering. 7 Merged Constraints time time+f fails 0.12 0.11 0 0.96 0.90 2 5.05 5.69 121 2.24 2.17 2 15.61 15.66 3 474.22 530.52 58 19.38 19.68 3 554.44 668.509 96 1198.95 1301.64 11 0.01 0.02 1.07×103 58.93 58.95 4.32×105 287.05 324.10 8.81×106 Conclusion In this paper we have improved the BDD-based technique of set bounds propagation. The traversal approach to propagation we presented is at least an order of magnitude faster than the previous technique utilizing BDD operations. The prototype implementation of our method is signiﬁcantly faster than the state of the art set constraint solver of Gecode. As demonstrated by [7], further improvements in the solver performance can be straightforwardly achieved by incorporating nogoods generation [11]. REFERENCES [1] Randal E. Bryant, ‘Graph-based algorithms for Boolean function manipulation’, IEEE Trans. Comput., 35(8), 677–691, (1986). [2] R.F. Damiano and J.H. Kukula, ‘Checking satisﬁability of a conjunction of BDDs’, in Proceedings of Design Automation Conference, pp. 818– 823, (2003). [3] Gecode. www.gecode.org. Accessed Jan 2008. [4] Carmen Gervet, ‘Interval propagation to reason about sets: Deﬁnition and implementation of a practical language’, Constraints, 1(3), 191– 246, (1997). [5] P. Hawkins, V. Lagoon, and P.J. Stuckey, ‘Set bounds and (split) set domain propagation using ROBDDs’, in 17th Australian Joint Conference on Artiﬁcial Intelligence, volume 3339 of LNCS, pp. 706–717, (2004). [6] P. Hawkins, V. Lagoon, and P.J. Stuckey, ‘Solving set constraint satisfaction problems using ROBDDs’, Journal of Artiﬁcial Intelligence Research, 24, 106–156, (2005). [7] P. Hawkins and P.J. Stuckey, ‘A hybrid BDD and SAT ﬁnite domain constraint solver’, in Proceedings of the 8th International Symposium on Practical Aspects of Declarative Languages, volume 3819 of LNCS, pp. 103–117, (2006). [8] V. Lagoon and P.J. Stuckey, ‘Set domain propagation using ROBDDs’, in Proceedings of the 10th International Conference on Principles and Practice of Constraint Programming, volume 3258 of LNCS, pp. 347– 361, (2004). [9] J. H. van Lint and R. M. Wilson, A Course in Combinatorics, Cambridge University Press, 2nd edn., 2001. [10] Fabio Somenzi. CUDD: Colorado University Decision Diagram package. Accessed May 2004. http://vlsi.colorado.edu/ fabio/CUDD/. [11] S. Subbarayan, ‘Efﬁcent reasoning for nogoods in constraint solvers with BDDs’, in Proceedings of Tenth International Symposium on Practical Aspects of Declarative Languages, volume 4902 of LNCS, pp. 53–57, (2008). [12] G. Tack, C. Schulte, and G. Smolka, ‘Generating propagators for ﬁnite set constraints’, in Twelfth International Conference on Principles and Practice of Constraint Programming, volume 4204 of LNCS, pp. 575– 589, (2006). 510 ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-510 A new Approach for Solving Satisﬁability Problems with Qualitative Preferences Emanuele Di Rosa and Enrico Giunchiglia and Marco Maratea1 Abstract. The problem of expressing and solving satisﬁability problems (SAT) with qualitative preferences is central in many areas of Computer Science and Artiﬁcial Intelligence. In previous papers, it has been shown that qualitative preferences on literals allow for capturing qualitative/quantitative preferences on literals/formulas; and that an optimal model for a satisﬁability problems with qualitative preferences on literals can be computed via a simple modiﬁcation of the Davis-Logemann-Loveland procedure (DLL): Given a SAT formula, an optimal solution is computed by simply imposing that DLL branches according to the partial order on the preferences. Unfortunately, it is well known that introducing an ordering on the branching heuristic of DLL may cause an exponential degradation in its performances. The experimental analysis reported in these papers hightlights that such degradation can indeed show up in the presence of a signiﬁcant number of preferences. In this paper we propose an alternative solution which does not require any modiﬁcation of the DLL heuristic: Once a solution is computed, a constraint is added to the input formula imposing that the new solution (if any) has to be better than the last computed. We implemented this idea, and the resulting system can lead to signiﬁcant improvements wrt the original proposal when dealing with MIN ONE / MAX - SAT problems corresponding to qualitative preferences on structured instances. 1 Introduction The problem of expressing and solving satisﬁability problems with qualitative preferences is central in many areas of Computer Science and Artiﬁcial Intelligence. For instance, in planning, beside the goals that have to be achieved, it is common to have other “soft” goals that it would be desiderable to satisfy: A plan is one solution which achieves all the goals, and an “optimal” plan is one which also achieves as many soft goals as possible. In planning as satisﬁability [16] with soft goals [13], the task of ﬁnding an optimal plan is reduced to a satisﬁability problem with qualitative preferences. Here, for simplicity, we consider qualitative preferences on literals, in which preferences are modeled as a set S of literals, and the relative importance of satisfying each literal in the set S is captured with a partial order on S. In [12, 13], it has been shown that 1. qualitative preferences on formulas and quantitative preferences on literals/formulas can be reduced to qualitative preferences on literals; and 2. that it possible to compute an optimal solution (wrt the expressed preferences) via a simple modiﬁcation of the Davis-LogemannLoveland procedure (DLL): In more details, an optimal solution 1 DIST Universit`a di email: {emanuele,enrico,marco}@dist.unige.it Genova, Italy, is computed by imposing that branching occurs according to the partial order on the literals in the set of preferences. This method for computing an optimal solution has the advantage that it only requires a simple modiﬁcation of existing state-of-theart SAT solvers all of which are based on DLL. However, it is well known that introducing an ordering on the branching heuristic of DLL may cause an exponential degradation in its performances [15]. OPTSAT is the name given to the related system built on top of MIN ISAT [10]. The experimental analysis reported in [12, 13] hightlights that such degradation can show up in the presence of a signiﬁcant number of preferences. In this paper we propose an alternative solution which does not require any modiﬁcation of DLL heuristic and thus which does not have the above mentioned disadvantage. In a few words, once a solution is computed, a blocking formula is added to the input formula imposing that the new solution (if any) will be better than the last computed wrt the qualitative preference on literals expressed. Our approach works with any qualitative preference on literals, and thus (via the reductions described in [12, 13]) with any qualitative/quantitative preference on literals/formulas. We extended OPTSAT in order to incorporate this new method. In the following, we use OPTSAT- HS to refer to OPTSAT when using the method described in [12], and OPTSAT- BF to refer to OPTSAT when using the method here described. To comparatively test the effectiveness of the approach, we consider MAX - SAT and MIN - ONE problems, in their non partial/partial2 , qualitative/quantitative versions, as in [12]. Our selection of benchmarks includes problems from the last MAX - SAT evaluation3 , well known satisﬁability planning problems, and does not include problems with a (pseudo)-random structure. Indeed, OPTSAT is based on MINISAT , and MINISAT has been designed to solve large but relatively easy industrial SAT problems (and not small but relatively difﬁcult randomly generated problems). In the qualitative case of (partial) MIN - ONE and MAX - SAT problems, the experimental results show that OPTSAT- BF performs better than OPTSAT- HS. The reasons for the good performances of OPTSAT- BF are: 1. The good quality of the ﬁrst computed solution, and 2. The few iterations required to get to the determined optimal solution. In the quantitative case, OPTSAT- BF is competitive also with respect to the other state-of-the-art systems for MAX - SAT, including the most performing systems in the recent PB and MAX - SAT evaluations. Summing up, the main contributions of the paper are: 2 3 In the partial MIN - ONE (resp. MAX - SAT ) problem the optimization has to be performed on a subset of the variables (resp. clauses) of the problem. http://www.maxsat07.udl.es/ E. Di Rosa et al. / A New Approach for Solving Satisﬁability Problems with Qualitative Preferences • We deﬁne a new approach for solving satisﬁability problems with qualitative preferences. • We formally state some properties of our algorithm. • We extend OPTSAT in order to implement this new approach. • On (partial) MAX - SAT and MIN - ONE non (pseudo)-random problems, we show that OPTSAT- BF performs better than OPTSAT- HS in the qualitative case, and that is competitive wrt other state-ofthe-art systems in the quantitative case. The paper is structured as follows. In Section 2 we review our formalism for expressing preferences. Section 3 is dedicated to the presentation of the algorithm behind OPTSAT- BF, and its formal properties. Section 4 presents the experimental analysis we conducted. Section 5 ends the paper with some ﬁnal remarks. 2 Satisﬁability and Qualitative Preferences Consider a ﬁnite set P of variables. A literal is a variable x or its negation x. We assume x = x. A clause is a ﬁnite disjunction of literals and a formula is a ﬁnite conjunction of clauses. As customary in SAT, we also represent clauses as sets of literals and formulas as sets of clauses, and we use and ⊥ to denote the empty set of clauses and the empty clause respectively. For example, given the 4 variables Fish, Meat, RedWine, WhiteWine, the formula {Fish, Meat}, {RedWine, WhiteWine} (1) models the fact that we cannot have both ﬁsh (Fish) and meat (Meat), both red (RedWine) and white (WhiteWine) wine. An assignment is a consistent set of literals. If l ∈ μ, we say that both l and l are assigned by μ. An assignment μ is total if each literal l is assigned by μ. A total assignment μ satisﬁes a formula ϕ if for each clause C ∈ ϕ, C ∩μ = ∅. A model μ of a formula ϕ is an assignment satisfying ϕ. A formula ϕ entails a formula ψ if the models of ϕ are a subset of the models of ψ. For instance, (1) has 9 models. In the following, we abbreviate a total assignment with the set of variables assigned to true, and we write μ |= ψ to indicate that μ is a model of ψ. For instance, we write {Fish, WhiteWine} as an abbreviation for the total assignment {Fish, Meat, WhiteWine, RedWine} in which the only variables assigned to true are Fish and WhiteWine, i.e., the situation in which we have ﬁsh and white wine. A qualitative preference on literals is a partially ordered set of literals, i.e., a pair S, ≺ where S is a set of literals (also called the set of preferences), and ≺ is a partial order on S. Intuitively, S represents the set of literals that we would like to have satisﬁed, and ≺ models the relative importance of our preferences. For example, {Fish, RedWine, WhiteWine}, {WhiteWine ≺ RedWine} (2) models the case in which we prefer to have ﬁsh and both red and white wine. In the case in which it is not possible to have both red and white wine, we like more to have white than red wine. A qualitative preference S, ≺ on literals can be extended to the set of total assignments as follows: Given two total assignments μ and μ , μ is preferred to μ (μ ≺ μ ) if and only if 1. there exists a literal l ∈ S with l ∈ μ and l ∈ μ ; and 2. for each literal l ∈ S∩(μ \μ), there exists a literal l ∈ S∩(μ\μ ) such that l ≺ l . A model μ of a formula ϕ is optimal if it is a minimal element of the partially ordered set of models of ϕ. For instance, considering the qualitative preference (2), the formula (1) has only one optimal model, i.e., {Fish, WhiteWine}. 511 We recall that qualitative preference on formulas can be reduced to qualitative preferences on literals (see [13]); and that by propositional encoding of the objective function to maximize/minimize, it is possible to reduce also quantitative preferences to qualitative ones, see [12]. 3 Solving satisﬁability problems with preferences Consider a formula ϕ and a qualitative preference on literals S, ≺. The problem of computing an optimal model of ϕ wrt S, ≺ can be solved by 1. computing a (not necessarily optimal) model μ of ϕ, 2. adding a formula which restricts the subsequent search for models to those which are preferred to μ, 3. iterating the above two steps up to the point that the last assignment found can no longer be improved. Crucial for the above procedure is a condition which enables us to say which are the models that are preferred (wrt S, ≺) to an assignment μ. The preference formula for μ wrt S, ≺ is (∨l:l∈S,l∈μ l) ∧ (∧l :l ∈S,l ∈μ (∨l:l∈S,l∈μ,l≺l l ∨ l )). (3) An assignment μ is preferred to μ wrt S, ≺ iff μ satisﬁes (3), as stated by the following theorem. Theorem 1 Let μ and μ be two total assignments. Let S, ≺ be a qualitative preference. μ is preferred to μ wrt S, ≺ if and only if μ satisﬁes the preference formula for μ wrt S, ≺. As an example of the application of the theorem above consider the following particular cases: 1. S ⊆ μ, (e.g., because there are no preferences, S = ∅): In this case (3) is equivalent to ⊥, meaning that there is no assignment which is preferred to μ, i.e., that μ is already optimal; 2. S, ≺ = {l1 , . . . , ln }, ∅: In this case (3) becomes (∨l:l∈S,l∈μ l) ∧ (∧l :l ∈S,l ∈μ l ), meaning that any assignment μ with μ ≺ μ must be such that μ ∩ S ⊂ μ ∩ S; Considering the preference (2), 1. if μ1 = {Meat, RedWine}, then (3) is ψ1 : (Fish ∨ WhiteWine) ∧ (WhiteWine ∨ RedWine) 2. if μ2 = {Meat, WhiteWine}, then (3) is ψ2 : (Fish ∨ RedWine) ∧ WhiteWine 3. if μ3 = {Fish, WhiteWine}, then (3) is ψ3 : RedWine ∧ Fish ∧ WhiteWine. Notice that μ2 ≺ μ1 and μ3 ≺ μ2 : As a consequence ψ2 entails ψ1 and ψ3 entails ψ2 . Further, as the last example makes clear, it is indeed possible that the preference formula for an assignment is inconsistent with the given set of constraints, and this is indeed an obvious consequence of the fact that the deﬁnition of (3) does not take into account the input formula: In the case in which the preference formula for an assignment μ is inconsistent with the input set of clauses, μ is optimal. As we have already said at the beginning of the section, Theorem 1 allows us to use any complete SAT solver as a black box for 512 E. Di Rosa et al. / A New Approach for Solving Satisﬁability Problems with Qualitative Preferences S, ≺ := a qualitative preference on literals; ϕ := the input formula; ψ := ; μopt := ∅ function PREF - DLL(ϕ ∪ ψ,μ) 1 if (⊥ ∈ (ϕ ∪ ψ)μ ) return FALSE; 2 if (μ is total) μopt := μ; ψ := Reason(μ, S, ≺); return FALSE; 3 if ({l} ∈ (ϕ ∪ ψ)μ ) return PREF - DLL(ϕ ∪ ψ, μ ∪ {l}); 4 l := ChooseLiteral(ϕ ∪ ψ, μ); 5 return PREF - DLL(ϕ ∪ ψ, μ ∪ {l}) or PREF - DLL(ϕ ∪ ψ, μ ∪ {l}). Figure 1. The algorithm of PREF - DLL . computing an optimal assignment. Once a model μ of a formula ϕ is found, the formula (3) is computed and added to ϕ and then the SAT solver can be invoked: The returned model is ensured to be preferred to μ. However, given that all the state-of-the-art systems are based on DLL, it is possible, following what has been successfully done in various areas of automated deduction (see, e.g., [2]), to add the formula (3) as soon as μ is determined, i.e., during the search. The resulting procedure is represented in Figure 1. In the ﬁgure: • ϕ is the input set of clauses, S, ≺ is a qualitative preference on literals, μopt is the (current) optimal assignment, ψ is the set of clauses corresponding to the preference formula for μopt wrt S, ≺; μ is an assignment; • (ϕ ∪ ψ)μ is the set of clauses obtained from ϕ ∪ ψ by (i) deleting the clauses C ∈ ϕ ∪ ψ with μ ∩ C = ∅, and (ii) substituting the other clauses C ∈ ϕ ∪ ψ with C \ {l : l ∈ μ}; • Reason(μ, S, ≺) returns the set of clauses corresponding to the preference formula for μ wrt S, ≺; • ChooseLiteral(ϕ ∪ ψ, μ) returns a literal in ϕ ∪ ψ which is unassigned by μ. It is easy to see that PREF - DLL is exactly the same as that once a model μ is determined (see line 2), DLL , except 1. μ is stored in μopt ; 2. the preference formula for μ wrt S, ≺ is stored in ψ, and 3. FALSE is returned. Notice that PREF - DLL generalizes DLL in the sense that if there are no preferences (i.e., if S = ∅), PREF - DLL behaves as DLL: Indeed, if S = ∅ then any model is optimal, and as soon as one model μ is found, the preference formula for μ wrt S, ≺ (i.e., ⊥) determines the termination of PREF - DLL. Theorem 2 Let ϕ be a formula and S, ≺ a qualitative preference on literals. PREF - DLL(ϕ, ∅) terminates, and then μopt is empty if ϕ is unsatisﬁable, and an optimal model of ϕ wrt S, ≺ otherwise. Beside the above, one interesting property of PREF - DLL is its “anytime” property: The sequence of models μ1 , μ2 , . . . , μn computed by PREF - DLL are ensured to be such that μi+1 is preferred to μi , i.e., μi+1 ≺ μi (0 < i < n). Thus, PREF - DLL is as fast as DLL to compute the ﬁrst model of the input set of clauses, and, time permitting, from that point on, it can only improve the quality of the model found. Also notice that in Figure 1 we called Reason the procedure for computing the preference formula (3). Indeed, most of the current SAT solvers (at least those meant for applications) are based on learning: As soon as a clause C becomes empty, C is returned and then used by the learning mechanism of the solver to backjump over irrelevant nodes while backtracking, and, with learning, to prune the subsequent search of the solver. Such clause C is often called “reason” or conﬂict clause, and it has the property that it is falsiﬁed by the assignment μ which caused C to become empty (i.e., for each literal l ∈ C, l ∈ μ). In our case, with solvers based on learning, as soon as the assignment μ is total and no empty clause is detected, we can return the clause C corresponding to the left conjunct of (3) as conﬂict clause: Indeed, ∨l∈S,l∈μ l is falsiﬁed by μ. However, we must also add the other clauses corresponding to (3) to the input set of clauses, since these are needed to ensure that the search will continue looking for another model μ of the input formula with μ ≺ μ. Fortunately, the clauses added to the input set of clauses, do not need to be indefinitely retained (otherwise PREF - DLL can have an exponential blow up in space): Once a new model μ with μ ≺ μ is found, we can discard the clauses added because of μ since they are entailed by the new clauses added because of μ , as stated by the following theorem. Theorem 3 Let S, ≺ be a qualitative preference. Let μ1 , μ2 , . . . , μn be the sequence of models computed by PREF - DLL, and ψ1 , ψ2 , . . . , ψn be the corresponding preference formulas. For each i, 0 < i < n, ψi+1 entails ψi . In PREF - DLL (see Figure 1), the preference formula ψi for μi is overwritten as soon as a new model μi+1 is determined (line 2). PREF - DLL is thus guaranteed to work in polynomial space in the size of the input formula and qualitative preference. 4 Implementation and experimental analysis We extended OPTSAT [12] in order to incorporate these ideas. OPTSAT is built on top of MINISAT [10], the 2005 version, winner of the SAT 2005 competition on the industrial benchmarks category (together with the SAT/CNF minimizer S AT EL ITE [9]): Such choice has been motivated by our interest in solving, in particular, large structured problems coming from applications. The two versions of OPTSAT — OPTSAT- HS and OPTSAT- BF — are the ones that we consider in the case of qualitative preferences. In the case on quantitative preference, OPTSAT encodes the objective function using the methods described in [23, 3]: Here we used the one based on [23]. Table 1 shows the results for OPTSAT- HS and OPTSAT- BF on a variety of problems detailed below. The table shows the results also for various other state-of-the-art solvers included for completeness. In particular we considered both dedicated solvers for • MAX - SAT problems, like BF [6]; MAX S OLVER [24]; T OOL [21, 17] ver. 3.0; MAX S ATZ version submitted to the 2007 Evaluation [18]; M INI M AX S AT ver. 1.0 [14] and abbreviated with MMSAT in the Table; and • generic Pseudo-Boolean solvers, like OPBDP ver. 1.1.1 [4]; PBS ver. 2.1 and ver. 4 [1]; MINISAT + ver. 1.13 [11] and abbreviated with MSAT + in the Table; GLP PB ver. 0.2 by the same authors of P UEBLO [22] as submitted to the 2007 Evaluation4 ; BSOLO ver. 3.0.17 [19]. BAR MAX S ATZ and M INI M AX S AT have been the winner of the recent Max-SAT Evaluation 2007 in the “Max-SAT” and “Partial MaxSAT” category, respectively. MINISAT + was the solver able to prove 4 http://www.eecs.umich.edu/˜hsheini/pueblo/ E. Di Rosa et al. / A New Approach for Solving Satisﬁability Problems with Qualitative Preferences class 1 Partial MINONE 2 MINONE 3 MAXSAT 4 MAXCUT/spinglass 5 MAXCUT/dimacs mod 6 PSEUDO/garden 7 PSEUDO/logic-synthesis 8 PSEUDO/primes 9 PSEUDO/routing 10 MAXONE/structured 11 MAXCLIQUE/structured #I OPTSAT- HS 21 77.99(19) 26 0.69(26) 35 26.68(34) 5 0.01(5) 62 0.01(62) 7 0.02(7) 17 0.03(17) 148 4.81(130) 15 11.69(15) 60 0.96(60) 62 0.01(62) OPTSAT- BF OPBDP 2.7(21) 0.2(26) 11.25(35) 0.01(5) 0.01(62) 0.01(7) 0.01(17) 0.19(131) 3.12(15) 0.13(60) 0.06(62) − 85.37(7) 20.89(3) 0.99(1) 230.33(5) 2.2(4) − 16.65(85) 81.83(5) 296.26(35) 70.37(16) PBS4 223.14(15) 17.56(19) 98.55(10) 66.67(1) 0.01(2) 147.58(4) 85.88(1) 18.08(90) 102.75(9) 11.48(60) 23.79(13) MSAT + 43.32(18) 7.33(24) 130.37(31) 0.86(1) 247.54(7) 0.25(5) 490.36(5) 11.52 (104) 43.74(15) 2.02(58) 154.39(22) 513 BSOLO MAX S ATZ MMSAT OPTSAT- HS OPTSAT- BF 433.21(16) 391.21(12) 74.28(21) 69.89(21) 115.73(22) 87.21(24) 93.24(24) 23.99(25) 192.56(23) 274.38(22) 229.73(21) 218.86(31) 175.12(31) 76.57(1) 33.19(3) 1.09(3) 7.56(1) 7.52(1) 0.01(2) 59.27(52) 194.52(52) 66.86(4) 21.61(3) 30.18(4) 4.75(5) 22.8(5) 36.66(5) − 81.93(2) 90.36(3) 338.26(3) 22.23 (94) 62.08 (107) 31.8(103) 60.59(109) 373.73(8) 109.49(14) 41.49(15) 36.1(15) 40.96(60) 22.5(60) 293(56) 7.87(58) 248.26(14) 61.97(36) 54.14(19) 178.04(23) Table 1. Results for solving satisﬁability problems with qualitative (columns 4-5) and quantitative (columns 6-13) preferences. Problems are: Partial MIN - ONE (row 1), MIN - ONE (row 2), MAX - SAT (rows 3-5), and partial MAX - SAT (rows 6-11). unsatisﬁability and optimality to a larger number of instances than all the other solvers that entered into the Pseudo-Boolean Evaluation 2005 [20], and the best performing solver (together with BSOLO) also in the Pseudo-Boolean Evaluation 2006, category OPT-SMALLINTLIN. BSOLO and GLP PB have been the best performing PB solvers in the OPT-SMALLINT-LIN category of the recent Pseudo-Boolean evaluation 2007. Considering the dedicated solvers for MAX - SAT, we discarded BF, MAX S OLVER and T OOLBAR after an initial analysis because they seem to be tailored for randomly generated problems, and are thus not suited to deal with the problems we consider here. About the Pseudo-Boolean solvers, we do not show the results for PBS ver. 2.1 and GLP PB because they are almost always slower than PBS ver. 4.0 and BSOLO, respectively, and, ultimately, they manage to solve only a few of the instances we consider. About the benchmarks, we considered a wide set of instances, mainly coming for real-world applications. In particular, we used SATPLAN 2004, release of 10 Feb. 2006 to generate the partial MIN ONE problems of row 1: In more details, we considered several domains from previous International Planning Competitions (IPCs); generated the ﬁrst satisﬁable instances with SATPLAN; and, for such instance, we considered the partial MIN - ONE problem of minimizing the set of action variables set to true. For MIN - ONE and MAX SAT problems, we selected well known satisﬁable and unsatisﬁable SAT instances from several domains, i.e., Formal Veriﬁcation instances from the Bejing’96 competition, planning problems from SATPLAN contributed by Kautz and Selman, Data Encryption Standard (DES) instances, quasi group instances, and bounded model checking (BMC) problems used in the original BMC paper [5], miter-based circuit equivalence benchmarks by Joao Marques-Silva: Each of these satisﬁable instances corresponds to a MIN - ONE problem and the results are presented in row 2, while the unsatisﬁable instances correspond to the MAX - SAT problems whose results are in row 3. Finally, we included in our analysis also (partial) MAX SAT problems from the recent MAX - SAT evaluation, rows 4-11: As it emerges from the results of this evaluation5 , these benchmarks are hard; the performances of the best solvers differ only for a factor, no solver clearly wins; and it is difﬁcult to solve even a single instance more than the other solvers. Each solver has been run using its default settings. All the experiments have been run on a Linux box equipped with a Pentium IV 3.2GHz processor and 1GB of RAM. CPU time is measured in seconds; timeout has been set to 1800 seconds. In Table 1, 5 See the slides about the http://www.maxsat07.udl.es/ms07-pre.pdf. results, available at • • • • column 2 is the class of the problems; column 3 is the number of instances in the class; columns 4-5 are dedicated to qualitative preferences; columns 6-14 are for the quantitative case. Results for solvers are cumulatively presented as in the report of the MAX - SAT Evaluations: Given a set of instances, we show the mean CPU time of the solved instances, and the number of solved ones (in parenthesis). MAX S ATZ can only deal with MAX - SAT problems, and this is why the corresponding results for MIN - ONE and partial MIN - ONE / MAX - SAT are missing. In the qualitative case we can see that OPTSAT- BF (column 5) is consistently better than OPTSAT- HS (column 4), both in terms of mean CPU time and solved instances: OPTSAT- BF solves the same number of instances of OPTSAT- HS, or higher, and in less time, sometimes dramatically (see, e.g, rows 1 and 8), but for row 11 which is nonetheless solved very easily by both solvers. In the quantitative case, OPTSAT- BF performs also well on these benchmarks. We have to remind that these benchmarks do not include many problems from the last evaluations because of their (pseudo)-random structure which is not suited for our solver. For fairness, this also implies that it is not clear whether the problem we selected are suited for the other solver in our analysis. Indeed, we conducted a preliminary analysis on the (pseudo)-random problems we excluded, and we got a different picture, in which other solvers (and in particular MMSAT) emerge. class 1 Partial MINONE 2 MINONE 3 MAXSAT 4 MAXCUT/spinglass 5 MAXCUT/dimacs mod 6 PSEUDO/garden 7 PSEUDO/logic-synthesis 8 PSEUDO/primes 9 PSEUDO/routing 10 MAXONE/structured 11 MAXCLIQUE/structured T1 2.68 0.19 0.05 0.01 0.01 0.01 0.01 0.18 3.12 0.12 0.06 Q1 #Sols Tf Qf 45.5 2.5 2.7 44.1 751.6 2 0.2 751.6 8605.2 21.2 11.25 8847.6 770.4 2 0.01 770.4 695.9 2.2 0.01 701.9 496 2 0.01 496 152.2 2 0.01 152.2 368.4 2 0.19 368.4 58.7 2 3.12 58.7 240.5 8.4 0.13 249.8 430.4 2 0.06 430.4 Table 2. CPU time for ﬁnding ﬁrst (column T1 ) and optimal (column Tf ) solution. 1 + number of models computed by OPTSAT- BF (column #Sols). Quality of the ﬁrst (column Q1 ) and optimal (column Qf ) solution. 514 E. Di Rosa et al. / A New Approach for Solving Satisﬁability Problems with Qualitative Preferences In order to understand the good behavior of our algorithm, Table 2 shows, for each class, the average of the CPU times for ﬁnding the ﬁrst (even if not optimal) (column T1 ) and optimal (column Tf ) solution; the average quality6 of the ﬁrst (column Q1 ) and optimal (column Qf ) solution; and the average of 1 + the number of models computed by OPTSAT- BF (column #Sols). Looking at the table, we see that the good performances of OPTSAT- BF can be explained by the following factors: 1. the relative quality of the ﬁrst solution (i.e., Qf /Q1 for rows 1-2 and Q1 /Qf for rows 3-11) is usually very high, greater than 0.96; and 2. the low number of intermediate solutions generated before the optimal one: For 9 classes out of 11, the number in column #Sols is lower or equal than 2.5. Considering that 2 indicates that the ﬁrst computed model is already optimal, this means that the algorithm converges to an optimal model very quickly. Finally note how, for the two classes in which the ﬁrst solution is of a low quality, i.e., rows 3 and 10 in Table 2, the convergence is very different: For the MAXSAT class in row 3, T1 is negligible, and all CPU time is spent in “ﬁlling the gap” with the optimal result; while for the MAXONE/structured class, most of the time is spent looking for the ﬁrst solution. As a consequence, in MAX - SAT (resp. MAXONE/structured) the optimal solution is reached by a serie of relatively difﬁcult (resp. easy) intermediate steps. 5 Conclusions We have deﬁned and implemented a new approach based on DLL for solving satisﬁability problems with preferences which does not need any modiﬁcation to DLL heuristic. The basic idea is that whenever a solution is found, a formula is added to the input set of clauses ensuring that the new model (if any) will be better than the last computed one. The experimental analysis performed on a wide set of, mainly structured, (partial) MAX - SAT and MIN - ONE benchmarks has shown that it leads in most cases to signiﬁcant improvements when dealing with qualitative preferences, and that it is also competitive with other state-of-the-art systems in the quantitative case. There is a huge literature on expressing and reasoning with preferences, see, e.g. [8], and the various events on preferences taking place every year. If we do not take into account [12, 13], the closest work to ours seems to be the one on CP-nets [7]: In the paper, the authors show that exploring the search space according to the partial order on the values of the variables, the ﬁrst solution determined is guaranteed to be optimal. CP-nets allows for non-Boolean variables, but on the other hand they only allow to express preferences between values of a same variable: Thus, modeling “I prefer a to b” where a and b are distinct propositional variables cannot be directly captured. REFERENCES [1] Fadi A. Aloul, Arathi Ramani, Igor L. Markov, and Karem A. Sakallah, ‘PBS: A backtrack search pseudo-Boolean solver’, in Proc. SAT, (2002). [2] Alessandro Armando, Claudio Castellini, Enrico Giunchiglia, Fausto Giunchiglia, and Armando Tacchella, ‘SAT-based decision procedures for automated reasoning: a unifying perspective’, in Mechanizing Mathematical Reasoning: Essays in Honor of J¨org H. Siekmann on the Occasion of His 60th Birthday, volume 2605 of LNCS, Springer Verlag, (2005). 6 Quality is measured in terms of number of variables assigned to true for MIN - ONE and partial MIN - ONE problems, and in terms of number of satisﬁed clauses for MAX - SAT and partial MAX - SAT problems. [3] Olivier Bailleux and Yacine Boufkhad, ‘Efﬁcient CNF encoding of Boolean cardinality constraints.’, in Proc. CP, pp. 108–122, (2003). [4] P. Barth, ‘A Davis-Putnam enumeration algorithm for linear pseudoboolean optimization’, Technical report, Max Plank Instutute for Computer Science, (1995). technical Report MPI-I-95-2-2003. [5] A. Biere, A. Cimatti, E. M. Clarke, M. Fujita, and Y. Zhu, ‘Symbolic model checking using SAT procedures instead of BDDs’, in Proceedings of the 36th Design Automation Conference (DAC’ 99), pp. 317– 320. Association for Computing Machinery, (1999). [6] Brian Borchers and Judith Furman, ‘A two-phase exact algorithm for max-SAT and weighted max-SAT problems.’, J. Comb. Optim., 2(4), 299–306, (1998). [7] Craig Boutilier, Ronen I. Brafman, Carmel Domshlak, Holger H. Hoos, and David Poole, ‘CP-nets: A tool for representing and reasoning with conditional ceteris paribus preference statements’, J. Artif. Intell. Res. (JAIR), 21, 135–191, (2004). [8] Jon Doyle, ‘Prospects for preferences’, Computational Intelligence, 20(2), 111–136, (2004). [9] Niklas E´en and Armin Biere, ‘Effective preprocessing in sat through variable and clause elimination’, in Theory and Applications of Satisﬁability Testing, 8th International Conference, SAT 2005, volume 3569 of Lecture Notes in Computer Science, pp. 61–75. Springer, (2005). [10] Niklas E´en and Niklas S¨orensson, ‘An extensible SAT-solver’, in Theory and Applications of Satisﬁability Testing, 6th International Conference, SAT 2003, Selected Revised Papers, pp. 502–518, (2003). [11] Niklas E´en and Niklas S¨orensson, ‘Translating pseudo-Boolean constraints into SAT’, Journal on Satisﬁability, Boolean Modeling and Computation, 2, 1–26, (2006). [12] E. Giunchiglia and M. Maratea, ‘Solving optimization problems with DLL’, in Proc. of 17th European Conference on Artiﬁcial Intelligence (ECAI), pp. 377–381, (2006). [13] Enrico Giunchiglia and Marco Maratea, ‘Planning as satisﬁability with preferences’, in In Proc. of 22nd AAAI Conference on Artiﬁcial Intelligence, pp. 987–992. AAAI Press, (2007). [14] Federico Heras, Javier Larrosa, and Albert Oliveras, ‘MiniMaxSat: A new weighted max-sat solver’, in Proc. of Theory and Applications of Satisﬁability Testing - SAT 2007, 10th International Conference, volume 4501 of LNCS, pp. 41–55. Springer, (2007). [15] Matti J¨arvisalo, Tommi Junttila, and Ilkka Niemel¨a, ‘Unrestricted vs restricted cut in a tableau method for Boolean circuits’, Annals of Mathematics and Artiﬁcial Intelligence, 44(4), 373–399, (August 2005). [16] Henry Kautz and Bart Selman, ‘Planning as satisﬁability’, in Proc. ECAI, pp. 359–363, (1992). [17] Javier Larrosa, Federico Heras, and Simon de Givry, ‘A logical approach to efﬁcient Max-SAT solving’, Artiﬁcial Intelligence, 172, 204– 233, (2008). [18] Chu Min Li, Felip Manya, and Jordi Planes, ‘New inference rules for max-sat’, Journal of Artiﬁcial Intelligence Research (JAIR). To appear., (2007). [19] V. M. Manquinho and J. P. Marques-Silva, ‘On using cutting planes in pseudo-boolean optimization’, Journal on Satisﬁability, Boolean Modeling and Computation (JSAT), 2, 209–219, (2006). [20] Vasco Miguel Manquinho and Olivier Roussel, ‘The ﬁrst evaluation of pseudo-Boolean solvers (PB’05)’, Journal on Satisﬁability, Boolean Modeling and Computation, 2, 103–143, (2006). [21] P. Meseguer S. De Givry, J. Larrosa and T. Schieux, ‘Solving MaxSAT as weighted CSP’, in Proc. of 9th International Conference on Principles and Practice of Constraint Programming (CP 2003), volume 2833 of Lecture Notes in Computer Science, pp. 363–376, (2003). [22] Hossein M. Sheini and Karem A. Sakallah, ‘Pueblo: A modern pseudoboolean sat solver’, in 2005 Design, Automation and Test in Europe Conference and Exposition (DATE 2005), 7-11 March 2005, Munich, Germany, pp. 684–685. IEEE Computer Society, (2005). [23] Joost P. Warners, ‘A linear-time transformation of linear inequalities into conjunctive normal form.’, Information Processing Letters, 68(2), 63–69, (1998). [24] Z. Xing and W. Zhang, ‘MaxSolver: An efﬁcient exact algorithm for (weighted) maximum satisﬁability’, Artiﬁcial Intelligence, 164(1-2), 47–80, (2005). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-515 515 Combining binary constraint networks in qualitative reasoning Jason Jingshi Li1,2 and Tomasz Kowalski1 and Jochen Renz1,2 and Sanjiang Li3 Abstract. Constraint networks in qualitative spatial and temporal reasoning are always complete graphs. When one adds an extra element to a given network, previously unknown constraints are derived by intersections and compositions of other constraints, and this may introduce inconsistency to the overall network. Likewise, when combining two consistent networks that share a common part, the combined network may become inconsistent. In this paper, we analyse the problem of combining these binary constraint networks and develop certain conditions to ensure combining two networks will never introduce an inconsistency for a given spatial or temporal calculus. This enables us to maintain a consistent world-view while acquiring new information in relation with some part of it. In addition, our results enable us to prove other important properties of qualitative spatial and temporal calculi in areas such as representability and complexity. 1 INTRODUCTION An important ability of intelligent systems is to handle spatial and temporal information. Qualitative calculi such as the Region Connection Calculus (RCC8) [10] or Allen’s Interval Algebra (IA) [1] intend to capture such information by representing relationships between entities in space and time. Such calculi have different advantages compared to quantitative spatial and temporal representations such as coordinate systems. They are closer to everyday human cognition, deal well with incomplete knowledge, and can be computationally more efﬁcient than, say, the full machinery of metric spaces. Deﬁning a qualitative calculus is very intuitive. What is required is a domain of spatial or temporal entities, a set of jointly exhaustive and pairwise disjoint (JEPD) relations between the entities of the domain, and (weak) composition between the relations. These properties are essential for enabling constraint-based reasoning techniques for qualitative calculi [13]. However, not all qualitative calculi that can be deﬁned in this way are equally well suited for representing and reasoning about spatial and temporal information. Consider two consistent sets Θ1 , Θ2 of spatial or temporal information. It is clear that if both sets refer to different entities, then combining the two sets will also lead to a consistent set as there are no potentially conﬂicting constraints. If the two sets contain information about the same entities, then it is clear that combining the two sets might lead to inconsistencies, as your favourite crime story will amply demonstrate. Here we are interested in a particular kind of com1 RSISE, The Australian National University, Canberra ACT 0200, Australia, email: jason.li|tomasz.kowalski|jochen.renz@anu.edu.au 2 NICTA Canberra Research Laboratory, Canberra ACT 2601, Australia 3 State Key Laboratory of Intelligent Technology and Systems, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, P.R. China, email: lisanjiang@tsinghua.edu.cn bination of sets, namely, combining sets that share only a very small number of entities and where the relationships between the shared entities are identical in both sets. Assume, for example, that Θ1 and Θ2 contain consistent information about the spatial relationships of entities in two adjacent rooms, Θ1 for room 1 and Θ2 for room 2. Assume further that the two rooms are connected by n closed doors such that the relationships between the n doors are exactly the same in Θ1 and Θ2 , and the doors are the only entities contained in both sets. Without considering any additional information (e.g. that there is only one computer in total, but both room 1 and room 2 contain the computer according to Θ1 and Θ2 ), it is common sense that combining both sets Θ1 and Θ2 to Θ = Θ1 ∪ Θ2 cannot lead to an inconsistency. However, as several examples in the literature show [6], there are qualitative calculi where this property is not satisﬁed and where inconsistencies are introduced when combining two sets that share a small number of entities with identical relations. Such calculi are counterintuitive and it is questionable whether they should be used for spatial or temporal representation and reasoning at all, as they introduce inconsistencies where there shouldn’t be any. Apart from this problem, there are some practically very important advantages of using a qualitative calculus that allows the consistent combination of two consistent sets of information: (1) It opens up the possibility to use divide-and-conquer techniques and to split a large set of qualitative constraints into smaller sets that can be processed independently. This is an essential requirement for hierarchical reasoning and may also speed up reasoning. (2) It becomes possible to ignore or ﬁlter additional information if it is clear that it won’t affect the information important to us. Unfortunately, there is currently no general way of determining for which qualitative calculi consistent sets can be consistently combined and for which calculi unnecessary inconsistencies are introduced. Some initial results were obtained by Li and Wang [6], where a special case of this problem called one-shot extensibility was analysed. Li and Wang considered the case of consistently extending a consistent atomic set of RCC8 constraints by one additional entity, and showed manually by an extensive case analysis that this is always possible for RCC8. Li and Wang showed that one-shot extensibility is also an essential requirement for other important computational properties of a qualitative calculus. In this paper we analyse combinations where two sets share at most two entities and identify a method for automatically testing if this is always possible for a given qualitative calculus. This case is particularly important for different reasons: (1) It provides a purely algebraic and very general proof for one-shot extensibility [6]. (2) It (partially) solves some fundamental questions related to algebraic closure, consistency and (weak) composition, and (3) it provides a purely symbolic test for when a relation algebra is representable. 516 2 J.J. Li et al. / Combining Binary Constraint Networks in Qualitative Reasoning PRELIMINARIES A qualitative calculus such as RCC8 or the Interval Algebra deﬁnes relationships over a given set of spatial or temporal entities, the domain D. The basic relations B form a partition of D × D which is jointly exhaustive and pairwise disjoint, i.e., between any two elements of the domain exactly one basic relation holds [7]. RCC8 for example uses a topological space of extended regions as the domain and deﬁnes eight basic relations DC, EC, PO, EQ, TPP, NTPP, TPPi, NTPPi which are verbalised as disconnected, externally connected, partial overlap, equal, tangential proper part, non-tangential proper part and the converse relations of the latter two [10]. In this paper we intensively use the precise mathematical deﬁnitions of relations, algebras and different algebraic operators which we summarize in the following. A more detailed overview can be found in [3, 7, 13]. A nonassociative relation algebra (NA) is an algebraic structure A = (A, ∧, ∨, ;, −,˘, 1’, 0, 1), such that • (A, ∧, ∨, −, 0, 1) is a Boolean algebra • (A, ;,˘, 1’) is an involutive groupoid with unit, that is, a groupoid satisfying the following equations (a) x ; 1’ = 1’ ; x (b) x˘˘ = x (c) (x ; y)˘ = y˘; x˘ • the operations ; and ˘ are normal operators, that is, they satisfy the following equations – x;0=0=0;x – 0˘ = 0 – x ; (y ∨ z) = (x ; y) ∨ (x ; z) – (x ∨ y) ; z = (x ; z) ∨ (y ; z) – (x ∨ y)˘ = x˘∨ y˘ • the following equivalences hold (x ; y) ∧ z = 0 iff (z ; y˘) ∧ x = 0 iff (x˘; z) ∧ y = 0 These are called Peircean laws or triangle identities. A nonassociative relation algebra is a relation algebra (RA) if the multiplication operation (;) is associative. For more on relation algebras and nonassociative relation algebras see [4] and [8]. Let A be a NA. For any set U , called a domain, let R(U ) be the algebra (℘(U × U ); ∪, ∩, ◦, −, −1 , Δ, ∅, U × U ), where the operations are union, intersection, composition, complement, converse, the identity relation, the empty relation and the universal relation (all with their standard set-theoretical meaning). Notice that since ◦ is associative, R(U ) is a RA. We say that A is weakly represented over U if there is a map μ : A → ℘(U × U ) such that μ commutes with all operations except ; for which we require only μ(a ; b) ⊇ μ(a) ◦ μ(b) This property of weak representation gives rise to a notion of weak composition of relations, namely, for R, S ∈ μ[A], we deﬁne R 7 S to be the smallest relation Q ∈ μ[A] containing R ◦ S. Every NA has a weak representation, for example a trivial one, with U = ∅. Of course, interesting weak representations are nontrivial, and typically injective. A weak representation is a representation if μ is injective and the inclusion above is in fact equality, that is, if μ is an embedding of relation algebras. In such a case weak composition equals composition [12], and that is expressed by saying that weak composition is extensional. Not every NA, indeed not every RA is representable. Although weak representations are not as interesting as representations, curiously, it is the former that gave rise to a notion of qualitative calculus, which is a triple (A, U, μ) where A is a NA, U is a set and μ : A → U is a weak representation of A over U . Since (A, U, μ) is notationally cumbersome, we will later write A for both a NA and a corresponding calculus (A, U, μ), if U and μ are clear from context or their precise form is not important. A calculus (A, U, μ) is extensional if μ is a representation of A. Notice that if (A, U, μ) is extensional, then A is a RA, indeed a representable one. The converse need not hold, as the example of RCC8 demonstrates. All NAs considered in this paper are assumed to be ﬁnite (hence atomic) and such that 1’ is an atom. These are severe restrictions on the class of NAs, but natural from a qualitative calculi point of view. A network N over a NA A is a pair (V, ) where V is a set of vertices (nodes) and : V 2 → A is any function. Thus, a network is a complete directed graph labelled by elements of A. Abusing notation a little we will often write N for the set of vertices of N , if it does not cause confusion. Where double precision is important, we will write VN and N for the set of vertices of N and its labelling function, respectively. For convenience we assume that the set V of nodes is always a set of natural numbers. We will also frequently refer to the label on the edge (i, j) as Rij . A network M is a subnetwork of N , if all nodes and labels of M are also nodes and labels of N . We will write M ≤ N is such case. A network M is a reﬁnement of N if VM = VN and M (i, j) ≤ N (i, j), for any i, j ∈ V (where ≤ is the natural ordering among the labels as elements of A). A network is atomic if all the labels are basic relations (atoms) of A. To indicate atomicity we will sometimes use lower case labels rij . A network N is algebraically-closed (a-closed) if the following hold 1. Rii is the equality relation (identity element of A) 2. Rij 7 Rjk ≥ Rik for any i, j, k ∈ N Networks may be viewed as approximations to (weak) representations, indeed, if μ is a weak representation of A over a domain U , then μ[A] is an a-closed network over A. An arbitrary network N over A is consistent with respect to a weak representation μ, if N is a subnetwork of μ[A]. This paper is mostly concerned with combining a-closed networks without introducing inconsistencies. Let N0 , N1 , N2 be a-closed networks, such that N0 ≤ N1 and N0 ≤ N2 . The triple (N0 , N1 , N2 ) is called a V -formation. A V-formation (N0 , N1 , N2 ) can be amalgamated if there is an a-closed network M such that N1 ≤ M and N2 ≤ M . Such an M is called an amalgam of N1 and N2 over N0 or just an amalgam if the rest is clear from the context. Notice that we do not formally require VM = VN1 ∪ VN2 . However, if an amalgam M exists, its restriction to M ≤ M with VM = VN1 ∪ VN2 is an amalgam as well, so we can always assume that M only has nodes from N1 and N2 . Deﬁnition 1 (Network Amalgamation Property) Let A be a qualitative calculus (NA). A has Network Amalgamation Property (NAP), if any V-formation (N0 , N1 , N2 ) of networks over A can be amalgamated by a network M over A. Clearly NAP is a hard property to come by, so some restrictions are necessary. One such restriction calls for the common subnetwork N0 to be small in the following sense. Deﬁnition 2 (k-Amalgamation Property) Let A be a qualitative calculus (NA). A has k Amalgamation Property (k-AP), if any Vformation (N0 , N1 , N2 ) of networks over A, such that |N0 | ≤ k, can be amalgamated by a network M over A. J.J. Li et al. / Combining Binary Constraint Networks in Qualitative Reasoning Figure 2. Figure 1. (a) 3-extensibility and, (b) 4-extensibility. Both amalgamate over the edge (1,2). The dashed arrows represent the new edges. It is obvious that n-AP implies m-AP for n ≥ m. The smallest interesting case for a qualitative calculus is that of 2-AP. We will approach it step by step, beginning with |N1 | = |N2 | = 3, i.e., amalgamation of two triangles over a common edge. We will show that this follows from the associativity of A. The next case, namely, |N1 | = 4 and |N2 | = 3 (adding a triangle to a square) turns out to be crucial. We will analyse it in some detail and then show that certain strong version of this case implies 2-AP for atomic networks. 3 EXTENSIBILITY In this section we deal with 2-AP for the case with |N2 | = 3, which can be seen as extending an a-closed network N1 by a triangle N2 over a common edge. We refer to this as a one-shot extension [6]. Deﬁnition 3 ((generic) k-extensibility) Let A be a qualitative calculus (NA) and k a natural number. A is k-extensible if any atomic V-formation (N0 , N1 , N2 ) of networks over A, such that |N0 | = 2, |N1 | = k and |N2 | = 3, can be amalgamated by a network |M | over A. If Ni (i ∈ {0, 1, 2}) are non-atomic, then A is generically k-extensible (see Figure 1). Lemma 1 Let A be a RA. If A is associative, then A is generically 3-extensible. Proof sketch. Let N0 = {1, 2}, N1 = {0, 1, 2} and N2 = {1, 2, 3}. Put R03 = R01 7 R13 ∩ R02 7 R23 . By associativity, R12 = ∅. We need to show that the triangles {0, 1, 3} and {0, 2, 3} are a-closed. By symmetry it sufﬁces to prove it for {0, 1, 3}, so we need to show three inclusions: (R01 7 R13 ) ∩ (R02 7 R23 ) ≤ R01 7 R13 (1) R13 ≤ R10 7 [(R01 7 R13 ) ∩ (R02 7 R23 )] (2) R01 ≤ [(R01 7 R13 ) ∩ (R02 7 R23 )] 7 R31 (3) • • • • 517 (a) The RA B9 and, (b) The network S that is not 4-extensible xIy if x = y xGy if x = y ± 1 (mod 7) xBy if x = y ± 2 (mod 7) xRy if x = y ± 3 (mod 7) Then, {I, R, G, B} are atoms of a representable relation algebra. Its representation using red for R, green for G and blue for B is shown in Figure 2. This algebra is known as B9 (cf. [5]). Consider the network S = {0, 1, 2, 3} with (0, 1) = R = (2, 3), (0, 3) = B = (1, 2), (0, 2) = G = (1, 3), and (i, i) = I, (i, j) = (j, i). Verifying that S is a-closed but not extensible is left to the reader as an exercise. Since S is atomic, B9 is not 4-extensible. We will return to S twice more in this paper, hence the fancy font. We did not ﬁnd any equations that would imply 4-extensibility in a manner similar to the role of associativity in Lemma 1. Checking generic 4-extensibility exhaustively takes too long for even a relatively small calculus such as RCC8. However, we could construct RCC8 networks for which generic 4-extensibility fails. Interestingly, all such networks we managed to construct contained relations that are known to be NP-hard (cf. [11]). On the other hand, 4-extensibility can be exhaustively tested by a program that performs checks on all atomic a-closed networks with four nodes. Theorem 1 If a qualitative calculus (A, U, μ) is extensional and A is not 4-extensible, then a-closure does not decide consistency for networks of atomic relations. Proof sketch. In an extensional calculus, consistent networks can always be extended by one-shot [6]. However, if A is not 4-extensible, then there exists an atomic network N on four nodes that has no aclosed one-shot extension. Therefore N is not consistent. One example of such an algebra is B9 , with S in place of N . 3.1 Strong 4-extensibility 4-extensibility allows two networks of size 3 and 4 respectively to be combined over one edge without introducing inconsistencies. In this section, we show a special case of 4-extensibility that allows us to combine any two atomic networks of arbitrary size over one edge. The ﬁrst of these is trivial, the two others follow from relation algebra identities. To show 3-extensibility, put R03 = r01 7 r13 ∩ r02 7 r23 , where rij are atoms. Then any reﬁnement of R03 satisﬁes the inclusions above, so any atomic reﬁnement r03 satisﬁes them as well. Deﬁnition 4 (Strong 4-extensibility) Let A be a qualitative calculus (NA). A is strongly 4-extensible if any V-formation (N0 , N1 , N2 ) of atomic networks over A, with N0 = {1, 2}, N1 = {0, 1, 2} and N2 = {1, 2, 3, 4}, can be amalgamated by a network |M | over A, such that for all i ∈ N2 \ N0 Since algebras that fail associativity are somewhat pathological, the above lemma is widely applicable. Unlike 3-extensibility, 4-extensibility may fail in associative algebras, indeed even in representable ones. Consider the group Z7 (the integers under addition modulo 7) and for x, y ∈ Z7 deﬁne Ri0 = (ri1 7 r10 ) ∩ (ri2 7 r20 ) It follows easily by triangle identities that strong 4-extensibility implies 4-extensibility. The beauty of strong 4-extensibility is that for a given one-shot extension, labels for new edges are precisely the 518 J.J. Li et al. / Combining Binary Constraint Networks in Qualitative Reasoning intersections of compositions of labels on existing edges. This property is in fact possessed by both RCC8 and IA and can be checked even more efﬁciently than simple 4-extensibility. Theorem 2 If a NA A is strongly 4-extensible, then A has 2Amalgamation Property if N1 , N2 are atomic. Proof sketch. Let (N0 , N1 , N2 ) be a V-formation of atomic networks, with N0 = {0, 1}. Let M = N1 ∪ N2 be the network retaining all the labels from N1 and N2 and with the new labels for edges (x, y) with x ∈ Ni \ Nj and y ∈ Nj \ Ni ({i, j} = {1, 2}) deﬁned by (x, y) = rx0 7 r0y ∩ rx1 7 r1y . We will show that M is a-closed. Suppose the contrary. Then, there is a triangle in M with edges labelled by A,B,C, such that C ≤ A 7 B. Now, A,B and C cannot all be edges from Ni (i ∈ {1, 2}), for Ni is a-closed. So at least one of A,B,C is of the from (x, y) with x ∈ Ni \ Nj and y ∈ Nj \ Ni ({i, j} = {1, 2}). Notice also that there at most two of A, B, C can be such (three such edges do not form a triangle). We have then two cases. If there is exactly one such edge among A,B,C, it violates the assumption of 3-extensibility; if there are exactly two such edges, then it violates the assumption of strong 4-extensibility. Thus, M is a-closed as claimed. The above theorem showed that if the calculus is strong 4extensible, then we can amalgamate any two atomic networks over one edge. In the following we will show additional beneﬁts of strong 4-extensibility for a qualitative calculus or relation algebra. Deﬁnition 5 (One-Shot Extensibility [6]) A qualitative calculus (A, U, μ) is one-shot extensible if any consistent atomic V-formation (N0 , N1 , N2 ) with |N0 | = 2 and |N2 | = 3, can be amalgamated by a consistent atomic network M . Corollary 1 If a qualitative calculus A is strongly 4-extensible, and a-closure decides consistency for networks of atomic and universal relations, then A is one-shot extensible. One-shot extensibility was used in [6] to prove (for certain A) that tractability of a set of relations S is equivalent to tractability of its closure Sb under weak composition, intersection and converse. The method from [6] involves numerous manual calculations in the semantics of A. However, if we know that a-closure happens to decide consistency for networks of atomic and universal relations in a qualitative calculus, as it for example does in RCC8 [2], then a simple check on the composition table for strong 4-extensibility is sufﬁcient to prove one-shot extensibility. Deﬁnition 6 (One-Shot Proto-Extensibility) A qualitative calculus (NA) A is one-shot proto-extensible if any atomic V-formation (N0 , N1 , N2 ) with |N0 | = 2 and |N2 | = 3, can be amalgamated by an atomic network M . One-shot proto-extensibility ensures that the amalgam has an aclosed atomic reﬁnement. Its advantage over one-shot extensibility is that the it is a syntactic notion that is independent to any (weak) representation. Any representable algebra is trivially one-shot extensible relative to its representation. Theorem 3 Any one-shot proto-extensible RA is representable. Proof sketch. Let A be a RA with the required property. We build a representation of A inductively, beginning with any atomic a-closed triangle. At any given stage i, we have constructed an atomic aclosed network Ni . By one-shot proto-extensibility, we can pick any atomic a-closed triangle T and add it to Ni , in effect amalgamating Ni and T over an edge thatSthey share, obtaining an atomic aclosed network Ni+1 . Let N = i∈ω Ni . Deﬁne μ : A → N putting μ(a) = {(x, y) : N (x, y) = a} for an atom a ∈ A. By ﬁniteness of A, each u ∈ A is a join of ﬁnitely many atoms. Thus, we can extend μ onto the whole universe of A setting μ(u) = μ(a1 ) ∪ · · · ∪ μ(an ), where a1 , . . . , an are atoms with u = a1 ∨· · ·∨an . It can be veriﬁed that the so deﬁned μ is a representation of A. It is not the case that one-shot extensibility implies one-shot protoextensibility, even for representable algebras. This is connected to the existence of atomic a-closed networks that are not consistent. A counterexample is again provided by B9 , which is representable, hence one-shot extensible, but not one-shot proto-extensible, as the network S in Figure 2 shows. 4 ATOMIC REFINEMENT OF AMALGAMATED NETWORKS In the previous section we showed that a NA A has the 2-Amalgamation Property over atomic networks if it is strongly 4-extensible. Then if a calculus (A, U, μ) has the property that a-closure decides consistency for networks of atomic and universal relations, there is always an atomic amalgam of the two networks, hence the calculus is one-shot proto-extensible. However, this is not a satisfactory result, as one-shot proto-extensibility is a purely syntactic concept based on the relation algebra, and we want to be able to prove it without resorting to the semantics of the qualitative calculus. We want a procedure that ensures the amalgam always has an a-closed atomic reﬁnement. Such a procedure would provide a purely syntactic way to prove oneshot proto-extensibility, and hence representability. 4.1 Flexibility Ordering Under strong 4-extensibility, each non-atomic relation in the amalgam of two networks over a common edge is precisely the intersection of the two paths from nodes in one network to another. One way to ensure there is always an atomic reﬁnement to these relations such that the entire network is a-closed is to have a ﬂexible atom (cf. [9]). A relation algebra with a set of atoms B has a ﬂexible atom a if the following condition hold: ∃a ∈ B : ∀b, c ∈ B \ {1’}, a ∈ b 7 c A ﬂexible atom is contained in any composition of two atomic relations, so to make an amalgam atomic and a-closed one would just need to replace all the non-atomic relations in it by the ﬂexible atom. However, requiring a ﬂexible atom is a very strong condition, and we do not know of a qualitative calculus, whose associated algebra has this property. Instead, we propose to construct an ordering of atoms that will emulate this property when reﬁning amalgams, given the algebra has strong 4-extensibility. That is, we create a sequence of atomic relations, such that for any non-atomic edge R in the amalgam, we can reﬁne it to the ﬁrst element in the sequence that is contained in R, and the network remains a-closed. Formally, let A be a relation algebra with a set of atoms B and S be a sequence of its atoms. A choice reﬁnement of a non-atomic relation R over S is the ﬁrst member of S that is a reﬁnement of R. Deﬁnition 7 (Flexibility Ordering) For a strongly 4-extensible relation algebra A, its Flexibility Ordering is a sequence S of atomic relations, such that for any amalgam M of an atomic V-formation J.J. Li et al. / Combining Binary Constraint Networks in Qualitative Reasoning (N0 , N1 , N2 ) with |N0 | = 2, the non-atomic relations from M can be replaced by their respective choice reﬁnements over S and the resulting network is a-closed. The idea is that we deﬁne a sequence S of atomic relations such that in any M , when we replace a non-atomic edge R by its choice reﬁnement r over S, it will never be inconsistent with the atomic edges of M , or atomic edges which arise as choice reﬁnements of other non-atomic relations in M that are prior or equal to r in S. To construct such a sequence, we propose an automated procedure that consists of two parts: First, for a given sequence S, that may not cover all cases, we test if a new atomic relation r that is not in S to see if it is compatible with S. That is, for an amalgam M of any two atomic a-closed network {0, 1, 2} and {1, 2, 3, 4}, in the case that no current member of S is contained in the new edge R03 but r is, we check whether the following hold: 1. If R04 is already atomic, then when we replace R03 with r, the triangle {0, 3, 4} is a-closed. 2. Else if there exists a choice reﬁnement r04 of R04 over S, then when we replace R03 with r, R04 with r04 , the triangle {0, 3, 4} is a-closed. 3. Else if R04 contains r, then when we replace both R03 and R04 by r, the triangle {0, 3, 4} is a-closed. If the above hold for all such amalgams M , r and S is compatible. The second part involves the construction of such a list. Starting from an empty list, we incrementally add atoms that pass the compatibility test with the list, and backtrack when no further candidates can pass the test. It is worth noting that each branch of the search tree may terminate early: e.g. if an atom a is not compatible with an empty ordering, then we do not have to test any entries with a at the head of the ordering. Theorem 4 If a NA is strongly 4-extensible, and it has a Flexibility Ordering, then it is one-shot proto-extensible. Proof sketch. From Theorem 2 we get a network M that is a-closed, but the new edges between N1 and N2 may not be atomic. However, with a Flexibility Ordering we can reﬁne each of these edges to atomic relations, knowing that similar atomic reﬁnements of other new edges will not introduce an inconsistent triple, since we have checked all possible cases in the construction of the Flexibility Ordering. Therefore the entire network is reﬁned to be atomic and aclosed, thus the relation algebra is one-shot proto-extensible. This general result, together with Theorem 3, allows us to prove representability of a RA A from its composition table. This means that A can be a part of an extensional qualitative calculus (A, U, μ). It also implies that consistency can be preserved when amalgamating two atomic a-closed networks over two nodes if we know that aclosure decides consistency for only atomic relations. 4.2 Empirical Evaluations of Flexibility Ordering on RCC8 and Interval Algebra Both RCC8 and IA are prime candidates to test for Flexibility Orderings, as they are well known and non-trivial calculi in the spatialtemporal domain, and their respective relation algebras are both strongly 4-extensible. For RCC8, the procedure found the Flexibility Ordering: (DC, EC, PO, TPP, TPPi), whereas for IA, the procedure found (<, di, o, s, oi, f ). Hence we have proved from their composition table that their relation algebras are representable. 519 Computationally the worst case of the procedure is O(|B|!). However, this would be extremely rare, as most branches of the search tree will be terminated earlier than exhaustive search, thus trimming down a majority of potential search space. For IA, with 13 atoms, the procedure found an ordering in 4 seconds on a Intel Core2Duo 2.4GHz processor with 2GB RAM, and for RCC8 it found a solution in less than a second. Therefore, our procedure is widely applicable. 5 CONCLUSION AND FUTURE WORK We provided sufﬁcient conditions to amalgamate two atomic networks of any size over a common edge. Hence, for a calculus where a-closure decides consistency for networks that contain only atomic and universal relations, two atomic networks can always be consistently amalgamated. The property of strong 4-extensibility, together with other known results, also tell us when a-closure does not decide consistency for atomic networks. It provides an efﬁcient computational test to check, for a non-extensional calculus, whether complexity results for a set of relations can be transferred to its closure. More importantly, we have provided a procedure that proves the resulting amalgamated network has an a-closed atomic reﬁnement, independent of any information about the domain of the calculus. This allows us to prove representability of a relation algebra from its composition table. It preserves consistency under amalgamation of two atomic networks over two nodes, if a-closure decides consistency for networks of atomic relations. The ﬁrst obvious future step is to see whether two atomic a-closed networks can be amalgamated over n nodes for n > 2. Then we take it to the non-atomic ones. It is also interesting to see under what conditions a calculus has the Network Amalgamation Property, that is, networks can be combined regardless of number of shared nodes. Secondly, our proposed notion of one-shot proto-extensibility is a sufﬁcient, but not necessary condition for representability of a relation algebra. There are other representable relation algebras which are not one-shot proto-extensible. It would be interesting to see if there are any connections between one-shot proto-extensibility and Hirsch-Hodkinson type games [4], and whether Hirsch-Hodkinson games can be interpreted as a sequence of one-shot extensions. REFERENCES [1] J.F. Allen, Maintaining knowledge about temporal intervals, Comm. ACM, 26, 832–843, 1983. [2] B. Bennett, Determining consistency of topological relations, Constraints, 3, 213–225, 1998. [3] I. D¨untsch, Relation algebras and their application in temporal and spatial reasoning, Artif. Intell. Rev., 23-4, 315-357, 2005. [4] R. Hirsch, I. Hodkinson, Relation algebras by games, Elsevier, 2002. [5] P. Jipsen, E. Luk´acs, Minimal relation algebras, Algebra Universalis, 32, 189–203, 1994. [6] S. Li and H.Wang, RCC8 binary constraint network can be consistently extended, Artiﬁcial Intelligence, 170, 1-18, (2006). [7] G. Ligozat, J. Renz, What is a qualitative calculus? A general framework, Proc. PRICAI’04, 53-64, 2004 [8] R. Maddux, Relation algebras, Elsevier, 2006. [9] R. Maddux, Some varieties containing relation algebras, Trans. Amer. Math. Soc., 2, 501–526, 1982. [10] D.A. Randell, Z. Cui, and A.G. Cohn, A spatial logic based on regions and connection. Proc. KR’92, 165–176, 1992. [11] J. Renz, Maximal tractable fragments of the Region Connection Calculus: A complete analysis, Proc. IJCAI’99, 448–455, 1999. [12] J. Renz, G. Ligozat, Weak composition for qualitative spatial and temporal reasoning, Proc. CP’05, 534-548, 2005. [13] J. Renz, B. Nebel, Qualitative spatial reasoning using constraint calculi, Handbook of Spatial Logics, Springer, 161-215, 2007. 520 ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-520 Solving Necklace Constraint Problems Pierre Flener and Justin Pearson 1 Abstract. Some constraint problems have a combinatorial structure where the constraints allow the sequence of variables to be rotated (necklaces), if not also the domain values to be permuted (unlabelled necklaces), without getting an essentially different solution. We bring together the ﬁelds of combinatorial enumeration, where efﬁcient algorithms have been designed for (special cases of) some of these combinatorial objects, and constraint programming, where the requisite symmetry breaking has at best been done statically so far. We design the ﬁrst search procedure and identify the ﬁrst symmetrybreaking constraints for the general case of unlabelled necklaces. Further, we compare dynamic and static symmetry breaking on reallife scheduling problems featuring (unlabelled) necklaces. 1 INTRODUCTION In combinatorics, a necklace of n beads over k colours is the lexicographically smallest element in an equivalence class of the set of k-ary n-tuples under rotations; the underlying symmetry group is the cyclic group Cn acting on the indices. For example, the binary triple 001 is the representative necklace of {001, 010, 100}. Combinatorial objects are enumerated under some chosen total order. For example, under the lexicographic order, the binary 3-bead necklaces are 000, 001, 011, and 111. If the values (colours) of a tuple are interchangeable, then we speak of unlabelled tuples (symmetric group Sk acting on the values) and unlabelled necklaces (product group Cn × Sk ). For example, under the lexicographic order, the unlabelled binary 3-tuples are 000, 001, 010, and 011, while the unlabelled binary 3bead necklaces are 000 (representing the necklaces 000 and 111) and 001 (representing the necklaces 001 and 011). The generating functions for counting (unlabelled) necklaces are given in [6], and the sequences of their counts (for k ≤ 6) can be found in [16]. A constraint satisfaction problem (CSP) is a triplet X, D, C, where X is a sequence of n variables, D is a set of k possible values for these variables and is called their domain, and C is the set of constraints specifying which assignments of values to the variables are solutions. If the constraint set C allows the variable sequence X to be rotated, then a necklace is a combinatorial sub-structure of the CSP and we say that the CSP has rotation variable symmetry. If the constraint set C has a domain D containing interchangeable elements, then we say that the CSP has full value symmetry. Exploiting such symmetry is important in order to solve a CSP efﬁciently. For example, compare the ternary object counts in Table 1 with 3n . CSPs with an (unlabelled) necklace as a combinatorial substructure are not unusual. For example, Gusﬁeld [9, page 12] states that “circular DNA is common and important. [sample organisms omitted.] Consequently, tools for handling circular strings may someday be of use in those organisms”. One such problem is studied 1 Department of Information Technology, Uppsala University, Box 337, SE – 751 05 Uppsala, Sweden. Email: Firstname.Surname@it.uu.se in [3]. Necklaces occur in coding theory [7], genetics [7], and music [6], while unlabelled necklaces occur in switching theory [6]. We study a real-life problem with (unlabelled) necklaces in scheduling, different from the one in [8]. In this paper, we propose to bring together combinatorial enumeration and constraint programming (CP). Very efﬁcient combinatorial enumeration algorithms exist for some of the mentioned combinatorial objects, but not for unlabelled necklaces (except over two colours [2]). These algorithms can be used as CP search procedures for CSPs having those objects as combinatorial sub-structures, thereby breaking a lot of symmetry dynamically. This has also been advocated in [13], say, where a generic CP search procedure is proposed for an arbitrary symmetry group acting on the values; however, except for [15] not much dynamic symmetry breaking seems to have been done for groups acting on the variables. Conversely, CP principles can be used for devising enumeration algorithms for the combinatorial objects where efﬁcient algorithms have remained elusive to date. The contributions of this paper can be summarised as follows: • Design of an enumeration algorithm, and hence a CP search procedure, for (partially) unlabelled k-ary necklaces (Sections 2 and 4). • Identiﬁcation of symmetry-breaking constraints for (partially) unlabelled k-ary necklaces, including ﬁltering algorithms for the identiﬁed new global constraints (Sections 3 and 4). • Experiments on real-world problems validating the usefulness of the proposed dynamic and static symmetric-breaking methods for (partially unlabelled) k-ary necklaces (Section 4). Finally, in Section 5, we conclude and discuss future research. In the following, consider a CSP X, D, C where X is a sequence of n ≥ 2 variables and D is a set of k ≥ 1 domain values. We assume that D = {0, . . . , k − 1}; this also has the advantage that the order is obvious whenever we require D to be totally ordered. 2 DYNAMIC SYMMETRY BREAKING Unlabelled Tuples. If the domain values of D are interchangeable, then we impose a total order on D, and the enumeration algorithm of [5], say, can be used to generate all unlabelled tuples (modulo the full value symmetry). We present it as Algorithm 1 in the style of a search procedure in constraint programming (CP), so that it can interact with any problem constraints. The initial call is utuple(1, −1). At any time, j is the index of the next variable to be assigned (and j = n + 1 when none remains) while u is the largest value used so far (and u = −1 when none was used yet). The idea is to try for each variable all the values used so far plus one unused value, since all unused values are still interchangeable at that point. Upon backtracking, the try all construct non-deterministically tries all the alternatives, in the given value order (line 6). Each alternative contains the assignment of the chosen value i to the chosen variable X[j] P. Flener and J. Pearson / Solving Necklace Constraint Problems 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: procedure utuple(j, u : integer) var i : integer if j > n then return true else try all i = 0 to min(u + 1, k − 1) do X[j] ← i; utuple(j + 1, max(i, u)) end try end if Algorithm 1: Search procedure for unlabelled tuples [5] 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: procedure necklace(j, p : integer) var i : integer if j > n then return n mod p = 0 else try all i = X[j − p] to k − 1 do X[j] ← i; necklace(j + 1, if i = X[j − p] then p else j) end try end if 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 521 procedure uneck (j, p, u : integer) var i : integer if j > n then return n mod p = 0 else try all i = X[j − p] to min(u + 1, k − 1) do if probe(j, i, p) then X[j] ← i; uneck (j +1, if i = X[j −p] then p else j, max(i, u)) end if end try end if function probe(j, i, p : integer) : boolean X[j] ← i; if j = n ∧V n mod (if i = X[j − p] then p else j) = 0 then return q=n X[q, . . . , n, 1, . . . , q − 1] ≥lex X[1, . . . , n] 2 else if j < V n then return j−1 q=2 X[j − q + 1, . . . , j] ≥lex X[1, . . . , q] else return false end if Algorithm 3: Probing search procedure for unlabelled necklaces Algorithm 2: Search procedure for necklaces [2] (line 7) and a recursive call for the next variable (line 8). Note that we have ﬁxed the variable order to be from left to right across X, and the tuples are thus generated in lexicographic order; this is an unnecessary restriction, but the reason for this choice will become clear in a few lines. This algorithm takes constant amortised time and space, and the number of objects generated is actually equal to the number of unlabelled tuples. Necklaces. If the variable sequence X is circular, then the enumeration algorithm of [2], say, can be used to generate all necklaces (modulo the rotation variable symmetry). We present it as a CP search procedure in Algorithm 2. The initial call is X[0] ← 0; necklace(1, 1), where X[0] is a dummy element. At any time, j is the index of the next variable to be assigned (and j = n + 1 when none remains) while p is the period, explained next. The idea is either to try and keep replicating the values at the previous p positions, or to try all larger values with a new period of j. At any time, the preﬁx X[1, . . . , j] is a pre-necklace, that is a preﬁx of some necklace, which may however be longer than n. The variable order is necessarily from left to right across X, due to the role of p, and the necklaces are thus generated in lexicographic order. This algorithm takes constant amortised time and space, and the number of objects generated is proportional by a constant factor (tending down to (k/(k − 1))2 as n → ∞) to the number of necklaces: note that only n-tuples where the period p divides n actually are necklaces (line 4). In other words, not all symmetry is broken at every node of the search tree, and some backtracking is forced (by a constant-time test on p) only at leaf level; at present, loopless necklace enumeration remains elusive. Unlabelled Necklaces. If the variable sequence X is circular and the domain values of D are interchangeable, then a constant-amortisedtime enumeration algorithm [2] only exists for generating all binary (k = 2) unlabelled necklaces (modulo the symmetries). We do not present it here, but instead construct a novel enumeration algorithm for any amount of colours. Noting that unlabelled necklaces are a subset of the necklaces (Algorithm 2) that are unlabelled tuples (Algorithm 1), and observing that the control ﬂows of those two algorithms match line by line, the skeleton of an enumeration algorithm for unlabelled necklaces can be obtained simply by “intersecting” those two algorithms, which yields all but lines 7 and 10 of the CP search procedure uneck in Algorithm 3. The initial call is X[0] ← 0; uneck (1, 1, −1), where X[0] is a dummy element. We now gradually reﬁne the probe(j, i, p) function (called in line 7), guarding the non-deterministic assignment of value i to the current variable X[j] followed by the continued enumeration. Leaf Probing. If probe always returns true, then uneck will enumerate a superset of the unlabelled necklaces, as their symmetry group is the product rather than just the union of the symmetry groups for necklaces and unlabelled tuples. For example, the binary 3-necklace 011 will erroneously be returned, even though it can be transformed into the unlabelled necklace 001 (by ﬁrst rotating the second position of the circular sequence 011 into ﬁrst position, giving 110, and then minimally renaming its colours, giving 110 = 001); however, the necklace 111 will correctly not be returned, since it is not an unlabelled tuple. Consider the left half of Table 1, giving the numbers of various combinatorial objects of length n over 3 colours: column 7 counts the unlabelled tuples (sequence A124302 in [16]); column 6 counts the necklaces (fewer than the unlabelled tuples for n ≥ 7; sequence A1867); column 5 counts the necklaces that are unlabelled tuples, that is the number of pre-necklaces when probe always returns true; and column 2 counts the unlabelled necklaces (sequence A2076). The difference between columns 5 and 6 (or 7) shows the gain obtained so far for free by Algorithm 3 over Algorithm 2 (or Algorithm 1), but the difference between columns 5 and 2 shows the amount of pruning that leaf probing has to do. The least thing probe(j, i, p) should thus do is to make sure only unlabelled necklaces are enumerated. This is at the latest done when trying to assign the last variable (when j = n) of the CSP: at that moment, the entire circular sequence X is known, so probe must return true if X cannot be transformed (by position rotation and col or renaming) into an object that has already been tried in the enumeration. Since objects are enumerated in lexicographic order (as an inherited feature of the two underlying algorithms), this can be done by checking whether the minimal renaming of every (non-unit) rotation of X is lexicographically larger than or equal to X. Computing the minimal renaming Y of an n-tuple Y takes Θ(n) time, and can be merged into the O(n)-time lexicographic comparison; at most n − 1 such renamings and comparisons are done, hence this probing takes 522 n 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Table 1. P. Flener and J. Pearson / Solving Necklace Constraint Problems seq. A2076: unecks 1 2 3 6 9 26 53 146 369 1002 2685 7434 20441 57046 159451 probing internal + leaf n mod p = 0 leaves 1 1 2 2 4 5 8 10 15 22 34 48 80 121 196 293 490 744 1267 1920 3357 5104 8996 13635 24403 37030 66886 101354 184770 279895 leaf only leaves 1 2 5 13 36 97 268 732 2017 5552 15371 42624 118731 331664 929883 seq. A1867: necks 3 6 11 24 51 130 315 834 2195 5934 16107 44368 122643 341802 956635 seq. A124302: utuples 1 2 5 14 41 122 365 1094 3281 9842 29525 88574 265721 797162 2391485 necklaces Algo. 2 Cons. (3) time time 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.01 0.02 0.04 0.04 0.11 0.11 0.24 0.30 0.78 0.81 2.12 2.22 5.91 6.24 16.54 17.25 Algo. 3 time (leaf) 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.06 0.20 0.63 1.95 6.06 18.82 58.56 unlabelled necklaces Algo. 3 Cons. (1) and (4) time (all) time fails 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.01 2 0.00 0.01 6 0.00 0.03 9 0.01 0.07 29 0.02 0.18 69 0.06 0.50 181 0.16 1.48 469 0.49 4.54 1240 1.58 13.33 3298 4.65 41.04 8919 14.50 122.46 24328 44.89 374.12 66865 Numbers of objects of length n over 3 colours, and their enumeration times (in seconds) via dynamic & static (constraint-based) symmetry breaking O(n2 ) time at worst. Note that a successful probe incurs the highest cost. The algorithmic details are trivial, so we just write a speciﬁcation into line 16. Lazy evaluation of the conjunction should be made, returning false as soon as one conjunct is false. Also, experiments have revealed that failure is detected earlier on the average if the starting positions of the rotations recede from right to left across X. An improvement of this leaf probing comes from observing what happens when the lowest value, namely X[j − p], is tried for X[j] when j = n: the recursive call (line 9) then is uneck (n + 1, p, u) and everything hinges on whether n mod p = 0 or not. But the latter check can already be done before probing (in O(n2 ) time, recall) whether X[j − p] actually is a suitable value for X[n]. For any other tried value i > X[j − p] for X[n], the recursive call (line 9) is uneck (n + 1, n, max(i, u)) and we then know that n mod n = 0. Hence the test in line 15, as well as lines 19 and 20. Internal Probing. The leaf probing discussed so far assumes that line 18 is replaced by return true. This is unsatisfactory, as no pruning (other than via the p and u parameters) takes place at the internal nodes of the search tree, so that many more leaves are generated than necessary (recall the difference between columns 5 and 2 in Table 1). In the spirit of constraint programming, we ought to perform more pruning when j < n. The idea is the same as for leaves (where j = n) except that only a strict preﬁx X[1, . . . , j] of the circular sequence X is known, so that we can only check whether the minimal renaming of every sufﬁx of X[1, . . . , j] is lexicographically larger than or equal to X[1, . . . , j]. For example, when searching for a ternary 6-bead unlabelled necklace, assume we have already constructed the pre-necklace 010 and probe(4, 2, 4) is now called to check whether at position j = 4 < 6 = n the variable X[4] can be assigned the (so far unused) value i = 2 = u + 1 = k − 1 under period p = 4, so the following comparisons must be made: 2 02 102 0102 = = = = 0 01 012 0102 ≥lex ≥lex ≥lex ≥lex 0 01 010 0102 (4) (3) (2) (1) The ﬁrst and last comparisons will always succeed and can be omitted. Exactly j−2 such renamings and comparisons of tuples of length O(j − 1) are thus to be done, hence this internal probing also takes O(n2 ) time at worst, since j = O(n). The algorithmic details are trivial, so we just write a speciﬁcation into line 18. Again, lazy evaluation of the conjunction should be made. Also, experiments have revealed that failure is detected earlier on the average if the starting positions of the sufﬁxes recede from right to left across X[1, . . . , j], as in the top-down order of the sample comparisons above. To assess the impact of internal probing, consider again the left half of Table 1: column 4 gives the new numbers of pre-necklaces (much lower than in column 5), and column 3 counts the prenecklaces that are accepted by the test on the period p. The difference between columns 3 and 2 is the amount of pruning that leaf probing now has to do, and the difference between columns 4 and 3 is the amount of pruning done by the period test. Note that the constanttime period test prunes much more than the quadratic-time probing. Incremental Internal Probing. Empirically, on average, the internal probing just proposed is much more efﬁcient than its O(n2 ) worst time suggests, due to the nature of unlabelled necklaces. We now optimise this internal probing into an algorithm taking O(n) time at worst, leading to an enumeration that is systematically faster by a constant factor (namely 17% faster in our implementation). The idea is to trade time for space and make the comparisons incremental. Continuing our previous example, having so far constructed the pre-necklace 0102 of a ternary 6-bead unlabelled necklace, probe(5, 1, 5) is eventually called at the next iteration to check whether at position j = 5 < 6 = n the variable X[5] can be assigned the value i = 1 under period p = 5, so the following comparisons must be made: 1 21 021 1021 01021 = = = = = 0 01 012 0120 01021 ≥lex ≥lex ≥lex ≥lex ≥lex 0 01 010 0102 01021 (5 ) (4 ) (3 ) (2 ) (1 ) Note that the last four comparisons correspond to the ones given earlier, that the considered sufﬁxes of X[1, . . . , j] got longer at the end by the new (boldfaced) value i = 1, and that the minimal renamings of the (non-boldfaced) preﬁxes remained the same. In other words, only the scalar comparisons of the (boldfaced) last values matter, since the lexicographic ≥lex comparisons of the (non-boldfaced) preﬁxes have already been made until the previous iteration. If the lexicographic comparison until the previous iteration is =lex , as in formulas (1), (3), and (4), then the scalar comparison operator is ≥ at the current iteration; if the lexicographic comparison until the previous iteration is >lex , as in formula (2), then no scalar comparison need be made at the current iteration. We incrementally maintain a global k × n matrix m, where m[i, j] gives the minimal renaming of value i if the renaming starts at position j. We also incrementally maintain locally to every search-tree node an n-tuple c of Booleans, where c[j] = true if the lexicographic comparison from position j until the previous iteration is =lex , that is if the comparison from j is to continue at the current iteration. For example, since the scalar P. Flener and J. Pearson / Solving Necklace Constraint Problems comparison in formula (3 ) gives 2 > 0, we set c[3] ← false for the next iteration. Using these incremental data structures, the internal probing in line 18 can be replaced by the following speciﬁcation (the algorithmic details, including the incremental maintenance of c and m, are omitted for space reasons): q=j−1 return ^ (if c[q] then m[i, q] ≥ X[j + 1 − q] else true) 2 At most j − 2 scalar comparisons are to be done, hence this incremental internal probing takes O(n) time at worst, since j = O(n) and the incremental maintenance of c[1 . . . j] and m[i, 1 . . . j] takes O(n) time at worst. Lazy evaluation of the conjunction should be made. Failure is detected earlier on the average if the starting positions of the sufﬁxes recede from right to left across X[1, . . . , j], as in the top-down order of the sample comparisons above. Discussion. An analysis of the amortised complexity of Algorithm 3 is beyond the scope of this paper. Its correctness follows from line 16 capturing the essence of unlabelled necklaces and the correctness of Algorithms 1 and 2. To assess the runtime impact of internal probing, consider the right half of Table 1: the fourth-last and third-last columns give the enumeration times (in seconds) if there is only leaf probing and also internal probing, respectively. (All experiments in this paper were performed under SICStus Prolog v4.0.2 on a 2.53 GHz Pentium 4 machine with 512 MB running Linux 2.6.20.) 3 STATIC SYMMETRY BREAKING Unlabelled Tuples. To break full value symmetry, it sufﬁces to order the positions of the ﬁrst occurrences, if any, of each value. Letting ﬁrstPos(i) denote the ﬁrst position, if any, of value 0 ≤ i < k in X under the current assignment, and n + 1 + i otherwise, the following k − 1 binary constraints break full value symmetry [11]: ﬁrstPos(0) < ﬁrstPos(1) < · · · < ﬁrstPos(k − 1). A more efﬁcient ﬁltering algorithm can be designed for the conjunction of these constraints, giving a new global constraint, called orderedFirstOccurrences(X, D) (1) A checker for this global constraint can be speciﬁed as a deterministic ﬁnite automaton (DFA) (omitted for space reasons), so that we get a ﬁltering algorithm using the automaton global constraint [1]. Necklaces. To break rotation variable symmetry, we apply the socalled lex-leader scheme [4], which says that any variant of a wanted solution under all the symmetries of the considered symmetry group must be lexicographically larger than or equal to that solution. For necklaces, this means that all the rotations of the sequence X must be lexicographically larger than or equal to X itself: n ^ X[q, . . . , n, 1, . . . , q − 1] ≥lex X[1, . . . , n] (2) q=2 These n − 1 constraints over sequences of exactly n elements have been logically minimised in [8] to the following n − 1 constraints over sequences of at most n − 1 elements: n ^ X[q, . . . , (2q − 3) mod n + 1] ≥lex X[1, . . . , q − 1] (3) q=2 Reading from right to left, this constrains the ﬁrst q − 1 elements of X to be lexicographically smaller than or equal to the cyclically next q − 1 elements of X, for 2 ≤ q ≤ n. Future work includes 523 designing a more efﬁcient ﬁltering algorithm for the conjunction of these global lexicographic constraints. Unlabelled Necklaces. The conjunction of the constraints (1) and (3) accepts all necklaces that are unlabelled tuples (just like Algorithm 3 without probing). In fact, the rotation variable symmetry and full value symmetry can be broken by the constraints (1) together with the probing tests in line 16 of Algorithm 3 seen as constraints: n ^ X[q, . . . , n, 1, . . . , q − 1] ≥lex X[1, . . . , n] (4) q=2 The difference with (2) and (3) lies in the minimal renaming of the left-hand sides. The logic minimisation of (2) into (3) does not apply to (4). A checker for the required A ≥lex B global constraint can be speciﬁed as a DFA (omitted for space reasons), so that we get a ﬁltering algorithm using the automaton global constraint [1]. The idea is to augment the classical DFA for ≥lex [1] with variables representing the smallest value used so far and the minimal-renaming bijection on D (encoded by an allDiﬀerent constraint). Discussion. The proof of correctness and completeness of the introduced symmetry-breaking constraints is omitted for space reasons. To assess the runtimes (in seconds) of dynamic and static symmetry breaking, consider the right half of Table 1. Unmentioned numbers of backtracks are zero. For necklaces, columns 8 and 9 reveal a slight advantage of Algorithm 2 over constraints (3). For unlabelled necklaces, the last three columns reveal a huge advantage of Algorithm 3 over constraints (1) and (4). However, these runtimes were obtained in the absence of any problem-speciﬁc constraints, and static symmetry breaking usually performs better than dynamic symmetry breaking in the presence of problem-speciﬁc constraints. We address this issue in the next section. 4 EXPERIMENTS We now experimentally compare the proposed dynamic and static symmetry-breaking (SB) methods on real-life scheduling problems containing an (unlabelled) necklace as a combinatorial sub-structure. Example: Rotating Schedules. Many industries and services need to function 24/7. Rotating schedules, such as the one in Figure 1 (a real-life example taken from [10]) are a popular way of guaranteeing a maximum of equity to the involved work teams. In our example, there are day (d), evening (e), and night (n) shifts of work, as well as days off (x). Each team works maximum one shift per day. The scheduling horizon has as many weeks as there are teams. In the ﬁrst week, team i is assigned to the schedule in row i. For any next week, each team moves down to the next row, while the team on the last row moves up to the ﬁrst row. Note how this gives almost full equity to the teams, except, for instance, that team 1 does not enjoy the six consecutive days off that the other teams have, but rather three consecutive days off at the beginning of week 1 and another three at the end of week 5. We here assume that the daily workload is uniform. In our example, each day has exactly one team on-duty for each work shift, and hence two teams entirely off-duty; assuming the work shifts average 8h, each employee will work 7 · 3 · 8 = 168h over the ﬁve-week-cycle, or 33.6h per week. Daily workload can be enforced by global cardinality (gcc) constraints on the columns. Further, any number of consecutive workdays must be between two and seven, and any change in work shift can only occur after two to seven days off. This can be enforced by stretch constraints [12] on the table ﬂattened row-wise into a sequence. (A ﬁltering algorithm for the stretch constraint, which is not a built-in of SICStus Prolog, was automatically obtained from a DFA model of a constraint checker using 524 P. Flener and J. Pearson / Solving Necklace Constraint Problems Week 1 2 3 4 5 Figure 1. Mon x x d e n Tue x x d e n Wed x e d x n Thu d e x x n Fri d e x n x Sat d x e n x Sun d x e n x instance 1d, 1e, 1n, 2x 2d, 2e, 2n, 2x Figure 2. unique sol’s 2274 4115 4950 3444 Algorithm 2 time fails 7 228823 50 959970 199 2922846 603 7526564 Constraints (3) time fails 4 9140 26 69704 147 408669 558 1587889 Cons. (5) and (4 ) time fails 205 2964 31193 313587 5 CONCLUSIONS no SB time 21 158 751 2581 Performance comparison on necklace schedules the (built-in) automaton global constraint [1].) We assume that soft constraints, such as full weekends off as numerous and well-spaced as possible, are enforced by manual selection among schedules satisfying the hard constraints. In our example, there are two full weekends off, in the optimally spaced rows 2 and 5. Necklaces. Under the given assumption (uniform workload) and constraints (gcc and stretch), any rotating schedule has the symmetries of necklaces, when we view it ﬂattened row-wise into a sequence. In addition to the classical instance in Figure 1, here denoted 1d, 1e, 1n, 2x, we ran experiments over other instances. For example, instance 2d, 2e, 1n, 2x has the uniform daily workload of 2 teams each on the day and evening shifts, 1 team on the night shift, and 2 teams off-duty. Figure 2 gives the obtained runtimes (in seconds) and numbers of backtracks (fails) over all solutions. The time ratio to all solutions between SB and no-SB is a good indicator of that time ratio to the ﬁrst optimal solution (say, with the maximum number of full weekends off), as branch-and-bound essentially iterates over many solutions in order to pick the best. On average, when breaking the symmetries statically, the default variable ordering (trying the leftmost variable) is better than ﬁrst-fail (trying the leftmost variable with the smallest domain) and most-constrained (trying the leftmost variable with the smallest domain that has the most constraints suspended), with the default bottom-up value ordering, hence the runtimes for static symmetry-breaking are given for the default orderings. Static symmetry-breaking, in the presence of the problemspeciﬁc constraints, is now faster than dynamic symmetry-breaking. Partially Unlabelled Necklaces. Under the uniform workload assumption, some rotating schedules even have many of the symmetries of unlabelled necklaces. In our instances for 5 and 8 weeks, the constraints do not distinguish between the d, e, n work shifts, so that those values are interchangeable. To break such partial value symmetry dynamically, it sufﬁces to replace line 6 of Algorithm 3 by try all i ∈ {X[j − p], . . . , min(u + 1, k − 2)} ∪ {k − 1} and to make the minimal renamings Y in lines 16 and 18 respect the subsets D ⊆ D of interchangeable values; in our case D = {d, e, n} ∪ {x}. We denote the resulting search procedure by Algorithm 3 . To break this partial value symmetry statically, it sufﬁces to post one orderedFirstOccurrences(X, D ) for each subset D : ﬁrstPos(d) < ﬁrstPos(e) < ﬁrstPos(n) Algorithm 3 time fails 13 35969 703 1380876 Figure 3. Comparison on partially unlabelled necklace schedules A ﬁve-week rotating schedule with uniform workload instance 1d, 1e, 1n, 2x 2d, 1e, 1n, 2x 2d, 2e, 1n, 2x 2d, 2e, 2n, 2x unique sol’s 402 274 (5) Together with an adaptation, denoted (4 ), of constraints (4) where Y respects the D , we have a static symmetry-breaking method for such partially unlabelled necklaces. Figure 3 gives the obtained runtimes (in seconds) and numbers of backtracks (fails) over all solutions. Static symmetry breaking, in the presence of the problem-speciﬁc constraints, is still a lot slower than dynamic symmetry breaking. By bringing together the ﬁelds of combinatorial enumeration and constraint programming, we have extended existing results for dynamically and statically breaking the rotation variable symmetry of necklaces into new symmetry-breaking methods dealing also with the additional full value symmetry of unlabelled necklaces. On an example, we have also shown how to specialise these methods when the value symmetry of unlabelled necklaces is only partial. In the absence of problem-speciﬁc constraints, the dynamic symmetrybreaking methods outperform the static ones, narrowly for necklaces but largely for unlabelled necklaces. On a real-life scheduling problem we have shown that, in the presence of problem-speciﬁc constraints, the static method becomes faster for necklaces, but not for partially unlabelled necklaces. One should be aware of existing enumeration algorithms for special cases, such as the constant-amortised-time algorithms for unlabelled binary necklaces [2], or for necklaces with ﬁxed content [14]. For instance, under the given assumption (uniform workload) and constraints, rotating schedules are necklaces with ﬁxed content, so the algorithm of [14] should be tried instead of Algorithm 2. Future work includes the quest for a constant-amortised-time enumeration algorithm for unlabelled k-ary necklaces. Acknowledgements. We are supported by grant IG2001-67 of the Swedish Foundation for International Cooperation in Research and Higher Education, and by grant 70644501 of the Swedish Research Council. We thank J. Sawada and V. Vajnovszki for discussions. REFERENCES [1] N. Beldiceanu, M. Carlsson, and T. Petit, ‘Deriving ﬁltering algorithms from constraint checkers’, CP’04, LNCS 3258:107–122. Springer. [2] K. Cattell, F. Ruskey, J. Sawada, M. Serra, and C. R. Miers, ‘Fast algorithms to generate necklaces, unlabeled necklaces, and irreducible polynomials over GF (2)’, Journal of Algorithms 37(2):267–282, (2000). [3] W. Y. C. Chen and J. D. Louck, ‘Necklaces, MSS sequences, and DNA sequences’, Advances in Applied Mathematics 18(1):18–32, (1997). [4] J. M. Crawford et al., ‘Symmetry-breaking predicates for search problems’, KR’96, pp. 148–159. Morgan Kaufmann, (1996). [5] M. C. Er, ‘A fast algorithm for generating set partitions’, The Computer Journal 31(3):283–284, (1988). [6] E. N. Gilbert and J. Riordan, ‘Symmetry types of periodic sequences’, Illinois Journal of Mathematics 5:657–665, (1961). [7] S. W. Golomb, B. Gordon, and L. R. Welch, ‘Comma-free codes’, Canadian Journal of Mathematics 10(5):202–209, (1958). [8] A. Grayland, I. Miguel, and C. Roney-Dougal, ‘Minimal ordering constraints for some families of variable symmetries’, SymCon’07, (2007). [9] D. Gusﬁeld, Algorithms on Strings, Trees, and Sequences, CUP, 1997. [10] G. Laporte, ‘The art and science of designing rotating schedules’, Journal of the Operational Research Society 50(10):1011–1017, (1999). [11] Y. C. Law and J. Lee, ‘Symmetry breaking constraints for value symmetries in constraint satisfaction’, Constraints 11(2–3):221–267, (2006). [12] G. Pesant, ‘A ﬁltering algorithm for the stretch constraint’, CP’01, LNCS 2239:183–195. Springer, (2001). [13] C. M. Roney-Dougal et al., ‘Tractable symmetry breaking using restricted search trees’, ECAI’04, pp. 211–215. (2004). [14] J. Sawada, ‘A fast algorithm to generate necklaces with ﬁxed content’, Theoretical Computer Science 301(1–3):477–489, (2003). [15] M. Sellmann and P. Van Hentenryck, ‘Structural symmetry breaking’, IJCAI’05, pp. 298–303. IJCAI, (2005). [16] N. Sloane. The on-line encyclopedia of integer sequences. At http: //www.research.att.com/∼njas/sequences/, 2008. ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-525 525 Vivifying Propositional Clausal Formulae C´ edric PIETTE1 and Youssef HAMADI2 and Lakhdar SA¨ IS1 Abstract. In this paper, we present a new way to preprocess Boolean formulae in Conjunctive Normal Form (CNF). In contrast to most of the current pre-processing techniques, our approach aims at improving the ﬁltering power of the original clauses while producing a small number of additional and relevant clauses. More precisely, an incomplete redundancy check is performed on each original clauses through unit propagation, leading to either a sub-clause or to a new relevant one generated by the clause learning scheme. This preprocessor is empirically compared to the best existing one in terms of size reduction and the ability to improve a state-of-the-art satisﬁability solver. 1 INTRODUCTION Since a few years, preliminary computations on CNF formulae have been more and more studied by the SAT community. This renewal of interest can be explained by diﬀerent factors. First, reducing the huge size of the SAT instances encoding real world problems increases the robustness of SAT solvers. Secondly, these instances contain diﬀerent kinds of structures that can be handled more eﬃciently before search. One of the most eﬀective preprocessing techniques (SatElite) is currently integrated in state-of-the-art SAT solvers such as Minisat and Rsat. It is now well acknowledged that the performances of these solvers is usually greatly improved by this particular preprocessing, up to the point where SatElite is often used by SAT competitors. Thus, preprocessing a formula before solving is now known as an important step, and a lot of preprocessors have already been proposed. One of the ﬁrst and eﬃcient preprocessing algorithm, called 3-Resolution was incorporated to the Satz solver [10]. It consists in adding to the formula all resolvent clauses of size less or equal to 3, until saturation. 2-SIMPLIFY, a less computationally heavy preprocessor was proposed in [2]. It has been developed to better manage realworld benchmarks, which often contain a lot of binary clauses. Roughly, the idea is thus to use those binary clauses to construct an implication graph, from which unit clauses can be deduced by computing the transitive closure. If unit clauses have been obtained, they are propagated and this process is iterated until a ﬁx point is reached. Later on, HyPre generalized 2-SIMPLIFY by computing hyper-binary resolution to deduce new binary clauses [1]. Moreover, HyPre is able to detect and substitute equivalent literals incrementally. The classical DP 1 2 Universit´ e Lille-Nord de France, Artois, CRIL-CNRS UMR 8188, F-62307 Lens, email: {piette,sais}@cril.fr Microsoft Research, 7 J J Thomson Avenue, Cambridge, United Kingdom email: youssefh@microsoft.com procedure, based on variable elimination through resolution, has also been used in a limited way as a preprocessing step. A weaker schema has been adopted by the NiVER procedure [13]. This one eliminates variables by resolution if this computation does not increase the number of literals of the CNF formula. NiVER has been improved later by a so-called substitution rule, together with the use of clause signatures and touched lists to deﬁne the recent SatElite preprocessor [6]. However, only preprocessors that eliminate variables by a limited application of resolution are now grafted to modern SAT solvers. Indeed, the other kinds of preprocessors aim at modifying the CNF formula with some addition and/or removal of clauses, keeping generally the same set of variables. The main problem of these preprocessors is that it is diﬃcult to measure the relevance of each added or eliminated clause with respect to the resolution step. One can eliminate clauses and can derive an harder sub-formula. Similarly, adding new clauses might lead to an increase in the space complexity, without reducing the search space. Indeed, the added clauses can only clutter the solver by creating redundant information. In this paper, we revisit this kind of preprocessing, using only forms of resolution that aims at substituting existing clauses by more constrained ones. In other words, our main goal is to strengthen, or to vivify, the redundant clauses from the original formula. To this end, we apply a limited check of redundancy on each clause of the CNF formula in order to derive or to approximate one of its minimally redundant subclauses. Interestingly, our proposed approach can also take advantage of modern learning scheme to produce new resolvents that are conditionally added to the formula. This paper is organized as follow: in the next section basic notations and deﬁnitions about propositional clausal formulae and SAT are provided. In section 3, diﬀerent simpliﬁcation techniques and their practical usefulness are discussed. Next, particular forms of resolution hidden by unit propagation are presented, and an incomplete method which can produce them is presented. The resulting preprocessor is detailed and evaluated in section 4. Finally, we conclude the paper by some perspectives and further works. 2 DEFINITIONS AND NOTATIONS We brieﬂy state here some deﬁnitions and notations used in the rest of this paper. A propositional formula is in conjunctive normal form (CNF for short), if it can be represented using a set (interpreted as a conjunction) of clauses, where a clause is a set (interpreted as a disjunction) of literals, a literal being a propositional variable, or its negation. The set of variables that appear in a CNF formula Σ will be denoted 526 C. Piette et al. / Vivifying Propositional Clausal Formulae Var (Σ). Lit(Σ) is deﬁned as the set {x, ¬x|x ∈ Var (Σ)}. For ¯ is deﬁned as {¯ a set of literals L, L l|l ∈ L}. An interpretation ρ of a CNF formula Σ is an application from Var (Σ) to the set of truth values {true, false}. It is called a model iﬀ it provides the value true to Σ (in short ρ |= Σ). SAT is the problem of deciding whether a given CNF formula admits a model, or not. Let ca = {la1 , . . . , lan , l} and cb = {lb1 , . . . , lbm , ¬l} be two clauses. The clause c = {la1 , . . . , lan , lb1 , . . . , lbm } is a logical consequence (called resolvent) of ca and cb . This production rule is called resolution and is denoted ⊗R . We note the resolvent c as ca ⊗R cb . Most of the techniques used for solving SAT (e.g. DP-like procedure, unit propagation, learning schemes, etc.) are based on implicit or explicit application of resolution. This is clearly the case for most preprocessors, including the one presented in this paper. Let c and c be two clauses of Σ. We say that a clause c (resp. c) subsumes (resp. is subsumed by) c (resp. c ) iﬀ c ⊂ c. Subsumed clauses c can be removed from Σ while preserving satisﬁability. Given x ∈ Lit(Σ), we deﬁne Σ|x , the formula simpliﬁed by the assignment of x to true. We recursively deﬁne U P (Σ) as follows : (1) U P (Σ) = Σ if Σ does not contain unit clauses, (2) U P (Σ) =⊥ if Σ contains two unit-clauses {x} and {¬x}, (3) otherwise, U P (Σ) = U P (Σ|x ) with x a unit literal appearing in a unit clause of Σ. A clause c is implied by unit propagation from Σ, denoted Σ |=U P c, if U P (Σ|c¯) =⊥. In the next section, the main preprocessing strategies are discussed, and a limited form of resolution that produces more constrained clauses than the original ones is presented. 3 3.1 PREPROCESSING CNF FORMULAE Adding and/or removing clauses? Two main categories of preprocessors have been proposed: the ﬁrst one aims at eliminating variables through a partial application of the DP procedure [5]. Actually, only variables which can be eliminated keeping the formula within a “reasonable” size (w.r.t. the original size) are exhaustively processed by resolution. SatElite belongs to this category of preprocessors. The principle of the second category is to modify the original formula by adding and/or removing clauses, usually keeping the whole set of variables. Most of the time, the production of new clauses is made by resolution. For instance, HyPre performs hyper resolution to produce binary clauses [1] that are added to the formula. These new clauses represent redundant information with respect to the original CNF formula, and this information seems to generally help solvers. Recently, a new approach introduced in [7] aims at removing from a formula some of the redundant clauses, namely clauses c of Σ s.t. Σ\{c} c. Obviously enough, performing such a test is computationally intractable. Therefore, this redundancy is only checked through unit propagation. As a consequence, this approach is incomplete, but it is able to remove some redundant clauses in polynomial time. As other clauseﬁltering techniques, the resulting preprocessor can sometimes slow down the whole resolution process because of the removal of some important redundant clauses. The main problem with those techniques is that it is hard to characterize which redundant clauses are useful. Indeed, a tradeoﬀ has to be made between the management of a large number of clauses, which slows down DPLL implementations, and their relevance, namely their ability to trigger propagations. Indeed, it is well-known that redundant information can actually help SAT solvers; for instance, the powerful learning scheme, which produces a particular resolvent clause after each conﬂict, can be viewed as a dynamic addition of redundant clauses during search. This learning strategy is now known to be one of the key features of modern solvers, which proves the interest of redundant information with respect to practical SAT resolution. Nevertheless, a simple experiment which consists in adding all learnt clauses to a CNF formula after its resolution shows that this new redundant information makes the formula generally more diﬃcult to solve. Hence, how can we ensure that a particular clause-adding approach can eﬀectively boost a given SAT solver? A priori, one interesting option would consider the eﬃcient generation of sub-clauses from the original CNF. In this way, there is neither addition nor removal of any clause, but the substitution of existing clauses by more constrained ones. In current solvers, this computation would have great advantages: it could not only increase the number of unit propagations with no more clauses to manage, but would also lead to shorter learnt clauses during the search by reducing the reason of the propagated literals. Several techniques have already been proposed to generate sub-clauses. For instance, it is proposed in [4] to explore the implication graph to generate resolvent clauses and to only take into account the ones which subsume at least one original clause of the CNF formula, in order to substitute this latter clause by the shorter produced one. Actually, this computation is exponential in the worst case, and a weaker polynomial version restricted to a single literal assignment is proposed. In the next section, a new approach that aims at checking more systematically whether a clause can be shortened or not, is presented. 3.2 One answer: shorten existing clauses The way a problem is encoded in CNF is crucial for its practical resolution, and can lead to exponential diﬀerences in resources requirement. Analyzing the diﬀerent kinds of modelling is now an active path of research (see e.g. [8]). However, even with “good” modelling, some clauses might be redundant. A clause is redundant if it can be inferred from the remaining part of the CNF formula. In our approach redundancy check is only used to shorten clauses by eliminating some redundant literals. However, checking whether a clause is redundant is CoNPcomplete [11]. Hence, an incomplete but linear time deduction strategy has been adopted. Indeed, this check is performed with respect to unit propagation, only. More formally, a clause c of Σ is redundant modulo unit propagation (in short RedU P (Σ, c)), iﬀ Σ\{c} |=U P c. Obviously, if RedU P (Σ, c ), and c ⊂ c, then we also have RedU P (Σ, c). The converse is not true. This observation lead us to a new deﬁnition of minimal redundancy of clauses. We say that a clause c of Σ is minimally redundant modulo UP iﬀ c ⊂ c s.t. RedU P (Σ, c ). One of the main goals behind our viviﬁcation process is to ﬁnd for each redundant clause, one of its minimal redundant sub-clauses. Actually, a clause checked to be shortened is removed from the CNF formula, and the opposite of its literals are assigned one by one according to their lexicographic C. Piette et al. / Vivifying Propositional Clausal Formulae ordering. Given a CNF formula Σ and c = {l1 , l2 , . . . , ln } a clause from Σ. Assuming that the order in which the literals are assigned is (¬l1 , . . . , ¬ln ), two possible cases may occur: 1. ∃i ∈ {1, . . . , n − 1} s.t. Σ\{c} ∪ {¬l1 , . . . , ¬li } U P ⊥ In this case, we have Σ\{c} U P c with c = (l1 ∨ . . . ∨ li ) This new clause c strictly subsumes c. Hence, the original clause can be substituted by the new deduced one. Obviously, c is not necessarily minimally redundant modulo UP. Indeed, another ordering on the literals {l1 , l2 , . . . , li } might lead to an even shorter sub-clause. Thanks to a conﬂict analysis, the deduced sub-clause c could be shortened again leading to an even smaller sub-clause. Indeed, a new clause η can be generated by a complete traversal of the implication graph associated to Σ and to the assignments of the literals {¬l1 , . . . , ¬li } . The complete traversal of the implication graph ensure that the clause η contains only literals from c . Thereby, η is a sub-clause of (l1 ∨ . . . ∨ li ). 2. Otherwise, as unit propagation is performed after each assignment, if one of the remaining literals is assigned by this ﬁltering operation, then a sub-clause is produced. Trivially, when this phenomenon occurs, the propagated literal is either assigned positively (it satisﬁes the removed clause of the CNF formula) or negatively (it is falsiﬁed in this clause). Considering i and j with 1 ≤ i < j ≤ n, the two possible cases are: • Σ\{c} ∪ {¬l1 , . . . , ¬li } U P ¬lj In this case, we can deduce: Σ\{c} U P (l1 ∨. . .∨li ∨¬lj ) Applying resolution between this new clause and c (using the variable lj ), we obtain: (l1 ∨ . . . ∨ lj ∨ . . . ∨ ln ) ⊗R (l1 ∨ . . . ∨ li ∨ ¬lj ) = (l1 ∨ . . . ∨ lj−1 ∨ lj+1 ∨ . . . ∨ ln ). This new clause clearly subsumes c. Hence, the original clause can be substituted by the new deduced one. • Σ\{c} ∪ {¬l1 , . . . , ¬li } U P lj In this case, we can deduce: Σ\{c} U P (l1 ∨ . . . ∨ li ∨ lj ) In this case too, the produced clause subsumes c and enables to “remove” literals from it. Accordingly, from the iterative assignments of the opposite literals of a clause, one reduced clause could be produced. This computation can clearly be integrated into a modern SAT solver, and beneﬁt from lazy data structures to be performed. Moreover, during such a search, some assignments could lead to a conﬂict. As explained above, when this case occurs, the procedure can use the conﬂict analysis implemented in current solvers to produce smaller sub-clauses in a polynomial time. Using the previous rules and the learning feature of SAT solvers, a CNF formula can be viviﬁed, namely made easier to solve. In the next section, we present the practical implementation that has been made, based on the previous ideas. 4 4.1 CNF FORMULAE VIVIFICATION Technical choices In this section, diﬀerent practical parameters are discussed, some of them resulting from extensive experiments. First, the ideas proposed in the previous section imply to test the clauses of a formula to shorten some of them. However, if a literal is actually removed from a clause, new propagations can be performed using this clause, meaning that all 527 the failed tests made on previous clauses could then succeed with this shortened clause. Hence, whenever a test succeeds to produce a sub-clause, all other clauses are checked again with a new iteration of the procedure. Second, the presented sub-clause production technique supposes that the order in which the literals are assigned is important. Clearly, to ensure a maximal clause reduction, one has to check all possible orders of literals. However, this could lead to a pretty heavy computation; then, an incomplete strategy that consists in trying only one particular order has been adopted. Actually, a variant of the MOMS branching heuristic [9] is used to sort the literals in order to maximize the number of implied literals by unit propagation. Yet, using only this heuristic makes the order very similar from one iteration to the other. As said previously, a clause is tested again only if at least one other clause has been shortened. However, keeping only the MOMS ordering does not appear as a good solution, because the procedure would not beneﬁt of the potential multiple iterations made on each clause. To diversify the search, some randomization is used as follows: assuming that the literals of a clause are sorted with respect to MOMS, two of them are selected randomly and are exchanged in this ordering. Finally, when a conﬂict occurs, the tested clause c = (l1 ∨ . . . ∨ ln ) is substituted by its sub-clause c = (l1 ∨ . . . ∨ li ). As mentioned above, a complete traversal of the implication graph could lead to an even more reduced clause, but for eﬃciency purposes, this computation is not performed. In our implementation, classical learning scheme is used to generate a nogood η corresponding to the ﬁrst UIP. If this new clause η subsumes the sub-clause c , then c is now substituted by η; otherwise η is only added to the formula if its size (in term of number of literals) is strictly less than the size of the original clause. As the results show, this strategy only adds a few number of nogoods (< 5% of the number of original clauses), which prove useful for the future exhaustive search. Considering these choices, a new polynomial preprocessor called ReVivAl (for pReprocessing based on Viviﬁcation Algorithm) has been developed. This method is described in the Algorithm 1. Roughly, for each clause c of an input CNF Σ, c is removed from Σ and the opposite of each literal is assigned alternatively with unit propagation (loop from line 5 to 29). Moreover, diﬀerent checks about the remaining literals (that “should” be unassigned) and the presence of a conﬂict are performed, as presented in section 3.2 (tests on lines 11, 13, 17, 19, 23 and 27). The order in which the literals are selected for assignment is given by the function select a literal which just selects the highest literal with respect to our randomized MOMS-like score, where two randomly chosen literals have their score reversed. As long as one of the clauses has been reduced (change set to true), the process continues with all the other clauses. Let us note that our implementation has been integrated into a modern SAT solver, which enables the use of most recent data structures and mecanisms designed for SAT resolution. Hence, the redundancy test of each clause, performed by a serie of assignments, takes advantage of the eﬃciency of watched literals. In the same way, the conditional add of clauses is achieved through the “classical” learning functions, usually called by the solver after each conﬂict. Exploiting those structures and techniques implemented into exhaustive methods not only leads to an easy 528 C. Piette et al. / Vivifying Propositional Clausal Formulae Algorithm 1: Viviﬁcation of a CNF formula Input: Σ : a CNF formula Output: a viviﬁed CNF formula 1 begin 2 change ←− true; ; 3 while change do 4 change ←− f alse ; 5 foreach c ∈ Σ do 6 Σ ←− Σ\{c} ; Σb ←− Σ ; 7 cb ←− ∅ ; shortened ←− f alse ; 8 while (Not(shortened) And (c = cb )) do 9 l ←− select a literal(c\cb ) ; 10 cb ←− cb ∪ {l} ; Σb ←− (Σb ∪ {¬l}) ; 11 if ⊥ ∈ UP(Σb ) then 12 cl ←− conﬂict analyze and learn() ; 13 if cl ⊂ c then 14 Σ ←− Σ ∪ {cl } ; 15 shortened ←− true ; 16 else if |cl | < |c| then Σ ←− Σ ∪ {cl } ; cb ←− c ; 17 18 if c = cb then Σ ←− Σ ∪ {cb } ; shortened ←− true ; 19 20 21 22 else if ∃(ls ∈ (c\cb )) s.t. ls ∈ UP(Σb ) then if (c\cb ) = {ls } then Σ ←− Σ ∪ {cb ∪ {ls }} ; shortened ←− true ; 23 24 25 26 if ∃(ls ∈ (c\cb )) s.t. ¬ls ∈ UP(Σb ) then Σ ←− Σ ∪ {c\{ls }} ; shortened ←− true ; 27 28 29 if Not(shortened) then Σ ←− Σ ∪ {c} ; else change ←− true ; 30 31 32 33 return Σ ; end implementation of our method within most of current solvers, but also provides our approach with their eﬀectiveness for the diﬀerent performed tests. Our approach is thoroughly evaluated in the following section. 4.2 Empirical Evaluation We have compared ReVivAl against the preprocessor which is actually considered as the best approach, namely SatElite. The state-of-the-art SAT solver RSAT [12] has been selected, since it has been recognized in the last competition as very adapted for structured problems. All our experiments have been conducted on Intel Xeon 3GHz under Linux CentOS 4.1. (kernel 2.6.9) with a RAM limit of 2GB. For all experiments, a timeout of 3 hours has been respected. We have compared the preprocessors both on their size reduction and their impact on the eﬃciency of RSAT. Actually, this comparison has been conducted on a very large set of benchmarks from the SAT competitions, SAT Race, SATLIB and other sources; more than 5000 instances have been used for those experiments that have needed about 600 days of CPU time. A sample of experiments where examples will be referred in the following is proposed in Table 1, but the exhaustive results are available at: http://www.cril.fr/~piette/preprocessor.html. The ﬁrst main part of Table 1 provides the name of the tested problem together with the number of clauses (#cla) and literals (#lit) it contains.The two other parts of the table are similar (one for each preprocessor), and contain the time of preprocessing in seconds, the size of the resulting formula in term of number of literals and clauses after the corresponding preliminary computation, and the solving time (in seconds) needed to solve the CNF formula after simpliﬁcation. In addition, for ReVivAl, the number of performed iterations and learnt clauses are provided in the columns “#ite” and “#learnt”, respectively. The best preprocessing on a given instance corresponds to the best one in term of cumulated preprocessing and solving time (reported in boldface). First, let us focus on benchmarks that can be actually solved being only preprocessed. Indeed, it exists such CNF formulae, including some instances proposed for the SAT competitions and/or the SAT Races. Given the features of the presented preprocessing approaches, when one of them (or both) succeed(s) to prove (un)satisﬁability of a CNF formula, this clearly means that the CNF formula is solvable in polytime (indicated Polynomial in the table). The interest of such formulae can be questioned for solvers empirical evaluations, because they do not exhibit any computational diﬃculties, which should be the key point of comparison between exhaustive procedures. Among the tested formulae, SatElite (resp. ReVivAl) proves 35 (resp. 167) instances polynomial. Moreover, note that for both preprocessors, those computations are most often performed within a few seconds (see e.g. SAT dat.k1, ezfact16 3). Second, let us consider the size of CNF formula after being preprocessed. Some diﬀerences can be observed between both approaches. Indeed, on the ﬁrst hand, the purpose of SatElite is to eliminate variables without increasing the size of the CNF formula. Thus, resulting CNF formula can have about the same number of clauses, but they can exhibit a higher number of literals. On the opposite, ReVivAl tries to minimize the size of clauses and to add limited relevant ones. As a consequence, the simpliﬁed formulae sometimes contain a little more clauses than the original ones, but in general the average number of literals per clause is reduced, making them more exploitable for the solver’s unit-propagation mechanism. As an example, on the benchmark alu4mul.miter which exhibits 30465 clauses and 103040 literals (ratio #lit/#cla = 3.38), SatElite eliminates variables keeping about the same number of clauses and literals whereas ReVivAl returns a smaller CNF formula in number of clauses (28992) for a ratio equal to 3.11. Cases where SatElite provides a formula with a largely bigger ratio can occur (see e.g. 3pipe 3 ooo and 3bitadd 31), but not with ReVivAl. More generally, discarding the instances that cannot be solved using any of the preprocessors in conjunction to RSAT, a time gap of 18.8% can be observed in favour of ReVivAl. Futhermore, using SatElite RSAT cannot decide the satisﬁability of 2508 instances within 3 hours of CPU time C. Piette et al. / Vivifying Propositional Clausal Formulae Instance name (#cla,#lit) pbl-00250 (32700,256765) velev-fvp-sat-3.0-07 (1012271,2979665) alu4mul.miter (30465,103040) Composite-024BitPrimes-1 (11158,49842) Composite-024BitPrimes-0 (11158,49842) velev-eng-uns-1.0-04 (66654,188252) (31310,86676) 3bitadd 31 SAT dat.k1 (3868,9928) c3540mul.miter (33199,112244) logistics-rotate-10t5 (338789,680799) (1113,4089) ezfact16 3 ezfact48 8 (11001,41369) ezfact48 9 (11001,41369) ezfact64 1 (19785,74601) abb313GPIA-8-c (426860,2561106) qg7-10 (33736,89626) color-10-3 (6475,25200) grieu-vmpc-s05-27r (96849,253854) ferry12 (32199,71303) mod2c-3cage-unsat-9-1 (464,1856) 544707209399nw (18031,53975) rand net40-60-10 (14321,33560) abb313GPIA-8-cn (693640,2080902) (11489,33442) equilarge m1 hanoi5u (73777,160717) shuﬄing-2-s1765005333 (30465,103040) lksat-n2200. . .s1262048766 (7524,22572) 3pipe 3 ooo (33270,95618) gripper12 (30746,68144) gripper13 (40461,89385) time 0.33 4.05 0.1 0.02 0.02 0.5 0.38 0 0.13 2.45 0 0.02 0.02 0.06 1.47 0.04 0.08 0.15 0.21 0 0.12 0.1 16.73 0.07 0.29 0.13 0.04 0.19 0.12 0.16 SatElite (#cla,#lit) (31115,245053) (998048,3017403) (30392,102976) (10087,44762) (10016,44487) (62239,201530) (31186,108004) Polynomial (33066,112216) (317194,687554) (990,3586) (9215,34333) (10532,39464) (16716,62498) (421719,2521104) (11038,27452) (6175,32600) (96849,253854) (30268,68540) (464,1856) (16032,49234) (10778,29243) (388404,2307599) (11158,41295) (61778,135270) (30392,102976) (6322,19609) (31735,101212) (28060,62815) (37437,83384) Table 1. time 59.8 61.53 17.99 0.56 0.48 60.09 10.17 0 3.3 11.67 0 1.37 1.26 2.77 60.88 0.15 0.06 25.87 0.5 0 2.26 3.74 23.27 0.12 1.18 23.93 0.19 30.86 0.26 0.37 ReVivAl (#cla,#lit) solv. time (34815,219076) 613.89 (990394,2928889) 16.04 (28992,90194) 670.66 (10728,35545) 2013.16 (10395,34668) 16.59 (57518,157260) 14.24 (33125,83346) 1.71 Polynomial (27206,80134) 2186.01 (277860,558081) 151.64 Polynomial (9582,27880) 46.71 (9558,27858) 82.46 (17265,50532) 3049.66 (404007,2340042) 12.05 Polynomial (6475,25200) 19.41 (96482,239730) 51.76 (30405,67168) 1.87 (472,1533) 2942.17 (14768,34277) 814.4 (14321,33152) 421.97 (677607,2017201) 2321.44 (11489,33164) time out (59467,129001) 284.83 (28975,89789) 1023.51 (7629,20439) 59.91 (31150,87665) 9.13 (27871,61632) 676.2 (37195,82030) 1183.14 #ite #learnt 30 2785 2 19 15 502 20 163 18 132 15 89 28 1815 13 815 1 0 18 315 17 344 15 549 28 236 1 0 9 154 1 12 9 8 9 628 8 0 12 19 5 0 1 9 11 592 16 105 33 60 1 2 1 4 SatElite VS ReVivAl (preprocessing and solver), while the solver fails for only 2457 benchmarks with our approach. This 51 instances difference does not look big, but SAT competitions and Races are usually settled by even smaller gaps. However, even if ReVivAl has in general a better eﬀect on CNFs than SatElite, counter-examples can obviously be provided (see e.g. hanoi5u and abb313GPIA-8-cn). Nevertheless, many classes of SAT instances are typically more sensitive to the ReVivAl process, which is better than SatElite at improving RSAT. For example, on ezfact-*, encoding circuits factorization, Composite-*BitPrimes instances, encoding composite numbers (suggested as a challenge to SAT solvers in 1997 by Cook and Mitchell [3]), gripper* planning instances, our proposed approach clearly outperforms SatElite. 5 solv. time time out 9.84 915.91 7403.21 733.2 11.67 5058.73 1635.29 1250.5 0 61.99 204.94 time out 366.29 0.01 1.28 104.02 8.28 2152.31 1477.7 486.48 143 time out 183.74 1380.07 201.38 9.67 time out 1300.36 529 CONCLUSION In this paper, ReVivAl, a new preprocessing based on limited forms of resolution and conﬂict analysis has been proposed. Our approach, called viviﬁcation, makes an original use of clause redundancy checking to produce sub-clauses and to add new relevant clauses obtained thanks to the clause learning scheme. Its eﬃciency is illustrated through extensive experiments with a state-of-the-art DPLL solver. A comparison with the best known preprocessing technique shows that ReVivAl, achieves interesting improvements, especially on circuits factorization, composite numbers and planning instances. Our results open many interesting future directions of research. It appears that combining several preprocessors often enables to even better improvements. Indeed, a combination of SatElite and ReVivAl obtained parlicularly interesting results at the SAT-Race 2008 (6th on 19 submitted solvers). A dynamic selection of preprocessors based on automated- tuning approaches is thus a path that should be explored. The periodical use of ReVivAl, for example during restarts, is also a promising future direction. REFERENCES [1] F. Bacchus and J. Winter, ‘Eﬀective preprocessing with hyper-resolution and equality reduction’, in SAT’03, pp. 341– 355, (2003). [2] Ronen I. Brafman, ‘A simpliﬁer for propositional formulas with many binary clauses’, in IJCAI’01, pp. 515–522, (2001). [3] S.A. Cook and D.G. Mitchell, ‘Finding hard instances of the satisﬁability problem: A survey’, DIMACS Series in Discrete Mathematics and Theoretical Computer Science, 35, (1997). [4] S. Darras, G. Dequen, L. Devendeville, B. Mazure, R. Ostrowski, and L. Sais, ‘Using boolean constraint propagation for sub-clause deduction’, in CP’05, pp. 757–761, (2005). [5] M. Davis and H. Putnam, ‘A computing procedure for quantiﬁcation theory’, Journal of the ACM, 7(3), 201–215, (1960). [6] N. E´ en and A. Biere, ‘Eﬀective preprocessing in SAT through variable and clause elimination’, in SAT’05, pp. 61–75, (2005). [7] O. Fourdrinoy, E. Gr´egoire, B. Mazure, and L. Sais, ‘Eliminating redundant clauses in SAT instances’, in CP-AI-OR’07, pp. 71–83, (2007). [8] A. Hertel, P. Hertel, and A. Urquhart, ‘Formalizing dangerous SAT encodings’, in SAT’07, pp. 159–172, (2007). [9] R. G. Jeroslow and J. Wang, ‘Solving propositional satisﬁability problems’, Annals of mathematics and artiﬁcial intelligence, 1, 167–187, (1990). [10] C. Li and Anbulagan, ‘Look-ahead versus look-back for satisﬁability problems.’, in CP’97, pp. 341–355, (1997). [11] Paolo Liberatore, ‘Redundancy in logic I: CNF propositional formulae’, Artif. Intell., 163(2), 203–232, (2005). [12] K. Pipatsrisawat and A. Darwiche, ‘RSAT 2.0: SAT solver description’, Technical Report D–153, Automated Reasoning Group, Computer Science Department, UCLA, (2007). [13] S. Subbarayan and D. Pradhan, ‘NiVER: Non increasing variable elimination resolution for preprocessing SAT instances’, SAT’04, 276–291, (2004). 530 ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-530 Hybrid tractable CSPs which generalize tree structure Martin C. Cooper1 and Peter G. Jeavons2 and Andr´as Z. Salamon3 Abstract. The constraint satisfaction problem (CSP) is a central generic problem in artiﬁcial intelligence. Considerable progress has been made in identifying properties which ensure tractability in such problems, such as the property of being tree-structured. In this paper we introduce the broken-triangle property, which allows us to deﬁne a hybrid tractable class for this problem which signiﬁcantly generalizes the class of problems with tree structure. We show that the broken-triangle property is conservative (i.e., it is preserved under domain reduction and hence under arc consistency operations) and that there is a polynomial-time algorithm to determine an ordering of the variables for which the broken-triangle property holds (or to determine that no such ordering exists). We also present a nonconservative extension of the broken-triangle property which is also sufﬁcient to ensure tractability and can be detected in polynomial time. Keywords: constraint satisfaction, tractability, computational complexity, arc consistency. 1 INTRODUCTION Constraint satisfaction problems with tree structure have been widely studied, and are known to have efﬁcient algorithms [8]. However, tree structure is quite restricted. It is therefore worthwhile exploring more general problem classes, to identify more widely-applicable properties which still allow efﬁcient solution algorithms. A subclass of the general CSP which can be solved in polynomial time, and also identiﬁed in polynomial time, is called a tractable subclass. There has been a considerable research effort in identifying tractable subclasses of the CSP over the past decade. Most of this work has focused on one of two general approaches: either identifying forms of constraint which are sufﬁciently restrictive to ensure tractability no matter how they are combined [2, 9], or else identifying structural properties of constraint networks which ensure tractability no matter what forms of constraint are imposed [7, 5]. The ﬁrst approach has had considerable success in characterizing precisely which forms of constraint ensure tractability no matter how they are combined. A set of constraint types with this property is called a tractable constraint language. In general it has been shown that any tractable constraint language must have certain algebraic properties known as polymorphisms [13]. A complete characterization of all possible tractable constraint languages has been established in the following cases: conservative constraint languages 1 2 3 IRIT, University of Toulouse III, 31062 Toulouse, France, email: cooper@irit.fr Computing Laboratory, University of Oxford, Oxford, OX1 3QD, UK, email: Peter.Jeavons@comlab.ox.ac.uk Computing Laboratory, University of Oxford, Oxford, OX1 3QD, UK, and The Oxford-Man Institute of Quantitative Finance, 9 Alfred Street, Oxford, OX1 4EH, UK, email: Andras.Salamon@comlab.ox.ac.uk (i.e. constraint languages containing all unary constraints) [3], and constraint languages over a 2-element domain [17] or a 3-element domain [4]. The second approach has also had considerable success in characterizing precisely which structures of constraint network ensure tractability no matter what constraints are imposed. For the class of problems where the arity of the constraints is bounded by some ﬁxed constant (such as binary constraint problems) it has been shown that (subject to certain technical assumptions) the only class of structures which ensure tractability are structures of bounded tree-width [12]. However, many constraint satisfaction problems do not possess a sufﬁciently restricted structure or use a sufﬁciently restricted constraint language to fall into any of these tractable classes. They may still have properties which ensure they can be solved efﬁciently, but these properties concern both the structure and the form of the constraints. Such properties have sometimes been called hybrid reasons for tractability [16], and they are much less widely-studied and much less well-understood than the language properties and structural properties described above. In this paper we introduce a new hybrid property which we call the broken-triangle property. We show that this property is sufﬁcient to ensure that a CSP instance is tractable, and also show that checking whether an instance has the broken-triangle property can be done in polynomial time. Moreover, we show that all tree-structured CSP instances have this property, as well as many other instances that are not tree-structured (including some with unbounded tree-width). The broken triangle property can be thought of as a kind of transitivity condition. By processing the variables in an appropriate order, an algorithm akin to those used for solving tree-structured CSP instances can be applied to ﬁnd a solution. Moreover, a suitable such ordering of variables can be found efﬁciently. The general technique for ﬁnding a suitable ordering, and then exploiting it to generate a solution, is discussed in Section 3. Sections 4 to 6 extend these ideas. 2 THE BROKEN TRIANGLE PROPERTY In this paper we focus on binary constraint satisfaction problems. A binary relation over domains Di and Dj is a subset of Di × Dj . For a binary relation R, the relation rev(R) is deﬁned as {(v, u) | (u, v) ∈ R}. A binary CSP instance consists of a set of variables (where each variable is denoted by a number i ∈ {1, 2, . . . , n}); for each variable i, a domain Di containing possible values for variable i; and a set of constraints, each of the form (i, j), R, where i and j are variables and R is a relation such that R ⊆ Di × Dj . To simplify the notation we introduce the notion of a canonical constraint relation which combines all of the speciﬁed information about a pair of variables i, j. Deﬁnition 1 Suppose i and j are variables of a CSP instance. De- M.C. Cooper et al. / Hybrid Tractable CSPs Which Generalize Tree Structure 531 note by Uij the set of constraint relations speciﬁed for the (ordered) pair of variables (i, j). The canonical constraint relation between variables i and j will be denoted Rij and is deﬁned as \ Rij = (Uij ∪ {rev(R) | R ∈ Uji }) . Lemma 3 A binary CSP instance satisﬁes the broken-triangle property with respect to the variable ordering < if and only if for all triples of variables i < j < k, and for all (u, v) ∈ Rij , The canonical constraint relation Rij contains precisely the pairs of values that are allowed for the variables i and j by all the constraints on i and j. Note that Rij = rev(Rji ). If there are no constraints involving i and j, then Rij is the intersection of an empty set, and is deﬁned to be the complete relation Di ×Dj . If relation Rij is neither empty nor the complete relation, we say it is proper. Proof: The condition that either Rik (u) ⊆ Rjk (v) or Rjk (v) ⊆ Rik (u) is equivalent to stating that there do not exist elements a of Rik (u) and b of Rjk (v) such that a ∈ Rjk (v) and b ∈ Rik (u). By the deﬁnition of the image of an element in a relation, this in turn is equivalent to the statement that there do not exist a, b ∈ Dk such that (u, a) ∈ Rik , (v, b) ∈ Rjk , (u, b) ∈ Rik and (v, a) ∈ Rjk . Sentence (1) therefore exactly forbids the presence of a conﬁguration that would prevent the instance from satisfying the BTP. Deﬁnition 2 A binary CSP instance satisﬁes the broken-triangle property (BTP) with respect to (w.r.t.) the variable ordering <, if, for all triples of variables i, j, k such that i < j < k, if (u, v) ∈ Rij , (u, a) ∈ Rik and (v, b) ∈ Rjk , then either (u, b) ∈ Rik or (v, a) ∈ Rjk . The broken-triangle property can be understood by the implication shown in Figure 1. In this ﬁgure, each oval represents the domain of an associated variable, and each line represents a consistent assignment of values for a pair of variables. A line joins element u ∈ Di and element v ∈ Dj if (u, v) ∈ Rij . The BTP on i, j, k simply says that for any “broken triangle” a − u − v − b, as illustrated in Figure 1, there is always a true triangle u − v − c (where c is either a or b). BTP is similar to but stronger than directional path consistency [18]. It is important to note that the BTP must be satisﬁed for all triples i < j < k, even if the description of the instance does not specify a constraint between variables i and j. If there is no speciﬁed constraint between i and j, then Rij allows all pairs of values. A set of CSP instances may satisfy the broken-triangle property due to the structure of the constraint graph, due to the language of the constraint relations, or due to a combination of these. a b u (Rik (u) ⊆ Rjk (v)) ∨ (Rjk (v) ⊆ Rik (u)). (1) Using this result we can obtain the following simple sufﬁcient condition for the broken-triangle property. Lemma 4 A binary CSP instance satisﬁes the broken-triangle property with respect to a variable ordering < if, for all triples of variables i < j < k, either Rik or Rjk is a complete relation. Proof: If Rik is a complete relation, then Rik (u) = Dk , while if Rjk is a complete relation, then Rjk (v) = Dk . In either case, by Lemma 3, the instance satisﬁes the BTP. Deﬁnition 5 A class of CSP instances is called conservative if it is closed under domain restrictions (i.e., the addition of arbitrary unary constraints). It is easy to verify from the deﬁnition that the broken-triangle property is conservative. This has two important beneﬁts. First, the broken-triangle property is invariant under arc consistency operations: if a binary CSP instance satisﬁes the broken-triangle property, then so does its arc consistency closure. Second, if the brokentriangle property is satisﬁed on all triples of variables i, j, k belonging to some subset of variables W , then the CSP instance which results when all of the variables not in W have been assigned will satisfy the broken-triangle property, and hence be efﬁciently solvable. k 3 i v j ⇓ c u k i v j Figure 1. The broken-triangle property on variables i, j, k. For an element a ∈ Di , we write Rij (a) to represent {b ∈ Dj : (a, b) ∈ Rij }, the image of a in relation Rij . TRACTABILITY OF BTP INSTANCES In this section we show that if a CSP instance has the broken-triangle property with respect to some ﬁxed variable ordering, then ﬁnding a solution is tractable. Moreover, the problem of ﬁnding a suitable ordering if it exists is also tractable. For a binary CSP instance with n variables, let d = max{|D1 |, . . . , |Dn |} and let q be the number of constraints. Deﬁnition 6 An assignment of values (u1 , . . . , uk ) to the ﬁrst k variables of a binary CSP instance is called consistent if ui ∈ Di whenever 1 ≤ i ≤ k, and (ui , uj ) ∈ Rij whenever 1 ≤ i < j ≤ k. Theorem 7 For any binary CSP instance which satisﬁes the BTP with respect to some known variable ordering <, it is possible to ﬁnd a solution in O(d2 q) time (or determine that no solution exists). Proof: By the discussion above, if an instance has the BTP with respect to <, then establishing arc consistency preserves the BTP. Furthermore, it is known that arc consistency can be established in O(d2 q) time [1]. If this results in an empty domain, then the instance has no solutions. Therefore, we assume in the following that the CSP instance is arc consistent and has non-empty domains. 532 M.C. Cooper et al. / Hybrid Tractable CSPs Which Generalize Tree Structure We can assign some value u1 ∈ D1 to the ﬁrst variable, since D1 = ∅. To prove the result it is sufﬁcient to show, for all k = 2, . . . , n, that any consistent assignment (u1 , . . . , uk−1 ) for the ﬁrst k − 1 variables can be extended to a consistent assignment (u1 , . . . , uk ) for the ﬁrst k variables. The case k = 2 follows from arc consistency. By Lemma 3, if i < j < k then either Rik (ui ) ⊆ Rjk (uj ) or Rjk (uj ) ⊆ Rik (ui ). Thus the set {Rik (ui ) | i < k} is totally ordered by subset inclusion, and hence has a minimal element \ Ri0 k (ui0 ) = Rik (ui ) (2) i<k for some i0 < k. Since the instance is arc consistent, Ri0 k (ui0 ) = ∅. By the deﬁnition of Rik (ui ), it follows that (u1 , . . . , uk ) is a consistent assignment for the ﬁrst k variables, for any choice of uk ∈ Ri0 k (ui0 ). The time taken to calculate the intersections in (2) is at most O(d2 q) overall, since each pair of values must be checked against each relevant constraint. Theorem 8 The problem of ﬁnding a variable ordering for a binary CSP instance such that it satisﬁes the broken-triangle property with respect to that ordering (or determining that no such ordering exists) is solvable in polynomial time. Proof: Given a CSP instance P , we deﬁne a new CSP instance P that has a solution precisely when there exists a suitable variable ordering for P . To construct P , let O1 , . . . , On be variables taking values in {1, . . . , n} representing positions in the ordering. We impose the ternary constraint Ok < max{Oi , Oj } (3) 4 RELATED CLASSES In this section we will show that the broken-triangle property generalizes two other known tractable classes: one based on language restrictions and one based on structural restrictions. Throughout this section we suppose that the values in the variable domains are totally ordered. Deﬁnition 9 A binary relation Rij is right monotone if ∀b, c ∈ Dj , (a, b) ∈ Rij ∧ b < c ⇒ (a, c) ∈ Rij . A commonly-used right monotone constraint is the inequality constraint: Xi ≤ Xj . The complete relation is also right monotone. Lemma 10 If the relations Rik , Rjk are both right monotone, then the broken triangle property is satisﬁed on the triple of variables i < j < k, whatever the relation Rij . Proof: Suppose that Rik , Rjk are both right monotone and that (u, v) ∈ Rij , (u, a) ∈ Rik and (v, b) ∈ Rjk . If a < b, then (u, b) ∈ Rik (since Rik is right monotone); if a > b, then (v, a) ∈ Rjk (since Rjk is right monotone). Deﬁnition 11 Consider a binary CSP instance P . For a given variable ordering <, denote by parents< (k) the set of variables i < k such that Rik is proper. Deﬁnition 12 A binary CSP instance is renamable right monotone with respect to a variable ordering < if, for each k ∈ {2, . . . , n}, there is an ordering of Dk , such that Rik is right monotone for every i ∈ parents< (k). Lemma 13 If a binary CSP instance is renamable right monotone with respect to a variable ordering <, then it satisﬁes the brokentriangle property with respect to <. for all triples of variables i < j < k in P such that the brokentriangle property fails to hold for some u ∈ Di , v ∈ Dj , and a, b ∈ Dk . The instance P then has a solution precisely if there is an ordering of the variables 1, . . . , n of P which satisﬁes the brokentriangle property. Note that if the solution obtained represents a partial order (for instance, if Oi and Oj are assigned the same value for some i = j), then it can be extended to a total order which still satisﬁes all the constraints by using a linear time topological sort. For each triple of variables in P , the construction of the corresponding constraints in P requires O(d4 ) steps to check which constraints to add. There are O(n3 ) such triples, so constructing instance P takes O(n3 d4 ) steps, which is polynomial in the size of P . The constraints in P are all of the form (3), and such constraints are max-closed [14] (if p1 < max{q1 , r1 } and p2 < max{q2 , r2 } then max(p1 , p2 ) < max{max(q1 , q2 ), max(r1 , r2 )}). Maxclosed constraints are a tractable constraint language [14]: any CSP instance with max-closed constraints can be solved by establishing generalized arc consistency [15] and then choosing the maximum element which remains in each variable domain. Since the size of P is polynomial in the size of P , it follows that the instance P can be solved in time polynomial in the size of P . Proof: If a CSP instance has tree structure, then any variable ordering < from any designated root to the leaves is such that |parents< (k)| ≤ 1 for every variable k. Hence, by Lemma 4, it satisﬁes the BTP with respect to that ordering. Because the BTP is conservative, any pre-processing operations which only perform domain reductions, such as arc consistency, path-inverse consistency [11], or neighbourhood substitution [10, 6], can be applied before looking for a variable ordering for which the broken-triangle property is satisﬁed; these reduction operations cannot destroy the broken-triangle property, but they can make it more likely to hold (and easier to check). Let TREE be the constraint satisfaction problem consisting of all instances that have tree structure, RRM be the CSP consisting of all instances that are renamable right monotone w.r.t. some variable ordering, and BTP be the CSP consisting of all instances which have the broken-triangle property w.r.t. some variable ordering. Note that the class RRM contains instances of arbitrary tree-width, for instance some CSPs where the constraint structure is a grid. Proof: Suppose the CSP instance is renamable right monotone with respect to variable ordering <, and let k be any variable. Since the instance is renamable right monotone with respect to <, there is an ordering of Dk such that whenever i ∈ parents< (k) then Rik is right monotone. Now suppose i < j < k are variables in this ordering. Then each of Rik and Rjk is either the complete relation (and hence right monotone), or right monotone in its own right. By Lemma 10, the broken triangle property is satisﬁed for i, j, k. Since the choice of k was arbitrary, it follows that the instance satisﬁes the BTP. Lemma 14 If a CSP instance has a tree structure, then it satisﬁes the broken-triangle property with respect to any variable ordering in which each node occurs before its children. M.C. Cooper et al. / Hybrid Tractable CSPs Which Generalize Tree Structure 2 0 1 533 that (u, v) ∈ Rij , (u, a) ∈ Rik and (v, b) ∈ Rjk . Denote by I the subproblem of DAC(I) on variables i, j, k and with reduced domain {a, b} for variable k. Establishing directional arc consistency in I may reduce the domains of variables i and j, but cannot delete v from the domain of variable j (since it has a support, namely b, at k) nor can it delete u from the domain of variable i (since it has supports at variables j and k). If DAC(I ) is universally backtrackfree, then the consistent assignment (u, v) for the variables (i, j) can be extended to a consistent assignment for (i, j, k), which must be either (u, v, a) or (u, v, b). This corresponds exactly to the deﬁnition of the broken-triangle property, and so DAC(I) satisﬁes the BTP. Figure 2. An instance in BTP that is not in RRM or TREE. 6 Theorem 15 TREE BTP and RRM BTP. Proof: The inclusions follow from Lemma 14 and Lemma 13; the instance shown in Figure 2 establishes the strict separations. 5 ALTERNATIVE CHARACTERIZATION In this section we consider properties which are both conservative and preserved by taking subproblems. We show that the brokentriangle property is the only such property which ensures that the following desirable behaviour can be guaranteed simply by achieving a certain level of arc-consistency: Deﬁnition 16 A CSP instance is universally backtrack-free with respect to an ordering < of its n variables if ∀k ∈ {2, . . . , n}, any consistent assignment for the ﬁrst k − 1 variables can be extended to a consistent assignment for the ﬁrst k variables. Deﬁnition 17 Given a CSP instance I on variables 1, . . . , n, the subproblem I({i1 , . . . , im }), where 1 ≤ i1 < i2 < . . . < im ≤ n, is the m-variable CSP instance with domains Di1 , . . . , Dim and exactly those constraints of I whose scopes are subsets of {i1 , . . . , im }. Deﬁnition 18 A set Σ of CSP instances is inclusion-closed if ∀I ∈ Σ, all subproblems I(M ) on subsets M of the variables of I also belong to Σ. Deﬁnition 19 A binary CSP instance is directional arc-consistent with respect to a variable ordering <, if for all pairs of variables i < j, ∀a ∈ Di , ∃b ∈ Dj such that (a, b) ∈ Rij . Proposition 20 A conservative inclusion-closed set Σ of CSP instances is such that the directional arc-consistency closure DAC(I) of every I ∈ Σ with respect to a variable ordering < is universally backtrack-free with respect to < if and only if ∀I ∈ Σ, DAC(I) satisﬁes the broken-triangle property with respect to <. Proof: The argument used in the proof of Theorem 7 shows that if any binary CSP instance satisﬁes the broken-triangle property then its directional arc-consistency closure is universally backtrack-free. To prove the converse, suppose that Σ is a conservative inclusionclosed set of CSP instances and consider any I ∈ Σ. Since Σ is conservative, DAC(I) also belongs to Σ, since it is obtained from I by a sequence of domain reductions. In the following, we let Di denote the domain of variable i in DAC(I). Consider three variables i < j < k and four domain values u ∈ Di , v ∈ Dj , a, b ∈ Dk such GENERALIZING THE BTP In this section we show that a weaker form of the broken-triangle property also implies backtrack-free search. This leads to a larger, but non-conservative, tractable class of CSP instances. Throughout this section, we assume that domains are totally ordered. Deﬁnition 21 A binary CSP instance is min-of-max extendable with respect to the variable ordering <, if for all triples of variables i, j, k such that i < j < k, if (u, v) ∈ Rij , then (u, v, c) is a consistent assignment for (i, j, k), where c = min(max(Rik (u)), max(Rjk (v))) The symmetrically equivalent property max-of-min extendability is deﬁned similarly, with c = max(min(Rik (u)), min(Rjk (v))). Lemma 22 A binary CSP instance satisﬁes the broken-triangle property w.r.t. a variable ordering < if and only if it is min-of-max extendable w.r.t. < for all possible domain orderings. Proof: Suppose that a CSP instance satisﬁes the broken-triangle property with respect to <, and consider an arbitrary ordering of each of the domains. To prove min-of-max extendability, it sufﬁces to apply the broken-triangle property to a = max(Rik (u)) and b = max(Rjk (v)). Since a and b are maximal, it must be (u, v, min(a, b)) which is the consistent extension of (u, v). To prove the converse, suppose that a CSP instance is min-of-max extendable for all possible domain orderings. For any a, b ∈ Dk , consider an ordering of Dk for which a, b are the two maximal elements. The broken-triangle property then follows from the deﬁnition of min-of-max extendability. Theorem 23 If a binary CSP instance is min-of-max extendable w.r.t. some known variable ordering < and some (possibly unknown) domain orderings, and is also directional arc-consistent with respect to <, then it is universally backtrack-free w.r.t. <, and hence can be solved in polynomial time. Proof: Suppose that (u1 , . . . , uk−1 ) is a consistent assignment for the variables (1, . . . , k − 1). By directional arc consistency, ∀i < k, Rik (ui ) = ∅. This means that c = min{max(Rik (ui )) : 1 ≤ i ≤ k − 1} is well-deﬁned. Let j ∈ {1, . . . , k − 1} be such that c = max(Rjk (uj )). Let i be any variable in {1, . . . , k − 1} − {j}. Applying the deﬁnition of min-of-max extendability to variables i, j, k allows us to deduce that (ui , c) ∈ Rik . It follows that ∃uk ∈ Dk 534 M.C. Cooper et al. / Hybrid Tractable CSPs Which Generalize Tree Structure (namely uk = c) such that (u1 , . . . , uk ) is a consistent assignment for the variables (1, . . . , k). Note that we used the ordering of domain Dk only to prove the existence of a consistent extension (u1 , . . . , uk ) of (u1 , . . . , uk−1 ). A backtrack-free search algorithm need not necessarily choose uk = c and hence does not need to know the domain orderings. Theorem 24 The problem of ﬁnding a variable ordering for a binary CSP instance with ordered domains such that it is min-of-max extendable w.r.t. that ordering (or determining that no such ordering exists) is solvable in polynomial time. Proof: The requirements for the ordering are a subset of the requirements for establishing the broken triangle property. Hence the result can be proved exactly as in the proof of Theorem 8. We can use Theorem 24 in the following way: given a CSP instance with ordered domains, compute its arc consistency closure, and then test (in polynomial time) whether this reduced instance is min-of-max extendable for some ordering of its variables. If we ﬁnd such an ordering, then the instance can be solved in polynomial-time, by Theorem 23. However, this approach is not guaranteed to ﬁnd all possible useful variable orderings achieving min-of-max extendability. Since minof-max extendability is not a conservative property, it may be that, for some variable orderings, the directional arc-consistency closure is min-of-max extendable but the full arc-consistency closure is not (or vice versa). In fact we conjecture that, for a given binary CSP instance with ﬁxed domain orderings, determining whether there exists some variable ordering such that the directional arc-consistency closure is min-of-max extendable with respect to that ordering is NP-complete. We also conjecture that determining whether a CSP instance is min-of-max extendable for some unknown domain orderings, even for a ﬁxed variable ordering, is NP-complete. Finally, we show that min-of-max extendability is a generalization of a previously-identiﬁed hybrid tractable class based on row-convex constraints [18]. Deﬁnition 25 A CSP instance is row-convex (w.r.t. a ﬁxed variable ordering and ﬁxed domain orderings) if for all pairs of variables i < j, ∀u ∈ Di , Rij (u) is the interval [a, b] for some a, b ∈ Dj . It is known that a directional path-consistent row-convex binary CSP instance is universally backtrack-free and hence tractable [18]. (However, it should be noted that establishing directional path consistency may destroy row-convexity.) Our interest in this hybrid tractable class is simply to demonstrate that it is a special case of min-of-max extendability. Proposition 26 If a binary CSP instance is directional pathconsistent and row-convex, then it is min-of-max extendable (and also max-of-min extendable). Proof: Consider the triple of variables i < j < k and suppose that (u, v) ∈ Rij . By directional path consistency, ∃c ∈ Dk such that (u, c) ∈ Rik and (v, c) ∈ Rjk . By row-convexity, Rik (u) and Rjk (v) are intervals in the ordered domain Dk . The existence of c means that these intervals overlap. Both end-points of this overlap provide extensions of (u, v) to a consistent assignment for the variables (i, j, k). One end-point is given by min(max(Rik (u)), max(Rjk (v))) which ensures min-of-max extendability. (The other ensures max-of-min extendability.) 7 CONCLUSION We have described new hybrid tractable classes of binary CSP instances which signiﬁcantly generalize tree-structured problems as well as previously-identiﬁed language-based and hybrid tractable classes. The new classes are based on local properties of ordered triples of variables. Moreover, we have shown that the problem of determining a variable ordering for which these properties hold is solvable in polynomial time. We see this work as a ﬁrst step towards a complete characterization of all hybrid tractable classes of constraint satisfaction problems. REFERENCES [1] C. Bessi`ere and J.-C. R´egin, ‘Reﬁning the basic constraint propagation algorithm’, in Proc IJCAI’01, Seattle, WA, pp. 309–315, (2001). [2] Andrei Bulatov, Peter Jeavons, and Andrei Krokhin, ‘Classifying the complexity of constraints using ﬁnite algebras’, SIAM Journal on Computing, 34(3), 720–742, (2005). [3] Andrei A. Bulatov, ‘Tractable conservative constraint satisfaction problems’, in Proceedings of 18th IEEE Symposium on Logic in Computer Science (LICS 2003), 22-25 June 2003, Ottawa, Canada, pp. 321–330. IEEE Computer Society, (2003). [4] Andrei A. Bulatov, ‘A dichotomy theorem for constraint satisfaction problems on a 3-element set’, Journal of the ACM, 53(1), 66–120, (2006). [5] David Cohen, Peter Jeavons, and Marc Gyssens, ‘A uniﬁed theory of structural tractability for constraint satisfaction problems’, Journal of Computer and System Sciences, 74(5), 721–743, (2008). [6] Martin C. Cooper, ‘Fundamental properties of neighbourhood substitution in constraint satisfaction problems’, Artiﬁcial Intelligence, 90(1– 2), 1–24, (1997). [7] R. Dechter and J. Pearl, ‘Network-based heuristics for constraint satisfaction problems’, Artiﬁcial Intelligence, 34(1), 1–38, (1987). [8] Rina Dechter, ‘Tractable structures for constraint satisfaction problems’, in Handbook of Constraint Programming, eds., Francesca Rossi, Peter van Beek, and Toby Walsh, 209–244, Elsevier, (2006). [9] Tom´as Feder and Moshe Y. Vardi, ‘The computational structure of monotone monadic SNP and constraint satisfaction: A study through Datalog and group theory’, SIAM Journal of Computing, 28(1), 57– 104, (1998). [10] Eugene C. Freuder, ‘Eliminating interchangeable values in constraint satisfaction problems’, in Proc. AAAI-91, Anaheim, CA, pp. 227–233, (1991). [11] Eugene C. Freuder and Charles D. Elfe, ‘Neighborhood inverse consistency preprocessing’, in Proc. AAAI/IAAI-96, Portland, OR, Vol. 1, pp. 202–208, (1996). [12] Martin Grohe, ‘The structure of tractable constraint satisfaction problems’, in Proceedings of the 31st Symposium on Mathematical Foundations of Computer Science, volume 4162 of Lecture Notes in Computer Science, pp. 58–72. Springer-Verlag, (2006). [13] P.G. Jeavons, ‘On the algebraic structure of combinatorial problems’, Theoretical Computer Science, 200, 185–204, (1998). [14] P.G. Jeavons and M.C. Cooper, ‘Tractable constraints on ordered domains’, Artiﬁcial Intelligence, 79(2), 327–339, (1995). [15] R. Mohr and G. Masini, ‘Good old discrete relaxation’, in Proceedings 8th European Conference on Artiﬁcial Intelligence —ECAI’88, ed., Y. Kodratoff, pp. 651–656. Pitman, (1988). [16] J.K. Pearson and P.G. Jeavons, ‘A survey of tractable constraint satisfaction problems’, Technical Report CSD-TR-97-15, Royal Holloway, University of London, (July 1997). [17] T.J. Schaefer, ‘The complexity of satisﬁability problems’, in Proceedings 10th ACM Symposium on Theory of Computing, STOC’78, pp. 216–226, (1978). [18] Peter van Beek and Rina Dechter, ‘On the minimality and decomposability of row-convex constraint networks’, Journal of the ACM, 42(3), 543–561, (1995). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-535 535 Justiﬁcation-Based Non-Clausal Local Search for SAT Matti J¨arvisalo and Tommi Junttila and Ilkka Niemel¨a1 Abstract. While stochastic local search (SLS) techniques are very efﬁcient in solving hard randomly generated propositional satisﬁability (SAT) problem instances, a major challenge is to improve SLS on structured problems. Motivated by heuristics applied in complete circuit-level SAT solvers in electronic design automation, we develop novel SLS techniques by harnessing the concept of justiﬁcation frontiers. This leads to SLS heuristics which concentrate the search into relevant parts of instances, exploit observability don’t cares and allow for an early stopping criterion. Experiments with a prototype implementation of the framework presented in this paper show up to a four orders of magnitude decrease in the number of moves on real-world bounded model checking instances when compared to WalkSAT on the standard CNF encodings of the instances. 1 INTRODUCTION Advances in propositional satisﬁability (SAT) testing have established SAT based methods as a competitive way of solving combinatorial problems in various domains. Stochastic local search (SLS) methods, such as [16, 15, 10, 3], are very efﬁcient especially in solving randomly generated SAT instances. However, for structural realworld SAT instances complete DPLL based SAT solvers seem to dominate SLS solvers (see, e.g., results of the latest SAT competitions at http://www.satcompetition.org/). Further work on improving SLS techniques for structural problems is needed and, in particular, developing techniques for handling variable dependencies efﬁciently has been identiﬁed a major challenge [7]. One problem in developing efﬁcient techniques for handling variable dependencies is that typically the most efﬁcient SLS solvers work on the ﬂat CNF input format. Some techniques for CNF level SLS solvers have been developed to utilize propagation during search [2]. However, there seems to be room for novel structurebased SLS techniques exploiting variable dependencies more directly. Indeed, in SAT based approaches, direct CNF encodings of a problem domain are rarely used: the problem at hand is typically encoded with a structure-preserving general propositional formula φ which can then be translated into an equi-satisﬁable CNF formula by introducing additional variables for the subformulas of φ. There are also SAT solvers which—instead of demanding CNF translation before solving—work directly on general formulas. Such solvers use Boolean circuits [11] as the compact representation for a general propositional formula in a DAG-like structure. However, such solvers are typically complete DPLL style non-clausal algorithms [5, 8, 9, 17]. Only a few SLS methods have been proposed for 1 Helsinki University of Technology, Dept. Information and Comp. Sci., Finland. Emails: {matti.jarvisalo,tommi.junttila,ilkka.niemela}@tkk.ﬁ. Research supported by Academy of Finland (#122399 (MJ,IN) and #112016 (TJ)). MJ additionally acknowledges support from the HeCSE graduate school, Emil Aaltonen Foundation, Jenny and Antti Wihuri Foundation, Nokia Foundation, and Foundation for Technology Promotion TES. general propositional formulas [14, 6, 12]. Common to these SLS approaches is that they attempt to explicitly exploit variable dependencies through independent (or input) variables, i.e., sets of variables such that a truth value assignment for them uniquely determines the truth values of all other variables, by focusing the search on truth assignments of input variables. In this paper we develop a novel non-clausal SLS method for structural SAT problems from a different starting point. Our aim is to bring structure-exploiting techniques into local search for SAT in order to lift the performance of local search SAT solving especially on structural real-world problem domains. We employ Boolean circuits as the representation of general propositional formulas. Motivated by justiﬁcation frontier heuristics (see e.g. [9]) applied in complete circuit-level SAT solvers in electronic design automation, our search technique looks for a justiﬁcation for the Boolean circuit instead of focusing on ﬁnding a satisfying truth assignment. The idea is to be able to drive local search more top-down in the overall structure of the circuit rather than in a bottom-up mode as is done in local search techniques focusing on input variables. This is achieved by guiding the search using justiﬁcation frontiers that enable exploiting observability don’t cares (see e.g. [13]), drive the search to relevant parts of the circuit, and offer early stopping criteria which allow to end the search when the circuit is de facto satisﬁable even if no concrete satisfying truth assignment has been found. Experiments with a prototype implementation of the framework presented in this paper show up to a four orders of magnitude decrease in the number of moves on real-world bounded model checking instances when compared to WalkSAT on the standard CNF encodings of the instances. The rest of this paper is organized as follows. First, Boolean circuits and related central concepts are deﬁned (Sect. 2). The proposed justiﬁcation-based non-clausal SLS method is then described (Sect. 3) and analyzed w.r.t. both CNF level and previous non-clausal methods (Sect. 4). Initial experiments are presented in Sect. 5. 2 CONSTRAINED BOOLEAN CIRCUITS Boolean circuits offer a natural non-clausal representation for propositional formulas in a compact DAG-like structure with subformula sharing. Rather than translating circuits to CNF for solving the resulting SAT instance by local search, in this work we will work directly on the Boolean circuit representation. A Boolean circuit over a ﬁnite set G of gates is a set C of equations of form g := f (g1 , . . . , gn ), where g, g1 , . . . , gn ∈ G and f : {f, t}n → {f, t} is a Boolean function, with the additional requirements that (i) each g ∈ G appears at most once as the left hand side in the equations in C, and (ii) the underlying directed graph ˘ ¯ G(C), E(C) = g , g ∈ G × G | g := f (. . . , g , . . .) ∈ C is acyclic. The set of gates in a circuit C is denoted by G(C). If g , g ∈ E(C), then g is a child of g and g is a parent of g . The 536 M. Järvisalo et al. / Justiﬁcation-Based Non-Clausal Local Search for SAT descendant and ancestor relations are deﬁned in the usual way as the transitive closures of the child and parent relations, respectively. If g := f (g1 , . . . , gn ) is in C, then g is an f -gate (or of type f ), otherwise it is an input gate. The set of input gates in C is denoted by inputs(C). A gate with no parents is an output gate. An assignment for C is a (possibly partial) function τ : G → {f, t}. A total assignment τ is consistent with C if τ (g) = f (τ (g1 ), . . . , τ (gn )) for each g := f (g1 , . . . , gn ) in C. A constrained Boolean circuit C α is a pair C, α, where C is a circuit and α is an assignment for C. Each g, v ∈ α is called a constraint where g is constrained to v (typically used for setting an output gate to a truth value). A total assignment τ for C satisﬁes C α if (i) τ is consistent with C, and (ii) respects the constraints: τ ⊇ α. If some total assignment satisﬁes C α , then C α is satisﬁable and otherwise unsatisﬁable. In this work we consider Boolean circuits in which the following Boolean functions are available as gate types. • • • • NOT(v) is t iff v is f. is t iff at least one of v1 , . . . , vn is t. AND (v1 , . . . , vn ) is t iff all v1 , . . . , vn are t. XOR(v1 , v2 ) is t iff exactly one of v1 , v2 is t. OR(v1 , . . . , vn ) However, notice that the techniques developed in this paper can be adapted for a wider range of types such as equivalence and cardinality gates. In order to keep the presentation and algorithms simpler, we assume that constraints only appear in the output gates of constrained circuits. Any circuit can be rewritten into such a normal form by using the rules in [5]. Example 1 Figure 1 shows a Boolean circuit for a full-adder with the constraint that the carry-out bit c1 is t. One satisfying total assignment for the circuit is {c1 , t, t1 , t, o0 , f, t2 , f, t3 , t, a0 , t, b0 , f, c0 , t}. (1) c1 OR C t t1 AND o0 XOR t2 AND t3 XOR a0 b0 Figure 1. c0 = {c1 := OR(t1 , t2 ) t1 := AND(t3 , c0 ) o0 := XOR(t3 , c0 ) t2 := AND(a0 , b0 ) t3 := XOR(a0 , b0 )} α = {c1 , t} A constrained Boolean circuit C α . The restriction of an assignment τ to a set G ⊆ G of gates is deﬁned as usual: τ |G = {g, v ∈ τ | g ∈ G }. Given a non-input gate g := f (g1 , . . . , gn ) and a value v ∈ {f, t}, a justiﬁcation for the pair g, v is a partial assignment σ : {g1 , . . . , gn } → {f, t} to the children of g such that f (τ (g1 ), . . . , τ (gn )) = v holds for all extensions τ ⊃ σ. That is, the values assigned by σ to the children of g are enough to force g to have the value v. A gate g is justiﬁed in an assignment τ if it is assigned, i.e. τ (g) is deﬁned, and (i) it is an input gate, or (ii) g := f (g1 , . . . , gn ) ∈ C and τ |{g1 ,...,gn } is a justiﬁcation for g, τ (g). For example, consider the gate t1 in Fig. 1. The possible justiﬁcations for t1 , f are {t3 , f}, {t3 , f, c0 , t}, {t3 , f, c0 , f}, {c0 , f}, and {t3 , t, c0 , f}; the ﬁrst and fourth are subset minimal ones. Gate t1 is justiﬁed in the assignment (1). Given a constrained circuit C α and an assignment τ ⊇ α for C, the justiﬁcation cone of C α under τ , denoted by jcone(C α , τ ), is the minimal set of gates satisfying the following requirements. 1. All constrained gates belong to the cone. That is, if g, v ∈ α, then g ∈ jcone(C α , τ ). 2. If a justiﬁed gate belongs to the cone, then all the gates that participate in some subset minimal justiﬁcation for the gate are also in the cone. Formally, if g ∈ jcone(C α , τ ) and (i) g is a non-input gate, (ii) g is justiﬁed in τ , and (iii) gi , vi ∈ σ for some subset minimal justiﬁcation σ for g, τ (g), then gi ∈ jcone(C α , τ ). In principle it would be sufﬁcient to consider only one, arbitrarily chosen subset minimal justiﬁcation. However, such a formalization would make jcone(C α , τ ) ambiguously deﬁned. The justiﬁcation frontier of C α under τ is the “bottom edge” of the justiﬁcation cone, i.e. those gates in the cone that are not justiﬁed: jfront(C α , τ ) = {g ∈ jcone(C α , τ ) | g is not justiﬁed in τ }. A gate g is interesting in τ if it belongs to the frontier jfront(C α , τ ) or is a descendant of a gate in it; the set of all gates interesting in τ is denoted by interest(C α , τ ). A gate g is an (observability) don’t care if it is neither interesting nor in the justiﬁcation cone jcone(C α , τ ). For instance, consider the constrained circuit C α in Fig. 1. Under the assignment τ = {c1 , t, t1 , t, o0 , f, t2 , f, t3 , t, a0 , f, b0 , f, c0 , t}, the justiﬁcation cone jcone(C α , τ ) is {c1 , t1 , t3 , c0 }, the justiﬁcation frontier jfront(C α , τ ) is {t3 }, interest(C α , τ ) = {t3 , a0 , b0 }, and the gates t2 and o0 are don’t cares. Proposition 1 If the justiﬁcation frontier jfront(C α , τ ) is empty for some total assignment τ , then the constrained circuit C α is satisﬁable. When jfront(C α , τ ) is empty, a satisfying assignment can be obtained by (i) restricting τ to the input gates appearing in the justiﬁcation cone, i.e. to the gate set jcone(C α , τ ) ∩ inputs(C), (ii) assigning other input gates arbitrary values, and (iii) recursively evaluating the values of non-input gates. Thus, whenever jfront(C α , τ ) is empty, we say that τ de facto satisﬁes C α . As an example, the assignment τ = {c1 , t, t1 , f, o0 , f, t2 , t, t3 , t, a0 , t, b0 , t, c0 , t} de facto satisﬁes the constrained circuit C α in Fig. 1. Also note that if a total truth assignment τ satisﬁes C α , then jfront(C α , τ ) is empty. Translating Circuits to CNF. Each constrained Boolean circuit C α can be translated into an equi-satisﬁable CNF formula cnf(C α ) by applying the standard “Tseitin translation”. In order to obtain a small CNF formula, the idea is to introduce a variable g˜ for each gate g in the circuit, and then to describe the functionality of each gate with a set of clauses. For instance, an AND-gate g ∨ g˜1 ),. . . , g := AND(g1 , . . . , gn ) is translated into the clauses (¬˜ g ∨ ¬˜ g1 ∨ . . . ∨ ¬˜ gn ). The constraints are trans(¬˜ g ∨ g˜n ), and (˜ lated into unit clauses: g, t ∈ α introduces the unit clause (˜ g ) and g, f ∈ α the negated unit clause (¬˜ g ). A Note on Negations. As usual in many SAT algorithms, we will implicitly ignore NOT-gates of form g := NOT(g1 ); g and g1 are always assumed to have the opposite values. Thus NOT-gates are, for instance, (i) “inlined” in the cnf translation by substituting ¬˜ g1 for g˜, and (ii) never counted in an interest set interest(C α , τ ). 3 JUSTIFICATION-BASED NON-CLAUSAL SLS In contrast to typical local search algorithms for SAT, which work on CNF formulas, we develop justiﬁcation-based non-clausal stochastic M. Järvisalo et al. / Justiﬁcation-Based Non-Clausal Local Search for SAT local search techniques. As typical in clausal SLS, a conﬁguration is described by a total truth assignment. However, our method works directly on general propositional formulas represented as Boolean circuits, and hence a conﬁguration is a total assignment on the gates of the Boolean circuit at hand. In contrast to typical local search for SAT, we exploit—motivated by successful implementations of complete circuit SAT solving techniques (see, e.g., [9])–techniques for detecting justiﬁcation-based don’t cares within our Boolean circuit SAT local search (BC SLS) framework. This is based on justiﬁcation frontiers, which guide the search heuristics to concentrate on relevant parts of the instance and, moreover, provide an alternative, early stopping criterion for the search. We demonstrate the novel approach by developing a WalkSAT type algorithm [15] that exploits justiﬁcation frontiers in guiding search. In the clausal WalkSAT local moves are based on randomly selecting a clause falsiﬁed by the current truth assignment. In our algorithm the role of the falsiﬁed clauses is played by the gates in the justiﬁcation front, i.e., the gates in the justiﬁcation cone not justiﬁed by the current assignment. WalkSAT ﬂips one of the variables in the chosen clause in the greedy move to maximize the decrease in the number of falsiﬁed clauses. In our case a greedy move selects a justiﬁcation for the chosen gate to minimize the number of interesting gates. The resulting method is presented as Algorithm 1. Given a constraint circuit C α and a noise parameter p ∈ [0, 1] (with p = 0 only greedy moves are made), the algorithm performs local search over the assignment space of all the gates in C (inner loop on lines 3-13). Algorithm 1 BC SLS Input: constrained Boolean circuit C α , parameter p ∈ [0, 1] Output: a de facto satisfying assignment for C α or “don’t know” Explanations: τ : current truth assignment on all gates with τ ⊇ α δ: next move (a partial assignment) 1: for try := 1 to M AX T RIES(C α ) do 2: τ := pick an assignment over all gates in C s.t. τ ⊇ α 3: for move := 1 to M AX M OVES(C α ) do 4: if jfront(C α , τ ) = ∅ then return τ 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: Select a random gate g ∈ jfront(C α , τ ) with probability (1 − p) do %greedy move δ := a random justiﬁcation from those justiﬁcations for g, v ∈ τ that minimize cost(τ, ·) otherwise %non-greedy move (with probability p) if g is unconstrained in α δ := {g, ¬v} where g, v ∈ τ else δ := a random justiﬁcation for g, v ∈ τ τ := (τ \ {g, ¬w | g, w ∈ δ}) ∪ δ return “don’t know” We will next describe the inner loop of BC SLS in more detail. 3.1 Stopping Criterion Similar to typical CNF level SLS methods, one could terminate the search in BC SLS by applying the standard stopping criterion: when all gates are justiﬁed in the current conﬁguration τ , then τ is in itself a satisfying truth assignment for the circuit. However, the justiﬁcation frontier allows for an early stopping criterion by Proposition 1: when the current front jfront(C α , τ ) is empty (line 4), the current 537 conﬁguration τ de facto satisﬁes C α . Thus we can obtain from τ a satisfying assignment after the search is terminated by simply evaluating the unconstrained gates in C α by using the values for input gates in τ . This is a stronger stopping criterion than the standard one, since the front is empty whenever the standard one holds, but the opposite does not necessarily hold: the front can be empty even if there are gates in the circuit which are not justiﬁed in τ . 3.2 Making Moves For each of the M AX T RIES(C α ) runs of the inner loop, M AX M OVES(C α ) moves are made. The moves exploit structural information and semantics of individual gates for ﬁnding a justiﬁcation for the currently assigned value of a chosen gate (lines 6-12). Given the current conﬁguration τ , we concentrate on making moves on gates in jfront(C α , τ ) by randomly picking a gate g from this set. For a gate g and its current value v in τ , the possible greedy moves are induced by the justiﬁcations for g, v. The idea is to minimize the size of the interest set. In other words, the value of the cost function for a move (justiﬁcation) δ is ˛ ˛ cost(τ, δ) = ˛interest(C α , τ )˛, where τ = (τ \ {g, ¬w | g, w ∈ δ}) ∪ δ. That is, the cost of a move δ is given by the size of the interest set in the conﬁguration τ where for the gates mentioned in δ we use the values in δ instead of those in τ . The move is then selected randomly from those justiﬁcations δ for g, v for which the value cost(τ, δ) is smallest over all justiﬁcations for g, v. During a non-greedy move (lines 9-12, executed with probability p), we invert the value of the gate g itself whenever this is possible, i.e., when g is not constrained in α. The idea here is to try to escape from possible local minima by more radically changing the justiﬁcation front, most likely upwards in the circuit structure. In the case that we may not invert the value of g (since it is constrained), the move is chosen randomly from the set of all justiﬁcations for g, v ∈ τ . 4 ANALYSIS 4.1 Interest Set Size Driven Greedy Moves Considering greedy moves, the objective function under minimization in BC SLS is cost(τ, ·). Alternatively, one could use the objective of minimizing |jfront(C α , τ )|, since (i) ﬂipping is concentrated on gates in jfront(C α , τ ) and (ii) the stopping criterion jfront(C α , τ ) = ∅ is used. The reasoning behind choosing to minimize the number of gates in interest(C α , τ ) is that it gives a better progress measure than minimizing the number of gates in the justiﬁcation front. First, notice that the justiﬁcation front cannot become empty before it reaches a subset of the input gates, since only input gates are justiﬁed by default. Now, the size of the interest set gives an upper bound on the number of gates that still need to be justiﬁed (the descendants of the gates in the front). Following this intuition, by minimizing the size of the interest set the greedy moves drive the search towards the input gates. 4.2 Comparison with Clausal Methods One of the main advantages of the proposed BC SLS method over clausal local search methods is that BC SLS can exploit observability don’t cares. As an example, consider the circuit in Fig. 2(a), where the gate g1 is constrained to true and the other t and f symbols depict 538 M. Järvisalo et al. / Justiﬁcation-Based Non-Clausal Local Search for SAT the current conﬁguration τ . All the gates, except g6 , in the complex subcircuit rooted at the gate g2 are don’t cares under τ . Therefore BC SLS can ignore the subcircuit and terminate after ﬂipping the input gate g5 as the justiﬁcation front becomes empty. However, assume that we translate the circuit into a CNF formula by using the Tseitin translation cnf given in Sect. 2. If we apply a clausal SLS algorithm such as WalkSAT on the CNF formula, observability don’t cares are no longer available in the sense that the algorithm must ﬁnd a total truth assignment that simultaneously satisﬁes all the clauses originating from the subcircuit. This can be a very complex task. t or g1 f and g2 complex subcircuit f f and f g6 t or t xor g3 g5 (a) Exploiting don’t cares. Figure 2. not gx1 t t or gx2 gx3 gx4 (b) A CNF circuit. Example circuits We can also analyze how BC SLS behaves on ﬂat clausal input. To do this, we associate a CNF formula F = C1 ∧ . . . ∧ Ck with a constrained CNF circuit ccirc(F ) = C, α as follows. Take an input gate gx for each variable x occurring in F . Now C = {gCi := OR(gl1 , ..., g¯lm ) | Ci = (l1 ∨ . . . ∨ lm )} ∪ ˘ g¬x := NOT(gx ) | ¬x ∈ ∪ki=1 Ci and the constraints force each “clause gate” gCi to true: α = {gCi , t | 1 ≤ i ≤ k}. This is illustrated in Fig. 2(b) for F = (x1 ∨ ¬x2 ) ∧ (¬x2 ∨ x3 ∨ x4 ). When BC SLS is run on a CNF circuit, it can only ﬂip input variables. If input gates were excluded from the set interest(C α , τ ) of interesting gates, then |interest(C α , τ )| would equal to the number of unjustiﬁed clause gates in the conﬁguration τ . Thus the greedy move cost function cost(τ, ·) would equal to that applied in WalkSAT measuring the number of clauses ﬁxed/broken during a ﬂip. Since input gates are included in interest(C α , τ ), the BC SLS cost function also measures, in CNF terms, the number of variables occurring in unsatisﬁed clauses. 4.3 Comparison with Non-Clausal Methods SLS techniques working directly on non-clausal problems closest to our work include [14, 6, 12]. They are all based on the idea of limiting ﬂipping to input (independent) variables whereas we allow ﬂipping all gates (subformulas) of the problem instance. Moreover, in these approaches the greedy part of the search is driven by a cost function which is substantially different from the justiﬁcation-based cost function that we employ. Sebastiani [14] generalizes the GSAT heuristic to general propositional formulas and deﬁnes the cost function by (implicitly) considering the CNF form cnf(φ) of the general formula φ: the cost for a truth assignment is the number of clauses in cnf(φ) falsiﬁed by the assignment. The approaches of Kautz and Selman [6] and Pham et al. [12] both use a Boolean circuit representation of the problem and employ a cost function which, given a truth assignment for the input gates, counts the number of constrained output gates falsiﬁed by the assignment. This cost function provides limited guidance to greedy moves in cases where there are few constrained output gates or they are far from the input gates. A worst-case scenario occurs when the Boolean circuit given as input has a single output gate implying that the cost function can only have the values 0 or 1 for any ﬂip under any conﬁguration. Such a cost function does not offer much direction for the greedy ﬂips towards a satisfying truth assignment. Our cost function appears to be less sensitive to the number of output gates or their distance from the input gates. This is because the search is based on the concept of a justiﬁcation frontier which is able to distribute the requirements implied by the constrained output gates deeper in the circuit. 5 EXPERIMENTS In order to evaluate the ideas behind the BC SLS framework, we have implemented a prototype on top of the bc2cnf Boolean circuit simpliﬁer/CNF translator [4]. The computation of justiﬁcation cone is implemented directly by the deﬁnition. When making greedy and random moves, justiﬁcations are selected from the set of subset minimal justiﬁcations for the gate value; for a true OR-gate and false AND-gate, the value of a single child is inverted, and for a true OR-gate and false AND-gate the values of all children are inverted. As structural benchmarks we use a set of Boolean circuits encoding bounded model checking of asynchronous systems for deadlocks [1], available at http://www.tcs.hut.fi/∼mjj/ benchmarks/. Although rather easy for current DPLL solvers, these benchmarks are challenging for typical SLS methods. Since our implementation is at present a very preliminary nonincremental one, we will compare the number of moves made by WalkSAT and our prototype.2 We use WalkSAT, since the current prototype—as explained also in Sect. 3—can be basically seen as a justiﬁcation-based variation of WalkSAT. For running WalkSAT, we apply exactly the same Boolean circuit level simpliﬁcation in bc2cnf to the circuits as in our prototype (including, e.g., circuit level propagation that is equivalent to unit propagation), and then translate the simpliﬁed circuit to CNF with the Tseitin-style translation implemented in bc2cnf for running WalkSAT. We run both WalkSAT and our prototype implementation with the default noise value p = 0.5 (that is, 50%). To make a fair evaluation (not favoring our prototype), we allow WalkSAT 108 moves and limit our implementation to a maximum of 106 moves. Each instance is run 9 times without restarts. The number of gates in the simpliﬁed circuits (column #gates), and the number of variables (#vars) and clauses (#clauses) resulting from the standard CNF translation, are given in Table 1. Furthermore, the minimum (min), median (med), and maximum (max) number of moves for each instance is presented. The number of runs without a satisfying truth assignment is given in the column max in parentheses. Additionally, we give the ratio of the number of moves made by our prototype and WalkSAT for the minimum, median, and maximum number of moves done by the solvers. For example, the max/max ratio of 533.43 for the instance speed 1.fsa-b10-s means that the maximum number of moves made by WalkSAT over the nine runs was 533.43 times as large as the maximum number of moves done by our implementation on the instance. To sum up, the experiments demonstrate potential of our novel approach when solving structural (non-clausal) SAT instances. A promising observation is that our justiﬁcation frontier based technique seems to keep the search rather focused when the size of the instance grows as witnessed by the modestly increasing number of moves. In particular, this compares favorably to WalkSAT which typically exceeds the cutoff of 108 moves as the instance sizes grow. 2 The prototype computes the justiﬁcation front and cone repeatedly in a global, non-incremental way. This naive implementation makes around 80250 times fewer ﬂips per second (fps) than WalkSAT on instances with 1000-2500 gates. By careful re-implementation a very substantial increase is expected in the fps rate by incrementally computing the front and cone. M. Järvisalo et al. / Justiﬁcation-Based Non-Clausal Local Search for SAT 539 Table 1. Comparison of a prototype implementation of BC SLS with WalkSAT CNF Instance #gates #vars #clauses speed 1.fsa-b6-s 836 688 2087 1142 943 2875 speed 1.fsa-b7-s speed 1.fsa-b8-s 1448 1198 3660 1754 1453 4444 speed 1.fsa-b9-s 5226 speed 1.fsa-b10-s 2060 1708 speed 1.fsa-b12-s 2672 2218 6786 7563 speed 1.fsa-b13-s 2978 2473 8338 speed 1.fsa-b14-s 3284 2728 speed 1.fsa-b15-s 3590 2983 9111 speed 1.fsa-b6-p 687 577 1666 speed 1.fsa-b7-p 1022 863 2528 1359 1149 3387 speed 1.fsa-b8-p 1696 1435 4245 speed 1.fsa-b9-p speed 1.fsa-b10-p 2033 1721 5101 6793 speed 1.fsa-b12-p 2703 2289 7643 speed 1.fsa-b13-p 3040 2575 dp 12.fsa-b5-s 1579 1339 4148 2060 1748 5418 dp 12.fsa-b6-s 2541 2157 6688 dp 12.fsa-b7-s 3022 2566 7958 dp 12.fsa-b8-s dp 12.fsa-b9-s 3503 2975 9228 dp 12.fsa-b5-p 1267 1111 3210 1844 1616 4673 dp 12.fsa-b6-p 2421 2121 6136 dp 12.fsa-b7-p dp 12.fsa-b8-p 2998 2626 7599 3575 3131 9062 dp 12.fsa-b9-p elevator 1-b4-p 264 230 649 elevator 1-b4-s 534 447 1363 841 704 2163 elevator 1-b5-s 1307 1093 3388 elevator 1-b6-s elevator 2-b6-p 896 775 2308 1606 1379 4214 elevator 2-b7-p elevator 2-b6-s 1582 1339 4157 2448 2070 6495 elevator 2-b7-s mmgt 2.fsa-b6-p 903 777 2285 mmgt 2.fsa-b7-p 1283 1113 3278 1421 1188 3722 mmgt 2.fsa-b6-s 1953 1692 5034 mmgt 3.fsa-b7-p 3079 2600 8260 mmgt 3.fsa-b7-s BC SLS #moves min med max 965 965 4358 2266 2578 5077 1633 1849 5518 5029 6695 12616 5089 11313 28423 6899 41379 141700 31384 139921 415601 43690 179967 587184 33647 321554 - (1) 1129 1129 1129 2148 2777 7614 4338 8248 22294 5176 10610 27500 7249 30846 60009 45304 144787 735228 34363 328346 709696 8880 27519 36421 23740 52975 106542 28289 69029 170824 33935 91764 461459 83107 162137 446453 4838 85808 411793 38563 118477 221461 22545 69040 214360 73826 132431 576672 50227 148409 425594 171 171 171 869 869 1723 2543 3632 4788 5073 57305 116572 1789 4376 15621 6221 18601 91691 6776 16702 - (1) 7940 28524 72247 1220 1220 34132 5671 32236 86944 3051 10831 38780 5136 7264 48315 39796 178191 833128 WalkSAT #moves min med max 2252 5805 11368 6255 20915 38237 9266 62497 95837 25911 330321 1643032 1511045 4376285 15161778 - (9) - (9) - (9) - (9) 1342 1851 7706 4636 10916 25955 10991 40833 278042 33752 76864 288506 2043613 4638369 10800631 - (9) - (9) 14469 37361 81102 123249 790190 2299552 397887 28360757 - (1) - (9) - (9) 4619 12545 46037 17961 85344 145830 113932 244863 406876 379112 14101664 69496990 - (9) 1176 3041 10801 2450 51226 270317 19472 202139 391216 305888 1317650 4433915 16898 1134779 2824590 492869 2756750 13232933 13576544 - (8) - (9) 170454 620821 1260101 873901 4289501 16896746 8586379 67686412 - (3) 6886854 - (6) - (9) Considering the input ﬂipping SLS methods in the literature (recall Sect. 4.3), we were, unfortunately, unable at the moment to obtain implementations of these methods for comparison. Comparing input ﬂipping methods to our current framework remains thus an important aspect of future work. We did also investigate the performance of AdaptNovelty+ [3] on the benchmarks. We omit the precise results here due to space reasons. On the whole, although AdaptNovelty+ does ﬁnd satisfying truth assignments for more instances than WalkSAT using the cutoff of 108 moves, our prototype shows typically a one-to-three orders of magnitude reduction in the number of moves compared to AdaptNovelty+ — rather similarly as when compared to WalkSAT. 6 CONCLUSIONS Motivated by techniques applied in circuit-level SAT solvers in electronic design automation, we present a novel approach to solving structural SAT problems with local search on non-clausal level. By incorporating justiﬁcation frontiers, we develop SLS heuristics which concentrate the search into relevant parts of instances, exploit observability don’t cares and allow for an early stopping criterion. Encouraged by the potential witnessed by low move counts of a prototype implementation, we see various directions for further work. We plan to replace the prototype with a proper solver implementation with specialized data structures. For achieving self-tuning of the greediness parameter for effectively escaping from local minima, developing adaptive noise mechanisms [3] for non-clausal SLS is a topic for further work. Another aspect is to investigate the effect of adding local consistency checking (on the circuit level, extending studies on adding propagation to CNF level SLS [2]) into the framework, and possibly even conﬂict learning. relative gain in #moves min/min med/med max/max 2.33 6.02 2.61 2.76 8.11 7.53 5.67 33.80 17.37 5.15 49.34 130.23 296.92 386.84 533.43 >14494.85 >2416.68 >705.72 >3186.34 >714.69 >240.62 >2288.85 >555.66 >170.30 >2972.03 >310.99 1.19 1.64 6.83 2.16 3.93 3.41 2.53 4.95 12.47 6.52 7.24 10.49 281.92 150.37 179.98 >2207.31 >690.67 >136.01 >2910.11 >304.56 >140.91 1.63 1.36 2.23 5.19 14.92 21.58 14.07 410.85 >585.40 >2946.81 >1089.75 >216.70 >1203.27 >616.76 >223.99 0.95 0.15 0.11 0.47 0.72 0.66 5.05 3.55 1.90 5.14 106.48 120.51 >1990.96 >673.81 >234.97 6.88 17.78 63.16 2.82 58.95 156.89 7.66 55.66 81.71 60.30 22.99 38.04 9.45 259.32 180.82 79.23 148.20 144.32 2003.62 >5987.31 >12594.46 >3505.82 >1384.14 139.72 508.87 36.92 154.10 133.07 194.34 2814.28 6249.32 >2578.65 1340.90 >13766.52 >2069.75 >2512.82 >561.20 >120.03 REFERENCES [1] K. Heljanko, ‘Bounded reachability checking with process semantics’, in CONCUR, volume 2154 of LNCS, pp. 218–232. Springer, (2001). [2] E.A. Hirsch and A. Kojevnikov, ‘UnitWalk: A new SAT solver that uses local search guided by unit clause elimination’, Annals of Mathematics and Artiﬁcial Intelligence, 43(1), 91–111, (2005). [3] H.H. Hoos, ‘An adaptive noise mechanism for WalkSAT’, in AAAI, pp. 655–660, (2002). [4] T. Junttila. The BC package and a ﬁle format for constrained Boolean circuits˙ http://www.tcs.hut.fi/∼tjunttil/bcsat/. [5] T. Junttila and I. Niemel¨a, ‘Towards an efﬁcient tableau method for Boolean circuit satisﬁability checking’, in CL 2000, volume 1861 of LNAI, pp. 553–567. Springer, (2000). [6] H. Kautz, D. McAllester, and B. Selman, ‘Exploiting variable dependency in local search’, in IJCAI poster session, (1997). http://www. cs.rochester.edu/u/kautz/papers/dagsat.ps. [7] H.A. Kautz and B. Selman, ‘The state of SAT’, Discrete Applied Mathematics, 155(12), 1514–1524, (2007). [8] A. Kuehlmann, M.K. Ganai, and V. Paruthi, ‘Circuit-based Boolean reasoning’, in DAC, pp. 232–237. ACM, (2001). [9] A. Kuehlmann, V. Paruthi, F. Krohm, and M. K. Ganai, ‘Robust Boolean reasoning for equivalence checking and functional property veriﬁcation’, IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, 21(12), 1377–1394, (2002). [10] D. McAllester, B. Selman, and H. Kautz, ‘Evidence for invariants in local search’, in AAAI, pp. 321–326, (1997). [11] C. Papadimitriou, Computational Complexity, Addison-Wesley, 1995. [12] D.N. Pham, J. Thornton, and A. Sattar, ‘Building structure into local search for SAT’, in IJCAI, pp. 2359–2364, (2007). [13] S. Safarpour, A. Veneris, R. Drechsler, and J. Lee, ‘Managing don’t cares in Boolean satisﬁability’, in DATE’04. IEEE, (2004). [14] R. Sebastiani, ‘Applying GSAT to non-clausal formulas’, Journal of Artiﬁcial Intelligence Research, 1, 309–314, (1994). [15] B. Selman, H.A. Kautz, and B. Cohen, ‘Noise strategies for improving local search’, in AAAI, pp. 337–343, (1994). [16] B. Selman, H. Levesque, and D. Mitchell, ‘A new method for solving hard satisﬁability problems’, in AAAI, pp. 440–446, (1992). [17] C. Thiffault, F. Bacchus, and T. Walsh, ‘Solving non-clausal formulas with DPLL search’, in CP, volume 3258 of LNCS, pp. 663–678. Springer, (2004). 540 ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-540 Multi-valued Pattern Databases Carlos Linares L´opez1 Abstract. Pattern Databases were a major breakthrough in heuristic search by solving hard combinatorial problems various orders of magnitude faster than state-of-the-art techniques at that time. Since then, they have received a lot of attention. Moreover, pattern databases are also researched in conjunction with other domain-independent techniques for solving planning tasks. However, they are not the only technique for improving heuristic estimates. Although more modest, perimeter search can also lead to signiﬁcant improvements in the number of generated nodes and overall running time. Therefore, whether they can be combined or not is a natural and interesting issue. While other researchers have recently proven that a joint application of both ideas (termed as multiple goal) leads to no progress at all, it is shown here that there are other alternatives for putting both techniques together —denoted here as multi-valued. This paper shows that multivalued pattern databases can still improve the performance of standard (or single-valued) pattern databases in practice. It also examines how to enhance memory usage when comparing multi-valued pattern databases in contraposition to various single-valued standard pattern databases. 1 Introduction Heuristics play a central role in problem-solving by guiding search algorithms towards the goal state from an arbitrary state, anywhere in the state space. Before the conception of pattern databases [1], heuristics were usually either handcrafted or directly derived by relaxing the original constraints of the problem at hand. In other words, pattern databases are an automatic mean for deriving heuristic functions which are usually far better informed than others. Thus leading to large improvements in the number of nodes generated and the overall running time. However, since pattern databases can take large chunks of main memory, various alternatives have been explored to efﬁciently use the available memory. On one hand, it has been shown that pattern databases can be successfully compressed, at least, in some domains like the Towers of Hanoi [5]. Also, it has been shown that pattern databases can be mapped re-using the same symbol instead of using a one-to-one mapping [9] as originally suggested. Although pattern databases can lead to further improvements by exploiting some domain-speciﬁc properties (e.g. reﬂections in the deﬁnition of the state or intrinsic characterizations of permutation state spaces [7]), they have been used also for solving planning tasks in conjunction with other domain-independent techniques [3] with very good results. In contrast to pattern databases, perimeter search [2, 12] aims at improving an existing heuristic function, instead of automatically 1 Planning and Learning Group, Universidad Carlos III de Madrid. Avda. de la Universidad, 30 - 28911 Legan´es, Madrid (Spain) email: carlos.linares@uc3m.es generating a new one. Although this technique has been usually employed for solving large sets of instances with respect to the same target, it could be broadly used when solving problems for distinct goal nodes. Altogether, pattern databases can be used for automatically deriving heuristic functions, and perimeter search serves for improving its estimates. Hence, whether they can be combined or not is an interesting issue which has already been addressed [6]. However, the ﬁrst results in this regard showed that perimeter search leads to no beneﬁt at all. In this paper, a different technique for combining both ideas is discussed. 2 Background This section succinctly reviews the main concepts underlying both perimeter search and pattern databases. The interested reader should refer to the cited papers for further information. 2.1 Perimeter Search Perimeter search was independently and simultaneously introduced in the specialized bibliography by Giovanni Manzini [12] and John F. Dillenburg and Peter C. Nelson [2]. The key observation of these researchers is that the main problems in bidirectional search come from the fact that both searches progress simultaneously. They proposed, instead, to generate a set of nodes (known as perimeter nodes) whose descendants exceed a given threshold d (known as perimeter depth) around the target node, and only after it has been generated, to start a unidirectional search from the source state until a collision with a perimeter node is detected. From this point of view, perimeter search might be seen as a more simple form of bidirectional search. However, the most prominent feature of this contribution is that it provides a mean for automatically improving an existing heuristic function, h(· ), since the unidirectional search from an arbitrary state, n uses the following, better informed, heuristic function hd : hd (n, t) = min {h(n, m) + h∗ (m, t)} m∈Pd (1) where Pd is the perimeter set comprising all nodes generated at depth d from the target, t, and h∗ (m, t) is the optimal cost of reaching the goal from the perimeter node m. Although it can be argued that using perimeter search involves “as many heuristic calculations as there are perimeter nodes” [11], the truth is that this number decreases with the depth of the forward search [12]. 2.2 Pattern Databases In their original work [1], Joseph C. Culberson and Jonathan Schaeffer deﬁned patterns as abstractions of the original state space where 541 C. Linares López / Multi-Valued Pattern Databases each constant appearing in the state space gets replaced by either a dedicated symbol or a special “don’t care” symbol. The granularity of the abstraction is deﬁned as the number of constants in the original state being replaced by the same symbol [8]. For example γ = 3, 3, 2, 1 denotes an abstraction where three constants are replaced by one symbol (say x1 ); another three are replaced by a new symbol x2 ; another two constants by a third symbol, x3 , and the last constant by a unique symbol, x4 . Although it has not been mentioned before in the related literature, it can be easily proven that the number of patterns generated with a given granularity γ is: |γ| i=1 C N− i−1 j=1 γj ,γi = |γ| . N− i−1 γj / j=1 i=1 γi (2) where Cn,m is the number of combinations of n elements choose m, and N is the total number of constants in the original state space, so that N = i γi . Thus, the previous granularity gives raise to: The apparent advantage of this approach is that while standard pattern databases explore the abstracted search space around the goal node, the perimeter generation starts by considering the original state space up to a pre-deﬁned perimeter depth. Nevertheless, the same state can be mapped, in different pattern databases, to different entries which contain the minimum distance to different perimeter nodes, so that comparisons become more difﬁcult. In other words, this idea is likely to produce very poor estimates by comparing the minimum distance to different perimeter nodes.3 Besides, Ariel Felner and Nir Ofek [6] experimentally showed, and empirically proved, that this approach leads to no improvement at all, i. e., it generates the same number of nodes. Their explanation can be intuitively depicted as follows: the only expected beneﬁt under this scheme is that patterns happening within the perimeter are now assigned better heuristic estimates, since patterns appearing beyond the perimeter set shall still get the same minimum distance. Comparing the number of patterns within the perimeter with all the plausible patterns gives a very small ratio in favour of this approach. C9,3 × C9−(3),3 × C9−(3+3),2 × C9−(3+3+2),1 = 5, 040 different entries. Pattern databases are simply hash tables which store, for every pattern (or arrangement of symbols in the abstracted state), the minimum number of moves required to place the symbols in the abstracted state space in their goal location —also known as goal pattern. This value can be easily computed with a backwards brute-force breadth-ﬁrst search from the goal pattern. So far, pattern databases are admissible heuristic functions. The index into the pattern database assigned to each pattern results from a ranking function which is (usually by far) the most expensive operation in searching with pattern databases2 . Originally, all moves were counted in so that when comparing the values retrieved from different pattern databases (for a collection of different patterns), the only way for getting an admissible heuristic is just to take the MAX of all values. However, when the constants appearing in the original state space can be split into disjoint sets (as in the N -puzzle or the Towers of Hanoi, but not in the Rubik’s cube or the TopSpin puzzle), a far better informed heuristic function can be built by computing the summation of all values [10]. This idea is known as disjoint, or just ADD pattern databases. 3 3.1 Mutiple Goal Pattern Databases The ﬁrst approach consists of addressing the combination as a multiple goal problem, i.e. a special case of heuristic search where the problem consists of hitting any of the perimeter nodes generated at depth d. A simple, yet beautiful, way for solving this sort of problems with the aid of pattern databases, consists of seeding the queue used in the backward breadth-ﬁrst search with all the perimeter nodes [11]. This way, the pattern database will store a unique value per entry with the minimum distance to all perimeter nodes. Ranking consists of converting each item in a collection into a scalar. Multi-valued Pattern Databases Instead, it is suggested herein to store separately the distance to each perimeter node in the pattern database, as shown in ﬁgure 1. Thereby, comparisons with respect to the same perimeter node become now feasible, leading to a better informed heuristic function as discussed in section 2.1. This is, indeed, the most natural way to implement perimeter search. Since every entry contains a vector of values instead of a scalar, this technique is denoted as multi-valued pattern databases in contraposition to standard single-value pattern databases which consist of a unique value per entry. t m1 Combining Perimeter Search and Pattern Databases As mentioned in the introduction, the main contribution of this work consists of discussing a different way than that previously proposed in [6] for putting together both perimeter search and pattern databases. 2 3.2 Figure 1. mj m2 ... h1 [1] h1 [2] ... h1 [j] h2 [1] h2 [2] ... h2 [j] h3 [1] h3 [2] ... h3 [j] Seeding a different queue with every perimeter node At ﬁrst glance, it might seem that this approach wastes a lot of space in main memory. However, this is not the case at all in the vast majority of cases. Consider, for example, the 15-Puzzle and a single-valued pattern database consisting of 7 different symbols, i.e. 1, 1, 1, 1, 1, 1, 1, 9. According to equation (2), this yields 57,657,600 different entries. What is the next bigger pattern database that can be built? • One option consists of augmenting the original pattern database with an additional symbol, this is, taking 8 different constants, 3 Indeed, though not explicitly mentioned in [6], the termination condition might become more difﬁcult now, since it is not strictly true that when various pattern databases return zero, a collision with a perimeter node has been detected, since maybe they are all referring to different nodes! 542 C. Linares López / Multi-Valued Pattern Databases whose granularity is 1, 1, 1, 1, 1, 1, 1, 1, 8. This new, bigger, pattern database consists of 518,918,400 entries and is 9 times bigger than the original one. • Another option consists of mapping an additional constant in the original state space to some other symbol currently used. This case is represented with granularity 1, 1, 1, 1, 1, 1, 2, 8 and originates 259,459,200 entries, 4.5 times more than the original pattern database. However, the number of perimeter nodes generated in the 15Puzzle at depth d = 1 and 2 is |Pd | = 2 and 4, respectively. This means that the resulting multi-valued pattern databases are smaller than the single-valued pattern databases created in both cases. Another consideration tightly related to the size of the resulting pattern databases is the number of ranking operations performed in each case. While the number of nodes to consider simultaneously in multi-valued pattern databases impose an overhead, they are all retrieved in a row, i.e. with a single ranking operation. This is true because the distances to each perimeter node are stored in contiguous locations in memory. However, if different single-valued pattern databases are going to be employed (which take altogether the same space than a multi-valued pattern database), each value shall be retrieved separately so that various ranking operations shall be performed. Since ranking is the most expensive operation in pattern databases, this overhead shall be taken into account as well. 3.3 Results Although ADD pattern databases are known to provide more accurate heuristic values, it is not always possible to apply them. Therefore, experiments have been conducted with both ADD and MAX pattern databases. In both cases, the perimeter is generated using a brute-force depthﬁrst search algorithm from the goal which generates all nodes whose descendants have a cost that exceeds the speciﬁed perimeter depth. Once the perimeter set is generated, different queues are seeded with each perimeter node and a backward breadth-ﬁrst search is issued with everyone for a given pattern speciﬁcation. As a result, multivalued pattern databases (which are as many times bigger than a single-valued pattern database as perimeter nodes were found) are generated. For the ease of comparison, the same mapping functions have been programmed. Since sparse mapping incurs in prohibitive wastes of space for some cases, a compact mapping has been chosen4 . Because IDA∗ explores the state space in a depth-ﬁrst fashion, an incremental implementation of the Myrvold and Ruskey ranking algorithm [13] has been developed. The current implementation runs about 20%–30% faster and has no additional memory requirements. Unfortunately, due to space constraints no further details regarding this algorithm are provided. It is worth mentioning that single-valued pattern databases are expected to be more sensitive to this improvement than multi-valued pattern databases. The reason is that multivalued pattern databases, being more informed than single-valued pattern databases (according to section 2.1) will generate and rank less nodes. 3.3.1 5-5-5 sPDB (0.0014Gb) mPDB2 (0.0058Gb) rPDB (0.0014Gb) 6-6-3 sPDB (0.0107Gb) mPDB2 (0.0429Gb) rPDB (0.0107Gb) #1 #2 #3 #4 #5 #6 1.28 0.851 0.49 0.188 1.01 0.372 0.56 0.324 0.56 0.232 0.38 0.111 2.17 1.561 0.92 0.417 1.48 0.581 0.79 0.456 0.81 0.364 0.49 0.140 2.32 1.634 1.18 0.570 1.49 0.585 0.47 0.254 0.64 0.253 0.44 0.134 #1 #2 #3 #4 #5 #6 0.83 0.509 0.37 0.113 0.78 0.248 0.98 0.576 0.45 0.144 0.63 0.186 0.42 0.187 0.30 0.086 0.36 0.092 0.77 0.452 0.33 0.108 0.44 0.119 1.77 1.183 1.00 0.434 1.79 0.638 0.39 0.181 0.50 0.181 0.24 0.056 Table 1. Experimental results in the 15-Puzzle with 5-5-5 and 6-6-3 PDBs: each cell shows the mean run-time in seconds (above) and total number of generated nodes in thousands of millions, 109 (below) Table 1 shows also the same statistics for another six different arrangements of 6-6-3 pattern databases. Pattern database #6 is the one suggested in [4]. Next, table 2 depicts the same statistics for four different arrangements of 7-8 pattern databases. In this case, pattern database #1 is the one widely suggested in the specialized bibliography and also cited in [4]. sPDB (0.5369Gb) mPDB2 (2.1479Gb) rPDB (0.5369Gb) #1 #2 #3 #4 0.0368 13.721 0.0355 10.262 0.0088 3.832 0.1338 60.347 0.0434 15.374 0.0993 21.646 0.1374 54.323 0.0435 15.502 0.0846 19.379 0.1687 78.370 0.0580 18.349 0.0939 23.594 Table 2. Experimental results in the 15-Puzzle with 7-8 PDBs: mean run-time in seconds (above) and total number of generated nodes in millions, 106 (below) Multi-valued ADD pattern databases The domains chosen for experimenting with multi-valued ADD pattern databases are the 15-Puzzle and the 24-Puzzle. In all cases, pat4 tern databases are “blank-preserving” [8], i. e., the blank tile is always mapped to a unique symbol, instead of “blank-increasing” — which consists of mapping the blank tile to the same symbol used by other tiles, such as the “don’t care” symbol. Also, reﬂections about the main diagonal are computed for single-valued pattern databases only if the regular lookup did not exceed the current threshold. Since no domain dependent feature is exploited for multi-valued pattern databases, results with reﬂections are provided only for the sake of completeness. Table 1 shows the mean time elapsed (in seconds) and the total number of generated nodes for solving the Korf’s test suite, which consists of 100 problems, when using single-valued and multi-valued pattern databases. In the experiments, six different arrangements of pattern databases have been used, where each pattern consists of 5 pattern tiles —pattern database #6 is the same suggested in [4]. In all tables, sPDB denotes single-valued pattern databases; mPDBi stands for multiple-valued pattern databases generated with perimeter depth d = i and, ﬁnally, rPDB stands for the same pattern databases as in sPDB but taking advantage of the reﬂections about the main diagonal. For a thorough discussion on the topic, see [4], section 4.2, page 289. Table 3 shows the same statistics in the 24-Puzzle using two different arrangements of 6-6-6-6 pattern databases at depth 3. Pattern database #1 is the usual reference in this domain, as suggested in [4]. The test set employed consists of the 25 easiest instances of the test 543 C. Linares López / Multi-Valued Pattern Databases sPDB (0.4750Gb) mPDB3 (4.7501Gb) rPDB (0.4750Gb) #1 #2 10622.39 3.08 6162.81 1.00 2390.09 0.41 24746.44 7.02 12227.45 1.93 10170.26 1.72 by perimeter search clearly pays-off for the reduction on the number of nodes generated. 4-4-4 (0.0086Mb) sPDB mPDB1 Table 3. Experimental results in the 24-Puzzle with 6-6-6-6 PDBs: mean run-time in seconds (above) and total number of generated nodes in millions of millions, 1012 (below) suite detailed in [10], with solution lengths ranging from 81 to 106 moves. When comparing the performance of various single-valued pattern databases versus their multi-valued counterparts, it turns out that the latter usually outperforms the former, more remarkably in the 7-8 and 6-6-6-6 cases. But this is not always true —see, for example PDB #6 in the 6-6-3. However, when comparing all running-times, the pattern database which resulted in faster performance is always the multivalued pattern database (for example, in the 6-6-3 case, the fastest algorithm uses multi-valued pattern databases arranged as in #3), but in the 5-5-5 case. The fact that for some arrangements, multi-valued pattern databases do not outperform their single-valued counterpart but others do it, can be explained as an effect of the diversity induced by the perimeter nodes. It has been observed that for some arrangements of pattern databases, the blank tile only reaches a few patterns when computing the perimeter nodes. The more pattern databases are affected, the better the heuristic. For example, in the 7-8 PDB #1 of the 15-Puzzle5 , allowing the blank to move twice affects both pattern databases. Thus, the resulting multi-valued pattern database outperformed its single-valued counterpart, even though the latter is very accurate for solving this problem. Correspondingly, when computing the multi-valued pattern database of 5-5-5 #6, only one PDB out of three gets updated, thus not leading to any improvement on either the number of nodes generated or the running time. mPDB2 Multi-valued MAX pattern databases The domain chosen for these experiments is the (N, K)-TopSpin. Max’ing is far less efﬁcient than taking the summation of a few values from different disjoint pattern databases. Thus, the sizes of the instances considered here are smaller than the ones shown in the previous paragraph. The number of pattern databases and the number of tiles they contain in each case is clearly identiﬁed in the tables. For example, 6-6 stands for two pattern databases with 6 tiles each. Besides, they always consist of contiguous locations arranged in such a way that the pattern databases are all equidistant, thus minimizing the overlapping among them. In all the subsequent experiments, the test suites employed consisted of 100 solvable instances generated with the random application of a number of operators between 100 and 500. Table 4 shows the results in the (9, 2)-TopSpin. This puzzle can be solved so fast that in most cases the time spent falls below 0.00 seconds. The number of perimeter nodes generated at depth d = 1 and 2 is |Pd | = 3 and 6, respectively, so that mPDB1 and mPDB2 are 3 and 6 times larger than the size of the corresponding sPDB, shown below every arrangement. As it can be seen, the overhead imposed 5 In this case, the 15-Puzzle is split into two halves: one above the other. The one below contains 8 pattern tiles whereas the one over it contains 7, because the blank tile is ommitted. ≤ 0.00 0.458 ≤ 0.00 0.347 ≤ 0.00 0.263 6-6-6 (0.1730Mb) ≤ 0.00 0.104 ≤ 0.00 0.077 ≤ 0.00 0.060 Table 4. Experimental results in the (9, 2)-TopSpin: mean run-time in seconds (above) and total number of generated nodes in tenths of millions, 105 (below) Table 5 summarizes the results for both the (12, 2)-TopSpin and the (15, 2)-TopSpin. As it can be seen, multi-valued pattern databases solved the problems faster and generating less nodes in all cases, with no exception. (12, 2)-TopSpin 6-6 8-8 (1.2689Mb) (38.0676Mb) sPDB mPDB1 mPDB2 9.94 1.813 6.54 1.066 6.07 0.795 0.21 0.021 0.16 0.013 0.11 0.008 (15, 2)-TopSpin 7-7-7-7-7 (154.6497Mb) 22.48 2.776 20.16 2.114 16.95 1.572 Table 5. Experimental results in the (12, 2)-TopSpin and the (15, 2)-TopSpin: mean run-time in seconds (above) and total number of generated nodes in thousands of millions, 109 (below) 4 3.3.2 0.01 5.952 0.01 3.964 ≤ 0.00 3.044 5-5-5 (0.0432Mb) Compressing Multi-valued Pattern Databases From equation (2) it becomes clear that the number of patterns grows rapidly for any granularity. Thus, techniques have been developed for efﬁciently compressing pattern databases both in a lossy and lossless way [5]. In this section, some preliminary ideas for compressing multi-valued pattern databases as well are discussed. It should be highlighted that the techniques discussed herein are not incompatible with those introduced in [5]. In spite of the discussions in section 3.2, the truth is that disjoint (or ADD) multi-valued pattern databases take even less space than what it might seem. Consider the 7-8 PDB #1 for the 15-Puzzle generated with perimeter depth d = 1 —see fotnote 5. It is easy to realize that in the two perimeter nodes generated so far, the inferior half (i. e., the pattern database with 8 tiles) looks exactly the same than in the goal state. Since ADD pattern databases do count all moves of the blank tile, the values stored in the inferior multi-valued pattern database are likely to be the same. Thus, it is only necessary to store two values per entry in the superior pattern database, but only one in the inferior database. This way, the resulting multi-valued pattern databases take twice the space of the smaller single-valued database (the one with 7 tiles) but only once the space of the inferior, larger, single-valued database. This stands for a marginal increment in the size of 10%. Even considering larger perimeter depths (say d = 2), it is still possible to apply other compression schemes to multi-valued pattern databases as discussed below. 544 C. Linares López / Multi-Valued Pattern Databases This is not true, however, for MAX pattern databases because in this case only moves of the pattern tiles are taken into account. Nevertheless, it is still possible to compress the resulting multi-valued pattern database relating statistically the distribution of values to each perimeter node with the distance to the ﬁrst perimeter node. Let δi (j) denote the difference hi (j)−hi (1) where hi (j) is the jth component of the vector in the i-th entry of a multi-valued pattern database. In other words, δi (j) is the difference of the distance to the j-th perimeter node and the ﬁrst perimeter node from pattern i. This way, it is possible to compute the vector of differences δi (· ) for every entry, i, in a given multi-valued pattern database. Also, it is assumed that P perimeter nodes have been generated. Now, there are two different ways to compress data in a loosy way without sacriﬁcing admissibility: Traversal compression consists of forcing all hi (j) values from the same entry i to be equal to the minimum of them all, so that each component j takes a new value hi (j) computed as follows: choice. In other words, instead of storing a vector of heuristic estimations to each perimeter node in every entry of the multi-valued pattern database, an index to a small number of δ(· ) vectors is attached. Then, when solving a problem, retrieve the index from the pattern database and apply its δ(· ) vector to get the heuristic estimations to all the perimeter nodes. Preliminary experiments in the (N, K)-TopSpin suggest that it is feasible to signiﬁcantly compress multi-valued pattern databases and still running faster than various single-valued pattern databases generating far less nodes. 5 Summary Although it might be contrary to intuition, storing various values per entry in a pattern database can outperform the standard, singlevalued pattern databases, either ADD or MAX. Furthermore, these databases can be compressed with the techniques outlined in the last section which are not incompatible with existing techniques for compressing single-valued pattern databases. P hi (j) = hi (1) + min{δi (j)}, 2 ≤ j ≤ P . The expected loss in j=2 the accuracy of the resulting heuristic values due to the traversal compression, Lt , can be computed as: Lt (δi (· )) = P hi (j) − hi (j) p(δi ) Acknowledgements This work has been partially supported by the Spanish MEC project TIN2005-08945-C06-05 and UC3M-CAM project CCG06UC3M/TIC-0831. j=2 where p(δi ) stands for the probability of occurrence of the vector of differences δi . Note that the same vector of differences δi can happen in an arbitrary number of entries in the multi-valued pattern database other than the i-th entry. Applying repeatedly this compression scheme, the resulting pattern database will be exactly the same than the one generated under the multiple goal approach in section 3.1. Longitudinal compression merges two different entries, u and v, by forcing their δ vectors, δu (· ) and δv (· ), to be the same so that hu (i) and hv (i) take new values, hu (i) and hv (i), according to: hu (i) = hu (1) + min{δu (i), δv (i)} and, similarly for hv . As in the previous case, it is possible to compute the expected loss in the accuracy of the heuristic values that result after a longitudinal compression, Ll , as follows: Ll (δu (· ), δv (· )) = P (hu (j) − hu (j))p(δu )+ j=2 (hv (j) − hv (j))p(δv ) Since the preceding expressions allow the measurement of the loss in the accuracy of the heuristic function, they serve for compressing any multi-valued pattern database to any desired ratio of compression degree versus loss of accuracy. In particular, for any upper bound on the average loss of the heuristic function, U , an algorithm for efﬁciently compressing a multi-valued pattern database proceeds in the following fashion: if the average loss is still below U , compute the expected loss of all the traversal compressions, and also the expected loss of all the longitudinal compressions for each pair of entries in the pattern database. Next, pick the compression with the minimum expected loss and update the pattern database. Proceeding in this manner, the number of differences, δ(· ), will be monotonically decreasing at each step. If there are n different vectors of differences when the expected loss reaches the upper bound, U , code each entry in the pattern database with one of the indexes in the range [1, log2 n], so that log2 n bits are used instead of 8, which is the usual REFERENCES [1] Joseph C. Culberson and Jonathan Schaeffer, ‘Pattern databases’, Computational Intelligence, 14(3), 318–334, (1998). [2] John F. Dillenburg and Peter C. Nelson, ‘Perimeter search’, Artiﬁcial Intelligence, 65, 165–178, (1994). [3] Stefan Edelkamp, ‘External symbolic heuristic search with pattern databases’, in Proceedings of the Fifteenth International Conference on Automated Planning and Scheduling (ICAPS-05), pp. 51–60, Monterey, California, United States, (June 2005). [4] Ariel Felner, Richard E. Korf, and Sarit Hanan, ‘Additive pattern database heuristics’, Journal of Artiﬁcial Intelligence Research, 22, 279–318, (November 2004). [5] Ariel Felner, Richard E. Korf, Ram Meshulam, and Robert Holte, ‘Compressed pattern databases’, Journal of Artiﬁcial Intelligence Research, 30, 213–247, (October 2007). [6] Ariel Felner and Nir Ofek, ‘Combining perimeter search and pattern database abstractions’, in Proceedings of the Seventh Symposium on Abstraction, Reformulation and Approximation (SARA-07), pp. 155– 168, Whistler, Canada, (July 2007). [7] Ariel Felner, Uzi Zahavi, Jonathan Schaeffer, and Robert C. Holte, ‘Dual lookups in pattern databases’, in Proceedings of the Nineteenth International Joint Conference on Artiﬁcial Intelligence (IJCAI-05), pp. 103–108, Edinburgh, Scotland, (July 2005). [8] Robert Holte, Jack Newton, Ariel Felner, Ram Meshulam, and David Furcy, ‘Multiple pattern databases’, in Proceedings of the Fourteenth International Conference on Automated Planning and Scheduling (ICAPS-04), pp. 122–131, Whistler, British Columbia, Canada, (June 2004). [9] Robert C. Holte, Ariel Felner, Jack Newton, Ram Meshulam, and David Furcy, ‘Maximizing over multiple pattern databases speeds up heuristic search’, Artiﬁcial Intelligence, 170(16–17), 1123–1136, (November 2006). [10] Richard E. Korf and Ariel Felner, ‘Disjoint pattern database heuristics’, Artiﬁcial Intelligence, 134(1–2), 9–22, (2002). [11] Richard E. Korf and Ariel Felner, ‘Recent progress in heuristic search: A case study of the four-peg towers of hanoi problem’, in Proceedings of the Twentieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-07), pp. 2324–2329, Hyderabad, India, (January 2007). [12] Giovanni Manzini, ‘BIDA∗ : an improved perimeter search algorithm’, Artiﬁcial Intelligence, 75, 347–360, (1995). [13] W. Myrvold and F. Ruskey, ‘Ranking and unranking permutations in linear time’, Information Processing Letters, 79, 281–284, (2001). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-545 545 Using Abstraction in Two-Player Games Mehdi Samadi , Jonathan Schaeffer1, Fatemeh Torabi Asr , Majid Samar , Zohreh Azimifar 2 Abstract. For most high-performance two-player game programs, a signiﬁcant amount of time is devoted to developing the evaluation function. An important issue in this regard is how to take advantage of a large memory. For some two-player games, endgame databases have been an effective way of reducing search effort and introducing accurate values into the search. For some one-player games (puzzles), pattern databases have been effective at improving the quality of the heuristic values used in a search. This paper presents a new approach to using endgame and pattern databases to assist in constructing an evaluation function for two-player games. Via abstraction, single-agent pattern databases are applied to two-player games. Positions in endgame databases are viewed as an abstraction of more complicated positions; database lookups are used as evaluation function features. These ideas are illustrated using Chinese checkers and chess. For each domain, even small databases can be used to produce strong game play. This research has relevance to the recent interest in building general gameplaying programs. For two-player applications where pattern and/or endgame databases can be built, abstraction can be used to automatically construct an evaluation function. 1 Introduction and Overview Almost half a century of AI research into developing highperformance game-playing programs has led to impressive successes, including D EEP B LUE (chess), C HINOOK (checkers), TDG AMMON (backgammon), L OGISTELLO (Othello), and M AVEN (Scrabble). Research into two-player games is one of the most visible accomplishments in artiﬁcial intelligence to date. The success of these programs relied heavily on their ability to search and to use application-speciﬁc knowledge. The search component is largely well-understood for two-player games (whether perfect or imperfect information; stochastic or not); usually the effort goes into building a high-performance search engine. The knowledge component varies signiﬁcantly from domain to domain. Various techniques have been used, including linear regression (as in L OGIS TELLO ) and temporal difference learning (as in TD-G AMMON ). All of them required expert input, especially the D EEP B LUE [10] and C HINOOK [16] programs. Developing these high-performance programs required substantial effort over many years. In all cases a major commitment had to be made to developing the program’s evaluation function. The standard way to do this is by hand, using domain experts if available. Typically, the developer (in consultation with the experts) designs multiple evaluation function features and then decides on an appropriate 1 2 Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada T6G 2E8, email: {msamadi,jonathan}@cs.ualberta.ca Department of Computer Science and Engineering Shiraz University, Shiraz, Iran, email:{torabi,samar,azimifar}@cs.shirazu.ac.ir weighting for them. Usually the weighted features are summed to form the assessment. This technique has proven to be effective, albeit labour intensive. However, this method fails in the case of a new game or for one in which there is no expert information available (or no experts). The advent of the annual General Game Playing (GGP) competition at AAAI has made the community more aware of the need for general-purpose solutions rather than custom solutions. Most high-performance game-playing programs are compute intensive and beneﬁt from faster and/or more CPUs. An important issue is how to take advantage of a large memory. Transposition tables have proven effective for improving search efﬁciency by eliminating redundancy in the search. However, these tables provide diminishing returns as the size increases [3]. For some two-player games, endgame databases (sometimes called tablebases) have been an effective way of reducing search effort and introducing accurate values into the search. These databases enumerate all positions with a few pieces on the board and compute whether each position is a provable win, loss or draw. Each database position, however, is applicable to only one position. The single-agent (one-player) world has also wrestled with the memory issue. Pattern databases have been effective for improving the performance of programs to solve numerous optimization problems, including the sliding-tile puzzles and Rubik’s Cube [8]. They are similar to endgame databases in that they enumerate a subset of possible piece placings and compute a metric for each (e.g., minimum number of moves to a solution). The databases are effective for two reasons. First they can be used to provide an improved lower bound on the solution quality. Second, using abstraction, multiple states can be mapped to a single database value, increasing the utility of the databases. The main theme of this paper is to investigate and propose a new approach to use endgame and pattern databases to assist in automating the construction of an evaluation function for two-player games. The research also carries over to multi-player games, but this is not addressed in this paper. The key idea is to extend the beneﬁts of endgame and pattern databases by using abstraction. Evaluation of a position with N pieces on the board is done by looking up a subset of pieces M < N in the appropriate database. The evaluation function is built by combining the results of multiple lookups and by learning an appropriate weighting of the different lookups. The algorithm is simple and produces surprisingly strong results. Of greater importance is that this is a new general way to use the databases. The contributions of this research are as follows: 1. Abstraction is used to extend pattern databases (even additive pattern databases) for constructing evaluation functions for a class of two-player games. 2. Pattern-database-based evaluation functions are shown to produce state-of-the-art play in Chinese checkers (10 pieces a side). Against a baseline program containing the latest evaluation func- 546 M. Samadi et al. / Using Abstraction in Two-Player Games tion enhancements, the pattern-database-based program scores 68% to 79% of the possible points. 3. Abstraction is used to extend endgame databases for constructing evaluation functions for a class of two-player games. 4. Chess evaluation functions based on four- and ﬁve-piece endgame databases are shown to outplay C RAFTY, the strongest freeware chess program available. On seven- and eight-piece chess endgames, the endgame-database program scores 54% to 80% of the possible points. Abstraction is a key to extending the utility of the endgame and pattern databases. For domains for which these databases can be constructed, they can be used to build an evaluation function automatically. As the experimental results show, even small databases can be used to produce strong game plays. state in the abstract search space. Information about the abstract state (e.g., solution cost) can be used as a heuristic for the original state (e.g., a bound on the solution cost). Here we give the background notation and deﬁnitions using chess as the illustrative domain. Let S be the original search space and S be the abstract search space. Original space Abstract space u (u) a*(u,v) (a)*( (u), (v) ) v (v) 2 Related Work Endgame databases have been in use for two-player perfect information games for almost thirty years. They are constructed using retrograde analysis [18]. Chess was the original application domain, where databases for all positions with six or fewer pieces have been built. Endgame databases were essential for solving the game of checkers, where all positions with ten or fewer pieces have been computed [6]. The databases are important because they reduce the search tree and introduce accurate values into the search. Instead of using a heuristic to evaluate these positions (with the associated error), a game-playing program can use the database value (perfect information). The limitation, however, is that each position in the database is applicable to a single position in the search space. Pattern databases also use retrograde analysis to optimally solve simpliﬁed versions of a state space [4]. A single-agent state space is abstracted by simplifying the domain (e.g., only considering a subset of the features) and solving that problem. The solutions to the abstract state are used as lower bounds for solutions to a set of positions in the original search space. For some domains, pattern databases can be constructed so that two or more database lookups can be added together while still preserving the optimality of the combined heuristic [13]. Abstraction means that many states in the original space can use a single state in the pattern database. Pattern databases have been used to improve the quality of the heuristic estimate of the distance to the goal, resulting in many orders of magnitude reduction in the effort required to solve the sliding-tile puzzles and Rubik’s Cube [8]. The ideas presented in this paper have great potential for General Game Playing (GGP) programs [9]. A GGP program, given only the rules of the game/puzzle, has to learn to play that game/puzzle well. A major bottleneck to producing strong play is the discovery of an effective evaluation function. Although there is an interesting literature on feature discovery applied to games, to date the successes are small [7]. It is still early days for developing GGP programs, but the state of the art is hard coding into the program several well-known heuristics that have been proven to be effective in a variety of games, and then testing them to see if they are applicable to the current domain [15]. It remains an open problem how to automate the discovery and learning of an effective evaluation function for an arbitrary game. 3 Using Abstraction in Two-Player Games Abstraction is a mapping from a state in the original search space into a simpliﬁed representation of that state. The abstraction is often a relaxation of the state space or a subset of the state. In effect, abstraction maps multiple states in the original state space to a single Figure 1. Original states and edges mapped to an abstract space. Deﬁnition 1 (Abstraction Transformation): An abstraction transformation φ : S → S maps 1) states u ∈ S to states φ(u) ∈ S , and 2) actions a ∈ S to actions φ(a) ∈ S . This is illustrated in Figure 1. Consider the chess endgame of white king, rook, and pawn versus black king and rook (KRPKR). The original space (S) consists of all valid states where these ﬁve pieces can be placed on the board. Any valid subset of the original space can be considered as an abstraction. For example, king and rook versus king (KRK) and king, rook, and pawn versus king (KRPK) are abstractions (simpliﬁcations) of KRPKR. For any particular abstraction S , the search space contains all valid states in the abstract domain (all piece location combinations). The new space S is much smaller than the original space S, meaning that a large number of states in S are being mapped to a single state in S . For instance, for every state in the abstracted KRK space, all board positions in S where the white king, white rook and black king are on the same squares as in S are mapped onto a single abstract state (i.e., white pawn and black rook locations are abstracted away). Actions in S contain all valid moves for the pieces that are in the abstracted state. Deﬁnition 2 (Homomorphism): An abstraction transformation φ is a homomorphism transformation if for all series of actions that transforms state u to state v in S then there is a corresponding transformation for φ(u) to φ(v) in S . This is illustrated in Figure 1, where a∗ represents zero or more actions. If there is a solution for a state in the original space S, then the homomorphism property guarantees the existence of a solution in the abstracted space S . Experimental results indicate that this characteristic can be used to improve search performance in S. Various abstractions can be generated for a given search problem. The set of relaxing functions is deﬁned as φ = {φ1 , φ2 , . . . , φn }, where each φi is an abstraction. Deﬁne the distance between any two states u and v in the relaxed environment as φi with hφi (u, v). For example, for an endgame or pattern database, v is usually set to a goal state meaning that hφi (u, v) is the minimal number of moves needed to achieve the goal. Using off-line processing, the distance from each state in φi to the nearest goal can be computed and saved in a database (using retrograde analysis). For a pattern database (one-player search), the minimal distance to the goal is stored. For an endgame database (two- M. Samadi et al. / Using Abstraction in Two-Player Games player search), the minimal number of moves to win (maximal moves to postpone losing) are recorded. This is the standard way that these databases are constructed. Given a problem instance to solve, during the search all values from those lookup tables are retrieved for further processing. To evaluate a position p from the original space, the relaxed state, φi (p), is computed and the corresponding hφi (p) is retrieved from the database. The abstract values are saved in a heuristic vector h =< hφ1 , hφ2 , . . . , hφn >. The evaluation function value for state p is calculated as a function of h. For example, popular techniques used for two-player evaluation functions include temporal difference learning to linearly combine the hφi values [1], and neural nets to achieve non-linear relations [17]. For example, let us evaluate a position p in the KRPKR chess endgame. In this case, the abstracted states could come from the databases KRPK, KRKR, KRK and KPK. First, for each abstraction, the abstract state is computed and the heuristics value hφi (p) is retrieved from the database. In this case, the black rook is removed and the resulting position is looked up in the KRPK database; the white pawn is removed and the position looked up in the KRKR database; etc. The heuristic value for p could be, for example, the sum of the four abstraction scores. 4 Experimental Results In this section, we explore using abstraction to apply pattern database technology to two-player Chinese checkers and chess endgame database technology to playing more complicated chess endgames. Unlike chess, Chinese checkers has the homomorphism property (the proof is simple, but not shown here for reasons of space). 4.1 Chinese Checkers Chinese checkers is a 2-6 player game played on a star shaped board with the squares hexagonally connected. The objective is to move all of one’s pieces (or marbles) from the player’s home zone (typically 10) to the opposite side of the board (the opponent’s home zone). Each player moves one marble each turn. A marble can move by rolling to an adjacent position (one of six) or by repeatedly jumping over an adjacent marble, of any color, to an adjacent empty location (the same as jumps in 8 × 8 checkers/draughts). In general, to reach the goal in the shortest possible time, the player should jump his pieces towards the opponent’s home zone. Here we limit ourselves to two-player results, although the results presented here scale well to more players (not reported here). Due to the characteristics of Chinese checkers, three different kinds of abstractions might be considered. Given N pieces on each side of the original game: 1. Playing K ≤ N white pieces against L ≤ N black pieces; 2. Playing K ≤ N white pieces to take them to opponent’s home zone (a pattern database including no opponent’s marble); and 3. Playing K ≤ N white pieces against L ≤ N black pieces, but with a constraint that the play concentrates on a partition of the board. For any given search space, the more position characteristics that are exploited by the set of abstractions, the more likely that the combination of abstraction heuristics will be useful for the original problem space. The ﬁrst two abstractions above have the homomorphism property, and the empirical results indicate that they better approximate the original problem space. In the ﬁrst abstraction, a subset of 547 pieces for both players (e.g., the three-piece versus two-piece game) is considered and the minimal number of moves to win (most moves to lose) is used. The second abstraction ignores all the opponent’s pieces. This abstraction gives the number of moves required to get all of one’s pieces into the opponent’s zone. This value is just a heuristic estimate (not a bound), since the value does not take into account the possibility of jumping over the opponent’s pieces (which precludes it from being a lower bound) and does not take into account interference from the opponent’s pieces (precluding it from being an upper bound). Clearly, the ﬁrst abstraction is a better representation of the original problem space. The third abstraction considers only a part of the board to build a pattern database. For example, the goal of the abstraction can be changed so that the pieces only have to enter the goal area (without caring about where the end up). The state space for the ﬁrst abstraction is large; the endgame database of three versus two pieces requires roughly 256MB. The second relaxation strategy makes the search space simpler, allowing for pattern databases that include more pieces on the board. The database size for ﬁve pieces of the same side needs roughly 25MB, 10% of the ﬁrst abstraction database. Our experience with Chinese checkers shows that during the game ﬁve cooperating pieces will result in more (and longer) jump moves (hence, less moves to reach the goal) than ﬁve adversarial pieces. Although the ﬁrst abstraction looks more natural and seems to better reﬂect the domain, the second abstraction gives better heuristic values. Thus, here we present only the second and third abstractions. The baseline for comparison is a Chinese checkers program (10 pieces a side) with all the current state-of-the-art enhancements. The evaluation function is based on the Manhattan distance for each side’s pieces to reach the goal area. Recent research has improved on this simple heuristic by adding additional evaluation terms: 1) curved board model, incremental evaluation, left-behind marbles [19]; and 2) learning [11]. All of these features have been implemented in our baseline program. Experiments consisted of the baseline program playing against a program using a PDB- or endgame-based evaluation function. Each experimental data point consists of a pair of games (switching sides) for each of 25 opening positions (after ﬁve random moves have been made). Experiments are reported for search depths of three and ﬁve ply (other search depth results are similar). The branching factor in the middlegame of Chinese checkers is roughly 60-80. Move generation can be expensive because of the combination of jumps for each side. This slows the program down, limiting the search depth that can be achieved in a reasonable amount of time. The average response time for a search depth of six in the middlegame is more than thirty seconds per move (1.5 hours per game). Our reported experiments are limited to depths three through ﬁve because of the wide range of experiments performed. In this paper, we report the results for three interesting heuristic evaluation functions. Numerous functions were experimented with and achieved similar performance to those reported here. For the following abstractions, the pieces were labeled 1 to 10 in a right-to-left, bottom-up manner. The abstractions used were: PDB(4): four-piece pattern database (second abstraction) with the goal deﬁned as the top four squares in the opponent’s home zone. Three abstractions (three lookups) were used to cover all available ten pieces: pieces 1-4, 4-7, and 7-10. We also tested other lookups on this domain. Obviously increasing the number of lookups can increase the total amount of time to evaluate each node. On the other hand, the overlap of using pieces four and seven in the evaluation function does not have a severe effect on the cost of an 548 M. Samadi et al. / Using Abstraction in Two-Player Games evaluation function. PDB(6): six-piece pattern database (second abstraction) with the goal deﬁned as the top six squares in the opponent’s home zone. Two abstractions (two lookups) were used to cover all 10 pieces: pieces 1-6 and 5-10. Again, two pieces are counted twice in an evaluation (pieces 5 and 6), as a consequence of minimizing the execution overhead. PDB(6+4): a probe from the six-piece PDB is added to a probe from the four-piece PDB (a combination of second and third abstraction). Two abstractions (two lookups) were used to cover all 10 pieces: pieces 1-6 from the PDB(6) and 7-10 from the PDB(4) with its goal deﬁned as passing all pieces from the opponent’s front line (third abstraction). In other words, for the four-piece abstraction we delete the top six squares of the board such that the new board setup introduces our new goal. The weighting of each probe is a simplistic linear combination of the abstraction heuristic values. Abstraction (Pieces) PDB (4) PDB (6) PDB (6+4) PDB (4) PDB (6) PDB (6+4) PDB (4) PDB (6) PDB (6+4) Table 1. Search Depth 3 3 3 4 4 4 5 5 5 Win % 79 68 74 69 68 80 78 70 78 Experiments in Chinese checkers. Table 1 presents the results achieved using these abstractions. The three rows of results are given for each of search depths three, four and ﬁve. The win percent reﬂects two points for a win, one for a tie and zero for a loss. Evidently, PDB(6+4) has the best performance, winning about 80% of the games against the baseline program. Perhaps surprisingly, PDB(4) performs very well, even better than PDB(6) does. One would expect PDB(6) to perform better since it implicitly contains more knowledge of the pieces interactions. However, note that the more pieces on the board, the more frequent long jump sequences will occur. The longer the jump sequence the smaller the probability that it can be realized, given that there are other pieces on the board. Hence, we conclude that a larger PDB may not be as accurate as a smaller PDB. The additive evaluation function (using PDB(4+6)) gives the best empirical results. Not only is it comparable to the PDB(4), but also it achieves its good performance with one fewer database lookup per evaluation. Although the experiments were done to a ﬁx search depth (to ensure a uniform comparison between program versions), because of the relative simplicity of the evaluation function an extra database lookup represented a signiﬁcant increase in execution cost. In part this is due to the pseudo-random nature of accessing the PDB, possibly incurring cache overhead. Our implementation takes advantage of obvious optimizations to eliminate redundant database lookups (e.g., reusing a previous lookup if still applicable). By employing these optimizations, we observed that the time for both heuristic functions are very close and does not change the results. Several experiments were also performed using the ﬁrst abstraction, with three-against-two-piece endgame databases. A program based on the second abstraction (pattern databases with ﬁve pieces) signiﬁcantly outperformed the ﬁrst abstraction. The values obtained from ﬁve cooperating pieces were a better heuristic predictor than that obtained from the adversarial three versus two pieces database. The results reported here do not necessarily represent the best possible. There are numerous combinations of various databases that one can try. The point here is that simple abstraction can be used to build an effective evaluation function. In this example, single-agent pattern databases are used in a new way for two-player heuristic scores. 4.2 Chess This section presents experimental results for using four- and ﬁvepiece chess endgame databases to play seven- and eight-piece chess endgames. The abstracted state space is constructed using a subset of available pieces. For example, for the KRPKR endgame one can use the KRK, KRPK, and KRKR subset of pieces as abstractions of the original position. All the abstractions are looked up in their appropriate database. The endgame databases are publicly available at numerous sites on the Internet. For each position, they contain one of the following values: win (the minimum number of moves to mate the opponent), loss (the longest sequence of moves to be mated by the opponent) or draw (zero). The values retrieved from the abstractions are used as evaluation function features. They are linearly combined; no attempt at learning proper weights has been done yet. In chess, as opposed to Chinese checkers, ignoring all the opponent pieces does not improve the performance given the tight mutual inﬂuence they have on each other (i.e., piece captures are possible). Hence pattern databases are unlikely to be effective. One could use pattern databases for chess, even though, we expect a learning algorithm to discover a weight of zero for such abstractions. The chess abstraction does not have the homomorphism property because of the mutual interactions among the pieces. In other words, it is possible to win in the original position while not achieving this result in the abstract position. For example, there are many winning positions in the KRPKR endgame but in the abstraction of KRKR almost all states lead to a draw. Our experiments used the four- and ﬁve-piece endgame databases. Note that the state-of-the-art in endgame database construction is six pieces [2]. These databases are too large to ﬁt into RAM, making their access cost prohibitively high. Evaluation functions must be fast, otherwise they can dramatically reduce the search speed. Hence we restrict ourselves to databases that can comfortably ﬁt into less than 1GB of RAM. This work will show that even the small databases can be used to improve the quality of play for complex seven- and eight-piece endgames. In our experiments the proposed engine (a program consisting solely of an endgame-database-based evaluation) played against the baseline program (as the opponent). Each experimental data point consisted of a pair of games (switching sides) for each of 25 endgame positions. The programs searched to a depth of seven and nine ply. Results are reported using four- and ﬁve-piece abstractions of sevenand eight-piece endgames. Because of the variety of experiments performed, the search depth was limited to nine. The baseline considered here is C RAFTY, the strongest freeware chess program available [12]. It has competed in numerous World Computer Chess Championships, often placing near the top of the standings. Table 2 shows the impact of two parameters on performance: the endgame database size and the search space depth. The table gives results for three representative seven-piece endgames. The ﬁrst column gives the endgame, the second gives the win percentage (as stated before, wins is counted as two, draws as one and losses as zero), and M. Samadi et al. / Using Abstraction in Two-Player Games Endgame KRPP–KBN KRPP–KNN KRP–KNPP KRPP–KBN KRPP–KNN KRP–KNPP KRPP–KBN KRPP–KNN KRP–KNPP KRPP–KBN KRPP–KNN KRP–KNPP Search Depth 7 7 7 7 7 7 9 9 9 9 9 9 Win % 60 68 72 68 76 80 54 64 70 56 68 76 Abstractions Used KPPK, KKBN, KRK KRK, KRKP, KRPK, KNKP KKPP, KBKP, KPKN, KRK KPKBN, KRPKB, KRPKN KRPKN, KPPKN, KPKNN KRPKN, KRKNP, KPKNP, KPKPP KPPK, KKBN, KRK KRK, KRKP, KRPK, KNKP KKPP, KBKP, KPKN, KRK KPKBN, KRPKB, KRPKN KRPKN, KPPKN, KPKNN KRPKN, KRKNP, KPKNP, KPKPP Table 2. Experiments in chess (four-piece and ﬁve-piece abstractions). the last column shows the abstractions used. The ﬁrst six lines are for a search depth of seven; the remaining six for a search depth of nine. For each depth, the ﬁrst three lines show the results for using threeand four-piece databases as an abstraction; the last three rows show the results when ﬁve-piece databases are used. C RAFTY was used unchanged. It had access to the same endgame databases as our program, but it only used them when the current position was in the database. For all positions with more pieces, it used its standard endgame evaluation function. In contrast, our program, using abstraction, queried the databases every time a node in the search required to be evaluated. By eliminating redundant database lookups, the cost of an endgame-database evaluation can be made comparable to that of C RAFTY’s evaluation. Not surprisingly, the ﬁve-piece databases had superior performance to the four-piece databases (roughly 8% better for depth seven and 4% better at depth nine). Clearly, these databases are closer to the original position (i.e., less abstract) and hence are more likely to contain relevant information. Further, a signiﬁcant drawback of small-size abstraction models is the large number of draw states in the database (e.g. KRKR), allowing little opportunity to differentiate between states. The ﬁve-piece databases contain fewer draw positions, giving greater decision domain to the evaluation function. As the search depth is increased, the beneﬁts of the superior evaluation function slightly decrease. This is indeed expected, as the deeper search allows more potential errors by both sides to be avoided. This beneﬁts the weaker program. Position KQP–KRNP KRRPP–KQR KRPP–KRN KQP–KNNPP KQP–KRBPP KQP–KRNP KRRPP–KQR KRPP–KRN KQP–KNNPP KQP–KRBPP Search Depth 7 7 7 7 7 9 9 9 9 9 Table 3. Win % 64 76 60 76 64 64 76 64 72 62 Abstractions Used KQKRP, KQKNP, KPKRN , KQKRN KQKRP, KQKNP, KPKRN KRPKN, KPPKR, KPKRN KPKNN, KQKNN, KQKNP KPKNN, KQKNN, KQKNP KQKRP, KQKNP, KPKRN, KQKRN KQKRP, KQKNP, KPKRN KRPKN, KPPKR, KPKRN KPKNN, KQKNN, KQKNP KQKRB, KQPKR, KQKRP, KQKBP Experiments for chess. Table 3 shows the results for some interesting (and complicated) seven- and eight-piece endgames, all using ﬁve-piece abstraction. These represent difﬁcult endgames for humans and computers to play. Again, the endgame-database-based evaluation function is superior to C RAFTY, winning 60% to 76% of the games. This performance is achieved using three or four abstraction lookups, in contrast to C RAFTY’s hand-designed rule-based system. Why is the endgame database abstraction effective? The abstrac- 549 tion used for chess is, in part, adding heuristic knowledge to the evaluation function about exchanging pieces. In effect, the smaller databases are giving information about the result when pieces come off the board. This biases the program towards lines which result in favorable piece exchanges, and avoids unfavorable ones. 5 Conclusion and Future Works The research presented in this paper is a step towards increasing the advantages of pre-computed lookup tables for the larger class of multi-agent problem domains. The main contribution of this research was to show that the idea of abstraction can be used to extend the beneﬁts of pre-computed databases for use in new ways in building an accurate evaluation function. For domains for which pattern and/or endgame databases can be constructed, the use of this data can be extended beyond its traditional usage and be be used to build an evaluation function automatically. As the experimental results show, even small databases can be used to produce strong game play. Since 2005, there has been interest in the AI community in building a general game-playing (GGP) program. The application-speciﬁc research in building high-performance games is being generalized to handle a wide class of games. Research has already been done in identifying GGP domains for which databases can be built [14]. For those domains, abstraction is a promising way to automatically build an evaluation function. An automated system has been developed to build a pattern databases for planning domains using binpacking algorithm to select the appropriate symbolic variables for pattern database [5]. Similar approach can be used to automatically select variables in GGP to build endgame/pattern databases. REFERENCES [1] J. Baxter, A. Tridgell, and L. Weaver, ‘Learning to play chess using temporal differences’, Machine Learning, 40(3), 243–263, (2000). [2] E. Bleicher, 2008. http://k4it.de/index.php?topic= egtb&lang=en. [3] D. Breuker, Memory Versus Search in Games, Ph.D. dissertation, University of Maastricht, 1998. [4] J. Culberson and J. Schaeffer, ‘Searching with pattern databases’, in Canadian Conference on AI, pp. 402–416, (1996). [5] S. Edelkamp, ‘Planning with pattern databases’, in Proceedings of the 6th European Conference on Planning (ECP-01), pp. 13–34, (2001). [6] J. Schaeffer et al., ‘Checkers is solved’, Science, 317(5844), 1518– 1522, (2007). [7] T. Fawcett and P. Utgoff, ‘Automatic feature generation for problem solving systems’, in ICML, pp. 144–153, (1992). [8] A. Felner, U. Zahavi, J. Schaeffer, and R. Holte, ‘Dual lookups in pattern databases’, in IJCAI, pp. 103–108, (2005). [9] M. Genesereth, N. Love, and B. Pell, ‘General game playing: Overview of the AAAI competition’, AI Magazine, 26, 62–72, (2005). [10] F h. Hsu, Behind Deep Blue, Princeton University Press, 2002. [11] Alistair Hutton, Developing Computer Opponents for Chinese Checkers, Master’s thesis, University of Glasgow, 2001. [12] R. Hyatt, 2008. http://www.craftychess.com/. [13] R. Korf and A. Felner, ‘Disjoint pattern database heuristics’, Artiﬁcial Intelligence, 134, 9–22, (2002). [14] Arsen Kostenko, Calculating End Game Databases for General Game Playing, Master’s thesis, Fakultat Informatik, Technische Universitat Dresden, 2007. [15] G. Kuhlmann and P. Stone, ‘Automatic heuristic construction for general game playing’, in AAAI, pp. 1457–1462, (2006). [16] J. Schaeffer, One Jump Ahead, Springer-Verlag, 1997. [17] G. Tesauro, ‘Temporal difference learning and TD-Gammon’, CACM, 38(3), 58–68, (1995). [18] K. Thompson, ‘Retrograde analysis of certain endgames’, Journal of the International Computer Chess Association, 9(3), 131–139, (1986). [19] Paula Ulfhake, A Chinese Checkers-playing program, Master’s thesis, Department of Information Technology Lund University, 2000. This page intentionally left blank 9. Planning and Scheduling This page intentionally left blank ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-553 553 A Practical Temporal Constraint Management System for Real-Time Applications Luke Hunsberger Abstract. A temporal constraint management system (TCMS) is a temporal network together with algorithms for managing the constraints in that network over time. This paper presents a practical TCMS, called MYSYSTEM, that efﬁciently handles the propagation of the kinds of temporal constraints commonly found in realtime applications, while providing constant-time access to “all-pairs, shortest-path” information that is extremely useful in many applications. The temporal network in MYSYSTEM includes special timepoints for dealing with the passage of time and eliminating the need for certain common forms of constraint propagation. The constraint propagation algorithm in MYSYSTEM maintains a restricted set of entries in the associated all-pairs, shortest-path matrix by incrementally propagating changes to the network either from adding a new constraint or strengthening, weakening or deleting an existing constraint. The paper presents empirical evidence to support the claim that MYSYSTEM is scalable to real-time planning, scheduling and acting applications. 1 Introduction A Simple Temporal Network (STN) is a pair, (T , C), where T is a set of time-point variables (or time-points) and C is a set of temporal constraints, each having the form: tj − ti ≤ δ, for some ti , tj ∈ T and some real number δ [3]. In this paper, we let n = |T | and m = |C|. A solution to an STN is a set of real-valued assignments to the variables in T that satisfy all of the constraints in C. An STN is called consistent if it has at least one solution. Each STN, (T , C), has a corresponding graph, G = (T , E), where the nodes of the graph are the time-points in T , and the edges of the graph correspond one-to-one with the constraints in C. In particular, for each constraint, tj − ti ≤ δ, in C, there is an edge from ti to tj with weight δ in E. In this paper, we let k be the maximum number of edges incident to any node in the graph. An STN is consistent if and only if its corresponding graph has no negative cycles (i.e., loops with negative path-length) [3]. Most STNs include a special time-point—called the zero timepoint (or Z)—whose value is ﬁxed at 0. Temporal constraints involving Z are equivalent to unary constraints. For example, Z − ti ≤ δ1 is equivalent to the lower-bound constraint, −δ1 ≤ ti ; and tj − Z ≤ δ2 is equivalent to the upper-bound constraint, tj ≤ δ2 . The distance matrix for an STN is an n-by-n matrix, D, such that D(ti , tj ) equals the length of the shortest path from ti to tj in the corresponding graph, G. Thus, D is the all-pairs, shortest-path (APSP) matrix for G. If there is no path from ti to tj , then D(ti , tj ) = ∞. Changing an STN over Time. An STN typically acquires new time-points and constraints over time. Algorithms that incrementally 1 Vassar College, Poughkeepsie, NY, USA, hunsberg@cs.vassar.edu 1 propagate changes to the STN in response to adding a new constraint or strengthening an existing constraint are called incremental algorithms. Algorithms that propagate changes to the STN in response to weakening or deleting a constraint already in the network are called decremental algorithms. Algorithms that are both incremental and decremental are called fully dynamic. Decremental algorithms have higher time complexity than their incremental counterparts [16, 12]. Executing Time-Points. In most applications, the starting and ending times of tasks are represented by time-points in a temporal network. When the task is begun—say, at time K—its starting time-point, ts , is ﬁxed to the value K, by inserting the constraints, K ≤ ts ≤ K (i.e., Z − ts ≤ −K and ts − Z ≤ K). We say that ts has been executed at time K. Similarly, when the task is completed— say, at time L—its ending point, te , is ﬁxed to the value L. Cesta and Oddi’s Algorithm. Cesta and Oddi [2] presented a fully dynamic algorithm for propagating changes to an STN. The algorithm does not maintain the entire distance matrix; instead, it maintains only enough entries to verify the consistency of the network. In particular, for each time-point t ∈ T , it only maintains entries of the form, D(Z, t) and D(t, Z). Thus, the space requirements are O(n). The incremental portion of the algorithm, which is a variation of the Bellman-Ford algorithm, has time complexity O(nm). The decremental portion of the algorithm ﬁrst determines which entries might be affected by the change to the network and then runs the incremental portion on that part of the network. Since their algorithm does not maintain the full distance matrix, it can only discover negative cycles during the process of constraint propagation. Furthermore, answering distance matrix queries for entries other than those involving Z requires O(kn) time, instead of the constant look-up time that is afforded by having the full distance matrix. Maintaining the Full Distance Matrix. Maintaining an up-todate distance matrix requires O(n2 ) space and additional constraint propagation; however, it has the following important advantages. First, it provides constant-time lookup for distance-matrix entries, which facilitates the use of multi-agent coordination algorithms (e.g., temporal decoupling algorithms [9]). Second, before adding a new constraint (or strengthening an existing constraint) the consistency of the resulting network can be determined by constant-time lookup— in advance of any constraint propagation [11]. Researchers have developed fully dynamic algorithms for maintaining distance matrices [7, 5, 16, 4, 12]. Although these algorithms have attractive time complexities, they restrict the kinds of constraints that can populate a network and, thus, are inappropriate for many applications. Others have presented algorithms making fewer restrictions, but exhibiting poorer performance [13, 6]. The INCR 2004 Algorithm. The author recently presented a practical incremental algorithm for maintaining the full distance ma- 554 L. Hunsberger / A Practical Temporal Constraint Management System for Real-Time Applications 22 18 17 ti 4 9 6 8 5 tj Figure 1. tp tm tk 4 Z tq direction of propagation The PropFwd phase of the incremental algorithm 0 −d N tg 6 Before: Z During: 2 tm tj 12 After: direction of propagation Figure 2. The PropBkwd phase of the incremental algorithm trix [8]. For ease of exposition, we shall refer to that algorithm as the INCR 2004 algorithm. That algorithm reduces the size of the network by collapsing all rigid components down to a single time-point.2 The INCR 2004 algorithm also reduces constraint propagation by propagating only along undominated edges.3 The undominated edges are stored in hash tables. In particular, for each time-point t, Precs(t) is a hash table containing the undominated edges coming in to t; and Succs(t) contains the undominated edges going out from t. The highlevel structure of the algorithm, which is based on work by several others [12, 13, 6], has two phases, called PropFwd and PropBkwd. The algorithm has time complexity O(kΔ), where Δ is the number of entries of D that actually need to be changed [12]. The PropFwd Phase. Suppose a new (or stronger) constraint, tj − ti ≤ δ, is added to the network. Fig. 1 illustrates the PropFwd phase, in which changes to distance matrix entries of the form, D(ti , t), are propagated by following the successors of tj . In the ﬁgure, decreasing the weight of the edge, ti tj , from 5 to 2 requires decreasing D(ti , tk ) from 9 to 6, and decreasing D(ti , tm ) from 17 to 14. Since D(ti , tp ) does not need to be changed, forward propagation stops at that point.4 During the PropFwd phase, each time-point, t, for which D(ti , t) changed is collected in a hash-table, AffectedTPs. The PropBkwd Phase. Fig. 2 illustrates the PropBkwd phase of the INCR 2004 algorithm. For each tm in AffectedTPs collected during the PropFwd phase, the predecessors of ti are followed, potentially leading to changes in entries of the form, D(t, tm ). For example, in the ﬁgure, the entry D(ti , tm ) had been reduced from 17 to 14 during the ﬁrst phase. Its new value, requires reducing D(th , tm ) from 18 to 15. However, since D(tg , tm ) does not need to be changed, backward propagation stops at that point. Augmented STNs. An Augmented STN (ASTN) is an STN that has been augmented to include a special time-point, N, which represents the current time (i.e., “now”) [11]. Representing the now timepoint enables the network to explicitly handle the passage of time 2 3 4 −1 −2 ti A rigid component is a set of time-points in which the temporal distance between each pair of time-points is constrained to be some ﬁxed value. Other researchers have described collapsing rigid components [17, 7]. A constraint is called undominated if removing it from the network would necessarily require updating the distance matrix. In contrast, removing a dominated constraint from the network would leave the distance matrix unchanged. The algorithm takes advantage of the fact that dominated constraints are easy to detect in networks with no rigid components [10]. For expositional simplicity, Fig. 1 shows only one branch of the sub-tree rooted at tj . The PropFwd phase normally explores multiple branches of that sub-tree. Similar remarks apply to the PropBkwd phase. tb tc 0 The now time-point in an ASTN 14 th 1 0 Figure 3. 20 18 ta Z N 2 −2 −3 Z 0 N N 2 −2 t t t Figure 4. The execution of the time-point t at time 2 and the execution of time-points. The passage of time is handled by including a single edge from N to Z, with weight −d, representing the lower-bound constraint, d ≤ N. This edge, as illustrated in Fig. 3, is the only outgoing edge from the now time-point. As time passes, the value of d increases (i.e., the constraint involving Z and N grows stronger). Since the time-complexity of strengthening a constraint is lower than that of weakening or deleting constraints, this way of dealing with the passage of time is computationally attractive. In an ASTN, each unexecuted time-point, t, is constrained to occur at or after now—represented by an edge from t to N with weight 0. Fig. 3 illustrates these kinds of edges, which are the only incoming edges to the now time-point. When t is executed, the edge from t to N is deleted, and two edges between t and Z are inserted to ﬁx t’s value. Fig. 4 provides “before”, “during” and “after” snapshots of a network in which t is executed at time 2. In the “before” snapshot, the current time is 1, and t is constrained to occur at or after that time. In the middle snapshot, t has been executed at time 2 (i.e., the edge from t to N has been deleted, and a pair of edges between t and Z have been inserted, ﬁxing the value of t to 2). In the bottom snapshot, the current time has advanced to 3, but that has no effect on t. For an ASTN, the distance matrix entry, D(Z, N), can be interpreted as a kind of deadline [11]. In particular, if some time-point is not executed at or before this deadline, then the network is certain to become inconsistent—because the passage of time (i.e., the increased value of d on the edge from N to Z) will eventually generate a negative cycle. The potential inconsistency can be averted by executing one or more time-points, thereby deleting constraints involving N and increasing the value of D(Z, N). 2 Desiderata The main goal for the work described in this paper is to provide a temporal constraint management system that can serve as the basis for a temporal reasoning module in real-time planning, scheduling and acting applications, including multi-agent systems involving the coordination of temporally dependent, inter-agent activities. This high-level goal consists of the following subsidiary goals: • To maintain constant-time access to all distance-matrix entries • To reduce space requirements for the distance matrix (or any other auxiliary data structures) • To reduce the need for constraint propagation L. Hunsberger / A Practical Temporal Constraint Management System for Real-Time Applications ta ta tb −2 5 8 −3 Z Figure 5. 8 Zout 5 −2 0 tb −3 Zin Replacing the zero time-point by a pair of time-points • To include a fully dynamic constraint propagation algorithm that is scalable to real-time applications Constant-time access to distance-matrix entries facilitates multiagent coordination algorithms (e.g., temporal decoupling [9]). Reducing space requirements for the distance matrix implies not explicitly representing every distance-matrix entry, while maintaining constant-time access. Reducing the need for constraint propagation makes the fully dynamic algorithm computationally palatable. “Scalable” means that the resulting TCMS is practical for applications involving hundreds, or even thousands of time-points. 3 Approach This paper presents a TCMS called MYSYSTEM that meets the desiderata listed above. In MYSYSTEM: • The now time-point, N, is explicitly represented (as in ASTNs). • The zero time-point, Z, is replaced by a pair of time-points, Zin and Zout , thereby eliminating propagation through Z, and reducing the number of distance matrix entries needing to be computed. • Since the portion of the distance-matrix that is actually computed is typically quite small, the values are stored in a hash table, instead of a two-dimensional array. • The incremental algorithm is essentially the same as the INCR 2004 algorithm, except that rigid components and dominated constraints are handled differently. • A new decremental algorithm is provided that manipulates the same data structures as the incremental algorithm. The algorithm, which draws on ideas from other researchers [4, 13], is not the fastest possible, but requires only minor auxiliary data structures. • Executed time-points are effectively removed from the network. Replacing the Zero Time-Point by a Pair of Time-Points. In real-world applications, the starting and ending times of tasks are typically subject to a variety of unary constraints—that is, constraints involving the zero time-point, Z. As a result, while the maximum number of edges incident on any other time-point might be, say, ten, the number of edges incident on Z can be O(n). Thus, a great deal of the constraint propagation needed to fully populate the distance matrix is due to constraints involving Z. To eliminate constraint propagation through Z, the temporal network in MYSYSTEM replaces Z by a pair of time-points, Zin and Zout .5 In particular, as illustrated in Fig. 5, Zin is the destination for all edges that would normally point to Z, and Zout is the source of all edges that would normally emanate from Z. Now, adding an edge from Zin to Zout with weight 0 (shown as a dashed arrow in the ﬁgure) would make the two networks in Fig. 5 equivalent; however, such an edge is purposely left out of the network in MYSYSTEM. This seemingly minor change eliminates propagation through Z; thus, it dramatically reduces the amount of computation required to maintain the distance matrix. At the same time, 5 This treatment of the zero time-point is somewhat similar to Cesta and Oddi’s treatment of the zero time-point as both a source and a sink [2]. 555 MYSYSTEM retains the property of having constant-time access to all distance-matrix entries. To see this, suppose A is a standard ASTN and A is the same as A, except that the zero time-point has been replaced by Zin and Zout , as described above. Because the edge from Zin to Zout is left out of A , the distance matrices, D and D , are typically quite different. However, the relationship between their corresponding entries is simple. In particular, for any ti , tj ∈ T \{Z}:6 • D(ti , Z) = D (ti , Zin ) • D(Z, tj ) = D (Zout , tj ) • D(ti , tj ) = min{D (ti , Zin ) + D (Zout , tj ), D (ti , tj )} The last equality can be glossed as: “The shortest path from ti to tj either involves the zero time-point or it doesn’t.” In this way, although D typically contains far fewer ﬁnite entries than D, it can be used to fetch the value of any entry in D(ti , tj ) in constant time. The Distance-Matrix Hash Table. Due to the use of Zin and Zout , the constraint propagation algorithms in MYSYSTEM typically need to compute only a small fraction of the O(n2 ) entries in the distance matrix, D . Thus, to save space, a hash table is used to store only those entries that are actually computed. Any entry, D (ti , tj ), that has not been stored in the hash table is taken to be inﬁnity, representing that there is no path from ti to tj . Hash-table keys are integers of the form, N i + j, where N is an upper bound on the number of time-points in the network. 7 For example, if N = 214 = 16384, then 28-bit values can be used for hash-table keys—which can be quickly computed using left-shift and addition operations. A Note about Rigid Components and Undominated Edges. In a purely incremental context, constraints are never weakened or deleted. Thus, rigid components, once created, can never become non-rigid. Thus, it is safe to collapse each rigid component down to a single point as soon as it is created. Insodoing, the network remains free from rigidities, which simpliﬁes the detection of dominated constraints. In contrast, a fully dynamic algorithm must handle the weakening or deleting of constraints and, thus, cannot afford to collapse all rigid components—because undoing such transformations can be too computationally costly. Thus, the fully dynamic algorithm in MYSYSTEM does not typically collapse rigid components. Thus, the network in MYSYSTEM may contain rigidities, thereby complicating the detection of dominated edges. For this reason, the detection of dominated edges in MYSYSTEM is restricted to cases where a strictly shorter alternative pathway is found.8 In addition, the decremental algorithm can sometimes insert dominated edges into the Precs and Succs hash tables—because avoiding doing so would be too computationally costly. However, when the incremental algorithm detects these dominated edges, they are immediately removed from the Precs and Succs hash tables. Thus, in this sense, the fully dynamic algorithm in MYSYSTEM can be said to propagate along “mostly” undominated edges. The Decremental Algorithm in MYSYSTEM. The decremental algorithm is used when an existing constraint, tj − ti ≤ δ, is either weakened or deleted. The algorithm has the following three phases: (1) In a hash-table called Changelings, collect all pairs, (tx , ty ), such that D (tx , ty ) might need updating. (2) For each (tx , ty ) in Changelings, check for shorter alternative pathways from tx to ty ; collect the shortest alternatives in a hash-table called AltPaths. 6 7 8 T \{Z} denotes the set of time-points in A other than Z. Demetrescu and Italiano [4] encode pairs in this way. In contrast, the INCR 2004 algorithm also detects edges that are dominated by a path whose length is the same as that of the edge being dominated. 556 L. Hunsberger / A Practical Temporal Constraint Management System for Real-Time Applications (3) Incrementally propagate the constraints in AltPaths. Phase 1. Consider the path from tx to ty shown below, where the wavy arrows represent shortest paths and δ is the original weight of the edge being weakened/deleted. tx ti δ tj ty The pair, (tx , ty ), is collected during Phase 1 if and only if: D (tx , ty ) = D (tx , ti ) + δ + D (ti , ty ) All such pairs are collected using a two-pass algorithm that has the same structure as the PropFwd and PropBkwd phases of the incremental algorithm. Thus, Phase 1 takes time O(kΔ), where Δ is the number of pairs in Changelings. After the Changelings hash-table has been populated, the corresponding distance-matrix entries are assigned new values, as follows. If the edge, ti tj , has been deleted, then each D (tx , ty ) is set to ∞, because the deletion of ti tj might mean there no longer is any path from tx to ty . On the other hand, if ti tj was simply weakened—say by an amount α—then each D (tx , ty ) is set to the value D (tx , ty ) + α + 1. Using this value, which is necessarily greater than the eventual updated value, forces D (tx , ty ) to be updated during Phase 2 or 3. Since MYSYSTEM does not maintain any pointers to ﬁrst or last steps of shortest paths (e.g., as done by Rohnert [13]), the Changelings hash table may end up containing some pairs whose distance-matrix entries do not need to be updated. Instead of maintaining complex auxiliary data structures to avoid this, the decremental algorithm discovers alternative paths during Phase 2 and 3 to ensure that the corresponding distance-matrix entries are restored. Phase 2. For each (tx , ty ) in Changelings, alternative pathways of the forms given below are collected in a hash-table called AltPaths.9 tx ty edge tx tk edge ty shortest path tx tv shortest path ty edge For some (tx , ty ) in Changelings, it may be that no alternative paths exist. For other pairs, more than one such path may exist; however, only the shortest such paths are kept in AltPaths. The hash-key for the AltPaths hash table is the pair, (tx , ty ); the value is the length of the alternative path. (Interior time-points on the path are not needed.) Notice that the alternative pathways collected during Phase 2 may well have been dominated prior to the weakening (or deleting) of the edge ti tj , as illustrated below in the case of an alternative edge. 16 tx 3 ti 5 ty tj 4 Prior to weakening ti tj from 5 to 10, the edge, tx ty was not a shortest path; however, afterward, it becomes a shorter (and possibly shortest) path. For this reason, the edges considered during Phase 2 are drawn from the set C—which contains all of the edges in the network—not just those in the Precs and Succs hash tables. 9 Demetrescu and Italiano [4] refer to such pathways as locally shortest. Phase 3. During Phase 3, the alternative paths found in Phase 2 are incrementally propagated. There are several options for doing this. Each alternative path could, in turn, be completely propagated using the incremental algorithm. However, this sort of depth-ﬁrst approach might result in a large amount of redundant propagation. Another option, analogous to A∗ search, would be to sort the alternative paths according to how close their path-lengths were to the original value of D (tx , ty ) and apply the incremental algorithm to those alternative paths in their sorted order. The decremental algorithm in MYSYSTEM takes an iterative, breadth-ﬁrst approach. In the ﬁrst iteration, each path in AltPaths is propagated only one step along the predecessors of tx and the successors of ty . Each one-step propagation generates a new update which is stored in a hash-table called newAltPaths. During the second iteration, each update in newAltPaths is propagated only one step, generating new updates for the third iteration. This iterative process terminates when no more updates are generated. Empirical evidence suggests that this form of incremental propagation is quite practical. Removing Executed Time-Points. As discussed earlier, the fully dynamic algorithm does not typically collapse rigid components, because undoing such transformations in response to constraint relaxations can be too computationally costly. However, when a timepoint, t, is executed, it forms a rigid component with Zin and Zout that is guaranteed to persist. Thus, it is safe to collapse this kind of rigid component. Doing so effectively removes t from the network by reorienting constraints involving t toward Zin and Zout . 4 Empirical Evaluation The MYSYSTEM TCMS was tested on a set of thirty 25-agent scheduling problems drawn from the Phase 2 Evaluation for the DARPA Coordinators Project [15]. These kinds of problems are represented in the cTAEMS language, the details of which are described elsewhere [1]. The important characteristics of the test problems are shown in the top plot in Fig. 6. Each problem involved between 1507 and 3273 time-points (plotted on the horizontal axis) and between 803 and 1686 activities (ACTS).10 For each problem, a centralized scheduler [14] was used to generate a set of agent schedules seeking to optimize the cTAEMS quality metric. In the process, the scheduler invoked the incremental algorithm of MYSYSTEM between 3461 and 7353 times (INCRS), and the decremental algorithm between 254 and 1185 times (DECRS). The resulting schedules included a total of between 139 and 243 activities (SCHEDS), and resulted in networks with between 3326 and 6797 edges (EDGES). The middle plot of Fig. 6 shows the CPU time used by MYSYSTEM to do all of the temporal computations for each scheduling problem. The CPU time ranged from 2 seconds to 2 minutes for each problem. In the worst case, the 2 minutes of computation, spread over 8000 invocations of the incremental or decremental algorithms, averaged to about 15 msec per invocation. The bottom plot of Fig. 6 shows the memory usage by MYSYS TEM . The number of ﬁnite distance-matrix cells (i.e., those that were actually stored in a hash table) ranged from about 77,000 to about 850,000 per problem. In contrast, the full distance matrix would have required between 2.2 and 10.7 million cells. Given that typical entries are four bytes, such a matrix could have required over 40 megabytes of memory. In contrast, the total memory used by MYSYSTEM during the course of each scheduling problem, most of which was dynamically allocated and freed, ranged from about 8 to 92 megabytes. 10 Some activities share time-points; hence the number of time-points is somewhat less than double the number of activities. L. Hunsberger / A Practical Temporal Constraint Management System for Real-Time Applications INCRS 7000 EDGES 5000 557 sults on temporal networks derived from a centralized scheduler applied to a variety of 25-agent scheduling problems involving thousands of time-points. Acknowledgments 3000 ACTS 1000 DECRS SCHEDS 1600 2000 2400 2800 3000 NUMBER OF TIME POINTS The research presented in this paper was supported in part by subcontract 55-000723 between Vassar College and SRI International as part of the DARPA Coordinators Project (Contract FA8750-05-C0033). Any opinions, ﬁndings and conclusions or recommendations expressed in this paper are those of the author and do not necessarily reﬂect the views of DARPA. The author thanks Stephen Smith, Zachary Rubinstein, Terry Zimmerman, Laura Barbulescu and Anthony Gallagher from Carnegie Mellon University for providing access to their scheduler. 3 CPU SECONDS 10 REFERENCES 2 10 1 10 0 10 1600 2000 2400 2800 3000 NUMBER OF TIME POINTS 8 10 Total Memory Used (Bytes) 7 10 Potential Size of Distance Matrix 6 10 5 10 Number of Finite Distance Matrix Entries 4 10 1600 2000 2400 2800 3000 NUMBER OF TIME POINTS Figure 6. Results of experiments on 25-agent scheduling problems All experiments were run on an IBM Thinkpad laptop with a 2.4GHz Intel processor using Allegro Common Lisp, version 8.1. 5 Conclusion This paper presented a new temporal constraint management system, called MYSYSTEM, that combines novel STN representations with a fully dynamic propagation algorithm that is practical for realworld, real-time applications. The temporal network in MYSYSTEM includes special time-points to eliminate a common form of constraint propagation and reduce the number of distance-matrix entries that typically need to be computed. The fully dynamic algorithm extends an earlier incremental algorithm. It limits propagation to “mostly” undominated edges. The paper provided empirical re- [1] M. Boddy, B. Horling, J. Phelps, R. Goldman, R. Vincent, C. Long, and B. Kohout, ‘C taems language speciﬁcation, version 1.06 (0)’. [2] Amedeo Cesta and Angelo Oddi, ‘Gaining efﬁciency and ﬂexibility in the simple temporal problem’, in Proceedings of the Third International Workshop on Temporal Representation and Reasoning (TIME-96), pp. 45–50. IEEE, (1996). [3] Rina Dechter, Itay Meiri, and Judea Pearl, ‘Temporal constraint networks’, Artiﬁcial Intelligence, 49, 61–95, (1991). [4] C. Demetrescu and G. Italiano, ‘A new approach to dynamic all pairs shortest paths’, in Proceedings of the 35th STOC, pp. 159–166, (2003). [5] Camil Demetrescu and Giuseppe F. Italiano, ‘Improved bounds and new trade-offs for dynamic all pairs shortest paths’, Technical Report ALCOMFT-TR-02-1, ALCOM, (2002). [6] Shimon Even and Hillel Gazit, ‘Updating distances in dynamic graphs’, Methods of Operations Research, 49, 371–387, (1985). [7] Alfonso Gerevini, Anna Perini, and Francesco Ricci, ‘Incremental algorithms for managing temporal constraints’, Technical Report IRST9605-07, IRST. [8] Luke Hunsberger, ‘Quantitative temporal reasoning in planning problems’. AAAI-2004 Tutorial MP-2, slides available at: http://www.cs.vassar.edu/˜hunsberg. [9] Luke Hunsberger, ‘Algorithms for a temporal decoupling problem in multi-agent planning’, in Proceedings of the Eighteenth National Conference on Artiﬁcial Intelligence (AAAI-2002), (2002). [10] Luke Hunsberger, Group Decision Making and Temporal Reasoning, Ph.D. dissertation, Harvard University, 2002. Available as Harvard Technical Report TR-05-02. [11] Luke Hunsberger, ‘Distributing the control of a temporal network among multiple agents’, in Proc. of the 2nd Int’l. Joint Conference on Autonomous Agents and MultiAgent Systems (AAMAS-03), (2003). [12] G. Ramalingam and Thomas Reps, ‘On the computational complexity of dynamic graph problems’, Theoretical Computer Science, 158, 233– 277, (1996). [13] Hans Rohnert, ‘A dynamization of the all pairs least cost path problem’, in 2nd Symposium of Theoretical Aspects of Computer Science (STACS 85), ed., Kurt Mehlhorn, volume 182 of Lecture Notes in Computer Science, 279–286, Springer, (1985). [14] S. Smith, A.T. Gallagher, T.L. Zimmerman, L. Barbulescu, and Z. Rubinstein, ‘Distributed management of ﬂexible times schedules’, in Intl. Conf. on Autonomous Agents and Multiagent Systems, (2007). [15] Valerie Guralnik Thomas Wagner, John Phelps and Ryan VanRiper, ‘COORDINATORS: Coordination managers for ﬁrst responders’, in Proc. of the 3rd Intl. Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS-2004). IEEE Computer Society, (2004). [16] Mikkel Thorup, ‘Worst-case update times for fully-dynamic all-pairs shortest paths’, in Annual ACM Symposium on Theory of Computing, pp. 112–119, (2005). [17] Ioannis Tsamardinos, Reformulating Temporal Plans for Efﬁcient Execution, Master’s thesis, University of Pittsburgh, 2000. 558 ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-558 Towards Efﬁcient Belief Update for Planning-Based Web Service Composition J¨org Hoffmann1 Abstract. At the “functional level”, Semantic Web Services (SWS) are described akin to planning operators, with preconditions and effects relative to an ontology; the ontology provides the formal vocabulary and an axiomatisation of the underlying domain. Composing such SWS is similar to planning. A key obstacle in doing so effectively is handling the ontology axioms, which act as state constraints. Computing the outcome of an action involves the frame and ramiﬁcation problems, and corresponds to belief update. The complexity of such updates motivates the search for tractable classes. Herein we investigate a class that is of practical relevance because it deals with many commonly used ontology axioms, in particular with attribute cardinality upper bounds which are not handled by other known tractable classes. We present an update computation that is exponential only in a comparatively uncritical parameter; we present an approximate update which is polynomial in that parameter as well. 1 Introduction Semantic Web Services (SWS) are pieces of software advertised with a formal description of what they do; Web Service Composition (WSC) means to link them together in a way satisfying a complex user requirement. WSC is widely recognized for its economic potential. In the wide-spread OWL-S [3] and WSMO [5] frameworks, at the so-called “functional level” (which abstracts from interaction details and speciﬁes only overall functionality), SWS are described akin to planning operators, with preconditions and effects relative to an ontology. Hence planning – planning under uncertainty, since information in the web context cannot be expected to be complete – is a prime candidate for realizing this form of WSC. In our work, we pursue a kind of conformant planning [17]. The tool we develop performs a forward search as per Figure 1. Each s represents (partial) knowledge about the corresponding belief b, where as usual b is the set of all situations possible at the given point in time. Maintaining the states s is challenging because it involves a belief update problem. Namely, the main difference to most work in conformant planning is that we consider state constraints, e.g. [8, 2, 16]: the domain axiomatization given in the ontology. Such axioms are state constraints in the sense that any state that can be encountered, in the given domain, is known to satisfy them. In the presence of such axioms, computing the outcome of an action involves the frame and ramiﬁcation problems: How do the axioms affect the previous world, and what are their side effects? Following various authors, e.g. [10, 15], we deﬁne action outcomes as belief updates, where the “update” is the action effect conjoined with the axioms. Belief update has been shown to be hard even in tractable logics (e.g. Horn [4]). Since update is a frequently solved sub-problem in 1 SAP Research, CEC Karlsruhe, Germany, joe.hoffmann@sap.com s0 := initialise(); open-list := s0 while TRUE do s := choose(open-list) if is-solution(s) then return path leading to s for all calls a of SWS applicable in s do s := update(s, a); insert(open-list,s ) Figure 1. The main loop of our planner. planning as per Figure 1, the need for tractable classes is tantalising. In this context, it is of particular interest that practical WSC problems, e.g. the widely used Virtual Travel Agency (VTA) scenario, often come with fairly simple domain axiomatizations. Some of the most typically used axioms are: subsumption relations, which herein we write as clauses of the form ∀x : train(x) ⇒ vehicle(x); attribute range type restrictions ∀x, y : ticketfor (x, y) ⇒ person(y); mutual exclusion ∀x : ¬train(x) ∨ ¬car (x); and bounds on the number of distinct attribute values, such as the axiom ∀x, y1 , y2 , y3 : (ticketfor (x, y1 ) ∧ ticketfor (x, y2 ) ∧ ticketfor (x, y3 )) ⇒ (y1 = y2 ∨ y1 = y3 ∨ y2 = y3 ) which is a cardinality upper bound saying that at most two persons may travel on the same ticket. The above raises the question which classes of axioms allow a polynomial time belief update. To our knowledge, the only existing work exploring this question is DL-Lite [6, 7], a fragment of DL for which belief update can be done efﬁciently, and the new belief can be represented in terms of a single ABox. The latter is necessary since the updated belief will be visible to the user. DL-Lite does not allow cardinality upper bounds. In this paper, we identify a tractable fragment which includes such bounds. A key difference to DL-Lite is that we don’t require beliefs to be understandable for a user: the representation is internal to the planner, and so we are completely free in how to deﬁne the search states s. We show that this enables us to deal with cardinality upper bounds, in time exponential only in the maximum bound k imposed by any such bound. The belief update algorithm we present deals also with binary clauses, i.e., clauses of at most two literals, such as subsumption relations, attribute range type restrictions, and mutual exclusion. One would usually expect k to be 1 or 2 (rather than, say, 17). However, in large tasks the complexity of the update can become critical even for small k. We hence also pursue the idea of sacriﬁcing either of soundness or completeness, for tractability. We present an approximate update algorithm that is polynomial also in k. A few words are in order regarding our planning formalism. In difference to DL-Lite, and in line with the usual planning formalisms, we make a closed world assumption where a ﬁnite set of constants is ﬁxed. The motivation for this is simply that it is closer to existing planning tools, and hence is expected to make it easier to eventually build on that work. The other main design decision regards the semantics of belief update. We adopt the possible models approach J. Hoffmann / Towards Efﬁcient Belief Update for Planning-Based Web Service Composition (PMA)[18], which addresses the frame and ramiﬁcation problems via a set-based notion of minimal change. Alternative semantics should be considered in the future: from an application perspective, at the time of writing there isn’t sufﬁcient material on concrete use cases in order to tell whether one or the other semantics is more practical. The PMA has been used in many recent works related to formal semantics for WSC, e.g. [15, 1, 6], and is hence somewhat canonical.2 Section 2 introduces our planning formalism. Section 3 establishes some core observations. Sections 4 and 5 present our exact respectively approximate update algorithms. Section 6 discusses closely related work and Section 7 concludes. For lack of space, we omit all proofs and many other details such as notions of, and algorithms for, output constants and a construct for more ﬂexible updates of attribute values. The full paper is available as a TR [12]. 2 WSC Formalism Our formalism follows standard notions from conformant planning, extended by modelling constructs for axioms. Our terminology is as used in the WSC area; it should be obvious how this corresponds to planning terminology. We denote predicates with G, H, I, variables with x, y, and constants with c, d, e. We treat equality as a “built-in” predicate. Literals are possibly negated predicates whose arguments are variables or constants; if all arguments are constants, the literal is ground. Given a set X of variables, we denote by LX the set of all literals which use only variables from X. If l is a literal, we write l[X] to indicate that l uses variables X. If X = {x1 , . . . , xk } and C = {c1 , . . . , ck }, then by l[c1 , . . . , ck /x1 , . . . , xk ] we denote the substitution, abbreviated l[C]. In the same way, we use substitution for any construct involving variables. By l, we denote of V the inverse V l. If L is a set of literals, then L := {l | l ∈ L} and L := l∈L l. An ontology Ω is a pair (P, Φ) where P is a set of predicates and Φ is a conjunction of closed ﬁrst-order formulas. We call Φ a theory. A clause is a disjunction of literals with universal quantiﬁcation on the outside, e.g. ∀x.¬G(x) ∨ H(x) ∨ I(x). A clause is binary if it contains at most two literals. Φ is binary if it is a conjunction of binary clauses. The only non-binary clauses we will consider are cardinality upper bounds, taking the form ∀x, y1 , . . . , yk+1 .(G(x, y1 )∧ . . . G(x, yk+1 )) ⇒ (y1 = y2 ∨ y1 = y3 ∨ · · · ∨ yk = yk+1 ); to simplify notation, we will refer to such a clause as image(G) ≤ k. A theory is binary with cardinality upper bounds if it consists entirely of binary clauses and cardinality upper bounds. We will consider the special case where every predicate G with a bound image(G) ≤ k does not appear positively in any binary clause; we refer to such Φ as binary with consequence-independent cardinality upper bounds. Note that this includes subsumption relations, attribute range type restrictions, mutual exclusion, and cardinality upper bounds. A web service w is a tuple (Xw , prew , effw ), where Xw is a set of variables (the inputs), prew is a conjunction of literals from LXw (the precondition), and effw is a conjunction of literals from LXw (the effect).3 Before a web service can be applied, its inputs must be instantiated with constants, yielding a service; to avoid confusion with the search states s, we refer to services as actions a (which is 2 3 Notably, one of the main arguments made against the PMA, e.g. by [2, 16, 11] is that it lacks a notion of causality. However, ontology languages such as OWL do not model causality; all we are given is a set of axioms. Hence this criticism does not apply for WSC (unless one proposes an entirely new framework for modelling web services, which is not our focus here). Note that this deﬁnition of preconditions and effects (conjunctions of literals) is quite restrictive. This is intended since we’re looking for tractable classes in here. It remains to be veriﬁed in future work if and inhowfar this restriction can be relaxed without losing our tractability results. 559 in accordance with the usual planning terminology). Formally, for a web service (X, pre, Y, eff) and tuple of constants Ca , an action a is given by (prea , effa ) = (pre, eff)[Ca /X]. By convention, given an arbitrary action a, we will use Ca to denote a’s input instantiation. WSC tasks are tuples (Ω, W, C, U ). Ω is an ontology, W is a set of web services, and C is a set of constants. U is the user requirement, a pair (preU , effU ) of precondition and effect. For complexity considerations, we will restrict WSC tasks to have ﬁxed arity, meaning a constant upper bound on predicate arity, the number of parameters of any web service, and the depth of quantiﬁer nesting in Φ. Further, we will sometimes assume ﬁxed maximum cardinality, meaning a constant upper bound on k in any axiom image(G) ≤ k. The semantics of our formalism relies on a notion of beliefs, where each belief is a set of models. Each model is an interpretation of all propositions formed from P and C. The initial belief b0 is undeﬁned if Φ ∧ preU is not satisﬁable; else, b0 := {m | m |= Φ ∧ preU }. A solved belief is a belief b s.t., for all m ∈ b, m |= effU . It remains to deﬁne how actions affect models and beliefs. Say m is a model and a is an action; as stated, we deﬁne the outcome Res(m, a) following [18]. We say that a is applicable in m if m |= prea . If a is not applicable in m, then Res(m, a) is undeﬁned. Otherwise, Res(m, a) := {m | m ∈ min(m, Φ ∧ effa )}. Here, min(m, φ) is the set of all m that satisfy φ and that are minimal with respect to the partial order deﬁned by m1 ≤ m2 :iff for all propositions p, if m2 (p) = m(p) then m1 (p) = m(p). That is, m differs in a set-inclusion minimal subset of values from m. Say b is a belief. Res(b, a) is undeﬁned if there exists m ∈ b so that Res(m,Sa) is undeﬁned, or so that Res(m, a) = ∅. Else, Res(b, a) := m∈b Res(m, a). The Res function is extended to sequences a1 , . . . , an in the obvious way. A solution is a sequence a1 , . . . , an s.t. Res(b0 , a1 , . . . , an ) is a solved belief. Example 1 Given predicate ticketfor with image(ticketfor ) ≤ 2, and constants t, Peter , Bob, Mary. Initially, ticketfor (t, Peter )∧ ticketfor (t, Bob). Say we apply a1 with effect ticketfor (t, Mary). We get two resulting states, one with ticketfor (t, Peter ) ∧ ticketfor (t, Mary) and one with ticketfor (t, Bob)∧ticketfor (t, Mary) (but none with only ticketfor (t, Mary), since that would not be a minimal change). Say we now apply a2 with effect ticketfor (t, Peter ). We get two states, with ticketfor (t, Peter )∧ticketfor (t, Mary) and ticketfor (t, Peter ) ∧ ticketfor (t, Bob), respectively. 3 Basic Observations We make a number of basic observations: lemmas used in our update computations, and negative results supporting our design decisions. We ﬁrst make some general observations about belief intersections, then we consider binary clauses and cardinality upper bounds. Before thinking about how to update beliefs, one needs to think about how to represent beliefs, and, even, which aspects of beliefs to represent. Every belief may contain an exponential number of different models, and hence symbolic representations should be utilized, and/or only a partial knowledge should be maintained. Herein, we focus on the latter. Inspired by recent techniques from conformant planning [13] (with no state constraints), we aim at maintaining only belief intersections: the set of literals that are T true in all models of a T T belief b, m∈b {l | m |= l} =: b. Based on b, we can determine T whether an action a is applicable to b, namely iff prea ⊆ b, and T whether b is solved, namely iff effU ⊆ b. So, ideally, we wish to deﬁne the search states s from Figure 1 as sets Ls of literals: if b is a belief Tand s the corresponding search state, then we want to have Ls = b. The question is, how do we maintain those s? 560 J. Hoffmann / Towards Efﬁcient Belief Update for Planning-Based Web Service Composition T First, one piece of bad news is that computing Res(m, a) is very hard in general, and is hard even if Φ is Horn. This follows directly from earlier results in the area of belief update [4]: Lemma 4 Let φ be a propositional CNF, with φ = φ1 ∧ φ2 where there exists no literal l s.t. l appears in φ1 and and l appears in φ2 . Let l be a literal s.t. φ |= l. Then either φ1 |= l or φ2 |= l. Proposition 1 Assume a WSC task (Ω, W, C, U ) with ﬁxed arity. Assume a model m, an action a, andTa literal l such that m |= l. It is Πp2 -complete to decide whether l ∈ Res(m, a). If Φ is Horn, then the same decision is coNP-complete. This is easy to see based on the lack of conﬂicts between φ1 and φ2 . A more subtle point is that even dealing with cardinality upper bounds in isolation is tricky. T T Namely, it is not possible to compute Res(b, a) based only on b: This shows in particular that it is not necessarily enough to restrict ourselves to a tractable logics for Φ – at least in the case of Horn logics, that does not make the update problem tractable. The question arises whether the same is the case for binary clauses. As one might suspect, the answer is “no”. The following two technical observations can be used to prove this fact; they are also used further below to prove the correctness Tof our update computations. First, literals l ∈ Res(b, a) do not appear “out of thin air”: Proposition 3 There exist a WSC task (Ω, W, C, U ) where Φ consists entirely of cardinality upper T bounds, T an action T a, and two reachable beliefs b and b s.t. b = b , but Res(b, a) = T Res(b , a). Lemma 1 AssumeTa WSC task (Ω, W, C, U). Assume aTbelief b and an action a. Then Res(b, a) ⊆ {l | Φ ∧ effa |= l} ∪ b. T This is due to the PMA, which, if l ∈ b and Φ ∧ effa |= l, generates m ∈ Res(b, a) so that mT|= l. Lemma 1 means that, in general, Res(b, a) can be computed in two steps: T (A) determine {l | Φ ∧ effaT|= l}; (B) determine which l ∈ b do not disappear, i.e., l ∈ Res(b, a). Obviously, (A) is just deduction in Φ. The more tricky part T is (B). The following observation characterizes exactly when l ∈ b disappears: Lemma 2 Assume a WSC task T (Ω, W, C, U).TAssume a belief b, an action a, and a literal l ∈ b. Then, l ∈ Res(b, a) iff there exists a set V L0 of literals satisﬁed by a model V m ∈ b, such that Φ ∧ effa ∧ L0 is satisﬁable and Φ ∧ effa ∧ L0 ∧ l is unsatisﬁable. Intuitively, L0 is the “reason” why l disappears: it is consistent with the effect and hence true in a model of Res(b, a); but it excludes l. We can conclude that, for binary clauses, a literal disappears only if its opposite is necessarily true: Lemma 3 Assume a WSC task (Ω, W, C, UT ) where Φ T is binary. Assume a belief b, an action a, and a literal l ∈ b. If l ∈ Res(b, a), then Φ ∧ effa ∧ l is unsatisﬁable. V Namely: by Lemma 2 there V exists L0 so that Φ ∧ L0 ∧ l is satisﬁable, but Φ ∧ effa ∧ L0 ∧ l is unsatisﬁable; with binary Φ, this implies that Φ ∧ effa ∧ l is unsatisﬁable. By Lemmas 1 and 3, and since reasoning in grounded binary Φ is polynomial, we get: Corollary 1 Assume a WSC task (Ω, W, C, U ) with ﬁxed arity, where Φ is binary. Assume T a belief b, and anTaction a; let LT:= {l | Φ ∧ effa |= l}. Then Res(b, a) = L ∪ ( b \ L). Given b, this can be computed in time polynomial in the size of (Ω, W, C, U). Corollary 1 is a moderately interesting result since binary clauses are somewhat complementary to DL-Lite. The more important use of Lemmas 1, 2, and 3 will be below where we consider the combination of binary clauses with cardinality upper bounds. Our ﬁrst observation regarding that combination is: Proposition 2 Assume a WSC task (Ω, W, C, U ) with ﬁxed arity, where Φ is binary with cardinality upper bounds. Deciding whether Φ is satisﬁable is NP-complete. By a straightforward reduction from VERTEX COVER. We sidestep this source of intractability by restricting ourselves to Φ that are binary with consequence-independent cardinality upper bounds (c.f. Section 2): any predicate G with a bound image(G) ≤ k does not appear positively in the binary clauses. Note that G appears only negatively in the clause image(G) ≤ k. This removes the problem: A model m may disappear when applying an action a , and not be re-created when to beliefs T a is inverted. This leads T T b where b = {m | m |= Φ ∧ b},4 and further to b, b s.t. b = b but b = b . This means that it is not possible to, as envisioned, deﬁne the search states s simply as sets Ls – at least not if we want to ensure that Ls is exactly the intersection of the corresponding belief. We need to augment s with additional information. We have experimented for some time with methods augmenting s with the min and max number of attribute values present in any model of the belief. The intuition behind such an approach would be that cardinality upper bounds affect only how many, not which attribute values there are. However, this is not true since the cardinality upper bounds are intermingled with action effects; this makes capturing the precise distribution of attribute value tuples a surprisingly tricky task. It remains an open question whether beliefs in the presence of cardinality upper bounds can be represented concisely. Herein, we present two alternative options. The ﬁrst option, Section 4, takes time and space that is exponential (only) in the maximum k of any upper bound image(G) ≤ k. The second option, Section 5, takes polynomial time also in k, but sacriﬁces precision and guarantees only one of soundness or completeness (the user may choose which one). 4 Exact Belief Update We now specify search states s and associated initialise and update procedures that enable us to maintain precise belief intersections. We need three notations. First, by Φ|2 , we denote the subset of binary clauses of Φ. Second, if L is a set of literals, G is a predicate with arity 2, and c is a constant, then we denote L|G,c := {d | G(c, d) ∈ L}. That is, L|G,c selects from L the values of attribute G for c. Similarly, L|−G,c := {d | ¬G(c, d) ∈ L}. Third, say b is a belief; we introduce a formal notation for the precise distribution, denoted Db , of attribute value tuples. Our search states will explicitly keep track of that distribution, and hence contain suﬁcient information for preT cise belief update (this is not possible based only on b, c.f. Proposition 3). Db maps any G where image(G) ≤ k in Φ, and any c ∈ C, onto a set of subsets of C. Namely, for each m ∈ b, Db (G, c) contains the set {d | m |= G(c, d)}. Hence, for every G and c, Db (G, c) speciﬁes which combinations of attribute values occur. Our search states s are pairs (Ls , Ds ). Consider Figures 2 and 3. In lines (1) to (3), Figure 2 determines all logical consequences, L, of the initial literals and the binary part of Φ, and checks whether L is contradictory. Thereafter, cardinality upper bounds are handled; note that this can be done separately because of Lemma 4. Line (5) detects any violated upper bounds. Line (6) says that, for any cardinality upper bound where we already have the maximum number 4 This relates to [14], who show that DL updates can often not be represented in terms of a single changed ABox. J. Hoffmann / Towards Efﬁcient Belief Update for Planning-Based Web Service Composition procedure initialise() (1) LpreU := {l | l appears in preU } V (2) L := {l | Φ|2 ∧ LpreU |= l} (3) if ex. l s.t. l ∈ L and l ∈ L then return (undeﬁned) (4) for all image(G) ≤ k in Φ, c ∈ C do pre (5) if |L|G,cU | > k then return (undeﬁned) pre (6) if |L|G,cU | = k then pre L := L ∪ {¬G(c, d) | d ∈ C, d ∈ L|G,cU } preU (7) D(G, c) := {D | D ⊆ C, L|G,c ⊆ D, D ∩ L|−G,c = ∅, |D| ≤ k} (8) return (L, D) Figure 2. The initialise procedure for exact search states. of allowed attribute values, all other values are disallowed. Line (7) sets the D(G, c) value combination sets as appropriate, taking every combination that adheres to all constraints. procedure update(s, a) (1) if prea ⊆ Ls then return (undeﬁned) (2) LA := {l | l appears V in effa } (3) L := {l | Φ|2 ∧ LA |= l} (4) if ex. l s.t. l ∈ L and l ∈ L then return (undeﬁned) (5) for all image(G) ≤ k in Φ, c ∈ C do (6) if |LA |G,c | > k then return (undeﬁned) (7) if |LA |G,c | = k then L := L ∪ {¬G(c, d) | d ∈ C, d ∈ LA |G,c } (8) D(G, c) := ∅ (9) LAT := L; L := L ∪ {l | l ∈ Ls , l ∈ L} (10) for all image(G) ≤ k in Φ, c ∈ C, D ∈ Ds (G, c) do AT (11) if |D ∪ LA |G,c \ L|−G,c | > k then (12) L := L \ {G(c, d) | G(c, d) ∈ Ls \ LA } (13) D(G, c) := D(G, c)∪ A AT {D ∪ LA |G,c | D ⊆ D \ (L|G,c ∪ L|−G,c ), A |D | = k − |L|G,c |} AT (14) else D(G, c) := D(G, c) ∪ {D ∪ LA |G,c \ L|−G,c } (15) return (L, D) Figure 3. The update procedure for exact search states. The update procedure, Figure 3, is more complicated. Line (1) tests whether a is applicable. Lines (2) to (7) are analogous to lines (1) to (6) of Figure 2. Line (8) initialises the D structures. Line (9) extends L with all literals from Ls , except those that areT contradicted by L. By Lemma 1, the resulting L is a superset of Res(b, a). By Lemma 3,Tas far as binary clauses are concerned, the resulting L is equal to Res(b, a). For cardinality upper bounds, Lemma 3 does not apply, which necessitates lines (10) to (12) to check if further “old” belief intersection literals disappear. Namely, applying Lemma 2, an old attribute value (even if it is not contradicted) survives only if there exists no model m ∈ b so that, after the effects and their direct consequences have been applied, m contains too many attribute values. To ﬁgure out whether or not the latter is the case, the information given by Ds is exploited, in a straightforward way. (Note that this information is indeed required here. Assume that all we know is the maximum number of attribute values in any model m ∈ b. Then we would not know whether or not these are the same values as set by the action effects, and hence we could not decide whether or not an overﬂow occurs.) Lines (13) and (14), ﬁnally, make sure that D is updated correctly. If an overﬂow occurs, then all possible ways of minimally repairing the overﬂow are generated. If no overﬂow occurs, then D(G, c) 561 simply changes according to the effect and its implications. We have: Theorem 1 Assume a WSC task (Ω, W, C, U) where Φ is binary with consequence-independent cardinality upper bounds. Assume b is a reachable belief, and s is the corresponding search T state. Then: (1) b is deﬁned iff s is deﬁned; (2) if b is deﬁned, then b = Ls ; (3) if b is deﬁned, then Db ≡ Ds . The formal proof of Theorem 1 is quite lenghty, and involves various (sometimes rather tedious) case distinctions. The proof essentially spells out the intuitive arguments given above. Our main result here is that, provided a maximum cardinality is ﬁxed, maintaining belief intersections is tractable: Corollary 2 Assume a WSC task (Ω, W, C, U ) with ﬁxed arity and ﬁxed maximum cardinality, where Φ is binary with consequenceindependent cardinality upper bounds. Assume b is reached by action sequence a. Then the corresponding search state s isT computed in time polynomial in the size of (Ω, W, C, U ) and a, and b = Ls . Note that it is indeed a non-trivial consequence of our particular setting that the behavior is exponential only in the maximum k of any image(G) ≤ k. The enabling properties are: (1) image(G) ≤ k does not interfere in any way with image(H) ≤ k, if G = H; (2) similarly, the bound on the number of y in G(c, y) does not interfere with the bound on y in G(c , y) if c = c ; (3) due to consequenceindependence, no interferences arise from the binary clauses. Example 2 Re-consider Example 1. Running initialise, we get the state s0 where L = {ticketfor (t, Peter ), ticketfor (t, Bob)} and D(ticketfor , t) = {{Peter , Bob}}. Applying a1 , we get s1 = update(s0 , a1 ) where L = {ticketfor (t, Mary)} and D(ticketfor , t) = {{Peter , Mary}, {Bob, Mary}}. Applying a2 , we get s2 = update(s0 , a2 ) where L = {ticketfor (t, Peter )} and D(ticketfor , t) = {{Peter , Mary}, {Bob, Peter }}. 5 Approximate Belief Update Even though it seems likely that k will be small in practice, it is advisable to look for` more ´ efﬁcient methods. The size of D(G, c) is bounded only by |C| . If there are many constants, then enumerk ating D will become critical even for, say, k > 2. We now tackle this complexity by approximation methods. The search states s are + − + pairs (L− s , Ls ) where Ls and Ls respectively under-approximate and over-approximate the belief intersection. Both approximations are maintained simultaneously because they are interlinked. Depending on how one tests action applicability and solutions, one obtains a pessimistic/sound (but incomplete) planning procedure, or an optimistic/complete (but unsound) planning procedure. We show here only the former; the latter can be obtained by minor modiﬁcations. The initialise procedure changes only slightly because, there, no update is performed. In fact the procedure is exactly as shown in Figure 2, except that the returned s takes the form (L, L) where L – the precise belief intersection – serves both as L− and as L+ . Consider Figure 4. Line (1) tests pessimistically whether a is not applicable: the preconditions are tested against L− s . Thereafter, lines (2) and (3) determine the effects and their implications over the binary clauses. Line (4) tests for contradictions in the latter. Similarly, line (6) aborts the algorithm in case of a conﬂict with a cardinality upper bound (separate treatment of the two kinds of conﬂicts is justiﬁed by Lemma 4). Line (7) adds the consequences of the upper bounds to the implied literals. 562 J. Hoffmann / Towards Efﬁcient Belief Update for Planning-Based Web Service Composition procedure update(s, a) (1) if prea ⊆ L− s then return (undeﬁned) (2) LA := {l | l appears V in effa } (3) L := {l | Φ|2 ∧ LA |= l} (4) if ex. l s.t. l ∈ L and l ∈ L then return (undeﬁned) (5) for all image(G) ≤ k in Φ, c ∈ C do (6) if |LA |G,c | > k then return (undeﬁned) (7) if |LA |G,c | = k then L := L ∪ {¬G(c, d) | d ∈ C, d ∈ LA |G,c } (8) L− := L ∪ {l | l ∈ L− , l ∈ L} s (9) L+ := L ∪ {l | l ∈ L+ s , l ∈ L} (10) for all image(G) ≤ k in Φ, c ∈ C do (11) if |L− |G,c | > k then A L+ := L+ \ {G(c, d) | G(c, d) ∈ L+ s \L } − A (12) if |(C \ L|−G,c ) ∪ L|G,c | > k then A L− := L− \ {G(c, d) | G(c, d) ∈ L− s \L } (13) return (L− , L+ ) Figure 4. The update procedure for approximate search states. Lines (8) and (9) initialise the consideration of old intersection literals. All of those which are not contradicted are taken into a respective approximate set (c.f. Lemmas 1 and 3). Line (11) says that, if even the under-approximation violates a bound, then certainly the old attribute values get lost unless they are protected by the effect (c.f. Lemma 2). Line (12) says that, if the number of constants that could potentially be attribute values violates a bound, then it may happen that the old attribute values get lost, unless they are protected by the effect (c.f. Lemma 2). Note that the order of lines (11) and (12) is important because line (12) changes L− . If one executes (12) before (11), then the condition of (11) is always false, and L+ is still an over-approximation but an unnecessarily generous one. We get: Theorem 2 Assume a WSC task (Ω, W, C, U) where Φ is binary with consequence-independent cardinality upper bounds. Assume b is a reachable belief, and s is the corresponding approximate search state. Then: (1) if T b is undeﬁned, then s is undeﬁned; (2) if s is deﬁned, then L− b ⊆ L+ s ⊆ s . As for Theorem 1, the proof of Theorem 2 is lenghty and involves various case distinctions. Our main result of this section is: Corollary 3 Assume a WSC task (Ω, W, C, U) with ﬁxed arity, where Φ is binary with consequence-independent cardinality upper bounds. Assume b is reached by action sequence a. If the corresponding approximate search state s is deﬁned, then s isT computed in time polynomial in the size of (Ω, W, C, U) and a, and b ⊇ L− s . Example 3 Re-consider Example 1. Running initialise, we get the state s0 where L− = L+ = {ticketfor (t, Peter ), ticketfor (t, Bob) }. Applying a1 , both lines (11) and (12) ﬁre and so we get s1 = update(s0 , a1 ) where L− = L+ = {ticketfor (t, Mary)}. Applying a2 , only line (12) ﬁres and so we get L− = {ticketfor (t, Peter )} and L+ = {ticketfor (t, Mary), ticketfor (t, Peter )}. 6 Related Work [6] introduces DL-Lite, where the updated belief can be represented in terms of a new ABox computed in polynomial time. DL-Lite is somewhat complementary to binary clauses. Disjunction is allowed only in the form of subsumption rules in the TBox, and is binary in that sense. However, [6] allow unqualiﬁed existential quantiﬁcation, membership assertions (ABox literals) using variables, and updates involving general (constructed) DL concepts. On the other hand, DLLite does not allow clauses with two positive literals, and DL-Lite (like any DL) does not allow predicates of arity greater than 2. Most importantly, DL-Lite does not allow cardinality upper bounds. [7] considers a variant of DL-Lite where ABox assertions do not allow variables, and hence updates cannot be represented in terms of a new ABox. [7] show that the update from [6] can be re-used to compute the exact set of (restricted) ABox assertions after the update; this approximates the update in the sense that this set of assertions does not sufﬁce to characterize the exact set of models. This is quite different from our approximation techniques as per Section 5, where we use approximation (without exactness guarantees) not to handle a different language, but to obtain efﬁciency. [9, 10] address planning with belief update semantics (other than the PMA); they do not identify tractable classes. 7 Conclusion In planning-based WSC, one of the fundamental difﬁculties is the complexity of computing the outcome of actions. Since practical domain axiomatizations for WSC are often simple, there is hope to tackle this complexity by identifying tractable fragments. We make a ﬁrst step in this direction, showing how cardinality upper bounds can be handled, in combination with binary clauses. Many questions are left open. For example: Are our algorithms here the best possible ones, or is there an exact update algorithm that is polynomial also in k? Can one efﬁciently deal with cardinality lower bounds? We hope that some of these questions will be clariﬁed in future work. REFERENCES [1] F. Baader, C. Lutz, M. Milicic, U. Sattler, and F. Wolter, ‘Integrating description logics and action formalisms: First results’, in AAAI, (2005). [2] G. Brewka and J. Hertzberg, ‘How to do things with worlds: On formalizing actions and plans’, JLC, 3, 517–532, (1993). [3] The OWL Services Coalition. OWL-S: Semantic Markup for Web Services, 2003. [4] T. Eiter and G. Gottlob, ‘On the complexity of propositional knowledge base revision, updates, and counterfactuals’, AI, 57, 227–270, (1992). [5] D. Fensel, H. Lausen, A. Polleres, J. deBruijn, M. Stollberg, D. Roman, and J. Domingue, Enabling Semantic Web Services – The Web Service Modeling Ontology, Springer-Verlag, 2006. [6] G. De Giacomo, M. Lenzerini, A. Poggi, and R. Rosati, ‘On the update of description logic ontologies at the instance level’, in AAAI, (2006). [7] G. De Giacomo, M. Lenzerini, A. Poggi, and R. Rosati, ‘On the approximation of instance level update and erasure in DL’, in AAAI, (2007). [8] M. Ginsberg and D. Smith, ‘Reasoning about action I: A possible worlds approach’, AI, 35, 165–195, (1988). [9] J. Hertzberg and S. Thiebaux, ‘Turning an action formalism into a planner - a case study’, JLC, 4, 617–654, (1994). [10] A. Herzig, J. Lang, P. Marquis, and T. Polacsek, ‘Updates, actions, and planning’, in IJCAI, (2001). [11] Andreas Herzig and Omar Riﬁ, ‘Propositional belief base update and minimal change’, AI, 115, 107–138, (1999). [12] J. Hoffmann. Towards efﬁcient belief update for planningbased web service composition, 2008. Available at http://members.deri.at/ joergh/papers/tr-ecai08.pdf. [13] J¨org Hoffmann and Ronen Brafman, ‘Conformant planning via heuristic forward search: A new approach’, AI, 170(6–7), 507–541, (2006). [14] H. Liu, C. Lutz, M. Milicic, and F. Wolter, ‘Updating description logic ABoxes’, in KR, (2006). [15] Carsten Lutz and Ulrike Sattler, ‘A proposal for describing services with DLs’, in DL, (2002). [16] N. McCain and H. Turner, ‘A causal theory of ramiﬁcations and qualiﬁcations’, in IJCAI, (1995). [17] D. E. Smith and D. Weld, ‘Conformant Graphplan’, in AAAI, (1998). [18] M. Winslett, ‘Reasoning about actions using a possible models approach’, in AAAI, (1988). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-563 563 Genetic Optimization of the Multi-Location Transshipment Problem with Limited Storage Capacity Nabil Belgasmi and Lamjed Ben Saïd and Khaled Ghédira1 Abstract. Lateral Transshipments afford a valuable mechanism for compensating unmet demands only with on-hand inventory. In this paper we investigate the case where locations have a limited storage capacity. The problem is to determine how much to replenish each period to minimize the expected global cost while satisfying storage capacity constraints. We propose a RealCoded Genetic Algorithm (RCGA) with a new crossover operator to approximate the optimal solution. We analyze the impact of different structures of storage capacities on the system behaviour. We find that Transshipments are able to correct the discrepancies between the constrained and the unconstrained locations while ensuring low costs and system-wide inventories. Our genetic algorithm proves its ability to solve instances of the problem with high accuracy. 1 INTRODUCTION Practical optimization problems especially supply chain optimization problems, usually have a complex structure. That is the same in a lot of transport or production related fields [1]. Physical pooling of inventories has been widely used in practice to reduce cost and improve customer service [2]. Transshipments are recognized as the monitored movement of material among locations at the same echelon. It affords a valuable mechanism for correcting the discrepancies between the locations’ observed demand and their on-hand inventory. Subsequently, Transshipments may reduce costs and improve service without increasing the system-wide inventories. The study of multi-location models with Transshipments is an important contribution for mathematical inventory theory as well as for inventory practice. The idea of lateral Transshipments is not new. The first study dates back to the sixties. The twolocation-one-period case with linear cost functions was considered by [3]. [4] studied with N-location-one-period model where the cost parameters are the same for all locations. [5] incorporated non-negligible replenishment lead times and Transshipment lead times among stocking locations to the multilocation model. The effect of lateral Transshipment on the service levels in a two-location-one-period model was studied by [6]. The common problem tackled by these models is the determination of the optimal replenishment decision which minimizes the aggregate cost of the system. Most of the studies lead to optimal solutions since they investigate simple models easily solved by mathematical techniques (see [4]], [7]). However, an optimal replenishment decision for a general multi-location inventory system cannot be solved in analytical way. Few operational research methods were applied to find out near-optimal solutions. The gradient-based IPA method was successfully used for both capacitated Transshipment and production problems [8]. The use of IPA to solve real-world problems is not always possible since 1 Ecole Nationale des Sciences de l’Informatique, Email : khaled.ghedira@isg.rnu.tn many conditions should be satisfied to ensure the unbiasedness of its estimator [9]. Evolutionary optimization may provide a powerful methodology for solving such complex problems without need of prior knowledge about their analytical properties. The contribution of this paper is twofold. We, first, incorporate storage capacity constraints to the traditional Transshipment model which leads to a better modelling of real-world situations. Second, we investigate the applicability of real-coded evolutionary algorithms to the optimization of inventory levels and costs. This provides insights to tackle other extensions of the basic Transshipment problem with evolutionary optimization methods. The remainder of this paper is organized as follows. In Section 2, we formulate the proposed Transshipment model. In Section 3, we present the main concepts of the evolutionary optimization; we describe the new crossover operator and our evolutionary modelling of the problem. In Section 4, we show our experimental results. In Section 5, we state our concluding remarks. 2 2.1 THE PROBLEM Model description We consider the following real life problem where we have n stores selling a single product. The stores may differ in their cost and demand parameters. The system inventory is reviewed periodically. At the beginning of the period and long before the demands realization, replenishments take place in store i to increase the stock level up to Si. The storage capacity of each location is limited to Smax,i. In other way, the replenishment quantities should not exceed Smax,i inventory units. This may be due to expensive fixed holding costs, or to the limited physical space of the stores. Thus, the inventory level of store i will be always less or equal to min(Si, Smax,i). After the replenishment, the observed demands Di which represents the only uncertain event in the period are totally or partially satisfied depending on the onhand inventory of local stores. However, some stores may be run out of stock while others still have unsold goods. In such situation, it will be possible to move these goods from stores with surplus inventory to stores with still unmet demands. This is called lateral Transshipment within the same echelon level. It means that stores in some sense share the stocks. The set of stores holding inventory I+ can be considered as temporary suppliers since they may provide other stores at the same echelon level with stock units. Let t ij be the Transshipment cost of each unit sent by store i to satisfy a one-unit unmet demand at store j. In this paper, the Transshipment lead time is considered negligible. After the end of the Transshipment process, if store i still has a surplus inventory, it will be penalized by a per-unit holding cost of hi. If store j still has unmet demands, it will be penalized by a per-unit 564 N. Belgasmi et al. / Genetic Optimization of the Multi-Location Transshipment Problem with Limited Storage Capacity shortage cost of pj. Fixed cost Transshipment costs are assumed to be negligible in our model. [2] proved that, in the absence of fixed costs, if Transshipments are made to compensate for an actual shortage and not to build up inventory at another store, there exists an optimal base stock policy S* for all possible stationary policies. To see the effect of the fixed costs on a two-location model formulation, see [10]. The following notation is used in our model formulation: n Number of stores Si Order quantities for store i S Vector of order quantities, S = (S1, S2, …, Sn) (Decision variable) Smax, Maximum storage capacity of store i Smax Vector of storage capacities, Smax = (Smax 1, Smax,2, …, Smax n) Di Demand realized at i D Vector of demands, D = (D1, D2, …, Dn) hi Unit inventory holding cost at i pj Unit penalty cost for shortage at j t ij Unit cost of Transshipment from i to j Tij Amount transshipped from i to j I+ Set of stores with surplus inventory (before Transshipment) previous period. The optimality of the order-up-to policy in the absence of fixed costs is proven in [2]. 2.3 Model formulation Cost function: Since inventory choices in each store are centrally coordinated, it would be a common interest among the stores to minimize aggregate cost. At the end of the period, the system cost is given by: C S, D hi S i Di p j Dj i I Sj K S, D j I (1) The first and the second term on the right hand side of (1) can be respectively recognized as the total holding cost and shortage cost before the Transshipment. However, the third term is recognized as the aggregate Transshipment profit since every unit shipped from i to j decreases the holding cost at i by hi and the shortage cost at j by pj. However, the total cost is increased by t ij because of the Transshipment cost. Due to the complete pooling policy, the optimal Transshipment quantities Tij can be determined by solving the following linear programming problem: K S, D max Tij hi i I pj ij Tij (2) j I Subject to Tij Si Di , i I (3) Tij Dj Sj , j I (4) j I I- Set of stores with unmet demands (before Transshipment) i I 2.2 Modelling assumptions Several assumptions are made in this study to avoid pathological cases: Assumption 1 (Transshipment policy): The Transshipment policy is stationary, that is, the Transshipment quantities are independent of the period in which they are made; they depend only on the available inventory after demand observation. In this study, we will employ a Transshipment policy known as complete pooling. This Transshipment policy is described as follow [11]: “the amount transshipped from one location to another will be the minimum between (a) the surplus inventory of sending location and (b) the shortage inventory at receiving location”. The optimality of the complete pooling policy is ensured under some reasonable assumptions [6]. Assumption 2 (Lead time): Transshipment lead times are negligible. At the end of every period, optimal Transshipment quantities are computed. We assume that they are immediately shipped to their destination without making customers wait for long time. Assumption 3 (Replenishment policy): At the beginning of every period, replenishments take place to increase inventory position of store i up to min(Si, Smax,i) taking into account the remaining inventory of the Tij 0 (5) In (2), problem K can be recognized as the maximum aggregate income due to the Transshipment. Tij denotes the optimal quantity that should be shipped from i to fill unmet demands at j. Constraints (3) and (4) say that the shipped quantities cannot exceed the available quantities at store i and the unmet demand at store j. Since demand is stochastic, the aggregate cost function is built as a stochastic programming model which is formulated in (6). The objective is to minimize the expected aggregate cost per period. min S C S, D min S hi Si i I Di p j Dj Sj K S, D j I (6) Subject to Si S max,i , i 1...n (7) where the first two terms denotes the expected cost before the Transshipment, called Newsvendor2 cost, and the third term 2 The newsvendor model is the basis of most existing Transshipment literature. It addresses the case where Transshipments are not allowed. N. Belgasmi et al. / Genetic Optimization of the Multi-Location Transshipment Problem with Limited Storage Capacity denotes the expected aggregate income due to the Transshipment. This proves the important relationship between the newsvendor and the Transshipment problem. By setting very high Transshipment costs, i.e. t ij > hi + pj , no Transshipments will occur. Problem K will then return zero. Thus, our model can deal with both Transshipment and newsvendor cases. Cost function properties: The cost function is stochastic because of the demand randomness modelled by the continuous random variables Di with known joint distributions. Thus we must compute the expected value of the cost function. An analytical tractable expression for problem K given in (2) exists only in the case of a generalized two-location problem or N-location with identical cost structures [4]. In both cases, the open linear programming problem K has an analytical solution. But in the general case (many locations with different cost structures), we can use any linear programming method to solve problem K. In this study, we used the Simplex Method. The mentioned properties of our problem are sufficient to conclude that it is not possible to compute the exact expected values of the stochastic function given in (6). The most common method to deal with noise or randomness is re-sampling or re-evaluation of objective values [12]. With the re-sampling method, if we evaluate a solution S for N times, the estimated objective value is obtained as in equation (8) and the noise is reduced by a factor of N . For this purpose, draw N random scenarios D1,…,DN independently from each other (in our problem, a scenario Dk is equivalent to a vector demand Dk=(D11 ,…,DNN). A sample estimate of f(S), noted E(f(S,D)), is given by f ( S , D) 3 3.1 f (S ) 1 N N k 1 f (S , D k ) Var[ f ( S )] N (8) 565 parameters in its chromosome RCGA (Real Coded Genetic Algorithm). The general structure of a GA is: Genetic algorithm Begin t:=0 Initialize P(t) Evaluate P(t) while (not Stop-criterion) do t := t + 1 Select POP(t) from P(t-1) Crossover P(t) Mutate P(t) Evaluate P(t) End-While End. Where t is the current generation, and P(t) is the current population. 3.2 Solution methodology In our study, a real-coded GA is used to search for optimal replenishment decisions S*, with respect to the storage capacity constraints. In this section, we describe our evolutionary modelling of the constrained multi-location Transshipment problem. Structure of the Individual and population size: Each individual consists of a vector of n genes. It encodes a replenishment decision S. A gene is a positive real parameter representing an order quantity Si. It is easy to see that a population represents a set of replenishment decisions that moves toward regions of the search space that have better fitness values (lower costs). The population size is less than 30 individuals. EVOLUTIONARY OPTIMIZATION Main concepts We refer to evolutionary algorithms as methods that handle a population of solutions, iteratively evolve the population by applying phases of self-adaptation and co-operation and employ a coded representation of the solutions. The most suitable evolutionary algorithm to solve optimization problems in continuous domains are Evolutionary Strategies (ES) [13], Genetic Algorithms (GA) [14] with real coding and evolutionary programming [15]. GAs are search methodology invented by Holland [15], which is inspired by the natural genetic theory. They are regarded as methods that are suited for exploring large solution spaces. It is a very effective method for solving realworld problems this success is its simplicity and performance. The main idea of this technique is to generate diverse chromosomes and select the most appropriates to continue. We have an initial population of chromosomes which are produced randomly or by a particular scheme. Then, iteratively, we generate new generations of population out of the previous ones using mutation, crossover and selection. Mutation is designed to generate a new chromosome out of an existing one by randomly changing it. In the crossover two existing chromosomes are combined to generate new chromosomes. Selection will ensure the formation of the new population from the previous population. By applying the mentioned operations, the average fitness of the population will tend to increase over the algorithm lifetime. In many practical problems, chromosomes are coded as real numbers. We call the GA working with real Fitness evaluation: With respect to the re-sampling method given in (8), we should evaluate each individual N times in order to compute its fitness value. However, this may lead to individuals with different variances, which makes the selection of good individuals not accurate. Thus, in order to get a population with a common estimation Error Rate ER, we repeat the evaluation of each individual until its error estimation rate would be less than ER. We define the error estimation rate as the fraction of the estimated standard deviation and the expected mean of the sampled function at the given design S, ER S S f S (9) Recall that ER(S) is null when the approximated standard deviation is null. This is the case when the sample size is too large (9). Using the ER measure facilitates the supervision of the accuracy of explored regions of the search space, since neither the standard deviation nor the expected cost is known in advance. We will use ER varying between 0.01 and 2. Initialization: In most of the search algorithms, the initialization method is very important. We have opted for two initialization procedures. The first consists of generating uniformly distributed values for each gene within the domain [0, min(Si, Smax,i)]. The second consists of analytical solving of the newsvendor version of our problem. Then, we initialize each gene with a random value close to the optimal computed solution with respect to the storage capacities. N. Belgasmi et al. / Genetic Optimization of the Multi-Location Transshipment Problem with Limited Storage Capacity Crossover: Mating is performed using crossover to combine genes from different parents to produce new children. We have chosen the binary tournament selection to pick out parents for reproduction. Tournament selection runs a tournament between two randomly chosen individuals and selects the winner (individual with best fitness value). Many crossover techniques were studied in evolutionary optimization. We tested 3 existing crossover operators. Let A and B be two selected parents, and a a real number uniformly generated between 0 and 1; Single-point crossover: the chromosomes of the parents are cut at a randomly chosen point and the resulting fragments are swapped. Uniform crossover: each gene of the offspring X is selected randomly from the corresponding genes of the parents. Convex crossovers: offspring X = a.A + (1-a).B Moreover, we proposed a new crossover operator called Gradient-descent crossover (GRD-Crossover) since it creates an offspring following a quasi-descent direction. The first new offspring X is obtained by applying a convex crossover (X is inside the segment [AB]). The second offspring Y depends on the fitness values of the parents. Let CA and CB be fitness values of A and B and assume that CB = CA.. We can suppose that if Y will be in the same direction of the path linking solution A to B, then it may be better than its parents. More properly, X and Y are created as below: X = a.A + (1 – a).B Y = B – ? .(B – A) Where a is a real number uniformly generated between 0 and 1; ? is a positive uniform random variable that has the same sign as (CB – CA). We implemented all these crossovers and showed that the GRD-Crossover performs well in term of convergence and accuracy. Mutation: Mutation is realized by adding to each gene Si a normally distributed random number centred on 0. This operator alters genes of the selected individuals with a given mutation probability. Because we are dealing with real-valued definition domains (e.g. [0, min(Si, Smax,i)]), all offspring genes that are out of its domains are scaled down as follow: Si := min(Si, Smax,i). 4 OPTIMIZATION RESULTS In this section, we report on our numerical study. We first analyze the shape of the constrained cost function for a given system setting. We illustrate the spread of the individuals in the first and the tenth generations of the GA. We compare our GRDCrossover with other crossovers and show its ability to perform well and to provide near-optimal solutions. Finally, we analyze the impact of the incorporation of storage capacity in the basic Transshipment model. 4.1 Case study Our first exemplary inventory model consists of 2 locations with the following parameters: hi=$1, pi=$4, t ij=$0.5 and Di=N(100,20). Location (2) has no storage capacity constraints (Smax,2=8). However, location (1) storage capacity is limited to Smax,1 = 80. We generated 30.000 samples of the cost function with a fixed error rate ER=1%. The average number of evaluations is 450.000. Obviously, an individual consists of 2 genes only, each for one location. The evolutionary optimization process was started with the following parameters: Population size = 30 Number of generations = 40 Crossover rate = 85% Mutation rate = 15% Error rate = 1% 4.2 Experimental design To show the flexibility of our model, we have studied a 4-location Transshipment system with 7 storage capacities. In all designs, holding costs are equal to $1, shortage costs are equal to $4, Transshipment costs are equal to $0.5 and demands are normally distributed: N (100, 20). Table 1 summarizes the designs characteristics. C-0 Sys 8 Smax,1 Table1. RCGA parameters C-1 C-2 C-3 C-4 C-5 8 100 80 60 C-6 C-7 20 0 40 In system C-0, no material movement is allowed among locations. It represents 4 independent newsvendor problems. System C-1 refers to the basic Transshipment problem with no storage limits. In systems C2-7, only location (1) faces different storage constraints. All the other locations have no such storage constraints. System-wide inventories considerably decrease in comparison to independent newsvendors system. Figure 1 reveals also an important property of multi-location systems with storage capacity constraints, that is the ability of the locations to face heavy storage constraints (Smax,1 = 0). Solidarity and cooperation of some system locations significantly fix the aggregate cost. When analyzing the optimal costs of all settings, we remark that whatever the hardness of the storage capacity (varying from 8 to 0), costs and systemwide inventories in systems where Transshipments are allowed (C-1-7) are less then newsvendor. 700,00 480,00 600,00 470,00 500,00 460,00 400,00 450,00 300,00 440,00 200,00 430,00 100,00 420,00 0,00 410,00 C-0 C-1 C-2 C-3 C-4 C-5 C-6 C-7 Settings Cost (TR) Cost (NB) Total Inventory (TR) Total Inventory (NB) Figure 1. Cost under different systems Total inventory Selection: After evaluating the fitness of each individual, we must select the fittest ones to reproduce and form the population of the next generation. In our case, the best individuals represent the set of replenishment decisions {S*} that ensure low aggregate costs. Many selection methods were studied and used for solving problems. We have chosen a deterministic selection procedure which consists of sorting the individuals and copying the best 10% of them to the mating pool. This protects the best individuals and let them survive until the birth of stronger offspring. Cost 566 N. Belgasmi et al. / Genetic Optimization of the Multi-Location Transshipment Problem with Limited Storage Capacity 4.3 Validation with a benchmark We validate our RCGA using an illustrative example from [4] where optimal solutions are available. Recall that the system consists of 4 locations having identical cost structures with a holding cost of $1 per unit, a shortage cost of $4 per unit, and Transshipment cost of $0.10 per unit. There are no storage capacity constraints. Thus, our purpose is to compare the solution given by our RCGA using different crossovers to the optimal solution computed analytically. This can be done by setting infinite storage capacity limits (Smax,i = 8). In figure 2, we found that GRD-Crossover is better than all the other experimented crossovers. It has an important role in finetuning the individuals at the last generations. It performs better than the Convex-crossover though it is partially based on a convex exploitation of the selected parents. Figure 3 shows that best solutions given by the RCGA has a big variance (94<S1<130, 209<S2<274, 155<S3<198 and 148<S4<218) whereas the resulting costs are approximately equal (C = {113.51, 113.90, 113.80, 115.42}). Recall the optimal solution is S*=(109, 222.5, 163.5, 192.5) with a minimal cost of C* = 113.49. This leads to the conclusion that the approximation of the optimal cost value with our RCGA is satisfactory even though the approximation of the optimal order quantities has a great variance. 122,00 120,00 Fitness 118,00 116,00 114,00 112,00 110,00 108,00 1 GRD-X 2 3 4 CVX-X 5 6 7 8 9 10 11 12 13 14 15 Individuals UNIFORM-X 1-POINT-X OPT-SOL Figure 2. Best fitness of the last generation individuals under multiple crossovers 300,00 Inventory level 250,00 200,00 150,00 100,00 50,00 0,00 S1 GRD-X CVX-X S2 UNIFORM-X S3 1-POINT-X S4 OPT.SOL Figure 3. Optimal and near-optimal solutions under multiple crossovers 5 CONCLUSION In this paper, we considered a multi-location Transshipment model with limited storage capacity. The objective is to minimize the aggregate cost function where decision variables are the 567 constrained order-up-to quantities. We modelled the optimal redistribution of inventory in an arbitrary period as a linear programming problem based on the complete pooling policy. We employed a real-coded GA to solve the problem. A new crossover operator based on a simple approximation of the gradient descent is proposed and tested under multiple problem instances. Experiments showed that it outperforms many existent crossovers. An interesting conclusion is that Transshipments offer an important flexibility to systems that faces embarrassing storage capacity limits. The observed results confirm the success of evolutionary algorithms in solving inventory problems. Future studies will be concentrated on two directions: The multi-objective optimization of multi-location systems with storage capacity, where costs, lead times and service level should be optimized. The amelioration of real-coded evolutionary algorithms by incorporating effective search and sensitivity estimation techniques in crossover or mutation operators. REFERENCES [1] Arnold, J. and Köchel, P., Evolutionary Optimization of the Multilocation Inventory Model with Lateral Transshipments, 1997. [2] Herer, Y.T, Tzur, M., and Yücesan, E., The multi-location Transshipment problem. (Forthcoming in IIE Transactions), 2005. [3] Aggarwal, S.P., Inventory control aspect in warehouses. Symposium on Operations Research, Indian National Science Academy, New Delhi. 1967. [4] Krishnan, K.S. and Rao, V.R.K., Inventory control in N warehouses. J. Industrial. Engineering, Vol.16, No.3, pp. 212-215. 1965. [5] Jonsson, H. and E.A. Silver. Analysis of a Two-Echelon Inventory Control System With Complete Redistribution. Management Science 33, 215-227. 1987. [6] Tagaras, G., Effects of pooling on the optimization and service levels of two-location inventory systems. IIE Trans., Vol. 21, No. 3, pp. 250-257. 1989. [7] Rudi, N., SANDEEP KAPUR, AND DAVID PYKE (1998). A Two–Location Inventory Model with Transhipment and Local Decision Making. [8] Özdemir, D., E. Yücesan, and Y.T. Herer. 2003. Multi-Location Transshipment Problem with capacitated Transportation. Technology Management Area, INSEAD. Proceedings of the 2003 Winter Simulation Conference. [9] Glasserman P., Gradient estimation via perturbation analysis, Kluwer Academic Publishers, Hingham. 1991. [10] Herer, Y. and Rashit, A., Lateral Stock Transshipments in a Twolocation Inventory System with Fixed Replenishment Costs. Department of Industrial Engineering, Tel Aviv University. 1999a [11] Herer, Y. and Rashit, A., Policies in a general two-location infinite horizon inventory system with lateral stock Transshipments. Department of Industrial Engineering, Tel Aviv University. 1999b. [12] H.-G. Beyer. Evolutionary algorithms in noisy environments: Theoretical issues and guidelines for practice. Computer Methods in Applied Mechanics and Engineering, 186(2-4):239267. 2000. [13] Rechenberg, I., Evolution Strategy , in Zuarda et.al. 1994, pp.147159 [14] Fogel, D. B. (1995). Evolutionary Computation: Toward a New Philosophy of Machine Intelligence. IEEE Press, Piscataway, New Jork. [15] Holland, J.H. (1975). Adaptation in Natural and Artificial Systems. University of Michigan Press, Anna Arbor. 568 ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-568 Regression for Classical and Nondeterministic Planning Jussi Rintanen NICTA & the Australian National University Canberra, Australia Abstract. Many forms of reasoning about actions and planning can be reduced to regression, the computation of the weakest precondition a state has to satisfy to guarantee the satisfaction of another condition in the successor state. In this work we formalize a general syntactic regression operation for ground PDDL operators, show its correctness, and deﬁne a composition operation based on regression. As applications we present a very simple yet powerful algorithm for computing invariants, as well as a generalization of the hn heuristic of Haslum and Geffner to PDDL. 1 Introduction Although it is well known that the expressivity of PDDL [13] is required for efﬁcient modeling of many planning problems [14], most planner implementations still restrict to the STRIPS language in which action preconditions are conjunctions of (positive) literals and all effects are unconditional. Anecdotal evidence tells that this is due to the difﬁculty to reason about actions more general than STRIPS. PDDL can often be efﬁciently reduced to STRIPS, but certain classes of operators that have disjunctive preconditions or several conditional effects with logically independent antecedents lead to an exponential number of STRIPS operators. Furthermore, reduction to STRIPS is impossible for many generalizations of classical planning: in the presence of partial observability, splitting one operator to several is in general incorrect because in execution time it may not be possible to choose which operator to execute. This provides a strong motivation for the generalization of STRIPS-based algorithms and other planning techniques to more general languages such as PDDL. Our work deﬁnes a regression operation for ground PDDL operators and demonstrates its applications to planning. Pednault [15] deﬁnes regression for his ADL class of operators but his deﬁnition skips over the concrete syntax of what is today known as ADL/PDDL, The key component of the regression operation we deﬁne for ground PDDL is Deﬁnition 3 that maps a PDDL operator and a state variable to formulae describing the conditions under which the variable becomes true and false. The basis of regression operations is the substitution of a variable by an expression that describes its new value. This was used in the assignment axioms of the Hoare calculus [9] and later by Dijkstra for computing weakest preconditions [4]. The structure of the paper is as follows. Section 2 deﬁnes the classical planning problem for ground PDDL, the regression operation and the composition operation, and discusses their formal properties. Section 3 gives applications to invariants and heuristics. Section 4 deﬁnes regression for nondeterministic operators, Section 5 discusses related work, and Section 6 concludes the paper. 2 Deﬁnitions Deﬁnition 1 Let A be a set of state variables. An operator is a pair p, e where p is a propositional formula over A describing the precondition, and e is an effect, deﬁned recursively as follows. 1. a and ¬a for state variables a ∈ A are effects. 2. e1 ∧ · · · ∧ en is an effect if e1 , . . . , en are effects. 3. c e is an effect if c is a formula and e is an effect. The meaning of conditional effects c e is that effects e take place if the condition c is true. Deﬁnition 2 (Execution) Let p, e be an operator over A. Let s : A → {0, 1} be a state. The operator is executable in s if s |= p and the set es is consistent. This set is recursively deﬁned as follows. 1. as = {a} and ¬a Ss = {¬a} for a ∈ A. 2. e1 ∧ · · · ∧ en s = n i=1 ei s . 3. c es = es if s |= c and c es = ∅ otherwise. An operator p, e induces a partial function Rp, e on states: states s and s are related by Rp, e if s |= p and s is obtained from s by making the literals in es true and retaining the truth-values of state variables not occurring in es . Deﬁne exco (s) = s by sR(o)s and exco1 ;...;on (s) = excon (. . . exco1 (s) . . .). The main application of regression is in backward-search in which the basic step, computing a formula that represents the predecessor states (the new subgoal), is regression. The key component of regression for PDDL-style operators is given next. Deﬁnition 3 We recursively deﬁne the condition El (e) of literal l made true by an operator with the effect e as follows. El (l) = El (l ) = ⊥ when l = l (for literals l ) El (e1 ∧ · · · ∧ en ) =El (e1 ) ∨ · · · ∨ El (en ) El (c e) = c ∧ El (e) The symbols and ⊥ denote true and false, respectively. The case El (e1 ∧ · · · ∧ en ) = El (e1 ) ∨ · · · ∨El (en ) is deﬁned as a disjunction because it is sufﬁcient that at least one effect makes l true. Deﬁnition 4 Let A be the set of state variables. We deﬁne the condition El (o) of operator o =Vp, e being executable so that literal l is made true as p ∧ El (e) ∧ a∈A ¬(Ea (e) ∧ E¬a (e)). The third conjunct in the formula requires that no state variable is made both true and false. The formula El (e) indicates in which states the literal l is made true by e. It is closely related to es . 569 J. Rintanen / Regression for Classical and Nondeterministic Planning Lemma 5 Let A be the set of state variables, s a state on A, l a literal on A, and o an operator with effect e. Then 1. l ∈ es if and only if s |= El (e), and 2. exco (s) is deﬁned and l ∈ es if and only if s |= El (o). The formula Ea (e) ∨ (a ∧ ¬E¬a (e)) expresses the truth of a ∈ A after the execution of e in terms of truth-values of state variables before the execution: either a becomes true, or a is true before and does not become false. Lemma 6 Let a ∈ A be a state variable, o = p, e ∈ O an operator, and s and s = exco (s) states. Then s |= Ea (e) ∨ (a ∧ ¬E¬a (e)) if and only if s |= a. Deﬁnition 7 (Regression) Let φ be a propositional formula and o = p, e an operator. The regression of φ with respect to o is V rgo (φ) = φr ∧ p ∧ χ where χ = a∈A ¬(Ea (e) ∧E¬a (e)) and φr is obtained from φ by replacing every a ∈ A by Ea (e) ∨ (a ∧ ¬E¬a (e)). Deﬁne rge (φ) = φr ∧ χ and rgo1 ;...;on (φ) = rgo1 (· · · rgon (φ) · · · ). The formula χ corresponds to the requirement that es is consistent for an operator to be executable. The reason why regression is useful is that it allows to compute the predecessor states by simple formula manipulation. Next we formalize the important property of regression. Theorem 8 Let φ be a formula over A, o an operator over A, and S the set of all states i.e. valuations of A. Then {s ∈ S|s |= rgo (φ)} = {s ∈ S| exco (s) |= φ}. Proof: We show that for any state s, s |= rgo (φ) if and only if exco (s) is deﬁned and exco (s) |= φ. By deﬁnition rgo (φ) = φr ∧ p ∧ χ for o = p, e where φr is obtainedV from φ by replacing each a ∈ A by Ea (e)∨ (a ∧¬E¬a (e)) and χ = a∈A ¬(Ea (e) ∧E¬a (e)). First we show that s |= c ∧ χ if and only if exco (s) is deﬁned. s |= c ∧ χ iff s |= c and {a, ¬a} ⊆ es for all a ∈ A iff exco (s) is deﬁned The two equivalences are respectively by Lemma 5 and Deﬁnition 2. Then we show that s |= φr if and only if exco (s) |= φ. This is by structural induction over subformulae φ of φ and formulae φr obtained from φ by replacing a ∈ A by Ea (e) ∨ (a ∧ ¬E¬a (e)) Induction hypothesis: s |= φr if and only if exco (s) |= φ . Base case 1, φ = : Now φr = and both are true in the respective states. Base case 2, φ = ⊥: Now φr = ⊥ and both are false in the respective states. Base case 3, φ = a for some a ∈ A: Now φr = Ea (e) ∨ (a ∧ ¬E¬a (e)). By Lemma 6 s |= φr if and only if exco (s) |= φ . Inductive case 1, φ = ¬θ: By the induction hypothesis s |= θr iff exco (s) |= θ. Hence s |= φr iff exco (s) |= φ by the truthdeﬁnition of ¬. Inductive case 2, φ = θ ∨ θ : By the induction hypothesis s |= θr iff exco (s) |= θ, and s |= θr iff exco (s) |= θ . Hence s |= φr iff exco (s) |= φ by the truth-deﬁnition of ∨. Inductive case 3 for φ = θ ∧ θ goes like the previous case. It may appear that for n consecutive regression steps the size of the formula grows exponentially, as each variable occurrence may be replaced by a bigger formula containing several variables. However, if the formula is represented in the circuit form instead of a treelike formula, each variable occurs at most once. Hence a sequence of regression steps only leads to a worst-case polynomial increase in size. The circuits can often be simpliﬁed to keep them small, and in special cases, like STRIPS operators, there is a constant upper bound on the size of formulae/circuits. In addition to being the basis of backward search, regression has many other applications in reasoning about sequences of actions. Central questions concern the relation between a given action and a given sequence of actions: whether they are executable in exactly the same states and whether they have the same effects. This is the basis of computing macro-actions [10] and the elimination of redundant actions [8]. Answering this question requires the composition of a sequence of two or more operators. The composition o1 ◦ o2 of o1 = p1 , e1 and o2 = p2 , e2 is an operator that behaves like applying o1 followed by o2 . For a to be true after o2 we can regress a with respect to o2 , obtaining Ea (e2 ) ∨ (a ∧ ¬E¬a (e2 )). Condition for this formula to be true after o1 is obtained by regressing with e1 , leading to rge1 (Ea (e2 ) ∨ (a ∧ ¬E¬a (e2 ))) = rge1 (Ea (e2 )) ∨ (rge1 (a) ∧ ¬rge1 (E¬a (e2 ))) = rge1 (Ea (e2 )) ∨ ((Ea (e1 ) ∨ (a ∧ ¬E¬a (e2 ))) ∧ ¬rge1 (E¬a (e2 ))). Since we want to deﬁne an effect φ a of o1 ◦ o2 so that a becomes true whenever o1 followed by o2 would make it true, the formula φ does not have to represent the case in which a is true already before the execution of o1 ◦ o2 . Hence we can simplify the above formula to rge1 (Ea (e2 )) ∨ (Ea (e1 ) ∧ ¬rge1 (E¬a (e2 ))). An analogous formula is needed for making ¬a false. This leads to the following deﬁnition. Deﬁnition 9 (Composition) Let o1 = p1 , e1 and o2 = p2 , e2 be two operators on A. Then their composition o1 ◦ o2 is deﬁned as * + ^ ((rg (Ea (e2 )) ∨ (Ea (e1 ) ∧ ¬rg (E¬a (e2 )))) a)∧ e1 e1 p, ((rge1 (E¬a (e2 )) ∨ (E¬a (e1 ) ∧ ¬rge1 (Ea (e2 )))) ¬a) a∈A where p = rgo1 (p2 ) ∧ V a∈A ¬ (Ea (e1 ) ∧ E¬a (e1 )). Example 10 Consider o = , (¬b0 b0 ) ∧ (¬b1 ∧ b0 (b1 ∧ ¬b0 )) ∧ (¬b2 ∧ b1 ∧ b0 (b2 ∧ ¬b1 ∧ ¬b0 ) which increments a 3-bit binary number by 1.1 The composition of o with itself, representing increment by 2, is (after applying the De Morgan laws) , ((¬b2 ∨ ¬b1 ) ∧ b0 ) ∨ (¬b0 ∧ b2 ∧ b1 ) b0 )∧ ((¬b0 ∧ b1 ∧ ¬b2 )∨ (((b2 ∧ b1 ) ∨ ¬b0 ) ∧ ((¬b1 ∨ (b0 ∧ ¬b2 )) ∧ (¬b0 ∨ b1 ))) ¬b0 )∧ ((((b2 ∧ b1 ) ∨ ¬b0 ) ∧ ((¬b1 ∨ (b0 ∧ ¬b2 )) ∧ (¬b0 ∨ b1 ))) ∨(b0 ∧ ¬b1 ) b1 )∧ ((¬b0 ∧ b1 ∧ ¬b2 ) ∨ (b0 ∧ ¬b2 ∧ b1 ) ¬b1 )∧ ((¬b0 ∧ b1 ∧ ¬b2 ) ∨ (b0 ∧ ¬b2 ∧ b1 ) b2 ). Further logical simpliﬁcation and elimination of redundant conditional effects and simplifying unnecessary conditions yields , (¬b0 ∧ b2 ∧ b1 b0 )∧ (¬b1 b1 )∧ (¬b2 ∧ b1 ¬b1 )∧ (¬b2 ∧ b1 b2 ). Theorem 11 Let o1 and o2 be operators and s a state. Then exco1 ◦o2 (s) is deﬁned if and only if exco1 ;o2 (s) is deﬁned, and exco1 ◦o2 (s) = exco1 ;o2 (s). 1 Notice that 111 is not incremented further. 570 3 J. Rintanen / Regression for Classical and Nondeterministic Planning Applications 3.1 Invariants Very interestingly, the regression operation can be used as the main component of a powerful and intuitive algorithm for computing invariants. An invariant property of a planning problem is satisﬁed by every state that is reachable from the initial state(s). An equivalent inductive deﬁnition states that a property is invariant if the initial states satisfy it and every action preserves it. Main applications of invariants are planning by SAT and CSPs [11] in which invariants help to prune the search space, the validation of domain models in which invariants give information about dependencies between state variables, inexpensive incomplete tests for unreachability, and the computation of heuristics. We generalize the inductive algorithm [16] to general operators. The novelty is the extremely simple structure of the algorithm given the generality of the operator deﬁnition. The algorithm invariants(A,I,O,n) in Figure 1 computes invariants with at most n literals for operators O and an initial state I over state variables A. The runtimes increase quickly as n is increased and in practice one can use n = 2 or n = 3. We deﬁne lits(l1 ∨ · · · ∨ ln ) = {l1 , . . . , ln }. The loop on line 5 is repeated until there are no o ∈ O and clauses c ∈ C such that C ∪ {rgo (¬c)} is satisﬁable. Lemma 12 Let C be a set of clauses, φ a formula, and o an operator. If C ∪ {rgo (¬φ)} is unsatisﬁable, then exco (s) |= φ for all states s such that s |= C and o is executable in s. Proof: Easy corollary of Theorem 8. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: procedure invariants(A, I, O, n); C := {a ∈ A|I |= a} ∪ {¬a|a ∈ A, I |= a}; repeat C := C; for each o ∈ O and c ∈ C s.t. C ∪ {rgo (¬c)} ∈ SAT do C := C\{c}; if |lits(c)| < n then begin (* Add weaker clauses. *) C := C ∪ {c ∨ a | a ∈ A} ∪ {c ∨ ¬a | a ∈ A}; end end do until C = C ; return C; Figure 1. Algorithm for computing a set of invariant clauses On lines 7 and 9, when a clause c is not guaranteed to hold, weaker clauses c ∨ l may be, so replace c by all clauses that are weaker by having one more literal. If these clauses don’t hold either, they will be similarly removed and replaced by weaker ones. Theorem 13 Let A be a set of state variables, I a state, O a set of operators, and n ≥ 1 an integer. Then the procedure invariants(A, I, O, n) returns a set C of clauses with at most n literals so that exco1 ;...;om (I) |= C for any sequence o1 ; . . . ; om of operators from O. Proof: Let C0 be the value assigned to the variable C on line 2 in the procedure and C1 , C2 , . . . be the values of the variable in the end of each iteration of the outermost repeat loop. Induction hypothesis: for every {o1 , . . . , oi } ⊆ O and c ∈ Ci , exco1 ;...;oi (I) |= c. Base case i = 0: exc (I) for the empty sequence is by deﬁnition I itself, and by construction C0 consists of only formulae that are true in the initial state. Inductive case i ≥ 1: Take any {o1 , . . . , oi } ⊆ O and c ∈ Ci . Analyze two cases. 1. If c ∈ Ci−1 , then by the induction hypothesis exco1 ;...;oi−1 (I) |= c. Since c ∈ Ci it must be that Ci−1 ∪ {rgoi (¬c)} is unsatisﬁable. Hence by Lemma 12 exco1 ;...;oi (I) |= c. 2. If c ∈ Ci−1 , it must be because Ci−1 ∪ {rgo (¬c )} is satisﬁable for some o ∈ O and c ∈ Ci−1 such that c is obtained from c by conjoining some literals to it and c |= c. Since c ∈ Ci−1 , by the induction hypothesis exco1 ;...;oi−1 (I) |= c . Since c |= c also exco1 ;...;oi−1 (I) |= c. Since Ci−1 ∪ {rgoi (¬c)} is unsatisﬁable, exco1 ;...;oi (I) |= c by Lemma 12. This ﬁnishes the induction proof. The iteration of the procedure stops when Ci = Ci−1 , meaning that the claim of the theorem holds for arbitrarily long sequences o1 ; . . . ; om . To make the algorithm run in polynomial time the satisﬁability and logical consequence tests should be performed by algorithms that approximate these tests in polynomial time. If restricted to STRIPS operators, the inductive invariant computation [16] is obtained by implementing the satisﬁability test C ∪ {rgo (¬c)} as an incomplete test by unit resolution. More generally, it may be useful to have a stronger tractable satisﬁability test. The proof of Theorem 13 remains valid as long as the incomplete satisﬁability test does not falsely indicate unsatisﬁability for a satisﬁable set. Inference of facts that hold at given time points was ﬁrst considered in the GraphPlan algorithm of Blum and Furst in the form of mutexes [1]. This planning graph construction, similarly to early algorithms for computing invariants [5, 16] restricts to STRIPS operators. Later works have considered more general classes of operators [6, 12] adopting the inductive deﬁnition deﬁnition of invariants ﬁrst used in [1, 16]. Gerevini and Schubert [6] consider conditional effects but no disjunctions. Lin [12] tries to ﬁnd invariants for a class of problems by looking at problem instances with a small state space and eliminating candidate invariants if they are falsiﬁed by the chosen problem instances. 3.2 Haslum and Geffner’s hn Our invariant algorithm computes a generalization of Haslum & Geffner’s hn heuristic [7] which is deﬁned for STRIPS only. An estimate for the distance of any formula φ (precondition or goal) is k if φ is satisﬁable with Ck but not with Ck−1 (an incomplete satisﬁability test can be used without sacriﬁcing the admissibility of the heuristic.) Haslum and Geffner’s estimate Gn (V ) for the distance of a set V of variables from the initial state can be expressed in terms of our sets Ci when our parameter n equals m: for V = {a1 , . . . , am }, Gn (V ) = k iff there is ¬b1 ∨ · · · ∨ ¬bj ∈ Ck−1 such that {b1 , . . . , bj } ⊆ V and there is no such clause in Ck , and Gn (V ) = 0 if ¬a ∈ C0 for all a ∈ V . Haslum and Geffner deﬁne states as subsets of the set A of all state variables. We will call this kinds of states h-states to distinguish them from our deﬁnition of states. Haslum and Geffner deﬁne R(V ) as the set of pairs (B, o) such that the operator o reaches a h-state V from a h-state B. This is essentially a simple regression operation for STRIPS. We ignore the operator o (because we don’t need it for costs J. Rintanen / Regression for Classical and Nondeterministic Planning unlike Haslum and Geffner who consider non-unitary costs) and deﬁne R(V ) simply as all the minimal sets of variables that have to be true for variables V to be true after executing one of the operators. Now R(V ) has the following property. V V Lemma 14 For all B ∈ R(V ), a∈B a |= rgo ( a∈V a). The deﬁnition of the heuristic is as follows. For V ⊆ A let if V ⊆ I Gn (V ) = 0 Gn (V ) = minB∈R(V ) (1 + Gn (B)) if |V | ≤ n and V ⊆ I Gn (V ) = maxB⊂V,|B|=n Gn (B) if |V | > n. Theorem 15 For a STRIPS problem, let Ci be the sets computed by the algorithm in Figure 1 as explained in the proof of Theorem 13. Let V ⊆ A be a set of variables. If Gn (V ) = k for any k ≥ 1, then Ck−1 ∪ V is unsatisﬁable and Ck ∪ V is satisﬁable, and Gn (V ) = 0 iff C0 ∪ V is satisﬁable. Proof: We give a proof sketch. Induction hypothesis: for every i ≥ 0, for any V ⊆ A, 1. if Gn (V ) = i then Ci ∪ V is satisﬁable, 2. if Gn (V ) = i then Cj ∪ V is unsatisﬁable for j ∈ {0, . . . , i − 1}. Base case i = 0: Let V ⊆ A be any set of variables. 1. If Gn (V ) = 0 then V ⊆ C0 . Since C0 is satisﬁable, also C0 ∪ V is satisﬁable. 2. Holds trivially because {0, . . . , i − 1} = ∅. Inductive case i ≥ 1: Remark A. If Ci |= ¬a1 ∨ · · · ∨ ¬ak , then ¬b1 ∨ · · · ∨ ¬bm ∈ Ci for some {b1 , . . . , bm } ⊆ {a1 , . . . , ak }. 1. Assume Gn (V ) = i. Then there is an operator o that reaches the h-state V from a h-state B such that Gn (B) = i − 1. Since Gn (B) = i − 1, by theVinduction hypothesis V Ci−1 ∪ B is satisﬁable. By Lemma 14 a |= rg ( o a∈B a∈V a). Hence also V Ci−1 ∪ {rgo ( a∈V a)} is satisﬁable. Hence when constructing Ci the algorithm removes all clauses ¬b1 ∨ · · · ∨ ¬bj such that {b1 , . . . , bj } ⊆ V . Hence by Remark A Ci ∪ V is satisﬁable. 2. Assume Gn (V ) = i ≥ 1. Then Gn (B) ≥ i − 1 for all h-states B and operators that reach V from B. If i > 1, then by the induction hypothesis Ci−2 ∪B is unsatisﬁable for any such B, and there is a clause ¬b1 ∨ · · · ∨ ¬b V j ∈ Ci−2 such that {b1 , . . . , bj } ⊆ B. Hence Ci−2 ∪ {rgo ( a∈V a)} is unsatisﬁable for every o ∈ O. Therefore the clauses in Ci−1 that contradict V are not removed, and hence Ci−1 ∪V is unsatisﬁable. If i = 1, then Gn (V ) > 0 because V ⊆ I, and hence C0 ∪ V is unsatisﬁable. Hence Ci−1 ∪ V is in both cases unsatisﬁable. 4 Regression for Non-Deterministic Operators Based on the regression operation for deterministic operators in Deﬁnition 7 regression for a class of nondeterministic operators can be deﬁned. The operators’ effects have nondeterministic choice e1 | · · · |en between two or more deterministic effects e1 , . . . , en . Deﬁnition 16 Let φ be a formula and o = p, e1 | · · · |en an operator where e1 , . . . , en are deterministic. Deﬁne rgnd o (φ) = rg p,e1 ! (φ) ∧ · · · ∧ rg p,en ! (φ). 571 Theorem 17 Let φ be a formula over A, o an operator over A, and S the set of all states over A. Then for all s ∈ S, s |= rgnd o (φ) if and only if all possible successor states s of s satisfy φ. Proof: This follows from the fact that each p, ei represents one possible outcome the nondeterministic action may have, rg p,ei ! (φ) represents all the states from which φ is reached by p, ei , and the intersection of these sets is exactly the set of states from which φ is reached no matter which outcome is the actual one. Example 18 Let o = d, b|¬c. Then rgnd o (b ↔ c) = rg d,b! (b ↔ c) ∧ rg d,¬c! (b ↔ c) = (d ∧ ( ↔ c)) ∧ (d ∧ (b ↔ ⊥)) ≡ d ∧ c ∧ ¬b. Applications of the nondeterministic regression operation are similar to the deterministic one. Most notably, backward-search algorithms for planning with partial observability can be based on it. 5 Related Work Regression is closely related to other forms of manipulation of formulae for computing the images or preimages of sets of states. We discuss some of the most closely related and some of the very recent related work and contrast them to regression. 5.1 Symbolic Pre-Images General forms of reasoning about actions by the computation of images and preimages, leading to logic-based algorithms for computing sets of reachable states, has many applications and was originally introduced in the context of computer-aided veriﬁcation as a technique for model-checking [3, 2]. Preimage computation is essentially regression whereas images are successors of sets of states. Let A = {a1 , . . . , an }, A = {a1 , . . . , an } and A = {a1 , . . . , an }. The variables in A refer to the values of state variables in a state and the variables in A to the values in a successor state. Formulae φ over A∪A can represent arbitrary binary relations on the set of all states. The translation V of a deterministic operator V p, e into a formula is τA (o) = p ∧ a∈A ¬(Ea (e) ∧E¬a (e)) ∧ a∈A (a ↔ (Ea (e) ∨ (a ∧ ¬E¬a (e)))). The ﬁrst two conjuncts express the conditions for the executability of the operator (truth of the precondition and the consistency of the effects) and the third conjunct expresses the new value of each state variable in terms of the old values of state variables. With respect to an operator o the successor or predecessor states of a set of states, represented as a formula φ, can be computed by syntactic manipulation of φ and τA (o). The basic logical step in this computation is that of existential abstraction which eliminates the occurrences of one variable in a formula. It is deﬁned by ∃x.φ = φ[/x] ∨ φ[⊥/x] where φ[θ/x] means replacing all occurrences of x in φ by θ. Deﬁnition 19 Let o be an operator and φ a formula. Deﬁne imgo (φ) = (∃A.(φ ∧ τA (o)))[A/A ] preimgo (φ) = ∃A .(τA (o) ∧ φ[A /A]) 572 J. Rintanen / Regression for Classical and Nondeterministic Planning Above φ[A /A] denotes substitution of each a ∈ A in φ by the corresponding variable a ∈ A . Not surprisingly, there is a close connection between preimages and regression. Acknowledgements The research was funded by Australian Government’s Department of Broadband, Communications and the Digital Economy and the Australian Research Council through NICTA. Theorem 20 rgo (φ) ≡ preimgo (φ). REFERENCES Example 21 Let A = {a, b, c}. Let o = c, a ∧ (a b). Then rgo (a ∧ b) = c ∧ ( ∧ (b ∨ a)) ≡ c ∧ (b ∨ a). The formula corresponding to o is τA (o) = c ∧ a ∧ ((b ∨ a) ↔ b ) ∧ (c ↔ c ). The preimage of a ∧ b with respect to o is represented by ∃a b c .(τA (o) ∧ (a ∧ b )) ≡ ∃a b c .(c ∧ a ∧ ((b ∨ a) ↔ b ) ∧(c ↔ c ) ∧ a ∧ b ) ≡ ∃a b c .(a ∧ b ∧ c ∧ (b ∨ a) ∧ c ) ≡ ∃b c .(b ∧ c ∧ (b ∨ a) ∧ c ) ≡ ∃c .(c ∧ (b ∨ a) ∧ c ) ≡ c ∧ (b ∨ a) This connection between preimages and regression is best understood based on the equivalence a ↔ (Ea (e) ∨ (a ∧ ¬E¬a (e))) in the deﬁnition of τA (o): it corresponds to the substitution in the deﬁnition of regression. The advantage of regression is that no existential abstraction is needed, and the disadvantage is that it is restricted to operators/relations that can be represented as a conjunction of equivV alences a∈A a ↔ φa . 5.2 C-Filter of Shahaf and Amir Shahaf and Amir [17] present C-Filtering for computing (an implicit representation of) the image of a set of states with respect to a sequence of actions. Shahaf and Amir hint at a connection between CFiltering and regression but do not clarify it. The C-Filter is simply the use of regression to test facts about a belief state B reached from an initial belief state I by a sequence of actions o1 , . . . , on . Instead of explicitly constructing B by image computation, facts relating to B are queried by regressing them to queries about the initial state. For example, to test whether B ∩ B = ∅ for some belief state B expressed as a formula φ, test the non-emptiness of the intersection by a satisﬁability test of I ∧ rgo1 ;...;on (φ). Shahaf and Amir claim as the novelty of C-Filtering the incremental construction of the substitutions rgo1 ;...;on (a)/a as the action sequence o1 , . . . , on , . . . progresses as well as the representation of the required formulae as Boolean circuits. 6 Conclusions We have deﬁned regression and composition operations for PDDL operators and a regression operation for nondeterministic actions. We have also discussed applications of general regression operations in connection with macro-actions, elimination of irredundant operators, invariants and heuristics. In particular, we gave an algorithm for computing invariants for a general deﬁnition of actions that includes disjunctive preconditions and conditional effects. The algorithm is powerful yet conceptually extremely simple, and its power can be traded to efﬁciency by controlling the accuracy and asymptotic runtime of approximate satisﬁability tests. The algorithm also yields a generalization of the hn heuristic [7]. [1] Avrim L. Blum and Merrick L. Furst, ‘Fast planning through planning graph analysis’, Artiﬁcial Intelligence, 90(1-2), 281–300, (1997). [2] J. R. Burch, E. M. Clarke, D. E. Long, K. L. MacMillan, and D. L. Dill, ‘Symbolic model checking for sequential circuit veriﬁcation’, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 13(4), 401–424, (1994). [3] Olivier Coudert, Christian Berthet, and Jean Christophe Madre, ‘Veriﬁcation of synchronous sequential machines based on symbolic execution’, in Automatic Veriﬁcation Methods for Finite State Systems, International Workshop, Grenoble, France, June 12-14, 1989, Proceedings, ed., Joseph Sifakis, volume 407 of Lecture Notes in Computer Science, pp. 365–373. Springer-Verlag, (1990). [4] Edsger W. Dijkstra, ‘Guarded commands, nondeterminacy and formal derivation of programs’, Communications of the ACM, 18(8), 453–457, (1975). [5] Alfonso Gerevini and Lenhart Schubert, ‘Inferring state constraints for domain-independent planning’, in Proceedings of the 15th National Conference on Artiﬁcial Intelligence (AAAI-98) and the 10th Conference on Innovative Applications of Artiﬁcial Intelligence (IAAI-98), pp. 905–912. AAAI Press, (1998). [6] Alfonso Gerevini and Lenhart K. Schubert, ‘Discovering state constraints in DISCOPLAN: Some new results’, in Proceedings of the 17th National Conference on Artiﬁcial Intelligence (AAAI-2000) and the 12th Conference on Innovative Applications of Artiﬁcial Intelligence (IAAI-2000), pp. 761–767. AAAI Press, (2000). [7] Patrik Haslum and H´ector Geffner, ‘Admissible heuristics for optimal planning’, in Proceedings of the Fifth International Conference on Artiﬁcial Intelligence Planning Systems, eds., Steve Chien, Subbarao Kambhampati, and Craig A. Knoblock, pp. 140–149. AAAI Press, (2000). [8] Patrik Haslum and Peter Jonsson, ‘Planning with reduced operator sets’, in Proceedings of the Fifth International Conference on Artiﬁcial Intelligence Planning Systems, eds., Steve Chien, Subbarao Kambhampati, and Craig A. Knoblock, pp. 150–158. AAAI Press, (2000). [9] C. A. R. Hoare, ‘An axiomatic basis for computer programming’, Communications of the ACM, 12(10), 576–580, (1969). [10] Glenn A. Iba, ‘A heuristic approach to the discovery of macrooperators’, Journal of Machine Learning, 3(4), 285–317, (1989). [11] Henry Kautz and Bart Selman, ‘Planning as satisﬁability’, in Proceedings of the 10th European Conference on Artiﬁcial Intelligence, ed., Bernd Neumann, pp. 359–363. John Wiley & Sons, (1992). [12] Fangzhen Lin, ‘Discovering state invariants’, in Principles of Knowledge Representation and Reasoning: Proceedings of the Ninth International Conference (KR 2004), eds., Didier Dubois, Christopher A. Welty, and Mary-Anne Williams, pp. 536–544. AAAI Press, (2004). [13] Drew McDermott, ‘The Planning Domain Deﬁnition Language’, Technical Report CVC TR-98-003/DCS TR-1165, Yale Center for Computational Vision and Control, Yale University, (October 1998). [14] Bernhard Nebel, ‘On the compilability and expressive power of propositional planning formalisms’, Journal of Artiﬁcial Intelligence Research, 12, 271–315, (2000). [15] Edwin P. D. Pednault, ‘ADL and the state-transition model of action’, Journal of Logic and Computation, 4(5), 467–512, (1994). [16] Jussi Rintanen, ‘A planning algorithm not based on directional search’, in Principles of Knowledge Representation and Reasoning: Proceedings of the Sixth International Conference (KR ’98), eds., A. G. Cohn, L. K. Schubert, and S. C. Shapiro, pp. 617–624. Morgan Kaufmann Publishers, (June 1998). [17] Dafna Shahaf and Eyal Amir, ‘Logical circuit ﬁltering’, in Proceedings of the 20th International Joint Conference on Artiﬁcial Intelligence, ed., Manuela Veloso, pp. 2611–2618. AAAI Press, (2007). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-573 573 Combining Domain-Independent Planning and HTN Planning: The Duet Planner Alfonso Gerevini† and Ugur Kuter‡ and Dana Nau‡ and Alessandro Saetti† and Nathaniel Waisbrot‡∗ Abstract. Despite the recent advances in planning for classical domains, the question of how to use domain knowledge in planning is yet to be completely and clearly answered. Some of the existing planners use domain-independent search heuristics, and some others depend on intensively-engineered domain-speciﬁc knowledge to guide the planning process. In this paper, we describe an approach to combine ideas from both of the above schools of thought. We present Duet, our planning system that incorporates the ability of using hierarchical domain knowledge in the form of Hierarchical Task Networks (HTNs) as in SHOP2 [14] and using domain-independent local search techniques as in LPG [8]. In our experiments, Duet was able to solve much larger problems than LPG could solve, with only minimal domain knowledge encoded in HTNs (much less domain knowledge than SHOP2 needed to solve those problems by itself). 1 Introduction Most classical planners fall into one of two categories: planners that use domain-independent knowledge, i.e., that work in any classical planning domain, and planners that can exploit domain-speciﬁc knowledge. It has been shown, both theoretically and experimentally, that each approach has its own advantages and disadvantages: • A planner that can exploit domain-speciﬁc knowledge in order to guide its planning can solve much larger planning problems and can generally solve them much faster than the planners that don’t use such knowledge. The biggest downside of such planning systems, however, is that they require an expert human to give them extensive knowledge about how to solve planning problems in the planning domain at hand. Usually this knowledge is expressed using either temporal logic (e.g., TLPlan [1] and TALplanner [13]) or task decomposition (e.g., SHOP2 [14], SIPE-2 [17], and OPLAN [6]), and might not be easy for the general user to specify. • A planner that uses domain-independent heuristic information (e.g., FF [11], AltAlt [15], SGPlan [5], HSP [3], FastDownward [10], and LPG [8]) usually does not need expert-provided domain knowledge, since the planner itself computes a heuristic for each domain. This makes the domain formalization simpler and the planner easier to use; but the planner may often perform much worse than a planner that exploits speciﬁc domain knowledge. In this paper, we describe Duet, a new planning system that combines the advantages of using domain-independent heuristics † Dipartimento di Elettronica per l’Automazione, Universit´ a degli Studi di Brescia, Via Branze 38, I-25123 Brescia, Italy. ‡ Department of Computer Science and Institute for Advanced Computer Studies, University of Maryland, College Park, Maryland 20742, USA. ∗ Corresponding author (email:waisbrot@cs.umd.edu) and domain-speciﬁc knowledge, while avoiding their drawbacks.1 To accomplish this, Duet incorporates adaptations of two wellknown planners: LPG, which uses domain-independent heuristics in a stochastic local search engine [8], and SHOP2, which uses domainspeciﬁc Hierarchical Task Networks (HTNs) to organize its search space [14]. We extended the SHOP2 and LPG formalisms to allow the planners to communicate in Duet by generating subgoals of a planning problem for each other. Duet organizes the planning process by passing these subgoals to the individual planners until no subgoals are left to achieve. We present our experiments with Duet on a new planning domain, called Museums. The Museums domain was inspired by the realworld operations of acquiring and relocating art objects among a set of museums around the world. The domain combines aspects of the well-known Logistic and Tower of Hanoi (ToH) problems. The objective is to use trucks to move various art objects from museums to other museums. When a truck comes to a museum to load or unload objects, there are three places to put the objects: the truck, and two pallets at the museum’s loading dock. An object’s placement depends on its fragility: fragile art objects must be placed on less-fragile ones. Thus, loading and unloading correspond to solving ToH problems. The rationale for using the Museums domain in the evaluation of Duet was that we observed that it is challenging for state-of-the-art planners and it includes two kinds of subproblems: domain-speciﬁc knowledge isn’t needed to plan the truck movements, but is needed to plan the loading and unloading operations, since the ToH problem is hard for many domain-independent planners including LPG. In our experiments, we varied the amount of HTN-based domainspeciﬁc knowledge available to Duet and compared its performance with LPG’s and SHOP2’s performance as stand-alone planners. Even with just a small amount of domain-speciﬁc knowledge (e.g., “choose the least-fragile object and move it to the target museum”), Duet usually generated solutions faster than LPG. With more domain-speciﬁc problem-solving knowledge (e.g., how to properly stack art objects on top of each other), Duet ran faster and solved more problems than both LPG and SHOP2. Although SHOP2’s performance could have been improved, this would have required much more time for hand-crafting its knowledge base. 2 Preliminaries Our deﬁnitions of classical states, planning operators, planning domains and problems are based on those in [9]. Below we’ll summarize the deﬁnitions at the semantic level; for syntactic details see [9]. 1 In that sense, it is closely related to the recently-proposed “Model-Lite Planning” approach [12, 16], which aims to develop techniques that do not require intensive domain knowledge but still are practical. 574 A. Gerevini et al. / Combining Domain-Independent Planning and HTN Planning: The Duet Planner In addition to classical planning operators and actions (i.e., ground instances of planning operators), we deﬁne an abstract planning operator as a triple (t, Pre, Eﬀ), where Pre and Eﬀ are the preconditions and the effects of the abstract operator (described as logical formulas over literals), and t is an expression (name, arg1 , . . . , argn ), where name is the abstract operators’ name and arg1 , . . . , argn are the arguments (variables and/or constant symbols). An abstract action is a ground instance of an abstract planning operator. A plan is a sequence of actions that are either classical or abstract. A planning domain is a triple Σ = (S, A, γ) where S and A are the sets of states and actions (classical and abstract), and γ : S × A → S is the state-transition function, with γ(s, a) deﬁned iff a is applicable to s. Γ(s, π) = γ(γ(...γ(s, a1 ), a2 ), . . . , an ) is the state generated by applying the plan π = a1 , . . . , an in the state s. If some action ai is inapplicable in Γ(s, a1 , . . . , ai−1 ) then π is inapplicable in s and Γ(s, π) is not deﬁned. A planning problem is a pair P = (s0 , g) in the planning domain Σ = (S, A, γ), where s0 ∈ S is the initial state and g is the goals represented as a conjunction of logical atoms (i.e., g represents a set of goal states G ⊆ S). A solution for a classical planning problem P is a plan π = a1 , . . . , ak such that each ai in π is a classical action and the state s = Γ(s0 , π) satisﬁes the goals g. LPG’s plan representation is based on linear action graphs [8], which are variants of the well-known planning graphs [2]. A linear action graph [8] is a directed acyclic leveled graph alternating between a proposition level, i.e., a set of domain propositions, and an action level, i.e., one ground domain action and a set of special dummy actions, called “no-ops”, each of which propagates a proposition of the previous level to the next one. If an action is in the graph, then its preconditions and positive effects appear in the corresponding proposition levels of the graph. Moreover, a pair of propositions or actions can be marked as mutually exclusive at every graph level where the pair appears (for a detailed description, see [8]). While in the original deﬁnition, action levels contain only classical actions [8], here we use an extended representation where an action level contains either a classical action or an abstract action. An (extended) action graph can have two types of ﬂaws: unsatisﬁed action preconditions and abstract actions. LPG uses a stochastic local search process that iteratively modiﬁes the current graph until there is no ﬂaw or a certain search limit is exceeded [8]. LPG deals with an unsatisﬁed precondition by inserting into or removing from the graph a new or existing action, respectively. We modiﬁed LPG in order to recognize abstract actions as ﬂaws resolvable by running an HTN planner, as described below. An action graph with no ﬂaws represents a solution for the input planning problem. An HTN planner formulates a plan by decomposing tasks (i.e., symbolic representations of problem-solving activities to be performed) into smaller and smaller subtasks until tasks are reached that can be performed directly. An HTN is a pair (T, C), where T is a set of tasks and C is a set of partial ordering constraints on the tasks. The empty HTN is the pair (T, C) such that T = ∅ and C = ∅. An HTN planner uses an HTN domain description that contains three kinds of knowledge artifacts: axioms, operators, and methods. The axioms are similar to logical Horn-clause statements; the planner uses them to infer conditions about the current state. The operators are like the planning operators used in any classical planner. The names of these operators are designated as primitive tasks . Each method in an HTN domain description is a prescription for how to accomplish a nonprimitive task by decomposing it into subtasks (which may be either primitive or nonprimitive tasks). A method consists of (1) the task that the method can be used to ac- complish, (2) the set of preconditions which must be satisﬁed for the method to be applicable, and (3) the subtasks to accomplish, along with some constraints over those tasks that must be satisﬁed. For example, consider the task of moving a collection of items from one location to another. One method might be to move them by truck. For such a method, the preconditions might be that the truck is in working order and is present at the ﬁrst location. The subtasks might be to open the door, put the items onto the truck, drive the truck to the other location, and unload the items. We assume that each abstract action in a planning domain corresponds to a nonprimitive task, which must be decomposed into smaller tasks using HTN methods (if available).2 In addition to primitive and nonprimitive tasks, we also deﬁne a class of special-purpose tasks called achieve-goals tasks. An achieve-goals task speciﬁes a set of goals, as in a classical planning problem, that need to be achieved in the world before the task decomposition-process can progress during HTN planning. An HTN planner would not have any methods to decompose an achieve-goals task t Instead, an achieve-goals task triggers the invocation of a classical planner to generate a plan π such that the state Γ(s, π) satisﬁes the speciﬁed goals of t, which we denote as GoalsOf(s, t), given the input set of actions. The achievegoals task is an important component of our planning system Duet that incorporates LPG and SHOP2 in a uniﬁed planning process, as described in the next section. 3 Duet = LPG + SHOP2 This section describes our planning procedure, called Duet, that incorporates local-search planning as in LPG [8] and HTN planning as in SHOP2 [14]. The LPG and SHOP2 planning procedures that we use in Duet are slightly modiﬁed versions of the originals reported in [8] and [14], respectively. Below, we ﬁrst describe the Duet planning procedure, and subsequently we brieﬂy describe our modiﬁcations to LPG and SHOP2 to adapt them to work within Duet. Figure 1 shows a high-level description of the Duet planning procedure. Duet’s input includes the initial state s0 and the goal condition g of a classical planning problem, as well as a possibly empty initial task network speciﬁed for achieving the goals g and a possibly empty set M of HTN methods. Duet ﬁrst initializes the current state s to s0 and the current partial plan to the empty plan. At Line 1, n is a counter for the number of search steps performed by the planner; that is, n is the total number of graph modiﬁcations performed by LPG to ﬁx the ﬂaws plus the number of task decompositions done by SHOP2. Duet also uses a tabu list, τ , that keeps the abstract actions that cannot be decomposed into smaller tasks given the HTN methods in M , and therefore, must be avoided during local search in LPG. The tabu list τ is initialized to the empty list at Line 1. Duet successively generates and resolves subgoals for the input planning problem until it generates a solution plan. A subgoal of the planning problem is either a goal to achieve using domainindependent search heuristics via LPG, or an abstract action (i.e., a task) that needs to be decomposed into smaller tasks via SHOP2. Duet performs this iterative procedure for a maximum predeﬁned number of search steps. If a solution cannot be found during these iterations, the procedure returns failure. 2 Note that a macro-action [4, 7] is a special case of an abstract action: a macro-action decomposes directly into a sequence of primitive actions, whereas an abstract action may be decomposed into a combination of both primitive actions and other nonprimitive tasks that need to be decomposed further. This allows us, for example, to write HTNs that perform the standard recursive decomposition of a Towers of Hanoi task in the Museum domain. A. Gerevini et al. / Combining Domain-Independent Planning and HTN Planning: The Duet Planner Procedure Duet(s0 , g, w0 , M ) Input: The problem initial state s0 , the set of problem goals g, the initial task network w0 and a set of HTN-methods, Output: A solution plan or failure. 1. n ← 0; s ← s0 ; w ← w0 ; π ← τ ← gshop2 ← glpg ← ∅; 2. while n does not exceed a predeﬁned number of steps 3. if π is a solution (all subgoals satisﬁed) then return π; 4. else if there exists an abstract action gSHOP2 then 5. π , s , gLPG , w , n ← SHOP 2(s, gSHOP2 , πnil , n, M ); 6. if π = failure 7. then τ ← τ ∪ gSHOP2 , n; w ← (w − gSHOP2 ); 8. else π ← π + π ; w ← w + (w − gSHOP2 ); s ← s ; 9. gSHOP2 ← ∅; 10. else if there exists an achieve-goals task gLPG then 11. π , gSHOP2 , n ← LPG(s, gLPG , πnil , n, τ ); 12. π ← π+ preﬁx of π up to the ﬁrst abstract action; 13. w ← the rest of π + (w − gLPG ); 14. s ← Γ(s0 , π); 15. gLPG ← ∅; 16. else if w = ∅ then 17. π, s, gLPG , w, nil ← SHOP 2(s, w, π, n, M ); 18. if π = failure then return failure; 19. else 20. π , gSHOP2 , n ← LPG(s0 , g, π, n, τ ); 21. π ← preﬁx of π up to the ﬁrst abstract action; 22. w ← the rest of π ; 23. s ← Γ(s0 , π); 24. return failure. Figure 1. Pseudocode of the Duet planning algorithm. “+” is the operator concatenating two plans, πnil is the empty plan, s is the world state, w is the task network, τ is the tabu-list, gLPG represents the goals speciﬁed in an achieve-goals task, and gSHOP22 is an abstract action. If Duet returns failure, we re-start it from the beginning with the same input for a predeﬁned number of times, in order to search for possible solutions again. The rationale behind these restarts is that since LPG, and therefore Duet, is a randomized search algorithm, there is a possibility that different restarts of the planner will produce different search paths in the search space and the planner will generate a solution plan. At each iteration of the while loop (Lines 2–23), Duet ﬁrst checks whether the current partial plan π is a solution for the input planning problem. If so, Duet returns this plan and terminates successfully. Otherwise, if there is an abstract action (or an HTN of abstract actions) to be accomplished, Duet invokes SHOP2 on this HTN, which is called gSHOP2 in Line 5. Using the input HTN methods M , SHOP2 attempts to generate a solution plan for the HTN gSHOP2 . Figure 2 shows the modiﬁed version of SHOP2 [14] that Duet uses. The planning procedure is the same as in [14], except for Lines 10–12. In Line 10, if the current task to be decomposed is an achievegoals task, then our adaptation of SHOP2 returns the GoalsOf(s, t) in the current state s. As described above, the Duet then invokes LPG on these goals to achieve them and updates the current partial plan. When SHOP2 returns, there are three cases: • SHOP2 generates a plan π for gSHOP2 successfully using the methods in M . In this case, the returned successor HTN w is the empty HTN and there are no successor goals for LPG (i.e., gLPG is the empty set in Line 5). • SHOP2 generates an achieve-goals task tLPG for Duet to invoke LPG in the next iteration. In this case, π is the partial plan that SHOP2 generated until the task tLPG in the decomposition pro- 575 Procedure SHOP2(s, w, π, n, M ) Input: a world state s, a task network w, a (partial) plan π, a number of search steps n and a set of HTN-methods, Output: a plan, its ﬁnal state, a task that has no method, a task network and a number of search steps. 1. while w is not empty do 2. nondeterministically choose a task t from w that has no predecessors and remove it; 3. n ← n + 1; 4. if t is primitive then 5. π ← π + t; s ← γ(s, t); 6. else if t is nonprimitive then 7. choose an applicable method m for t (or if there’s 8. no such method then return failure) 9. add decomposition to the front of tasks; 10. else if t is an achieve-goals task then 11. return π, s, GoalsOf(s, t), w, n; 12. return π, s, nil, nil, n; Procedure LPG(s, g, π, ninit , τ ) Input: an initial world state s, a set of goals g, a (partial) plan π, a number of search steps ninit and a tabu-list τ , Output: a plan, the ﬁrst abstract action in the plan and a number of search steps. 1. A ← an action graph with the ﬁrst fact level deﬁned by s, the action levels by π and the last fact level by g 2. for n = ninit to a predeﬁned number of steps do 3. π ← the plan represented by A 4. if A is a solution graph then return π, nil, n; 5. σ ← the ﬂaw at the lowest level of A 6. if σ is an abstract action then return π, σ, n; 7. else 8. N ← set of actions that are not in τ and whose insertion to/removal from A ﬁxes σ; 9. select an element from N and modify A with it 10. return nil, nil, n. Figure 2. Pseudocode of Duet’s modiﬁed SHOP2 and LPG procedures. cess, s is the state in which LPG must be called, gLPG is the goals for LPG speciﬁed by tLPG , w is the HTN that still needs to be accomplished once Duet generates a plan that achieves the goals gLPG , and n is the updated number of search steps. • SHOP2 returns failure. SHOP2’s failure on gSHOP2 means that there are no possible ways to decompose gSHOP2 given the current domain knowledge and the input initial state, and therefore, LPG should not consider the particular abstract action gSHOP2 in its later planning invocations. In this case, Duet inserts gSHOP2 , along with the number of search steps generated so far, in the tabu list, and removes gSHOP2 from the current task network (Line 7). If SHOP2 returns a plan π , Duet inserts it into the current plan π, and updates the HTN w that still needs to be accomplished. Note that at Line 8, if SHOP2 could successfully accomplish gSHOP2 without returning any goals to LPG, the returned HTN w would be the empty HTN, and there would be no update to the HTN w. If there is a goal gLPG for LPG (see Lines 10–15), Duet invokes LPG with this goal, the current state, the empty plan, and the current values of the tabu list and number of search steps. The modiﬁed LPG procedure (Figure 2) is essentially the same stochastic local search procedure of [8] with the following differences: the action graph is initialized using a (possibly non-empty) plan; the initial number of search steps is an input number instead of zero; the action graphs can contain a new type of ﬂaw (an abstract action), which is handled by just returning it to Duet together with the current plan and number of search steps (Line 6); the search neighborhood is restricted to forbid the insertion of an abstract action in the input tabu list (Line 8). Note that at Line 5 the unsupported preconditions of an abstract action are selected before the action and that, as in [8], the neighborhood selection at Line 9 is randomized and uses a heuristic function. There are three possible cases when LPG terminates: • LPG tries to ﬁx a ﬂaw corresponding to an abstract action during its search and needs SHOP2 to decompose this abstract action into smaller tasks. In this case, LPG returns the current partial plan it has (π ), the abstract action for SHOP2 (gSHOP2 ), and the updated number n of performed search steps. • LPG generates a solution plan with no abstract actions for the input goals gLPG . In this case, LPG’s gSHOP2 output is empty. • LPG fails because the search increases the input number of search steps n above the predeﬁned maximum. In this case, Duet will return failure and can be restarted. After the run of LPG, Duet updates the current plan π, the current task network w and the current world state s (Lines 12–14). If there are no immediate goals for SHOP2 or LPG (i.e., if both gSHOP2 = ∅ and gLPG = ∅), then Duet checks whether there are more tasks that need to be decomposed by SHOP2 (Lines 16–18) or any remaining ﬂaws in the current plan that need to be ﬁxed by LPG (Lines 19–23). In the former case, Duet invokes SHOP2 to plan for the HTN w that still needs to be accomplished. Note that, in this case, Duet gives SHOP2 the current partial plan as input (instead of the empty plan as in the above case). This is because if SHOP2 generates a plan for the input abstract action then that plan must be a part of the solution. If the task network becomes empty and the current plan contains a ﬂaw, Duet invokes LPG in its next iteration (see Line 20) with the initial planning problem, except that this time LPG starts with the current partial plan and attempts to generate a solution based on it, rather than starting from the empty plan. The following theorem establishes Duet’s soundness (we omit the proof due to space limitations). Theorem 1 Let P = (s0 , g) be a classical planning problem, w0 be a (possibly empty) HTN to accomplish the goals g, and M be a set of HTN methods. Suppose Duet(s0 , g, w0 , M ) returns a plan π. Then, π is a solution for the planning problem P . Duet is not a complete planner (i.e., it may not ﬁnd a solution to an input planning problem, although there is one) for two reasons: (1) LPG, as a stochastic local search procedure, may return failure without ﬁnding any solution given the number of restarts and the bound parameter on the number of search steps; and (2) the HTNs provided as input for SHOP2 may not be complete, and even if they are, they may prune the solution away. 4 Experimental Evaluation We compared LPG and SHOP2 with two versions of Duet, one supplied with extremely sparse domain knowledge, and the other with more detailed knowledge of one facet of the Museums domain. The planning operators for LPG in this domain are DRIVE-TRUCK, MOVE-TO-TRUCK, MOVE-FROM-TRUCK, and MOVE. The three move operators deﬁne a ToH subdomain where the pegs are the truck area and the two museum pallets. Duet 500 450 lpg-solo duet-simple 400 350 duet-specialist shop2-solo 300 250 200 150 100 50 0 4 5 6 7 8 9 # of problems left unsolved A. Gerevini et al. / Combining Domain-Independent Planning and HTN Planning: The Duet Planner time (s) 576 50 lpg-solo duet-simple duet-specialist shop2-solo 40 30 20 10 0 4 5 number of objects 6 7 8 9 number of objects Figure 3. In the ﬁrst graph, each data point is the average running time on 50 randomly generated problems. The second graph shows how many times the planners failed to return plans within our 500-second deadline; each such failure was scored at 500 seconds in the ﬁrst graph. with sparse domain-knowledge, denoted as DuetSimple , used SHOP2 to choose the order in which to relocate the objects, and LPG to plan how to move each object. Duet with rich domain-knowledge of object-stacking, denoted as DuetSpecialist , provided LPG with abstract actions LOAD and UNLOAD in place of the three primitive move operators. In this version, LPG controls the trucks and chooses which objects to pick up and drop off, where each pick up/drop off request is an abstract action handled by SHOP2. Table 1. Sizes of the human-generated Museum domain descriptions for LPG, DuetSimple , and DuetSpecialist , and a SHOP2 HTN. Planner LPG DuetSimple DuetSpecialist SHOP2 Total lines 34 70 157 238 Total characters 1658 2893 6573 9549 Total no. of tokens 426 694 1534 2254 To measure the complexity of the domain knowledge needed by the various planners, Table 1 gives several different measures of the sizes of the domain descriptions used by the various planners. LPG requires only a description of the operators, while SHOP2 requires the operators and HTN methods to solve the Museum planning problems. DuetSimple and DuetSpecialist use a partial set of HTN methods: these methods can be used to generate plans for parts of a Museums planning problem but they cannot solve the problem entirely. There are three parameters affecting problem difﬁculty in Museums domain: the number of museums, the connectivity of the museums, and the number of art objects to transport. We performed experiments for each case in which we ﬁxed two of the parameters above and vary the other. In the cases where we varied the ﬁrst two parameters above, we did not observe a signiﬁcant change in the relative performance of the planners since these two cases emphasized the truck-movement subproblems in the Museums domain and all of our planners were able to solve truck-movement subproblems easily. All of our operator and HTN descriptions and other input ﬁles regarding our experimental setup are available online.3 Figure 3 shows the results of our experiments with varying number of objects where we ﬁxed the number of museums as 3 and generated complete graphs of museums. Each data point in this ﬁgure is the average of 50 randomly-generated planning problems. We set the time limit of 500 seconds for the planners and we scored those runs that did not return a plan within the limit at 500 seconds. With increasing numbers of objects, LPG’s local search became frequently trapped into local minima and was unable to produce any 3 See http://www.cs.umd.edu/∼waisbrot/Duet A. Gerevini et al. / Combining Domain-Independent Planning and HTN Planning: The Duet Planner plan within the given CPU-time limit. For example, LPG began to struggle when the number of objects at any one museum went beyond 4, and out of the 50 9-object problems, it failed on 37. DuetSimple outperformed LPG slightly when they both solved a problem, but generally failed on most of the same problems as LPG, for the same reasons. One advantage of DuetSimple over LPG was an increase in reliability. Some of the plans produced by LPG included repetition of actions: picking an object up and then putting it back in the same place multiple times. LPG can be conﬁgured to do more planning iterations and produce an improved plan, but DuetSimple was able to produce a more directed plan in a single pass, saving time. DuetSpecialist dramatically outperformed both DuetSimple and LPG because it used domain-speciﬁc HTNs to solve the parts of the problem that involve object-stacking. While the object-stacking HTNs required human authoring, we did not give DuetSpecialist any HTNs for navigating between museums, choosing when objects should be picked up, or choosing where to place objects. DuetSpecialist solved all of the problems, and in most cases solved them faster than LPG. To run SHOP2 by itself, we needed to give it HTN methods both for stacking art objects and navigating the truck. It suffered from two major failings, due to the inexperience of the domain writer. First, the HTN methods focused on moving one art object at a time, rather than loading multiple objects onto the truck before attempting delivery. Second, the HTN methods were deeply recursive, so large problems caused the stack to overﬂow. Although the SHOP2 methods could be improved with additional time and experience, Duet produces good results with less effort on the part of the domain writer. One exception to Duet’s performance was that LPG outperformed it in the easiest problems. This is because of Duet’s loose coupling between SHOP2 and LPG, which made Duet easy to implement but made the communication from SHOP2 to LPG very expensive. Duet and SHOP2 are both written in LISP, so calls to SHOP2 to decompose a task were inexpensive, but calls to LPG, which is written in C, required spawning and later destroying a separate shell and process. Because of this expense, the easiest problems were completely solved by LPG before Duet was able to complete the necessary calls between planners. If both planners were packaged as libraries, this inter-planner communication cost would be signiﬁcantly decreased. 5 Conclusions We have described Duet, a new planner that incorporates adaptations of two well-known planners, LPG [8] and SHOP2 [14]. Duet combines LPG’s domain-independent local search techniques with hierarchical domain knowledge in the form of SHOP2’s Hierarchical Task Networks (HTNs). Duet starts with a planning problem consisting of an initial state, a goal condition, and a possibly empty set of tasks. During planning, Duet uses SHOP2 to decompose tasks into smaller subtasks, and LPG to satisfy goal conditions. Our experiments with Duet in the Museums domain showed that even when Duet had only a small amount of domain-speciﬁc knowledge (e.g., “choose the least-fragile object and move it to the target museum ﬁrst”), it still solved planning problems faster, on average, than LPG. With more problem-solving knowledge (e.g., how to properly manipulate stacks of art objects), Duet outperformed both LPG and SHOP2, in terms of both speed and the number of successfully solved problems. To get SHOP2 to perform better, signiﬁcantly more human effort would have been needed to improve its knowledge base. We are currently starting a further experimental evaluation of Duet. So far, we have run experiments using the Storage domain from the 2006 International Planning Competition and obtained sim- 577 ilar results to those shown here. Although the Duet planning procedure we described in this paper is based on SHOP2 and LPG, the ideas we described here could be easily generalized to combine any planner that uses domain-speciﬁc knowledge with any domain-independent classical planner. Thus, a possible future direction is to extend Duet to work with planners such as FF [11], FastDownward [10], and SGPlan [5]. Another direction is a tighter integration of SHOP2 and LPG, which would probably yield more efﬁcient planning in Duet. Not only would this reduce the communication overhead between the planners, it would allow Duet to provide a richer form of “knowledge transfer;” the decisions that one of the planners make during its planning time will be more closely dependent on the domain knowledge that the other one could provide. Acknowledgments. This work was supported in part by DARPA’s Transfer Learning and Integrated Learning programs and NSF grant IIS0412812. The opinions in this paper are those of the authors and do not necessarily reﬂect the opinions of the funders. REFERENCES [1] F. Bacchus and F. Kabanza, ‘Using temporal logics to express search control knowledge for planning’, Artiﬁcial Intelligence, 116(1-2), 123– 191, (2000). [2] A. L. Blum and M. L. Furst, ‘Fast planning through planning graph analysis’, Artiﬁcial Intelligence, 90(1-2), 281–300, (1997). [3] B. Bonet and H. Geffner, ‘Planning as heuristic search: New results’, in ECP, Durham, UK, (1999). [4] Adi Botea, Markus Enzenberger, Martin Muller, and Jonathan Schaeffer, ‘Macro-ff: Improving ai planning with automatically learned macro-operators’, JAIR, 24, 581–621, (2005). [5] Y. Chen, C. Hsu, and B. Wah, ‘Temporal planning using subgoal partitioning and resolution in SGPlan’, JAIR, 26, 323–369, (2006). [6] K. Currie and A. Tate, ‘O-Plan: The open planning architecture’, Artiﬁcial Intelligence, 52(1), 49–86, (1991). [7] R. E. Fikes and N. Nilsson, ‘Strips: A new approach to the application of theorem proving to problem solving’, Artiﬁcial Intelligence, 5(2), 189–208, (1971). [8] A. Gerevini, A. Saetti, and I. Serina, ‘Planning through Stochastic Local Search and Temporal Action Graphs’, JAIR, 20, 239–290, (2003). [9] M. Ghallab, D. Nau, and P. Traverso, Automated Planning: Theory and Practice, Morgan Kaufmann, 2004. [10] M. Helmert, ‘The Fast Downward planning system’, JAIR, 26, 191– 246, (2006). [11] J. Hoffmann and B. Nebel, ‘The FF planning system: Fast plan generation through heuristic search’, JAIR, 14, 253–302, (2001). [12] S. Kambhampati, ‘Model-lite planning for the web age masses: The challenges of planning with incomplete and evolving domain theories’, in AAAI, Vancouver, Canada, (2007). [13] J. Kvarnstr¨om and P. Doherty, ‘TALplanner: A temporal logic based forward chaining planner’, Annals of Mathematics and Articial Intelligence, 30, 119–169, (2001). [14] D. Nau, T. Au, O. Ilghami, U. Kuter, W. Murdock, D. Wu, and F. Yaman, ‘SHOP2: An HTN planning system’, JAIR, 20, 379–404, (2003). [15] N. Nguyen, S. Kambhampati, and R. Nigenda, ‘Planning graph as the basis for deriving heuristics for plan synthesis by state space and CSP search’, Artiﬁcial Intelligence, 135(1-2), 73 – 124, (2002). [16] S. Yoon and S. Kambhampati, ‘Towards Model-lite Planning: A Proposal For Learning & Planning with Incomplete Domain Models’, in Proc. ICAPS-07 Workshop on AI Planning and Learning, Providence, RI, (2007). [17] D. E. Wilkins, Practical Planning: Extending the Classical AI Planning Paradigm, Morgan Kaufmann, San Mateo, CA, 1988. 578 ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-578 Learning in Planning with Temporally Extended Goals and Uncontrollable Events Andr´e A. Cir´e1 and Adi Botea2 Abstract. Recent contributions to advancing planning from the classical model to more realistic problems include using temporal logic such as LTL to express desired properties of a solution plan. This paper introduces a planning model that combines temporally extended goals and uncontrollable events. The planning task is to reach a state such that all event sequences generated from that state satisfy the problem’s temporally extended goal. A real-life application that motivates this work is to use planning to conﬁgure a system in such a way that its subsequent, non-deterministic internal evolution (nominal behavior) is guaranteed to satisfy a condition expressed in temporal logic. A solving architecture is presented that combines planning, model checking and learning. An online learning process incrementally discovers information about the problem instance at hand. The learned information is useful both to guide the search in planning and to safely avoid unnecessary calls to the model checking module. A detailed experimental analysis of the approach presented in this paper is included. The new method for online learning is shown to greatly improve the system performance. 1 Introduction Recent years have seen an increased interest in advancing planning from the classical model to extensions such as using temporal logic to express desired features of a correct plan. Search in a classical planning problem can be guided with control rules expressed in temporal logic [1]. The international planning competition IPC-5 [6] has introduced hard and soft constraints, expressed in temporal logic, that ﬁnite plans should satisfy. Computing cyclic solutions to problems with temporally extended goals is presented in [10]. Previous contributions to planning such as the above ones apply temporal logic reasoning along a (candidate) solution plan that is either a ﬁnite or a cyclic sequence of actions. In contrast, this paper addresses a problem where temporal logic is applied to the future behavior of a system after a goal state is reached. Speciﬁcally, the temporal goal of a problem must be satisﬁed by all sequences of events that originate in a goal state. Events are transitions in the problem state space that are not under the control of the planning agent. A real-life application that motivates this research is automated conﬁguration of a composite system such as a power grid or a network of water pipes. A composite system is a collection of interacting components. Assume it has a nominal behavior, a non-deterministic evolution in the state space where all transitions are uncontrollable events. Even though planning cannot control the events directly, it can impact the nominal behavior by conﬁguring elements of the sys1 2 Institute of Computing, University of Campinas, Brazil NICTA and Australian National University, Canberra, ACT tem structure such as the connections between components. Conﬁguring the system in a speciﬁc way doesn’t necessarily imply that the subsequent nominal behaviour is fully determined. Generally, many event trajectories can originate from a given conﬁguration. The planning task is to conﬁgure the system in such a way that its subsequent nominal behavior satisﬁes the goal condition on every possible event sequence. The conﬁguration step is useful in a number of scenarios such as the initial conﬁguration of a system, a reconﬁguration to recover from a failure, a reconﬁguration to grow or reduce the size of a system, and a reconﬁguration to adapt to a new goal condition. As soon as a solution is found, the planning agent interferes no longer with the system unless a reconﬁguration process becomes necessary at some point in the future. Contributions. This paper introduces a new planning model that combines temporally extended goals and uncontrollable events. A solving approach is presented that incrementally learns new information about a problem instance and uses it to improve the performance. The architecture contains a planning component, a model checking component and an online learning component. Planning explores the problem space where transitions are actions and enumerates candidate goal states. A model checking round tests if all event sequences that originate in a candidate goal state satisfy the temporally extended goal. If the test succeeds, a solution has been found. Otherwise, at least one event sequence exists for which the goal formula does not hold. The learning step analyzes such event sequences. New information is extracted, which will be used to both guide the planning and avoid unnecessary model checking rounds. The performance of a system that implements the ideas presented in this paper is analyzed empirically in detail. The new method for incrementally learning information about a problem instance is shown to greatly improve both the planning effort and the total number of model checking rounds. 2 Related Work Planning systems such as TLP LAN [1] and TAL PLANNER [13] are capable of handling a large problem space by using search control rules formulated in temporal logic. MIPS [16], SGP LAN [9] and HP LAN -P [2] are examples of systems that can handle hard and soft constraints (preferences) related to a planning goal. This research direction was mainly encouraged by a track added to the 2006 International Planning Competition (IPC-5), in conjunction with PDDL3 [6]. A method able to generate cyclic plans that satisfy a temporally extended goal can be found in [10]. In path planning, temporal logic can encode constraints that a trajectory computed for a mobile unit (e.g., robot) should satisfy [5]. As in previous work A.A. Ciré and A. Botea / Learning in Planning with Temporally Extended Goals and Uncontrollable Events such as [7, 10, 16], we convert LTL formulas into B¨uchi automata. Two major features that distinguish our work from all contributions mentioned earlier are: (1) our system is capable of learning from trajectories where an extended goal does not hold; and (2) we apply our ideas to a new planning problem, where a deterministic planning component is followed by a non-deterministic evolution generated with uncontrollable events. In particular, we reason about LTL goals in the presence of events, whereas the IPC-5 domains with extended goals and preferences are deterministic. In reactive planning, actions are executed to respond to event occurrences. Reactive planning in problems with extended goals expressed in Metric Temporal Logic (MTL) is the topic of [3, 4]. There is an important distinction between the problem that we address and ﬁelds such as reactive planning and controller synthesis. In the latter cases no goal state is deﬁned, whereas we need to reach a goal state where the planning (conﬁguration) is completed and the subsequent system evolution (nominal behaviour) respects the temporal goal. Generating a control strategy consistent with an LTL formula in a non-deterministic environment is the topic in [12]. The value of this contribution seems to be more theoretical. It provides a translation of the original problem into an LTL game but indicates no heuristics or other enhancements that will be necessary to scale up the performance of a solver. It reports neither experiments nor an actual implementation of the theoretical ideas. A high-level theme that our learning approach shares with explanation based learning (EBL) is learning from counter examples. Our work differs signiﬁcantly from previous work on EBL in the planning problem addressed and in the ways that new information is extracted and subsequently used. E.g., the topic in [11] is learning from Graphplan dead-ends in classical planning whilst we focus on learning from bad event sequences in planning with temporal goals and uncontrollable events. Model-based self-conﬁguration, a problem related to our work, is addressed in [17]. That work does not consider temporally extended goals. It can be seen as a form of EBL, since it attempts to make a search more informed as more conditions conﬂicting with goal states are discovered. 3 Problem Deﬁnition and Background The planning model addressed in this work is a structure S, s0 , ϕ, γ, A, E with S a ﬁnite state space, s0 ∈ S an initial state, and ϕ a temporal logic formula that describes the goal. The function γ : S × (A ∪ E) → S models deterministic transitions in the state space. The transitions are partitioned into a set of actions A (i.e., transitions under the control of the planner), and a set of uncontrollable events E that deﬁne the nominal behavior of a system. The search space that has the initial problem state as a root node and uses only actions as transitions is called the problem planning space. The space that is rooted in a given state s and uses only events for transitions is called the event space of state s. The state space associated with a problem is deﬁned using a ﬁxed collection of boolean variables called atoms. Each state is a complete assignment to the atoms deﬁned for that problem. Equivalently, a state s can be deﬁned as the set of all atoms that are true in s (closed world assumption). Following the STRIPS representation, each action (event) a has a set of preconditions pre(a), a set of positive effects add(a) and a set of negative effects del(a). An action (or event) a is applicable in a state s if s |= pre(a). In such a case, γ(s, a) = (s \ del(a)) ∪ add(a). Otherwise, γ(a, s) is undeﬁned. A sequence of actions (events) a1 , a2 , . . . , ak , is applicable in a state s if a1 is applicable in s, a2 is applicable in γ(s, a1 ) and so on. For a sequence of actions 579 (events) π = a1 , . . . , ak that is applicable in a state, the precondition of the entire sequence pre(π) is the union of all atoms p such / add(aj )). that (∃i ∈ {1 . . . k}) : (p ∈ pre(ai ) ∧ (∀j < i)p ∈ The planning task is to ﬁnd a ﬁnite sequence of actions that can be applied in s0 and that reaches a goal state. A state s ∈ S is a goal if every event sequence applicable in s satisﬁes the temporal goal ϕ. A sequence that does not satisfy ϕ is called a bad event sequence. 4 Solving Approach The architecture outlined in Algorithm 1 contains three main modules. Planning explores the planning space and enumerates candidate goal states. Model checking explores the event space of a candidate goal state s to check if it satisﬁes the temporally extended goal of the problem ϕ. If the test returns a positive answer, a solution has been found. Otherwise, the online learning component attempts to extract a sufﬁcient condition that explains the negative result of the most recent model checking round. The system incrementally learns information about a problem instance that is used to speed up the solving process. The learned information I is represented as an atemporal boolean formula. A state s with the property s |= I is guaranteed not to satisfy the goal formula ϕ. The boolean formula I is used in two parts of the algorithm, each with a great contribution to the system performance. Firstly, no model checking rounds need to be performed in states s with s |= I. Secondly, ¬I can be used as a reachability goal in the planning component, allowing the computation of relaxed plans that steer the search away from states that are guaranteed not to be goals. As a problem deﬁnition contains no explicit reachability goals, no other information besides ¬I is used as a goal when building relaxed plans. Standard algorithms that compute relaxed plans such as the one implemented in the FF planning system [8] work only with conjunctive reachability goals. As in Rintanen’s work [15], FF’s method is extended to handle goals such as ¬I, which can be an arbitrary boolean formula. In general, a relaxed plan could be used to compute a heuristic distance from a current state to a goal state, and to partition the successors of a node into helpful nodes (i.e., nodes obtained from applicable actions that are also part of the parent’s relaxed plan) and rescue nodes (all other valid successors). In this paper, two open queues are used, one for helpful and another for rescue nodes. A rescue node is expanded only when the helpful open queue is empty. No heuristic values are associated with nodes. The reason is that, in this problem, the reachability goal ¬I varies in time. Nodes evaluated early might have better heuristic values just because these were computed when the reachability goal was more relaxed. When ¬I is used as a reachability goal in planning, the lines 6 and 7 in Algorithm 1 are redundant, since sg |= ¬I holds for every candidate goal state sg . The lines are added to the pseudocode to emphasize more clearly that model checking is triggered only for a small fraction of the states visited in planning. The next discussion assumes that Linear Temporal Logic (LTL) goals are used. Model checking is implemented as a breadth-ﬁrst search in order to discover bad event sequences of minimal length. Shorter bad event sequences can allow to learn information that has fewer conjunctive conditions and hence is more generally applicable. See details about learning later in this section. For the sake of clarity, assume that each application of an event in model checking search is performed together with both a normal (usual) progression of ϕ and a progression in the B¨uchi automaton corresponding to ϕ. B¨uchi progression is a standard approach also adopted, for example in [10]. Other model checking methods (e.g., SAT based [14]) can 580 A.A. Ciré and A. Botea / Learning in Planning with Temporally Extended Goals and Uncontrollable Events Algorithm 1 Architecture overview. 1: I ← false {initialize learned info} 2: while true do 3: (sg , π) ← SearchForNextCandidateGoalState() {planning; π is the action sequence from s0 to sg } 4: if no state sg is found then 5: return no solution 6: if sg |= I then 7: continue {no need for a costly model checking round} 8: ModelChecking(sg ) {run a model checking round} 9: if model checking succeeds then 10: return π 11: else 12: I ← I ∨ ExtractInfo() {learning} be used but the actual choice is not a major point of this research. As explained in this section and demonstrated empirically in the next section, we improve the model checking component of the algorithm by reducing dramatically the total number of model checking rounds, not the effort spent in one individual round. In the model checking component, the event sequences that originate in a candidate goal state sg are split into four categories, one corresponding to paths that satisfy ϕ and three corresponding to bad event sequences. Bad event sequences are: L-paths, sequences that end with a leaf node (i.e., a node where no events can be applied) before the normal progression reduces ϕ to either true or false; Fpaths, sequences along which the normal progression reduces ϕ to false; and C-paths, where a cycle is created and ϕ is never satisﬁed. As soon as one bad event sequence is discovered, the corresponding round of model checking returns. If desired, the procedure could attempt to discover several bad event sequences, allowing to learn more information from one round. The rest of this section focuses on the learning method. This is triggered each time when model checking has discovered an event sequence πe that is either an F-path or a C-path. No information is extracted from L-paths. Information extracted from an L-path might be too speciﬁc to sg , since it would have to explain why none out of potentially many events is applicable in the leaf node. The information extraction aims at detecting a boolean formula c such that sg |= c and c is sufﬁcient to explain the failure of ϕ along the sequence πe . More speciﬁcally, c should imply both the following conditions: (1) πe is applicable in sg ; and (2) ϕ does not hold along the sequence πe . As indicated in Algorithm 2, the formula c is initialized to pre(πe ) to ensure that c implies condition (1). To imply condition (2), c is extended with zero or more conjunctive literals l. It is desirable to minimize the number of added literals, as a smaller formula c is more generally applicable and thus more model checking rounds could be avoided in the future. To compute a set of literals to be added to c, a variation of progression called event-speciﬁc progression is introduced. Consider a state si obtained after applying the ﬁrst i ≥ 1 steps of πe . The eventspeciﬁc progression to si from the previous step is equivalent to the normal progression, except that it postpones the instantiation of certain atoms, as explained next. The normal progression can be deﬁned recursively starting from atoms and moving to more and more complicated formulas. For the complete set of rules, see for example [1]. Only the case of atomic formulas needs to be discussed here. At the atomic level, prog(p, si ) = true if si |= p and prog(p, si ) = false if si |= ¬p. In other words, all occurrences of atoms in the progressed formula that are not inside a temporal operator are replaced by their actual truth values in the corresponding state. The event-speciﬁc formula progression applies different rules at the atomic level. For each atom p in the initial problem deﬁnition, deﬁne a new variable p0 . Deﬁne a set of atoms Zi as pre(πe )∪eff(e1 )∪ del(e1 ) ∪ · · · ∪ eff(ei ) ∪ del(ei ). Being independent from the ﬁrst i / Zi preserve their value all the way from sg to steps of πe , atoms q ∈ si . For an atomic formula p, the event-speciﬁc progression is deﬁned / Zi , and eprog(p, si ) = prog(p, si ) ∈ as eprog(p, si ) = p0 if p ∈ {true, false} if p ∈ Zi . The progression rules for more complicated, non-atomic formulas are the same as in normal progression. Usual simpliﬁcations such as true ∨ α = true are useful to eliminate irrelevant occurrences of new variables p0 that might exist in α. Event-speciﬁc progression of ϕ along πe is performed step-bystep for t times, the same number of steps that normal progression was performed before detecting that πe was a bad event sequence. The resulting formula is denoted by eprog(ϕ, πe , t). Consider that P is the set of all new boolean variables p0 added during event-speciﬁc progression. Each element p0 ∈ P generates one literal to be added to c as a new conjunction. If p is true in sg , then p is added to c. Otherwise, ¬p is the newly created literal. It can be shown that the condition c computed as before implies both conditions (1) and (2). Implying condition (1) is obvious from the way c is initialized. A formal proof for condition (2) is skipped to save space. The intuition is that the only atoms that could possibly impact the normal formula progression of ϕ along πe are those determined by pre(πe ) (i.e., atoms in Zt ) and atoms p with p0 ∈ P . The condition c is the assignment in sg of the atoms in pre(πe ) ∪ {p|p0 ∈ P }. Before creating the literals to be added to c, P can be reduced with a greedy procedure that is linear in the size of P . The correctness of the extracted information c is preserved in the sense that it still implies conditions (1) and (2). A formula β is initialized to eprog(πe , ϕ, t). The procedure iteratively selects one variable p0 from P and instantiates it in β with the value of p in sg . This is repeated until β becomes equivalent to prog(sg , πe , ϕ, t), the formula obtained by normal progression from sg along πe for t steps. The variables in P that were not instantiated in this loop can safely be skipped when the literals are generated. The condition on line 7 of Algorithm 2 is easy to check for F-paths, since prog(sg , πe , ϕ, t) = false. The implemented system skips the greedy reduction of P for C-paths. It is possible to address this, but the experiments reported next did not indicate a performance bottleneck caused by this choice. As a simple example, if eprog(ϕ, πe , t) is false, then no additional information is added to c besides the existing part pre(πe ). In such a case, regardless of the values of other variables in sg , the preconditions and the effects of the event sequence alone are enough to progress ϕ to false. 5 Experimental Results This ﬁrst part of this section introduces a new benchmark domain. Our experiments are described next. The last part of the section contains the results and their analysis. Benchmark and Setup. Among the many available planning benchmarks, we are not aware of the existence of an encoding that is suitable to the model presented in Section 3, which includes 581 A.A. Ciré and A. Botea / Learning in Planning with Temporally Extended Goals and Uncontrollable Events a deterministic planning stage (conﬁguration) followed by a nondeterministic evolution in the event space (nominal behaviour). A new domain has been designed to carry out the experimental evaluation presented in this section. Because of lack of space, only a brief description is included here. The website http://abotea. rsise.anu.edu.au/factory-benchmark/ contains a detailed presentation and the source code of a problem generator. Each problem instance contains a collection of components split into two categories: machines and repositories. At most two repositories can be connected to a machine at a time. A repository cannot simultaneously be connected to more than one machine. Each repository stores raw material of a certain type and can transfer batches of it to a connected machine. A machine can combine two types of raw material to generate a ﬁnal product. Planning actions consist of both changing connections between repositories and machines, and component-speciﬁc operations such as cleaning a machine. The nominal behavior of a system includes transferring raw products from a repository to a machine, and creating ﬁnal products from combinations of raw materials. Furthermore, certain combinations of raw products can break a machine that is not clean. In experiments, a temporally extended goal, expressed in LTL, is a conjunction of conditions such as never break a machine and eventually generate certain products. The code is implemented in Java 1.6. B¨uchi automata are built using the LTL2BA package, available at http: //www-i2.informatik.rwth-aachen.de/Research/ RV/ltl2ba4j/index.html. The experiments are carried out on a 3.4 Ghz machine, with 1.8 GB allocated to the heap memory and 1.8 GB assigned to the stack memory. The time limit is 15 minutes per problem. We are not aware of any existing system designed for the problem addressed in this paper. In the current experiments, the new solver is compared against a basic version where the learning component is switched off. A set of 350 problem instances is created as follows. The number of repositories r is ﬁxed to 4 and the number of machines m varies from 4 to 10. For each combination (r, m), 50 problems are generated. The LTL goal formulae range in size from 5 to 15 conjunctive conditions. The parameters r and m are chosen in such a way that the problems gradually scale up until the basic solver reaches its limits within the given time and memory constraints. The problem collection contains both instances with solutions and instances that can be proven unsolvable within the allocated resource limits. The latter category is useful to evaluate the impact of learning on reducing the number of model checking rounds. When no goal state exists, both system versions have to visit all states in the planning space and the difference in the overall performance is mostly explained by the number of model checking rounds. Results. Figure 1 shows the total running time for instances that are proven unsolvable. Each data point in a curve corresponds to one problem instance. The problems are ordered to obtain a monotonically increasing curve for the basic solver. Learning improves the number of model checking rounds. As explained before, the number of nodes in planning search is not affected in such problems. Processing one node in informed planning (i.e., in the system with learning enabled) is more expensive, since a relaxed plan has to be computed. The overall improvement achieved by learning in this subset appears to be almost constant across the problem range. In instances where a solution is found (Figure 2), learning improves not only the number of model checking rounds but also the number of nodes expanded in planning. As compared to Figure 1, the speed-up factor increases as the problems gets larger. The largest improvement in this set reaches two orders of magnitude. Given a problem instance, assume that (P, M, L) tells the percentage that each system module (i.e., planning, model checking, learning) contributes to the total running time. (P (m), M (m), L(m)) is the average over the problems with m machines. When m varies from 4 to 10, L(m) is stable around a value of 3 to 4%. P (m) slightly increases from 70% to 80%. When learning is switched off, the only modules that contribute to the total running time are planning and model checking. The average weight of the planning time slightly increases from 55% when m = 4 to 60% when m = 10. 1000 Basic Learning 100 Time (seconds) Algorithm 2 Learning step in pseudocode. 1: c ← pre(πe ) 2: P ← all new variables p0 in eprog(πe , ϕ, t) 3: if perform greedy reduction of P (optional) then 4: β ← eprog(πe , ϕ, t) 5: PN ← P 6: P ←∅ 7: while not (β ≡ prog(sg , πe , ϕ, t)) do 8: select p0 ∈ PN 9: instantiate p0 in β with p’s value in sg 10: remove p0 from PN and add it to P 11: for each p0 ∈ P do 12: l ← (sg |= p)?(p) : (¬p) 13: c←c∧l 14: return c 10 1 0.1 0 Figure 1. 20 40 60 80 100 Instance 120 140 160 180 Time for instances with no solution. Note the logarithmic scale. Learning keeps the number of model checking rounds to very small values, whereas the basic system faces an exponential growth as problems increase in difﬁculty. Figure 3 illustrates this for problems with solutions. The situation is very similar for problems with no solution. The corresponding chart is skipped to save space. When learning is switched off, planning search is equivalent to breadth-ﬁrst search, which is guaranteed to ﬁnd solutions of optimal length. Figure 4 presents the quality of solutions computed by the system with learning enabled. The problems with solution solved by both systems are included in this summary. The sub-optimality of a × 100, where l is the actual length and solution is measured as l−o o o is the optimal length found with breadth-ﬁrst search. In Figure 4, each bar counts how many problems ﬁt into the corresponding suboptimality range. The data indicate that a majority of the solutions found by the learning system are optimal. 582 A.A. Ciré and A. Botea / Learning in Planning with Temporally Extended Goals and Uncontrollable Events 1000 Basic Learning Time (seconds) 100 10 1 0.1 0 20 40 60 80 100 120 140 160 180 Instance Figure 2. Time for instances with solutions on a logarithmic scale. A solving architecture that combines elements of planning, model checking and learning is presented and analyzed in detail. An online learning procedure builds up information that is used both as a reachability goal in planning search and as a condition to safely skip unnecessary model checking rounds. In experiments, the incrementally learned information has a great contribution to speeding up the solving process. Future work includes integrating the planning method presented in this paper with monitoring and diagnosis algorithms. The latter monitor a system to decide whether the nominal behavior is the desired one. When faults are detected, the planning method changes the system into a correct conﬁguration. 7 Acknowledgment NICTA is funded by the Australian Government’s Department of Communications, Information Technology, and the Arts and the Australian Research Council through Backing Australia’s Ability and the ICT Research Centre of Excellence programs. This work has been initiated when the ﬁrst author was a visiting student at NICTA. We thank Patrik Haslum, Sophie Pinchinat, Jussi Rintanen and Sylvie Thi´ebaux for useful discussions on this topic. 100000 Basic Learning 90000 80000 70000 Rounds 60000 50000 REFERENCES 40000 30000 20000 10000 0 0 20 40 60 80 100 120 140 160 180 Instance Figure 3. Model checking rounds for instances with solution. 6 Conclusion and Future Work Advancing recent contributions that extend classical planning with temporal logic, this paper focuses on a planning model that combines temporally extended goals with uncontrollable events. The model is a generic encoding of a real-life application where a system should automatically be conﬁgured such that its future nominal behavior respects a given condition expressed in temporal logic. Figure 4. Solution quality when learning is used. [1] F. Bacchus and F. Kabanza, ‘Using Temporal Logics to Express Search Control Knowledge for Planning’, Artiﬁcial Intelligence, 16, 123–191, (2000). [2] J. Baier, F. Bacchus, and S. McIlraith, ‘A Heuristic Search Approach to Planning with Temporally Extended Preferences’, in Proceedings of IJCAI-07, pp. 1808–1815, (2007). [3] M. Barbeau, F. Kabanza, and R. St-Denis, ‘Synthesizing Plant Controllers Using Real-Time Goals’, in IJCAI-95, pp. 791–798, (1995). [4] M. Barbeau, F. Kabanza, and R. St-Denis, ‘A Method for the Synthesis of Controllers to Handle Safety, Liveness, and Real-Time Constraints’, IEEE Transactions on Automatic Control, 43(11), 1453–1559, (1998). [5] G.E. Fainekos, H. Kress-Gazit, and G.J. Pappas, ‘Hybrid Controllers for Path Planning: A Temporal Logic Approach’, Decision and Control, and European Control Conference CDC-ECC-05, 4885–4890, (2005). [6] A. Gerevini and D. Long, ‘Plan Constraints and Preferences for PDDL3’, Technical report, University of Brescia, (2005). [7] G. De Giacomo and M. Y. Vardi, ‘Automata-Theoretic Approach to Planning for Temporally Extended Goals’, in Proceedings of ECP-99, pp. 226–238, (1999). [8] J. Hoffmann and B. Nebel, ‘The FF Planning System: Fast Plan Generation Through Heuristic Search’, JAIR, 14, 253–302, (2001). [9] C. W. Hsu, B. W. Wah, R. Huang, and Y. X. Chen, ‘Handling Soft Constraints and Preferences in SGPlan’, in ICAPS Workshop on Preferences and Soft Constraints in Planning, pp. 54–57, (2006). [10] F. Kabanza and S. Thi´ebaux, ‘Search Control in Planning for Temporally Extended Goals’, in Proceedings of ICAPS-05, pp. 130–139, (2005). [11] S. Kambhampati, ‘Improving Graphplan’s Search with EBL and DDB Techniques’, in Proceedings of IJCAI, pp. 982–987, (1999). [12] M. Kloetzer and C. Belta, ‘Managing non-determinism in symbolic robot motion planning and control’, in Robotics and Automation-07, pp. 3110–3115, (2007). [13] J. Kvarnstr¨om and M. Magnusson, ‘TALplanner in IPC-2002: Extensions and Control Rules’, JAIR, 20, 343–377, (2002). [14] T. Latvala, A. Biere, K. Heljanko, and T. Junttila, ‘Simple Bounded LTL Model Checking’, in Proceedings of Formal Methods in ComputerAided Design (FMCAD’2004), pp. 186–200, (2004). [15] J. Rintanen, ‘Uniﬁed Deﬁnition of Heuristics for Classical Planning’, in Proceedings ECAI-06, pp. 600–604, (2006). [16] S. Jabbar S. Edelkamp and M. Nazih, ‘Large-Scale Optimal PDDL3 Planning with MIPS-XXL’, in Proceedings of the International Planning Competition IPC-05, (2006). [17] B. C. Williams and P. P. Nayak, ‘A Model-based Approach to Reactive Self-Conﬁguring Systems’, in Proceedings AAAI-96, pp. 971–978, (1996). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-583 583 A Simulation-based Approach for Solving Generalized Semi-Markov Decision Processes Emmanuel Rachelson1 and Gauthier Quesnel and Fr´ed´erick Garcia and Patrick Fabiani Abstract. Time is a crucial variable in planning and often requires special attention since it introduces a speciﬁc structure along with additional complexity, especially in the case of decision under uncertainty. In this paper, after reviewing and comparing MDP frameworks designed to deal with temporal problems, we focus on Generalized Semi-Markov Decision Processes (GSMDP) with observable time. We highlight the inherent structure and complexity of these problems and present the differences with classical reinforcement learning problems. Finally, we introduce a new simulation-based reinforcement learning method for solving GSMDP, bringing together results from simulation-based policy iteration, regression techniques and simulation theory. We illustrate our approach on a subway network control example. 1 Introduction Many problems in planning present both the features of decision under uncertainty and time-dependency. Imagine, for instance, having to plan the exploitation of a subway network, where available actions only consist in introducing or removing trains from service. In this problem, the goal is to maximize the number of passengers going through the network while minimizing the exploitation cost of the subway. Passenger arrival times, movements going in and out of the trains and possible delays in the system make the outcome of every action uncertain with regard to the next state and the date of the next decision epoch. On top of that, the ﬂow of passengers and their destinations depend greatly on the time of day. All this deﬁnes the kind of problems we try to capture as Temporal Markov Problems. These problems cover a wide variety of other applications, as onboard UAV coordination or airport taxiway management, etc. Problems of decision under uncertainty are commonly modelled as Markov Decision Processes (MDP). Recent work on solving large state-space MDP include, for example, factored MDP methods, approximate linear programming, hierarchical approaches, reinforcement learning, etc. Temporal Markov Problems, however, have received little attention from the planning and machine learning communities, even though simulation seems a promising approach to tackling these problems. This paper presents formalisation and algorithmic issues about Temporal Markov Problems and proposes a simulation-based algorithm designed to solve them. In section 2, we will review the models adapted from Markov Processes and designed to include time-dependency and decision making. Building on this ﬁrst section’s conclusions, we focus on controlling Generalized Semi-MDP (GSMDP). Section 4 presents our algorithm and discusses the issues and interests of simulation-based approaches for 1 ONERA, France, email: emmanuel.rachelson@onera.fr GSMDP. We illustrate our approach on the subway control example in section 4.3 and conclude in section 5. 2 Temporal Markov Problems MDP have become a popular model for describing problems of planning under uncertainty. Formally, an MDP is composed of a 4-tuple S, A, P, r, where S is a countable set of states for the system, A is the countable set of possible actions, P (s |s, a) is a probability distribution function providing the transition model between states (as in a Markov Process, but conditioned with the action a) and r(s, a) is a reward value associated with the (s, a) transition, used to build criteria and to evaluate actions and policies. Solutions to MDP problems are often given as Markovian policies π, namely functions that map current states to actions. One can introduce criteria to evaluate these policies, as the discounted reward criterion given in equation 1. Criteria permit deﬁnition of the value function V π associated with a policy. An important result concerning MDP is that for any historydependent policy, there exists a Markovian policy which is at least as good with regard to a given criterion. Consequently, one can safely seach for optimal control policies in the restricted space of Markovian policies without loss in optimality. Finally, algorithms as value iteration or policy iteration are based on the fact that the optimal policy’s value function V ∗ obeys Bellman’s optimality equation 2 [1]. ! ∞ X γ δ r (sδ , π(sδ )) (1) Vγπ (s) = E δ=0 " ∗ V (s) = max r(s, a) + γ a∈A X # ∗ P (s |s, a)V (s ) (2) s ∈S 2.1 Including continuous time in the MDP framework Introducing time in Markov Processes (MP) models — and in their decisional counterparts, MDP — can be done by deﬁning stochastic durations between decision epochs. In a standard MP or MDP, the sojourn time in a given state is one and decision epochs occur at integer time values (thus yielding the γ δ in the discounted criterion). Allowing the sojourn time in a given state to be continuous and stochastic deﬁnes the Semi-MP, or Semi-MDP formalism. In an SMDP [11], state sojourn time is described through a distribution F (τ |s, a) indicating the time before transition, provided that we undertake action a in state s. Therefore, an SMDP is a 5-tuple < S, A, P, F, r > which corresponds to a Markov Process but with stochastic state sojourn time. Policies for the control of SMDP can be computed using standard MDP algorithms since solving a discounted reward SMDP turns 584 E. Rachelson et al. / A Simulation-Based Approach for Solving Generalized Semi-Markov Decision Processes out to be equivalent to performing an integration over expected transition durations and to solving a total reward MDP. This is mainly due to the independence between state sojourn time τ and arrival state s . This very strong assumption was lifted in the Time-dependent MDP (TMDP) model of [2] and generalized recently in the XMDP model of [13]. Formally, an XMDP is described by a 4-tuple < S, A, p, r > where the state space S can be composed of discrete and continuous variables and may include the process’ time, A is a continuous or discrete parametric action space and p and r correspond to transition and reward models for states of S and actions of A. [13] proved that XMDP obeyed a similar optimality equation as equation 4, thus proving that standard algorithms as value iteration could be safely used to solve XMDP. Using the XMDP representation, one can model any stochastic decision process with continuous observable time and hybrid state and action spaces. This seems to suit our Temporal Markov Problems well and some recent techniques for solving hybrid state space MDP ([6, 4]) could be applied here. However, writing transition and duration functions for Temporal Markov Problems is often a very complex task and requires a lot of engineering. For instance, the effect of a RemoveT rain action on the global state of the subway problem is the result of several concurrent processes : the passenger arrivals, the trains movements, the removal of one train, etc.: all compete to change the system’s state and it is a complex task to summarize all these process’ concurrent stochastic inﬂuence into the transition and duration functions. 2.2 Concurrency and MDP In the stochastic processes litterature, concurrent Markov processes are modelled as Generalized Semi-Markov Processes (GSMP) [5]. A GSMP is a natural representation of several concurrent SMP affecting the same state space. [16] introduced Generalized SemiMarkov Decision Processes (GSMDP) in order to model the problem of decision under uncertainty where actions compete with concurrent uncontrollable stochastic events. A GSMDP describes a problem by factoring the global transition function of the process by the different stochastic contributions of concurrent events. This makes GSMDP an elegant and efﬁcient way of describing the complexity of Markov Temporal Problems. We will therefore focus on solving time-dependent GSMDP from now on and will give a more formal deﬁnition of GSMDP in section 3. + continuous sojourn time MP Figure 1. 3.1 Concurrent processes We start from the stochastic process point of view, with no decision making. Formally, a GSMP [5] is described by a set S of states and a set E of events. At any time, the process is in a state s and there exists a subset Es of events that are called active or enabled. These events represent the different concurrent processes that compete for the next transition. To each active event e, we associate a clock ce representing the duration before this event triggers a transition. This duration would be the sojourn time in state s if event e was the only active event. The event e∗ with the smallest clock ce∗ (the ﬁrst to trigger) is the one that takes the process to a new state. The transition is then described by the transition model of the triggering event: the next state s is picked according to the probability distribution Pe∗ (s |s). In the new state s , events that are not in Es are disabled (which actually implies setting their clocks to +∞). For the events of Es , clocks are updated the following way: • If e ∈ Es \ {e∗ }, then ce ← ce − ce∗ • If e ∈ Es or if e = e∗ , pick ce according to Fe (τ |s ) The ﬁrst active event to trigger then takes the process to a new state where the above operations are repeated. One ﬁrst important remark concerning GSMP is that the overall process does not retain Markov’s property anymore : knowing the current state s is not sufﬁcient to predict the distribution on the next state of the process. [9] showed that by augmenting the state space with the events’ clocks, one could retain the Semi-Markov behaviour for a GSMP, we will discuss this issue in the next section. Introducing action choice in a GSMP yields a GSMDP as deﬁned by [16]. In a GSMDP, we identify a subset A of controlable events or actions, the remaining ones are called uncontrolable or exogenous events. Actions can be enabled or disabled at will and the subset As = A ∩ Es of activable actions is never empty since it always contains at least the “idle” action a∞ (whose clock is always set ∞) which, in fact, does nothing and lets the ﬁrst exogenous event take the process to a new state. As in the MDP case, searching for control strategies on GSMDP imply deﬁning rewards r(s, e) or r(s, e, s ) associated to transitions and introducing policies and criteria. 3.2 Controling GSMDP + concurrency SMP GSMP SMDP GSMDP + actions MDP phenomena at stake. In this section, we focus on the GSMDP formalism with observable time. We deﬁne control policies, the associated state variable issues and present resolution methods. From MP to GSMDP 3 GSMP and GSMDP The previous section illustrated how Temporal Markov Problems needed both continuous observable time models and an efﬁcient representation of concurrency in order to represent the complexity of the As mentionned before, the transition function for the global semiMarkov process does not retain the Markov property without augmenting the state space. In the classical MDP framework, one can make use of the Markov property of the transition function to prove that there exists a Markovian policy (which only depends on the current state) which is at least as good as any historydependent policy [11]. In the GSMDP case however, this is no longer possible and in order to deﬁne criteria and to ﬁnd optimal policies, we need - in the general case - to allow the policy to depend on the whole execution path of the process. An execution path [16] of length n from state s0 to state sn is a sequence σ = (s0 , t0 , e0 , s1 , . . . , sn−1 , tn−1 , en−1 , sn ) where ti is the sojourn time in state si before event ei triggers. As in [16], we deﬁne the discounted value of an execution path by: « Z ti n−1 X T „ t Vγπ (σ) = γ i γ i k(si , ei , si+1 ) + γ t c(si , ei )dt (3) i=0 0 E. Rachelson et al. / A Simulation-Based Approach for Solving Generalized Semi-Markov Decision Processes where k and c are traditional P SMDP lump sum reward and reward rate functions, and Ti = i−1 j=0 tj . One can then deﬁne the expected value of policy π in state s as over all execution paths ˆ the expectation ˜ starting in s: Vγπ (s) = Esπ Vγπ (σ) . This provides a criterion for evaluating policies. The goal is now to ﬁnd policies that maximize this criterion. The main problem here is that it is hard to search the space of history-dependent policies. On the other hand, the supplementary variable technique is often used to transform non-Markovian processes into Markovian ones. It consists in augmenting the state space with just enough variables so that the distribution over future states only depends on the current value of these variables. In [9], Nielsen augments the natural state s of the process with all the clock readings and shows that this operation brings Markov behavior back to the GSMP process. We will note this augmented state space (s, c) for convenience. Unfortunately, it is unrealistic to deﬁne policies over this augmented state space since clock readings contain information about the future of the system. From here, several options are possible: • One could decide to sacriﬁce optimality and to search for “good” policies among a restricted set of policies, say the policies deﬁned on the current natural state only. • One could also search for representation hypotheses that simplify the GSMDP model and that make natural state Markovian again. • One could compute optimal policies on the augmented state space (s, c) and then derive a policy on observable variables only. • Finally, one could search for a set of observable variables which retain the Markov property for the process, for example the set composed of the natural state of the process s, the duration for which each active event ei has been active τi and its activation state si . We will note this augmented state (s, τ, sa ) [16] is based on the second option listed above. In the next paragraph, we brieﬂy present this approach and introduce our reinforcement learning method designed to deal with very large state spaces for GSMDP with continuous observable time and that can be adapted to the three other options. 3.3 Resolution methods The resolution method for GSMDP proposed by [16] relies on the memoryless property of the exponential distribution. If one approximates all duration functions F by phase-type distributions (which are combinations of exponential distributions), then augmenting the state space with the distribution phases brings the overall behaviour of the GSMDP back to a Continuous Time MDP, which can, in turn, be transformed to a standard discrete time MDP by the method of uniformization [11]. We refer the reader to [16] for more details. We wish not make hypotheses on the distributions that describe the dynamics of our system. On top of that, many problems we want to consider present other characteristics such as very large, and sometimes continuous state spaces. Therefore, we need to consider methods for policy search that can cope with large hybrid state spaces (yielding large hybrid trajectory spaces) and observable time. Finally, for some aspects of the problems, the stochastic behaviour might still be very complex to model formally while simulators might be readily available (for instance, in the airport taxiway management problem, the weather model is not given as probability distribution functions but as a simulator). In order to deal with such problems we turn towards reinforcement learning methods. More speciﬁcally, in order to avoid complete state space exploration, we introduce a version of approximate policy iteration where policies are deﬁned and evaluated 585 on a subset of states and then generalized by regression to the whole state space. The choice of the subset of states used for evaluation is guided by the simulation of the current policy. We present our algorithm in section 4.1 and then illustrate why simulation-based policy iteration is particularly adapted to temporal problems in section 4.2. 4 Simulation-based approaches 4.1 Algorithm Our algorithm belongs to the Approximate Policy Iteration (API) family of algorithms. Policy Iteration is an algorithm for solving MDP which searches the policy space in a two-step fashion as illustrated on ﬁgure 2. Given a policy πn at step n, the ﬁrst step consists in computing the value of πn . The second step then performs a Bellman backup in every state of the state space, thus improving the policy. An important property of policy iteration is its good anytime behaviour: at any step n, policy πn will be at least as good as any previous policy. Policy Iteration usually converges in less iterations than the standard Value Iteration algorithm but takes longer since the evaluation step is very time consuming. To deal with real problems, one needs to allow for approximate policy evaluation (as in [7]) since exact computation is often infeasible. There are few theoretical guarantees on convergence and optimality of API, as explained in [8]. Policy evaluation: V πn One-step improvement: πn+1 Figure 2. Policy Iteration The version of simulation-based policy iteration we use performs simulations of the current policy πn starting from the current state of the process and stores the triplets of states, times and rewards (sδ , tδ , rδ ) obtained. Thus, one execution path yields a value function over the discrete set of states explored during simulation (equation 3). All the value functions issued from simulation form a training set {(s, v)}, s ∈ S, v ∈ R, from which we wish to generalize a value function V˜ over all states. The average value of state s in the training set tends to V πn (s) as the number of simulations tends to +∞. One major advantage of policy-driven simulation is that the policy guides the exploration of the state space to the states most likely to be visited, thus reﬁning the training set over the states that have the largest probability of being reached by the policy. A second advantage is that this technique is adapted to large dimension state spaces. Once simulation has provided the set of samples in the space of trajectories, we want to use it as a training set for a regression method that will generalize it to the entire state space. Several approaches to regression based reinforcement learning have been proposed in the machine learning community - methods based on trees [3], evolutionary functions [15], kernel methods [10], etc. - but few have been coupled with policy simulation. We chose to focus on support vector machines (SVM) because of their ability to handle the large dimension spaces over which our samples are deﬁned. SVM belong to the family of kernel methods and can be used for both regression and classiﬁcation. Training a standard SVM over a given training set corresponds to looking for a hyperplane interpolating the samples in a 586 E. Rachelson et al. / A Simulation-Based Approach for Solving Generalized Semi-Markov Decision Processes higher dimensional space called feature space. Practically, SVM take advantage of the kernel trick to avoid expressing the feature space explicitely. For more details on SVM, we refer the reader to [14]. In our case, we call V˜n (s) the interpolated value function of policy πn . Finally, while simulation-based exploration and SVM generalization of the value function are techniques dedicated to improve the evaluation step of approximate policy iteration, the third speciﬁcity of our algorithm deals with improving the optimization step. For large and possibly continuous state spaces, it might be very long or impracticable to compute the one-step improvement of the policy. Indeed, most of the time, computing a complete policy is irrelevant since most of this policy will never be used for the simulation-based evaluation step. Instead, it might be easier to compute online the onestep lookahead best action in the current state with respect to the stored value function. More precisely, in a standard MDP, the optimization step consists in solving equation 4 in every state: ˜ n+1 (s, a) πn+1 (s) ← arg max Q a∈A X ˜ n+1 (s, a) = r(s, a) + P (s |s, a)V˜n (s, a) with: Q (4) s ∈S For continuous state spaces, computing πn+1 implies being able to compute integrals over P and V˜n . We wish not make hypotheses on the model used and therefore will perform a discretization for evaluation of the integral. Finally, since the model of P is not necessarily known to the decision maker and since we have a simulator of our system, we will make a second use of this simulator for the purpose ˜ n+1 (s, a) associated with perof evaluating the expected reward Q forming action a in state s with respect to value function V˜n (equation 5). At the end of the evaluation phase, the value function V˜n is stored and no policy is computed from it. Instead, we immediately enter a new simulation phase but whenever the policy πn+1 is asked for the action to perform in the current state s it performs online the estimation of all Q-values for state s and then choses the best action to perform. The speed up in the execution of the policy iteration algorithm is easy to illustrate for discrete state spaces problems since we replace |S| evaluations of the Q-values for policy update by the number of states visited during one simulation. This is especially interesting in the case of Temporal Markov Problems since (as we will explain in section 4.2) a state is never visited twice. Consequently, ˜ n+1 (s, a) is calculated by simply simulating N times the applicaQ tion of a in s and observing the set of {(ri , si )} as in equation 5. Then the policy returns the action which corresponds to the largest Q-value. We call this online instanciation of the policy “online approximate policy iteration”. N h i X ˜ n+1 (s, a) = 1 Q ri + V˜n (si ) N i=1 (5) Our algorithm, called online Approximate Temporal Policy Iteration (online-ATPI), is summarized in algorithm 1. Note that in algorithm 1, s actually denotes the part of the state that is observable to the policy. This makes online-ATPI adaptable to any of the sets of policy variables presented in section 3.2. We tested a version of online-ATPI on the natural state of the process. 4.2 Simulating GSMDP and learning Simulation is a key aspect of ATPI. The Discrete EVents Simulation theory (DEVS) of [17] provides a general framework for specifying discrete event dynamic systems. We implemented GSMP and Algorithm 1 Online-ATPI main: Input : π0 or V˜0 , s0 loop T rainingSet ← ∅ for i = 1 to Nsim do {(s, v)} ← simulate(V˜ , s0 ) T rainingSet ← T rainingSet ∪ {(s, v)} end for V˜ ← TrainApproximator(T rainingSet) end loop simulate(V˜ , s0 ): ExecutionP ath ← ∅ s ← s0 while horizon not reached do action ← ComputePolicy(s, V˜ ) (s , r) ← GSMDPstep(s, action) ExecutionP ath ← ExecutionP ath ∪ (s , r) end while convert execution path to value function {(s, v)} (eqn 3) return {(s, v)} ComputePolicy(s, V˜ ): for a ∈ A do ˜ a) = 0 Q(s, for j = 1 to Nsamples do (s , r) ← GSMDPstep(s, a) ˜ a) ← Q(s, ˜ a) + r + γ t −t V˜ (s ) Q(s, end for 1 ˜ a) ˜ a) ← Q(s, Q(s, Nsamples end for ˜ a) action ← arg max Q(s, return action a∈A GSMDP extensions in the VLE multi-modeling platform [12] based on the DEVS speciﬁcation; by doing so, we take advantage of the DEVS framework’s properties which ﬁt our simulation requirements, namely: • Event driven simulation and time oriented output. • The simulation engine deals with simultaneity issues and with simulation consistency and reproducibility. • Simulation engines such as the VLE platform [12] are readily available and built on the same discrete events simulation theory. • Multi-modelling possibilities, opens the algorithm to other formalisms than MP. On top of that, the DEVS formalism allows for experimental frames deﬁnition, which would permit integration of the whole simulation and planning loop in a DEVS speciﬁcation. We haven’t used experimental frames yet but plan to do so in future versions. Finally, we have claimed that Temporal Markov Problems present a speciﬁc structure that makes the problem both hard to deal with for classical reinforcement learning algorithms and particularly adapted for online approximate policy iteration. More speciﬁcally: • Most reinforcement learning algorithms deal with discrete state spaces. Some approaches have been proposed ([10, 3, 6] for dealing with continuous or hybrid states but the topic is still very new. Often, continuous state resolution methods depend strongly on the E. Rachelson et al. / A Simulation-Based Approach for Solving Generalized Semi-Markov Decision Processes representation used and on the ability to calculate integrals over the probability functions. Simulation-based sampling approaches propose a different approach to this issue. • When time is observable, the causality principle ensures that the process never goes back in time. This avoids loops and insures that online policy instanciation performs less operations than a complete ofﬂine policy improvement step. 4.3 Example Table 1 presents optimization results for the ﬁrst four iterations of online-ATPI for the subway problem initialized with a policy π0 that sets trains to run all day long 2 . Nsim was set to 20 and Nsamples to 15 with γ = 1 (ﬁnite horizon). This simple instance of the subway problem implied 4 trains and 6 stations. The problem’s speciﬁcation took time-dependency and stochastic behaviour into account; for example passenger arrival periods were represented using Gaussian distributions with means and standard deviations depending on the time of day. The state space for this problem included 22 discrete, boolean or continuous variables (including time), thus yielding a sample space of dimension 22 for the training set. In table 1, tsim is the training set building time (which corresponds to performing the Nsim simulations) while tlearn is the SVM training time (in seconds). V˜stat (s0 ) is the statistical evaluation of V˜ (s0 ), while V˜SV M (s0 ) is the value provided by the trained SVM. Lastly, #SV is the number of support vectors in the SVM. The expected value of the initial state increases with iterations; this conﬁrms the fact that policy quality improves with each iteration. This increase is not necessarily linear and depends on the problem’s structure. If the policy takes the simulation to states that are “far” from explored states (states for which the interpolated value might be erroneous) and that provide very bad rewards, it can happen that the initial state’s expected value drops for one iteration. This is the drawback from partial exploration of the state space and interpolation: very good or very bad regions of the state space might be discovered late in the iterations. One can notice that simulation time increases with iterations. This is mainly due to the number of support vectors in the SVM. Depending on the iteration step, the SVM can be much simpler and simulation time can drop again. On the other hand, online-ATPI is still very sensitive to the initial policy and we are currently working on other possibilities to improve solution quality (such as roll-out techniques and estimator reﬁnement during optimization by simulationoptimization interweaving). Table 1. Subway control policy tsim tlearn V˜stat (s0 ) V˜SV M (s0 ) #SV π0 47.1 2.28 -3261.31 -2980.29 55 π1 203.43 2.7 -3188.11 -2962.46 61 π2 206.45 12.18 -2074.74 -2020.22 439 π3 446.15 56.08 -1850.12 -1837.41 3588 π4 1504.41 229.45 -887.076 -875.417 13596 Since Nsim = 20 simulations per iteration always provide a training set of around 45000 points for the SVM in the subway example, the number of support vectors for the SVM - and therefore, the iteration duration - is bounded. Longer runs on the subway problem show that the number of support vectors and learning time in column π4 are a good estimate of the worst values. 2 experiments were ran on a 1.7GHz single core processor with 1GB of RAM 587 5 Conclusion This paper introduces a new reinforcement learning method for solving Generalized Semi-Markov Decision Processes. These processes are a natural and elegant way of representing the complexity of concurrent stochastic processes. In the framework of time-dependent GSMDP with explicit time, simulation seems to be an efﬁcient way of exploring the state space and evaluating strategies. Drawing from this idea, we introduced a simulation-based version of Approximate Policy Iteration (API), which we called online-ATPI. This algorithm incrementally improves the quality of an initial policy by making use of simulation-based evaluation, SVM regression and online policy instanciation. Although there are few theorical results concerning the convergence and optimality of API, online-ATPI seems to perform well on an example of subway network control. Future work will deal with making online-ATPI more robust to initialization; in fact, if the initial policy does not guide the simulation towards relevant areas of the state space, the error in policy evaluation can greatly penalize the algorithm. To avoid this drawback, we plan to use incremental reﬁning methods for simulation initialization. This could result in building a more dense training set, therefore minimizing the risk of not exploring relevant parts of the state space. REFERENCES [1] R. E. Bellman, Dynamic Programming, Princeton University Press, Princeton, New Jersey, 1957. [2] J. Boyan and M. Littman, ‘Exact solutions to time dependent MDPs’, Advances in Neural Information Processing Systems, 13, 1026–1032, (2001). [3] D. Ernst, P. Geurts, and L. Wehenkel, ‘Tree-based batch mode reinforcement learning’, JMLR, 6, 503–556, (2005). [4] Z. Feng, R. Dearden, N. Meuleau, and R. Washington, ‘Dynamic programming for structured continuous markov decision problems’, in 20th Conference on Uncertainty in AI, pp. 154–161, (2004). [5] P. Glynn, ‘A GSMP formalism for discrete event systems’, Proc. of the IEEE, 77, (1989). [6] M. Hauskrecht and B. Kveton, ‘Approximate linear programming for solving hybrid factored MDPs’, in 9th Int. Symp. on AI and Math., (2006). [7] M. Lagoudakis and R. Parr, ‘Least-squares policy iteration’, JMLR, 4, 1107–1149, (2003). [8] R. Munos, ‘Error bounds for approximate policy iteration’, in Int. Conf. on Machine Learning, (2003). [9] F. Nielsen, ‘GMSim: a tool for compositionnal GSMP modeling’, in Winter Simulation Conference, (1998). [10] Dirk Ormoneit and Saunak Sen, ‘Kernel-based reinforcement learning’, Machine Learning, 49, 161–178, (2002). [11] M. Puterman, Markov Decision Processes, John Wiley & Sons, Inc, 1994. ´ Ramat, and M.K. Traore, ‘VLE - A Multi[12] G. Quesnel, R. Duboz, E. Modeling and Simulation Environment’, in Moving Towards the Uniﬁed Simulation Approach, Proc. of the 2007 Summer Simulation Conf., pp. 367–374, (2007). [13] E. Rachelson, F. Garcia, and P. Fabiani, ‘Extending the Bellman equation for MDP to continuous actions and continuous time in the discounted case’, in 10th Int. Symp. on AI and Math., (2008). [14] V. Vapnik, S. Golowich, and A. Smola, ‘Support vector method for function approximation, regression estimation and signal processing’, Advances in Neural Information Processing Systems, 9, 281–287, (1996). [15] Shimon Whiteson and Peter Stone, ‘Evolutionary function approximation for reinforcement learning’, JMLR, 7, 877–917, (2006). [16] H. Younes and R. Simmons, ‘Solving Generalized semi-Markov Decision Processes using Continuous Phase-Type Distributions’, in AAAI, (2004). [17] B. P. Zeigler, D. Kim, and H. Praehofer, Theory of modeling and simulation: Integrating Discrete Event and Continuous Complex Dynamic Systems, Academic Press, 2000. 588 ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-588 Heuristics for Planning with Action Costs Revisited Emil Keyder1 and H´ector Geffner2 Abstract. We introduce a simple variation of the additive heuristic used in the HSP planner that combines the beneﬁts of the original additive heuristic, namely its mathematical formulation and its ability to handle non-uniform action costs, with the beneﬁts of the relaxed planning graph heuristic used in FF, namely its compatibility with the highly effective enforced hill climbing search along with its ability to identify helpful actions. We implement a planner similar to FF except that it uses relaxed plans obtained from the additive heuristic rather than those obtained from the relaxed planning graph. We then evaluate the resulting planner in problems where action costs are not uniform and plans with smaller overall cost (as opposed to length) are preferred, where it is shown to compare well with cost-sensitive planners such as SGPlan, Sapa, and LPG. We also consider a further variation of the additive heuristic, where symbolic labels representing action sets are propagated rather than numbers, and show that this scheme can be further developed to construct heuristics that can take delete-information into account. 1 PLANNING MODEL AND HEURISTICS We consider planning problems P = F, I, O, G expressed in Strips, where F is the set of relevant atoms or ﬂuents, I ⊆ F and G ⊆ F are the initial and goal situations, and O is a set of (grounded) actions a with precondition, add, and delete lists P re(a), Add(a), and Del(a) respectively, all of which are subsets of F . For each action a ∈ O, we assume that there is a non-negative cost(a) so that the cost of a plan π = a1 , . . . , an is cost(π) = n X cost(ai ) (1) i=1 This cost model is a generalization of the classical model where the cost of a plan is given by its length. Two of the heuristics used to guide the search for plans in the classical setting are the additive heuristic ha used in HSP [2], and the relaxed plan heuristic hFF used in FF [11]. Both are based on the delete relaxation P + of the problem, and both attempt to approximate the optimal delete-relaxation heuristic h+ which is well-informed but intractable. We review these heuristics below. In order to simplify the deﬁnition of some of the heuristics, we introduce a new dummy End action with zero cost, whose preconditions G1 , . . . , Gn are the goals of the problem, and whose effect is a dummy atom G. The heuristics h(s) then estimate the cost of achieving this ’dummy’ goal G from s. 1 2 Universitat Pompeu Fabra, Passeig de Circumvalaci´o 8, 08003 Barcelona, Spain. email: emil.keyder@upf.edu ICREA & Universitat Pompeu Fabra, Passeig de Circumvalaci´o 8, 08003 Barcelona, Spain. email: hector.geffner@upf.edu 1.1 The Additive Heuristic Since the computation of the optimal delete-free heuristic h+ is intractable, HSP introduces a polynomial approximation in which subgoals are assumed to be independent in the sense that they are achieved with no ’side effects’ [2]. This assumption is normally false, but results in a simple heuristic function def ha (s) = h(G; s) (2) that can be computed quite efﬁciently in every state s visited in the search from the recursive equation: j 0 if p ∈ s def (3) h(p; s) = h(ap ; s) otherwise where h(p; s) stands for an estimate of the cost of achieving the atom p from s, h(a; s) stands for an estimate of the cost of applying action a in s, and ap is a best support of ﬂuent p in s. These two expressions are deﬁned in turn as X def h(q; s) (4) h(a; s) = cost(a) + q∈P re(a) and def ap = argmina∈O(p) h(a; s) (5) where O(p) stands for the actions in the problem that add p. Versions of the additive heuristic appear also in [6, 16, 17], where the cost of joint conditions in action preconditions or goals is set to the sum of the costs of each condition in isolation. When the ’sum’ in (4) is replaced by ’max’, the heuristic hmax is obtained [2]. The heuristic hmax , unlike the additive heuristic ha , is admissible, but less informed. The heuristics coincide and are equivalent to the optimal delete-relaxation heuristic h+ when all the actions involve a single precondition and the goal involves a single atom. 1.2 The Relaxed Planning Graph Heuristic The planner FF modiﬁes HSP along two dimensions: the heuristic and the search algorithm. Unlike ha , the heuristic hFF used in FF makes no independence assumption for approximating h+ , computing instead one plan for P + which is not guaranteed to be optimal. This is done by a Graphplan-like procedure [1], which due to the absence of deletes constructs a planning graph with no mutexes, from which a plan πFF (s) is extracted backtrack-free [11]. The heuristic hFF (s) is then set to |πFF (s)|. The basic search procedure in FF is not best-ﬁrst as in HSP but (enforced) hill-climbing (EHC), in which the search moves from a state s to a neighboring state s with smaller heuristic value by performing a breadth ﬁrst search. This breadth ﬁrst search is carried out with a reduced branching factor, ignoring actions a that are not found to be ’helpful’. The ’helpful actions’ in E. Keyder and H. Geffner / Heuristics for Planning with Action Costs Revisited a state s are the actions applicable in s that add the precondition p of an action in πFF (s) for p ∈ s. The use of EHC search, along with the pruning of non-helpful actions are the key factors that make FF scale up better than HSP in general [11], but due to its construction, the heuristic hFF cannot be extended easily to take action costs into account (yet see [7]). 1.3 Relaxed Plans without Planning Graphs A simple variation of the additive heuristic can be deﬁned that is cost sensitive and results in relaxed plans compatible with helpful action pruning and EHC search. For this, the best support ap of each atom p in the state s, calculated as part of the computation of the heuristic ha (s) in Equation 5, is stored.3 The deﬁnition of the set of actions πa (s) that make up a relaxed plan then simply collects these best supports backwards from the goal: def πa (s) = π(p; s) = def π(G; s) j {} S {ap } ∪ q∈pre(ap ) π(q; s) if p ∈ s otherwise Intuitively, the relaxed plan πa (p; s) is empty if p ∈ s, and the union of the best supporter ap for p with the relaxed plans for each of its preconditions q ∈ pre(ap ) otherwise. Note that πa (s), being a set of actions, can contain an action at most once. The same construction, captured by Equation 6, underlies the construction of the relaxed plan πFF (s) computed by FF from the relaxed planning graph. For this, however, the best supports ap that encode the ’best’ actions for achieving the atom p in the relaxation, must be obtained from the hmax heuristic and not from ha ; a modiﬁcation that just involves changing the sum operator in (4) by the max operator. The hmax heuristic is known to encode the ﬁrst level of the relaxed planning graph that contains a given action or fact. It is simple to prove that the collection of actions in πa (s) represents a plan from s in the delete relaxation P + . This relaxed plan, unlike the relaxed plan πFF (s) is sensitive to action costs, and can be used in FF in place of πFF (s). We call the resulting planner FF(ha ). 2 THE FF(ha ) PLANNER In FF(ha ), the relaxed plans πa (s) are produced by computing the additive heuristic using a Bellman-Ford algorithm while keeping track of the chosen lowest-cost supporter for each atom, and then recursively collecting the best supporters starting from the goal. The heuristic h(s) used for P measuring progress in FF(ha ) is deﬁned as the relaxed plan cost a∈πa (s) cost(a) and not as its length |πa (s)|. This heuristic, which is obtained from the computation of the additive heuristic ha , is almost equivalent to ha (s) but does not count the cost of an action more than once. The EHC search used in FF(ha ) is a slightly modiﬁed version of that used in FF. While a single step of EHC in FF ends as soon as a state s is found by breadth-ﬁrst search from s such that h(s ) < h(s), in FF(ha ), all states s resulting from applying a helpful action a in s are evaluated and among those for which h(s ) < h(s) holds, the action minimizing the expression cost(a) + h(s ) is selected. Like in FF, the helpful actions in s are the actions applicable in s that add the precondition p of an action in πa (s) such that p ∈ s. 3 We assume that ties in the selection of the best supports ap are broken arbitrarily. The way ties are broken does not affect the value of the additive heuristic ha (s) in a state s but may affect the value of the heuristic deﬁned below. The same is true for FF’s heuristic. 589 FF(ha ) is implemented on top of the Metric-FF planner [10] because of its ability to handle numeric ﬂuents, through which nonuniform action costs are currently expressed in PDDL. FF(ha ) does not make use of numeric ﬂuents for any other purpose besides representing action costs. 3 EXPERIMENTAL RESULTS We evaluated the performance of FF(ha ) in comparison with other cost-sensitive planners; namely SGPlan [5], LPG-quality [8] and Sapa [6]4 on 11 domains.5 For reference, the curves show also the plan times and costs obtained by running FF, that ignores cost information, and FF-quality, an option in Metric-FF that optimizes a given plan metric by using an FF-like heuristic in a Weighted A* search [10]. Experiments were performed with eleven domains, ﬁve of these taken from the numeric track of the Third International Planning Competition (IPC3). Of these 5 domains, the Depots, Rovers, Satellite, and Zenotravel domains were modiﬁed by removing all occurrences of numeric variables from action preconditions and goals, once the action cost information was extracted from the PDDL. Also, as a reference, all planners except LPG were evaluated on the STRIPS (uniform cost) versions of these domains, and all planners were evaluated on 6 new domains introduced here, which were constructed with the aim that the length of solutions not correlate with their cost. Indeed, in two of these domains, the Minimum Spanning Tree and Assignment domains, all valid solutions contain the same number of actions. The other domains are: Shortest Path (shortestpath problems), Colored Blockworld (blocks have colors and colors must be stacked in certain ways in the goal, with costs associated with the different blocks), Delivery (a variation of the IPC5 domain TPP), and Simpliﬁed Rovers (a domain adapted from [17], in which a robot must collect samples from rocks in a grid). Moreover, for S. Rovers, both hard goal and soft goal versions were used, with the soft goals being compiled away into action costs, following the procedure described in [13].6 The experiments were run on a CPU running at 2.33 GHz with 8 GB of RAM. Execution time was limited to 1,800 seconds. The results, including plan costs and planning times for the various planners, are reported in the ﬁgures. Some observations about the results follow. Quality of Plans: In almost all of the domains, FF(ha ) produces the best plans, with the exception of the hard-goal version of S. Rovers (Fig. 3c), where it does particularly bad, and in the soft version (Fig. 3b). In both cases, LPG does better, although the opposite occurs in several domains like in Delivery (Fig. 1a), Satellite (Fig. 2a), and the Assignment Problem (Fig 2c). Sapa produces plans that are close to the best quality plans in all the domains for which it can be executed, yet is usually able to solve only the smallest instances in each domain. FF-quality suffers from a similar problem, solving a signiﬁcant proportion of the instances in a few domains only.7 Overall, SGPlan does not appear to produce better plans than FF, even if FF ignores costs completely, and both produce plans that are often much worse than FF(ha ). In the STRIPS versions of the 4 5 6 7 Sapa was compiled from Java to native machine code with the GNU compiler. We were later informed by the authors that this results in a slowdown of approximately 50% compared to the version running on the Java virtual machine. LPG and Sapa could not be run on some of the domains due to bugs. We cannot provide further details on these domains due to lack of space, but the PDDL ﬁles are available from the authors. For clarity, FF-quality’s results are shown only for domains in which it was able to solve a signiﬁcant number of instances. 590 E. Keyder and H. Geffner / Heuristics for Planning with Action Costs Revisited 8000 1800 FF SGPlan LPG-quality SapaPS FF(ha) 7000 FF SGPlan LPG-quality SapaPS FF(ha) 1600 1400 6000 1200 5000 1000 4000 800 3000 600 2000 400 1000 200 0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 2 4 (a) Plan costs - Delivery domain 6 8 10 12 14 16 18 20 18 20 (a) Plan costs - Satellites domain 160 2200 FF SGPlan LPG-quality FF-quality SapaPS FF(ha) 2000 1800 FF SGPlan LPG-quality FF-quality SapaPS FF(ha) 140 120 1600 1400 100 1200 80 1000 60 800 600 40 400 20 200 0 0 2 4 6 8 10 12 14 16 18 2 20 4 8 10 12 14 16 (b) Length of Plans above in Satellites (b) Plan costs - Shortest Path domain 3500 3000 FF SGPlan LPG-quality FF-quality FF(ha) 3000 6 FF SGPlan LPG-quality FF-quality SapaPS FF(ha) 2500 2500 2000 2000 1500 1500 1000 1000 500 500 0 0 2 4 6 8 10 12 14 16 18 20 2 (c) Plan costs - Minimum Spanning Tree domain 10000 1000 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 (c) Plan costs - Assignment Problem domain 300000 FF SGPlan LPG-quality FF-quality FF(ha) 250000 100 200000 10 150000 1 100000 0.1 50000 0.01 FF SGPlan LPG-quality SapaPS FF(ha) 0 2 4 6 8 10 12 14 16 18 (d) Planning times - Minimum Spanning Tree domain Figure 1. 20 2 4 6 8 10 12 14 16 (d) Plan costs - Zenotravel domain Figure 2. 18 20 591 E. Keyder and H. Geffner / Heuristics for Planning with Action Costs Revisited ﬁve IPC3 domains (unit costs), all planners produce plans of roughly equal quality. Planning Times: FF(ha ) is somewhat slower than FF on most problems, though the difference is usually a constant factor (Fig. 1d).8 There are two main reasons for this. The ﬁrst is that computing ha and extracting the associated relaxed plan πa is somewhat more costly than the equivalent operation on the relaxed planning graph, so FF(ha ) takes longer to perform the same number of heuristic evaluations as FF. In general, hFF evaluates states 2–10 times faster than ha . The second is that while FF minimizes the number of actions in the plan, FF(ha ) minimizes the cost of the plan, which in some cases leads to longer plans, requiring more search nodes and more heuristic evaluations (Fig. 2b). SGPlan takes roughly the same amount of time as FF on almost all domains considered, while LPG is roughly an order of magnitude slower than the other planners except Sapa, but appears to have better scaling behaviour. Sapa is slower than LPG by roughly one order of magnitude. 4 FURTHER VARIATIONS OF THE ADDITIVE HEURISTIC We consider brieﬂy two further variations of the additive heuristic: the set-additive heuristic and the TSP heuristic, both analyzed in more detail in [12, 13]. 4.1 The Set Additive Heuristic In the additive heuristic, the value h(ap ; s) of the best supporter ap of p in s is propagated to obtain the heuristic value h(p; s) of p. In contrast, in the set-additive heuristic, the best supporter ap of p is itself propagated, with supports combined by set-union rather than by sum, resulting in a recursive function π(p; s) that represents the set of actions in a relaxed plan for p in s, which can be deﬁned similarly to h(p; s) as: j {} if p ∈ s π(p; s) = (6) π(ap ; s) otherwise 4.2 The TSP Heuristic The set-additive heuristic can be generalized by replacing the plans π(p; s) with more generic labels L(p; s) that can be numeric, symbolic, or a suitable combination, provided that there is a function Cost(L(p; s)) mapping labels L(p; s) to numbers. Here we consider labels L(p; s) that result from treating one designated multivalued variable X in the problem in a special way. A multivalued variable X is a set of atoms x1 , . . . , xn such that exactly one xi holds in every reachable state. For example, in a task where there are n rocks r1 , . . . , rn to be picked up at locations l1 , . . . , ln , the set of atoms at(l0 ), at(l1 ), . . . , at(ln ), where at(l0 ) is the initial agent location, represent one such variable, encoding the possible locations of the agent. If the cost of going from location li to location lk is c(li , lk ), then the cost of picking up all the rocks is the cost of the best (min cost) path that visits all the locations, added to the costs of the pickups. This problem is a TSP and therefore intractable, but its cost can be approximated by various fast suboptimal TSP algorithms.9 By comparison, the delete relaxation approximates the cost of the problem as the cost of the best tree rooted at l0 that spans all of the locations. The modiﬁcation of the labels π(p; s) in the set-additive heuristic allows us to move from the approximate model captured by the delete relaxation to approximate TSP algorithms over a more accurate model (see [15] for other uses of OR models in planning heuristics). For this, we assume that the actions that affect the selected multivalued variable X do not affect other variables in the problem, and maintain in each label π(p; s) two disjoint sets: a set of actions that do not affect X, and the set of X-atoms required as preconditions by these actions. The heuristic hX (s) is then deﬁned as hX (s) = CostX (π(G, s)) (11) where CostX (π) is the sum of the action costs for the actions in π that do not affect X plus the estimated cost of the ’local plan’ [4] that generates all the X-atoms in π, expressed as ¯ + CostT SP (π ∩ X) CostX (π) = Cost(π ∩ X) where (12) where ap = π(a; s) = Cost(π(a; s)) = argmina∈O(p) Cost(π(a; s)) [ {a} {∪q∈P re(a) π(q; s)} X cost(a ) (7) (8) The set-additive heuristic hsa (s) for a state s is then deﬁned as = Cost(π(G; s)) . = ap = π(a; s) = (9) a ∈π(a;s) hsa (s) π(p; s) 8 < {} {p} : π(ap ; s) if p ∈ s if p ∈ X otherwise argmina∈O(p) CostX (π(a; s)) [ {a} {∪q∈P re(a) π(q; s)} It is easy to show that the collection of actions π(p; s) for all atoms p represent plans for achieving the atom p in the delete-relaxation P + , which in the set-additive heuristic are computed recursively, starting with the trivial (empty) plan for the atoms p ∈ s. From a practical point of view, this recursive computation does not appear to be cost-effective in general, as the relaxed plans πa (p; s) obtained from the normal additive heuristic are normally as good and can be computed faster. Yet the planner FF(hsa ) obtained from FF by replacing the relaxed plans πFF (s) by π(G; s) above compares well with existing cost-sensitive planners (see [12]), and the formulation of the set-additive heuristic opens the door to the formulation of a broader family of heuristics. and CostT SP (R) is the cost of the best path spanning the set of atoms R, starting from the value of X in s, in a directed graph whose nodes stand for the different values x of X, and whose edges (x, x ) have costs that encode approximations of the cost of achieving x from x in s (see [13] for details). We have implemented the planner FF(hX ) in which hX , rather than ha , is used to derive the relaxed plan, with the X variables being automatically chosen as the root variables of the causal graph [3, 9]. This planner produces plans of much lower cost than any other planner tested in the soft goals version of the Simpliﬁed Rovers domain (Fig. 3b), and plans of much lower cost than any other planner except LPG in the hard goals version (Fig. 3c), where LPG produces plans of only slightly worse quality. 8 9 We omit further data on planning time due to space considerations. (10) In our planner we have implemented the 2-opt algorithm discussed in [14]. 592 E. Keyder and H. Geffner / Heuristics for Planning with Action Costs Revisited 5 450 FF SGPlan FF-quality FF(ha) 400 350 300 250 200 150 100 50 0 2 4 6 8 10 12 14 16 18 20 (a) Plan costs - Colored Blocksworld domain 12000 FF SGPlan FF(hX) LPG-quality FF-quality FF(ha) 10000 8000 6000 DISCUSSION We have shown that relaxed plans and therefore helpful actions can be computed without the use of a relaxed planning graph, meaning that other heuristics can be used in conjunction with FF’s powerful EHC search. Our method of relaxed plan extraction using the additive heuristic is cost-sensitive and does not impose a large overhead over that of FF. Furthermore, a simple planner that combines the relaxed plan extracted in this way with the EHC search algorithm compares favourably to the state of the art in planning with action costs. Two other variations of the additive heuristic were also presented: the set-additive heuristic in which the relaxed plans are computed recursively, and the TSP heuristic, that takes delete-information into account. In both cases, labels are propagated rather than numbers in the equation characterizing the additive heuristic. Used together with EHC search, the TSP heuristic produces plans of much lower cost than any other planner tested in navigation problems where ﬁnding good paths going through a set of locations is critical. Our implementation of the TSP heuristic, however, is preliminary, and is suited only for problems where these locations correspond to the values of a single root variable in the causal graph. ACKNOWLEDGEMENTS 4000 2000 0 2 4 6 8 10 12 14 16 18 20 (b) Plan costs - soft goals version of S. Rovers domain 12000 FF SGPlan FF(hX) LPG-quality FF(ha) 10000 8000 6000 4000 2000 0 2 4 6 8 10 12 14 16 18 20 (c) Plan costs - hard goals version of S. Rovers domain 1000 FF SGPlan FF(hX) LPG-quality FF(ha) 100 10 1 0.1 0.01 2 4 6 8 10 12 14 16 18 20 (d) Planning times - hard goals version of S. Rovers domain Figure 3. We thank the reviewers for useful comments and J. Hoffmann for making the sources of Metric-FF available. H. Geffner is partially supported by grant TIN2006-15387-C03-03 from MEC/Spain. REFERENCES [1] A. Blum and M. Furst, ‘Fast planning through planning graph analysis’, in Proc. IJCAI-95, pp. 1636–1642, (1995). [2] B. Bonet and H. Geffner, ‘Planning as heuristic search’, Artiﬁcial Intelligence, 129(1–2), 5–33, (2001). [3] R. Brafman and C. Domshlak, ‘Structure and complexity of planning with unary operators’, JAIR, 18, 315–349, (2003). [4] R. Brafman and C. Domshlak, ‘Factored planning: How, when, and when not’, in Proc. AAAI-06, (2006). [5] Y. Chen, B. W. Wah, and C. Hsu, ‘Temporal planning using subgoal partitioning and resolution in SGPlan’, JAIR, 26, 323–369, (2006). [6] M. Do and S. Kambhampati, ‘Sapa: A domain-independent heuristic metric temporal planner’, in Proc. ECP 2001, pp. 82–91, (2001). [7] R. Fuentetaja, D. Borrajo, and C. Linares, ‘Improving relaxed planning graph heuristics for metric optimization’, in Proc. 2006 AAAI Workshop on Heuristic Search, (2006). [8] A. Gerevini, A. Saetti, and Ivan Serina, ‘Planning through stochastic local search and temporal action graphs in LPG’, JAIR, 20, 239–290, (2003). [9] M. Helmert, ‘A planning heuristic based on causal graph analysis’, in Proc. ICAPS-04, pp. 161–170, (2004). [10] J. Hoffmann, ‘The Metric-FF planning system: Translating ”ignoring delete lists” to numeric state variables’, JAIR, 20, 291–341, (2003). [11] J. Hoffmann and B. Nebel, ‘The FF planning system: Fast plan generation through heuristic search’, JAIR, 14, 253–302, (2001). [12] E. Keyder and H. Geffner, ‘Heuristics for planning with action costs’, in Proc. Spanish AI Conference (CAEPIA 2007), volume 4788 of Lecture Notes in Computer Science, pp. 140–149. Springer, (2007). [13] E. Keyder and H. Geffner, ‘Set-additive and TSP heuristics for planning with action costs and soft goals’, in ICAPS-07 Workshop on Heuristics for Domain Independent Planning, (2007). [14] S. Lin and B. W. Kernighan, ‘An effective heuristic algorithm for the TSP’, Operations Research, 21, 498–516, (1973). [15] Derek Long and Maria Fox, ‘Automatic synthesis and use of generic types in planning.’, in Proc. AIPS-2000, pp. 196–205, (2000). [16] O. Sapena and E. Onaindia, ‘Handling numeric criteria in relaxed planning graphs’, in Advances in Artiﬁcial Intelligence: Proc. IBERAMIA 2004, LNAI 3315, pp. 114–123. Springer, (2004). [17] D. E. Smith, ‘Choosing objectives in over-subscription planning’, in Proc. ICAPS-04, pp. 393–401, (2004). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-593 593 Diagnosis of Simple Temporal Networks Nico Roos 1 and Cees Witteveen 2 Abstract. In many domains successful execution of plans requires careful monitoring and repair. Diagnosis of plan execution supports this process by identifying causes of plan failure. Most plans have to satisfy temporal constraints. An important and common occurring problem during plan execution are violations of temporal plan constraints. This paper addresses diagnosis of such temporal constraint violations by modeling the temporal aspects of a plan as a Simple Temporal Network (STN). We investigate the computational properties of standard diagnostic concepts but we also argue that traditional notions of preferred diagnoses such as minimum diagnosis are not adequate. A new notion of a maximum conﬁrmation diagnosis is introduced. 1 Introduction A Simple Temporal Network (STN) [8] provides a way to describe (i) a plan, (ii) temporal aspects of plan steps, and (iii) temporal relations between plan steps. It also enables the description of schedule constraints, and of observations about the temporal execution of the plan, using the same formalism. The observations may violate the temporal constraints of the plan or its schedule, giving rise to a Simple Temporal Diagnosis (STD) problem. Diagnosis should identify the plan and scheduling constraints that have been violated during plan-execution. A Simple Temporal Diagnosis problem is related to a Simple Temporal Problem (STP) [8]. An STP addressed the identiﬁcation of an allowable schedule for an STN. The STD problem extends this by identifying where the actual execution schedule starts to deviate from the allowable schedule. Note that STD may also be used prior to plan execution if an STP does not have an allowable schedule. required for the transfer of passengers between ﬂights. Flight NW 456 arrived 16:05 at the gate while it was scheduled to arrive at the gate between 14:55 and 15:00. The cause of its delay was a delayed departure of 15 minutes at JFK and an additional delay during the ﬂight caused by unexpected head-winds. Figure 1 show the schedule of the plan and the actual execution of the plan. The ﬁgure shows the time lines for the ﬂights KL123 and NW 456, and the time line for the catering. The blocks drawn on the time line represent the plan steps. Note that the length of the blocks roughly indicate the duration of the plan steps. The uncertainty about the start or ﬁnish of a plan steps is indicated by the time intervals below each time line. Also note the white blocks that are placed above instead of on the time line. These blocks indicate (i) the ‘waiting time’ between the on-block time of ﬂight NW 456 and the off-block time of ﬂight KL 123 in which passengers are transferred between the two ﬂights, and (ii) the ‘waiting time’ between the ﬁnish of the catering service and the off-block time of ﬂight KL 123. Schedule KL123 15:55-16:10 passenger transfer NW456 8:50-8:55 14:55-15:00 Catering 15:15-15:30 Execution KL123 16:30 passenger transfer 2 Running example To illustrate the ideas presented in the following sections, we will use a problem from the domain of Air Trafﬁc Control as a running example. Flight KL 123 has a delayed departure; a delayed takeoff at 16:30 instead of the scheduled takeoff at 15:55-16:00. The taxiing time of 15-20 minutes incurred no delays. In fact ﬂight KL 123 had a delayed off-block time. At the gate ﬂight KL 123 incurred a delay because of the catering. Catering was scheduled to start the delivery of food between 15:15 and 15:30, which must arrive at the airplane 10 to 30 minutes before the off-block. The actual delivery time was 16:00. Flight KL 123 was also delayed because the ﬂight had to wait on transfer passengers from ﬂight NW 456. At least 30 minutes are 1 2 Maastricht ICT Competence Center, Universiteit Maastricht P.O.Box 616, NL-6200 MD Maastricht, email: roos@micc.unimaas.nl Faculty EEMCS, Delft University of Technology P.O.Box 5031, NL-2600 GA Delft, email: C.Witteveen@tudelft.nl NW456 9:10 16:05 Catering 16:00 Figure 1. The schedule and the execution of two ﬂights and the catering. The goal of diagnosis is to determine to what extent the plan constraints describing plan step durations, time restriction on successive plan steps, and the scheduled, are satisﬁed using partial observations of the plan execution. 3 Preliminaries Simple Temporal Networks A STN (E, C) describes a plan and its schedule by a set of events E and a set of constraints C over the events. Events denote such things as the start start(s) of a plan step 594 N. Roos and C. Witteveen / Diagnosis of Simple Temporal Networks s and the ﬁnish f inish(s) of s. The constraints are used to specify the durations of plan steps, the precedence relations between plan steps, and the plan’s schedule. It is also possible to specify requirements such as the requirement that a plan step that must start within δ minutes after the ﬁnish of its preceding plan step. To describe a constraint, we associate a variable te with each event e ∈ E. These variables take values in some dense time domain T ime. We assume T ime to be a total order with element 0 and maximum element ∞. A constraint c ∈ C speciﬁes the allowed temporal difference between two events: lb ≤ te − te ≤ ub where e and e are events in E, lb, ub ∈ T ime and 0 ≤ lb ≤ ub. Constraints deﬁne a strict precedence relation ≺ on the E. We say that e directly precedes e iff lb ≤ te −te ≤ ub ∈ C and ub > 0. The transitive closure of the direct precedences deﬁnes the precedence relations; i.e., e precedes e iff e ≺+ e. Relating an STN to a traditional plan description P = (S, ≺), the duration of a plan step s is described by 0 < lb ≤ tf inish(s) − tstart(s) ≤ ub. A precedence constraint s ≺ s is described by lb ≤ tstart(s ) − tf inish(s) ≤ ub. Note that in the standard interpretation of a precedence constraint, lb = 0 and ub = ∞. A schedule is a placement of events on the timeline. To describe a schedule we need a special event ‘0’ marking start of the timeline; i.e., t0 = 0. This enables us to schedule the period in which an event e ∈ E should take place: lb ≤ te − t0 ≤ ub; i.e.: lb ≤ te ≤ ub. Figure 2 and 3 shows the plan of our running example and the corresponding STN, respectively. Since there are no gaps between plan steps such as ‘ﬂight’, ‘landing’, ’taxiing’, and so on; i.e., precedence constraints of the form 0 ≤ tstart(s ) − tf inish(s) ≤ 0 hold between successive plan steps, and since these constraint cannot be violated, in Figure 3 we have chosen to represent the ﬁnish and start of successive plan steps of a ﬂight by a single event. KL 123 NW 456 taxiing takeoff flight landing ground handling taxiing taxiing ground handling takeoff taxiing takeoff catering Catering service Figure 2. The plan. on-blocks off-blocks [50,55] KL123 [15,20] ground handling [30,f] NW 456 [5.30,5.45] [5,5] takeoff flight landing [15:55,16:00] [15,20] taxiing [5,5] takeoff [10,30] on-blocks [5,5] [8:50,8:55] taxiing off-blocks [50,55] ground handling [15,20] taxiing [5,5] takeoff [14:55,15:05] [10,15] Catering service [15:15,15:30] catering 0 Figure 3. The Simple Temporal Network. Semantics The constraints of an STN place restrictions on the way a plan may be executed; the execution schedule. An execution schedule for the set of events E of an STN (E, C) is a function σ : E → Time. We say that an execution schedule σ satisﬁes the constraints C, denoted by σ |= C, iff lb ≤ σ(e) − σ(e ) ≤ ub holds for every constraint lb ≤ te − te ≤ ub ∈ C. An execution schedule satisfying every constraints in C is called an allowable schedule. The identiﬁcation of a allowable schedule for an STN is called a Simple Temporal Problem (STP) [8]. It is well-known that an STN has an allowable execution schedule iff its underlying labeled graph contains no negative cycles. 3 We say that a constraint c : a ≤ te − te ≤ b is entailed by a set of constraints C, denoted by C |= c, iff every allowable schedule for C satisﬁes c. Given a constraint c : a ≤ te − te ≤ b we say that c : a ≤ te − te ≤ b is a tightening of c, denoted by c |= c iff a ≤ a ≤ b ≤ b. There is a sound and complete derivation procedure (|−) for determining the most tightened constraint c : a ≤ te − te ≤ b entailed by a set of constraints C: C |− c iff C |= c. Observations During the execution of a plan observations can be made. These observations may pertain to the time difference observed between two events e and e as speciﬁed in the plan or may pertain to the time at which a certain event e ∈ E takes place. We assume that the ﬁrst type of observation is described by some constraint a ≤ te − te ≤ b indicating that we have observed that event e occurred at least a time steps, but within b time steps after e . The second type of observation is given by a constraint a ≤ te − t0 ≤ b indicating that e occurred after a time units but before b time units have been passed (after the occurrence of the time reference event ‘0’). The set of observations containing these constraints is denoted by Obs. In the running example, we have the following observations. The delayed takeoff of ﬂight KL 123 at 16:30, the catering starting at 16:00, and the delayed arrival at the gate of ﬂight NW 456 at 16:05. These observations are described by the constraints 16:30 ≤ te − t0 ≤ 16:30, 16:00 ≤ te − t0 ≤ 16:00, and 16:05 ≤ te − t0 ≤ 16:05. respectively. Compatibility An important notion is the compatibility between the STN speciﬁcation (E, C) and the set of observations Obs. We say that the set of observations is compatible with the plan speciﬁcation if we can ﬁnd an execution schedule σ that satisﬁes the original set of constraints C as well as the set Obs; i.e., the STN (E, C ∪ Obs) has an allowable schedule. Qualiﬁcations If an STN (E, C) is not compatible with a set Obs of observations, some constraints in C must have been violated directly or indirectly by some of the observations. To restore the compatibility between plan and observations we need to indicate which constraints have been violated. Clearly, if a plan constraint c is violated, some part of it is executed in an abnormal way. To indicate such an abnormal execution we introduce a qualiﬁcation Q of constraints. Given an STN (E, C), a qualiﬁcation Q is a function Q : C → H assigning a health mode to every constraint in C. We distinguish the following health modes: 1. We use the mode Q(c) = nor to denote the normal execution of a constraint c ∈ C; i.e., c has not been violated, 2. we use Q(c) = ab to denote the abnormal execution of a constraint c without exactly specifying how it is violated, and 3. we use a real number δ ∈ R to denote the degree in which a constraint is violated: Q(c) = δ. Note that the last health mode describes how much shorter or longer the temporal difference between two events is with respect to what is speciﬁed by the constraint. 3 A negative cycle refers to the fact that, ﬁrst of all, a constraint c : lb ≤ te − te ≤ ub is equivalent to the following two inequalities: te −te ≤ −lb and te −te ≤ ub. Next, such inequalities can be composed: if te −te ≤ ub then te − te = te − te + te − te ≤ ub + ub . If we can derive an inequality te − te < 0, a clear inconsistency (a negative cycle) has been detected [8]. 595 N. Roos and C. Witteveen / Diagnosis of Simple Temporal Networks Qualiﬁcations will be used to restore the compatibility between observations and plan executions as follows: 1. For any constraint c : lb ≤ te − te ≤ ub, if Q(c) = ab, we assume that the constraint is not respected anymore. So c will be replaced by its weakest implicate −∞ ≤ te − te ≤ ∞, which in fact comes down to removing c from C. 2. If Q(c) = δ ∈ R then c : lb ≤ te − te ≤ ub will be replaced by the constraint lb + δ ≤ te − te ≤ ub + δ. Since the duration of plan steps and waiting times between successive plan steps cannot be negative, we require that Q(c) ≥ −1 · lb. We will use the update fuction upd(C, Q) to denote modiﬁcation of the constraints C using qualiﬁcation Q. upd(C, Q) = {c ∈ C | Q(c) = nor} ∪ {lb + δ ≤ te − te ≤ ub + δ | c : lb ≤ te − te ≤ ub ∈ C, Q(c) = δ} Note that the qualiﬁcation of the health mode ‘ab’ to a constraint increases the uncertainty expressed by the constraint; i.e., the difference between the upper and lower bound of the constraint. The qualiﬁcation of the health mode δ ∈ R does not change the expressed uncertainty since (ub + δ) − (lb + δ) = ub − lb. 4 Diagnosis Classical Model-Based Diagnosis (MBD) addresses the identiﬁcation of failing components is some system. In MBD, two types of diagnosis are distinguished, abductive and consistency based diagnosis. The abductive diagnosis can be viewed as a special case of consistency based diagnosis where we have complete knowledge of both the way components may fail and how failing components behave.4 Since in abstraction, diagnosis of constraint violations in an STN is closest related to MBD, we will use the terminology used in classical MBD. Note however, that unlike MBD, we do not have components to be diagnosed. Instead we diagnose temporal constraints. We distinguish two types of diagnosis: diagnosis without fault models where only the health modes nor and ab are used, and diagnosis with fault models. 4.1 Diagnosis without fault models We consider consistency based diagnosis without fault models. That is, we try to make the STN compatible with the observations by identifying the constraints that could have been violated without considering how the constraints are violated. We therefore restrict ourselves to qualiﬁcations Q that map constraints to nor or ab. As we remarked before, constraints qualiﬁed as being abnormal will be removed from the set of constraints C deﬁned by the STN (E, C). Deﬁnition 1 Let (E, C) be an STN and let Obs be the constraints describing the observations made. Moreover, let Q : C → H be a qualiﬁcation such that for every c ∈ C, Q(c) ∈ {nor, ab}. The qualiﬁcation Q is a consistency based diagnosis without fault models iff the STN (E, {c ∈ C | Q(c) = nor} ∪ Obs) has an allowable schedule. 4 Diagnosis of Discrete Event Systems (DESs) is another form of modelbased diagnosis that addresses the identiﬁcation of failure events that change the states of components in dynamic systems. In general, there may not be a unique diagnosis given the observations made. In fact the number of diagnoses can be quite large. For instance, in the absence of fault models, if Q is a diagnosis, then every Q such that {c ∈ C | Q(c) = nor} ⊆ {c ∈ C | Q (c) = nor} is also diagnosis. Dependencies between constraint violations Among the set of diagnoses, some diagnoses are considered to be better than others. To select the most likely diagnosis, in MBD, preference orders are deﬁned on the set of diagnoses. These preference orders are all based on the underlying assumption that fault probabilities are independent of each other. This assumptions does not hold for the constraints of an STN. In particular the schedule constraints are not independent of other schedule constraints, plan duration constraints and precedence constraints. For instance, a delay in boarding of passengers may imply a violation of the scheduled takeoff time. To illustrate the problem of dependencies more clearly, consider the plan depicted in Figure 4. Suppose that we make the observations 11:00 ≤ t5 − t0 ≤ 11:05 and 10:55 ≤ t6 − t0 ≤ 11:00. Clearly, the schedule constraints c0-5 : 10:40 ≤ t5 − t0 ≤ 10:50 and c0-6 : 10:45 ≤ t6 − t0 ≤ 10:50 are violated and a minimum diagnosis qualiﬁes these constraints as abnormal (ab) while qualifying all other constraints as normal (nor). A diagnosis in which the schedule constraints c0-5 , c0-6 and the plan constraint c1-2 : 14 ≤ t2 − t1 ≤ 23 are qualiﬁed as abnormal (ab) is not a minimum diagnosis. Since a violation of the plan constraint c1-2 ; e.g., its execution is taking at least 15 minutes longer, implies the violations of the schedule constraints c0-5 and c0-6 , we should only count c1-2 when determining a minimum diagnosis. The violations of c0-5 and c0-6 are not independent of the violation of c1-2 . 1 [14,23] [9:50,9:55] 2 [13,f] [6,f] 3 Figure 4. 5 [10:40,10:50] 4 0 [25,33] [10,15] 6 [10:45,10:50] Dependencies between constraints. The above example shows us that notions, such as minimum diagnoses, cannot be deﬁned considering all the violated constraints. Instead, we should consider an “independent core” of a diagnosis Q. To identify the independent core, we ﬁrst deﬁne a causal dependency between a constraint c and a set of constraints D. The idea is that the upper and lower bound of c cannot be chosen independently of the constraints in D. Moreover, for the constraint c : lb ≤ te − te ≤ ub to causally depend on D, no event of a constraint c in D may occur after the event e. Deﬁnition 2 A constraint c : lb ≤ te − te ≤ ub ∈ C depends on a set of constraints D not containing c iff • D is a minimal subset of C such that for some choices for lb and ub, D ∪ {c} has no allowable schedule, • for no event e speciﬁed in a constraint in D, e precedes e . The independent core of a diagnosis Q can now be determined by identifying the constraints in Q that (i) are qualiﬁed the health mode ab and (ii) do not depend on other constraints that are qualiﬁed the health mode ab in Q. 596 N. Roos and C. Witteveen / Diagnosis of Simple Temporal Networks Deﬁnition 3 Let Q be a diagnosis and let C ⊆ C be a set of constraints. C is an independent core of Q iff C contains of all constraints c ∈ C such that Q(c) = ab, and for all sets of constraints D ⊆ C on which c depends, there is no c ∈ D such that Q(c ) = ab. Minimum diagnoses In MBD, one usually prefers minimum diagnoses. The rational behind this preference is that the probability of n faults is usually much smaller than the probability of m faults for n > m provided that fault probabilities are independent. In an STN the independence requirement does not hold. Therefore, minimum diagnoses must be deﬁned with respect to the independent core of a diagnosis Q. In the running example, we observe a delayed takeoff of ﬂight KL 123, a delayed on-block of ﬂight NW 456 and a delay in the ﬁnish of the catering service. One possible diagnosis Q qualiﬁes as abnormal (ab) the scheduled takeoff time of ﬂight KL 123, the ﬂying time of ﬂight NW 456 and its scheduled on-block time, and the scheduled starting time of the catering service. All other constraints are qualiﬁed as normal (nor). The independent causal core of Q are the constraints specifying the ﬂying time of ﬂight NW 456 and the scheduled starting time of the catering service. Since the number of constraints in the independent causal core is minimal, Q is a minimum diagnosis. Theorem 1 Finding a diagnosis with a minimum independent core for an STD problem is an NP-hard problem. We prove NP-hardness by reducing the well-known NP-complete Feedback Arc Set problem [9] to the problem of ﬁnding a diagnosis with a minimum independent core. Consider an instance I = (G(V, A), K) of the Feedback Arc Set problem. We construct an instance f (I) = (P (E, C), Obs) of the temporal diagnosis problem by specifying the plan P (E, C) as follows: • For every node v ∈ V we create two events e1v and e2v in E; • For every arc (v, w) ∈ A we add a temporal constraint 1 ≤ te2w − te1v ≤ ∞ to the temporal network. Note that the source of an arc (v, w) in the graph G(V, A) is always a e1v -event in the temporal network while the target is always a e2w -event; see Figure 5. u1 u u2 [1,f] v w x v1 [1,f] [1,f] v2 [1,f] w1 w2 [1,f] x1 x2 Figure 5. Reduction of a Feedback Arc Set problem to STD. It is easy to see that this plan has an allowable execution schedule: for every event tiv ∈ E, let σ(tiv ) = i. This assignment satisﬁes all constraints. Moreover, since there is at most one path of constraints between each pair of events in the STN, the independent core consists of all constraints that are qualiﬁed as being abnormal in diagnosis. The set of observations Obs of observations restores the structure of the graph G by containing for every node v ∈ V the constraint 0 ≤ te1v − te2v ≤ ∞. It is not hard to see that the observations are incompatible with the STN (i.e., the STN contains a negative cycle) iff the graph G contains a cycle. Moreover, a diagnosis in which K constraints are qualiﬁed as abnormal ab corresponds one to one with a directed feedback arc set of size K of the graph G(V, A). 4.2 Diagnosis with fault models An important difference with diagnosis in other domains is that in diagnosis of STNs fault models are always available. In an STN, a fault model of a temporal constraint describes the degree to which the constraint is violated. In the qualiﬁcation Q we denote this by shift in the bound of the temporal constraints. So, if Q(c) = δ ∈ R, then c : lb ≤ te − te ≤ ub will be replaced by the constraint lb + δ ≤ te − te ≤ ub + δ. Hence, diagnosis with fault models is deﬁned as: Deﬁnition 4 Let (E, C) be an STN and let Obs be the constraints describing the observations made. Moreover, let Q : C → H be a qualiﬁcation. The qualiﬁcation Q is a consistency based diagnosis with fault models iff the STN (E, upd(C, Q) ∪ Obs) has an allowable schedule. Preferred diagnoses Deﬁnition 4 does not give us a unique diagnosis given the observations. Some diagnoses may be better than others. Generalizing the preference for minimum diagnoses in the absence of fault models, we could prefer minimum-fault diagnoses that minimize c∈C |Q(c)| where Q(c) = nor and Q(c) = ab are interpreted as Q(c) = 0 and Q(c) = ω, respectively. Clearly, minimum diagnoses are a special case of minimum-fault diagnoses. A minimum-fault diagnosis minimizes the number of execution schedules σ that satisfy an updated STN (E, upd(C, Q)) and the observations Obs. To give an illustration, consider a plan with two events e and e and one constraint: c : lb ≤ te − te ≤ ub. If we observe a ≤ te − te ≤ b with a > ub, then Q(c) = a − lb is a minimum-fault diagnosis. Since there is only one execution schedule satisfying the updated plan and the observation, the probability that the diagnosis is correct is minimal. Therefore, a different notion of preference is desirable. We should prefer diagnoses that have a high probability of being correct; i.e., maximize the number of execution schedules. The number of execution schedules is maximal if we can predict the observations made; i.e., abductive diagnosis. To illustrate this point, consider the plan in ﬁgure 6 together with the observations: 10:40 ≤ t5 − t0 ≤ 11:15 and 10:30 ≤ t6 − t0 ≤ ∞. If all constraints are qualiﬁed normal (nor), then the constraints entail 10:35 ≤ t5 − t0 ≤ 11:03 and 10:27 ≤ t6 − t0 ≤ ∞, which do not explain our observations. A diagnosis Q qualifying all plan constraint as normal (nor) except c1−2 : 14 ≤ t2 − t1 ≤ 23, which is qualiﬁed as: Q(c) = 5, does explain the observations. This diagnosis enables us to predict 10:40 ≤ t5 − t0 ≤ 11:08 and 10:32 ≤ t6 − t0 ≤ ∞. Since these predictions are a tightening of the observations, the diagnosis Q is an abductive diagnosis. 0 1 [14,23] 2 [6,12] 3 [25,33] 5 [9:50,9:55] [13,f] 4 Figure 6. [10,15] 6 Abduction versus conﬁrmation. Maximum conﬁrmation diagnosis In the above example the two observations are not very accurate. A more accurate observation such as 10:40 ≤ t5 − t0 ≤ 10:48, cannot be explained by the normal execution of the plan: Q(c) = nor for all constraints in C. Nevertheless, the most tightened constraint 10:35 ≤ t5 − t0 ≤ 11:03 entailed by the plan constraints, is conﬁrmed by the observation. This also indicates the absence of violations of the constraints that are used to N. Roos and C. Witteveen / Diagnosis of Simple Temporal Networks make the prediction for the pair of events ‘0’ and e5 . Therefore, we propose a new notion of diagnosis, namely maximum-conﬁrmation diagnoses. The idea of maximum-conﬁrmation diagnosis is to identify the qualiﬁcation Q for which the number of execution schedules is maximal. To measure the number of execution schedules, we introduce a conﬁrmation percentage. Deﬁnition 5 Let Q be a qualiﬁcation, let o : lb ≤ te − te ≤ ub be an observation and let a ≤ te − te ≤ b be the most tightened constraint implied by a qualiﬁed plan (E, upd(C, Q)). The conﬁrmation percentage of the observation o, denoted by cpQ (o), is deﬁned as: min(ub,b)−max(lb,a) if min(ub, b) − max(lb, a) ≥ 0 ub−lb cpQ (o) = 0 otherwise The sum of the conﬁrmation percentages gives us a measure for comparing diagnoses. Deﬁnition 6 Let (E, C) be an STN, and let Obs be the constraints describing the observations made. A diagnosis Q of the STN and the observation Obs is a maximumconﬁrmation diagnosis iff o∈Obs cpQ (o) is maximal. Note that a maximum-conﬁrmation diagnosis need not be unique. From the set of maximum-conﬁrmation diagnoses we can derive intervals of violation degrees for the constraints. In our running example, the maximum-conﬁrmation diagnoses assign delays of 15 to 35 minutes to the catering process given the observed ﬁnish at 16:00. An important question concerns the worst case time complexity of determining a maximum conﬁrmation diagnosis. Theorem 2 A maximum conﬁrmation diagnosis can be determined in polynomial time. To see why, note that each observation o : lb ≤ te − te ≤ ub ∈ Obs has one or more causal chains of events between the two events of the observation constraint e and e . Starting from the earliest observation, we qualify the last plan constraint of each the causal chain of events between the two events e and e as being violated. The qualiﬁcation is chosen such that it maximizes the conﬁrmation percentage of the observation o. Before continuing with the next constraint, we have to propagate the effect of the qualiﬁcations made. All steps can be carried out in polynomial time. 5 Related work Several authors have addressed aspects of plan diagnosis. • Diagnosis of an agent’s planning assumptions: Birnbaum et al. [1]. • Diagnosis of a single task execution: Lesser et al. [2, 10]. • Social diagnosis of behavior selection in teams: Kalech and Kaminka [11, 13]. • Diagnosis of the abnormal effects of a plan execution: Roos et al. [19, 16, 7, 18]. • Diagnosis of coordination errors of agents executing a plan: Kalech and Kaminka [12] and Roos and Witteveen [17]. • Diagnosis of multi agent plan interactions: de Jonge et al. [3, 6]. • Diagnosis and repair of plan execution with agents share resources and provide services: Micalizio and Torasso [15, 14]. • Diagnosis of temporal constraint violations: de Jonge et al. [4, 5]. None of these approaches address diagnosis of Simple Temporal Networks. The approach to de Jonge et al. [4, 5] comes closest to our approach. However, they can only deal with abstract states such as delayed or early for plan steps. 6 597 Conclusion Identifying causes of violations of a plan’s temporal constraints is an important issue in plan execution. To enable such diagnosis, the temporal aspects of a plans are described by Simple Temporal Network (STN). Based on observations of the plan’s execution, diagnosis has to identify the temporal constraint that are violated. The notion of classical Model-Based Diagnosis (MBD) has been adapted to STNs. Two important issues had to be dealt with: (i) we cannot assume that temporal constraints are violated independently, and (ii) the notion of consistency-based and abductive diagnosis are not adequate for STNs. A new notion of a maximum conﬁrmation diagnosis has been proposed. In future work we will integrate whether the here presented model for diagnosis of STN can be combined with models for diagnosing other aspects of plan execution failures. REFERENCES [1] L. Birnbaum, G. Collins, M. Freed, and B. Krulwich. Model-based diagnosis of planning failures. In AAAI 90, pages 318–323, 1990. [2] N. Carver and V. Lesser. Domain monotonicity and the performance of local solutions strategies for cdps-based distributed sensor interpretation and distributed diagnosis. Autonomous Agents and Multi-Agent Systems, 6(1):35–76, 2003. [3] F. de Jonge and N. Roos. Plan-execution health repair in a multi-agent system. In PlanSIG 2004, 2004. [4] F. de Jonge, N. Roos, and H. Aldewereld. Multiagent system technologies. In Multiagent System Technologies, 2007. [5] F. de Jonge, N. Roos, and H. Aldewereld. Temporal diagnosis of multiagent plan execution without an explicit representation of time. In BNAIC-07, 2007. [6] F. de Jonge, N. Roos, and H. van den Herik. Keeping plan execution healthy. In Multi-Agent Systems and Applications IV: CEEMAS 2005, LNCS 3690, pages 377–387, 2005. [7] F. de Jonge, N. Roos, and C. Witteveen. Diagnosis of multi-agent plan execution. In Multiagent System Technologies: MATES 2006, LNCS 4196, pages 86–97, 2006. [8] R. Dechter, I. Meiri, and J. Pearl. Temporal constraint networks. Artiﬁcial Intelligence, 49:61–95, 1991. [9] P. Festa, P. Pardalos, and M. Resende. Feedback set problems. In Handbook of Combinatorial Optimization, volume 4. Kluwer Academic Publishers, 1999. [10] B. Horling, B. Benyo, and V. Lesser. Using Self-Diagnosis to Adapt Organizational Structures. In Proceedings of the 5th International Conference on Autonomous Agents, pages 529–536. ACM Press, 2001. [11] M. Kalech and G. A. Kaminka. Diagnosing a team of agents: Scalingup. In AAMAS 2005, pages 249–255, 2005. [12] M. Kalech and G. A. Kaminka. Towards model-based diagnosis of coordination failures. In AAAI 2005, pages 102–107, 2005. [13] M. Kalech and G. A. Kaminka. On the design of coordination diagnosis algorithms for theams of situated agents. Artiﬁcial Intelligence, 171:491–513, 2007. [14] R. Micalizio and P. Torasso. On-line monitoring of plan execution: A distributed approach. Knowledge-Based Systems, 20:134–142. [15] R. Micalizio and P. Torasso. Team cooperation for plan recovery in multi-agent systems. In Multiagent System Technologies, LNCS 4687, pages 170–181, 2007. [16] N. Roos and C. Witteveen. Diagnosis of plans and agents. In MultiAgent Systems and Applications IV: CEEMAS 2005, LNCS 3690, pages 357–366, 2005. [17] N. Roos and C. Witteveen. Diagnosis of plan structure violations. In Multiagent System Technologies, 2007. [18] N. Roos and C. Witteveen. Models and methods for plan diagnosis. Journal of Autonomous Agents and Multi-Agent Systems, DOI: 10.1007/s10458-007-9017-6, 2008. [19] C. Witteveen, N. Roos, R. van der Krogt, and M. de Weerdt. Diagnosis of single and multi-agent plans. In AAMAS 2005, pages 805–812, 2005. This page intentionally left blank 10. Perception, Sensing and Cognitive Robotics This page intentionally left blank ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-601 601 An Attentive Machine Interface Using Geo-Contextual Awareness for Mobile Vision Tasks Katrin Amlacher and Lucas Paletta 1 Abstract. The presented work settles attention in the architecture of ambient intelligence, in particular, for the application of mobile vision tasks in multimodal interfaces. A major issue for the performance of these services is uncertainty in the visual information which roots in the requirement to index into a huge amount of reference images. We propose a system implementation – the Attentive Machine Interface (AMI) – that enables contextual processing of multi-sensor information in a probabilistic framework, for example to exploit contextual information from geo-services with the purpose to cut down the visual search space into a subset of relevant object hypotheses. We present a proof-of-concept with results from bottom-up information processing from experimental tracks and image capture in an urban scenario, extracting object hypotheses in the local context from both (i) mobile image based appearance and (ii) GPS based positioning, and verify performance in recognition accuracy (> 10%) using Bayesian decision fusion. Finally, we demonstrate that top-down information processing – geo-information priming the recognition method in feature space – can yield even better results (> 13%) and more economic computing, verifying the advantage of multi-sensor attentive processing in multimodal interfaces. 1 INTRODUCTION Attention as a methodology of selecting detail of relevance is ubiquitous in biological systems and has increasingly received consideration for the design of artiﬁcial cognitive systems. Mobile multimodal interfaces as devices that receive a multitude and diversity of data with the purpose to assisting the user with relevant detail and level of abstraction are an obvious choice of investigation about how concepts for the appropriate selection of information might contribute to solve a task. In this paper we approach attention from the viewpoint of a nomadic urban user that is equipped with a camera phone and that is interested in receiving appropriate information about objects of interest within a local environment. We describe the embedding of the problem in a general system implementation of an Attentive Machine Interface (AMI) that enables contextual processing of multi-sensor information in a probabilistic framework. The system is prepared to support in general bottom-up information processing in terms of selecting and processing information within speciﬁc modalities and according to a pre-deﬁned – be it learned or heuristically determined – methodology. A particularly novel functionality presented in this work is to enable top-down information processing by cross-modal 1 JOANNEUM RESEARCH Forschungsgesellschaft mbH, Institute of Digital Image Processing, Wastiangasse 6, 8010 Graz, Austria, email: {katrin.amlacher,lucas.paletta}@joanneum.at priming of early processing in the manner of a multi-sensor framework for attentive – and ﬁnally superior – performance. Mobile object recognition and visual positioning have recently been proposed in terms of mobile vision services for the support of urban nomadic users. A major issue for the performance of these services is uncertainty in visual information; covering large urban areas with naive approaches would require to refer to a huge amount of reference images and consequently to highly ambiguous features. We propose to exploit contextual information from geo-services with the purpose to cut down the visual search space into a subset of all available object hypotheses in the large urban area. Geo-information in association with visual features enables to restrict the search within a local context. We extract object hypotheses in the local context from (i) mobile image based appearance and (ii) GPS based positioning and investigate the performance of Bayesian information fusion with respect to a reference database (TSG-20). The results from experimental tracks and image captures in an urban scenario prove a signiﬁcant increase in recognition accuracy (Sec. 4) and use of computational resources when using in contrast to omitting geo-contextual information. Finally, we demonstrate that cross-modal top-down information processing – geo-information priming the recognition method in visual feature space – can yield even better results and more economic computing, verifying the advantage of using attentive processing in multimodal interfaces. 2 2.1 THE ATTENTIVE MACHINE INTERFACE Related Work In ubiquitous computing, several frameworks have been proposed in the frame of attentive interfaces and context awareness. [14, 1] proposed Attentive User Interfaces (AUI) that capture the attention of the user, e.g. from eye gaze estimation, and consequently adapt interaction systems for better communication with the user. [3] proposed that context is a description of a real world situation on an abstract level that is derived from available cues. [2] described the role of perceptual components in a context aware system for interaction. Finally, [11] proposed a context processing system with blackboard functionality where components can subscribe to receive messages matching speciﬁc patterns, and various cues are integrated into a multimodal description of a situation. While the concept of AMI is directly inspired by [11], it presents processing in a probabilistic framework and enables top-down, i.e., attentive cross-modal information processing. Previous work on mobile vision services primarily advanced the state-of-the-art in computer vision methodology for the application in urban scenarios. [13] provided a ﬁrst innovative attempt on building identiﬁcation proposing local afﬁne features for object match- 602 K. Amlacher and L. Paletta / An Attentive Machine Interface Using Geo-Contextual Awareness for Mobile Vision Tasks ing. [15] introduced image retrieval methodology for the indexing of visually relevant information from the web for mobile location recognition. Subsequent attempts [8, 10, 4] advanced the methodology further towards highly robust building recognition, however, so far it has not been considered to investigate the contribution of geoinformation to the performance of the vision service. 2.2 Concept and Architecture The context framework used in the AMI deﬁnes a cue as an abstraction of logical and physical sensors which may represent a context itself, generating a recursive deﬁnition of context. Sensor data, cues and context descriptions are deﬁned in a framework of uncertainty. Attention is the act of selecting and enabling detail – in response to situation speciﬁc data - within a choice of given information sources, with the purpose to operate exclusively on it. Attention enabled by the AMI is therefore focusing operations on a speciﬁc detail of a situation that is described by the context. The architecture of the AMI reﬂects the enabling of both bottomup and top-down information processing and would support snapshot (e.g., image) based as well as continuous operation on a stream of input data (e.g., video). Fig. 1 outlines the embedding of the AMI within a client-server system architecture for mobile vision services with support from multi-sensor information. A user interface generates task information (mobile vision service) that is fed into the AMI. The user request for context information is handled by a Master Control (MC) component that schedules the processing (multiple users can start several tasks) and associates with each task corresponding system monitoring (SM) procedures. A concrete task is then performed by the Task Processor (TP) who, ﬁrstly, requests a hierarchical description of services, i.e. context-generating modules (context subgraph) and, secondly, executes the services in the order of the subgraph description. Since such a subgraph can provide several ways of processing, the appropriate part can get selected by means of, e.g., time constraint, conﬁdence of the expected result and quality of context-generating services. If a service gets ofﬂine, TP can switch to another similar service or to another processing chain, where already processed data is reused. The Context Graph Manager (CGM) maintains and manages context-generating modules in a graph structure (Context Graph). These context-generating modules are services that receive an input cue (an image, a GPS signal, etc.) from the Data Control (DC) module and generate a speciﬁc context abstraction from an integration of the input cues. CGM assembles the subgraph according to several constraints, such as, task information, availability of context-generating modules and data and ensures that the subgraph is processable. The AMI functionality ensures the possibility to arbitrarily combine services and implements process ﬂow regulation mechanism, e.g. when a service gets ofﬂine to switch to another service. It is also possible to invoke an additional processing chain if the conﬁdence of the result it too low. Multiple users can concurrently request context information and the services are targeted towards fast and accurate (robust) responses. 2.3 Context Processing For high-level context generation various services are required to combine information, services may temporarily exist, and outputs may be combined in arbitrary manner. The Context Graph – a directed acyclic graph with nodes representing individual context processing units, edges describing the information ﬂow – is a ﬂexible and extensible data structure that correspondingly connects between Figure 1. Concept of a client-server system architecture with attentive machine interface. individual functionalities. Each context node provides a contextgenerating service that derives context information from its input data; context nodes are linked together depending on input and output data of the wrapped services; context nodes represent context information at a different level, where high-level context information is demanded by the user. For the generation of high-level context information only parts of the Context Graph need to be processed, in fact those that contribute to the corresponding (top-level) context node. Depending on available input data and services, a subgraph from the Context Graph is derived which consequently ensures a smooth processing by the Task Processor. The subgraph gets processed starting with those leaf context nodes which take data only from the Data Control. The calculated results are given to the next Context Nodes following the outgoing edges until the top-level context node is reached. The resulting high-level context information is given to the user via a visualization compent and is stored in the Data Control or Diary Manager. 2.4 Bottom-Up and Top-Down Processing The AMI supports two different modes of information processing, i.e., bottom-up and top-down processing. The choice of modes can be decided by the Task Processor according to demands on computational resources, quality of service (e.g., recognition accuracy) and availability of data. Figure 2 provides a schematic sketch of two different modes in performing the service of geo-indexed object recognition (Sec. 3). In bottom-up processing mode (a), services for the computation of (i) visual objects (object recognition) and (ii) geo-features (positioning) are determining hypotheses with respect to the occurrence of objects (i) in the image and (ii) within a local environment. In topdown processing mode (b), there is a cross-modal dependency in (i) object recognition on the input of object hypotheses provided by (ii) the geo-service. While individually processed distributions on object hypotheses can simply be integrated in (a) using Bayesian decision fusion, (b) actually models an impact of geo-information on visual feature extraction and integration as outlined in more detail in Sec. 3. K. Amlacher and L. Paletta / An Attentive Machine Interface Using Geo-Contextual Awareness for Mobile Vision Tasks (a) (b) Figure 2. Two different processing modes visualised by their associated context subgraphs for “Geo-Indexed Building Recognition” (Sec. 3). (a) Bottom-up information processing of visual object recognition and geo-features. (b) Top-down information processing by using geo-features to prime visual object recognition (Sec. 2.4). 3 GEO-INDEXED OBJECT RECOGNITION Urban image based recognition provides the technology for both object awareness and positioning. Outdoor geo-referencing still mainly relies on satellite based signals where problems arise when the user enters urban canyons and the availability of satellite signals dramatically decreases due to various shadowing effects [5]. Cell identiﬁcation is not treated here due to its large positioning error. Alternative concepts for localization are economically not affordable, such as, INS and markers that need to be massively distributed across the urban area. For image based urban object recognition, we brieﬂy describe how we make use of the methodology presented in [12, 4]. The user captures an image about an object of interest in its ﬁeld of view, and a software client initiates wireless data submission to the server. Assuming that a GPS receiver is available, the mobile device reads the actual position estimate and sends this together with the image to the server. In the second stage, the web-service reads the message and analyzes the geo-referenced image. Based on a current quality of service and the given decision for object detection and identiﬁcation, the server prepares the associated annotation information from the content database and sends it back to the client for visualization. Attentive Object Recognition Research on visual object detection has recently focused on the development of local interest operators [9, 6] and the integration of local information into object recognition. The SIFT (Scale Invariant Feature Transformation) descriptor [6] is widely used for its capabilities for robust matching despite viewpoint, illumination and scale changes in the object image captures which is mandatory for mobile vision services. The Informative Features Approach (i-SIFT [4]) applied to mobile imagery in our experiments uses local density estimations to determine the posterior entropy, making local information content explicit with respect to object discrimination. The information content from a posterior distribution is determined with respect to given task speciﬁc hypotheses. In contrast to costly global optimization, one expects that it is sufﬁciently accurate to estimate a local information content from the posterior distribution within a sample test point’s local neighborhood in descriptor space. One is primarily interested to get the information content of any sample local descriptor di in descriptor space D, di ∈ R|D| , with respect to the task of object recognition, where oi denotes an object hypothesis from a given object set SO . For this, 603 Figure 3. Concept for recognition from informative local descriptors. (I) SIFT descriptors are extracted within the test image. (II) Decision making analyzes the descriptor voting for MAP decision. (III) In i-SIFT attentive processing, a decision tree estimates the SIFT speciﬁc entropy; informative descriptors are then attended for decision making (II). one needs to estimate the entropy H(O|di ) of the posterior distribution P (ok |di ), k = 1 . . . Ω, Ω is the number of instantiations of the object class variable P O. The Shannon conditional entropy denotes H(O|di ) ≡ − k P (ok |di ) log P (ok |di ). One approximates the posteriors at di using only samples gj inside a Parzen window of a local neighborhood , ||di − dj || ≤ , j = 1 . . . J. Fig. 3 depicts discriminative descriptors in an entropy-coded representation of local SIFT features di . From discriminative local descriptors one proceeds to entropy thresholded object representations, providing increasingly sparse representations with increasing recognition accuracy, in terms of storing only selected descriptor information that is relevant for classiﬁcation purposes, i.e., those di with ˆ H(O|d i ) ≤ HΘ . For the rejection of images whenever they do not contain any objects of interest one considers to estimate the entropy in the posterior distribution - obtained from a normalized histogram of the object votes - and reject images with posterior entropies above a predeﬁned threshold. The proposed recognition process is characterized by an entropy driven selection of image regions for classiﬁcation, and a voting operation. Geo-Contextual Computing of Object Recognition Geoservices provide access to information about a local context that is stored in a digital city map. Map information in terms of map features is indexed via a current estimate on the user position that can be derived from satellite based signals (GPS), dead-reckoning devices and so on. The map features can provide geo-contextual information in terms of, e.g., location of points of interest. In previous work [7], the general relevance of geo-services for the application of mobile object recognition was already emphasised, however, the contribution of the geo-services to the performance of geo-indexed object recognition was not quantitatively assessed, and top-down processing was Figure 4. Extraction of object hypotheses from geo-services. (Left to right) Within a local spatial neighborhood (geo-focus), distances to the points of interest are determined, weighted by an exponential function and normalised to result in a distribution on object hypotheses. 604 K. Amlacher and L. Paletta / An Attentive Machine Interface Using Geo-Contextual Awareness for Mobile Vision Tasks not considered. Fig. 4 depicts a novel methodology to introduce geo-service based object hypotheses. (i) A geo-focus is ﬁrst deﬁned with respect to a radius of expected position accuracy with respect to the city map. (ii) Distances between user position and points of interest (e.g., tourist sight buildings) that are within the geo-focus are estimated. (iii) The distances are then weighted according to a normal density function by p(x) = 1/((2π)d/2 |Σ|1/2 ) exp{−1/2(x − μ)T Σ−1 (x − μ)}. By investigating different values for σ, assuming (Σij ) = δij σj2 , one can tune the impact of distances on the weighting of object hypotheses. (iv) Finally, weighted distances are normalised and determine conﬁdence values of individual object hypotheses. Bottom-Up Geo-Indexed Object Recognition Distributions over object hypotheses from vision and geo-services are then integrated via Bayesian decision fusion. Although an analytic investigation of both visual and position signal based information should prove statistical dependency between the corresponding random variables, one assumes that it is here sufﬁcient to pursue a naive Bayes approach for the integration of the hypotheses (in order to get a rapid estimate about the contribution of geo-services to mobile vision services) by P (ok |yi,v , xi,g ) = p(ok |yi,v )p(ok |xi,g ), where indices v and g mark information from image (y) and positioning (x), respectively. Top-Down Geo-Indexed Object Recognition Here, we ﬁrstly process the geo-service in order to receive a distribution over object hypotheses that is input to attentive object recognition. The recognition method is then primed to reject all those local (i-SIFT; see above) descriptors from consideration that are labelled with hypotheses of negligible conﬁdence in the output of the geo-service. Hence the feature space underlying the nearest-neighbor voting procedure is containing only pre-selected prototypes which are then preferred but outside a pre-determined distance threshold in feature space. The resulting distribution over object hypothesis can again be fused with the distribution from geo-services in order to receive a distance based weighting on object hypotheses. 4 EXPERIMENTS The overall goal of the experiments was to determine and to quantify the contribution of geo-services to object recognition in urban environments and to compare bottom-up and top-down approaches in the AMI. The performance in the detection and recognition of objects of interest on the query images with respect to a given reference image database and a given methodology (TSG-20 [4]) was compared to the identical processing but using geo-information and information fusion for the integration of object hypotheses. User Scenario and Constraints In the application scenario, we imagine a tourist being equipped with a mobile device with builtin GPS. He can send image based queries to a server using UMTS or WLAN based connectivity. The server performs geo-indexed object recognition and is expected to respond with tourist relevant annotation if a point of interest was identiﬁed. In the experiments we used an ultra-mobile PC (Sony Vaio UMPC VGN-UX1XN) with 1.3 MPixels image captures. Reference imagery [4] with 640 × 480 resolution about building objects of the TSG-20 database2 were captured from a camera-equipped mobile phone (Nokia 6230), containing changes in 3D viewpoint, partial occlusions, scale changes by varying distances for exposure, and various illumination changes. For each object we selected 2 images taken by a viewpoint change of ≈ ±30◦ and of similar distance to the object for training to determine the i-SIFT based object representation. Two additional views 2 http://dib.joanneum.at/cape/TSG-20/ (a) (b) (c) (d) Figure 5. Comparison between bottom-up (blue/dark bars) and top-down approach (green/light bars) from (a) sample input images. Integration of object hypotheses from (b) vision and (c) geo-services into a (d) fused distribution demonstrates clear increases in the conﬁdences of the correct object hypothesis and therefore a signiﬁcant improvement in the performance of the mobile vision service (Fig. 6). were taken for test purposes, giving 40 test images in total. For the evaluation of background detection we used a dataset of 120 query images, containing only images of buildings and street sides without TSG-20 objects. Another dataset was acquired with the UMPC, which consists of seven images per TSG-20 object from different view points; images were captured on different days under different weather conditions. Attentive Object Recognition In the ﬁrst evaluation stage, each individual image query was evaluated for vision based object detection and recognition, then regarding extraction of geo-service based object hypotheses, and ﬁnally with respect to Bayesian decision fusion on the individual probability distributions (Sec. 3). Detection is an important pre-processing step to recognition, e.g., to avoid geoservices to support conﬁdences for objects that are not in the query image. Experiments on imagery including background data resulted in a PT rate of 89.2% and a FP rate of 20.1%, probably due to the bad sensor quality. However, once a query image is attributed to the object category, the geo-indexed object recognition will boost the performance in ﬁnding more correct hypotheses than using vision alone. Fig. 5 depicts sample query images associated with corresponding distributions on object hypotheses from vision, geo-services, and using information fusion. The results demonstrate signiﬁcant increases in the conﬁdences of correct object hypotheses. The evaluation of the complete database of image queries about TSG-20 objects (Fig. 6) proves a decisive advantage for taking geo-service based information into account in contrast to purely vision based object recognition, in particular, using the top-down approach. While vision based K. Amlacher and L. Paletta / An Attentive Machine Interface Using Geo-Contextual Awareness for Mobile Vision Tasks 605 the local object context that enables a meaningful selection of expected object hypotheses and therefore improve overall performance of urban object recognition. We proposed to pursue a methodology on Bayesian decision fusion that integrates distributions on object hypotheses from both cues, i.e., visual information and position estimate. We performed experiments on a representative image data set and proved signiﬁcant improvement in performance when using geo-services. In future work we further exploit the concept of the AMI by integrating different context information, such as visual context or semantic segmentation, in a probabilistic framework. ACKNOWLEDGEMENTS (a) This work is supported in part by the European Commission funded project MOBVIS under grant number FP6-511051 and by the FWF Austrian National Research Network on Cognitive Vision under subproject S9104-N04. REFERENCES (b) Figure 6. (a) Performance comparison between geo-service based hypotheses (Geo), purely vision based recognition (OR), bottom-up processing with information fusion (OR+GEO), top-down processing of attentive recognition without (R+OR) and with post-processing using Bayesian decision fusion (R+OR+GEO). (b) Geo-indexed object recognition involves only a fraction of hypotheses and reduces computing time. recognition is on a low level (≈ 84%), an exponentially weighted spatial enlargement of the scope on object hypotheses with geoservices increased the recognition accuracy up to ≈ 96%. With increasing σ an increasing number of object hypotheses are taken into account for information fusion and the performance ﬁnally drops to vision based recognition performance (uniform distribution in the geo-service based object hypotheses). 5 CONCLUSION In this work we propose the AMI that enables bottom-up and topdown cross-modal information processing. We take advantage of geo-contextual information for the improvement of mobile vision services in urban scenarios, such as visual object recognition of tourist sights. We argued that geo-information provides a focus on [1] Leonardo Bonanni, Chia-Hsun Lee, and Ted Selker, ‘Attention-based design of augmented reality interfaces’, in CHI ’05: CHI ’05 extended abstracts on Human factors in computing systems, pp. 1228–1231, New York, NY, USA, (2005). ACM. [2] James L. Crowley, Jo¨elle Coutaz, Gaeten Rey, and Patrick Reignier, ‘Perceptual Components for Context Aware Computing’, in UBICOMP 2002, International Conference on Ubiquitous Computing, Goteborg, Sweden, (September 2002). [3] Anind K. Dey and Gregory D. Abowd, ‘Towards a Better Understanding of Context and Context-Awareness’, in Proceedings of the CHI 2000 Workshop on ”The What, Who, Where, When, Why and How of Context-Awareness”, (2000). [4] Gerald Fritz, Christin Seifert, and Lucas Paletta, ‘A Mobile Vision System for Urban Object Detection with Informative Local Descriptors’, in Proc. IEEE 4th International Conference on Computer Vision Systems, ICVS, New York, NY, (January 2006). [5] B. Hofmann-Wellenhof, H. Lichtenegger, and J. Collins, Global Positioning System Theory and Practice, Springer-Verlag, Vienna, Austria, 2001. [6] D. Lowe, ‘Distinctive image features from scale-invariant keypoints’, International Journal of Computer Vision, 60(2), 91–110, (2004). [7] P. Luley, L. Paletta, A. Almer, M. Schardt, and J. Ringert, ‘Geo-services and computer vision for object awareness in mobile system applications’, in Proc. 3rd Symposium on LBS and Cartography, pp. 61–64. Springer, (2005). [8] Rapha¨el Mar´ee, Pierre Geurts, Justus Piater, and Louis Wehenkel, ‘Decision trees and random subwindows for object recognition’, in ICML workshop on Machine Learning Techniques for Processing Multimedia Content (MLMM2005), (2005). [9] K. Mikolajczyk and C. Schmid, ‘A performance evaluation of local descriptors’, in Proc. Computer Vision and Pattern Recognition, CVPR 2003, Madison, WI, (2003). [10] Stepan Obdrzalek and Jiri Matas, ‘Sub-linear indexing for large scale object recognition.’, in Proceedings of the British Machine Vision Conference, volume 1, pp. 1–10, (2005). [11] Albrecht Schmidt and Kristof Van Laerhoven, ‘How to build smart appliances’, IEEE Personal Communications, 66 – 71, (2001). [12] C. Seifert, G. Fritz, L. Paletta, and H. Bischof, ‘Learning to focus attention on discriminative regions for object detection’, in Proc. European Conference on Artiﬁcial Intelligence, ECAI 2004, pp. 932–936, (2004). [13] H. Shao, T. Svoboda, and L. van Gool, ‘HPAT indexing for fast object/scene recognition based on local appearance’, in Proc. International Conference on Image and Video Retrieval, CIVR 2003, pp. 71– 80. Chicago,IL, (2003). [14] Roel Vertegaal, ‘Attentive User Interfaces’, Communications of the ACM, 46(3), 30–33, (2003). [15] T. Yeh, K. Tollmar, and T. Darrell, ‘Searching the web with mobile images for location recognition’, in Proc. IEEE Computer Vision and Pattern Recognition, CVPR 2004, pp. 76–81, Washington, DC, (2004). 606 ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-606 Learning Functional Object-Categories from a Relational Spatio-Temporal Representation Muralikrishna Sridhar and Anthony G Cohn and David C Hogg 1 Abstract. We propose a framework that learns functional objectcategories from spatio-temporal data sets such as those abstracted from video. The data is represented as one activity graph that encodes qualitative spatio-temporal patterns of interaction between objects. Event classes are induced by statistical generalization, the instances of which encode similar patterns of spatio-temporal relationships between objects. Equivalence classes of objects are discovered on the basis of their similar role in multiple event instantiations. Objects are represented in a multidimensional space that captures their role in all the events. Unsupervised learning in this space results in functional object-categories. Experiments in the domain of food preparation suggest that our techniques represent a signiﬁcant step in unsupervised learning of functional object categories from spatio-temporal patterns of object interaction. 1 Introduction Children learn about the world around them by observing and participating in activities that engage them in the course of every day life. One aspect of learning activity models involves acquiring notions of what objects mean to them based on the function they fulﬁll in activities. Functional categories and taxonomies of objects are naturally acquired by humans during the process of observing object behaviour and using them accordingly. An important step toward unsupervised learning of activity models is to learn an analogous model of functional object categories purely by observing their behaviour. In this work, we represent the behaviour of objects involved in an activity, in terms of an activity graph, which captures qualitative spatio-temporal patterns of interaction between these objects. We search for frequent similar subgraph instances and generalize these by variablizing. These are our event classes, the instances of each event class encoding a similar pattern of spatio-temporal relationships between their respective object instances. Then we learn object categories by clustering in an object space, where the similarity measure between objects is measured, based on whether they play a similar role across the event instances for each of the event classes; e.g., a set of objects, even though different in appearance, may tend to play a similar role in events such as washing, cutting and cooking as opposed to others that do not play such a role in these events. By observing multiple instances of such event classes that have the same event role for this set of objects, it is natural to form a category that correspond to what we refer to as vegetables. Through our experiments we demonstrate that using our framework it is possible to learn semantically meaningful functional object categories and a taxonomy purely by observing object behaviour. 1 School of Computing, University of Leeds, Leeds, UK, email:{krishna,agc,dch}@comp.leeds.ac.uk. This work was funded under EPSRC grant EP/D061334/1. In section 3 we show how functional object categories can be learned from event classes. The rest of the paper describes a novel procedure for inducing event classes from video input. 2 Related Work Much previous work has focused on supervised learning of object classes either based on the appearance of the object itself [9] or by recognizing contextual cues such as activities associated with objects [8] to locate and recognize objects. By contrast, unsupervised learning of objects can be divided into two stages, the ﬁrst being object discovery e.g. discovery of blobs that are candidates for objects from video. The second stage is object class learning which involves automatically categorizing these blobs into object classes. Early work on object discovery [6] formed candidate objects by grouping pixels with similar temporal signatures that are constructed by recording colour (RGB) values for stable intervals when objects arrive, stay and depart from a region. In [7], candidate objects are obtained by ﬁrst over segmenting images in a video and after extracting image features for these segments, rigidly moving features are grouped into potential objects. Both object discovery and class learning is performed simultaneously from a collection of static images [5] in two steps. First multiple segmentations for each image are produced, by varying the parameters of the normalized cut technique with the assumption that each object instance is correctly segmented at least by one segmentation. Then object classes which are groups of correctly segmented objects that are coherent in a large set of candidate segments, are learned. Another approach [1] obtains a hierarchy of object classes for static scenes by grouping image features which spatially co-occur across images for the same scene, under the same leaf of the hierarchy. In this manner, the technique learns to identify candidate objects such as keyboards, while also learning higher level object classes such as a desk area (consisting of a computer, desk etc). In this work we perform object discovery by ﬁrst over segmenting the video in terms of colour patches and then grouping spatially cohesive and continuous coloured blobs to discover a candidate set of objects. We perform object class learning by clustering on a object space, where the similarity between objects is based on similar spatio-temporal behaviour (speciﬁcally object interactions) in scenes. Recent work on event learning [3, 4] aims at learning activity/event classes given a sequence of primitive events, where the primitive events are deﬁned and recognized a priori. In [2] a relational representation language is introduced for deﬁning temporal events, and algorithms for learning these deﬁnitions from video output are described. In this work, we introduce a generic deﬁnition for events, in terms of graphs, that captures changing spatio-temporal M. Sridhar et al. / Learning Functional Object-Categories from a Relational Spatio-Temporal Representation Figure 1. Lattice for general to speciﬁc object learning relationships between discovered objects. We show how this representation enables event mining and object learning. 3 Object Learning ¯ where X ¯ is a Assume the existence of a set of event classes F (X), sequence of object variables in some canonical ordering, between which some set of spatio-temporal relationships hold and which when instantiated, yields a set of event instances. The event classes ¯ = Fi (X1 , .., Xk , ..., Xm ) in general have multiple event inFi (X) stances in the corpus so that all these instances encode the same set (or more generally a similar set) of spatio-temporal relationships between their objects. This induces a natural mapping between objects corresponding to each object variable Xk for the event instances of an event class. Given a corpus of such instances, we show, using an example, how to induce functional object categories for the set of objects present in these instances. The event classes could be handcrafted manually through knowledge engineering techniques, or, as we describe in later sections, could be induced from a video by an event learning procedure. Let F (X1 , X2 , X3 ) be an example event class that represents events such as “X2 being lifted away from of X3 by X1 ”. The example in ﬁg 2(c), is one such event instance (F (h1 , b1 , p1 )) of the event class F with object instances h1 , b1 , p1 having IDs 3, 4 and 6 respectively. Let us suppose that two other instances F (h1 , b2 , p1 ), F (h1 , b3 , p2 ) of the same class F had been observed in the scene. A lattice as shown in ﬁg. 1 is grown from event instances at the bottom level (3), by generalizing exactly one argument position to a variable at each successive level. We then search for equivalence classes of objects from general to speciﬁc by traversing down this lattice, using the following procedure. For every node of each level l in the lattice, the procedure involves searching for sets of nodes at level l + 1, where each set is formed by substituting more than one object instance for the same variable Xk , for that node at level l. Applying this procedure at level 0 of the lattice, we get two such sets at level 1 (shaded with two colours) : {F (X1 , b1 , X3 ), F (X1 , b2 , X3 ), F (X1 , b3 , X3 )} obtained by substituting for X2 with b1 , b2 , b3 and 607 {F (X1 , X2 , p1 ), F (X1 , X2 , p2 )} obtained by substituting for X3 with p1 , p2 respectively . As the substituted constants {b1 , b2 , b3 } and {p1 , p2 }, play the same roles (as the variables X2 and X3 respectively) for the event class F , we say that F has induced event roles for instances of the variables X2 and X3 resulting in equivalence classes {b1 , b2 , b3 } and {p1 , p2 } respectively. We now show that, by applying the same procedure at one level below (level 1) of the lattice, we obtain a more speciﬁc event role for the speciﬁc event of objects placed on a certain plate (p1). The procedure applied at level 1 results in a set of nodes {F (X1 , b1 , p1 ), F (X1 , b2 , p1 )} at level 2 (as shaded in ﬁg. 1), obtained by substituting for X2 in F (X1 , X2 , p1 ) with b1 , b2 respectively. We say that the more speciﬁc event class F (X1 , X2 , p1 ) has induced a more speciﬁc event role for the variable X2 resulting in an equivalence class of objects {b1, b2}, i.e. objects being put on plate p1 . By progressively traversing down the lattice using this procedure, it becomes possible to create event roles and corresponding equivalence classes C1 ...Cn , from general to speciﬁc. Applying this idea, we produce a matrix of object by equivalence classes, O in which Oi,j equals 1 if the object i occurs in the equivalence class Cj and 0 otherwise. As each equivalence class corresponds to an event role, the row vectors of this matrix summarize each object in terms of the role it plays in all the event-roles and thus induce a multidimensional object space. In this space, objects that have a similar role with respect to similar sets of events are expected have a high similarity measure. We therefore perform k-means clustering using a cluster partition index to determine k. Hierarchical clustering on these categories then yields an object taxonomy. In the next section, we show how event classes can be learned from video input and in section 6 the results of applying our object learning procedure are discussed. 4 Activity Graphs from Video Object discovery is performed by ﬁrst over segmenting the video in terms of colour patches and then grouping these into spatially continuous and cohesive blobs that are a mix of noisy patches along with potential objects. These blobs are given IDs and their position and extent are recorded from the video. The spatio-temporal patterns in the entire video are represented using an activity graph. The spatial relationships between the bounding boxes of each pair of objects for every frame are mapped to a set of spatial primitives = {D, S, T}. Two objects are either spatially Disconnected(D) or connected through the Surrounds(S) or Touches(T) relationships 2 . illustrated in ﬁg. 2(b). For each pair of objects, these spatial relationships hold during a time interval. In general, If {o1 , o2 ...on } is the set of all the objects observed in the video, for each pair oi , oj , a particular spatial relationship r ∈ holds for each frame f , i.e. holds(r(oi , oj ), f ). We are interested in maximal one-piece time intervals during which r holds between oi and oj , which we refer to as episodes. We represent such episodes by a quadruple E = oi , oj , τ, r, where |{r : Holds(r(oi , oj ), f ) ∈ τ }| = 1 and τ is a consecutive sequence of frames such that ∀τ (τ ⊂ τ → |{r : Holds(r(oi , oj ), f ) ∈ τ }| > 1 . We thus obtain the set of all episodes Δ = {E1 , E2 ...Em } for all pairs of objects. Episodes labelled E1 − E20 in ﬁg 2(a) correspond to this set, for the activity considered in this example. 2 This approach clearly could be applicable to any set of spatial relations . Our simpliﬁed approach to video analysis is 2D, thus using this set of spatial relations means, e.g. an object o1 placed on an object o2 is represented as S(o1 , o2 ) – these 3 relations have sufﬁced for our experiments. 608 M. Sridhar et al. / Learning Functional Object-Categories from a Relational Spatio-Temporal Representation (a) An activity (b) Spatial and Temporal Primitives (c) A subactivity of the activity in (a) (d) Level-0 activity graph episodes E5 − E12 in (c) for (e) Level-1 Activity Graph for episodes E1 − E20 in (a) Figure 2. Having obtained all the episodes, we obtain a complete graph – which we call an activity graph – whose vertices represent the episodes and whose edges relate the time intervals corresponding to their respective episodes using Allen’s temporal primitives . We call the complete graph encoding all temporal relationships between intervals E1 −E20 a level-0 activity graph for the activity in ﬁg. 2(a). More formally, we have the activity graph (V, E, η, ρ, Δ, ), where the function η : V −→ Δ maps the vertices V = {vi } to episodes in Δ and ρ : E −→ maps the directed edges between all pairs of vertices E : eij = vi , vj to temporal relationships in . We require that η is a bijective mapping from vertices to the set of episodes in the activity graph. The complete activity graph is too large to display here and a typical activity graph is too complex to be able to search to ﬁnd event classes3 . Fig. 2(d) shows a subgraph of the level-0 activity graph for episodes E5 − E12 - depicted in ﬁg. 2(c). Therefore, prior to searching for event classes we use an attention mechanism to structure and simplify the level-0 activity graph to produce a level-1 activity graph. This is achieved by using a foreground attention mechanism (described below) to cluster episodes and forming a new graph structure over these clusters. Each cluster represents an atomic event and we call the clusters of episodes and their Allen relationships, a unary event graph (unary EG). The graph whose nodes are unary event graphs and whose edges are Allen’s temporal relationships between these nodes is the level-1 activity graph. 3 If we consider n = 10 objects and k as the average number of episodes in video which is usually 102 even for scenes that last for a minute, the activity graph results in a search space of O(k 2 n4 ) .i.e O(108 ). M. Sridhar et al. / Learning Functional Object-Categories from a Relational Spatio-Temporal Representation Foreground Attention Mechanism: We hypothesize that many activities can be conceived in terms of different foreground events each of which involve interactions only between a subset of objects – foreground objects, at different time periods. This idea can be intuitively explained using ﬁg. 2(a), where the entire activity shown can be conceived in terms of three foreground events - (1) the left hand scooping some butter with a knife (2) the right hand removing the bread from the plate (3) the left hand spreading butter on the bread with a knife, while the right hand holds the bread. As long as {left hand,knife, butter} and {right hand, plate , bread }, are disconnected, we have two sets of foreground objects {1, 2, 5}, {3, 4, 6}, between frames 26 and 49. When the knife and the bread start to interact, the foreground set changes to the set of IDs {1, 2, 3, 4}, in which the butter and plate with IDs 5 and 6 are not included (frames 54-75). Three periods and their corresponding set of episodes {E1 − E4 }, {E5 − E12 }, {E13 − E20 } (as shown in the parallel lines below the frames), for the three foreground events are thus obtained. The next two paragraphs describe how, in general foreground events are detected and may be ommited on a ﬁrst reading. We look for spatial changes between a pair of objects. For each such pair of primary foreground objects o1 , o2 at some frame f , we ﬁnd the set Ω of all moving objects which are connected (i.e. T or S) to o1 or o2 , or which are connected to o1 or o2 indirectly via another moving object which is connected to o1 or o2 (directly or indirectly). The set Ω is propagated forwards to some frame f2 and backwards to some frame f1 from f until such time that one of the objects in Ω − {o1, o2} (the secondary foreground objects) changes its spatial relation to some other object in Ω to D, (unless o1 and o2 are connected at that time). The entire time from f1 to f2 is termed a period during which a foreground event involving o1 and o2 occurs, involving all the foreground objects Ω. The intuition behind this deﬁnition is that a spatial change focuses attention on a pair of objects (at least one of which must be moving, since a change has occurred), and all the objects which are intimately connected to the two objects, and groups all the interactions involving the primary objects together until such time as one of the secondary objects becomes fully disconnected from the group of objects (which then terminates this particular set of foreground objects). Note that it is possible, depending on the choice of primary objects o1 and o2 for there to be multiple temporally overlapping foreground events involving shared objects (though this has not occurred in the videos we have analysed so far). For each foreground event, we create a unary event graph (unary EG) restricted to the foreground objects of the foreground event and just during the temporal extent of the foreground event. Each unary EG endures for a period P and can be represented by the unary EG (V, E, η, ρ, ΔP , ) between the episodes for the time period P . The three unary EGs for the activity in ﬁg. 2(a) are shown as the nodes in the level-1 activity graph in ﬁg. 2(e). Unary EGs (which are single nodes of the level-1 activity graph) typically capture simple events such as removing a slice of bread from a plate. In the next section we show how to generalize unary events to unary event classes, and then how to form n-ary event classes, which are compound event classes composed of unary event classes. Instances of n-ary event classes are n-ary events which are composed of n unary EGs of the level-1 activity graph and which represent complex events such as the entire activity depicted in ﬁg. 2(a,c). 609 5 Event Learning The activity graph consists of many individual events; these can be similar in that they have similar spatio-temporal relationships between their constituent objects. In order to formalize the idea of an event class that captures these regularities, independent of the actual objects involved, we ﬁrst introduce a generalized version of an unary event graph. We then show how n-ary event classes can be formed, consisting of individual unary event classes. To generalize events to event classes, we ﬁrst consider a unary EG φ = (V, E, η, ρ, ΔP , ) for a time period P . Instead of object instances oi ∈ Ω and intervals τ ∈ Λ, consider sets of object and interval variables X = XO , XT so that Oi ∈ XO and T ∈ XT 4 . We can now generalize the set of episodes E ∈ ΔP to EX ∈ ΔX where ΔX is a set such that EX ∈ ΔX if and only if EX = O1 , O2 , T, r where O1 ∈ XO , O2 ∈ XO , T ∈ XT , r ∈ . We use the generalised set of episodes to formalise event classes by ﬁrst deﬁning a unary event class graph (unary ECG) which captures a common pattern of spatio-temporal relationships amongst a set of similar unary EG (instances), in a generic form. Deﬁnition Let φ = (V, E, η, ρ, ΔP , ) be a unary EG of the transformed activity graph, then γ = (V , E , η , ρ , ΔX , ) is a unary event class graph (unary ECG) of φ, or we say that γ θ-generalizes φ if ∃θ = θO · θT where θO : XO → Ω and θT : XT → Λ, such that γ is isomorphic to φ under the substitution θ, i.e. 1. {η (v )θ : v ∈ V } = {η(v) : v ∈ V }. 2. {ρ (eij ) : eij = (vi , vj ) ∈ E } = {ρ(eij ) : eij = (vi θ, vj θ) ∈ E}. We require that a unary ECG generalises at least λ unary EGs, i.e. instances must occur frequently. We now extend the the idea of a unary event class graph to an n-ary event class graph (n-ary ECG) composed of unary ECGs. A n-ary ECG is just a graph made up of unary ECGs γ1 ...γn , n > 2 as its vertices and whose edges relate the time periods Pi and Pj corresponding to γi and γj by Allen’s temporal primitives . A n-ary ECG Γ whose vertices are the set {γ1 , ..., γn } θ-generalizes an n-ary EG Φ with vertices {φ1 , ..., φm }, if each γi θ-generalizes a corresponding φi and the temporal relationship between any φi , φj ∈ Φ is the same as for the corresponding γi , γj ∈ γ. A n-ary ECG represents a n-ary event class if it generalises at least λ n-ary EGs. We model λ as an exponential decreasing function of n in order to allow for larger n-ary ECGs to θ-generalise fewer n-ary EGs. Using these deﬁnitions, we ﬁnally formalize event classes as maximal event class graphs. We deﬁne a maximal event class graph (MECG) as a event class graph which generalises some set of EGs, such that no other ECG which contains it generalizes this set. I.e. every MECG generalises a set of EGs which are not generalised by some larger ECG. The procedure for computing MECGs involves two stages. In the ﬁrst stage, unary ECGs with a statistically signiﬁcant number of EG instantiations are found. In the second stage, these unary ECGs are iteratively used to build larger and larger ECGs (with statistically signiﬁcant number of instantiations), until a ﬁnal set of MECGs are obtained. In this manner we discover event classes as MECGs from the level-1 activity graph. Having found all the MECGs, we give them names ¯ ¯ ¯ F1 (X)...F k (X), where X is a sequence of variables in the 4 Note that we use capitalized/bold letters for variables and small letters for instances. 610 M. Sridhar et al. / Learning Functional Object-Categories from a Relational Spatio-Temporal Representation Figure 3. A hierarchy of objects categories. MECGs, in some canonical ordering of nodes in each MECG. In section 3, where we were purely concerned with inducing an object taxonomy from the event deﬁnitions we ignored the internal ¯ which can be structure of an MECG and used just these Fi (X), deﬁned as predicates from each of the MECGs. 6 Experiments We demonstrate our framework using a video taken with a toy (plastic) kitchen set up. We have chosen a constrained environment for the moment, in order to minimize the complexities arising in a real kitchen as a result of cluttered backgrounds, ﬂickering lights, shiny surfaces, multiple shadows etc. We have further simpliﬁed the problem by focusing only on the hand (not the entire person) along with the other objects in the kitchen scene and taking care in the actions of the cook to not create complications arising, for instance, from full occlusion of objects involved. However, despite such simpliﬁcations, a large number of noisy patches are produced from the object discovery module, making the learning problem challenging. The video is taken with a static overhead camera that focuses on the scene. The scene consists of hands simulating the preparation of sandwiches, hot drinks, cutting vegetables and cooking vegetable dishes, lasting around 10 minutes. The video consists of exactly one instance for each of these preparations. After applying event and object learning, we obtain the object hierarchy in ﬁg. 3. While our procedure outputs a hierarchy of object IDs, we replace these labels with the corresponding objects from the video, in order to visualize the results. It can be observed that the proposed framework has been able to differentiate between broader categories such as food items and containers and interestingly separate noisy patches from all other objects. Finer levels of granularity are captured in the grouping which separates a slice of white bread from another group consisting of vegetables. A distinction between plates pans and spoons is also clear from the hierarchy. It can therefore be concluded that the learned categories and taxonomy is intuitive and corresponds to a functional classiﬁcation of objects. 7 Summary and Future Work A framework for learning object and event categories from video has been introduced. This framework offers a general way of representing activities in terms of spatio-temporal graphs. Techniques for mining events from this graph and then learning object functional categories from these events have been proposed in this work. Our experiments show that our framework offers a promising approach toward learning functional categories. In the future, we plan to extend this framework in several directions. At present, event generalisation requires exact graph isomorphism. We plan to extend event classes to generalize a larger set of event instances by experimenting with similarity metrics between our event graphs. This will allow our approach to exploit a greater variety of video input to learn event and object taxonomies , and to cope better with noise (which might also intervene during an event instance). In contrast to almost all work in object recognition which is based on learning categories based on perceptual features, we have tackled the little researched problem of learning categories from function. However, there is clearly scope to use the learned functional categories to supervise visual appearance based object learning. REFERENCES [1] D.Parikh and C. Tsuhan;, ‘Unsupervised learning of hierarchical semantics of objects (hsos).’, Computer Vision and Pattern Recognition, 2007. CVPR ’07., 1–8, (2007). [2] Givan R.L. Fern, A.P. and J.M. Siskind, ‘Speciﬁc-to-general learning for temporal events with application to learning event deﬁnitions from video’, Artiﬁcial Intelligence Research (JAIR), 17, 379–449, (2002). [3] Somboon Hongeng, ‘Unsupervised learning of multi-object event classes’, in: Proc. 15th British Machine Vision Conference (BMVC’04), London, UK, 2004, 487–496, (2004). [4] A. Bobick R. Hamid, S. Maddi and I. Essa, ‘Structure from statistics unsupervised activity analysis using sufﬁx trees’, in Proc. of Conf. on Computer Vision, (2007). [5] Bryan C. Russell, William T. Freeman, Alexei A. Efros, Josef Sivic, and Andrew Zisserman, ‘Using multiple segmentations to discover objects and their extent in image collections’, CVPR 06: Proc. of Comp. Soc. Conf. on Computer Vision and Pattern Recognition, 1605–1614, (2006). [6] Brandon C.S. Sanders, Randal C. Nelson, and Rahul Sukthankar, ‘A theory of the quasi-static world’, Proc. 16th Int’l Conf. on Pattern Recognition ICPR02, (2002). [7] Tristram Southey and James J. Little, ‘Object discovery using motion, appearance and shape’, Cognitive Robotics Workshop, AAAI, (2006). [8] M Veloso, P Rybski, and F von Hundelshausen, ‘Focus: A generalized method for object discovery for robots that observe and interact with humans’, Proc. Conf. on Human-Robot Interaction, (2006). [9] A. Rosenfeld P.J. Phillips W. Zhao, R. Chellappa, ‘Face recognition: A literature survey’, ACM Computing Surveys, 399–458, (2003). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-611 611 Sequential spatial reasoning in images based on pre-attention mechanisms and fuzzy attribute graphs Geoffroy Fouquier1 and Jamal Atif2 and Isabelle Bloch1 Abstract. Spatial relations play a crucial role in model-based image recognition and interpretation due to their stability compared to many other image appearance characteristics, and graphs are well adapted to represent such information. Sequential methods for knowledgebased recognition of structures require to deﬁne in which order the structures have to be recognized, which can be expressed as the optimization of a path in the representation graph. We propose to integrate pre-attention mechanisms in the optimization criterion, in the form of a saliency map, by reasoning on the saliency of spatial area deﬁned by spatial relations. Such mechanisms extract knowledge from an image without object recognition in advance and do not require any a priori knowledge on the image. Therefore, pre-attentional mechanisms provide useful knowledge for object segmentation and recognition. The derived algorithms are applied on brain image understanding. 1 Introduction Sequential segmentation is a useful approach for knowledge-based object recognition where objects are segmented in a predeﬁned order, starting from the simplest object to segment to the most difﬁcult one. The segmentation and recognition of each object is then based on a generic model of the scene and relies on the previously recognized objects. This approach, as developed e.g. in [3], requires to deﬁne the order according to which the objects have to be recognized and the choice of the most appropriate order is one of the difﬁculties raised by this approach. Here, the recognition and the segmentation of the objects of the scene are performed at the same time. The sequence of objects may be expressed as a path in a graph, where each node of the graph represents an object. In this paper, we propose a new approach to this problem integrating information extracted from the data, based on the notion of saliency. The visual system is usually modeled using pre-attentional and attentional mechanisms. Basically, the purpose of the pre-attentional step is to guide the attentional step to select salient parts in the scene. This selection allows the attentional process to focus only on the salient part (object or region) and thus reduces the computational cost of this mechanism. We can easily draw some similarities between the iterative segmentation scheme and the visual system: the pre-attentional mechanism could correspond to the selection of the next object to segment and the attentional mechanism to the segmentation of an object of the scene (and its interpretation). Thus the iterative segmentation framework is viewed as a scene exploration and analysis process. 1 2 TELECOM ParisTech (ENST), CNRS-LTCI UMR 5141, Paris, France {geoffroy.fouquier, isabelle.bloch}@enst.fr IRD-Cayenne/UAG, email: atif@cayenne.ird.fr Our contribution is to introduce a pre-attentional mechanism in the optimization of the segmentation path for a sequential image segmentation process. This article is organized as follows. First we present in Section 2 how to represent the knowledge composing the generic model of the scene. In Section 3, a brief overview of the modeling of the visual system is given as well as a presentation of the pre-attentional mechanism used in the following section. Then we present in Section 4 a way to evaluate which information is given by the attentional mechanism. Then, Section 5 presents a way to integrate the saliency map into the segmentation process. Experiments and results are presented in Section 6 on an example of brain image understanding and Section 7 draws some conclusions. 2 Knowledge representation Graphs are well adapted to represent generic knowledge, such as spatial relations between the objects of a scene. In the sequential segmentation framework, the generic model of the scene is modeled as a graph where each vertex represents an object of the scene and each edge represents one or more spatial relations between two objects. We introduce the following notations: Let ΣV , ΣE be the sets of vertex labels and edge labels, respectively. Let V be a ﬁnite nonempty set of vertices, Lv be a vertex interpreter Lv : V → ΣV , E be a set of ordered pairs of vertices called edges, and Le be an edge interpreter Le : E → ΣE . Then G = (V, Lv , E, Le ) is a labeled graph with directed edges. For v ∈ V and e ∈ V ×V , δ(v, e) is a transition function that returns the vertex v such that e = (v, v ). For v ∈ V , A(v) returns the set of edges adjacent to v. Finally, p = (v1 , v2 , ..., vn ) is a path of length n labeled as lp = (v1 , e(v1 , v2 ), v2 , ..., vn ). A knowledge base KB deﬁnes all the spatial relations existing between vertices in the graph: KB = {vi Rvj , vi , vj ∈ V, R ∈ R} and e = (v1 , v2 ) ∈ E ⇐⇒ ∃R ∈ R, (v1 Rv2 ) ∈ KB, where R is the set of relations. In the following, we use fuzzy representations of the spatial relations, since they are appropriate to model the intrinsic imprecision of several relations (such as “close to”, “behind”, etc.), the potential variability (even if it is reduced in normal cases) and the necessary ﬂexibility for spatial reasoning [2]. Here, the representation of a spatial relation is computed as the region of space in which the relation R to an object A is satisﬁed. The membership degree of each point corresponds to the satisfaction degree of the relation at this point. Figure 2 (b,c) presents an example of a structure and the region of space corresponding to the region “to the right of” this structure. A directed edge between two vertices v1 and v2 carries at least one spatial relation between these objects. An edge interpretor associates to each edge a fuzzy set μRel , deﬁned in the spatial domain S, 612 G. Fouquier et al. / Sequential Spatial Reasoning in Images Based on Pre-Attention Mechanisms and Fuzzy Attribute Graphs representing the conjunctive merging of all the representations of the spatial relations carried by this edge to a reference structure. Since there is at least one spatial relation carried by an edge, μRel cannot be empty. Let μeRi , i = 1, ..., ne the ne relations carried by an edge e. Then μeRel is expressed as: μeRel = i=1..ne (μeRi ) with a t-norm (fuzzy conjunction) [4]. Since objects are sequentially segmented, we propose to focus our attention by using known spatial relations with previously segmented objects. The set of target objects is ﬁltered as the set of unsegmented objects which have a spatial relation with a previously segmented object. The set of segmented objects is ﬁltered likewise as the set of objects which have a spatial relation with an unsegmented object of interest. The “search area” is thus deﬁned by the merging of the representations of known spatial relations between previously segmented objects which have an edge in the graph with the target object. We now describe the modeling of the main relations that we use: distances and directional relative positions. A distance relation can be deﬁned as a fuzzy interval f of trapezoidal shape on R+ . A fuzzy subset μd of the image space S can then be derived by combining f with a distance map dA to the reference object A: ∀x ∈ S, μd (x) = f (dA (x)), where dA (x) = inf y∈A d(x, y). The relation “close to” can be deﬁned as a function of the distance between two sets: μclose (A, B) = h(d(A, B)) where d(A, B) denotes the minimal distance between points of A and B: d(A, B) = inf x∈A,y∈B d(x, y), and h is a decreasing function of d, from R+ into [0, 1]. We assume that A ∩ B = ∅. The relation of adjacency can be deﬁned likewise as a “very close to” relation, leading to a degree of adjacency instead of a Boolean value, making it more robust to small errors. Directional relations are represented using the “fuzzy landscape approach” [1]. A morphological dilation δνα by a fuzzy structuring element να representing the semantics of the relation “in direction α” is applied to the reference object A: μα = δνα (A), where να is deﬁned, for x in S given in polar coordinates (ρ, θ), as: να (x) = g(|θ − α|), where g is a decreasing function from [0, π] to [0, 1], and |θ − α| is deﬁned modulo π. This deﬁnition extends to 3D by using two angles to deﬁne a direction. The example in Figure 2 (b,c) has been obtained using this deﬁnition. Other relations can be modeled in a similar way [2]. These models are generic, but the membership functions depend on a few parameters that have to be tuned for each application domain according to the semantics of the relations in that domain. 3 Saliency Maps Among the pre-attentional mechanisms, we focus on the saliency map, as deﬁned by Koch and Ullman [6]. This mechanism allows selecting areas using some basic features easily computable on every type of images. Figure 2 presents a saliency map and its restriction around an object which allows exploring the area of the image around the object. This approach uses three basic features: intensity, color and orientation. For each feature, the difference between a location and its immediate surrounding is computed. For intensity, this is the difference of contrast. For color, two oppositions of colors are studied: between red and green on the one hand, and between blue and yellow on the other and. And for orientation, four directions are studied with Gabor ﬁlters. Overall, seven features are considered. Nine scale spaces are created with dyadic Gaussian pyramids for each feature and six maps are derived by center-surround difference between the ﬁne scale c in {2, 3, 4} and the coarse scale of the pyramid s = c+d, with d in {3, 4}. Finally, all maps corresponding to a same feature are normalized, and a conspicuity map per feature (the sum of all corresponding maps) is computed. Then the three conspicuity maps are merged with a weighted mean to produce the saliency map. Figure 1 presents an example of a saliency map. Figure 1. Lena and the corresponding saliency map (dark: not salient, bright: most salient parts) This approach is a data-driven bottom-up approach, and the only top-bottom connections is for the occlusion of the most salient location. But more top-bottom connections are required to deﬁne protoobjects [7], an extension of the ﬁrst method recently presented. In this case, the saliency map is computed as in the original method, but once the most salient location is detected, a feedback connection allows ﬁnding which conspicuity map, and then which map produces this salient location (or contributes the most). Then, a proto-object is deﬁned as the connected component (a pixel belongs to the component if one of its neighbors is in the component, and if its value is higher than a threshold) at the same location of the higher value of the saliency map, on the map which produces it. 4 Evaluating saliency on manually segmented structures The sequential segmentation framework with the optimized segmentation path described in [5] uses generic knowledge and a segmented database and therefore cannot take into account the intrinsic segmentation difﬁculties of each object. These difﬁculties vary with respect to the object features: shape, homogeneity, texture or boundaries, or image noise. Some generic rules could be constructed, e.g. this object is more difﬁcult to segment than this other one, but this kind of rule is not necessarily true for each image even in a restricted application domain. We consider that the information of saliency is directly related to the difﬁculties of segmentation because an object with a salient border will be much simpler to segment than an object with a less salient border. Therefore, we propose a methodology to derive the difﬁculty of segmentation from saliency information and to compare all the areas of saliency corresponding to the previously segmented objects. The area of saliency for an object corresponds to the saliency map masked by the segmentation (a binary map) of this object and possibly its surrounding. Depending on the class of segmentation algorithms, we may not be interested by the same parts of the objects. If we consider an edgebased segmentation algorithm, then we consider that the most important area to take into account for the image segmentation is the border of the object. In this case, the interesting part of the object should be extracted for example as the dilated segmentation of the object, in order take into account the surrounding of the border. In a region-based segmentation, the whole object is extracted depending on a homogeneity criterion. The saliency map is masked, in this case, by the extracted object. G. Fouquier et al. / Sequential Spatial Reasoning in Images Based on Pre-Attention Mechanisms and Fuzzy Attribute Graphs (a) (b) (c) (d) 613 (e) (a) A slice of a 3D Magnetic Resonance Image. (b) Right lateral ventricle. (c) Fuzzy subset corresponding to the spatial relation “right of” (b). (d) A slice of the saliency map of (a). (e) Saliency around the ventricle (dark: not salient, bright: most salient parts) Figure 2. Once the saliency for the surrounding of each object has been extracted, an histogram of the saliency map is computed for each object. Once normalized, we have a distribution of the saliency for each object. Therefore, we propose to estimate the difﬁculty of segmentation as a comparison of the histograms of saliency. In our experiments, we compute the energy of the histogram as a criterion of comparison between histograms. The energy of the histogram H, with N P 2 bins, is computed as follows: energy(H) = N h(n) where h is the function that counts the number of occurrences of value n in the saliency map. Figure 5 presents two histograms of several objects from two images. This methodology is not used for a segmentation purpose (here we are trying to get rid of the usage of a previously segmented base), but only to study the saliency of the different objects and to exhibit the potential interest of this type of measure. where the superscript i denotes the iteration. Accordingly the set of target vertices is ﬁltered so as to keep only the vertices connected with the already segmented set of vertices. Likewise, the latter set is ﬁltered to the subset of vertices connected with an edge to the set of target objects. The set of edges is ﬁltered accordingly. The obtained subgraph forms a bipartite graph composed by both sets of known and target objects, and by the set of edges representing the spatial relations between both groups of vertices: 5 Using saliency for image interpretation For each edge e in Ef , the edge interpretor produces μeRel . The area of space of the search domain is deﬁned as the merging of the support of all edge representations, given by the edge interpretor: Approaches relying on the shape of the target object, like in [5], make the assertion that the generic model is always valid, i.e. that all objects from the generic model are always present and no new object can be taken into account. Here, the exploration relies on the previously recognized objects only and not on the shape of the target object, which allows dealing with changes in the model. Image segmentation is seen as a scene exploration process, where only a small region of space is analyzed at a given time, i.e. objects are segmented individually. Also, the exploration of a new area of space uses the previously explored area, here the segmented objects are used to segment the remaining parts of the scene. The process is guided using a pre-attentional mechanism, here a saliency map, which indicates the most salient area of space in the search domain. This area is computed using the already known part of the scene and the spatial relations existing between these objects and the objects that still to be found. Figure 3 presents the general scheme of the method. At ﬁrst, we present how the graph is ﬁltered to compute the area of search, then we present the process of selecting the next object to segment. In the following, the original image is denoted by I. The vertices of the graph are divided into two disjoint groups of vertices: V = Vseg ∪ Vtar . At the beginning of the process, a ﬁrst object is considered as known and segmented: Vseg = {vinit }. This object can be detected using saliency in the image, or other information (in brain imaging, the lateral ventricle can be segmented using a completely different scheme for example). The recognition of an object implies thus to move a vertex from the set of target vertices to the set of segmented vertices and it is mandatory that the vertex to segment is directly connected to the set of already segmented vertices. An iteration of the sequential segmentation is expressed as a function of the previously segmented objects Vseg , the chosen next object to segment vˆ, the saliency map of the image salI , the original image I and Ef the spatial relations between both sets of objects, already segmented and to be segmented, respectively: i i−1 Vseg = seqseg(Vseg , vˆ, salI , I, Efi−1 ) Vf s = {v1 ∈ Vseg | ∃v2 ∈ Vtar , (v1 , v2 ) ∈ E} Vf t = {v2 ∈ Vtar | ∃v1 ∈ Vseg , (v1 , v2 ) ∈ E} Ef = {(vt , vs ) | vt ∈ Vf t , vs ∈ Vf s } μsd = ⊥e∈Ef (μeRel ) with ⊥ a t-conorm (fuzzy disjunction) [4]. The binary map corresponding to the search domain gives an area of space which includes the spatial location of all the target objects (hence a disjunction combination). Note that this spatial location could cover a large part of the image space, particularly if the only spatial relation between two objects is a relation of direction. The search domain sd is simply deﬁned as: sd = support(μsd ) Now, we present how the process of selection of a target vertex by an analysis of the saliency in the search domain. The ﬁltering of the graph gives two groups of vertices: Vf s and Vf t and we have to choose in Vf t the next vertex (and so the object that the vertex represents) to recognize. For each candidate vertex v, its estimated spatial location is deﬁned by the merging of the spatial relations connecting this vertex to the previously recognized vertices: locv = e∈(A(v)∩Ef ) (μeRel ) with a t-norm. This estimated spatial location of a vertex is then combined with the search domain, to extract the saliency in the area of the estimated location of the target object and its surrounding: saliencyv = (locv , sd, salI ) An histogram of this area is then produced. We select the next object to segment by an analysis of this histogram. Among other measures, the energy of the histogram (previously deﬁned) is kept as a criterion of selection and allows selecting the most salient area and then the next object to segment: vˆ = arg max (energy(Hv )) v∈Vft 614 G. Fouquier et al. / Sequential Spatial Reasoning in Images Based on Pre-Attention Mechanisms and Fuzzy Attribute Graphs Generic Knowledge :already segmented : to be segmented original image saliency map 2 3 1 4 Model Graph 2 1 2 3 1 Saliency around 3 3 Structure Selection Saliency around 4 4 current graph 4 Filtered fuzzy subsets graph representing spatial relations 2 3 1 updated graph Figure 3. search domain Segmentation 4 Block diagram of the proposed method to include a pre-attentional mechanism into sequential segmentation. the exploration of the scene consists then in moving a vertex from the set of target vertices to the set of known vertices, and the selection of the moved vertex is realized by the comparison of the saliency of each object area of the search domain, which corresponds to a modeldriven exploration of the scene. This method allows us to directly take into account the knowledge given by the current image and does not rely on a representation of the target objects during the process. The segmentation of the object is expressed as a function of the selected object to segment vˆ, selected with a criterion based on saliency, its spatial relations with the previously segmented objects and the original image: Finally, 9 maps are extracted. Note that we could extract more planes allowing to take into account more directions thus better isotropy. Experiments have been conducted using a manually segmented database of human brain 3D MRI (IBSR database). This database is composed by 18 brain images with their segmentations. The parameters of the membership functions used to computed the representation of the spatial relations are learned on a database of healthy cases (IBSR) and pathological cases (5 differents cases so far, corresponding to different types of brain tumor). Table 1 presents some relations used in our experiments. v, locvˆ , I) segvˆ = segment(ˆ Table 1. Some relations used in our experiments. LLV: left lateral ventricle LCN: left caudate nucleus, LTH: left thalamus and LPU: left Putamen. Finally, the set of segmented object is updated: i−1 i i−1 i = Vseg ∪ {ˆ v } and Vtarget = Vtarget \ {ˆ v} Vseg v1 LLV LLV LLV LCN LCN R RightOf CloseTo DownOf RightOf InFrontOf v2 LCN LCN LTH LPU LTH v1 LCN LTH LTH LTH R UpOf BehindOf DownOf RightOf v2 LTH LCN LCN LPU 6 Application to human brain structures recognition Saliency map on 3-dimensions MRI Saliency maps, especially according to Koch and Ullman, are usually computed on 2D natural images with a sufﬁcient resolution to produce the requested scale of the dyadic pyramid. In the case of 3D magnetic resonance images (MRI), the resolution of the image is often small. The IBSR database3 images used during our experiments have the following size: 256 × 256 × 128. We limit our pyramid to 7 scales (including the original scale). The ﬁne scale used to compute maps are 1, 2 and 3. The coarse scale are the ﬁne scale plus a δ ∈ {2, 3}, i.e. 1 + 2, 1 + 3, 2 + 2, 2 + 3, . . . . Finally, the saliency map is computed with the size of the third level of the dyadic pyramid. 3D MRI provides only one channel which is considered as an intensity in the computation. Since there is no color channel, color features are just removed. For orientation, we use a similar approach as in 2 dimensions, but on 3 different planes deﬁned by the axis x and y for the ﬁrst plane, x and z for the second, y and z for the last one. We considered 4 directions for each plane and removed the duplicates. 3 Internet Brain Segmentation Repository. The MR brain data sets and their manual segmentations were provided by the Center for Morphometric Analysis at Massachusetts General Hospital and are available at http://www.cma.mgh.harvard.edu/ibsr/ Saliency on manually segmented structures In our experiments, the area of saliency taken into account for each structure corresponds to the 3D binary map of the segmentation of one object dilated by a elementary structuring element in 6-connectivity. The saliency map is normalized between 0 and 255. The histogram in Figure 4 presents the saliency for each of the three structures on all images, and it shows the variation of saliency, although the IBSR data set is quite uniform. This variation shows that the measure of saliency takes into account speciﬁc information about each image. Table 2 presents saliency measures for three anatomical structures of the human brain plus the same measure for the white matter and the gray matter. These measures (energy of the histogram) are always higher for the three anatomical structures. Figure 5 presents some histograms of saliency for these structures. Histograms of saliency for gray and white matter are in most of the cases larger and lower than histograms for other structures, and particularly the histograms of caudate nucleus and putamen. Thus, there is more saliency in the area of the anatomical structures than in areas of gray or white matter, which does not present much information. Comparing structures, it appears that the thalamus has generally lower values (it has less well deﬁned boundaries). Hence it can be expected that its segmentation 615 G. Fouquier et al. / Sequential Spatial Reasoning in Images Based on Pre-Attention Mechanisms and Fuzzy Attribute Graphs 0.2 Table 2. Saliency measures (energy measure of saliency histogram) for 3 anatomical structures, white matter (LWM) and gray matter (LGM) for all images of the IBSR database. LCN: left caudate nucleus, LTH: left thalamus and LPU: left Putamen. LCN 0.065 0.097 0.039 0.050 0.038 0.054 0.039 0.040 0.039 0.045 0.037 0.033 0.037 0.046 0.033 0.032 0.045 LTH 0.057 0.064 0.033 0.031 0.028 0.038 0.024 0.026 0.026 0.030 0.025 0.029 0.033 0.030 0.026 0.025 0.032 LPU 0.068 0.095 0.042 0.054 0.107 0.099 0.046 0.046 0.061 0.060 0.048 0.032 0.069 0.061 0.044 0.044 0.049 LWM 0.026 0.041 0.027 0.026 0.027 0.038 0.023 0.020 0.026 0.027 0.019 0.026 0.031 0.025 0.017 0.022 0.022 LGM 0.015 0.020 0.017 0.017 0.018 0.025 0.018 0.014 0.020 0.014 0.011 0.017 0.020 0.017 0.014 0.015 0.020 will be more difﬁcult. 0.16 LLV LCN LPU LTH LLV LWM LGM 1st selection LLV → LCN LTH 0.035 0.016 0.048 0.023 0.018 0.011 0.018 0.011 0.017 0.011 0.022 0.013 0.017 0.011 0.016 0.011 0.021 0.014 0.018 0.013 0.017 0.010 0.017 0.010 0.019 0.012 0.017 0.011 0.017 0.010 0.014 0.010 0.019 0.014 0.01 Figure 4. 150 200 250 300 The histograms of the saliency of each structure for all images in the database. Sequential segmentation Starting from the lateral ventricle, we are looking for the next structure to segment. Table 3 presents the measures of saliency for the two structures connected to the lateral ventricle in the graph, the caudate nucleus and the thalamus, and the same measure, after the segmentation of the ﬁrst structure. For all the images of the IBSR database, the same path is selected but with some variation of the criterion values. The resulting path corresponds to the path used in [3], deﬁned intuitively, in a supervised way, thus with visual hints. It is hence very satisfactory to ﬁnd the same path automatically using a saliency feature. The IBSR base is also a quite homogeneous database, and all images have been registered, lowering the difference between the images. Experiments on images with a higher variability, including pathological ones, are currently conducted. Figure 6 presents a typical segmentation using the resulting path. 0.04 0 20 40 60 80 100 120 140 160 180 200 hist. IBSR 02 Histograms of saliency for 4 anatomical structures, white matter and gray matter of the left hemisphere in a 3D MRI. In this case, the saliency is high for all structures, ventricle and caudate saliency histograms are clearly distinct from putamen and thalamus ones. Saliency of white matter and gray matter are lower than saliency of internal structures. Table 3. Measure of saliency for two successive selections, for each image in the IBSR database. The initial structure is the left lateral ventricle 0.005 100 0.06 Figure 5. 0.02 50 0.1 0.08 0 Ventricle CaudateNucleus Putamen Thalamus 0 0.12 IBSR 02 0.015 0 0.14 0.02 0.03 0.025 Ventricle CaudateNucleus Putamen Thalamus White Matter Gray Matter 0.18 2nd selection (LCN,LLV) → LTH LPU 0.015 0.012 0.022 0.017 0.011 0.009 0.011 0.010 0.011 0.009 0.013 0.012 0.011 0.010 0.011 0.010 0.014 0.013 0.012 0.010 0.010 0.009 0.010 0.009 0.012 0.011 0.010 0.009 0.010 0.009 0.010 0.010 0.014 0.013 [2] I. Bloch, ‘Fuzzy Spatial Relationships for Image Processing and Interpretation: A Review’, Image and Vision Computing, 23(2), 89–110, (2005). [3] O. Colliot, O. Camara, and I. Bloch, ‘Integration of Fuzzy Spatial Relations in Deformable Models - Application to Brain MRI Segmentation’, Pattern Recognition, 39, 1401–1414, (2006). [4] D. Dubois and H. Prade, Fuzzy Sets and Systems: Theory and Applications, Academic Press, New-York, 1980. [5] G. Fouquier, J. Atif, and I. Bloch, ‘Local Reasoning in Fuzzy Attributes Graphs for Optimizing Sequential Segmentation’, in 6th IAPRTC15 Workshop on Graph-based Representations in Pattern Recognition, GbR’07, ed., springer, volume 4538 of LNCS, pp. 138–147, Alicante, Spain, (Jun 2007). [6] L. Itti, C. Koch, and E. Niebur, ‘A model of saliency-based visual attention for rapid scene analysis’, IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11), 1254–1259, (Nov. 1998). [7] D. Walther and C. Koch, ‘Modeling attention to salient proto-objects’, Neural Networks, 19(9), 1395–1407, (Nov. 2006). 7 Conclusion We have presented a sequential segmentation framework viewed as a scene exploration process, and guided by a pre-attentional mechanism, here saliency map. Preliminary results show that saliency provides intrinsic information about the image, usable for its segmentation. Further work will be done on a larger graph with more structures and relations between them. REFERENCES [1] I. Bloch, ‘Fuzzy Relative Position between Objects in Image Processing: a Morphological Approach’, IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(7), 657–664, (1999). Figure 6. Typical segmentation using the path found in our experiments 616 ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-616 Automatic Conﬁguration of Multi-Robot Systems: Planning for Multiple Steps Robert Lundh and Lars Karlsson and Alessandro Safﬁotti 1 Abstract. We consider multi-robot systems where robots need to cooperate tightly by sharing functionalities with each other. There are methods for automatically conﬁguring a multi-robot system for tight cooperation, but they only produce a single conﬁguration. In this paper, we show how methods for automatic conﬁguration can be integrated with methods for task planning in order to produce a complete plan were each step is a conﬁguration. We also consider the issues of monitoring and replanning in this context, and we demonstrate our approach on a real multi-robot system, the P EIS-Ecology. 1 Introduction One essential property of planning is that it is possible to detect, before hand, if a task can be executed or not. It is common sense for traditional action planning that if there is a plan that achieves the goal we know that it is possible to accomplish the task. If we cannot ﬁnd a plan, we cannot accomplish the task. For the planning problem we address in this paper, it is not enough that we ﬁnd an action plan for a task. Why this is the case will be explained later in this section. We assume that in a multi-robot system, each robot has a set of capabilities. If the robots are heterogeneous, they have different capabilities, and can thus provide a number a different functionalities which can be used to cooperate in different ways. For example, in a navigation task a robot must know its own position relative to the goal position, i.e., it needs to be localized. This position information can either be provided by using its own sensors, or another robot can provide this information by tracking the ﬁrst robot and send the estimated position. We call conﬁguration any way to instantiate and connect the different functionalities available in the multi-robot system. An action like move the robot Pippi to the livingroom can be implemented as a conﬁguration. Different actions typically correspond to different conﬁgurations. Moreover, the same action can often be performed using different conﬁgurations depending on the availability and cost of the functional resources. The ability to automatically conﬁgure a multi-robot system in different ways for an action is the key to its ﬂexibility and robustness. Conﬁgurations typically implement one action at a time, and if a task requires several steps (actions) to be achieved there is also a need for several conﬁgurations to be executed one after the other. This can be considered as two different problems: (1) ﬁnding a sequence of steps (a plan) and (2) ﬁnding conﬁgurations for the individual steps. However, the two problems are not independent of each other. When problem 1 is considered, it is not possible to know, before hand, that the generated plan can achieve the goal of the task unless problem 2 is also considered. That is, there might be steps in the plan for 1 ¨ Center for Applied Autonomous Sensor Systems, Orebro University, Sweden. email: {robert.lundh, lars.karlsson, alessandro.safﬁotti}@aass.oru.se which no conﬁguration can be found, and that would make the plan non-executable. This could happen in our earlier approach [7], were these two problems were considered in sequence: an action plan was ﬁrst generated, and then conﬁgurations were generated online as they were needed. This paper proposes a better integration of problems 1 and 2. The approach considers both problems at planning time and can tell before hand if a task is executable or not. The single step conﬁguration problem has been studied in several research areas, e.g., single robot task performance [8, 5], network robot systems [1, 3], and cooperative robotics [10]. However, none of these approaches consider sequences of conﬁgurations. There are also some works about integrating task planning with more detailed types of reasoning, such as the aSyMov planner [2] which combines symbolic and geometric reasoning. The rest of the paper is organized as follows. In Section 2 we give a reminder about the notion of functional conﬁgurations. In Section 3 we describe different solutions to integrated action planning and conﬁguration generation. Sections 4 details the different parts of the approach and Section 5 presents an illustrative experiment. 2 Functional Conﬁgurations For the conﬁguration part of our approach we use the approach proposed in [6, 7]. We here give a brief description of the concepts of functional conﬁgurations. We assume that the world can be in a number of different states. The set of all states is denoted S. There is a number of robots r1 , . . . , rn . The properties of the robots, such as what sensors they are equipped with and their current positions, are considered to be part of the current state s0 . Robots are assumed to have modular functionalities that can be accessed and used independently, across and within the robots. A functionality f is an operator that perform computations, sensing or actuation. It is characterized by the following elements: • A speciﬁcation of inputs I to be provided by other functionalities, including information about domain (e.g., video images), timing (e.g., 25 fps), etc. • A speciﬁcation of outputs O provided to other functionalities, also containing domain and timing information. • A set of causal preconditions P r: conditions in the environment that have to hold in order for the functionality to be operational. • A set of causal postconditions P o: conditions in the environment which the functionality is expected to achieve. • A speciﬁcation of costs Cost, e.g., computation and energy. • A body Φ, containing the code to be executed. This is typically a continuous loop, getting input, producing output. R. Lundh et al. / Automatic Conﬁguration of Multi-Robot Systems: Planning for Multiple Steps Figure 1. An example conﬁguration A channel ch transfers data from an output of a functionality to an input of another functionality. A conﬁguration C is a pair F, Ch, where F is a set of functionalities and Ch is a set of channels. An important property of a conﬁguration is that all the components in it are connected “in the right way”. We call this property admissibility, and distinguish two brands: a conﬁguration is information admissible if each input of each functionality is connected to a compatible output of another functionality; it is causally admissible if all preconditions of all functionalities hold in the current world state. A precise deﬁnition of these properties can be found in [6]. As an example of a conﬁguration (see Fig 1), consider a task in which robot B helps robot A to navigate. Robot A has two functionalities: a navigation controller and a wheel actuator. Robot B has four functionalities: a camera, an object tracker, a laser, and a scanmatch localization algorithm. The camera is connected to the tracker to obtain the position of A relative to B. The laser is connected to the localization to obtain the absolute position of B. These positions are combined to get the absolute position of A, which is sent to the controller on A that provides motion commands to the wheels. Conﬁguration problem Let Σ be a multi robot system, and let D be a domain describing, in some formalism, all the functionalities that exist in Σ. D implicitly deﬁnes the set C(D) of all the conﬁgurations that can be built in Σ (both admissible and not admissible). Let A denote an action (or task), and s denote the current state. A conﬁguration problemA, D, s for Σ is the problem of ﬁnding a conﬁguration c ∈ C(D), admissible in state s, to perform A. Conﬁguration planning To ﬁnd a solution to a conﬁguration problem we use a conﬁguration planner [6]. The conﬁguration planner uses techniques inspired by hierarchical planning, in particular the SHOP planner [9] in order to combine functionalities to form admissible conﬁgurations that solve speciﬁc tasks. This is done by searching the space of conﬁgurations to ﬁnd one which is admissible in the current state and which has the lowest cost. The conﬁguration planner takes as input a domain that describes the existing functionalities, a state of the available functionalities, and a goal (action). The conﬁguration planner returns a conﬁguration description, which essentially consists of a set of functionality names, and set of channels describing how the functionalities can be connected. It also returns the pre- and post-conditions and the total cost of the conﬁguration. 3 617 Integrated action and conﬁguration planning The conﬁguration problem above is concerned only with ﬁnding a conﬁguration for one action. However, in practice most tasks require more than one action to be completed. For instance, if the robot wants to wake up a person, the robot must ﬁrst reach the bedroom, then move close to the bed, before it can wake the person up. Figure 2. Different ways to combine action and conﬁguration planning. (a) Independent. (b) Fully integrated. (c) Loosely coupled. Different conﬁgurations may be required for each of these actions. We call such a plan, where each action is a conﬁguration, a conﬁguration plan. A conﬁguration plan is a sequence of conﬁgurations CP = c1 , . . . , ck , where k ≥ 0. Note that from now on we reserve the term task to denote the top-level task, and use the term action to denote each individual sub-task achieved by each conﬁguration. A conﬁguration plan is admissible if and only if each ci is admissible in the state si−1 it will be executed in. Note that each conﬁguration can also change the state according to its postconditions. Thus, a domain D can be considered to deﬁne a state-transition system S, C, γ with states S, conﬁgurations C = C(D), and a transition function γ : S ×C → S deﬁned according to the pre- and postconditions of the conﬁgurations. The state-transition function deﬁnes the states s1 , . . . , sk in which conﬁgurations are executed. Thus, the domain D implicitly deﬁnes the set CP(D) of all the conﬁguration plans that can be built in Σ (both admissible and not admissible). Let then T denote a task (or goal state), and s0 denote the initial state. A conﬁguration plan problemT, D, s0 is the problem of ﬁnding a conﬁguration plan CP ∈ CP(D) to perform T , which is admissible from starting state s0 . In the remaining part we detail and discuss solutions to the conﬁguration plan problem. The job of an action planner is typically to ﬁnd a sequence of atomic actions a1 , . . . , ak that achieves a goal or task T . From a conﬁguration perspective, each action ai can then be seen as an abstraction of the set of conﬁgurations {ci1 , . . . cin } that can implement it. Hence, combining an action planner with a conﬁguration planner would let the robots deal with tasks that require more than one conﬁguration/action to be performed. There are several ways this combination could be done. These ways can be described with the following variables: (1) If the decisions about what actions to perform (i.e., the action planning) should be taken at planning time or execution time. (2) If the actions should be expanded into conﬁgurations (i.e., the conﬁguration planning) at planning time or at execution time. (3) If the action and conﬁguration planning should be done independently of each other or not. We here present three different settings for the variables above. Independent action and conﬁguration planning In [7], a simple approach to combine an action planner and a conﬁguration planner is presented. It works by ﬁrst calling the action planner to ﬁnd an action plan a1 , . . . , ak for solving a particular task. That is, the decisions about which actions to perform (1) is done at planning time. This plan is then executed action by action. For each action ai that is performed, a suitable conﬁguration ci is generated by the conﬁguration planner at the time when the action must be ex- 618 R. Lundh et al. / Automatic Conﬁguration of Multi-Robot Systems: Planning for Multiple Steps ecuted. Thus, for (2) the decision about when to expand actions into conﬁgurations is taken at execution time. The action planning decisions and the conﬁguration planning decisions are taken independently of each other. Fully integrated action and conﬁguration planning The second way is to have the planners fully integrated. Both the decisions about what actions to perform (1) and the expansion of the actions into conﬁgurations (2) are taken at planning time. The decisions for 1 and 2 are fully interdependent, i.e., the conﬁguration planner is called immediately to generate conﬁgurations for each action that is considered during search, so the system is working directly with conﬁguration plans c1 , . . . , ck . In this way it is possible to cut parts of the search space based on the availability of conﬁgurations and to only create admissible conﬁguration plans. Loosely coupled action and conﬁguration planning In this paper, we present an approach that is based on the idea to generate an action plan and conﬁgurations for this plan before we start to execute it. First a complete action plan a1 , . . . , ak is generated, and then for that plan, a conﬁguration is generated for each action: c1 , . . . , ck . That is, both the decision on actions to perform (1) and the expansion of actions into conﬁgurations (2) are done at planning time as in the fully integrated approach above. However, conﬁguration generation is only done when a complete action plan has been found, in order to validate that plan. If the action plan is not valid (i.e., there are not conﬁgurations for all actions), the control returns to action planning to generate an alternative action plan, taking into account information about the failed action and its state, and so on. In this way, it is possible know if there is an admissible conﬁguration plan for the generated action plan. In Fig. 2, the three different cases are shown side by side for comparison. The independent approach (Fig. 2a) assumes that the two planning problems can be addressed independently of each other. This approach has problems when an action cannot be expanded into a conﬁguration at execution time. If this happens, a new action plan must be generated that fulﬁlls the goal. Since some actions may be irreversible, there may be situations in which this solution would not be able to complete the task. Even if a new plan can be found, the fact that the actions in the failed plan were executed leads to a suboptimal performance. The fully integrated approach (Fig. 2b) considers both planning problems simultaneously. It is possible to guarantee that the generated conﬁguration plans are admissible and optimal. However, since conﬁgurations are generated for all actions in the search space (even the actions that do not lead to the goal), the complexity of the problem makes it unusable in most practical cases. The loosely coupled approach (Fig. 2c) can, like the fully integrated approach, guarantee that the generated conﬁguration plan is admissible. It avoids the complexity problems of the integrated approach by only trying to generate conﬁgurations for actions that are on a path to the goal. Compared to the independent approach, the loosely coupled approach can reject bad action plans before they are actually executed, and ﬁnd better alternatives. The price to pay is that global optimality of the conﬁguration plan cannot be guaranteed in general. 4 Implementation The loosely coupled action planning and conﬁguration generation approach has been implemented and tested on a special case of a multi-robot system, called the P EIS-Ecology. 4.1 The P EIS-Ecology testbed The concept of P EIS-Ecology was originally proposed by Safﬁotti and Broxvall [13]. The main constituent of a P EIS-Ecology is a physically embedded intelligent system, or P EIS. This is any computerized system interacting with the environment through sensors and/or actuators and including some degree of “intelligence”. A P EIS generalizes the notion of robot, and it can be as simple as a toaster or as complex as a humanoid robot. A P EIS-Ecology consists of a number of P EIS embedded in the same physical environment, and endowed with a common communication and cooperation model. Communication relies on a shared tuple-space: P EIS exchange information by publishing tuples and subscribing to tuples. Cooperation relies on the notion of linking functional components: each P EIS can use functionalities from other P EIS in the ecology to complement its own. The P EIS-Ecology model has been implemented in an open-source middleware, called the P EIS-kernel [11]. 4.2 Top-level process To be used in a practical multi-robot system, such as the P EISEcology, action planning and conﬁguration planning must be embedded in a larger process. This process must implement the integration of the two planners, and it must also consider the following aspects. First, both action planning and conﬁguration planning depends on the current state of the environment and the system. Hence, this state should be dynamically acquired before the planning is started. Second, when the action plan is executed, each generated conﬁguration should be instantiated in the actual P EIS-Ecology, and the conﬁguration execution should be monitored in order to decide when to switch to the next action, and to detect possible failures. Fig. 3 gives an overall view of the top-level process. In this process, there are several “paths” for different situations. The solid arrows constitute the normal path. That is, all the different steps (1 – 8) are completed without any discrepancies. The dotted arrows represent different recovery paths. The rest of this section details the different steps and paths of the top-level process. The top-level process is run by one single robot that conﬁgures the P EIS-Ecology to help it solving the top-level task. The steps that consider state acquisition, conﬁguration deployment, and conﬁguration execution and monitoring are also reported in [7]. 4.3 Planning As noted above, both action and conﬁguration planning use state information to ensure that both action plans and conﬁgurations are admissible. This state consists of two parts: system state and world state. The system state contains information relative to the system itself, e.g., which functionalities are currently available, and what is their current cost. The world state is a representation of the facts that currently hold in the environment, e.g., information about rooms and places, how they are connected, etc. To acquire the current (system and world) state from the P EIS-Ecology, we use the mechanisms provided by the P EIS-kernel. In order to generate action plans, we employ a state of the art action planner called PTLplan [4]. It requires as input a domain and a world state. The domain describes all the actions potentially available, and it is hand-coded. The state, acquired right before the planning is done, determines which actions are actually available in the current situation. An action plan consists of actions like “move(Pippi, bedroom)”, “dock(Pippi, bed)”, and “wakeup(Pippi, Johanna)”. This plan is R. Lundh et al. / Automatic Conﬁguration of Multi-Robot Systems: Planning for Multiple Steps 619 Figure 4. Left: A sketch of the P EIS-Home. Right: Astrid with the newspaper in the gripper. Figure 3. Flow chart of the top-level process. given to the conﬁguration planner (step 3 in Fig. 3). In this step, a conﬁguration is generated for each action in the action plan, thus creating a conﬁguration plan. If there is a problem of ﬁnding a conﬁguration for an action, information about that action and its state is stored (step A in Fig. 3). The action planner is then called again, and it removes that particular action in that particular state and then tries to ﬁnd an alternative sequence of actions that can achieve the task. If an alternative action plan is found, it is given to the conﬁguration generator which again tries to turn it into a conﬁguration plan. 4.4 Execution When a conﬁguration plan is found, it is given to a sequencer (step 4 in Fig. 3) that is responsible for taking the next conﬁguration in the conﬁguration plan. When an action/conﬁguration is reported to be completed (step 8), the sequencer takes the next conﬁguration in the plan to deploy. Since all conﬁgurations are generated before the execution of the action plan, it is very important to verify that they are still admissible when it is time to execute them. To guarantee this, the state is dynamically acquired before the execution of each action (step 5). The preconditions of the conﬁguration are then checked in the state (step 6). If they still hold, the conﬁguration can be deployed (step 7). If they do not hold, an alternative conﬁguration must be generated (step 3). If there is an alternative conﬁguration, the postconditions of the alternative conﬁguration must be compared with the postconditions of the initial conﬁguration. If they are equal, the conﬁguration can safely be added to the conﬁguration plan and deployed. If they differ, the remaining part of the conﬁguration plan must be regenerated to comply with the new conﬁguration. In this case, the sequencer does not take the next action in step 4, but retries the same action. If in step 3 no alternative conﬁguration was found, the information about this action is stored in step A and the action planner tries to ﬁnd a new action plan (step 2) as described in the previous section. Once a conﬁguration description is generated, it must be deployed on the P EIS-Ecology. This involves activating functionalities, setting up the channels between the functionalities, and subscribing to the appropriate signals from the functionalities to know when a conﬁguration is completed or if it has failed. After a conﬁguration has been deployed, execution (step 8 Fig. 3) continues until the action is completed or it fails. When a conﬁguration is completed, the next one is selected (step 4). If a conﬁguration fails during execution, the top process tries to generate an alternative conﬁguration. 5 An illustrative experiment We have performed an experiment to show that (and how) the combined planner can handle situations when there are actions for which there is no conﬁguration (this may occur both during planning and execution time). To facilitate comparisons with our previous approach [7], we repeat the scenario presented in that paper where a robot wakes up a person. For the experimental part, we have used a physical test-bed facility, called the P EIS-Home, which looks like a typical apartment of about 25m2 . It consists of a living-room, a bedroom and a small kitchen. The P EIS-Home is equipped with a communication and computation infrastructure, and with a number of P EIS. The following P EIS are of particular importance for our experiments. Pippi and Astrid: Two PeopleBot indoor robots from ActivMedia Robotics (see Fig. 4 right). Each one runs an instance of the Thinking Cap (TC), an architecture for autonomous robot control based on fuzzy logic [14], and an instance of the Player program [12], which provides a low-level interface to the robot’s sensors and actuators. The two robots are identical except that Astrid is equipped with a laser range ﬁnder and Pippi is not. The Home Security Monitor(HSM): A stationary computer which is connected to a set of web-cameras mounted in the ceiling. In addition to other monitoring tasks, not relevant here, the HSM provides a P EIS-component that is able to track a robot and localize it in the P EIS-home. HSM also has an action planner and a conﬁguration planner, and the reconﬁgurations of the P EIS-Ecology in these experiments are done from here. Note however that these could as well be done elsewhere, e.g., in Pippi. 620 R. Lundh et al. / Automatic Conﬁguration of Multi-Robot Systems: Planning for Multiple Steps The experiment unfolds as follows: a. At start up, Pippi is located in the living-room and Astrid in the kitchen. When the morning paper arrives, the HSM wants to wake up Johanna, who is sleeping in the bedroom, and give it to her. b. With this task, the conﬁguration process acquires the current state (step 1 Fig 3). For this state and task, an action plan is generated (step 2). This plan has the actions: dock-to(Pippi, entrance), take(Pippi, newspaper), move-to(Pippi, bedroom), dock-to(Pippi, bed), wake-up(Pippi, Johanna). c. In step 3, the search for a conﬁguration for each action is started. For the ﬁrst three actions, conﬁgurations are found. The ﬁrst and third action (dock-to(entrance), move-to(bedroom)) uses a camera mounted in the ceiling for localization. For the fourth action (dock-to(bed)), the search fails since no conﬁguration can be found. The ceiling camera used in the other actions can only track robots in the living-room and kitchen, and Pippi has no other means of localization. The information about the failed action is stored (step A Fig 3), and the action planer is again called. d. The action planner ﬁnds an alternative plan with the following actions: move-to(Astrid, living-room), dock-to(Astrid, entrance), take(Astrid, newspaper), move-to(Astrid, bedroom), dock-to(Astrid, bed), wake-up(Astrid, Johanna). When revisiting step 3 with the new action plan, it is possible to ﬁnd a conﬁguration for each action. Unlike Pippi, Astrid is able to localize on its own using a laser range ﬁnder and scan matching techniques. e. The ﬁrst action/conﬁguration in which Astrid is using the ceiling camera to localize is taken by the sequencer (step 4). It has a lower cost then using the laser. This conﬁguration is then veriﬁed (step 6), deployed (step 7), and executed (step 8). When arriving to the living-room, the navigation module signals completion of the action and the next action is prepared for execution. f. The state is dynamically acquired (step 5). To demonstrate the behavior of the system under dynamically changing conditions, we manually made the ceiling camera unavailable. Thus, in the veriﬁcation step, the conﬁguration preconditions for dock to entrance using the ceiling camera does not hold. An alternative conﬁguration is generated (step 3) in which the laser is used for localization instead. Since this new conﬁguration has the same postconditions as the original conﬁguration, it can be deployed and executed without regenerating the conﬁguration plan. g. The remaining actions for take newspaper, dock to bed and waking up Johanna and delivering the newspaper proceeds without any complications and the task is completed. The critical point of this experiment is step c. where a conﬁguration cannot be found for action dock-to(Pippi, bed) and the action planner is again called to ﬁnd a new action plan. In the approach with independent action planning and conﬁguration [7], Pippi would have started to execute the action plan and would not discover that it cannot achieve the goal until it reached the bedroom. The HSM would try to ﬁnd a new plan to reach the goal. The HSM cannot simply generate the same plan as above, where Astrid gets the newspaper at the entrance and delivers it to Johanna, since Pippi is now holding the newspaper in the bedroom. If there is an action for giving a newspaper between robots, the HSM may ﬁnd an alternative plan, otherwise it will fail. 6 Conclusions We have presented an approach that, by combining different planning techniques, is able to ﬁnd a solution for tasks that require sequences of conﬁgurations to be completed. For this purpose we employ two different planners: one for action planning [4] and another for conﬁguration planning [6]. The planners are loosely integrated, i.e., conﬁguration planning is used to validate and correct action plans. With this integration, it is possible to guarantee that the execution of a task is not started if there is no admissible conﬁguration plan. In other words, it is possible to know before hand if a plan is executable or not. To use a loose integration also makes it easy to replace the current planners with other generation techniques if this is desirable. We have demonstrated the approach in the P EIS-Ecology framework, but it applies to generic multi-robot systems as long as the robots are able to share their functionalities with each other. An important limitation of the current implementation is that we only consider the execution of a single top-level task. In general, several tasks might be performed concurrently, and new tasks might dynamically appear. A natural extension of the current framework would be to use task allocation techniques to assign different tasks to different conﬁguration processes. With such an extension, issues such as resource handling, conﬂict resolution and deadlocks must also be considered. Our next step is to consider multiple top-level tasks. ACKNOWLEDGEMENTS This work was supported by the Swedish National Graduate School in Computer Science (CUGS). REFERENCES [1] D. Baker, G. McKee, and P. Schenker, ‘Network robotics, a framework for dynamic distributed architechtures’, in Proc of the IEEE/RSJ Int Conf on Intelligent Robots and Systems, pp. 1768–1773, (2004). [2] S. Cambon, F. Gravot, and R. Alami, ‘A robot task planner that merges symbolic and geometric reasoning’, in Proc of the European Conf on AI, pp. 895–899, (2004). [3] M. Gritti, M. Broxvall, and A. Safﬁotti. Reactive self-conﬁguration of an ecology of robots. In: ICRA workshop on Network Robot Systems, 2007. [4] L. Karlsson, ‘Conditional progressive planning under uncertainty’, in Proc of the Int Joint Conf on Artiﬁcial Intelligence (IJCAI), pp. 431– 438, (2001). [5] D. Kim, S. Park, Y. Jin, H. Chang, Y.-S. Park, I.-Y. Ko, K. Lee, J. Lee, Y.-C. Park, and S. Lee, ‘SHAGE: a framework for self-managed robot software’, in Proc of the Int Workshop on self-adaptation and selfmanaging systems, (2006). [6] R. Lundh, L. Karlsson, and A. Safﬁotti, ‘Plan-based conﬁguration of a group of robots’, in Proc of the 17th European Conf on Artiﬁcial Intelligence (ECAI), pp. 683–687, (2006). [7] R. Lundh, L. Karlsson, and A. Safﬁotti, ‘Dynamic self-conﬁguration of an ecology of robots’, in Proc of the IEEE/RSJ Int Conf on Intelligent Robots and Systems, pp. 3403–3409, (2007). [8] B. Morisset and M. Ghallab, ‘Learning how to combine sensory-motor functions into a robust behavior’, Artiﬁcial Intelligence, 172(4-5), 392– 412, (2008). [9] D. Nau, Y. Cao, A. Lothem, and H. Munoz-Avila, ‘SHOP: simple hierarchical ordered planner’, in Proc of the Int Joint Conf on Artiﬁcial Intelligence (IJCAI), pp. 968–973, (1999). [10] L. E. Parker and F. Tang, ‘Building multi-robot coalitions through automated task solution synthesis’, Proc of the IEEE, special issue on Multi-Robot Systems, 94(7), 1289–1305, (2006). [11] The PEIS ecology project. Ofﬁcial web site. www.aass.oru.se/˜peis/. [12] Player/Stage Project. playerstage.sourceforge.net/. [13] A. Safﬁotti and M. Broxvall, ‘PEIS ecologies: Ambient intelligence meets autonomous robotics’, in Proc of the Int Conf on Smart Objects and Ambient Intelligence (sOc-EUSAI), pp. 275–280, (2005). [14] A. Safﬁotti, K. Konolige, and E. H. Ruspini, ‘A multivalued-logic approach to integrating planning and control’, Artiﬁcial Intelligence, 76(1-2), 481–526, (1995). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-621 621 Structure segmentation and recognition in images guided by structural constraint propagation Olivier Nempont1 and Jamal Atif 2 and Elsa Angelini 1 and Isabelle Bloch 1 Abstract. In some application domains, such as medical imaging, the objects that compose the scene are known as well as some of their properties and their spatial arrangement. We can take advantage of this knowledge to perform the segmentation and recognition of structures in medical images. We propose here to formalize this problem as a constraint network and we perform the segmentation and recognition by iterative domain reductions, the domains being sets of regions. For computational purposes we represent the domains by their upper and lower bounds and we iteratively reduce the domains by updating their bounds. We show some preliminary results on normal and pathological brain images. 1 INTRODUCTION Image segmentation and recognition is a key problem in scene interpretation. In some application domains, such as medical imaging, the objects that compose the scene are known as well as some of their properties and their spatial arrangement. This knowledge may be properly encoded as a symbolic graph. Two main approaches can then be derived. The ﬁrst one consists in matching this graph representation with image regions obtained from a preliminary segmentation (e.g. [5]). Since it is usually difﬁcult to segment the image into semantically meaningful entities, this type of approach often relies on an over-segmentation, which makes the matching more complex (no isomorphism can be expected). The second type of approach uses the graph as a guide in a sequential process. In [4], the structures are sequentially segmented using a deformable model, which is constrained to fulﬁll some spatial relations with previously segmented structures. However the result is highly dependent on the segmentation order and the segmentation of one structure cannot beneﬁt from partial information available about not already segmented structures. In this paper, we propose a new method to overcome these limitations. The idea is to express the problem as a constraint propagation process, exploiting the capability of constraint networks to solve combinatorial problems [18]. The propagation can be performed either by adding or simplifying constraints or by reducing the domains of variables. In the scope of qualitative spatial reasoning, the ﬁrst option has been investigated in particular to solve satisﬁability problems, for instance with RCC-8 relations [16] or qualitative relative positions [10]. We propose here to investigate the second option, i.e. the iterative reduction of the variable domains. We ﬁrst recall in Section 2 some deﬁnitions on structural representations. Section 3 is the core of the paper. We deﬁne the constraint 1 2 Telecom ParisTech, CNRS UMR 5141 LTCI, Paris, email: {olivier.nempont, isabelle.bloch, elsa.angelini}@telecom-paristech.fr Unit´e ESPACE S140, IRD-Cayenne/UAG, Guyane Franc¸aise, email: jamal.atif@gmail.com network, domains, domain bounds, the structural constraints and the contracting operators for several types of structural knowledge. In Section 4 we present a propagation process and a decision process based on minimal surface extraction. In Section 5 some preliminary segmentation and recognition results are presented on brain magnetic resonance images (MRI). 2 PRELIMINARIES Structural Knowledge Representation – The structural arrangement of anatomical structures is known and almost stable, even in the presence of a pathology. This knowledge, supposed to be consistent, can be appropriately encoded by an hypergraph [7] where the vertices correspond to spatial objects and the edges (between one or several nodes) may represent either: • known properties of objects, such as the connectivity, a priori range of volumes, • relative positions between structures, • appearance properties, such as homogeneity or contrast. Such characteristics depend on the imaging modality (MRI in our example). Since such knowledge is usually expressed in linguistic terms (in anatomical textbooks for instance [19]), fuzzy sets constitute an appealing framework for its formal modeling: to represent spatial relations, to account for different types of imprecision, related to the imperfections of the image, and to the intrinsic vagueness of some relations [1]. Membership functions deﬁning these fuzzy sets can be learned from a data base of examples. Fuzzy Sets [6] – Let X be a bounded subset of Zn . A fuzzy set on X will be denoted by its membership function μ : X → [0, 1]. We denote α-cuts by μα and by F the set of fuzzy sets deﬁned on X. (F, ≤) is a complete lattice for the usual order on fuzzy sets. The supremum ∨ and inﬁmum ∧ are the max and min respectively. The smallest element is denoted by 0F and the largest element by 1F . We denote the fuzzy complementation by c(μ)(x) = 1 − μ(x), the Lukasiewicz t-norm by (x, y) = max(0, x + y − 1) and t-conorm by ⊥(x, y) = min(1, x + y), for x, y in [0, 1]. 3 STRUCTURAL RECOGNITION PROBLEM AS A CONSTRAINT NETWORK 3.1 Structural segmentation and recognition problem Let I : X → R+ be a grey level image. We want to extract a set of N structures χ = {Oi |i ∈ [1..N ]} present in that image. Each of these 622 O. Nempont et al. / Structure Segmentation and Recognition in Images Guided by Structural Constraint Propagation variables Oi is represented as a fuzzy subset μi ∈ F of X and takes values in a domain Di ⊆ F. The set of domains associated with χ is denoted by D. This recognition problem is constrained by the prior knowledge described in Section 2. Let us assume for instance that the knowledge base contains the relation “A is to the right of B”. The recognition amounts to ﬁnd two fuzzy sets μ1 and μ2 satisfying the dir binary constraint CA,B (μ1 , μ2 ) = 1. The formal expression of these constraints is described in Section 3.3 for several types of relations. We will denote by C the set of constraints. Our segmentation and recognition problem can thus be associated with a constraint network χ, D, C. A solution {μi |μi ∈ Di , i ∈ [1..N ]} of our problem has to fulﬁll all constraints. Ideally this problem would have a unique solution. However it is generally underconstrained and different solutions are possible. Through contracting operators we will simplify our problem to obtain domains as close as possible to the set of solutions. In the following we always assume that the problem is satisﬁable. 3.2 Domain deﬁnition The deﬁnition above involves the representation and the manipulation of domains which are subsets of F. In practice, membership values are discretized, and if k is the cardinality of the current discretization of [0, 1] and n the cardinality of X, the cardinality of F is then kn (10131072 for the 2D examples presented in Section 5). Handling such a set is generally not computationally tractable and we have to consider a simpliﬁed version of it. In [15], the authors represent this subset by its Minimum Bounded Rectangle (MBR) (i.e. the smallest rectangle in 2D that includes all elements of the domain). This very compact representation is nevertheless not able to capture the geometry of objects and provides a poor representation (consider for instance a diagonal line) that will limit the efﬁciency of the constraint propagation process. Considering the lattice structure of F, we propose here to deﬁne the domain bounds as the supremum and inﬁmum of fuzzy sets over the domain. Let DA ⊆ F be the domain associated with an object A. We deﬁne the upper bound A of DA as: A = ∨{ν ∈ DA }. It can also be interpreted as an over-estimation of μA . The lower bound A is deﬁned as: A = ∧{ν ∈ DA } and is an under-estimation of μA . We can notice that ∀ν ∈ DA , A ≤ ν ≤ A. For instance a tiny domain for the left lateral ventricle LV l (delineated in Figure 1(a)) is deﬁned as the six fuzzy sets in (b). Note that the third one is μLV l . The lower and upper bounds (LV l, LV l) of this domain are presented in (c). Based on these notations, we represent the domain associated with a structure A by its bounds: (A, A) = {ν ∈ F|A ≤ ν ≤ A}. Note that if A A, the domain (A, A) is empty and the problem (b) (c) Figure 1. A cropped axial view of a brain MRI. (a) Contour of left lateral ventricle (LVl). (b) A domain for LV l that contains six fuzzy sets. (c) Lower bound LV l and upper bound LV l. 1 2 is unsatisﬁable. Let (A1 , A ) and (A2 , A ) be two non empty domains for the structure A. We consider the following partial order: 1 2 1 (A1 , A ) (A2 , A ) if ∀x ∈ X, A1 (x) ≥ A2 (x) and A (x) ≤ 2 A (x). The associated supremum and inﬁmum operators are respec1 2 1 2 tively deﬁned as: (A1 , A ) ∨ (A2 , A ) = (A1 ∧ A2 , A ∨ A ) and 1 2 1 2 (A1 , A ) ∧ (A2 , A ) = (A1 ∨ A2 , A ∧ A ). 3.3 Contracting operators 3.3.1 General issues The constraints involved in the knowledge data base are expressed as symbolic relations. Each constraint is deﬁned as a function C : F k → {0, 1} if k objects are involved in the relation. As detailed below, it will be expressed in terms of fuzzy sets representing the objects and the spatial or appearance relations. Due to the size of the domains, contracting operators that exhaustively browse the domains (to achieve arc consistency for instance) cannot by applied. We thus deﬁne weaker contraction operators that compute new domain bounds from the initial domain bounds. A contracting operator ψ; D; C is written as: where ψ is the set of variables involved ψ; D ; C in the set of constraints C, D and D are the associated domains represented by their bounds, with D D. Notice that the contracting operators will generally not achieve arc consistency nor 2B consistency [9]. Indeed the domain may contain two values that fulﬁll all constraints but their supremum or inﬁmum does not necessarily. 3.3.2 Directional relative position In [1] a method to characterize the directional relative position between objects using mathematical morphology was proposed. Suppose for instance that the caudate nucleus CN l (delineated in Figure 2(a)) is located on the right of the left ventricle LV l (delineated by dashed line). The relation “on the right” can be characterized by a structuring element ν. The fuzzy dilation δν (μLV l ) of μLV l by ν (displayed in (b)) deﬁnes a fuzzy set that corresponds to the points on the right of LV l. We consider that such a relation from an object A to an object B is satisﬁed if it is for all points of B, and we also impose that B is included in the complement of A. The associated constraint can be deﬁned as: j 1 if μ2 ≤ (δν (μ1 ), c(μ1 )), dir CA,B (μ1 , μ2 ) = 0 otherwise. Suppose that the objects A and B are respectively deﬁned over the domains (A, A) and (B, B). The elements μ of (B, B) that dir according to the current domain of A are such that: satisfy CA,B ∃ζ ∈ (A, A), μ ≤ (δν (ζ), c(ζ)), hence μ ≤ (δν (A), c(A)), since the dilation and are increasing and the complementation is decreasing. The contracting operator associated with the constraint dir CA,B is derived from this inequality. DIRECTION CONTRACTING OPERATOR: dir A, B; (A, A), (B, B); CA,B dir A, B; (A, A), (B, B ∧ (δν (A), c(A))); CA,B Considering the same example, Figure 2 shows the upper bound LV l (c) and CDl (d) of the domains of LV l and CDl (the lower bound is here the empty set). The dilation δν (LV l) is displayed in (e) and we can see in (f) the updated upper bound CDl. The deﬁnition of the initial bounds will be addressed in Section 4. O. Nempont et al. / Structure Segmentation and Recognition in Images Guided by Structural Constraint Propagation We denote by H, H ⊆ F, the set of connected fuzzy sets. j 1 if μ1 ∈ H, conn (μ1 ) = CA 0 otherwise. CNl LVl (a) 623 (b) (c) W 1 (A) = {ν ∈ H|A ≤ A new upper bound can be obtained as: ξA ν ≤ A}. However it can be shown that this ﬁlter is not robust (a small error on A may cause a large error on the result). As discussed in [14], weWprefer the following formulation: 2 (A) = {ν ∈ H | ν ≤ A and maxx∈X ν(x) ≤ μ≤ (A, ν)}, ξA where μ≤ stands for the Lukasiewicz implicator, i.e. μ≤ (A, ν) = minx∈X min(1, 1 − A(x) + ν(x)). CONNECTIVITY CONTRACTING OPERATOR: conn A; (A, A); CA 2 conn A; (A, ξA (A)); CA (d) (e) (f) 3.3.6 Volume Figure 2. A cropped axial view of a brain MRI. (a) Contours of the left lateral ventricle (LV l) and the left caudate nucleus (CN l). (b) Fuzzy set that represents the points on the right of LVl. (c) LV l. (d) CN l. (e) On the right of LV l. (f) CN l updated. 3.3.3 Distances Distances from fuzzy objects may be computed using mathematical morphology [1]. Let us assume that we have some knowledge about the distance between two objects A and B, that can be modeled as a fuzzy interval. The region of space satisfying such a relation to a reference object μ1 is deﬁned as the set difference between two dilations, using two structuring elements ν1 and ν2 deﬁned in the spatial domain and derived from the fuzzy interval: (c(δν1 (μ1 )), δν2 (μ1 )). Two fuzzy sets μ1 and μ2 satisfy the distance constraint between A and B if: j 1 if μ2 ≤ (c(δν1 (μ1 )), δν2 (μ1 )), dist CA,B (μ1 , μ2 ) = 0 otherwise. DISTANCE CONTRACTING OPERATOR: dist A, B; (A, A), (B, B); CA,B dist A, B; (A, A), (B, B ∧ (c(δν1 (A)), δν2 (A))); CA,B 3.3.4 Inclusion Consider now two objects A and B with A included in B. The associated constraint can be expressed as: j 1 if μ1 ≤ μ2 , in (μ1 , μ2 ) = CA,B 0 otherwise. INCLUSION CONTRACTING OPERATOR: in A, B; (A, A), (B, B); CA,B in A, B; (A, A ∧ B), (B ∨ A, B); CA,B The inclusion prior can be extended to a partition prior, for instance if an object A can be decomposed into subparts {Bi }. A volume prior is represented as a membership function μVmin : R+ → [0, 1]. The constraint is formulated as (see [14] for details): 8 < 1 if maxx∈X μ(x) vol ≤ maxv∈R+ min(μV (μ)(v), μVmin (v)), CA (μ) = : 0 otherwise, where μV (μ)(v) = sup{α, |μα | ≥ v}, |μα | denoting the cardinality (i.e. the volume) of the α-cut μα . The reduction of the domain to the fuzzy sets that satisfy this prior, will generally not change the bounds. However if we also suppose that the object is connected, the upper bound can be ﬁltered accordW vol (ν) = 1}. ing to: ξμVmin (A) = {ν ∈ H|ν ≤ A and CA VOLUME AND CONNECTIVITY CONTRACTING OPERATOR: conn vol A; (A, A); CA ∧ CA conn vol A; (A, ξμVmin (A)); CA ∧ CA 3.3.7 Adjacency A degree of adjacency between A and B can be deﬁned as [1]: μadj (μA , μB ) = supx,y∈X min(μA (x), μB (y), n(x, y)) where n(x, y) stands for a connectivity degree between two points x and y of X. We deﬁne the following constraint: 8 < 1 if min(maxx∈X μ1 (x), adj maxx∈X μ2 (x)) = μadj (μ1 , μ2 ), CA,B (μ1 , μ2 ) = : 0 otherwise. As in the volume case, a domain reduction by an adjacency constraint does not affect its bounds. Therefore, we also consider the adjacency jointly W with a connectivity prior, and deﬁne the following adj adj (B) = {ν ∈ H|ν ≤ B and CA,B (A, ν) = 1}. ﬁlter: ξA ADJACENCY AND CONNECTIVITY CONTRACTING OPERATOR: adj conn conn ∧ CA ∧ CB A, B; (A, A), (B, B); CA,B adj adj adj conn conn A, B; (A, ξB (A)), (B, ξA (B)); CA,B ∧ CA ∧ CB 3.3.5 Connectivity 3.3.8 Contrast If A is a connected object, its domain can be restricted to connected fuzzy sets (deﬁnitions of fuzzy connectivity can be found in [17, 14]). The following constraint will play a key role in the propagation process, since it will be computed from image data. We suppose here 624 O. Nempont et al. / Structure Segmentation and Recognition in Images Guided by Structural Constraint Propagation that the contrast between the structures is roughly known and stable, which is the case in MRI (the lateral ventricles are for instance hypointense compared with the white matter on T1 weighted MRI). We ﬁrst deﬁne the grey level membership function associated with a spatial object as: μIA (v) = supx∈X,I(x)=v μA (x), where I is the intensity function and v a grey level value (conversely a spatial membership function μ can be obtained from a grey level one μI as μ(x) = μI ◦ I(x)). We rely on the deﬁnition of Michelson for the contrast [12]: 2 c = vv11 −v , where v1 and v2 are two grey levels. According to the +v2 extension principle [20], we obtain the following membership function for the contrast between two fuzzy objects A and B, with grey level membership functions μIA and μIB : μcA,B (c) = sup(v ,v )∈R+2 ,c= v1 −v2 min(μIA (v1 ), μIB (v2 )). 1 2 v1 +v2 Conversely, if we consider a contrast prior μcA,B , we can obtain the set of grey levels that satisfy this contrast prior from object −1 A as: μI (v) = sup(v1 ,v2 )∈R+2 ,v=v1 ∗v2 min(μIA (v1 ), μkA,B (v2 )) −1 with μkA,B (v) = supc∈[−1,1],v= 1−c μcA,B (c) and from object B 1+c as: μI (v) = sup(v1 ,v2 )∈R+2 ,v=v1 ∗v2 min(μIB (v1 ), μkA,B (v2 )) with μkA,B (v) = supc∈[−1,1],v= 1+c μcA,B (c). 1−c cont CA,B (μ1 , μ2 ) = 8 1 if ∀v ∈ R+ , > > > > μI1 (v) ≤ sup(v1 ,v2 )∈R+2 min(μI2 (v1 ), μkA,B (v2 )) > > < v=v1 ∗v2 and ∀v ∈ R+ , −1 > > > μI2 (v) ≤ sup(v1 ,v2 )∈R+2 min(μI1 (v1 ), μkA,B (v2 )) > > > v=v ∗v 1 2 : 0 otherwise. CONTRAST CONTRACTING OPERATOR: 4 CONSTRAINT PROPAGATION We describe here a simple propagation algorithm to perform the segmentation and recognition of a set of structures χ. First we initialize the domains of these structures to (0F , 1F ) and we restrict the set of constraints to those that involve only variables in χ. The constraints are then sequentially applied to reduce the variable domains, i.e. to reduce the upper bound and increase the lower one. The constraints could be applied sequentially without any ordering. However in most cases the constraint computation would be useless and time consuming. Different factors may inﬂuence the beneﬁt of the computation of a constraint. Among them we consider the amount of change (since the last computation) of the bounds of the variables involved in the constraint3 and the computation cost of the constraint CC (function of the complexity of each involved operation such as dilations). We deﬁne a priority P for each constraint, initialized to P (0) = card(X) . At each step of the propagation process CC the highest priority constraint is selected and the associated contracting operator is computed. The priority of the constraint is then set to 0. The application of this contracting operator may induce changes on the domain of its variables. When this occurs, the priority P of a constraint that depends on one of the variables is updated as follows: P P (i + 1) = P (i) + 1 1 x∈X (A (x) 2 − A (x)) + (A2 (x) − A1 (x)) , CC 2 were (A1 , A ) and (A2 , A ) are respectively the domains before and after a change on the variable and P (i) is the priority value at step i. The process stops when the priority of all constraints is equal to 0. 3 In AC −3 algorithm [11], the list of constraints to update would correspond to those with a non-zero amount of change. cont A, B; (A, A), (B, B); CA,B I A, B; (A, A ∧ (sup(v1 ,v2 )∈R+2 min(μB (v1 ), μkA,B (v2 )) ◦ I)), v=v1 ∗v2 −1 cont (B, B ∧ (sup(v1 ,v2 )∈R+2 min(μIA (v1 ), μkA,B (v2 )) ◦ I)); CA,B v=v1 ∗v2 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 10 20 30 40 50 60 70 50 60 70 (b) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 10 (a) 20 30 40 (c) (d) (e) Figure 3. (a) LV l. (b) μILV l . (c) Original μIW M l (plain), updated one (dashed). W M l before (d) and after (e) application of the contrast contracting operator. This is illustrated in Figure 3. Suppose for instance that the fuzzy set displayed in (a) is the upper bound LV l of the domain of the left lateral ventricle. The associated grey level membership function μILV l is shown in (b). An upper bound W M l for the left white matter structures is displayed in (d) (the contour of W M l is also shown) and μIW M l in (c). The application of the contrast contracting operator restricts μIW M l to the membership in (c) (dashed), which corresponds to the updated W M l (e). Figure 4. LV l (left) and LV l (right) at step 0 (a), 500 (b), 1000 (c), 2500 (d), 10000 (e) and 20000 (f) of the propagation process. The target object LV l is delineated. 625 O. Nempont et al. / Structure Segmentation and Recognition in Images Guided by Structural Constraint Propagation Ideally the upper and lower bounds of the different domains will converge to the same fuzzy set. However this will generally not occur and there remains some indecision at least on object boundaries. Even if the propagation signiﬁcantly reduces the search space, it is still time consuming to apply a backtracking algorithm to extract an optimal solution according to some cost function. Therefore we propose to reﬁne the segmentation of each structure by using the method proposed in [13], based on minimal surface optimization [3]. The segmentation problem consists in ﬁnding the closed curve that minimizes a metric based on the obtained bounds. This can efﬁciently be solved using a graph-cuts based method [2] for instance. 5 PRELIMINARY RESULTS ON NORMAL AND PATHOLOGICAL BRAIN We illustrate here some preliminary results on 2D brain MRI. Our knowledge base contains about 3000 relations involving 34 variables that correspond to visible structures on MRI. If we consider the left caudate nucleus, it is for instance strictly on the right of the left lateral ventricle, fairly on the left of the putamen, much brighter than the lateral ventricle, darker than the white matter and somewhat darker than the putamen. We now describe the recognition process for a few structures of the 2D brain MRI presented in Figure 5(a). We suppose that the brain was previously extracted. The associated domain is deﬁned as a singleton. Its lower and upper bounds are thus equal. We initialized all other domains to (0F , 1F ). The propagation is then performed, completing in about 5 hours on a 3.0 GHz Pentium 4 CPU. We show in Figure 4 the upper and lower bounds of the left lateral ventricle at different steps of the propagation process. The prior information provides a good discrimination with other structures and the upper and lower bounds are close to the solution at the end of the propagation. The extraction of a crisp segmentation can then easily be performed using the method in [13]. We show in Figure 5(b) the segmentation results for the internal structures. We show also a result on a case affected by a brain tumor in Figure 5 (c-d). The tumor induces various degrees of deformation and may also involve structural modiﬁcations. The case presented here is affected by a cortical tumor which was previously extracted [8]. We modify the knowledge base, just to include that the tumor is a subpart of the brain. We do not modify the other relations. The segmentation results for internal structures is shown in Figure 5(d). We can observe that the result remains correct, despite the shape modiﬁcation induced on some structures by the tumor. 6 CONCLUSION We have proposed in this paper a new formulation of the segmentation and recognition task in the case of a known structural arrangement as the resolution of a constraint network. Preliminary results were shown on 2D MRI brain. They illustrate that the constraint propagation is very efﬁcient in providing domain bounds close to the object, thus considerably reducing the search space. Future work aims at improving the efﬁciency of the propagation process to make it applicable in 3D cases. A deeper study for pathological cases will also be performed, in particular to account for strong structural changes on the internal structures potentially induced by subcortical tumors. ACKNOWLEDGEMENTS This work has been partly supported by a grant from INCA. (a) (b) (c) (d) Figure 5. (a) 2D T1 weighted brain MRI. (b) Cropped view of segmentation results for the internal structures. (c) 2D MRI of a brain affected by a tumor. (d) Segmentation results for internal structures. REFERENCES [1] I. Bloch, ‘Spatial Reasoning under Imprecision using Fuzzy Set Theory, Formal Logics and Mathematical Morphology’, International Journal of Approximate Reasoning, 41, 77–95, (2006). [2] Y. Boykov and V. Kolmogorov, ‘Computing geodesics and minimal surfaces via graph cuts’, in IEEE International Conference on Computer Vision, ICCV, pp. 26–33, Nice, France, (jun 2003). [3] V. Caselles, R. Kimmel, and G. Sapiro, ‘Geodesic active contours’, in IEEE International Conference on Computer Vision, ICCV, pp. 694– 699, Boston, MA, USA, (1995). [4] O. Colliot, O. Camara, and I. Bloch, ‘Integration of Fuzzy Spatial Relations in Deformable Models - Application to Brain MRI Segmentation’, Pattern Recognition, 39, 1401–1414, (2006). [5] A. Deruyver, ‘Adaptive pyramid and semantic graph: knowledge driven segmentation’, in Graph-based Representations in Pattern Recognition, GbR, volume LNCS 3434, pp. 213–223, Poitiers, France, (apr 2005). [6] D. Dubois and H. Prade, Fuzzy Sets and Systems: Theory and Applications, Academic Press, New-York, 1980. [7] C. Hudelot, J. Atif, O. Nempont, B. Batrancourt, E. Angelini, and I. Bloch, ‘GRAFIP: a Framework for the Representation of Healthy and Pathological Anatomical and Functional Cerebral Information’, in Human Brain Mapping, HBM, Florence, Italy, (jun 2006). [8] H. Khotanlou, O. Colliot, J. Atif, and I. Bloch, ‘3D Brain Tumor Segmentation in MRI Using Fuzzy Classiﬁcation, Symmetry Analysis and Spatially Constrained Deformable Models’, To appear in Fuzzy Sets and Systems. [9] O. Lhomme, ‘Consistency Techniques for Numeric CSPs’, in International Joint Conference on Artiﬁcial Intelligence, IJCAI, pp. 232–238, Chambry, France, (1993). [10] G. Ligozat, ‘Reasoning about Cardinal Directions’, Journal of Visual Languages and Computing, 9(1), 23–44, (1998). [11] A.K. Mackworth, ‘Consistency in networks of relations’, Artiﬁcial Intelligence, 8(1), 99–118, (feb 1977). [12] A. Michelson, Studies in Optics, Chicago University Press, 1927. [13] O. Nempont, J. Atif, E. Angelini, and I. Bloch, ‘Combining Radiometric and Spatial Structural Information in a New Metric for Minimal Surface Segmentation’, in Information Processing in Medical Imaging, IPMI, volume LNCS 4584, pp. 283–295, Kerkrade, The Netherlands, (jul 2007). [14] O. Nempont, J. Atif, E. Angelini, and I. Bloch, ‘A New Fuzzy Connectivity Class. Application to Structural Recognition in Images’, in Discrete Geometry for Computer Imagery DGCI, volume LNCS 4992, pp. 446–457, Lyon, France, (2008). [15] D. Papadias, T. Sellis, Y. Theodoridis, and M.J. Egenhofer, Topological relations in the world of minimum bounding rectangles: a study with R-trees, ACM Press New York, NY, USA, 1995. [16] J. Renz and B. Nebel, ‘On the complexity of qualitative spatial reasoning: A maximal tractable fragment of the Region Connection Calculus’, Artiﬁcial Intelligence, 108(1-2), 69–123, (1999). [17] A. Rosenfeld, ‘Fuzzy Digital Topology’, Information and Control, 40, 76–87, (1979). [18] F. Rossi, P. Van Beek, and T. Walsh, Handbook of Constraint Programming, Elsevier Science, 2006. [19] S.G. Waxman, Correlative Neuroanatomy, McGraw-Hill, New York, 24 edn., 2000. [20] L. A. Zadeh, ‘The Concept of a Linguistic Variable and its Application to Approximate Reasoning’, Information Sciences, 8, 199–249, (1975). 626 ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-626 Theoretical Study of Ant-based Algorithms for Multi-Agent Patrolling Arnaud Glad and Olivier Simonin and Olivier Buffet and Franc¸ois Charpillet 1 Abstract. This paper addresses the multi-agent patrolling problem, which consists for a set of autonomous agents to visit all the places of an unknown environment as regularly as possible. The proposed approach is based on the ant paradigm. Each agent can only mark and move according to its local perception of the environment. We study EVAW, a pheromone-based variant of the EVAP [3] and VAW [12]. The main novelty of the paper is the proof of some emergent spatial properties of the proposed algorithm. In particular we show that obtained cycles are necessarily of same length, which ensures an efﬁcient spatial distribution of the agents. We also report some experimental results and discuss open questions concerning the proposed algorithm. 1 INTRODUCTION Deploying autonomous agents or robots in unknown or dynamic environments is a challenging problem for a growing number of tasks (e.g. military surveillance, rescue after natural disasters, etc.). In this paper we address an important task: the patrolling of an unknown environment. It consists in several agents that are in charge of the surveillance of a limited area. We suppose that this area is not known in advance and the number of agents can change dynamically. So we are looking for a patrolling approach that provides adaptability and robustness. To address such a challenge we study a bio-inspired algorithm that mimics ant mechanisms. Ants provide decentralized algorithms relying on very simple individual behaviors [6]. A particularity of ants is their ability to use the environment as a shared memory by dropping and sensing pheromones, deﬁning temporary information (due to the evaporation process). Such a paradigm has been used to deﬁne several pheromone-based algorithms and meta-heuristics to deal with spatial or more generally distributed problems [5, 2, 4, 9, 10]. The patrolling problem can be deﬁned, for a group of agents, as the problem of visiting a set of places while minimizing the time between two consecutive visits. This time is called idleness. For about ten years, several models have been proposed to deal with patrolling. Most of these approaches propose to search for a policy ofﬂine by ant-walk and consider a priori known environments represented as graphs [8, 1, 7]. On the contrary, few models have been proposed to deal with unknown and dynamic environments and online computation. We can mention Wagner et al. [13, 11] who proposed ant-based algorithms for the covering problem. In these papers they explored the capabilities of self-organized systems, in which each agent can only read and write integers on the edges of a graph. In this paper we study such systems when the environment is a grid. So we present 1 INRIA/Nancy University, Loria Lab., MAIA project, Nancy, France email: ﬁrstname.lastname@loria.fr the EVAP algorithm, introduced in [3], that just uses the pheromone evaporation process, and we compare it to a variant of the VAW algorithm [12]. Those algorithms exhibit interesting properties. After an exploration phase, agents self-organize into stable partial cycles of equal length that completely cover the environment. As a consequence, cells are visited at a very regular frequency. As this property is desirable in the patrolling problem, our main objective is to demonstrate this property formally. The paper is organized as follows. In Section 2 we introduce the multi-agent patrolling problem. Then Section 3 presents the EVAP and VAW ant-based algorithms allowing to deal with covering and patrolling problems, and we show that they have similar behaviors. In Section 4 we study emergent spatial properties of EVAW, a combination of these two algorithms, by focusing on the emergence of optimal cycles. Before concluding, Section 5 discusses some open questions about the proposed approach. 2 2.1 THE PATROLLING PROBLEM Deﬁnition Patrolling consists in deploying several agents in order to visit at regular time intervals some deﬁned places of an area. It aims at gathering reliable information, seeking objects and watching over places in order to defend them against any intrusion, etc. An efﬁcient patrol in an environment requires that the delay between two consecutive visits of a given place is minimal. Related work on multi-agent patrolling generally considers that the environment is known, two-dimensional and that it can be reduced to a graph G(V, E) (V the nodes to be visited, E the arcs deﬁning the valid paths between nodes). 2.2 Covering vs. Patrolling (a) Figure 1. (b) Optimal covering is not necessary optimal patrolling Covering aims, for one or multiple agents, at visiting each place of the environment once within the shortest possible time. Then patrolling can be intuitively considered as the process of repeatedly covering an environment. But a simple example can show that repeating an optimal solution to cover the environment is not necessary 627 A. Glad et al. / Theoretical Study of Ant-Based Algorithms for Multi-Agent Patrolling optimal for patrolling. Indeed, in the case of Figure 1 we have two optimal covers but only the second one is an optimal patrol since the last visited cell is adjacent to the ﬁrst one. Covering approaches may not be relevant in the scope of the patrolling problem. In next sections we address the patrolling problem by using simple agents that cannot communicate directly. 3 ANT-INSPIRED ALGORITHMS 3.1 Presentation of the Algorithms 3.1.1 The EVAP Algorithm The EVAP algorithm has been introduced in [3]. This algorithm solves the multi-agent patrolling problem even when the environment is unknown. It is based on a digital pheromone model in which pheromones are represented as numbers whose value decreases over time (simulating the evaporation process of biological pheromones). Agents evolve in a 2D grid. They can perceive and move to the four adjacent cells representing their neighborhood (noted N (x), x being the current cell). Algorithm 1 describes the individual behavior of each agent. When an agent visits a cell, it drops a quantity Qmax of pheromone, then moves according to the negative gradient of pheromone. As the environment evaporates pheromones, with rate ρ (see Algorithm 2), the remaining quantity in a cell x (noted q(x)) represents the time elapsed since its last visit. So, an agent’s local behavior is deﬁned by moving to the cell of its neighborhood which has not been visited for the longest time. Figure 2. 3D illustration of the EVAP algorithm (with one agent) Algorithm 1 EVAP Agent (situated on cell x) A) Find a cell y in N (x) such that q(y) = minw∈N (x) q(w) in case of multiple choices make a random choice B) Move to cell y C) Set q(y) ← Qmax (drop the Max quantity of pheromone) Algorithm 2 EVAP Environment For every cell x of the environment If q(x) = 0 then q(x) ← ρ.q(x) (ρ ∈]0, 1[) 3.1.2 the same as the EVAP algorithm (gradient descent), but the dropped information is the date s(x) of the visit instead of laying a quantity of pheromone. So, in the VAW0 algorithm, agents must have synchronised time counters (same frequency) and start at the same time with counter t = 0. Algorithm 3 Vertex-ant-walk0 (ant situated in cell x) A) Find a cell y in N (x) such that s(y) = minw∈N (x) s(w) in case of multiple choices make a random choice B) Set s(x) ← t C) Move to cell y D) t = t + 1 3.2 Comparison of the EVAP and VAW0 Algorithms Lets compare both algorithms. One can see that the next cell selected by an agent is the same in both algorithms (step A). Indeed, agents follow the numerical gradient, choosing in the surrounding neighborhood the cell with the minimum value. So agents necessarily choose the one which has not been visited for the longest time. Concerning the numerical ﬁelds q and s built by the algorithms, they both allow to express the elapsed time δt(x) since the last visit of a x cell: δt(x) = log(q(x)/Qmax )/ log(ρ) δt(x) = t − s(x) in VAW0 . in EVAP, It is then possible to express q(x) as a function of s(x) and reciprocally. There is clearly a bijection between the EVAP evaporation function and the VAW0 time function. So, we can freely swap the time computation functions of these two algorithms. However, it is important to note that, in the multi-agent case, EVAP and VAW0 are not strictly equivalent as steps B and C are not performed in the same order. EVAP agents move and drop pheromones whereas VAW0 agents drop pheromones and move to the next cell. As a consequence, two EVAP agents may only meet on the same cell in very particular topologies. On the contrary, VAW0 agents may ﬁnd themselves on the same cell more often and then follow each other until some random choice has to be made. This subtle difference leads to a more efﬁcient exploration with EVAP. We prefer EVAP because it favors exploration, yet VAW0 ’s time computation function is easier to manipulate. As a result, we propose — and will study — the EVAW algorithm (Exploring VAW) which uses EVAP’s order of operations with VAW0 ’s maths formulae (see Algorithm 4). Note that EVAP and EVAW exhibit identical behaviors for the same initial conditions and the same random seed. Algorithm 4 EVAW Agent (situated on cell x) A) Find a cell y in N (x) such that s(y) = minw∈N (x) s(w) in case of multiple choices make a random choice B) Move to cell y C) Set s(y) ← t D) t = t + 1 The Vertex-Ant-Walk (VAW) Algorithm In this section, we present an earlier version of the VAW algorithm (noted WAV0 in the rest of the paper) introduced by Wagner and coauthors in an appendix of [12]. The local behavior of the agents is 3.3 Known Properties In [12], Wagner et al. proved that k VAW0 agents cover the environment in bounded time tk . This proof can be extended to show 628 A. Glad et al. / Theoretical Study of Ant-Based Algorithms for Multi-Agent Patrolling that the algorithm performs the patrolling task (each cell will be visited at most every tk time steps). These results are also valid for the EVAW algorithm. As Wagner et al. we have also experimentally observed that the agents self-organize, so that each of them reaches a stable cycle. A cycle ζ is a ﬁnite sequence of adjacent cells that the agent repeatedly covers, some cells possibly appearing several times in the sequence. We are interested in formally studying those cycles. Before considering the multi-agent case in next section, we start by giving a result in the single agent case. In [11], Wagner et al. present a VAW variant (which we call VAW1 ) in which ants smell traces made up of a pair (μ, τ ) in which μ is the number of visits to the cell so far and τ the last time the cell was visited. Considering a single agent, they proved that, when an Hamiltonian cycle2 has been reached, the ant repeats it forever. Using the proof schema, we now show the same result for the EVAW algorithm. We note st (x) the value of cell x at time t. Proof: Assume that ζ is an Hamiltonian cycle denoted by ζ(t) = (xt , xt+1 , . . . , xt+n ) the sequence of n + 1 consecutive vertices in the tour, starting at xt . The next tour starts at time t + n + 1 and only depends on the gradient values along the vertices. So, to prove that the cycle is stable, we have to prove that, for vertices u, v, if it holds that st (u) > st (v) then st+n (u) > st+n (v). This is true as, for all u, st+n (u) = st (u)+n. So if a single Hamiltonian cycle is obtained it remains stable forever. In the next section we study the stability of cycles (Hamiltonian or not) when several agents interact in the same environment. only once in the cycle (which is in particular veriﬁed in Hamiltonian cycles). As a result, at time t, when agt1 is in c1 , we have: st (c1 ) = st (c1 ) − l1 + 1 = t − l1 + 1. (1) Lemma Under these conditions, two distinct cycles patrolled each by one EVAW agent will not be maintained if they have different lengths. Proof If agt2 breaks its cycle ﬁrst, the problem is solved. Let us therefore consider that this is not the case and observe what happens for agt1 . Agent agt1 goes to cycle ζ2 (on cell c2 ) if and only if it is in cell c1 at time t and st (c1 ) ≥ st (c2 ). (2) This inequality relies on the EVAW agent behavior that ensures it always moves to its minimal neighbor cell. We therefore have to show that inequality (2) will be true in a ﬁnite time. The property that both agents visit c1 and c2 alternatively inﬁnitely often would be written: t2 ≤ t1 ≤ t2 + l2 ≤ t1 + l1 ≤ · · · ≤ t2 + k · l2 ≤ t1 + k · l1 , where t2 and t1 are two reference visit dates t2 and t1 (agt2 visiting c2 just before agt1 visits c1 ). This inequality obviously holds only if l1 = l2 . Thus, there exist two dates t1 and t2 of the visit of agt1 in c1 (st1 (c1 ) = t1 ) and agt2 in c2 (st2 (c2 ) = t2 ) such that t2 ≤ t1 < t1 + l1 < t2 + l2 . 4 STUDY OF THE MULTI-AGENT CASE 4.1 We can then write (using Equation 1): Introduction In the multi-agent setting, cycles only interact in pairs so that we will focus on the two-agent case. We suppose for now that both agents (agt1 and agt2 ) remain on their own cycles (ζ1 and ζ2 , of respective lengths l1 and l2 ). These cycles are neighbors by at least two adjacent cells. We note (c1 , c2 ) a couple of adjacent cells such that c1 ∈ ζ1 and c2 ∈ ζ2 (see Fig. 3). l1 l2 ζ1 c1 c1 c2 st1 (c1 ) = t1 − l1 + 1, st1 +l1 (c1 ) = (t1 + l1 ) − l1 + 1 = t1 + 1, st1 (c2 ) = st2 (c2 ) = t2 (because t1 < t2 + l2 ) and st1 +l1 (c2 ) = st2 (c2 ) = t2 (because t1 + l1 < t2 + l2 ). Then, at t1 + l1 , we have (using Eq. 2): st1 +l1 (c1 ) ζ2 = t1 + 1 > t2 = st1 +l1 (c2 ). So, agt1 changes to cycle ζ2 . Figure 3. Two cycles of different lengths connecting in cells (c1 , c2 ) We will now show that the obtained cycles can not be stable if they have different lengths, then study the stability of equal length cycles. 4.2 Instability of Cycles of Different Lengths We suppose l1 < l2 . Each time agt1 visits c1 , it continues its cycle on cell c1 (see Fig. 3). We make the assumption that c1 appears 2 A cycle is Hamiltonian when each cell is visited exactly once. 2 Note that, as we take into account only cell c2 , the previous result does not depend on the direction of agt2 ’s walk. Another remark concerns the stability of n cycles created by n agents. The stability of the system can only be obtained if cycles have the same length. 4.3 Stability of Equal Length Cycles From now on we consider that l1 = l2 . Will cycles ζ1 and ζ2 be maintained ? We show that some patterns are ﬁxed points and others are not. Lets start with an illustrated example. Figure 4 presents an environment in which two cycles have emerged, and that will persist, i.e. a ﬁxed point was attained. Such a solution illustrates the emergence of an optimal patrolling with two agents. Fig. 4-b shows step 7 and Fig. 4-c shows step 15 (i.e. after one more turn). One can see that 629 A. Glad et al. / Theoretical Study of Ant-Based Algorithms for Multi-Agent Patrolling a) b) 7 0 1 6 3 2 5 4 1 2 0 3 4 7 6 5 9 10 8 11 12 15 8 9 15 14 13 stable. Otherwise, one agent (say agt1 ) is in front of the other (agt2 ) and may switch to agt2 ’s cycle which has to ﬁnd another path to follow (Fig. 7-c). 14 11 10 c) 13 12 Figure 4. A ﬁxed point composed of two cycles of equal length 2 0 1 n 0 n−1 6 7 0 5 2 1 4 3 a) General view 2 0 3 4 7 6 5 Figure 7. Figure 5. Two cycles of equal length that cannot be maintained the difference of values between adjacent cells from one cycle to the next remains the same. We show below that, under deﬁned conditions, when agents converge to distinct cycles of equal length, the cycles will be stable. Remark When both cycles have the same length, an agent has a choice between two options (see Fig. 5) if and only if it sees not only the tail of its own cycle, but also the tail of the other agent’s cycle. We will try to ﬁnd out in which situations such a choice is possible by ﬁrst studying a special case where both cycles are contiguous on half of their length, as depicted on Figures 6-a and 7-a. In this setting, we will distinguish two cases depending whether both agents run along their boundary in opposite or similar directions. 0 0 n n n−1 n n−2 1 1 b) Stable cycles c) Unstable cycles Agents going in similar directions along their boundary Non Continuous Boundary — The same reasoning can be extended to more complex settings where the boundary is not made of a single segment as in previous examples. Fig. 8-a shows two agents which have reached stable cycles whose boundary is made of ﬁve segments. a) b) Figure 8. Solutions with a) complex boundaries and b) more than two agents Agents Going in Opposite Directions — Because the length of the boundary is half the length of their cycles, agt1 and agt2 meet each other at some point along this boundary. Then, they can either always end up on a couple of neighbouring cells (c1 , c2 ) —so that each remains on its own cycle (see Fig. 6-b)— or they always “miss” each other —so that they both see each other’s tail and have the choice to switch cycles or not (see Fig. 6-c)—. As a consequence, the agents have one chance out of two to have stable cycles. Beyond Two Agents — The same reasoning can also be extended to more than two agents by considering boundaries in pairs as illustrated in Fig. 8-b. n−1 0 Common Cycle — We distinguish a ﬁrst case where several agents cover a common cycle. Figure 9-a illustrates such a situation. Trivially, both agents describe a cycle with the same length as the other. n n 0 n−1 a) General View b) Agents meeting n 0 0 n 4.4 Shared Cycles Up to now, cycles were distinct, meaning that each cell belonged to a single cycle. However, EVAW agents can also reach cycles where some cells are visited by different agents. c) Agents missing Figure 6. Agents going in opposite directions along their boundary a) Agents Going in Similar Directions — Both agents “follow each other”. In most cases the distance between agt1 and agt2 is different from 1, so that they never see each other’s tail (Fig. 7-b) and remain Figure 9. b) Solutions with a) a common cycle and b) two overlapping cycles 630 A. Glad et al. / Theoretical Study of Ant-Based Algorithms for Multi-Agent Patrolling Overlapping Cycles — A second case is that of agents whose cycles share only a subset of their cells. Experimentally, this case seems to appear more frequently than common cycles. Fig. 9-b gives an example of cycles overlapping on the central cell of the environment. 5 DISCUSSION We have demonstrated that the obtained cycles can only stabilize if they have the same length. As a consequence the EVAP algorithm ensures a balanced spatial distribution of agents in the environment. Indeed, the average and worst-case idlenesses are minimized, which is a desired property in the context of patrolling. Wagner et al. [11] asked the question whether VAW1 — when used with a single agent and in an environment allowing Hamiltonian cycles — can converge to a non-Hamiltonian cycle. Our experiments with EVAW raise the same question as we never found a counterexample. It is interesting to note that, in a multi-agent setting, EVAW may reach suboptimal solutions, when the environment is Hamiltonian (i.e. when it can be covered by a set of non overlapping Hamiltonian sub-cycles). Yet the length of these resulting cycles is always close to the Hamiltonian one. We also observed the formation of optimal or close-to-optimal cycles in non-Hamiltonian environments. In this last case, some agents follow a path that crosses itself in order to extend it and ensure that all cycles have the same length. Although we have proved that EVAW achieves the patrolling task (agents repeatedly visiting all cells), a theoretical proof that cycles are necessarily obtained is still missing. Furthermore, we plan in future work to study the mechanism leading systematically to an organization in cycles, even if the time to converge to a stable solution is huge. The objective is to possibly improve the algorithm so as to ﬁnd better solutions or ﬁnd good solutions faster. Concerning a real implementation of EVAP and VAW0 , both require that some computational entities be synchronized: • the “smart cells” in the case of EVAP and • the mobile robots for VAW0 . If computations take place in different entities in each algorithm, both rely on digital marks —possibly based on sensor networks or future dust sensors— as a shared memory. Patrolling algorithms and pervasive technologies will have to jointly evolve so as to provide a real-world solution to the patrolling problem. Real-world settings will also add constraints such as limited resources, robots avoidance and human-robot interaction. These algorithms should also be considered for ofﬂine pathplanning: they are known to compare with the state of the art algorithms for ﬁnding Hamiltonian cycles in a graph [11]. It has also been shown experimentally in [3] that the number of agents asymptotically increases the performance up to a limit value. However, robustness of the algorithms still needs to be demonstrated in the face of perturbations such as: • dynamic changes in the graph as studied in [13], • asynchronicity between the cells or the robots’ clocks, • noisy observations and uncertain actions. Under such perturbations, some theoretical questions remain open: • Will EVAW always self-organize in a set of cycles ? • Could we compute a complexity bound for cycle formation ? • If EVAW does not converge to a set of cycles, is the patrolling still guaranteed ? • Could we bound the average/maximum idleness ? 6 CONCLUSION In this paper we investigated emergent behaviors occurring in antbased algorithms deﬁned for the multi-agent patrolling problem. Such theoretical results are still rare in the reactive MAS community. We have presented and compared two similar algorithms: EVAP [3] and VAW0 [12]. Then we have introduced EVAW for practical reasons, using it both for theoretical and experimental studies. The main novelty of the paper is the theoretical study of the stability of cycles generated by the algorithm. Whereas Wagner et al. only considered Hamiltonian cycles in a single-agent setting, we proved that, in the multi-agent case, only cycles of same lengths can persist as limit cycles. Then we identiﬁed patterns that ensure that several cycles with same length will remain stable forever. We also presented and discussed different spatial self-organizations. In future work, we plan to generalize our results and continue the theoretical study of the emergent behaviors of EVAW. In particular, we want to go deeper in the analysis of the mechanisms underlying cycles formation. We plan also to work on experimental and theoretical bounds of algorithm complexity. Concerning applications, we are currently experimenting this algorithm with simulated drones involved in military base surveillance (SMAART DGA project). REFERENCES [1] A. L. Almeida, P. M. Castro, T. R. Menezes, and G. L. Ramalho, ‘Combining idleness and distance to design heuristic agents for the patrolling task’, in II Brazilian Workshop in Games and Digital Entertainment, pp. 33–40, (2003). [2] R. Beckers, O.E. Holland, and J.-L. Deneubourg, ‘From local actions to global tasks: stigmergy and collective robotics’, in Artiﬁcial Life IV: Proc. of the 4th Int. Workshop on the synthesis and the simulation of living systems, MIT Press, (1994). [3] H. Chu, A. Glad, O. Simonin, F. Sempe, A. Drogoul, and F. Charpillet, ‘Swarm approaches for the patrolling problem, information propagation vs. pheromone evaporation’, in ICTAI’07 IEEE International Conference on Tools with Artiﬁcial Intelligence, pp. 442–449, (2007). [4] A. Colorni, M. Dorigo, and V. Maniezzo, ‘Distributed optimization by ant colonies’, in in proceedings of ECAL91, European Conference on Artiﬁcial Life, pp. 134–142, Paris, (1991). Elsevier. [5] A. Drogoul and J. Ferber, ‘From tom thumb to the dockers: Some experiments with foraging robots’, in 2nd Int. Conf. On Simulation of Adaptative Behaviors, pp. 451–459, Honolulu, (1992). [6] T. H. Labella, M. Dorigo, and J-L. Deneubourg, ‘Division of labor in a group of robots inspired by ant’s foraging behavior’, ACM Transactions on Autonomous and Adaptive Systems, 1, 4–25, (2006). [7] F. Lauri and F. Charpillet, ‘Ant colony optimization applied to the multiagent patrolling problem’, in IEEE Swarm Intelligence Symposium, (2006). [8] A. Machado, G. Ramalho, J-D. Zucker, and A. Drogoul, ‘Multi-agent patrolling: an empirical analysis of alternative architectures’, in Third International Workshop on Multi-Agent Based Simulation, pp. 155– 170, (2002). [9] J. A. Sauter, R. Matthews, H. V. D. Parunak, and S. Brueckner, ‘Evolving adaptive pheromone path planning mechanisms’, in Proc. of AAMAS’02, pp. 434–440, (2002). [10] J. A. Sauter, R. Matthews, H. V. D. Parunak, and S. Brueckner, ‘Performance of digital pheromones for swarming vehicle control’, in Proc. of AAMAS’05, pp. 903–910, (2005). [11] I. Wagner and A. Bruckstein, ‘Hamiltonian(t) - an ant-inspired heuristic for recognizing hamiltonian graphs’, in Ant-Algorithms Session in CEC’99 International Joint Conference on Neural Networks, (1999). [12] I. Wagner, M. Lindenbaum, and A. Bruckstein, ‘Distributed covering by ant-robots using evaporating traces’, IEEE Transactions on Robotics and Automation, 15, 918–933, (1999). [13] I. Wagner, M. Lindenbaum, and A. Bruckstein, ‘Ants agents networks trees and subgraphs’, Future Generation Computer Systems Journal, 16(8), 915–926, (2000). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-631 631 Incremental Component-Based Construction and Veriﬁcation of a Robotic System Ananda Basu1 and Matthieu Gallien2 and Charles Lesire2 and Thanh-Hung Nguyen1 and Saddek Bensalem1 and F´elix Ingrand2 and Joseph Sifakis1 Abstract. Autonomous robots are complex systems that require the interaction/cooperation of numerous heterogeneous software components. Nowadays, robots are critical systems and must meet safety properties including in particular temporal and real-time constraints. We present a methodology for modeling and analyzing a robotic system using the BIP component framework integrated with an existing framework and architecture, the LAAS Architecture for Autonomous System, based on GenoM. The BIP componentization approach has been successfully used in other domains. In this study, we show how it can be seamlessly integrated in the preexisting methodology. We present the componentization of the functional level of a robot, the synthesis of an execution controller as well as validation techniques for checking essential “safety” properties. 1 Introduction A central idea in systems engineering is that complex systems are built by assembling components (building blocks). Components are systems characterized by an abstraction that is adequate for composition and re-use. It is possible to obtain large components by composing simpler ones. Component-based design confers many advantages such as reuse of solutions, modular analysis and validation, reconﬁgurability, controllability, etc. Autonomous robots are complex systems that require the interaction/cooperation of numerous heterogeneous software components. They are critical systems as they must meet safety properties including in particular, temporal and real-time constraints. Component-based design relies on the separation between coordination and computation. Systems are built from units processing sequential code insulated from concurrent execution issues. The isolation of coordination mechanisms allows a global treatment and analysis. One of the main limitations of the current state-of-the-art is the lack of a uniﬁed paradigm for describing and analyzing the information ﬂow between components. Such a paradigm would allow system designers and implementers to formulate their solutions in terms of tangible, well-founded and organized concepts instead of using dispersed coordination mechanisms such as semaphores, monitors, message passing, remote call, protocols, etc. It would allow in particular, a comparison of otherwise unrelated architectural solutions and could be a basis for evaluating them and deriving implementations in terms of speciﬁc coordination mechanisms. The designers of complex systems such as autonomous robots need scalable analysis techniques to guaranteeing essential proper1 2 VERIMAG CNRS/University Joseph Fourier, Grenoble, France LAAS/CNRS, Unversity of Toulouse, Toulouse, France. ties such as the one mentioned above. To cope with complexity, these techniques are applied to component-based descriptions of the system. Global properties are enforced by construction or can be inferred from component properties. Furthermore, componentized descriptions provide a basis for reconﬁguration and evolutivity. We present an incremental componentization methodology and technique which seamlessly integrate with the already existing LAAS architecture for autonomous robot. The methodology considers that the global system architecture can be obtained as the hierarchical composition of larger components from a small set of classes of atomic components. Atomic components are units processing sequential code that offer interactions through their interface. The technique is based on the use of the Behavior-InteractionPriority (BIP) [2] component framework which encompasses incremental composition of heterogeneous real-time components. The main contributions of the paper include: • A methodology for componentizing and architecting autonomous robot systems applied to the existing LAAS architecture. • Composition techniques for organizing and enforcing complex event-based interaction using the BIP framework. • Validation techniques for checking essential properties, including scalable compositional techniques relying on the analysis of the interactions between components. The paper is structured as follows. In Section 2 we illustrate with a real example, the preexisting architecture (based on GenoM [6]) of an autonomous robotic software developed at LAAS. From this architecture, we identify the atomic components used for the componentization of the robot software in BIP. Section 3 provides a succinct description of the BIP component framework. Section 4 presents a methodology for building the BIP model of existing GenoM functional modules and their integration with the rest of the software. Controller synthesis results as well as “safety” properties analysis are also presented. Section 5 concludes the paper with a state of the art, an analysis of the current results and future work directions. 2 Modular Architecture for Autonomous Systems At LAAS, researchers have developed a framework, a global architecture, that enables the integration of processes with different temporal properties and different representations. This architecture decomposes the robot system into three main levels, having different temporal constraints and manipulating different data representations [1]. This architecture is used on a number of robots (e.g. DALA, an iRobot ATRV) and is shown on Fig. 1. The levels in this architecture are : 632 A. Basu et al. / Incremental Component-Based Construction and Veriﬁcation of a Robotic System Figure 2. Figure 1. An instance of the LAAS architecture for the DALA Robot. • a functional level: it includes all the basic built-in robot action and perception capacities. These processing functions and control loops (e.g., image processing, obstacle avoidance, motion control, etc.) are encapsulated into controllable communicating modules developed using GenoM3 . Each modules provide services which can be activated by the decisional level according to the current tasks, and posters containing data produced by the module and for other (modules or the decisional level) to use. • a decisional level: this level includes the capacities of producing the task plan and supervising its execution, while being at the same time reactive to events from the functional level. • At the interface between the decisional and the functional levels, lies an execution control level that controls the proper execution of the services according to safety constraints and rules, and prevents functional modules from unforeseen interactions leading to catastrophic outcomes. In recent years, we have used the R2C [14] to play this role, yet it was programmed on the top of existing functional modules, and controlling their services execution and interactions, but not the internal execution of the modules themselves. The organization of the overall system in layers and the functional level in modules are deﬁnitely a plus with respect to the ease of integration and reusability. Yet, an architecture and some tools are not “enough” to warrant a sound and safe behavior of the overall system. In this paper the componentization method we propose will allow us to synthesize a controller for the overall execution of all the functional modules and will enforce by construction the constraints and the rules between the various functional modules. Hence, the ultimate goal of this work is to implement both the current functional level and execution control level with BIP. A GenoM module organization. in Fig. 1 (which only shows the data ﬂow of the functional level), there is an explicit periodical processing loop. The module Laser RF acquires the laser range ﬁnder and store them in the poster Scan, from which Aspect builds the obstacles map Obs. The module NDD (responsible for the navigation) avoids these obstacles while periodically producing a Speed reference to reach a given target from the current position Pos produced by POM. Finally, this Speed reference is used by RFLEX, which controls the speed of the robots wheels, and also produces the odometry position to be used by POM to generate the current position.4 All these modules are built using a unique generic canvas (Fig. 2) which is then instantiated for a particular robot function. Each module can execute several services started upon upper level requests. The module can send information relative to the executed requests to the client (such as the ﬁnal report) or share data with other modules using posters. E.g. the NDD module provides six services corresponding to initializations of the navigation algorithm (SetParams, SetDataSource andSetSpeed), launching and stopping the path computation toward a given goal (Stop and GoTo) and a permanent service (Permanent). To execute this path, NDD exports the Speed poster which contains the speed reference. The services are managed by a control task responsible for launching corresponding activities within execution tasks. Control and execution tasks share data using the internal data structures (IDS). Moreover execution tasks have periods in which the several associated activities are scheduled. It is not necessary to have ﬁxed length periods if some services are aperiodic. Fig. 3 presents the automata of an activity. Activity states correspond to the execution of particular elementary code (codels) available through libraries and dedicated either to initialize some parameters (START state), to execute the activity (EXEC state) or to safely end the activity leading to reseting parameters, sending error signals, etc. 3 The BIP Component Framework 5 2.1 GenoM Functional Modules Each module of the LAAS architecture functional level is responsible for a function of the robot. Complex modalities (such as navigation) can be obtained by having modules “working” together. For example 3 The GenoM tool can be freely downloaded from: http://softs.laas.fr/openrobots/wiki/genom BIP [2] is a software framework for modeling heterogeneous realtime components. The BIP component model is the superposition of three layers: the lower layer describes the behavior of a component as a set of transitions (i.e a ﬁnite state automaton extended with 4 5 This particular setup will serve as an example throughout the rest of the paper. The BIP tool-set can be downloaded from: http://www-verimag.imag.fr/˜async/BIP/bip.html. A. Basu et al. / Incremental Component-Based Construction and Veriﬁcation of a Robotic System 633 START _/started request(arg)/_ in abort/_ x ETHER _/failed FAIL _/interrupted _/OK(ret) EXEC IDLE empty in 0<x y:=f(x) out y out full abort/_ INTER abort/_ events : input / output Figure 4. An example of an atomic component in BIP. END Figure 3. Execution automaton of an activity. Figure 5. data); the intermediate layer includes connectors describing the interactions between transitions of the layer underneath; the upper layer consists of a set of priority rules used to describe scheduling policies for interactions. Such a layering offers a clear separation between component behavior and structure of a system (interactions and priorities). BIP allows hierarchical construction of compound components from atomic ones by using connectors and priorities. An atomic component consists of a set of ports used for the synchronization with other components, a set of transitions and a set of local variables. Transitions describe the behavior of the component. They are represented as a labeled relation between control states. Fig. 4 shows an example of an atomic component with two ports in, out, variables x, y, and control states empty, f ull. At control state empty, the transition labeled in is possible if 0 < x. When an interaction through in takes place, the variable x is eventually modiﬁed and a new value for y is computed. From control state f ull, the transition labeled out can occur. Connectors specify the interactions between the atomic components. A connector consists of a set of ports of the atomic components which may interact. If all the ports of a connector are incomplete then synchronization is by rendezvous. That is, only one interaction is possible, the interaction including all the ports of the connector. If a connector has one complete port then synchronization is by broadcast. That is, the complete port may synchronize with the other ports of the connector. The possible interactions are the non empty sublists containing this complete port. the feasible interactions of a connector and in particular to model the two basic modes of synchronization, rendezvous and broadcast. Priorities in BIP are a set of rules used to ﬁlter interactions amongst the feasible ones. The model of a system is represented as a BIP compound component which deﬁnes new components from existing components (atoms or compounds) by creating their instances, specifying the connectors between them and the priorities. The BIP framework consists of a language and a toolset including a front-end for editing and parsing BIP programs and a dedicated platform for the model validation. The platform consists of an engine and software infrastructure for executing simulation traces of models. It also allows state space exploration and provides access to model-checking tools like Evaluator [10]. This permits to validate BIP models and ensure that they meet properties such as deadlockfreedom, state invariants and schedulability. The back-end, which is the BIP engine, has been entirely implemented in C++ on Linux to allow a smooth integration of compo- BIP model of a service. nents with behavior expressed using plain C/C++ code. 4 The Functional Layer in BIP The LAAS architecture makes use of a generic module for its functional layer. If we model this generic module and its components in BIP and if we then instantiate it and connect the existing “codels” to the resulting component, then we have a BIP model of the GenoM modules. Adding the BIP model of the interaction between the modules will give us a BIP model of the overall functional layer. In order to formalize the componentization approach, we propose the following mapping (+ for one component or more, and . for composing components): functional level ::= (module)+ module ::= (service)+ . (execution task) . (poster)+ service ::= (service controler) . (activity) execution task ::= (timer) . (scheduler activity) As shown in Fig. 5, a component modeling a generic Service is obtained from composing the atomic components service controller and activity. The left sub-component represents the execution task of a service. It is launched by synchronization through port trigger. The service controller then controls the validity of the parameters of the request (if available) and will either reject the request or start the activity by synchronizing with the activity component (right subcomponent). In each state, the status of the execution task is available by synchronizing through port status. The activity will then wait for execution (i.e. synchronization on the exec port with the control task) and will either safely end, fail, or abort. Each of the transitions control, start, exec, fail, ﬁnish and inter may call an external function. The service components are further composed with execution task and poster components to obtain a module component as shown in Fig. 6. 4.1 A Functional Module in BIP The full BIP description of the functional level of the robot, which consists of several modules, is beyond the scope of this paper. We rather focus on the modeling of the NDD module. The NDD module contains six services, a poster and a control task as sub-components and the connectors between them, as shown in Fig. 8. The control task wakes up periodically (managed by the bottomleft component with alternating sleep and trigger transitions) and always triggers the Permanent service at the beginning of each pe- 634 A. Basu et al. / Incremental Component-Based Construction and Veriﬁcation of a Robotic System Figure 6. A componentized GenoM module. p c ERROR tick c >= p finish tick IDLE EXEC trigger c := 0 trigger tick tick c Figure 7. Observer for the control task period veriﬁcation. riod. During a period, the services will have authorization to execute through interactions with the control task. Moreover, the BIP formalism allows complex relations to be deﬁned, such as: • interruptions, as modeled by the connector joining Stop.exec and GoTo.abort; if service Stop is executed, the GoTo algorithm will be aborted; • constraints, as modeled by the goTo connector (in blue); service GoTo can be launched only if SetParams, SetSpeed and SetDataSource have been already completed (information available through their status port). The BIP tool-chain generates code from the BIP model, which can be executed by the BIP engine. The code contains calls to functions from libraries originally designed for GenoM modules, which executes the real activities of the robotic system. The code generated for the NDD module has been integrated and executed. In particular, it was fully integrated with the decisional layer by replacing the functional layer originally modeled with GenoM with the one modeled in BIP. 4.2 Functional Level Controller Synthesis Previously, in the LAAS architecture, a centralized controller (R2C) was used to control the proper execution of the services and to enforce the safety constraints and modules interactions. On the contrary, in the BIP model, the proper execution order and the safety properties are enforced by the BIP connectors between the controllers of different services. A BIP connector has guarded actions associated to each of its possible interactions. Dependency between the controllers of service in different modules are modeled by connectors associated with guards which represents either some valid execution condition or some safety rule. The composite behavior of these local controllers, synchronized by the connectors and restricted by priorities, is equivalent to the behavior of the centralized controller. As an example, we had to enforce a rule between the NDD and the POM modules which states that the robot can navigate using the Figure 8. The NDD module. GoTo service of the NDD module only if the module POM has already executed successfully its Run service (which updates poster Pos). Such a rule is enforced by constructing a connector between port trigger of the Goto service and port status of the Run service, and guarded by the status value. 4.3 Veriﬁcation of Safety Properties The BIP tool-set can perform an exhaustive state-space exploration of the system. Additionally, it can detect potential deadlocks in the system. These features have been used to verify some properties in the model of the robot and for detection of deadlocks. Two kinds of properties have been veriﬁed. 4.3.1 Safety Properties A safety property guarantees that something unexpected will never happen. For the veriﬁcation of such properties, we used methods based on state-space exploration. The basic idea is to generate all reachable states and state changes of the system under consideration, and represent this as a directed graph called the state-space. Two different methods have been applied. Model checking [15, 3] We used the model-checker tool Evaluator [10] which performs on-the-ﬂy veriﬁcation of temporal properties on the state-space generated by the BIP engine on exploration of the system. As an example, we describe the usage of this method in verifying a safety property of the NDD module. It is required that the GoTo service is triggered only after a successful termination of SetSpeed service. To ensure this, in the BIP model of NDD, we need to guarantee that the interaction GoTo:trigger occurs only after the occurrence of the interaction SetSpeed:ﬁnish. We checked for violations of this property, i.e ﬁnding a transition sequence in the statespace where GoTo:trigger is not preceded by SetSpeed:ﬁnish. The result obtained by Evaluator proves that the initialization property is preserved in the NDD module. Veriﬁcation using Observers [17, 13] For a given system S and a safety property P , we construct ﬁrst an observer for P , i.e. an automaton which monitors the behavior of S and reports an error on A. Basu et al. / Incremental Component-Based Construction and Veriﬁcation of a Robotic System violation of P . The veriﬁcation consists of exploring the state-space of the product system. Such a method has been used to verify a timing property in the NDD module. It is needed to verify that the total time taken by all the services called within a period does not exceeds the period. In BIP, it is possible to model time as symbolic time [2] by using tick ports and clock variables in every timed component. Time progress is by strong synchronization of all the tick ports. The clock variables are incremented on a tick, to model function execution times. Fig. 7 shows the observer component used to verify the timing property of the NDD module. It has a clock variable c and a parameter p representing the period of the control task. It synchronizes with the control task and tracks the cumulative time taken by the services triggered by control task. If this time exceeds the period p, the observer moves to the ERROR state. During exploration, if a global system state, containing the ERROR state of the observer is reachable, then the property is violated. 4.3.2 Deadlock Freedom This is an essential correctness property as it characterizes a system’s ability to perform some activity over its life time. The BIP toolset allow detection of potential deadlocks by static analysis of the connectors in the BIP model [7]. It generates a dependency graph and for each cycle in this graph, a boolean formula is generated. The satisﬁability of the formula is then checked by the tool minisat [4], where a solution corresponds to a potentially deadlocked global state. Presence of an actual deadlock can then be veriﬁed by reachability analysis of the deadlocked states, starting from the initial state of the system. The analysis for the NDD module found a potential deadlock for the state where all services are in the EXEC state, all activities are in the ETHER state, and the control task is in the Q0 state. However, this state is unreachable, hence the deadlock is not possible. 5 State of the Art, Current Results and Prospective The design and development of autonomous robots and systems is a very active research ﬁeld. There are other architectures addressing similar problems: to provide an efﬁcient, reusable and formally sound organization of robot software. CLARAty [12], used on various NASA research rovers, provides a nice object oriented hierarchical organization over two layers, but there is no formal model of the component interactions, nor modules canvas. IDEA [5] and T-REX [11], developed at NASA Ames and MBARI, have an interesting modular/component organization with a temporal constraint based formalism. However, complexity of constraint propagation is an obstacle for effective deployment on real-time functional modules. RMPL [9, 18] and its associated tools, propose a system based on a model-based approach. The programmers specify state evolution with invariants expressed in an “Esterel like” language and a controller maintaining them. In [8], the authors present the CIRCA SSP planner for hard realtime controllers. This planner synthesizes off-line controllers from a domain description and then deduce the corresponding timed automata to control the system on-line. These automata can be formally validated with model checking techniques. However, this work focuses on the decisional part of the overall architecture. In [16] the authors present a system which allows the translation from MPL (Model-based Processing Language) and TDL (Task Description Language) to SMV, a symbolic model checker language. Compared 635 to our approach, this does not address componentization and is designed for the high level speciﬁcation of the decisional level. The paper presents an approach integrating component-based construction and validation of robotic systems. It shows that a complex robotic system can be considered as the composition of a small set of atomic components. Even if we build up on the pre-existing modular LAAS architecture for autonomous robots, and model in BIP all the generic components of this architecture, such an approach could be used with other robot software architectures and tools. The approach has been implemented and we now have a BIP controller for a subset of the functional layer of DALA, running in simulation and on the robot. The paper shows that it is possible to combine standard veriﬁcation techniques, based on global state exploration, with structural analysis techniques for deadlock detection. A useful work direction is the online monitoring of the functional level execution using observer components, which would be able to generate feedback actions for the decisional level which can be useful for errorrecovery. Another work direction is to extend the BIP model to take into account the decisional capabilities of autonomous systems (action planning, execution control). REFERENCES [1] R. Alami, R. Chatila, S. Fleury, M. Ghallab, and F. Ingrand, ‘An architecture for autonomy’, IJRR, Special Issue on Integrated Architectures for Robot Control and Programming, 17(4), (1998). [2] A. Basu, M. Bozga, and J. Sifakis, ‘Modeling heterogeneous real-time components in BIP’, in SEFM, Pune, India, (2006). [3] E. M. Clarke and E. A. Emerson, ‘Synthesis of synchronization skeletons for branching time temporal logic’, in Workshop on Logic of Programs, Yorktown Heights, NY, USA, (1981). [4] N. E´ en and N. S¨ orensen, ‘An extensible SAT−solver’, in SAT, Portoﬁno, Italy, (2003). [5] A. Finzi, F. Ingrand, and N. Muscettola, ‘Robot action planning and execution control’, in IWPSS, Darmstadt, Germany, (2004). [6] S. Fleury, M. Herrb, and R. Chatila, ‘GenoM: A tool for the speciﬁcation and the implementation of operating modules in a distributed robot architecture’, in IROS, Grenoble, France, (1997). [7] G. Goessler and J. Sifakis, ‘Component-based construction of deadlock-free systems’, in FSTTCS, Bombay, India, (2003). [8] R. P. Goldman and D. J. Musliner, ‘Using model checking to plan hard real-time controllers’, in AIPS Workshop on Model-Theoretic Approaches to Planning, Breckenridge, CO, USA, (2000). [9] P. Kim, B. C. Williams, and M. Abramson, ‘Executing reactive, modelbased programs throgh graph-based temporal planning’, in IJCAI, Seattle, WA, USA, (2001). [10] R. Mateescu and M. Sighireanu, ‘Efﬁcient on-the-ﬂy model-checking for regular alternation-free mu-calculus’, Technical Report 3899, INRIA Rhˆone-Alpes, France, (2000). [11] C. McGann, F. Py, K. Rajan, H. Thomas, R. Henthorn, and R. McEwen, ‘T-REX: A deliberative system for AUV control’, in ICAPS WS on Planning and Plan Execution for Real-World Systems, Providence, RI, USA, (2007). [12] I.A. Nesnas, A. Wright, M. Bajracharya, R. Simmons, and T. Estlin, ‘CLARAty and challenges of developing interoperable robotic software’, in IROS, Las Vegas, NV, USA, (2003). [13] M. Phalippou, ‘Executable testers’, in IWPTS, Tokyo, Japan, (1994). [14] F. Py and F. Ingrand, ‘Dependable execution control for autonomous robots’, in IROS, Sendai, Japan, (2004). [15] J-P. Queille and J. Sifakis, ‘Speciﬁcation and veriﬁcation of concurrent systems’, in Int. Symposium on Programming, Torino, Italy, (1982). [16] R. Simmons, C. Pecheur, and G. Srinivasan, ‘Towards automatic veriﬁcation of autonomous systems’, in IROS, Takamatsu, Japan, (2000). [17] J. Tretmans, ‘A formal approach to conformance testing’, in IWPTS, Tokyo, Japan, (1994). [18] B. C. Williams, M. D. Ingham, S. Chung, P. Elliott, M. Hofbaur, and G. T. Sullivan, ‘Model-based programming of fault-aware systems’, Artiﬁcial Intelligence, 24(4), (2003). 636 ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-636 Salience-driven Contextual Priming of Speech Recognition for Human-Robot Interaction Pierre Lison and Geert-Jan Kruijff 1 Abstract. The paper presents an implemented model for priming speech recognition, using contextual information about salient entities. The underlying hypothesis is that, in human-robot interaction, speech recognition performance can be improved by exploiting knowledge about the immediate physical situation and the dialogue history. To this end, visual salience (objects perceived in the physical scene) and linguistic salience (objects, events already mentioned in the dialogue) are integrated into a single cross-modal salience model. The model is dynamically updated as the environment changes. It is used to establish expectations about which words are most likely to be heard in the given context. The update is realised by continuously adapting the word-class probabilities speciﬁed in a statistical language model. The paper discusses the motivations behind the approach, and presents the implementation as part of a cognitive architecture for mobile robots. Evaluation results on a test suite show a statistically signiﬁcant improvement of salience-driven priming speech recognition (WER) over a commercial baseline system. 1 Introduction Service robots are becoming more and more sophisticated. In many cases, these robots must operate in open-ended environments and interact with humans using spoken natural language to perform a variety of service-oriented tasks. This has led to an increasing interest in developing dialogue systems for robots [28, 15, 23]. A fundamental challenge here is, how the robot can situate the dialogue: The robot should be able to understand what is being said, and how that relates to the physical situation [20, 25, 26, 11]. The relation between language and experience is often characterized as being bi-directional (cf. [14]). That is, language inﬂuences how to perceive the environment – and vice versa, the physical situation provides a context against which to interpret language. In this paper, we focus on how information from the dialogue- and situated context can help guiding, and improving, automatic speech recognition (ASR) in human-robot interaction (HRI). Spoken dialogue is one of the most natural means of communication for humans. Despite signiﬁcant technological advances, however, ASR remains for most tasks at least an order of magnitude worse than that of human listeners [17]. This particularly holds for using ASR in HRI systems which typically have to operate in real-world noisy environments, dealing with utterances pertaining to complex, open-ended domains. In this paper we present an approach to using context in priming ASR. By priming we mean, focusing the domain of words / word sequences ASR can expect next, so as to improve recognition. This approach has been implemented, and integrated into a cognitive architecture for a mobile robot [10, 14]. Evaluation results on a test suite 1 with recordings of ”free speech” in the application domain show a statistically signiﬁcant decrease in word-error rate (WER) of the implemented system, over a commercial baseline system. We follow [9] and use context information (in the form of contextual constraints) to update the statistical language model used in ASR. We deﬁne a context-sensitive language model which exploits information about salient objects in the visual scene and linguistic expressions in the dialogue history to prime recognition. A salience model integrating both visual and linguistic salience [12] is used to dynamically compute lexical activations, which are incorporated into the language model at runtime. The structure of the paper is as follows. We ﬁrst situate our approach against the background of situated dialogue and ASR, and introduce the software architecture in which our system has been integrated. We then describe the salience model, and explain how it is utilised within the language model used for ASR. We ﬁnally present the evaluation of our approach, followed by conclusions. DFKI GmbH, Saarbr¨ucken, Germany, email: {pierre.lison} {gj}@dfki.de Figure 1. 2 Example interaction Background The combinatorial nature of language provides virtually unlimited ways in which we can communicate meaning. This, of course, raises the question of how precisely an utterance should then be understood as it is being heard. Empirical studies have investigated what information humans use when comprehending spoken utterances. An important observation is that interpretation in context plays a crucial role in the comprehension of utterance as it unfolds [13]. During utterance comprehension, humans combine linguistic information with scene understanding and ”world knowledge.” Several approaches in processing situated dialogue for HRI have made similar observations [19, 20, 21, 4, 14]: A robot’s understanding can be improved by relating utterances to the situated context. This ﬁrst of all presumes the robot is able to relate language and the world around it. [22] present a comprehensive overview of existing P. Lison and G.-J. Kruijff / Salience-Driven Contextual Priming of Speech Recognition for Human-Robot Interaction approaches. One of the earliest systems which connected utterances to a visual world was Winograd’s SHRDLU [30]. Among more recent approaches, the most developed are those by Gorniak & Roy, and Steels et al. Gorniak & Roy [6, 7] present an approach in which utterance meaning is probabilistically mapped to visual and spatial aspects of objects in the current scene. Recently, they have extended their approach to include action-affordances [8]. Their focus has primarily been on relating language to the situation, and not on priming effects; same for SHRDLU. Steels et al [27, 25, 26] develop an approach where the connection between word meaning and percepts is modeled as a semiotic network, in which abstract categories mediate between language and the visual world. Our approach on context-sensitive priming of speech recognition departs from previous work by modeling salience as inherently crossmodal, instead of relying on just one particular modality such as gesture [5], eye gaze [18] or dialogue state [9]. The F USE system described in [21] is a closely related approach, but limited to the processing of object descriptions, whereas our system was designed from the start to handle generic dialogues. We can therefore prime not only words related to the object linguistic description, but also words denoting subparts, general properties, and affordances. 3 Architecture Our approach has been implemented as part of a distributed cognitive architecture [10]. Each subsystem consists of a number of processes, and a working memory. The processes can access sensors, effectors, and the working memory to share information within the subsystem. Furthermore, across across subsystems can be interconnected (or ”bound”) [11]. Below, we ﬁrst discuss ideas implemented in the comprehension side of the dialogue system, and then brieﬂy point to several technical details. For more details, we refer to [14]. 637 scene, it can already narrow down the set of potential objects to which the action may apply: Only b1 and m2 are graspable. After processing ‘the’, the utterance meaning built up so far indicates the object should be salient. This constraint is met by b1 and m2. Finally, ‘mug’ completes the utterance. Using the narrowed down set of visual objects, the robot can determine m2 as the visual referent for the expression ‘the mug’ – even though, as a referring expression, ‘the mug’ is at best ambiguous between m3 and m2! Building up an interpretation of the utterance thus needs to be combined with categorical knowledge and information about the visuo-spatial scene. Only in this way the robot can arrive at an interpretation which ”makes sense” in context. Of course, the example in Fig. 2 is idealized. What if someone would have said ‘it’, instead of ‘the mug’? To understand ‘take it’, the robot needs to be able to resolve the pronoun ‘it’ to a previously mentioned object. Furthermore, once it knows what object is meant, it should also be able to retrieve the corresponding visual referent. (In Fig. 2, the steps in green would have been affected.) Other, not unrealistic problems would have been if the vision system would not have been able to recognize any of the objects as mugs.This arises e.g. in scenarios for interactive visual learning. It can partly be resolved through categorical inference, which can establish that a ‘mug’ is a type of thing. Thus, even if the robot would only have recognized the ball, and discern between m2 and m3 in terms of graspability, it would have been able to resolve ‘the mug’ to m2. (In Fig. 2, the steps in orange would have been affected.) Finally, the utterance may have been ambiguous: ‘Put the mug to the left of the ball to the right of the mug.’ Which mug? And where is the robot supposed to move ”the” mug? [4, 3, 14] discuss how such linguistic ambiguities can be resolved using the situated context. Figure 2. Dialogue comprehension: Visual scene (l.) and processing (r.) Consider the visual scene in Fig. 2. There are three mugs, labelled m1 through m3, and a yellow ball b1. b1, m2 and m3 are in reach of the robot arm. The robot is only able to pick up m2 and b1 – m3 is too big, and m1 is too far away. The diagram in Fig. 2 illustrates (abstractly) what is happening in the dialogue system. As the robot hears ‘take’, it starts parsing. All it can do at the moment is assuming it is dealing with a command (the utterance-initial verb indicates imperative mood), which it needs to execute. On linguistic grounds, it does not know what it is to manipulate. Grounding the command in the situation, the robot then applies categorical inferences, to interconnect linguistic content with content from other modalities. These inferences yield further information: a take action presupposes it applies to a graspable object. When the robot combines this information with what it knows about the visuo-spatial Figure 3. Schema of the system for spoken dialogue comprehension Fig. 3 illustrates part of the implemented dialogue system. The ﬁrst step in spoken dialogue comprehension is speech recognition. The ASR module is based on Nuance Recognizer v8.5 together with a statistical language model. For the online update of word class probabilities according to the salience model, we use the “just-intime grammar” functionality provided by Nuance. The objective of the ASR module is to analyse the audio signal to form a recognition result, which is given as a word lattice with associated conﬁdence levels. Once a (partial or complete) recognition result is available, it is added to the working memory. 638 P. Lison and G.-J. Kruijff / Salience-Driven Contextual Priming of Speech Recognition for Human-Robot Interaction The strings included in the word lattice are then parsed, to represent the syntactic and semantic structure of an utterance. Parsing only interprets an utterance at the grammatical level. Parsing is based on an incremental chart parser2 for Combinatory Categorial Grammar [24, 1]. The parser yields a set of interpretations expressed as ontologically rich, relational structures (description logic-like; cf. [2]). These interpretations are packed into a single representation [14], a technique which enables us to efﬁciently handle ambiguities. Dialogue interpretation tries to relate utterance meaning to the preceding context, by resolving contextual references, event structure, and dialogue moves (”speech acts”). Interpretating an utterance grammatically, and in the dialogue context, happens in parallel in the system. This way, the dialogue context can help constraining parsing. Furthermore, also the situated context is (indirectly) involved. While interpreting an utterance linguistically, the system also attempts to connect utterance content (once it has been interpreted against the dialogue context) with the situated context [11]. Models of the situated context currently contain visual information, spatial organization, and situated action planning. The results of this process again feed back into the linguistic analysis. What is important for the approach we present here in this paper is that we thus have access to the dialogue- and situated contexts, while the analysis of an utterance unfolds. As we explain in the next section, this contextual information is exploited to build a salience model of the environment (§4.1). This salience model is subsequently used to compute lexical activation levels in our vocabulary (§4.2) and adapt the word-class probabilities of the language model (§4.3). Finally, once the new probabilities are estimated, they are added to the working memory and retrieved by the speech recognizer which incorporates them at runtime. The above process is repeated after each detected change in the visual or linguistic context. 4 Approach 4.1 Salience modeling In our implementation, we deﬁne salience using two main sources of information: 1. the salience of objects in the perceived visual scene; 2. the linguistic salience or “recency” of linguistic expressions in the dialogue history. In the future, other information sources could be added, for instance the possible presence of gestures [5], eye gaze tracking [18], entities in large-scale space [31], or the integration of a task model – as salience generally depends on intentionality [16]. 4.1.1 Visual salience Via the working memory, we can access the set of objects currently perceived in the visual scene. Each object is associated with a concept name (e.g. printer) and a number of features, for instance spatial coordinates or qualitative propreties like colour, shape or size. Several features can be used to compute the salience of an object. The ones currently used in our implementation are (1) the object size and (2) its distance relative to the robot (e.g. spatial proximity). Other features could also prove to be helpful, like the reachability of the object, or its distance from the point of visual focus – similarly to the spread of visual acuity across the human retina. To derive the visual 2 built on top of the OpenCCG NLP library: http://openccg.sf.net salience value for each object, we assign a numeric value for the two variables, and then perform a weighted addition. The associated weights are determined via regression tests. At the end of the processing, we end up with a set Ev of visual objects, each of which is associated with a numeric salience value s(ek ), with 1 ≤ k ≤ |Ev |. 4.1.2 Linguistic salience There is a vast amount of literature on the topic of linguistic salience. Roughly speaking, linguistic salience can be characterised either in terms of hierarchical recency, according to a tree-like discourse structure, or in terms of linear recency of mention [12]. Our implementation can theorically handle both types of linguistic salience, but, at the time of writing, only the linear recency is calculated. To compute the linguistic salience, we extract a set El of potential referents from the discourse context model, and for each referent ek we assign a salience value s(ek ) equal to the linear distance between its last mention and the current position in the discourse model. 4.1.3 Cross-modal salience model Once the visual and linguistic salience are computed, we can proceed to their integration into a cross-modal statistical model. We deﬁne the set E as the union of the visual and linguistic entities: E = Ev ∪ El , and devise a probability distribution P (E) on this set: P (ek ) = δv IEv (ek ) sv (ek ) + δl IEl (ek ) sl (ek ) |E| (1) where IA (x) is the indicator function of set A, and δv , δk are factors controlling the relative importance of each type of salience. They are determinedP empirically, subject P to the following normalisation constraint: δv s(ek ) + δl s(ek ) = |E|. ek ∈Ev ek ∈El The statistical model P (E) thus simply reﬂects the salience of each visual or linguistic entity: the more salient, the higher the probability. 4.2 Lexical activation In order for the salience model to be of any use for speech recognition, a connection between the salient entities and their associated words in the ASR vocabulary needs to be established. To this end, we deﬁne a lexical activation network, which lists, for each possible salient entity, the set of words activated by it. The network speciﬁes the words which are likely to be heard when the given entity is present in the environment or in the dialogue history. It can therefore include words related to the object denomination, subparts, common properties or affordances. The salient entity laptop will activate words like ‘laptop’, ‘notebook’, ‘screen’, ‘opened’, ‘ibm’, ‘switch on/off’, ‘close’, etc. The list is structured according to word classes, and a weight can be set on each word to modulate the lexical activation: supposing a laptop is present, the word ‘laptop’ should receive a higher activation than, say, ‘close’, which is less situation speciﬁc. The use of lexical activation networks is a key difference between our model and [21], which relies on a measure of “descriptive ﬁtness” to modify the word probabilities. One key advantage of our approach is the possibility to go beyond object descriptions and activate word types denoting subparts, properties or affordances of objects – in the context of a laptop object, ‘screen’ and ‘switch on/off’ would for instance be activated. P. Lison and G.-J. Kruijff / Salience-Driven Contextual Priming of Speech Recognition for Human-Robot Interaction If the probability of speciﬁc words is increased, we need to renormalise the probability distribution. One solution would be to decrease the probability of all non-activated words accordingly. This solution, however, suffers from a signiﬁcant drawback: our vocabulary contains many context-independent words like ‘thing’, or ‘place’, whose probability should remain constant. To address this issue, we mark an explicit distinction in our vocabulary between contextdependent and context-independent words. In the current system, the lexical activation network is constructed semi-manually, using a lexicon extraction algorithm. We start with the list of possible salient entities, which is given by 1. the set of physical objects the vision subsystem can recognise ; 2. the set of nouns speciﬁed in the lexicon as an ‘object’. For each entity, we then extract its associated lexicon by matching speciﬁc syntactic patterns against a corpus of dialogue transcripts. 4.3 Language modeling We now detail the language model used for the speech recognition – a class-based trigram model enriched with contextual information provided by the salience model. 4.3.1 4.3.2 do depend on context: for a given class – e.g. noun -, the probability of hearing the word ‘laptop’ will be higher if a laptop is present in the environment. Hence: P (wi |wi−1 wi−2 ; E) = P (wi |ci ; E) × P (ci |ci−1 , ci−2 ) | {z } {z } | word-class probability Salience-driven, class-based language models The objective of the speech recognizer is to ﬁnd the word sequence W∗ which has the highest probability given the observed speech signal O and a set E of salient objects: W∗ = arg max W P (O|W) × | {z } acoustic model P (W|E) | {z } (2) (4) class transition probability We now deﬁne the word-class probabilities P (wi |ci ; E): X P (wi |ci ; ek ) × P (ek ) P (wi |ci ; E) = (5) ek ∈E To compute P (wi |ci ; ek ), we use the lexical activation network speciﬁed for ek : 8 < P (wi |ci ) + α1 if c1 P (wi |ci ) − α2 if ¬c1 ∧ c2 (6) P (wi |ci ; ek ) = : P (wi |ci ) else where c1 ≡ wi ∈ activatedWords(ek ), and c2 ≡ wi ∈ contextDependentWords. The optimum value of α1 is determined using regression tests, while α2 is computed relative to α1 in order to keep the sum of all probabilities equal to 1: Corpus generation We need a corpus to train a statistical language model adapted to our task domain, consisting of human-robot interactions related to a ﬁxed visual scene. The visual scene usually includes a small set of objects (mugs, balls, boxes) which can be manipulated by the robot. Unfortunately, no corpus of situated dialogue adapted to our task domain was available. Collecting in-domain data via Wizard of Oz (WOz) experiments is a very costly and time-consuming process, so we decided to follow the approach advocated in [29] instead and generate a class-based corpus from a domain-speciﬁc grammar. Practically, we ﬁrst collected a small set of WOz experiments, totalling about 800 utterances. This set is too small to be directly used as a training corpus, but sufﬁcient to get an intuitive idea of the type of utterances in our domain. Based on it, we designed a task-speciﬁc context-free grammar able to cover most of the utterances. Weights were then automatically assigned to each grammar rule by parsing our initial corpus, leading to a small stochastic context-free grammar. As a last step, this grammar is randomly traversed a large number of times, which provides us the generated corpus. 639 α2 = |activatedWords| × α1 |contextDependentWords| − |activatedWords| (7) These word-class probabilities are dynamically updated as the environment and the dialogue evolves and incorporated into the language model at runtime. 5 Evaluation We evaluated our approach on a test suite of 250 spoken utterances recorded during WOz experiments. The participants were asked to interact with the robot while looking at a speciﬁc visual scene. We designed 10 different visual scenes by systematic variation of the nature, number and spatial conﬁguration of the objects presented. The interactions could include descriptions, questions and commands. Table 1 gives the experimental results. For space reasons, we focus on the WER of our model compared to the baseline (a class-based trigram model not using salience). Word Error Rate [WER] vocabulary size ! 200 words vocabulary size ! 400 words vocabulary size ! 600 words salience-driven language model Table 1. Classical LM Salience-driven LM 25.04 % (NBest 3: 20.72 %) 26.68 % (NBest 3: 21.98 %) 28.61 % (NBest 3: 24.59 %) 24.22 % (NBest 3: 19.97 %) 23.85 % (NBest 3: 19.97 %) 23.99 % (NBest 3: 20.27 %) Comparative results of recognition performance For a trigram language model, the probability of the word sequence P (w1n |E) is: P (w1n |E) ; n Y P (wi |wi−1 wi−2 ; E) (3) i=1 Our language model is class-based, so it can be further decomposed into word-class and class transitions probabilities. The class transition probabilities reﬂect the language syntax. We assume they are independent of salient objects. The word-class probabilities, however, 5.1 Analysis As the results show, the use of a salience model can enhance the recognition performance in situated interactions: with a vocabulary of about 600 words, the WER is indeed reduced by 16.1 % compared to the baseline. According to the Sign test, the differences for 640 P. Lison and G.-J. Kruijff / Salience-Driven Contextual Priming of Speech Recognition for Human-Robot Interaction the last two tests (400 and 600 words) are statistically signiﬁcant. As we could expect, the salience-driven approach is especially helpful when operating with a larger vocabulary, where the expectations provided by the salience model can really make a difference in the word recognition. The word error rate remains nevertheless quite high. This is due to several reasons. The major issue is that the words causing most recognition problems are – at least in our test suite – function words like prepositions, discourse markers, connectives, auxiliaries, etc., and not content words. Unfortunately, the use of function words is usually not context-dependent, and Figure 4. Sample visual hence not inﬂuenced by salience. scene including 3 objects: a We estimated that 89 % of the box, a ball, and a chocolate bar recognition errors were due to function words. Moreover, our chosen test suite is constituted of “free speech” interactions, which often include lexical items or grammatical constructs outside the range of our language model. 6 Conclusion We have presented an implemented model for speech recognition based on the concept of salience. This salience is deﬁned via visual and linguistic cues, and is used to compute degrees of lexical activations, which are in turn applied to dynamically adapt the ASR language model to the robot’s environment and dialogue state. As future work we will examine the potential extension of our approach in two directions. First, we wish to take other information sources into account, particularly the integration of a task model, relying on data made available by the symbolic planner. And second, we want to go beyond speech recognition, and investigate the relevance of such salience model for the development of a robust understanding system for situated dialogue. REFERENCES [1] J. Baldridge and G.-J. M. Kruijff, ‘Multi-modal combinatory categorial grammmar’, in Proceedings of EACL’03, Budapest, Hungary, (2003). [2] J. Baldridge and G.J.M. Kruijff, ‘Coupling CCG and hybrid logic dependency semantics’, in Proc. ACL 2002, pp. 319–326, Philadelphia, PA, (2002). [3] M. Brenner, N. Hawes, J. Kelleher, and J. Wyatt, ‘Mediating between qualitative and quantitative representations for task-orientated humanrobot interaction’, in Proc. of the Twentieth International Joint Conference on Artiﬁcial Intelligence (IJCAI), Hyderabad, India, (2007). [4] T. Brick and M. Scheutz, ‘Incremental natural language processing for HRI’, in Proceeding of the ACM/IEEE international conference on Human-Robot Interaction (HRI’07), pp. 263 – 270, (2007). [5] J. Y. Chai and Sh. Qu, ‘A salience driven approach to robust input interpretation in multimodal conversational systems’, in Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing 2005, pp. 217–224, Vancouver, Canada, (October 2005). ACL. [6] P. Gorniak and D. Roy, ‘Grounded semantic composition for visual scenes’, Journal of Artiﬁcial Intelligence Research, 21, 429–470, (2004). [7] P. Gorniak and D. Roy, ‘Probabilistic grounding of situated speech using plan recognition and reference resolution’, in Proceedings of the Seventh International Conference on Multimodal Interfaces (ICMI 2005), (2005). [8] P. Gorniak and D. Roy, ‘Situated language understanding as ﬁltering perceived affordances’, Cognitive Science, 31(2), 197–231, (2007). [9] A. Gruenstein, C. Wang, and S. Seneff, ‘Context-sensitive statistical language modeling’, in Proceedings of INTERSPEECH 2005, pp. 17– 20, (2005). [10] N. Hawes, A. Sloman, J. Wyatt, M. Zillich, H. Jacobsson, G.-J. M. Kruijff, M. Brenner, G. Berginc, and D. Skocaj, ‘Towards an integrated robot with multiple cognitive functions.’, in AAAI, pp. 1548–1553. AAAI Press, (2007). [11] H. Jacobsson, N. Hawes, G.-J. Kruijff, and J. Wyatt, ‘Crossmodal content binding in information-processing architectures’, in Proceedings of the 3rd ACM/IEEE International Conference on Human-Robot Interaction (HRI), Amsterdam, The Netherlands, (March 12–15 2008). [12] J. Kelleher, ‘Integrating visual and linguistic salience for reference resolution’, in Proceedings of the 16th Irish conference on Artiﬁcial Intelligence and Cognitive Science (AICS-05), ed., Norman Creaney, Portstewart, Northern Ireland, (2005). [13] P. Knoeferle and M.C. Crocker, ‘The coordinated interplay of scene, utterance, and world knowledge: evidence from eye tracking’, Cognitive Science, (2006). [14] G.J.M. Kruijff, P. Lison, T. Benjamin, H. Jacobsson, and N.A. Hawes, ‘Incremental, multi-level processing for comprehending situated dialogue in human-robot interaction’, in Language and Robots: Proceedings from the Symposium (LangRo’2007), pp. 55–64, Aveiro, Portugal, (2007). [15] G.J.M. Kruijff, H. Zender, P. Jensfelt, and H.I. Christensen, ‘Situated dialogue and spatial organization: What, where. . . and why?’, International Journal of Advanced Robotic Systems, Special section on Human and Robot Interactive Communication, 4(2), (March 2007). [16] F. Landragin, ‘Visual perception, language and gesture: A model for their understanding in multimodal dialogue systems’, Signal Processing, 86(12), 3578–3595, (2006). [17] R. K. Moore, ‘Spoken language processing: piecing together the puzzle’, Speech Communication: Special Issue on Bridging the Gap Between Human and Automatic Speech Processing, 49, 418–435, (2007). [18] Sh. Qu and J. Chai, ‘An exploration of eye gaze in spoken language processing for multimodal conversational interfaces’, in Proceedings of the Conference of the North America Chapter of the Association of Computational Linguistics, pp. 284–291, (2007). [19] D. Roy, ‘Situation-aware spoken language processing’, in Royal Institute of Acoustics Workshop on Innovation in Speech Processing, Stratford-upon-Avon, England, (2001). [20] D. Roy, ‘Grounding words in perception and action: Insights from computational models’, Trends in Cognitive Science, 9(8), 389–96, (2005). [21] D. Roy and N. Mukherjee, ‘Towards situated speech understanding: visual context priming of language models’, Computer Speech & Language, 19(2), 227–248, (April 2005). [22] D. Roy and E. Reiter, ‘Connecting language to the world’, Artiﬁcial Intelligence, 167(1-2), 1–12, (2005). [23] T.P. Spexard, S. Li, B. Wrede, M. Hanheide, E.A. Topp, and H. Httenrauch, ‘Interaction awareness for joint environment exploration’, in Proceedings of the IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), pp. 546–551, (2007). [24] M. Steedman, The Syntactic Process, The MIT Press, 2000. [25] L. Steels, ‘Semiotic dynamics for embodied agents’, IEEE Intelligent Systems, 21, 32–38, (2006). [26] L. Steels, ‘The symbol grounding problem has been solved. so what’s next?’, in Symbols, embodiment and meaning, eds., M. De Vega, G. Glennberg, and G. Graesser, Academic Press, New Haven, (2008). [27] L. Steels and J-C. Baillie, ‘Shared grounding of event descriptions by autonomous robots’, Robotics and Autonomous Systems, 43(2-3), 163– 173, (2003). [28] C. Theobalt, J. Bos, T. Chapman, A. Espinosa-Romero, M. Fraser, G. Hayes, E. Klein, T. Oka, and R. Reeve, ‘Talking to godot: Dialogue with a mobile robot’, in Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2002), pp. 1338–1343, (2002). [29] K. Weilhammer, M. N. Stuttle, and S. Young, ‘Bootstrapping language models for dialogue systems’, in Proceedings of INTERSPEECH 2006, Pittsburgh, PA, (2006). [30] T. Winograd, ‘A process model of language understanding’, in Computer Models of Thought and Language, eds., R.C. Schank and K.M. Colby, 152–186, Freeman, New York, NY, (1973). [31] H. Zender and G.-J. M. Kruijff, ‘Towards generating referring expressions in a mobile robot scenario’, in Language and Robots: Proceedings of the Symposium, pp. 101–106, Aveiro, Portugal, (December 2007). III. Prestigious Applications of Intelligent Systems (PAIS) This page intentionally left blank ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-643 643 A new CBR approach to the oil spill problem Juan Manuel Corchado, Aitor Mata, Juan Francisco De Paz And David Del Pozo1 Abstract. Oil spills represent one of the most destructing environmental disasters. Predicting the possibility of finding oil slicks in a certain area after an oil spill can be crucial in order to reduce the environmental risks. The system presented here forecasts the presence or not of oil slicks in a certain area of the open sea after an oil spill using Case-Based Reasoning methodology. CBR is a computational methodology designed to generate solutions to a certain problem by analysing previous solutions given to previous solved problems. The proposed system wraps other artificial intelligence techniques such as a Radial Basis Function Networks, Growing Cell Structures and Principal Components Analysis in order to develop the different phases of the CBR cycle. CBR systems have never been used before to solve oil slicks problems. The proposed system uses information obtained from various satellites such as salinity, temperature, pressure, number and area of the slicks…. OSCBR system has been able to accurately predict the presence of oil slicks in the north west of the Galician coast, using historical data. 1 INTRODUCTION Predicting the behaviour of oceanic elements is a quite difficult task. In this case the prediction is related with external elements (oil slicks), what makes the prediction even more difficult. Open ocean is a highly complex system that may be modelled by measuring different variables and structuring them together. Some of those variables are essential to predict the behaviour of oil slicks. In order to predict the future presence of oil slicks in an area, it is obviously necessary to know their previous positions. That knowledge is provided by the analysis of satellite images, obtaining the precise position of the slicks. The solution proposed in this paper generates, for different geographical areas, a probability (between 0 and 1) of finding oil slicks after an oil spill. The proposed system has been constructed using historical data and checked using the data acquired during the Prestige oil spill, from November 2002 to April 2003. Most of the data used to develop the proposed system has been acquired from the ECCO (Estimating the Circulation and Climate of the Ocean) consortium. Position and size of the slicks has been obtained by treating SAR satellite images [1]. The proposed system is a forecasting Case-Based Reasoning system: the Oil Spill CBR (OSCBR). A CBR system has the ability to learn from past situations, and to generate solutions to new problems based in the past solutions given to past problems. Past solutions are stored in the system, in the case base. In OSCBR the cases contain information about the oil slicks as long as atmospheric data (wind, salinity, temperature, ocean height and pressure). 1 University of Salamanca, Spain, email: corchado@usal.es, aitor@usal.es, fcdps@usal.es, meco007@gmail.com The results obtained with OSCBR approximate to the real process occurred in near the ninety per cent of the value of the main variables analyzed, which is a quite important approximation. In this paper, the CBR technology will be first explained, introducing the specific elements that make this way of predicting work. In second place, the oil spill problem is presented, showing its difficulties and the possibilities of finding solutions to the problem. Finally, OSCBR is explained, giving special attention to the techniques applied in the different phases of the CBR cycle. Last, the results are shown and also the future developments that can be achieved with the system. 2 CASE BASED REASONING SYSTEMS Case Based Reasoning is a technique that has its origin in knowledge based systems. CBR systems learn from previous situations. The main element of a CBR system is the case base; a structure that stores problems, elements (cases), and its solutions. So, a case base can be visualized as a database where a collection of problems is stored keeping a relationship with the solutions to every problem stored, which give the system the ability to generalize in order to solve new problems. The learning capabilities of the CBR systems are due to its own structure, composed of four main phases [2]: retrieval, reuse, revision and retention. The first phase is called retrieve, and consists in finding the most similar cases to the proposed problem from the case base. Once a series of cases are extracted from the case base, they must be reused by the system. In this second phase, an adaptation of the selected cases is done to fit the current problem. After giving a solution to the problem, that solution is revised to check if the proposed alternative is a solution to the problem. If the proposal is confirmed as a solution, then it is retained by the system and could eventually serve as a solution to future problems. CBR has been applied to different situations like treating problems with no evident solutions. But, in most cases, CBR has not been used alone, but combined with various artificial intelligence techniques. Growing Cell Structures [3] has been used with CBR to automatically create the intern structure of the case base from existing data. Actual trends in CBR explore the possibility of giving explanations from the very CBR systems [4]. These techniques allow the CBR systems to give the users a better solution, adding extra information to the solution proposed by the system. 3 OIL SPILL PROBLEM After an oil spill, it is necessary to determine if an area is going to be contaminated or not. To conclude about the presence or not of contamination in an area it is necessary to know how the slicks generated by the spill behave. The most data available; the best solution can be given. 644 J.M. Corchado et al. / A New CBR Approach to the Oil Spill Problem First, position, shape and size of the oil slicks must be identified. The most precise way to acquire that information is by using satellite images. SAR images are the most commonly used to automatically detect this kind of slicks [5]. The satellite images show certain areas where it seems to be nothing, like zone with no waves; that are the oil slicks. With these images it is possible to distinguish between normal sea variability and slicks. It is also important to distinguish between oil slicks and look-alikes. Oil slicks are quite similar to quiet sea areas. If there is not enough wind, the difference between the calmed sea and the surface of a slick is less evident and so, there may be more mistakes when trying to distinguish between an oil slick and something that it is not a slick. This is a crucial aspect in this problem that can also be automatically done by a series of computational tools. Once the slicks are identified, it is also essential to know the atmospheric and maritime situation that is affecting the slick in the moment that is being analysed. Information collected from satellites is used to obtain the atmospheric data needed. That is how different variables such as temperature, sea height and salinity are measured in order to obtain a global model that can explain how slicks evolve. 3.1 Previous solutions given to the oil spill problem There have been different ways to analyze, evaluate and predict situations after an oil spill. One approach is the simulation [6], where a model of a certain area is created, introducing specific parameters (weather, currents and wind) and working along with a forecasting system. Using this methodology, it is easy to obtain a good solution for a certain area, but it is quite difficult to generalize in order to solve the same problem in new zones. Another way to obtain a trajectory model is to replace the oil spill by drifters [7] comparing the trajectory followed by the drifters with the already known oil slicks trajectories. If the drifters follow a similar trajectory as the one that followed the slicks, then a model can be created and there will be a possibility of creating more models in different areas. A different way of predicting oil slicks trajectories is to study previous cases to obtain a trajectory model for a certain area with different weather situations [8]. 3.2 Models One step over those solutions previously explained are the systems that, combining a major set of elements, generate response models to solve the oil spill problem. A quite new point of view is given by complex systems [9] that analyze large data bases (environmental, ecological, geographical and engineering), using expert systems. This way, an implicit relation between problem and solution is obtained, but with no direct connection between past examples and current decisions. Nevertheless there is a great data mining effort in that kind of solutions. Once the oil spill is produced there should be contingency models that make a fast solution possible. To get the proper solution expert systems has also been used, using the stored information from past cases, as a repository where future applications will find structured information. The final objective of all these systems is to be decision support systems [10], in order to help to take all the decisions that need to be taken properly organized. To achieve that great objective, different techniques have been used, from fuzzy logic to negotiation with multi-agent systems. 4 OIL SPILL CBR SYSTEM – OSCBR CBR has already been used to solve maritime problems [11] in which different oceanic variables were involved. In this case, the data collected from different observations from satellites, is preprocessed, and structured in cases. The created cases are the keys to obtain the solutions to future problems, through the CBR system. Variable Longitude Latitude Bottom pressure Definition Geographical longitude Geographical latitude Day, month and year of the analysis Height of the waves in open sea Atmospheric pressure in the open sea Salinity Sea salinity Temperature Celsius temperature in the area Surface covered by the slicks present in the analyzed area Meridional direction of the wind Zonal direction of the wind Wind strength Meridional direction of the ocean current Zonal direction of the ocean current Ocean current strength Date Sea Height Area of the slicks MeridionalWind Zonal Wind Wind Strenght Meridional Current Zonal Current Current Strenght Unit Degree Degree dd/mm/yyyy m Newton/m2 ppt (parts per thousand) ºC Km2 m/s m/s m/s m/s m/s m/s Table 1. Variables that define a case. OSCBR determines the probability of finding oil slicks in a certain area. To generate the predictions, the system divides the ocean surface in squares of approximately half a degree side. Then the system determines the amount of slicks present in a square. The squares where the slicks are located are coloured with different gradation depending on the quantity of the squared area covered by oil slicks. The squared zone determines the area that is going to be analyzed independently. The values of the different variables in a square area in a certain moment as long as the value of the possibility of finding oil slicks in the following day is what is called a case, which define the problem and propose the solution. In table 1 the structure of a case is shown. The variables present in a case can be geographical (longitude and latitude), temporal (date of the case), atmospheric (wind, sea height, bottom pressure, salinity and temperature) and variables directly related with the problem (number and area of the slicks). Once the data is structured, it is stored in the case base. Every case has its temporal situation stored and that relates every case with the next situation in the same position. That temporal relationship is what creates the union between problem and 645 J.M. Corchado et al. / A New CBR Approach to the Oil Spill Problem solution. The problem is the past case, and the solution is the future case, the future state of the square analyzed. The data used to train the system has been obtained after the Prestige accident, between November 2002 and April 2003, in a specific geographical area to the north west of the Galician coast (longitude between 14 and 6 degrees west and latitude between 42 and 46 degrees north). When all that information is stored in the case base, the system is ready to predict future situations. To generate a prediction, a problem situation must be introduced in the system. Then the most similar cases to the problematic situation are retrieved from the case base. Once a collection of cases are chosen from the case base, they must be used to generate the solution to the current problem. Radial Basis Functions Networks are used to combine the chosen cases in order to obtain the new solution. OSCBR includes different artificial intelligence techniques to achieve the objectives of every CBR phase. As shown in figure 1, every CBR phase uses an artificial intelligence technique in order to obtain its solution. Those phases with its related techniques are going to be explained next. 4.1 Pre-processing Historical data collected from November 2002 to april 2003 is used to create the case base. As explained before, cases are formed by a series of variables. Principal Components Analysis (PCA) can reduce the number of those variables and then, the system stores the value of the principal components, which are related with the original variables that define a case. PCA has been previously used to analyse oceanographic data and it has proved to be a consistent technique when trying to reduce the number of variables. Let Ot r+ denote the vector of eigenvalues associated with the current estimate of the first r eigenvectors. The new KHA algorithm sets de ith component of Kt to: (1) The final variables are, obviously, linearly independent and are formed by combination of the previous variables. The values of the original variables can be recovered by doing the inverse calculation to the one produced to obtain the new variables. The variables that are less used in the final stored variables are those whose values suffer less changes during the periods of time analysed (salinity, temperature and pressure do not change from one day to another, then, they can be ignored considering that the final result does not depend on them). After applying FIKPCA, the historical data is stored in the case base, and is used to solve future problems using the rest of the CBR cycle. Storing the principal components instead of the original variables implies reducing the amount of memory necessary to store the information in about a forty per cent which is more important as the case base grows. The reduction of the number of variables considered also implies a faster recovery from the case base. When introducing the data into the case base, Growing Cell Structures [13] are used. GCS can create a model from a situation organizing the different cases by their similarity. If a 2D representation is chosen to explain this technique, the most similar cells (cases in OSCBR) are near one of the other. If there is a relationship between the cells, they are grouped together, and this grouping characteristic helps the CBR system to recover the similar cases in the next phase. When a new cell is introduced in the structure, the closest cells move towards the new one, changing the overall structure of the system. The weights of the winning cell, , and its neighbours, , are changed. The terms and represent the learning rates for the winner and its neighbours, respectively. represents the value of the input vector. (2) (3) 4.2 Figure 1. CBR cycle adapted to the OSCBR system. In this paper Fast Iterative Kernel PCA, an evolution of PCA, has been used [12]. This technique reduces the number of variables in a set by eliminating those that are linearly dependent, and it is quite faster than the traditional PCA. To improve the convergence of the Kernel Hebbian Algorithm used by Kernel PCA, FIK-PCA set Kt proportional to the reciprocal of the estimated eigenvalues. Retrieve Once the case base has stored the historical data, and the GCS has learned from the original distribution of the variables, the system is ready to receive a new problem. When a new problem comes to the system, GCS are once used. The stored GCS behaves as if the new problem would be stored in the structure, and finds the most similar cells (cases in the CBR system) to the problem introduced in the system. In this case the GCS does not change its structure, because it is being used to obtain the most similar cases to the introduced problem. Only in the retain phase, the GCS changes again, introducing if it is correct, the proposed solution. The similarity of the new problem to the stored cases is determined by the GCS calculating the distance between them. Every element in the GCS has a series of values (every value corresponds to one of the principal components created after de 646 J.M. Corchado et al. / A New CBR Approach to the Oil Spill Problem PCA analysis) and then the distance between elements is a multidimensional distance, where all the elements are considered to establish the distance between cells. Then, after obtaining the most similar cases from the case base, they are used in the next phase. The most similar cases stored in the case base will be used to obtain an accurate prediction according to the previous solutions related with the selected cases. vectors, we can establish a distance between them, calculating the evolution of the situation in the considered conditions. If the distance between the proposed problem and the solution given is not bigger than the distances obtained from the selected cases, then the solution is a good one, according to the structure of the case base. 4.3 5 Reuse Once the most similar cases to the problem to be solved are recovered from the case base, they are used to generate the solution. The prediction of the future probability of finding oil slicks in an area is generated using an artificial neural network, with a hybrid learning system. An adaptation of Radial Basis Functions Networks are used to obtain that prediction [14; 15]. The chosen cases are used to train the artificial neural network. Radial Basis Function networks have been chosen because of the reduction of the training time comparing with other artificial neural network systems, such as Multilayer Perceptrons. In this case, in every analysis the network is trained, using only the cases selected from the case base, the most similar to the proposed problem. Growing RBF networks [16] are used to obtain the predicted future values corresponding to the proposed problem. This adaptation of the RBF networks allows the system to grow during training gradually increasing the number of elements (prototypes) which play the role of the centers of the radial basis functions. In this case the creation of the Growing RBF must be made automatically, which implies an adaptation of the original GRBF system [17]. The definition of the error for every pattern is shown below: ݁௜ ൌ ݈ൗ‫ כ݌‬σ௞ୀଵหȁ‫ݐ‬௜௞ െ ‫ݕ‬௜௞ ȁหǡ ௣ (4) Where tik is the desired value of the kth output unit of the ith training pattern, yik the actual values ot the kth output unit of the ith training pattern. Once the GRBF network is created, it is used to generate the solution to the proposed problem. The solution will be the output of the network using as input data the selected cases from the case base. 4.4 Revise After generating the prediction, it is shown to the user in a similar way the slicks are interpreted by OSCBR. A set of squared coloured areas appear. The intensity of the colour corresponds with the possibility of finding oil slicks in that area. The areas coloured with a higher intensity are those with the highest probability of finding oil slicks in them. In this visual approximation, the user can check if the solution is a good one or not. But the system provides an automatic method of revision that must be, anyway, checked by an expert user. Explanations are used to check the correction of the proposed solution, to justify the solution [18]. To obtain a justification to the given solution, the cases selected from the case base are used once again. To create an explanation, a comparison between different possibilities has been used. All the selected cases has its own future situation associated. If we consider the case and its solution as two RESULTS The data used to train the system has been obtained from different satellites. Temperature, salinity, bottom pressure, sea height, number and area of the slicks, as long as the location of the squared area and the date have been used to create a case. All these data define the problem case and also the solution case. The solution to a problem defined by an area and its variables is the same area, but with the values of the variables changed to the prediction obtained from the CBR system. Number of RBF + RBF CBR OSCBR cases CBR 100 45 % 39 % 42 % 43 % 500 48 % 43 % 46 % 46 % 1000 51 % 47 % 58 % 64 % 2000 56 % 55 % 65 % 72 % 3000 59 % 58 % 68 % 81 % 4000 60 % 63 % 69 % 84 % 5000 63 % 64 % 72 % 87 % Table 2. Percentage of good predictions obtained with different techniques. When the OSCBR system has been used with a subset of the data that has not been previously used to train the system, it has produced quite hopeful results. The predicted situation was contrasted with the actual future situation. The future situation was known, as long as past data was used to train the system and also to test the correction of it. The proposed solution was, in most of the variables, close to 90% of accuracy. In every problem, defined by an area and its variables, the system offers nine solutions: the same area, with its proposed variables and the eight closest neighbours. This way of prediction is used in order to clearly observe the direction of the slicks, what can be useful in order to determine the coastal areas that will be affected by the slicks generated after an oil spill. In table 2 a summary of the results obtained is shown. In this table different techniques are compared. The table shows the evolution of the results along with the increase of the number of cases stored in the case base. All the techniques analyzed improve its results when increasing the number of cases stored. The “RBF” column represents a simple Radial Basis Function Network that is trained with all the data available. The network gives an output that is considered a solution to the problem. The “CBR” column represents a pure CBR system, with no artificial intelligence techniques included. The cases are stored in the case bases and recovered considering the Euclidean distance. The most similar cases are selected and after applying a weighted mean depending on the similarity, a solution is proposed. It is a mathematical CBR. The “RBF + CBR” column corresponds to the possibility of using a RBF system combined with CBR. The recovery from the CBR is done using the Manhattan distance to determine the closest cases to the introduced problem. The RBF network works in the reuse J.M. Corchado et al. / A New CBR Approach to the Oil Spill Problem phase, adapting the selected cases to obtain the new solution. The results of the “RBF+CBR” column are, normally, better than those of the “CBR”, mainly because of the elimination of useless data to generate the solution. Finally, the “OSCBR” column shows the results obtained by the proposed system, being better than the three previous solutions analyzed. The proposed solution do not generate a trajectory, but a series of probabilities in different areas, what is far more similar to the real behaviour of the oil slicks. 6 CONCLUSIONS AND FUTURE WORK In this paper, the OSCBR system has been explained. It is a new solution for predicting the presence or not of oil slicks in a certain area after an oil spill. This system used data acquired from different orbital satellites and with that data the CBR environment was created. The data must be previously classified into the structure required by the CBR system to store it as a case. OSCBR uses different artificial intelligence techniques in order to obtain a correct prediction. Fast Iterative Kernel Principal Component Analysis is used to reduce the number of variables stored in the system, getting about a 40% of reduction in the size of the case base. This adaptation of the PCA also implies a faster recovery of cases from the case base (more than 7% faster than storing the original variables). To obtain a prediction using the cases recovered from the case base, Growing Radial Basis Function Networks has been used. This evolution of the RBF networks implies a better adaptation to the structure of the case base, which is organised using Growing Cell Structures. The results using Growing RBF networks instead of simple RBF networks are about a 4% more accurate, which is a good improvement. It has been proved that the system can predict in the conditions already known, showing better results than previously used techniques. The use of a combination of techniques integrated in the CBR structure makes possible to obtain better result than using the CBR alone (17% better), and also better than using the techniques isolated, without the integration feature produced by the CBR (11% better). The next step is generalising the learning, acquiring new data to create a base of cases big enough to have solutions for every season. Another improvement is to create an on-line system that can store the case base in a server and generate the solutions dynamically to different requests. This on-line version will include real time connection to data servers providing weather information of the current situations in order to predict real future situations. REFERENCES [1] Palenzuela, J.M.T., Vilas, L.G. and Cuadrado, M.S. (2006) Use of ASAR images to study the evolution of the Prestige oil spill off the Galician coast, International Journal of Remote Sensing, 27 (10), 1931-1950. [2] Aamodt, A. and Plaza, E. (1994) Case-Based Reasoning: Foundational Issues, Methodological Variations, and System Approaches, AI Communications, 7 (1), 39-59. 647 [3] Azuaje, F., Dubitzky, W., Black, N. and Adamson, K. (2000) Discovering relevance knowledge in data: a growing cell structures approach, IEEE Transactions on Systems, Man and Cybernetics, Part B, 30 (3), 448-460. [4] Roth-Berghofer, T.R., Cassens, J. and Sørmo, F. (2005) Goals and Kinds of Explanations in Case-Based Reasoning, Proceedings of WM, 264–268. [5] Solberg, A.H.S., Storvik, G., Solberg, R. and Volden, E. (1999) Automatic detection of oil spills in ERS SAR images, IEEE Transactions on Geoscience and Remote Sensing, 37 (4), 19161924. [6] Brovchenko, I., Kuschan, A., Maderich, V. and Zheleznyak, M. (2002) The modelling system for simulation of the oil spills in the Black Sea, 3rd EuroGOOS Conference: Building the European capacity in operational oceanography., 192. [7] Price, J.M., Ji, Z.G., Reed, M., Marshall, C.F., et al. (2003) Evaluation of an oil spill trajectory model using satellitetracked, oil-spill-simulating drifters, OCEANS 2003. Proceedings, 3. [8] Vethamony, P., Sudheesh, K., Babu, M.T., Jayakumar, S., et al. (2007) Trajectory of an oil spill off Goa, eastern Arabian Sea: Field observations and simulations, Environmental Pollution. [9] Douligeris, C., Collins, J., Iakovou, E., Sun, P., et al. (1995) Development of OSIMS: An oil spill information management system, Spill Science & Technology Bulletin, 2 (4), 255-263. [10] Keramitsoglou, I., Cartalis, C. and Kassomenos, P. (2003) Decision Support System for Managing Oil Spill Events, Environmental Management, 32 (2), 290-298. [11] Corchado, J.M. and Fdez-Riverola, F. (2004) FSfRT: Forecasting System for Red Tides, Applied Intelligence, 21, 251-264. [12] Gunter, S., Schraudolph, N.N. and Vishwanathan, S.V.N. (2007) Fast Iterative Kernel Principal Component Analysis, Journal of Machine Learning Research, 8, 1893-1918. [13] Fritzke, B. (1994) Growing cell structures—a self-organizing network for unsupervised and supervised learning, Neural Networks, 7 (9), 1441-1460. [14] Martin, B. and Sanz, A. (1997) Redes neuronales y sistemas borrosos, Zaragoza: Editorial Ra-Ma. [15] Haykin, S. (1999) Neural networks. Prentice Hall Upper Saddle River, NJ. [16] Karayiannis, N.B. and Mi, G.W. (1997) Growing radial basis neural networks: merging supervised andunsupervised learning with network growth techniques, Neural Networks, IEEE Transactions on, 8 (6), 1492-1506. [17] Ros, F., Pintore, M. and Chrétien, J.R. (2007) Automatic design of growing radial basis function neural networks based on neighboorhood concepts, Chemometrics and Intelligent Laboratory Systems, 87 (2), 231-240. [18] Sørmo, F., Cassens, J. and Aamodt, A. (2005) Explanation in Case-Based Reasoning–Perspectives and Goals, Artificial Intelligence Review, 24 (2), 109-143. 648 ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-648 QuestSemantics - Intelligent Search and Retrieval of Business Knowledge Ian Blacoe and Ignazio Palmisano and Valentina Tamma1 and Luigi Iannone2 Abstract. Keyword-based search engines, though hugely popular, show limitations when trying to answer very speciﬁc queries. The processing of search results is performed by users, rather than by software. Ontologies provide a means to create formal, machineprocessable descriptions of the knowledge in a certain domain [14], and, by using elements of these descriptions to annotate suitable information sources, they can be analysed and manipulated in an intelligent manner. The QuestSemantics platform provides automated ontology-based metadata creation and resource annotation, with subsequent ontology-based querying of the annotated resources. The platform has been deployed in two commercial scenarios, providing useful feedback on both the feasibility and effectiveness of applying Semantic Web technologies to speciﬁc business problems. 1 INTRODUCTION In today’s Web information is primarily intended to be read and processed by humans, and cannot be readily manipulated by computers. The intelligence applied in search tasks, as well as the assessment of the relevance of retrieved pages, is mainly human, with limited support from software [15]. Whilst this type of processing is still adequate for domestic users, it cannot scale to the volume of information available to business, where the vast amount of data available on the web is coupled with company documents and databases. Current keyword based search engines present limitations in that they cannot fully capture the richness intrinsic in natural language; e.g., synonymy and polysemy pose hard to solve problems for keyword based search task. Enhancing search engines with lexicons such as WordNet [11] can help to relieve these problems, but it is not sufﬁcient to identify and resolve more complicated types of ambiguity. Furthermore, keyword-based search engines make little provision for the formulation of very speciﬁc queries, particularly those that make use of relationships between entities. A possible way to overcome these limitations is to make use of Semantic Web technologies. The Semantic Web [1] is an evolution of the current Web where information is represented in a machinereadable format, while maintaining the human-friendly HTML representation. Ontologies [14] are crucial in providing shared and machine processable meaning to web resources. An ontology models the entities and processes that are used to describe both the content of a web resource, and, more importantly, the logical relations between the resources. Using this model, a representation can be created of the information contained in relevant web documents (annotation), 1 2 Computer Science Department, University of Liverpool; email: {I.W.Blacoe, I.Palmisano, V.Tamma}@liverpool.ac.uk Computer Science Department, University of Manchester; email: iannone@cs.man.ac.uk and thus more precise queries can be formulated to retrieve this information. The annotation process normally involves the creation of metadata items (as instances of concepts from the ontology) to represent speciﬁc entities recognised in the resources, and then linking this metadata to the resource as its description. Many research efforts have thus been devoted to the provision of (semi-) automatic solutions for annotating web documents expressed in various formats, mainly text, but also structured formats, such as databases. This paper presents QuestSemantics (QS), a platform supporting the semi-automatic discovery, annotation, ﬁltering and retrieval of information resources on the Internet and in intranets, on the basis of ﬁne-grained business knowledge. QS is designed in order to maximise the separation between the different types of knowledge represented - domain versus task-speciﬁc knowledge, and application versus generic knowledge. This separation is aimed at achieving reusability, and easy customisation of the various architectural components, thus allowing semantics-based search in a variety of task and domain scenarios. The platform includes two main components: a general framework for the (semi-) automatic annotation of resources, based upon a detailed ontological model of the domain, and a search interface for the user-friendly formulation and execution of knowledge-based queries over the generated metadata. The paper illustrates two different commercial use-cases in which the QS platform has been employed, providing concrete data on the advantages that the adoption of Semantic Web technologies can bring to classical information retrieval problems. The remainder of this paper is organised as follows. Section 2 describes the design and implementation of the developed application. Section 3 gives details regarding the deployment and evaluation of the platform in two commercial test-cases, one in the safety legislation compliance contracts domain, the other in the aerospace domain. Related work is described in Section 4, and in Section 5 some conclusions are drawn. 2 DESIGN AND IMPLEMENTATION QS is designed for applications that aim to leverage different information sources in order to provide searchable knowledge. Such a requirement is often accomplished by means of steps that differ only slightly between different applications and different domains. The aim of the framework is to enable applications to abstract from all the details that are common, so that application speciﬁc code is reduced and simpliﬁed. In the remainder of this section the main aspects underlying the design and implementation of QS are discussed. 2.1 Knowledge independent components QS is a generic platform for automatic annotation of semi-structured information sources and retrieval based on semantic queries (i.e. I. Blacoe et al. / QuestSemantics – Intelligent Search and Retrieval of Business Knowledge queries that make use of knowledge about the application domain). The platform components are designed to be customisable depending on the speciﬁc domain it is applied to. Therefore, a main concern in the platform design is that its customisation is limited to domain related aspects only. In QS design there is a distinction between domain knowledge and task knowledge. Domain knowledge is the description of all relevant entities in a speciﬁc domain of knowledge, representing a state of affairs and constraining the possible states it can evolve into. Task knowledge, in general, references the domain knowledge to describe the relevant entities with respect to the required tasks [16], and thus describes the ways to perform useful changes to the domain states. The only decisions taken at platform level are those related to the formalisms adopted for representing domain and task knowledge. A domain ontology needs a formalism that allows the easy expression of taxonomical and non-taxonomical relationships among entities, i.e. static knowledge. A task ontology, instead, needs to represent dynamic operations like sequences, selections and iterations. The Semantic Web standard for representing ontologies is the Web Ontology Language (OWL) [10]. While this is adequate for modelling domain knowledge, it is not suitable to represent dynamic operations; therefore rules are added on top of OWL ontologies, and they are represented using SWRL [7, 6]. One of the examples in which such an extension was necessary regarded the expression of meronomic relations [17]; Description Logic is not expressive enough to formalise this. Representing procedural knowledge, on the other hand, is accomplished mixing declarative rules with a traditional programming language (Java). Tasks are then represented by clauses, i.e. a set of premises in conjunction and a single consequence, and the consequence is represented by a block of executable code. 2.2 the application domain. Responses to queries will be lists of matching resources, containing the metadata descriptions and a pointer to the original source (e.g. web-page or database record-set). A graphical search interface enables user-speciﬁcation of the semantic queries in an intuitive and non-technical manner, and allows clear presentation of and access to the resulting resources. 2.3 Design of the framework The framework design (depicted in Figure 1) is based around two software components: an Annotation Engine to analyse and ﬁlter the retrieved documents (handling the Annotation stage), and a semantic Search Engine to provide ﬁne-grained access to the ﬁltered documents (handling the Search stage). The two components also share a Store component, which is responsible for all data storage, consisting of document contents, ontologies and metadata instantiations, and the intermediate results created by the analysis and annotation components. The Annotation Engine component retrieves documents Annotation and Search The framework is divided into two stages, reﬂecting the two tasks of semi-automatic resource annotation and knowledge-based resource retrieval: the annotation stage and the search stage. In the ﬁrst stage, both domain knowledge and task speciﬁc knowledge (e.g. layout speciﬁcation, annotation and ﬁlter rules) are used in order to create semantic metadata about the information sources to exploit. This metadata is then used in the search stage, where speciﬁc queries from the user are answered using the domain knowledge to guide the query process. The Annotation stage is composed of four distinct process elements: • Harvesting of live information sources, ensuring retrieved information is up to date with the latest information available. • Analysis of the retrieved resources, using the knowledge encoded in the heuristic task rules, to identify which resources are of interest for the annotation component. • Annotation of the analysis results using domain ontologies: instances of concepts are identiﬁed, and, where possible, attributes are retrieved and relations between instances are stated. • Storage of the metadata resulting from the annotation process in an RDF3 database. The Search stage is primarily devoted to retrieve speciﬁc information from the metadata stored in the last step of the Annotation phase. Queries are expressed in SPARQL [13], and will impose constraints upon potentially matching resources using the ontology representing 3 649 http://www.w3.org/RDF/ Figure 1. General System Architecture from their sources, and then analyses, annotates and ﬁlters them on the basis of the application needs. Each of these functions is performed by a speciﬁc element, that is an implementation of one of the interfaces presented (Harvester, Analyzer, Semantic Annotator). Task speciﬁc knowledge is separated from domain knowledge at this level of abstraction: the Analyzer element oses only the task speciﬁc knowledge available, e.g. how to ﬁnd relevant information in a web page, while the Semantic Annotator element uses domain knowledge in order to create the actual metadata. These independent components are obtained by leveraging the distinction between the knowledge needed for each functionality, so that changes in task or domain only have an impact on one component. Moreover, conﬁning the task speciﬁc knowledge to the Analyzer system makes the Search component completely agnostic to the way information is retrieved, easing the process of using multiple knowledge-bases to answer users’ 650 I. Blacoe et al. / QuestSemantics – Intelligent Search and Retrieval of Business Knowledge queries. At the time of writing, two examples of this modular system, presented in detail in Section 3, have been implemented. The elements of the Annotation Engine component are as follows. Harvester element: An implementation of the Harvester interface must be able to retrieve information resources, and convert them into a form suitable for the annotation process. In the case of web pages, the Harvester retrieves the pages and saves them in the Store component as text documents. When the source is a database, as in one of the test cases presented later, it retrieves ﬁrst the database schema and then the contents, and saves them in an XML format. Analyzer element: Analyzer elements deﬁne methods to extract relevant information from an input information source, and store it in an intermediate format suitable for the Annotation Engine element. Its architecture is shown in Figure 2. Document layout speciﬁc information is encoded in the form of regular expressions (or with specialised Java code) into an implementation of the MatchingPattern interface. A set of these implementations is used by a Parser implementation, and a Parser together with its MatchingPattern elements forms a Rule. Rules are considered as atomic objects, meaning that the relevant information found by the MatchingPattern elements inside a Rule are only extracted if all the MatchingPattern are found to be satisﬁed in the input document/source; this is the case in which a Rule is said to be applicable. Some Rules can condition the applicability of other Rules, e.g., one Rule determines that the current resource is unsuitable and forces all subsequent Rules to be skipped (a Blocking Rule). that must be saved persistently, both for performance reasons (e.g. to save retrieved documents so that they are available for the analysis step) and to keep track of connections between information items, such as the source of a speciﬁc annotation. The Store interface enables an application to save and retrieve data identiﬁed by a URI, such as byte streams (typically containing text documents such as HTML pages), Java maps containing intermediate mapping results, and RDF models containing ﬁnished annotations. In addition, the Store interface is designed to enable saving relations such as the fact that a speciﬁc URI is an alternate name for another resource, i.e., in OWL terms, the two resources are OWL : SAME A S. This is particularly useful when a single conceptual resource is described by different documents and enables the annotation rules to retrieve all the available information for the resource, addressing the problem of information that is logically related but physically disconnected. The Search Engine component of the framework is responsible for querying the information generated by the Annotation component; it is intended to accept queries posed in SPARQL, and will return a set of links to matching resources. A specialised search interface enables the users to develop an abstract model of a semantic query, pose it to the engine, and then review the resulting matched documents. The search interface provides the means by which end-users (i.e. people who are not experts in Semantic Web technologies) will access the resources ﬁltered and annotated by the Semantic Annotator component. It is also possible to add and delete entities and properties (with related values), so that a user can interact with the knowledge base to ﬁne tune the query, enabling subsequent searches to become more accurate. The key aim for the query interface is that the user has to be presented with an intuitive and clear abstract query model, in order to hide, as much as possible, of the underlying complexity of representation and reasoning. 3 Figure 2. Annotation components detailed architecture Semantic Annotator element: This element creates the RDF models representing the information highlighted by the Analyzer - building source metadata according to the domain and application speciﬁc ontology(ies). Its architecture is shown in Figure 2. Analogously with the internal structure of the Analyzer element, annotation is performed by means of AbstractDocumentMatchingPattern implementations. Each implementation extracts a speciﬁc piece of information from the Analyzer output, and Annotator processes create and formalise the metadata into an RDF model. Annotators and AbstractDocumentMatchingPatterns are grouped into AnnotationRules, which can be Blocking or Non-Blocking. The Semantic Annotator element is the ﬁrst point in the process where the form of the source information becomes unimportant, i.e. it is agnostic w.r.t. whether the data originate from web pages or from other sources, such as a database. Filters in the Semantic Annotator are used to apply some predeﬁned ﬁlter rules to determine whether a speciﬁc resource is suitable for use by the Search Engine. One example of such uses is the removal of information that is no longer up to date or useful (e.g. some information can expire after a certain amount of time, like a call for papers). Details are shown in Figure 2. Store element: Each step of the annotation process produces data DEPLOYMENT AND EVALUATION The QS system has been deployed in two different commercial testcases. The ﬁrst commercial partner is Vectra Group Ltd. Their problem was one of information overload: they need to examine speciﬁc web-published documents for commercial opportunities matching their areas of business interest. However, their current search service only uses keywords to represent these interests and match against the publications, resulting in many potential matches, which then need to be human-ﬁltered to determine if they represent suitable commercial opportunities. QS was applied to this task of information retrieval, to enable more domain-speciﬁc analysis and ﬁltering of the published documents. The knowledge representation formalisms are used to encode knowledge about the areas of business in which they are interested (i.e. sectors, markets, activities, companies, locations, etc.), and knowledge about the source material regarding how to ﬁnd, annotate and ﬁlter those sources on the basis of the business knowledge. The application runs a daily, automatic annotation and ﬁltering process of potentially matching resources, storing the results in the knowledge-base. This meta-data is then accessed via the search interface to perform regular searches over sub-sets of the company’s business interests for suitable opportunities. The application of QS to this task enables the resource matching process to obtain more accurate results, producing fewer falsepositive matches for the business criteria. This allows Vectra to concentrate efforts on a more precise set of results, reducing the time spent checking which of the possible matches are actual matches. The increased result accuracy also aids identiﬁcation of suitable resources, that may be over-looked in the current process due to infor- 651 I. Blacoe et al. / QuestSemantics – Intelligent Search and Retrieval of Business Knowledge mation overload. In addition, providing ﬁne-grained access to potentially matching resources through an advanced search interface enables Vectra to perform on-demand search, on the basis of the business knowledge, for resources matching speciﬁc criteria rather than having to determine this by a manual search of all resources. The second commercial test-case concerns knowledge-based search over pre-existing database information resources. The North West Aerospace Association (NWAA) maintains a database of its member aerospace companies, giving details of these companies (areas of expertise, speciﬁc capabilities, etc.). Access to this database is provided through the NWAA web site, enabling interested parties to search for aerospace companies. However, the current search is inﬂexible, with only a basic categorisation of activities, capabilities and approvals, and cannot combine search features. This means that searches can only be approximate and do not allow identiﬁcation of companies exhibiting speciﬁc feature combinations without manual cross-referencing of search results. The application of QS enables the creation of a knowledge-base, based on an ontology of the domain, using the company data currently held in NWAA’s database, and provides a semantic search facility allowing the knowledge-base to be searched by constructing speciﬁc queries based upon the ontological model. The knowledge represented in the ontology is a conceptualisation of the aerospace domain in terms of the features, capabilities and business relationships applying to companies within that domain. This conceptualisation is then instantiated to describe the speciﬁc companies and ancillary information, gathered from existing database resources. The annotation rules, layered on top of the ontology, specify how the existing information is automatically mapped into this knowledge representation. The knowledge base is thus dynamically created from the existing data resources, and is updated on demand. The search interface to this knowledge-base is designed to be used through the NWAA web site by companies seeking partners with speciﬁc aerospace expertise. The semantic search enables the use of multiple, hierarchically structured categorisations and features, combination of features using boolean logic, aggregation of results over similar categories, and reference to speciﬁc company features within search constraints. The primary beneﬁts of the enhanced search facility to NWAA and its members are more accurate results for all types of search over company information, leading to a saving of company time spent analysing search results in order to identify potential partner companies. In the Vectra LTD use case, an evaluation of the performance of the annotation and ﬁltering system, applied to the problem of identifying web resources that match business interests, has been performed. The evaluation examined a large-scale harvest of 34285 documents, determining how many are returned by the QS system, and, of these, how many are of genuine business interest to Vectra. These results (shown in Table 1) demonstrate that only a very small fraction of the published documents are of genuine interest to Vectra, which matches with their expectation. Furthermore, the results show that the semantic annotation and ﬁltering process is performing well, eliminating over 93 % of published documents with a British location (GB). The results for QS compare well with the results of the current service. A full comparative evaluation is still ongoing in this regard, however, random spot-check comparisons over individual daily returns show an average reduction in returns of 71 %. The effects on Vectra’s business have been signiﬁcant; Vectra has ceased subscription to the existing search service, and now intends to use QS. However, there is still signiﬁcant room for improvement on the current results, as only 3.5 % of the returns from QS were determined to be of genuine business interest to Vectra. Total Daily avg Contracts 34285 836.22 Table 1. GB contracts 2894 70.59 Found 199 4.83 Interesting 7 0.17 Summary of Vectra test-case evaluation. There are many ways in which the current application could be improved, both in the existing tasks of annotation and search, and in the extensions to the current system to address areas such as knowledge management and business intelligence. Examples of such improvements for the Vectra test-case are: • Extension of ﬁlter rules to consider speciﬁc rule-exceptions, thus allowing more ﬂexible application of ﬁlters. • Reﬁnement of the query construction and editing methodology, enabling a more intuitive and ﬂexible workﬂow. • Search result ordering can be extended to allow a variety of rankings, based on different criteria, to be applied. • Annotation lifecycle management can be enhanced to revise and remove annotations describing resources in a fully automated manner. • Allow users to add further annotations to retrieved resources, indicating what action is being taken, which would then enable monitoring of activity in this domain. • Extensions to the knowledge-base regarding closely related business areas would enable monitoring of opportunities on the margins of current business interests - helping to identify areas of potential business expansion. 4 RELATED WORK This section presents a brief survey of the relevant existing approaches for annotating and searching web resources, based on Semantic Web technologies. One of the the ﬁrst semantic annotation applications was Annotea [8] in 2001. Annotea employs RDF Schema as its formalism to express the meta-data vocabulary, but the resource annotation process is entirely manual. A manual annotation process tends to be subjective (i.e. depends on the knowledge and point of view of the domain expert), and is time-consuming and tedious. These aspects lead to a second generation of semi-automatic semantic annotation tools. Besides automatising parts of the process, these tools also propose a slightly more constrained notion of annotation. An annotation shifted from being generic information related to (a portion of) a document, to being a formal description of the information within it. In [3] the authors present a platform (Seeker), and an application (SemTag) built upon it, that were designed to scale up to annotation for the whole web. SemTag relies on a ﬁxed ontology, namely TAP [4], as its meta-data vocabulary, and identiﬁes instances of the concepts appearing in the TAP ontology within the analysed documents. This is accomplished by means of an algorithm for word sense disambiguation that considers word windows around a term as context to help in disambiguating its sense. The S-CREAM [5] abstract framework, and its implementation Ont-O-Mat, represent an evolution of such approaches. They do not depend on the use of a particular ontology and individuate instances, relations between instances, and instance attributes (relationships between instances and values). They employ Amilcare [2], a tool 652 I. Blacoe et al. / QuestSemantics – Intelligent Search and Retrieval of Business Knowledge for learning adaptive rules for tagging a corpus that leverages several Natural Language Processing methodologies. The gap between XML-based Amilcare annotation and Semantic Web meta-data formalised w.r.t. an ontology is bridged, within the S-CREAM architecture, by a Discourse Representation component, which is responsible for translation from Amilcare results into the meta-data in a Semantic Web standard language. The more recent Knowledge Parser [12] proposes an architecture that explicitly accounts for layout processing as one of the early steps in the annotation process. Annotations can be based on multiple ontologies that are not known a priori. Knowledge Parser has a separate process (Intelligent Ontology Population) for the creation of instances of the concepts in the ontologies. This process varies according to the domain, in that the rules (policies) for populating the ontology are dependent on the application. It provides a Natural Language interface for querying the generated knowledge base. Knowledge Parser is the only one of the systems reviewed that provides a search interface, and only the latter two are domain independent systems. S-CREAM, Knowledge Parser and QS can be categorised as systems that aim to employ semantic annotation and access in very speciﬁc knowledge domains. On the contrary, SemTag and analogous systems (e.g. the KIM platform [9]) have been designed for bootstrapping the Semantic Web by annotating the current Web, and rely on very general ontologies in order to capture the widest possible range of knowledge. Therefore, although the SemTag-like category of tools need little customisation in order to be used in any domain, they cannot be easily adapted to produce very detailed annotation regarding speciﬁc domains. 5 CONCLUSIONS As can be seen from the evaluation in the Vectra use-case (see Section 3), the application of knowledge representation methodologies to intelligent data capture and access can produce very successful results. The signiﬁcant reduction in false positive returns produces savings in company time and effort expended on identifying opportunities, and helps to reduce the likelihood that suitable opportunities are missed due to information overload. In addition, as shown with both Vectra and NWAA, the facility to access the data resources on the basis of the encoded business knowledge enables users to identify useful resources in a way that is tailored to their needs and experience. Furthermore, focus upon limited and clearly deﬁned domains of knowledge enables the business partners to specify the conceptualisation needed to apply their implicit knowledge about their business to the problem tasks in an automated manner. As a result of the two test-case applications of the QS system, a number of lessons have been learnt regarding the application of knowledge representation and manipulation techniques within commercial scenarios. Companies require end-to-end solutions that solve speciﬁc business problems, requiring development of an integrated system of knowledge representation and other technologies to solve the whole of that problem. The knowledge elicitation process requires signiﬁcant time and effort, but, in our experience, the rich expressivity of the formalisms employed provides a straightforward mean to encode the knowledge required; the main problem identiﬁed in this phase is the need to confront business managers with the formalized knowledge in order to validate it; this process requires a basic understanding of the involved technologies, which can require a relevant effort in terms of company time. Therefore, to enable the business partners to make full use of the application, the presentation of the knowledge is as important as its representation. The languages employed provide assistance here by allowing concepts, properties and values to be represented in a natural way that supports an expressive but clear presentation. Finally, the strict separation between the various different types of knowledge represented, both problemspeciﬁc and generic, underpins the ﬂexibility of the approach, and enables its application to almost any domain, given a sufﬁciently detailed ontology and annotation rules. REFERENCES [1] T. Berners-Lee, J. Hendler, and O. Lassila, ‘The Semantic Web’, Scientiﬁc American, (May 2001). [2] F. Ciravegna, A. Dingli, Y. Wilks, and D. Petrelli, ‘Timely and nonintrusive active document annotation via adaptive information extraction’, in Proc. of the ECAI Workshop on Semantic Authoring, Annotation and Knowledge Markup., Lyon, France., (2002). [3] S. Dill, N. Eiron, D. Gibson, D. Gruhl, R. Guha, A. Jhingran, T. Kanungo, S. Rajagopalan, A. Tomkins, J.A. Tomlin, and J.Y. Zien, ‘SemTag and Seeker: bootstrapping the Semantic Web via automated semantic annotation’, in Proc. of the 12th International Conference on the World Wide Web, pp. 178–186, (2003). [4] R.V. Guha and R. McCool, ‘TAP: A semantic web test-bed.’, Journal of Web Semantics, 1(1), 81–87, (2003). [5] S. Handschuh, S. Staab, and F. Ciravegna, ‘S-CREAM - Semiautomatic CREAtion of Metadata.’, in Proc. of Knowledge Engineering and Knowledge Management. Ontologies and the Semantic Web, 13th International Conference, pp. 358–372, (2002). [6] I. Horrocks, P. F. Patel-Schneider, S. Bechhofer, and D. Tsarkov, ‘OWL rules: A proposal and prototype implementation’, J. of Web Semantics, 3(1), 23–40, (2005). [7] I. Horrocks, P. F. Patel-Schneider, H. Boley, S. Tabet, B. Grosof, and M. Dean. SWRL: A Semantic Web Rule Language Combining OWL and RuleML, 2004. http://www.w3.org/Submission/SWRL/. [8] J. Kahan and M.R. Koivunen, ‘Annotea: an open RDF infrastructure for shared Web annotations.’, in Proc. of the 10th International Conference on the World Wide Web, pp. 623–632, (2001). [9] A. Kiryakov, B. Popov, I. Terziev, D. Manov, and D. Ognyanoff, ‘Semantic annotation, indexing, and retrieval’, Journal of Web Semantics, 2(1), 49–79, (2004). [10] D.L. McGuinness and F. van Harmelen (Eds). OWL Web Ontology Language Overview, 2004. http://www.w3.org/TR/owl-features/. [11] G.A. Miller, ‘Wordnet: a lexical database for english.’, Communications of the ACM, 38(11), 39–41, (November 1995). [12] L. Rodrigo, V.R. Benjamins, J. Contreras, D. Pat´on, D. Navarro, R. Salla, M. Bl´azquez, P. Tena, and I. Martos, ‘A Semantic Search Engine for the International Relation Sector.’, in Proc. of the 4th International Semantic Web Conference, pp. 1002–1015, (2005). [13] SPARQL. Query language for RDF. W3C Recommendation, 15th January 2008. [14] R. Studer, R. Benjamins, and D. Fensel, ‘Knowledge engineering: Principles and methods.’, Journal of the ACM, 25(1-2), 161–197, (March 1998). [15] V. Uren, P. Cimiano, J. Iria, S. Handschuh, M. Vargas-Vera, E. Motta, and F. Ciravegna, ‘Semantic Annotation for Knowledge Management: Requirements and a Survey of the State of the Art’, Journal of Web Semantics, 4(1), (2006). [16] G. van Heijst, A.Th. Schreiber, and B.J. Wielinga, ‘Using explicit ontologies in kbs development.’, Int. J. Hum.-Comput. Stud., 46(2), 183– 292, (1997). [17] M.E. Winston, R. Chafﬁn, and D. Herrmann, ‘A taxonomy of partwhole relations.’, Cognitive Science, 11(4), 417–444, (1987). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-653 653 Intelligent adaptive monitoring for cardiac surveillance Lucie Callens1 and Guy Carrault 2 and Marie-Odile Cordier 3 and Elisa Fromont 4 and Franc¸ois Portet 5 and Ren´e Quiniou 6 Abstract. Monitoring patients in intensive care units is a critical task. Simple condition detection is generally insufﬁcient to diagnose a patient and may generate many false alarms to the clinician operator. Deeper knowledge is needed to discriminate among alarms those that necessitate urgent therapeutic action. We propose an intelligent monitoring system that makes use of many artiﬁcial intelligence techniques: artiﬁcial neural networks for temporal abstraction, temporal reasoning, model based diagnosis, decision rule based system for adaptivity and machine learning for knowledge acquisition. To tackle the difﬁculty of taking context change into account, we introduce a pilot aiming at adapting the system behavior by reconﬁguring or tuning the parameters of the system modules. A prototype has been implemented and is currently experimented and evaluated. Some results, showing the beneﬁts of the approach, are given. 1 INTRODUCTION Monitoring means to process incoming data (signals) recorded by sensors in order to recognize alarming conditions. Such devices may generate alarms in huge volume that can overwhelm an operator who has to validate the alarms and take therapeutic actions. What the operator really needs is a decision support system that could help him decide whether an alarm needs some action or can be skipped safely. In the 1980’s the concept of kwnowledge based system emerged with the aim to associate deep knowledge to diagnosis. An intelligent monitoring system integrates such a knowledge-based system into a monitoring system. The ﬁrst step of intelligent monitoring is temporal abstraction. This means transforming numerical time series into symbolic event sequences. There is a huge literature in this domain (for surveys see [26, 12]), e.g. in the cardiac domain. The second step is devoted to the reasoning task. Among proposals, model-based diagnosis [19, 3] has the main avantage of using an explicit model that can be used to diagnose the series of events observed during monitoring as well as giving comprehensible explanations to the operator. As diseases have an important temporal dimension, we have proposed to represent them by sets of events linked by temporal constraints on their occurrences. Such sets of events are called chronicles [7]. They can model (a faulty model, in this case) the evolution of a disease during time or local typical temporal phenomena, e.g. typical wave sequences of an electrocardiogram (ECG) that represent cardiac beats 1 2 3 4 5 6 INRIA Rennes, France, email: lucie.callens@irisa.fr LTSI, University of Rennes 1, France, email: guy.carrault@univ-rennes1.fr IRISA, University of Rennes 1, France, email: marie-odile.cordier@irisa.fr Katholieke Universiteit Leuven, Belgium, email: Elisa.Fromont@cs.kuleuven.be Dept. of Computing Science, University of Aberdeen, Scotland, email: fportet@abdn.ac.uk INRIA Rennes, France, email: rene.quiniou@irisa.fr characteristic of some rhythm problem. Their recognition on an input stream is based on efﬁcient processing of temporal constraint networks [4]. This makes chronicles good candidates for monitoring. One of the main challenge of temporal abstraction in intelligent monitoring systems is to closely couple signal processing tasks and higher level tasks involved in diagnosis. One source of difﬁculty is that, generally, recorded data are highly dynamic and subject to changes. For example, the patient may move letting some sensor transmit very noisy data. Also, the patient state may evolve quickly due to the effect of some drug or disease evolution. We propose to introduce a central module called a pilot that analyzes continuously the monitoring context, i.e. the nature and quality of signals as well as the hypotheses devised by the diagnosis module. The aim of the pilot is to select the best signal processing algorithms and the right abstraction level for data abstraction. The pilot makes use of decision rules in order to bring high ﬂexibility for taking into account new monitoring conditions or new monitoring domains. The major bottleneck of knowledge based approaches is knowledge acquisition and maintenance. Machine learning has been advocated for this task. Since monitored diseases have a temporal relational dimension, ﬁrst order models are good candidates for knowledge representation. This is why we have used Inductive Logic Programming [16] for learning chronicles. Devising decision rules for the pilot could also be tedious and time consuming [17]. We have proposed decision tree learning for inducing decision rules from the performance of algorithms in a representative set of contexts. Stacey and McGregor’s survey [26] lists many intelligent monitoring systems that share several objectives and features with ours. To cite a few, R E´ SUM E´ [25], VIE - VENT [13] and its successors like ASGAARD [24] have introduced general knowledge representation paradigms for temporal abstraction with a deep integration of domain knowledge. VIE - VENT could process data streams whereas R E´ SUM E´ was limited to databases. For efﬁciency reasons, instead of being general, our work focusses on cardiac knowledge in order to extract rich information from ECG or pressure signals online. From the temporal reasoning and diagnosis point of view, N E´ O - GANESH [5] shares the concept of temporal patterns, called scenarios, with our approach. In addition, we have proposed a supervised method to learn such temporal patterns automatically from annotated temporal data. In [11], a self-adaptive software is introduced to modify the processing chain according to some predeﬁned events such as sensor loss. But this approach does not deal with high-level goals as temporal reasoning which is crucial for monitoring evolving systems. In ASGAARD [23], the data abstraction module focusses on speciﬁc input sources according to a contextual plan which can be adapted when the processed inputs contradict expected values. However, data abstraction subtasks are scheduled in rigid sequences and diagnosis information is not used for adapting the current system. 654 L. Callens et al. / Intelligent Adaptive Monitoring for Cardiac Surveillance This article summarizes the work done during several years as an active collaboration with experts in biological signal processing and the department of cardiology of the local hospital and which led to the implementation of an experimental platform called Calicot. Section 2 describes the medical applicative context. In Section 3, an overall overview of the system architecture is given. In Sections 4 and 5, we present the temporal abstraction and diagnosis methods. In Section 6, a solution to adaptation to context change is detailed. Learning chronicles and decision rules are presented in Section 7. We provide some evaluation results in Section 8 before concluding. 2 APPLICATION DOMAIN: MONITORING CARDIAC PATIENTS Several areas in medicine require monitoring and management, e.g. temporary paralysis of the respiratory center in the brain, renal damage, surgical anesthesia or myocardial infarction. In this latter case, our concern, intensive care units were initiated in the 60s, for monitoring the vital functions of a patient after a serious cardiac attack. The main goal was to prevent, detect and control lethal arrhythmias by therapeutic actions. ECG signals are displayed to an operator and analyzed in real time: the trends of the main parameters, such as the cardiac frequency, are computed and alarms are generated. All this information assists the operator who is in charge of analyzing the situation, validating the alarms and deciding what action to perform. The main problem is to abstract the signal observation to a pathological state through several steps [15]: signal processing, alarm handling, therapy advising. To solve this it, intelligent supervision systems appeared in the 90’s. Their aim was to integrate several sources of observation (numerical or symbolic) and several types of medical knowledge (surface and deep knowledge). Surface knowledge relies on the use of the so-called experiential knowledge while deep knowledge corresponds to a complete theory about a particular subject, from which all valid statements can be derived. Recent developments have led to knowledge-based temporal abstraction (KBTA) where machine learning techniques are used to extract the most discriminating patterns which can be identiﬁed in normal and several pathological states. There is a wide literature about temporal abstraction in medical domains [12, 26]. The Calicot system [1], dedicated to cardiac arrhythmias detection, and the one proposed by Guimaraes [9] which is focused to sleep-related respiratory disorders are two examples. Collaborative knowledge discovery approaches exploiting the properties of multi-agent systems (open character, autonomy of its components) have been proposed recently[10] for the exploration of mechanical ventilation asynchronies. 3 ARCHITECTURE Calicot has two execution modes: ofﬂine and online. The online mode, depicted in Fig. 1, is devoted to monitoring and adopts a pattern-matching approach: multivariate signals are ﬁrst abstracted in series of symbolic timestamped events and then a matcher attempts to recognize, on the ﬂy, instances of chronicles in the symbolic series. A chronicle associated to some cardiac disease is a temporal signature of this disease. The ofﬂine mode is dedicated to learning and updating the decision rule base and the chronicle base. Contextual information is of great importance for monitoring. On the one hand, by taking the signal quality (noise) into account one can decide more accurately which is the most relevant signal processing algorithm to use in the current situation. On the other hand, by taking the patient state into account one can decide more accurately which Figure 1: The architecture of the online part of the adaptive monitoring system Calicot. are the relevant chronicles. Moreover, these two decisions are not independent: it can be the case that input signals are so noisy that it is useless to try to detect P-waves, for example. Consequently, chronicles that contain P-wave events can’t be recognized and should be removed from the candidate set. Also, in the context of a particular disease, some types of event could be absent from the set of candidate chronicles and so it is useless to execute costly signal processing algorithms for detecting their related waves. Thus, low (signal processing) and high (chronicle recognition) level computations should be tightly coupled. This is why we opt for an adaptive architecture. On the one hand, a signal processing library containing many signal processing algorithms was built. Their performance have been assessed in many contexts to determine when and how to run them. On the other hand, a chronicle abstraction hierarchy was deﬁned: more abstract chronicles contains less event types and/or less event attributes which makes them more relevant to more noisy contexts. More speciﬁc chronicles are relevant for situations where the detection of particular events could improve the diagnosis accuracy. The decisions are taken by a central monitor (that we call a pilot) which analyzes continuously the signal and the patient context to determine the best signal processing task and algorithm to execute as well as the related chronicle abstraction level. A centralized control was adopted because it was simpler to specify via decision rules. 4 DATA ABSTRACTION The temporal abstraction step transforms the numerical series into symbolic event sequences that are easier to process for high-level diagnosis. In coronary care units, the main problem comes from the presence of different kinds of noise (slow baseline drift, high frequency noise, impulsive noise) and from the great variability of patient dependent patterns and which can change over time. For example, multiform premature ventricular beats can combine with permanent or intermittent left or right bundle branch block. The temporal abstraction level achieves two main tasks: QRS complex (ventricular activity) and P wave (auricular activity) detection and QRS classiﬁcation. Many methods have been proposed for detecting the ventricular activity, i.e. QRSs. Each one fails in speciﬁc situations and each reacts differently to the many different QRS waveforms. Hence we retain several algorithms. The proposed approach is not to merge the decisions of several algorithms but to select, on line, the most promising detector according to its performance in similar contexts. Actually, seven algorithms were selected [17]. Once a QRS has been detected it is labeled and a symbolic event is generated. The generation of its attribute values is based on the L. Callens et al. / Intelligent Adaptive Monitoring for Cardiac Surveillance 655 fact a beat can be efﬁciently represented by a compactly supported wavelet base [21]. Each QRS is then represented by a global extrema at each decomposed level. The QRS classiﬁcation consists in labeling the beats into two mains classes, normal or abnormal. A probabilistic neural network (PNN) based on radial basis function has been used. P wave detection is very hard because it has a weak amplitude and a variable morphology. We retain the QRS-T interval cancellation technique to overcome the limitations window techniques (which assumes that a P wave always occurs before some QRS) and mainly to facilitate the detection of the P wave even for arrhythmias with A-V dissociation. The proposed approach [22] mostly relies on: i) QRS-T interval detection and cancellation based on wavelet decomposition, ii) a statistical analysis of the residue for detecting P waves, not associated to a QRS, iii) an artiﬁcial neural network classiﬁer to reject false detection which frequently occur in P wave detection. 5 DIAGNOSIS BY CHRONICLE RECOGNITION Figure 2: ECG (top) and pressure (bottom) signals of a bigeminy episode. The graph in the middle shows a bigeminy chronicle model: P stands for a P wave event occurrence, QRS for a QRS complex, D for a diastole, S for a systole. Quoted QRSs represent abnormal ones. Dotted lines indicate possible event matches that satisfy the temporal constraints. In model-based diagnosis, a model of either normal or faulty behavior is used to detect and identify faults or diseases [19, 3]. If a normal behavioral model is used, the values reﬂecting the patient’s state are fed into the model and the diagnoser generates an alarm if the model outputs are different from the values observed on the patient. With a faulty model, the diagnoser reasons abductively to generate disease hypotheses that could have produced the actual observations. Sometimes, such a model can be compiled into sets of discriminant patterns that can be efﬁciently searched for on the input stream. This method is very suited to online monitoring: the input stream is analyzed continuously and an alarm is emitted in case a set of events that can be related to a disease have been observed in some time window. In many situations, such as in the case of dynamic systems, time is crucial [2]. The events related to the course of a disease must happen in a speciﬁc order and have to respect delay constraints. Moreover, sometimes it is easier to describe a disease by a set of successive or synchronous events respecting temporal constraints than to extract discriminant features from a vector of values recorded by several sensors. This is true in the cardiac domain: the symptom related to some disease, e.g. bigeminy, is described more naturally by the properties of several cardiac beats than by the particular features of a QRS. Such temporal patterns can be easily represented by chronicles. Fig. 2 shows an example of chronicle related to bigeminy and an example of match on two types of signal. Since the input data streams can be huge, recognizing chronicles on the ﬂy must be very efﬁcient. We have used a system called CRS (Chronicle Recognition System) [6] which manages chronicle models, chronicle instances and temporal constraints between events from these instances. CRS tries to associate each incoming event with some uninstantiated event of a chronicle instance which satisﬁes the temporal constraints. It generates also new chronicle instances containing an event that can match the observed event. Many chronicle instances can be generated in such a way. One strength of CRS is its ability to prune instances, as early as possible, whenever one temporal constraint could not be satisﬁed by assessing whether an event occurence time has elapsed. This makes it particularly adapted for applications where the detection of critical situations is essential. transition, decisions concerning parameter settings and the choice of tasks and algorithms to execute next are postponed to runtime. The goal of such software, is to monitor and control themselves. In our case, the system can be viewed as a metamonitor: a monitoring system, the pilot, monitors a monitoring system. The goal is to obtain the best performance by reconﬁguring the system operations, precisely choosing alternate algorithms and chronicles sets. To each processing module, algorithm or chronicle set, is associated a description which describes the way this module can be used. For example, to a signal processing algorithm is associated the task that it can achieve, the objects (types of events) that it can deliver and features describing the contexts in which it should ensure the best performance. The pilot receives continuously two kinds of contextual information: signal context, mainly related to the signal quality, and recognition information, related to the diagnosis state. Noise type and level, event detection rate and distribution are used to estimate the signal quality (signal analyzer in Fig. 1). Chronicle recognition rate, types of chronicles recognized so far and their distribution, expected patient state, etc. are used to estimate the diagnosis quality (arrhythmias analyzer in Fig. 1). From this contextual information the pilot has to decide whether the context has changed notably and a reconﬁguration is needed. In this case, the pilot must select the best algorithms and tune their parameters as well as choose the right chronicle abstration level and the set of chronicles to be recognized. To ensure a maximal ﬂexibility and modularity, the pilot uses decision rules to decide when and how to perform reconﬁgurations. Here follows two examples of decision rules. The ﬁrst one selects a detection algorithm, the second one selects the abstraction level that must be used for chronicles: if paced ∧ noiseType=muscular ∧ noiseLevel ≤ 0 dB then algo df2 if PWaveDetection is active ∧ QRSClassiﬁcation is active then abstractionLevel=4 6 ADAPTATION 7 MACHINE LEARNING The architecture of Calicot shares many features with self adaptive systems [20]. Since it is impossible to anticipate every situation and In our approach machine learning is used at two stages, for learning chronicles and for learning decision rules for the pilot module. 656 L. Callens et al. / Intelligent Adaptive Monitoring for Cardiac Surveillance Figure 3: The graphical interface of Calicot showing the recognition of bigeminy and trigeminy episodes. To establish accurate diagnoses efﬁciently the chronicles should be as discriminating as possible, i.e. they should clearly distinguish the diseases. However, the speciﬁcation of discriminating chronicles is hard to do manually. Moreover, as ﬁrst order (temporal) relations are concerned, we have chosen to use an ILP method. From examples describing symbolically signals related to disease episodes, ILP induces a set of ﬁrst-order clauses that can discriminate the classes of the examples [16]. Fortunately, such a clause can be translated straightforwardly to a chronicle for CRS. By varying the language bias, i.e. the description language of examples and clauses, and using the related background knowledge, models at different abstraction levels can be learned. In addition, the model space is explored efﬁciently by searching only the most promising parts. We have also proposed a method for learning efﬁciently from multisource data [8]. The goal of decision rules is to aid the pilot module select the best algorithm and chronicle set according to the current signal and patient context. We have used two approaches to devise such rules. The ﬁrst one made use of PCA (Principal Component Analysis) to determine the most informative context attributes [17] from performance data obtained by executing a set of signal processing algorithms in many situations with different kinds and level of noise and different diseases. Then, decision production rules conditioned by the selected attributes were built manually by experts. The second method used decision tree learning [18]. From a similar set of performance data, decision trees were learnt. The nodes of such a tree represent a partition of the values of some attribute and the leaves represent a class, here an algorithm. Every path from the root node to some leave can be translated into a decision rule having as conditions the tests in the path nodes and as conclusion the algorithm in the leave. It is worthnoting that such learnt rules performed quite as well as expert rules. 8 EVALUATION The Calicot prototype has been implemented in Java7 . Its graphical interface displays monitored signals annotated with complex events related to recognized chronicles (see Fig. 3). Many experiments were conducted in order to assess its monitoring quality and evaluate its performance. Calicot has been evaluated on real clinical data recorded in ICU but, until now, has not been used in clinical routine. This section gives some results concerning the performance of the prototype on QRS detection, with and without piloting. 7 http://www.irisa.fr/dream/Calicot/ The implemented context analyser is based on a wavelet decomposition-recomposition in three subbands to obtain the triplet ls, ms, hs (low, medium, high subbands). This triplet together with the annotated context attributes r, n, SN R (rhythm, noise type, Signal-to-Noise Ratio) forms the context descriptor used by the pilot to decide which algorithm to use. Piloting rules have been extracted by decision tree learning from the performance results of 7 QRS detectors from the litterature [18]. Three decision trees were induced: D1 (using attributes r × n × SN R × ls × ms × hs), D2 (using r × ls × ms × hs) and D3 (using the subbands only ls × ms × hs). Ten ECGs, lasting around 30 minutes each (containing about 18.000 QRSs in total) and including ten various ventricular and supra-ventricular arrhythmic contexts, were extracted from the MITBIH Arrhythmia database [14]. Real clinical noise, from 5 dB to -15 dB, was introduced randomly in each ECG with probabilities P (no noise) = P (bw) = P (ma) = P (em) = 1/4 and P (5dB) = 1/2, P (−5dB) = 1/3, P (−15dB) = 1/6 to reproduce difﬁcult clinical ECG situations and to assess the system performance in speciﬁc contexts as well as when the context changes. The performance was evaluated from the standard T P (True Positive – correct result), F N (False Negative – missed result) and F P P (False Positive – false result): the sensitivity Se = T PT+F , the N TP positive predictivity P P = T P +F P and the F-measure F M (β) = (1+β 2 )∗P P ∗Se , β 2 ∗P P +Se where β = 1 (same weight for Se and P P ), were computed. To estimate the upper bound performance reachable by Calicot with the pilot, the best detector performance (i.e. achieved by the detector with maximal FM) for each chunck of ECG was also retained. These results were used to deﬁne a gold standard (bestChoice). Table 1 synthesizes the results of arrhythmia recognition without the pilot and every QRS detector algorithms, bestChoice (used as gold standard) and the different piloting rule sets. According to FM pilot D2, outperforms all the other methods. The best non piloted monitoring performance (FM=88.35%) is obtained when using algorithm kadambe for temporal abstraction, followed by af2 (FM=86.67%), and benitez (FM=85.25%). These three algorithms outperform the others with an FM greater by 1.73%. The best piloted monitoring performance is obtained by the pilot D2 rule set followed by pilot D1 and pilot D3. pilot D1 performs better than non piloted af2 but worst than non piloted kadambe. The upper bound that can be obtained is given by bestChoice as FM=91.71%. This shows that the piloting strategy could be notably improved with more accurate rules (at most by 3.36% of FM). detector sens(%) P+(%) FM(1) (%) switches af2 benitez df2 gritzali kadambe* mobd pan 92.59 95.91 80.64 87.29 94.76 95.29 79.66 81.47 76.72 86.61 75.33 82.74 68.36 85.70 86.67 85.25 83.52 80.87 88.35 79.61 82.57 - bestChoice 92.39 91.04 91.71 1486 pilot D1 pilot D2 pilot D3 95.09 94.74 93.30 82.01 82.92 81.53 88.06 88.43 87.02 741 294 1478 pilot D1* pilot D2* pilot D3* 91.32 92.33 94.69 80.74 81.38 81.20 85.70 86.51 87.42 443 343 1642 Table 1: Recognition results (on 15525 cardiac beats) L. Callens et al. / Intelligent Adaptive Monitoring for Cardiac Surveillance kadambe associates wavelet analysis with heuristics for selfadaptating to the signal, thus, it can be considered as being “piloted”. To asses the pilot more fairly, new piloting rules, pilot D1*, pilot D2* and pilot D3*, were learned excluding kadambe. The best performance of non piloted algorithms was obtained for af2 followed by benitez. The piloted rule sets exibited the best performance: pilot D3* with FM = 87.42% outperformed non piloted af2 with FM= 86.67% improving FM by 0.75% which is considered a good score in the QRS detection ﬁeld. The number of switches, bestChoice, shows that the best possible performance needs 1486 switches. pilot D2 reached good scores with far less switches (294). Without using kadambe, pilot D3* has switched 1642 times showing that it uses the available algorithms much more. Compared to kadambe, the advantage of using a pilot is that it uses explicit declarative rules which can be easily updated. This demonstrates the value of using a smart adaptation of QRS detection algorithms according to both signal, patient and diagnosis context. 9 CONCLUSION We have presented an approach to intelligent monitoring with selfadaptive capabilities in the cardiac domain. Our proposition associates temporal abstraction, online diagnosis by chronicle recognition, self-adaptation to the monitoring context and automatic knowledge acquisition to learn chronicles and adaptation decision rules. A prototype named Calicot has been implemented. Efﬁciency has been a constant concern during the conception and implementation of Calicot, as it was intended to run online. Thus, a temporal abstraction method taking advantage of the domain and data speciﬁcities has been proposed. Though they representing complex event, chronicles can be efﬁciently recognized on multiple data streams, one or two orders of magnitude less than real time in our case. To enhance the performance we have also proposed an architecture for self-adaptation, featuring a pilot which can reconﬁgure the processing chain or tune the module parameters when the monitoring context changes. Finally, symbolic machine learning is used, ofﬂine, to get discriminating patterns, on the one hand, and adaptation decision rules, on the other hand. Using a symbolic approach for knowledge and (temporal) reasoning makes it possible to provide understandable explanations to the user. This is very important in medicine for letting clinician operators trust such systems. This research could not have been achieved without an active and fruitful collaboration with the medical staff. Working with clinicians in hospital is not always easy for computer scientists: experts are overbooked, getting data is sometimes difﬁcult as protocols for recording data are very strict, especially they should not introduce any risk for the patient or any violation of data privacy. But confronting ideas and views from different research, knowledge and practice domains is particularly rewarding. REFERENCES [1] G. Carrault, M-O. Cordier, R. Quiniou, and F. Wang, ‘Temporal abstraction and inductive logic programming for arrhythmia recognition from ECG’, Artiﬁcial Intelligence in Medicine, 28, 231–263, (2003). [2] M.-O. Cordier and C. Dousson, ‘Alarm driven monitoring based on chronicles’, in Safeprocess’2000, pp. 286–291, (2000). [3] J. de Kleer, A. Mackworth, and R. Reiter, ‘Characterizing diagnoses and systems’, Artiﬁcial Intelligence, 56(2-3), 197–222, (1992). [4] R. Dechter, I. Meiri, and J. Pearl, ‘Temporal constraint networks’, Artiﬁcial Intelligence, 49(1-3), 61–95, (1991). 657 [5] M. Dojat, F. Pachet, Z. Guessoum, D. Touchard, A. Harf, and L. Brochard, ‘N´eoganesh: a working system for the automated control of assisted ventilation in ICUs’, Artiﬁcial Intelligence in Medicine, 11(2), 97–117, (1997). [6] C. Dousson, ‘Alarm driven supervision for telecommunication networks. ii- on-line chronicle recognition’, Annales des T´el´ecommunications, 51(9-10), 501–508, (1996). [7] C. Dousson, P. Gaborit, and M. Ghallab, ‘Situation recognition: representation and algorithms’, in Proceedings of the International Joint Conference on Artiﬁcial Intelligence (IJCAI), pp. 166–172, (1993). [8] E. Fromont, R. Quiniou, and M.-O. Cordier, ‘Learning rules from multisource data for cardiac monitoring’, in AIME’05 (European Conference on Artiﬁcial Intelligence in Medicine), volume 3581 of LNAI, pp. 484–493, Aberdeen, Scotland, (2005). Springer. [9] G. Guimar˜aes, J.-H. Peter, T. Penzel, and A. Ultsch, ‘A method for automated temporal knowledge acquisition applied to sleep-related breathing disorders’, Artiﬁcial Intelligence in Medicine, 23(3), 211– 237, (2001). [10] Thomas Guyet, Catherine Garbay, and Michel Dojat, ‘Knowledge construction from time series data using a collaborative exploration system.’, Journal of Biomedical Informatics, 40(6), 672–687, (2007). ´ L´edeczi, J. Sztipanovits, G. P´eceli, G. Simon, and [11] G. Karsai, A. T. Kov´acsh´azy, ‘An approach to self-adaptive software based on supervisory control’, in Self-Adaptive Software, 2nd Int. Workshop, IWSAS, volume 2614 of LNCS, pp. 24–38. Springer, (2001). [12] N. Lavrac, B. Zupan, I. Kononenko, M. Kukar, and E. T. Keravnou, ‘Intelligent data analysis for medical diagnosis: Using machine learning and temporal abstraction’, AI Communications, 11, 191–218, (1998). [13] S. Miksch, W. Horn, C. Popow, and F. Paky, ‘Utilizing temporal data abstraction for data validation and therapy planning for artiﬁcially ventilated newborn infants’, Artiﬁcial Intelligence in Medicine, 8, 543–576, (1996). [14] G. B. Moody. ECG database applications guide. Harvard-MIT Division of Health Sciences and Technology Biomedical Engineering Center, Ninth Edition, 1997. [15] F. Mora, G. Passariello, G. Carrault, and J-P. Le Pichon, ‘Intelligent patient monitoring and management systems : A review’, IEEE Engineering in Medicine and Biology, 12(4), 23–33, (1993). [16] S. Muggleton and L. De Raedt, ‘Inductive Logic Programming: Theory and methods’, The Journal of Logic Programming, 19 & 20, 629–680, (1994). [17] F. Portet, A.I. Hern´andez, and G. Carrault, ‘Evaluation of real-time QRS detection algorithms in variable contexts’, Medical & Biological Engineering & Computing, 43(3), 381–387, (2005). [18] F. Portet, R. Quiniou, M.-O. Cordier, and G. Carrault, ‘Learning decision tree for selecting QRS detectors for cardiac monitoring’, in AIME’07 (European Conference on Artiﬁcial Intelligence in Medicine), volume 4594 of LNCS, pp. 170–174. Springer, (2007). [19] R. Reiter, ‘A theory of diagnosis from ﬁrst principles’, Artiﬁcial Intelligence, 32(1), 57–96, (1987). [20] P. Robertson and R. Laddaga, ‘Model based diagnosis and contexts in self adaptive software’, in Self-star Properties in Complex Information Systems, volume 3460 of LNCS, pp. 112–127. Springer, (2005). [21] L. Senhadji, G. Carrault, and J.-J. Bellanger, ‘Comparing wavelet transforms for recognizing cardiac patterns’, IEEE Eng. in Medicine and Biology Magazine, 14(2), 167–173, (1995). [22] L. Senhadji, F. Wang, A.I. Hern´andez, and G. Carrault, ‘Wavelet extrema representation for QRS-T cancellation and P wave detection’, in Computers in Cardiology, pp. 37–40, (2002). [23] A. Seyfang, S. Miksch, W. Horn, M. S. Urschitz, C. Popow, and C. F. Poets, ‘Using time-oriented data abstraction methods to optimize oxygen supply for neonates’, in AIME’01 (European Conference on Artiﬁcial Intelligence in Medicine), volume 2101 of LNCS, pp. 217–226. Springer, (2001). [24] Y. Shahar, S. Miksch, and P. Johnson, ‘The Asgaard project: a taskspeciﬁc framework for the application and critiquing of time-oriented clinical guidelines’, Artiﬁcial Intelligence in Medicine, 14(1-2), 29–51, (1998). [25] Y. Shahar and M. A. Musen, ‘Knowledge-based temporal abstraction in clinical domains’, Artiﬁcial Intelligence in Medicine, 8(3), 267–298, (1996). [26] M. Stacey and C. McGregor, ‘Temporal abstraction in intelligent clinical data analysis: A survey’, Artiﬁcial Intelligence in Medicine, 39(1), 1–24, (2007). 658 ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-658 A decision support system for breast cancer detection in screening programs Marina Velikova1 and Peter J.F. Lucas2 and Nivea Ferreira2 Maurice Samulski1 and Nico Karssemeijer1 Abstract. The goal of breast cancer screening programs is to detect cancers at an early (preclinical) stage, by using periodic mammographic examinations in asymptomatic women. In evaluating cases, mammographers insist on reading multiple images (at least two) of each breast as a cancerous lesion tends to be observed in diﬀerent breast projections (views). Most computer-aided detection (CAD) systems, on the other hand, only analyze single views independently, and thus fail to account for the interaction between the views. In this paper, we propose a Bayesian framework for exploiting multi-view dependencies between the suspected regions detected by a single-view CAD system. The results from experiments with real-life data show that our approach outperforms the singleview CAD system in distinguishing between normal and abnormal cases. Such a system can support screening radiologists to improve the evaluation of breast cancer cases. 1 INTRODUCTION Breast cancer is the most common form of cancer among women world-wide and its early detection can improve the chances of successful treatment and recovery ([1]). Therefore, many countries have introduced screening programs for the early diagnosis of breast cancer in asymptomatic women. A screening mammographic examination usually consists of four images, corresponding to each breast scanned in two views: mediolateral-oblique (MLO) and craniocaudal (CC) (Figure 1). The MLO projection is taken under 45◦ angle and shows part of the pectoral muscle. The CC projection is a top-down view of the breast. In reading mammograms, radiologists judge for the presence of a lesion by comparing both views and breasts. The general rule is that a lesion is to be observed in both views. Most computer-aided detection (CAD) systems, on the other hand, are only able to analyze each view independently. Hence, the correlations in the lesion characteristics are ignored and the breast cancer detection can be obscured due to the lack of consistency in lesion marking. This limits the usability and the trust in the performance of such systems. In this paper, we explore multi-view dependencies to improve the breast cancer detection rate at a patient level. We 1 2 Dept. of Radiology, Radboud University Nijmegen Medical Centre, The Netherlands, email: {m.velikova, m.samulski, n.karssemeijer}@rad.umcn.nl Institute for Computing and Information Sciences, Radboud University Nijmegen, The Netherlands, email: {peterl, nivea}@cs.ru.nl Figure 1. and MLO and CC projections of a right and left breast develop a Bayesian network model that combines the information from all the regions detected by a single-view CAD system in MLO and CC to obtain a single measure for suspiciousness of a case. To get the reader acquainted with the terminology used in the domain of breast cancer and throughout this paper, we next introduce a number of deﬁnitions of terms. By lesion we refer to a physical cancerous object detected in a patient. We call a contoured area on a mammogram a region. A region can be true positive (for example, a lesion marked manually by a radiologist or detected automatically by a CAD system as being suspicious) or false positive. A region detected by a CAD system is described by a number of continuous (realvalued) features (e.g., size, location, contrast). By link we denote matching (established correspondence) between two regions in MLO and CC views, respectively. The term case refers to a patient who has undergone a mammographic exam. The remainder of the paper is organized as follows. In the next section we brieﬂy review previous research in multi-view breast cancer detection. In Section 3 we introduce basic definitions related to Bayesian networks and then we describe a general Bayesian network framework for multi-view detection. The proposed approach is evaluated on an application of breast cancer detection using actual screening data. The evaluation procedure and the results are presented in Section 4. Section 5 concludes the paper. 2 PREVIOUS RESEARCH A number of works have already been developed to deal with multi-view breast cancer detection. Van Engeland et al. develop a linking method in [2] based on Linear Discriminant Analysis (LDA) classiﬁer and a set of view-link features to compute a correspondence score for every possible region combination. The proposed approach demonstrated an ability to 659 M. Velikova et al. / A Decision Support System for Breast Cancer Detection in Screening Programs discriminate between true and false links. In [3], Van Engeland and Karssemeijer extend this matching approach by building a cascaded multiple-classiﬁer system for reclassifying the initially detected region based on the linked candidate region in the other view. Experiments show that the lesion-based detection performance of the two-view detection system is signiﬁcantly better than that of the single-view detection method. Paquerault et al. also consider established correspondence between suspected regions in both views to improve lesion detection based on LDA ([4]). By combining the resulting correspondence score with its one-view detection score the lesion detection improves and the number of false positives reduces. Only in this study, however, the authors report improvement in the case-based performance based on multi-view information. Therefore more research is required to build CAD systems that discriminate well between normal and suspicious cases–the ultimate goal of breast cancer screening programs. In contrast to the clinical situation, in the screening setting the detected lesions are usually small and due to breast compression they are sometimes diﬃcult to observe in both views. However, there is a strong correlation between the characteristics of the breast projections, which can assist the decision process of classifying a case as normal or suspicious. 3 3.1 BAYESIAN MULTI-VIEW DETECTION Basic Deﬁnitions A Bayesian network is deﬁned as a pair BN = (G, P ) where G is an acyclic directed graph (ADG) G = (V, A) with a set of nodes V corresponding 1 − 1 to a ﬁnite set of random variables X and a set of arcs A ⊆ (V × V) corresponding to direct causal relationships between the variables. Here P denotes a joint probability distribution of X. We say that G is an I–map of P if any independence represented in G, denoted by U ⊥ ⊥ G V | W with U, V, W ⊆ V mutually disjoint sets of nodes, is satisﬁed by P , i.e., U⊥ ⊥GV | W =⇒ XU ⊥ ⊥ P XV | X W , where U , V and W are sets of nodes of the ADG G and XU , XV and XW are the sets of random variables corresponding to the sets of nodes U , V and W , respectively. A Bayesian network BN allows a compact representation of independence information about the joint probability distribution P by specifying a conditional probability table (CPT) for each random variable. This table describes the conditional distribution of the node given each possible combination of values of its parents. The joint probability can be computed by simply multiplying the CPTs. For more detailed recent description of Bayesian networks, the reader is referred to [5]. One way to specify interactions among statistical variables in a compact fashion is oﬀered by the notion of causal independence [6]. The general structure of a causal-independence model is shown in Figure 2; it expresses the idea that causes C1 , . . . , Cn inﬂuence a given common eﬀect E through intermediate variables I1 , . . . , In . A value of a variable is denoted by a lower-case letter, e.g., ik stands for Ik = (true) and i¯k otherwise. The interaction function f represents in which way the intermediate eﬀects Ik , and indirectly also the causes C1 C2 ... Cn I1 I2 ... In f Figure 2. E Causal-independence model. Ck , interact. This function f is deﬁned in such a way that when a relationship between the Ik ’s and E = is satisﬁed, then it holds that f (I1 , . . . , In ) = e; otherwise, it holds that f (I1 , . . . , In ) = e¯. Furthermore, it is assumed that if f (I1 , . . . , In ) = e then P (e | I1 , . . . , In ) = 1; otherwise, if f (I1 , . . . , In ) = e¯, then P (e | I1 , . . . , In ) = 0. Using information from the topology of the network, the notion of causal independence can be formalised for the occurrence of eﬀect E, i.e. E = , in terms of probability theory as follows: P (e | C1 , . . . , Cn ) = X n Y P (Ik | Ck ) f (I1 ,...,In )=e k=1 Finally, it is assumed that P (ik | c¯k ) = 0 (absent causes do not contribute to the eﬀect); otherwise, P (Ik | Ck ) > 0. An important subclass of causal-independence models is obtained if the deterministic function f is deﬁned in terms of separate binary functions gk ; it is then called a decomposable causal-independence model [6]. Usually, all functions gk (Ik , Ik+1 ) are identical for each k. Typical examples of decomposable causal-independence models are the noisy-OR [7] models, where the function g represents a logical OR. This function is used in the general theoretical model presented in the next section. 3.2 Model Description The objective of multi-view detection of a physical object is to determine whether or not the object has certain characteristics (e.g., being suspicious) based on the characteristics of regions (subparts) in multiple object views (projections). Figure 3 depicts a schematic representation of multi-view detection. View–A View–B A1 L11 L12 A2 B1 L21 L22 B2 Figure 3. Schematic representation of multi-view analysis of a physical object with automatically detected regions We have a physical object (displayed as a gray cloud), which is projected in two views, View-A and View-B. The ovals represent the projections of a suspicious physical subpart of the 660 M. Velikova et al. / A Decision Support System for Breast Cancer Detection in Screening Programs object; thus, the whole object is suspicious. An automatic single-view system detects regions in both views and a number of real-valued features are extracted to describe every region. In the ﬁgure regions A1 and B1 are correct detection of the suspicious physical subpart, i.e., these are true positive (TP) regions whereas A2 and B2 are false positive (FP) regions. Since we deal with projections of the same physical object we introduce links (Lij ) between the detected regions in both views, Ai and Bj . Every link has a class (label) Lij = ij deﬁned as follows ( f alse if Ai and Bj are FP, (1) ij = true if Ai or Bj are TP. A region, view and the whole object has also a binary class with a value of f alse if all the corresponding links ij = f alse; otherwise it is true. This deﬁnition allows us to maintain information about the suspiciousness of the physical object even if there is no detected TP region in one of the views. In any case, multiple views corresponding to the same TP subpart contain correlated characteristics whereas views of FPs tend to be less correlated. To account for view interaction, we propose a two-step Bayesian network framework where all the regions from corresponding views are considered simultaneously to compute a single measure for suspiciousness for the physical object as whole. Figure 4 represents the framework. Ai / Bj = (x1, x2, …, xn) A1 A2 B1 L11 L12 L21 CA1 CB1 a) RegNet CB2 The next step is to compute the probabilities P (CAi = 1|Lij = ij ) and P (CBj = 1|Lij = ij ) where CAi and CBj are the classes of regions Ai and Bj , respectively. Given our class deﬁnition in (1), we can easily model these relations through a causal independence model using the logical OR. The Bayesian network RegNet models this scheme. At the second step of our Bayesian network framework we simply combine the computed region probabilities from RegNet by using again a causal independence model with the logical OR to obtain the probability of the respective view being true. We call this Bayesian network ViewNet. Finally, we combine the view probabilities obtained from ViewNet into a single probabilistic measure for the object as a whole by using diﬀerent schemes. The ﬁrst simplest scheme is taking the average of both view probabilities. In another more advanced scheme, we take into account the class of the object (f alse or true) by using a logistic regression model with the view probabilities as input variables. We refer to the whole multi-view detection scheme thus described as MultiView model. 4 As mentioned earlier, multi-view analysis plays a crucial role in the breast cancer detection on mammograms. Here, we describe the application of the proposed Bayesian network framework in this domain. 4.1 B2 L22 CA2 CA1 CA2 CB1 CB2 IA1 IA2 IB1 IB2 View-A View-B b) ViewNet Figure 4. Bayesian network framework for representing the dependencies between multiple views of an object At ﬁrst we compute the probability that a region in one view is classiﬁed as true given its links to the regions in the other view. A straightforward way to model a link Lij is to use the corresponding regions Ai and Bj as causes for the link class, i.e., Ai −→ Lij ←− Bj . Since the link variable is discrete and the regions are represented by a vector of real-valued features (x1 , x2 , . . . , xn ) extracted from an automatic detection system, we apply logistic regression to compute P (Lij = ij |Ai , Bj ): ” “ exp β0ij + β1ij x1 + · · · + βkij xk ” “ P (Lij = ij |Ai , Bj ) = 1 + exp β0ij + β1ij x1 + · · · + βkij xk where β’s are the model parameters we optimize. Logistic regression ensures that the outputs P (Lij = ij |Ai , Bj ) lie in the range [0, 1] and they sum up to 1. APPLICATION TO BREAST CANCER Data Description As input for our multi-view detection scheme we use the regions detected by a single-view CAD system that consists of the following main steps: 1) Segmentation of the mammogram into background area, breast, and for MLO, pectoral muscle; 2) Initial detection of pixel-based locations of interest; 3) Region extraction with dynamic programming using the detected locations as seed points. For each region a number of real-valued features are computed based on breast and local area information; 4) Region classiﬁcation as “false” and “abnormal” based on the region features. A measure for suspiciousness is computed based on supervised learning with a neural network (NN) and converted into normality score (NormSc): the average number of normal regions in a view (image) with the same or higher suspiciousness measure. The proposed model was evaluated using a data set containing 1063 screening exams from which 383 are cancerous. All exams contained both MLO and CC views. The total number of breasts were 2126. All cancerous breasts had one visible lesion in at least one view, which was veriﬁed by pathology reports to be malignant. Lesion contours were marked by, or under supervision of, an experienced screening radiologist. For each image (mammogram) we selected the ﬁrst 5 regions with the lowest NormSc computed from the CAD system. In total there were 10478 MLO regions and 10343 CC regions. Every region from MLO view was linked with every region in CC view, thus obtaining 51088 links in total. We constructed the data such that every row contains the features of all regions belonging to the MLO and CC view for one breast, i.e., ﬁrst 5 MLO and then 5 CC regions. The regions per image were sorted according to their NormSc. Every region is described by 11 continuous features automatically 661 M. Velikova et al. / A Decision Support System for Breast Cancer Detection in Screening Programs extracted by the system, which tend to be relatively invariant across the views. These features include the neural network’s output from the single-view CAD and lesion characteristics such as spiculation (star-like shape), focal mass, size, contrast, linear texture and location coordinates. Finally we add the class variable with binary values f alse (“normal”) and true (“suspicious”) for each link following the deﬁnition in (1). Hence a region, view, breast and case has class values of “normal” and “suspicious”. Given the data processing procedure, we obtained a dataset with 2126 rows and 135 columns (10 regions × 11 features + 25 link classes). Evaluation 4.3 Results Based on the results from ViewNet, Figure 5 presents the classiﬁcation outcome with the respective AUC measures per ROC curve per CC view 1 1 0.9 0.9 0.8 0.8 True positive rate We applied our MultiView model to the data thus described. Both RegNet and ViewNet networks have been built, trained and tested using the Bayesian Network Toolbox in Matlab ([8]). The learning has been done using the EM algorithm as the OR-nodes are hidden variables. The evaluation of each network performance is done using two-fold cross validation: the dataset is split into two subsets with approximately equal number of observations and proportion of cancerous breasts. Each half is used as a training set and as a test set. The view probabilities for MLO and CC obtained from ViewNet are combined by the averaging scheme Avg(MLO,CC) for computing the probability of breast being suspicious. For the logistic regression combining scheme we use as input variables not only the view probabilities for MLO and CC but also the minimum NormScs for each view. The breast data is split in two halves and each half is used once as a train and a test set. This splitting is repeated 10 times. As a result for each case we obtain 20 probabilities in total–2 × 10 probabilities corresponding to each breast. Out of these 20 probabilities we ﬁrst choose the maximum probability and assign it to the respective breast. Then consider only the 10 probabilities for the other breast and take the minimum as a ﬁnal measure for suspiciousness. This scheme is referred to as 10-fold LR. To compute the probability of a case being “suspicious” we apply the most straightforward scheme of taking the maximum of the two breast probabilities. However for the 10-fold LR model, which accounts for the breast classes, it is expected that the absolute diﬀerence between the left and right breast probabilities should be larger for the suspicious cases than that of the normal cases and thus allowing for a better case distinction. We use this diﬀerence as a third measure for suspiciousness at the case level (10-fold LR-diff). We compare the performance of our model with the performance of the single-view CAD system (SV-CAD). For the latter the breast (case) probability is computed by taking the minimum NormSc out of all the regions in both views (breasts). The comparison analysis is done using Receiver Operating Characteristic (ROC) curve ([9]) and the Area Under the Curve (AUC) as a performance measure. The signiﬁcance of the AUC diﬀerences between our multi-view model and the benchmark SV-CAD system is tested by using the software package LABROC4 ([10]). ROC curve per MLO view True positive rate 4.2 MLO and CC view, respectively. To check the signiﬁcance of the diﬀerence between the AUC measures we test the hypothesis that the AUC measures are equal against the onesided alternative hypothesis that the multi-view system yields higher AUC for MLO and CC views. The p-values obtained are: 0.000 for MLO view and 0.035 for CC view. The results clearly indicate an overall improvement in the discrimination between suspicious and normal views for both MLO and CC projections. Such an improvement is expected as the classiﬁcation of each view in our multi-view system takes into account region information not only from the view itself but also from the regions in the other view. 0.7 0.6 0.5 0.4 0.3 MultiView AUC: 0.863 SV−CAD AUC: 0.818 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.7 0.6 0.5 0.4 0.3 MultiView AUC: 0.871 SV−CAD AUC: 0.855 0.2 0.1 1 0 0 0.1 False positive rate Figure 5. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate ROC analysis per MLO and CC view While the view results are very promising, from a radiologists’ point of view it is more important to look at the breast and case level performance; in Table 1 the AUCs from our MultiView and SV-CAD are presented. Overall we see that MultiView outperforms SV-CAD in terms of an increased true detection rate at both breast and case level. Although the simple averaging method Avg(MLO,CC) tends to show better distinction between normal and suspicious breasts (cases) than SV-CAD, the diﬀerences in the AUC measures is statistically insigniﬁcant at breast and case level. However, taking into account the breast classes and performing new training as done in 10-fold LR leads to a signiﬁcant improvement in the classiﬁcation outcome. The best performance is achieved for 10-fold LR-diff, conﬁrming our expectation that the probability diﬀerence between the breasts for suspicious cases must be larger than that for the normal cases. Table 1. AUCs obtained from the single- and multi-view system Method SV-CAD Avg(MLO,CC) 10-fold LR-max 10-fold LR-diﬀ Breast 0.850 0.864 0.875 0.875 p-value – 0.123 0.001 0.001 Case 0.797 0.827 0.832 0.838 p-value – 0.135 0.014 0.006 To get more insight into the areas of improvement we plotted ROC curves for each of our models against the single-view CAD system. For all the plots we observed the same tendency of an increased true positive rate at (very) low false positive rates (< 0.5)–a result ultimately desired at the screening practice where the number of normal cases is considerably larger than the suspicious ones; Figure 6 presents the ROC plot for the best performing 10-fold LR-diff model. 662 M. Velikova et al. / A Decision Support System for Breast Cancer Detection in Screening Programs ROC curve per case 1 0.9 True positive rate 0.8 0.7 0.6 0.5 0.4 0.3 MultiView AUC: 0.838 SV−CAD AUC: 0.797 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate ROC analysis per case Figure 6. To have a closer look at the quality of classiﬁcation for the models that produce a real probability measure for suspiciousness (Avg(MLO,CC) and 10-fold LR-max), we compute the average log-likelihood (ALL) of the probabilities for different units–link, region, MLO/CC view, breast and case–by: ALL(C) = N 1 X − ln P (Ci |Ei ), N i=1 (2) where N is the number of the unit, Ci and Ei is the class value and the feature vector of the i-th observation, respectively. Thus, the value of ALL(C) indicates how close the posterior probability distribution is to reality: when P (Ci |Ei ) = 1 then ln P (Ci |Ei ) = 0 (no extra information); otherwise − ln P (Ci |Ei ) > 0. The log-likelihood results are given in Table 2. The lowest ALL(C) is achieved for the links meaning that the estimated probabilities best ﬁt the link probability distribution. A possible explanation is that in our Bayesian network framework the links are directly dependent on the original region features and thus they are better ﬁtted. On the other hand, the rest of the units are based on combining estimated probabilities from previous levels where noise could play a role. Overall, however, our log-likelihood results show that MultiView ﬁts closely the probability distributions for diﬀerent units. Table 2. Average log-likelihood of the class based on MultiView Method Avg(MLO,CC) 10-fold LR-max 5 Case 0.50 0.47 Average log-likelihood of the class Breast MLO/CC Link Region 0.32 0.34/0.31 0.19 0.38 0.30 CONCLUSIONS Using the proposed Bayesian network framework and expert knowledge on multi-view analysis of mammograms we showed that the detection rate of breast cancer is larger at low false positive rates than that of a single-view CAD system. This improvement is achieved at view, breast and case level and it is due to a number of factors. First, we built upon a single-view CAD system that already demonstrates relatively good detection performance. By applying a probabilistic causal model we linked the original features extracted by CAD for all the regions in MLO and CC views and we combined all the links for one breast to obtain a single measure for suspiciousness of a view, breast and case. Another factor for the improved classiﬁcation is that our approach incorporated domain knowledge. Following radiologists’ practice, we applied a straightforward scheme to account for multi-view dependencies such that (i) correlations between the regions in MLO and CC views are considered per breast as whole and (ii) the classiﬁcation of breast/case as “suspicious” is employed through the logical OR. Thus the proposed methodology can be applied to any domain (e.g., fault detection in manufacturing processes) where similar deﬁnitions and objectives hold. Although we demonstrated that the proposed framework has the potential to assist screening radiologists to improve the evaluation of breast cancer cases, we consider a number of directions for extension. First, the features used in the current model are independently computed per region. We expect that the inclusion of multi-view region features such as the distance to the nipple or correlation features would further improve the system’s performance by considering explicitly multi-view dependences. Another possible extension is based on the model structure. Following our Bayesian network framework with using logistic regression and logical OR at a link and view level, we can also apply similar combining schemes at a breast and case level. Thus we can allow for better handling of missing or noisy information in the estimation of the breast/case probabilities for suspiciousness. ACKNOWLEDGEMENTS Work funded by the Netherlands Organisation for Scientiﬁc Research under BRICKS/FOCUS grant number 642.066.605. REFERENCES [1] Breast cancer and screening. Technical report, World Health Organization, http://www.emro.who.int/ncd/ publications/breastcancerscreening.pdf, accessed on 25-02-2008. [2] S. van Engeland, S. Timp, and N. Karssemeijer. Finding corresponding regions of interest in mediolateral oblique and craniocaudal mammographic views. Medical Physics, 33(9):3203–3212, 2006. [3] S. van Engeland and N. Karssemeijer. Combining two mammographic projections in a computer aided mass detection method. Medical Physics, 34(3):898–905, 2007. [4] S. Paquerault, N. Petrick, H. Chan, B. Sahiner, and M. A. Helvie. Improvement of computerized mass detection on mammograms: Fusion of two-view information. Medical Physics, 29(2):238–247, 2002. [5] F.V. Jensen and T.D. Nielsen. Bayesian Networks and Decision Graphs. Springer Verlag, 2007. [6] D. Heckerman and J. S. Breese. Causal independence for probability assessment and inference using Bayesian networks. IEEE Transactions on Systems, Man and Cybernetics, Part A, 26(6):826–831, 1996. [7] F. Diez. Parameter adjustment in Bayes networks: The generalized noisy or-gate. In Proceedings of the Ninth Conference on UAI, San Francisco, CA. Morgan Kaufmann, 1993. [8] K. Murphy. Bayesian Network Toolbox (BNT), http://www. cs.ubc.ca/~murphyk/Software/BNT/bnt.html. [9] J. A. Hanley and B. J. McNeil. The meaning and use of the area under a Receiver Operating Characteristic (ROC) curve. Radiology, 143:29–36, 1982. [10] C.E. Metz. Some practical issues of experimental design and data analysis in radiological ROC studies. Investigative Radiology, 24:234–245, 1988. ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-663 663 The Design, Deployment and Evaluation of the AnimalWatch Intelligent Tutoring System Paul R. Cohen1, Carole R. Beal2, Niall M. Adams3 Abstract. Europe and the U.S. both face the challenges of urban schools with low-achieving adolescent learners, many of whom are not proﬁcient in the language of instruction. This paper describes the deployment and evaluation of the AnimalWatch intelligent tutoring system for mathematics in challenging classrooms. Previous studies demonstrated that AnimalWatch beneﬁts 12-14 year-old students in relatively controlled conditions. The current study indicates that the system can help older, very low-achieving students in challenging secondary schools that serve diverse student populations. 1 Introduction AnimalWatch was designed to help middle school students (10-14 year olds) build skills in pre-algebra topics such as number sense, computation, fractions, decimals, percentages and proportions, and rational numbers. This paper focuses on three aspects of the AnimalWatch project: designing engaging tutoring systems for young adolescent learners (Sec. 2), the challenges of deploying systems in large urban schools with diverse student populations (Sec. 3), and evaluations of the efﬁcacy of tutoring systems (Sec. 4). In United States schools, pre-algebra skills are introduced in Grade 6 (age 11) and covered in more depth in Grade 7, in preparation for Algebra 1, which is introduced in Grade 8 or 9. However, disparities in educational achievement are such that AnimalWatch has also recently been deployed to help high school (secondary) students catch up on the mathematics they did not master in middle school. There is some urgency to questions about the efﬁcacy of intelligent tutoring systems such as AnimalWatch: American students perform relatively poorly on international assessments [18]. In the large California city where AnimalWatch was most recently tested, only 36% of Grade 6 students scored at the “Proﬁcient” level or better on the 2007 end-of-year California Standards Test-Math [17]. Ethnic gaps in achievement persist, with more White and Asian-American students scoring at a “Proﬁcient” or “Advanced” (59% and 77%, respectively) than their African-American and Latino/a peers (36% and 27%, respectively). One in four Californian students is an English language learner. Many of these students do not complete secondary school and have limited opportunities in the labor force. Several European countries face similar challenges in preparing young people for the workforce [12]. Although there is demand for workers with low skills in some sectors of the European labor market, the pool of young people who lack skills and qualiﬁcations for high-paying jobs is large and increasing [15]. Primary and secondary 1 2 3 USC Information Sciences Institute USC Information Sciences Institute Imperial College London schools now enroll large numbers of children from immigrant families, many of whom are not proﬁcient in the language of instruction and perform poorly in school [8, 9]. In both the United States and Europe, qualiﬁed teachers are retiring and it is proving difﬁcult to recruit and train teachers who are both qualiﬁed to teach mathematics and to work with students who do not speak or read the language of instruction [11]. Technology-based instruction might help to improve students’ mathematics achievement, although a recent report from the United States Institute for Education Sciences appears to cast doubt on the value of technology-based instruction, having found no beneﬁts for classrooms that used commercial educational software products [19]. However, the software products evaluated in the study were not particularly innovative; most functioned like electronic versions of textbooks and lacked the interactivity, individualization of instruction, and rich multimedia features of instructional software being developed in research laboratories. What is needed are demonstrations that intelligent tutoring systems can be effective in the classroom. Such demonstrations should help to identify the design, deployment and evaluation factors that contribute to, and in some cases detract from, the success of these applications. This paper presents evidence that the AnimalWatch intelligent tutoring system does help struggling students learn mathematics, and it relates some lessons learned while deploying the software with large numbers of low-achieving students in urban schools. 2 The Design of AnimalWatch An intelligent tutoring system should engage students and develop their problem solving skills while teaching material that is aligned to state and national curriculum standards. The system should also be easy to deploy in classrooms with low-end computers, and not require extensive training or technical support for teachers. The system should automatically collect data that will be used to evaluate how well it engages students and helps them learn the target math skills. This section discusses the design of AnimalWatch in terms of these criteria. Connect Mathematics with Science Following the recommendations of the National Council of Teachers of Mathematics, AnimalWatch integrates mathematics learning with authentic environmental science material [21]. AnimalWatch engages students in narratives about tracking and monitoring the status of endangered species (hence the system’s name). By connecting math problem solving with science, the student encounters many examples of how mathematics can apply to real-world problems and contexts, at a point in the curriculum when many students begin to complain that math is 664 P.R. Cohen et al. / The Design, Deployment and Evaluation of the AnimalWatch Intelligent Tutoring System disconnected from their lives. Environmental science is engaging to many young adolescents, both boys and girls, and also aligns with many state frameworks for middle school science (e.g., the California Science Grade 6 and 7 curricula focus on Earth Science and Life Science, respectively). Problem-based Instruction Prior research indicates that word problem solving is often difﬁcult for students because it requires multiple skills beyond simple computation: the ability to understand what a problem is asking, to construct equations from text, to compute an answer, and then to evaluate it for accuracy and plausibility in the context of the problem information. AnimalWatch provides students with opportunities to develop and practice these skills. AnimalWatch includes approximately 1100 word problems, organized into narratives about endangered and threatened species. Each word problem includes an introduction with authentic background information, a graphic (image, ﬁgure or table), and a question derived from the introduction. Scientiﬁc terms in the word problems that may not be familiar to students are linked to an integrated glossary. Students enter their answer into an answer box and receive immediate feedback, including hints and explanations in a variety of media. One-on-one Tutoring Much research suggests that the ideal learning context — the “gold standard” — is one-on-one instruction with an experienced human tutor (e.g., [5]). Human tutors present problems to help diagnose the student’s sources of difﬁculty, choose problems within the student’s “Zone of Proximal Development,” scaffold the student to a successful solution, and then attribute the success to the student’s effort and enhanced understanding [7, 14, 16]. However, in our partner schools, a math teacher typically works with ﬁve classes of 30 or more students each, making it extremely difﬁcult to give students individualized instruction. A student who struggles with a particular math concept or skill can quickly fall behind as the class moves on to new material. AnimalWatch is designed to help students build proﬁciency in topics that have not yet been mastered, based on a curriculum of 30 math skills. The speciﬁc sequence of word problems that is presented to an individual student is customized to his or her proﬁciency level, which, of course, changes during sessions with the tutor. Math topics and the difﬁculty of individual word problems are increased as the student demonstrates that she or he can solve problems involving a particular skill. Skills estimated to have been mastered are periodically reviewed; more speciﬁcally, if a student makes errors on a problem involving prerequisite math skills, the probability of selecting a problem involving those earlier skills will increase. Thus, the system adaptively focuses on the areas that each student most needs to practice. When a student needs help solving a problem, clicking on the “hint” icon brings up a menu of multimedia tutorial resources, including text explanations (e.g., how to ﬁnd the least common denominator), worked examples, interactive solutions, and video lessons. If the student makes a problem-solving error, text feedback about accuracy is presented, followed by an operations hint (e.g., “No, that’s not quite right.” “Are you sure you’re subtracting?”). A third error elicits a recommendation to view the associated help resources. When the student clicks on the “help” icon, a menu window appears, showing the options available for that topic. The student can then select the type of help he or she would like to see, or can view each type in turn to review alternative solutions and explanations. A fourth error elicits the correct answer, which the student is required to enter before moving on to a new problem. Some students may actually prefer technology-based assistance to tutoring by human teachers. In our prior work with a different intelligent tutoring system (an ITS for high school high-stakes test preparation) we found that students who described themselves as disengaged from math (and whose teachers agreed) were highly likely to access multimedia help resources in an effort to learn the material [3]. Apparently, disengaged students are willing to seek problem solving help from the computer, whereas they are reluctant to do so from their teacher or classmates. Our pilot work with AnimalWatch indicates that a similar effect may be at work, with the lowest performing students showing high rates of using multimedia help resources. Skills Practice In addition to word problem solving, AnimalWatch includes a module that provides students with opportunities to build computational ﬂuency and automatic retrieval of math facts. This module is based on prior research indicating that students’ proﬁciency with basic math facts and simple computation predicts their ability to solve complex word problems [22]. When lower-level processes such multiplication are automatic, cognitive resources are available to allocate to higher-order problem solving activities such as identifying what the problem is asking (problem representation) and checking possible answers in relation to the problem context. Royer and his colleagues found that training students in basic math facts was associated with improvement on achievement test problems. The role of computational ﬂuency is strongest when students have limited time to solve problems, for example, on high-stakes tests [24]. By design, the basic math skills practice modules are distinct from the primary educational activity of solving word problems. There are twelve “skillbuilders” that test students on easy true-false problems (e.g., is 3 + 4 = 8 true or false?). Item difﬁculty is low, which motivates students to repeat the units (because they can achieve high scores); in turn, repetition strengthens ﬂuency. Students also can practice solving math problems from the Math League, a popular national competition that includes practice activities completed each week by students around the country [13]. Math League items require insight or innovative solution strategies. In AnimalWatch, students may switch at any time between tutoring on word problems and these alternative activities. This provides valuable information about students’ levels of engagement with the word problems. Design for Statistical Student Modeling As noted, AnimalWatch maintains a model of each student’s estimated proﬁciency with the target math skills. In the past, these models were fairly simple and heuristic, but recently, we have developed statistical models, particularly Hidden Markov Models of engagement [4]. As increasing numbers of students use ITSs, there will be opportunities for new kinds of data mining and statistical modeling; not only models of outcomes (e.g., improvements on tests) but models of students’ learning processes. AnimalWatch is designed to support statistical modeling of students’ behavior as they work with the software. It is a Web-based application that gathers moment-by-moment information about every observable aspect of students’ activities and organizes the information in a temporal object store. 3 Deploying AnimalWatch This section discusses some of the requirements for deploying ITSs — for getting them into classrooms or making them available to students in other ways. P.R. Cohen et al. / The Design, Deployment and Evaluation of the AnimalWatch Intelligent Tutoring System Align with State Standards With increased emphasis on testing in recent years, teachers are reluctant to devote classroom time to activities that are not explicitly aligned with educational standards (on which annual achievement tests are based). A challenge to national or international deployment of ITSs like AnimalWatch is that states and nations have different mathematics curricula, so a math topic that is introduced in one grade in some states may be covered in different grades elsewhere. AnimalWatch is aligned with the California and Massachusetts Mathematics Standards for middle school mathematics and also with the process (i.e., problem-solving skills) standards set by the National Council of Teachers of Mathematics [21]. Identify a Role for the ITS Teachers will want to know how an ITS meshes with their own instructional activities. Originally, AnimalWatch was designed to supplement classroom activity, to review and reinforce learning of targeted topics. However, after years of meeting with teachers and leaders of community groups, several other roles have been added: AnimalWatch is used as an after-school activity in community centers, as a remedial tutor for high-school students who never mastered middle school mathematics, as a tutor in an elite program for inner-city children run by the University of Southern California, and, very recently, as a tutor for blind children. Assessment and Learner Tracking The ability to assess students’ performance and track it over time is very attractive to teachers. AnimalWatch currently includes several assessment instruments: • Pre- and Post-tests These tests are completed online and scored automatically. The 30-item tests include sub-scores for computation (10 items) fractions (6 items), one-variable equations (6 items) and rational numbers (proportions, discounts, unit conversion, etc., 8 items) mapped to the California standards for Number Sense, Algebra and Functions, and Measurement and Statistics. • Mathematics motivation How students perform in math reﬂects motivational as well as cognitive processes. Much research indicates that students’ beliefs about their ability in math, the value that they place on being successful in math, and the extent to which they see math as important contribute to math achievement. The “Math Proﬁle” is an online self-report instrument designed to assess students’ math self concept, and value of math [6, 10]. • Cognitive assessments AnimalWatch includes three online assessments of cognitive factors that have been identiﬁed as predictors of math problem solving: A spatial cognition task based on mental rotation [23], a math-fact retrieval task based on judging the truth of simple equations as quickly as possible [22], and a Piagetian assessment of formal operational reasoning. Resources for classroom integration AnimalWatch includes online resources to help teachers integrate the activity into their classrooms. These include a professional development manual and curriculum guide that can be viewed online or downloaded in PDF format; a users’ wiki, with documentation, troubleshooting tips, discussion forum, and frequently-asked questions. Technical requirements Classroom teachers have many demands on their time and have limited patience for buggy software. Schools generally have only limited technical support and most have limited bandwidth. AnimalWatch is stable, works with both PC and Apple computers and browsers, and with wired and wireless networks. Nothing needs to be installed on school computers. There 665 is no need for district technology specialists to provide technical support. AnimalWatch automatically upgrades itself. Where schools block open Internet access, we have successfully worked with districts to provide port information so that school computers can connect to the AnimalWatch site. (If necessary, AnimalWatch can be installed on one machine in a computer lab, which then acts as a server for the other machines, with the media ﬁles provided to each student computer on CD-ROM.) File compression algorithms are used to stream the video lessons and graphics to ensure adequate performance over sometimes-slow school networks. Common technical problems likely to be encountered at school sites have been identiﬁed, and “troubleshooting” tips and resources have been created in documentation for teachers (e.g., how to set the computer screen resolution). Data collection is automated. As students work with AnimalWatch, their actions with the keyboard (e.g., answers, latencies, requests for multimedia, help with problem solving, navigation between modules, etc.) are recorded and processed automatically. Other Deployment Issues Classroom-based research can be challenging for many reasons. It can be difﬁcult to ensure that students have equivalent time with the software. Other activities and special events frequently interrupt the school’s schedule. Establishing randomized comparison groups is rarely feasible. For one thing, some classes may be assigned to a software group simply because the computer lab is available only when those classes meet. Teachers may introduce a selectivity bias by deciding to have their lowest-performing classes work with the software on the assumption that the beneﬁts should go to the most needy students. In the schools where AnimalWatch has been deployed, student attrition and absenteeism is very high, so a sizable proportion of the sample is lost between pre- and post-test. Conversely, new students frequently arrive in classes after the AnimalWatch activity has started and students have already completed the pre-test. Students are also frequently re-assigned from one class period to another, meaning that a student may begin in an intervention group and later re-appear in a comparison group. These and other factors make it difﬁcult to establish that a tutoring system originally developed for research purposes can also be effective in real classrooms. 4 The Efﬁcacy of AnimalWatch This section presents some issues that arise in testing the efﬁcacy of Intelligent Tutoring Systems. The discussion very easily could ﬁll a textbook on empirical methods. We refer readers to [20] for a wealth of advice on metrics, experiment protocols and analysis methods. Here, we focus on how we evaluated the efﬁcacy of AnimalWatch. AnimalWatch was developed for middle school students and tested initially with sixth graders. [4]. In 2005, we were approached by a charter school district in which high school (secondary) students needed remedial help with pre-algebra math. The study included Grade 9 students enrolled in Algebra 1 classes in four high schools (N = 172). The schools served primarily African-American and Hispanic students. The sample included 88 students who spoke English as their primary language, and 84 English Language Learners. Overall performance in math was poor; nearly 80% of the sample scored at the Below Basic or Far Below Basic level of the California Standards Test-Math. Teachers reported that almost half of the students were failing Algebra because they had not mastered the prerequisite skills (arithmetic, fractions and rational numbers), and requested that the students work with AnimalWatch to review this material. P.R. Cohen et al. / The Design, Deployment and Evaluation of the AnimalWatch Intelligent Tutoring System Each student in the sample took the pretest, then worked with AnimalWatch, then took the posttest (described in Sec. 3). The total amount of activity with AnimalWatch varied considerably across students due to absenteeism, dropping out of school, and sessions being cancelled for higher-priority activities or emergencies at the schools. Even when students attended sessions, they did not all work exclusively on AnimalWatch problems. The number of problems that students worked on ranged from a low of 2 to a high of 88, with mean 26.8 and median 24. Efﬁcacy is a relationship between an intervention — in this case, students’ work with AnimalWatch — and some outcome such as improved math scores. Ideally, the relationship should be positive, that is, more work with AnimalWatch should produce better outcomes. One measure of outcomes is the difference between posttest and pretest scores. This suggests a model of the form: P osttest − P retest = β(AW ) (1) where AW represents work with AnimalWatch and β represents efﬁcacy, that is, the relationship between the intervention and the outcome. Our analysis is based on a slightly different model: P osttest = β1 P retest + β2 (AW ) (2) The reason for this model has to do with the dual role of pretest scores. Measures of efﬁcacy — the relationship between intervention and outcome — should be untainted by other factors that could produce good outcomes. In particular, a student’s prior mathematics knowledge might inﬂuence how much work the student does with AnimalWatch and perhaps also the beneﬁts of this work. Our strategy is to examine the partial relationship between the intervention and outcomes, holding the student’s prior mathematics knowledge constant. Because our best estimate of students’ prior mathematics knowledge is their pretest score, we need a model that allows us to examine the independent contributions of pretest score and the intervention to posttest score, as shown in Eq. 2. There are many ways to score performance on tests. In AnimalWatch, students did not answer all the items on their pretests and posttests, so each item, for each student, was either Correct, Incorrect, or Not Attempted. Let NC and NI denote the number of correct and incorrect answers on a particular student’s test. One measure of performance is NC /(NC + NI ), which acknowledges that non-attempted problems do not provide information about a student’s mathematics ability. Another measure is the odds ratio — NC /NI — but the distribution of odds ratios was skewed in AnimalWatch students. The log of the odds ratio, however, is nearly symmetric and close to Gaussian. The following analyses are for logodds, but our qualitative conclusion hold for other measures, including NC /(NC + NI ) (see Fig. 2 and discussion). Let P reLO and P ostLO denote log(NC /NI ) for a student’s pretest and posttest, respectively. The student’s work with AnimalWatch — the intervention — can also be measured in several ways. Recall that students worked on word problems, getting some right and others wrong, sometimes looking at hints; and had the opportunity to work on nonAnimalWatch activities, such as Skillbuilders. In AnimalWatch (and other ITSs) some students “abuse help” and do not actually try to solve problems, but click through hints and help, mechanically. We looked at several measures of the intervention, some of which gave slightly better results in the regression analyses discussed below. But we settled on an easy-to-interpret measure of amount of work the student did with AnimalWatch: the number of unique math word problems encountered by a student. Let AW denote this number. We will treat P reLO as a measure of the student’s prior mathematics ability and look at the partial relationship between P ostLO and AW holding P reLO constant. There are several ways to do this, each of which gives us some insight into the efﬁcacy of AnimalWatch. As a simple statement of association, the partial correlation of AW and P ostLO holding P reLO constant is 0.30 and the 95% bootstrapped conﬁdence interval around this statistic is [0.13, 0.44]. Clearly there is a signiﬁcant association between AW and P ostLO independent of P reLO. Regression analysis gives similar results. Regressing AW and P reLO on P ostLO produces a regression model that accounts for 56% of the variance in P ostLO (p < .0001). The least-squares parameter estimates for the model are P ostLO = 0.72(P reLO) + 0.012(AW )−0.68. T tests for all of these parameters are highly signiﬁcant. Although the regression coefﬁcients are partial, it is difﬁcult to compare them because they are on different scales. To compare these effects, one can rescale them in terms of standard deviations of P ostLO. This yields a model with standardized regression coefﬁcients: P ostLO = 0.66(P reLO) + 0.21(AW ). The interpretation of this model is that a change of one standard deviation in P reLO produces .66 standard deviations change in P ostLO, while a change of one standard deviation in AW produces .21 units change in P ostLO. (Standardized regression models have zero intercepts.) So, in terms of effects on P ostLO, increasing one’s level of effort with AnimalWatch by one standard deviation (approximately 19 problems) has roughly one third the effect (0.21 vs. 0.66) of being one standard deviation higher on the pretest scale. It is encouraging to see that the partial effects of P reLO and AW on P ostLO are of the same order of magnitude. It means that posttest scores are certainly inﬂuenced by prior mathematics achievement (as measured by pretest scores), but working with AnimalWatch has a meaningful effect on posttest scores independent of pretest scores. 2 1 Diff Log Odds 666 0 -1 -2 -3 1 2 3 4 Level Ranked UNIQUE-AW-PROBLEMS Figure 1. The relationship between Rank(AW ) and the residuals of the regression of P reLO on P ostLO It is instructive to divide the AW scores into four groups, corresponding to AW scores in the ﬁrst, second, third and fourth quartile of the distribution of scores, and then use this group label to organize pretest-posttest differences (i.e., P ostLO−P reLO ). One can see in Figure 1 that the biggest pretest-posttest difference was for students whose AW scores were in the second quartile. The analysis of variance associated with Figure 1 is signiﬁcant (p < .0002), however, Tukey HSD pairwise tests found differences only between the ﬁrst quartile of AW scores and all the others. One interpretation of these results is that the “dose-response” function ﬂattens out quickly: Some amount of work with AnimalWatch will produce a pretest-posttest difference but more will not. However, students worked so little with AnimalWatch, in such chaotic classroom conditions, that we cannot P.R. Cohen et al. / The Design, Deployment and Evaluation of the AnimalWatch Intelligent Tutoring System make any strong inferences about the dose-response function in regions where students work for longer in a more concentrated way. It is clear that students who do the least work with AnimalWatch have the least improvement. Figure 2 shows two such regressions, one for P reLO and P ostLO and the other for NC /(NC +NI ) (also called Correct/Attempted). In each scatterplot, the green crosses denote students whose AW scores were in the ﬁrst quartile of the distribution of AW scores. In other words, these are the students who worked on the fewest AnimalWatch problems. They fall disproportionately below the regression line, meaning that their posttest scores are lower than expected given their pretest scores. Chi-squared tests show that being in the lower quartile of AW is associated with being below the regression line for both log odds and correct/attempted scores (p < .0001 in each case). Post-Correct/Attempted 2 Post-LogOdds 1 0 -1 -2 -3 -3 -2 -1 0 Pre-LogOdds 1 2 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -0.1 -0.1 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 Pre-Correct/Attempted Figure 2. Regressions of pretest scores on postest scores (for log odds and Correct/Attempted scores), with the students who fall in the lower quartile of AW marked with green crosses. 5 Conclusion Prior work with AnimalWatch under relatively controlled conditions indicated that it helped students roughly as much a small-group instruction with skilled math teachers [4]. The results of the present study indicate that AnimalWatch also helped older students in more challenging conditions: The schedules and class enrollments in the participating schools were chaotic; the student population included many learners with very low achievement in math; and many of the students were not proﬁcient in English (the language of instruction). Even so, students who had more opportunity to work with the software showed greater pre- to post-test improvement than their peers who solved fewer AnimalWatch problems. These effects were small in absolute terms, probably because students worked with AnimalWatch for relatively little time (on average, they worked on roughly 25 word problems) and this effort was distributed over multiple sessions. With a longer-term intervention, there should be correspondingly greater improvement for struggling students. However, it will not be easy to deploy such an intervention. One objective of this paper is to describe the conditions that many adolescents in the United States face in urban schools, and the barriers to their success in mastering basic mathematics. Our results indicate that although research-based intelligent tutoring systems can help, technology alone will not ﬁx the problems of low achievement. Tutoring systems must be matched with teachers who are welltrained and supported in their work. Unfortunately, attrition among new teachers is high [2, 1]. All of the teachers with whom we worked in the study decided to leave the profession. 6 667 Acknowledgments The AnimalWatch project is supported by award R305K050086 from the United States Institute of Education Sciences. We would like to thank our project colleagues Wesley Kerr, Jean-Philippe Steinmetz, Erin Shaw and Sinjini Mitra, and the participating schools for their assistance. REFERENCES [1] National Education Association. Meeting the challenges of teacher recruitment and retention., 2003. [2] G. Barnes, E. Crowe, and B Schaefer. The costs of teacher turnover in ﬁve school districts., 2007. [3] C. R. Beal, L. Qu, and H. Lee, ‘Classifying learner engagement through integration of multiple data sources.’, in Proceedings of the 21st National Conference on Artiﬁcial Intelligence. Boston MA., (2006). [4] C. R. Beal, E. Shaw, and M. Birch., ‘Intelligent tutoring and human tutoring in small groups: An empirical comparison.’, in Artiﬁcial intelligence in education: Building technology-rich learning contexts that work. Amsterdam: IOS Press., (2007). [5] B.S. Bloom, ‘The 2 sigma problem: The search for methods of group instruction as effective as one-to-one tutoring.’, Educational Researcher, 13, 4–16, (1984). [6] M. Boekaerts, ‘The on-line motivation questionnaire: A self-report instrument to assess students’ context sensitivity’, in Advances in Motivation and Achievement, Vol. 12: New Directions in Measures and Methods, eds., P. R. Pintrich and M. L. Maehr, pp. 77–120. Elsevier Science, (2002). [7] A. L. Brown, S. Ellery, and J. Campione, ‘Creating zones of proximal development electronically’, in Thinking practices: A symposium in mathematics and science education, eds., J. Greeno and S. Goldman. Erlbaum, (1994). [8] G. Christensen and P. Stanat, ‘Language policies and practices for helping immigrants and second-generation students succeed’, in Transatlantic Task Forces on Immigration and Integration., (2007). [9] M. Crul. Pathways to success for the children of immigrants., 2007. [10] J. Eccles, A. Wigﬁeld, R. D. Harold, and P. Blumenfeld, ‘Age and gender differences in children’s self and task perceptions during elementary school.’, Child Development, 64, 830–847, (1993). [11] European Trade Union Committee for Education. Europe needs teachers, 2005. [12] National Center for Education Statistics. Comparative indicators of education in the united states and other g-8 countries: 2006., 2007. [13] The Math League. Math league website: http://www.mathleague.com/. [14] M. R. Lepper, M. Woolverton, D. Mumme, and J. Gurtner, ‘Motivational techniques of expert human tutors: Lessons for the design of computer-based tutors.’, in Computers as cognitive tools, eds., S. P. Lajoie and S. J. Derry, pp. 75–105. Erlbaum, (1993). [15] S. McIntosh and H.. Steedman. Low skills: A problem for europe., 2000. [16] D. C. Merrill, B. J. Reiser, M. Ranney, and J. G. Trafton, ‘Effective tutoring techniques: A comparison of human tutors and intelligent tutoring systems.’, Journal of the Learning Sciences, 2, 277–305, (1992). [17] California Department of Education. Dataquest query results retrieved from http://dq.cde.ca.gov/dataquest, 2007. [18] U.S. Dept. of Education. No child left behind is working., 2006. [19] U.S. Department of Education Institute of Education Sciences. Effectiveness of reading and mathematics software products: Report from the ﬁrst student cohort, 2007. [20] U.S. Department of Education Institute of Education Sciences. What works website: http://ies.ed.gov/ncee/wwc/twp.asp, 2008. [21] National Council of Teachers of Mathematics. Principles and standards for school mathematics., 2000. [22] J. M. Royer, L. N. Tronsky, Y. Chan, S. J. Jackson, and H. Merchant, ‘Math fact retrieval as the cognitive mechanism underlying gender differences in math test performance’, Contemporary Educational Psychology, 24, 181–266, (1999). [23] S. G.. Vandenberg. Mental rotation test, 1971. [24] J. J. Walczyk and D. A. Grifﬁth-Ross, ‘Time restriction and the linkage between subcomponent efﬁciency and algebraic inequality success’, Journal of Educational Psychology, 98, 617–627, (2006). 668 ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-668 AI on the Move: Exploiting AI Techniques for Context Inference on Mobile Devices Adolfo Bulfoni and Paolo Coppola and Vincenzo Della Mea and Luca Di Gaspero and Danny Mischis and Stefano Mizzaro and Ivan Scagnetto and Luca Vassena 1 Abstract. Context aware computing is a computational paradigm that has faced a rapid growth in the last few years, especially in the ﬁeld of mobile devices. One of the promises of context-awareness in this ﬁeld is the possibility of automatically adapting the functioning mode of mobile devices to the environment and the current situation the user is in, with the aim of improving both their efﬁciency (using the scarce resources in a more efﬁcient way) and effectiveness (providing better services to the user). We propose a novel approach for providing a basic infrastructure for context-aware applications on mobile devices, in which AI techniques (namely a principled combination of rule-based systems, Bayesian networks, and ontologies) are applied to context inference. The aim is to devise a general inferential framework to easier the development of context-aware applications by integrating the information coming from physical and logical sensors (e.g., position, agenda) and reasoning about this information in order to infer new and more abstract contexts. In previous contextaware applications, most researches focused almost exclusively on time and/or location and other few data, while the same contexts inference was limited to preconceived values. Our approach differs from previous works since we do not focus on particular contextual values, but rather we have developed an architecture where managed contexts can be easily replaced by new contexts, depending on the different needs. Moreover, the inferential infrastructure we designed is able to work in a more general way and can be easily adapted to different models of applications distribution. We show some concrete examples of applications built upon the inferential infrastructure and we discuss its strengths and limitations. 1 INTRODUCTION Recently we have assisted to the widespread of mobile devices such as PDAs, smartphones, etc. Due to a very rapid evolution trend, these devices have become more and more similar to traditional computers, both in terms of capabilities and computational resources. However, differently from traditional computers, these devices are usually employed “out there” in the real world, e.g., while the user is busy doing other activities (such as walking down a street, shopping, and so on). The scientiﬁc community is looking for new approaches to application development and user-device interaction management. A key-role in these new approaches is played by the notion of context, that is roughly described as the situation the user is in. This concept encloses important information that could be used to affect the capabilities of mobile devices, adapting them to the user’s needs. 1 University of Udine, Italy, email: bulfoni@dimi.uniud.it, coppola@uniud.it, dellamea@dimi.uniud.it, l.digaspero@uniud.it, mischis@dimi.uniud.it, mizzaro@dimi.uniud.it, scagnett@dimi.uniud.it, vassena@dimi.uniud.it Context awareness allows individuals to interact with systems that are aware of the environmental state (e.g., location, workgroup, activity) and computational state (e.g., applications, devices, services) of an individual [14]. One of the most crucial aspects in context-aware applications is the inference of the user’s context. In this paper we propose a new approach based on the combination of three classical AI techniques, namely rule-based systems, Bayesian networks, and ontologies. The feasibility of this approach is then demonstrated by means of example applications developed using the MoBe architecture [7], a framework for context-aware applications on mobile devices. This paper is structured as follows. After a brief introduction to context-aware computing and the presentation of the framework we intend to exploit for our experiments (Section 2.1), we give a detailed description of the proposed approach to context inference (Section 3). In Section 4 we present a concrete implementation that demonstrates the feasibility of our idea. Finally we present some conclusions and future work. 2 2.1 RELATED WORK Context-aware computing Context-aware computing can be deﬁned as the use of context in software applications, where the applications adapt to discovered contexts by changing their behavior [6]. The concept of context is still a matter of discussion and through the years several different deﬁnitions have been proposed. They can be divided in intensional deﬁnitions and extensional. Extensional deﬁnitions present the context through a list of possible associated values. In the ﬁrst work that introduces the expression context-aware [12], the context is represented by the location of the user and the surrounding objects. In a similar way, Brown et al. [5] deﬁne context as location, proximity to other people, temperature, day and hour, etc. Intensional deﬁnitions present the concept of context more formally. In [1] the context is deﬁned as “any information that can be used to characterize the situation of an entity. An entity is a person, place, or object that is considered relevant to the interaction between a user and an application, including the user and applications themselves”. For Brazire and Brezillion [3], “the context acts like a set of constraints that inﬂuence the behavior of a system (a user or a computer) embedded in a given task”. This deﬁnition moves from the analysis of a collection of 150 context deﬁnitions from several ﬁeld of application like sociology, computer science, etc. Extensional deﬁnitions seem to be useful in practical applications, where the abstract concept of context has to be made concrete. How- A. Bulfoni et al. / AI on the Move: Exploiting AI Techniques for Context Inference on Mobile Devices 669 ever, from a theoretical point of view they are not properly correct, as the context cannot be outlined just by some of its aspects. On the other hand intensional deﬁnitions are of little use in the practice, despite they are theoretically satisfying. Moving from these considerations, we can understand why in previous context-aware applications the notion of context mainly included just a little number of information. Most researches, for example, focused almost exclusively on time and/or location and other few data [14]. More complex approaches tend to combine several contextual values to generate new contextual information. In [1] primary contexts as location, entity, activity, and time, act as indices into other sources of contextual information. Similarly, in the TEA Project [13] Schmidt et al. use a resolution layer to determine a user’s activity starting from basic contextual information. As in the previously cited works, we want to combine contexts to determine new, more abstract contexts. Differently from them, however, we do not focus only on particular contextual values, but we develop an inferential infrastructure able to work in a general way. Figure 1. 2.2 General architecture of MoBe platform. The MoBe architecture MoBe is a general architecture for context-aware distributed applications on mobile devices based on the dynamic and automatic download, conﬁguration, execution, and unload of applications on the basis of the user’s current context. Rather than having the applications rigidly installed on a mobile device, the user and device context is used to obtain and start useful applications and to discard the not anymore useful ones. This way, a device is not limited to a set of predetermined functionalities, but allows to adopt those which are probably more useful for the user at a given time. For example, when a person enters his home, his device provides automatically the application to control the household appliances. This application can be discarded (or just stopped) when the person leaves home. The device can turn into a TV remote controller while the user is watching TV, or it can turn into a cooking book while the user is cooking, etc. MoBe architecture is presented in Figure 1 and is composed by the following three layers (from bottom to top): MoBeSoul: is the middleware whose basic responsibility are to sense the surrounding environment, to perform the context inferences and to manage the retrieval of applications. Application framework: consists of the software infrastructure for building the concrete mobile applications. Since the MoBeSoul component is completely general and can be adapted to different implementations, we currently have developed and tested three implementations based on (i) MoBe framework, an ad-hoc J2ME middleware, (ii) MoBeAgents, the extensions of a Multi-Agent framework, (iii) a Context-Aware browser, a browser extension that allows the development of contextual web applications. MoBeLets: a basic context-aware application built upon this architecture. The applications, called MoBeLets reside on the MoBeLet Server and migrate, transparently to the user, on her mobile device. Each MoBeLet presents a descriptor that holds the most important information related to the application, in order to make the retrieval easier (e.g., the type of task carried out) and to decide whether it is suitable or not for the mobile device of the user (e.g., information about the minimal CPU/memory requirements or the kind of needed peripherals/communication media). Some ancillary servers exchange information with mobile devices in that environment. In particular the MCS (MoBe Context Server) exchanges information about contexts (and inferential networks), while the MMS (MoBe MoBeLet Server) provides MoBeLets’ related information and the actual MoBeLets. The workﬂow managed in the MoBe architecture is the following: 1. the MoBeSoul acquires information related to the user and the surrounding environment, by means of sensors installed on the device or through a Context Server; 2. from this contextual information, the MoBeSoul infers the user’s context (and its likelihood); 3. the user’s context is sent to the MoBeLet Descriptor Search Engine that looks for the MoBeLets most suitable for the user’s context and sends their descriptors to the MoBeSoul; 4. on the basis of user preferences, the descriptors can be ﬁltered again; 5. the remaining descriptors are used to obtain the applications from MoBeLet server. The MoBeLets currently executed on the device are managed on the basis of the user’s context: when a context is not valid anymore, for example, the associated application can be stopped or discarded. 3 3.1 CONTEXT INFERENCE SYSTEM Inferring abstract contexts from concrete contexts The user’s current context is composed by an undeﬁned number of contextual values. Each value is described by two elements: an unambiguous ID and a probability value. We divide contextual values into two categories: Concrete contexts: represent the information obtained by a set of sensors. These contexts can be read from the surrounding environment through physical sensors (e.g., temperature sensor), or can be obtained by other software (e.g., calendar) through logical sensors. Some examples are: “temperature: 20◦ C”, “12:30”, “meeting at 14:30”, and so on. Concrete contexts are returned by the sensors and represent the input of the inferential mechanism. 670 A. Bulfoni et al. / AI on the Move: Exploiting AI Techniques for Context Inference on Mobile Devices Abstract contexts: represent everything that can be inferred from concrete contexts like, for example, “user at home”, “user is shopping”, etc. The problem we are facing is therefore the deﬁnition of an inferential system capable to derive the abstract contexts from the concrete ones. Concrete and abstract contexts are the inferential system input and output, respectively. From a theoretical point of view, this difference is faded since the contexts cannot be unambiguously assigned to one or the other category: the context “temperature 90◦ C” can be a concrete contexts as it is obtained from a sensor, or it can be inferred by other contexts (e.g., “user in sauna”). The aim of the inferential system it to combine concrete contexts to determine abstract contexts and to combine abstract context to obtain new, more abstract contexts. 3.2 3.3 The inferential infrastructure We propose a two stage inferential mechanism, where both rules and Bayesian networks are used. We deﬁne the combination of rules and Bayesian network as the inferential infrastructure. The input to the inferential infrastructure is represented by concrete contexts. Concrete contexts are processed by a rule-based system in order to simplify the information and map them on the starting node of the Bayesian network, that represents the second and main stage of the inferential infrastructure, where abstract contexts are transformed into concrete ones. The combination of concrete and inferred abstract contexts is the user’s current context (Figure 2). Two approaches for the inferential system To develop our inferential system two approaches seem intuitively adequate and have been taken into consideration: rule-based systems and Bayesian networks. As it is well known, a rule-based system [10] is a general mechanism for the knowledge representation and management. Although rule-based systems are a relatively simple model, they can be naturally adapted to the context-aware ﬁeld. The left and right side of a rule can represent two contextual values and the rule suggests a connection between them: e.g. IF I’m in the bathroom and there is an high humidity level and there is a continuous sound THEN I’m having a shower . Let us remark that rule-based systems allow to use variables, a feature that simpliﬁes knowledge management. Indeed, rule-based systems have already been used in the contextaware ﬁeld. Bacon and colleagues [2] describe a multimedia system based on user location, where contextual information is represented as facts in a rule-based system. Zhang [15] proposes a framework for allowing user to program his context-aware application: a user can visually create the rules that are then combined with sensors data to adapt the application to user context. In [8] a rule-based system is used to trigger the actions depending on the registered contexts. Bayesian networks [11] represent a model for the execution of inferences based on probability, and they can be easily adapted to the context-aware ﬁeld as well. Each node can represent a contextual value, the edges representing dependence relationships between different contexts, while the probability distributions indicate the certainty related to contexts. Bayesian networks have been used in the context-aware ﬁeld mainly as system to determine the uncertainty of contexts. In [4] a Bayesian network is used to measure the efﬁciency of contexts derivation from rough sensors data while in [9] Bayesian networks are used to classify contexts related to a user’s everyday activities. Even if the use of rule-based systems is reasonable in contextaware computing, it presents a remarkable limit: they manage certain knowledge, whereas almost everything related to contexts is characterized by uncertainty. Because of the uncertainty management, Bayesian networks are more suitable than rule-based systems for contextual inferences. On the other hand, a rule-based system allows the use of variables, that allow to limit the dimension of the same inferential mechanism. For example instead of managing all temperature values we can simplify them to the three abstract values temperature high, low, and normal and use rules generalized with variables to map the real sensed temperature value on the three abstract ones. In our opinion, both the approaches are important and needed for a complete and functional system. Figure 2. Inferential infrastructure model. The following is a simple example. A concrete context is “sound 60dB”. Through the ﬁrst rule-based stage this information is mapped into the Bayesian network starting nodes. In this way we simplify the contextual information: instead of managing all the possible sound values we create an abstraction (i.e., “sound low”, “sound high”). Without the rules (and the variables), if the management of all the single sound values was needed, we should have introduced in the Bayesian network a node for each sound value. Then the Bayesian network starts from this point to infer abstract contexts like “user is listening to music’. 3.4 Critical issues We can identify four main limits of the basic inferential infrastructure we have presented so far. The ﬁrst limit concerns the size of the inferential network. Since we do not want to limit the inferences only on an a priori deﬁned set of contextual dimensions, the network, in principle, should be omniscient and include information and inferential mechanisms for every possible situation. The possible contextual values that should be managed in such a case are innumerable and, as they have to be built into the inferential network, this would lead to a universal network, which would be unmanageable in practice. The solution we adopt consists in splitting the inferential network into several subnetworks, on the basis of a single dimension of concrete context. For example, a hypothetical universal network can be split using location as an index: in this way we obtain several networks, each of them relevant to a particular location (e.g., one for A. Bulfoni et al. / AI on the Move: Exploiting AI Techniques for Context Inference on Mobile Devices home, ofﬁce, car, etc.). These specialized networks have a smaller size, therefore they are more manageable. Furthermore, if they are still too complex they can be partitioned further. Single inferential networks are acquired automatically on the basis of the value of the contextual dimension used for the partitioning. The mobile device will receive one or more networks from a remote server and will combine them to infer more precisely the user abstract context. For example, when the user enters her home, if the user location is selected as the partitioning dimension, the device will obtain the inferential network specialized on the contexts related to the home. Location is just one of the dimensions that can be used to partition the inferential network; other dimensions could be time (e.g., a set of applications related to work is downloaded during working hours or according to an agenda) or a combination of location and time. The second problem concerns the potential mismatch between concepts in the inferential networks and the concrete contexts (e.g., a network could execute inferences starting from a contextual value expressed as “temperature in Celsius degrees” while the temperature sensor on the device manages information in Fahrenheit degrees). These incoherences would of course lead to erroneous contexts inference. To avoid that, we deﬁne an ontology of concrete contexts, which provides a shared model for situations, sensors, provided values, and relationships among them. In this ontology, the sensors are categorized on the basis of the contextual information they provide. Similarly, in the development of the inferential network, its concrete nodes (or the starting ones) must also be categorized in the ontology. In this way the inferential engine is able to obtain the information needed by the inferential network through the appropriate sensors. The third problem is a consequence of the previous two and it concerns the agreement on contexts representations. Indeed, it is possible to represent the same context in many different ways, leading to a troublesome matching between the context descriptions in the inferential network and those in the speciﬁc application descriptors. Again, a shared ontology allows to link related contexts, by explicitly associating them through the common concepts they share, even when they are expressed in different ways. Therefore, we extended the ontology previously suggested so that all contextual values (not only the concrete ones) are classiﬁed in it. In this way we formalize the contexts deﬁnitions and representations, regulating their use. Our ontology is based on Wordnet (http://wordnet.princeton.edu), which is both a terminology and a constitutive ontology that supports multilanguages and implements a set of semantic relations between concepts (e.g., synonymy, part-set, etc.). WordNet must be extended since it only provides a basic terminology and a ﬁrst group of semantic relations. However, the deﬁnition of relations speciﬁc to the contextual aspects is required to obtain an ontology of contexts. To this aim we devise a domain ontology, which includes concepts and basic relations for sensors and contexts, and can be extended by more speciﬁc ontologies. For example, to deﬁne the concepts related to “house” we both refer to the basic relations and extend the generic concepts from a previously deﬁned “building” ontology. A fourth problem concerns subjectivity: since a “one-size-ﬁtsall” approach will most likely be inadequate for users with different needs, we have introduced some other functionalities that allow users to personalize their inferential system. A user can associate to each context of interest in the network two values called privacy and importance. These values refer to two thresholds managed by the user. The ﬁrst one refers to the sensibility of the contextual information; contexts with privacy value higher than the privacy threshold will not be diffused outside the mobile device. The importance value acts 671 somehow in the opposite direction: it allows a contexts to be made public to remote servers even if its probability is lower than the ﬁltering threshold. Moreover, the user has the possibility to modify and personalize the network by adding or removing nodes (contexts) and changing the probability relations between contexts. This feature is crucial in order to make the network more suitable to model the user’s current situation. 4 IMPLEMENTATION In order to verify the feasibility of our approach, we concretely developed a simple prototype based on the above mentioned ideas. Rather than developing a complete system, we focused our efforts only on the implementation of our inferential approach: thus, the sensors information is simulated by an external application, in order to avoid low level details and to concentrate ourselves just on the context inferences. More precisely, the inferential system is composed by networks and an inferential engine. The latter in particular manages the acquisition of contextual values from sensors, the acquisition of networks and the execution of the inferences on them. This system has been implemented in Java using the package JavaBayes (www.cs.cmu.edu/javabayes) for Bayesian networks and JESS (http://herzberg.ca.sandia.gov/) for the rule-based stage. The Bayesian networks have been designed with the editor BNJ (http://bndev.sourceforge.net) and saved in XML format. Two distinct environments have been chosen as application candidates of our prototype: a domotics environment, with an inferential network that manages home contexts,and an automotive environment with two networks, one for the car and one for the highway. 4.1 Domotics environment The domestic environment is limited enough to avoid complexity issues and to be explicitly managed but, at the same time, it presents a quite heterogeneous set of situations. Also, it is well known and the deﬁnition of the rules and the relations between contextual values in the Bayesian network and their probabilities can be generated taking as examples our everyday life. For this prototype we take into account the following concrete contextual information (which we consider the most meaningful in the domestic ﬁeld): user location within the house, time of the day, user movement, light level, sound level, temperature, and humidity. In this prototype, starting from the concrete contexts, the system infers the abstract contexts. For instance, when a user is in his home, his device will receive an application to control the domotics system; when a user is watching TV his device will turn into a TV remote; when the user is having a shower, his device will play a list of mp3. Also, context —and application— ﬁltering can be performed on the basis of the probability value. For example, if the user could deﬁne an 80% threshold, to discard all the contexts with a lower probability to be discarded, and to retrieve the most suitable applications or web pages (MoBeLets) on the basis of the higher probability contexts. Then, knowing that “the user is in kitchen” and “it is lunch time” and “the user is not in movement”, the system can infer with a certain probability that “the user is having lunch”, and with a lower probability “the user is preparing lunch”. It is important to observe that a higher number of managed concrete contextual values and relative details corresponds to a more probabilistically correct inference of the abstract contexts. For instance, it is possible to infer the abstract context “user is having a 672 A. Bulfoni et al. / AI on the Move: Exploiting AI Techniques for Context Inference on Mobile Devices shower” from the concrete context “user in bathroom”. However, increasing the concrete contexts number, managing also information like humidity and sound level, the system can provide a more correct representation of the current user context. Anyway, the set of concrete contextual values taken into consideration is more than enough for the purposes of this prototype. 4.2 Automotive environment The second environment is derived from the automotive ﬁeld. In this case we have used two different networks. The ﬁrst one is related to the car: it receives concrete context values from car sensors (e.g., oil and water condition, etc.) and infers the current car status. The second network models a highway, and it uses several concrete contexts: weather, location, trafﬁc information, speed limits, etc. When the user enters his car, his mobile device receives the car network; when he drives into the highway, the device receives the highway network, and the system integrates it with the other networks (the car network, in this example). As the networks are obtained accordingly to the user location, they are integrated on the basis of this contextual values: the more general network receives in input the contexts inferred by the less general network. In this prototype the highway network receives in input the contexts inferred by the car network. This integration allows a more precise description of user’s current context: for example, the “user at gas station” context, inferred by the highway network, acquires different meanings if the “car broken” context has been inferred by the car network. Figure 3 shows two screenshots of our prototypes. Figure 3. Home and highway examples (Windows Mobile / Android). 5 CONCLUSIONS AND FUTURE WORK In this paper we have presented a new approach to context inference in context-aware applications on mobile devices. In particular we have proposed an approach based on the combination of rulebased systems, Bayesian networks, and ontologies; we have shown how these three AI tools (together with multiagent systems) can be exploited in real-world context aware systems. Differently from past works in the context-aware ﬁeld, we have not focused only on particular contextual values, as our inferential infrastructure is able to work in a more general way: our aim was to combine contexts to determine new, more abstract contexts. The feasibility of our approach has been demonstrated through a concrete application within MoBe, a framework for context-aware applications. The work presented is just a preliminary work. Although it demonstrates the feasibility of our approach, several questions have to be answered. The following step concerns a more robust and complete implementation of our prototype in order to execute a complete user testing. Moreover we are going to study and integrate in our system the context history, i.e., the set of all inferred contexts, organized in a temporal order. We are going to investigate how compute, starting from the history, signiﬁcant statistics on the user’s contexts, that can be useful to dynamically adapt the inferential network to the user experience. References [1] G. D. Abowd, A. K. Dey, P. J. Brown, N. Davies, M. Smith, and P. Steggles. Towards a better understanding of context and contextawareness. In HUC ’99: Proceedings of the 1st international symposium on Handheld and Ubiquitous Computing, pages 304–307. Springer-Verlag, 1999. [2] J. Bacon, J. Bates, and D. Halls. Location-oriented multimedia. IEEE Personal Communications, 4(5):48–57, ottobre 1997. [3] M. Bazire and P. Brezillon. Understanding context before using it. In Proceedings of CONTEXT 2005, pages 29–40. Springer-Verlag, luglio 2005. [4] G. Biegel and V. Cahill. A framework for developing mobile, contextaware applications. In Proceedings of Second IEEE International Conference on Pervasive Computing and Communication, PerCom 2004, marzo 2004. [5] P. J. Brown, J. D. Bovey, and C. Xian. Context-aware applications: from the laboratory to the marketplace. IEEE Personal Communications, 4(5):58–64, 1997. [6] G. Chen and D. Kotz. A survey of context-aware mobile computing research. Technical Report TR2000-381, Dept. of Computer Science, Dartmouth College, novembre 2000. [7] P. Coppola, V. Della Mea, L. Di Gaspero, S. Mizzaro, I. Scagnetto, A. Selva, L. Vassena, and P. Zandegiacomo Rizi`o. MoBe: A framework for context-aware mobile applications. In Proceedings of Workshop on Context Awareness for Proactive Systems CAPS 2005, pages 55–65. HIT, giugno 2005. [8] C. di Flora, O. Riva, K. Raatikainen, and S. Russo. Supporting mobile context-aware applications through a modular service infrastructure. In Proceedings on the Sixth International Conference on Ubiquitous Computing, UbiComp 2004, 2004. [9] P. Korpip¨aa¨ , M. Koskinen, J. Peltola, S.-M. Makela, and T. Seppanen. Bayesian approach to sensor-based context awareness. Personal Ubiquitous Comput., 7(2):113–124, 2003. [10] A. Newell and H. A. Simon. Computer Augmentation of Human Reasoning. Spartan Books, 1965. [11] J. Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann Publishers, Inc., 1988. [12] B. Schilit and M. Theimer. Disseminating active map information to mobile hosts. IEEE Network, 8(5):22–32, 1994. [13] A. Schmidt, K. A. Aidoo, A. Takaluoma, U. Tuomela, K. V. Laerhoven, and W. V. de Velde. Advanced interaction in context. In Proceedings of First International Symposium on Handheld and Ubiquitous Computing (HUC’99), pages 89–101. Springer-Verlag, 1999. [14] D. West, T. Apted, and A. Quigley. A context inference and multimodal approach to mobile information access. In AIMS 2004, Artiﬁcial Intelligence in Mobile Systems 2004 (in conjunction with UbiComp 2004), 2004. [15] T. Zhang. An architecture for building customizable context-aware applications by end-users. In Proceedings on Second International Conference, PERVASIVE 2004, aprile 2004. ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-673 673 Two Stage Knowledge Discovery for Spatio-temporal Radio-emission Data Matthias Haringer1, Lothar Hotz1 and Vera Kamp Abstract. In this paper, we introduce a method to examine and interpret spatio-temporal radio emission datasets. The goal is to ﬁnd communication patterns in the data in respect to spatial, temporal, and frequency based attributes. The chosen approach is a combination of two different AI-methods. First a clustering algorithm groups spatially close data points to potential emitters. In a second step a model-based constraint solving technique is applied to ﬁnd relationships between the identiﬁed emitters. The used models describe rules of the communications that are to be found. This guarantees a ﬂexible search for different kinds of communication. 1 Introduction The approach to be introduced arose from the need to ﬁnd communications in radio emission data. The used data is recorded via three or more antennas with distances of several hundred kilometers. The receiver systems match frequency and timing of the received data snippets of all included antennas and attach location information acquired from run time differences of the matching received snippets. Several thousands of matching snippets are collected every second and stored in large databases. This data is the basis of the proposed technique. The interpretation of this data, like ﬁnding communication patterns, has mostly been done manually by domain experts with the help of frequency catalogs and database-based approaches. In this paper, we introduce a communication recognition method where less user interaction and expert knowledge is needed. The radio emission data consists of single data atoms that are called emissions in this context. An emission is the smallest unit of received data where the location could be determined. An emission represents a direct or modulated electro-magnetic transmission, which can be a connection to a single partner or a broadcast for multiple receivers. Emissions can be radio station broadcasts (lasting the whole day, always the same frequency), two way transceiver communication, or automated data communication. Each emission has at least following attributes: A start-time, a duration, a frequency range, and a location (latitude and longitude). As only the sender and not the receiver of an emission is directly known, a communication can only be detected when two stations subsequently ”talk” to each other with two or more subsequent emissions (simplex communication), or when they use a ongoing connection on different frequencies (duplex communication). A communication in this context is, therefore, the combination of two or more emissions from two or more different emitters. The simplest communication consists of two emissions. One emission is the initial message and the other is the response. 1 2 3 HITeC c/o Department Informatik, University of Hamburg, Germany, email: {haringer,hotz}@informatik.uni-hamburg.de Plath GmbH, Hamburg, Germany, email: kamp@plath.de The whole work was performed in cooperation with Plath GmbH. One way to ﬁnd out if two emissions are part of a communication is their temporal behavior, as the sequences for natural speech and radio communication protocols follow speciﬁc rules. Another hint of a communication is the location of the emissions. If subsequent emissions come from the same two locations, it can be assumed that two sending units (in the following called emitters) are communicating with each other. An emitter is in this context a stationary unit capable of sending and receiving emissions. A problem of the spatial information (longitude, latitude) is the varying precision (see Section 3). A third important indicator of some types of communication is frequency equality. Most emitters use the same frequencies for a communication. However, it would be too restrictive to limit the system to that, because some emitters change their frequency with each emission of a communication or after deﬁned time periods. A large challenge in the given application is that there are many forms of radio communication like simplex, duplex, shallow and deep emitter hierarchies with speciﬁc timings, different modulation types, and different data transmission types. Examples of communications are mobile telephone transmissions, radio stations, airplane communication and all kinds of wireless communication. The search for patterns is, therefore, very difﬁcult and a general search for all communications with one set of criteria is not possible. This led to the idea of a model-based approach, where speciﬁc communication types can be formulated: Sometimes almost nothing is known about the communications to be found. This would require a general model. However, in most cases some more restricting parameters like the approximate duration, the gaps between emissions of a communication, one or more frequencies, the number of involved emitters, or the location of some emitters are known. These tasks are different use cases which have to be addressed with our prototype. A problem are arbitrarily ﬁtting emissions which result from the large amount of simultaneous emissions. The system ﬁnds only possible communications, which have to be reviewed by experts. But the expert time compared to analysis with raw data can be reduced signiﬁcantly with this method. 2 General Approach We identiﬁed two subproblems for ﬁnding communications with time and location based emission data: Emitter identiﬁcation which allows to determine an emitter for each emission and communication identiﬁcation between emissions of multiple emitters. In a ﬁrst step we investigated in different data-mining approaches and examined their suitability to our problem. Clustering was found suitable for emitter identiﬁcation and sequence analysis for combining emissions to communications. Clustering has the capability to ﬁnd similar 674 M. Haringer et al. / Two Stage Knowledge Discovery for Spatio-Temporal Radio-Emission Data data objects with the help of a distance measure. Clustering is applicable, if spatial, temporal or other 1 to n dimensional data groups have to be identiﬁed. Taking into account the spatial relationship of the emission data, clustering is helpful to ﬁnd emitters, frequency groups, and temporal accordance. Sequence analysis allows to ﬁnd temporally connected relationships and with that the recognition of compound events. Sequence analysis can be used in the given scenario for ﬁnding communications and communication sequences. As there are several different types of communication and the parameters should be easily changeable a model-based approach has been chosen. Therefore, following general process of analysis and interpretation is proposed: 1. Data preparation: exclusion of non relevant attributes, plausibility checks, data reduction, dimension reduction, and data partitioning. 2. Emitter identiﬁcation: Group emissions into potential emitters using clustering techniques. 3. Application of the main interpretation method: model-based recognition of temporally repeating structures between emitters. Data reduction is very important with large data sets to reduce the computation time of the whole system. In the following emitter identiﬁcation (Section 3), and ﬁnding communication patterns between emitters (Section 4) are explained in more detail. The complete approach has been implemented in a prototypical system (Section 5) and has been validated with several experiments (Section 6). 3 Emitter identiﬁcation In this phase emissions which originate from the same emitter are to be grouped together. One advantage of this approach is to reduce the complexity for the later model-based steps. For ﬁnding a start of a communication with one emission of an emitter e, for instance, all other emissions except in e are candidates for a communication. To detect ongoing communications only the participating emitters have to be taken into account. Clustering algorithms have been identiﬁed as most important method to emission emitter estimation. Each emission has longitude and latitude as spatial information. Longitude and latitude have different uncertainties depending on the used receivers and the distance from the emitter. If for example 3 receivers in a 1000km equilateral triangle are used, the best emission location results are in and near this triangle As the distance increases, different ellipses with increasing areas have to be considered as locations instead of points. As most of the clustering algorithms work only for point data, the geometrical centers of the uncertainty areas are used for clustering. Latitude and longitude are transformed to two dimensional points on the earth surface. As a distance measure for clustering, euclidian distance is sufﬁcient, as only spatially very close emissions are grouped together. A better solution would be to use the arc distance on the sphere surface, but several clustering methods do not easily support custom distance functions and we wanted to compare different algorithms ﬁrst. Most clustering methods are directly applicable to the large dataset. The time complexity for relocation and hierarchical algorithms can be controlled by appropriate termination criteria. Most clustering algorithms are apart from the initialization and the deﬁnition of the distance measure free from interaction. In the following different clustering methods are compared for the suitability for the targeted application: • Exclusive and overlapping relocational clustering algorithms can use random generated and evenly distributed locations as start cen- troids. The number of centroids and clusters has to be known beforehand. The ordinality of emitters can be roughly approximated from the number of emissions. • Density-based methods are due to the circular or elliptic forms of emitter clusters not relevant. Density-based clustering is especially interesting for concave cluster forms. • Probabilistic methods assume a few deﬁned probability distributions in the dataset. Whether this applies to the emission data, would have to be examined ﬁrst. • For hierarchical clustering only the link method and the terminate criterion have to be predeﬁned. Complete hierarchical clustering has a complexity of O(n2 ) which is too time consuming for large datasets. In our application the number of expected data objects per cluster is small compared to the total number of objects. Because of that the algorithms can be terminated after several iterations, as soon as a maximal distance criterion is met. Modern heterogeneous clustering algorithms unite the advantages of several cluster methods. Because of the large datasets it was important that the algorithm was not only main memory-based. The targeted complexity should be O(n log(n)) or better O(n) to be applicable. Several clustering methods have been tested: Different algorithms from the CLUTOclustering toolkit and the CURE clustering algorithm. The CLUTO clustering toolkit is a freely accessible easy to use library for several clustering methods described in [4]. CLUTO allows to parameterize and adapt partitioning and hierarchical clustering algorithms. The algorithms in CLUTO are optimized for large datasets and a high dimensionality. This is especially true for the algorithms based on partitioning. Drawbacks of CLUTO are that all algorithms in CLUTO need a previously deﬁned number of clusters. Another problem is that no user deﬁned distance functions can be introduced for more complex clustering tasks. The solution for avoiding the predeﬁned cluster number problem, is to examine the density and distribution of the data to estimate the cluster number. It has been found that 1 to 10 clusters per 1000 emissions give the best results. The CURE (Clustering Using Representatives) algorithm [2] is an agglomerative, hierarchical cluster algorithm. The algorithm usually stops at a predeﬁned number of clusters c. It was changed to terminate when a maximal distance between two clusters has been reached. The algorithms have been applied to different emission datasets and examined for performance, memory usage, and quality of the results (see Section 5). 4 Model-based communication recognition The clustered emissions from the emitter identiﬁcation build the input for model-based recognition of communication structures. Manually edited communication models are a further input. Figure 1 illustrates such a model of a communication structure (here a simplex communication between two partners). The communication model consists of several emission models, which have a start-time and end-time and which have to fulﬁll certain conditions. Each emission model represents a generic description of real emissions. Instead of absolute time relative relations between time points are speciﬁed. Communication models may also be composed. The communication model in Figure 1 consists for example of two sub-models: One that describes a start of a communication (start-connection) and one that describes the follow-up communications (alternating-communications). M. Haringer et al. / Two Stage Knowledge Discovery for Spatio-Temporal Radio-Emission Data For automatic computation of models a textual, LISP-based model deﬁnition language has been deﬁned. This language and its use is discussed in Section 4.1. In the further steps a communication model is seen as a speciﬁcation of a constraint problem. Thus, such models are an abstraction of typically hard to formulate constraint problems. For processing, the models are automatically mapped to a constraint problem (see Section 4.2). start-connection < 20 sec emitter Main Station < 20 sec < 20 sec H2 H1 Start.H1 End.H1 Sub station alternating-communications < 20 sec H3 Start.H2 End.H2 N1 Start.N1 End.N1 Start.H3 N2 Start.N2 End.H3 ... End.N2 time Figure 1. Graphical notation of a simplex communication model. 4.1 Representing Communication Models For representing communication models we developed the new declarative language ModoCom (Modeling of Communications). This language provides following modeling facilities: Primitive communication models specify a combination of emissions. Each emission belongs to a communication partner (i.e. an emitter). Combinations of emissions can be restricted by conditions. Each emission of a partner can be described by parameters, like start, end, delay, longitude, latitude etc. Restrictions specify conditions on the emissions of the communication partners. Conditions can be time related (e.g. partner a before partner b with delay d), related to the frequency of the emission (e.g. equal frequency for simplex communications) or spatially related (e.g. different positions in space). Compositional communication models combine primitive communication models or other compositional communication models. Thus, a hierarchical structure of models is established. Primitive communication models can be seen as templates that are used to identify communication structures in the emitter clusters. The result of evaluating primitive communication models are combinations of emissions that fulﬁll the restrictions of the primitive communication model. In Figure 2 an example for a primitive communication model is given. Two partners (emitters) are being speciﬁed - a main station (?m) and a sub station (?s). The emissions of these emitters are referenced by variables (?m1, ?m2, ?s1). Those are restricted by the speciﬁed constraint relations to have maximally 20 seconds delay. These restrictions relate parameters like emission-endtime or emission-cluster by the shown constraint relations (less, adder). Primitive communication models can be combined to compositional communication models. The computation of such models consists of two steps: First communications that fulﬁll the sub-models are generated (:generate). This step leads to combinations of emissions. In the second step these combinations are again composed according to further restrictions. In Figure 3 a communication model is composed from two sub-models simplex-2-partners-start-connection and simplex-2-partners-alternating. Their combination results are again referenced by variables (?mir, ?mar). The domains of these variables are combinations of emissions (a-cmb). These are generated and further restricted by combining them with the speciﬁc constraint relation combine-models. 675 (define-communication-model :name simplex-2-partners-start-connection :parameters ((delay 20.0)) :partners (;; main station (?m :type cluster :events ((?m1 :type emission) (?m2 :type emission))) ;; sub station (?s :type cluster :events ((?s1 :type emission)))) :restrictions ((related-to (?m1 ?m2 ?s1) (;; similar frequency different clusters (frequency-overlap-for-three ?m1 ?m2 ?s1) (unequal (emission-cluster ?m2) (emission-cluster ?s1)) ;; different emissions (unequal ?m2 ?s1) (unequal ?m1 ?s1) (unequal ?m1 ?m2))) (related-to (?m2 ?s1 delay) (;; m1.end + 20 < s1.start (less (emission-endtime ?m1) (emission-starttime ?s1)) (adder (emission-endtime ?m1) tmp (emission-starttime ?s1)) (less tmp delay))) (related-to (?m2 ?s1 delay) (;; s1.end + 20 < m2.start (less (emission-endtime ?s1) (emission-starttime ?m2)) (adder (emission-endtime ?s2) tmp (emission-starttime ?m2)) (less tmp delay))))) Figure 2. Declarative speciﬁcation of a primitive communication model. The delay between ?m1 and ?s1 as well as between ?s1 and ?m2 has to be less than the delay. adder is a constraint relation that ensures: a + b = c. Thus, the second argument (here ?tmp) is the distance between the ﬁrst and the third argument. 4.2 Using constraints for recognizing communications Each communication model is mapped to a constraint problem. A constraint problem is speciﬁed by constraint variables having a domain and n-ary constraints. A constraint restricts the domains of the variables (also called pins) that are connected to it. Because one variable may be connected to several constraints, a constraint network is formed. A solution of a constraint network given by constraints and variables is a tuple of values for each variable that satisﬁes the constraints. Given a constraint network the constraint satisfaction problem is the problem of computing one, several or all solutions of the network (see e.g. [5]). Several constraint systems exist which implement this kind of computation. A solution of a constraint network is given by restricting the given variable domains. In principle, every combination of values for the variables has to be checked, whether it is consistent with the constraints. Therefore, a constraint system solves the combinatorical challenge of identifying combinations of values that are consistent with the constraints. This facility is used for computing communications consisting of emissions (e.g. modeled by variables). Restrictions given by communications, like duration, structure, and spatial location, are mapped to constraints. In principle, every emission can be combined to form a communication, however, only those combinations that are consistent with the restrictions of the communication are of interest. Because of that, the challenge of communication identiﬁcation can be mapped to a constraint problem. The mapping of non-aggregated, primitive models like the simplex model is as follows: each partner (or emitter cluster) of the communication model are modeled as a constraint variable. The set of emissions that are gathered in one cluster is the domain of a constraint variable. The restrictions on the emissions that are given in the model are mapped to constraints. Each expression is one constraint, the variables of the expression are the pins of the constraint. By solving the constraint problem for a primitive communication model a combination of emissions is computed that fulﬁlls the re- 676 M. Haringer et al. / Two Stage Knowledge Discovery for Spatio-Temporal Radio-Emission Data (define-communication-model :name simplex-2-partners-aggregate :sub-models ((?ma :type simplex-2-partners-start-connection :solutions (?mar :type a-cmb)) (?mi :type simplex-2-partners-alternating :solutions (?mir :type a-cmb))) :clusters ((?a :type spatialcluster) (?b :type spatialcluster)) :generate ((related-to (?ma ?a ?b) ((unequal ?a ?b) (compute-submodel ?ma ?a ?b))) (related-to (?mi ?a ?b) ((unequal ?a ?b) (compute-submodel ?mi ?b ?a)))) :restrictions ((related-to (?mir ?mar ?a ?b) ((unequal ?a ?b) (combine-models ?mir ?mar))))) Figure 3. Declarative speciﬁcation of a compositional communication model. Two sub-models are referenced and emissions that ﬁt to this sub-models are generated. The result of this generation are combinations of emissions. Those are combined with the constraint relation combine-models. strictions speciﬁed in the communication model. This is achieved by computing all solutions of the constraint network by global propagation. Compositional communication models are mapped as follows: Results of sub-models are combinations of emissions. These combinations are taken as domains for the variables of the compositional communication model (e.g. of ?mir). Thus, while constraints of primitive communication models handle primitive parameter values like time points or spatial values, constraints of compositional communication models handle emissions belonging to combinations of emissions. Such constraints are newly deﬁned as functions (here also called constraint relations) for the constraint system (e.g. compute-submodel). These functions implement speciﬁc algorithms which take the time-line of emissions or equivalence classes built by frequencies into account. This approach of using results of a constraint problem as variable domains of a further constraint problem is new (at least to our knowledge) and here called cascading constraints. Through the compositional structure of the communication models cascading constraints are implicitly modeled, i.e. the results of one communication model (yielded by solving a constraint problem) are automatically transferred to the next higher aggregation level and there used as input for the next constraint problem. In typical applications of constraints only one level of variables are used. 4.3 Discussion of the Model-Based Approach Through the clustering of emissions the search for communications is facilitated. For searching a communication between two partners, for example, only emissions of the two participating emitters have to be considered. By using such clusters it is not necessary to compare emissions of one cluster with each other, but only emissions of different clusters. Furthermore, if a start of a communication is identiﬁed between a set of clusters, only emissions of those clusters have to be considered for continuing the communication. Due to the prescribed modeling language for representing communications, generic descriptions of typical communication structures can be speciﬁed which represent a set of communications. The communication models enable the modeling and identiﬁcation of commonly known and frequently used communication structures (like simplex communication of two partners). Models can be composed to more complex ones. Such compositional models are solved by cascading constraints. With communication models an enumeration of some ﬁnite communications is avoided. This leads, for example, to better mainte- nance properties of the resulting system. Furthermore, the models can be easily communicated to domain experts (here communication experts), because they concentrate on domain aspects (like delay between emissions). By using a declarative language, a strict separation of the models (i.e. the knowledge about the domain) from the algorithms (i.e. constraints solving algorithms) is achieved. Thus, the constraint algorithms can be improved without changing the models as long as the language stays the same (e.g. improving variable ordering). The separation of models and algorithms also enables the concentration on the formulation of the model instead of algorithm development. Furthermore, an evolutionary modeling approach is supported by frequently testing speciﬁed models with the constraint solver. By using a constraint solver the algorithms are clearly deﬁned and have known properties (e.g. termination as long as variable domains are reduced). A speciﬁc algorithm developed for recognizing communications would be a black box for others than the developers, e.g. properties of such an algorithm would have to be newly identiﬁed. However, parts of speciﬁc algorithms can be incorporated by implementing new constraint relations. 5 Prototype implementation For evaluating the approach described in this paper, we realized a prototype with a distributed architecture. The emissions and the results are stored in an Oracle database. The previously described clustering methods are implemented or integrated in a extendable C++ clustering module. The data is read from the database, passed to the clustering algorithm, and the resulting clusters are stored in the database. The developed model-based constraint system (communication identiﬁcation) is based on SCREAMER [3] and Common Lisp.This constraint system provides ﬁnite domains of numbers, symbols and objects as well as intervals of numbers (i.e. it is a heterogenous constraint system). Additionally, n-ary constraints and the deﬁnition of domain-speciﬁc constraint relations (like compute-submodel) are supported (see also constraint operators deﬁned in [1]). Such constraint relations enable the implementation of cascading constraints which use solutions of a constraint problem as input for a further constraint problem (i.e. as domains of structured variables). Furthermore, we enhance SCREAMER by introducing series as variable domains. Because constraint variables have typically sets as domains, no support for series of values are given in constraint systems. We enhanced SCREAMER for using ordered sets of variable values, which reduces the computation time in our experiments from several hours to several minutes, because of pruning values within a variable domain. The communication identiﬁcation module reads the clusters and emissions from the database, performs the search with the selected communication model, and stores found communications in the database. For user-interaction a Java based Graphic User Interface (GUI) has been implemented. The GUI can select data stores, map regions etc. Each module may run on a separate computers and provides services (like compute-clusters, compute-interpretations, call-visualization), which are implemented with the remoteprocedure protocol XML-RPC 2 . The framework has well deﬁned interfaces and allows the integration of other modules and algorithms. 2 www.xmlrpc.com 677 M. Haringer et al. / Two Stage Knowledge Discovery for Spatio-Temporal Radio-Emission Data 6 Experiment Results We have performed several experiments with the prototype. For the ﬁrst experiment we reduced the emission database to those emissions, containing spatial information and which belong to a six hours time period. This reduction results in a set of about 50000 emissions (dataset DS1). Three of the examined use cases have been ﬁnding radio stations, simplex-, and duplex communications in this data. Algorithms CLUTO Graph CLUTO Graph CLUTO Agglo No.Clust. 1000 50 50 Table 1. No.Emiss. 50000 500 500 Qual. ok ok ok Time 273s 0.6s 0.9s Emis. 19020 19482 19938 20880 21375 21846 Start 06:00:27 06:00:37 06:00:46 06:01:06 06:01:15 06:01:17 End 06:00:34 06:00:44 06:00:54 06:01:13 06:01:23 06:01:30 Cl. 0 299 0 0 299 299 Freq. 8766k 8766k 8766k 8766k 8766k 8766k Lon E 28 21 26 38 28 21 28 21 26 38 26 38 Lat N 42 24 43 14 42 19 42 19 43 14 43 14 Table 3. Several subsequent emissions of two clusters with the same frequency. The emissions are given by their id (Emission), starting and ending time (Start, End), their clusters, the emission’s frequency, and the longitude and latitude coordinates (E = Eastern, N = Northern). Comparing clustering algorithms. The used clustering algorithms for the emitter identiﬁcation are: CLUTO normal partitioning (CLUTO RB), CLUTO direct partitioning (CLUTO Direct), CLUTO graph-based partitioning (CLUTO Graph), CLUTO agglomerative hierarchical (CLUTO Agglo), and the CURE algorithm. Model Radio emitter Simplex model Simplex model Table 2. Algorithm CLUTO Direct CLUTO Direct CLUTO Graph No. 1000 1000 1000 Com. 91 62 304 Time direct 16 h 3:45 h Results of recognizing communications within DS1 with 50000 emissions. Runtime without cluster computation. CLU T O RB and CLUTO Direct did produce unacceptable shapes of clusters. This is due to the inefﬁciency of these algorithms to low dimensional clustering. The CURE and the CLUTO Agglo algorithm lead to memory overﬂow (swapping), because they are purely main memory based. In Table 1 the result of the most promising algorithms is summarized. The agglomerative approach produced the best cluster shapes, but was not scalable above 1000 emissions due to memory issues. The graph algorithm performed reasonably well for cluster shapes and run time. The used models for the communication identiﬁcation specify start connection communications similar to those presented in Section 4.1 for simplex and duplex communications as well as models for identifying radio emitters (long emissions, one emitter). In Table 2 some results for recognizing communications in DS1 are shown. The performance and the result of the model processing depends heavily on the quality of the clusters. This examination supports our approach with two phases. The identiﬁed communications are seen as hints for real communications and thus, have to be further examined, e.g. by visualization techniques. In Table 3 one example of a communication start is shown which fulﬁlls the restrictions of the simplex model. Domain experts have evaluated the found communications and have identiﬁed them as well-known and as newly accounted frequencies, emitters, and communications. For this evaluation task several use cases have been developed like change of frequencies, increasing appearances of communications, command structures, communications with multiple partners and multiple frequencies. In Figure 4 a screenshot of the developed prototype is shown, where several identiﬁed communications are visualized. 7 Conclusion We introduced a two step approach for ﬁnding communications in radio emission data with spatial information. The ﬁrst step uses spatial clustering to identify emitters. Several data preparation and clustering algorithms were applied and graph-based partitioning was the Figure 4. Visualization of several identiﬁed communications. Communications are visualized by connected points. Clusters are visualized as transparent shapes. Left: Side-view - vertically the time axis. Right: Top-view with geographical map. most promising choice for our application. The second step uses a model-based interpretation approach which is based on constraint solving with newly introduced cascading constraints. Thus, the approach demonstrates the combination of two AI-methods, i.e. clustering algorithms and model-based interpretation based on constraints. With deﬁning a relatively simple model the user is able to ﬁnd speciﬁc or general communications. We presented results with a prototype implementation which proved the basic concept of our approach. As a next step the single methods can be reﬁned and optimized. A customized cluster algorithm has to be implemented which uses an optimized distance measure, integrates the location inaccuracies, and uses caching strategies. Emitter identiﬁcation for moving emitters (tracking) could be introduced. Another improvement would be a graphical model editor which allows to deﬁne the models graphically and generates models using the introduced modeling language as an output. Further models and use cases for different communication types have to be examined. Other application areas could be the evaluation of mobile communication networks, bio-physiological analysis, or other large scale spatio-temporal datasets. REFERENCES [1] Fr´ed´eric Benhamou, ‘Heterogeneous constraint solving’, in ALP ’96: Proceedings of the 5th International Conference on Algebraic and Logic Programming, pp. 62–76, London, UK, (1996). Springer-Verlag. [2] Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim, ‘CURE: an efﬁcient clustering algorithm for large databases’, in Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 73–84, (1998). [3] Jeffrey Mark Siskind and David Allen McAllester, ‘Screamer: A portable efﬁcient implementation of nondeterministic common lisp’, Technical Report IRCS-93-03, Institute for Research in Cognitive Science, Philadelphia, PA, (1993). [4] M. Steinbach, G. Karypis, and V. Kumar, ‘A comparison of document clustering techniques’, in KDD Workshop on Text Mining, (2000). [5] E. P. K. Tsang, Foundations of Constraint Satisfaction., Academic Press, London, San Diego, New York, 1993. 678 ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-678 Using Natural Language Generation Technology to Improve Information Flows in Intensive Care Units Jim Hunter, Albert Gatt, François Portet, Ehud Reiter and Somayajulu Sripada1 Abstract. In the drive to improve patient safety, patients in modern intensive care units are closely monitored with the generation of very large volumes of data. Unless the data are further processed, it is difficult for medical and nursing staff to assimilate what is important. It has been demonstrated that data summarization in natural language has the potential to improve clinical decision making; we have implemented and evaluated a prototype system which generates such textual summaries automatically. Our evaluation of the computer generated summaries showed that the decisions made by medical and nursing staff after reading the summaries were as good as those made after viewing the currently available graphical presentations with the same information content. Since our automatically generated textual summaries can be improved by including additional content and expert knowledge, they promise to enhance information exchange between the medical and nursing staff, particularly when integrated with the currently available graphical presentations. The main feature of this technology is that it brings together a diverse set of techniques such as medical signal analysis, knowledge based reasoning, medical ontology and natural language generation. In this paper we discuss the main components of our approach with a critical analysis of their strengths and limitations and present options for improvement to address these limitations. 1 INTRODUCTION The modern intensive care unit (ICU) is data rich. Most bedside computers are equipped with software that helps medical staff to record, retrieve and visualize patient-related information. Several megabytes of physiological data (e.g. heart rate), as well as laboratory results and the results of other investigations (such as Xray results), are recorded per patient per day. Medical staff need to assimilate these data rapidly to provide real-time patient care. In the healthcare sector there is increased interest in deploying technology to deal with this information overload. Over several years, we have been working closely with medical staff in a neonatal ICU to understand how to achieve more effective use of patient data in order to improve the quality of patient care. Our studies led us to the following hypotheses: 1. Timely and accurate information flow between medical staff is the key to better patient care. Nurses and doctors require tools that accurately summarise the key facts about patients rather than expert systems that generate clinical recommendations. 2. The informational requirements of these summaries vary between the different groups of people involved in health care such as nurses, doctors and even family and friends. While 1 Department of Computing Science, University of Aberdeen, UK. email: j.hunter@abdn.ac.uk Figure 1. Graphical Display of NICU data nurses need information that helps them to make detailed patient care plans, doctors require information to support clinical decision making. Family and friends have more variable requirements depending on how close they are to the patient. 3. Staff roles in ICUs are well defined and any new technology has to suit the work flows associated with existing staff roles [16]. This paper presents an extensive programme of work whose ultimate goal is the automatic generation of summaries in natural language (text) of the history of a baby in a neonatal ICU over a period of several hours. In the rest of this introduction, we set out the background to this endeavour. In Section 2 we describe our initial prototype; the outputs from this system have been evaluated experimentally. This evaluation and its results are presented in Section 3; full details can be found in [17]. We conclude with a discussion of what has been achieved so far and the challenges for the future. 1.1 Data-intensive clinical environments Apart from clinical environments in which imaging plays a central role, the largest volumes of clinical data are generated where several physiological variables are measured at high frequencies (say once per second) over considerable periods of time. J. Hunter et al. / Using Natural Language Generation Technology to Improve Information Flows in Intensive Care Units Examples of such environments are: • • • • • intensive care units (both adult and neonatal); high dependency units (e.g. coronary care); operating theatres; renal dialysis; ambient monitoring in the home. The data are collected for good reason, in that when correctly interpreted, they enable the tracking of the clinical state of the patient over time so that action can be taken to rectify any undesirable deviations. For example, in an ICU, up to ten physiological variables (including heart-rate, blood pressure, O2 and CO2 saturations, and temperatures) may be measured continuously every second (i.e. almost a million measurements per day). In addition there will normally be a number of additional discrete data items which are entered sporadically: laboratory results, blood gases, daily organ failure scores, medication, equipment settings, patient observations, etc. The example from a neonatal ICU shown in Figure 1 illustrates many of the characteristics of data from such environments – they are multi-channel, voluminous, noisy and full of artifact. Even in intensive care, where the ratio of patients to staff is low, medical errors are not uncommon. Errors often result from missed symptoms and signs, or a lack of appreciation of their importance which may be due to staff ignorance, to attentional or informational overload. Presenting these data in an effective way is central to informed medical decision making. Normal practice is to present the physiological variables graphically as shown in Figure 1. However previous research [3] showed that displaying data in this way does not automatically lead to improved patient care. One of the reasons for this was established by the Cognate project [1], which showed that junior staff (who are responsible for most of the immediate care of the baby) spend a small fraction (about 5%) of their time looking at the displays. Cognate concluded that if the information present in these complex multi-channel time-series was to be fully utilised, some form of additional processing was essential. In the Neonate project, a follow-up to Cognate, it was hypothesised that reporting information textually would be more effective than the graphical displays because textual descriptions, by virtue of their structure, can present a coherent picture of all the relevant information, while graphical displays require their user to filter out a lot of irrelevant detail in order to establish the relatively few important facts about the patient state. To test this hypothesis, an experiment was carried out to compare the effect on decisionmaking of two different modes of presentation [8]. One mode was graphical (as in Figure 1); the other was a summary of the data in natural language written by expert clinicians. The data periods presented had durations of about 45 minutes. This experiment demonstrated clearly that participants took better decisions when informed by the textual presentation. However it took considerable effort and involved a considerable amount of time for the medical experts to generate the text. If the benefits of textual summary are to be exploited, the texts need to be generated automatically. This can be achieved by mean of recent advances in data-to-text technology. 1.2 Data-to-text NLG Natural Language Generation (NLG) is an area of research focused on developing techniques for automatically generating natural language (such as English) descriptions of non-linguistic 679 information [11]. An implicit assumption of most NLG techniques is that the non-linguistic input information comes from knowledge bases with well-defined semantics. In practice, however, in most application domains where automatic textual descriptions are desperately required, such knowledge bases do not exist. The dataintensive clinical environments described in the previous section generate large amounts of clinical data that are not structured into logical forms in a knowledge base. Data-to-text NLG is a recent extension to traditional NLG to allow such naturally occurring data to be described linguistically. Since the pioneering work of Kukich [9] in summarising stock market data, various applications of data-to-text NLG have been reported in the literature. Yu et al. [18] reported an NLG system that produces textual descriptions of time series data from an operational gas turbine. Sripada et al [13] reported an NLG system that produces textual weather forecasts for the offshore oil industry. This system has actually been deployed in the industry to produce drafts of marine weather forecasts for supporting oilrig operations in the North Sea. User evaluation studies found computer generated forecasts to be more consistent in their language and therefore of better value than the expert written forecasts. Sripada and Gao [15] reported a system that described data recorded by a scuba (Self Contained Underwater Breathing Apparatus) dive computer used by recreational scuba divers. In the Medical domain, the recent CLEF project [4] aims at generating summaries of multiple text-based health reports. In [2], the authors describe a system that dynamically generates hypertext pages that explain treatments, diseases, etc related to the patient's condition, using information in the patient's medical record as the basis for the tailoring. Perhaps the most successful medical data-totext applications have been tools that (partially) automate the process of writing routine documents, such as the Suregen-2 system [6], which is regularly used by physicians to create surgical reports; see [7] for a review of text generation in medicine. However, the complete summarisation of ICU data is more complex, involving the processing of time series, discrete events, and short free texts, which seems not to have been done before. 2 BABYTALK PROJECT The studies discussed above led us to conclude (i) that textual summaries of ICU data might have a significant role in decision support, and (ii) that the automatic generation of textual summaries of complex time series data was feasible. A preliminary study [12] demonstrated the problems specific to the ICU domain. The main challenge in developing a computer application for the ICU context that delivers the required functionality is bringing together a diverse set of techniques that were originally developed quite independently - medical signal analysis for processing the raw physiological data, knowledge based reasoning to integrate and interpret the results of signal analysis along with other discrete data and finally the application of NLG to generate the required descriptions. These issues are being tackled in the BabyTalk project. 2.1 Example data Input to our system consists of several channels of physiological data such as heart rate, blood pressure, and temperature. In addition to these continuous data, discrete data about equipment settings, laboratory results and actions taken by medical staff were also available to our system. Figure 1 shows a window of continuous channel data with annotations of discrete data. 680 J. Hunter et al. / Using Natural Language Generation Technology to Improve Information Flows in Intensive Care Units “You saw the baby between 16:40 and 17:25. Initially the HR baseline is 140-160; pO2 is 8-10; oxygen saturation = 92%, T1 and T2 are 36.9º and 36.6ºC. At around 16.45 ET suction is performed; there is a drop in oxygen saturation to 50% and pO2 to 3.3 and a rise in pCO2 to around 9. The FiO2 is increased from 61 to 100%. By 16.51 the HR is at 155, the pO2 is 6.7 and the oxygen saturation is 88% and the pCO2 is 9.2. There is an upward spike in the pO2 to16.9 and a corresponding downward one in pCO2 to 3.1; the oxygen saturation has fallen to 78%. T1 is now reading 36.9ºC and T2 35.7ºC. At 16.57 the ventilator rate is increased to 30. Baby is given Neopuff ventilation. The oxygenation continues to decrease: pO2 = 0.2 and oxygen saturation = 20% at 17:00 and the HR falls to 60. The pCO2 continues to rise to 10.1. The baby is pale and unresponsive. ET suction is given, baby is turned and at 17:02 the ETT is removed; the baby is again given Neopuff ventilation. Baby is re-positioned and the NGT aspirated. By 17:08 the baby is reintubated; the oxygen saturation has increased to the 80s and the HR has risen to 176 the pO2 = 0.1 and pCO2 = 0.2, T1 is 32.7ºC and T2 34.7ºC. At 17:15 the FiO2 is reduced to 33% and the rate put back to 15. At 17:24 the oxygen saturation falls to 65 and the FiO2 is increased to 56%. At 17:25 the HR is 165, the oxygen saturation is 100%, T1 is 35.7º and T2 is 34.5ºC.” Figure 2. Human generated summary of the data shown in Figure 1. Independently, our human experts produced their own summaries of the data; Figure 2 shows the text they generated for the data shown in Figure 1. An example of the automatically generated output of our system for the same data period is presented in Figure 3. 2.2 Architecture The main architecture of the prototype is presented in Figure 4. All of the modules in the architecture are driven by information stored in the knowledge base of the NICU Ontology (1). While developing ontology-driven systems is nothing new in both medical and NLG applications, developing an ontology that captures the knowledge required by both of these components is one of the novel features of this architecture. Moreover, as shown “You saw the baby between 16:40 and 17:25. Heart Rate (HR) = 155. Core Temperature (T1) = 36.9. Peripheral Temperature (T2) = 36.6. Transcutaneous Oxygen (TcPO2) = 9.0. Transcutaneous CO2 (TcPCO2) = 7.4. Oxygen Saturation (SaO2) = 94. Over the next 24 minutes there were a number of successive desaturations down to 0. Fraction of Inspired Oxygen (FIO2) was raised to 100%. There were 3 successive bradycardias down to 69. Neopuff ventilation was given to the baby a number of times. The baby was re-intubated successfully. The baby was resuscitated. The baby had bruised skin. Blood gas results received at 16:45 showed that PH = 7.3, PO2 = 5, PCO2 = 6.9 and BE = -0.7. At 17:15 FIO2 was lowered to 33%. TcPO2 had rapidly decreased to 8.8. Previously T1 had rapidly increased to 35.0.” Figure 3. Computer generated summary of the data shown in Figure 1. in the figure, the ontology is programmatically integrated into the system. The system creates a summary of the clinical data period in four main stages. All the terms used to describe the discrete events are related to our ontology which is still being extended to include about 1800 concepts. To enable future sharing of this valuable knowledge source we synchronised our ontology with UMLS, a meta-thesaurus that brings together several popular medical taxonomies (such as SNOMED-CT, WHO classification). The first stage of the processing is Signal Analysis (2) which extracts the main features of the physiological time series (artifacts, patterns, and trends) using modelling based on a baby’s physiological values, auto-regressive filtering, and adaptive bottom-up segmentation techniques. Data Interpretation (3) performs some temporal and logical reasoning to infer higher medical abstractions and relations (‘re-intubation’, “A causes B”, etc.) from the signal features and the clinical observations using an expert system linked to the ontology. From the large number of events generated, Document Planning (4) selects the most important based on an importance factor either determined by experts (e.g. a surgical operation is must always be reported) or computed from the signals (e.g. the importance of some patterns depends on their duration). The most important events are then structured into a tree. Finally, Microplanning and Realisation (5) translates this tree into coherent text. Details of all these stages are available in [10]. (1) NICU Ontology concepts (2) Signal analysis time series Databases signal features clinical annotations (3) Data interpretation medical observations (4) Document Planning Figure 4. Architecture of BabyTalk groups of intervals (5) Microplanning and Realisation Output Text J. Hunter et al. / Using Natural Language Generation Technology to Improve Information Flows in Intensive Care Units 3 EXPERIMENTAL EVALUATION There are several ways of evaluating the quality of computer generated texts, including post-editing [14] (where one measures the amount of editing required to render the text acceptable to a user) and detailed comparison with human generated texts based on the same input data. However, we are strong believers in taskbased evaluation, where end-users are asked to perform a meaningful task under different conditions; a judgment is then made on the extent to which the task was carried out satisfactorily – any differences are attributed to the conditions under which the task was performed. 3.1 Experiment The participants in our experiment were nurses and doctors from the neonatal ICU in Edinburgh with different levels of experience: 7 junior nurses and 8 senior nurses, 8 junior doctors and 8 senior doctors. They attended three separate sessions during each of which they had to look at the presentation of data from each of 8 scenarios and indicate which action or actions they would take at the end of the period; the actions were selected from a set of 18 possible actions. The experiments took place away from the ward and suitable practice was provided each time before the test began. For each scenario a participant was given some information about the baby at the start of the period and the data for the period was presented graphically or as text (human-authored or computergenerated). Participants were under some time pressure (three minutes per scenario), both to simulate the pressures that arise on the ward and to ensure that the time that the participant was away from her duties was predictable. Considerable care was taken to ensure that the human-authored text was purely descriptive i.e. that it did not interpret the data in a way which would have given a definite hint to the participant as to the correct action. Further information about all aspects of the experiment (including more detailed results) is given elsewhere [17]. 3.2 Results For each scenario our experts judged some actions to be appropriate, some to be inappropriate and the remainder to be neutral. Each participant was scored on each scenario, the score consisting of the percentage of the appropriate actions that they selected minus the percentage of inappropriate actions that they selected. For each condition an average score over the eight scenarios was calculated. Subjects from all the groups performed better with humanwritten textual summaries (mean score = 0.39), compared to their performance with graphical presentations (mean score = 0.33); note that graphical presentation is the currently available form of data presentation to medical staff on the ward. This result is in agreement with the earlier experiment reported in [8] and provides additional support for our hypothesis that textual summaries are valuable in presenting patient data to medical staff. On the other hand, subjects did not perform as well with the computer-generated summaries (mean score = 0.34) as they did with human-written texts. Although this is disappointing, the observation that they found the computer-generated texts as useful as the existing graphical presentations is positive and encouraging. As with the previous experiment, the textual descriptions (both human-authored and computer generated) deliberately avoided interpreting the data in ways that gave clues about the correct decisions to the subjects. However this is an artificial restriction 681 imposed to allow meaningful comparison with the existing presentation and need not limit the future development of our approach. Because human (expert) written summaries seem to be superior, we have started to analyze the differences between these texts and those generated automatically. Among other things, our initial analysis shows that the human written texts take a narrative approach to the presentation of the content; this is currently missing from the computer generated texts. Moreover, in the experiment, textual and graphical presentations were treated as alternatives. But in the future there is no reason why they should not be used to complement each other with the textual summaries reinforcing important messages. One of the most important factors in favour of computer generated texts is the amount of time they save for already stressed medical staff, who may otherwise be required to write these textual summaries (e.g. nursing shift summaries). 4 CHALLENGES AND CONCLUSIONS Besides the need to improve automatic generation to produce human-quality texts, there are several issues that need to be addressed before routine operational deployment. The existing technology for producing textual descriptions of ICU data only processes data captured on the ward over 45 minutes whereas in reality we need to be able to handle several hours of data. Also the current textual descriptions are not targeted at any specific group of people in the ICU. In our experiment, the performance of different groups such as junior nurses and senior doctors was not the same, almost certainly because the informational requirements of these groups are different. So one objective of the ongoing project is to develop significant extensions to the current working prototype for three distinct purposes: 1. To assist nurses in writing summaries at the end of their 12 hour shifts; these shift summaries help the incoming nurse to plan her care of the baby; the benefits to the nurses of automatic generation would be time saving and consistency. 2. To generate summaries on demand for junior doctors; summaries will cover approximately 12 hours and be designed to support decision making. 3. To provide parents and relatives with a basic summary of the progress of the baby over the previous 24 hours; the Edinburgh neonatal ICU has recently made such reports (written by clinicians) available to parents and close friends; we note that immediate family members often require more detailed information than others. These extensions will attempt much more complex tasks than the simple summarising of a 45 minute period. The times to be covered are an order of magnitude larger and the level of abstraction must be correspondingly greater. The systems are targeted at very different classes of user; the content, structure and language used must be appropriate. In addition to the above, the following issues require attention in future extensions: 1. Uncertainty in information: Many discrete events recorded by medical staff on the ward do not bear accurate time stamps. Some events may be recorded before they actually happen; most are recorded after their occurrence. This uncertainty in temporal information should either be resolved by the system or communicated to the user with appropriate linguistic cues. There are several domain independent techniques for handling uncertainty such as the possibility theory. We are currently 682 2. 3. 4. 5. J. Hunter et al. / Using Natural Language Generation Technology to Improve Information Flows in Intensive Care Units developing techniques for resolving uncertainty in temporal information which exploits domain knowledge. For example, if it is known that an event such as intubation (the task of inserting a tube for artificial breathing) has been performed, based on the domain knowledge about the expected duration of the event, it may be possible to estimate the start and end times of that event. Missing information: Several events happen regularly on the ICU and it is humanly impossible to record all of them. When inspecting recorded data, medical staff can infer, from experience, the occurrence of some of the unrecorded events. Certain events also manifest their occurrence as observable patterns in the physiological data which can be computationally recovered. We are currently investigating techniques for automatically detecting 'missing' information from the existing recorded information. Certain types of information such as visual observations may always elude automatic detection. This may place an upper bound on the amount of information that can be communicated by the textual descriptions. Unstructured (free) text: Medical staff often record details about specific patients in the form of unstructured (free) textual comments in the patient record. Summarising patient status without this information may be incomplete but it is not possible simply to insert the textual comments into the computer-generated descriptions. We are currently exploring Information Extraction (IE) techniques for automatically extracting important data from the free text comments and integrating them into our textual descriptions. Integrating text and graphics: In our interactions with the medical staff, we found that the eventual acceptance of our new technology will only be possible if it integrates seamlessly into the existing informational infrastructure on the ward. Although our system presents patient information in textual form, there is much to be gained if the texts can be integrated into the graphical presentations. Textual presentations seem to be appropriate for certain contexts - graphical presentations for others. Generating narratives: As observed earlier, human written texts follow a narrative style which helps readers' comprehension. We are currently working on developing techniques to produce texts that mimic expert narrative style. [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] ACKNOWLEDGEMENTS The authors are grateful to the UK Engineering and Physical Sciences Research Council for funding of the BabyTalk projectproject under award EP/D049520/1. We would also like to thank the other members of the BabyTalk team: Yvonne Freer, Felix Gao, Robert Logie, Saad Mahamood, Neil McIntosh, Wendy Moncur, Marian van der Meulen and Cindy Sykes. REFERENCES [1] [2] [3] E. Alberdi, J-C Becher, K.J. Gilhooly, J. Hunter, R. Logie, A. Lyon, N. McIntosh and J. Reiss, 'Expertise and the Interpretation of Computerised Physiological Data: Implications for the Design of Computerised Physiological Monitoring in Neonatal Intensive Care', International Journal of Human Computer Studies,Vol 55, No 3, pp 191-216 (2001). A. Cawsey, R. Jones, and J. Pearson, 'The evaluation of a personalised information system for patients with cancer', User Modelling and User-Adapted Interaction, 10, 47-72 (2000). S. Cunningham, S. Deere, A. Simon, R. A. Elton and N. McIntosh, 'A randomised control trial of computerised physiological trend [18] monitoring in an intensive care unit', Critical Care Medicine, 26:12, 2053-60 (1998). C. Hallett and D. Scott, ‘Structural variation in generated health reports’, Proceedings of the 3rd International Workshop on Paraphrasing, Jeju Island, Korea (2005). J. Hunter, L. Ferguson, Y. Freer, G. Ewing, R. Logie, P. McCue and N. McIntosh, 'The NEONATE Database', Workshop on Intelligent Data Analysis in Medicine and Pharmacology and Knowledge-Based Information Management in Anaesthesia and Intensive Care, AIME03, pp 21-24 (2003). D. Hüske-Kraus, ‘Suregen-2: A Shell System for the Generation of Clinical Documents’, Proceedings of EACL-2003 (demo session) (2003). D. Hüske-Kraus, ‘Text Generation in Clinical Medicine – a Review’, Methods of Information in Medicine, 42, pp 51-60 (2003). A. Law, Y. Freer, J. Hunter, R. Logie, N. McIntosh and J. Quinn, 'A Comparison of Graphical and Textual Presentations of Time Series Data to Support Medical Decision Making in the Neonatal Intensive Care Unit', Journal of Clinical Monitoring and Computing, 19, pp 183-194 (2005). K. Kukich, 'Design and Implementation of a Knowledge-Based Report Generator', Proceedings of ACL-1983, pp 145-150 (1983). F. Portet, E. Reiter, J. Hunter and S. Sripada, 'Automatic Generation of Textual Summaries from Neonatal Intensive Care Data', R. Bellazzi, A. Abu-Hanna and J. Hunter (eds.), 11th Conference on Artificial Intelligence in Medicine (AIME 07), pp 227-236, (2007). E. Reiter and R. Dale, Building Natural Language Generation Systems, Cambridge University Press (2000). S. Sripada, E. Reiter, J. Hunter, J. Yu, 'Summarising Neonatal Time Series Data', Proceedings of 2003 Conference of the European Chapter of the Association for Computation Linguistics, Companion Volume, pp 167-170, Budapest, Hungary (2003). S. Sripada, E. Reiter, I. Davy and K. Nilssen, 'Lessons from Deploying NLG Technology for Marine Weather Forecast Text Generation', R Lopez de Mantaras, L Saitta (eds.), European Conference on Artificial Intelligence (Valencia, Spain), pp 760-764, IOS Press (2004). S. Sripada, E. Reiter and L. Hawizy, 'Evaluation of an NLG System using Post-Edit Data: Lessons Learnt', Proceedings of European Natural Language Generation Workshop, pp 133-139 (2005). S. Sripada and F. Gao, 'Linguistic Interpretations of Scuba Dive Computer Data', Proceedings of 11th International Conference on Information Visualization, pp436-441 (2007). B. Strople and P. Ottani, 'Can technology improve intershift report? What the research reveals', J Prof Nurs, 22(3), pp 197-204 (2006). M. van der Meulen, R. Logie, Y. Freer, C. Sykes, N. McIntosh and J. Hunter, 'When a Graph is Poorer than 100 Words: A Comparison of Computerised Natural Language Generation, Human Generated Descriptions and Graphical Displays in Neonatal Intensive Care', Submitted (2008). J. Yu, E. Reiter, J. Hunter and C. Mellish, 'Choosing the Content of Textual Summaries of Large Time-Series Data Sets', Natural Language Engineering, 13, pp 25-49 (2007). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-683 683 Application and Evaluation of a Medical Knowledge System in Sonography (S ONO C ONSULT) Frank Puppe1 and Martin Atzmueller2 and Georg Buscher3 and Matthias Huettig4 and Hardi Luehrs5 and Hans-Peter Buscher6 Abstract. This paper presents the knowledge-system S ONO C ON SULT – an intelligent system in the medical domain. We evaluated the accuracy, acceptance and impact of S ONO C ONSULT, which has been used in clinical routine since 2002. The system was well accepted and had a signiﬁcant clinical impact. In contrast to our original expectations, the diagnostic conclusions, although inferred with high accuracy, were less important especially for experienced physicians. 1 Introduction Knowledge-based systems in medicine may serve many functions. Traditionally the main focus was on complex diagnostic and therapeutic recommendations [7, 4]. Recent reports [3] indicate that this may not be perceived as the primary need by most physicians. Instead, other functions such as support for high quality documentation, reminders, statistical analysis and training of beginners might be more important in clinical routine. We implemented a multifunctional knowledge-based system for sonography, which has been in routine use since 2002 documenting more than 12000 patients in two clinics, and evaluated its accuracy, acceptance and clinical impact. Table 1 shows a survey of the performed evaluations. Evaluation Diagnostic accuracy Acceptance: Comparison of expectations and experiences Clinical impact: More complete reports? Internal consistency of reports Statistical interexaminer comparison Procedure 99 prospective cases Interrogation of 14 resp. 19 physicians 103 reports 112 reports 4100 cases and data mining. The documentation component of SC has three modes (standard, short, and expert). The standard mode provides detailed and systematic questionnaires for all organs and is optimized by dialogue-guiding rules to ask only questions relevant for the case. The short mode asks questions on a more aggregate level, requiring more expertise from the user, and the expert mode allows directly entering just the diagnoses and measurements necessary for the report in a very compact manner. These documentation modes represent different compromises between the time necessary to document the case and the expertise of the user. The terminology of S ONO C ONSULT is descriptive and follows that of standard textbooks and publications. Based on the completed questionnaires a textual report (see Figure 1) is generated using a rule based template. The report consists of three parts: 1) basic patient information, 2) ﬁndings and 3) judgement (which is added by the examiner as free-text). The SC-diagnoses are shown to the physician when entering the free text judgement, but they are not included in the report, because the physician remains responsible for the examination interpretation. The ﬁndings, judgements and diagnoses inferred by SC are also stored in a data base for statistical analysis. Result High accuracy High acceptance More complete Relevant discrepancy Relevant differences Table 1. Survey of the performed evaluations. S ONO C ONSULT (SC) [9] covers the entire ﬁeld of abdominal ultrasound (liver, portal tract, gallbladder, spleen, kidneys, adrenal glands, pancreas, stomach, intestine, lymph nodes, abdominal aorta, cava inferior, prostate, and urinary bladder) and supports documentation, diagnosis, data mining and education. It was developed with the knowledge system d3web (www.d3web.de), which allows the input of expert knowledge via a graphical user interface [14]. The documentation component interacts with the user via dynamic questionnaires for all organs and generates two outputs: a structured report in a standard word processing system for the hospital information system and a data base of all cases for statistical analysis 1 2 3 4 5 6 University of Wuerzburg, puppe@informatik.uni-wuerzburg.de University of Wuerzburg, atzmueller@informatik.uni-wuerzburg.de DFKI, Georg.Buscher@dfki.uni-kl.de DRK-Kliniken Berlin-Koepenick, matthias.huettig@arcormail.de University-Hospital Wuerzburg, h.luehrs@medizin.uni-wuerzburg.de DRK-Kliniken Berlin-Koepenick, buscher.dhp@t-online.de Figure 1. Part of a generated exemplary S ONO C ONSULT-report. The diagnostic component adds inferred diagnoses based on the input data from the questionnaires to the output. The knowledge base makes use of medical heuristics as a knowledge source [12] and was built according to the principles applied for the construction of HepatoConsult [5]. SC uses ﬁve main concepts: symptoms (input data), symptom classes (Questionnaires grouping the input questions), symptom abstractions, diagnoses (output), and rules. Symptoms consist of a pair (attribute, value), where the attribute is the 684 F. Puppe et al. / Application and Evaluation of a Medical Knowledge-System in Sonography (SONOCONSULT) symptom name (e.g. liver size) and the value is the symptom value (e.g. increased). In interactive settings, the attributes are questions and the values are the answers by the user. There are two main types of attributes: choice and numerical. Choice attributes have a predeﬁned range (e.g. for liver size: decreased, normal, increased) and are differentiated according to their cardinality as one-choice (1, i.e. exactly one value is allowed, like for liver size) or multiple choice (0 .. n). Symptoms are grouped into symptom classes if they are requested together most of the time. It is possible to deﬁne rules in a symptom class that specify which questions have to be asked in which order depending on the values of previously answered questions. Symptom abstractions are very similar to symptoms except that their values are inferred by rules. They allow a stepwise abstraction of the input data. Diagnoses are also inferred by rules from symptoms, symptom abstractions or other diagnoses (”criteria”). They usually aggregate uncertain evidence. While d3web allows different reasoning mechanisms for inferring diagnoses, in SC a score-based scheme is used, i.e. the rules are assumed to be independent and add or subtract points to the score of a diagnosis, which is rated by thresholds in one of the linguistic categories ”probable”, ”possible” and ”unclear or excluded”. Rules consist of a condition, an action and exceptions. The condition may be a nested logical combination of criteria, e.g., ”and”, ”or” and ”not”. Rule actions include, e.g., rating diagnoses, computing values for symptom abstractions, indicating symptom classes and (further) follow-up questions. Exceptions allow to differentiate between two types of negation, i.e., whether a fact is yet unknown or deﬁnitely wrong. For more details, see [14]. The diagnostic procedure of SC follows the hypothesis-and-testand the establish-reﬁne-strategy. The selection of a speciﬁc questionnaire (symptom class) depends on the overall clinical question and on the inferred diagnoses. Data gathering stops when (a) the user jumps to the conclusions or (b) all suspected diagnoses (category ”possible”) are either ”probable” or ”unclear or excluded” by means of the program’s expertise or (c) there are no useful questionnaires left for clariﬁcation. Besides the 430 questions, SC contains about 140 symptom abstractions, 230 diagnoses and several thousand rules of varying complexity. In an average case, from the 60 entered symptoms 20 symptom abstractions and 3-6 diagnostic conclusions are inferred by the program. The range 3-6 means that some inferred diagnoses of SC are less important than others (like e.g. ”adiposity”) and could be ignored depending on the point of view. The data mining component offers a standard tool for getting an overview of the data and an innovative subgroup analysis tool for knowledge discovery and quality control. The data mining technique of subgroup mining [10, 8] is quite suitable for common medical questions, e.g. whether a certain pathological state is signiﬁcantly more frequent if combinations of other pathological states exist or if there are diagnoses and/or ﬁndings, which one physician documents signiﬁcantly more or less often than the average. We used the VIKAMINE (Visual, Interactive and Knowledge-Intensive Analysis and Mining Environment) system [1] for interactive and automatic subgroup mining. This tool is adapted to particularities of the medical domain like many missing values in the records due to intelligent data gathering strategies minimizing the number of asked questions. Furthermore, often background knowledge can be utilized, since existing knowledge should not be rediscovered, but the available knowledge should be used to ﬁnd new, often subtle correlations, to increase the interestingness of the discovered results. Additionally, often (known) confounding factors (like age, gender, body weight etc.) need to be controlled. VIKAMINE offers an efﬁcient exhaustive and various heuristic search options with constraints for automatic sub- group discovery and interactive visualizations for active user involvement. When the user discovers something unexpected/interesting in the data using standard tools, then these ﬁndings can be inspected and analyzed in detail using VIKAMINE. For more details, see [1]. The educational effect of SC is based on the structured documentation procedure showing what aspects are important in what context, and the explanation component showing the diagnostic meaning of observations and vice versa the criteria for inferring a diagnosis. Additionally, most ﬁndings and diagnoses are linked to a text-book-like information system for rapid information lookup. The rest of the paper is organized as follows: In Section 2 we discuss the clinical experience with the system, evaluations of its accuracy, acceptance, and clinical impact, and discuss these results. Section 3 describes an application of the data mining component for the detailed analysis of interexaminer variations. In Section 4 we present the lessons learned, and conclude the paper in Section 5 with a discussion of the presented work and promising options for future work. 2 Clinical Experience and Evaluations SC has been in routine use since 2002 as the only documentation system for ultrasound examinations in the DRK-hospital of BerlinK¨openick; since 2005, it is in routine use at the university hospital of W¨urzburg. Since SC runs on a web server, intranet integration with the hospital information system (HIS) was a prerequisite for both clinics. In Berlin a weak integration was implemented: the physician uses two separate programs: SC for entering the sonographic data and the HIS for storing the document generated by SC. The transfer of the data is done in a ”copy and paste” style, that is implemented with a ”one-click”-macro for convenience. In W¨urzburg, a strong integration was implemented: The physician starts SC from the HIS, where SC is initialized with some basic patient data. After ﬁnishing the case, SC transfers the report into the physician’s letter section of the HIS and the ICD-coded diagnoses in the diagnostic section of the HIS. In Berlin the standard documentation mode of SC is used, whereas in W¨urzburg the expert mode is applied. We evaluated the diagnostic accuracy of SC, the clinical acceptance and the clinical impact. Due to the longer period of routine use, most of the following evaluations were done in Berlin-K¨openick. 2.1 Accuracy We deﬁne diagnostic accuracy as consistency between the input and output data, i.e. whether the inferred diagnoses are consistent with the entered symptoms. The standard documentation mode is then most appropriate, since the short and the expert mode require the physician to enter own interpretations not really challenging the diagnostic power of SC. We used 99 consecutive cases from Berlin in a prospective study. As gold standard, one sonographic expert from a different clinic than Berlin performed the evaluation, since the goal was to show that the SC knowledge base did not contain serious errors. The evaluator got the ﬁndings including the clinical problem and the list of diagnoses SC inferred from the ﬁndings. Rating each diagnosis, the evaluator entered the overall impression of the case using four categories (SC-diagnoses fully consistent with ﬁndings, basically consistent, partly deviating and seriously deviating). After the evaluator had completed the forms, the developer of the knowledge base of SC classiﬁed each error with the following categories: a) Judgement difference, e.g. due to different thresholds used by the evaluator and SC for organ size, b) input error, i.e. the documented ﬁndings are inconsistent leading to erroneous conclusions, F. Puppe et al. / Application and Evaluation of a Medical Knowledge-System in Sonography (SONOCONSULT) and c) knowledge base error due to either rule errors or errors in the template applied for generating the text. According to the overall impression, 92% of the cases were rated correct or basically correct (i.e. the diagnoses were consistent with the documented ﬁndings), in 7% of the cases, the diagnoses of SC were partly deviating and in 1% were seriously deviating from the documented ﬁndings (in this case the documented ﬁndings were inconsistent). A closer look on the level of the individual diagnoses showed that 83.9% of all 483 diagnoses in the 99 cases were rated consistent. If only important diagnoses were considered, even 94.9% of diagnoses were rated as consistent. The classiﬁcation of the errors showed that most errors were due to judgement differences between the developer of SC and the evaluator. This is not surprising, because judgement in sonographic examinations is in part subjective. In particular, different thresholds for normal organ sizes were responsible for the majority of expert disagreements. However, if only important diagnoses or the overall rating of the cases are considered, the judgement differences are less prominent and the main reasons are input errors. Knowledge base errors were responsible only for 2% of cases with partly deviating conclusions and 0.7% of wrong diagnoses. 2.2 Acceptance The acceptance of SC in Berlin was measured with a before-after comparison, i.e. the users were asked to ﬁll out a questionnaire before SC was installed in routine use and to ﬁll out a second questionnaire two years after its installation. According to the users opinion, the most important preconditions for the programs introduction into clinical routine were (a) an acceptable account of symptom representation, (b) a time-efﬁcient input procedure, and (c) the ability to convert the case data into structured text documents for the medical record of the procedure. These preconditions were met before the program was put into routine use. While a self written report took on average about 5 minutes for senior examiners to complete an examination using a text system including some building blocks for common phrases, the input time with SC was about 4-13 minutes (mean 7.55 minutes) when starting to work with the program and about 5 minutes after being familiar with it for about 2-3 weeks of continuous use in the standard documentation mode. The expectations of the prospective users of SC were queried prior to its ﬁrst presentation. We provided a questionnaire that was answered by 19 sonographic examiners. After gaining experience with the use of SC, the physicians were asked again about their opinions using a questionnaire that was answered by 14 examiners. Both questionnaires asked items similar to a ﬁve point Likert scale. The answers to these questions show that prior expectations (PE) and the actual experiences (AE) agree in many aspects: the standardization of nomenclature and subsequent comparability of sonographic reports is acknowledged by the examiners (PE: 4.3; AE: 4.5), simple usability is very important (PE: 4.9; AE: 3.8) and the reminder function of the program is perceived as helpful (PE: 3.7; AE: 3.8). This is also true for the presentation of the system diagnoses, which is perceived as not so important (AE: 3.0; PE: 2.9); the inﬂuence of the system diagnoses on the diagnoses of the physicians was rated even lower (PE: 2.2). A difference between expectations and experiences exists with respect to the explanation function, which was declared as rather desirable, but rarely used (PE: 3.8; AE: 2.5). The expected training effect (PE: 3.9) was compared with the experiences of 5 beginners and clearly conﬁrmed the expectations. They all emphasized that the program’s most positive effect was to conduct an examination in a complete and structured way as well as in a standardized and 685 reasonable sequence. The diagnostic properties of the program had only been of medium/transitory interest during the learning phase. 2.3 Clinical Impact We also tried to measure whether the use of SC improved the quality of the sonographic records: Potential improvements are a more complete documentation of symptoms and a higher quality of the reported diagnoses. Concerning the ﬁrst issue, after the introduction of SC in Berlin the program established a documentation standard, which is highly welcomed by the physicians (see evaluation of acceptance). The question was how complete the sonographic reports would have been without applying SC. Therefore, we randomly selected 103 hand written reports, which were documented before the introduction of SC in Berlin and noted whether all questions asked by SC could be answered with the available data. If not, two senior examiners from the clinic in Berlin judged the information gaps in the free text reports as relevant or dispensable. The evaluation showed that 287 information gaps were found (i.e. questions generated by SC which could not be answered considering the sonographic reports); the domain experts judged nearly half of them (132) as relevant. To conﬁrm the assumption of a documentation standard after the introduction of SC in the ﬁrst study and for evaluating the second issue concerning the quality of the diagnoses, we performed another study: We used 112 prospective consecutive records and judged the completeness of the documented ﬁndings and the consistency of the diagnostic conclusions with the documented ﬁndings. The agreement of three domain experts (2 from the clinic in Berlin and one from W¨urzburg) was used as ”gold standard”, i.e. the diagnostic conclusions were judged by the domain experts as ”correct” or ”problematic”, when at least two agreed on the same assessment. The evaluation conﬁrmed that there were little information gaps in the reports, i.e. the examiners had answered nearly all the questions SC asked them. From 412 ”true” diagnoses in these records (i.e. in this sample an average of 3.7 per case), the examiners missed 107 (26%) diagnoses in their free text judgement and stated an additional 32 diagnoses, which were not supported by the documented ﬁndings. The evaluators also informally rated the diagnostic conclusions of SC. In agreement with the accuracy evaluation mentioned above, the SCdiagnoses were judged in general as adequate by the evaluators. The difference between the consistency of diagnoses of SC and the examiners was unexpected, because the examiners were shown the diagnostic conclusion of SC before entering their free text judgement. We differentiated the 412 diagnoses further into simple and complex conclusions (the latter are based on the combination of more than one symptom). There were 145 complex diagnoses, from which the examiners missed 57 (39%). The examiners stated 15 additional complex diagnoses unsupported by the documented ﬁndings. That means, the inconsistency between ﬁndings and diagnoses was higher for complex diagnoses (39% compared to 26%). These surprising ﬁgures are difﬁcult to interpret with respect to the clinical correctness of the diagnoses, since the evaluation was based on text documents, not on sonographic pictures, because these were not included in the records. Therefore, in general it is not possible to differentiate between incorrect symptom descriptions and incorrect conclusions, although the relative high degree of problematic simple diagnoses (50 from 267, i.e. 19%) indicates some documentation errors. Nevertheless, the remaining inconsistency between documented ﬁndings and diagnostic conclusions is rather high. As mentioned above, this is quite astonishing, since the SC-diagnoses were visible to the examiners before writing their ﬁnal comment: It is questionable whether 686 F. Puppe et al. / Application and Evaluation of a Medical Knowledge-System in Sonography (SONOCONSULT) they considered the diagnostic SC-conclusions for veriﬁcation of their data input. This fact is consistent with the low inﬂuence of the system’s diagnoses on the own diagnoses of the examiners (see 2.2). To investigate this phenomenon further, the silver bullet would be a study where the quality of sonographic reports is judged by comparing the judgements of the examiners with those of independent experienced sonographic experts, who examine the same patients a second time. Even using pictures from the ﬁrst examination to be judged by experts instead of a second examination might cause a bias. Since this procedure must be repeated several times for different uses of SC and different examiners, we considered it as too expensive for our routine evaluations. Instead, we focused on statistical quality control as presented in the next section. 3 Statistical Analysis The physicians considered statistical analysis as one of the desirable features. About 300 detailed patient records are documented per month in each clinic in W¨urzburg and Berlin. Statistical analysis ranges from getting an overview on the relative frequency of sonographic diagnoses via detecting patterns speciﬁc to different examiners and their experience and correlations of sonographic diagnoses with ﬁnal clinical diagnoses to knowledge discovery of correlations among pathological states of different organs and risk factor analysis. We used the subgroup discovery tool VIKAMINE for interactive analysis and focus in the following on the analysis of interexaminer variations in Berlin (see [2] for methodological issues). In Berlin the examiners rotate according to a predeﬁned schedule, e.g. every six month. We used for this study sonographic data over a period of 3 years with 7 different young examiners (E1 . . . E7), and considered the ﬁrst 600 consecutive cases from each examiner. We checked that the case mix of the different examiners with respect to age group and gender was roughly the same. In a ﬁrst step we analyzed the distribution of all diagnoses, and found considerable variations between the 7 examiners for several diagnoses: We discuss two examples: liver cirrhosis and chronic renal failure (CRF). Examiner E1 rated the diagnoses CRF as probable or possible more than ten times as often as E5 and twice as often as the average. The detailed analysis of E1 revealed that only one of two possible parameters is responsible for this special rating: a narrowed left or right renal parenchyma, while e.g. the renal size is not signiﬁcantly different from the average. Examiner E3 rated the diagnosis liver cirrhosis as probable or possible more than four times as often as E7 and more than twice as often as the average. The detailed analysis of E3 showed that the combination of the ﬁndings ”rarefaction of portal branches” and ”liver plasticity moderately reduced” is responsible for this increase, since E3 has a share of 90% of this combination (compared to a share of 14% for all cases), from which liver cirrhosis = possible is inferred in nearly all cases. This analysis shows that the differences between examiners in rating diagnoses can be traced back to one or two speciﬁc ﬁndings, which offers the opportunity to focused training actions in clinical routine. The use of the standardized documentation system S ONO C ONSULT offers the opportunity to link additional informal knowledge (e.g., pictures) for differentiation of critical ﬁndings, which are just a mouse click away during documentation. It also seems worthwhile to offer the examiners the possibility to compare their examination proﬁles computed by the subgroup mining tool to be informed about deviations, which might trigger a look on the respective informal knowledge. Further, the diagnostic results of SC can be used as motivation for reevaluation of the data input as well as of the diagnos- tic conclusions. This necessitates a simple presentation of the – with respect to the conclusions – incongruent inputs. Finally, SC in combination with data mining may be used to generate individualized quizzes with multiple choice questions which can be solved online by young sonographic examiners in training on a regular base. The effect of such clinical actions can then be evaluated in prospective studies using the same subgroup mining tools as described above. In summary, the results indicate a high variability of documentation and interpretation habits of the different examiners. This kind of statistical quality control indicates that different examiners vary in their performance. This observation is in line with the noted inconsistencies with respect to documented ﬁndings and inferred diagnoses in section 2.3. A possible interpretation is that the quality of sonographic reports depends much on the individual skills of the examiners, i.e. examination experience and accuracy of documentation. Especially the interpretation that the documented ﬁndings are less reliable than the diagnostic conclusions of the examiners is of interest and should be a focus of further development of control instruments. However, it must be taken into account that the set of patients investigated by the different examiners might have different characteristics. Along the line of investigating such hypotheses with statistical means, we plan further studies, where we compare the sonographic diagnoses with the ﬁnal diagnoses as stated in the coded diagnoses of the hospital information system or the (informal) diagnoses mentioned in the physician’s letter. Additional hints can be derived from comparison of sonographic diagnoses with diagnoses from lab data, computer tomography or magnet resonance imaging. However these studies – except using the coded diagnoses of the HIS – require a formalization and standardization of the diagnoses in these reports. This is a difﬁcult and time-consuming task requiring computer based assistance, which we just started to do. Even coded ICD data requires a mapping, because ﬁrst studies showed that the ICD codes from sonographic diagnoses were not identical with ICD codes for equivalent ﬁnal diagnoses. Therefore results are currently not available. 4 Lessons Learned Computer based documentation and diagnosis systems can increase the quality of documentation and diagnosis. In our ﬁve years of experience with S ONO C ONSULT we have shown some signiﬁcant effects and have observed indications for other important effects. In particular, the use of intelligent dynamic questionnaires increases the completeness of records without prolonging the time necessary for data input. Since all records use the same terminology, they become comparable, which was welcomed in itself by the physicians. In addition, it enabled different forms of quality control. Our results show that there is some need for quality control, which has been undetected so far. In order to avoid expensive studies where sonographic examinations resp. interpretations had to be repeated by experts, we used computer-based methods for quality control: • The documented ﬁndings and the conclusions of the examiners in the records are not always consistent. Rule-based computerdiagnosis achieves a much higher consistency. • Different examiners have a relatively high variation in stating sonographic diagnoses like liver cirrhosis or chronic degenerative renal failure, due to variations in reporting corresponding ﬁndings. • Automatic comparison of sonographic diagnoses with diagnoses from other data sources (lab data, CT, MR) and the ﬁnal diagnoses may give further valuable hints for quality control. However, we just started to follow this path which seems quite promising. F. Puppe et al. / Application and Evaluation of a Medical Knowledge-System in Sonography (SONOCONSULT) A rather unexpected ﬁnding was that experienced examiners largely ignored computer generated diagnostic suggestions, although they were of high quality. This observation is supported asking the physicians about their attitudes and their actual behaviour when stating their diagnostic conclusions. Less experienced examiners welcomed the systematic approach and the diagnostic conclusions. It seems worthwhile to adopt the well-known critiquing approach [13] to draw the attention of examiners to inconsistencies in the report and simultaneously to allow them the fast correction of the inconclusive entries. This is in line with the observation in [6] that physicians often do not know when their diagnoses are incorrect. At the W¨urzburg university medical hospital, S ONO C ONSULT was adopted two years ago, taking this observation into account. The system was used in a different (expert) mode: The physician ﬁrst enters the diagnoses and the computer then asks about the most important ﬁndings necessary for inferring them. In this way, the time for entering data is reduced. The entered diagnoses are automatically ICD coded, and these codes are automatically transferred into the hospital information system as sonographic diagnoses. The system is still able to do some consistency checking, although much less than in the standard mode as used in Berlin. This mode is not suitable for beginners, but optimized for experienced examiners. Since the evaluation of the accuracy of SC showed that parts of the knowledge base should reﬂect different opinions (e.g. organ size thresholds), it was necessary to make these parts adaptable. This seems to be a general prerequisite for transfer from one clinic to another. These lessons learned can be generalized to the insight that the GUI of an interactive knowledge system should be designed to integrate and complement the competence of (potential heterogenous) users instead of duplicating it, e.g. diagnostic knowledge can be used for beginners to infer diagnoses, but for experienced physicians, it is more acceptable to be used for semi-automatic report generation or maybe a critiquing approach. We currently plan two obvious steps for quality improvement: First, to reduce variations among examiners, the questions of SC concerning the ﬁndings with high variations will be annotated with descriptions and in particular reference pictures. In addition, the examiners are informed, if they deviate considerably from the average frequency for certain diagnoses and/or the corresponding ﬁndings. Second, the observation that SC can infer more consistent diagnoses as the examiners from the documented ﬁndings can be used for a critiquing approach in Berlin. Thus, the diagnoses of the examiner are compared to the diagnoses inferred by SC, and if there are serious discrepancies, the examiner is informed about the inconsistency and offered means for correction (either to change the diagnoses or to change the underlying ﬁndings). The difﬁculty with this approach is to extract diagnoses from the free text judgement of the examiner. A suitable technique is expectation driven information extraction [11], which is quite promising, but currently not fully operational, because the quality of information extraction must be very high. The effects of quality improvement activities will be measured with statistical techniques as outlined above. 5 Conclusions The applications and the evaluations of S ONO C ONSULT showed (1) its beneﬁts as an intelligent documentation system producing more complete records in a standardized nomenclature in about the same amount of time as hand-written reports, (2) its training value for beginners, (3) its high diagnostic accuracy, and (4) its potential for statistical quality control. Although the system was well accepted 687 in general, its diagnostic conclusions were largely ignored. We reacted in two ways: when migrating S ONO C ONSULT from Berlin to W¨urzburg, we deﬁned a new mode of data entry with intelligent questionnaires, where the diagnoses are entered ﬁrst and supporting ﬁndings are asked subsequently. This shortened the time for data entry but depends on the clinical knowledge of the examiner. Especially for beginners in sonography – a common problem due to planned rotations - the standard mode will be supported with a critique component, where the free text judgement of the examiner and the diagnostic conclusions of S ONO C ONSULT are compared and the examiner is informed about major discrepancies. We also plan to compare the sonographic diagnoses systematically with diagnoses from other investigations (lab data, CT, MNR) and in particular the ﬁnal clinical diagnoses. However, these projects require limited understanding of free text reports, which we address with the technique of expectation driven information extraction. The prospect is that we can evaluate the clinical signiﬁcance of sonographic examinations in a systematic manner without relying on the expensive technique to recheck individual sonographic examinations manually by different experts, despite this might be the ultimate gold standard. ACKNOWLEDGEMENTS This work has been partially supported by the German Research Council (DFG) under grant Pu 129/8-1 and Pu 129/8-2. REFERENCES [1] Martin Atzmueller, Knowledge-Intensive Subgroup Mining, volume 307 of Diski, IOS Press, 2007. [2] Martin Atzmueller, Frank Puppe, and Hans-Peter Buscher, ‘Proﬁling Examiners using Intelligent Subgroup Mining’, in Proc. 10th Intl. Workshop on Intelligent Data Analysis in Medicine and Pharmacology (IDAMAP-2005), pp. 46–51, Aberdeen, Scotland, (2005). [3] Clinical Decision Support Systems, ed., E. Berner, Springer, 2007. [4] Maisiak R. Cobbs G. Taunton O. Berner, E., ‘Effects of a Decision Support System on Physicians’ Diagnostic Performance’, J Am Med Inform Assoc, 6, 420–427, (1999). [5] Hans-Peter Buscher, Ch. Engler, A. F¨uhrer, S. Kirschke, and F. Puppe, ‘HepatoConsult: A Knowledge-Based Second Opinion and Documentation System’, Journal Artif. Intell. in Med., 24(3), 205–216, (2002). [6] Timothy M. Franz Gwendolyn C. Murphy Fredric M. Wolf Paul S. Heckerling Paul L.Fine Thomas M. Miller Arthur S. Elstein Charles P. Friedman, Guido G. Gatti, ‘Do Physicians know when their Diagnoses are correct? Implications for Decision support and Error Reduction’, J Gen Intern Med., 20, 334339, (2005). [7] Poynard T. Darmoni, S., ‘Computer-Aided Decision Support in Hepatology’, Scand J Gastroenterol, 27, 889–896, (1992). [8] Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, 2nd Edition, Morgan Kaufmann, San Francisco, USA, 2006. [9] Matthias Huettig, Georg Buscher, Thomas Menzel, Wolfgang Scheppach, Frank Puppe, and Hans-Peter Buscher, ‘A Diagnostic Expert System for Structured Reports, Quality Assessment, and Training of Residents in Sonography’, Medizinische Klinik, 99(3), 117–122, (2004). [10] Willi Kl¨osgen, Handbook of Data Mining and Knowledge Discovery, chapter 16.3: Subgroup Discovery, Oxford Univ. Press, NY, 2002. [11] Schuhmann M. Puppe F. Buscher H.-P. Kl¨ugl, P., ‘Expectation-Driven Information Extraction in Incomplete Sentences (in German)’, in Proc. of LWA 2007 in Halle, Germany, pp. 237–243, (2007). [12] Clement J. McDonald, ‘Medical Heuristics: The Silent Adjudicators of Clinical Practice’, Ann. Intern. Med., 124, 56–62, (1996). [13] Perry L. Miller, Expert Critiquing Systems - Practice-Based Consultation by Computer, Springer Verlag, 1986. [14] Frank Puppe, ‘Knowledge Reuse among Diagnostic Problem-Solving Methods in the Shell-Kit D3’, Intl. Journal of Human-Computer Studies, 49, 627–649, (1998). 688 ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-688 Automating Accreditation of Medical Web Content Vangelis Karkaletsis and Pythagoras Karampiperis and Konstantinos Stamatakis1 and Martin Labský and Marek Růžička and Vojtěch Svátek2 and Enrique Amigó Cabrera3 and Matti Pöllä4 and Miquel Angel Mayer and Angela Leis5 and Dagmar Villarroel Gonzales6 Abstract.123456The increasing amount of freely available healthrelated web content generates, on one hand, excellent conditions for self-education of patients as well as physicians, but on the other hand entails substantial risks if such information is trusted irrespective of low competence or even bad intentions of its authors. This is why medical web resources accreditation by renowned authorities is of high importance. However, various health web content surveys show that the proportion of accredited web resources is insufficient due to the difficulty of the labeling authorities to cope with the amount and dynamics of the medical web. In this paper, we address the problem of automating the accreditation of medical web content. To this end, we present a system which provides the infrastructure and the means to organize and support various aspects of the daily work of labeling experts, exploiting web content collection and information extraction techniques. 1 INTRODUCTION The number of health information web sites and online services is increasing day by day. On the other hand, patients continue to find new ways of reaching health information and more than four out of ten health information seekers say the material they find affects their decisions about their health [1, 2]. However, it is difficult for health information consumers, such as the patients and the general public, to assess by themselves the quality of the information because they are not always familiar with the medical domains and vocabularies [3]. Although there are different opinions about the need for accreditation of health web sites and adoption by Internet users [4], different organizations around the world are working on establishing standards of quality in the accreditation of health-related web content. The European Council supported an initiative within “eEurope 2002” to develop a core set of “Quality Criteria for Health Related Websites” 1 2 3 4 5 6 Institute of Informatics and Telecommunications, NCSR “Demokritos”, Greece, email: {vangelis, pythk, kstam}@iit. demokritos.gr University of Economics, Czech Republic, email: {labsky, ruzicka, svatek }@vse.cz ETSI Informática, UNED, Spain, email: enrique@lsi.uned.es Adaptive Informatics Research Centre, Helsinki University of Technology, Finland, email: matti.polla@tkk.fi Web Médica Acreditada, Medical Association of Barcelona (COMB), Spain, email: {mmayer.wma, mleis.wma}@comb.es Agency for Quality in Medicine (AquMed), Germany, email: villarroelgonzales@azq.de [5]. These criteria may be used as a basis in the development of user guides, voluntary codes of conduct, trust marks, accreditation systems, adopted by relevant parties, at European, national, regional or organizational level. There are two major mechanisms in medical quality labeling: - Filtering portals: the web resources are classified according to predetermined criteria in order to facilitate a quick access to quality reviewed information. Examples of this mechanism are: “Catalog and Index of French-speaking Medical Sites” (CISMEF) [17], “Intute service - its Health and Life Sciences branch” [16] from UK, “Agency for Quality in Medicine” (AQUMED) [19] from Germany. - Third party accreditation: an organization evaluates the quality of the web site according to a set of criteria. Compliance with those criteria is showed with a logo or trust mark on the homepage. The HONCode of the Health on the Net Foundation [18], the URAC Accreditation Program [20], the Web Mèdica Acreditada [18] trustmark are the most well known quality seals. The main problem that these mechanisms face is the need for a continuous review and control of the accredited or classified web sites that means a huge amount of human effort. This stress on content quality evaluation contrasts with the fact that most of the current Web is still based on HTML, which only specifies how to layout the content of a web page addressing human readers. This “current web” must evolve in the next years, from a repository of humanunderstandable information, to a global knowledge repository, where information should be machine-readable and processable, enabling the use of advanced knowledge management technologies [6]. This change is based on the exploitation of semantic web technologies. The Semantic Web is "an extension of the current web in which information is given a well-defined meaning, better enabling computers and people to work in cooperation" based in metadata (i.e. semantic annotations of the web content) [7]. These metadata can be expressed in different ways using the Resource Description Framework (RDF) language. RDF is the key technology behind the Semantic Web, providing a means of expressing data on the web in a structured way that can be processed by machines. In order for the medical quality labeling mechanisms to be successful, they must be equipped with semantic web V. Karkaletsis et al. / Automating Accreditation of Medical Web Content technologies that enable the creation of machineprocessable labels as well as the automation of the labeling process. In this paper, we address the problem of automating the accreditation of medical web content. To this end, we present the AQUA system, developed within the MedIEQ7 project, which provides the infrastructure and the means to organize and support various aspects of the daily work of labeling experts by making them computer-assisted. More precisely, we describe the challenges addressed in AQUA development and the results achieved so far. In Section 2, we present and discuss the design principles of AQUA. In Section 3, we focus on AQUA’s components/modules responsible for automating the accreditation process. In Section 4, we present the methodological steps for extending AQUA to support new languages, labeling criteria, as well as, labeling authorities. Finally, in Section 5 we present experimental results on the use of these components, discuss our findings and the conclusions that can be offered. 2 making them computer-assisted. More specifically, AQUA supports labeling experts in: • Creating machine readable labels, by adopting the use of the RDF model [10] for producing machine-readable content labels; at the current stage, the RDF-CL model [11] is used. In the final version of AQUA, the POWDER model, introduced by the W3C Protocol for Web Description Resources (POWDER) working group [24], will be supported. • Automating the accreditation process by helping in the identification of unlabeled resources, extracting from these resources information relative to specific accreditation criteria, generating content labels from the extracted information and facilitating the monitoring of already labeled resources. 2.1 System Architecture AQUA incorporates several subsystems (Figure 1) and functionalities for the labeling expert. The Web Content Collection (WCC) component identifies, classifies and collects online content relative to the labeling criteria. THE AQUA SYSTEM OVERVIEW By analyzing the two main approaches of medical quality labeling (filtering portals and third party accreditation), we have identified the following key tasks, followed entirely or partially by most labeling agencies: - Identification of new web resources: this could happen either by active web searching or by voluntary application from the information provider. - Labeling of the web resources: this could be done with the purpose of awarding an accreditation seal or in order to classify and index the web resources in a filtering portal. - Re-reviewing or monitoring the labeled web resources: this step is necessary to identify changes or updates in the resources as well as broken links and to verify if a resource still deserves to be awarded an accreditation seal. As a result, the AQUA system [14] was designed to support the main tasks of the web content accreditation process, that is: Identification of unlabeled resources having health-related content; Visit and review of the identified resources; Generation of content labels for the reviewed resources, and Monitoring the labeled resources. Compared to other approaches that partially address the assessment process [8, 9], the AQUA system is an integrated solution. AQUA aims to provide the infrastructure and the means to organize and support various aspects of the daily work of labeling experts by 7 689 MedIEQ (www.medieq.org) is an EC-funded project, under the DGSANCO Programme “Public Health”. MedIEQ stands for “Quality Labeling of Medical Web content using Multilingual Information Extraction. Figure 1. Architecture of the AQUA system. The Information Extraction Toolkit (IET) analyses the web content collected by WCC and extracts attributes for the content labels. The Label Management (LAM) component generates, validates, modifies and compares the content labels. The Multilingual Resources Management (MRM) subsystem gives access to healthrelated multilingual resources; input from such resources is needed in specific parts of the WCC, IET and LAM toolkits. Finally, the Monitor-Update-Alert (MUA) tool handles the configuration of monitoring tasks, database updates, and alerts to labeling experts. 3 3.1 AUTOMATING THE ACCREDITATION PROCESS Locating Unlabeled Web Resources The AQUA crawling mechanism is part of the web content collection environment (WCC). The Crawler searches the 690 V. Karkaletsis et al. / Automating Accreditation of Medical Web Content Web for health-related content that does not have a content label yet. It is a meta-search engine that exploits results returned from known search engines and directory listings from known Web directories. All collected URLs from all sources are merged and filtered, and a pre-final URLs list is returned. Apart from two well-known general-purpose search engines (Google, Yahoo!), the AQUA Crawler exploits two searching services specialized to the health domain, one from HON [15] and a second from Intute’s Health and Life Sciences branch [16]. Crawler’s open architecture guarantees that additional search engines can be added, at any moment, if this is needed for a specific application. At the same time, the Crawler is enhanced with a content classification mechanism, which can be trained to distinguish health from non-health content. For the initial training, the user manually classifies as Pos or Neg a part of the initial results. A quick preview-assess-characterize interface is available for this purpose. A classification model is trained which is then used for the automatic classification of the results returned in next search iterations. At the end of a search iteration, the new results automatically get a classification score (pos, neg or uncl) proposed by the model trained from user’s feedback. The user can continue by manually verifying the automatically classified URLs, as well as, by checking the unclassified ones. Then, the model can be again re-trained and so on. Interesting similar approaches exist, for example [21], where a context-based adaptive personalized web search to adapting search results according to user’s information needs is proposed. However, this approach depends on domain specific ontologies, whereas our Crawling mechanism is domain agnostic, enabling the user to classify the results and improve the tool performance. 3.2 Spidering The AQUA Spider examines individual pages of a website via following links. The web resources whose URLs are obtained from the Crawler are processed by the Spider one-by-one in several independent threads. Unreachable sites/pages are revisited in next run. Since not all web pages of a web site are interesting for the labeling process, the Spider utilizes a content classification component that consists of a number of classification modules (statistical and heuristic ones). These modules decide which pages contain interesting information. Each of them relies on a different classification method according to the classification problem on which it is applied. Pages identified as belonging to classes relevant to some of the labeling criteria are stored locally in order to be exploited by the Information Extraction subsystem (for instance, contact pages in order to extract contact information from them). 3.3 Extracting Information Accreditation Criteria Relative to The present work continues and builds upon the work of previous projects in the area of information extraction (IE) [12, 13]. The AQUA IE toolkit (IET) employs a set of components responsible for the extraction of elementary information items found in each document and for the integration of these items into a set of semantically meaningful objects called instances. The core IE engine currently used within IET is the Ex system [13], which relies on the combination of the socalled extraction ontologies with exploiting the local HTML formatting regularities and embedding trainable classifiers for specific low-level tasks. The output of IET is proposed to the labeling expert through the AQUA LAM user interface. 3.4 Monitoring Already Described Resources Another part of AQUA, called MUA (from MonitorUpdate-Alert), handles problems such as the configuration of monitoring tasks, the necessary MedIEQ repository updates and the alerts to labeling experts when important differences (relative to the quality criteria) occur during the monitoring of previously labeled sites. MUA thus extends the functionality of the content collection and extraction toolkits by shifting from a one-shot scenario to that of continuous monitoring. 4 EXTENDING AQUA AQUA addresses a complex task. However, various design and implementation decisions help MedIEQ partners keep AQUA extensible and easy to maintain. The main characteristics of its implementation include: a) open architecture, b) accepted standards adopted in its design and deployment, c) character of large-scale, enterpriselevel web application, and d) internationalization support. AQUA has been designed so as to be able to support addition of new languages, labeling criteria and labeling authorities. Figure 2, presents the methodological workflow for extending AQUA to support a new language. This process consists of the following main steps: - User Interface Localization. This is in practice the translation of a text messages file. - Spider Model Training. The training of Spider’s classifiers is facilitated using a specialized tool called Corpus Formation Tool for the collection and annotation of corpus. - IET Model Training. The training of extraction models is supported using the BOEMIE Annotation Tool [22], which enables the annotation of named entities and relations. - Topic Categorizer Configuration. AQUA’s topic categorizer employs the Automatic Ontological 691 V. Karkaletsis et al. / Automating Accreditation of Medical Web Content Concepts Extraction Tool (POKA) [23], a general purpose tool for automatic extraction of ontological concepts (MeSH in the current implementation of AQUA). performance was achieved with 1-grams and HTML tags removed (see Table 1). Table 1. Classification performance results for content 1-grams (Tags removed) Prec. Rec. NB 0.75 0.63 FB 0.73 0.55 SMO 0.75 0.61 classification Fm. 0.68 0.62 0.67 The relatively low performance of the content classifiers is justified by the fact that is difficult, even for humans, in various cases to assess whether a website has health-related content or not. In addition, the health domain covers a wide set of topics. 5.2 Spidering The Spider classification mechanism has been examined for the accreditation criteria listed in Table 2. Table 2. A sub-set of the AQUA supported accreditation criteria Figure 2. Extending AQUA to support a new Language. Concerning the extension to support new labeling criteria, a well-defined process is used, which involves the following key steps: - Adaptation of the LAM User Interface, - Definition of the desired criterion as an RDF element of the supported Vocabulary, - Specification of the methodology for the extraction of information relevant to the new criterion, e.g. use of regular expressions, heuristics or machine learning. Finally, the extension to support a new labeling authority is an iterative process that extends AQUA for the new labeling criteria and language(s) of this labeling authority. 5 EXPERIMENTAL RESULTS So far, the 1st version of AQUA has been developed and evaluated by labeling experts. The scope of this evaluation was the performance evaluation of AQUA on supporting the labeling process, as well as, the usability evaluation of AQUA interface. The evaluation showed that by using only the proposed by AQUA links, it was possible for the labeling experts to identify the right value in more than 80% of the different labeling cases. In this section we present evaluation results, for locating unlabeled web resources, spidering and information extraction. 5.1 Locating Unlabeled Web Resources In this section, we summarize evaluation results on Crawler’s content classification component. For this evaluation, we used an English corpus, consisting of 1976 pages (944 positive & 1032 negative samples), all manually annotated. Three different classifiers have been tested (SVM, Naïve Bayes and Flexible Bayes). Best Accreditation Criterion AQUA approach Classification among three possible target groups: adults, children and professionals The target audience of a website Contact information of the responsible of a website must be present and clearly stated Presence of virtual consultation services Presence of advertisements in a website Detection of candidate pages during the spidering process and forwarding for information extraction Detection of parts of a website that offer such services during the spidering process Detection of parts of a website that contain advertisements during the spidering process Table 3. SVM performance. Category Precision Recall Fm CI 0.84 0.96 0.90 AD 0.87 0.80 0.83 0.87 VC 0.87 0.87 Adults 0.78 0.75 0.77 Children 0.80 0.78 0.79 Professional 0.77 0.81 0.79 Several learning schemes were tested. The performance of the SVM classifier which provides the best results is presented in Table 3, for English corpora. The most difficult criterion for classification purposes seems to be the target audience, as being a highly subjective one. Although not reported in this paper, experiments have been also performed for other project languages, namely, Czech, Greek, Finish and Spanish. The results received vary a lot between corpora in different languages. Changing the experimental settings, with respect to the use of HTML tags, stemming, etc., does not affect significantly the results indicating that the characteristics of each language specific corpus is the determinant factor. 5.3 Extracting Information Accreditation Criteria Relative to The IET serves as a container for IE engines which exposes a uniform API used by the AQUA system. 692 V. Karkaletsis et al. / Automating Accreditation of Medical Web Content Currently it integrates two IE engines: the first engine is based on extraction ontologies [16], and the second is a machine learning based one. Both engines can be combined within one extraction task; however the possible benefits are not yet fully exploited. Preliminary F-measure results are presented below for the extraction of contact information using extraction ontologies for three languages. In Table 4, the first scores are stricter since they require the extracted values to be exactly the same as the gold standard. The second scores also give some credit to partial matches (proportional to the correct field’s word length). Results marked with a dash are not yet available. ACKNOWLEDGEMENTS The work presented in this paper is supported by the EC-funded project MedIEQ (www.medieq.org), under the DG-SANCO Programme “Public Health”. The authors of this paper would like to thank the labeling experts participating in AQUA evaluation. REFERENCES [1] [2] [3] Results are presented for named entity extraction. Corpora sizes were 109 contact HTML pages for English, 200 for Spanish and 108 for Czech. The collections contained roughly 7000, 5000 and 11000 named entities, respectively. One contact extraction model was developed per language (with shared common parts) based on seeing 30 randomly chosen documents from each dataset and evaluated using the remaining documents. The system exploits manually encoded extraction knowledge in the form of cascaded pattern matching rules, axioms and HTML formatting regularities induced over the analysed documents. Table 4. Extraction performance for contact information Field Fm English Fm Spanish Fm Czech Person name 0.77 0.83 Person degree 0.77 0.82 Street 0.56 0.75 City 0.57 0.59 6 0.76 0.81 0.78 0.80 - - 0.88 0.90 0.56 0.71 0.71 0.75 0.59 0.61 0.68 0.77 Zip 0.86 0.89 0.89 0.93 0.94 0.94 Country 0.70 0.71 0.72 0.72 0.74 0.78 Phone 0.89 0.91 0.87 0.92 0.88 0.89 Email 1.00 1.00 0.96 0.97 0.98 0.98 Organization 0.45 0.64 - - - - Department 0.46 0.62 - - - - Overall 0.73 0.79 0.76 0.80 0.82 0.84 CONCLUSIONS In this paper, we address the problem of automating the accreditation of medical web content. To this end, we present the AQUA system, which provides the infrastructure and the means to organize and support various aspects of the daily work of labeling experts. In general the evaluation results were promising for the further development of AQUA as an assisting system for the labeling experts. However, for the evaluation of the final version of AQUA towards its integration within the day-to-day activities of a labeling organization, the application in a real day-to-day practice scenario will be performed. [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] Eysenbach G. Consumer health informatics. BMJ 320 (4) (2000), 1713-16. Diaz JA, Griffith RA, Ng JJ, Reinert SE, Friedmann PD, Moulton AW. Patients´use of the Internet for medical Information. J Gen Intern Med 17(3) (2002), 180-5. Soualmia LF, Darmoni SJ, Douyère M, Thirion B. Modelisation of Consumer Health Information in a Quality-Controled gateway. In: Baud R et al. (ed.). The New Navigators: from Professionals to Patients. Proc of MIE2003 (2003), 701-706. Analysis of 9th HON Survey of Health and Medical Internet Users Winter 2004-2005, 2005. Available Online at: http://www.hon.ch/ Survey/ Survey2005/ res.html http://europa.eu.int/information_society/eeurope/ehealth/doc/ communication_acte_en_fin.pdf. Eysenbach G. The Semantic Web and healthcare consumers: a new challenge and opportunity on the horizon?. J Healthc Techn Manag 5 (2003), 194-212. Berners-Lee T, Hendler J, Lassila O. The Semantic Web. Scientific American, May 2001. Griffiths KM, Tang TT, Hawking D, Christensen H. Automated assessment of the quality of depression websites. J Med Internet Res. 2005 Dec 30;7(5):e59. Wang Y, Liu Z. Automatic detecting indicators for quality of health information on the Web. Int J. Med Inform. 2006 May 31. http:// www.w3.org/TR/rdf-schema/ http://www.w3.org/ 2004/12/q/doc/content-labels-schema.htm Karkaletsis V, Spyropoulos CD, Grover C, Pazienza MT, Coch J, Souflis D. A Platform for Crosslingual, Domain and User Adaptive Web Information Extraction. In Proceedings of the European Conference in Artificial Intelligence (ECAI); 2004; Valencia, Spain; p. 725-9. Labsky M., Svatek V., Nekvasil M., Rak D.: The Ex Project: Web Information Extraction using Extraction Ontologies. In: Proc. PriCKL'07, ECML/PKDD Workshop on Prior Conceptual Knowledge in Machine Learning and Knowledge Discovery. Warsaw, Poland, October 2007 Stamatakis K, Chandrinos K, Karkaletsis V, Mayer MA, Villarroel D, Labsky M, Amigó E, Pölla M. AQUA, a system assisting labelling experts assess health web resources. In Proceedings of the 12th International Symposium on Health Information Management Research (ISHIMR); 2007; Sheffield, UK, 75-84. http://www.hon.ch http://www.intute.ac.uk/healthandlifesciences/ http://www.cismef.org/ http://wma.comb.es http://www.aezq.de or http://www.patienten-information.de http://www.urac.org/ Pan, Xuwei; Wang, Zhengcheng; Gu, Xinjian, "Context-Based Adaptive Personalized Web Search for Improving Information Retrieval Effectiveness," Wireless Communications, Networking and Mobile Computing, 2007. WiCom 2007. International Conference on , vol., no., pp.5427-5430, 21-25 Sept. 2007 Fragou P., Petasis G., Theodorakos A., Karkaletsis V., Spyropoulos C.D, BOEMIE ontology-based text annotation tool. In Proc. of the Language Resources and Evaluation Conference (LREC-2008), Marrakesh, 28-30 May 2008. http://www.seco.tkk.fi/tools/poka/ http://www.w3.org/2007/powder/ ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-693 693 Pattern Classiﬁcation Techniques for Early Lung Cancer Diagnosis using an Electronic Nose Rossella Blatt1 and Andrea Bonarini1 and Elisa Calabr´o3 and Matteo Matteucci1 and Matteo Della Torre2 and Ugo Pastorino3 Abstract. We present a method to diagnose lung cancer by the analysis of breath using an electronic nose. This device can react to a gas substance by providing signals that can be analyzed to classify the input. It is composed of a sensor array (6 MOS sensors, in our case) and a pattern classiﬁcation process based on machine learning techniques. During the ﬁrst phase of our research, we have evaluated the possibility and accuracy of lung cancer diagnosis by classifying the olfactory signal associated to exhalations of subjects. The second part of the research, still in progress, is aimed at assessing the possibility of discriminating also the different types and stages of the disease. At the end of the ﬁrst phase, results have been very satisfactory and promising: we achieved an average accuracy of 92.6%, sensitivity of 95.3% and speciﬁcity of 90.5%. In particular we analyzed the breath of 101 individuals, of which 58 control subjects, and 43 suffer from different types of lung cancer (primary and not) at different stages. In order to ﬁnd the components able to discriminate between the two classes ‘healthy’ and ‘sick’ at best, and to reduce the dimensionality of the problem, we have extracted the most signiﬁcant features and projected them into a lower dimensional space using Non Parametric Linear Discriminant Analysis. Finally, we have used these features as input to several supervised pattern classiﬁcation algorithms, based on different k-nearest neighbors (k-NN) approaches (classic, modiﬁed and Fuzzy k-NN), linear and quadratic discriminant classiﬁers and on a feed-forward artiﬁcial neural network (ANN). The observed results have all been validated using cross-validation. These results pushed us to begin the second phase of the project to investigate the possibility of early lung cancer diagnosis: we are involving a larger number of subjects, partioned in different classes according to the type and stage of the disease. The research demonstrates that the electronic nose is a promising alternative to current lung cancer diagnostic techniques: the obtained predictive errors are lower than those achieved by present diagnostic methods, and the cost of the analysis, both in money, time and resources, is lower. The introduction of this technology will lead to very important social and business effects: its low price and small dimensions allow a large scale distribution, giving the opportunity to perform non invasive, cheap, quick, and massive early diagnosis and screening. Keywords: Electronic Nose, E-Nose, Olfactory Signal, Pattern Classiﬁcation, Fuzzy k-NN, MOS Sensor Array, Lung Cancer. 1 2 3 IIT Unit Artiﬁcial Intelligence and Robotics Laboratory AIRLab, Politecnico di Milano, Italy, email: blatt@elet.polimi.it, bonarini@elet.polimi.it, matteucci@elet.polimi.it Automation & Inspection Systems, SACMI Imola S.C., Italy, email: matteo.della.torre@sacmi.it Toracic Surgery Department, Istituto Nazionale dei Tumori, Milano, Italy, email: elisa.calabro@istitutotumori.mi.it, ugo.pastorino@istitutotumori.mi.it 1 Introduction Lung Cancer causes 3000 deaths each day in the world. The surviving rate after 5 years of treatment is less then 10% in Europe and around 15% in the USA; these percentages increase to more than 50% if the cancer is discovered in its earliest stage. The diagnosis in an advanced stadium represents the main cause of the therapeutic failure. Current diagnostic techniques (e.g. TAC, PET) are invasive, very expensive, have a high risk of complications and a not so good accuracy; moreover, results on early detection and treatment, in last decades, have been very poor and unsatisfying. This calls for the necessity of a non invasive, accurate and cheaper diagnostic technique, able to identify the presence of lung cancer in its early stages. We found an indication of the solution in the Fundamental Principle of Clinical Chemistry, which afﬁrms that every pathology changes people chemical composition, modifying the concentration of some chemicals in the human body. This is true also in lung cancer: it has been demonstrated that the presence of lung cancer alters the percentage of some volatile organic compounds (VOCs) in the human breath [7, 12]. This means that these VOCs can be considered as lung cancer markers: the analysis of the olfactory signal of patients breath and the recognition of these VOCs in it, allow to determine the presence of lung cancer. An instrument that allows to acquire, detect and process the olfactory signal is the electronic nose, which mimics the non separative mechanism of human olfaction [1]. Nowadays the research on olfactory systems has become very lively, most of all because of the multitude of applications in which it has been successfully used (e.g., medical diagnosis, food control quality, mines detection, drug detection, environmental analysis etc.) [10]. We divided our study in two phases: during the ﬁrst one, already completed, we have evaluated the possibility of distinguishing between healthy persons and lung cancer diseased patients only analyzing their breath; the second part of the project, still in act, invastigates the possibility of performing an early diagnose by the analysis of the olfactory signal to discriminate among different lung cancer types and stages. 2 Methodology The experiment has been developed within the Italian MILD (Multicentric Italian Lung Detection) project, promoted by the Istituto Nazionale Tumori, Italy. We analyzed the breath of 101 volunteers, of which 58 healthy and 43 suffering from different types of lung cancer. All cases were hospitalized at the Istituto Nazionale Tumori of Milan. Among them, 23 have a primary lung cancer, while 20 of 694 R. Blatt et al. / Pattern Classiﬁcation Techniques for Early Lung Cancer Diagnosis Using an Electronic Nose Figure 1. Basic functioning of an electronic nose. The olfactory signal is acquired by the sensor array and preprocessed to reduce the impact of any form of noise and to reduce the dimensionality of the problem. Then the best features are used to perform the classiﬁcation of the signal. them have different kinds of pulmonary metastasis. Control people have no pulmonary disease and have negative chest CT scan. The study has been approved from the Ethical Committee of the Institute and we asked everybody to sign an agreement for the participation to the study. The breath acquisition has been made by inviting all volunteers to blow into a nalophan bag of approximately 400cm3 . As the breath exhaled directly from lung is contained only in the last part of exhalation, we have decided to collect only the ﬁnal portion of it. From each bag we took two measures, obtaining a total of 202 measurements, of which 116 correspond to the breath of healthy people and 86 to diseased ones. In the second phase of the research we are involving a larger number of volunteers and we are partitioning diseased patients according to the type and stage of lung cancer. The experimental procedure is the same of the previous analysis, except that now the bag is directly connected to the electronic nose while the subject blows into the bag and we introduced a new system to better control humidity effects. 3 Acquisition and analysis of the olfactory signal An electronic nose is an instrument able to detect and recognize odours, namely the volatile organic compounds present in an analyzed substance [1]. It is composed of an array of electronic chemical sensors with partial speciﬁcity able to convert a physical or chemical information into an electrical signal and of a pattern analysis system, able to recognize or classify odours. Each sensor reacts in a different way to the analyzed substance providing multidimensional data that can be considered as an olfactory blueprint of the substance itself. An electronic nose consists in three principal components (Figure 1): the ﬁrst component regards the Gas Acquisition System, that is done through a sensor array that measures a given physical or chemical quantity; in this research we used an array of six MOS sensors (developed by Group Sacmi). This choice is due to the fact that MOS sensors are characterized by high sensitivity (in the order of parts per billion ppb), low cost, high speed response and a relatively simple electronics. This aspects take on great importance if we consider that most of the VOCs markers of lung cancer are present in the diseased people’s breath in very small quantities, varying from parts per million to parts per billion. The MOS sensors react to gases with a variation of resistance [11]; in Figure 2 it is possible to see a typical response of a MOS sensor. The second component concerns the pre-processing and dimensionality reduction phase: after the electronic nose has acquired the olfactory signal it is necessary to reduce the effect of humidity, to Figure 2. Example of a MOS sensor response. Each measure consists of three main phases: before each measure the instrument inhales the reference air, showing in its graph a relatively constant curve; after this short period it inhales the analyzed gas, producing a change of the sensors’ resistance; ﬁnally the instrument returns to the reference line, ready for a new measure. normalize the acquired signal and to manipulate the baseline. After pre-processing we performed dimensionality reduction to extract the most relevant information from the signal. We deﬁned ten descriptors from the sensors’ responses able to represent data characteristics in the most efﬁcient way. In particular, these features have been based on the variation of resistance, the course of the curve, its derivative, its integral and its Fast Fourier Transform (FFT). Some of these features returned more than one value (like the FFT), for a total of 39 descriptors for each measurement. Considering that we used 6 sensors, each measure would be described by 234 descriptors. Among all features it has been necessary to ﬁnd those able to maximize the informative components and, thus, to contribute to improve the accuracy of the classiﬁer. For this reason we applied the Mann-Wilcoxon non-parametric test with a signiﬁcance level equal to α = 0.0001 to select only discriminat descriptors. The choice of using a nonparametric test instead of a parametric one, is due to a previous analysis of the features distribution and a Lilliefors test. In order to evaluate the discriminative ability of the combination of more features, we performed an Analysis of Variance (ANOVA) and several scatter plots. We found that the most discriminative features between the two classes ‘healthy’ and ‘sick’ were the following descriptors (R(t) is the curve representing the resistance variation during the measurement and R0 the value of the resistance at the beginning of the measurement - as indicated in Figure 2 - ): • Delta: resistance change of sensors during measurement: δ = R0 − min(R(t)) (1) 695 R. Blatt et al. / Pattern Classiﬁcation Techniques for Early Lung Cancer Diagnosis Using an Electronic Nose Figure 3. The results of dimensionality reduction through PCA on the left and NPLDA on the right. • Classic: the ratio between the reference line and the minimum value of resistance reached during the measurement: C = R0 / min(R(t)) (2) • Relative Integral: calculated as: I= R(t)/(t · R0 ) 3.1 (3) • Phase Integral: the closed area determined by the plot of the state graph of the measurement [9]: x = R; y = dR/dt main component of an electronic nose. We considered three families of classiﬁers: Nearest Neighbors Classiﬁers (k-NN), Linear and Quadratic Discriminant Function based Classiﬁers (LD and QD) and an Artiﬁcial Neural Network (ANN). (4) • Single Point: the minimum value of resistance reached during the measurement. S = min(R(t)) (5) After feature selection we performed data projection: we considered Principal Component Analysis (PCA) [5] and Nonparametric Linear Discriminant Analysis (NPLDA) [6]: PCA transforms data in a linear way projecting features into the directions with maximum variance, the latter is based on nonparametric extensions of the commonly used Fisher’s linear discriminant analysis [5]. It is important to notice that PCA does not consider category labels; this means that the discarded directions could be exactly the most suitable for classiﬁcation purpose. This limit can be overcome by NPLDA, which looks for the projection able to maximize differences between different classes and minimize those intra-class. In particular, NPLDA removes the unimodal gaussian assumption by computing the between scatter-matrix Sb using local information and the k nearest neighbors rule; as a result of this, the matrix Sb is full-rank, allowing to extract more that c-1 features (where c is equal to the number of considered classes) and the projections are able to preserve the structure of the data more closely [6]. As evident from Figure 3, NPLDA is able to separate the projected features more clearly than PCA, which plot shows a more evident overlap of samples. This means that NPLDA is more suitable, for the considered problem, in terms of classiﬁcation performance. Moreover, the plot and the obtained eigenvalues clearly indicated that only one principal component is needed. After the most signiﬁcative dimension has been obtained, it has been possible to perform classiﬁcation, that represents the third Nearest Neighbors The basic idea of this simple and powerful algorithm is to assign a sample to the class of the k closest samples in the training set. This method is able to do a non linear classiﬁcation starting from a small number of samples. The algorithm is based on a measure of the distance (in this case, the Euclidean one) between the normalized features, and it has been demonstrated [5], that the k-NN is formally a non parametric approximation of the Maximum A Posteriori MAP criterion. The asymptotic performance of this algorithm, is almost optimum: with an inﬁnite number of samples and setting k=1, the minimum error is never higher than the double of the Bayesian error (that is the theoretical lower bound reachable) [4]. One of the most critical aspects of this method regards the choice of parameter k when having a limited number of samples: if k is too large, then the problem is too much simpliﬁed and the local information loses its relevance. On the other hand, a too small k leads to a density estimation too sensitive to outliers. For this reason, in addition to the classic k-NN, we implemented two other versions of this technique: the Modiﬁed k-NN and the Fuzzy k-NN. In the former, k means the number of closest neighbors to look for (as in the classic k-NN), but all belonging to the same class. This dynamically modify the neighborhood according to the noise in the dataset. Fuzzy k-NN, a variation of the classic k-NN based on a Fuzzy logic approach [14], assigns a fuzzy class membership to each sample and provides an output in a fuzzy form. In particular, the membership value of unlabeled sample x to ith class is inﬂuenced by the inverse of the distances from neighbors and their class memberships: k μi (x) = −2 j=1 k μij (-x − xj -) m−1 j=1 −2 (-x − xj -) m−1 (6) where μij represents the membership of labeled sample xj to the ith class. This value can be crisp or it can be calculated according to a particular fuzzy rule: in this work we deﬁned a fuzzy triangular 696 R. Blatt et al. / Pattern Classiﬁcation Techniques for Early Lung Cancer Diagnosis Using an Electronic Nose Table 1. Performance indexes and corresponding conﬁdence intervals (CI=95%) obtained from considered algorithms. Features have been previously projected by NPLDA and only the ﬁrst principal component has been kept for classifciation. For k-NN techniques, we considered k=1,3,5,9,101; for classic and modiﬁed k-NN we show the best achieved results (when k=9). On the contrary, Fuzzy k-NN led to the same results independently from k’s values. Classiﬁer Classic 9-NN NER 90.1% TPR 89.5% TNR 90.5% PRECP OS 87.5% PRECN EG 92.1% Conﬁdence Interval [85.7-94.5] [85.3-93.8] [86.0-95.0] [81.6-93.4] [86.8-97.4] Modiﬁed 9-NN 91.1% 91.9% 90.5% 87.8% 93.7% Conﬁdence Interval [86.8-95.4] [87.9-95.9] [86.0-95.0] [81.9-93.7] [89.1-98.4] Fuzzy k-NN 92.6% 95.3% 90.5% 88.2% 96.3% Conﬁdence Interval [88.5-96.7] [91.8-98.9] [86.0-95.0] [82.3-94.1] [93.2-99.4] LD 89.6% 96.5% 84.5% 82.2% 97.0% Conﬁdence Interval [85.0-94.2] [93.7-99.3] [79.1-89.9] [75.2-89.1] [93.9-100] QD 92.6% 95.3% 90.5% 88.2% 96.3% Conﬁdence Interval [88.5-96.7] [91.8-98.9] [86.0-95.0] [82.3-94.1] [93.2-99.4] ANN 91.6% 91.9% 91.3793% 88.8% 93.8% Conﬁdence Interval [87.4-95.8] [87.9-95.9] [87.0-95.8] [84.1-93.4] [88.2-99.4] membership function with maximum value at the average of the class and null outside the minimum and maximum values of it. In this way, the closer the sample j is to the average point of class i, the closer its membership value μij will be to 1 and vice versa. The parameter m determines how heavily the distance is weighted when calculating each neighbor’s contribution to the membership value [8]; we chose m = 2, but almost the same error rates have been obtained on these data over a wide range of values of m. 3.2 Discriminant Functions Classiﬁer Classiﬁcation based on discriminant functions represents a geometric approach where the features space is divided in c decision regions each one corresponding to a particular class. The classiﬁer is represented as a family of discriminant functions gi (x) with only one output that minimizes a given cost function. We considered two types of discriminant functions: the linear (LD) and the quadratic one (QD). A classiﬁer based on a linear discriminant function divides the features space by planes and it is therefore optimum when the problem is linearly separable. In any case, this technique is able to lead to good performances also when the problem is not linearly separable. We implemented the Minimum Distance to Means (MDM) approach, in which the representatives of each class have been calculated as the mean value of samples belonging to that class. This approach is very simple and lead to good generalization; the drawback is that it compresses all information in only one representative value. If the problem is not linearly separable, a quadratic discrimination function could be more suitable, as it has been veriﬁed also in this work. 3.3 Artiﬁcial Neural Network Artiﬁcial Neural Networks (ANN) are non-linear statistical modeling tools that can be used to model complex relationships between inputs and outputs or to ﬁnd patterns in data. It can be demonstrated that an ANN, given a sufﬁcient number of sigmoidal neurons in the hidden levels, is able to approximate any non linear function on a compact set. Moreover ANNs asymptotically (with an inﬁnite number of examples) approximate the a-posteriori probability as with the Bayesian classiﬁers [2]. One of the main drawbacks of this method regards the impossibility to decide a priori the best topology to use. This choice has therefore been made through an empirical approach. In particular, we chose to use a feedforward neural network with one hidden layer, in which inputs are the ﬁrst principal component obtained by NPLDA and the output is a single neuron assuming the value 1 if the presence of the disease is detected and 0 otherwise. All neurons have a sigmoidal function as activation function. The net has been trained using the Resilient Backpropagation algorithm, based on the gradient descent approach, in which only the sign of the derivative is used to determine the direction of the weights update. This choice is due to the fact that this algorithm was able to offer the best compromise between the error on the validation and convergence. Finally, we set the number of neurons in the hidden layer equal to 3; this value has been obtained by training a set of networks with increasing number of hidden neurons and picking the smallest one with a good validation error. Since ANN’s results depend on the values of the initialization, we trained the net 20 times and we choose the best conﬁguration (according to the early stopping error) to evaluate the test set. 4 Results The performance of the classiﬁers has been evaluated through the obtained confusion matrices and performance indexes, deﬁned as: • accuracy (Non Error Rate NER), that represents the probability of doing a generic correct classiﬁcation; • sensitivity (True Positive Rate TPR): the probability to classify a person as sick when this is true; • speciﬁcity (True Negative Rate TNR): the probability of classifying a person as healthy when this is true; • precision w.r.t. diseased people (P RECP OS ): the probability that, having assigned a sample to the class of diseased people, it actually belongs to that class; • precision w.r.t. healthy people (P RECN EG ): the probability that, having assigned a sample to the class of healthy people, it actually belongs to that class. To obtain indexes able to describe in a reliable way the performances of the algorithms, it is necessary to evaluate these parameters on new and unknown data, validating the obtained results. Considering the not so big dimension of population and that for every person we had two samples, we opted for a modiﬁed Leave-One-Out approach where each test set is composed by the pair of measurements corresponding to the same person, instead of a single measure as would be in the normal Leave-One-Out method. Doing this way, we avoided that one of these two measures could belong to the training set, while using the other in the test set. 697 R. Blatt et al. / Pattern Classiﬁcation Techniques for Early Lung Cancer Diagnosis Using an Electronic Nose Table 2. Confusion matrix obtained from Fuzzy k-NN and Quadratic Discriminant Functions algorithms. Positive samples correspond to diseased subjects, while negative samples represent healthy volunteers. CONFUSION MATRIX ESTIMATED Positive LABELS Negative TRUE LABELS Positive Negative 82 11 4 105 Table 3. Comparison of lung cancer diagnosis performance reached with the electronic nose presented in this work and current diagnostic techniques (data from [13]). Indexes Accuracy (NER) 5 Conclusion and Further Direction of Research The use of an electronic nose as lung cancer diagnostic tool is reasonable if it gives some advantage compared to current lung cancer diagnostic techniques, namely Computed Axial Tomography (CAT) and Positron Emission Tomography (PET). Not only this is veriﬁed in terms of performances, as illustrated in Table 3, but also because the electronic nose, unlike the classical approaches, is a low cost, robust, small (and thus eventually portable), very fast and, above all, non invasive instrument. This means that this instrument allows a massive and quick lung cancer diagnosis. In order to improve the sensors technology, the necessity to develop longer-lyfe and more stable sensors emerged. Moreover, the development of hybrid systems is desirable, in order to obtain both selective and sensitive sensors. According to the classiﬁcation techniques, our work could be improved evaluating other classiﬁcation algorithms (as support vector machines, Bayesian approaches or other topologies of ANN), as well as improving the feature selection algorithm. It could be also very interesting to train the ANN in presence of noise, since it has been demonstrated that ANNs can compensate humidity, drift and temperature variation phenomenons [3] that affect olfactory signals. According to the scientiﬁc literature, there are no studies on the variation of VOCs in the breath before and after the surgery: it may be interesting to evaluate the resolution of the disease due to surgery. PET E-Nose Nd Nd 75% [60-90] 66% [55-77] Nd 91% [81-100] 86% [78- 94] Nd Nd Nd 92.6% [88.5-96.7] 95.3% [91.8-98.9] 90.5% [86.0-95.0] 88.2% [82.3-94.1] 96.3% [93.2-99.4] Conﬁdence Interval Sensitivity (TPR) Conﬁdence Interval All implemented algorithms have demonstrated a good ability to discriminate the two classes ‘healthy’ and ‘sick’. Performance indexes are reported in Table 1, where we considered the ﬁrst principal component obtained from NPLDA. The ﬁrst consideration regards the similarity of Modiﬁed and Classic k-NN: results are strongly comparable, but a slight improvement is shown by Modiﬁed k-NN. Moreover Modiﬁed k-NN is able to achieve the same performance as Classic k-NN with a lower k value. Another relevant consideration regards the robustness of Fuzzy k-NN to k changes: we considered different values of k (k=1,3,5,9,101), but the algorithm demonstrated to be robust to these changes, keeping its results invariant. In diagnostic ﬁeld, sensitivity is more important than speciﬁcity because it is more relevant to recognize correctly a sick person instead of a healthy one; in the same way, precision on negative samples is more important than precision on positive ones, because it is worse to classify a person as healthy when he or she is actually sick, than the opposite. Considering larger importance of sensitivity and precision w.r.t. healthy samples, we can afﬁrm that the fuzzy k-NN and the quadratic classiﬁer are the algorithms able to achieve best results for the considered problem. The confusion matrix obtained by these algorithms is shown in Table 2, where elements along the principal diagonal represent respectively the TruePositive (TP) and the TrueNegative (TN) values, while those off-diagonal are respectively the FalsePositive (FP) and the FalseNegative (FN) values. Performing a Student’s t-test between all pair of classiﬁers, no relevant differences emerged; this means that implemented classiﬁers’ results are comparable for the considered problem. CAT Speciﬁcity (TNR) Conﬁdence Interval PRECP OS Conﬁdence Interval PRECN EG Conﬁdence Interval An ambitious research prospective regards the individuation of risk factors connected to lung cancer (as smoke or food). After the promising results that we obtained in the ﬁrst phase of our research, we believed that the most important prospective of research that we should had followed was the evaluation of performing early dioagnosis. We want to understand to what extent the electronic nose is able to detect lung cancer even when it is at its earliest stage. If the results will conﬁrm our assumption, the social and economics effects will be of strong impact: the low price and small dimensions of the electronic nose, allow a large scale distribution, giving the opportunity to perform non invasive, cheap, quick, and massive early diagnosis and screening. REFERENCES [1] P.N. Bartlett and J.W. Gardner, Electronic Noses: Principles and Applications, Oxford Univ Press: Oxford, 1999. [2] C.M. Bishop, Neural Networks for Pattern Recognition, Clarendon Pr, 1995. [3] J. Brezmes, N. Canyellas, E. Llobet, X. Vilanova, and X. Correig. Application of artiﬁcial neural networks to the design and implementation of electronic olfactory systems, 2000. [4] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classiﬁcation, WileyInterscience, 2000. [5] K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Pr, 1990. [6] K. Fukunaga and JM Mantock, ‘Nonparametric discriminant analysis(for pattern feature extraction)’, IEEE Transactions on Pattern Analysis and Machine Intelligence, 5, 671–678, (1983). [7] SM Gordon, JP Szidon, BK Krotoszynski, RD Gibbons, and HJ O’Neill, ‘Volatile organic compounds in exhaled air from patients with lung cancer’, Clinical Chemistry, 31(8), 1278–1282, (1985). [8] JM Keller, ‘A fuzzy k-nearest neighbor algorithm’, IEEE Transactions m Systems, Man, and Cybernetics, 15(4), 580–585, (1985). [9] E. Martinelli, C. Falconi, A. D’Amico, and C. Di Natale, ‘Feature Extraction of chemical sensors in phase space’, Sensors and Actuators B: Chemical, 95(1), 132–139, (2003). [10] HT Nagle, R. Gutierrez-Osuna, and SS Schiffman, ‘The how and why of electronic noses’, Spectrum, IEEE, 35(9), 22–31, (1998). [11] M. Pardo and G. Sberveglieri, ‘E lectronic Olfactory Systems Based on Metal Oxide Semiconductor Sensor Arrays’, MRS BULLETIN, 29(10), 703, (2004). [12] M. Phillips, R.N. Cataneo, A.R.C. Cummin, A.J. Gagliardi, K. Gleeson, J. Greenberg, R.A. Maxﬁeld, and W.N. Rom, ‘Detection of Lung Cancer With Volatile Markers in the Breath’, Chest, 123, 2115–2123, (2003). [13] R.M. Pieterman, J.W.G. van Putten, J.J. Meuzelaar, E.L. Mooyaart, W. Vaalburg, G.H. Koeter, V. Fidler, J. Pruim, and H.J.M. Groen, ‘Preoperative Staging of Non-Small-Cell Lung Cancer with PositronEmission Tomography’, The New England journal of medicine, 343(4), 254–261, (2000). [14] LA Zadeh, ‘Fuzzy sets [J]’, Information and Control, 8(3), 338–353, (1965). 698 ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-698 A BDD Approach to the Feature Subscription Problem T. Hadzic1 and D. Lesaint2 and D. Mehta3 and B. O’Sullivan4 and L. Quesada5 and N. Wilson6 Abstract. Modern feature-rich telecommunications services offer signiﬁcant opportunities to human users. To make these services more usable, facilitating personalisation is very important since it enhances the users’ experience considerably. However, regardless how service providers organise their catalogues of features, they cannot achieve complete conﬁgurability due to the existence of feature interactions. Distributed Feature Composition (DFC) provides a comprehensive methodology, underpinned by a formal architecture model to address this issue. In this paper we present an approach based on using Binary Decision Diagrams (BDD) to ﬁnd optimal reconﬁgurations of features when a user’s preferences violate the technical constraints deﬁned by a set of DFC rules. In particular, we propose hybridizing constraint programming and standard BDD compilation techniques in order to scale the construction of a BDD for larger size catalogues. Our approach outperforms the standard BDD techniques by reducing the memory requirements by as much as ﬁve orders-ofmagnitude and compiles the catalogues for which the standard techniques ran out of memory. 1 Introduction Information and communication services, from news feeds to internet telephony, are playing an increasing, and potentially disruptive, role in our lives. As a result, service providers seek to develop personalisation solutions that put customers in charge of controlling and enriching the behaviour of their telecommunication services. An outcome of this work is the emergence of features as fundamental primitives for personalisation [Int93, Int97]. A feature is an increment of functionality which, if activated, modiﬁes the basic service behaviour, e.g., do-not-disturb, multimedia ring-back tones, call-diverton-busy, credit-card-calling, ﬁnd-me, etc. In this context, a personalisation approach consists of exposing a catalogue of features to endusers and letting them subscribe to a subset of features and sequence them in the way they prefer. However, not all the subscriptions and sequences are acceptable due to the possible occurrence of feature interactions. A feature interaction is “some way in which a feature modiﬁes or inﬂuences the behavior of another feature in generating the system’s overall behavior” [BCP+ 04]. For instance, a do-notdisturb feature will block any incoming call and cancel the effect of any subsequent feature subscribed by the callee. This is an undesirable interaction: as shown in Figure 1, the call originating from X will never reach the call-logging feature subscribed by Y . However, if call-logging is sequenced before do-not-disturb then both features will play their intended role. 1 2 3 4 5 6 Cork Constraint Computation Centre, UCC, Ireland, t.hadzic@4c.ucc.ie British Telecommunications plc, UK, david.lesaint@bt.com Cork Constraint Computation Centre, UCC, Ireland, d.mehta@4c.ucc.ie Cork Constraint Computation Centre, UCC, Ireland, b.osullivan@4c.ucc.ie Cork Constraint Computation Centre, UCC, Ireland, l.quesada@4c.ucc.ie Cork Constraint Computation Centre, UCC, Ireland, n.wilson@4c.ucc.ie Distributed Feature Composition (DFC) provides a comprehensive methodology underpinned by a formal architecture model to address feature interaction [JZ98]. Feature interactions can be avoided by prescribing a set of precedence and exclusion constraints. A precedence constraint, fi , fj , between features fi and fj , means that if a user subscribes to both fi and fj then fi must appear before fj in the sequence. An exclusion constraint means that a user cannot subscribe to both fi and fj simultaneously. We can therefore view service personalisation as a problem of assisting a user to select a subset of non-mutually excluding features that is possible to order in a sequence such that all precedence constraints are satisﬁed. We will refer to any such subset as a consistent subscription of features. We assist a user by verifying that the set of features (s)he has subscribed to are mutually consistent with respect to the set of precedence and exclusion constraints of the catalogue. If the user’s choices are inconsistent, we suggest remedial action by computing optimal relaxations. An optimal relaxation is a most preferred subset of the user’s choices that are consistent. Our approach is based on compiling all consistent subscriptions into a binary decision diagram (BDD) [Bry86] in the ofﬂine phase (prior to user interaction). Then, in the online phase, we efﬁciently compute optimal relaxations of the user’s choices by ﬁnding shortest paths in the BDD with respect to specially constructed edge weights. Although the resulting BDD is small in practice, the memory consumption during the compilation process has exponential peaks. Therefore, the challenge is to ﬁnd an approach that is scalable for compiling a catalogue of reasonable size into a BDD (e.g., [BCG+ 05] has proposed a catalogue of 25 features). In this paper, we present a hybrid approach that combines constraint programming and standard BDD compilation techniques for generating BDDs. This approach easily constructs a BDD for catalogues consisting of 25 features without any memory problems, while standard techniques run out of memory. In the remainder of this paper we ﬁrst describe the Distributed Feature Composition architecture in Section 2. We formalise some notions that are relevant for the feature subscription problem in Section 3. The necessary background for binary decision diagrams is provided in Section 4. In Section 5 we describe our BDD-based solution approach to computing optimal subscriptions given that a BDD is already compiled. In Section 6 we describe four techniques for Figure 1. An example of an undesirable feature interaction. T. Hadzic et al. / A BDD Approach to the Feature Subscription Problem SOURCE REGION TARGET REGION features CATALOGUE CL OCS OCS CL CL < CL TCS TDR CFU < < <> TCS < TDR < CFU SUBSCRIPTIONS CONFIGURATION feature box types target sub. of Y source sub. of X OCS TDR TCS target sub. of Z CL TCS ROUTING ZONES zone of X X src=x trg=y OCS feature boxes zone of Y src=x trg=y TDR TCS Y zone of Z src=x trg=z CL src=x trg=z TCS src=x trg=z ing order between two features, as for the case of Terminating-CallScreening (TCS) and Call-Logging (CL) in Figure 2. An exclusion constraint makes two features mutually exclusive, as for the case of CL and Call-Forwarding-Unconditional (CFU) in Figure 2. A subscription is a subset of catalogue features and a set of user precedence constraints between features in each region. For instance, the subscription of Y in the target region includes the user precedence TDR≺TCS. Conﬁguring a subscription involves selecting, parameterising and sequencing features in each region consistently with the catalogue constraints and other integrity rules [JZ03]. In particular, the source and target regions of a subscription must include the same reversible features in inverse order, i.e. source and target regions are not conﬁgured independently. Z 3 Figure 2. DFC: Catalogues, subscriptions and sessions. compiling catalogues into BDDs. We report experimental results in Section 7 comparing each technique, and ﬁnally, we conclude. 2 699 Distributed Feature Composition This section provides an overview of the DFC architecture, its routing method and the terminology relevant to the understanding of the feature subscription problem [Les07]. In DFC each feature is implemented by one or more modules called feature box types (FBT). We assume in this paper that each feature is implemented by a single FBT and we associate features with FBTs. As shown in Figure 2, a call session between two endpoints is set up by chaining feature boxes, i.e., instances of FBTs. The routing method decomposes the connection path into a source and a target region and each region into zones. A source (target) zone is a sequence of features that execute for the same source (target) address. The ﬁrst source zone is associated with the source address encapsulated in the initial setup request, e.g., zone of X in Figure 2. A change of source address in the source region, caused for instance by an identiﬁcation feature, triggers the creation of a new source zone [ZGS04]. If no such change occurs in a source zone and the zone cannot be expanded further, routers switch to the target region. Likewise, a change of target address in the target region, as performed by Time-Dependent-Routing (TDR) in Figure 2, triggers the creation of a new target zone. If no such change occurs in a target zone and the zone cannot be expanded further (as for Z in Figure 2), the request is sent to the ﬁnal box identiﬁed by the encapsulated target address. DFC routers are only concerned with locating feature boxes and assembling zones into regions. They do not make decisions as to the type of feature boxes appearing in zones or their ordering. They simply fetch this information from the feature subscriptions that are preconﬁgured for each address in each region based on the catalogue published by the service provider. A catalogue is a set of features subject to precedence and exclusion constraints. Features fall into three classes: source, target and reversible, i.e., a subset of features that are both source and target. Constraints are formulated by designers on pairs of source features and pairs of target features to prevent undesirable feature interactions in each zone. Speciﬁcally, a precedence constraint imposes a rout- Conﬁguring Feature Subscriptions A catalogue is a pair F, P , where F is a set of features and P is a set of precedence constraints on F . Let fi and fj be features, we write a precedence constraint of fi before fj as fi , fj , or alternatively, pij . Note that an exclusion constraint between fi and fj can be encoded as the pair of precedence constraints fi , fj and fj , fi . The transpose of a catalogue F, P is F, P T such that ∀fi , fj ∈ F 2 : fi , fj ∈ P ⇔ fj , fi ∈ P T . In DFC the precedence constraints between the features in the source (target) catalogue are speciﬁed with respect to the direction of the call. For the purpose of conﬁguration, we compose the source catalogue Fs , Ps and the target catalogue Ft , Pt into a single catalogue Fc , Pc ≡ Fs ∪ Ft , Ps ∪ Pt T . A catalogue Fc , Pc can also be seen as a directed graph by mapping the features in Fc to vertices and the precedence constraints in Pc to the edges. A maximal set (with respect to inclusion) of features of the catalogue that one can subscribe to is a set Fc such that (i) Fc ⊆ Fc , (ii) the directed graph Fc , Pc ↓Fc is acyclic and (iii) ∀f ∈ Fc − Fc , the directed graph Fc ∪ {f }, Pc↓Fc ∪{f } is cyclic. A feature subscription S of catalogue Fc , Pc is a tuple F, C, WF , where F ⊆ Fc , C is the projection of Pc on F , i.e., Pc ↓F = {fi , fj ∈ Pc : {fi , fj } ⊆ F } and WF : F → N is a function that assigns weights to features. The value of S is deﬁned by Value(S) = f ∈F WF (f ). Note that a weight associated with a feature signiﬁes its importance for the user. A feature subscription F, C, WF is deﬁned to be consistent if and only if the directed graph F, C is acyclic. Determining whether a feature subscription F, C, WF is consistent or not can be checked in O(|F | + |C|) time by using Topological Sort [CLR90]. A relaxation of a subscription F, C, WF is a subscription F , C , WF such that F ⊆ F , C = Pc↓F , and WF = WF ↓F . Let S = F, C, WF be an inconsistent feature subscription. A relaxation S = F , C , WF of S is maximal if S is consistent and each relaxation S = F , C , WF where S = S , F ⊇ F , C = Pc ↓F , and WF = WF ↓F is inconsistent. We call F a maximal set of features of the subscription S. A single maximal relaxation can be found by traversing the features in F and checking at each time that the current feature/precedence can be added. If the feature can be added then it is considered part of the relaxation and the next check is performed on this basis. If not, the feature is simply discarded. As |F | checks are performed, the overall complexity is O((|F |) × (|F | + |C|)). However, ﬁnding the set of all maximal relaxations is exponential [BS05]. Let RS be the set of all maximal relaxations of subscription S. We say that S ∈ RS is an optimal relaxation of S if it has maximum value amongst all maximal relaxations, i.e., if and only if there does 700 T. Hadzic et al. / A BDD Approach to the Feature Subscription Problem not exist S ∈ RS such that Value(S ) > Value(S ). Finding an optimal relaxation is NP-Hard [LMO+ 07]. We shall focus on this problem in this paper. 4 Binary Decision Diagrams A binary decision diagram (BDD) [Bry86] is a rooted directed acyclic graph, with vertices V and edges E ⊆ V × V , that encodes a constraint set over some set of linearly ordered Boolean variables. It has two terminal nodes labeled with T0 and T1 . All other nodes u ∈ V \ {T0 , T1 } are labeled with a variable var(u) ∈ {1, . . . , n} and have exactly two outgoing edges: a low edge ending in node low(u) and a high edge ending in high(u). The BDD is ordered if variable labels along all the paths from the root to either T0 or T1 respect the ordering of the variables. Given an assignment to the variables, whether the constraint is satisﬁed is determined by following a path starting at the root node and recursively following the high edge, if the associated variable is assigned 1, and the low edge, if the associated variable is assigned 0. The constraint set is satisﬁed if we reach terminal node T1 ; otherwise it is violated. Even though BDDs are worst-case exponential in the size of an input constraint model, they are compact for many important classes of constraints. This is because we can use reduced ordered BDDs, where isomorphic nodes are merged and redundant nodes are eliminated. Two distinct nodes u and u are isomorphic if they are labeled with the same variable var(u) = var(u ) and have the same child nodes: low(u) = low(u ) and high(u) = high(u ). Merging corresponds to deleting one of the nodes (say u ) and redirecting all incoming edges of u to u. A node u is redundant if both child nodes are the same: low(u) = high(u). Eliminating u corresponds to deleting it, and redirecting all incoming edges to the child node. Eliminating u introduces a long-edge that skips the variable var(u) indicating that all assignments to skipped variables are allowed. A reduced ordered BDD for the conjunction of two Boolean constraints {x1 = x2 , x3 = x4 } is shown in Figure 3. Notice how an assignment x1 = 1, x2 = 1 leads to a long-edge ending in T0 and skipping x3 and x4 . This means that no matter what is assigned to x3 and x4 , when x1 = 1 and x2 = 1, the constraint is violated. x1 x2 x2 x3 x4 0 x4 1 Figure 3. Reduced OBDD for x1 = x2 ∧ x3 = x4 Optimisation Using BDDs. An additive objective function c (x ) can be minimised subject to a constraint set by ﬁnding j j j a shortest path from the root to T1 in the corresponding BDD. If node u and u have labels xk and x , respectively, then an edge from u to u has length: cv [u, u ] = ck (v) + l−1 j=k+1 min{cj (1), cj (0)} where v = 1 if u = high(u) and v = 0 if u = low(u). If this edge is part of a shortest path, it induces assignments to corresponding variables {xk , xk+1 , . . . , xl−1 } such that xk = v, and for all skipped variables xj = vj , where vj = 1 if cj (1) < cj (0), vj = 0 if cj (1) > cj (0) and vj ∈ {0, 1} otherwise. Encoding Finite Domains. Constraints over ﬁnite-domain variables can be compiled into a BDD by mapping ﬁnite-domain variables into Boolean variables [Wal00]. Usually a log encoding scheme is used where we encode each ﬁnite domain variable xi ∈ Di with ki = log|Di | Boolean variables xi1 , . . . , xiki representing digits in binary notation. Hence, each ﬁnite value v is associated with a unique sequence of bits (v1 , . . . , vk ) such that v = ki=1 2i−1 vi . 5 A BDD-Based Solution Approach Given a catalogue Fc , Pc with n features and m precedence relationships, we generate a Constraint Satisfaction Problem where for each feature fi ∈ Fc we introduce two variables: a Boolean variable si indicating whether a feature fi is selected, and an integer variable pi ∈ {1, . . . , n} indicating the position of a feature in a sequence of selected features. For every catalogue precedence fi , fj ∈ Pc , we introduce the constraint si ∧ sj ⇒ (pi < pj ). Our solution approach is based on dividing the computational effort between two phases. In the ofﬂine phase (prior to user interaction), we compile the catalogue into a BDD B (see Section 6) that represents the conjunction of all the precedence constraints: B= (si ∧ sj ⇒ pi < pj ). (1) fi ,fj ∈Pc In the online phase, when the user-selected features F ⊆ Fc are known, we compute optimal relaxations using efﬁcient shortest path algorithms, for example, Dijkstra’s shortest-path algorithm [CLR90]. An additive cost function n i=1 ci (si ) is deﬁned based on userselected features F , such that ci (1) = 0, ci (0) = WF (fi ) if a feature fi ∈ F was selected by a user, and ci (1) = 0, ci (0) = 0 otherwise. To all Boolean variables encoding ﬁnite-domain position variables pj we assign cost 0. A shortest path with respect to this additive cost function induces an assignment to variables si that represents a subset of features F ⊆ F . If the induced set of assignments (s1 = v1 , . . . , sn = vn ) involves an assignment sj = 1 for some feature fj not selected by a user (fj ∈ F ), we can truncate this assignment, i.e. set sj = 0, since this yields a compatible assignment of the same cost. After truncating all such selections of non-user features, we are left with an assignment representing an optimal subset of user-selected features F . Once we have F , we can efﬁciently order it in a way that respects all the relevant precedence constraints by using topological sort. Other BDD Applications. Note that once the BDD is computed and an additive cost function is given, we can efﬁciently implement a range of functionalities supporting a user choosing a desired feature subscription. For example, we can efﬁciently compute the set of k best relaxations by using the algorithm presented in [NBW06]; we can assist a user to interactively conﬁgure relaxations of cost at most k by using an approach from [HA06]; we can perform postoptimality analysis by analysing which relaxations become available as the maximal cost increases [HH06] and we can ﬁnd a most similar/diverse relaxation using approach from [HOW07]. The main issue, regardless of which user functionality we choose to implement, is whether we can compile the corresponding BDD B and whether the resulting size allows for efﬁcient online processing. 701 T. Hadzic et al. / A BDD Approach to the Feature Subscription Problem 6 Function compile(Fc , Pc ) B ← T1 for fi ∈ Fc Bi ← T1 for fi , fj ∈ Pc : Bi ← Bi ∧ Bij for fj , fi ∈ Pc : Bi ← Bi ∧ Bji B ← ∃pi (B ∧ Bi ) return B Compiling Feature Subscription BDDs Generating a BDD that represents all consistent subscriptions of a catalogue Fc , Pc by using a standard approach to BDD compilation, would involve ﬁrst constructing a BDD Bij for each precedence constraint fi , fj ∈ Pc , Bij = BDD(si ∧ sj ⇒ pi < pj ), Figure 4. A variable elimination approach to ofﬂine compilation of a BDD representing the catalogue (Fc , Pc ). We initialize B to tautology (terminal T1 ) and for each feature fi we generate a BDD Bi representing the conjunction of all remaining constraints involving si and pi . We then existentially quantify pi from conjunction B ∧ Bi as it does not appear in remaining precedence constraints. The order in which features fi ∈ Fc are considered is not necessarily lexicographical. 1e+08 standard approach variable elimination approach 1e+07 1e+06 BDD size and then conjoining all the resulting BDDs using the standard BDD conjunction operator: B = fi ,fj ∈Pc Bij . Our attempt to use this compilation approach did not scale beyond catalogues involving 15 features as the resulting BDD had excessive memory requirements, measuring in millions of nodes (more details are presented in Section 7). In the remainder of the section we therefore describe the techniques we used to scale up the BDD compilation. We ﬁrst enhanced standard compilation through variable elimination of position variables pi , which reduced the size of the ﬁnal BDD B from millions to thousands of nodes. This alone did not allow us to scale beyond catalogues involving more than 20 features since the size of the largest BDD resulting from intermediate conjunctions (memory peak) was still too large. We therefore used a BDD compilation based on constraint programming search which overcame this problem and helped us scale to our designated goal of handling 25 features. 100000 10000 1000 100 6.1 Variable Elimination Approach 10 0 Since our additive cost function does not depend on position variables, it is sufﬁcient to execute a shortest path algorithm over a BDD: ⎞ ⎛ Bs ≡ ∃p1 ,...,pn ⎝ (si ∧ sj ⇒ pi < pj )⎠ (2) 5 10 15 20 25 compilation steps 30 35 40 45 Figure 5. Intermediate memory requirements for the standard and variable elimination approach for a catalogue with 15 features and 42 constraints. fi ,fj ∈Pc that represents a projection of B onto the si variables. In order to handle models where generating the original BDD B is not possible, we cannot compile Bs by ﬁrst computing B and then eliminating the pi variables. Instead, we have to eliminate the pi variables during the conjunction of BDDs Bij as soon as it is detected that they do not occur in the remaining constraints that are to be conjoined. We used a particular conjunction heuristic, where for each feature fi we conjoined all the BDDs involving si and pi , and afterwards eliminated pi through the standard BDD operation of existential quantiﬁcation [MT98]. Figure 4 shows the algorithm implementing this conjunction heuristic, and Figure 5 compares variable elimination against the standard approach by showing the size of BDDs in intermediate conjunction steps. 6.2 Constraint Programming Approach Even though the ﬁnal BDDs generated using the variable elimination approach were remarkably small, the BDDs resulting from intermediate conjunction steps were too big to allow scaling beyond catalogues involving more than 20 features. We observed, however, that the set of all maximal sets of features, Fmax , of the underlying catalogue Fc , Pc was no more than a few thousand, i.e. signiﬁcantly smaller than in the worst case (|Fmax | 2|Fc | ). We also observed that in order to construct a BDD Bs that represents all the consistent subsets of Fc (not necessarily maximal), it is sufﬁcient to add the powerset of Fmax for all Fmax ∈ Fmax . Each powerset is represented by fi ∈Fc −Fmax ¬si , where Fc − Fmax denotes those features that are not part of the maximal set of features of the catalogue. Hence, it holds that: ⎛ Bs ≡ ⎝ Fmax ∈Fmax ⎞ ¬si ⎠ . (3) fi ∈Fc −Fmax This led us to consider a constraint programming (CP) approach where we ﬁnd all the maximal sets of features Fmax through CPbased search and then generate Bs using (3). The advantage of this approach is that our memory consumption is linear in the number of maximal sets of features represented by the ﬁnal BDD. The complexity of ﬁnding all maximal sets of features of the catalogue is worst-case exponential, but this needs to be done only once. BDD for Maximal Sets of Features. We also considered what we call a CP-max approach, where we construct the BDD Bsmax that represents exactly the maximal sets of features Fmax : Bsmax ≡ Fmax ∈Fmax ⎛ ⎝ fi ∈Fmax si ∧ ⎞ ¬sj ⎠ . (4) fj ∈Fc −Fmax Notice that any maximal set F of the feature subscription F, C, WF is a subset of at least one of the maximal set of features Fmax ∈ Fmax . Therefore, it is still possible to ﬁnd an optimal relaxation F ⊆ F of user-selected features F ⊆ Fc by using the nsame approach as in Section 5: we deﬁne the objective function i=1 ci (si ), and an optimal relaxation is obtained using the shortest path by truncating non-user features fj ∈ F . 702 T. Hadzic et al. / A BDD Approach to the Feature Subscription Problem Table 1. Comparison of different compilation approaches. For each catalogue Fc , Pc the leftmost column indicates the number of features and precedences nc , mc . It also indicates the number of maximal sets of features (#max-rel), i.e., the maximal subsets of Fc that still yield a consistent subscription. For each approach we show the size of the ﬁnal BDD (#nodes) and the maximum number of intermediate nodes during compilation (#max-nodes). For the standard and variable elimination approaches that did not compile all the catalogues, the number of instances solved (#solved) is also shown. catalogue 5, 4 10, 18 15, 42 20, 76 25, 120 7 #max-rel 1 4 59 470 3,376 #nodes 150 62,147 11,965,178 - Standard #max-nodes 150 62,147 13,182,339 - #solved 5 5 5 0 0 Variable elimination #nodes #max-nodes #solved 1 28 5 7 3,497 5 81 376,206 5 654 17,306,609 3 0 Experimental Evaluation We compared the performance of different compilation approaches discussed in the previous sections: a standard approach that generates the BDD B from Equation (1), a variable elimination approach that generates projection Bs from Equation (2), a CP approach generating Bs based on Equation (3), and a CP-max approach generating Bsmax based on Equation (4). We generated and experimented with a variety of random catalogues nc , dc where nc is the number of features, dc ∈ [0, 1] is the density of the precedence constraints, i.e. it denotes the percentage of the maximum number of constraints that are selected. A random catalogue is generated by selecting mc = dc × nc (nc − 1)/2 pairs of features. For each selected pair fi , fj we randomly decide with equal probability whether fi ≺ fj or fj ≺ fi . In our experiments we let nc ∈ {5, 10, 15, 20, 25} and dc ∈ {0.1, 0.2, 0.4, 0.6, 0.7}. For each combination of nc and fc we generated ﬁve random catalogues. For each approach we tried several variable ordering heuristics and selected those with the best performance. For the ﬁrst two approaches we put variables appearing in the larger number of constraints higher in the ordering. For the CP and CP-max approaches a variable corresponding to a feature that is included in fewer maximal relaxations was put higher in the ordering. Table 1 summarises the results for a subset of instances with density dc = 0.4. We can see that the CP and CP-max approaches dramatically outperform the standard and variable elimination compilation approach as they reduce the memory peak by several orders of magnitude. Furthermore, the time required by these approaches was not excessive. In our experience it never exceeded one hour of computation time (on a machine with a 1.8 GHz processor and 768 MB of RAM). This was comparable to the time required by the ﬁrst two approaches. 8 Conclusions In this paper we presented an approach for solving a feature subscription problem by ﬁrst compiling all consistent sets of catalogue features into a BDD in the ofﬂine phase, which then allows for an efﬁcient computation of the optimal consistent subset of user selected features in the online phase. We addressed the key computational issue of compiling corresponding BDDs by investigating four alternative compilation techniques. In particular, we suggested two compilation approaches based on constraint programming which reduce the memory peak by several orders-of-magnitude. As a result, we easily reached our target of handling catalogues of 25 features. 9 Acknowledgements This material is based upon works supported by the Science Foundation Ireland under Grant No. 05/IN/I886, and Embark Post Doctoral Fellowships No. CT1080049908 and No. CT1080049909. #nodes 0 6 103 766 5,863 CP #max-nodes 0 7 136 1,060 7,134 #nodes 5 17 148 954 5,768 CP-max #max-nodes 5 17 148 954 5,771 REFERENCES [BCG+ 05] [BCP+ 04] [Bry86] [BS05] [CLR90] [HA06] [HH06] [HOW07] [Int93] [Int97] [JZ98] [JZ03] [Les07] [LMO+ 07] [MT98] [NBW06] [Wal00] [ZGS04] Gregory W. Bond, Eric Cheung, Healfdene Goguen, Karrie J. Hanson, Don Henderson, Gerald M. Karam, K. Hal Purdy, Thomas M. Smith, and Pamela Zave. Experience with component-based development of a telecommunication service. In CBSE, pages 298–305, 2005. Gregory W. Bond, Eric Cheung, Hal Purdy, Pamela Zave, and Christopher Ramming. An Open Architecture for NextGeneration Telecommunication Services. ACM Transactions on Internet Technology, 4(1):83–123, 2004. R. E. Bryant. Graph-based algorithms for boolean function manipulation. IEEE Transactions on Computers, 1986. James Bailey and Peter J. Stuckey. Discovery of minimal unsatisﬁable subsets of constraints using hitting set dualization. In PADL 2005 Proceedings, pages 174 – 186, 2005. T. Cormen, C. Leiserson, and R. Rivest. Introduction to Algorithms. The MIT Press, 1990. Tarik Hadzic and Henrik Reif Andersen. A BDD-based Polytime Algorithm for Cost-Bounded Interactive Conﬁguration. In Proceedings of AAAI’06, 2006. Tarik Hadzic and John Hooker. Postoptimality analysis for integer programming using binary decision diagrams. In Proceedings of GICOLAG workshop, Viena, 2006. Emmanuel Hebrard, Barry O’Sullivan, and Toby Walsh. Distance constraints in constraint satisfaction. In Manuela M. Veloso, editor, IJCAI, pages 106–111, 2007. International Telecommunication Union. Introduction to Intelligent Network Capability Set 1. Recommendation Q.1211, ITU, Geneva, Switzerland, March 1993. International Telecommunication Union. Introduction to Intelligent Network Capability Set 2. Recommendation Q.1221, ITU, Geneva, Switzerland, September 1997. Michael Jackson and Pamela Zave. Distributed Feature Composition: a Virtual Architecture for Telecommunications Services. IEEE TSE, 24(10):831–847, October 1998. Michael Jackson and Pamela Zave. The DFC Manual. AT&T, November 2003. David Lesaint. A conﬁguration logic for telecommunication services: Part 1. Technical report, BT Research and Venturing, 2007. David Lesaint, Deepak Mehta, Barry O’Sullivan, Luis Quesada, and Nic Wilson. A Constraint-Based System for the Conﬁguration of Subscriptions to Feature-Based Telecommunications Services. Patent report, BT, Ipswich, UK, December 2007. C. Meinel and T. Theobald. Algorithms and Data Structures in VLSI Design. Springer, 1998. Ross Nicholson, Derek Bridge, and Nic Wilson. Decision diagrams: Fast and ﬂexible support for case retrieval and recommendation. In Proceedings of Eighth European Conference on Case-Based Reasoning (ECCBR 2006), 2006. Toby Walsh. SAT v CSP. In Rina Dechter, editor, CP, Lecture Notes in Computer Science, pages 441–456, 2000. Pamela Zave, Healfdene Goguen, and Thomas M. Smith. Component Coordination: a Telecommunication Case Study. Computer Networks, 45(5):645–664, August 2004. ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-703 703 Continuous Plan Management Support for Space Missions: the R AXEM Case Amedeo Cestaa , Gabriella Cortellessaa , Michel Denisb , Alessandro Donatib , Simone Fratinia , Angelo Oddia , Nicola Policellab, Erhard Rabenaub and Jonathan Schulsterb a ISTC-CNR, National Research Council of Italy, Institute for Cognitive Science and Technology, Rome, Italy name.surname@istc.cnr.it b ESA/ESOC, European Space Agency, European Space Operations Centre, Darmstadt, Germany name.surname@esa.int Abstract. This paper describes R AXEM, an AI-based system developed to support human mission planners in the daily task to plan uplink commands for an interplanetary spacecraft. The intelligent environment of R AXEM has been designed to support the users in analyzing the problem and taking planning decisions as a result of an interactive process. The system combines different ingredients like integrating ﬂexible automated algorithms, promoting user active participation during problem solving, and guaranteeing continuity of work practice. The paper touches upon all these aspects and comments on how a key factor for success has been the integration of intelligent technology to continuously support mission plan management. already on board to be executed and those that still need to be uplinked. In this paper we describe both the technological/AI aspects of R AXEM development and the users perspective on the tool. We underscore how the end-to-end features of these systems are contributing not only to support mission operations but to increasingly inject innovative ideas about more ﬂexible ways of managing operations during mission. Projects like R AXEM and those mentioned above, have the merit of increasing the awareness in the operational space environment of the reliability and maturity of AI technology and intelligent systems in general. 2 The M EX -U P Problem 1 Introduction The space domain is one of those in which AI Planning and Scheduling (P&S) technology has demonstrated maturity and effectiveness. A signiﬁcant effort has been directed to have advanced examples of autonomy, see Remote Agent [4] and EO-1 [3]. Other very interesting success stories have addressed daily problems at ground segments to support either payload scientists negotiation, like in M AP GEN [1], or mission planners activities, like in M EXAR 2 [2]. This paper describes a new system developed to support the command uplink problem (formalized as M EX -U P) for the M ARS E X PRESS, a spacecraft which is currently operational in Mars orbit. To solve the M EX -U P problem a planning tool called R AXEM has been developed to support daily mission planners activities. The R AXEM uplink planning tool has been designed and engineered to optimize the safety and timeliness of the more than ﬁfty command timelines sent to M ARS E XPRESS each week. The R AXEM tool is in operational use since late Summer 2007. It uses an AI constraint-based modeling and solving approach to plan each command ﬁle for uplink retaining a backup window wherever possible, keeping the on-board timeline as full as feasible, and ensuring the safety of the spacecraft at all times. A key point in R AXEM is to support the continuity of work of mission planners. They are in continuous contact with payload PIs and may receive commands to be uplinked distributed over time including the possibility of having to accommodate new activities in a short notice. As a consequence a supporting front-end should be able to allow smooth accommodation of these late requests. To this aim R AXEM has been endowed with an interaction layer that supports incremental plan deﬁnition and management. Particularly useful are functionalities to continuously check the situation of the commands The M ARS E XPRESS 1 spacecraft is not able to plan and execute science operations in a fully autonomous way, hence its plans arrive from the ground on a continuous basis. In particular a particular onboard memory block, called Mission TimeLine ( MTL ), is replenished by uploading time-tagged telecommands (TCs) from the ground. The spacecraft activities for each month are determined in accordance with the Medium Term Plan (MTP) for the concerned period (typically 4 weeks). Based on the given MTP various operations requests (OR) are generated During the daily planning activity for each operation a set of time-tagged telecommands are synthesized and then are collected in a set of MTL Detailed Agenda Files (MDAF) by the Mission Planning Engineers: this step ensures that all the TCs related to a particular operation/procedure are in the same MDAF and then are all uploaded together. This is fundamental to avoid having the spacecraft in an inconsistent status in case of a transmission failure - we do not want the spacecraft pointing somewhere in the space without knowing what to do next. On-board the spacecraft the TCs reside in the MTL buffer ordered by execution time. At the speciﬁed execution time, each TC will be released and removed from the MTL . This situation is shown in Fig. 1: a set of requests has to be sent to the M ARS E XPRESS probe through a limited transmission channel in order to deﬁne the operations that has to be accomplished. Two constraints make this problem hard: the limited bandwidth of the transmission channel and the ﬁnite capacity of the on-board memory ( MTL ) where the commands have to be stored, waiting for the execution. This is the problem named M EX -U P. The goal of our work has been to synthesize a consistent 1 http://www.esa.int/SPECIALS/Mars Express 704 A. Cesta et al. / Continuous Plan Management Support for Space Missions: The RAXEM Case consists in relaxing also the need for the acknowledgement. The activity duration is then equal to the time needed for transferring the MDAF. – Additionally, the users may merge a group of heterogeneous MDAFs in order to optimize the use of the transmission windows (the merging allows saving of the OWLT). This is a further ﬂexibility service for the users to explore types of solutions. – The need for identifying alternative uplink windows for each MDAF. This is necessary to support both a possible uplink failure and to manage the case that the reserved ground station is not available. The last case can happen when the ground station is reallocated at late notice to another mission with higher priority then M ARS E XPRESS. Figure 1. A sketchy representation of M EX -U P sequence of activities for uploading the set of commands on-board (i.e., the uploading plan). For a complete formulation of the M EX -U P problem, the following additional aspects need to be taken into account: – each MDAF is a triple f irst, last, size that deﬁnes respectively, the execution time of the ﬁrst telecommands, the execution time of the last telecommands, and the number of telecommands in the MDAF; – each MTL is a pair size, cache that deﬁnes respectively, its size and the size of its cache (i.e., the most immediate TCs to execute); – each uplink window is a n-ple st, et, dur, owlt that deﬁnes respectively, the start time, the end time, the duration, and the oneway-light-time, OWLT;2 – ﬁnally, given an MDAF m, the duration of the uplink process is a function durup (m) that depends on the number of TCs in m, the execution time of the ﬁrst TC, the uplink window available, and the MTL status. The M EX -U P problem consists in producing an uploading plan for the set of MDAFs, considering the available uplink windows, the status of the MTL , and the priority of each MDAF. Technical/Managerial Constraints. In addition to the basic features of the problem listed above, there are a number of additional aspects that suggested the users to look for a more intelligent tool able to explore more deeply the solution space. These issues concern both additional properties required for the solution and some existing problems in work practice. From the point of view of the distinguishing qualities of a solution we have: Other constraints concerns more explicitly the modality of work. The reasons mission planning engineers were looking for a supporting tool stem in the fact that (a) planning uplink is a continuous, timeconsuming task that, even though very important for the mission, tends to become routine hence inevitably prone to errors; (b) uplink is incremental so there is a need to manage dynamically a current plan and very easily the type of solutions may become “patched”, thus entailing a decreased quality in the solution. On the contrary a problem solver can explore the space of solutions more effectively; (c) there is always need to insert additional command sets to upload for some emergency or unforeseen events. Also in this case the possibility of computing an automated solution quickly and maintaining quality is important. Finally, (d) the introduction of a plan management tool allows ﬂexibility not only in exploring the solution space but also to support feature changes so as to reﬂect the incrementality of the problem. To sum up, M EX -U P is a planning problem that should respect a lot of speciﬁc mission constraints that impact the quality of alternative solutions. The satisfaction of the users is grounded on the possibility of exploring alternative solutions. Additionally a clear example is given of how the problem is a combination of problem solving and plan management. Satisfaction of the user is connected to the integration of different services in the plan life cycle demonstrating how planning services should involve more than the single solving functionalities [6]. 3 The R AXEM Tool R AXEM has been developed (and then introduced in the work practice) when the mission was already operational since a long period; this has required preserving the mission planning work practice as much as possible. The goal was to address some weaknesses of the working cycle that were considered minor at mission design time and ended up having an impact on the quality of work during daily mission operations. We have developed an end-to-end application with particular attention to the maintenance of the mission data ﬂow and to the idea of preserving key decisions for mission planners. – the possibility of choosing, for each MDAF, among different uplink modalities with an impact on the duration of the associated uplink activity. In particular three modalities can be considered: (a) full conﬁrmation, in this case the activity requires the time needed for transferring on board the ﬁle, performing a speciﬁc processing procedure, and receiving the acknowledgement; (b) reduced conﬁrmation, in this case the constraint of waiting for the processing procedure is relaxed. Then the activity requires just the time needed for transferring on board the ﬁle and receiving back the acknowledgement; (c) no conﬁrmation, this extreme modality 2 The OWLT is the elapsed time it takes for light (or radio signals) to travel between the Earth and a celestial object (in this case Mars). Figure 2. The general software architecture A. Cesta et al. / Continuous Plan Management Support for Space Missions: The RAXEM Case R AXEM directly accepts as input the MDAF ﬁles and the Uplink Window Files (UWFs — describing the temporal availability of different ground stations), and produces uplink ﬁles in the expected format for the M ARS E XPRESS data cycle. This is obtained by encapsulating the intelligent system between two software modules (see Fig. 2): (1) the Parsing module that processes the input ﬁles and selects the relevant information for the symbolic model used by the solver, and (2) the Output Generation module that manipulates the results produced by the system and generates the output according to external formats. Fig. 2 shows a complete blow up of R AXEM software components. Apart the two input/output modules3 the system involves three modules that provide the core functionalities: (a) a domain modeler (Model-Based Representation in the ﬁgure), (b) an algorithmic module (Automated Solver), (c) an interaction module (Interaction Services) that allows mission planners to access the other two modules. The Parsing and Output Generation modules directly interact with the model-based representation that acts as the key module for the whole approach. The solver directly extracts and stores all the relevant information from/to the modeling part. 3.1 Modeling with Timelines As in any AI approach the basic step in solving the M EX -U P problem has been to build a representation (or model) of the domain which contains the relevant objects and constraints that inﬂuence the problem solving in the particular domain. In particular we have followed a timeline-based approach [5, 4, 3] which focuses the attention on problem features evolving over time. Deciding on temporal evolution of the main timelines is the “meta-goal” of the problem solver. The representation choices have been fundamental because they not only support the solving algorithm but are at the base of the interaction with the user. In R AXEM we consider the temporal evolution of the two relevant system components: – Mission Timeline ( MTL ). The MTL contains the set of telecommands. This can be represented as a cumulative resource characterized by a ﬁnite capacity and a ﬁnite cache capacity. – Communication Channel. The uplink connection to Earth for transmitting data. This resource, which is binary (either busy or free), is characterized by a set of separated transmission windows which identify time intervals for uplink. In this way we restrict the problem to consider the resource proﬁles of the MTL and the channel availability for uplink. The core of the M EX -U P problem is to decide the uploading plan that is when each MDAF can be uploaded. Figure 3. 3 The main problem timelines In the ﬁgure, the Output Generation is represented with a dotted line because the task is performed externally from the current R AXEM system. 705 As shown in Fig. 3, for each MDAF ready to uplink it is possible to identify two different activities to allocate on the previous component timelines: – An Uplink activity: this is to represent the transmission over a free slot of the communication channel. This operation will require the whole bandwidth of the communication channel for the entire duration (due to the binary capacity). As said in the ﬁgure the duration of the transmission depends on the execution time of the ﬁrst telecommand in the ﬁle, the size of the MDAF, and on the MTL “status”. Such a status is particularly relevant when the MTL is almost full to capacity, or some last-minute set of commands should be allocated directly in the on-board cache; – An MTL operation: at its start time, each operation “instantaneously” stores in the Mission Timeline an amount of data equal to the number of telecommands in the MDAF, Size( MDAF). At the speciﬁed execution time, each TC will be released and removed from the MTL . In the ﬁgure a linearly decreasing behavior is given for a certain MDAF, this is not the general shape of the curve because TC execution time depends on the TCs distribution within a given MDAF. The ﬁgure describes the basic synchronization constraints that are immediately translated in temporal constraints to be satisﬁed. In addition the solver should both cope with the additional Technological constraints described before and comply with a number of local decisions posted by the users. 3.2 Problem Solving Capabilities To better cope with the detailed constraints in the problem (uplink mode, need to reserve a secondary window, etc.) we have designed a two steps algorithm. Basically the algorithm starts trying to fulﬁll all the MDAF requirements and in case no solution is available it incrementally relaxes the different requirements in order to obtain a feasible uplink plan. The ﬁrst step iteratively produces an initial uplink plan. In input we have a set of ﬁles of telecommands (MDAFs) to be allocated, a set of MDAFs already allocated (necessary to know the initial situation of the on-board memory), and a set of communication windows. For each MDAF we have to decide (1) which uplink modality to use, (2) if the ﬁle can be associated with other MDAFs in a unique uplink activity or it has to be uploaded alone, and, of course, (3) at what time instant to start the ﬁle uplink (as described above the activity duration depends on the uplink time, the number of TCs in the ﬁle, and the state of the MTL ). Two are the fundamental constraints to be taken into account: the availability of a sufﬁcient communication window and the availability of sufﬁcient room in the MTL in order to allocate the telecommands contained in the ﬁle. Therefore in the ﬁrst step, the ﬁles are sorted according to the execution time of their ﬁrst telecommand (ﬁrstTC). Given this order each MDAF is allocated in order to be uploaded in a Multi-MDAFs activity, with a full conﬁrmation uplink mode, and with a backup (or secondary) window. In case one of the MDAFs cannot be uplinked (either in a Multi-MDAFs or single-MDAF uplink activity) in a full conﬁrmation mode the algorithm relax this constraint in order to complete the solution (ﬁrst from full conﬁrmation to reduced conﬁrmation and then, in case of failure, from reduced conﬁrmation to no conﬁrmation). In case the ﬁrst step does not produce a complete solution (i.e., all the ﬁles are allocated for uplink), the second step aims at completing the current plan. The algorithm is a complete search that at each step removes a previous decision (ﬁle planned to be uplinked) in order to ﬁnd space for the activities that have not been planned. The goal is to maximize the number of ﬁles to uplink and in case of solutions 706 A. Cesta et al. / Continuous Plan Management Support for Space Missions: The RAXEM Case relative priority of each of the ﬁles inﬂuencing the problem solving phase. Furthermore the users may inspect the status of the on-board MTL . Figure 4. An example of workﬂow for incremental problem deﬁnition with the same number of uplinked ﬁles, to maximize the number of telecommands uplinked. It is worth noting that a theoretical alternative to this second phase would consist in considering also complete solutions and trying to optimize it. Unfortunately operative constraints require that the ﬁrstTC order of allocation should be respected (also penalizing the optimality of the solution). This makes this alternative approach not viable in practice. 3.3 Interactive Plan Management As already mentioned the core functionalities of the R AXEM architecture include a layer for user interaction. This aspect is particularly relevant for addressing the M EX -U P application because of the importance of the requirements for plan management within the whole set of problem features. Plan management is grounded on the model based representation whose functionalities are instrumental not only to the automated solver but also to guarantee a level of “understandability” to the services toward the users. The interaction services support the mission planner during the whole uplink-plan life-cycle providing an environment to support three main tasks: (a) incremental management of the problem; (b) inspection of plans; (c) what-if projection. Furthermore, in designing this environment we paid attention to both reproduce the previous work practice – so as to foster a seamless integration within the working environment – and augment mission planners capabilities with the aim of improving work efﬁciency and solutions quality. Incremental problem deﬁnition. The M EX -U P problem is inherently incremental. For example, MDAFs become dynamically available to mission planners, uplink windows may vary according to various availability factors. In addition new requests for uplink may arrive during operations or existing ones may be removed or delayed by users. To manage this incrementality the interaction environment provides a means to deﬁne a new problem and change it incrementally, allowing ﬂexibility in the problem speciﬁcation as well as in its modiﬁcation to absorb contingencies and unexpected events. The MDAFs table performs a preliminary input checking and enables users to modify MDAFs priority/type and add/remove MDAFs to be uplinked. Fig. 4 shows an example of task ﬂow during a phase of incremental problem deﬁnition. The basic interaction layout shows on top the list of MDAFs, on bottom their location subdivided between on-ground (to be planned for uplink) and on-board (successfully uplinked). The use of colors allows an immediate identiﬁcation of their type. As new requests for uplink arrive they are loaded incrementally and displayed to the user. The user can change, for example, the Figure 5. Different views of a solution Plan inspection. Alternatives views and aspects of the solution are presented to the user for inspection as it is shown in Fig. 5. The box (1) of the picture shows the MDAFs product for a problem whose solution transfer all the MDAFs on board, box (2) shows the solution uplink-plan which associates start and end time for uplink for each MDAF, box (3) contains the MTL status after uplink for the current problem. Alternative information is provided in box (4) which shows the uplink activities subdivided by ground stations and gives also an immediate view of the amount of use of the visibility windows. The visual environment represents a powerful way to check the validity the solutions and allows discovering duplicated ﬁles or missing uplink products. If the user is not satisﬁed with the solution, he/she can change input setting and run the R AXEM tool to obtain different uplink plans that take into account different priorities or new uplink needs. A speciﬁc additional service allows the user to ask for a snapshot of the status of the memory at a given time. The related window, foreground of Fig. 6, displays the list of single telecommands start times and their associated MDAFs. Fig. 6 shows also another possible way for using the inspection modalities: from observing a speciﬁc MDAF on the MTL status the user can go for inspection to the MTL resource proﬁle then to the uplink activity on the solution Gantt. Figure 6. Inspecting on-board status What-if projection. In addition to the classical use of R AXEM the interaction layer is also instrumental for some form of what-if analysis support. Indeed the visual representation of the uplink plans can A. Cesta et al. / Continuous Plan Management Support for Space Missions: The RAXEM Case be used to detect uplink windows used to capacity or MDAFs that have been downgraded by the automated solver from “uplink with secondary window” to “primary window only” . This allows predicting “bottlenecks” in the uplink capability for the mission. With forecasting at medium-term planning level this make it possible to release excess station time not required for uplink of products (or downlink of science data) with consequent cost-savings. Also, the user can evaluate the effectiveness of alleviating particular bottlenecks in the uplink plan by choosing which MDAF to downgrade. 4 An Evaluation from Users Overall the R AXEM tool has shown very positive outcomes with respect to performance, reliability and actual beneﬁts with regard to planning of the uplink stream for M ARS E XPRESS. Since becoming operational in the late Summer 2007 the tool has generated errorfree plans for the uplink of all products. The reduction in work-effort for planning one weeks uplink is estimated as about 4-6 hours per week saving. The actual plan is signiﬁcantly more robust including accurate uplink window timings and a secured alternative uplink window for each product on a separate ground station. The tool also has beneﬁts in terms of conﬁguration control and traceability of uplinked commanding ﬁles, as well as it allows almost effortless re-planning in the case of single MDAF modiﬁcation and/or addition of new MDAFs after the normal planning cycle is complete. The main achievements of the R AXEM tool can be summarized as follows: Safety and security. R AXEM has achieved its stated objectives of maintaining the on-board command queue (Mission Timeline or MTL ) as full as possible, while ensuring safety of the commanding chain through provision of fully redundant uplink opportunities on two different ground stations for each product wherever feasible. This in principle provides improved security and safety for mission integrity even in the event of the total outage of one ground station. The work-hours involved in planning the uplink for a week has been reduced by a factor of 4-6 on average, depending on the complexity of the planning task – the more uplink products and the shorter or more infrequent the uplink-windows the greater the saving, since R AXEM takes the same time to run regardless of problem complexity. The checking of the uplink solution still takes longer with a more restrictive uplink case. Also on the operations standpoint it allows a fast response time to restart science operations after a Safe Mode of the spacecraft,4 whereby all MDAFs must be re-sent to populate an empty MTL . Efﬁciency and accuracy. Another beneﬁt is that re-planning an uplink solution if an additional ﬁle is added or one is replaced or deleted takes very little time. Most of the effort is in checking the solution that R AXEM proposes but usually only the “deltas” need to be rechecked. The R AXEM tool has greatly improved the quality and accuracy of the uplink requests, by eliminating the human errors that occasionally occurred in completing forms by hand for ﬁles with long and similar names. The tool is very easy to use, and training of a new user from scratch can be completed usually within an hour, with very little follow up support required for the typical experienced engineer. R AXEM provides a powerful visualization interface that allows rapid checking for duplicated ﬁles or missing uplink products which show up as ‘gaps’ in the timeline for a particular product stream. Flexibility and traceability. The tool allows forward modelling and prediction of “bottlenecks” in the uplink capability for the mission – the graphical representation clearly shows where all available uplink 4 This is a precautionary condition in which all the science operations are stopped in order to cope with particular events (e.g., eclipse seasons). 707 windows are being used to capacity or products have been downgraded from “uplink with secondary window” to “primary window only”. With forecasting at medium-term planning (monthly) level it will be possible to release excess station time not required for uplink of products (or downlink of science data) with consequent costsavings on ground station allocations for which charges per hour are made against the mission budget. R AXEM also ensures full traceability of all uplinked products, right from generation of the commanding to actual execution on board the spacecraft. Since introducing R AXEM we have had no missed uplink of products due to “human error”, where a ﬁle was missed out or uplinked twice. This signiﬁcantly improves the overall safety and security of the mission operations. 5 Discussion and Conclusions The R AXEM experience reinforces some remarks coming from other work on innovation infusion within space environments. It showed the capability of AI technology to increase in performance the management of speciﬁc aspects of the mission planning process, e.g., save time and money, reduce human error, increase science data return, improve the solution robustness, etc. Additionally we have experienced the importance of a global approach to the problem, which entails not only the production of a smart algorithm but also the design of a complete tool to support users in charge of the problem. This effort in synthesizing a “complete application” can be identiﬁed as the key feature when proposing solutions to mission planners who are supposed to work on a problem day by day for the entire period of a mission. This aspect is particularly relevant in the case of R AXEM where managing continuity in operation, incrementality in problem deﬁnition, reaction to problem changes and monitoring of current status are basic needs of the environment. The introduction of R AXEM within the operational contexts has made clear how very often single problems could be tackled in a more integrated way and with a more systematic approach. In current practice R AXEM is inserted within a work cycle together with other speciﬁc tools that manipulate input and output of R AXEM and contribute to a comprehensive “integrated uplink service support line”. In closing the paper we quote a recent informal comment from the users: “despite some initial skepticism that such an AI-based tool would have been able to improve on the performance of a team of highly-experienced engineers, acceptance of R AXEM was so widespread that since its ﬁrst introduction (Summer 2007) no one has made use anymore of the option to plan the uplink by hand!”. REFERENCES [1] M. Ai-Chang, J. Bresina, L. Charest, A. Chase, J. C. Hsu, A. Jonsson, B. Kanefsky, P. Morris, K. Rajan, J. Yglesias, B. G. Chaﬁn, W. C. Dias, and P. F. Maldague, ‘MAPGEN: Mixed-Initiative Planning and Scheduling for the Mars Exploration Rover Mission’, IEEE Intelligent Systems, 19(1), 8–12, (2004). [2] A. Cesta, G. Cortellessa, M. Denis, A. Donati, S. Fratini, A. Oddi, N. Policella, E. Rabenau, and J. Schulster, ‘M EXAR 2: AI Solves Mission Planner Problems’, IEEE Intelligent Systems, 22(4), 12–19, (2007). [3] S. Chien, R. Sherwood, D. Tran, B. Cichy, G. Rabideau, R. Castano, A. Davies, D. Mandl, S. Frye, B. Trout, S. Shulman, and D. Boyer, ‘Using autonomy ﬂight software to improve science return on earth observing one’, Journal of Aerospace Computing, Information, and Communication, 2(4), 196–216, (2005). [4] A.K. Jonsson, P.H. Morris, N. Muscettola, K. Rajan, and B. Smith, ‘Planning in Interplanetary Space: Theory and Practice’, in AIPS-00. Proceedings of the Fifth Int. Conf. on Artiﬁcial Intelligence Planning and Scheduling, (2000). [5] N. Muscettola, S.F. Smith, A. Cesta, and D. D’Aloisi, ‘Coordinating Space Telescope Operations in an Integrated Planning and Scheduling Architecture’, IEEE Control Systems, 12(1), 28–37, (1992). [6] M. E. Pollack and J. F. Horty, ‘There’s More to Life than Making Plans: Plan Management in Dynamic, Multi-Agent Environments’, AI Magazine, 20(4), 71–84, (1999). 708 ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-708 The i-Walker: an intelligent pedestrian mobility aid R. Annicchiarico1 and C. Barru´e 2 and T. Benedico 3 and F. Campana 4 and U. Cort´es 2 and A. Mart´ınez-Velasco 3 Abstract. In this paper we focus on the development of an intelligent pedestrian mobility aid that we call i-Walker for elders. This target population includes, but is not limited to, persons with low vision, visual ﬁeld neglect, spasticity, tremors, and cognitive deﬁcits. SHARE-it will provide an Agent-based Intelligent Decision Support System to aid the elders. 1 Introduction It is clear that one of the most important and critical factors in quality of life for the elderly is their ability to move about independently and safely. Mobility impairments due to age, injury or disease cause a downward trend in their quality of life. Lack of independence and exercise can have dramatic results. One of the SHARE-it, an EUfunded research project, main objectives is concerned with developing an Intelligent Walker, that we called i-Walker, to assist the elderly and increase the ease and safety of their daily mobility. The beneﬁts to the user include assistance avoiding dangerous situations (obstacles, drops, etc.) and help with navigation through cluttered environments but well-known environments. Many older adults use walkers to improve their stability and safety while walking. It is hoped that this assistance will provide the user with a feeling of safety and autonomy that will encourage them to move about more, incurring the beneﬁts of walking and helping them to carry out the Activities of Daily Living (ADLs). Another related problem is the lack of strength in target population. Doctors make us conscious of the possible uneven loss of strength in the extremities. This of course is the main reason for having troubles in arising from a chair, in walking, being unable to steer a normal walker, being unable to standing still, etc. We have developed a robotically augmented walker to reduce fall risk and confusion, and to increase walker convenience and enjoyment. Among of the SHARE-it objectives is to build different iWalker workbench platforms, oriented to demonstrate their feasibility, and gain the conﬁdence to support the speciﬁc disabilities [3]. An important issue to be considered is that before starting experiments with elders the whole system has to be approved by a Ethical Committee. We had use the original agent-based control elements in an experiment with volunteer inpatients in Fondazione Santa Lucia, Rome, using Spherik an intelligent wheelchair [1]. In this paper, we generalize it to be used in the i-Walker. Although, the experimentation with elders has to start the whole system is already in place. 1 2 3 4 Fondazione Santa Lucia, Italy, r.annicchiarico@hsantalucia.it Universitat Polit`ecnica de Catalunya, Spain {cbarrue,ia}@lsi.upc.edu Universitat Polit`ecnica de Catalunya, Spain {toni.benedico, antonio.b.martinez}@upc.edu Centri Assistenza Domiciliare, Italy, fcampana@tiscali.it The rest of this paper is organized as follows: In §2 we introduce our ideas on Shared Autonomy related with the support to the elders. In section 3 we introduce our new intelligent pedestrian mobility aid that we call i-Walker. We also introduce in this section the agentbased control elements §3.2. In §4 we introduce the generic scenarios where the i-Walker is currently in limited testing, to assure its safeness and soundness, before to go to a full-scale testing with real users from the target population. In §5 we present our conclusions and future plans for this research in the frame of SHARE-it. 2 Shared Autonomy: A vision Autonomy for the elderly or people with disabilities does not only rely on mobility terms, but on a set of domains inﬂuenced by functioning, activity limitations, participation restrictions and environmental factors. Life areas related to activities and participation are such as learning and applying knowledge, general tasks and demands, communication, mobility, self-care, interpersonal interactions and relationships as well as community and social life. All these domains can be affected by aging or disabilities and are the base of personal autonomy and the satisfactory participation on them reﬂects on the self well-being. Assistive Technologies (AT) are of special interest, as the average age of the population increases fast [2, 10]. AT can participate in these activities in order to enhance the user’s autonomy, gathering all the environmental information and making use of it properly. Our idea is based on the notion of a Shared Autonomy between the user and its own agent-based mediator with any information system at hand. Existing telematic healthcare systems that provide integrated services to users are not, to our taste, enough ﬂexible to allow a real personalization and maybe now it is too expensive to change them. The shared autonomy concept is scarcely explored in literature and often it is misunderstood as shared control (e.g., [12, 7]). In the personal autonomy and disability context, two different scenarios of the shared autonomy can be distinguished. • People presenting mainly physical impairments are able to deﬁne their own goals, but due to their restrictions they usually are not able to execute them, suffering a limitation in their autonomy. In this scenario the contribution of AT focus on physical devices, mostly mobility hardware, that allow them to reach their objectives. These devices may be controlled by multi-agent systems or through an agent supervised shared control if the user motor capabilities are not severely damaged. In this scenario, user interfaces are very important to detect the user intention, which is critical to deﬁne goals for the wheelchair to be able to assist him/her. • People presenting mostly cognitive impairments may require a different kind of assistive aids, which may lead even a more rele- R. Annicchiarico et al. / The i-Walker: An Intelligent Pedestrian Mobility Aid 709 vant role in the sharing of personal autonomy. In this scenario the user probably does not have very clear goals or is not capable of achieving them because he/she cannot remember how to do them. In these cases, AT may empower and complement their autonomy using agents that offer them a set of services, like reminding what kind of activities they can or should perform at a certain moment of the day or pointing them out how to achieve these activities. The main idea is to offer the users a set of cognitive aids, either rational or memory based, that can ease their daily living. Roboticists have developed a number of mobility-enhancing assistive technologies. Most of these are active aids, meaning that they share control over motion with the user. Most are aimed at obstacle avoidance and path navigation [11, 5, 8]. 3 i-Walker With this context in mind, we introduced in [3] the design of an integrated architecture aimed at helping citizens with disabilities to improve their autonomy in structured, dynamic environments. The main element of this architecture is an intelligent agent layer that mediates between different technology components (robotic devices –as the i-Walker– ubiquitous computing, and interfaces) in order to provide the subject with the necessary degree of independent mobility to beneﬁt from different assistive services and to reach goals determined by either the subject himself/herself or by medical staff. The agent based control system provides an excellent means to model the different required autonomous elements in the patient’s environment (from control elements in the wheelchair to care-giving services). Agents probe to be efﬁcient in coordinating heterogeneous domain-speciﬁc elements with different levels of autonomy. Addressing the mobility problem and keeping in mind that different users need different degrees of help, a part of this agent based control layer has been focused on the development of a shared control for the robotic wheelchair that adapts to the user needs. The i-Walker is an assistive device with four conventional wheels and two degrees of freedom (see ﬁgure 1). Two of these wheels, the ones placed closest to the user, are ﬁxed wheels driven by independent motors. The other two wheels, the ones placed on the front part, are castor-wheels. They can freely rotate around their axis and are self-oriented. The i-Walker has two handles, that the user holds with both hands, to interact with it. The i-Walker is a passive robot as it will only move if the user moves it. The mechanical analysis of the Intelligent Walker is focused on the interaction between a generic user and the vehicle, in addition to how the rear wheel motors -which are the only active control availablecan modify the user’s behavior and his/her perception of the followed path. For safety reasons, these motors will never result in pulling the i-Walker by themselves. 3.1 Figure 1. i-Walker • Cooperative because it attempts to infer the user’s path and uses this inference to decide how to avoid any obstacles in the user’s path. • Submissive because it monitors the users to see if they are resisting the actions (steering/braking) selected by the i-Walker. If they are, the movements are adjusted. This cycle continues until the user agrees with the motion (i.e. does not resist it) or manually over-rides it. This interaction forms the basis of the feedback loop between user and agent. Similar approach is [13]. i-Walker Control Concept The i-Walker has been designed to be passive, cooperative and submissive (see ﬁgure 2). Figure 2. i-Walker Control • Passive because it can only adjust the facing direction of its frontal wheels, i.e. it can steer. However, it has braked drive motors and so not only relies on the user for motive force. This allows the walker to move at the user’s pace and provides for the user’s feeling of control. The manual brakes have also been replaced with an automated braking system. The walker can sense the user’s steering input via 710 R. Annicchiarico et al. / The i-Walker: An Intelligent Pedestrian Mobility Aid sensors in the handles that detect the difference in force on the two handles. • Pushing with more force on one handle (left or right), the walker will turn in the opposite direction. • Applying of equal force on both handles will move the walker straight forward or backward (which direction can be determined by the i-Walker’s wheel encoders). One of the main objectives of SHARE-it is helping the users in orienting them when handling the i-Walker in a known environment. The user will receive help from a screen, but the innovative idea will be steering by moderate braking, for helping in navigation. Apart from the multi-modal (in particular speech) interface, we will experiment with moderate brake on the i-Walker’s wheels to gain the experience on how to better guide the user by allowing s/he sharing with the computer the steering actions. 3.2 Agent Layer The i-Walker sensing devices provide the means to precisely track the user’s intention in every situation. We are assuming that the users of the i-Walker follow a daily schedule that include all their ADLs. All the information gathered supports the agent layer that will process this data and use it to provide the services that users might need using the computer device attached to the i-Walker. The agent layer delivers three main kind of services: monitoring, navigation support and cognitive support. The monitoring services gather all kind of data from the sensors (walking behavior, forces exerted, environment, localization if available, ...). The information related to the user will be processed and analyzed by medical partners with possible rehabilitation uses. Also, with the step behavior and forces on the handlebars observed the agents can determine the user intention, be it in navigation terms or even if the user is trying to get up from a chair or just trying to get the walker closer to the place where they are resting. Monitoring also covers security issues, like being aware if the user or the i-Walker fall to the ground, and taking the according measures. Among the navigation services the users have on disposal a map of the environment and their localization on it. They can ask for a route to reach some destination and real time indications to follow it. If navigation is interrupted by non avoidable obstacles, the agents can suggest a new route or offer to ask for help to a caregiver. The way help is requested, depends on the environment (tcp, msg, sms,...). The SHARE-it— agent layer offers a series of cognitive aids focused mainly on memory reinforcements and ADL support. The user has an ADL agenda, a skeleton of daily activities that the user performs like waking up, going to the toilet, having breakfast, etc. The monitoring services keep track of the sequence of places (i.e. rooms) that the user has visited, and the order is also tracked, so for instance the agent knows if the user has visited the kitchen for breakfast after waking up. Comparing his daily behavior with the user’s usual agenda, the agent can send some activity reminders to the user in case he forgot. The user’s agent can also trigger help request messages to the caregivers if some abnormal agenda activities happen, for instance if the user has not visited the kitchen in all the day, probably meaning that the user has not had any meal at all. There will be a special attention to the medical reminders, like having the medication at the right time, RFID tags on some environment items like the medicine box will support this service. Some people with moderate or heavier cognitive problems, can forget how to perform some ADLs or just get confused while performing them, so they can ask their agent a tutorial on how performing a daily activity (i.e. washing your hands). The ultimate goal of the interaction between robotics, Agent Systems and the user is to enhance autonomy and up-grade the quality and complexity of services offered. The degree of control exhibited by the i-Walker control agent depends on the abilities of the user at each time and situation. Nevertheless, some important topics as safeness and security have to be redeﬁned in the future in order to broaden the applicability of this approach [4]. 4 Generic Scenarios Devices have been used to assist people with cognitive and/or physical disabilities to complete various tasks for almost 20 years. What represents a change and challenge is the abilities embedded in a new generation of tools that are able to cooperate with the user to complete a task. This implies that these new tools are context-aware and are able to learn from the interaction with the user. Cooperation for problem solving between users and their agent and the cooperation between agents among themselves requires some kind of model which at least describes what to expect from whom in terms of questions, actions, etc and that uses previous experiences and trust. Scenarios appear to be an easy and appropriate way to create partitions of the world and to relate them with time. Scenarios allow actions to be performed in a given time. For example, Mihailidis et al., in [9], studied the hand washing scenario where a full instrumented environment was used to provide users with cues to support the completion of this task. As in Mihalilidis’ approach we are looking to support those tasks that are needed to perform the most important ADLs. In particular, those related with mobility but not only. Figure 3. i-Walker workbench 4.1 Scenarios for the i-Walker Experimentation for the i-Walker is to be realized in a 5x5m practicable platform that allows a maximum slope of 16%. The task to be R. Annicchiarico et al. / The i-Walker: An Intelligent Pedestrian Mobility Aid 711 Figure 4. Example of a real path that a user may need to follow performed is very simple: Starting in one end walk into the platform and, following a path, to describe two complete circles and then get out from the other end. The main objective of this experimental scenario is to gather information about the users gait and the forces s/he exerts on the handlers, as shown in Figure 4. The basic measure for each user will be using the platform as a horizontal plane, and then we will repeat the experiment with a different elevations. Another generic environment has horizontal and inclined surfaces over which the user can walk along. On inclined surfaces, the user may follow the maximum slope line or any other direction. In addition to this, it is useful to be able to know the absolute position of the O point and the orientation of the vehicle. It is necessary to deﬁne a standard working path with the most common situations that a user will ﬁnd when travelling with the iWalker. To test the whole behavior of the i-Walker, this path should include: • A velocity change from 0 to a nominal value: to study the starting process. • A velocity change from a certain value to a higher/lower value: to study positive or negative acceleration processes. • Positive and negative slopes: to study inclined surfaces and the i-Walker going uphill and downhill. • Orientation change segments: to study the necessity of avoiding obstacles. • A velocity change from a certain value to zero: to study the stopping process Examples of paths that fulﬁll these conditions are shown in Figures 3 and 4. The second illustrates what can be considered as a complex path for an elder. Further research is needed to investigate the stability of the complete human user/i-Walker system and to infer the users stability. An open topic is the acceptability of this technology. The work in i-Walker is important as after the cane is the most commonly used mobility device. Senior citizens facing some disabilities need to ﬁnd this technology easy to learn to use as well as be conﬁdent with its usage in their preferred environment. This implies an effort to provide the appropriate infrastructure elsewhere, for example to provide connectivity in all the spaces the user is using. Also, it should be easy to adapt this technological solutions to different environments. 5 Conclusions and Future work The existing functionalities of the i-Walker are divided in three areas: analysis, support and navigation i-Walker (aid to move in a wellknown environment). The Analysis walker consists in gathering, real time information coming from different sensors: forces in the handlebars and normal forces from the ﬂoor, feet relative position towards the walker, tilt information, speed of rear wheels, mainly. The analysis of this information will allow the study about: the gait, how the patient lays onto the walker and how much force exerts on the handlebars while following a predeﬁned trajectory. The support walker consists in applying two strategies to motor: • A helping strategy. In the normal operation of the i-Walker, the user must apply pushing or pulling forces on the handlers to move around. The strategy of helping the user consists on relieving him from doing a determined percentage of the necessary forces. • A braking strategy. It can oblige the patient to apply a forward pushing force in the handlers in a downhill situation instead of pulling force which can be less safe. The amount of helping percentage and braking force in each hand can both be determined by a Doctor. Both strategies are not exclusive: we can have the user pushing the i-Walker going downhill and at the same time the i-Walker relieving him from part of the necessary pulling/pushing force to move around. The navigation walker connects to a cognitive module that gives the appropriate commands to the platform in order to help a user to reach a desired destination indoors. The i-Walker commands will consist in moderate braking for steering the i-Walker to the right direction. Other information will be shared with the cognitive module like: speed, operation mode etc. The i-Walker platform can be used manually by a walking user, but it is also capable of performing autonomous moving. The platform can easily be adapted to accept commands to set a desired speed from a navigation module, when this is completed. Autonomous moving can be useful, for instance, to drive to a parking place for charging battery and returning to the side of patient when remotely called. 712 5.1 R. Annicchiarico et al. / The i-Walker: An Intelligent Pedestrian Mobility Aid Future Work The results obtained in our work suggested a new interesting scenario regarding rehabilitation. As a matter of fact many people use traditional walkers, not only as assistive devices, but also as rehabilitative devices during the rehabilitation program in order to recover functions as gait and balance. In this context, the possibility to detect, through sensors, the performance of hands and feet during gait, on smooth or uneven surface, could provide crucial information from the medical perspective. The opportunity to collect such information is decisive in the deﬁnition of different patterns of performance of different users in various scenarios; it could be also used - at an individual level - to modify and personalize the rehabilitation program and to follow changes. There is a strong case for the use of the i-Walker inside the frame depicted by SHARE-it and, therefore, for the use of intelligent agents to support mobility and communication in senior citizens. Moreover, there is a clear evolutionary pathway that will take us from current AT to more widespread AmI where MAS will be kernel for interaction and support for decision-making. The positive effects of assistive technologies on quality of life of elderly disabled people [6] have been largely proven. The growing numbers of disabled people will increase the demand for assistive devices in the elderly population. We believe that passive robots combined with a MAS, as i-Walker is, offer a decisive advantage to the elderly because they leave (almost always) ﬁnal control in the hands of the user. Our work seeks to help people who can and want to walk. In our view the user should only be assisted according to his/her proﬁle: not more, not less. 6 Acknowledgements Authors would like to acknowledge support from the EC funded project SHARE-it: Supported Human Autonomy for Recovery and Enhancement of cognitive and motor abilities using information technologies (FP6-IST-045088). The views expressed in this paper are not necessarily those of the SHARE-it consortium. REFERENCES [1] R. Annicchiarico, U. Cort´es, A. Federici, F. Campana, C. Barru´e, A.B. Martinez, and C. Caltagirone, ‘The Impact of Cognitive Navigation Assistance on people with special needs.’, in Proc. 9th International WorkConference on Artiﬁcial Neural Networks, IWANN 2007. LNCS 4507, ed., F. Sandoval, pp. 1060–1066, Berlin, (2007). Springer-Verlag. [2] L.M. Camarinha-Matos and H. Afasarmanesh, Virtual communities and elderly support, 279–284, WSES, 2001. [3] U. Cort´es, R. Annicchiarico, J. V´azquez-Salceda, C. Urdiales, L. Ca˜namero, M. L´opez, M. S`anchez-Marr`e, and C. Caltagirone, ‘Assistive technologies for the disabled and for the new generation of senior citizens: the e-Tools architecture’, AI Communications, 16, 193– 207, (2003). [4] J. Fox and S. Das, Safe and Sound: Artiﬁcial Intelligence in Hazardous Applications, AAAI Press/MIT Press, 1st edn., 2000. [5] J. Glover, D. Holstius, M. Manojlovich, K. Montgomery, A. Powers, J. Wu, S. Kiesler, J. Matthews, and S. Thrun, ‘A roboticallyaugmented walker for older adults’, Technical Report CMU-CS-03170, Carnegie Mellon University, Computer Science Department, Pittsburgh, PA, (2003). [6] J. M. Guralnik, ‘The evolution of research on disability in old age’, Aging Clin Exp Res, 17(3), 165–7, (2005). [7] A. Lankenau and T. R¨ofer, ‘The role of shared control in service robots the bremen autonomous wheelchair as an example’, in Service Robotics - Applications and Safety Issues in an Emerging Market. Workshop Notes, pp. 27–31, (2000). [8] A. Lankenau and T. Rofer, ‘A versatile and safe mobility assistant’, IEEE Robotics & Automation Magazine, 8(1), 29–37, (2001). [9] A. Mihailidis, G.R. Ferniea, and W.L. Cleghornb, ‘The development of a computerized cueing device to help people with dementia to be more independent’, Technology and Disability, 13(1), 23–40, (2000). [10] M. E. Polack, ‘Intelligent Technology for an Aging Population: The use of AI to assist elders with cognitive impairment.’, AI Magazine, 26(2), 9–24, (2005). [11] C. Urdiales, A. Poncela, R. Annicchiarico, F. Rizzi, F. Sandoval, and C. Caltagirone, ‘A topological map for scheduled navigation in a hospital environment’, in e-Health: Application of Computing Science in Medicine and Health Care, pp. 228–243, (2003). [12] D. Vanhooydonck, E. Demeester, M. Nuttin, and H. Van Brussel, ‘Shared control for intelligent wheelchairs: an implicit estimation of the user intention’, in Proceedings of the 1st International Workshop on Advances in Service Robotics 2003, (2003). [13] G. Wasson, P. Sheth, M. Alwan, C. Huang, and A. Ledoux, ‘User intent in a shared control framework for pedestrian mobility aids’, in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2962–2967. IEEE, IEEE, (2003). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-713 713 Mixture of Gaussians Model for Robust Pedestrian Images Detection Dymitr Ruta1 Abstract. Automated pedestrian detection is a forward looking challenge for future driver support systems in automotive industry. Such system would have to make safety critical decisions based on poor quality images shot in real-time from the unstable moving vehicles. The proposed system offers a simple yet very effective detection methodology based on mixture of Gaussians (MoG) aided by an Expectation-Maximisation (EM) clustering algorithm. The algorithm operates on a number of features built by aggregation of different variations of the ﬁrst and second order pixel gradients related to the aggregated templates of pedestrian and non-pedestrian classes. For each class the algorithm ﬁts a ﬁxed number of clusters and using Gaussian kernels optimises the parameters of the Gaussian Mixture model such that the probabilities of belonging to the intraclass clusters is maximised. Given a new image the system instantly generates relative features and uses mixture model to build posterior probability densities for all clusters and after aggregation and renormalisation, posterior class probabilities. The system has been ﬁne-tuned against its parameters and feature subsets and tested using almost 10000 real images provided by DaimlerChrysler. Reaching the testing performance in excess of 95% the model was announced the winner of the NISIS Competition 2007. One of the biggest challenges in automotive industry that prevents its further development towards autonomous driving and vehicle control systems is the ability to sense, process and structure the surrounding environment. One of the key components of this challenge is the ability to detect characteristic objects like road signs, trafﬁc lights or pedestrians from continues video feed provided by the cameras installed on the moving vehicle. While road signs or trafﬁc lights are fairly structured, well deﬁned and rather invariable in a particular environment, detecting moving and deformable people whose shape, pose, clothing and scene background are highly variable is particularly difﬁcult. Addressing such highly complex recognition problems recent advances in machine learning favour the models that are capable of learning many discriminative features from a large number of examples in order to distinguish whether a part of an image contains an object or not. Pedestrian detection problem is a complex process that starts from the analysis of the whole scene to extract small patches or areas of attention likely to include pedestrian images [2], [10], [11]. Given the patches the next task is to extract good, discriminative features to provide a numerical distinction between the image patch of a pedestrian and its background. The last stage is the validation of the pedestrian presence in the image which typically is carried out as a standard two-class classiﬁcation problem. The classiﬁcation method and the feature generation process is often paired together and optimised in an attempt to achieve further performance gains while cutting down on computational costs. However, since many high performing classiﬁers are reported in the supervised learning literature [4], the main effort in related recent work on pedestrian classiﬁcation is clearly focussed on extracting good features out of the images. Dalal and Trigs [2] proposed a method of using histograms of oriented gradients paired with support vector classiﬁer and achieved high pedestrian recognition rates under both daylight conditions and infrared detection by night. Feed-forward neural networks were used in combination with local receptive ﬁelds by W¨ohler and Anlauf [15] and with ﬁltered gradient images by Zhao and Thorpe [16]. Papageorgiou and Poggio [11] used Support Vector Machine in a combination with overcomplete sets of wavelet features. Viola et al. [14] proposed an efﬁcient cascade of single-feature detectors trained by AdaBoost [5] which takes advantage of recent advances in classiﬁer fusion domain [8]. Other approaches tried to seek an improvement by decomposing the pedestrian recognition task into simpler component tasks. Shashua et al. [13] for instance, extracts a feature vector from each of nine ﬁxed subregions. Other methods attempted to separately detect human-characteristic body parts like head, legs and arms and then deliver a combined classiﬁcation decision using additional classiﬁer [9]. There are striking differences of the classiﬁcation performance reported in the literature and particularly false positive rate is prone to huge variability that can stretch up to a several orders of magnitude across different studies [16], [13], [11]. These differences stem on the one hand from the lack of large representative image datasets which are expensive to provide and label and on the other hand from massive inconsistency in composing the images, particularly the negative class examples. As a result it is very difﬁcult to extract synergic merits from complementary methods to progress further towards objectively better performing models. To avoid such dissonance this work uses an established DaimlerChrysler Pedestrian Classiﬁcation Benchmark Dataset (DCPCBD)2 used for benchamrking in many recently developed pedestrian recognition models [10]. The proposed model is positioned in the same stage of pedestrian image patches validation and was designed in a an attempt to deliver a simple and generic system yet capable of delivering real-time predictions that could challenge the performance of models presented and reviewed in [10]. It uses a mixture of Gaussians (MoG) aided 1 2 1 Introduction British Telecommunications (BT) Group, Chief Technology Ofﬁce, Intelligent Systems Research Centre, Adastral Park, Orion MLB 1, PP12, Ipswich IP53RE, UK, email: dymitr.ruta@bt.com DaimlerChrysler Pedestrian Classiﬁcation Benchmark Dataset provided by DaimlerChrysler and prepared and made available by S. Munder and D.M. Gavrila at http://www.science.uva.nl/research/isla/downloads/pedestrians/ 714 D. Ruta / Mixture of Gaussians Model for Robust Pedestrian Images Detection by an Expectation-Maximisation (EM) clustering algorithm operating on global distances between the images and class templates in various ﬁrst and second order intensity difference projections. The model was submitted as a contender to-, and has been announced a winner of the International NISIS Competition 2007 3 The remainder of this paper is organized as follows: Section 2 brieﬂy summarises objectives of the competition. The following section discusses the data and feature extraction process. Section 4 covers the classiﬁcation model and its variations in the two subsequent sections. Finally experiments along with brief conclusions are presented in the ﬁnal 2 sections. 2 NISIS Competition 2007 The objective of the NISIS Competition 2007 was to devise an automated classiﬁcation system that would be able to detect images of pedestrians against a background or other objects seen along the road. The images are obtained from a low-resolution camera, installed on the front of a vehicle and are represented by 36x18 pixels patches in 8-bit gray scale. The quality of images reﬂects an operational reality of mobile live shooting and no ﬁltering, smoothing nor any preprocessing has been applied on the images. Figure 1 shows sample images of two pedestrians and road signs in the abovementioned formats. Figure 1. Examples of pedestrian images. images. These images were then used to generate features by simple calculation of the average distance between the templates and the images to be classiﬁed. 3.1 Initially 8 basic types of individual pixel differences have been deﬁned. Given an image X [N ×M ] with gray pixel values: xij , i = 1, .., N, j = 1, .., M , the directional difference transformations denoted by Tdirection (X) simply calculate the absolute difference between the current pixel and 8 neighbouring pixels lying around it where direction corresponds to the geographical abbreviations {N, S, E, W, N E, N W, SE, SW }. Given these 8 basic transformations the following average directional derivative transformations have been deﬁned: 3 Data preprocessing and feature extraction The data has been ﬁrst loaded, displayed and examined in terms of quality. For a large number of images bottom part of different hight turned out to be monochrome which may indicate either a fault or genuine occlusions. Moreover tens of images from both pedestrian and non-pedestrian classes appeared completely unicolour, which immediately sets the upper bound on the recognition rate. The feature generation process has been based on rather simplistic technique focussed on global aggregation of localised pixel intensity gradients. Central to this process was on the one hand local differencing along four different directions on the image plane and on the other hand global gradients aggregation to generate reference class 3 Nature Inspired Smart Information Systems (NISIS) Competition 2007. Description and results available at: http://www.nisis.risktechnologies.com/msc/competition2007.aspx T− (X) = (TE (X) + TW (X))/2 (1) T| (X) = (TN (X) + TS (X))/2 (2) T/ (X) = (TN E (X) + TSW (X))/2 (3) T\ (X) = (TN W (X) + TSE (X))/2 (4) T0 (X) = (T| (X) + T− (X) + T\ (X) + T/ (X))/4 (5) along with the corresponding second order directional differences: T= (X) = (T− (T− (X))) (6) T|| (X) = (T| (T| (X)) (7) T\\ (X) = (T\ (T\ (X)) (8) T// (X) = (T/ (T/ (X)) (9) The intention of these transformations, some of which are visualised in Figure 2, is to capture signiﬁcant intensity gradients around the shape of pedestrian and in fact any other object covered by the images from the positive class. 3.2 The competition data comprised of 9800 images from DCPCBD with equal pedestrian/non-pedestrian class priors. The dataset has been split into labelled training set (37.5%) and unlabelled testing set (62.5%). The objective of the competition was to built a classiﬁcation model that having learned on the training set would yield the highest recognition rate on the testing set. Differencing Feature generation The process of feature generation from images is often a process of capturing and enumeration of the characteristic image structures and elements. important important has been made dependent on the availability of some preexisting labelled training set X = {(X1 , ω1 ), .., (XK , ωK )} where ωk ∈ {Ω1 , .., ΩC } denotes a class label. First let the class incidence function Ic be deﬁned as follows: Ick = Ic (ωk ) { 1 if ωk = Ωc 0 otherwise (10) Then for each class Ωc the following aggregated image template can be deﬁned: Xc = K k=1 Xk Ic (ωk )/ K Ick (11) k=1 To take advantage of the abovementioned difference transformation an class speciﬁc image templates can be applied to the raw images as well as to the transformed processed Using the C class templates X c for each of the 21 differencing transformations deﬁned above, and in fact for any other transformation T that might be added, the following set of C features can be deﬁned upon the testing image Y: c ftype (Y ) = |(Ttype (Y ) − Ttype (X)c | (12) 715 D. Ruta / Mixture of Gaussians Model for Robust Pedestrian Images Detection 5 5 5 5 10 10 10 10 15 15 15 15 20 20 20 20 25 25 25 25 30 30 30 35 35 2 4 6 8 10 12 14 16 (a) T− (X)1 30 35 18 2 4 6 8 10 12 14 16 35 18 2 (b) T− (X)−1 4 6 8 10 12 14 16 18 (c) T| (X)1 2 5 5 5 5 10 10 10 15 15 15 15 20 20 20 20 25 25 25 25 30 30 30 35 2 4 6 8 10 12 14 16 2 (e) T/ (X)1 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 18 (g) T\ (X)1 2 5 5 5 5 10 10 10 15 15 15 15 20 20 20 20 25 25 25 25 30 30 30 30 35 35 35 4 6 8 10 12 14 16 18 2 (i) T0 (X)1 4 6 8 10 12 14 16 18 (j) T0 (X)−1 12 14 16 18 4 6 8 10 12 14 16 18 (h) T\ (X)−1 10 2 10 35 18 (f) T/ (X)−1 8 30 35 18 6 (d) T| (X)−1 10 35 4 35 2 4 6 8 10 12 14 (k) X 1 16 18 2 4 6 8 10 12 14 16 18 (l) X −1 Figure 3. Aggregated pedestrian vs non-pedestrian templates in various difference transformation projections. Figure 2. Basic pixel difference transformations (top) and their derivatives (bottom) These features represent in fact some form of Manhattan distance from the actual image to different aggregated class templates. Additional features can be added by calculating the distance to the difference of various class templates i.e.: c1 ,c2 ftype (Y ) = |(Ttype (Y ) − |Ttype (X)c1 − Ttype (X)c2 || Mixture of Gaussians with EM Clustering Given the expected spatial fragmentation of the pedestrian class a Mixture of Gaussians (MoG) model [4] has been chosen to estimate class probability density function (pdf) and accordingly determine class labels. Each class of data is separately modelled by G Gaussian distributions, which are in fact soft clusters with the pdf estimated by: p(x) = G g=1 πg N (x|μg , Σg ) s.t. G g=1 N (13) Given pedestrian detection problem with 2 classes: (Ω1 = 1, Ω2 = −1 for pedestrian and non-pedestrian respectively, this proc1 ,c2 c cess gives 42 ftype (Y ) features and additionally 21 ftype (Y ) features. Figure 3 shows some mean class templates in various difference projections: 4 where πg are mixing coefﬁcients and N (x|μg , Σg ) is a normal distribution with mean μg and covariance matrix Σg . Maximising the pdf p(x) with respect to mixture parameters across the whole dataset X N ×M would yield: πg = 1, 0 ≤ π ≤ 1 (14) i=1 πg N (xi |μg , Σg ) G j=1 πj N (xi |μj , Σj ) Σg (xi − μg ) = 0 (15) Given full covariance matrix such optimisation problem becomes rather hard. To address this challenge the Expectation-Maximisation (EM) clustering has been applied with an aid of a secondary normaldensity based classiﬁer: Quadratic Discriminant Classiﬁer (QDC) available in PRTOOLS 4 . For each class the algorithm ﬁrst initialises clusters with G centres for which the maximum distances from all patterns to the nearest center is minimized. Then the class is decomposed into G subclasses labelled according to the proximity to the nearest cluster centre. From this point the QDC classiﬁer is used in a sequential EM process of re-training, self-classiﬁcation and corresponding re-labelling that continues until no changes in class labels is observed which signiﬁes the best ﬁtness of the mixture model to data. Final classiﬁcation step involves normalisation of the class likelihood −pk estimates using standard sigmoid function: pN ) and k = 1/(1 + e weighting by class priors. 4 PRTOOLS: A Pattern Recognition Toolbox for Matlab available at http://www.prtools.org/ 716 5 D. Ruta / Mixture of Gaussians Model for Robust Pedestrian Images Detection Model Variations and Fine Tuning The model has been parameterised to allow for ﬁne tuning and further adjustments in an attempt to improve the classiﬁcation performance recorded on the validation set. To start with the number of Gaussians has been optimised in the range between 1 to 10 and the optimal value of G = 4 selected. Next the feature generation process has been parameterised with respect to the aggregation applied upon the transformed images. The ﬁrst round of parametrisation attempted to identify whether the classiﬁcation performance can be improved by aggregating distances between the actual images and the class templates only for a subset of the corresponding pixels as opposed to the whole image. A simple ﬁlter has been devised that takes only pixels for which the difference transformations described in Section 3.1, resulted in values greater than a relative threshold e.g. top 20% of most variable pixels. This approach was motivated by the belief that only parts of an image that show the highest variability like object edges should be used to contribute to the feature value. The experiments clearly conﬁrmed, however, that any reduction in the number of image pixels taken to the aggregation causes classiﬁcation performance to fall. In the second attempt to improve the classiﬁcation performance the scope of the features was fragmented to an exclusive grid of rectangular segments that the image has been split into. This way the feature generation process has been conﬁned to a single segment and replicated for all the segments in isolation such that the resultant number of features multiplied very quickly with the number of different segments. The experiments with the number of segments between 1 to 16 that scaled the classiﬁcation problem up to 16 × 63 features proved unsuccessful in lifting classiﬁcation performance. 5.1 Feature selection with Genetic Algorithm (GA) The genetic algorithm was developed in 1970s by Holland [7] as an effective evolutionary optimisation method and in fact has not changed signiﬁcantly ever since. In machine learning domain it has been used for feature and classiﬁer selection [3], [12], [6]. Adopted to the pedestrian recognition problem the chromosomes represent binary vectors in which 1s indicate the indices of features taken to the model. A randomly initialised population of 100 chromosomes is then evaluated according to the classiﬁcation performance ﬁtness function and assigned a probability of survival proportional to their ﬁtness. The best chromosomes are most likely to survive and are allowed to reproduce themselves by recombining their genotype and passing it on to the next generation. This is followed by a random mutation of some bits, which is designed to avoid premature convergence and enables the search to access different regions of a search space. The whole process is repeated until the 100th generation is reached. To preserve the best solutions an elitism operator is applied to the whole population which simply means that the chromosomes can only be replaced by their offsprings if their ﬁtness is better than the worst parent chromosome. 6 Classiﬁer fusion The analysis so far focussed on the optimisation of a single MoG classiﬁer. As a result of this process a set of optimised model parameters, and feature subset has been obtained. It is however well known that further performance gains can be obtained by a combination of multiple classiﬁers or multiple versions of the same classiﬁer. The latter approach has the advantage that the performances of the individual classiﬁers are similar and therefore altogether they have better chances to complement each other in some input subspaces while not compromising the performance on other input regions. The fusion scheme applied in this case was aligned with active population of feature subsets collected from the ﬁnal generation returned by the GA. For each subset in this population a 100 MoG classiﬁer versions have been built using Bootstrapping and Aggregation methodology known as Bagging [1]. These model versions were then applied to the testing set and their outputs combined using simple mean combiner to produce the ﬁnal classiﬁcation outputs. 7 Experiments The experimental part of this work forms the submission of the model predictions to the NISIS Competition 2007. Given the ﬁne-tuned subset of features the ﬁrst experiment involved building full MoG model on the training set and testing its performance on the validation set. The MoG model returned soft posterior probabilities of belonging to pedestrian and non-pedestrian classes. Based on these outputs and the true validation set the ROC curve has been built as shown in Figure 4. The system reached the classiﬁcation performance of 95.14% at the rate of about 85 image classiﬁcation per second. The next experiment involved exactly the same classiﬁer but applied in multiple versions using top 100 performing feature subsets as described in Section 6. This highly complex training process involved building in total 100×100 individual MoG model versions ultimately combined using mean aggregation rule. The trained models were applied to the validation set and their outputs aggregated using mean combiner. Like for the individual model testing the classiﬁcation outputs have been compared with the true validation labels to build the ROC curve shown in Figure 4. This time the minimum error rate fell to e=4.37% and it was observed for the threshold on posterior pedestrian probability slightly exceeding 0.6. In this case combining similar classiﬁers by Bagging yielded relatively small 0.5% performance gain yet absorbed massive computational cost compared to the individual MoG model. Perhaps the explanation of the small performance improvement could be the fact that MoG classiﬁer has an in-built optimised mechanism of combining Gaussian distributions and therefore does not leave much to gain from combining its versions. However, the combined model’s ROC curve is smoother and reaches 100% true positive rate at a relatively low false positive rate, which reﬂects more stable predictions and greater ﬂexibility in controlling the true and false positive rates. Using the optimised parameters and feature subset the model was eventually trained on all available labelled images and applied to generate classiﬁcation outputs for the competition testing set. Following the competition results announcement it yielded the expected low classiﬁcation errors of respectively 4.16% and 4.56% for combined and individual MoG model. Further inspection of the ROC curves shown in Figure 5 and corresponding confusion matrices shown in Table 1 conﬁrm the same stretch along the true and false positive rates observed for the combined curve. The combined model’s performance gain has further diminished for the testing set thereby conﬁrming inappropriateness of classiﬁer fusion in this case. Based on this performance ﬁgure and the overall model suitability for real-time application in prospective in-vehicle support systems it was announced a winner of the NISIS Competition 2007. It is also worth noting that the presented model outperformed all models presented in [10] that were tested on the same dataset and often used more images for training. D. Ruta / Mixture of Gaussians Model for Robust Pedestrian Images Detection 1 Emin=4.37% 0.95 Emin=4.86% True Positive Rate 0.9 0.85 0.8 0.75 0.7 Combined Emin(combined) Individual Emin(individual) 0.65 0.02 0.04 0.06 0.08 0.1 False Positive Rate 0.12 0.14 0.16 Figure 4. The ROC curves and error rates for pedestrian detection using individual and combined MoG models 1 Emin=4.16% 0.98 717 The algorithm employs a number of intelligent features extracted from the original labelled images and the unlabelled testing images. All the features were generated by aggregation of different variations of the ﬁrst and second order differences of neighbouring pixels related to the aggregated templates of pedestrian and non-pedestrian classes. The system uses a Mixture of Gaussians model aided by an Expectation-Maximisation clustering algorithm that employs additional Gaussian density based classiﬁer to optimise mixture parameters. For each class the algorithm ﬁts a ﬁxed number of clusters and using Gaussian kernels optimises the parameters of the MoG model to maximise the likelihoods of intra-class clusters. The EM algorithm follows a loop of learning, classiﬁcation and relabelling sequences until the mixture ﬁts best the data which is signiﬁed by a lack of changes during relabelling. Given a new image the system instantly generates relative features and uses mixture model to build posterior probability densities for all clusters, and after aggregation and re-normalisation, posterior class probabilities. The system has been ﬁne-tuned against a number of different parameterisation techniques and feature subsets selected using tailored Genetic Algorithm. The proposed model has been tested in both individual and combined versions using real images provided within NISIS Competition 2007 by DaimlerChrysler. It produced the best pedestrian classiﬁcation performance exceeding 95%. It can deliver 85 image predictions per second and is therefore recommended for real-time operational applications within in-vehicle support systems. Emin=4.56% 0.96 REFERENCES True Positive Rate 0.94 0.92 0.9 0.88 0.86 Individual Combined Emin=4.16% Emin=4.56% 0.84 0.82 0.04 0.06 0.08 0.1 0.12 False Positive Rate 0.14 0.16 0.18 Figure 5. The ROC curves for pedestrian detection using individual and combined MoG models for the testing set 8 Conclusions This work presents a powerful yet simple method of detecting pedestrians from preselected image patches of 18 by 26 pixels captured by a camera installed on the front of vehicle. It accepts faulty, incomplete or even completely blank images that can occur during realistic real-time driving scenario and makes the best possible use of them. True\Predicted Class labels -1 1 Sum Individual -1 1 2883 202 77 2963 2960 3165 Combined -1 1 2851 234 21 3019 2872 3253 Sum 3085 3040 6125 Table 1. Confusion matrices of the pedestrian classiﬁcation carried out by (left) individual MoG model and (right) combined model using bagging. [1] L. Breiman, ‘Bagging predictors’, Machine Learning, 24, 123–140, (1996). [2] N. Dalal and B. Triggs, ‘Histograms of oriented gradients for human decision’, in Proc. Int. Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 886–893, (2005). [3] L. Davis, Handbook of Genetic Algorithms, Van Nostrand Reinhold, New York, 1991. [4] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern classiﬁcation, John Wiley and Sons, New York, NY, 2001. [5] Y. Freund and R.E. Schapire, ‘Experiments with a new boosting algorithm’, in Proc. 13th Int. Conf. on Machine Learning, pp. 148–156, (1996). [6] B. Gabrys and D. Ruta, ‘Genetic algorithms in classiﬁer fusion’, Applied Soft Computing, 6(4), 337–347, (2006). [7] J.H. Holland, Adaptation in natural and artiﬁcial systems, The University of Michigan Press, Michigan, 1975. [8] L.I. Kuncheva, Combining Pattern Classiﬁers: Methods and Algorithms, Wiley-Interscience, 2004. [9] A. Mohan, C. Papageorgiou, and T. Poggio, ‘Example-based object detection in images by components’, IEEE Trans. Patt. Analysis and Mach. Intelligence, 23(4), 349–361, (2001). [10] S. Munder and D.M. Gavrila, ‘An experimental study on pedestrian classiﬁcation’, IEEE Trans. on Pattern Analysis and Machine Intelligence, 28(11), 1–6, (2006). [11] C. Papageorgiou and T. Poggio, ‘A trainable system for object detection’, Int. Jrnl. on Comp. Vision, 38(1), 15–33, (2000). [12] D. Ruta, ‘Multidimensional selection model for classiﬁcation’, in Proc. 7th Int. Conf. on Enterprise Information Systems, Miami USA, (2005). [13] A. Shashua, Y. Gdalyaha, and G. Hayun, ‘Pedestrian detection for driving assistance systems: Single-frame classiﬁcation and system level performance’, in Proc. IEEE Intelligent Vehicle Symposium, pp. 1–6, (2004). [14] P. Viola, M. Jones, and D. Snow, ‘Detecting pedestrians using patterns of motion and appearance’, in Proc. Int. Conf. Computer Vision, pp. 734–741, (2003). [15] C. Woehler and J. Anlauf, ‘An adaptable time-delay neural-network algorithm for image sequence analysis’, IEEE Trans. on Intelligent Transp. Sys., 10(6), 1531–1536, (1999). [16] L. Zhao and C. Thorpe, ‘Stereo- and neural network-based pedestrian detection’, IEEE Trans. on Intelligent Transportation Systems, 1(3), 148–154, (2000). This page intentionally left blank IV. Short Papers This page intentionally left blank 1. Knowledge Representation and Reasoning This page intentionally left blank ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-723 723 Deriving explanations from causal information Ph. Besnard 1 and M.-O. Cordier 2 and Y. Moinard 3 Abstract. We deﬁne an inference system to capture explanations based on causal statements, using an ontology in the form of an ISA hierarchy. We introduce a simple logical language which makes it possible to express that a fact causes another fact and that a fact explains another fact. We present a set of formal inference patterns from causal statements to explanation statements. We introduce an elementary ontology which gives greater expressiveness to the system while staying close to propositional reasoning. We provide an inference system that captures the patterns discussed, in a datalog (limited predicate) framework. (a) Predicate symbols. For each parameter, the predicate can be essentially existential, essentially universal or unknown. The names of the predicate symbols begin with an uppercase letter. (b) Constant symbols. Their names begin with a lower case letter. (c) Classical atoms are ground atoms P (a1 , ..., an ) where P is a predicate of arity n and ai ’s are constants. Classical atoms can also be denoted with Greek letters α, β1 , .... (d) Classical formulas are sentences (Boolean combinations of classical atoms: only ground formulas are considered). 2. Causal atoms and formulas 1 INTRODUCTION Example We consider two causal statements: (i) Any ship that is about to sink causes her crew to launch some red rocket(s) (ii) On July the 14th , the celebration of the French national day causes the launching of ﬁreworks all over France. So, if the place is a coastal city in France, on July the 14th , then red rockets being launched could be explained either by some ship(s) sinking or by a national day ﬁrework launched. In this example, it is needed to acknowledge the fact that a red rocket is a kind of (colourful) rocket in order to get the second explanation, which makes sense. Suppose that we now add the following statement: (i) Seeing a red rocket being launched triggers a rescue process. Now, on July the 14th in a coastal city in France, a possible explanation for the triggering of the rescue process, as happens in practice, is that a national day ﬁrework has been launched. We deﬁne a dedicated inference system to capture explanations based on causal statements and stress that the rˆole of ontology-based information is essential. Only simple causal relations are considered: no probabilities (“smoking causes cancer” not concerned), no temporal aspects, these are left for future work. 2 THE FORMALISM 2.1 Vocabulary 1. Classical vocabulary and formulas 1 2 3 CNRS, IRIT, Univ. Paul Sabatier, 31062 Toulouse cedex, France Univ. Rennes I, IRISA, Campus de Beaulieu, 35042 Rennes cedex, France INRIA, IRISA, Campus de Beaulieu, 35042 Rennes cedex, France 1 besnard@irit.fr, 2 cordier@irisa.fr, 3 moinard@irisa.fr (a) If α and β are classical atoms, α causes β is a causal atom. (b) A causal formula is a Boolean combination of classical and causal atoms. 3. Ontological atoms (three kinds, two relations →IS−A , →ISAux ) If a and b are constant symbols, then a →IS−A b is an ontological atom (among constants). If P and Q are two predicate symbols of the same arity, then P →IS−A Q is an ontological atom (among predicates). If α and β are classical atoms, then α →ISAux β is an ontological atom (among classical atoms). 4. A causal theory consists in a set W of classical formulas, a set C of causal formulas and a set O of ontological atoms. 5. Explanation atoms From a given causal theory, some explanation atoms will be derived. An explanation atom is α explains β bec poss Φ where α and β are classical atoms and Φ is a set of classical atoms. It reads “α explains β because the set Φ is possible”. 2.2 About ontological atoms Our purpose is to facilitate the task of the user of the system. We consider an elementary “ontology” (or taxonomy in its present state), and we try to allow the user to formalize a problem in a simple and natural way. We introduce two kinds of “ontological links”: shrill bell →IS−A shrill noise is an information given by the user, and the system transforms this into Heard(shrill bell) →IS−A Heard(shrill noise), Like(shrill noise) →ISAux Like(shrill bell) and similarly for each predicate symbol such as Heard which is “essentially existential” and for each predicate such as Like which is “essentially universal”. In this example, Heard(a) means “there exists some a” heard by me, while Like(a) means I like “anything which is an a”. The information about the kind of each parameter of a predicate is provided by the user. We get “downward inheritance” when the predicate is essentially universal with respect to a parameter and “upward inheritance” when the predicate is essentially existential with respect to a parameter (see Points 2a(all) and 2a(one) below). 724 Ph. Besnard et al. / Deriving Explanations from Causal Information The user can also provide some ontology between predicate symbols, such as Heard →IS−A P erceived, giving rise to Heard(shrill bell) →ISAux P erceived(shrill bell) (case 2a(pred) below). 2.3 A formal proof system for inferring explanations The causal inference %C is deﬁned as follows. 1. Property of the causal atoms: entailing implication (α causes β) → (α → β). 2. Ontological atoms (a) Deriving →ISAux from →IS−A Let P and Q be two predicates of arity n, and a1 , · · · an and bj (for some j ∈ {1, · · · , n}) be constant symbols. (pred) If P →IS−A Q, then P (a1 , · · · , an ) →ISAux Q(a1 , · · · , an ). (all) If P is essentially universal with respect to its j th parameter, and if aj →IS−A bj , then P (a1 , · · · , aj−1 , bj , aj+1 , · · · , an ) →ISAux P (a1 , · · · , an ). (one) If P is essentially existential with respect to its j th parameter, and if aj →IS−A bj , then P (a1 , · · · , an ) →ISAux P (a1 , · · · , aj−1 , bj , aj+1 , · · · , an ). (b) Properties of the →ISAux ontological relation i. Transitivity If α →ISAux β and β →ISAux γ, then α →ISAux γ. ii. Reﬂexivity α →ISAux α. iii. Entailing implication If α →ISAux β, then α → β. 3. Deriving explanation atoms (The notation Φ denotes the conjunction of all the elements of the set of classical atoms Φ) (a) Generating case . / β →ISAux γ, β →ISAux δ, If and W |= ¬(α ∧ β), then (α causes γ) → (α explains δ bec poss {α, β}). (b) Transitivity of explanation If . W |= ¬ (Φ ∪ Ψ) then / α explains β because possible Φ ∧ → β explains γ because possible Ψ α explains γ because possible (Φ ∪ Ψ). (c) Simplifying the sets of explanation atoms n If W |= Φ → i=1 Φi , then . / α explains β bec poss (Φi ∪ Φ) i∈{1,···,n} → and natural. It is important to realize the great rˆole played by these “ontological links”: they entail standard implication, but standard implication alone would not produce the explanation atoms. Let us motivate brieﬂy the main rules of inference and axioms given above. We require that causal and ontological atoms entail standard implication (Points 1 and 2(b)iii). However, we do not want that standard implication alone helps producing new “explanations” since it could reﬂect “co-occurrence” only, without any causality involved. The ontological relation →ISAux is transitive. It is also reﬂexive, which is rather uncommon but is here for technical reasons: it helps keeping the number of the rules to a minimum, and it does not hurt. Notice that the “standard” relation →IS−A could be made transitive and reﬂexive without modifying the resulting explanation atoms. The generating case of explanations is the core of our formalism. It has been designed from a variety of examples, and seems adapted to a variety of domains. We have no space here to justify it. Notice that, thanks to Point 2(b)ii, it contains the uncontroversial rule: If W |= ¬(α ∧ β) then, if (α causes β), we get α explains β bec poss {α, β}). The simpliﬁcation of the set of conditions (Point 3c) is motivated by the gathering of the conditions in the transitivity law: If α causes β and β causes γ, we get α explains γ bec poss {α, β, γ} from Points 3a and 3b and [α → β and β → γ] from Point 1, thus the set {α, β, γ} is equivalent to the simpliﬁed set {α}. Examples involving disjunctions justify the formal simpliﬁcation rule of Point 3c. This rule is very hard to compute, and should be considered as an ideal goal: in real systems, only part of this rule can be implemented. We have provided a translation of our formalism in answer set programming (ASP) which implements the formalism introduced above (except full simpliﬁcation). 4 CONCLUSION Our formalism is intended as a basis for dealing with abduction or plan recognition, but the purpose of this paper is only devoted to the derivation of “explanations” from causal and ontological information. The fact that with a very few additional rules our ASP translation can deal with e.g. the “cooking example” of [5, 3] proves that middle size problems can be solved in this way. Since our formalism takes the task of the formalization of a given problem as its central topic, the problems should be easier to express with our proposal. For what concerns the relation with the abundant literature about causal reasoning (e.g., [1, 2, 4]), let us brieﬂy notice that one of the main aspects of our proposal is that it strictly separates causality, ontology and explanations. In particular, our “ontology” and our “causality” are disconnected from the weaker simple implication, and simple implication alone does not produce any “explanation”. α explains β bec poss Φ. REFERENCES 3 A FEW FEATURES OF THE FORMALISM This formalism could be described in a purely propositional setting with one ontological relation only. However, this would seriously complicate the task of a user. This is why we have chosen a formalism of the “datalog” kind with predicates, but without explicit quantiﬁers. Notice that the small complication of the various kinds of IS − A links can be ignored by the user which could only be concerned by the →IS−A link and let the “technical” →ISAux link to the internal computation of the system. We have tried to reduce the ontology relation to a minimum, in order to keep things simple [1] J. Bell, ‘Causation as Production’, in 17th biennial European Conference on Artiﬁcial Intelligence, eds., G. Brewka, S. Coradeschi, A. Perini, and P. Traverso, pp. 327–331, Riva del Garda, Italy, (2006). IOS Press. [2] Alexander Bochman, ‘A Logic for Causal Reasoning’, in IJCAI-03, eds., Georg Gottlob and Toby Walsh, pp. 141–146,2003. Morgan Kaufmann. [3] Console L. and Theseider Dupr´e D.” Abductive Reasoning with Abstraction Axioms, LNCS 810, pp. 98–112, 1994. [4] Enrico Giunchiglia, Joohyung Lee, Vladimir Lifschitz, Norman McCain, and Hudson Turner, ‘Nonmonotonic causal theories’, Artﬁcial Intelligence, 153(1–2), 49–104, (March 2004). [5] Kautz H. A. A Formal Theory of Plan Recognition and its Implementation, Reasoning About Plans, Allen J. F., Kautz H. A., Pelavin R. and Tenenberg J. (eds), Morgan Kaufmann Publ., pp. 69–125, 1991. ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-725 725 A Hybrid Tableau Algorithm for ALCQ Jocelyne Faddoul1 and Nasim Farsinia1 and Volker Haarslev1 and Ralf M¨oller2 Abstract. We propose an approach for extending a tableau-based satisﬁability algorithm by an arithmetic component. The result is a hybrid concept satisﬁability algorithm for the Description Logic (DL) ALCQ which extends ALC with qualiﬁed number restrictions. The hybrid approach ensures a more informed calculus which, on the one hand, adequately handles the interaction between numerical and logical restrictions of descriptions, and on the other hand, when applied is a very promising framework for average case optimizations. 1 Introduction Using the DL ALCQ one can express numerical restrictions on concepts with (∃R.C), (≥ nR.C), and (≤ mR.C) as well as logical ones, or both.3 Such expressiveness means that a satisﬁability algorithm for ALCQ not only needs to satisfy logical restrictions, but also numerical ones. Our calculus, strongly inspired by [3, 4, 5], consists of a standard tableau for ALC [1] modiﬁed and extended to work with a constraint solver such as a linear inequation solver. The tableau rules encode numerical restrictions into a set of inequations using the atomic decomposition technique [5]. The set of inequations is processed by an inequation solver which ﬁnds a minimal integer solution (distribution of role ﬁllers) satisfying the numerical restrictions, if one exists. The tableau rules then take care of making sure that such distribution of role ﬁllers also satisﬁes the logical restrictions. Considering a minimal distribution of ﬁllers ensures that a corresponding model is of minimum size. This hybrid algorithm ﬁlls the gaps between tableau algorithms [1] which do not adequately handle numerical reasoning, and the arithmetic reasoning for description logic in [5] where no calculus was proposed and all the input is reduced to equation solving. In contrast with [3] which proposes a recursive algorithm, this hybrid algorithm has the potential to be easily extended to handle more expressive languages. Since this hybrid algorithm collects all the information about arithmetic expressions before creating any role ﬁller, it will not satisfy an at-least restriction by violating an at-most restriction and there is no need for a mechanism of merging role ﬁllers. Moreover, it reasons about numerical restrictions by means of an inequation solver, thus its performance is not affected by the values of numbers in qualiﬁed number restrictions. Considering all these features the proposed hybrid algorithm is well suited to improve average case performance. 1 2 3 Concordia University, Montreal, Canada, e-mail: {j faddou, n farsin, haarslev}@encs.concordia.ca Hamburg University of Technology, Hamburg, Germany, e-mail: r.f.moeller@tuhh.de A universal restriction could be considered as both numerical and logical restriction (∀R.C ≡≤ 0R.(¬C)). 2 Arithmetic and Logical Expressions Given an ALCQ concept expression with C, D concepts and R a role name, we distinguish between arithmetic and logical expressions. Expressions of the form (∃R.C), (≥ nR.C), and (≤ mR.C) hold arithmetic restrictions; they specify a lower (upper) bound on the cardinality of the set of R-ﬁllers. Expressions of the form (C D), (C D), and ¬C hold logical restrictions using logical operators on concepts. We refer to these expressions as logical expressions. In the following, we assume all ALCQ concept expressions to be in their negation normal form (NNF). We use ¬C ˙ to denote the NNF of ¬C. We also deﬁne clos(C) to be the smallest set of concepts such that: (a) C ∈ clos(C), (b) if D ∈ clos(C) then ¬D ˙ ∈ clos(C), (c) if (E D) or (E D) ∈ clos(C) then E, D ∈ clos(C), and (d) if (≥ nR.E) or (≤ mR.E) ∈ clos(C) then E ∈ clos(C). The size of clos(C) is bounded by the size of C. In addition FIL(R, s) is the set of R-ﬁllers of an individual s ∈ ΔI for some role R ∈ NR and is deﬁned as: FIL(R,s) = {t ∈ ΔI | s,S t ∈ RI }. The set of all R-ﬁllers for R is then deﬁned as FIL(R) = s∈ΔI FIL(R, s). Re-writing ALCQ Arithmetic Expressions: We deﬁne a concept operator (∀(R\S).D) and a role implication operator (R S) needed to preprocess the input descriptions before applying the calculus. These operators are based on set semantics such that given an interpretation I, then (∀(R\S).D)I = {s ∈ ΔI | s, t ∈ RI and s, t ∈ / S I ⇒ t ∈ DI } is satisﬁed and (RI ⊆ S I ) is satisﬁed for each role implication R S ∈ ϕr , where ϕr is a set of role implications. Given an ALCQ concept E and an empty set ϕr , we recursively re-write the arithmetic expressions in E such that: - Each ≥nR.C is replaced with ≥ nR ∀R .C, with R new in NR and R R new in ϕr - Each ≤mR.D is replaced with ≤mR ∀R .C ∀(R\R ).¬D, with R new in NR and R R new in ϕr Satisﬁability of Arithmetic Expressions w.r.t ϕr Using Linear Inequation Solving: The atomic decomposition technique was used in [4] to reduce reasoning about cardinalities of role ﬁllers to inequation solving. We use the same technique to decide the satisﬁability of arithmetic expressions w.r.t ϕr . For each role R ∈ NR that is involved in an arithmetic expression the introduced sub-role R enables some hierarchy of roles. We deﬁne H(R) = {R}∪{R | (R R) ∈ ϕr } as the role hierarchy of R. For every role R ∈ H(R), the set of R -ﬁllers forms a subset of the set of R-ﬁllers (FIL(R ) ⊆ FIL(R)). Using the atomic decomposition of H(R) we can deﬁne all possible intersections between R-ﬁllers as disjoint partitions. T Each L ⊆ H(R) is associated with a unique partition P (L) = R ∈L FIL(R ). Furthermore, P is the set of partitions deﬁned for the decompositions of all hierarchies in ϕr : J. Faddoul et al. / A Hybrid Tableau Algorithm for ALCQ 726 1 {L | L ⊆ H(R)} \ {L | L ⊆ H(R), ∃R ∈ L, R ∈ / L and A P = R∈NR @ R R ∈ ϕr } We do not consider L such that R ∈ L, R ∈ / L for some (R R) ∈ ϕr since the corresponding partition P (L) does not satisfy FIL(R ) ⊆ FIL(R) and therefore must be empty. We assign a variable name v for each partition P (L) such that v is mapped to a non-negative integer value n which denotes the cardinality of P (L). Let V be the set of all variable names, we maintain a mapping between variable names and their corresponding partitions using a function α: V → P such that for some non-negative integer n assigned to a variable v we have n = #P (α(v)). Since the partitions are mutually disjoint and the cardinality function is additive, a lower (upper) bound n (m) on the cardinality of the set of role ﬁllers FIL(S ) for some can be reP role S ∈ H(R) P duced to an inequation of the form v∈VS v ≥ n ( v∈VS v ≤ m). VS denotes the set of variable names mapped to partitions for a role S and is deﬁned as VS = {v ∈ V | S ∈ α(v)}. Thus, we can easily convert an expression of the form (≥ nS) P or (≤ mS) into an inequation using ξ such that ξ(S, ≥, n) = v∈VS v ≥ n, and P ξ(S, ≤, m) = v∈VS v ≤ m. Each variable v occurring in an inequation can be mapped to a non-negative integer p such that assuming α(v) = {R , R }, this means that #(FIL(R ) ∩ FIL(R )) = p and the corresponding partition P (α(v)) must have p ﬁllers. 0 -Rule S 3 -Rule ∀-Rule ∀(\) -Rule ≤-Rule ch-Rule ≥-Rule ﬁl-Rule A Hybrid Tableau Algorithm for ALCQ In general, logical and arithmetic expressions in ϕ share symbols, therefore, their satisﬁability cannot be decided independently. Furthermore, disjunctions in ϕ need to be treated case by case. For this purpose we propose a tableau-based hybrid algorithm which decides the existence of a tableau for an ALCQ concept expression ϕ. A completion graph is a directed graph G = (V, E, L, LE ) where each node x ∈ V is labeled with L and LE such that L(x) denotes a set of concept expressions, L(x) ⊆ clos(ϕ), and LE (x) denotes a set of inequations. Each edge x, y ∈ E is labeled with a set, L(x, y) ⊆ P, of role names. We denote by ξx the set of inequations in LE (x) obtained by converting the at-least and at-most restrictions in L(x). An integer solution σ for ξx maps each variable v occurring in ξx to a non-negative integer p such that σ is a distribution of role ﬁllers of x. The distribution is consistent with the lower and upper bounds expressed in arithmetic restrictions and the hierarchy in ϕr . A node x in V is said to contain a clash if either (i) {C, ¬C} ˙ ⊆ L(x), or (ii) the set of inequations ξx ⊆ LE (x) does not admit a non-negative integer solution. Case (ii) is decided by the inequation solver. When no rules are applicable or there is a clash, a completion graph is said to be complete. To decide the satisﬁability of a concept expression ϕ, the algorithm starts with the completion graph G = ({x}, ∅ ,{ϕ}, ∅). G is then expanded by applying the expansion rules given in Fig. 1 until no more rules are applicable or a clash occurs. When G is complete and there is no clash, this means that the arithmetic expressions are satisﬁed as well as the logical ones: we have a pre-model and the algorithm returns that ϕ is satisﬁable. Explaining the Rules: We assign the ﬁl-Rule the lowest priority; All other rules can be ﬁred in arbitrary order. The ≤-Rule and the ≥Rule are responsible for encoding the arithmetic expressions in the label L(x) of a node x into a set (ξx ) of inequations maintained in LE (x). An inequation solver is always active and is responsible for ﬁnding a non-negative integer solution σ for ξx or triggering a clash If C D ∈ L(x), and {C, D} L(x) then set L(x) = L(x) ∪ {C, D} If C D ∈ L(x), and {C, D} ∩ L(x) = ∅ then set L(x) = L(x) ∪ {C} or set L(x) = L(x) ∪ {D} If ∀R.C ∈ L(x) and R ∈ L(x, y) with C ∈ / L(y) then set L(y) = L(y) ∪ {C} If ∀(R\S).C ∈ L(x), and there exists y such that R ∈ L(x, y) and S ∈ / L(x, y) then set L(y) = L(y) ∪ {C} If (≤ nR) ∈ L(x) and ξ(R, ≤, n) ∈ / LE (x) then set LE (x) = LE (x) ∪ {ξ(R, ≤, n)} If there exists v occurring in LE (x) with {v ≥ 1, v ≤ 0} ∩ LE (x) = ∅ then set LE (x) = LE (x) ∪ {v ≥ 1} or set LE (x) = LE (x) ∪ {v ≤ 0} If (≥ nR) ∈ L(x) and ξ(R, ≥, n) ∈ / LE (x) then set LE (x) = LE (x) ∪ {ξ(R, ≥, n)} If there exists v occurring in LE (x) such that (i) σ(v) = m with m > 0, and (ii) there are no m nodes y1 . . . ym with L(x, yi ) = α(v) for 1 ≤ i ≤ m then create m new nodes y1 . . . ym and set L(x, yi ) = α(v) for 1 ≤ i ≤ m Figure 1. Expansion rules for ALCQ. if no solution is possible. The ch-Rule is used to check for empty partitions. Given a set of inequations in the label (LE ) of a node x and a variable v corresponding to a partition α(v) in P, we distinguish between two cases: (i) The case when a partition must be empty; this can happen when restrictions of individuals assigned to this partition trigger a clash. (ii) The case when a partition can have at least one individual; if a partition can have one individual without causing any logical clash, this means that we can have m (m ≥ 1)4 individuals also in this partition without a clash. Since the inequation solver is unaware of logical restrictions of ﬁller domains we allow an explicit distinction between cases (i) and (ii). We do this by non-deterministically assigning ≤ 0 or ≥ 1 for each variable v occurring in LE (x). The ﬁl-Rule is used to generate role ﬁllers for a node x depending on the distribution (solution) returned by the inequation solver. For a proof of correctness we refer the reader to [2]. REFERENCES [1] F. Baader and U. Sattler, ‘An overview of tableau algorithms for description logics’, Studia Logica, 69, 5–40, (2001). [2] J. Faddoul, N. Farsinia, V. Haarslev, and R. M¨oller, ‘A Hybrid Tableau Algorithm for ALCQ’, in Description Logics, (2008). [3] V. Haarslev, M. Timmann, and R. M¨oller, ‘Combining tableaux and algebraic methods for reasoning with qualiﬁed number restrictions’, in Description Logics, pp. 152–161, (2001). [4] H.J. Ohlbach and J. Koehler, ‘Role hierarchies and number restrictions’, in Description Logics, (1997). [5] H.J. Ohlbach and J. Koehler, ‘Modal logics description logics and arithmetic reasoning’, Artiﬁcial Intelligence, 109(1-2), 1–31, (1999). 4 The value of m is decided by the inequation solver. ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-727 727 Semantic relatedness in semantic networks Laurent Mazuel and Nicolas Sabouret1 Abstract. This paper presents a new semantic relatedness measure on semantic networks (SN) that uses both hierarchical and non-hierarchical relations. Our approach relies on two assumptions. Firstly, in a given SN, only a few numbers of paths can be considered as “semantically correct” and these paths obey to a given set of rules. Secondly, following a given edge in a path has a cost (which depends on its type, is-a, part-o f , etc.) and its position in the path. We propose an evaluation of our measure on WordNet with two different benchmarks, using the part-o f relation. We show that, in this context, our measure does better than the classical semantic measures. 1 Introduction The great majority of existing Human-Machine interaction systems with Natural Language (questions/answering systems, dialogue systems...) makes use of ontologies for the semantic interpretation [2]. But in current NL systems, this ontology is reduced to a taxonomy and the semantic interpretation relies only on similarity measure, even if the literature underlines the need for semantic relatedness measure, and, thus, the use of heterogeneous relations types [2]. Semantic relatedness between two concepts can be materialised by a path, starting from one concept, following different kinds of relations (subsumption (is-a), meronymy (part-o f ) or any other domain speciﬁc relation) to the other concept. Therefore, computation of semantic relatedness is considered to be much more difﬁcult than computing a similarity measure (i.e. measure which only use the subsomption links) [11]. For this reason, the much work on semantic measures has focused on computing similarity degrees [1]. On the other hand, no work, since work of Hirst & St-Onge (HSO) [5], has focused on the issue of semantic relatedness in a semantic network (SN). Nevertheless, this measure assumes that the informationcontent of all edges is uniform, which is a too strong hypothesis [11]. In this paper, we present a new semantic distance (the lower the score, the closer the concepts are) to measure the relatedness degree between two concepts of a SN. This measure considers a set of constraints to ﬁlter the paths which are not semantically correct (since this problem may arise since we consider different kinds of relations). In addition, it uses the information-theoretic deﬁnition of semantic similarity to weight the hierarchical edges in the graph [11] and suggests a new strategy to compute the weight of a nonhierarchical edge. the more the weight will be. Therefore, semantic measure are deﬁned using the notation IC(c) for the information content of the node c. One of the major measure is the Jiang & Conrath measure [6]. In this measure, each edge is linked to a weight and the semantic distance is computed by adding all the edge weights along the shortest path. The weight LS(x, y) (“LS for link strength”) of an edge {x, y} between the node x and the node y is computing regarding to their information content2 : LS(x, y) = |IC(x) − IC(y)| On the other hand, semantic relatedness measures consider several kinds of relations (and not only the hierarchical relation) as partO f , madeWith, etc. However, if the shortest path is unique in a hierarchy, many possible paths exist in a graph although most of them are not correct semantically [5]. For this reason, any relational measure must provide (implicitly or explicitly) a set of constraints to ensure that a path is semantically correct. For instance, HSO [5] associate a direction in Upward and Downward and Horizontal for each relation type and enumerate only 8 patterns of semantically-correct paths: {U, UD, UH, UHD, D, DH, HD, H}. Considering the WordNet relations, the authors deﬁne hypernymy and meronymy as Upward link, hyponymy and holonymy as Downward links and synonymy and antonymy as Horizontal links. 3 Our semantic relatedness measure We call single-relation path (S-R path) a path whose edges are all of the same type X. To compute the weight W of a S-R path, we separate hierarchical relations (X is the is-a or the includes relation) and nonhierarchical relations. Let us consider a path pathX (x, y) between two concepts x and y in the ontology, following only the relation X. If X is a hierarchical relation, we chose to consider the Jiang & Conrath weight of an edge: W (pathX∈{isa,include} (x, y)) = |IC(x) − IC(y)| Information-theoretic measures uses a weight for a node in a taxonomy. This weight represents the information content (IC) of the concept in the hierarchy [11, 12]. The more specialised a concept is, If X is not a hierarchical relation, we cannot use the information content of nodes, because this value is computed regarding to the hierarchy structure [12, 11]. We propose to associate to each relation type X a static weight TCX , which corresponds to the “strength” of a given relation type. We then compute the weight of the path as a value that has the following properties: 1) it increases with the length of the path and 2) it is bounded by TCX which represents the worst possible value for an X-relation path (i.e. the value of an inﬁnite-length path that uses only X relations). Information-theoretic measures [11, 1] have outlined the adequacy of the log function to compute a semantic weight, but the log function is not bounded. Then, we use the n/n+1 function to simulate a logarithmic bounded function. 1 2 2 Work background and notation Laboratoire Informatique de Paris 6 - LIP6 104 av du Président Kennedy, 75016, Paris, France, email: {laurent.mazuel; nicolas.sabouret}@lip6.fr The ﬁnal J&C measure is deﬁned as IC(c1 ) + IC(c2 ) − 2 × IC(ccp(c1 , c2 )), where ccp represent the common closest parent of the two nodes. 728 L. Mazuel and N. Sabouret / Semantic Relatedness in Semantic Networks As a consequence, the weight of pathX (x, y) when X is not a hierarchical relation, is deﬁned by: . / |pathX (c1 , c2 )| W (pathX (x, y)) = TCX × |pathX (c1 , c2 )| + 1 Now, let us consider path(x, y), a mixed-relation path (M-R path) between two concepts x and y. It is clearly composed of an ordered set of n S-R sub-paths. We note T (path(x, y)) this unique ordered set of sub-paths. Hence, the weight of the M-R path path(x, y) is then deﬁned as the sum of all sub-paths weights of T (path(x, y)): W (path(x, y)) = ∑ W (p) p∈T (path(x,y)) To compute this distance corresponding to the relatedness between two concepts, we consider only the semantically correct paths between these two concepts and we will select the best one. We chose to use the HSO rules to ﬁlter the semantically correct paths. Let us consider c1 and c2 , two concepts. We note π(c1 , c2 ) the set of acyclic paths between c1 and c2 and HSO : π(c1 , c2 ) −→ B the function such that HSO(p) is true if and only if p is a valid path w.r.t. HSO rules. The semantic distance between c1 and c2 is then deﬁned as: dist(c1 , c2 ) = 4 min {p∈π(c1 ,c2 )|HSO(p)=true} W (p) Evaluation We chose two benchmarks: 1) the well-know Miller & Charles test [9], composed of 30 couples of words associated to a human similarity score and 2) the WordSimilarity-353 test3 [3], composed of 353 couples of words which are relationally connected (e.g. “computerkeyboard”, “telephone-communication”, etc.). In this evaluation, we will consider the noun sub-part of WordNet 3.0. Because of WordNet relation deﬁnition, we will consider only the part-o f relations for non-hierarchical relations. Moreover, we chose to consider only one ﬁxed maximal weight TCX for the part-o f relations of WordNet.4 We compare our approach with 4 similarity measure (Rada [10], Resnik [11], Lin [7] and Jiang & Conrath [6]) and the Hirst & St-Onge [5] relatedness measure. Measures Rada Resnik Lin Jiang & Conrath Hirst & St-Onge Our measure, TCX = 0.4 Table 1. Correlation M&C WS-353 0.638 0.249 0.804 0.375 0.836 0.377 0.880 0.362 0.847 0.380 0.902 0.400 Comparison of correlation factors. Table 1 shows that our measure obtains the best correlation regarding to the others measures. Moreover, to our knowledge, it is the ﬁrst time that a semantic measure based on WordNet reaches a correlation of 0.4 for the WS-353 test. 3 4 Since we cannot anticipate the correct value for this TC part -o f , we evaluated different values and found that the best TCX in this context is TCX = 0.4. Some couples were not connected in WordNet and, thus, by our measure (e.g. “telephone-communication”). This can be explain, by the lack of relations types in WordNet. Moreover, the WS-353 test contains many couples connected by common-sense link and thus not connected in WordNet (e.g. “popcorn-movie”). For this reason, we believe that it will be very difﬁcult to go beyond the 0.35 − 0.4 limit on the WS-353 test using only WordNet as an ontology. 5 Conclusion & future work In this paper, we have presented a new measure to evaluate the semantic relatedness between two concepts. Our measure takes advantages of the two paradigms: the semantically correct path for semantic relatedness (based on the Hirst & St-Onge [5]) and the information-theoretic approach (introduced by [11]) to reﬁne result. The evaluation underlines the lack of non-hierarchical relation in WordNet, as ﬁrst mentioned in [5]. For instance, in WordNet, there is no relational path between concepts like “journey-car” or “telephonecommunication”. This allows us to conclude that to use the capabilities of a semantic relatedness measure, we need a real domain ontology. For this reason, our next aim is to test our measure in our natural language semantic interpretation algorithms in dialogue systems [8]. Our ﬁnal objective is to propose and to evaluate a measure of semantic similarity for more complex language of knowledge representation [4]. We think it can be possible to extend the weighting strategy to model speciﬁc relations between concepts, as intersection or disjunctive classes. References [1] A. Budanitsky and G. Hirst, ‘Evaluating wordnet-based measures of semantic distance’, Computational Linguistics, 32(1), 13–47, (March 2006). [2] K. Eliasson, ‘Case-Based Techniques Used for Dialogue Understanding and Planning in a Human-Robot Dialogue System’, in Proc. of IJCAI07, pp. 1600–1605, (2007). [3] Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin, ‘Placing search in context: the concept revisited’, in WWW ’01: Proceedings of the 10th international conference on World Wide Web, pp. 406–414, New York, (2001). ACM Press. [4] J. Hau, W. Lee, and J. Darlington, ‘A Semantic Similarity Measure for Semantic Web Services’, in Proc. Workshop on Web Service Semantics, (2005). [5] G. Hirst and D. St-Onge, ‘Lexical chains as representation of context for the detection and correction malapropisms’, in WordNet: An Electronic Lexical Database, ed., Christiane Fellbaum, chapter 13, 305– 332, MIT Press, (1998). [6] J. Jiang and D. Conrath, ‘Semantic similarity based on corpus statistics and lexical taxonomy’, in Proc. on International Conference on Research in Computational Linguistics, pp. 19–33, Taiwan, (1997). [7] D. Lin, ‘An information-theoretic deﬁnition of similarity’, in Proc. 15th International Conf. on Machine Learning, pp. 296–304. Morgan Kaufmann, San Francisco, CA, (1998). [8] L. Mazuel and N. Sabouret, ‘Generic command interpretation algorithms for conversational agents’, in Proc. Intelligent Agent Technology (IAT’06), pp. 146–153. IEEE Computer Society, (2006). [9] G.A. Miller and W.G. Charles, ‘Contextual correlates of semantic similarity’, Language and Cognitive Processes, 6(1), 1–28, (1991). [10] R. Rada, H. Mili, E. Bicknell, and M. Blettner, ‘Development and Application of a Metric on Semantic Nets’, IEEE Transactions on Systems, Man, and Cybernetics, 19(1), 17–30, (1989). [11] P. Resnik, ‘Using information content to evaluate semantic similarity in a taxonomy.’, in 14th International Joint Conference on Artiﬁcial Intelligence (IJCAI’05), pp. 448–453, (1995). [12] N. Seco, T. Veale, and J. Hayes, ‘An Intrinsic Information Content Metric for Semantic Similarity in WordNet’, in Proc. ECAI’2004, the 16th European Conference on Artiﬁcial Intelligence, pp. 1089–1090, (2004). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-729 729 HOOPO: A Hybrid Object-Oriented Integration of Production Rules and OWL Ontologies Georgios Meditskos and Nick Bassiliades1 Abstract. We describe a framework for the development of production rule programs on top of OWL ontologies, following a hybrid Object-Oriented (OO) approach. The hybrid nature is realized by separating ontologies and rules, interfacing an external DL reasoner and a production rule engine. The OO nature is realized by mapping OWL ontologies into the OO model, in such a way, so to preserve the extensional ontology semantics when the OO ontology constructs are matched in the production rule conditions. tween the rule engine and the reasoner, increasing rule execution performance. This is feasible, since we consider that the ontology information is not altered by the rule programs. Finally, the lack of any runtime interaction accounts for the utilization of the reasoner and the rule engine without modifications. In that way, the DL component reasons only once on the ontology and the information is used to generate the OO KB of the rule engine. 1 INTRODUCTION 2 OBJECT-ORIENTED MAPPING OF OWL There are two main approaches towards the combination of rules and ontologies [1][8]: x Hybrid approach: Rule and ontology predicates are strictly separated and the ontology predicates can be used as constraints in rules. Thus, existing reasoners may be used. x Homogeneous approach: Rule and ontology predicates are treated homogeneously, as a new single logic language. Thus, a new reasoner is needed, able to handle the new language. Let C be the set of named classes, R the set of properties, I the set of individuals of a DL reasoner’s KB after the reasoning procedure over an ontology, and let Co be the set of classes, Ro the set of attributes and Io the set of objects of the OO model. Furthermore, A B and m : A is the DL syntax for class subsumption and class membership, and EXT(A) is the class extension of A, that is the set of individuals that belong to the class A. Similarly, Obj(Ao) denotes the objects of the OO class Ao. Furthermore, let Ao Y Bo denote that the OO class Ao is subclass of the class Bo, and let mo 6 Ao denote that the object mo has the class type Ao. We present HOOPO, a hybrid approach that enables the definition of production rules over OWL ontologies, following an OO approach. More specifically, we enable the development of OO rule-based applications with ontology-based information, using an OO schema that stems from a vocabulary defined in ontologies. We follow the idea that rules may not be used to derive ontological knowledge and any knowledge about ontology information is provided by a DL component. This is achieved by allowing the OO ontology constructs to be matched only in OO production rule conditions, serving as restrictions for the development of derived OO KBs, that is user-defined classes, attributes and objects, disjoint from the OO ontology KB. Thus, we target at the monotonic combination of rules and ontologies. HOOPO interfaces a production rule engine with an external DL reasoner, defining an OO mapping procedure of the ontological knowledge into the OO model of the rule engine. There are three motivations behind this OO mapping procedure. Firstly, we enable the development of rule-based ontology-based applications based on the well-known and established OO programming principles. Secondly, the generated OO ontology KB encapsulates the extensional (individual) ontology semantics that are needed during rule execution, and thus, there is no need for a runtime interaction be1 Department of Informatics, Aristotle University of Thessaloniki, Greece, email: {gmeditsk, nbassili}@csd.auth.gr 2.1 Class mapping OWL classes are mapped into OO classes. The class transformation procedure implements the OWL axiom, stating that owl:Thing subsumes every class and all individuals belong to the class extension of owl:Thing. The OO model is unable to represent directly the semantics of equivalent classes that impose mutual subclass relationships among them, in order to have the same class extension. For that reason, we introduce the notion of the delegator class. C1. For every set of equivalent classes D, we arbitrary choose a class A D as the delegator class, such that B D, dlg(B) = A. For each concept M with no equivalent classes, dlg(M) = M. Each class without any superclass becomes direct subclass of the OO owl:Thingo class. C2. Let a concept A for which N such that A N. We define Mo Y owl:Thingo, where Mo = dlg(Ao). Only delegator classes are involved in OO subclass relations. C3. Each M N relation is mapped into the subclass relation Ao Y Bo, where Ao = dlg(Mo) and Bo = dlg(No). Class intersection and union are mapped into multiple OO subclass relations. C4. Let the concept A be the intersection of a set D of concepts. We define Ao Y Mo, Mo Do. 730 G. Meditskos and N. Bassiliades / HOOPO: A Hybrid Object-Oriented Integration of Production Rules and OWL Ontologies C5. Let a concept A be the union of a set D of concepts. We define Mo Y Ao, Mo Do. In class equivalence, the delegator becomes subclass of all its equivalent classes. C6. Let the set D of equivalent classes and A = dlg(N), N D. We define Ao Y Mo, Mo Do - {Ao}. 2.2 Property mapping Properties are mapped into class attributes. Let a property P with a domain set D. P1a. If D = , we define Po as an attribute of the owl:Thingo class in order to be inherited by all classes. P1b. If D = {M}, then if K such that M { K and E is the equivalent class set, then we define Po as an attribute of all No Eo - {dlg(Mo)}. If K such that M { K, the property P is mapped directly as an attribute Po in the Mo class. P1c. If ¨D°t 2 then we create a class To, such that To Y No, No Do, and Po is defined as an attribute of To. We follow the same approach for ranges. In the case of OWL datatype properties, we map range restrictions to actual datatypes, for example xsd:int restrictions into Integer types. The most closely related approaches to HOOPO are the [4][6] [10] and [3], where the ontology axioms are not altered and the ontology predicates are used as constraints in rule bodies. HOOPO differs from the above approaches on the fact that we approach the integration from an OO perspective and the DL constraints are determined directly by the OO KB, using both ontology class and property constraints in rule bodies, without runtime interaction between the DL and rule components. There are also approaches that target at the use of ontology predicates in rule heads, altering ontology axioms. Some examples are [9][11][5]. 5 CONCLUSIONS In this work we investigated the possibility of representing OWL extensional semantics following object-oriented principles in order to enable OO production rules to operate over OWL ontologies. We have implemented our methodology combining the Pellet DL reasoner [12] and the OO model of the CLIPS production rule engine [2]. The results show that it is possible to preserve the extensional ontology semantics of the transformed ontology. We plan to use HOOPO in the domain of semantic Web service discovery and composition. 2.3 Individual mapping ACKNOWLEDGEMENTS Individuals are mapped into objects. Let the set D of concepts and an individual m, where N D, m : D. I1a. If D = {K}, then mo 6 Ao, where Ao = dlg(Ko). I1b. If ¨D°t 2, we create (or reuse, if exists from P1c) the class To, such that To Y dlg(No), No Do and we define mo 6 To. Individual property values are mapped in object attribute values. I2. Each ¢m, y² : P axiom is mapped by inserting the value y in the attribute Po of the mo object. If y is an individual (P is an object property) then yo mo.P, else (P is a datatype property) y mo.P. This work was partially supported by a PENED program (EPAN M.8.3.1, No. 03Ǽǻ73), jointly funded by the European Union and the Greek Government (General Secretariat of Research and Technology/GSRT). 3 EXAMPLES Intersection: Consider the OWL ontology (DL syntax): Father { Male hasChild.Child, m : Male, n : Child, ¢m, n² : hasChild. The m instance will be classified in the Father concept by the reasoner, since it satisfies the existential restriction. Then, Father Y Male (C4), Child Y owl:Thing (C2), hasChild Att(owl:Thing) (P1a), m 6 Father and n 6 Child (I1a) and n m.hasChild (I2). Thus, m Obj(Male) and m Obj(Father), since Father Y Male. Notice, that only named concepts are mapped into classes. Union: Consider the OWL ontology: Human { Man Woman, m : Man, n : Woman. Then, Man Y Human and Woman Y Human (C5), m 6 Man and n 6 Woman (I1a). Thus, Obj(Human) = Obj(Man) Obj(Woman) = {m, n}. Equivalence: Consider the OWL ontology: Student { Pupil, m : Student, n : Pupil. Assuming that dlg(Student) = dlg(Pupil) = Student, then Student Y Pupil (C6), m 6 Student and n 6 Student (I1a). Thus, Obj(Student) = Obj(Pupil) = {m, n}. 4 RELATED WORK The OO transformation procedure of HOOPO is inspired by [7]. In this work, we target at the hybrid paradigm where OWL semantics are handled by a DL reasoner, without involving entailments. REFERENCES [1] G. Antoniou, C.V. Damasio, B. Grosof, I. Horrocks, M. Kifer, J. Maluszynski, P.F. Patel-Schneider, Combining Rules and Ontologies. A Survey, Reasoning on the Web with Rules and Semantics, REWERSE Deliverables, 2005. [2] CLIPS, http://www.ghg.net/clips [3] W. Drabent, J. Henriksson, J. Maluszynski, HD-rules: A Hybrid System Interfacing Prolog with DL-reasoners, 2nd International Workshop on Applications of Logic Programming to the Web, Semantic Web and Semantic Web Services, 2007 [4] F.M. Donini, M. Lenzerini, D. Nardi, A. Schaerf, AL-log: Integrating datalog and description logics, J. of Intelligent Information Systems, vol 10(3), pp. 227-252, 1998. [5] S. Heymans, L. Predoiu, C. Feier, J. Bruijn, D. Van Nieuwenborgh, G-Hybrid Knowledge Bases, Applications of Logic Programming in the Semantic Web and Semantic Web Services (ALPSWS), 2006 [6] A.Y. Levy, M. Rousset, Combining Horn rules and description logics in CARIN, Artificial Intelligence, vol 104 (1-2), 1998. [7] G. Meditskos, N. Bassiliades, A Rule-based Object-Oriented OWL Reasoner, IEEE Transactions on Knowledge and Data Engineering, vol. 20, no. 3, pp. 397-410, Mar., 2008. [8] J. Mei, Z. Lin, H. Boley, ALCu: An Integration of Description Logic and General Rules, Web Reasoning and Rule Systems, 2007 [9] B. Motik, U. Sattler, R. Studer, ‘Query answering for OWL-DL with rules’, J. of Web Semantics, 3 (1), pp. 41-60, 2005. [10] R. Rosati, Towards expressive KR systems integrating Datalog and description logics: Preliminary report. In Proc. of DL’99, 1999. [11] R. Rosati, DL+log: Tight Integration of Description Logics and Disjunctive Datalog, In Proceedings of the 10th KR, 2006 [12] E. Sirin, B. Parsia, B.C. Grau, A. Kalyanpur, Y. Katz, Pellet: A Practical OWL-DL Reasoner, J. of Web Semantics, 5(2), 51-53, 2007. ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-731 731 Rule-based OWL Ontology Reasoning Using Dynamic ABOX Entailments Georgios Meditskos and Nick Bassiliades1 Abstract. In the rule-based OWL reasoning paradigm, ontologies are mapped into an internal rule engine representation format and rules are applied, such as TBOX and ABOX OWL entailment rules, in order to deduce new knowledge. In this paper we briefly introduce the notion of dynamically generating ABOX entailment rules in order to enhance the ABOX reasoning performance of a rule engine. The proposed methodology is still based on entailments rules for reasoning, using generic TBOX entailments for handling OWL semantics about concepts and roles, and dynamic ABox entailments for handling ontology instances. 1 INTRODUCTION OWL [16] is the W3C recommendation for creating and sharing ontologies on the Web. It provides the means for ontology definition and specifies formal semantics on how to derive new information. Several approaches have been followed for the development of reasoning engines able to handle OWL semantics, such as Description Logic algorithms [2], theorem provers [15] or ruleengines [8][12]. Each approach has advantages and disadvantages and the selection of the appropriate one is based on the domain or users’ requirements [5][13]. In this work, we are focused on the rule-based OWL reasoning paradigm based on entailments and we describe a methodology that improves the time a rule engine needs in order to apply the OWL semantics over ontology individuals (ABOX). This is feasible by exploiting the schema information of OWL ontologies in order to generate domain-dependent ABOX rules. 2 BACKGROUND AND MOTIVATION In the rule-based OWL reasoning paradigm, the asserted knowledge, that is the knowledge stemming directly from the ontology definition, is mapped into an internal rule engine representation format, and inference rules are applied in order to deduce new knowledge. The inference rules are based on OWL entailments [7], rules which describe the information that should be inferred based on existing knowledge. To exemplify, let S be the set of triples of an ontology, where S = {<A subClassOf B rel="nofollow">, }. By implementing the rdfs9 entailment rule for subclass transitivity (Table 1), we get that S = {<A subClassOf B rel="nofollow">, 1 Department of Informatics, Aristotle University of Thessaloniki, Greece, email: {gmeditsk, nbassili}@csd.auth.gr ClassOf C>, <A subClassOf C rel="nofollow">} (the entailment rules can be found in [7]). However, the high expressivity of OWL hampers the definition of a complete set of OWL entailments and rule languages can only handle a subset of OWL, known as Description Logic Programs (DLP) [4]. Despite this limitation, the combination of rules and ontologies is one of the hottest areas [1][6][10][14]. We are motivated by the fact that the majority of the ABOX entailment rules are based on generic TBOX information, such as the rdfp4 entailment for role transitivity, which requires the property p to be transitive. Intuitively, the rdfs9 entailment is a specialized form of the rdfp4 entailment for the subClassOf property. However, the latter is more complicated than the former, requiring a double join in its body among a transitive property p and two instance p values. Our approach is based on such ABOX entailment specializations. Table 1. Examples of entailment rules. rdfs9 rdfp4 rdfp1 rdfp12a rdfp14a if c1 subClassOf c2, c2 subClassOf c3 p type TransitiveProperty, x p y, y p z p type FunctionalProperty, x p y, x p z c1 equivalentClass c2 r hasValue y, r onProperty p, xpy then c1 subClassOf c3 xpz y sameAs z c1 subClassOf c2 x type r The TBOX entailments are either specialized, since they refer to built-in OWL constructs which are known in advance, such as the subclass transitivity (rdfs9), or they cannot be specialized before the termination of the TBOX inferencing procedure (rdfp12a). In contrast, ABOX entailments can be specialized, apart from some exceptions, provided that the TBOX inferencing is performed first. In that way, a dynamic inference rule base is generated, able to apply more efficiently ABOX semantics than a generic rule base, especially in large scale ABOX ontologies. 3 DYNAMIC ENTAILMENT GENERATION The dynamic entailment methodology is based on the fact that most of the ABox entailments can be grounded into one or more simpler domain-dependent rules. More formally, an ABOX entailment rule is of the form 732 G. Meditskos and N. Bassiliades / Rule-Based OWL Ontology Reasoning Using Dynamic ABOX Entailments T AE(T) AE o AI(T) AI, where T is the set of TBOX triple conjunctions, AE(T) is the set of individual triple conjunctions that use TBOX information, AE is the set of individual triple conjunctions unrelated to TBOX, AI(T) is the conjunctive set of the inferred individual triples that use TBOX information, and AI is the conjunctive set of the inferred individual triples unrelated to TBOX. The dynamic rule generation methodology performs the following rule transformation: t o T : AE(T) AE o AI(T) AI, T AE(T) AE o AI(T) AI which generates T-dependent rules. To exemplify, consider the rdfp1 entailment with T = {}, AE(T) = { <x p y>, <x p z>}, AE = , AI(T) = and C = {<y sameAs z>}, which is transformed into: p | : rule: if <x p y> <x p z> then <y sameAs z>. Moreover, the rdfp14a entailment, with T = {<r hasValue z>, <r onProperty p>}, AE(T) = { <x p z>}, AI(T) = , AI(T) = {<x type r>} and C = , is transformed into: r | <r hasValue z> <r onProperty p> : rule: if <x p z> then <x type r>. 4 EXPERIMENTAL RESULTS We used the CLIPS [3] production rule engine in order to apply thirteen entailments over the LUBM [11] university ontology. Five extensional datasets Di were generated, each one of approximately 12,000 triples. Table 2 depicts the time needed to apply the dynamic and the generic rules over different dataset sizes. The dynamic approach generates about 300 rules and, despite the great number of rules, the ABOX reasoning procedure terminates considerably faster than the generic approach, where only 13 rules are applied. Table 2. Dynamic and Generic ABOX reasoning times. 12,000 24,000 36,000 48,000 60,000 Dynamic (sec) Generic (sec) 36.750 86.063 61.078 167.000 84.797 255.859 10.7406 393.109 129.719 512.312 5 RELATED WORK To the best of our knowledge, the existing rule-based reasoners that use entailments follow the generic methodology, that is both the TBOX and the ABOX entailments are generic and ontologyindependent. SweetProlog [9], Jena [12] and OWLIM [8] are some example systems that are based on general purpose rule engines, e.g. Prolog, or on rule engines built from scratch, such as the TRREE engine of OWLIM. Notice that the default Jena rule engine for OWL reasoning is a hybrid implementation, using forward chaining rules in order to generate backward chaining rules. 6 CONCLUSIONS In this paper we presented a methodology of performing rule-based OWL reasoning based on generic TBOX and on dynamic ABox entailment rules. In that way, we are able to use the TBOX rules as the basis for generating domain-dependent ABOX inferencing rules. The main characteristic of these rules is that they join less conditional elements in their body, achieving better activation times in rule engines, than their corresponding generic entailments. Currently we are working on combining a rule engine with a DL reasoner in order to dynamically generate ABOX inferencing rules based on the inferencing capabilities of the DL paradigm. ACKNOWLEDGEMENTS This work was partially supported by a PENED program (EPAN M.8.3.1, No. 03Ǽǻ73), jointly funded by the European Union and the Greek Government (General Secretariat of Research and Technology/GSRT). REFERENCES [1] G. Antoniou, C.V. Damasio, B. Grosof, I. Horrocks, M. Kifer, J. Maluszynski, P.F. Patel-Schneider, Combining Rules and Ontologies. A Survey, Reasoning on the Web with Rules and Semantics, REWERSE Deliverables, 2005. [2] F. Baader, U. Sattler, An Overview of Tableau Algorithms for Description Logics, Studia Logica, vol. 69, pp. 5-40, 2001 [3] CLIPS, http://www.ghg.net/clips [4] B. Grosof, I. Horrocks, R. Volz, S. Decker, Description logic programs: Combining logic programs with description logics, WWW 2003, pp. 48–57. ACM, 2003. [5] P. Hitzler, J. Angele, B. Motik, R. Studer, Bridging the Paradigm Gap with Rules for OWL. In Proc. of the W3C Workshop on Rule Languages for Interoperability, Washington, USA, 2005 [6] I. Horrocks, P.F. Patel-Schneider, A Proposal for an OWL Rules Language, 13th Int. WWW Conf., ACM, New York (2004) [7] H.J. Horst, Completeness, decidability and complexity of entailment for RDF Schema and a semantic extension involving the OWL vocabulary, Journal of Web Semantics, vol. 3, pp. 79-115, 2005 [8] A. Kiryakov, D. Ognyanov, D. Manov, OWLIM - a Pragmatic Semantic Repository for OWL, Proc. Workshop Scalable Semantic Web Knowledge Base Systems, USA, 2005 [9] L. Laera, V. Tamma, T.B. Capon, G. Semeraro, SweetProlog: A System to Integrate Ontologies and Rules, Rules and Rule Markup Languages for the Semantic Web, 2004. [10] A.Y. Levy, M.-C. Rousset, Combining Horn rules and description logics in CARIN, Artificial Intelligence, 104(1-2), 165–209 (1998). [11] Y. Guo, Z. Pan, J. Heflin, LUBM: A Benchmark for OWL Knowledge Base Systems, Journal of Web Semantics, 3(2), pp. 158-182, 2005 [12] B. McBride, Jena, Implementing the RDF Model and Syntax Specification, 2nd International Workshop on the Semantic Web, Hong Kong, China, 2001 [13] B. Motik, I. Horrocks, R. Rosati, U. Sattler, Can OWL and Logic Live Together Happily Ever After?, Proc. 5th ISWC, Athens, USA, 2006 [14] R. Rosati, On the decidability and complexity of integrating ontologies and rules, Web Semantics: Science, Services and Agents on the World Wide Web, vol. 3(1), pp. 61-73, July 2005. [15] D. Tsarkov, A. Riazanov, S. Bechhofer, I. Horrocks, Using Vampire to reason with OWL, International Semantic Web Conference, pp. 471-485, 2004. [16] Web Ontology Language - OWL, http://www.w3.org/2004/OWL/ ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-733 733 Computability and Complexity Issues of Extended RDF Anastasia Analyti1 and Grigoris Antoniou1,2 and Carlos Viegas Dam´asio3 and Gerd Wagner4 Abstract. ERDF stable model semantics is a recently proposed semantics for ERDF ontologies and a faithful extension of RDFS semantics on RDF graphs. Unfortunately, ERDF stable model semantics is in general undecidable. In this paper, we elaborate on the computability and complexity issues of the ERDF stable model semantics. 1 Introduction Rules constitute the next layer over the ontology languages of the Semantic Web, allowing arbitrary interaction of variables in the head and body of the rules. In [1], the Semantic Web language RDFS [4] is extended to accommodate the two negations of Partial Logic, namely weak negation ∼ (expressing negation-as-failure or non-truth) and strong negation ¬ (expressing explicit negative information or falsity), as well as derivation rules. The new language is called Extended RDF (ERDF ). In [1], the stable model semantics of ERDF ontologies is developed, based on Partial Logic, extending the model-theoretic semantics of RDFS. Intuitively, an ERDF ontology is the combination of (i) an ERDF graph G containing (implicitly existentially quantiﬁed) positive and negative information, and (ii) an ERDF program P containing derivation rules, with possibly all connectives ∼, ¬, ⊃, ∧, ∨, ∀, ∃ in the body of a rule, and strong negation ¬ in the head of a rule. ERDF enables the combination of closed-world (non-monotonic) and open-world (monotonic) reasoning, in the same framework, through the presence of weak negation (in the body of the rules) and the new metaclasses erdf :TotalProperty and erdf :TotalClass, respectively. In [1], it is shown that stable model entailment conservatively extends RDFS entailment from RDF graphs to ERDF ontologies. Unfortunately, satisﬁability and entailment under the ERDF stable model semantics are in general undecidable. In this paper, we elaborate on the computability and complexity issues of the ERDF stable model semantics. Additionally, we propose a slightly modiﬁed semantics on ERDF ontologies, called ERDF #nstable model semantics that is also a faithful extension of RDFS semantics on RDF graphs and achieves decidability. 2 Stable Model Semantics of ERDF Ontologies In this Section, we brieﬂy review ERDF ontologies and their stable model semantics. Details and examples can be found in [1]. A (Web) vocabulary V is a set of URI references and/or literals (plain or typed). We denote the set of all URI references by URI. 1 Institute of Computer Science, FORTH-ICS, Crete, Greece, e-mail: analyti@ics.forth.gr Department of Computer Science, University of Crete, Greece 3 CENTRIA, Departamento de Informatica, Faculdade de Ciencias e Tecnologia, Universidade Nova de Lisboa, 2829-516 Caparica, Portugal 4 Inst. of Informatics, Brandenburg Univ. of Technology at Cottbus, Germany 2 We consider a set of variable symbols Var such that URI, Var , and the set of literals are pairwise disjoint. In our examples, variable symbols are preﬁxed by “?”. Let V be a vocabulary. An ERDF triple over V is an expression of the form p(s, o) or ¬p(s, o), where s, o ∈ V ∪Var are called subject and object, respectively, and p ∈ V ∩ URI is called property. An ERDF graph G is a set of ERDF triples over some vocabulary V . We denote the variables appearing in G by Var (G), and the set of URI references and literals appearing in G by VG . Let V be a vocabulary. We denote by L(V ) the smallest set that contains the ERDF triples over V and is closed with respect to the following conditions: if F, G ∈ L(V ) then {∼F, F ∧G, F ∨G, F ⊃ G, ∃xF, ∀xF } ⊆ L(V ), where x ∈ Var . An ERDF formula over V is an element of L(V ). Intuitively, an ERDF graph G represents an existentially quantiﬁed conjunction of ERDF triples. Speciﬁcally, let G = {t1 , ..., tm } be an ERDF graph, and let Var (G) = {x1 , ..., xk }. Then, G represents the ERDF formula formula(G) = ∃?x1 , ..., ∃?xk t1 ∧ ... ∧ tm . Existentially quantiﬁed variables in ERDF graphs are handled by skolemization. Let G be an ERDF graph. The skolemization function of G is an 1:1 mapping skG : Var (G) → URI, where for each x ∈ Var (G), skG (x) is an artiﬁcial URI, denoted by G:x. The skolemization of G, denoted by sk(G), is the ground ERDF graph derived from G after replacing each x ∈ Var (G) by skG (x). An ERDF rule r over a vocabulary V is an expression of the form: Concl (r ) ← Cond (r ), where Cond (r ) ∈ L(V ) ∪ {true} and Concl (r ) is an ERDF triple or false. An ERDF program is a set of ERDF rules. We denote the set of URI references and literals appearing in P by VP . An ERDF ontology is a pair O = G, P , where G is an ERDF graph and P is an ERDF program. The vocabulary of RDF, VRDF , is a set of URI references in the rdf : namespace [4]. The vocabulary of RDFS, VRDF S , is a set of URI references in the rdfs: namespace [4]. The vocabulary of ERDF is deﬁned as VERDF = {erdf :TotalClass, erdf :TotalProperty}. Intuitively, instances of the metaclass erdf :TotalClass are classes c that satisfy totalness, meaning that, at the interpretation level, each statement rdf :type(x, c) is either true or explicitly false. Similarly, instances of the metaclass erdf :TotalProperty are properties p that satisfy totalness, meaning that, at the interpretation level, each statement p(x, y) is either true or explicitly false. Let O = G, P be an ERDF ontology. The vocabulary of O is deﬁned as VO = Vsk(G) ∪ VP ∪ VRDF ∪ VRDF S ∪ VERDF . In [1], the set of (ERDF) stable models of O is deﬁned, denoted by Mst (O). Each stable model M of O (i) interprets the terms in VO and (ii) assigns intended truth and falsity extensions to the classes and properties in VO (satisfying all semantic conditions of an RDFS interpretation [4] on VO , as well as new semantic conditions, particular to ERDF). M is generated through a sequence of steps. Intuitively, 734 A. Analyti et al. / Computability and Complexity Issues of Extended RDF starting of an intended interpretation for sk(G), a stratiﬁed sequence of rule applications is produced, where all applied rules remain applicable throughout the generation of stable model M . Let M ∈ Mst (O) and let F be an ERDF formula or ERDF graph. In [1], the model relation M |= F is deﬁned. We say that O entails F under the (ERDF) stable model semantics, denoted by O |=st F , iff for all M ∈ Mst (O), M |= F . As an example, consider a class ex:Wine whose instances are wines and a property ex:likes(X, Y ) indicating that person X likes wine Y . Assume now that we want to select wines for a dinner such that for each guest, there is on the table exactly one wine that she/he likes. Let the class ex:Guest indicate the persons that will be invited to the dinner and let the class ex:SelectedWine indicate the wines chosen to be served. An ERDF program P that describes this wine selection problem is the following5,6 : id(?x, ?x) ← true. rdf :type(?y, SelectedWine) ← rdf :type(?x, Guest), rdf :type(?y, Wine), likes(?x, ?y), ∀?z (rdf :type(?z, SelectedWine), ∼id(?y, ?z) ⊃ ∼likes(?x, ?z)). Consider now the ERDF graph G, containing the factual information: G = {rdf :type(Carlos, Guest), rdf :type(Gerd, Guest), rdf :type(Riesling, Wine), rdf :type(Retsina, Wine), likes(Gerd, Riesling), likes(Gerd, Retsina), likes(Carlos, Retsina)}. Then, the ERDF ontology O = G, P has only one stable model M , for which it holds M |= rdf :type(Retsina, SelectedWine) ∧ ∼rdf :type(Riesling, SelectedWine). This is because (i) both Gerd and Carlos like Retsina and (ii) Carlos likes only Retsina. Obviously, O |=st rdf :type(Retsina, SelectedWine) ∧ ∼rdf :type(Riesling, SelectedWine). Proposition 2.1 Let G, G be RDF graphs such that VG ∩VERDF = ∅, VG ∩ VERDF = ∅, and VG ∩ skG (Var (G)) = ∅. It holds: G |=RDF S G iff G, ∅ |=st G . 3 Computability and Complexity Issues In [1], it is shown that satisﬁability and entailment under the ERDF stable model semantics are in general undecidable. The proof of undecidability exploits a reduction from the unbounded tiling problem, whose existence of a solution is known to be undecidable [2]. Note that since each constraint false ← F that appears in an ERDF ontology O can be replaced by the rule ¬t ← F , where t is an RDF, RDFS, or ERDF axiomatic triple, the presence of constraints in O does not affect decidability. An ERDF formula F is called simple, if it has the form t1 ∧...∧tk ∧∼tk+1 ∧...∧∼tm , where each ti , i = 1, ..., m, is an ERDF triple. An ERDF program P is called simple if for all r ∈ P , Cond (r ) is a simple ERDF formula or true. An ERDF ontology O = G, P is called simple, if P is a simple ERDF program. A simple ERDF ontology O (resp. ERDF program P ) is called objective, if no weak negation appears in O (resp. P ). Reduction in [1] shows that ERDF stable model satisﬁability and entailment remain undecidable, even if (i) O = G, P is a simple ERDF ontology, and (ii) the terms erdf :TotalClass and erdf :TotalProperty do not appear in O (i.e., (VG ∪VP )∩VERDF = ∅). However, we will show that satisﬁability and entailment under the ERDF stable model semantics are decidable, if (i) O is an objective ERDF ontology, and (ii) the entailed formula is an ERDF d-formula. 5 6 To improve readability, we ignore the example namespace ex:. Commas “,” in the body of the rules indicate conjunction ∧. Let F be an ERDF formula. We say that F is an ERDF dformula iff (i) F is the disjunction of existentially quantiﬁed conjunctions of ERDF triples, and (ii) FVar (F ) = ∅. For example, let F = (∃?x rdf :type(?x , Vertex ) ∧ rdf :type(?x , Red )) ∨ (∃?x rdf :type(?x , Vertex ) ∧ ¬rdf :type(?x , Blue)). Then, F is an ERDF d-formula. It is easy to see that if G is an ERDF graph then formula(G) is an ERDF d-formula. Proposition 3.1 Let G, G be ERDF graphs, let P be an objective ERDF program, let F d be an ERDF d-formula, and let F be an ERDF formula. 1. The problem of establishing whether O = G, P has a stable model is NP-complete w.r.t. (|P | + 1) ∗ (|Vsk(G) | + |VP |). 2. The problems of establishing whether: (i) G, P |=st G , (ii) G, P |=st F d , and (iii) G, P |=st F , where P = ∅, are co-NP-complete w.r.t. (|P | + 1) ∗ (|Vsk(G) | + |VP |). The hardness part of the above complexity results can be proved by a reduction from the Graph 3-Colorability problem, which is a classical NP-complete problem. Moreover, participation of the above problems in NP or co-NP can be proved by showing that, from the inﬁnite set of rdf : i terms (i ∈ IN ), only a ﬁnite subset needs to be considered for solving the corresponding problem. The following proposition shows that even if O = G, P is an objective ERDF ontology, entailment of a general ERDF formula F under the ERDF stable model semantics is still undecidable. This result can also be proved by a reduction from the unbounded tiling problem [2]. Proposition 3.2 Let G be an ERDF graph, let P be an objective program, and let F be an ERDF formula. The problem of establishing whether G, P |=st F is in general undecidable. Let O be an ERDF ontology (with weak negation possibly appearing in the program rules). The source of undecidability of the ERDF stable model semantics of O is the fact that VRDF is inﬁnite. Thus, the vocabulary of O is also inﬁnite (note that {rdf : i | i ≥ 1} ⊆ VRDF ⊆ VO ). Therefore, we slightly modify the deﬁnition of the ERDF stable model semantics, based on a redeﬁnition of the vocabulary of an ERDF ontology, which now becomes ﬁnite. We call the modiﬁed semantics, the ERDF #n-stable model semantics (for n ∈ IN ). Let n ∈ IN and VO#n = VO − {rdf : i | i > n}. We deﬁne the ERDF #n-stable model semantics of O similarly to the ERDF stable model semantics of O, but now only the interpretation of the terms in VO#n is considered. The ERDF #n-stable model semantics also extends RDFS entailment from RDF graphs to ERDF ontologies. Query answering under the ERDF #n-stable model semantics is decidable. Moreover, if O is a simple ERDF ontology then query answering under the ERDF #n-stable model semantics reduces to query answering under the answer set semantics [3] for an extended logic program Π#n O . Finally, we would like to mention that the complexity results of Proposition 3.1 also hold for ERDF #n-stable model semantics. REFERENCES [1] A. Analyti, G. Antoniou, C. V. Dam´asio, and G. Wagner. Extended RDF as a Semantic Foundation of Rule Markup Languages. Journal of Artiﬁcial Intelligence Research (JAIR), 32:37–94, 2008. [2] R. Berger. The Undecidability of the Dominoe Problem. Memoirs of the American Mathematical Society, 66:1–72, 1966. [3] M. Gelfond and V. Lifschitz. Logic programs with Classical Negation. In ICLP’90, pages 579–597, 1990. [4] P. Hayes. RDF Semantics. W3C Recommendation, 10 February 2004. Available at http://www.w3.org/TR/2004/ REC-rdf-mt-20040210/. ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-735 735 Automated Web Services Composition Using Extended Representation of Planning Domain Mohamad El Falou1 and Maroua Bouzid1 and Abdel-Illah Mouaddib 1 and Thierry Vidal 2 1 INTRODUCTION WS are distributed software components that can be exposed and invoked over the Internet using standard protocols. They communicate with their clients and with other WS by sending XML based messages over the Internet. Artiﬁcial Intelligence planning techniques can help solving the composition of WS problem. In fact, services can be modelled as actions and the business process as planning to connect the WS. The main contribution of this paper is the extension of the model of actions to handle the creation or elimination of objects as effects of actions. This contribution allows us to answer new and more expressive requests, called implicit requests, in which goals may contain objects that have been generated by the plan. 2 Related Works The work on web service composition in the university of Trento presented in [5] translates WS into a state transition. After translating WS, the system constructs a parallel product which com" bines the n STS, this parallel product allows the n services to evolve concurrently. They use the Model Based Planner MBP [1] based on model checking techniques [6]. The drawback of this approach is that we must recalculate " whenever we add or remove a service from the domain. In [3] an approach called GOLOG based on the situation calculus is presented. GOLOG composes web services by applying logical inference techniques on pre-deﬁned plan templates. Finally, in [7] the authors deﬁne a translation from DAML-S process models to the SHOP2 domains, and from DAML-S composition tasks to the SHOP2 planning problem. SHOP2 is a well-suited planner for working with the Process Model in an Hierarchical Task Network (HTN). HTN planning builds plans through task decomposition. All the approaches cited above suppose that the domain objects are static. In other words there is no way to eliminate nor to create objects. Furthermore, all deﬁned requirements for the composite Web Services are deﬁned as explicit queries. 3 Motivating Example Let us consider a set of WS which are intended to deal with ﬁles, images and tracks as follows: 1. W S1 translates ﬁle languages. It has two services: f r2en (en2ar) translates ﬁles from French (English) to English (Arabic). 1 2 University of Caen, France, email: melfalou, bouzid, mouaddib@info.unicaen.fr IRISA - INRIA Rennes, France, email: thierry.vidal@irisa.fr 2. W S2 transforms text ﬁle formats. It has two services: latex2doc (doc2pdf ) transforms ﬁles from latex (doc) to doc (pdf ) format. 3. W S3 merges ﬁles. It has two services: mergepdf (mergedoc) merges two pdf (doc) ﬁles into a third one. As an example, let us suppose that we have two ﬁles: the ﬁrst is a doc format written in English, the second is in latex format written in French and we want to obtain a ﬁle which contains the content of the two ﬁles translated to Arabic. The existing approaches dedicated to WS composition cannot express or deal with this kind of problem. To overcome this limitation, we propose an approach where the speciﬁcation language of the domain consists in an extension of the speciﬁcation language PDDL [4] and the WS composition mechanism is based on two planning mechanisms which are Tree-search and GraphPlan. 4 Formal Framework Our formal framework is based on extended Planning-Graph techniques [2] allowing the creation and elimination of objects when executing services (actions). Contrary to classical approaches where a state is deﬁned as a set of predicates, a state in our domain is deﬁned by a set of objects, properties and relations between these objects, and we extend the deﬁnition of actions to allow the generation and elimination of objects in the environment, the assignment of new predicates to objects and the deﬁnition of new relations between them. 4.1 Preliminaries and Deﬁnitions The domain D = (C, P ) is deﬁned by a set of WS C = (W S1 , W S2 , ..., W Sn ) that we call a community of Web Services, and a set of predicate types P = {p1 , p2 , ..., pn } to specify the possible properties of objects and relations between them. A state q=(V,P) of the plan execution is deﬁned by a set of objects V and their types, and a set of predicates P specifying the properties of these objects and the relationship between them. In section 3, the initial state is speciﬁed as : q0 =({(F1: ﬁle),(F2: ﬁle)},{(doc F1),(en F1),(latex F2),(fr F2)} , where F1,F2 are objects (ﬁles) and ﬁle is a type and doc, en, latex, and fr are properties. A Web Service W Si is deﬁned by W Si = (Ti , Ai , Si ) which are respectively : type, attributes and services of WS.a A service in W Si is deﬁned by Sik = k k k k k (P ini , P outi , P inouti , P reci , Ef f ectsi ) which are respectively : input, output and input-output objects, preconditions and effects of service execution. The service mergepdf of W S3 is deﬁned as follows: 736 • • • • • • M. El Falou et al. / Automated Web Services Composition Using Extended Representation of Planning Domain P in3 = { (F1: ﬁle),(F2: ﬁle) }. P out3 = {(F: ﬁle) }. P inout3 = { }. P rec3 ={(pdf F1),(pdf F2)}. Ef f ect− 3 ={(pdf F1),(pdf F2)}. Ef f ect+ 3 ={(pdf F), (merge F F1 F2)}. A plan is deﬁned by a sequence of sets of services where every set is called a partial plan. More formally Π =< π1 , π2 , ..., πn > is a plan such that ∀i ∈ [1..n], πi = (si1 , ..., sin ) is a partial plan of independent services, and each sik is instantiated with real objects in the domain. One plan solution for the problem introduced in section 3 is Π =< π1 , π2 , π3 , π4 , π5 > where : • π1 =(fr2en [F1], en2ar [F2]). • π2 =(en2ar [F1], doc2pdf [F2]). • π3 =(latex2doc [F1]) • π4 =(doc2pdf [F1]). • π5 =([#F0]=mergepdf[F1, F2]). A request R = (D,q0 ,g) is deﬁned for a domain D of W S by the initial state q0 and the goal state g. The initial and ﬁnal states are deﬁned by a set of objects and a set of associated predicates (V, P ). In the previous example, q0 =[ {(F1 :ﬁle),(F2 :ﬁle)},{(doc F1 ),(en F1 ),(latex F2 ),(fr F2 )}] and g=[ {(#F0 :ﬁle)},{(pdf #F0 ),(ar #F0 ),(merge F1 F2 #F0 )}]. The aim of using the symbol # before the name of the object is to state that it is a generated object (in the output set of the executed service), and any other object having the same type beginning with # can replace it in the domain. 5 Planning Algorithm We have implemented two algorithms to build the solution of our problem. The ﬁrst one is based on the classical Tree-search algorithm and the second one is based on the Graph plan method. The basic idea behind the Tree-search algorithm is to apply from the initial state all executable services. By doing this (expand a state) we obtain a set of new states S, if the goal is in S, a solution is found. If not, based on the strategy of Tree-search we select one of the unexpanded states. If all states are expanded we get a failure. By using this algorithm, we obtain a sequential plan of services. After that, we transform this plan into a sequential set of partial plans (set of independent services) Π =< π1 , π2 , ..., πn >. 5.1 Graph plan algorithm The GraphPlan algorithm performs a procedure close to iterative deepening, discovering a new part of the search space at each iteration. It iteratively expands the planning graph by one level, then it searches backward from the last level of this graph for a solution. The ﬁrst expansion, however, proceeds to a level Pi in which all the goal propositions are included and no pairs of them are mutex, and the set of services executed for reaching g are not mutex; and so on until reaching P0 (then the plan is found), or until reaching failure (Pi = Pi+1 and no plan is found). 6 Implementation and Results By implementing the Tree-search and the Graph Plan algorithms, we prove that our new approach of composition WS under implicit request is feasible. In our implementation we use a part of the PDDL language, and extend it to ﬁt our model. Table 1. Problem nbr of objects P1 1 P2 1 P3 2 P4 2 P5 3 P6 3 P7 4 " solution not found Results of Tree-search algorithm strategy depth width node nbr plan size node nbr plan size 6 4 9 4 4 0 4 0 8 7 232 6 9 8 2585 8 13 12 > 4200 " 14 13 583 7 18 17 > 2900 " We tested our algorithms on 7 examples that contain many objects and many types of variables (ﬁles, tracks and images). P2 is the problem given in section 3. In table 1 we give the number of initial objects of the different problems, the number of expanded nodes and the plan size by using depth and width strategies. We have 16 available services in the domain (illustrated in section 3). From these results we can observe that the depth strategy is very effective and in few seconds we obtain a plan by expanding a few number of nodes. By using the Graphplan algorithm, we obtain solutions for simple problems, but not for complex problems that contain a high number of objects. By applying the extended techniques of Graph-plan, we get a combinatorial explosion. The combinatorial explosion is due to the execution of services that create new objects in each level. 7 Conclusion And Perspective In this paper, we give an extended view of the composition of Web Service problem by modelling the problem as a planning problem. In our work we propose an extended model of service to answer to composition problems that require the creation and elimination of objects as effects of the execution of a service. We also overcome the limitations of other approaches, by giving a dynamic and distributed deﬁnition of our domain. This allows us to add, remove and/or replace services without recalculating some other part of the domain. Finally, our model overcomes the limitation of pre-deﬁned plans by deﬁning the implicit request only through an initial and a goal states. REFERENCES [1] P. Bertoli, A. Cimatti, M. Pistore, M. Roveri, and P. Traverso, ‘Mbp: a model based planner’, the IJCAI’01 Workshop on Planning under Uncertainty and Incomplete Information, Seattle, August 2001., (2001). [2] Paolo Traverso Malik ghallab, Dana Nau, Automated Planning , theory and practice, MORGAN KAUFMANN PUBLISHERS, Jun 2005. [3] S. McIlraith and T. Son. Adapting golog for composition of semantic web services, 2002. [4] C.Knoblock D.McDermott A.Ram M.Veloso D.Weld D.Willkins M.Ghallab, A.Howe, ‘Pddl — the planning domain deﬁnition language’, (1998). [5] M. Pistore, P. Bertoli, F. Barbon, D. Shaparau, and P. Traverso, ‘Planning and monitoring web service composition’, ICAPS 2004. [6] Marco Pistore and Paolo Traverso, ‘Planning as model checking for extended goals in non-deterministic domains’, 479–486, (2001). [7] D. Wu, E. Sirin, J. Hendler, D. Nau, and B. Parsia. Automating daml-s web services composition using shop2, 1998. ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-737 737 Propositional merging operators based on set-theoretic closeness Patricia Everaere1 and S´ebastien Konieczny2 and Pierre Marquis3 Abstract. In the propositional setting, a well-studied family of merging operators are distance-based ones: the models of the merged base are the closest interpretations to the given proﬁle. Closeness is, in this context, measured as a number resulting from the aggregation of the distances to each base of the proﬁle. In this work we deﬁne a new familly of propositional merging operators, close to such distance-based merging operators, but relying on a set-theoretic definition of closeness, already at work in several revision/update operators from the literature. We study a speciﬁc merging operator of this family, obtained by considering set-product as the aggregation function. 1 Introduction Information merging is a very important task in artiﬁcial intelligence: the issue is to determine the beliefs, or the goals, of a group of agents from their individual points of view. Much work has been devoted to the deﬁnition of merging operators in the propositional case [11, 9, 1, 8, 10]. In [8] a set of postulates is proposed to characterize different families of merging operators, and several families of operators satisfying those postulates are deﬁned. Such operators are called model-based merging operators because basically they select the models of a given integrity constraint (i.e. a formula encoding laws, norms, etc., used for constraining the result of the merging) that are the closest ones to the given proﬁle of belief/goal bases of the group. Often, those operators are deﬁned from a distance between interpretations, which intuitively indicates how conﬂicting they are. This distance between interpretations induce a distance between an interpretation and a base, which indicates how plausible/satisfactory the interpretation is with respect to the base. Once such distances are computed, an aggregation function is used to deﬁne the overall distance of each model (of the integrity constraints) to the proﬁle. Semantically, the models of the result of the merging are the closest models of the integrity constraints to the proﬁle. A commonly-used distance between interpretations is the Hamming distance (also called Dalal distance [3]). The Hamming distance between two interpretations is the number of propositional variables the two interpretations disagree on. The amount of conﬂict between two interpretations is thus assessed as the number of atoms whose truth values must be ﬂipped in one interpretation in order to 1 2 3 Universit´e Lille-Nord de France, LIFL, CNRS UMR 8022, France, email: patricia.everaere@univ-lille1.fr CNRS UMR 8188, CRIL, Universit´e Lille-Nord de France, Artois, France, email: konieczny@cril.fr Universit´e Lille-Nord de France, Artois, CRIL, CNRS UMR 8188, France, email: marquis@cril.fr make it identical to the second one. Such a distance is very meaningful when no extra-information on the epistemic states of the agents are available. The major problem with distance-based merging operators is that evaluating the closeness between two interpretations as a number may lead to lose too many information.Thus, the conﬂicting variables themselves (and not only how many they are) can prove signiﬁcant. Especially, when variables express real-world properties, it can be the case that some variables are more important than others, or that some variables are logically connected. In those cases, distances are not fully satisfactory. As an alternative to distance, an interesting measure used to evaluate the closeness of two interpretations is diﬀ, the symetrical difference between them. Instead of evaluating the degree of conﬂict between two interpretations as the number of variables on which they differ (as it is the case with the Hamming distance), the diﬀ measure assesses it as the set of such variables. In this work, we consider the family of propositional merging operators based on the diﬀ measure. We speciﬁcally focus on the operator Δdiﬀ,⊕ from this family obtained by considering set-product as the aggregation function. We evaluate it with respect to three criteria: logical properties, strategy-proofness and complexity. 2 A Diff-Based Merging Operator: Δdiﬀ,⊕ The key idea underlying our approach consists in evaluating the degree of conﬂict between two interpretations ω and ω as the set of variables on which they differ: diﬀ(ω, ω ) = {p ∈ P | ω(p) = ω (p)}. This deﬁnition has already been used in the belief revision/update literature in order to deﬁne a number of operators [6, 13, 12, 2, 14]. As for distances, we can straightforwardly deﬁne, using diﬀ, a notion of closeness between an interpretation and a base, as the minimum closeness between the interpretation and the models of the base. Of course, since diﬀ gives as output a set instead of a number, setinclusion has to be considered as minimality criterion: diﬀ(ω, K) = min({diﬀ(ω, ω ) | ω |= K}, ⊆). So the closeness between an interpretation ω and a base K is measured as the set of the minimal sets (for set inclusion) of propositional variables which have to be ﬂipped in ω to make it a model of K. Now, we need to aggregate those measures in order to deﬁne a global notion of closeness between an interpretation and a proﬁle. This is the aim of the aggregation functions. Of course, usual functions at work for distance-based operators cannot be used here simply because we do not deal with numbers, but with sets. Several aggregation functions can be considered in our setting. For space reasons, we focus on a single one in this paper. We consider 738 P. Everaere et al. / Propositional Merging Operators Based on Set-Theoretic Closeness set-product ⊕ as an aggregation function: for two sets of sets E and E , E ⊕ E = {c ∪ c | c ∈ E and c ∈ E }. Deﬁnition 1 Let E = {K1 , . . . , Kn } be a proﬁle and ω an interpretation. The closeness between ω and E is given by: diﬀ(ω, E) = min({⊕Ki ∈E diﬀ(ω, Ki )}, ⊆). By construction, each element of diﬀ(ω, E) is a minimal set c of variables (a conﬂict set) such that for each base Ki , ω can be transformed into a model of Ki by ﬂipping in ω the variables of c. Finally, we deﬁne a merging operator Δdiﬀ,⊕ which picks up the models of the integrity constraints whose closeness to the proﬁle E contains at least one of the minimal (w.r.t. ⊆) conﬂict set: Deﬁnition 2 Let E = {K1 , K2 , . . . , Kn } be a proﬁle, μ an integrity constraint. Then diﬀ μ (E) = min({diﬀ(ω, E) | ω |= μ}, ⊆) and [Δdiﬀ,⊕ (E)] = {ω |= μ | ∃c ∈ diﬀ(ω, E) s.t. c ∈ diﬀ μ (E)}. μ 3 Δ Properties of diﬀ,⊕ Δdiﬀ,⊕ satisﬁes most of the logical properties proposed in [8]: Proposition 1 Δdiﬀ,⊕ satisﬁes (IC0), (IC1), (IC2), (IC3), (IC4) and (IC7). It does not satisfy (IC5), (IC6) and (IC8). Δdiﬀ,⊕ does not satisfy (IC5) and (IC6), which are postulates capturing aggregation properties. This is not surprising since, unlike distance-based operators (as the ones based on Hamming distance), Δdiﬀ,⊕ keeps a justiﬁcation of the minimality of an interpretation (as a conﬂict set). Beyond the IC postulates, Δdiﬀ,⊕ satisﬁes also an interesting additional logical property: Deﬁnition 3 A merging operator Δ satisﬁes the temperance property iff for every proﬁle {K1 , . . . , Kn }: Δ ({K1 , . . . , Kn }) is consistent with each Ki (temperance) Proposition 2 Δdiﬀ,⊕ satisﬁes (temperance). This proposition shows that the merged base obtained using Δdiﬀ,⊕ is consistent with every base of the proﬁle (when there is no integrity constraint). This proposition also gives an additional explanation to the fact that Δdiﬀ,⊕ does not satisfy (IC6), since temperance is not compatible with this postulate. Proposition 3 There is no merging operator satisfying both (IC2), (IC6), and (temperance). It is worth noting that the temperance property is not satisﬁed by many merging operators. In particular, as implied by the previous proposition, none of the IC merging operators satisﬁes (temperance). Interestingly, the temperance property shows that Δdiﬀ,⊕ can be viewed as a kind of negotiation operator, which can be used for determining the most consensual parts of the bases of all agents. Let us now investigate how robust Δdiﬀ,⊕ is with respect to manipulation. Intuitively, a merging operator is strategy-proof if and only if, given the beliefs/goals of the other agents, reporting untruthful beliefs/goals does not enable an agent to improve her satisfaction. A formal counterpart of this idea is given in [4, 5]: Proposition 4 In the general case Δdiﬀ,⊕ is not strategy-proof for any of the three indexes idw , ids and ip . When there is no integrity constraint (i.e., μ ≡ ), Δdiﬀ,⊕ is strategy-proof for idw , but still not strategy-proof for ids or ip . Most of the model-based operators are not strategy-proof, even in very restricted situations [5]. For example, ΔdH ,Σ or ΔdH ,Gmin , which are the best model-based operators with respect to strategyproofness, are not strategy-proof for idw , even if μ ≡ . Δdiﬀ,⊕ performs better than any of them with this respect. Let us consider now the complexity issue for the inference problem from a Δdiﬀ,⊕ -merged base. Proposition 5 MERGE(Δdiﬀ,⊕ ) is Πp2 -complete. Hardness still holds under the restriction where E contains a single base K consisting of a conjunction of propositional variables, and α is a propositional variable. This result shows that Δdiﬀ,⊕ is computationally harder than usual distance-based operators, but is at the same complexity level as many formula-based operators [7]. 4 Conclusion In this work we have introduced a family of model-based merging operators, relying on a set-theoretic measure of conﬂict. We focused on set-product as an aggregation function and considered the corresponding operator Δdiﬀ,⊕ . A feature of this operator, typically not shared by existing model-based operators, is that it satisﬁes the temperance property, and as a consequence, it is strategy-proof for the weak drastic index when there are no integrity constraints. The price to be paid is a higher complexity than usual model-based operators (but similar to the one of formula-based merging operators [5]). ACKNOWLEDGEMENTS The authors have been partly supported by the ANR project PHAC (ANR-05-BLAN-0384). REFERENCES [1] C. Baral, S. Kraus, J. Minker, and V. S. Subrahmanian, ‘Combining knowledge bases consisting of ﬁrst-order theories’, Computational Intelligence, 8(1), 45–71, (1992). [2] A. Borgida, ‘Language features for ﬂexible handling of exceptions in information systems’, ACM Trans. on Database Syst., 10, 563–603, (1985). [3] M. Dalal, ‘Investigations into a theory of knowledge base revision: preliminary report’, in Proc. of AAAI’88, pp. 475–479, (1988). [4] P. Everaere, S. Konieczny, and P. Marquis, ‘On merging strategyproofness’, in Proc. of KR’04, pp. 357–367, (2004). [5] P. Everaere, S. Konieczny, and P. Marquis, ‘The strategy-proofness landscape of merging’, Journal of Artiﬁcial Intelligence Research, 28, 49–105, (2007). [6] H. Katsuno and A. O. Mendelzon, ‘Propositional knowledge base revision and minimal change’, Artiﬁcial Intelligence, 52, 263–294, (1991). [7] S. Konieczny, J. Lang, and P. Marquis, ‘DA2 merging operators’, Artiﬁcial Intelligence, 157, 49–79, (2004). [8] S. Konieczny and R. Pino P´erez, ‘Merging information under constraints: a logical framework’, Journal of Logic and Computation, 12(5), 773–808, (2002). [9] P. Liberatore and M. Schaerf, ‘Arbitration (or how to merge knowledge bases)’, IEEE Transactions on Knowledge and Data Engineering, 10(1), 76–90, (1998). [10] T. Meyer, P. Pozos Parra, and L. Perrussel, ‘Mediation using m-states’, in Proc. of ECSQARU’05, pp. 489–500, (2005). [11] P. Z. Revesz, ‘On the semantics of arbitration’, International Journal of Algebra and Computation, 7(2), 133–160, (1997). [12] K. Satoh, ‘Non-monotonic reasoning by minimal belief revision’, in Proccedings of the International Conference on Fifth Generation Computer Systems, pp. 455–462, (1988). [13] A. Weber, ‘Updating propositional formulas’, in Proccedings of First Conference on Expert Database Systems, pp. 487–500, (1986). [14] M. Winslett, ‘Reasoning about action using a possible model approach’, in Proc. of AAAI’88, pp. 89–93, (1988). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-739 739 Partial and Informative Common Subsumers in Description Logics Simona Colucci1 2 and Eugenio Di Sciascio1 and Francesco Maria Donini 3 and Eufemia Tinelli 4 Abstract. Least Common Subsumers in Description Logics have shown their usefulness for discovering commonalities among all concepts of a collection. Several applications are nevertheless focused on searching for properties shared by signiﬁcant portions of a collection rather than by the collection as a whole. Actually, this is an issue we faced in a real case scenario that provided initial motivation for this study, namely the process of Core Competence extraction in knowledge intensive companies. The paper deﬁnes four reasoning services for the identiﬁcation of meaningful common subsumers describing partial commonalities in a collection. In particular Common Subsumers adding informative content to the Least Common Subsumer are investigated, with reference to different DLs. 1 Introduction Least Common Subsumers(LCSs) were originally proposed by Cohen, Borgida and Hirsh [5] as novel reasoning service for the Description Logic underlying Classic [4]. By deﬁnition, for a collection of concept descriptions, their LCS represents the most speciﬁc concept description subsuming all of the elements of the collection. The usefulness of such inference task has been shown in several application classes, varying from learning from examples [6, 7, 10], to similarity-based Information Retrieval [12, 13] and bottom-up construction of knowledge bases [1]. Nevertheless, there are some problems where the computation of LCS does not provide solutions. The LCS in fact intuitively represents properties shared by all the elements of a given collection. In several applications, instead, such a sharing is not required to be full: in other words we could be interested in ﬁnding a concept description subsuming a portion of the elements in the collection. Different perspectives on the introduced problem may be taken: if the LCS of the collection is the universal concept, we can determine the concept description subsuming a number m of concept descriptions in the collection, where m is the maximum cardinality of subsets of the collection for which a common subsumer non-equivalent to the universal concept exists. We give the name Best Common Subsumer to such a concept description, in analogy with LCS. Alternatively, we could be interested in determining a concept description subsuming at least k elements in the collection, where k is a threshold value established a priori on the basis of a decisional process dependent on the application domain. We give such a different concept description the name k-Common Subsumer (k-CS). In particular, the search should revert on those k-CSs adding informative content to LCS: we 1 2 3 4 SisInfLab–Politecnico di Bari, Bari, Italy D.O.O.M. s.r.l., Matera, Italy Universit`a della Tuscia, Viterbo, Italy Universit`a di Bari, Bari, Italy call Informative k-Common Subsumer (IkCS) a k-CS more speciﬁc than the LCS of the collection. We here deﬁne the k-CS, the IkCS, the BCS and one more speciﬁc service (Best Informative Common Subsumer) and give some computation results in different DLs, namely ALN , EL and ALE. 2 Deﬁnitions The deﬁnition of the four novel services relies on Least Common Subsumer deﬁnition, which we recall in the following. Deﬁnition 1 (LCS, [7]) Let C1 , . . . , Cn be n concepts in a DL L. An LCS of C1 , . . . , Cn , denoted by LCS(C1 , . . . , Cn ), is a concept E in L such that the following conditions hold:(i) Ci E for i = 1, . . . , n; (ii) E is the least L-concept satisfying (i),i.e., , if E is an L-concept satisfying Ci E for all i = 1, . . . , n, then E E . We deﬁne in the following a new concept, which represents the commonalities of k concepts out of the n in a collection of DL concepts. Deﬁnition 2 (k-CS) Let C1 , . . . , Cn be n concepts in a DL L, and let be k < n. A k-Common Subsumer (k-CS) of C1 , . . . , Cn is a concept D such that D is an LCS of k concepts among C1 , . . . , Cn . Among k-Common Subsumers we distinguish concepts adding informative content to the LCS of the investigated collection. Deﬁnition 3 (IkCS) Let C1 , . . . , Cn be n concepts in a DL L, and let k < n. An Informative k-Common Subsumer (IkCS) of C1 , . . . , Cn is a k-CS E such that E is strictly subsumed by LCS(C1 , . . . , Cn ). Some Informative k-Common Subsumers are peculiar for subsuming the maximum number of concepts in the collection, with such a maximum less than the cardinality n of the collection. We therefore deﬁne in what follows: Deﬁnition 4 (BICS) Let C1 , . . . , Cn be n concepts in a DL L. A Best Informative Common Subsumer (BICS) of C1 , . . . , Cn is a concept B such that B is an Informative k-CS for C1 , . . . , Cn , and for every k < j ≤ n every j-CS is not informative. For collections whose LCS is equivalent to the universal concept the following deﬁnition makes also sense: Deﬁnition 5 (BCS) Let C1 , . . . , Cn be n concepts in a DL L. A Best Common Subsumer (BCS) of C1 , . . . , Cn is a concept S such that S is a k-CS for C1 , . . . , Cn , and for every k < j ≤ n every j-CS ≡ . Proposition 1 If LCS(C1 , . . . , Cn ) ≡ , every BCS is a BICS. 740 S. Colucci et al. / Partial and Informative Common Subsumers in Description Logics Even though the services deﬁned above may appear quite similar to each other at a ﬁrst sight, it has to be underlined that they deal with different problems: k-CS: can be computed for every collection of elements and ﬁnds least common subsumers of k elements among the n belonging to the collection; IkCS: describes those k-CSs adding an informative content to the one provided by LCS, i.e., more speciﬁc than LCS. Observe that IkCS does not exist when every subset of k concepts has the same LCS as the one of all C1 , . . . , Cn ; BICS: describes IkCSs subsuming h concepts, such that h is the maximum cardinality of subsets of the collection for which an IkCS exists. A BICS does not exist if and only if Ci ≡ Cj for all i, j = 1, . . . , n; BCS: may be computed only for collections admitting only LCS equivalent to the universal concept; it ﬁnds k-CSs such that k is the maximum cardinality of subsets of the collection for which an LCS not equivalent to exists. For computing Lk it is sufﬁcient to compute for every subset {i1 , . . . , ik } ⊆ {1, . . . , n} the concept LCS(Ci1 , . . . , Cik ). The same holds for Ik , excluding those LCS(C1 , . . . , Ck ) which are equivalent to LCS(C1 , . . . , Cn ). For the computation of the sets B and BI, instead, an algorithm can be deﬁned[8], based on the one proposed by Kusters and Molitor [11] for LCS computation. 3 Acknowledgment Computation 4 Conclusions Motivated by a real-world application need —ﬁnding Core Competence in knowledge-intensive companies— we deﬁned and investigated novel reasoning services ﬁnding commonalities among portions in a collection of concepts in ALN , EL and ALE. In all of the three studied languages a computation algorithm has been designed. The computation algorithm for ALN has been also implemented in the framework of IMPAKT, a novel and optimized knowledge-based system for competences and skills management[9], which will be released late this year by D.O.O.M. s.r.l. The complexity of computing the common subsumers deﬁned in Section 2 depends on the speciﬁc DL in which the collection is represented. We will therefore separate the results for three different DLs in the following. Nevertheless, some results are common to every DL, like the following theorem, which deals with the cardinality of the set of k-CSs, given a collection of concepts in a DL L. We thank Franz Baader for helpful discussions. This work has been supported in part by projects EU-FP6-IST-26896, PE 013 Innovative models for customer proﬁling and PS 092 DIPIS. Theorem 1 For some sets of n concepts C1 , . . . , Cn in a DL L, and for some k < n, there are exponentially many kCS of C1 , . . . , Cn . [1] F. Baader and R. K¨usters. Computing the least common subsumer and the most speciﬁc concept in the presence of cyclic ALN -concept descriptions. In Proc. of KI-98, volume 1504 of LNCS, pages 129–140, Bremen, Germany, 1998. Springer–Verlag. [2] F. Baader, R. K¨usters, and R. Molitor. Computing least common subsumer in description logics with existential restriction. Technical Report LTCS-Report 98-09, RWTH Aachen, 1998. [3] F. Baader and A.-Y. Turhan. On the problem of computing small representations of least common subsumers. In Proc. of KI 2002, volume 2479 of LNAI, Aachen, Germany, 2002. Springer-Verlag. [4] A. Borgida, R.J. Brachman, D. L. McGuinness, and L. Alperin Resnick. CLASSIC: A structural data model for objects. In Proc. of ACM SIGMOD, pages 59–67, 1989. [5] W. Cohen, A. Borgida, and H. Hirsh. Computing least common subsumers in description logics. In Proc. of AAAI-92, pages 754–760. AAAIP, 1992. [6] W. Cohen and H. Hirsh. The learnability of description logics with equality constraints. Machine Learning, 17(2-3):169–199, 1994. [7] W. Cohen and H. Hirsh. Learning the CLASSIC description logics: Theorethical and experimental results. In Proc. of KR’94, pages 121– 133, 1994. [8] S. Colucci, E. Di Sciascio, and F.M. Donini. Partial and informative common subsumers of concept collections in description logics. In Proc. of DL 2008, 2008. [9] S. Colucci, T. Di Noia, E. Di Sciascio, F.M. Donini, and A. Ragone. Semantic-based skill management for automated task assignment and courseware composition. Journal of Universal Computer Science, 13(9):1184–1212, 2007. [10] M. Frazier and L. Pitt. C LASSIC learning. In Proc. of the 7th Annual ACM Conference on Computational Learning Theory, pages 23– 34, New Brunswick, New Jersey, 1994. ACM Press and Addison Wesley. [11] R. K¨usters and R. Molitor. Structural Subsumption and Least Common Subsumers in a Description Logic with Existential and Number Restrictions. Studia Logica, 81:227–259, 2005. [12] T. Mantay, R. Moller, and A. Kaplunova. Computing probabilistic least common subsumers in description logics. In KI - Kunstliche Intelligenz, volume 1701 of LNCS, pages 89–100. Springer-Verlag, 1999. [13] R. M¨oller, V. Haarslev, and B. Neumann. Semantics-based information retrieval. In Proc. of IT&KNOWS-98, Vienna, Budapest, 1998. The following theorem, instead, focuses on the complexity for ﬁnding a BCS w.r.t. to the one for computing an LCS. Theorem 2 Let m be the sum of the sizes of C1 , . . . , Cn . Then ﬁnding a BCS of C1 , . . . , Cn amounts to the computation of O(m2 ) subsumption tests in L, plus the computation of one LCS. Both theorems are proved in [8]. Hereafter, regardless of the DL employed for the representation of concepts, we will refer to the solution sets for the introduced reasoning services by the names: B for the set of BCSs, BI for the set of BICSs, Ik for the set of IkCSs, given k < n and Lk for the set of k-CSs, given k < n. For a collection of concept descriptions in ALN , an algorithm can be deﬁned computing the solution sets [8]. Complexity results for this algorithm are claimed in the following theorem. Theorem 3 Let C1 , . . . , Cn , T be n concepts and a simple Tbox in ALN , let m be the sum of the sizes of C1 , . . . , Cn , and let S(s) be a monotone function bounding the cost of deciding C T D in ALN , whose argument s is |C| + |D| + |T |. The computation of the solution sets B, BI, Lk , Ik for a collection of concept descriptions in ALN is then a problem in O(m2 + (S(m))2 ). Baader et al. [2] showed that, by taking into account existential restriction, the n-ary LCS operation is exponential, even for the small DL EL, and even shortening possible repetitions by using a TBox [3]. The computation results for the determination of the solution sets of a concept collection in EL and ALE are affected by results for LCS: Theorem 4 The computation of the solution sets B, BI, Lk , Ik for a collection of concept descriptions in EL or ALE may be reduced to the problem of computing the LCS of the subsets of the collection and may then grow exponential in the size of the collection. REFERENCES ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-741 741 Prime Implicate-based Belief Revision Operators 1 Meghyn Bienvenu2 and Andreas Herzig3 and Guilin Qi4 1 INTRODUCTION A belief revision operator can be seen as a function which takes as input a set of beliefs K and an input formula ϕ and outputs a new set of beliefs K ϕ. Many of the belief revision schemes that have been deﬁned in the literature require additional input. The extra information they need comes in various forms: relations over subsets of the sets of beliefs [2], epistemic entrenchment relations [1], system of spheres [6], faithful orderings [7], etc. In many applications we do not have such background information, which is why there is a need for revision operators which give good results without it. Unfortunately, the above approaches appear ill-suited to cases where we do not have any information regarding the relative importance of different beliefs or models. For example, if we accord equal importance to each of the beliefs (or each model of the beliefs or non-beliefs), which seems the most reasonable thing to do if we have no preference information, then these approaches result in the infamous drastic revision operator which gives up all old beliefs whenever the incoming information contradicts them. All of the above belief revision schemes are insensitive to syntax: logically equivalent sets of beliefs are revised in the same way, and logically equivalent input formulas lead to the same result. The so-called formula-based approaches, like the full meet [5, 4] and cardinality-maximizing base revision operators [5, 9], abandon the postulate of insensitivity to syntax, and allow e.g. the set of beliefs K1 = {a, b} to be revised differently from K2 = {a ∧ b}. Such approaches can do without extra information: they do not collapse into the drastic revision operator. There are only very few belief revision operators that are both insensitive to syntax and independent of extra information. The most prominent one is Dalal’s [3]. It is often called model-based: revision is identiﬁed with a move from the models of K to those models of ϕ that are closest in terms of the Hamming distance. Two other modelbased revision operators exist: Weber’s [12] and Satoh’s [11]. In this paper we propose two revision operators which are formulabased yet syntax-insensitive, and do not rely on background information. Our operators are obtained by ﬁrst replacing the belief base by its set of prime implicates and then applying either the full meet or the cardinality-maximizing base revision operators. The prime implicates of a belief base, deﬁned as its logically strongest clausal consequences, can be seen as the primitive semantic components of the belief base, from which all other beliefs can be derived. We argue that when no extra information is available, prime implicates provide a natural and interesting way of representing a set of beliefs. Moreover, the fact that equivalent sets of formulae have the same sets of 1 2 3 4 The third author acknowledges partial support by the EU under the IST project NeOn (IST-2006-027595, http://www.neon-project.org/) IRIT-Universit´e Paul Sabatier, France, bienvenu@irit.fr IRIT-CNRS, France, herzig@irit.fr Institute AIFB, Universit¨at Karlsruhe, Germany, gqi@aifb.uni-karlsruhe.de prime implicates guarantees the syntax-insensitivity of our operators. 2 FORMAL PRELIMINARIES We consider a propositional language built out of a ﬁnite set of atoms and the usual Boolean connectives. We suppose the latter includes the 0-ary connective ⊥. We will use V(ϕ) to refer to the set of atoms occurring in ϕ. A belief base is a ﬁnite set of propositional formulae. Where convenient, we will identify W a belief base with the conjunction of its elements. We will use K to denote the disjunction of the elements in the belief base K. A literal is either an atom or the negation of an atom, and a clause is a disjunction of literals. Prime implicates (cf. [8]) are deﬁned as the logically strongest clausal consequences of a formula. By deﬁnition, if π is a prime implicate of ϕ, then so too are all clauses equivalent to π. To simplify the presentation, we will choose a representative for each equivalence class of clauses, and we let Π(ϕ) denote the set of representatives of equivalence classes of prime implicates of ϕ. We deﬁne the minimal language of a formula ϕ, written V0 (ϕ), to be the set of atoms occurring in every formula ϕ which is equivalent to ϕ. A set {A1 , . . . , An } of sets of atoms is a splitting of a belief base K if and only if the Ai partition V0 (K) and there exist formulae V ϕ1 , . . . , ϕn such that K ≡ n i=1 ϕi and V(ϕi ) ⊆ Ai for all i. A splitting {A1 , . . . , An } of K is a ﬁnest splitting of K just in the case that if {A1 , . . . , Ap } is another splitting of K, then for every Ai there is some Aj such that Ai ⊆ Aj . It was shown in [10] that every belief base has a unique ﬁnest splitting. We will use K⊥ϕ and K⊥Card ϕ to denote respectively the set of inclusion- and cardinality-maximal subsets of K consistent with ¬ϕ. 3 PROPOSED REVISION OPERATORS Our ﬁrst revision operator Π conjoins the input ϕ and the disjunction of the maximal subsets of Π(K) consistent with ϕ. It is essentially the same as the syntactic full meet base revision operator [5, 4] except that instead of dealing directly with the formulae in the belief base we deal with the prime implicates of the belief base. Deﬁnition 1. Let K be a belief base and ϕ be a formula. Then the prime implicate-based full meet revision operator, written Π , is deﬁned as follows: _ K Π ϕ = ϕ ∧ (Π(K)⊥¬ϕ) We illustrate the functioning of our operator on some examples: Example 2. Let K = {a ∨ b, a ∨ c} and ϕ = ¬a ∧ ¬b. We have Π(K) = K, and Π(K)⊥¬ϕ = {{a ∨ c}}, so the result of revising K by ϕ is ¬a ∧ ¬b ∧ (a ∨ c) ≡ ¬a ∧ ¬b ∧ c. 742 M. Bienvenu et al. / Prime Implicate-Based Belief Revision Operators Example 3. Let K = {a ∨ c, ¬b ∨ d, ¬a ∨ b} and let ϕ = ¬c ∧ ¬d. Then Π(K) = {a ∨ c, ¬b ∨ d, ¬a ∨ b, b ∨ c, ¬a ∨ d, c ∨ d}. The maximal subsets of Π(K) consistent with ϕ are P1 = {a∨c, ¬b∨d}, P2 = {a ∨ c, ¬a ∨ b, b ∨ c}, P3 = {¬b ∨ d, ¬a ∨ b, ¬a ∨ d}, and P4 = {¬a∨b, b∨c, ¬a∨d}. Now P1 ∧¬c∧¬d ≡ a∧¬b∧¬c∧¬d, P2 ∧ ¬c ∧ ¬d ≡ a ∧ b ∧ ¬c ∧ ¬d, P3 ∧ ¬c ∧ ¬d ≡ ¬a ∧ ¬b ∧ ¬c ∧ ¬d, and P4 ∧ ¬c ∧ ¬d ≡ ¬a ∧ b ∧ ¬c ∧ ¬d , so K Π ϕ ≡ ¬c ∧ ¬d. In the last example, none of the prime implicates from K can be inferred from the revised base K Π ϕ. This is because our operator takes the disjunction of all the inclusion-maximal subsets consistent with the revision formula, which means that those prime implicates which do not appear in every inclusion-maximal subset can be lost when we take the disjunction. The solution lies in selecting only some of the inclusion-maximal subsets. If we have no information regarding the importance of different beliefs, as we assume here, there is no sure way of choosing among the subsets. One reasonable heuristic is to accord equal importance to each of the prime implicates, and hence to prefer those subsets which contain the most prime implicates. This leads us to propose a second revision operator which selects only those cardinality-maximal subsets consistent with the revision formula. Deﬁnition 4. Let K be a belief base and ϕ be a formula. Then the prime implicate-based cardinality-maximizing revision operator, written Π,Card , is deﬁned as follows: _ K Π,Card ϕ = ϕ ∧ (Π(K)⊥Card ¬ϕ) The operator Π,Card can be seen as a syntax-insensitive version of the cardinality-maximizing base revision operator [5, 9]. Example 5. Let K and ϕ be as in Example 3. P2 , P3 , and P4 are the cardinality-maximal subsets that are consistent with ϕ. So we have K Π,Card ϕ ≡ (¬a ∨ b) ∧ ¬c ∧ ¬d, which is logically stronger than ¬c ∧ ¬d which is obtained using Π . 3.1 Properties of Our Operators Revision operators are often judged based on whether they satisfy the well-known AGM postulates [2]. These postulates are formulated for logically closed sets of formulae (belief sets), but they can be modiﬁed so as to apply to belief bases. The modiﬁed postulates (omitted for lack of space) are known as the KM postulates [7]. Our ﬁrst operator satisﬁes the ﬁrst ﬁve KM postulates but fails to satisfy the last one. Proposition 6. Π satisﬁes KM1-KM5, but falsiﬁes KM6. This proposition is not surprising since Katsuno and Mendelzon showed in [7] that KM6 ensures that the faithful assignment corresponding to the revision operator is a total pre-order.5 As our prime implicate-based full meet operator uses inclusion to compare subsets of prime implicates, it induces a partial and not a total pre-order over the set of interpretations. Katsuno and Mendelzon argued however in [7] that requiring the faithful assignment to be total may be too strong in practice, and they proposed to replace KM6 with weaker postulates KM7 and KM8. Since they are less well-known, we recall them here: KM7 If K ϕ1 |= ϕ2 and K ϕ2 |= ϕ1 then K ϕ1 ≡ K ϕ2 . 5 A faithful assignment maps a belief base K to a pre-order ≤K over the set of all interpretations of the language. KM8 (K ϕ1 ) ∧ (K ϕ2 ) |= K (ϕ1 ∨ ϕ2 ). We show that both of these postulates are satisﬁed by our operator. Proposition 7. Π satisﬁes KM7 and KM8. Our cardinality-based operator satisﬁes all KM postulates. Proposition 8. Π,Card satisﬁes KM1-KM6. The AGM/KM postulates have been criticized for admitting revision operators that discard beliefs that have no real connection with the incoming information. For instance, there are AGM/KM operators for which (a ∧ b) ¬a |= b, even though intuitively we expect b to survive the revision. In an attempt to remedy this, Parikh [10] proposed an additional postulate which can be formulated as follows: Relevance If K is satisﬁable and K |= ϕ and K ψ |= ϕ, then there is some set of atoms A in the ﬁnest splitting of K such that both V0 (ϕ) ∩ A = ∅ and V0 (ψ) ∩ A = ∅. We can show that our revision operators satisfy this postulate: Proposition 9. Π and Π,Card satisfy Relevance. 3.2 Comparison With Other Operators The following proposition concerns the relation between our operators and the model-based operators mentioned in the introduction. Proposition 10. 1. Our operators sometimes yield logically stronger revised bases than the Dalal, Weber, and Satoh operators. 2. Our revision operators sometimes yield logically weaker revised bases than the Dalal and Satoh operators. Proof. For (1), consider Example 2. For (2), consider Example 3. REFERENCES [1] Knowledge in Flux: Modeling the Dynamics of Epistemic States, MIT Press, 1988. [2] C. Alchourron, P. G¨ardenfors, and D. Makinson, ‘On the logic of theory change: Partial meet contraction and revision functions’, Journal of Symbolic Logic, 50(2), 510–530, (1985). [3] M. Dalal, ‘Investigations into a theory of knowledge base revision’, in Proceedings of the Seventh National Conference on Artiﬁcial Intelligence, (475-479). [4] R. Fagin, J. Ullman, and M. Vardi, ‘On the semantics of updates in databases’, in Proceedings of the Second ACM SIGACT-SIGMOD Symposium on Principles of Database Systems (PODS-83), pp. 352–365, (1983). [5] Matthew L. Ginsberg, ‘Counterfactuals’, Artiﬁcial Intelligence, 30(1), 35–79, (1986). [6] A. Grove, ‘Two modelings for theory change’, Journal of Philosophical Logic, (1988). [7] H. Katsuno and A. Mendelzon, ‘Propositional knowledge base revison and minimal change’, Artiﬁcial Intelligence, 52(3), 263–294, (1991). [8] P. Marquis, Handbook on Defeasible Reasoning and Uncertainty Management Systems, volume 5, chapter Consequence Finding Algorithms, 41–145, Kluwer, 2000. [9] B. Nebel, Handbook of Defeasible Reasoning and Uncertainty Management Systems, Volume 3: Belief Change, chapter How Hard is it to Revise a Belief Base?, Kluwer, 1998. [10] R. Parikh, Logic, Language, and Computation, volume 2, chapter Beliefs, belief revision, and splitting languages, CSLI Publications, 1999. [11] K. Satoh, ‘Nonmontonic reasoning by minimal belief revision’, in Proceedings of the Fifth International Conference Generation Computer Systems (FGCS-88), pp. 455–462, (1988). [12] A. Weber, ‘Updating propositional formulas’, in Proceedings of the First Conference on Expert Database Systems, pp. 487–500, (1986). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-743 743 Approximate structure preserving semantic matching Fausto Giunchiglia1 , Mikalai Yatskevich 1, Fiona McNeill 2, Pavel Shvaiko1 , Juan Pane1 , Paolo Besana2 Abstract. Typical ontology matching applications, such as ontology integration, focus on the computation of correspondences holding between the nodes of two graph-like structures, e.g., between concepts in two ontologies. However, there are applications, such as web service integration, where we may need to establish whether full graph structures correspond to one another globally, preserving certain structural properties of the graphs being considered. The goal of this paper is to provide a new matching operation, called structure preserving matching. This operation takes two graph-like structures and produces a set of correspondences between those nodes of the graphs that correspond semantically to one another, (i) still preserving a set of structural properties of the graphs being matched, (ii) only in the case if the graphs are globally similar to one another. We present a novel approximate structure preserving matching approach that implements this operation. It is based on a formal theory of abstraction and on a tree edit distance measure. We have evaluated our solution with encouraging results. 1 INTRODUCTION Many varied solutions of matching have been proposed so far [1]3 . In this paper we focus on a particular type of matching, namely structure preserving matching. Similarly to the conventional ontology matching, structure preserving matching ﬁnds correspondences between semantically related nodes of the graphs. Differently from it, it preserves a set of structural properties (e.g., vertical ordering of nodes) and establishes whether two graphs are globally similar. These characteristics of matching are required in web service integration applications, see, e.g., [5]. Let us consider an example of approximate structure preserving matching between two web services: get wine(Region, Country, Color, Price, Number of bottles) and get wine(Region(Country, Area), Colour, Cost, Year, Quantity), see Figure 1. In this case the ﬁrst web service description requires the fourth argument of the get wine function (Color) to be matched to the second argument (Colour) of the get wine function in the second description. Also, Region in T 2 is deﬁned as a function with two arguments (Country and Area), while in T 1, Region is an argument of get wine. Thus, Region in T 1 must be passed to T 2 as the value of the Area argument of the Region function. Moreover, Year in T 2 has no corresponding term in T 1. Notice that detecting these correspondences would have not been possible in the case of exact matching by its deﬁnition. In order to guarantee a successful web service integration, we are only interested in the correspondences holding among the nodes of the trees underlying the given web services in the case when the web 1 University of Trento, Italy, email:{fausto,yatskevi,pavel,pane}@dit.unitn.it University of Edinburgh, Scotland, email:{f.j.mcneill,p.besana}@ed.ac.uk 3 See, http://www.ontologymatching.org for a complete information on the topic. 2 get Wine T1 Region Country Price Color Number of bottles get Wine Region T2 Country Area Colour Cost Year Quantity Figure 1: Two approximately matched web services represented as trees. Functions are in rectangles with rounded corners; they are connected to their arguments by dashed lines. Node correspondences are indicated by arrows. services themselves are similar enough. At the same time the correspondences have to preserve two structural properties of the descriptions being matched: (i) functions have to be matched to functions and (ii) variables to variables. Thus, for example, Region in T 1 is not linked to Region in T 2. Finally, let us suppose that the correspondences on the example of Figure 1 are aggregated into a single similarity measure between the trees under consideration, e.g., 0.62. If this global similarity measure is higher than empirically established threshold (e.g., 0.5), the web services under scrutiny are considered to be similar enough, and the set of correspondences showed in Figure 1 is further used for the actual web service integration. 2 THE APPROACH The matching process is organized in two steps: (i) node matching and (ii) tree matching. Node matching solves the semantic heterogeneity problem by considering only labels at nodes and contextual information of the trees. We use here the S-Match system [4]. Technically, two nodes n1 ∈ T 1 and n2 ∈ T 2 match iff: c@n1 R c@n2 holds, where c@n1 and c@n2 are the concepts at nodes n1 and n2 , and R ∈ {=, , '}. In semantic matching [2] as implemented in the S-Match system [4] the key idea is that the relations, e.g., equivalence and subsumption, between nodes are determined by (i) expressing the entities of the ontologies as logical formulas and by (ii) reducing the matching problem to a logical validity problem. Speciﬁcally, the entities are translated into logical formulas which explicitly express the concept descriptions as encoded in the ontology structure and in external resources, such as WordNet. This allows for a translation of the matching problem into a logical validity problem, which can then be efﬁciently resolved using sound and complete state of the art satisﬁability solvers. Notice that the result of this stage is the set of one-to-many correspondences holding between the nodes of the trees. For example, initially Region in T 1 is matched to both Region and Area in T 2. Tree matching exploits the results of the node matching and the structure of the trees to ﬁnd if these globally match each other as 744 F. Giunchiglia et al. / Approximate Structure Preserving Semantic Matching Table 1: The correspondence between abstraction operations, tree edit operations and costs. Abstraction operations t1 P d t2 t1 D t2 t1 P t2 t1 P d t2 t1 D t2 t1 P t2 t1 = t2 Tree edit operations a→b a→b a→λ a→b a→b a→λ a=b Preconditions of operations a b; a and b correspond to predicates a b; a and b correspond to functions or constants a corresponds to predicates, functions or constants a b; a and b correspond to predicates a b; a and b correspond to functions or constants a corresponds to predicates, functions or constants a = b; a and b correspond to predicates, functions or constants follows: Matching via abstraction. Given the correspondences produced by the node matching and based on the work in [3], the following abstraction operations are used in order to select only those correspondences that preserve the desired properties, namely that functions are matched to functions and variables to variables: Predicate: Two or more predicates are merged, typically to the least general generalization in the predicate type hierarchy, e.g., Bottle(X) + Container(X) → Container(X). We call Container(X) a predicate abstraction of Bottle(X) or Container(X) 'P d Bottle(X). Conversely, we call Bottle(X) a predicate reﬁnement of Container(X) or Bottle(X) P d Container(X). Domain: Two or more terms are merged, typically by moving the functions or constants to the least general generalization in the domain type hierarchy, e.g., Acura + Nissan → Nissan. Similarly to the previous item we call Nissan a domain abstraction of Acura or Nissan 'D Acura. Propositional: One or more arguments are dropped, e.g., Bottle(A) → Bottle. We call Bottle a propositional abstraction of Bottle(A) or Bottle 'P Bottle(A). For example, predicate and domain abstraction/reﬁnement operations do not convert a function into a variable. Therefore, the one-tomany correspondences returned by the node matching should be further ﬁltered based on the allowed abstraction/reﬁnement operations. For instance, the correspondence that binds Region in T 1 and Region in T 2 should be discarded, while the correspondence that binds Region in T 1 and Area in T 2 should be preserved. Tree edit distance via abstraction operations. We look for a composition of the abstraction/reﬁnement operations allowed for the given relation R that are necessary to convert one tree into another. We represent abstraction/reﬁnement operations as tree edit distance operations applied to the term trees. The tree edit distance problem involves three operations: (i) vertex deletion (υ → λ), (ii) vertex insertion (λ → υ), and (iii) vertex replacement (υ → ω) [6]. Our proposal is to restrict the formulation of the tree edit distance problem in order to reﬂect the semantics of the ﬁrst-order terms. In particular, we redeﬁne the tree edit distance operations in a way that will allow them to have one-to-one correspondence to the abstraction/reﬁnement operations, see Table 1. Global similarity between trees. Since we compute the composition of the abstraction/reﬁnement operations that are necessary to convert one term tree into the other, we are interested in the minimal cost of this composition. Global similarity between two trees is computed as shown in Eq. 1, where S stands for the set of the allowed tree edit operations; ki stands for the number of i-th operations necessary to convert one tree into the other and Costi deﬁnes the cost of the i-th operation, see Table 1. min ki ∗ Costi i∈S TreeSim(T1,T2) = 1 − (1) max(sizeof (T1), sizeof (T2)) CostT 1=T 2 1 1 1 1 1 1 0 CostT 1T 2 ∞ ∞ ∞ 1 1 1 0 CostT 1 T 2 1 1 1 ∞ ∞ ∞ 0 The highest value of TreeSim computed for CostT 1=T 2 , CostT 1T 2 and CostT 1T 2 is selected as the one ultimately returned. In the case of example of Figure 1, when we match T 1 with T 2, TreeSim would be 0.62 for both CostT 1=T 2 and CostT 1T 2 . 3 EVALUATION We have evaluated our approach on different versions of SUMO and AKT ontologies4 . These are both ﬁrst-order ontologies, out of which 132 pairs of trees (ﬁrst-order logic terms) were used. The matching quality results are shown in Figure 2. Note that F-Measure values exceed 70% for the given range of the cut-off thresholds. The average execution time per matching task on a standard laptop was 93ms. Figure 2: Evaluation results. 4 CONCLUSIONS We have presented an approximate structure preserving semantic matching approach that implements the structure preserving matching operation. It is based on a theory of abstraction and a tree edit distance. We have evaluated our solution with encouraging results. Future work includes conducting extensive and comparative testing in real-world scenarios. Acknowledgements. We appreciate support from the OpenKnowledge European STREP (FP6-027253). REFERENCES [1] J. Euzenat and P. Shvaiko, Ontology matching, Springer, 2007. [2] F. Giunchiglia and P. Shvaiko, ‘Semantic matching’, The Knowledge Engineering Review, 18(3), (2003). [3] F. Giunchiglia and T. Walsh, ‘A theory of abstraction’, Artiﬁcial Intelligence, 57(2-3), (1992). [4] F. Giunchiglia, M. Yatskevich, and P. Shvaiko, ‘Semantic matching: Algorithms and implementation’, Journal on Data Semantics, IX, (2007). [5] M. Klusch, B. Fries, and K. Sycara, ‘Automated semantic web service discovery with OWLS-MX’, in Proceedings of AAMAS, (2006). [6] K.-C. Tai, ‘The tree-to-tree correction problem.’, Journal of the ACM, 26(3), (1979). 4 See http://dream.inf.ed.ac.uk/projects/dor/ for full versions of these ontologies and analysis of their differences. ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-745 745 Discovering Temporal Knowledge from a Crisscross of Timed Observations Nabil Benayadi and Marc Le Goc 1 Abstract. This paper is concerned with the discovering of temporal knowledge from a sequence of timed observations provided by a system monitoring of dynamic process. The discovering process is based on the Stochastic Approach framework where a series of timed observations is represented with a Markov chain. From this representation, a set of timed sequential binary relations between discrete event classes is discovered with an abductive reasoning and represented as abstract chronicle models. To reduce the search space as close as possible to the potential relations between the process variables, we propose to characterize a set of series of timed observations with a unique measure of the homogeneity of the crisscross of class occurrences and to use this measure to prune abstract chronicle models. 1 INTRODUCTION When supervising and monitoring dynamic processes, a very large amount of timed messages (alarms or simple records) are generated and collected in databases. Mining these databases allows to discover the underlying relations between the variables that govern the dynamic of the process. This paper addresses this problem in the framework of the Stochastic Approach [2] where a timed message is considered as a timed observation that is represented with an occurrence of a discrete event class Ci = {(xi , δi )} linking a variable xi and a constant δi . The BJT4T algorithm represents a set of sequences Ω of discrete event class occurrences with a ﬁrst order Markov chain and uses an abductive reasoning to identify the set of the most probable timed sequential binary relations between classes. A timed sequential bi+ i j nary relation R(Ci ,C j , [τ− i j , τi j ]) is an oriented relation C → C ) bei j tween two classes C and C that is timed constrained with the in+ i j terval [τ− i j , τi j ]. A set M = {(C → C )} of timed sequential binary relations constitutes an abstract chronicles model which is used by the BJT4S algorithm (BJT for Signatures) to look for the n-ary relations in Ω. The search space being generally so large, measures of the ”interestingness” of a timed relation are required to focus to the minimal set of hypothesis. To this aim, we deﬁned a measure, called the BJ-measure, of the homogeneity of the crisscross (i.e. interlacing) of series of timed observations that is an temporal version of the J-measure of [3]. 2 The BJ-Measure Lets Ω a sequence of |Ω| occurrences of a set of classes Ci ∈ Cω , Ci and Co two classes in Cω , N(Ci ), N(Co ) and N(Ci ,Co ) respec- tively the occurrence number in Ω of the classes Ci and Co and of the couple (Ci ,Co ). According to the memoryless property of a Markov chain, a timed sequential binary relation Ci → Co is associated with a discrete memoryless channel [1] that links the values of two random binary variables X = {Ci , ¬Ci } and Y = {Co , ¬Co }, where ¬Ci ≡ Cω − {Ci } and ¬Co ≡ Cω − {Co } so that: p(Ci ) = N(Co ) N(Ci ,C j ) , p(Co |Ci ) = N(Ci ) , |Ω| N(Ci ) , p(Co ) |Ω| tion of the J-measure can be adapted to deﬁne a BJL-measure that evaluates the homogeneity of the crisscross towards the future (i.e. from the Ci class to the Co class) and a BJW-measure that evaluates the homogeneity of the crisscross towards the past (i.e. from the Co class to the Ci class). These two measures will then be combined to deﬁne the BJ-Measure of a timed sequential binary relation Ci → Co . Deﬁnition 1 Considering a timed sequential binary relation Ci → Co such that p(Co |Ci ) > p(Co ), the BJL(Ci → Co ) measure is given by the following formula : BJL(Ci → Co ) = p(Co |Ci ) × log2 ( + (1−p(Co |Ci )) |Cω |−1 × log2 ( (1−p(Co |Ci )) ) 1−p(Co ) p(Co |Ci ) ) p(Co ) (1) where |Cω | is the number of event classes in ω. The BJL-measure has the following properties: • if p(Co |Ci ) ≤ p(Co ) then BJL(Ci → Co ) = 0 • if sequence ω consists only of two classes occurrences Ci and Co ,|Cω | = 2, the BJL(Ci → Co ) behaves like j-measure. • for p(Co |Ci ) = p(Co ), BJL(Ci → Co ) increase when N(Cω ) increase. • for p(Co |Ci ) = 1, BJL(Ci → Co ) is maximal (= log2 ( p(C1 o ) )). Deﬁnition 2 Considering a timed sequential binary relation Ci → Co such that p(Ci |Co ) > p(Ci ), the BJW(Ci → Co ) measure is given by the following formula : BJW(Ci → Co ) = p(Ci |Co ) × log2 ( + (1−p(C |C )) |Cω |−1 i o × log2 ( (1−p(C |C )) ) 1−p(Ci ) i o p(Ci |Co ) ) p(Ci ) (2) Co ) is null at the same A noticeable property is that BJW(Ci → i o point as BJL(C → C ). This property of symmetry is a consequence of Bayes’ rule: p(Co |Ci ) p(Co ) = p(Ci |Co ) . p(Ci ) The Figure 1 shows the (abscissa) and the corresponding BJL(Ci → Co ) (orN dinate) for different ratio θ = NCCoi . When the numbers of the oc- BJW(Ci Co ) → N 1 LSIS- University AIX-Marseille III France email: {nabil.benayadi, marc.legoc}@lsis.org = p(¬Co |Ci ) = 1 − p(Co |Ci ). The ”j” func- currences of the classes Ci and Co are equals (i.e. θ = NCCoi = 1), BJL(Ci → Co ) = BJW(Ci → Co ) and the corresponding curve is 746 N. Benayadi and M. Le Goc / Discovering Temporal Knowledge from a Crisscross of Timed Observations the diagonal. The maximum point of the diagonal corresponds to a perfectly homogeneous crisscross of occurrences with N(Ci ,Co ) = N(Ci ) = N(Co ): each occurrence of the Ci class is followed with an occurrence of the Co class and each occurrence of the Co is preceded with an occurrence of the Co class. The minimum point of the diagonal (i.e. the origin) corresponds to BJL(Ci → Co ) = BJW(Ci → Co ) = 0: the occurrences of the Ci and the Co classes are not interlaced. It is to note also that the curves of Figure 1 corresponding to θ and θ−1 are symmetric according to the diagonal (i.e. BJL(Ci → Co ) = BJW(Ci → Co )). ⎛ 1 ⎞ ⎟ log⎜⎜ o 1⎟ ⎝ p (C ) ⎠ N (C i , C o ) = 100 θ = 0,8 4 θ = 6 4 θ = 7 4 θ = 8 0,7 0,6 B JM -L 4 5 i N (C i , C o ) = 98 o N (C , C ) = 97 i The α function provides then a simple mean to interpret the BJmeasure of a crisscross of a series of timed observations. 3 Application to SACHEM system Our approach has been applied to sequences generated by the SACHEM knowledge-based system developed at the end the 20th century to help the operators to monitor, diagnose and control the blast furnace [2]. We are interested with the omega variable that reveals the quality of the management of the whole blast furnace. The studied sequence contains 7682 occurrences of 45 discrete θ =1 N (C i , C o ) = 99 0,9 • α(Ci → Co ) = 0.75 when 75 of the 100 occurrences of class Ci is followed by one of the 100 occurrences of class Co . • α(Ci → Co ) = 0.5 when 50 of the 100 occurrences of class Ci is followed by one of the 100 occurrences of class Co . o N (C , C ) = 96 N (C i , C o ) = 95 N (C i , C o ) = 94 0,5 0,4 θ = TGS 5 4 1454 1455 TGS 0,3 θ = 0,2 7 θ = 8 4 θ = 4 0,1 0 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 6 4 FT 0,8 1217 1216 ω 1463 1464 0,9log⎛⎜⎜ 11 ⎞ ⎟ i ⎟ ⎝ p (C ) ⎠ BJM -W SS N Figure 1: BJL and BJW measures with different ratios θ = NCo C (a) Deﬁnition 3 The BJ-measure of a.timed sequential/binary relation BJL(Ci → Co ) Ci → Co is the norm of the vector : BJW(Ci → Co ) > BJM(Ci → Co ) = BJL(Ci → Co )2 + BJW(Ci → Co )2 (3) N The BJ-measure depends only of the rates θ = NCCoi , which made difﬁcult the comparison between two crisscross. This is the aim of the α(Ci → Co ) function. Deﬁnition 4 The α(Ci → Co ) function provided the value, projected N in the interval [0.5, 1], corresponding to the BJM(Ci → Co ) if θ = NCCoi was equal to 1. BJM(Ci → Co ) + 0.5 2 × max(BJM(Ci → Co )) BD i The BJ-Measure aims to provide a general mean to evaluate and to represent the homogeneity of the crisscross of any series of classes occurrences. α(Ci → Co ) = FT ω (4) where max(BJM(Ci → Co )) is the maximal value of the BJ-measure for a given θ (i.e. when N(Ci ,Co ) = min(N(Ci ), N(Co )) for any NCi and NCo ). The α(Ci → Co ) is illustrated with the red squares along the diagonal of Figure 1 when NCi = NCo = 100: • α(Ci → Co ) = 1 when each of the 100 occurrences of the class Ci is followed by a one and only one of the 100 occurrences of class Co and inversely (perfect crisscross). • α(Ci → Co ) = 0.99 when 99 of the 100 occurrences of class Ci is followed by one of the 100 occurrences of class Co . SS BD 1717 1718 1721 1256 1267 1260 (b) Figure 2: Expert’s (1995, a) and discovered relations (2007, b) event classes of the SACHEM system at Fos-Sur-Mer (France) from 08/01/2001 to 31/12/2001. For the 1463 class linked to the omega variable, the BJT4T algorithm provides a chronicle model with 205 = 3, 200, 000 sequential binary relations. Applying the BJ-measure to prune this tree, the BJT4P algorithm produces a tree with 195 nodes (the the pruning method is given in [2]). The reduction factor is greater than 16,000 and the pruned tree can then be used by the BJT4S algorithm to look for the set of n-ary relations observed in the sequence. When substituting a class withe the corresponding variable, this set becomes the graph (b) of Figure 2. The only difference with the Expert’s knowledge formulated in 1995 (graph a) is the direction of the relation between the variables FT and BD. This result shows that the branches with a high BJ-measure have a strong potentiality to be reveal some knowledge about the relations between the variables of a process. It is to note that the same result is observed with the Apache system, a clone of Sachem design to monitor and diagnose a galvanization bath. REFERENCES [1] C.E.Shannon and W. Weaver, ‘The mathematical theory of communication’, University of Illinois Press, 27, 379–423, (1949). [2] M. Le Goc and N. Benayadi, ‘Discovering experts knowledge from sequences of discrete event class occurrences’, Proceedings of the 10th International Conference on Enterprise Information Systems (ICEIS08), (June 12-16 2008). [3] P. Smyth and R. M. Goodman, ‘An information theoretic approach to rule induction from databases’, IEEE Transactions on Knowledge and Data Engineering 4, 301–316, (1992). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-747 747 Fred meets Tweety Antonis Kakas1 and Loizos Michael2 and Rob Miller3 Abstract. We propose a framework that brings together two major forms of default reasoning in Artiﬁcial Intelligence: applying default property classiﬁcation rules in static domains, and default persistence of properties in temporal domains. Particular attention is paid to the central problem of qualiﬁcation. We illustrate how previous semantics developed independently for the two separate forms of default reasoning naturally lead to the integration that we propose, and how this gives rise to domains where different types of knowledge interact and qualify each other while preserving elaboration tolerance. 1 Introduction Tweety is watching as we load the gun, wait, and then shoot Fred. Should we conclude that Tweety will ﬂy away as birds normally do when they hear a loud noise as that normally produced by shooting a loaded gun? It depends on whether Tweety can ﬂy! This belief, in turn, depends on whether Tweety is only known to be a bird, or also known to be a penguin. What can we conclude about Fred if after the act of shooting Fred we observe that Tweety is still on the ground? In this problem of “Fred meets Tweety” we need to bring together two major forms of default reasoning that have been extensively studied on their own in A.I., but have rarely been addressed in the same formalism. These are default property classiﬁcation as applied to inheritance systems [4, 10], and default persistence central to temporal reasoning in theories of Reasoning about Action and Change (RAC) [3, 9, 11]. How can a formalism synthesize the reasoning encompassed within each of these two forms of default reasoning? Central to these two (and indeed all) forms of default reasoning is the qualiﬁcation problem: default conclusions are qualiﬁed by information that can block the application of the default inference. Recent work has shown the importance for RAC theories to properly account for different forms of qualiﬁcation [5, 12]. In our problem of integrating the default reasoning of property classiﬁcation into RAC, we study how a static default theory expressing known default relationships between ﬂuents can endogenously qualify the reasoning about actions and change, so that the application of causal laws and default persistence is properly adjusted by this static theory. 2 Knowledge Qualiﬁcation One of the ﬁrst knowledge qualiﬁcation problems formally studied in A.I. relates to the Frame Problem (see, e.g., [11]) of how the causal change properly qualiﬁes the default persistence. In the archetypical Yale Shooting Problem domain [3], a turkey named Fred is initially alive, and one asks whether it is still alive after loading a gun, waiting, and then shooting Fred. The lapse of time cannot cause the gun 1 2 3 University of Cyprus, P. O. Box 20537, CY-1678, Cyprus. e-mail: antonis@ucy.ac.cy Harvard University, Cambridge, MA 02138, U.S.A. e-mail: loizos@eecs.harvard.edu University College London, London WC1E 6BT, U.K. e-mail: rsm@ucl.ac.uk to become unloaded. Default persistence is qualiﬁed only by known events and known causal laws linked to these events. The consideration of indirect action effects gave rise to the Ramiﬁcation Problem (see, e.g., [7]) of how these effects are generated and qualify persistence. Static knowledge expressing domain constraints was introduced to encode such indirect action effects. In early solutions to the Ramiﬁcation Problem a direct action effect would cause this static knowledge to be violated, unless a minimal set of indirect action effects were also assumed so as to maintain consistency [7, 8]. Thus, given the static knowledge that “dead birds do not walk”, the shooting action causing Fred to be dead would also indirectly cause Fred to stop walking, qualifying the persistence of the latter property. Subsequent work examined default causal knowledge, bringing to focus the Qualiﬁcation Problem4 (see, e.g., [12]) of how such default causal knowledge is qualiﬁed by domain constraints. In some solutions to the Qualiﬁcation Problem, the role of static knowledge within the domain description was identiﬁed as that of endogenously qualifying causal knowledge, as opposed to aiding causal knowledge in qualifying persistence [5]. Observations after action occurrences also qualify causal change when the two are in conﬂict, a problem known as the Exogenous Qualiﬁcation Problem (see, e.g., [5]). Independently of the above, another qualiﬁcation problem was examined in the context of Default Static Theories [10] that consider how observed facts qualify default static knowledge. In the typical domain one asks whether Tweety is able to ﬂy, when it is only known to be a bird. In the absence of any explicit information on whether Tweety is able to ﬂy, the theory predicts that it is, but retracts this prediction once the extra fact that Tweety is a penguin is added. In this paper we investigate temporal domains that incorporate (possibly) default static theories. The technical challenge lies in understanding how the four types of knowledge in a domain, three of which may now be default, interact and qualify each other. To illustrate some of these interactions we employ the syntax of the action description language ME [5]. Strict static knowledge is encoded in propositional logic. Default static knowledge is encoded in terms of default rules of the form “φ ψ”, where φ, ψ are propositional formulas; an informal reading of such default rules sufﬁces for this section. Formulas with variables are used as a shorthand notation for the set of all of their groundings over a ﬁnite domain of constants. ClapHands causes Noise Noise causes Fly(x) Noise causes ¬Noise Penguin(Tweety) holds-at 1 ClapHands occurs-at 3 ClapHands occurs-at 7 static theory: (1) Penguin(x) ¬CanFly(x) (2) Penguin(x) → Bird(x) (3) Bird(x) CanFly(x) rule (1) overrides rule (3) (4) ¬CanFly(x) → ¬Fly(x) The default persistence of “Penguin(Tweety) holds-at 1” implies, through the static theory, that “¬CanFly(Tweety)” holds everywhere. This, then, qualiﬁes the causal generation of “Fly(Tweety)” by the 4 Not to be confused with the broader sense of the term qualiﬁcation we use. 748 A. Kakas et al. / Fred Meets Tweety action “ClapHands” at time-points 3 and 7. If, on the other hand, the observation “Fly(Tweety) holds-at 5” is added, then the static theory is qualiﬁed itself, and does no longer qualify the causal generation of “Fly(Tweety)”. Note, however, that Tweety ﬂies for an exogenous reason. If an action at time-point 6 were to cause Tweety to stop ﬂying, this would release the static theory’s default conclusion that penguins do not ﬂy. The action “ClapHands occurs-at 7” would then be qualiﬁed and would not cause Tweety to ﬂy again. What would happen if “Noise” was caused at time-point 3 because Fred, a turkey that is initially alive, was shot; and we only knew that Tweety is a bird? Then, we would conclude that Fred is dead from time-point 3 onwards, and also that Tweety is ﬂying. If, however, one observes “¬Fly(Tweety) holds-at 4”, then whether Fred is dead depends on why Tweety did not ﬂy after Fred was shot! The observation by itself does not explain why the causal laws that would normally cause Tweety to ﬂy were qualiﬁed. An endogenous explanation would be that Tweety is a penguin, and “Fly(Tweety)” is qualiﬁed from being caused. An exogenous explanation would be that Tweety could not ﬂy due to exceptional circumstances (e.g., an injury). However, Tweety might not have ﬂown because the shooting action failed to cause a noise, or because it failed altogether. Different conclusions on Fred’s status might be appropriate depending on the explanation. 3 Formal Semantics of Integration Four different types of information present in a framework of RAC interact and qualify each other: (i) information generated by default persistence, (ii) action laws that qualify default persistence, (iii) static default laws of ﬂuent relationships that can qualify these action laws, and (iv) observations that can qualify any of these. This hierarchy of information comes full circle, as the bottom layer of default persistence of observations (which carry the primary role of qualiﬁcation) can also qualify the static theory. Due to the cyclical nature of the qualiﬁcations, we develop the formal semantics in two steps. For the temporal semantics we follow the semantics of ME [5], which accounts for the qualiﬁcation of causal knowledge by a given strict static theory. Causal knowledge in ME is qualiﬁed so as to ensure that the static theory is never violated at the observable time scale. We extend that semantics by proposing that the qualiﬁcation comes from an external set α(T ) of admissible states that might depend on the time-point T . Thus, we end up with a semantics that given an externally provided admissibility requirement α, computes the temporal evolution of states so as to ensure that the state of the world at time-point T always lies within the set of admissible states α(T ). The details of the temporal semantics of ME are largely orthogonal to the next step of determining how α is computed. An externally qualiﬁed model of a domain description D given an admissibility requirement α is any mapping of time-points to states such that (1) the world is initially in an admissible state; (2) it changes in an admissible manner; and it holds that (3.i) literals not caused to change persist, and (3.ii) caused change is realized. The admissibility requirement is determined by the static theory after being qualiﬁed by the combined effect of observations and persistence. We model this effect by considering virtual extensions of a domain D that contain additional virtual observations. Virtual observations are not meant to capture abnormal situations, but rather persistence of known observations from other time-points. The minimization of virtual observations that we impose later guarantees that known observations persist only as needed to achieve this effect. At every time-point T , we consider the static theory and the observations (including virtual ones) at T . The extensions of this default theory determine a particular set of admissible states α(T ). An in- ternally qualiﬁed model of a domain description D is an externally qualiﬁed model of D given this admissibility requirement α. Given a domain description D, we consider its virtual extensions that have internally qualiﬁed models. Among those, we choose the ones with a minimal set of virtual observations. The internally qualiﬁed models of these virtual extensions of D are the models of D. Observations in our semantics act as the knowledge that bootstraps reasoning. Since every other type of knowledge is amenable to qualiﬁcation, a strong elaboration tolerance result can be established. Theorem 1 (Elaboration Tolerance Theorem) Let D be a consistent domain, D a domain with no observations, and D ∪ D their union, where the static theories of D and D are merged together to form the single static theory of D ∪ D . We assume that the static theory of D ∪ D is consistent. Then, D ∪ D is a consistent domain. 4 Concluding Remarks We have presented an integrated formalism for reasoning with both default static and default causal knowledge, two problems that have been extensively studied in isolation from each other. The proposed solution applies to domains where the static knowledge is “stronger” than the causal knowledge, and qualiﬁes excessive change caused by the latter. A more detailed exposition of our developed formalism, including a tentative solution of how to encode causal laws that are “stronger” than the static knowledge, appears in [6]. Our future research agenda includes further investigation of such “strong” causal knowledge, and of how “strong” static knowledge can generate extra (rather than block) causal change. We also plan to develop computational models corresponding to the presented theoretical framework, using, for example, ideas from argumentation. Although we are unaware of any previous work explicitly introducing Fred to Tweety, much work has been done on the use of default reasoning in inferring causal change. In the context of the Qualiﬁcation Problem see [2, 12]. For distinguishing between default and non-default causal rules in the context of the Language C+ see [1]. REFERENCES [1] S. Chintabathina, M. Gelfond, and R. Watson, ‘Defeasible Laws, Parallel Actions, and Reasoning about Resources’, in Proc. of Commonsense’07, pp. 35–40, (2007). [2] P. Doherty, J. Gustafsson, L. Karlsson, and J. Kvarnstr¨om, ‘TAL: Temporal Action Logics Language Speciﬁcation and Tutorial’, ETAI, 2(3– 4), 273–306, (1998). [3] S. Hanks and D. McDermott, ‘Nonmonotonic Logic and Temporal Projection’, AIJ, 33(3), 379–412, (1987). [4] J. Horty, R. Thomason, and D. Touretzky, ‘A Skeptical Theory of Inheritance in Nonmonotonic Semantic Networks’, AIJ, 42(2–3), 311–348, (1990). [5] A. Kakas, L. Michael, and R. Miller, ‘Modular-E: An Elaboration Tolerant Approach to the Ramiﬁcation and Qualiﬁcation Problems’, in Proc. of LPNMR’05, pp. 211–226, (2005). [6] A. Kakas, L. Michael, and R. Miller, ‘Fred meets Tweety’, in Proc. of CogRob’08, (2008). [7] F. Lin, ‘Embracing Causality in Specifying the Indirect Effects of Actions’, in Proc. of IJCAI’95, pp. 1985–1991, (1995). [8] F. Lin and R. Reiter, ‘State Constraints Revisited’, J. of Logic and Comp., 4(5), 655–678, (1994). [9] J. McCarthy and P. Hayes, ‘Some Philosophical Problems from the Standpoint of Artiﬁcial Intelligence’, Mach. Intel., 4, 463–502, (1969). [10] R. Reiter, ‘A Logic for Default Reasoning’, AIJ, 13(1–2), 81–132, (1980). [11] M. Shanahan, Solving the Frame Problem: A Mathematical Investigation of the Common Sense Law of Inertia, MIT Press, 1997. [12] M. Thielscher, ‘The Qualiﬁcation Problem: A Solution to the Problem of Anomalous Models’, AIJ, 131(1–2), 1–37, (2001). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-749 749 Deﬁnability in Logic and Rough Set Theory 1 Tuan-Fang Fan2 and Churn-Jung Liau3 and Duen-Ren Liu4 Abstract. Rough set theory is an effective tool for data mining. According to the theory, a concept is deﬁnable if it can be written as a Boolean combination of equivalence classes induced from classiﬁcation attributes. On the other hand, deﬁnability in logic has been explicated by Beth’s theorem. In this paper, we propose two data representation formalisms, called ﬁrst-order data logic (FODL) and attribute value-sorted logic (AVSL), respectively. Based on these logics, we explore the relationship between logical deﬁnability and rough set deﬁnability. 1 Rough Set Theory—A Review The basic construct of rough set theory is an approximation space, which is deﬁned as a pair (U, R), where U is the universe and R ⊆ U × U is an equivalence relation on U . We can write an equivalence class of R as [x]R if it contains the element x. Note that [x]R = [y]R iff (x, y) ∈ R. In philosophy, the extension of a concept is deﬁned as the objects that are instances of the concept. Following the terminology, a subset of the universe is called a concept or a category in rough set theory. Given an approximation space (U, R), each equivalence class of 1 2 3 4 RX = {x ∈ U | [x]R ⊆ X}, Introduction In recent years, knowledge discovery in databases (KDD) and data mining have received more and more attention because of their practical applications. The rough set theory proposed by Pawlak provides an effective tool for extracting knowledge from data tables [3]. To represent and reason about the extracted knowledge, a decision logic (DL) is also proposed in [3]. The semantics of the logic is deﬁned in a Tarskian style through the notions of models and satisfaction. While DL is an instance of propositional logic, we can also represent knowledge in data tables by using ﬁrst-order logic (FOL)[2] or many-sorted ﬁrst-order logic (MSFOL). In this paper, we investigate the deﬁnability of concepts in the context of these alternative logical descriptions of data tables. In the next section, we review rough set theory, with the emphasis on the notion of deﬁnability. Then, in Sections 3 and 4, we propose ﬁrst-order data logic and attribute value-sorted logic for the description of data tables respectively, and discuss the relationship between logical deﬁnability and rough set deﬁnability in the context of these logics. We conclude the paper in Section 5. 2 R is called an R-basic category or R-basic concept, and any union of R-basic categories is called an R-category. Now, for an arbitrary concept X ⊆ U , we are interested in the deﬁnability of X by using R-basic categories. We say that X is R-deﬁnable if X is an Rcategory; otherwise X is R-undeﬁnable. The R-deﬁnable concepts are also called R-exact sets, whereas R-undeﬁnable concepts are said to be R-inexact or R-rough. A rough set can be approximated by two exact sets, called the lower approximation and upper approximation of X, respectively, and deﬁned as follows: This work was partially supported by NSC (Taiwan) under grants 95-2221E-001-029-MY3 Department of Computer Science and Information Engineering, National Penghu University, Penghu 880, Taiwan, email:dffan@npu.edu.tw, and Institute of Information Management, National Chiao-Tung University, Hsinchu 300, Taiwan, email: tffan.iim92g@nctu.edu.tw Institute of Information Science, Academia Sinica, Taipei 115, Taiwan, email: liaucj@iis.sinica.edu.tw Institute of Information Management, National Chiao-Tung University, Hsinchu 300, Taiwan, email: dliu@iim.nctu.edu.tw RX = {x ∈ U | [x]R ∩ X = ∅}. Obviously, a set X is R-deﬁnable iff RX = RX. In data mining problems, the equivalence relation is determined by the attributes (features) used to classify objects. Two objects are equivalent if they have the same values in every such attribute. Thus, intuitively, a concept is deﬁnable in rough set theory if it can be precisely described by such attributes. 3 Deﬁnability in First-order Data Logic To describe data tables by (a fragment of) FOL, we use an instance of function-free monadic predicate logic, called ﬁrst-order data logic (FODL). The alphabet (or vocabulary) of FODL consists of a set of constant symbols, a ﬁnite set of monadic predicate symbols, a set of variables, Boolean connectives (¬, ∧, ∨, ⊃, ≡), and the quantiﬁers (∀, ∃). The syntax and semantics of FODL are the same as those of ordinary FOL[2]. Based on FODL, we can formulate the deﬁnability of a concept in rough set theory precisely. In the language of FODL, a concept corresponds to a predicate, and the equivalence relation in an approximation space can be determined by a set of predicates. Let S be a subset of predicates. Then the following formula deﬁnes an indiscernibility relation (with respect to S): ^ ηs (x, y) = P (x) ≡ P (y). P ∈S Given an arbitrary predicate P , we can deﬁne two formulas corresponding to the lower and upper approximations of P : Ps (x) = ∀y(ηs (x, y) ⊃ P (y)), Ps (x) = ∃y(ηs (x, y) ∧ P (y)). Let Γ be an FODL theory that contains only predicate symbols in S ∪ {P }. Then we say that P is S-deﬁnable with respect to Γ if Γ |= ∀x(Ps (x) ≡ Ps (x)), where |= means the semantic consequence relation in FODL. 750 T.-F. Fan et al. / Deﬁnability in Logic and Rough Set Theory In classical logic, the deﬁnability of a predicate is explicated by the well-known Beth’s deﬁnability theorem[1]. The theorem states that explicit deﬁnability is equivalent to implicit deﬁnability. Let Γ be an FODL theory that contains only predicate symbols in S ∪ {P }. Then Γ explicitly deﬁnes P if there exists a wff ϕ(x) that contains only predicate symbols in S such that attribute domain variable, and S be a subset of the index set I. Then we can deﬁne the indiscernibility formula (with respect to S) as: ^ εs (x, y) = ∀v(Ri (x, v) ≡ Ri (y, v)). Γ |= ∀x(ϕ(x) ≡ P (x)). εPs (x) = ∀y(εs (x, y) ⊃ P (y)), We say that Γ implicitly deﬁnes P if for any A, B ∈ M od(Γ) such that QA = QB for all Q ∈ S, we have P A = P B , where M od(Γ) is the set of models of Γ. In effect, the implicit deﬁnability of a predicate P means the possibility of uniquely characterizing P . The primary objective of this paper is to establish the relationship between logical deﬁnability and rough set deﬁnability. εPs (x) = ∃y(εs (x, y) ∧ P (y)). Theorem 1 Let Γ be an FODL theory that contains only predicate symbols in S ∪ {P }. Then the explicit (or implicit) deﬁnability of P in Γ implies that P is S-deﬁnable with respect to Γ. 4 Deﬁnability in Attribute Value-sorted Logic i∈s Again, given an arbitrary concept predicate P , we can deﬁne two formulas corresponding to its lower and upper approximations: Let Γ be an AVSL theory that contains only predicate symbols in {Ri | i ∈ S}∪{P }. Then we say that P is indiscernibly S-deﬁnable with respect to Γ if Γ |= ∀x(εPs (x) ≡ εPs (x)). The deﬁnition of the explicit and implicit deﬁnability of P in Γ is the same as that in the FODL case and, analogously, we have the following theorem. Theorem 2 Let Γ be an AVSL theory that contains only predicate symbols in {Ri | i ∈ S} ∪ {P }. Then the explicit deﬁnability of P in Γ implies that P is indiscernibly S-deﬁnable with respect to Γ. In FODL, a monadic predicate intuitively corresponds to an attributevalue pair. However, in many cases, the number of possible values for an attribute may be inﬁnite. In such inﬁnite-domain cases, an inﬁnite number of predicates must be available in FODL, but since the indiscernibility wff ηs can only be deﬁned with respect to a ﬁnite subset of predicates S, it is sometimes inadequate. To circumvent such difﬁculties, we can use many-sorted ﬁrst-order logic (MSFOL) as the data representation formalism. In addition to Pawlak’s approximation space, the notion of tolerance approximation spaces has been proposed in [4] to cope with the problem of imprecise boundary regions in rough set theory. The deﬁnability of a concept in a tolerance approximation space can also be formulated in AVSL. First, let x, y, v and S be deﬁned as above. Then the tolerance formula (with respect to S) is ^ τs (x, y) = ∃v(Ri (x, v) ∧ Ri (y, v)). 4.1 Second, the lower and upper approximations of a concept predicate P are deﬁned as follows: Syntax and semantics We use a special instance of MSFOL, called attribute value-sorted logic (AVSL), to describe data tables. The set of sorts for AVSL is Σ = {σi | i ∈ I} ∪ {σu }, where I is an index set. The sort σu is called the object sort and each σi is called an attribute value sort. As in the case of FODL, the alphabet (or vocabulary) of AVSL consists of constant symbols, predicate symbols, variables, and logical symbols. The only difference is that, in AVSL, a rank function is used to assign a rank to constant symbols, predicate symbols, and variables. The rank of a constant symbol or a variable is an element of Σ, and the rank of a predicate symbol is in Σk if its arity is k. A constant (resp. variable) of rank σu is called an object constant (resp. variable); otherwise, it is called an attribute domain constant (resp. variable). We assume that the set of predicate symbols is the union of a set of monadic predicates and the set of dyadic predicates {Ri | i ∈ I}. For each i ∈ I, Ri is of rank (σu , σi ), and called an attribute predicate. Also, a monadic predicate of rank σu is called a concept predicate; and for each i ∈ I, a monadic predicate of rank σi is called a value predicate. Now, a term is either a constant or a variable, and the rank of the term is that of the constant or variable. If P is a predicate of rank (σ1 , · · · , σk ) and t1 , t2 , · · · , tk are terms of ranks σ1 , σ2 , · · · , σk respectively, then P (t1 , t2 , · · · , tk ) is an atomic formula (k = 1, 2). The formation rules for compound wffs are the same as those for ordinary FOL[2]. 4.2 Logical deﬁnability Analogous to the case of FODL, we can formulate the deﬁnability of a rough concept in AVSL. Let x and y be object variables, v be an i∈s τ Ps (x) = ∀y(τs (x, y) ⊃ P (y)), τ Ps (x) = ∃y(τs (x, y) ∧ P (y)). Finally, let Γ be an AVSL Vtheory that contains only predicate symbols in S ∪ {P } such that { i∈s ∀x∃vRi (x, v)} ⊆ Γ. Then we say that P is tolerantly S-deﬁnable with respect to Γ if Γ |= ∀x(τ Ps (x) ≡ τ Ps (x)). Note that, to ensure the reﬂexivity of the tolerance relation, ∀x∃vRi (x, v) is included in Γ for each i ∈ S. However, logical deﬁnability no longer implies rough set deﬁnability in terms of the tolerance approximation space. 5 Conclusion In this paper, we propose using FODL and AVSL for logical descriptions of data tables. Based on these logics, we precisely formulate the notion of deﬁnability in rough set theory and discuss its relationship to explicit and implicit deﬁnability in classical logic. REFERENCES [1] E.W. Beth. On padoa’s method in the theory of deﬁnition. Indagationes Math., 15:330–339, 1953. [2] E. Mendelson. Introduction to Mathematical Logic. Chapman & Hall/CRC, forth edition, 1997. [3] Z. Pawlak. Rough Sets–Theoretical Aspects of Reasoning about Data. Kluwer Academic Publishers, 1991. [4] A. Skowron and J. Stepaniuk. Tolerance approximation spaces. Fundamenta Informaticae, 27(2/3):245–253, 1996. ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-751 751 WikiTaxonomy: A Large Scale Knowledge Resource Simone Paolo Ponzetto1 and Michael Strube1 Abstract. We present a taxonomy automatically generated from the system of categories in Wikipedia. Categories in the resource are identiﬁed as either classes or instances and included in a large subsumption, i.e. isa, hierarchy. The taxonomy is made available in RDFS format to the research community, e.g. for direct use within AI applications or to bootstrap the process of manual ontology creation. 1 INTRODUCTION Advances in the development of knowledge intensive AI systems crucially depend on the availability of large coverage, machine readable knowledge sources. While tremendous progress in AI has been made in the last decades by investigating data-driven inference methods, we believe that further advancement ultimately depends also on the free access to large repositories of structured knowledge on which these inference techniques can be applied. In this article we approach the problem by using Wikipedia. We present methods for deriving a large coverage taxonomy of classes and instances from the network of categories in Wikipedia and present the RDF Schema we make freely available to the research community. 2 METHODS We apply in sequence the methods described in Ponzetto & Strube [8] and Zirn et al. [13] in order to generate a semantic network from the system of categories in Wikipedia. 1. We label the relations between category pairs as isa and notisa. This way the category network, which per-se is merely a hierarchical thematic categorization of the topics of articles, is transformed into a subsumption hierarchy with a well-deﬁned semantics. 2. We classify categories as either classes or instances in order to distinguish between isa subsumption and instance-of relations. 2.1 Deriving a taxonomy from Wikipedia In [8] we presented a set of lightweight heuristics for distinguishing between isa and notisa links in the Wikipedia category network. Syntax-based methods label category links based on string matching of syntactic components of the category labels. They use a full syntactic parse of the category labels to check whether category label pairs share the same lexical head2 (head matching) or the head of a category label occurs as a modiﬁer in another one (modiﬁer matching). 1 2 EML Research gGmbH, Schloss-Wolfsbrunnenweg 33, 69118 Heidelberg, Germany. Website: http://www.eml-research.de/nlp The head of a phrase is the word that determines the syntactic type of the overall phrase of which it is a member. In the case of category labels, it is the main noun of the label, e.g. the noun Scientists for the category label S CIENTISTS WHO COMMITTED SUICIDE. Connectivity-based methods reason on the structure and connectivity of the categorization network. Instance categorization applies the method from [10] to identify instances from Wikipedia pages to those categories referring to the same entities as the pages. Redundant categorization labels category pairs as in an isa relation by looking for directly connected categories redundantly having a page in common. Lexico-syntactic based methods use lexico-syntactic patterns applied to large text corpora (e.g. Wikipedia itself) to identify isa [4] and part-of relations [2], the latter providing evidence that the relation is not an isa relation. A majority voting scheme based on the number of hits for each set of patterns is used to decide whether the relation is isa or not. Inference-based methods propagate the previously found relations based on the properties of multiple inheritance and transitivity of the isa relation. These methods generate 105,418 isa links from a network of 127,325 categories and 267,707 links. We achieve a score of 87.9 balanced F-measure when evaluating the taxonomy against the subset of ResearchCyc [6] in which the categories can be mapped to. 2.2 Distinguishing between classes and instances Zirn et al. [13] go one step forward from [8] and classify categories as instances or classes. This step yields a taxonomy with ﬁner grained semantics, and it is necessary since the network contains many categories whose reference is an entity, e.g. the M ICROSOFT category3 , rather than a property of a set of individuals, e.g. M ULTINATIONAL COMPANIES . Similarly to [8], they devise a set of heuristics on which to decide the reference type of a category label and combine the best performing methods for each class into a voting scheme. Given a category c with label l, c is classiﬁed as either an instance or a class by the ﬁrst satisﬁed criterion. 1. Page & Plural: if no page titled l exists and the lexical head of l is plural, then c is a class. 2. Capitalization & NER: else if l is capitalized and has been recognized by a Named Entity Recognizer as a named entity, then c is an instance. 3. Page: else if no page titled l exists, then c is a class. 4. Plural: else if the head of l is plural, then c is a class. 5. Structure: else if c has no sub-category, then it is a class. 6. Capitalization: else if l is capitalized, then c is an instance. 7. Default: else c is a class. Using the same category network from [8] this pipeline of heuristics is shown to classify 111,652 class and 15,472 instance categories with an accuracy of 84.5% when evaluated against ResearchCyc. 3 We use Sans Serif for words and queries, CAPITALS for Wikipedia pages and S MALL C APS for Wikipedia categories. 752 S.P. Ponzetto and M. Strube / WikiTaxonomy: A Large Scale Knowledge Resource <rdf:Description rdf:about="http://www.eml-research.de/WikipediaOntology/Class#_1268"> <rdfs:subClassOf rdf:resource="http://www.eml-research.de/WikipediaOntology/Class#_2419"/> <rdfs:comment>http://en.wikipedia.org/Wiki/Category:Multinational_companies</rdfs:comment> <rdfs:label>Multinational_companies</rdfs:label> <rdf:type rdf:resource="http://www.w3.org/2000/01/rdf-schema#Class"/> </rdf:Description> <rdf:Description rdf:about="http://www.eml-research.de/WikipediaOntology/Individual#:_36"> <rdfs:comment>http://en.wikipedia.org/Wiki/Category:Microsoft</rdfs:comment> <rdfs:label>Microsoft</rdfs:label> <rdf:type rdf:resource="http://www.eml-research.de/WikipediaOntology/Class#_1268"/> </rdf:Description> Figure 1. Fragment of WikiTaxonomy in RDFS format. Individuals are linked to the class they are instances of using the rdf:type predicate. 3 WIKITAXONOMY We applied the methods from [8] and [13] using the English Wikipedia database dump from 25 September 2006. The extracted taxonomy was converted into RDF Schema [3, RDFS] using the Jena Semantic Web Framework4 . RDFS has a very limited semantics and serves mostly as foundation for other Semantic Web languages. Nevertheless it sufﬁces in the present scenario of data exchange where we have only a set of classes in a hierarchical relation. RDFS in addition provides compatibility with free ontology editors such as Prot´eg´e [5] for visualization, additional manual editing or conversion to richer knowledge representation languages such as OWL [7]. Figure 1 shows a sample fragment of the WikiTaxonomy in RDFS format. In the RDFS data model Wikipedia categories are represented as resources (i.e. a list of rdf:Description elements) and the subsumption relation is modeled straightforwardly using the rdfs:subClassOf property. A human readable version of the name of the category is given via the rdfs:label property and a link to the on-line version of the corresponding page is provided using the rdfs:comment property. In order to distinguish between categories which are instances or classes we use the rdf:type predicate to state whether a resource is a class or an individual of a class. In addition, the distinction is also given in the resource identiﬁer, i.e. the URI-reference. 4 RELATED WORK Researchers working in information extraction have recently begun to use Wikipedia as a resource for automatically deriving structured semantic content. Suchanek et al. build the YAGO system [10] by merging WordNet and Wikipedia: the isa hierarchy of WordNet is populated with instances taken from Wikipedia pages. Auer et al. present the DBpedia system [1] which generates RDF statements by extracting the attribute-value pairs contained in the infoboxes of the Wikipedia pages (i.e. the tables summarizing the most important attributes of the entity referred by the page), e.g. the pair capital=[[Berlin]] from the GERMANY page. Wu & Weld show in [11] how to augment Wikipedia with automatically extracted information. They propose to ‘autonomously semantify’ Wikipedia by (1) extracting new facts from its text via a cascade of Conditional Random Field models; (2) adding new hyperlinks to the articles’ text by ﬁnding the target articles nouns refer to. Wu & Weld’s Kylin Ontology Generator (KOG) [12] is the work closer to ours. Their system builds a subsumption hierarchy of classes by combining Wikipedia infoboxes with WordNet using statistical-relational learning. Each infobox template, e.g. Infobox Country for countries, 4 http://jena.sourceforge.net represents a class and the slots of the template are considered as the attributes of the class. KOG uses Markov Logic Networks [9] in order to jointly predict both the subsumption relation between classes and their mapping to WordNet. While KOG represents a theoretically sounder methodology than [8] and [13], the lightweight heuristics from the latters are straightforward to implement and show that, when given high quality semi-structured input as in the case of Wikipedia, large coverage semantic networks can be generated by using simple heuristics which capture the conventions governing its public editorial base. ACKNOWLEDGEMENTS This work has been funded by the Klaus Tschira Foundation, Heidelberg, Germany. The ﬁrst author has been supported by a KTF grant (09.003.2004). REFERENCES [1] S¨oren Auer, Christian Bizer, Jens Lehmann, Georgi Kobilarov, Richard Cyganiak, and Zachary Ives, ‘DBpedia: A nucleus for a Web of open data’, in Proc. of ISWC 2007 + ASWC 2007, pp. 722–735, (2007). [2] Matthew Berland and Eugene Charniak, ‘Finding parts in very large corpora’, in Proc. of ACL-99, pp. 57–64, (1999). [3] Dan Brickley and Ramanathan V. Guha, ‘RDF vocabulary description language 1.0: RDF schema’, Technical report, W3C, (2004). http: //www.w3.org/TR/rdf-schema. [4] Marti A. Hearst, ‘Automatic acquisition of hyponyms from large text corpora’, in Proc. of COLING-92, pp. 539–545, (1992). [5] Holger Knublauch, Ray W. Fergerson, Natalya Fridman Noy, and Mark A. Musen, ‘The Prot´eg´e OWL plugin: an open development environment for semantic web applications’, in Proc. of ISWC 2004, pp. 229–243, (2004). [6] Douglas B. Lenat and R. V. Guha, Building Large Knowledge-Based Systems: Representation and Inference in the CYC Project, AddisonWesley, Reading, Mass., 1990. [7] Peter F. Patel-Schneider, Patrick Hayes, and Ian Horrocks, ‘OWL Web Ontology Language semantics and abstract syntax’, Technical report, W3C, (2004). http://www.w3.org/TR/owl-semantics. [8] Simone Paolo Ponzetto and Michael Strube, ‘Deriving a large scale taxonomy from Wikipedia’, in Proc. of AAAI-07, pp. 1440–1445, (2007). [9] Matthew Richardson and Pedro Domingos, ‘Markov logic networks’, Machine Learning, 62, 107–136, (2006). [10] Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum, ‘YAGO: A core of semantic knowledge’, in Proc. of WWW-07, pp. 697–706, (2007). [11] Fei Wu and Daniel Weld, ‘Automatically semantifying Wikipedia’, in Proc. of CIKM-07, pp. 41–50, (2007). [12] Fei Wu and Daniel Weld, ‘Automatically reﬁning the Wikipedia infobox ontology’, in Proc. of WWW-08, (2008). [13] C¨acilia Zirn, Vivi Nastase, and Michael Strube, ‘Distinguishing between instances and classes in the Wikipedia taxonomy’, in Proc. of ESWC-08, pp. 376–387, (2008). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-753 753 Computing -Optimal Strategies in Bridge and Other Games of Sequential Outcome Pavel Cejnar1 1 INTRODUTION Bridge is a card game for 4 players in 2 teams, line NS against line WE, consisting of 2 phases. In the ﬁrst players try to agree on a game contract, in the second one of the players tries to fulﬁl the contract against the other line, playing cards in rounds and counting won tricks. A Nash equilibrium, or a proﬁle of optimal strategies for each player with no incentive to deviate, exists in Bridge, however it seems to be difﬁcult to compute it for such a large game. Then we often search for at least an -Nash equilibrium, or -optimal strategies, bringing an outcome no worse than . More references to -Nash equilibria and rare results about equilibria strategies in Bridge or a similar card game can be found in [2]. Promising results were published for another card game, Poker, in [3]. Studying differences between Bridge and Poker the ratio of symmetries seems to be worse [1] and behind this Bridge seems to be outside of a class of games that are applicable for the GameShrink algorithm. In Bridge in the second phase players do not get an immediate result as in Poker, but they play cards sequentially in rounds uncovering more information to other players and trying to win tricks, which form the outcome of the game, a sequential outcome. Based on this fact and on the minimax and alpha-beta technique (originally known from games of perfect information) we present two algorithms, 2.1 and 2.2, which together with the Brown-Robinson method [4] are suitable to compute -optimal strategies in a deﬁned class of games containing Bridge. We also present a test-bed for such a class, a reduced variant of Bridge, and the effect of imperfect information on a strategy of players. 2 ALGORITHMS 2.1 AND 2.2 As a test-bed of later deﬁned class and to simplify presented algorthms ﬁrst we construct a reduced variant of Bridge. The reduction includes: a game only for 2 players (A and B representing the line NS and WE and sharing information between their hands), the card count (4 or 5 cards in each of 4 suits), the ﬁxed ﬁrst phase and the different outcome functions (either as a mean value of won tricks or a probability to fulﬁl a contract). Given a distribution of cards between A and B for 16 card game there exist 70 different conﬁgurations of cards between N and S hand and another 70 between W and E. The details can be found in [1]. Trying to ﬁnd -optimal strategies for such a game we will use the Brown-Robinson method ([4]): Constructing the sequence of pure strategies a1 , a2 , ...ai and b1 , b2 , ..., bi for player A and B, where ai and bi are best strategies against avg(b1 , ..., bi−1 ) and avg(a1 , ..., ai−1 ), then 1 Department of Theoretical Computer Science, Charles University in Prague, Czech Republic, email: cejnar@kti.mff.cuni.cz avg(a1 , ..., ai ) and avg(b1, ..., bi ) converge to equilibria strategies for i → ∞. To use this metod we need an algorithm to construct an optimal strategy (say for player A) against a given opponent strategy. We use a game tree where each node has its Si index where Si are all cards mj mk ml ... played up to this node and S0 means the root node. In each node player A being on turn he has to select a card to play in each of his acceptable (or that do not imply breaking the rules) conﬁguration. Let CA,i , CB,j be acceptable conﬁgurations of player A and player B of index i and j in a given situation, let mk be an acceptable move in a given situation and let P (X) and P (X|Y ) be the probability of X and a conditional probability of X|Y . Then algorithm 2.1 (based on the minimax algorithm) runs as follows: 1. At the beginning player B is on turn and P (CB,i |S0 ) for each i is known. Each time player B is on turn (in Sk ) we know P (CB,i |Sk ), we know P (mm |CB,i &Sk ) for each m and i, then we compute P (CB,i |Sk + mm ) using the Bayes rule. If player A is on turn and plays mn , then we set P (CB,i |Sk + mn ) = P (CB,i |Sk ). Thus we know P (CB,i ) during the whole game. 2. After the end of game, there is only one CA,i and CB,j . The outcome for them is obvious - we set the outcome(CA,i , Sl ) assuming Sl cards played. 3. Having known the outcome for each node n moves to the end of game, for each node Sk n + 1 moves to the end of game: If player A is on turn, we select the action bringing the best outcome and thus we set outcome = (CA,j , Sk ) = maxn outcome(CA,j , Sk + mn ). In case of equality we choose a random one of all with the best outcome. If player B is on turn, we know P (CB,l |Sk ) for each l, we know his strategy, then we compute only outcome(CA,j , Sk ) = n outcome(CA,j , Sk + mn )P (mn |Sk ) where P (mn |Sk ) = P (mn |CB,q &Sk )P (CB,q |Sk ). q 4. To be a valid strategy we have to deﬁne player A actions in acceptable nodes where P (CB,i |Sk ) = 0 for each i. In such a situation we assume player B made a mistake. Then we assume player B still plays his strategy and made a minimal possible amount of mistakes. Then we are able to compute P (CB,i |Sk ) and apply item 3. Algorithm 2.1 runs in time linear to the size of game tree [1] and ﬁnds optimal strategy for player A against a given strategy of player B [1]. The construction of strategy for player B is similar. Traversing the whole game tree in each iteration of the BrownRobinson method has several weaknesses. It evaluates subtrees of nodes we can prove to have the same outcome each time. It also evaluates subtrees of nodes which are dominated in any CA,i or CB,j . To handle this we construct algorithm 2.2 which modiﬁes the 754 P. Cejnar / Computing -Optimal Strategies in Bridge and Other Games of Sequential Outcome alpha-beta technique (of games of perfect information) which can effectively reduce the whole game tree and thus the time for each iteration of the Brown-Robinson method and algorithm 2.1. Assuming Bridge as a zero-sum game for each node Si that will not be deleted let α(Si ) = minj outcome(CA,j , Si ), the minimal guaranted outcome of player A, let β(Si ) = maxj outcome(CA,j , Si ), the maximal possible outcome. For each node in general let αo (Si ) be the minimal public (known to all players) guaranted outcome of player A regardless of his future play and let βo (Si ) be the maximal possible one, let αp (Si ) and βp (Si ) be the values propagated to node Si . Let αh (Si ) be an estimate of α(Si ) meeting αo (Si ) ≤ αh (Si ) ≤ α(Si ). Using the property of sequential outcome, we can see it as the number of tricks we can take just now regardless of player B play. Let βh (Si ) be an estimate of β(Si ) meeting βo (Si ) ≥ βh (Si ) ≥ β(Si ), then algorithm 2.2 runs as follows: 1. Set αp (S0 ) = αo (S0 ) and βp (S0 ) = βo (S0 ). 2. If player A is on turn in node Si , compute αh (Si ) = minCA,j (αh (Si &CA,j )), i.e. the estimate of α(Si ) value. If αh (Si ) > min(βp (Si ), βo (Si )), mark this node as CUT, estimate α(Si ) as αh (Si ) and β(Si ) as βo (Si ) and don’t traverse the subtree. If αh (Si ) = βo (Si ), mark this node as SAME, save the value and don’t traverse the subtree (we are able to compute the strategy fast). Otherwise set αp (Si + mk ) = max(αp (Si ), αh (Si )), βp (Si + mk ) = min(βp (Si ), βo (Si )) for all mk in Si . If player B is on turn in node Si , compute βh (Si ) = maxCB,j βh (Si &CB,j ). If βh (Si ) < max(αp (Si ), αo (Si )), mark this node as CUT, estimate β(Si ) as βh (Si ) and α(Si ) as αo (Si ) and don’t traverse the subtree. If βh (Si ) = αo (Si ), mark this node as SAME, save the value and don’t traverse the subtree. Otherwise set βp (Si + mk ) = min(βp (Si ), βh (Si )), αp (Si + mk ) = max(αp (Si ), αo (Si )) for all mk in Si . 3. In terminal nodes set α(Si ) = β(Si ) = outcome(CA,j , Si ) for the remaining CA,j . 4. If player A is on turn in node Si and for all mk α(Si + mk ) and β(Si + mk ) are evaluated, compute α(Si ) = minCA,j maxml,CA,j (α(Si +ml,CA,j )) where ml,CA,j runs over all acceptable moves in CA,j . If there exists a mm move where β(Si + mm ) < α(Si ) then delete this node and its subtree. Then compute β(Si ) = maxmn (β(Si + mn )) where mn runs over all acceptable direct subnodes (excluding deleted ones). If α(Si ) = β(Si ), mark this node as SAME, save the value and delete its subtree. If player B is on turn in node Si and for all mk α(Si + mk ) and β(Si + mk ) are evaluated, compute β(Si ) = maxCB,o minml,CB,o (β(Si + ml,CB,o )) where ml,CB,o runs over all acceptable moves in CB,o . If there exists a mm move where α(Si + mm ) > β(Si ) then delete this node and its subtree. Then compute α(Si ) = minmn (α(Si + mn )) where mn runs over all acceptable direct subnodes (excluding deleted ones). If β(Si ) = α(Si ), mark this node as SAME, save the value and delete its subtree. We can go further and delete conﬁgurations in nodes where a better move exists, however it would get an additional overhead to remember them. Each equilibrium we found on a game tree reduced by the algorithm 2.2 we can transform to the equilibrium of original game tree[2]. Running time of this algorithm is no worse than linear to the size of game tree (when item 1 and 2 skipped). However the computation of αh (Si ) and βh (Si ) in games of sequential outcome seems to be very fast (counting tricks won without a loss of initiative) and saves additional time. The following lemma allows us to stop the iterative process and ﬁnd an -optimal strategy (proof in [1]): Let sA , sB be strategies of player A and player B, let oA , oB be optimal strategies of player A and player B against sB and sA , let |outcome(sA , oB ) − outcome(sA , sB )| < A and |outcome(oA , sB ) − outcome(sA , sB )| < B , then outcome of player A (and player B) will differ no more than A + B compared to equilibria strategies. In [1] it is also presented a more detailed view and C++ implementation of the method and algorithms. The reader can found there detailed discussion of data structures and an extension of algorithms to a game with more independent players. 3 GAMES OF SEQUENTIAL OUTCOME Depending on construction of algorithms we deﬁne a class of games that are most suitable to use for the presented method as games meeting all the following: It is a ﬁnite zero-sum game. It is a sequential game of perfect recall. It consists of three phases. In the ﬁrst phase each player receives a ﬁnite set of private signals from a ﬁnite set of signals and it is given which players share information about their private signals. Other phases consist of number of rounds. Each player plays once for round and they change on turn in known order. The player on turn announces a public signal (to all players) and private signals (to given players). The set of available signals depends on all (public and private) signals received before. In the third phase in each round each player also announces at least one public signal bringing new information about private signals received at the beginning. The guaranted outcome of players is dependent on all signals announced before and it is a monotonic function in the third phase of game which rises for at least one player after each round. After the end of game no private information remains. 1 0.1 0.01 0.001 0.0001 1e-005 1e-006 0 10000 20000 30000 40000 50000 60000 Figure 1. We computed 100 iterations in a 6.27% sample of all distributions in a 16 card reduced variant of Bridge, which took 487 hours on Pentium 2GHz with 1GB RAM having between 0 and 0.09. Figure shows the ordered difference in a trick outcome of strategies played with imperfect information of player B conﬁgurations against the outcome of strategies with perfect knowledge (logarithmic scale on the y axis). Other results and examples of precomputed strategies can be found in [2][1]. REFERENCES [1] P. Cejnar, Bridge - Computing Optimal Strategies, Master’s thesis, Faculty of Mathematics and Physics, Charles University, Prague, 2008. Supervisor V. Majerech. [2] P. Cejnar, ‘Computing -Optimal Strategies in Bridge and Other Games of Sequential Outcome (extended version)’, http://kti.mff.cuni.cz/˜cejnar/papers/ECAI2008extended.pdf, 2008. [3] A. Gilpin and T. Sandholm, ‘Finding Equilibria in Large Sequential Games of Imperfect Information’, Ann Arbor, MI, (2006). ACM Conference on Electronic Commerce (EC’06). [4] J. Robinson, ‘An Iterative Method of Solving a Game’, Annals of Mathematics, 54, (1951). 2. Machine Learning This page intentionally left blank ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-757 757 Classiﬁer Combination Using a Class-indifferent Method Yaxin Bi1 and Shenli Wu 1 and Pang Xiong2 and Xuhui Shen 2 Abstract. In this paper we present a novel approach to combining classiﬁers in the Dempster-Shafer theory framework. This approach models each output given by classiﬁers as a list of ranked decisions (classes), which is partitioned into a new evidence structure called a triplet. Resulting triplets are then combined by Dempster’s rule. With a triplet, its ﬁrst subset contains a decision corresponding to the largest numeric value of classes, the second subset corresponds to the second largest numeric value and the third subset represents uncertainty information in determining the support for the former two decisions. We carry out a comparative analysis with the combination methods of majority voting, stacking and boosting on the UCI benchmark data to demonstrate the advantage of our approach. 1 INTRODUCTION The choice or design of a method for combining classiﬁer decisions is a challenging task in ensemble learning and various methods have been developed in the past decades. Kuncheva in [2] roughly characterizes combination methods, based on the forms of classiﬁer outputs, into two categories. The ﬁrst category is that the combination of decisions is performed on single class labels, such as majority voting [1] and Bayesian probability [8]. The second category is concerned with the utilization of continuous values (probabilities) corresponding to class labels. One typical method, often called a class-aligned method, is based on the same classes from different classiﬁers in calculating the support for classes. This method includes meta-learning − stacking where combining functions are learnt from continuous values of class labels [4], linear sum and order statistics (mean, minimum and maximum) [6]. An alternative group of methods which are called class-indifferent methods are to make use of as much information as possible obtained from single and sets of classes in calculating the support for each class [2]. Formally suppose we are given classiﬁer ϕ and a new instance d, a classiﬁcation task is to make the decision for d using ϕ about whether instance d belongs to class ci ∈ C. Instead of singleclass assignment, the classiﬁer output can be denoted by ϕ(d) = {s1 , · · · , s|C| }, where si is a numeric value that can be regarded as a class-conditional probability (prior posterior probability). Given an ensemble of classiﬁers, ϕ1 , ϕ2 , · · · , ϕM , all classiﬁer outputs can be organized into a matrix called a decision proﬁle depicted in Figure 1. Based on the decision proﬁle, class-aligned methods calculate the support for class cj using only the DP (d)’s jth column, i.e. s1j , s2j , · · · , sM j , regardless of what the support for the other classes is. While class-indifferent methods use an entire decision pro1 School of Computing and Mathematics, University of Ulster, Co. Antrim, BT37 0QB, UK, email: {y.bi, s.wu1}@ulster.ac.uk 2 Institute of Earthquake Science, China Earthquake Administration, Beijing, 100036, China Figure 1. A decision proﬁle for instance d generated by ϕ1 (d), ϕ2 (d), · · · , ϕM (d) ﬁle as a set of intermediate feature vectors to constrain a class decision, such as computing a covariance matrix for some classes [2]. In this study, we consider a class-indifferent method based on the Dempster-Shafer theory of evidence [5], which is is slightly different from the one above. We do not use an entire decision proﬁle to compute the degrees of support for every class. Instead we select 2 classes from each ϕi (d) according to their numeric values and restructure them into a new list composed of three subsets of C which are represented by the novel evidence structure of triplet. For each triplet, its ﬁrst subset contains the class with the largest value, and the second contains the second largest class, and the third one is the whole set of C. In this way, a decision proﬁle in Figure 1 is restructured into a triplet decision proﬁle where each column no longer corresponds to the same class shown in Figure 2. The degree of support for each class is computed through combining all triplets in a decision proﬁle by Dempster’s rule of combination [5]. Figure 2. A triplet decision proﬁle for instance d derived from DP (d) in Figure 1 As an example, we consider a case where there is a ﬁve classiﬁer ensemble for a three class problem as shown in Figure 3. This ﬁgure presents the classiﬁer outputs for a given input d from the ensemble and combined results using different rules. The winning class for each combination rule is shown in bold. It can be seen that different classes win for different combination rules. For example, class 758 Y. Bi et al. / Classiﬁer Combination Using a Class-Indifferent Method Table 1. Accuracies of the best INDIVIDUAL classiﬁer, best combined classiﬁers based on TRIPLET using DS and MV along with MLRs (STACK1, 2 correspond to the settings (5) and (6)) and AdaBoostingM1 (BOOSTING corresponds to setting (7)) over the thirteen data sets DATASET A NNEAL AUDIOLOGY BALANCE C AR G LASS AUTOS I RIS L ETTER C LEVELAND S EGMENT S OYBEAN W INE Z OO AVERAGE W IN /D RAW /L OOSE S IGNIFICANT WIN I NDIVIDUAL 80.23 48.67 65.67 89.62 65.36 77.59 95.33 92.05 35.48 96.69 95.89 98.90 90.62 79.39 T RIPLET 81.57 57.44 63.17 94.29 66.81 79.28 96.67 92.91 37.09 97.35 96.88 100.00 93.61 81.30 12/0/1 7 1 wins four of the combination rules, class 1 wins three and class 3 wins only one rule. In particular, when class-aligned methods − sum rule and mean rule − cannot distinguish between classes 1 and 2, by taking account of the support of the other classes, Dempster’s rule is able to make a distinction between them. This demonstrates the advantage of the class-indifferent method. Figure 3. Example of class-aligned methods and class-indifferent methods MV 81.14 54.30 62.72 91.75 66.69 77.94 96.67 92.77 34.37 96.55 96.17 98.97 93.61 80.28 10/0/3 4 B OOSTING 77.35 45.16 93.17 92.60 65.97 77.32 98.00 92.53 31.91 96.57 95.50 98.38 89.43 81.07 5/0/8 3 S TACK 1 72.77 32.89 62.73 86.18 58.41 75.34 94.67 92.03 35.13 96.59 95.25 98.90 82.57 75.65 0/1/12 0 S TACK 2 75.34 32.19 68.49 90.03 57.77 77.32 94.00 92.53 31.87 95.85 95.20 98.32 83.64 76.35 3/0/10 0 IBk and NNge) by MLR; and 6) experimenting with AdaBoostingM1 where the best individual classiﬁer SMO is used as the base classiﬁer. To compare the classiﬁcation accuracies between the individual classiﬁers and the combined classiﬁers across all the data sets, we employed the ranking statistics in terms of win/draw/loose record [3]. The win/draw/loose record presents three values, the number of data sets for which classiﬁer A obtained better, equal, or worse than classiﬁer B with respect to a classiﬁcation accuracy. Classiﬁcation accuracies were measured by the averaged F -measure [7]. Six groups of experimental results are summarized in Table 1. The bottom of the table provides summary statistics of comparing the performance of the best individual classiﬁers with the best combined classiﬁers across the data sets. It can be observed that the accuracy of the combined classiﬁers based on the triplet structure using DS is better than the ﬁve others on average. It has more wins to looses over the best combined classiﬁers using MV, boosting and stacking compared with the best individual classiﬁers. This observation is further supported by the statistical signiﬁcant wins in which the triplet has three more wins than MV, four more wins than AdaBoostingM1, and seven more wins than MLR. REFERENCES 2 EXPERIMENTAL EVALUATION To evaluate our method, we used thirteen data sets downloaded from the UCI machine learning repository, including anneal, audiology, balance, car, glass, autos, iris, letter, heart, segment, soybean, wine and Zoo. For individual classiﬁers, we used thirteen learning algorithms including AOD, NaiveBayes, SMO, IB1, IBk, KStar, DecisionStump, J48, RandomForest, DecisionTable, JRip, NNge and PART, aa of which were taken from the Waikato Environment for Knowledge Analysis (Weka) version 3.4. For meta classiﬁers − stacking, we chose the multi-response linear regression (MLR) and we also chose AdaBoostingM1 to compare with our method. Parameters used for each algorithm were at the Weka default settings [7]. Six groups of experiments are reported here. These include 1) assessing all the algorithms; 2) combining the individual classiﬁers using DS; 3) combining the individual classiﬁers using MV; 4) combining J48, NaiveBayes, MLR and KStar by MLR [4]; 5) combining the best, the second best and the third best individual classiﬁers (SMO, [1] R.P.W. Duin and D.M.J. Tax, ‘Experiments with classiﬁer combining rules’, in Multiple Classiﬁer Systems, J. Kittler and F. Roli, eds, pp. 16– 29, (2000). [2] L. Kuncheva, ‘Combining classiﬁers: Soft computing solutions’, in Pattern Recognition: From Classical to Modern Approaches, Pal S.K. and Pal A. (eds), pp. 427–451, (2001). [3] Melville P and Mooney R.J., ‘Constructing diverse classiﬁer ensembles using artiﬁcial training examples’, in In Proc of IJCAI-2003, pp. 405– 510, (2003). [4] A.K. Seewald, ‘How to make stacking better and faster while also taking care of an unknown weakness’, in In Proceedings of ICML’02, pp. 554– 561, (2002). [5] G. Shafer, A Mathematical Theory of Evidence, Princeton University Press, Princeton, New Jersey, 1st edition edn., 1976. [6] K. Tumer and G. J. Robust, ‘On combining classiﬁers’, Pattern Analysis and Applications, 6 (1), 41–46, (2002). [7] I. H. Witten and E. Frank, Data Mining: Practical machine learning tools and techniques, Morgan Kaufmann, 2nd edition edn., 2005. [8] L. Xu, A. Krzyzak, and C.Y. Suen, ‘Several methods for combining multiple classiﬁers and their applications in handwritten character recognition’, IEEE Trans. on System, Man and Cybernetics, 2 (3), 418–435, (1992). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-759 759 Reinforcement Learning with Classiﬁer Selection for Focused Crawling Ioannis Partalas1, Georgios Paliouras2, Ioannis Vlahavas1 Abstract. Focused crawlers are programs that wander in the Web, using its graph structure, and gather pages that belong to a speciﬁc topic. The most critical task in Focused Crawling is the scoring of the URLs as it designates the path that the crawler will follow, and thus its effectiveness. In this paper we propose a novel scheme for assigning scores to the URLs, based on the Reinforcement Learning (RL) framework. The proposed approach learns to select the best classiﬁer for ordering the URLs. This formulation reduces the size of the search space for the RL method and makes the problem tractable. We evaluate the proposed approach on-line on a number of topics, which offers a realistic view of its performance, comparing it also with a RL method and a simple but effective classiﬁer-based crawler. The results demonstrate the strength of the proposed approach. 1 Introduction In this paper we propose a novel adaptive focused crawler that is based on the RL framework [5]. More speciﬁcally, RL is employed for selecting an appropriate classiﬁer that will in turn evaluate the links that the crawler must follow. The introduction of link classiﬁers reduces the size of the search space for the RL method and makes the problem tractable. We evaluate the proposed approach on a number of topics, comparing it with an RL approach from the bibliography and a classiﬁer-based crawler. The results demonstrate the robustness and the efﬁciency of the proposed approach. 2 Reinforcement Learning with Classiﬁer Selection In this work we propose an adaptive approach, dubbed Reinforcement Learning with Classiﬁer Selection (RLwCS), to evaluate URLs, based on the RL framework. RLwCS maintains a pool of classiﬁers, H = {h1 , . . . , hk }, that can be used for URL evaluation, and seeks a policy for selecting the best classiﬁer, ht , for a page to perform the evaluation task. In other words, the crawler must select dynamically a classiﬁer for each page, according to the characteristics of the page. We solve this problem using an RL approach. In our case, there are just two classes, as a URL or page can be relevant or not to a speciﬁc topic. We represent the problem of selecting a classiﬁer for evaluating the URLs, as an RL process. The state is deﬁned as the page that is currently retrieved by the agent, on the basis that the perception of the environment arises mainly by the pages retrieved at any given time. Actions are the different classiﬁers, ht ∈ H. We add an extra 1 2 Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece, email: {partalas,vlahavas}@csd.auth.gr Institute of Informatics and Telecommunications, National Centre for Scientiﬁc Research ”Demokritos”, email: paliourg@itt.demokritos.gr action which is denoted as S and combines the classiﬁers in a majority scheme. The set of actions is thus H ∪ {S}. The state transitions are deterministic as the probability of moving to a page when selecting a classiﬁer for evaluation is equal to 1. The selected classiﬁer is the one that scores the URLs of a visited page. More speciﬁcally, the classiﬁer’s score that a URL belongs to the relevant class. The reward for selecting a classiﬁer depends on the relevance of the page that the crawler visit. If the page is relevant, the reward is 1, while otherwise the reward is 0. Thus, we seek to ﬁnd an optimal policy for mapping pages to classiﬁers in order to maximize the accumulated reward received over time. The mechanism that is used for training the RL module is the Qlearning algorithm [6]. Q-learning ﬁnds an optimal policy based on the action-value function, Q(s, a). The Q function expresses the beneﬁt of following the action a when in state s. In our case the value of selecting a classiﬁer in a speciﬁc page is associated with the expected relevance of the next page (state) that the crawler will fetch. Next, we need to deﬁne the features that will be used to represent both the states and the actions. Based on the literature of focused crawling we chose the following features to represent a state-action pair: • Relevance score of a page with respect to the speciﬁc domain. • Relevance score of the page, computed by the selected classiﬁer (action). • Average relevance score of the parents of the page that is crawled. • Hub score. We employ function approximation to tackle the problem of the large state-action space. A well-known method is the combination of Q-learning with eligibility traces, Q(λ), and gradient descent function approximation [5]. Additionally, linear methods are used to approximate and represent the value function. Further details about the function approximation algorithm that we used can be found in [5]. 3 Experimental Setup We constructed a number of topic-speciﬁc datasets following the procedure that is described in [2]. Table 1 shows the topics that we selected for experimentation3 . For each topic URL we downloaded the corresponding page and constructed the instances based on the textual information. More speciﬁcally, for each document downloaded we produced the TF-IDF vectors using the weighted scheme proposed by Salton and Buckley [3]. Each instance of the on-topic and off-topic documents is named relevant or irrelevant respectively.4 3 4 http://dmoz.org The datasets created are available at http://mlkd.csd.auth.gr/fcrawling.html 760 I. Partalas et al. / Reinforcement Learning with Classiﬁer Selection for Focused Crawling 0.6 Table 1. ODP topics. BFS RLwCS TD-FC 0.55 Number of URLs 62 166 72 114 196 64 179 239 275 103 After creating the set of relevant and irrelevant instances we train the classiﬁers for each topic that will form the action set for RLwCS, with the addition of the extra action that combines the opinions of the classiﬁers using the majority scheme. For P an instance x the output of the majority scheme is S(x) = maxcj km=1 hm (x, cj ), where hm outputs a probability distribution for each class cj , j = 1 . . . n.. We trained four classiﬁers using the WEKA machine learning library [8]: • • • • Neural network (NN): 16 hidden nodes and learning rate 0.4. Support vector machine (SVM): polynomial kernel with degree 1. Naive Bayes (NB): with kernel estimation. Decision tree (DT): with Laplace smoothing and reduced error pruning. The proposed approach, RLwCS, is compared with a base crawler that uses a SVM to assign scores to the URLs and with Temporal Difference Focused Crawling (TD-FC) method [1]. The experiments for the crawlers are performed on-line in order to obtain a realistic estimate of their performance. We must note here that the majority of the approaches reported in the literature, conducted their experiments ofﬂine in a managed environment. The online evaluation in a variety of topics allow us to make more accurate statistical tests in order to detect signiﬁcant differences in the performances of the crawlers. For the purposes of evaluation, we used two analogous metrics to the well-known precision and recall, that is harvest rate and target recall [4]. 0.14 0.45 0.4 0.35 0.3 BFS RLwCS TD-FC 0.12 Target recall Topic Shopping/Auctions/Antiques and Collectibles/ Health/Medicine/Osteopathy/ Games/Video Games/Puzzle/Tetris-like/ News/Weather/Air Quality/ Science/Astronomy/Amateur/Astrophotography and CCD Imaging/ Health/Medicine/Informatics/Telemedicine/ Sports/Winter Sports/Snowboarding/ Sports/Hockey/Ice Hockey/ Arts/Literature/Periods and Movements/ Health/Alternative/Aromatherapy/ Harvest rate 0.5 0.1 0.08 0.06 0.04 0.02 0.25 0.2 0 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000 Number of crawled pages Number of crawled pages (a) Average harvest rate (b) Average target reacll Figure 1. Average harvest rate and target recall. the proposed approach obtains the highest values during the crawling process and outperforms the other two methods. Wilcoxon tests at a conﬁdence level of 95% report signiﬁcant differences only after the ﬁrst 600 pages have been crawled. This is again a very encouraging result for the proposed approach. 5 Conclusions In this paper we presented a novel Focused Crawling approach, named RLwCS, which is based on the Reinforcement Learning framework. The crawler learns to select an appropriate classiﬁer for ordering the URLs of each Web page that it visits. We compared the proposed approach with the well-known Best-First Search crawler and a pure RL approach, on a number of topic-speciﬁc datasets. The crawlers were tested on-line, in order to obtain realistic measurements of their performance. The analysis of the results led to several interesting conclusions. The proposed approach manages to achieve good performance outperforming the BFS which is considered in the literature as a very effective crawler. Acknowledgments We would like to thank Ioannis Katakis for providing us with the source code for text processing and Michalis Lagoudakis for interesting discussions that led to this work. This work is partly funded by the Greek General Secretariat for Research and Technology, project Regional Innovation Pole of Central Macedonia. 4 Results and Discussion Figure 1(a) presents the average harvest rate of each algorithm for all topics, against the number of crawled pages. We ﬁrst notice that RLwCS clearly outperforms both BFS anf TD-FC, as it manages to collect more relevant pages. In order to investigate whether the performance differences between RLwCS and the other two algorithms are signiﬁcant, we use the Wilcoxon signed rank test [7]. We performed 2 tests, one for each paired comparison of RLwCS with each of the other algorithms on each topic, at a conﬁdence level of 95%. The test was performed on various points during the crawling process, and more speciﬁcally per 200 crawled pages. The test found that RLwCS is signiﬁcantly better than all the other algorithms during the whole crawling process (200 to 3000 pages) on all topics. Another interesting observation is the fact that the proposed approach achieves a high harvest rate in the ﬁrst 200 pages which is a strong advantage in on-line crawling tasks where the crawler must gather relevant pages in a small time frame and a small number of visited pages. Figure 1(b) shows the target recall curves for the competing algorithms, averaged across all topics. We notice again that REFERENCES [1] A. Grigoriadis and G. Paliouras, ‘Focused crawling using temporal difference-learning’, in Proc. 3th Hellenic Conference on Artiﬁcial Intelligence, pp. 142–153, (2004). [2] Gautam Pant and Padmini Srinivasan, ‘Learning to crawl: Comparing classiﬁcation schemes’, ACM Transanction on Information Systems, 23(4), 430–462, (2005). [3] Gerard Salton and Christopher Buckley, ‘Term-weighting approaches in automatic text retrieval’, Information Processing and Management, 24(5), 513–523, (1988). [4] P. Srinivasan, F. Menczer, and G. Pant, ‘A general evaluation framework for topical crawlers’, Information Retrieval, 8(3), 417–447, (2005). [5] R. S. Sutton and A. G. Barto, Reinforcement Learning, An Introduction, MIT Press, 1999. [6] C.J. Watkins and P. Dayan, ‘Q-learning’, Machine Learning, 8, 279–292, (1992). [7] F. Wilcoxon, ‘Individual comparisons by ranking methods’, Biometrics, 1, 80–83, (1945). [8] I. H. Witten and E. Frank, Data Mining: Practical machine learning tools and techniques, 2nd Edition, Morgan Kaufmann, 2005. ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-761 761 Intuitive Action Set Formation in Learning Classiﬁer Systems with Memory Registers L. Sim˜oes and M.C. Schut and E. Haasdijk1 Abstract. An important design goal in Learning Classiﬁer Systems (LCS) is to equally reinforce those classiﬁers which cause the level of reward supplied by the environment. In this paper, we propose a new method for action set formation in LCS. When applied to a Zeroth Level Classiﬁer System with Memory registers (ZCSM), our method allows the distribution of rewards among classiﬁers which result in the same memory state, rather than those encoding the same memory update action. 1 INTRODUCTION This paper introduces a new method for action set formation (asf ) in Learning Classiﬁer Systems, and tests it in partially observable environments requiring memory. The operation of asf is responsible for choosing the classiﬁers that will receive the reward supplied by the environment, for some performed action. When new classiﬁers are generated, the system has no way of knowing how good these are. Their strengths depend on the actions in the contexts under which they trigger, and on the other classiﬁers in the population with which they interact. As classiﬁers are added to the population, these are assigned an initial strength value. Then, by repeated usage, the strength update component will gradually converge towards a better estimate of their qualities. But since the system has to perform at the same time it is building its rule base, it is forced to act despite its uncertainty about the environment, and selecting from among an ever changing population of insufﬁciently tested classiﬁers. The method introduced here, iasf, eliminates some of the noise to which the quality estimation component is subjected, with the goal of improving system performance. 2 BACKGROUND In the mid-1990s, Wilson [7] proposed ZCS as a simpliﬁcation of Holland’s original LCS [3]. Most importantly, he left out the message list which acted as memory in the original system. Thus, Wilson’s models had no way of remembering previously encountered states and could not perform optimally in partially observable environments where an agent can ﬁnd itself in a state that is indistinguishable from another state. However, the best action to undertake is not necessarily the same in both states. Wilson proposed [7] a solution for this problem in the form of memory registers to extend the classiﬁers. Cliff & Ross [2] follow this suggestion and implement ZCSM, extending ZCS with a memory mechanism. In their experiments they observed that ZCSM can efﬁciently exploit memory in partially observable environments. 1 Stone & Bull extensively compared ZCS to the more popular XCS in noisy, continuous-valued environments [6] and found that what makes XCS so good in deterministic environments (namely; its attempt to build a complete, maximally accurate and maximally general map of the payoff landscape) becomes a disadvantage as the level of noise in the environment increases. ZCS’s partial map, focusing on high-rewarding niches in the payoff landscape then becomes an advantage. This suggests ZCS as an adaptive control mechanism in multi-step, partially observable, stochastic real-world problems. Department of Computer Science, Faculty of Sciences, VU University, Amsterdam, The Netherlands, email: {lfms, mc.schut, e.haasdijk}@few.vu.nl 3 INTUITIVE ACTION SET FORMATION ZCS works on a population P of rules which together present a solution to the problem with which the system is faced. As it interacts with the environment, the system is triggered on reception of a sensory input. A match set M is then formed with all the rules in the population matching that input. From this set, a classiﬁer is chosen by proportionate selection based on its strength, and its action is executed. With memory added as described in [2], rules prescribe an external action as well as a modiﬁcation of the memory bits. It can be argued that the core of ZCS lies in the next, reinforcement stage, as it is responsible for incrementally learning the quality of the rules in the population, which will in turn determine the system’s behaviour. The action set A includes those rules in M that advocated the same action as the chosen classiﬁer. The rules in this action set share in the reward that results from the selected action (with the rationale that choosing any of those rules would have had the same effect). Rules in M that advocate a different action are penalised. Traditionally, A consists of those rules in M that match on a bitwise comparison with the action-part of the chosen classiﬁer. Now, consider ZCSM, where operators on the memory state are added to the action part of the rules. Suppose, then, a situation where the memory state was 01, and remains the same after execution of some chosen classiﬁer c, which advocated2 [0#]. Traditional action set formation would then have A include only those classiﬁers from M advocating this same memory operation (“set the ﬁrst memory register to 0”) as well as the same external action as the chosen classiﬁer. However, all of the internal actions {##,#1,01} would result in exactly the same internal state. Not only would the system not reward any classiﬁer in M having one of those internal actions (and the same external action) as the chosen classiﬁer, it would actually penalise them. This seems to conﬂict with ZCS’s goal of equally rewarding those classiﬁers which would cause the same level of reward supplied by the environment. 2 Disregarding the external output for simplicity. 762 L. Simões et al. / Intuitive Action Set Formation in Learning Classiﬁer Systems with Memory Registers 20 20 ZCSM1 ZCSM1iasf optimum ZCSM8 ZCSM8iasf optimum 15 steps to food steps to food 15 10 5 5 0 0 0 2000 4000 6000 number of trials 8000 10000 Figure 1. Performance comparison in woods101 with 1 memory bit. This realisation prompted us to introduce a new variant of Cliff & Ross’ classiﬁer system, ZCSMiasf , which compares classiﬁers based on the memory state which would result from their activation, rather than based on the memory operation. In this more intuitive scheme, any rule in M that prescribes the same external action as c and an internal action that leads to the same memory state (i.e., one of {##,#1,0#,01}) is included in A. 4 10 EXPERIMENTAL ANALYSIS Experimental Design and Setup – To compare the performance of iasf against regular action set formation, we conducted a series of experiments in the well-known woods101 and 102 environments [2, 5]. These are mazes where paths towards food locations must be learned; both mazes contain indistinguishable locations where the sensory information (i.e., the layout of the perceivable cells) is identical but the appropriate action differs. To tackle such situations, the agent’s controller requires memory to be able to choose the correct action; only reacting to sensory information cannot sufﬁce. An experiment consists of 10,000 trials where, starting from a random location in the maze, the agent must reach the food. If the agent moves into the cell with food, it receives a reward from the environment and the next trial commences: the food is replaced and the agent is randomly relocated. The agent can see the directly adjacent cells and uses that information to decide on an action—where to move next. Following Bull & Hurst’s suggestion, the system is then further tested for an additional 2,000 trials where “the Genetic Algorithm is switched off, reinforcement occurs as usual, and an action selection scheme is used which deterministically picks the action with the largest total ﬁtness in M ” [1]. Performance is measured as the moving average over the previous 50 trials of the number of steps it took to reach the food on each trial. See [7, 2] for more detailed descriptions of the experimental setup. We performed experiments with a memory size of 1 in woods101 and 8 in woods102 with Wilson’s default parameter set for ZCS [7]. Given the more demanding characteristics of woods102, we used a larger population size (N = 2000) there. Results – Figures 1 and 2 show the results of experiments averaged over 30 runs; the lighter horizontal line shows the optimal average performance for each environment (2.9 steps for woods101 and 3.23 for woods102 [5]). The horizontal axes show the number of trials into the experiment. Analysis – Although the change in asf technique is an intuitive one and one that fulﬁls the LCS design goal of equal credit assignment to the classiﬁers producing the level of reward coming from the environment, no beneﬁt in performance can be gleaned from the results of our experiments. In both cases, ZCSMiasf performed at substantially the same level as traditional ZCSM; only in woods102 can we see some slight –not statistically signiﬁcant– improvement. Because 0 Figure 2. 2000 4000 6000 number of trials 8000 10000 Performance comparison in woods102 with 8 memory bits. this is the more challenging of the two environments [5], this may indicate that performance in more complex environments and tasks can beneﬁt from iasf, but this remains an issue for further investigation. 5 CONCLUSIONS We have extended the way action sets are formed in classiﬁer systems with memory registers, taking them closer to the design goal of equal credit assignment to the classiﬁers whose actions cause the level of reward supplied by the environment. We have validated our extension experimentally in partially observable environments using the Zeroth Level Classiﬁer System. The environments on which experiments were performed are wellknown in the existing literature on the subject. The experiments showed no signiﬁcant improvement in performance. We require further investigation to see whether such improvement does occur in more complex environments. Still, the current results can be considered valuable since the new method is more in line with the general design goal of equal credit assignment than the traditional method. In stochastic environments, where the ZCS algorithm has previously shown to outperform the more widely known XCS [6], rule quality estimation can be expected to take on a more signiﬁcant role, which leads us to think that our extension will provide more signiﬁcant beneﬁts in partially observable instances of those problems. Again, further investigations are required to validate this assumption. REFERENCES [1] Larry Bull and Jacob Hurst, ‘ZCS redux’, Evolutionary Computation, 10(2), 185–205, (2002). [2] Dave Cliff and Susi Ross, ‘Adding temporary memory to zcs’, Adaptive Behavior, 3(2), 101–150, (1994). [3] John H. Holland, ‘Escaping brittleness: the possibilities of generalpurpose learning algorithms applied to parallel rule-based systems’, in Machine learning, an artiﬁcial intelligence approach, eds., R.S. Michalski, J.G. Carbonell, and T.M. Mitchell, volume 2, Morgan Kaufmann, (1986). [4] Pier Luca Lanzi, ‘An analysis of the memory mechanism of XCSM’, in Genetic Programming 1998: Proceedings of the Third Annual Conference, eds., John R. Koza, Wolfgang Banzhaf, Kumar Chellapilla, Kalyanmoy Deb, Marco Dorigo, David B. Fogel, Max H. Garzon, David E. Goldberg, Hitoshi Iba, and Rick Riolo, pp. 643–651, San Francisco, CA, USA, (22-25 July 1998). Morgan Kaufmann. [5] Pier Luca Lanzi and Stewart W. Wilson, ‘Toward optimal classiﬁer system performance in non-markov environments’, Evolutionary Computation, 8(4), 393–418, (2000). [6] Christopher Stone and Larry Bull, ‘Comparing XCS and ZCS on noisy continuous-valued environments’, Technical Report UWELCSG05-002, Learning Classiﬁer Systems Group, University of the West of England, Bristol, UK, (2005). [7] Stewart W. Wilson, ‘ZCS: A zeroth level classiﬁer system’, Evolutionary Computation, 2(1), 1–18, (1994). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-763 763 An Ensemble of Classifiers for coping with Recurring Contexts in Data Streams Ioannis Katakis, Grigorios Tsoumakas and Ioannis Vlahavas1 Abstract. This paper proposes a general framework for classifying data streams by exploiting incremental clustering in order to dynamically build and update an ensemble of incremental classifiers. To achieve this, a transformation function that maps batches of examples into a new conceptual feature space is proposed. The clustering algorithm is then applied in order to group different concepts and identify recurring contexts. The ensemble is produced by maintaining an classifier for every concept discovered in the stream2. 1 INTRODUCTION Recent advances in sensor, storage, processing and communication technologies have enabled the automated recording of data, leading to fast and continuous flows of information, referred to as data streams. The dynamic nature of data streams requires continuous or at least periodic updates of the current knowledge in order to ensure that it always includes the information content of the latest batch of data. This is important in applications where the concept of a target class and/or the data distribution changes over time. This phenomenon is commonly known as concept drift. A very special type of concept drift is that of recurring contexts [5]. In this case, concepts that appeared in the past may recur in the future. Although the phenomenon of reappearing concepts is very common in real world problems (weather changes, buyer habits etc) only few methods take it into consideration [3-5]. In this paper we propose an ensemble of classifiers that utilizes a new representation model for data streams suitable for problems with recurring contexts. 2 TRANSFORMATION FUNCTION First, the data stream is separated into a number of small batches of examples. Each batch is transformed into a conceptual vector that is constructed out of a number of conceptual feature sets. Each feature set corresponds to a feature from the initial feature space. Let’s assume that unlabeled (U) and labeled (L) examples are represented as vectors G xU x1 , x2 ,..., xn G x , x ,..., x , c and xL 1 2 n j where xi is the value of the feature f i , and c j C with C being the set of available classes. Let BU and BL be a batch of unlabeled and labeled instances of size b, G G G G G G BU xU ( k ) , xU ( k 1) ,..., xU ( k b 1) , BL xL ( k ) , xL ( k 1) ,..., xL ( k b 1) ^ 1 2 ` ^ ` Department of Informatics, Aristotle University of Thessaloniki, 54124 Greece, email:{katak, greg, vlahavas}@csd.auth.gr The full version of this paper as well as the datasets used for evaluation can be found at: http://mlkd.csd.auth.gr/concept_drift.html Every batch of examples (BL) is transformed into a conceptual JG vector Z z1 , z2 ,..., zn , where zi are the conceptual feature sets. For every batch BL and feature fi of the original feature space the conceptual feature sets are calculated as follows: Pi ,vj : j 1..m, v Vi , if fi is nominal ° zi ® °¯ Pi , j , V i , j : j 1..m , if fi is numeric v where Pi , j P( f i v | c j ) and i [1, n], j [1, m], v Vi , and Vi is the set of values of the nominal attribute fi Pi ,vj is considered to be equal to nv , j / n j , where nv , j is the number of samples of class c j having the value v at attribute i in batch BL and nj is the number of samples belonging to c j in batch BL. For numeric attributes we use the mean ( Pi , c j ) and standard deviation (V i ,c j ) of attribute f i for samples of class c j in batch BL. The notion behind this representation is that every element of the conceptual vectors expresses in what degree a feature characterizes a certain class. Consequently, conceptual distance between two batches BL ( P ) and BL ( v ) can be defined as the Euclidean distance of the corresponding Conceptual Vectors: ConDis(BL ( P ) ,BL (Q ) ) = Euclidean( Z ( P ) , Z ( v ) ) ^ ^ ^dis( z 1( P ) ` ` 1/2 ` , z1(Q ) ) ... dis ( zn ( P ) , zn ( v ) ) 2 2 Where, dis ( zi ( P ) , zi (Q ) ) ] i1( P ) ] L1( v ) ... ] il( P ) ] il( v ) and ] i (j P ) is the j-th element of the i-th conceptual feature-set of the vector ȝ, whereas l is the length of the feature set. This mapping procedure tries to ensure that the more similar two batches will be conceptually, the closer in distance their corresponding conceptual vectors will be. The definition of this distance will be also beneficial for the clustering algorithm of the framework we present in the following section. 3 THE CCP FRAMEWORK The main components of the CCP (Conceptual Clustering and Prediction) framework (Fig. 1) are: a) a mapping function (M), that transforms data into conceptual vectors, b) an incremental clustering algorithm (R), that groups conceptual vectors into clusters and c) an incremental classifier (h) for every concept discovered. The pseudocode of the framework can be seen in Fig. 2. What is maintained as time (t) passes is a set of clusters Gt {g1 , g 2 ,..., g q } and a set of corresponding classifiers Ht={hi,,h2,…,hq}. Classifier hi is trained from batches that belong conceptually to cluster gi. Initially, Go=, Ho= . By classifying the current batch according to the classifier built from the cluster of the previous batch we make a kind of a locality assumption. We assume that successive batches (of small size) most of the time will belong to the same concept. 764 I. Katakis et al. / An Ensemble of Classiﬁers for Coping with Recurring Contexts in Data Streams smaller batches do not suffice for calculating the summary probabilistic statistics. ȉhe experiments include a benchmark version of our framework (dubbed Oracle), where perfect clustering assignments are manually provided to the system. This allows the study of the maximum performance that can be achieved using the CCP framework. Results Table 2 shows the results of the experiments in the three datasets. We notice that even a basic implementation of CCP achieves better performance than all other methods. Fig. 3 shows the average accuracy over fifty instances for the CCP and WE method for the Usenet1 dataset. Note the sudden dives of WE’s accuracy in drift time-points. In all cases, CCP manages to recover much faster from the drift. Most notably, at the last two drift point, CCP recognizes the recurrent theme and remains accurate. Finally, the performance of Oracle, strongly underlines the fact that there is room for improvement by using more advanced incremental clustering algorithms. Fig. 1. Clustering conceptual vectors into concepts CCP Framework begin for i=1 to infinity do Zi-1=M.getconceptualVectorOf(BL(i-1)) gǯ = R.getClusterOf(Zi-1) R.update(Zi-1) hgǯ.update(BL(i-1)) hgǯ.classify(BU(i)) end Table 2. Accuracy of the four methods in the three datasets Usenet1 Fig. 2. The main operation of CCP framework 4 EVALUATION Datasets The first two datasets (usenet1, usenet2) are based on the 20 newsgroups collection [1]. They simulate a stream of messages from different newsgroups that are sequentially presented to a user, who then labels them as interesting or junk, according to his/her personal interests. Table 1 shows which messages are considered interesting (+) or junk (-) in each time period. The third dataset is based on the Spam Assassin collection and contains both spam and legitimate messages. Usenet2 spam Simple Incremental 0.59 0.73 0.75 TimeWindow (w=100) 0.56 0.60 0.60 TimeWindow (w=150) 0.59 0.62 0.64 TimeWindow (w=300) 0.58 0.70 0.62 CCP (Oracle) 0.81 0.80 - CCP (Leader-Follower) 0.75 0.77 0.93 Weighted Examples 0.67 0.75 0.91 Table 1. Dataset Usenet1 and Usenet2 0-300 medicine space baseball + - medicine space baseball + - 301-600 600-900 Usenet 1 + + + Usenet 2 + + 900-1200 1200-1500 + + + - + - + - Methods Evaluation involves the following methods: Simple Incremental Classifier (SIC): It maintains only one classifier, which incrementally updates its knowledge. Time Window (TW): It classifies incoming instances based on the knowledge of the latest N examples. Weighted Examples (WE): It consists of an incremental classifier that supports weighted learning. Bigger weights are assigned to more recent examples in order to focus on new concepts. An incremental naïve bayes classifier is used as base classifier for the above methods. Our implementation of the CCP framework includes the mapping function discussed in section 2, the Leader-Follower algorithm described in [2] as the clustering component and an incremental Naive Bayes classifier. Preliminary experiments showed that a batch size around 50 instances is appropriate. Larger batches invalidate the locality assumption, whereas Fig. 3. Average accuracy over 50 instances for WE and CCP. 5 ACKNOWLEDGMENTS This work was partially supported by a PENED program (EPAN M.8.3.1, No.03Ǽǻ73), jointly funded by the European Union and the Greek Government (General Secretariat of Research and Technology). REFERENCES [1] Asuncion, A. and Newman, D.J., UCI Machine Learning Repository. 2007, University of California, School of Information and Computer Science [www.ics.uci.edu/~mlearn/MLRepository.html]: Irvine, CA. [2] Duda, R.O., Hart, P.E., and Stork, D.G., Pattern Classification. 2000: Wiley-Interscience. [3] Forman, G. Tackling Concept Drift by Temporal Inductive Transfer. in 29th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2006. Washington, USA: p. 252-259. [4] Harries, M.B., Sammut, C., and Horn, K., Extracting Hidden Context. Machine Learning, 1998. 32(2): p. 101-126. [5] Widmer, G. and Kubat, M., Learning in the Presense of Concept Drift and Hidden Contexts. Machine Learning, 1996. 23(1): p. 69-101. ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-765 765 Content-Based Social Network Analysis Paola Velardi† and Roberto Navigli† and Alessandro Cucchiarelli‡ and Mirco Curzi‡ Abstract. Relationships among actors in traditional social network analysis are modelled as a function of the quantity of relations (co-authorships, business relations, friendship, etc.). In contrast, within a business, social or research community, network analysts are interested in the communicative content exchanged by the community members, not merely in the number of relationships. In order to meet this need, this paper presents a novel social network model, in which the actors are not simply represented through the intensity of their mutual relationships, but also through the analysis and evolution of their shared interests. Text mining and clustering techniques are used to capture the content of communication and to identify the most popular topics. 1 SYSTEM DESCRIPTION This paper presents a model for social network analysis in which, besides analyzing the quantity of relationships (co-authorships, business relations, friendship, etc.), we analyze also their communicative content. Text mining and clustering techniques are used to capture the content of communication and to identify the most popular themes. The social analyst is then able to perform a study of the network evolution in terms of the relevant themes of collaboration, the detection of new concepts gaining popularity, and the existence of popular themes that could benefit from better cooperation. The idea of modeling the content of social relationships is not entirely new. In [1] a method is proposed to discover “semantic clusters” in Google news, i.e. groups of people sharing the same topics of discussion. In [2] the authors propose the AuthorRecipient-Topic model, a Bayesian network that captures topics and the directed social network of senders and recipients in a message-exchange context. In both cases the major weakness lies in the rather naive bag-of-words model used for extracting content from documents. For example, in [1] one of the topics around which groups of people are created are “said, bomb, police, london, attack”, and in [2] an example is: “section, party, language, contract, …”. This problem is common to many existing papers on topic clustering, where the focus is more on the clustering algorithm than on the selection of textual features. In our view, the usefulness of a content-based social analysis (CB-SA) is strongly related to the informative level and semantic cohesion of the learned topics. A simple bag-of-words model seems rather inadequate at capturing the meaning of social communications. Instead, we use a combination of natural language processing and machine learning techniques to obtain very informative clusters, representing the central “topics” of a community. —————————————————————————————————— † ‡ Department of Computer Science, University of Roma “La Sapienza”, Italy. e-mail: {velardi, navigli}@di.uniroma1.it. Department of Computer Science, Management and Automation (DIIGA), Polytechnic University of Marche, Italy. e-mail: {cucchiarelli, curzi}@diiga.univpm.it. In summary (more details on the CB-SA model can be found in [3]), the analysis steps are the following: 1 Concept identification: The objective of this phase is to identify the emergent semantics of a community, i.e. the concepts that better characterize the content of actor’s communications. Concepts are extracted from available texts (hereafter referred to as the domain corpus) exchanged among the members of the community. We use our TermExtractor system [4], a freely available1 high performing tool to extract the relevant terminology from single documents and entire corpora. 2 Computation of semantic similarity: First, a graph G=(V,E) is built, being V the set of nodes representing terminological strings (hereafter denoted also as domain concepts) extracted as described in the previous phase, and E the set of edges. An edge (tj, ti) is added to E if any of the following three conditions holds2: i) a relation holds between the concepts expressed by tj and ti in a domain ontology or thesaurus (e.g. ontology representation is a kind-of knowledge representation); ii) the term ti occurs in a textual definition of tj from a domain glossary (e.g. we add (ontology representation, ontology) to E, as ontology representation is defined as “the description of an ontology in a well-defined language”); iii) the two terms co-occur in the document corpus (we use the Dice coefficient). Given the graph G, for each pair of concepts tj and ti, we compute the set of chains in the graph, i.e. edge paths of length l (l = 1, …, L, where L is the maximum path length) which connect the two concepts: . LCl (t j , ti ) = {t j t1 t 2 ... t l 1 t l ti } Finally, we compute the semantic similarity between tj and ti as a function of the corresponding lexical chains between the two concepts: L | LCl (t j , ti ) | l =1 | LCl (t j ) | sim(t j , ti ) = el (1) where LCl(tj) denotes the set of all the lexical chains connecting tj to any other node (i.e. the union of the sets LCl(tj, tm) for all tmV\{tj}). According to the above formula, the contribution of the lexical chains of length l is given by the inverse of the exponential of l weighted by the ratio of lexical chains of length l which connect tj to ti to that which connect tj to any node in the graph. Each domain concept tj is then associated with an n-dimensional vector xj, where n is the total number of extracted concepts, and the k-th component of xj is xji=sim(tj,ti). In the following, we denote with X the space of instance vectors, where |X|=|V|=n. 1 2 http://lcl.uniroma1.it/termextractor The availability of an ontology and glossary is not strictly required. However we developed tools to facilitate their automatic acquisition [5]. 766 P. Velardi et al. / Content-Based Social Network Analysis 3 Topic detection. The subsequent step, topic detection, is a clustering task: the objective is to organize concepts in groups, or clusters, so that concepts within a group are more similar to each other than are concepts belonging to different clusters. We cluster concept vectors in X using an empowered version of the k-means algorithm, the k-means++ method for optimal selection of the initial seeds [6]. The best clustering C is identified using the Silhouette Coefficient validity measure. 4 Social Network Analysis. Social network analysis is applied to the case of a research network, but the approach is fully general. Given the set G of research groups, the set D of members’ publications and the collection V of domain concepts, pattern matching is used to tag each publication di in D with a subset of domain concepts ViV. For any document di, we compute a vector vi of k elements yih (with k=|C C |) such that: y ih = l h,i |Ch | tf •idf (t j ,d i ) j:x j C h where xj is the similarity vector associated with concept tj (as defined in Section 2.2), lh,i is the number of concepts of Ch found in di, and tf•idf() is a standard measure for computing the relevance of a term tj in a document di of a collection D. Therefore each yih in vi measures the overlap of di with the topic ChC . We finally define a vector I g i which is the centroid of all publications vectors of gi. The Content Based-Social Network is then modelled through an undirected graph with: • the nodes representing the groups gi; • the edges representing the similarity between nodes, measured by the cosine function: cos sim(gi ,g j ) = cos(gi ,g j ) = I gi I gj Figure 1. The best clusters obtained with k=150. The model described in this paper allows the social analyst to extract information that is not available with standard social analysis tools. For sake of brevity, we show only the example of Figure 2, in which nodes represent the groups (the node dimension is proportional to the number of publications of the associated group, in turn related to the dimension of the NoE research groups), bent edges represent the similarity of interests (formula 2) and curved edges the co-authorship. In the figure, only edges above a user-defined threshold are shown. It is also possible to focus the analysis on a selectable subset of topics. The visualization is very useful to discover groups that could potentially cooperate but do not actually have common activities, thus allowing a better coordination of the network. A lot of other relevant information can be extracted from the CB-SA model. The interested reader is referred to [3]. (2) 2 EXPERIMENTS We applied our method to the study of the INTEROP NoE community, a research network now continuing within the European V_Lab on Enterprise Interoperability3. We collected 1452 full papers or abstracts authored by the INTEROP project members belonging to 46 organizations. We automatically extracted 728 domain terms and we then generated lexical chains, deriving semantic relations from the INTEROP ontology and cooccurrences from the domain papers and glossary. An excerpt of a similarity vector (the arguments are ordered by formula (1)) is: activity_diagram = (class_diagram (1), process_analysis (0.630), software_engineering (0.493), enterprise_software (0.488), deployment_diagram (0.468), bpms_paradigm (0.467), workflow_model (0.444), model-driven_architecture (0.442), workflow_management (0.418),....) http:// www.interop-vlab.eu/ cluster 19 = { common_ontology, core_domain_ontology, core_ ontology, domain_ontology, enterprise_ontology, federated_ ontology, ontology_alignment, ontology_analysis, ontology_ application, ontology_architecture, ontology_maintenance, ontology_mediation, ontology_merging, ontology_representation, ontology_validation, ontology_versioning, reference_ontology} I gi • I gj This formula models the semantic similarity between groups. Traditional and ad-hoc Social Network measures can then be used to support a thorough analysis of the community, as briefly discussed in the experimental session. We also implemented a graphical interface to assist the social analyst in the study of the network: the analyst can perform several tasks, like e.g. to select a topic and display the intensity of interest and the intensity of collaborations on this topic, to show partners with common interest that do not cooperate, to identify “central” topics and their shift in time, etc. 3 Finally, the concept vectors built from concept chains were used to feed the k-mean++ algorithm. The cluster validity measure was computed for incremented values of k, 50k300. Clustering results in the range 140k170 show the best Silhouette values. Figure 1 shows the best-rated cluster (according to its Silhouette) for k=150. Figure 2. Groups with highest co-authorship and strongest common interests. REFERENCES [1] J. Dhiraj and D. Gatica-Perez: ‘Discovering Groups of people in Google News’, Proc. Of HCM’06 , Santa Barbara, CA, USA, (2006) [2] A. McCallum, A. Corrada-Emmanuel and X. Wang: ‘Topic and Role Discovery in Social Networks’. Proc. Int. Joint Conf. on Artificial Intelligence, (2005). [3] P. Velardi, R. Navigli, A. Cucchiarelli and F. D’Antonio, and ‘A New content-based model for social network analysys’, Proc. of IEEE Int. Conf. on Semantic Computing, S. Clara, USA, August 2008 [4] F. Sclano and P. Velardi, ‘TermExtractor: a Web Application to Learn the Common Terminology of Interest Groups and Research Communities’, Proc. of 9th Conf. on Terminology and Artificial Intelligence (TIA 2007), Sophia Antinopolis, 2007. [5] P. Velardi, A. Cucchiarelli and M. Petit, ‘A Taxonomy learning Method and its Application to Characterize a Scientific Web Community’, IEEE Transaction on Data and Knowledge Engineering (TDKE), Vol. 19, N. 2, 180-191, (2007). [6] D. Arthur and S. Vassilivitski, ‘k-means++: The Advantages of Careful Seeding’, Proc. of the 18th ACM-SIAM Symp. on Discrete Algorithms, New Orleans, Louisiana, 1027-1035, 2007. ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-767 767 Efﬁcient Data Clustering by Local Density Approximation Marc-Isma¨el Akodj`enou and Patrick Gallinari 1 Abstract. The clustering task is a key part of the data mining process. In today’s context of massive data, methods with a computational complexity more than linear are unlikely to be applied practically. In this paper, we begin by a simple assumption : local projections of the data should allow to distinguish local cluster structures. From there, we describe how to obtain ”pure” local sub-groupings of points, from projections on randomly chosen lines. The clustering of the data is obtained from the clustering of these sub-groupings. Our method has a linear complexity in the dataset size, and requires only one pass on the original dataset. Being local in essence, it can handle twisted geometries typical of many high-dimensional datasets. We describe the steps of our method and report encouraging results. 1 INTRODUCTION Clustering is a well-known base building block of many data mining processes. It consists in identifying automatically natural groupings of points in a dataset. In the past forty years, an abundant literature has ﬂourished on the subject. Many methods, and variants of methods have been proposed, each with their qualities and weaknesses. A recent survey can be found in [3]. The unceasingly increasing size of today’s information generating processes causes the datasets to often have a large volume : large number of instances and high dimensions. These two problems have often been tackled separately in the literature. On one hand, handling high dimensions is tricky. Highdimensional data often contain non-convex, twisted cluster shapes and noise. It is very common that the clusters are not full-dimensional but have a low intrinsic dimensionality, and lie on non-linear manifolds. The concentration of measure phenomenon adds even more difﬁculties in that the distances, Euclidean included, tend to lose their meaning. Many approaches based either on density or distance suffer much from this phenomenon. Popular techniques which give good results, like Spectral Clustering, are unfortunately quadratically complex either with the dimension or the number of instances. On the other hand, to cope with a massive number of instances, grid-based clustering methods rely on a partitioning of the data space into cells and then aggregate those cells to form the ﬁnal clustering. The implicit assumption is that a particular cell is ”pure” in that it contains only points from the same cluster. With a proper grid resolution the assumption is quite reasonable, and the time complexity is usually linear in the number of instances. Unfortunately, because of the density-based aggregation process, in general the performance degrades quickly as the dimension increases. 1 LIP6 - Universit´e Paris 6 Pierre et Marie Curie, email: {Marc-Ismael.Akodjenou, Patrick.Gallinari}@lip6.fr France How could one have the best of both worlds : keeping linear complexity in N , but overcoming the curse of dimensionality linearly in d too ? In this paper, we propose to keep and relax the notion of cell in grid-based clustering to the notion of ”sub-grouping”. A sub-grouping is a ”pure” subset of points obtained with cheap projections of the data on local lines. Why a set of local lines ? First, a line projection is computationally cheap. Second, the use of locality to overcome bad dimensionality effects is transversal to many clustering or dimensionality reduction approaches. For example in the approach of [2], or in Subspace Clustering, local Euclidean distances prove to be pertinent, even when they are globally inadequate. After the projection step, the clustering of the data is obtained through the clustering of the sub-groupings. The key assumption is that subgroupings coming from the same cluster have common points. The method is designed to be of linear complexity with the dataset size. Moreover, aware of the access costs of today’s databases, it requires only one pass on the original dataset. Throughout this paper, we will use the following notations : the dataset X ⊂ Rd is a matrix of N datapoints. M is the number of lines under consideration and m is the number of closest lines for each point. Sub-groupings are sets of indices S1 , . . . , SP ⊂ [1, N ]. K is the number of clusters. The dot product is noted ·, ·, the Euclidean distance - · - and the cardinal of a set | · |. 2 CLUSTERING FROM SUB-GROUPINGS Figure 1. Projections on nearest lines leave small dense zones on the lines The idea of our method is depicted on Figure 1. We pierce the dataset with M randomly-oriented lines. We then ”shatter” the dataset by orthogonally projecting the points on the lines. As we will see, each particular point is close to a small number of lines; we project each point only on its m closest lines (in the sense of the orthogonal distance). The small dense zones left by the projection of the data are likely to be ”pure” in terms of cluster memberships. It is such a dense zone (precisely, the indices of the datapoints in it) that we call a sub-grouping S of points. As two subgroupings issued from the same cluster are likely to have points in common, we propose to 768 M.-I. Akodjènou and P. Gallinari / Efﬁcient Data Clustering by Local Density Approximation cluster the subgroupings Sj ﬁrst, and to deduce the clustering of the original data from the clustering of the Sj . 2.1 Projections and Sub-Groupings What would be a set of lines likely to yield good sub-groupings ? Under the linear complexity constraint, all that is left is to choose lines at random. However, as most of the space is empty in high dimensions, it is reasonable to require that each line is close to at least one point. For this, we take a random datapoint in X to be the ”origin” of the line. Precisely, we choose M lines Lk deﬁned by their originorientation pairs (yk , uk ). The yk ∈ X are took randomly, and the vectors uk took randomly on the unit sphere. For a point x, its projection on the line p Lk is projk (x) = x − yk , uk and its orthogonal distance to it is -x − yk -2 − x − yk , uk 2 . Proportion of lines they have in common and the total number of points in the two sets. At the end of this clustering step the subgroupings Sk are clustered in K groups C1 , . . . , CK . The clustering of original datapoints is straightly deduced : the cluster of the point xi ∈ X is the Ck in which x is the most present, ”presence” being measured by the number of times P x appears in the subgroupings of Ck : clusterid(xi ) = arg maxk S⊂Ck 1(xi ∈ S) . 3 The complexity of the approach is linear in the dataset size. The projection/distance calculation step is M projections of the dataset, that is O(N M d). Sorting the distances has worst-case complexity N M logM . Identiﬁcation of the sub-groupings on the lines is ﬁnding the modes of the kernel estimates, that is O(M N ) where N is the mean size of the Ik ’s. The hierarchical clustering step is O(M 2 logM ), which yields a linear computational complexity of O(M N d + (N M + M 2 )logM ). 4 0 Distance Figure 2. Typical distance-to-lines histogram for a point The sub-groupings we wish to ﬁnd are small dense zones on the lines. Projecting all points on all lines (that is, ignoring the distances of points to lines) will not allow to distinguish anything. Only the closest lines of a point are likely to give good dense zones. One can see a typical histogram of distances of a point to the lines on the Figure 2. The normally distributed component, often met in highdimensional settings, is always present but is preceded by a small number of ”close” lines (circled in red). It is those lines that we will select as candidates to obtain meaningful sub-groupings for the point. For this, we sort the distances to the lines for each point, and we keep its m closest lines. Our experiments show that a very little m (around 5 or 10) is suitable to yield an efﬁcient clustering. More sophisticated selection techniques based on a thresholding of the distance histograms could be used. Note that at the end of this step, each line Lk will be associated to a set Ik of points. We now have to identify the sub-groupings on each line. This is done by ﬁnding the modes of the projection of the points of Ik on the line Lk . We do this the classical way : we use a (gaussian) ker“ t−proj ” P Lk (x) nel density estimate fˆk (t) = x∈Ik K , where K(·) h is a gaussian kernel, to model the density on the line, and ﬁnd the modes by identifying the valleys of this density, what is very fast (there is only |Ik | points in the kernel estimates). Each mode of this density yields a sub-grouping S ∈ [1, N ]. The collection of all subgroupings S1 , . . . , SP obtained from the M lines is the representation forwarded for the hierarchical clustering step. 2.2 Hierarchical Clustering of Sub-Groupings It remains now to cluster the subgroupings. Each subgrouping Sj is a subset of [1, N ]. As mentioned above, subgroupings coming from the same cluster are likely to have points in common. Focusing on this assumption, it is natural to use the Jaccard Distance in the hier | archical merging procedure. The distance dJacc (S, S ) = 1− |S∩S |S∪S | measures the afﬁnity of two sets by means of the ratio of points COMPLEXITY EMPIRICAL EVALUATION We evaluate our approach on three popular datasets : USPS2 (handwritten digits 3,5,6 and 8), Coil203 and Umist4 (image datasets). The characteristics (N, d, c), c being the number of true classes, and the parameters (M, m) used by our method are shown in the table. We compare our method with k-means and Spectral Clustering [1]. In the results table we use two popular criteria : the ﬁrst is the NMI criterion, which measures the structural agreement between the clustering and the classes, while the Purity criterion expresses the homogeneity of the clusters with respect to the classes. Both take values in [0, 1]. Results shown are an average over 10 runs for k-means and our method. Spectral Clustering has been tuned to give its best results. NMI Purity (N, d, c) (M, m) K-means Spectral Our method K-means Spectral Our method USPS-3568 4400, 256, 4 400, 4 0.36 0.31 0.59 0.52 0.52 0.49 Coil20 1440, 1024, 20 500, 4 0.76 0.91 0.87 0.65 0.87 0.81 Umist 564, 10304, 20 1000, 10 0.65 0.73 0.83 0.50 0.61 0.72 The results show that our method exhibits performance similar, and sometimes superior, to Spectral Clustering. These results encourage us to think that the method, though simple and with linear complexity, has good performances; the locality of the approach seems to be appropriate. In future work, we will take a closer look at the selection of M and m. We will also examine whether or not the use of other distances than Euclidean for constituting the local subgroupings can give better performances (cosine distance for example). REFERENCES [1] I. Fischer and J. Poland, ‘Amplifying the block matrix structure for spectral clustering’, in Proceedings of the 14th Annual Machine Learning Conference of Belgium and the Netherlands, pp. 21–28, (2005). [2] Wanli Min, Ke Lu, and Xiaofei He, ‘Locality pursuit embedding’, Pattern Recognition Journal, 37(4), 781–788, (2004). [3] Xu Rui and D. Wunsch, ‘Survey of clustering algorithms’, in IEEE Transactions on Neural Networks, volume 16, pp. 645–678, (2005). 2 3 4 http://cervisia.org/machine learning data.php http://www1.cs.columbia.edu/CAVE/software/softlib/coil-20.php http://www.cs.toronto.edu/∼roweis/data.html ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-769 769 Gas Turbine Fault Diagnosis using Random Forests Manolis Maragoudakis 1 and Euripides Loukis 1 and Panayotis-Prodromos Pantelides 1 Abstract. In the present paper, Random Forests are used in a critical and at the same time non trivial problem concerning the diagnosis of Gas Turbine blading faults, portraying promising results. Random forests-based fault diagnosis is treated as a Pattern Recognition problem, based on measurements and feature selection. Two different types of inserting randomness to the trees are studied, based on different theoretical assumptions. The classiﬁer is compared against other Machine Learning algorithms such as Neural Networks, Classiﬁcation and Regression Trees, Naive Bayes and K-Nearest Neighbor. The performance of the prediction model reaches a level of 97% in terms of precision and recall, improving the existing state-of-the-art levels achieved by Neural Networks by a factor of 1.5%-2%. 1 INTRODUCTION Development of effective Gas Turbine Condition Monitoring and Fault Diagnosis methods has been the target of considerable research in recent years. This is due to the high cost, sensitivity and importance of these engines for most industrial companies. Most of this research is directed towards the diagnosis of Gas Turbine blading faults, because of the catastrophic consequences that these faults can have, if they are not diagnosed in time. Even very small blading faults can very rapidly grow and result to huge destructions ([1], [2], [3]). Blading faults diagnosis is regarded to be a very difﬁcult problem, because of the high levels of noise in all relevant measurements and the high interaction between the numerous Gas Turbine blading rows. Therefore, it is very important to take advantage of the processing power of modern computers, in order to provide a fast and reliable engine condition diagnosis from available measurements and to develop the highest possible level of intelligence and assistance to the operation and maintenance personnel. The Gas Turbine Blading Fault Diagnosis problem was originally addressed in [4] and [5], based on classical pattern recognition methods. Our contribution to the domain, is the introduction of an ensemble classiﬁer, namely Random Forests, for the ﬁrst time for the task at hand, which outperforms all previous attempts to Gas Turbine Blading Fault Diagnosis. Furthermore, Random Forests can provide some insight on the interrelationships between input features, unlike Neural nets, thus directing domain experts at selecting which measurement tools to use in real world applications. 2 1. Unsteady internal wall pressure (using fast response transducers P2 to P5). 2. Casing vibration (using accelerometers A1 to A6 mounted to the outside compressor casing). 3. Shaft displacement at compressor bearings (using transducer B). 4. Sound pressure levels (using double-layer microphone M). Five experiments were performed, testing the datum healthy engine and a similar engine with the following four typical small (but quite rapidly growing, as mentioned in the introductory section) and also not straightforwardly diagnosable faults: 1. 2. 3. 4. Fault-1: Rotor fouling. Fault-2: Individual rotor blade fouling. Fault-3: Individual rotor blade twisted (by appr. 8 degs). Fault-4: Stator blade restaggering. Tests were performed at four different engine loads (full load, half load, quarter load and no load), both for the healthy engine as well as for the above four faults. At each load, four series of time-domain data were acquired for each instrument (two series in each of the two sampling frequencies, l = 13 kHz and m = 32 kHz). 12 different measuring instruments were used and measurements were taken for every possible combination between engine’s 5 operational conditions (healthy engine and 4 faulty conditions), 4 different engine loads (full load, half load, quarter load and no load) and 2 sampling frequencies (low and high). To be more precise, regarding engine’s healthy condition, measurements have been taken for every combination between the engine load and sampling frequency (total 8 different combinations). Especially in engine’s faulty condition there’s been one more measurement series for all the above combinations. Consequently, for every instrument we have aggregately 72 different measurements: 8 healthy engine’s measurements and 64 faulty engine’s measurements. For every instrument, each and every one of the above measurements consists of 27 values that are forms of the spectral difference of the ﬁrst 27 harmonics of rotor’s shaft rotational frequency. So, if we would like to present the entirety of data in a data base then this would be composed of 864 instances described by 27 distinct attributes, corresponding to the 27 harmonics. PROBLEM & DATA DESCRIPTION The present work is based on data acquired from dynamic measurements on an industrial Gas Turbine into which different faults were 1 artiﬁcially introduced. During the experimental phase four categories of measurements were performed simultaneously: University of the Aegean, Department of Information and Communication Systems Engineering, Samos, Greece 3 RANDOM FORESTS Despite the fact that Random Forests have been quite successful in classiﬁcation and regression tasks, to the best of our knowledge, there has been no research in using the afore-mentioned algorithm for Gas 770 M. Maragoudakis et al. / Gas Turbine Fault Diagnosis Using Random Forests Turbine Fault Diagnosis. Random Forests are a combination of tree classiﬁers such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. A Random Forest multi-way classiﬁer Θ(x) consists of a number of trees, with each tree grown using some form of randomization, where x is an input instance [8]. The leaf nodes of each tree are labeled by estimates of the posterior distribution over the data class labels. Each internal node contains a test that best splits the space of data to be classiﬁed. A new, unseen instance is classiﬁed by sending it down every tree and aggregating the reached leaf distributions. In order to make the classiﬁcation process more formal, suppose that the joint classiﬁer Θ(x) contains x individual classiﬁers Θ1 (x), Θ2 (x),...,Θx (x). Let us also assume that each data instance is a pair (x,y), where x denotes the input attributes, taken from a set Ai , i=1,...,M and y symbolizes the set of class labels Lj , j=1,...,c (c is the number of class values). For reasons of simplicity, the correct class will be denoted as y, without any indices. Each discrete attribute Ai takes values from a set Vi , i=1 to mi (mi is the number of values attribute Ai has). Finally, the probability that an attribute Ai has value vk is denoted by p(vi,k ), the probability of a class value yj is denoted by p(yj ) and the probability of an instance with attribute Ai having value vk and class label yj is symbolized by p(yj |vi,k ). Each training example is picked up from a set of N instances at random with replacement. By this procedure, called bootstrap replication, a pool of 36.8% of the training examples are not used for the tree construction phase. These out-of-bag (oob) instances allow for computing the degree of strength and correlation of the forest structure. Suppose that Ok (x) is the set of oob instances of classiﬁer Θk (x). Furthermore, let Q(x, yj ) denote the subset of oob samples which were voted to have class yj at input example x. An estimate of p(Θ(x) = yj ) is given by the following equation: a classiﬁer may achieve high accuracy by simply always predicting the non faulty class. This problem particularly appears in the present task, where, from more than 2/5 of the data set contained the aforementioned class. A set of well-known machine learning techniques have constituted the benchmark to which our results have been compared: Multi-layer Perceptron Neural Networks, Naive Bayes, Classiﬁcation and Regression Trees (CART), and k-Nearest Neighbor (kNN) instance-based learning. Cross validation was performed with k-NN in order to determine the best k. As regards to the Random Forests implementation, the best results were obtained by using 500 trees and 6 features. Due to lack of space, the evaluation outcome is depicted in the following ﬁgure, for the precision metric (F1 to F4 denotes the fault categories and OK denotes the non faulty state). K Q(x, yj ) = k=1 I(Θk (x) = yj ; (x, y) ∈ Ok ) K k=1 I(Θk (x); (x, y) ∈ Ok ) (1) Figure 1. Evaluation results in terms of precision for all methodologies. where I(·) is the indicator function. The margin function which measures the extent to which the average vote for the right class y exceeds the average vote for any other class labels is computed by: margin(x, y) = P (Θ(x) = y) − maxcj=1,j =y (P (Θ(x) = yj ) (2) Since strength is deﬁned as the expected margin, it is computed as the average over the training set: s= n 1 (Q(xi , y) − maxcj=1,j =y Q(xi , yj )) n i=1 (3) The average correlation is given by the variance of the margin over the square of the standard deviation of the forest: p= V ar(margin) σ(Θ())2 (4) is estimated for every input example x in the training set Q(x, yj ). 4 EXPERIMENTAL RESULTS We applied two versions of Random Forests (Random Input (RI) Forests and Random Combination (RC) Forests) on the Gas Turbine data set, using oob estimates. As for evaluation metric, we considered per class precision and recall. Accuracy in some domains, such as the one at hand, is not actually a good metric due to the fact that REFERENCES [1] E. Loukis, P. Wetta, K. Mathioudakis, A. Papathana siou, K. Papailiou, Combination of Different Unsteady Quantity Measurements for Gas Turbine Blade Fault Diagnosis, 36th ASME International Gas Turbine and Aeroengine Congress, Orlando, 1991, ASME paper 91- GT-201. [2] E. Loukis, Contribution to Gas Turbine Fault Diagnosis Using Methods of Fast Response Measurement Analysis, Doctoral Thesis, Athens, National Technical University of Athens, 1993. [3] G. Merrington, O. K. Kwon, G. Godwin, B. Carlsson, Fault Detection and Diagnosis in Gas Turbines, ASME Journal of Engineering for Gas Turbines and Power, 113, 1991, 11-19. [4] E. Loukis, K. Mathioudakis, K. Papailiou, A procedure for Automated Gas Turbine Blade Fault Identiﬁcation Based on Spectral Pattern Analysis, Journal of Engineering for Gas Turbines and Power, 114, 1992, 201-208. [5] E. Loukis, K. Mathioudakis, K. Papailiou, Optimizing Automated Gas Turbine Fault Detection Using Statistical Pattern Recognition, Journal of Engineering for Gas Turbines and Power, 116, 1994, 165-171. [6] Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. Classiﬁcation and regression trees. Wadsworth Inc., Belmont, California, 1984. [7] Leo Breiman. Bagging predictors. Machine Learning Journal, 26(2):123140, 1996. [8] Igor Kononenko. Estimating attributes: analysis and extensions of Relief. In Luc De Raedt and Francesco Bergadano, editors, Machine Learning: ECML-94, pp. 171182. Springer Verlag, Berlin, 1994. ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-771 771 How Many Objects?: Determining the Number of Clusters with a Skewed Distribution Satoshi Oyama1 and Katsumi Tanaka2 Abstract. We propose a supervised approach to enable accurate determination of the number of clusters in object identiﬁcation. We use the aggregated attribute values of the data set to be clustered as explanatory variables in the prediction model. Attribute aggregation can be done in linear time with respect to the number of data items, so our method can be used to predict the number of clusters with a low computational burden. To deal with skewed target values, we introduce a two-stage method as well as a method using a higher-order combination of explanatory variables. Experiments demonstrate our methods enable more accurate prediction than existing methods. 1 INTRODUCTION Object-identiﬁcation problems, in which it is necessary to determine whether names appearing in documents or database records correspond to the same real world object, are important in information retrieval and integration. Typical examples of object-identiﬁcation problems include disambiguating namesakes in Web search results and establishing correspondence between an abbreviated author name in bibliographic databases and a particular person. Objectidentiﬁcation problems are generally solved by clustering data that contain an ambiguous name and by regarding data in the same cluster as corresponding to the same object. Among the various clustering algorithms, the most widely used are k-means and hierarchical algorithms including single-linkage. One problem in using the k-means clustering algorithm, though, is that a user has to specify the number of clusters as a parameter before starting the clustering procedure. If we use a hierarchical clustering algorithm for object identiﬁcation, we must specify the number of clusters or a stopping condition so that the algorithm stops the clustering and outputs the results after a certain number of clusters have been found. Determining the number of clusters as a parameter in an objectidentiﬁcation problem is not easy. One reason for this difﬁculty is that the number of corresponding objects varies considerably from name to name. For example, in the DBLP computer science bibliography3 , which is commonly used as a test collection for object identiﬁcation, we observed that the number of corresponding full names (clusters) k and the frequency f of abbreviated names obey a powerlaw distribution: f (k) = αk−γ (α and γ are parameters). In a power-law distribution, a very large number of data items with low values coexist with a few data items with very high values. Thus the average value of the data is meaningless, and there are no “typical” data values. For example, in the data set we used, the average 1 2 3 Kyoto University, Japan, email: oyama@i.kyoto-u.ac.jp Kyoto University, Japan, email: ktanaka@i.kyoto-u.ac.jp http://dblp.uni-trier.de/ number of full names per abbreviated name is 1.5, but setting the parameter of the number of clusters to 1 (which means doing no clustering) or 2 for all names is not meaningful because that results in very poor performance for names with very many clusters. Therefore, we need to use a different number of clusters for each clustering problem with a distinct ambiguous name. 2 SUPERVISED-LEARNING APPROACH Previous methods to determine the number of clusters take an “unsupervised” approach and treat each clustering problem independently [1, 2, 3]. In contrast, we take a supervised approach that uses other clustering problems for which we know the true numbers of clusters to predict the number of clusters for an unknown problem. We think this is a reasonable approach for object identiﬁcation where we solve many similar clustering problems for different names in the same domain. Our approach avoids unnecessary clustering for data sets with one cluster because model-based prediction of the numbers of clusters is used. This is especially effective for object identiﬁcation when the numbers of clusters follow a power-law distribution and one-cluster problems (problems with no need for clustering) are a large proportion of the problems. Assume we have pairs of a data set S j to cluster and the true number of clusters in it, y j , where the pairs are denoted as T = {(S 1 , y 1 ), (S 2 , y 2 ), . . . , (S |T | , y |T | )}. Using T as training data, we construct a function fT that gives a prediction y of the number of clusters for an unknown data set S. We can consider various forms of function fP T . Among them, one of the simplest models is a linear model, y = i wi xi + b, where {xi } are explanatory variables that characterize the data set to be clustered, and {wi } and b are parameters determined from the training data T . The number of clusters should be predicted efﬁciently. The computational cost of k-means is O(kn) and that of a hierarchical clustering method is O(n2 ). Therefore, in practice, the prediction of the number of clusters should be done in linear time with respect to the number of data. Our model should return the number of clusters given a data set cluster, so we need explanatory variables that characterize the statistics of the set of data rather than each datum. In addition, efﬁciently computing explanatory variables is required. Aggregations of attribute values of the data items to be clustered are good candidates for such explanatory variables. We devised several types of variables that might be correlated with the number of clusters. The explanatory variables we will introduce can be computed in linear time with respect to the number of data items. We can easily compute the value of the aggregated variables by using aggregate functions such as count(), max(), min(), and avg(), which are available in most database systems. 772 S. Oyama and K. Tanaka / How Many Objects?: Determining the Number of Clusters with a Skewed Distribution We use support vector regression [4] to determine the parameters in the linear model. One difﬁculty in building a model to predict values from a skewed distribution like a power-law distribution is that there is a large imbalance in the numbers of available training data for different target values. A large portion of the training data is shared by the data items with a target value of 1, and there are relatively few data items with large target values. If we use such training data, there is a risk of obtaining a model that underestimates the target values. To overcome the problem of imbalance between the numbers of training data for different target values, we introduce a method that successively applies two different models when predicting the number of clusters: (1) One model determines whether a given data set is composed of one cluster or multiple clusters. (2) The other model determines the number of clusters for a data set predicted to be composed of multiple clusters by the previous model. In ecology, a similar two-stage method is used to build a model to predict the abundance of rare species [5], although the learning methods used in each stage are different from ours. Another extension is that we use a model that is nonlinear to the explanatory variables rather than a linear model. Speciﬁcally, we consider a model using combinations of the explanatory variables. Using a higher-order model with large expressive power helps avoid the risk of under-ﬁtting the training data, which sometimes occurs when applying a simple linear model to skewed data. In our implementation, we adopt a kernel trick and use a quadratic polynomial kernel: k(x, z) = (x, z + 1)2 . By using the kernel in support vector learning, we can virtually use the conjunctions of explanatory variables in the model without actually computing the values of conjunctions. 3 EXPERIMENTS We took the disambiguation of abbreviated author names in a bibliographic database as an example task. From the DBLP data, we randomly selected 2,000 abbreviated names corresponding to more than one paper. We did not use abbreviated names that corresponded to only one paper because there is obviously only one cluster (full name) for them. For each selected abbreviated name, we collected bibliographic data containing the name as an author and computed the value of the following explanatory variables: (1) Number of papers with the target abbreviated author name, (2) Number of different coauthors in the data set, (3) Number of different words appearing in the paper titles, (4) Number of different journals or conference proceedings in which the papers are published, (5) Difference between publication years of the newest and oldest papers, (6) Standard deviation of publication years of papers in the data set, (7) Frequency of last names used in abbreviated names in the database, (8) Percentage of abbreviated names with a particular letter among the abbreviated names. We applied 10-fold cross validation. We used SVMlight 4 , which implements support vector regression to build the regression models as well as binary support vector machines used in building two-stage models. As the metric, we used the root mean square error (RMSE) between the true number of clusters (full names) and the predicted number of clusters given by a model. We compared the Cali´nski and Harabasz (C&H) method [1], the Hartigan method [2], a method using an average threshold, x-means [3], the basic learning-based method (Linear (1 stage)), a two-stage method (Linear (2 stages)), nonlinear regression using a polynomial 4 http://svmlight.joachims.org/ kernel (Polynomial (1 stage)), and a two-stage method using a polynomial kernel in each stage (Polynomial (2 stages)). For C&H, Hartigan, and x-means, we simply applied the methods for the clustering problems in the test sets and did not use the training sets. For the method using an average threshold, we applied the single-linkage method for each clustering problem in the training set and calculated the average of the thresholds that resulted in the true numbers of clusters. We then applied the single-linkage method to each clustering problem in the test sets and determined the number of clusters by using the average threshold as the clustering-stopping condition. The overall RMSE for each method is shown in Table 1. The four learning-based methods outperformed the other methods. Among the four learning-based methods, the two-stage model and the model with the polynomial kernel outperformed the basic model, and their combination gave the results with the smallest errors. Table 1. C&H Hartigan Threshold X-means 3.063 2.279 2.231 2.585 RMSE for each method Linear (1 stage) Linear (2 stages) Polynomial (1 stage) Polynomial (2 stages) 1.819 1.490 1.145 1.114 4 CONCLUSION We described a supervised, model-based approach to predicting the number of clusters in a data set, which is more efﬁcient and accurate than existing approaches. In addition, it enables us to avoid unnecessary clustering for one-cluster problems, which are a large proportion of the problems. As explanatory variables used in the prediction model, we used aggregated attribute values of the data set to be clustered, which can be computed efﬁciently. We described a basic learning-based method using a linear model as well as two extended methods: a two-stage method and a method using combinations of explanatory variables. Experimental results in author disambiguation showed that our learning-based methods outperformed existing methods and that the two extensions improved the performance of the basic linear model. ACKNOWLEDGMENTS This work was supported in part by Grants-in-Aid for Scientiﬁc Research (Nos. 18049041 and 19700091) from MEXT of Japan, a MEXT project entitled “Software Technologies for Search and Integration across Heterogeneous-Media Archives,” a Kyoto University GCOE Program entitled “Informatics Education and Research for Knowledge-Circulating Society,” and a Microsoft IJARC CORE4 project entitled “Toward Spatio-Temporal Object Search from the Web.” REFERENCES [1] T. Cali´nski and J. Harabasz, ‘A dendrite method for cluster analysis’, Communications in Statistics, 3(1), 1–27, (1974). [2] J. A. Hartigan, Clustering Algorithms, Wiley, 1975. [3] D. Pelleg and A. Moore, ‘X-means: Extending K-means with efﬁcient estimation of the number of clusters’, in Proceedings of ICML 2000, pp. 727–734, (2000). [4] V. N. Vapnik, Statistical Learning Theory, John Wiley & Sons, 1998. [5] A. H. Welsh, R. B. Cunningham, C. F. Donnelly, and D. B. Lindenmayer, ‘Modelling the abundance of rare species: Statistical models for counts with extra zeros’, Ecological Modelling, 88, 297–308, (1996). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-773 773 Active Concept Learning For Ontology Evolution 1 Murat S¸ensoy and Pınar Yolum 2 Abstract. This paper proposes an approach that enables agents to teach each other concepts from their ontologies using examples. Unlike other concept learning approaches, our approach enables the learner to elicit the most informative examples interactively from the teacher. Hence, the learner participates to the learning process actively. We empirically compare the proposed approach with the previous concept learning approaches. Our experiments show that using the proposed approach, agents can learn new concepts successfully and with fewer examples. 1 Introduction In current approaches to concept learning, the learner is passive. That is, the training examples are solely chosen by the teacher. However, this assumes that the teacher has an accurate view of what the learner knows, which concepts are confusing for it, and so on. We propose to involve the learner in the learning process by enabling it to interact with the teacher to elicit the most useful examples for its understanding of the concept to be learned. In our approach, each agent represents its domain knowledge using an ontology and manages this ontology using a network of experts. Each expert is a stand-alone learner composed of one or more classiﬁers. Main task of an expert is to learn how to discriminate between the sub-concepts of a speciﬁc concept. An agent learns a new concept from another agent using our approach as follows: 1. The learner agent asks for the positive examples of the concept from the teacher agent. 2. After receiving the positive examples, the learner determines the new concept’s parent in its ontology using those positive examples. Then, the expert related to the parent concept is entitled to learn the new concept. 3. This expert determines some negative examples of the new concept using the positive examples and a semi-supervised learning approach. Hence, it ﬁrst learns the new concept roughly without receiving any negative examples from the teacher. 4. The expert iteratively enhances its knowledge on the new concept by eliciting the most useful negative examples from the teacher. 5. After learning the new concept sufﬁciently, it is placed into the learner’s ontology and the ontology is modiﬁed accordingly. 2 Representing Knowledge In the current instance-based concept learning approaches, one classiﬁer is trained to learn each concept independently [3]. Although the 1 .This research has been partially supported by Bo˘gazic¸ i University Research Fund under grant BAP07A102 and The Scientiﬁc and Technological Research Council of Turkey by a CAREER Award under grant 105E073. 2 Department of Computer Engineering, Bo˘ gazic¸ i University, Bebek, 34342, Istanbul,Turkey, email: {murat.sensoy,pinar.yolum}@boun.edu.tr concepts are related through parent-child relationships, their classiﬁers are regarded as independent of one another. Such approaches require each classiﬁer to learn how to discriminate instances of one concept from those of every other concept in the ontology. Therefore, in order to learn a single concept, the agent uses the whole domain knowledge. In this paper, we envision that the domain knowledge related to an ontology is managed by a set of experts, each of which is knowledgeable in a certain concept. By knowledgeable in a concept, we mean that the expert can correctly report which of the concept’s subclasses an instance belongs to. Hence, each expert is trained with examples of the concept and nothing else. For example, an expert on motorcycles can tell us correctly that Burgman 400 is a scooter. 3 Actively Learning A Concept While teacher teaches a new concept to the learner, it ﬁrst selects a set of positive examples of the concept. This is relatively easier than selecting negative examples, which are chosen among instances of any other concept. Then, the teacher gives the selected positive examples to the learner. In our approach, negative examples are not directly given by the teacher, because the teacher cannot estimate which examples are more useful or informative for the learner. The given positive examples are classiﬁed using the experts of the learner and the most speciﬁc concept in the learners ontology is determined so that this concept subsumes all of the positive examples. Assume that, the teacher wants to teach Motorcycles concept to the learner, so it ﬁrst provides examples of motorcycles to the learner. The learner realized that all of the provided examples are instances of Car&M otorsports concept in its ontology. Hence, learning task is delegated to the expert of Car&M otorsports concept. The expert examines the other instances of Car&M otorsports to differentiate given motorcycles from the others as much as possible. Motorcycle instances should have some features in common that make them separate from the other instances of Car&M otorsports concept. In order to determine which features are more important for the M otorcycles concept, the differences of the feature distributions between the positive examples and the unlabeled examples can be used [2, 5]. We can estimate how signiﬁcant an instance I is as a motorcycle example, using the signiﬁcance of its features. After computing the signiﬁcance value for each known instance of Car&M otorsports, the obvious negative examples of M otorcycles are chosen among the instances that have the least signiﬁcance values. Using these negative examples and the positive examples provided by the teacher, the expert tries to learn the new concept roughly. Note that until now, the teacher has not provided any negative examples. Using the positive examples of M otorcycles and the obvious negative examples, the expert trains a classiﬁer. This classiﬁer can 774 M. S¸ ensoy and P. Yolum / Active Concept Learning for Ontology Evolution those approaches, the learner is inactive during the selection of the negative examples [3, 1]. The teacher selects the negative examples using its own ontology and viewpoint. Then, the learner is given positive examples and negative examples of the concept to be taught. In order to measure how successful our approach is in learning new concept for different number of negative examples, we set up experiments where the teacher is allowed to give or label only a predeﬁned number of negative examples. Then, these examples are given to the learner (as feedback in our approach). After training the learner with these examples, probability of misclassiﬁcation is computed. Figure 2 compares the results for the teacher-driven approach and the proposed approach. 0.45 Teacher−driven approach Proposed approach 0.4 0.35 Probability of misclassification roughly discriminate instances of M otorcycles from other instances of Car&M otorsports. However, the boundary between these two classes is not learned precisely yet, because only the obvious negative examples are used for training. Moreover, some of these negative examples can be wrongly chosen. This may seriously affect the performance of the trained classiﬁer. Therefore, the expert iteratively elicits more useful negative examples from the teacher and learns this boundary more precisely and correctly. Speciﬁcally, at each iteration, the expert samples instances of Car&M otorsports and then using the classiﬁer, it labels these sampled instances as instance of M otorcycles or not. Then, the teacher instructs the expert about the correct labels of these examples. The feedback from the teacher is used to reﬁne and improve the knowledge of the expert about the new concept M otorcycles. This, iterative active learning phase continues until the teacher makes sure that the learner correctly learns the concept. Then, the new concept is placed into the learner’s ontology as a new subconcept of Car&M otorsports. Lastly, we test whether M otorcycles concept subsumes some subconcepts of Car&M otorsports or not. If this is the case, concept-subconcept relationships are rearranged. 0.3 0.25 0.2 0.15 0.1 4 Evaluation 0.05 0 In order to evaluate our approach, we conduct several experiments in online shopping domain. For this purpose, we derive domain knowledge from Epinions3 . In our experiments, there is one teacher agent and one learner agent. In the implementation of the agents and the experts, we use JAVA and the C4.5 decision tree classiﬁer of WEKA data mining project [4]. In our experiments, an instance refers to a product item such as IBM ThinkPad T60, which is an instance of PCLaptops concept. Each product item has a web page in Epinions website and this page contains speciﬁcation of the product item in English. We derive a core vocabulary from these speciﬁcations automatically and each word in this vocabulary is used as a feature [2]. Figure 1 shows the performance of our approach at each iteration in terms of the probability of misclassiﬁcation. In Figure 1, after the ﬁrst iteration, the expert learns the new concepts roughly (with %12 error). This error rate is not acceptable for the teacher, so the expert continues with the next iteration. The second iteration results in a considerable progress in the learning performance (error drops to %4). The classiﬁcation error drops to zero at the ﬁfth iteration, which means that the teacher and the learner have exactly the same understanding for this concept. 0.12 Probability of misclassification 0.1 0.08 0.06 Figure 2. 20 40 60 80 100 Number of negative examples 120 140 150 Probability of misclassiﬁcation with different number of negative examples. As seen in Figure 2, the teacher-driven approach requires more negative examples than the proposed approach in order to achieve an acceptable performance. With only ﬁve negative examples, the learner that uses the proposed approach fails only on the 12% of its classiﬁcations. However, in the same case, the learner using the teacher-driven approach misclassiﬁes an instance with a probability of slightly higher than 0.4. Similarly, with only 35 negative examples, on the average, the proposed approach can learn a concept perfectly, while the teacher-driven approach requires approximately 150 negative examples for the same quality of learning. 5 Discussion This paper develops a framework for instance-based concept learning, where a learner can estimate some negative examples of the concept to be learned and obtain feedback about these negative examples from the teacher to learn the concept accurately. Our experiments show that our approach signiﬁcantly outperform a teacher-driven approach that represents other instance-based concept learning approaches in the literature by enabling learners to learn a concept with few examples. 0.04 REFERENCES 0.02 0 1 2 3 4 5 Iteration Figure 1. Probability of misclassiﬁcation at different iterations. We compare our approach with a teacher-driven concept learning approach. This approach represents the current concept learning approaches in the literature. Contrary to the proposed approach, in 3 5 http://www.epinions.com [1] A. Doan, J. Madhaven, R. Dhamankar, P. Domingos, and A. Helevy. Learning to match ontologies on the semantic web. VLDB Journal, pages 303–319, 2003. [2] G. P. C. Fung, J. X. Yu, H. Lu, and P. S. Yu. Text classiﬁcation without negative examples revisit. IEEE TKDE, 18(1):6–20, 2006. [3] S. Sen and P. Kar. Sharing a concept. In Working Notes of the AAAI-02 Spring Symposium, pages 55–60, 2002. [4] I. H. Witten and E. Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, San Francisco, 2005. [5] H. Yu, J. Han, and K. C.-C. Chang. PEBL: Web page classiﬁcation without negative examples. IEEE TKDE, 16(1):70–81, 2004. ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-775 775 Determining Automatically the Size of Learned Ontologies Elias Zavitsanos1,2 and Sergios Petridis1 and Georgios Paliouras1 and George A. Vouros2 Abstract. Determining the size of an ontology that is automatically learned from texts is an open issue. In this paper, we study the similarity between ontology concepts at different levels of a taxonomy, quantifying in a natural manner the quality of the ontology attained. Our approach is integrated in a method for language-neutral learning of ontologies from texts, which relies on conditional independence tests over thematic topics that are discovered using LDA. 1 INTRODUCTION Ontology learning is commonly viewed [1, 3] as the task of extending or enriching a seed ontology with new ontology elements mined from text corpora. While much work concentrates on enriching existing ontologies, in this paper, we propose an automated statistical approach to ontology learning, without presupposing the existence of a seed ontology. The proposed method tackles both tasks of concept identiﬁcation and taxonomy construction. Among the difﬁculties of such an endeavor, is the determination of the appropriate depth of the subsumption hierarchy, given the text collection at hand. The beneﬁt of being able to determine the depth of a taxonomy is that the hierarchy captures accurately the domain knowledge provided by the texts, reducing the extent of overlap among concepts and providing a coherent representation of the domain. In the proposed method, concepts are identiﬁed and represented as multinomial distributions over terms in documents, using the Markov Chain Monte Carlo (MCMC) process of Gibbs sampling [4], following the Latent Dirichlet Allocation (LDA) [2] model. To discover the subsumption relations between the identiﬁed concepts, conditional independence tests among these concepts are performed. Finally, statistical measures between the discovered concepts at different levels of the hierarchy are used to optimize the size of the ontology. 2 THE PROPOSED METHOD Given a corpus of documents, treating each document as a bag of words, we remove the stop-words. The remaining words form the term space for the application of the topic generation model (LDA). The next step creates a Document - Term matrix, each entry of which records the frequency of each term in each document. This matrix is used as input to LDA. Next, the iterative task of the learning method is initiated. Sets of topics, that we call layers, are generated by the iterative application of LDA. Starting with one topic and by incrementing the number 1 2 Inst. of Informatics and Telecommunications, NCSR “Demokritos”, Greece, email: {izavits | petridis | paliourg}@iit.demokritos.gr University of Aegean, Dpt. of Information and Communication Systems Engineering, Greece, email: georgev@aegean.gr of topics in each iteration, layers with more topics are generated. A layer comprising few topics attempts to capture all the knowledge of the corpus through generic topics. As the number of topics increases, the topics become more focused, capturing more detailed domain knowledge. Thus, the method starts from “general” topics, iterates, and converges to more “speciﬁc” ones. In each iteration, the method identiﬁes the subsumption relations that hold between topics of different layers according to their conditional independencies. Since the generated topics are random variables, e.g. A and B, by measuring their mutual information we obtain an estimate of their mutual dependence. Given a third variable C that makes A and B conditionally independent, the mutual information of topics A and B is reduced and is captured by topic C, i.e., C is a broader topic than the others. Thus, we may safely assume that C subsumes both A and B and the corresponding relations are added to the ontology. Moreover, C has been generated before A and B. Thus, it belongs in a layer that contains topics that are broader in meaning than the ones in the layer of A and B. A signiﬁcant contribution is the determination of the appropriate depth of the hierarchy from the given corpus of documents. We use a criterion based on the similarity of topic distributions that indicates the convergence towards the appropriate depth. We thus improve on our recent work [5] by ﬁtting this criterion, which controls the iterative process of the topic discovery. This stopping criterion is based on the symmetric KL divergence between concepts of different levels that participate in subsumption relations. The intuition is that the KL divergence between concepts that belong in the top levels of the hierarchy should be higher than the KL divergence between concepts that belong in the lower levels. This is because the top concepts are broader in scope than lower ones and the “semantic distance” between them and their children is expected to be higher than this of more speciﬁc concepts and their children. To validate this assumption, we have experimented with the Genia3 and the Lonely Planet gold ontologies and the corresponding corpora4 . In order to measure the similarity of the concepts in the ontologies using statistical measures, we represented the concepts of each gold standard ontology as probability distributions over the term space of the corresponding corpus. To create such a representation, we have to measure the frequency of the terms that appear in the context of each concept. In both corpora, the concept instances are annotated in the texts, providing direct population of the concepts in the golden standard ontologies with their instances. Therefore, it is possible to associate each document to the concept(s) that it refers to, by counting the concept instances that appear in the document. 3 4 The GENIA project, http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA The Lonely Planet travel advise and information, http://www.lonelyplanet.com/ 776 E. Zavitsanos et al. / Determining Automatically the Size of Learned Ontologies Thus, we create feature vectors based on the document in which each concept appears. These feature vectors form a two-dimensional matrix that records the frequency of each term in the context of each concept. That is, we have a representation of each concept as a distribution over the term space of the text collection. For each concept, frequencies are normalized giving a probability distribution over the term space. Figure 1 depicts the results obtained by measuring the similarity between concepts that participate in subsumption relations, in the case of the Genia and the Lonely Planet gold standard ontologies. Small values of KL divergence indicate high similarity between concepts. Figure 1 also conﬁrms our assumption that concepts at the lower levels of the hierarchy are more similar to their children than concepts at higher levels of the hierarchy. observed relative insensitivity of the result for values between 0.2 and 0.4 and we opted for the more conservative value in this plateau. Table 1 depicts the results. Table 1. Evaluation results for the Genia and the Lonely Planet corpora. Concept Identiﬁcation for the Genia corpus Precision Recall F-measure 94% 76% 84% Subsumption Hierarchy Construction for the Genia corpus Precision Recall F-measure 93% 75% 83% Concept Identiﬁcation for the Lonely Planet corpus Precision Recall F-measure 62% 36% 44% Subsumption Hierarchy Construction for the Lonely Planet corpus Precision Recall F-measure 53% 35% 42% To obtain a more detailed picture of the performance of the method, we replaced the stopping criterion with predeﬁned depths for the learned hierarchy and we experimented in both corpora. Figure 2 presents the evaluation results in terms of the F-measure for various depths of the hierarchy, using the same evaluation style. Figure 1. Average KL Divergence of subsumed concepts in the Genia and the Lonely Planet gold standard ontology. Based on this approach, we deﬁne a relative criterion that indicates how deep the hierarchy should be according to the information provided by the corpus of documents. This criterion, which controls the iterative task of the proposed method is deﬁned as: KLbottom 1− < ε. KLtop corresponds to the average symmetKLtop ric KL divergence between the concepts of level l and the concepts of level l +1. KLbottom is the average symmetric KL divergence between the concepts at level l+1 and the concepts of level l+2. Values close to 0 indicate that the new level of concepts added does not differ much from the parent concepts. Thus we are reaching maximum “speciﬁcity” and therefore optimal depth. Actually, the parameter ε has a very small value very close to zero to avoid small rounding errors during the computations. Figure 2 depicts that for a predeﬁned depth of 8 levels in case of Genia, or 10 levels, in the case of Lonely Planet, the F-measure is maximized reaching the values of table 1. Therefore, the method determined correctly the appropriate depth in both corpora. 3 ACKNOWLEDGEMENTS EVALUATION We have evaluated the proposed method on both corpora introduced in section 2. Our evaluation procedure uses the representation of the golden standard concepts as probability distributions over the term space of the documents, as explained in section 2. In addition, the concepts of the produced hierarchy have exactly the same representation. They are probability distributions over the same term space. We can, thus, perform a one-to-one comparison of the golden concepts and the produced topics. Speciﬁcally, a topic is matched to a concept if their corresponding distributions were the “closest” compared to all the other and their KL divergence was below a ﬁxed threshold thKL . The quantitative results have been produced using the metrics of Precision and Recall. The choice of threshold thKL affects the quantitative results, since a strict choice would force few topics to be matched with golden concepts, while a loose choice would cause many topics to be matched with golden concepts. We have chosen a value of thKL = 0.2 for the purposes of our evaluation, as we Figure 2. F-measures for Concepts Identiﬁcation and Subsumption Hierarchy Construction for the Genia (left) and the Lonely Planet (right) corpora. The presented work was supported by the research and development project ONTOSUM5 , funded by the Greek General Secretariat for Research and Technology. REFERENCES [1] E. Agirre, O. Ansa, E. Hovy, and D. Martinez, ‘Enriching very large ontologies using the www’, in Ontology Construction Workshop, (2000). [2] D.M. Blei, A.Y. Ng, and M.I. Jordan, ‘Latent dirichlet allocation’, Journal of Machine Learning Research, (2003). [3] A. Faatz and R. Steinmetz, ‘Ontology enrichment with texts from the www’, in Semantic Web Mining Workshop ECML/PKDD, (2002). [4] T. Grifﬁths and M. Steyvers, ‘A probabilistic approach to semantic representation’, in Conference of the Cognitive Science Society, (2002). [5] E. Zavitsanos, G. Paliouras, G.A. Vouros, and S. Petridis, ‘Discovering subsumption hierarchies of ontology concepts from text corpora’, in Proceedings of the International Conference on Web Intelligence, (2007). 5 See also http://www.ontosum.org/ ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-777 777 Dynamic Multi-Armed Bandit with Covariates Nicos G. Pavlidis1 , Dimitris K. Tasoulis1 , Niall M. Adams2 and David J. Hand 1, 2 1 Multi-Armed Bandit with Covariates In numerous real-world problems an agent selects repeatedly among a number of actions whose rewards are uncertain, aiming to maximise the overall reward. Each action-selection generates a reward and provides new information that reduces the agent’s uncertainty about the selected action. Ideally, one would like to maximise both immediate reward and the reduction of uncertainty, but the two objectives are often conﬂicting. The action with the highest expected reward is typically selected frequently, and therefore there is relatively little uncertainty about its reward. On the other hand, actions whose rewards are more uncertain are characterised by a low expected reward. In this class of problems, the central issue is how to resolve the dilemma between discovering new knowledge (exploration), and maximising reward based on existing knowledge (exploitation). The multi-armed bandit problem [3] is the simplest model to study this trade-off. In this problem an agent chooses repeatedly among n different actions, A = {α1 , α2 , . . . , αn } [4]. Each action selection, called a play, generates a numerical reward derived from the probability distribution associated with the chosen action. The learning problem that the agent faces is that of sampling sequentially from n populations with different probability densities, in order to maximise the cumulative reward. We study an extension of the standard problem in which the agent at each play observes a covariate, x(t) ∼ N (μ, σ 2 ), prior to selecting an action, and the reward is a linear function of the covariates: rαi (x(t)) = β0i + β1i x(t) + i , i ∼ N (0, σi2 ). (1) Such problems are encountered in real-world applications like scheduling, resource allocation and disaster management. A policy is a rule that associates states, x(t), with actions, αi . In a static environment, the learning problem that the agent faces is to identify the optimal policy, i.e. to select the action with the highest expected reward based on all possible realisations of x(t). Yang and Zhu [6], have shown that for a variety of nonparametric regression estimators, the ε-greedy strategy [5] with ε decreasing towards zero, converges asymptotically to the optimal policy [6]. The performance of a number of action-selection methods for the static multi-armed bandit problem with covariates was evaluated in [2]. In this work we investigate a dynamic environment in which the coefﬁcients of all the reward functions change over time, according to a random walk. The agent is not learning the underlying dynamics of the environment. In this setting there is no optimal policy in the sense that there can be no ﬁxed rule that associates states with optimal actions since the best action for a given state changes 1 Institute for Mathematical Sciences, 2 Department of Mathematics, Imperial College London, South Kensington Campus, London SW7 2AZ, United Kingdom, email: {n.pavlidis, d.tasoulis, n.adams, d.j.hand}@imperial.ac.uk over time. The agent instead attempts to formulate accurate estimates of the coefﬁcients of the reward functions which are used by the ε-greedy action-selection strategy. The estimation is performed using the Adaptive Recursive Least Squares algorithm (ARLS) [1]. This algorithm handles dynamics by incorporating a forgetting factor, λ(t) ∈ (0, 1], that is optimised at each iteration with respect to the estimation error using a stochastic gradient descent process. As λ tends to zero, the extent of forgetting increases, and vice versa as λ tends to unity. 2 Experiments and Results We resort to simulations because theoretical guarantees of convergence for bandit algorithms in the static environment may not hold in dynamic environments. In all our numerical experiments we consider a 10-armed bandit problem with a one-dimensional covariate, x(t) ∼ N (0, 1). The actions are oriented so that each is optimal in a region of the domain of x with probability 1/10, and σi ∼ N (0, 0.5). The coefﬁcients of all the arms change at each play by following a random walk with constant variance: βji (t + 1) = βji (t) + , 2 ∼ N (0, σrw ), for all i = 1, 2 . . . , 10, and j = 0, 1. The agent employs a separate ARLS algorithm for the estimation of the coefﬁcients of each of the 10 actions. At each play, the update of the estimated coefﬁcients of the selected action also updates the value of the forgetting factor. For simplicity a common forgetting factor, λ(t), is used by all the ARLS algorithms. At present we consider only the ε–greedy action-selection strategy [5]. ε-greedy selects the action with the highest expected reward based on the current parameter estimates with probability (1 − ), and with probability a random action is selected. In a static environment ε determines the balance between exploration and exploitation. In a dynamic environment this distinction is not so clear. Since all the reward functions change at each play, selecting the greedy action based on current estimates can be seen as exploring the change of this action. We ﬁrst investigate the relation between the variance of the ran2 dom walk of the coefﬁcients, σrw , and the evolution of the optimal forgetting factor for temporal prediction, λ(t). Fig. 1 illustrates the average λ(t) over 100 simulations using the 0.1-greedy strategy 2 2 for values of σrw ∈ [0, 1] with stepsize 0.1. Since, σrw is constant 2 throughout a simulation, for a particular setting of σrw and ε, λ(t) tends to oscillate around a ﬁxed value over a simulation. Increasing the volatility of the coefﬁcients of each arm results in a decline of the forgetting factor indicating that the estimation becomes progressively more sensitive to more recent observations. However, the forgetting factor is related not only to the variance 2 of the random walk, σrw , but also to the degree of exploration, ε. Fig. 2 illustrates the mean value of λ over an entire simulation with 2 respect to σrw and ε. The reported results are averages over 100 sim- 778 N.G. Pavlidis et al. / Dynamic Multi-Armed Bandit with Covariates Figure 1. Evolution of λ(t) over 2000 plays using the 0.1-greedy strategy for different values of the variance. sults from simulations with different reward functions are not directly comparable. In all cases, the value of ε that yielded the highest mean proportion of best action selected over a simulation was around 0.08. Moreover, the mean proportion of times the best action is selected 2 increases when σrw moves away from zero. The fact that for values 2 of σrw different from zero the optimal exploration constant is particularly low, while the mean proportion of times the best action is selected increases with respect to the static case, implies that introducing random walk dynamics decreases the difﬁculty of the problem. Examination of the evolution of the coefﬁcients of the linear equa2 tions over time revealed that irrespective of the value of σrw , very few actions are optimal (or equivalently most actions become suboptimal irrespective of x) and typically a single action is optimal over a region of the domain of the covariate with very large probability. 2 This was also veriﬁed for higher values of σrw . 3 ulations. As Fig. 2 shows, holding the variance of the random walk constant and increasing the proportion of times a random action is selected results in a lower average forgetting factor. When ε is close Figure 2. 2 and ε. Mean value of λ(t) for different values of σrw to zero the action-selection strategy becomes equivalent to greedy and typically chooses among very few (usually two) of the available actions. Increasing ε, on the other hand, tends to make all the available actions equally likely to be selected. Consider the case when action αi is one of the actions the agent chooses. If the probability of choosing αi is one half, then this action is selected on average once every two plays. Therefore, each time the estimated coefﬁcients of this action are updated the random walk dynamics have been applied twice on average to the true coefﬁcients. If on the contrary, the ten actions are equiprobable, then action αi will be chosen on average once every ten plays, and hence the random walk dynamics will have been applied ten times between two consecutive updates. Increasing the mean number of plays between two consecutive updates of the estimated regression coefﬁcients for an action renders the estimates less accurate at the time of the update. Thus, increasing the degree of exploration has an impact similar to that of increasing the speed at which the underlying environment is changing, by decreasing the sampling frequency of the actions that are chosen. Next, we investigate the relationship between the degree of exploration and the variance of the random walk. For different values 2 of σrw ∈ [0, 1], 1000 simulations were performed for each value of ε ∈ [0, 0.5] with stepsize 0.01. Performance is measured in terms of the mean proportion of times the best action is selected over a simulation. We do not consider cumulative reward because the re- Conclusions We study a dynamic version of the multi-armed bandit problem with covariates, in which the coefﬁcients of the reward functions follow a random walk. The agent employs the adaptive recursive least squares algorithm, which is capable of handling a changing environment by endogenously adapting the degree of forgetting. Hence the agent attempts to perform optimal temporal prediction and does not model explicitly the dynamics of the underlying environment. We consider the ε-greedy action-selection strategy. Experimental results indicate that the degree of forgetting is related not only to the magnitude of the variance of the random walk, but also to the extent of exploration. Indeed in this problem increasing exploration has the same impact on the forgetting factor as increasing the speed of change. This can be justiﬁed by the fact that more exploration decreases the sampling frequency of the actions that the agent actually performs. The results for different values of the variance of the random walk suggest that this type of dynamics always decreases the difﬁculty of the problem by making some actions globally suboptimal. This renders the optimal degree of exploration very small. It also suggests that challenging dynamic real-world problems that can be formulated as a multi-armed bandit problem with covariates are unlikely to be governed by this type of dynamics. ACKNOWLEDGEMENTS This research was undertaken as part of the ALADDIN (Autonomous Learning Agents for Decentralised Data and Information Systems) project and is jointly funded by a BAE Systems and EPSRC (Engineering and Physical Research Council) strategic partnership, under EPSRC grant EP/C548051/1. David J. Hand was partially supported by a Royal Society Wolfson Research Merit Award. REFERENCES [1] S. Haykin, Adaptive Filter Theory, Prentice-Hall International, 1996. [2] N. G. Pavlidis, D. K. Tasoulis, and D. J. Hand, ‘Simulation studies of multi-armed bandits with covariates’, in Proceedings of the EUROSIM/UKSim 2008. IEEE, (2008). [3] H. Robbins, ‘Some aspects of the sequential design of experiments’, Bulletin of the American Mathematical Society, 55, 527–535, (1952). [4] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, MIT Press, Cambridge, MA, 1998. [5] C. J. C. H. Watkins, Learning from Delayed Rewards, Ph.D. dissertation, Cambridge University, 1989. [6] Y. Yang and D. Zhu, ‘Randomized allocation with nonparametric estimation for a multi-armed bandit problem with covariates’, The Annals of Statistics, 30(1), 100–121, (2002). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-779 779 Reinforcement Learning with the Use of Costly Features Robby Goetschalckx2 , Scott Sanner1 and Kurt Driessens2 Abstract. A common solution approach to reinforcement learning problems with large state spaces (where value functions cannot be represented exactly) is to compute an approximation of the value function in terms of state features. However, little attention has been paid to the cost of computing these state features (e.g., search-based features). To this end, we introduce a cost-sensitive sparse linear-value function approximation algorithm — FOVEA — and demonstrate its performance on an experimental domain with a range of feature costs. 1 Introduction Reinforcement learning problems with large state spaces often preclude the representation of a fully enumerated value function. In this case, a common solution approach is to compute an approximation of the value function in terms of state features. While value function approximation is well-addressed in the reinforcement learning literature (c.f., Chapter 8 of [4]), the cost of feature computation is often ignored. Yet in the presence of costly features, this cost must be traded off with its impact on value prediction accuracy. While reinforcement learning is often modelled as a Markov decision process (MDP) [3], one might consider modelling the function approximation setting with costly features as a partially observable MDP (POMDP) [2] by using information-gathering actions to represent the computation of costly features. In theory, an optimal policy for this POMDP would select those features to compute at any decision stage in order to optimally trade-off feature cost w.r.t. its impact on future reward. However, such a framework requires embedding an already difﬁcult-to-solve MDP inside a POMDP; in general, solutions to such a POMDP will not be feasible in practice. Here we propose a more pragmatic approach where we learn the relative value of features in an explicit way. To do this, we approximate the value function using cost-sensitive sparse linear regression techniques, directly trading off prediction errors with feature costs. 2 MDPs and Reinforcement Learning We brieﬂy review Markov decision processes (MDPs) [3] and reinforcement learning (RL) [4]. Formally, an MDP can be deﬁned as a tuple S, A, T, R, γ. S = {s1 , . . . , sn } is a ﬁnite set of fully observable states. A = {a1 , . . . , am } is a ﬁnite set of actions. T : S×A×S → [0, 1] is a stationary, Markovian transition function. A reward R : S × A → R is associated with every state and action. γ is a discount factor s.t. 0 ≤ γ < 1 used to specify that a reward obtained t timesteps into the future is discounted by γ t . γ = 1 is permitted if total accumulated reward is ﬁnite. 1 2 National ICT Australia, email: Scott.Sanner@nicta.com.au Declarative Languages and Artiﬁcial Intelligence, Katholieke Universiteit Leuven, Leuven, Belgium, email: {robby,kurtd}@cs.kuleuven.be A policy π : S → A speciﬁes the action a = π(s) to take in each state s. The value Qπ (s, a) of taking an action a in state s and then following the policy π thereafter can be deﬁned using the inﬁnite horizon, expected discounted reward criterion: ˛ # "∞ ˛ X t ˛ π Q (s, a) = Eπ γ · rt ˛s0 = s, a0 = a (1) ˛ t=0 where rt is the reward obtained at time t (assuming s0 and a0 respectively represent the state and action at t = 0). The objective in ∗ an MDP is to ﬁnd a policy π ∗ such that ∀π, s. Qπ (s, π ∗ (s)) ≥ π ∗ Q (s, π(s)). An optimal policy π is guaranteed to exist. In the RL setting, the transition and reward model may not be explicitly known to the agent although both can be sampled from experience. Here, we use the generalized policy iteration (GPI) framework known to capture most reinforcement learning algorithms [4]. GPI interleaves policy evaluation and update stages as follows: Generalized Policy Iteration (GPI) 1. 2. 3. 4. Start with arbitrary initial policy π0 and set i = 0. Estimate Qπi (s, a) (e.g., from samples using Equation 1). Let πi+1 (s) = arg maxa∈A Qπi (s, a). If termination criteria not met, let i = i + 1 and goto step 2. Every RL algorithm that is an instance of GPI algorithm may prescribe its own method for performing each step and many GPI instances guarantee convergence to π ∗ or an approximation thereof. We keep our treatment of reinforcement learning with costly features as general as possible. Speciﬁcally, this means that in the context of GPI, we can restrict our discussion of RL with costly features to that of cost-efﬁcient Q-value approximation in step 2 of GPI. 3 Cost-efﬁcient Value-approximation π ˆw We represent a Q-value approximation Q (s, a) w.r.t. policy π as a linear combination of a feature set F = {f1 , . . . , fk } with weights w = w0 , . . . , wk where each fi : S × A → R and each wi ∈ R: X π ˆw Q wi fi (s, a) (2) (s, a) = w0 + fi ∈F We assume each feature fi is associated with cost cfi ∈ R expressed in the same units as prediction error. Our task will be to ﬁnd feature weights w that trade-off Q-value accuracy with feature cost. At step 2 of GPI, we assume that we are given data D = {Qπs,a } consisting of sampled Q-values to approximate. Then we deﬁne the optimal cost-efﬁcient value approximation w ∗ as follows: X X π cfi 1 π 2 ˆw [Qs,a − Q I[wi = 0] w∗ = argmin (s, a)] + w |D| 1 −γ π Q ∈D s,a fi ∈F Here, I[·] is 1 when its argument is true and 0 otherwise. We see 780 R. Goetschalckx et al. / Reinforcement Learning with the Use of Costly Features that the optimal setting of w ∗ directly trades off prediction error with feature cost (divided by (1 − γ) to account for the future discounted cost of feature evaluation at every time step). Unfortunately, this optimization objective is not convex due to step discontinuities where any wi = 0 and thus not easily amenable to ﬁnding a global optima. However, observing that weight sparsity encourages low feature cost, we can modify sparse linear regression approaches to encourage sparsity for a feature weight in a manner proportional to its cost. To do this, we focus on a class of sparse linear regression techniques collectively referred to as least-angle regression (LAR) methods, such as lasso and forward stepwise regression [1]. Fortunately, a simple modiﬁcation of forward-stepwise regression provides us with an efﬁcient algorithm — FOVEA — for approximating the solution to our optimization problem. We present FOVEA below and refer the reader to the detailed discussion in [1] for the original algorithm. RMSE versus Cost 0.25 RMSE cost c2 2 c2 prediction cost RMSE 0.2 0.15 0.1 0.05 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 c2 Figure 1. Prediction error (RMSE) vs. the feature cost of the prediction. Forward-stepwise Value Approximation (FOVEA) 1. Input: Q-value samples D = {Qπs,a } for policy π (e.g., computed from sample trajectories for a policy π using Equation 1). 2. Initialize F = ∅. 3. Initialize wi = 0 for i ≥ 1 and w0 with the average value of Qs,a ∈ D (this gives the residuals a mean of 0). 4. Normalize all feature predictions to have 0 mean and a standard deviation of 1. 5. Initialize the step-size η to some small positive value. 6. Repeat the following: (a) Compute a vector of residuals r and a vector of feature values fi with entries for each data sample Qπs,a ∈ D where the ˆ πw (s, a) and the feature value is fi (s, a). residual is Qπs,a − Q (b) Calculate cost-penalized correlation score for all fi ∈ F : ˛ ˛ 1 ˛ ˛ √ score i = ˛fi · r˛ − I[fi ∈ F ] cfi |D| (c) Find the feature fi with the highest scorei ≥ η; if no such feature found then halt and Output: w. √ / F , let F = F ∪ {fi }; wi = wi + sgn(fi · r) cf .3 (d) If fi ∈ i (e) Else let wi = wi + sgn(fi · r)η. It is important to note that the forward stepwise approach is a greedy selection approach and thus the result obtained might not be the optimal one in all cases. However, we can still prove a form of local optimality during the progression of the FOVEA algorithm: Theorem 1 Every feature fi which is introduced in step 6d of the FOVEA algorithm immediately reduces the mean squared error of the prediction by the value of its cost cfi . 4 Experiments We evaluated GPI using FOVEA on a simple deterministic corridor domain. The state space consists of ﬁve rooms, labeled s1 , . . . , s5 . From each state two actions, +1, −1 can be performed. Performing action +1 in state si for i < 5 leads to si+1 and performing −1 in state si for i > 1 leads to si−1 . All other actions take the agent to the center s3 . A reward of 1 is assigned for taking +1 in s5 and −1 is assigned for taking action −1 in s1 . All other rewards are equal to 0. We used a discount factor γ = 0.9. 3 sgn(·) is +1 if its argument is non-negative and −1 otherwise. We provided seven state-action indicator features fi for 0 ≤ i ≤ 6 to the agent where taking action a ∈ {+1, −1} in i results in fi+a = 1 with all remaining indicator features set to 0. f0 , f2 , f3 , f5 and f6 are free and are assigned cost c1 = 0 while f1 and f4 have a cost c2 . Furthermore two random number generators were provided to the agent, one which was free and another one which had a cost c3 > c2 . Finally, the state-action feature indicators f0 , . . . , f6 were copied but now with the higher cost c3 . We used forward-stepwise value approximation to approximate Q-values using the state-action features deﬁned above. We used 100 samples for each forwardstepwise update. All results shown are averages over 10 runs. We varied the value of c2 over a range of 0 to 0.5. If the agent does not pay c2 for f1 or f4 , it can not distinguish between s1 and s4 (if it pays the cost of only one of f1 or f4 , it can still infer the other by absence). The results in Figure 1 demonstrate the effectiveness of FOVEA. Initially the agent pays 2c2 for both f1 and f4 (illustrating slight sub-optimality by paying for both features due to inherent statistical noise in the estimation process, but still avoiding the useless features that cost c3 ) until it realizes for c2 > 0.05 that it can just pay c2 for one of these features and still obtain low prediction error. However, for c2 > 0.185, the agent refuses to pay the cost for either f1 or f4 since the cost exceeds the future expected reward. As such, there is a clear phase transition near c2 = 0.185 as the paid feature cost decreases rapidly while the prediction error likewise increases. 5 Future Work Perhaps the most important area of future work is to explore efﬁcient extensions to handle state- and action-dependent feature selection. Acknowledgements This research was sponsored by the fund for scientiﬁc research (FWO) of Flanders, of which Kurt Driessens is a postdoctoral fellow, and by National ICT Australia. REFERENCES [1] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, ‘Least angle regression’, Tech. report, Statistics Department, Stanford University, (2002). [2] Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra, ‘Planning and acting in partially observable stochastic domains’, Artiﬁcial Intelligence, 101, 99–134, (1998). [3] Martin L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, Wiley, New York, 1994. [4] R. Sutton and A. Barto, Reinforcement Learning: An Introduction, The MIT Press, Cambridge, MA, 1998. ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-781 781 Data-driven Induction of Functional Programs Emanuel Kitzelmann1 Abstract. We present a new method and system, called I GOR 2, for the induction of recursive functional programs from few nonrecursive, possibly non-ground example equations describing a subset of the input-output behaviour of a function to be implemented. 1 Introduction Classical attempts to construct functional L ISP-programs from input/output-examples [10, 4] are analytical, i.e., a L ISP-program belonging to a strongly restricted program class is algorithmically derived from examples. This is done by identifying repetitive syntactical patterns in traces. More recent approaches, e.g. [3, 7], generate and test programs until a program consistent with the examples is found. Theoretically, large program classes can be induced generateand-test based. Yet although these latter systems use type information, some of them higher-order functions, and further techniques for pruning the search space, they strongly suffer from combinatorial explosion. Also Inductive Logic Programming (ILP) [6] has originated some methods capable of inducing recursive programs on inductive types though ILP in general has a focus on classiﬁcation. General purpose systems capable of recursive program induction like F OIL [9] are suitable to only a limited extent for program induction since they use greedy search methods with inappropriate heuristics. Special purpose systems [1] have problems similar to those described for functional approaches. The I GOR 2 [5] method described here combines classical analytical methods with an enumerative approach in order to put their relative strengths into effect. Induction is based on search in order to avoid strong a priori restrictions as imposed by purely analytical methods. But in contrast to the generate-and-test approach I GOR 2 constructs successor programs during search using analytical methods. I GOR 2 represents functional programs as sets of typed recursive ﬁrst-order equations. The effect of constructing these equations analytically is that only equation sets entailing the example equations are enumerated. In contrast to greedy search methods, the search is complete—only programs known to be inconsistent are ruled out. Compared to purely analytical systems, I GOR 2 is a substantial extension since the class of inducible programs is much larger. E.g., all sample programs from [4, page 448] can be induced by I GOR 2 but only a fraction of the sample problems in [5, Sect. 5] can be induced by the system described in in [4]. Compared to ILP systems capable of inducing recursive functions and recent enumerative functional methods like F OIL [9] and M AGIC H ASKELLER [3] I GOR 2 mostly performs better regarding inducability of programs and/or induction times [2]. 1 University of Bamberg, Germany, email: emanuel.kitzelmann@unibamberg.de 2 General Method Given a set E of example equations of the form F (a) = r for any number of target functions F to be implemented as well as for already implemented background functions which may be used by the induced program I GOR 2 returns a set of recursive equations P constituting a functional program which is correct w.r.t. the example equations in that it evaluates the left-hand sides (lhss) of the example equations to their right-hand sides (rhss). Even if example equations may contain variables, we call lhs arguments example input and rhss example output in the following. There are inﬁnitely many correct solutions P , one of them E itself. In order to select one or at least a ﬁnite subset of the possible solutions at all and a “good” solution in particular, I GOR 2—like almost all inductive inference methods—is committed to a preference bias. I GOR 2 prefers solutions P which partition the examples in fewer subsets, i.e., programs with fewer case distinctions. Case distinctions are realised by disjoint patterns in the equation lhss. This concept is known as pattern matching in functional programming. Additionally simple forms of conditions to restrict the applicability of an equation like equality of pattern variables are used but not described in this paper. The search for solutions is complete, i.e., programs with the least number of case distinctions are found. This preference bias assures that the recursive structure in the examples as well as the computability by predeﬁned functions is best possible covered. Example From appropriate type declarations and the examples2 Rev ([ ]) = [ ], Rev ([X]) = [X], Rev ([X, Y ]) = [Y, X], Rev ([X, Y, Z]) = [Z, Y, X], (1) Rev ([X, Y, Z, V ]) = [V, Z, Y, X] and the background equations Last([X]) = X, Last([X, Y ]) = Y, Last([X, Y, Z]) = Z, Last([X, Y, Z, V ]) = V I GOR 2 induces the following equations for Rev and an auxiliary function Init: Rev ([ ]) Rev ([X|Xs]) Init([X]) Init([X1 , X2 |Xs]) = = = = [] [Last([X|Xs])|Rev (Init([X|Xs]))] [] [X1 |Init([X2 |Xs])] The induction of a program is organised as a kind of best ﬁrst search. During search, a hypothesis is a set of equations entailing the example equations and constituting a terminating program but potentially with unbound variables in the rhss, i.e., with variables in the 2 We use a syntax for lists as known from P ROLOG. 782 E. Kitzelmann et al. / Data-Driven Induction of Functional Programs rhss not occurring in the lhss. We call such equations and hypotheses containing them unﬁnished equations and hypotheses. A goal state is reached, if at least one of the best—according to the preference bias described above—hypotheses is ﬁnished. Such a ﬁnished hypothesis is terminating by construction and since its equations entail the example equations, it is also correct. The initial hypothesis is a program with one equation per target function, namely the least general generalisation [8] of the example equations. In most cases (e.g., for all recursive functions) one equation is not enough and the rhss remain unﬁnished. Then for one unﬁnished equation successors are computed which leads to new hypotheses. Now repeatedly unﬁnished equations of currently best hypotheses are replaced until a currently best hypothesis is ﬁnished. 3 Computing Successor Sets of Equations Three operations are applied to compute successor equations: (i) Partitioning of the inputs by replacing the pattern p of the equation by a set of disjoint more speciﬁc patterns; (ii) replacing the rhs by a (recursive) call of a deﬁned function; and (iii) replacing the rhs subterms in which unbound variables occur by calls to new subprograms. 3.0.1 Reﬁning a Pattern Computing a set of more speciﬁc patterns, case (i), in order to introduce a case distinction, is done as follows: A position in the pattern p with a variable resulting from generalising the corresponding subterms in the subsumed example inputs is identiﬁed. The inputs are partitioned such that those with the same symbol at this position belong to the same subset. This yields a partition of the example equations. Now for each subset a new initial hypothesis is computed, leading to a set of successor equations. E.g., consider the examples (1) for Rev . The pattern of the initial equation is simply a single variable Q, since the example inputs have no common root symbol. The ﬁrst example input consists of only the constant [ ]. All remaining example inputs have the list constructor cons as root. I.e., two subsets are induced, one containing the ﬁrst example, the other containing the remaining examples. The lggs of the example inputs of these two subsets are [ ] and [Q|Qs] resp. which are the (more speciﬁc) patterns of the two successor equations. 3.0.2 Introducing (Recursive) Function Calls and Help Functions In cases (ii) and (iii) help functions are invented. This includes the generation of examples from which they are induced. For case (ii) this is done as follows: Function calls are introduced by matching the currently considered outputs, i.e., those outputs whose inputs match the pattern of the currently considered equation, with the outputs of any deﬁned function. If all current outputs match, then the rhs of the current unﬁnished equation can be set to a call of the matched deﬁned function. The argument of the call must map the currently considered inputs to the inputs of the matched deﬁned function. For case (iii), the example inputs of the new deﬁned function also equal the currently considered inputs. The outputs are the corresponding subterms of the currently considered outputs. For an example of case (iii) consider the Rev examples except the ﬁrst one as they have been put into one subset in the previous section. The initial equation for these is: Rev ([Q|Qs]) = [Q2 |Qs2 ] (2) It is unﬁnished due two the two unbound variables in the rhs. Now the two unﬁnished subterms (consisting of exactly the two variables) are taken as new subproblems. This leads to two new example sets for two new help functions Sub1 and Sub2 : Sub1 ([X]) = X, Sub1 ([X, Y ]) = Y, . . ., Sub2 ([X]) = [ ], Sub2 ([X, Y ]) = [X], . . .. The successor equation-set for the unﬁnished equation contains three equations determined as follows: The original unﬁnished equation (2) is replaced by the ﬁnished equation Rev ([Q|Qs]) = [Sub1 ([Q|Qs] | Sub2 [Q|Qs]] and from the new example sets initial equations are derived. Finally, as an example for case (ii), consider the examples for the help function Sub2 and the unﬁnished initial equation: Sub2 ([Q|Qs] = Qs2 (3) The example outputs, [ ], [X], . . . of Sub2 match the example outputs for Rev . That is, the unﬁnished rhs Qs2 can be replaced by a (recursive) call to the Rev -function. The argument of the call must map the inputs [X], [X, Y ], . . . of Sub2 to the corresponding inputs [ ], [X], . . . of Rev , i.e., a new help function, Sub3 is needed. This leads to the new example set Sub3 ([X]) = [ ], Sub3 ([X, Y ] = [X], . . . The successor equation-set for the unﬁnished equation (3) contains the ﬁnished equation Sub2 ([Q|Qs] = Rev (Sub3 ([Q|Qs])) and the initial equation for Sub3 . 4 Conclusion and Future Research I GOR 2 integrates classical data-driven program induction techniques with search. Comparisons show that this approach is competitive with existing program induction methods regarding solvable problems and mostly solves problems faster [2]. In future work we will extend I GOR 2 to higher-order functions such that well known higherorder functions like Map can be used in induced programs. REFERENCES [1] P. Flener and S. Yilmaz, ‘Inductive synthesis of recursive logic programs: Achievements and prospects’, Journal of Logic Programming, 41(2–3), 141–195, (1999). [2] Martin Hofmann, Emanuel Kitzelmann, and Ute Schmid, ‘Analysis and evaluation of inductive programming systems in a higher-order framework’. Submitted to ECML’08, http: //www.cogsys.wiai.uni-bamberg.de/publications/ ecml08submission.pdf, 2008. [3] Susumu Katayama, ‘Systematic search for lambda expressions’, in Revised Selected Papers from the Sixth Symposium on Trends in Functional Programming, TFP 2005, ed., Marko C. J. D. van Eekelen, volume 6, pp. 111–126. Intellect, (2007). [4] E. Kitzelmann and U. Schmid, ‘Inductive synthesis of functional programs: An explanation based generalization approach’, Journal of Machine Learning Research, 7, 429–454, (2006). [5] Emanuel Kitzelmann, ‘Data-driven induction of recursive functions from input/output-examples’, in Proceedings of the ECML/PKDD 2007 Workshop on Approaches and Applications of Inductive Programming (AAIP’07), pp. 15–26, (2007). [6] S. Muggleton and L. De Raedt, ‘Inductive logic programming: Theory and methods’, Journal of Logic Programming, Special Issue on 10 Years of Logic Programming, 19-20, 629–679, (1994). [7] Roland Olsson, ‘Inductive functional programming using incremental program transformation’, Artiﬁcial Intelligence, 74(1), 55–83, (1995). [8] G. D. Plotkin, ‘A note on inductive generalization’, in Machine Intelligence, volume 5, 153–163, Edinburgh University Press, (1969). [9] J. R. Quinlan and R. M. Cameron-Jones, ‘FOIL: A midterm report’, in Proceedings of the 6th European Conference on Machine Learning, ed., P. Brazdil, LNCS, pp. 3–20, London, UK, (1993). Springer-Verlag. [10] D.R. Smith, ‘The synthesis of LISP programs from examples: A survery’, in Automatic Program Construction Techniques, eds., A.W. Biermann, G. Guiho, and Y. Kodratoff, 307–324, Macmillan, (1984). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-783 783 CTRNN Parameter Learning using Diﬀerential Evolution Ivanoe De Falco1 and Antonio Della Cioppa2 and Francesco Donnarumma3 and Domenico Maisto1 and Roberto Prevete3 and Ernesto Tarantino1 Abstract. Target behaviours can be achieved by ﬁnding suitable parameters for Continuous Time Recurrent Neural Networks (CTRNNs) used as agent control systems. Diﬀerential Evolution (DE) has been deployed to search parameter space of CTRNNs and overcome granularity, boundedness and blocking limitations. In this paper we provide initial support for DE in the context of two sample learning problems. Key words: CTRNN, Diﬀerential Evolution, Dynamical Systems, Genetic Algorithms 1 INTRODUCTION Insofar as Continuous Time Recurrent Neural Networks (CTRNNs) are universal dynamics approximators [1], the problem of achieving target agent behaviours is redeﬁned as the problem of identifying suitable network parameters. Although a variety of diﬀerent learning algorithms exists, evolutionary approaches like Genetic Algorithms (GA) are usually deployed to perform searches in the parameter space of CTRNNs [5]. However GAs require some kind of network encoding which may greatly inﬂuence parameter searches. In fact, the resolution of the parameters is limited by the bit resolution of the encoding (granularity ) and the parameters cannot assume values falling outside an encoding a priori ﬁxed interval (boundedness). Yamauchi and Beer [5] proposed a real-valued encoding for CTRNNs, which improves the learning process allowing parameter values to be in R. However, problems arise that, in not rare cases, prevent real-valued GAs (rvGA) from ﬁnding global optima (blocking) [2]. Here we propose an approach based on a Diﬀerential Evolution (DE) algorithm [4] which combines fast learning with the possibility of overcoming the limitations mentioned above. Section 2 introduces the DE algorithm. In Section 3 two sample CTRNN parameter search problems are solved with DE. Finally in Section 4, the obtained results are discussed and future developments of this approach are proposed. 2 DIFFERENTIAL EVOLUTION DE is a stochastic, population-based evolutionary algorithm [4] which addresses a generic optimization problem with m 1 2 3 ICAR-CNR, Naples, Italy -{ivanoe.defalco, domenico.maisto, ernesto.tarantino}@na.icar.cnr.it DIIIE, Universit` a di Salerno - adellacioppa@unisa.it Universit` a di Napoli Federico II {donnarumma, prevete}@na.infn.it real parameters by starting with a randomly initialized population consisting of n individuals, each made up of m real values, and, subsequently, by updating the population from a generation to the next one by means of many diﬀerent transformation schemes commonly named as strategies [4]. In all of these strategies DE generates new individuals by adding to an individual a number of weighted diﬀerence vectors made up of couples of population individuals. In the strategy chosen, starting from xi , the i-th individual, a new trial one xi is generated, by perturbing the best individual xbest by means of 2 diﬀerence vectors. The generic j-th component candidate is: xi,j = xbest,j + F · [(xr1 ,j − xr2 ,j ) + (xr3 ,j − xr4 ,j )] with 4 randomly generated integer numbers r1 , r2 , r3 , r4 in {1, . . . , n}, diﬀering from one another and F the parameter which controls the magnitude of the diﬀerential variation. So in DE new candidate solutions are created by using vector diﬀerences, whereas traditional rvGAs rely on probabilistic selection, random perturbation (mutation) and on mixing (recombination) of individuals. The three phases of a standard rvGA, selection, recombination and mutation, are combined in DE in one operation which is carried out for each individual. According to this, in rvGA not all the elements are involved in each phase of the generation of the new population, while, by contrast, DE algorithm iterates through the entire population and generates a candidate for each individual. 3 EXPERIMENTS We tested the eﬃcacy of CTRNN training by DE on two sample experiments where the approach seems to solve problems outlined in Section 1. Parameters ruling the DE algorithm were assigned experimentally via a set of training trials. 3.1 Cusp point learning Let us consider a CTRNN made up of a single self-connected neuron. The equation of the system is given by τ · y˙ = −y + wσ (y + θ) + I (1) where for simplicity we set the time constant τ = 1 and the bias θ = 0. Notice that no elementary expression for the solution of (1) exists. Such system has a cusp point, that is the only bifurcation point in which the system undergoes a 784 I. De Falco et al. / CTRNN Parameter Learning Using Differential Evolution picthfork bifurcation [3]. The goal of the experiment is to ﬁnd such cusp point. To evaluate each network candidate (I , w ) we let it evolve for a suﬃcient time T so that we can consider y (T ) ; y¯ . Then we choose as ﬁtness function FCP (y (I , w )) = ff ixed + ftan + fcusp with terms rewarding respectively ﬁxed point, non hyperbolic and cusp curve intersection condition. Average and standard deviation values found for (I, w) in 10 runs using the DE algorithm are I¯ = −2.00015 with a standard deviation equal to 1.6 · 10−4 and w ¯ = 4.0003 with standard deviation 3.1·10−4 . These values are absolutely close ˜ w) to the coordinates (I, ˜ = (−2, 4) of the cusp point which can be formally inferred. Figure (1) shows ﬁtness trend as a function of the generation number for average, best and worst case. The constant and smooth decrease suggests a gradual and continuous learning improvement as the generation number grows. In addition, the evident increasing resolution of the parameter values observable during DE runs demonstrates the possibility of tackling the granularity problem, theoretically having the machine precision as only limit. Figure 2. Sequence generator task. Left: ﬁtness of three diﬀerent runs plotted as a function of the generation number. Right: Fitness 1 from 1000-th to 2500-th generation. Figures show DE avoiding blocking by escaping from local minima. space allowing the surmounting of the boundedness problem. Moreover each run passes through a diﬀerent sequence of local minima, from which the DE algorithm has to escape. So the descent of the function towards the global minimum occurs in “steps” (see Left of Figure 2). Right of Figure 2 illustrates how the search of the parameters continues even in the very proximity of optimal values, ﬁnding better and better solutions. Moving by vector diﬀerences in the parameter space is “as if” DE is capable of calibrating the magnitude and the direction towards the reaching of the minima in it. The result is that every run is able to overcome the blocking problem. 4 Figure 1. Cusp point learning: ﬁtness plots of runs corresponding to the average, worst and best solutions as a function of the generation number 3.2 Sequence generator task The goal of this task is to train a control network able to switch between two diﬀerent behaviours (ﬁxed points 0 and 1) anytime a signal trigger is detected [5]. Focusing on a network of three neurons, we generate a random sequence for each generation I = [bit1 , . . . , bitM ], where M is the length of the sequence and biti ∈ {0, 1} ∀i ∈ M . The length of every sequence of 0 (no signal) or 1 (trigger) has been extracted from a Gaussian distribution. For every sequence ? @ generation we generate the desired target t = t1 , . . . , tM . We measure @ ? the output candidate y = y¯31 , . . . , y¯3M with a ﬁtness function FSG (y(w)) = FHM (y(w))+k ·FHD (y(w)) with the ﬁrst term (the Hamming distance) and the second term respectively measuring how many times and how diﬀerent the ﬁxed point values are from the desired targets. We set k = 0.01 so as to weight the ﬁrst contribute more than the second. In each of the 10 runs DE is able to ﬁnd optimal solutions, even reaching the global minima. It is worth remarking that the weights found are very sparse (e.g. w ≈ 21.19 and w ≈ 1.86 · 1018 ) so that by ﬁxing a priori intervals many good solutions would become inaccessible. This sparseness suggests that DE is almost able to investigate the entire parameter CONCLUSIONS We showed two experiments solved by means of DE which provides a simple and a “physical” way to perform CTRNN parameter space search. The ﬁrst experiment provides an example of how the granularity problem can be overcome. DE showed a high precision in determining the parameter values which can be still improved by letting the execution run. The second experiment points to ways in which boundedness and blocking can be overcome, too, by a DE approach. Using only three neurons we solve the sequence generator task. The found parameter values are sparse, so ﬁxing a priori intervals would have cut many possible solutions. Furthermore, although each run passes through a sequence of local minima, DE algorithm can escape from them jumping step by step towards a better approximation of a global minimum. After this encouraging results next studies will concern a direct comparison with rvGAs particularly on local minima trapping issue and a deeper investigation on theoretical details of DE approach for CTRNN learning. References [1] Ken-ichi Funahashi and Yuichi Nakamura, ‘Approximation of dynamical systems by continuous time recurrent neural networks’, Neural Networks, 6(6), 801–806, (1993). [2] David E. Goldberg, ‘Real-coded genetic algorithms, virtual alphabets, and blocking’, Complex Systems, 5, 139–167, (1991). [3] J. K. Hale and H Kocac, Dynamics and Bifurcations, SpringerVerlag, 1991. [4] K Price, R Storn, and Lampinen J, Diﬀerential Evolution: A Practical Approach to Global Optimization, Natural Computing Series, Springer-Verlag, 2005. [5] Brian M. Yamauchi and Randall D. Beer, ‘Sequential behavior and learning in evolved dynamical neural networks’, Adaptive Behavior, 2(3), 219–246, (1994). 3. Model-Based Diagnosis and Reasoning This page intentionally left blank ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-787 787 Incremental Diagnosis of DES by Satisﬁability1 Alban Grastien and Anbulagan NICTA and Australian National University Abstract. We propose a SAT-based algorithm for incremental diagnosis of discrete-event systems. The monotonicity is ensured by a prediction window that uses the future observations to lead the current diagnosis. Experiments stress the impact of parameters tuning on the correctness and the efﬁciency of the approach. 1 Diagnosis by SAT Diagnosis is the AI problem of determining whether a system is running correctly during a time window, and of identifying any failure otherwise. Consider a system which is completely modeled by a DES (basically a ﬁnite state machine) denoted Mod . This system is running and generates observations. The goal of the diagnosis is to determine from the model and the observations whether faulty events occurred on the system. The problem can be reduced to ﬁnding particular paths on the DES consistent with the observations [4]. Since failures are rare events, we can consider paths that minimize the number of faults. In [2], we proposed to solve the DES diagnosis problem with satisﬁability (SAT) algorithms. SAT is the problem of ﬁnding an assignment of the variables of a given Boolean formula in such a way as to make the formula evaluate to true. Given an upper bound on the number of transitions in the paths that are considered, a diagnosis problem – ﬁnding a particular path – can be encoded as a SAT problem. The SAT-based algorithm then simply uses SAT solver to look for a path with increasing number of faults until a diagnosis is found. 2 Incremental Diagnosis by SAT Incremental diagnosis (ID) consists in computing the diagnosis for a temporal window, and then updating this diagnosis to consider a larger temporal window. The incremental diagnosis can serve for two purposes. First, it is used when the observations for the latter temporal window are not immediately available: a diagnosis for the ﬁrst temporal window is computed, and then must be updated as the other observations are provided. This is typically the case for on-line diagnosis, where the system is monitored while it is running. Second, an incremental approach can be used to simplify a non-ID problem. Given a diagnosis task on a large temporal window, the window is sliced into small windows to obtain simpler diagnosis problems. In both cases, the complexity of the ID must be independent of the previous diagnoses. This paper considers the second approach where all the observations are available. The on-line problem contains additional issues mostly independant from ID. 1 This research was supported by NICTA in the framework of the SuperCom project. NICTA is funded by the Australian Government as represented by the Department of Broadband, Communications and the Digital Economy and the Australian Research Council through the ICT Centre of Excellence program. Consider that the observations on a window are denoted Obs, and denote ⊕ the concatenation of two windows. Note that the concatenation may be non-trivial in case of uncertain observations [3]. Given the diagnosis of Obs1 , given the observations Obs2 , the incremental diagnosis is the computation of the diagnosis of Obs1 ⊕ Obs2 . The complexity of the incremental diagnosis of Obs1 ⊕ Obs2 given the diagnosis of Obs1 must not depend on the size of Obs1 . It is wise to perform the diagnosis of Obs1 in such a way as to ease the ID of Obs1 ⊕ Obs2 . In this case, the complexity of the diagnosis of Obs1 must not depend on the size of Obs2 . Rather than diagnosing the whole period (t0 , tn ), we do an ID and diagnose n windows of size λ (ti+1 = ti +λ). The n diagnoses must be consistent with each other: the path computed for (t1 , t2 ) must be a continuation of the path computed for (t0 , t1 ). However, when the diagnosis of (t0 , t1 ) is computed, we cannot be sure that the extracted path will be consistent with the next observations. In diagnosis, there is usually a delay between the occurrence of an event and the reception of observations proving this occurrence. However, this delay is generally bounded: it is unlikely an observation will explain what happened several days or weeks ago. Thus, we ensure that the path of (t0 , t1 ) is consistent not only with observations (t0 , t1 ) but also with the observations (t1 , t1 + μ). This way, the diagnosis for this window should be globally consistent. The period of time (t1 , t1 + μ) is called prediction window of the diagnosis window (t0 , t1 ). Note that the diagnosis is approximate as the best global path may be lost if it includes the early occurrence of many faults. Algorithm 1 Incremental Diagnosis(Mod ,I,Obs,Que,λ,μ) 1: S(0) := I(0); // I represents the initial states 2: for i := 0 ; i < n ; i ++ do // Diagnoses the window (ti , ti+1 ) 3: while no solution found for (ti , ti + λ) do 4: for (k := 0 ; k < K and no solution found ; k ++) do 5: F := Mod (ti , ti + λ + μ) ∪ Obs(ti , ti + λ + μ) ∪ Quek (ti , ti + λ) ∪ S(i); 6: if SAT(F) is satisﬁable then 7: extract path(SAT(F)); 8: S(i + 1) := extract state(SAT(F)); 9: if no solution found for (ti , ti + λ) then // path reset 10: S(i) = ∅ We propose Algorithm 1 for the ID of (t0 , tn ). Let K be the maximum number of faults that can occur during λ time steps. For each window (ti , ti + λ), the SAT solver tries to ﬁnd a path starting from state S(i), consistent with the observations (ti , ti + λ + μ) by increasing the number of faults (lines 4–8). F is the CNF that models the set of contraints on the path we are looking for. When the path is found, the function extract path extracts the path computed during (ti , ti + λ). The function extract state computes S(i + 1) 788 A. Grastien and Anbulagan / Incremental Diagnosis of DES by Satisﬁability est runtime is not achieve with smallest diagnosis windows but with medium-large diagnosis windows. 100 λ= 2 λ= 5 λ= 10 λ= 20 λ= 40 90 80 70 120 100 Nb. of faults diagnosed Percentage of trajectory resets in order to force the next path to be a prolongation of the current path. If no path is found starting from S(i) for (ti , ti + λ), the path for the previous window is not consistent with the new observations. For complexity reasons, backtracking is not allowed. The algorithm simply tries to ﬁnd a new path that does not start from the previous path (line 10). We call this a path reset. When a path reset is performed, the path of (t0 , tn ) is not globally consistent. However for most systems, it can be expected that the misinterpretation of the observations will be only localised on a small time frame. 60 50 40 80 60 30 λ= 2 λ= 5 λ= 10 λ= 20 λ= 40 40 20 20 10 0 3 Empirical Validation 0 10 20 30 40 50 Size μ of the prediction window 100 0 a. Nb. of path resets. 600 400 50 100 λ= 2 λ= 5 λ= 10 λ= 20 λ= 40 600 500 400 300 200 200 100 0 Table 1. Runtime in seconds of M INI S AT solver on nID satisﬁable 10 20 30 40 50 Size μ of the prediction window 0 100 c. Nb. of calls to M INI SAT. problem instances with n observations 0 10 20 30 40 50 100 Size μ of the prediction window d. Total runtime of MINI SAT. 800 λ= 2 λ= 5 λ= 10 λ= 20 λ= 40 2000 2–100 5–100 10–100 20–100 40–100 40–0 700 600 Runtime in seconds Total runtime in seconds 2500 Quality of the diagnosis Figure 1a presents the percentage of path resets, and Figure 1b gives the number of faults computed for each pair of parameters. These measure the quality of the diagnosis. An accurate diagnosis should have no reset and the smallest number d(0, 1000) of faults consistent with the observations (this value is unknown but less than 128). As expected, the number of resets decreases when the size of the prediction window increases. In this example, a value μ = 100 is sufﬁcient to avoid any reset. The Figure 1b also shows that a large diagnosis window partially avoids the bad-quality results of small prediction windows though it generates a big number of path reset. This is simply because enlarging the size of the diagnosis windows makes the incremental diagnosis more and more look like non incremental diagnosis. 40 700 Solver runtime in seconds 800 0 Table 1 shows the runtime required by M INI S AT to ﬁnd a scenario consistent with the n observations and containing k(n) n/8 faults. In this Table, t >2d means that the instance cannot be solved in 2 days. Note that this computation is not a diagnosis in the sense that it should ﬁrst be proved that there is no path with k faults where k < k(n), which is usually more expensive as these problems are unsatisﬁable. Note that the runtime does not increase linearly but in a chaotic way, such as the difference between n = 999 and n = 1000. We now run Algorithm 1 on the scenario of 1 000 observations by varying the parameter λ in the range of {2, 5, 10, 20, 40} and the parameter μ in the range of {0, 10, 20, 30, 40, 50, 100}. 30 b. Nb. of diagnosis faults. λ= 2 λ= 5 λ= 10 λ= 20 λ= 40 1000 n 100 200 299 300 400 500 599 600 699 700 799 800 899 900 999 1000 t 116 19 >2d >2d 268 127 >2d 153 832 669 185 574 151 370 >2d 3204 20 Size μ of the prediction window 800 1200 Nb. of calls The experiments are conducted on an Intel Pentium 4 PC running at 3 GHz CPU, under Linux using M INI S AT v2.0 [1]. For this study, we use the system presented in [2]. The maximum number of faults K is set to 1 + λ/2 in the experiment. 10 1500 1000 500 400 300 200 500 100 0 0 10 20 30 40 50 Size μ of the prediction window e. Total runtime. Figure 1. 100 0 100 200 300 400 500 600 700 Nb. of observations 800 900 1000 f. Runtime of ID solving. Results of our incremental algorithm on nID problems. Incremental runtime Figure 1f shows the evolution of the SAT runtime during the incremental diagnosis for some pairs λ, μ (other pairs lead to similar results). The experiments clearly show a linear runtime for most pairs of parameters. Note however that small prediction windows potentially generates picks of computation. These results validate our approach. The incremental algorithm of DES can be performed using SAT algorithms, and the runtime is lower than in a non-incremental approach. The results stress the importance of the parameters λ and μ both for efﬁciency and for diagnosis correctness. These parameters should be tested off-line before running the diagnosis to address the quality of diagnosis required and the resources available. REFERENCES Runtime Figure 1c gives the number of calls to M INI S AT , Figure 1d presents the total runtime of M INI S AT , and Figure 1e presents the total runtime including the preprocessing time. All the computations are done in less than one hour, which is better than the incomplete computations of Table 1. The runtime generally increases when μ increases. Thus, a tradeoff might be required here between quality and efﬁciency. Note that the tendency is inverted when λ is large because the number of path restart decreases; for large diagnosis windows, large prediction windows increase quality and efﬁciency. Finally, note that the small- [1] N. E´en and N. S¨orensson, ‘An extensible SAT-solver’, in Sixth International Conference on Theory and Applications of Satisﬁability Testing (SAT-03), (2003). [2] A. Grastien, Anbulagan, J. Rintanen, and E. Kelareva, ‘Diagnosis of discrete-event systems using satisﬁability algorithms’, in Proc. of 19th AAAI, pp. 305–310, (2007). [3] A. Grastien, M.-O. Cordier, and Ch. Largou¨et, ‘Incremental diagnosis of discrete-event systems’, in Sixteenth International Workshop on Principles of Diagnosis (DX-05), pp. 119–124, (2005). [4] G. Lamperti and M. Zanella, Diagnosis of Active Systems, Kluwer Academic Publishers, 2003. ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-789 789 Characterizing and checking self-healability Marie-Odile Cordier1 and Yannick Pencol´e2 and Louise Trav´e-Massuy`es2 and Thierry Vidal1 1 INTRODUCTION Real-life complex systems are often required to offer high reliability and quality of service and must be provided with self-management abilities, even in faulty situations. They are expected to be self-aware of their current state and survive autonomously the occurrence of faults, still managing to provide the desired functionality. In other words, such systems must be self-healing [2]. Designing self-healing systems requires to be able to evaluate the joint degree of self-awareness and reactiveness. In the artiﬁcial intelligence community, these two properties are better known as diagnosability [3, 1], i.e. the capability of a system to exhibit different observables for different anticipated faulty situations, and repairability, i.e. the ability of a system and its repair actions to cope with any unexpected situation. Checking separately diagnosability and repairability leads to a conservative assessement of self-healability. In this paper, we show that neither standard diagnosability nor repairability of every anticipated fault are necessary to achieve self-healability. Our main contribution consists of deﬁning self-healability as a joint property bridging diagnosability and repairability, which requires a new deﬁnition of diagnosability that allows diagnosable subsets of faults to overlap, as opposed to the standard deﬁnitions which rely on a partition. 2 MAIN CONCEPTS The presented framework, which is relevant for state based or event based systems, adopting the generic viewpoint deﬁned in [1], is illustrated with discrete event systems 3 as our current objective is to apply it to service oriented architectures like Web Services in the framework of the WS-DIAMOND European project [4]. Observations and Faults : The set of observable events is O = {o1 , . . . , ono }. Complementing O with the set of unobservable events U = {u1 , . . . , unu } determines the whole set of events of the system E = O ∪ U . The occurrence of basic faults that might occur on the system are represented as speciﬁc unobservable events noted fi . In the following, we restrict ourselves to the single fault assumption (i.e. only one fault can be present in the system at a given time). The system can then be either in a nominal mode (absence of fault) or in one of the nf fault modes. The set of all possible system modes is hence given by F = {f0 , f1 , ..., fnf }, where f0 = ok. T denotes the set of (inﬁnite) possible trajectories (i.e. sequences of events) occurring in the system, while OBS is the set of all possible sequences of observable events. A trajectory τ ∈ T corresponds 1 2 3 IRISA/INRIA/Universit´e de Rennes1; Campus de Beaulieu, F-35042 Rennes cedex, France, email: marie-odile.cordier, thierry.vidal@irisa.fr LAAS-CNRS; Universit´e de Toulouse; 7, Avenue du Colonel Roche, F31077 Toulouse cedex, France, email: yannick.pencole, louise@laas.fr We assume the liveness of the observations [3]. to only one observable σ, while one σ may correspond to several disctint trajectories. o5 o5 f1 o1 o1 f1 o2 o6 o2 o6 f2 o2 o2 o3 f4 o4 o4 f3 o3 o3 o1 o1 The above ﬁgure represents the global model of a discrete-event system. The set of fault modes is F = {ok, f1 , f2 , f3 , f4 }. Fault events are not observable, the other events being observable. o1 o∞ 5 is both a trajectory and the observable obtained over that trajectory including an inﬁnite sequence of o5 . o2 o2 f2 o∞ 2 is another trajectory ∞ yielding the observable o∞ . f o is yet another trajectory which, 1 2 2 interestingly enough, yields the same observable o∞ 2 , which means these two trajectories cannot be discriminated from the observations. Macrofaults: It is not always possible to know with certainty in which mode a system is. It is often even not necessary with respect to reparability. It is why we deﬁne the concept of macrofault that represents the belief state referring to the system mode. A macrofault can be seen as an abstraction of system modes. For instance, if a pipe can be in the two basic fault modes leaking or blocked, it can also be said to be in an abnormal macrofault mode, where abnormal corresponds to leaking or blocked. A macrofault Fj is described by a non empty set of fault modes. With our single fault assumption, an ’occurrence’ of Fj means that exactly one of the faults fi ∈ Fj has occurred in the system. For instance, the macrofault {f1 , f2 } represents the fact that either f1 or f2 has occurred . A macrofault may be a singleton (Fj = {fi }). If all basic faults appear in a set of macrofaults E(F) ⊆ 2F , then it is called a covering set. Repairs : A repair plan is deﬁned in a simpliﬁed way as, for our purpose, only the existence of such repair plans and their matching to (basic) faults is relevant. The set of available repair plans is denoted R = {r1 , ..., rnr }. The predicate Repair relates repair plans to (macro)faults: Repair(rk , Fi ) means that applying the repair plan rk brings back the system into a nominal state, under the condition that the system is in one of the modes described by the macrofault Fi 4 For instance, 4 rok , the (void) repair plan such that Repair(rok , ok), is assumed to exist. 790 M.-O. Cordier et al. / Characterizing and Checking Self-Healability the repair plan r1 such that Repair(r1 , {f1 , f2 }) can be executed only if either f1 or f2 has occured. Having a repair plan for a macrofault is equivalent to having a repair plan for all the basic faults belonging to the macrofault, hence the following property : Repair(rk , Fj ) ≡ ∀fi ∈ Fj , Repair(rk , {fi }). 3 Deﬁnition 2 (Repairability) A set of macrofaults E(F ) is repairable, noted Repairable(E(F)), iff ∀Fj ∈ E(F) Repairable(Fj ). Example : If the only repair plan is r, with Repair(r, {f1 , f3 }), we indeed get Repairable({f1 , f3 }), and also Repairable({f1 }) and Repairable({f3 }). However, the system is not repairable since the faults f2 and f4 are not repairable. SELF-HEALABILITY Self-healability is intuitively deﬁned by “A system is self-healing if, and only if, after the occurrence of any basic fault, a diagnosis is issued that automatically raises a repair plan ﬁtted to the fault.” Behind this intuitive deﬁnition, two properties of the system are hidden: diagnosability and repairability. Diagnosability : Diagnosability relies on the notion of fault signatures [1]. Intuitively, a fault signature is the association between a fault and a set of possible observables. We use the following notations : • The predicate yields(fi , σ) means that there exists at least one trajectory in which fi ∈ F is present and that yields the observable σ ∈ OBS. The predicate yields can be generalized to macro-faults: yields(Fj , σ) means that it ∃fi ∈ Fj such that yields(fi , σ). σ is then called an elementary signature, or esignature of the fault Fi , • M F (σ) is the (unique) macrofault containing all faults that may yield σ, i.e. M F (σ) = {fi such that yields(fiS , σ)}. M F can be generalized to sets of e-signatures: M F (Σ) = σ∈Σ M F (σ). In this work, we are not interested in checking that any basic fault can be diagnosed, but we are interested in ﬁnding the level of diagnosability of a system. This is why the partition of faults classically used is replaced by a set of macrofaults possibly sharing common faults. Still, each macrofault must be associated to distinct observables and the corresponding sets of observables need to form a partition. Hence the following new deﬁnition for diagnosability that extends the classical deﬁnition and is suitable for self-healability. Deﬁnition 1 (Diagnosability of a set of macrofaults) The covering set E(F) is diagnosable, noted Diagnosable(E(F)), iff there exists a partition π = {Σ1 , . . . , Σm } of the observables OBS such that: E(F ) = {M F (Σj ), Σj ∈ π}. Example: A ﬁrst straightforward set of macrofaults is E (F) = {F} = {{ok, f1 , f2 , f3 , f4 }} in which faults are indistinguishable: obviously it is diagnosable, the partition being π = ∞ ∞ ∞ ∞ ∞ {{o1o5∞ , o∞ 1 , o2 , o3 o2 , o3 , o4 , o6 }} = {OBS}. The set of macrofaults E1 (F ) = {{ok}, {f1 , f2 }, {f1 , f3 }, {f4 }} ∞ ∞ ∞ ∞ is diagnosable with π1 = {{o1o5∞ }, {o∞ 2 , o3 o2 }, {o1 , o3 , o6 }, ∞ {o4 }}. Note that E1 (F ) also corresponds to another partition ∞ ∞ ∞ ∞ ∞ π2 = {{o1o5∞ }, {o∞ 2 , o3 o2 , o6 }, {o1 , o3 }, {o4 }}. E2 (F) = {{ok}, {f1 }, {f2 }, {f3 }, {f4 }} is not diagnosable because there are some cases in which f1 and f2 cannot be discriminated, there is no partition of observables associated to it (the same for f1 and f3 ). Repairability : A macrofault Fj is repairable if and only if there exists a repair plan that repairs it. Repairable(Fj ) ≡ ∃rk such that Repair(rk , Fj ). The repairability of a set of macrofaults is then deﬁned as the repairability of all the macrofaults in the set. Self-healability : Our deﬁnition for self-healibility directly derives from the deﬁnitions of diagnosability and repairability. Deﬁnition 3 (Self-healing set of macrofaults) A set E(F) is self-healing iff it is diagnosable and repairable, i.e. Self Healing(E(F)) ≡ Diagnosable(E(F)) and Repairable(E(F)). Deﬁnition 4 (Self-healing system) A system is self-healing iff there exists a self-healing covering set E(F). Example : If Repairable(ok), Repairable({f1 ,f3 }), Repairable({f1 ,f2 }) and Repairable(f4 ), then the set E1 (F ) = {{ok}, {f1 , f2 }, {f1 , f3 }, {f4 }} is diagnosable and repairable. The system is self-healing. If Repairable(ok), Repairable({f1 ,f3 }), Repairable(f2 ) and Repairable(f4 ) then the system is not self-healing as there does not exist a repair plan for {f1 , f2 }. Due to lack of space, the algorithm to check whether a system is self-healing is not given. 4 CONCLUSION AND PERSPECTIVES The main contributions of this paper are ﬁrst a new and original definition of diagnosability which allows to diagnose possibly overlapping sets of non-discriminated faults, and then using that deﬁnition, to propose a thorough and integrated deﬁnition of the self-healability of a dynamic system. Interestingly enough, diagnosability of each basic fault is not required but what is needed is a diagnosability level that can be matched to the existing repairs. As far as we know, it is the ﬁrst time that such a deﬁnition is issued. We are currently applying our work to web services in the framework of the WS-DIAMOND European project [4], in which we investigate a number of extensions to address more sophisticated and realistic cases, mostly in terms of the characterization of repair plans, their properties and conditions of applicability. One of the problems is how to deal with multiple faults that may appear sequentially. Another interesting issue refers to temporal conditions that may restrict the applicability of repairs and be in conﬂict with the time needed to diagnose a fault. REFERENCES [1] M.-O. Cordier, L. Trav´e-Massuy`es, and X. Pucel, ‘Comparing diagnosability in continous and discrete-event systems’, in 17th International Workshop on Principles of Diagnosis, eds., C.A. Gonz´alez, T. Escobet, and B. Pulido, pp. 55–60, (June 2006). [2] D. Ghosh, R. Sharman, H. R. Rao, and S. Upadhyaya, ‘Self-healing systems survey and synthesis’, Decision Support Systems, 42(4), 2164– 2185, (2007). [3] M. Sampath, R. Sengupta, S. Lafortune, K. Sinnamohideen, and D. Teneketzis, ‘Diagnosability of discrete event system’, IEEE Transactions on Automatic Control, 40(9), 1555–1575, (1995). [4] The Ws-DIAMOND team, ‘Ws-DIAMOND: Web services DIagnosability, MOnitoring and DIagnosis’, in 18th International Workshop on Principles of Diagnosis, DX’07, pp. 243–250, Nashville (TN, USA), (May 2007). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-791 791 Improving robustness in consistency-based diagnosis using possible conﬂicts Belarmino Pulido, Anibal Bregon, Carlos Alonso-Gonz´alez Intelligent Systems Group (GSI). Department of Computer Science, University of Valladolid. Spain email: {belar,anibal,calonso}@infor.uva.es Abstract. Behaviour simulation in Consistency-based Diagnosis requires knowing the initial value. This assumption is not easily fulﬁlled in real systems, even in the presence of measurements related to state variables due to noise and parameter uncertainties. This work proposes the integration of state observers to estimate initial states for simulation in consistency-based diagnosis with possible conﬂicts, using the BRIDGE framework, proposing an extension for a class of dynamic systems. Suitable state-observer structural models are obtained through same algorithms used to ﬁnd possible conﬂicts – minimal subsystems with analytical redundancy–, without additional knowledge in the models. 1 Introduction Two research communities have approached traditionally the problem of Model-Based Diagnosis (MBD) in different but complementary ways: the Fault Detection and Isolation community (FDI) born in the Automatic Control world, and the Diagnosis community (DX) rooted in the Artiﬁcial Intelligence ﬁeld. FDI community uses control and statistical decision theories to carry on the fault detection and isolation stages. Main issue in this approach is fault detection robustness. This ﬁeld has solid theoretical results for linear systems [4, 1], while analysis of non-linear systems is a major research issue. DX approach, has a solid theoretical foundation for static systems, being fault localization and identiﬁcation its main research issues. Consistency-based diagnosis (CBD) is the most used approach, and GDE is its computational paradigm [5]. Recently, the BRIDGE community [3] established a common framework for sharing results and techniques; it is based on the comparison between CBD via conﬂicts [5] and FDI via analytical redundancy relations (ARRs) obtained through structural analysis [1]. This work is based on the Possible Conﬂicts approach [9], PCs for short, an off-line dependency compilation technique from the DX community. CBD using PCs is based on the on-line simulation of subsystems equivalent to conﬂicts. The approach needs the initial value of state variables to re-start the simulation. Main goal of this work is to improve the robustness of the method through a more precise estimation of the initial state, without modifying its fault isolation capabilities, and its consistencybased approach. Based on the similarity between PCs and ARRs in the BRIDGE framework [9], this work uses PCs to design stateobservers, which are used to estimate the initial states for simulation. This article is organized as follows. First, assumptions, techniques, and working principles are shown. Second, a new way to derive the structure of state observers from a dynamic system using possible conﬂicts, and an integration method of simulation and stateobservers are introduced. Finally, results, and discussion with related works are provided. 2 PCs, ARRs, and conﬂicts in the BRIDGE framework Possible conﬂicts are those sub-systems capable to become conﬂicts in CBD, i.e. minimal subsets of equations containing enough analytical redundancy to perform fault diagnosis. Computation of PCs is performed on an abstract model for the set of equations in the system description, and are obtained off-line via two core concepts: minimal evaluation chains, MECs, and minimal evaluation models, MEMs. MECs are minimal over-constrained sets of relations, and they represent a necessary condition for a conﬂict to exist. MEMs are local propagation paths that describe how to use the relations of a MEC to predict behavior and to provide redundancy. Each MEM describes an executable model, which can be used to perform fault detection. PCs ﬁt in the BRIDGE framework, which was deﬁned for static systems [3]. It was demonstrated that PCs are equivalent to potential conﬂicts and the support for minimal ARRs [9]. This work will provide an speciﬁc extension for a class of dynamic systems. First, the inﬂuence of temporal information in PCs and ARRs calculation must be analyzed. Concepts will be illustrated using the following system. 3 Case study The system (ﬁgure 1) is made up of a water tank, T , a valve, V , and a PID controller that acts through valve, uc , to keep the level of the tank, h, close to its reference, href . Other elements are the input ﬂow sensor, Qi , and the output ﬂow sensor, Qo . href Qi h Controller uc Qo Figure 1. Our system is made up of tank, a valve and one controller. 4 Using PCs to design and integrate state observers While DX approaches have opted by simulation techniques – known as integral approach for behavior estimation–, relying mainly in qualitative models; traditionally, the FDI community has opted by numerical models, and has rejected simulation. Most FDI methods rely upon derivative estimation [1]– known as derivative approach–, which has problems related with disturbances and uncertainties. It is known that integral and derivative approaches can provide equivalent results for behavior estimation with numerical models [2]. Moreover, in the FDI community simulation, estimation, and state observers are equivalent for linear models [4]. In fact, parity- and observer-based approaches provide residuals with similar structures. 792 B. Pulido et al. / Improving Robustness in Consistency-Based Diagnosis Using Possible Conﬂicts Comparing the structure of state-observers and parity equations, several authors have already proved that they can be equivalent [6], according to the general system description which can be seen as: ˆ˙ ˆ X(t) = A · X(t) + B · U (t) + K · (Y (t) − Yˆ (t)) (1) ˆ Yˆ (t) = C · X(t) (2) Δ = 30 Δ = 60 Fault arises a t = P C1 EST1 IN T1 P C2 EST2 IN T2 Depending on the selected value for the gain K, we obtain: from simulation for K = 0, to prediction for A = K · C. Other values for K provides a state-observer. Dependency-compilation and state-observer design Models are made up of instantaneous -static- and differential -dynamic- constraints. Algorithms used to ﬁnd PCs provide an interesting sideresult if integral causality -integral approach- is used: the Minimal Evaluable Model can be implemented as a simulator or as an stateobserver. Proposition: Those MEMs containing a state variable can provide the minimal structural description for a state observer, if there exists one instantaneous constraint between the estimated state variable and its observed value1 . Integration proposal: increasing robustness with state observers State-observers generate a state-variable estimation, without fault, with noise in sensors and small parameter disturbances, and can be compared for fault detection. Their main drawbacks are: small persistence for activated residuals, small activation time (noise), and small fault masking. On the other hand, simulation in an interval, Δt, and using a dissimilarity comparison, DTW, in the interval, has different detection capabilities, being less sensible to noise in measurements. Semiclosed loop simulation iteratively introduces observations for initial conditions when the simulation interval elapsed. Our proposal is the integration of state observers within the CBD framework with Possible Conﬂicts, because observers will improve the estimations for the different states of the Possible Conﬂicts without fault, and they will not interfere with the behaviour of the Possible Conﬂicts in faulty situations. Running both MEMs in parallel, and assuming there is no fault detection, the state estimation given by the state-observer can be used as the initial state for the possible conﬂict simulation. This simple integration scheme shows the power of the proposal. The decision step in fault detection can be tuned giving more weight to the speed or false alarms ratio, level of noise or parameter uncertainty, etc. 5 Results on the case study The study was made on a data-set, made up of several simulation scenarios for each fault mode in the plant. We introduced noise in the measurements (5%), and model uncertainty (5%). Each simulation lasted 1000 seconds, and contained several changes in the reference level of the tank. We randomly generated fault magnitudes at different time instants within the interval [420, 480]. We have reduced the mean and maximum values of thresholds for fault detection by integrating state observers within the PCs computation in nominal situations (see table 1 - upper part). For faulty situations, detection time for different times fault occurrences are shown for PCs, State Observers, and the integration of both. Faults are pipe blockages of 10% and 30%, with 5% on sensor noise, and 5% on parameters disturbances (see table 1 - mid and lower part). Due to space limitations we do not provide results for other fault modes which provided similar results. 1 Proof and examples of this proposition can be found in [8] Fault arises a t = P C1 EST1 IN T1 P C2 EST2 IN T2 Nominal situation P C1 + EST1 P C2 + EST2 Med. Max. Med. Max. 9.16 32.92 49.56 58.48 5.82 23.78 40.15 55.99 Faulty situation: 10% pipe blockage 420 430 440 450 no no no 540 fp 646 fp fp 540 540 480 540 no no no no no no no no 480 540 540 fp Faulty situation: 30% pipe blockage 420 430 440 450 480 480 480 540 437 447 452 fp 480 480 480 480 540 480 540 540 422 427 432 436 480 480 480 480 P C3 + EST3 Med. Max. 22.93 52.42 21.54 51.82 460 no 647 540 fp no 540 470 no 660 540 540 no 480 460 540 457 480 480 442 480 470 540 572 480 480 446 480 Table 1. Results for the case study. f p is a false positive. no is a false negative. P C3 is not shown because is not affected by these faults. 6 Conclusions Based on identical fault isolation capabilities between FDI approaches, and PCs and ARRs in the BRIDGE framework, our proposal is to use algorithms for computing PCs as a tool for stateobserver design. These algorithms can provide the structure of MEMs which can be implemented as state-observers without including additional constraints in the model. Our work proposes a simple integration of those two expressions for the same MEM in a PC, if possible: use an state-observer for initial state estimation, then use the estimation for a semi-closed loop simulation. Decision logic for fault detection can be tailored for each system, to get desired detection or false alarm rates. Results on a simulation plant are promising. We are testing on more demanding scenarios. Combination of state-observers and CBD has been done before [7], but the integration –fault detection only with state-observers– and the isolation stage –propagation back and forward in a temporalcausal graph– were different than those proposed in this work. Acknowledgments: This work was supported by the Spanish Ministry of Education and Culture (MEC 2005-08498). REFERENCES [1] M. Blanke, M. Kinnaert, J. Lunze, and M. Staroswiecki, Diagnosis and Fault Tolerant Control, Springer, 2003. [2] M.J. Chantler, T. Daus, S. Vikatos, and G.M. Coghill, ‘The use of quantitative dynamic models and dependency recording engines’, in Procs. of DX’96, pp. 59–68, Val Morin, Quebec, Canada, (1996). [3] M.O. Cordier, P. Dague, F. L´evy, J. Montmain, M. Staroswiecki, and L. Trav´e-Massuy`es, ‘Conﬂicts versus analytical redundancy relations: a comparative analysis of the model-based diagnosis approach from the artiﬁcial intelligence and automatic control perspectives’, IEEE Trans. Syst. Man Cy. B., 34(5), 2163–2177, (2004). [4] J.J. Gertler, Fault detection and diagnosis in Engineering Systems, Marcel Dekker, Inc., Basel, 1998. [5] W. Hamscher, L. Console, and J. de Kleer(Eds.), Readings in Model based Diagnosis, Morgan-Kaufmann Pub., San Mateo, 1992. [6] R. Isermann, Fault-Diagnosis Systems: An Introduction from Fault Detection to Fault Tolerance, Springer-Verlag, 2006. [7] P. Mosterman and G. Biswas, ‘Diagnosis of continuous valued systems in transient operating regions’, IEEE Trans. Syst. Man Cy. B., 29(6), 554– 565, (1999). [8] B. Pulido, C. Alonso, A. Breg´on, V. Puig, and T. Escobet, ‘Analyzing the inﬂuence of temporal constraints in possible conﬂicts calculation for model-based diagnosis’, in Procs. of DX’07, USA, (2007). [9] B. Pulido and C. Alonso-Gonz´alez, ‘Possible conﬂicts: a compilation technique for consistency-based diagnosis’, IEEE Trans. Syst. Man Cy. B., 34(5), 2192–2206, (2004). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-793 793 Dependable Monitoring of Discrete-Event Systems with Uncertain Temporal Observations Gianfranco Lamperti and Marina Zanella 1 Abstract. In discrete-event system monitoring, a set of candidate diagnoses is output at the reception of each observation fragment. However, when the observation is uncertain, this result may be not dependable: the sets of diagnoses, relevant to consecutive observation fragments, may be unrelated to one another, and, even worse, they may be unrelated to the actual diagnosis. To cope with this problem, the notion of monotonic monitoring is introduced, which is supported by speciﬁc constraints on the fragmentation of the uncertain observation, leading to the notion of stratiﬁcation. 1 INTRODUCTION Model-based diagnosis of discrete-event systems (DESs) has arisen a great interest [3, 2, 8]. A DES consists of several components, where the behavior of each component is represented by an automaton. Interconnections between components can be modeled explicitly [1] and/or implicitly, that is, a communication buffer may be indistinguishable from a component [8]. Several state changes of distinct components can occur simultaneously [9, 10], or not [1, 7]. Two diagnostic tasks inherent to DESs can be singled out, a-posteriori diagnosis [1] and monitoring-based diagnosis [6], both requiring an observation as input. Therefore, observation features and models have been investigated [5, 4]. This paper deﬁnes monotonicity, a property that consists in producing as output at each monitoring step a set of diagnoses that includes the actual diagnosis, and discusses the granularity with which a temporally uncertain observation has to be processed by any sound and complete problem-solving method so as diagnostic results are monotonic, whichever the DES at hand. 2 MONITORING Given a system ˙ and an initial state ˙0 , each evolution of ˙ is conﬁned within the behavior space, Bhv.˙; ˙0 /. The latter is a directed graph rooted in ˙0 , where each node is a state of ˙ and each arc is a transition. Each (possibly empty) sequence of transitions rooted in ˙0 is a history of ˙ . Let T be the domain of transitions in ˙ and Lo a domain of observable labels. A viewer V of ˙ is a function from T to .Lo [fg/, where is the null label. If .T; / 2 V then T is silent else T is visible. The signature hV is the sequence of observable labels relevant to h, h V D h` j T 2 h; .T; `/ 2 V ; ` ¤ i. Ideally, the signature should represent how h is observed outside ˙ . However, what is actually perceived is the observation of ˙ , O D .N ; A/, which is a directed acyclic graph (DAG) where N is the set of nodes and A the set of arcs, with the following uncertainty properties: (Logical uncertainty) Each label ` in the signature corresponds to a node in O; such a label is perceived as a subset of .Lo [ fg/ of candidate labels, necessarily including `; (Node uncertainty) Additional (spurious) nodes are possibly inserted into O, each of which is associated with a subset of candidate labels necessarily including ; 1 Universit`a di Brescia, Italy, e-mail: flamperti,zanellag@ing.unibs.it (Temporal uncertainty) Absolute temporal ordering of the signature is relaxed to partial ordering (with the latter being consistent with the former). The extension of a node N in N , written kN k, is the set of labels embodied in N . A candidate signature of O is a sequence of labels obtained by ﬁrst picking up a label from each kN k, N 2 N , without violating the ordering imposed by A, and then removing the labels from the sequence. The extension of O, kOk, is the whole set of candidate signatures of O. Proposition 1 .h V / 2 kOk. The ruler of ˙ is a mapping from T to .Lf [fg/, where Lf is a set of fault labels. If .T; / 2 R then T is normal else T is faulty. The diagnosis h ˝ R is the set of fault labels f` j T 2 h; .T; `/ 2 R; ` ¤ g. A diagnosis is empty when all transitions in h are normal. The uncertain observation taken as input by a monitoring task is not given as a whole but, rather, as a list of several fragments, where each fragment is composed of one or several nodes in N along with the relevant temporal constraints (arcs) in A. Formally, a fragmentation of O D .N ; A/ is a sequence O D hF1 ; : : : ; Fn i where each fragment Fi D .Ni ; Ai /, i 2 Œ1 :: n, is such that fN1 ; : : : ; Nn g and fA1 , : : : ; An g are partitions of N and A, respectively. Each fragment Fi represents a set of observable events received in the current time interval. Each node in Ni is an event, and Ai includes all and only the temporal relationships linking the nodes in Ni with their parent nodes. Without loss of generality, the parents of nodes in a new fragment are required to be in the fragments received up to now. Each nonempty preﬁx hF1 ; : : : ; Fi i of O corresponds to a subobservation OŒi D .NŒi ; AŒi /, where NŒi D i [ j D1 Nj AŒi D i [ Aj (1) j D1 The empty sub-observation is OŒ0 D .;; ;/. If O is known, O is univocally deﬁned by the sequence of Ni , as Ai necessarily includes all (and only) the arcs entering nodes in Ni . For each i 2 Œ0 :: n, we can deﬁne a sub-problem }Œi .˙ / D .˙0; V ; OŒi ; R/ where the solution of }Œi .˙ /, written .}Œi .˙ //, consists of a sound and complete set of candidate diagnoses, with each diagnosis being entailed by a history h whose signature conforms to OŒi . As such, .}Œi .˙ // D fı j ı D h ˝ R; h 2 Bhv.˙; ˙0 /; h V 2 kOŒi kg. A monitoring problem .˙ / D .˙0 ; V ; O ; R/ is a 4-tuple involving an initial state, a viewer, a fragmented observation, and a ruler. Its solution, written ..˙ //, is the sequence of the solutions of the diagnosis sub-problems }Œi .˙ /, i 2 Œ0 :: n, that is, ˝ ˛ ..˙ // D .}Œ0 .˙ //; : : : ; .}Œn .˙ // . N O N ; R/ N be a monitoring probN D .˙0 ; V; Example 1 Let .˙/ lem inherent to a DES called ˙N whose behavior space, rooted in ˙0 (node 0), is displayed in Fig. 1 along with its viewer 794 G. Lamperti and M. Zanella / Dependable Monitoring of Discrete-Event Systems with Uncertain Temporal Observations Figure 2. Proposition 3 A monitoring problem involving a stratiﬁed observation is monotonic. N Figure 1. Behavior space (a), and viewer and ruler matrix (b) for ˙. N (gray cells). Suppose that VN (white matrix-cells) and ruler R the actual history is hN D hX1 ; X2 ; Y2 ; Z4; Z3 ; Y4 ; W4 ; Z2 ; X1 i, N depicted in Fig. 2. A possible and the relevant observation is O, N N fragmentation O of O is deﬁned by the following sequence of N sets of nodes: hfN1 ; N3 g; fN2g; fN4 g; fN5 gi. Then, ..˙// D h0 ; 1 ; 2 ; 3; 4 i, where 0 D f;g, 1 D ffwgg, 2 D ffwg; fxg; fx; ygg, 3 D ffwg; fxg; fx; yg, fx; y; zgg, and 4 D ffwg; fx; y; zgg. Example 1 shows that the solution of a monitoring problem is disappointing. In fact, at monitoring step 1, one is induced to believe that w is a quite certain fault but, from iteration 2 on, fault w is not certain any more. The rationale behind this deceitful behavior is that any sound and complete set of outputs complies with the whole observation received so far as it were a complete observation, while it is not. Therefore, the extension of the observation may change non-monotonically from one step to another, thus producing the highlighted negative effect. Let .˙ / D .˙0 ; V ; O ; R/, where O D hF1 ; : : : ; Fn i. Let h0 ; 1 ; : : : ; n i be the solution of .˙ / and ı the actual (unknown) diagnosis of the actual (unknown) history of ˙ . We say that .˙ / is monotonic iff 8i 2 Œ0 :: .n 1/ there exists ıi 2 i such that ı0 ı1 : : : ın1 ı. Example 2 The monitoring problem .˙N / in Example 1 is not monotonic: the actual diagnosis is ıN D fx; y; zg, for which the monotonicity condition does not hold as 1 D ffwgg includes no diagnoN sis that is a subset of ı. The monotonicity of a monitoring problem .˙ / depends on the nature of the fragmentation of O. The trivial fragmentation, involving the whole observation O as the unique fragment, supports monotonicity, but this is in fact a-posteriori diagnosis, not monitoring. Thus, we are interested in nontrivial fragmentations that guarantee monotonicity, independently of the speciﬁc system at hand, namely nontrivial stratiﬁed observations. A fragmentation O D hF1 ; : : : ; Fn i is stratiﬁed iff for each fragment Fi D .Ni ; Ai /, i 2 Œ1 :: n, we have 8N 2 Ni .Unrl.N / Ni / N for ˙N . Observation O (2) where Unrl.N / is the set of all the nodes in N whose reciprocal emission order with respect to N is unknown. A stratiﬁed fragmentation is called a stratiﬁcation and each fragment a stratum. Condition (2) requires that all nodes that are neither ancestors nor descendants (namely, unrelated) of nodes in the i -th stratum, be in the i -th stratum themselves. Proposition 2 Let O D hF1 ; : : : ; Fn i be a stratiﬁed observation. Then, for each i 2 Œ1 :: n, kOŒi k is composed of all the signatures in kOŒi 1k (possibly) extended with further observable labels. N of the monitoring probExample 3 Consider a variant 0 .˙/ N is replaced by the lem deﬁned in Example 1, where O N stratiﬁcation hfN1 g; fN2; N3 g; fN4 g; fN5 gi. Then, .0 .˙// D h00 ; 01 ; 02 ; 03; 04 i, where 00 D f;g, 01 D f;, fwg, fxg, fygg, 02 D f fwg, fxg; fx; ygg, 03 D f fwg, fxg , fx; yg, fx; y; zgg, and N is monotonic. 04 D f fwg, fx; y; zgg. As expected, 0 .˙/ Property (2) is conserved when several contiguous strata are grouped together to form coarser-grained fragments. The contrary does not hold: when two or more contiguous fragments are obtained by splitting a single stratum, stratiﬁcation may be lost. In the ﬁnest stratiﬁcation strata cannot be further split without losing stratiﬁcation. Proposition 4 The ﬁnest stratiﬁcation is unique. Proposition 5 The ﬁnest stratiﬁcation of an observation represented by a disconnected DAG is the trivial fragmentation. Proposition 6 Let i and i C1 , i 2 Œ0 :: .n 1/, be two consecutive elements in the solution of a monitoring problem involving a stratiﬁed observation. Then, 8ı0 2 i C1 .ı0 ı; ı 2 i /. In other words, monotonic monitoring is a shrink and expand operation, where i is ﬁrst shrunk and then the remaining candidates are possibly extended with additional faults, to make up i C1. REFERENCES [1] P. Baroni, G. Lamperti, P. Pogliano, and M. Zanella, ‘Diagnosis of large active systems’, Artiﬁcial Intelligence, 110(1), 135–183, (1999). [2] L. Console, C. Picardi, and M. Ribaudo, ‘Process algebras for systems diagnosis’, Artiﬁcial Intelligence, 142(1), 19–51, (2002). [3] R. Debouk, S. Lafortune, and D. Teneketzis, ‘Coordinated decentralized protocols for failure diagnosis of discrete-event systems’, Journal of Discrete Event Dynamic Systems: Theory and Application, 10, 33– 86, (2000). [4] A. Grastien, M.O. Cordier, and C. Largou¨et, ‘Incremental diagnosis of discrete-event systems’, in Sixteenth International Workshop on Principles of Diagnosis – DX’05, pp. 119–124, Monterey, CA, (2005). [5] G. Lamperti and M. Zanella, ‘Diagnosis of discrete-event systems from uncertain temporal observations’, Artiﬁcial Intelligence, 137(1–2), 91– 163, (2002). [6] G. Lamperti and M. Zanella, ‘A bridged diagnostic method for the monitoring of polymorphic discrete-event systems’, IEEE Transactions on Systems, Man, and Cybernetics – Part B: Cybernetics, 34(5), 2222– 2244, (2004). [7] G. Lamperti and M. Zanella, ‘Flexible diagnosis of discrete-event systems by similarity-based reasoning techniques’, Artiﬁcial Intelligence, 170(3), 232–297, (2006). [8] Y. Pencol´e and M.O. Cordier, ‘A formal framework for the decentralized diagnosis of large scale discrete event systems and its application to telecommunication networks’, Artiﬁcial Intelligence, 164, 121–170, (2005). [9] M. Sampath, R. Sengupta, S. Lafortune, K. Sinnamohideen, and D.C. Teneketzis, ‘Failure diagnosis using discrete-event models’, IEEE Transactions on Control Systems Technology, 4(2), 105–124, (1996). [10] R. Su and W.M. Wonham, ‘Global and local consistencies in distributed fault diagnosis for discrete-event systems’, IEEE Transactions on Automatic Control, 50(12), 1923–1935, (2005). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-795 795 Distributed Repair of Nondiagnosability Anika Schumann and Wolfgang Mayer and Markus Stumptner 1 1 Introduction Automated fault diagnosis has signiﬁcant practical impact by improving reliability and facilitating maintenance of systems [1]. Given a monitor continuously receiving observations from a dynamic eventdriven system, diagnosis algorithms infer possible fault events that explain the observations. For many applications, it is not sufﬁcient to identify what faults could have occurred; rather, one wishes to know what faults have deﬁnitely occurred. Computing the latter requires diagnosability of the system, that is, the guarantee that the occurrence of a fault can be detected with certainty after a ﬁnite number of subsequent observations [2]. This paper deﬁnes a distributed framework that assists in assessing and improving the diagnosability of discrete-event systems. In this context, a system is diagnosable iff the presence or absence of each unobservable fault event can always be deduced once sufﬁciently many subsequent observable events have occurred. Otherwise, the system must be altered, for example by adding additional sensors, to allow to discriminate between ambiguous system behaviours. If the system is not diagnosable, additional sensors are required to distinguish the ambiguous system behaviours. Several past approaches deal with the problem of selecting sensor placements to ensure diagnosability of a system. However, the problem of computing an optimal sensor set with minimal size has a complexity exponential in the number of possible sensor placements [6]. Existing sensor placement algorithms are based on a global representation of the system, which may not be computable for large systems. In this paper we address the diagnosability problem in a distributed way by identifying those system behaviours that require modiﬁcation to restore diagnosability. In fact, we show how to determine those subsystems whose modiﬁcation is guaranteed to make the entire system diagnosable. 2 Diagnosability of discrete event systems As in [2], we consider a discrete-event system G composed of components G1 , . . . , Gn that are each modelled as ﬁnite state machine (FSM). Here the transitions are partitioned into fault transitions and other locally unobservable transitions, transitions representing shared events that occur simultaneously in all concerned components, and observable transitions. A fault of the system is diagnosable iff its (unobservable) occurrence can always be deduced after ﬁnite delay [2]. To decide diagnosability we use the twin plant approach presented ei in [3]. It computes for each component the interactive diagnoser G 1 University of South Australia, Adelaide, Australia. Email: {schumann,mayer,mst}@cs.unisa.edu.au. This work was partially supported by the Australian Research Council under Discovery Grant DP0560183. that gives the set of faults that can possibly have occurred for each seˆ i is obquence of observable and shared events. A local twin plant G tained by synchronising two instances of the diagnoser based on the observable events. Each path represents two indistinguishable system behaviours (i.e. two behaviours that emit the same sequence of observations). The twin plant states are partitioned into diagnosable and possibly nondiagnosable states [3]. A subset of the latter are the nondiagnosable states. A fault F is diagnosable in system G iff its ˆ n ) has no path with a cycle ˆ1, . . . , G global twin plant (GTP) Sync(G containing at least one observable event and one F -nondiagnosable state2 . Such a path is called a critical path. Unfortunately, computing the GTP is prohibitively expensive for large systems. Our algorithm avoids scalability issues by computing nondiagnosable behaviours iteratively in a distributed approach, such that the global model need not be derived in many cases. We start with a set of twin plants of individual subsystems that characterise all paths that (may possibly) admit nondiagnosable behaviour (i.e. paths with a (possibly) nondiagnosable state). By composing individual models, behaviours that are infeasible or distinguishable in a larger subsystem are eliminated incrementally until (non)diagnosability can be decided or resource limits are reached. This work is an extension to the one presented in [3]. However, the latter can only verify diagnosability in a distributed way. In contrast, our approach allows to conﬁrm diagnosability and nondiagnosability given partial models of a system. 3 Distributed (non)diagnosability assessment Our framework is based on the two properties below. Assume a set of ˆ is created from a partition of a discrete event system G. twin plants G Then, ˆ is free of cycles that include an observable 1. G is diagnosable if G transition and a possibly nondiagnosable state, or if there is a twin ˆ where all states are diagnosable. plant in G ˆ includes a path to a pos2. G is nondiagnosable if each plant in G sibly nondiagnosable state that does not have events shared with any other plant and at least one of these paths has a cycle with a possibly nondiagnosable state and an observable transition. The ﬁrst property is derived from previous results on diagnosability [3]. The correctness of the second one follows directly from the Sync operation: The synchronisation of above paths from all twin plants results in a set of paths in the GTP, each containing an observable cycle with a possibly nondiagnosable state. Since every possibly nondiagnosable state in the GTP is nondiagnosable [3], such a synchronisation must contain a critical path and thus establishes the nondiagnosability of F . 2 The result of Sync is a FSM whose state space is the Cartesian product of the state spaces of the components, and whose transitions are synchronised in that any shared event always occurs simultaneously in all components that deﬁne it. 796 4 A. Schumann et al. / Distributed Repair of Nondiagnosability Algorithm {}-f1 a0 We use the above results to decide whether a system is (non)diagnosable. Starting with twin plants representing individual subsystems, our algorithm iteratively removes locally nondiagnosable paths by synchronising twin plants to form larger subsystems. In case formerly indistinguishable system behaviours become discriminable through observable events of the larger subsystem, the path is removed. Otherwise, the path remains dependent (i.e. has events shared with other subsystems), but may become independent after further synchronisation. The aggregation of subsystems continues until either condition (1) or (2) is met, or until resources are exhausted. In the latter case paths exist where it is not known if the system is indeed (non)diagnosable, and we return the locally non-diagnosable paths (a superset of the truly nondiagnosable subsystem) as an approximation. Hence, our approach admits certain anytime characteristics. Since the diagnosability problem is NP hard, some systems may require the computation of the GTP to assess diagnosability. While we cannot avoid this intrinsic complexity, we stop with an approximate solution in case resource limits are insufﬁcient to obtain the exact solution. 5 Inferring repair alternatives If a system cannot be proved diagnosable, an over-approximation of possibly nondiagnosable subsystems is obtained, represented by their twin plants that contain possibly nondiagnosable paths. To ensure the overall system is diagnosable, certain transitions must be modiﬁed such that the potentially nondiagnosable paths cannot manifest in the revised model. We identify the relevant transitions using the following labelling scheme: every twin plant transition t is labelled with the set of transition identiﬁers that comprises all those transitions that have been synchronised to obtain t: every component transition (except fault transitions) is assigned a unique identiﬁer label, the identiﬁer label is e i , and, subpropagated to the corresponding interactive diagnoser G ˆi. sequently, to the corresponding transition in twin plant G The FSMs in Figure 1 illustrate the labelling. Every transition t e i is labelled with the set of transition of the interactive diagnoser G identiﬁers obtained from the transitions in Gi represented by t. In the twin plant every shared transition corresponds to exactly one transition in the interactive diagnoser, and every observable transition refers to two transitions (one from the left and one from the right diagnoser). For shared transitions, the labelling is kept. For observable transitions the identiﬁer labels are obtained from the union of the two corresponding diagnoser transition labels. Since the algorithm described below requires the synchronisation ˆ = of twin plants, the transition identiﬁers for every twin plant G ˆ ˆ Sync(G , G ) must be determined. This label propagation is similar to the propagation described previously: every transition labelled by ˆ, G ˆ } ˆ ∈ {G an event that only occurs in one of the twin plants G carries the same identiﬁer as the unique corresponding transition in ˆ is obtained as the ˆ . Otherwise, the identiﬁer for a transition in G G union of the identiﬁers of the two corresponding transitions. Through transition labels those components where behavioural modiﬁcation would remove a cause of nondiagnosability can be identiﬁed. For instance, the critical paths shown in Figure 1(c) can be eliminated by changing the transition from x5 to x7, which may be accomplished by modifying either component transition t3 or t5 in Figure 1(a). A system designer might choose to do this by replacing one of the sensors emitting event o1 by one emitting a different event, thus changing the component’s behaviour. Then the behaviours rep- a1 {t2}-s2 a2 {t3}-o1 {t5}-o1 {t4}-s1 a4 a3 {t6}-s1 {t7}-o1 a5 (a) Labelled component model {t6}-s1 a5, {f1} {t2}-s2 a2, {f1} {t3}-o1 a4, {f1} {t7}-o1 a0, {} {t4}-s1 {t5}-o1 {t6}-s1 a3, {} a4, {} {t7}-o1 a5, {} (b) Labelled diagnoser {t6}-r:s1 {t2}-l:s2 x0 {t4}-r:s1 x2 x3 {t4}-r:s1 {t2}-l:s2 x5 {t3,t5}-o1 x7 {t6}-l:s1 x12 x11 {t6}-l:s1 {t6}-r:s1 x13 {t7}-o1 (c) Labelled twin plant. Grey states denote nondiagnosable ones and white states denote diagnosable ones. Figure 1. Assignment of transition identiﬁers. Solid, dashed, and dotted lines denote observable, shared, and failure transitions, respectively. resented by the two transition sequences from a0 to a4 become distinguishable. 6 Conclusion and future work We have outlined a distributed algorithm that ascertains (non)diagnosability of distributed event-driven systems. We have shown how to identify component behaviours and transitions that, if modiﬁed, render a system diagnosable. Our approach has two distinct features: ﬁrst, our algorithm can ﬁnd solutions of a whole system by operating on partitions thereof, and, second, an approximation is returned if computational resources to construct the entire system are not available. Diagnosability assessment and repair can be used to analyse physical and abstract systems such as distributed computing processes. Our work is particularly relevant for the latter, since assessing and designing monitoring capabilities of a system that are sufﬁcient to allow compensation and reconﬁguration to take place are active areas of research [4, 5]. As part of future work we intend to extend our approach to incorporate the costs for modifying the system and to explore a richer model of possible transition modiﬁcations, tailored to the analysis of distributed software systems. REFERENCES [1] G. Lamperti and M. Zanella, Diagnosis of active systems, Kluwer Academic Publishers, 2003. [2] M. Sampath, R. Sengupta, S. Lafortune, K. Sinnamohideen, and D. Teneketzis, ‘Diagnosability of discrete event system’, IEEE Transactions on Automatic Control, 40(9), 1555–1575, (1995). [3] A. Schumann and Y. Pencol´e, ‘Scalable diagnosability checking of event-driven systems’, in IJCAI-07, pp. 575–580, (2007). [4] Rajesh Thiagarajan, Markus Stumptner, and Wolfgang Mayer, ‘Semantic web service composition by consistency-based model reﬁnement’, in The 2nd IEEE Asia-Paciﬁc Service Computing Conference (APSCC 2007), pp. 336–343, Tsukuba, Japan, (December 2007). [5] WSDIAMOND, ‘WS-Diamond deliverable D5.1: Characterization of diagnosability and repairability for self-healing web services’, Technical Report IST-516933, University of Torino and others, (April 2007). [6] T. Yoo and S. Lafortune, ‘On the computational complexity of some problems arising in partially-observed discrete-event systems’, in American Control Conference, volume 1, pp. 307–312, (2001). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-797 797 From constraint representations of sequential code and program annotations to their use in debugging1 Mihai Nica and Franz Wotawa2 1 Introduction Debugging, i.e., the detection, localization, and correction of bugs, has been considered an important task in software engineering. A lot of research has been devoted to debugging but mainly to fault detection. In this paper we focus on fault localization, which is based on the constraint representation of programs. For this purpose, programs are converted into their equivalent constraint satisfaction problem (CSP). A solution of the corresponding CSP is a diagnosis candidate. Besides the source code, a failure revealing test case has to be given. For more information regarding CSP we refer to [2]. The work described in this paper is most closely to the work of Ceballos et al. authors [9], where constraint programming is used for fault localization. There approach requires that the programmer provides contracts, i.e., pre- and post-conditions, for every function. However, the authors do not investigate the complexity of solving the resulting problem and the scalability to larger programs. In particular, they do not consider structural decomposition or other methods for improving constraint solving, which would make the approach feasible. In order to complement previous research, we investigate the complexity of solving the CSP corresponding to a debugging problem that comprises the source code and the test case. In the past, in order to ﬁnd problem classes which are tractable, much work has been done on the structural decomposition of CSPs. Gottlob et al. proposed the hypertree decomposition, and they showed that this decomposition method generalizes other important methods [3, 4]. The hypertree width, a characteristic of the structure of the constraint system, is a measure for the complexity of solving a CSP and, therefore, a measure for the complexity of the debugging problem. In other words, by performing a hypertree decomposition we can obtain a metric for the complexity of debugging. 1. 2. 3. 4. 5. Figure 1. Example In this section we use a small example program to motivate fault localization using constraint-based reasoning with integrated annotations. For the program in Figure 1 assume that Line 3 is changed to ’while (i <= x) {’ which leads to an obviously wrong implementation. If we are only interested in ﬁnding single faults at the statement level, we use the following process. Statement by statement we go through the program and assume the current statement to be faulty. All other statements are considered to work as expected. 1 2 This research has been funded in part by the Austrian Science Fund (FWF) under grant P20199-N15 and by the FIT-IT research project Self Properties in Autonomous Systems(SEPIAS) which is funded by BMVIT and the FFG Technische Universit¨at Graz, Institute for Software Technology, 8010 Graz, Inffeldgasse 16b/2, Austria, {mihai.nica,wotawa}@ist.tugraz.at. Authors are listed in alphabetical order. A program for computing the product of two natural numbers When assuming a statement to be faulty, we cannot derive a value for those variables deﬁned in the statement. A variable is said to be deﬁned within a statement if a value is assigned to the variable. Such a semantics for faulty statements is implemented in the previous model-based diagnosis approaches of debugging, e.g., in [5]. We now assume that Line 1 of the multiplication program behaves faulty. In this case the variable i in Line 1 is assigned the undeﬁned value ?, that means: 1.i =?;. Because of this change, we are not able to decide whether the condition in Line 3 evaluates to true or false. Hence, no values for r or i can be determined, and ﬁnally, we cannot contradict the expected value. As a consequence Line 1 is a valid diagnosis accordingly to model-based diagnosis [6]. The same happens when assuming Line 2 to be faulty. In this case r has no value assigned. From the other information we know that the subblock of the while is executed once. Hence, we receive the following equation, where the available information is given in parentheses: 4. 2 { x ≥ 0 ∧ y ≥ 0 } // PRE-CONDITION i == 0; r == 0; while (i < x) { { r == i · y } // INVARIANT r = r + y; i = i + 1; } { r == x · y } // POST-CONDITION {r =? ∧ y=2} r = r + y; {r=0} This equation can be solved by setting the value of r (before executing the statement) to -2 which does not contradict the value ?. A similar situation occurs for the other statement, and hence, there is no way of excluding even a single statement from the list of possible bug candidates. This problem of not being able to exclude statements is due to the fact of missing information. In order to overcome this problem we have to combine veriﬁcation information and debugging. For this purpose we consider program annotations which can also be used for veriﬁcation based on Hoare’s calculus like the one given in Figure 1. When now using the same procedure for ﬁnding single faults, only Lines 1 and 3 remain as diagnosis results. We now prove that Line 2 can be excluded and it is easy to see that the same argument applies to Line 4 and 5 as well. If assuming Line 2 to be faulty, we receive the following equation: { r == i · y ∧ i==1 ∧ x==0 ∧ y==2} 798 4. M. Nica and F. Wotawa / From Constraint Representations of Sequential Code and Program Annotations to Their Use in Debugging Name BinSearch BinSearch Binomial Binomial Hamming Hamming Huffman Huffman whileTest whileTest Permutation Permutation Permutation Adder SumPowers SumPowers SumPowers IscasC432 ComplexHypertree ComplexHypertree ComplexHypertree r = r + y; {r==0} From i==1 and i==0 we derive r==0 before executing the statement. Hence, we obtain r to be 2 after the execution which contradicts the expected value of r. Statement 2 is no single fault diagnosis anymore. This simple example shows that the integration of veriﬁcation information that is based on program annotations really improves debugging. Hence,a representation of programs and there annotations as constraints together with a constraint solver can be used to check correctness assumptions of program statements. 3 Debugging process The whole conversion algorithm of programs into their equivalent CSP representation and its use in debugging is described in [10]. We only brieﬂy discuss the overall diagnosis process that comprises the following steps: 1. Remove loops: The ﬁrst step is to remove all while statements and recursive function calls by ’unrolling’. For this purpose a while statement is converted into a nested if-statement. A similar procedure is done for recursive functions. Since, the maximum number of iterations is known for a given test case, the resulting loop-free program behaves in the same way like the original program. 2. SSA Conversion: In the second step, the loop-free program is converted into its static single assignment (SSA) form. In the SSA form every variable is deﬁned once. For more information regarding the SSA form we refer to [1]. In this step the assertions are also converted. 3. The CSP’s hyper-tree: From the SSA form we build the constraints system and its corresponding hyper-tree. This is done by mapping every program variable to its corresponding constraint variable. Every assignment is mapped directly to a constraint. The behavior of the constraints is given by the semantics of the corresponding statements. 4. Diagnosis: In the diagnosis step, we use the resulting CSP and the given test case directly for solving the obtained debugging problem. For this purpose we use the TREE* algorithm [7]. The algorithm requires an acyclic CSP, which can be obtained by applying for example hyper-graph decomposition [3, 4] or other decomposition methods. The combination of TREE* and composition method is described in [8]. When using the debugging process, the complexity of debugging is equivalent to the complexity of solving a CSP. [4] states that the complexity of solving a CSP is related to the hyper-tree width of the CSP as follows: The time need it to ﬁnd a solution for a CSP with n variables as input and a corresponding hyper-tree width of k is in the worse case O(nk log n). Hence, knowing the hyper-tree width of CSPs of programs is important in practice. In Table 2 we report ﬁrst results regarding the hyper-tree width of some small programs comprising while- and if-statements. The table comprises the lines of code (LOC), the lines of code of the corresponding SSA form (LOC2), the number of while-statements (#W), the number of If-statements (#I), the number of considered iterations (#IS), and the hyper-tree width (HW) for each program. The hyper-tree width of the programs varies from 3 to more than 30, which indicates that computing diagnosis candidates is a complex task when relying on CSP representation of programs. Another important issue is that the hypertree width increases when the number of considered iterations (during the unrolling step) increases. Whether there is an upper-bound or not is still an open issue. LOC 27 27 76 76 27 27 64 64 60 60 24 24 24 63 21 21 21 162 12 12 12 LOC2 40 112 82 1155 62 989 78 342 88 376 41 119 1231 70 33 173 1376 162 30 370 1076 #W 1 1 5 5 5 5 4 4 4 4 3 3 3 0 2 2 2 0 1 1 1 #I 3 3 1 1 1 1 1 1 0 0 1 1 1 5 1 1 1 0 0 0 0 #IS 1 4 1 30 1 10 1 20 1 9 1 7 100 1 15 100 1 30 100 HW 3 8 3 ≥ 30 2 ≥ 14 2 ≥ 12 2 8 3 6 6 3 2 10 10 9 3 17 17 Figure 2. The hyper-tree width for different sequential programs 4 Conclusions In this paper we discussed the compilation of programs into their equivalent CSP representation and its use for fault localization. Assertions like pre- and post-conditions or invariants can be easily integrated. Moreover, CSP solvers can be directly used for debugging. Solving CSPs depends also on their structural properties. The structural properties of the CSP corresponding to a given problem, represent an indicator for the complexity of program debugging. In this paper, we give ﬁrst results of the debugging complexity using hyper-tree width. The results show that debugging requires a lot of computational resources. REFERENCES [1] Marc M. Brandis and H. M¨ossenb¨ock. Single-pass generation of static assignment form for structured languages. ACM TOPLAS, 16(6):1684– 1698, 1994. [2] Rina Dechter. Constraint Processing. Morgan Kaufmann, 2003. [3] Georg Gottlob, Nicola Leone, and Francesco Scarcello. On Tractable Queries and Constraints. In Proc. DEXA 2001, Florence, Italy, 1999. [4] G. Gottlob, N. Leone, and F. Scarcello. A comparison of structural CSP decomposition methods. AI, 124(2):243–282, 2000. [5] Wolfgang Mayer, Markus Stumptner, Dominik Wieland, and Franz Wotawa. Can ai help to improve debugging substantially? debugging experiences with value-based models. In ECAI, pages 417–421, Lyon, France, 2002. [6] Raymond Reiter. A theory of diagnosis from ﬁrst principles. AI, 32(1):57–95, 1987. [7] Markus Stumptner and Franz Wotawa. Diagnosing tree-structured systems. AI, 127(1):1–29, 2001. [8] M. Stumptner and F. Wotawa. Coupling CSP decomposition methods and diagnosis algorithms for tree-structured systems. In Proc. 18th IJCAI, pages 388–393, Acapulco, Mexico, 2003. [9] R. Ceballos and R. M. Gasca and C. Del Valle and D. Borrego Diagnosing Errors in DbC Programs Using Constraint Programming Lecture Notes in Computer Science, Vol. 4177, Pages 200-210, 2006. [10] Paper waiting to be reviewed by Informatica. http://www.informatica.si/ ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-799 799 Compressing Binary Decision Diagrams Esben Rune Hansen1 and S. Srinivasa Rao2 and Peter Tiedemann3 Abstract. The paper introduces a new technique for compressing Binary Decision Diagrams in those cases where random access is not required. Using this technique, compression and decompression can be done in linear time in the size of the BDD and compression will in many cases reduce the size of the BDD to 1-2 bits per node. Empirical results for our compression technique are presented, including comparisons with previously introduced techniques, showing that the new technique dominate on all tested instances. 1 Introduction In this paper we introduce a technique for compressing binary decision diagrams for those cases where random access to the compressed representation is not needed. The two primary areas in which decision diagrams are used in practice are veriﬁcation and conﬁguration. In both of these areas it is sometimes important to store binary decision diagrams using as little space as possible but without the need for random access. Primarily the need for such compression arises when it is necessary to transmit binary decision diagrams across communication channels with limited bandwidth. In the area of veriﬁcation this need arises for example when using a networked cluster of computers to perform a distributed compilation of a binary decision diagram [1]. A similar exchange of BDD data takes place in distributed conﬁguration as described in [11]. In such approaches the fact that the network bandwidth is much lower than the memory bandwidth can become a major bottleneck as computers stall waiting to receive data to process. Transmitting the binary decision diagrams in a compressed representation can help alleviate this problem. A full version of this paper is available at [4]. Related work The only previous work we are aware of for compressing BDDs for ofﬂine storage is the work by Starkey and Bryant[9] and by Mateu and Prades-Nebot[7] which describes techniques for image compression using BDDs. The latter includes a nontrivial encoding algorithm for storing the BDD. Kieffer et.al[5] gives theoretical results for using BDDs for general data compression including a technique for storing BDDs. Preliminaries For a deﬁnition of BDDs please see [2]. We denote a given BDD as G(V, E) and use Elow and Ehigh to denote the set of low and high edges respectively. We use l(u) to denote the layer in which a node u is located. An edge (u, v) such that l(u) + 1 < l(v) is called a long edge and is said to skip layer l(u)+1 to l(v)−1. The length of an edge (u, v) is deﬁned as l(v) − l(u). A layer ordering idl : V → {1, . . . , |V |} of the nodes in a layered DAG G(V, E) rooted in r is the ordering of V layer by layer in increasing order of 1 2 3 IT-University of Copenhagen MADALGO, Aarhus University, Denmark IT-University of Copenhagen the layer. Nodes at the same layer are ordered as they are visited by a DFS in the DAG starting at r and traversing left edges prior to right edges. We refer to idb (v) and idl (v) as “the BFS id of v” and “the layer id of v” respectively. Lemma 1. Every binary tree can be unambiguously encoded using 2 bits pr. node. To achieve such an encoding each node v is encoded using two bits. The ﬁrst bit and the second bit is true iff v contains a left and a right child respectively. In order to make decoding possible the order in which the children of already decoded nodes appear in the encoded data must be known. 2 The Compression technique Our compression technique can be summarized by the following steps: 1. Build a spanning tree on the BDD (Section 2.1). 2. Encode edges in the spanning tree, using Lemma 1. 3. Encode by one bit the order in which the two terminals appear in the spanning tree. 4. Encode the length of the edges in the spanning tree where necessary (Section 2.1). 5. Encode the edges that are not in the spanning tree (Section 2.2). 6. Compress the resulting data using standard compression techniques. 2.1 The spanning tree We will construct a spanning tree with a minimum number of long edges. For each node v in the BDD with parents u1 , . . . , uk , we add the edge (uj , v) that minimizes l(v) − l(uj ) to the spanning tree. This ensures a spanning tree with a minimal number of long edges. In the following, an edge is called a tree edge if it is contained in the spanning tree and a nontree edge otherwise. Encoding the lengths of the tree edges The spanning tree is stored as a binary tree where all edges have the same length. Since some of the edges in the spanning tree may correspond to long edges in the BDD, the binary tree itself is not sufﬁcient to reconstruct the layer information during decoding. We therefore encode the location and the length of each long edge that is included in the spanning tree. The location of a long edge (u, v) is uniquely speciﬁed by the BFS order of the end point of the edge, that is idb (v). To encode location of the long edges (u1 , v1 ), . . . , (uk , vk ) we, output a bitvector of length |V | for which entries idb (v1 ), . . . , idb (vk ) are true and all other entries are false. 800 2.2 E. Rune Hansen et al. / Compressing Binary Decision Diagrams Encoding nontree edges When the spanning tree and the layer information is encoded, we only need to encode the nontree edges, that is, those edges in the BDD that are not contained in the spanning tree. It is easy to see that there is |E|/2 + 1 tree edges (when |V | > 3), leaving |E|/2 − 1 nontree edges. With access to the spanning tree with restored layer information, the fact that every BDD node except the terminals has two children, the starting point of the nontree edges is known. The end-point of a nontree edge is called an incomplete child. We deﬁne S as the sequence of incomplete children appearing in layer order of their parent and idl (S) as the correspong sequence of layer ids. Below we describe three encodings of nontree edges which combine to encode all the nontree edges. Incomplete children with large in-degree Standard compression techniques excel at compressing sequences with high redundancy. We note that nodes with in-degree d will appear d − 1 times in the sequence of nontree edges. Hence standard compression will efﬁciently compress those nontree children that have a high in-degree if they are separated from the nodes that have a low in-degree. We split S into two disjoint subsequences H and L, the ﬁrst containing those incomplete children that have an in-degree larger than a speciﬁed threshold, the latter containing the rest. Based on H we construct the sequence of integers S H on the sequence of nodes v1 , . . . , v|V | in S by encoding vi ∈ H as idl (vi ) and vi ∈ L as 0. By 0s we indicate the incomplete children that are not among the incomplete children with high in-degree. The remaining incomplete children L, we code separately, as described in the next two sections. Incomplete children with small in-degree To encode L we will exploit the fact that the sequence of integers in idl (L) will in most instances tend to be increasing (this behavior is analysed in more detail in the full version [4]). We exploit this fact by encoding the sequence idl (L) using delta coding: Deﬁnition 2 (Delta Coding). Consider any sequence of integers (i1 , . . . ik ) ∈ Zk for any k ∈ N. We deﬁne the delta coding of (i1 , . . . ik ) by Δ(i1 , . . . ik ) = (i1 , i2 − i1 , i3 − i2 , . . . , ik − ik−1 ) Long forward edges A nontree edge (u, v) is a forward edge if u is an ancestor of v in the spanning tree. Any forward edge (u, v) in the graph with length k can be unambiguously decoded from idl (v) and k. We label each node v with the number of long-edges that ends in v. We then write the length of the long edges, ordered by their end-points. We introduce a threshold on the number of long forward edges to control the use of this approach. If the threshold is not exceeded all long forward edges are instead encoded as described above. 3 Experiments In this section we provide empirical results from compressing a large set of BDDs from various sources using the new encoder described in this paper and the encoders from [7] and [5]. We also provide results for a naive encoder, which outputs the size of each layer followed by a list of children. Many of the instances we show results for are taken from the conﬁguration library CLib [10]. We apply LZMA[8] to the output of all encoders to produce the ﬁnal encoding. The Java source code used for these experiments (including a command-line encoder and decoder for BDDs in the BuDDy [6] ﬁle format) will be made available along with all instances used in these experiments at [3]. Conclusion From the empirical results (Figure 1) we can see that the naive encoding, being only compressed by LZMA, is outperformed with a factor of up to 20. We also note that the new encoder is consistently able to perform as well or better than the other encoders on all tested instances. In particular the largest BDD in our test (“complex-P3”) required about twice as much space when using either of the two other dedicated encoders. Name renault renault-dir pc-CP pc Big-PC Big-PC-dir complex-P3 complex-P2 1-6+22-32 1-6+22-32-dir isp9607 isp9605 chinese 5x27queens 13x13rook 8x8rook 8x8queen-dir 8x8queen mult-mix-10 mult-apart-10 |V | this paper [7] Product Conﬁguration 455798 0,90 126% 1392863 0,23 198% 16496 0,76 220% 3467 2,19 224% 356696 0,38 334% 1291600 0,17 260% Power Supply Restoration 2812872 0,44 243% 163432 1,16 181% 20937 1,89 136% 61944 0,99 135% Fault Trees 228706 0,63 389% 4570 3,30 130% 3590 2,06 214% Combinatorial 562764 4,33 108% 76808 3,56 210% 1339 6,03 140% 2453 2,17 115% 879 4,29 114% Multipliers 42468 9,92* 114% 31260 8,07* 120% [5] Naive 103% 214% 209% 211% 266% 260% 402 % 1352% 788 % 436 % 1345% 2035% 202% 167% 154% 161% 951 % 541 % 413 % 606 % 204% 145% 160% 873 % 305 % 450 % 109% 165% 139% 178% 138% 204 % 311 % 277 % 374 % 332 % 107% 124% 169 % 202 % Figure 1. Above is shown the name and nodecount of each of the instances tested. The result of the new encoder, in bits per node, is then showed, followed by the relative results of the rest of the encoders. The * indicates that delta-coding was not used. References [1] P. Arunachalam, C. Chase, and D. Moundanos, ‘Distributed binary decision diagrams for veriﬁcation of large circuits’, ICCD, 00, 365, (1996). [2] Randal E. Bryant, ‘Graph-based algorithms for boolean function manipulation’, IEEE Transactions on Computers, 35(8), 677–691, (1986). [3] Esben Rune Hansen, Srinivasa Rao, and Peter Tiedemann, ‘Bdd compression’. http://bddcompression.sourceforge.net. [4] Esben Rune Hansen, Srinivasa Rao, and Peter Tiedemann. Compressing binary decision diagrams, 2008. http://arxiv.org/abs/0805.3267v1. [5] J. Kieffer, P. Flajolet, and E h. Yang, ‘Universal lossless data compression via binary decision diagrams’, in Proceedings of ISIT 2000, (2000). [6] J. Lind-Nielsen, ‘BuDDy - A Binary Decision Diagram Package’. http://sourceforge.net/projects/buddy, online. [7] P. Mateu-Villarroya and J. Prades-Nebot, ‘Lossless image compression using ordered binary-decision diagrams’, Electronic Letters, 37, 162– 163, (2001). [8] Igor Pavlov. 7z lzma sdk. http://www.7-zip.org/sdk.html. [9] M. Starkey and R. Bryant. Using ordered binary-decision diagrams for compressing images and image sequences, 1995. [10] Sathiamoorthy Subbarayan. Clib: conﬁguration benchmarks library. http://www.itu.dk/research/cla/externals/clib. [11] Peter Tiedemann, Tarik Hadzic, Stuart Henney, and Henrik Reif Andersen, ‘Interactive distributed conﬁguration’, in Proceedings of CP2006, pp. 761–765. Springer-Verlag Berlin Heidelberg, (2006). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-801 801 Dependent Failures in Consistency-based Diagnosis 1 J¨org Weber and Franz Wotawa 2 1 Introduction Model-based diagnosis (MBD) approaches which follow the consistency-based diagnosis paradigm [3, 2] usually assume that components fail independently; i.e., that any abnormal behavior of a component is the consequence of an internal fault. Although some researchers have acknowledged that components may fail dependently, there are very few works which have addressed this issue. In [5] we presented an approach for diagnosing dependent failures in the hardware-software hybrid system of a mobile autonomous robot, and in [4] we described a formalization of this approach. This paper proposes an improved technique for modeling failure dependencies. We formalize the semantics of our model, and we propose a troubleshooting strategy for systems with dependent failures. With the term dependent failure we denote cascades of failures which happen when a component, the cause of the cascade (CoC), fails due to an internal fault and when this failure causes the failure of other components as well. I.e., in systems with dependent failures it may happen that some components suffer from persistent damage after unexpected occurrences at their inputs. In physical systems, phenomena like overvoltages, high pressure, heat, etc., may harm those components which have not been designed to sustain such contingencies. In most existing MBD approaches the independence assumption is reﬂected in at least two ways: ﬁrst, the applied focusing criteria rely on it; second, the resulting multiple-fault diagnoses do not indicate any dependencies between the failures. Many approaches compute the (subset-)minimal diagnoses or the minimal cardinality diagnoses. However, in case of dependent failures those focusing criteria may miss failed components. Even though we cannot expect to ﬁnd all component failures, we should at least seek to determine all possible causes of a cascade of failures (i.e., the CoC’s). Furthermore, the obtained results should state the dependencies between the failures, since this information is often essential for a successful recovery of the system. 2 Discussion: Dependent Failures Figure 2 depicts a circuit with two inputs sc1 and sc2 , which are either on or of f . They control the state of the switches. The ﬁlaments of the bulbs are either ok or broken. The voltage magnitudes uxx are modelled in a qualitative way; in particular, ubi = high indicates that the bulb Bi is exposed to a voltage which exceeds the range it was designed for (e.g., > 230 V ). The logical system description SD, which (as usual) captures the nominal behavior of components using the predicate AB which denotes ”abnormal”, is depicted in Fig. 1. 1 2 This research has been funded in part by the Austrian Science Fund (FWF) under grant P20199-N15. Institute for Software Technologie, Technische Universit¨at Graz, Austria, email: {jweber,wotawa}@ist.tugraz.at ¬AB(V ) → (uv = norm) ¬AB(R) ∧ (uv = x) → (us = x), x ∈ {zero, norm, high} AB(R) → (us = low) ¬AB(Si ) ∧ (sci = on) → (ubi = us ) ¬AB(Si ) ∧ (sci = of f ) → (ubi = zero) ¬AB(Bi ) ↔ (f ili = ok) [f il...”ﬁlament”] (f ili = ok) ∧ (ubi = zero) ↔ (lighti = on) (us = low) ∧ (ubi = low) → ⊥ (us = norm) ∧ (ubi = high) → ⊥ Figure 1. System Description (SD) for the system in Fig. 2 Figure 2. Circuit with a voltage source V , a resistor R, two switches S1 and S2 , and two bulbs B1 and B2 . The variables uxx denote voltages. If every component works correctly and both system inputs sc1 and sc2 are on, then all voltages in the model have the value norm, and the two bulbs light. Now suppose that V fails in a way s.t. it produces a voltage signiﬁcantly higher than expected, i.e., uv = high. Clearly, this will eventually destroy the bulbs, as the lifespan of a ﬁlament strongly decreases with higher voltage. Hence, a fault in V may be the cause of f ili = broken, and consequently it may be the cause of AB(Bi ). In such a case, V is the CoC, the cause of the cascade of failures. Note that, if the ultimate purpose of diagnosis is repair, then a bulb with a broken ﬁlament should always be regarded as abnormal. As usual, let SD be the logical system description, COM P the set of components, and OBS a set of observations [3]: Deﬁnition 2.1 A diagnosis for (SD, COM P, OBS) is a set Δ ⊆ COM P s.t. SD ∪ OBS ∪ {AB(c)|c ∈ Δ} ∪ {¬AB(c)|c ∈ COM P \ Δ} is consistent. Δ is (subset-)minimal iff no proper subset of it is a diagnosis. For OBS = {sc1 = sc2 = on, light1 = light2 = of f, f il1 = f il2 = broken}, we obtain the minimal diagnosis Δ = {B1 , B2 }. If we attempt to repair the system by replacing both bulbs, the new bulbs will soon fail again, as the actual cause V remains faulty. Our approach generates the failure cascade hypothesis HV,Δ = {DF (V, B1 ), DF (V, B2 )}, meaning that V is the CoC of the minimal diagnosis Δ (note that V ∈ Δ) and that B1 and B2 have failed in dependence of V ; the DF predicate denotes ”dependent failure”. 802 J. Weber and F. Wotawa / Dependent Failures in Consistency-Based Diagnosis It can be seen that HV,Δ corresponds to a non-minimal diagnosis Δ = {V, B1 , B2 }, which is a superset of Δ. Moreover, HV,Δ also indicates the failure dependencies, i.e., the causal order of the cascade of failures. This is very important as in many systems this causal order also inﬂuences the order in which the components must be repaired; here, V should be repaired before the bulbs are replaced. If we have full observability and OBS = {uv = high, f il1 = f il2 = broken, . . .}, we obtain the minimal diagnosis Δ = {V, B1 , B2 }. Although this result comprises all components which have actually failed, it still does not indicate the failure dependencies. The discussions above show that, if knowledge about possible failure dependencies exists, the diagnosis can be improved by an approach which takes those dependencies into account and which is also able to provide results which state the causal order of failures. Moreover, it should be possible to logically refute those failure cascade hypotheses which are inconsistent with the observations. 3 Failure Cascade Hypotheses We propose to capture the possible failure dependencies in a cascading failure graph (CFG), a model separated from the system description SD: Deﬁnition 3.1 A system is a tuple (SD, CF G, COM P ). The intended usage of SD is as usual in MBD, and it may also specify faulty behavior [2]. CF G is a directed acyclic graph (DAG) whose nodes are conjunctions of literals. Each node contains at most one AB literal, and this literal must be positive. Each edge is labelled with an abstracted condition symbol α [1]. The CFG is a partial description of what happens in the course of a cascade of failures. It is a causal model whose edges represent MAY relationships, similar to those in [1]. The abstracted condition symbols abstract from the actual conditions which may be very complex or even unknown. Figure 3(a) depicts a very abstract CFG for the circuit. Intuitively, the model in this ﬁgure indicates that AB(V ) may cause AB(B1 ) and/or AB(B2 ). A reﬁned model is depicted in Fig. 3(b). Deﬁnition 3.4 The cascading failure model (CF M ) is a set of logα ical sentences. It is created as follows. For each edge Si → Sj : add Si ∧ α → Sj to CF M . Moreover, for every pair of components αi,k−1 αi,1 (ci,1 , ci,k ) with a direct dependency path Si,1 → . . . → Si,k add DF (ci,1 , ci,k ) → Si,1 ∧ αi,1 ∧ . . . ∧ αi,k−1 to CF M . |= It follows that CF M ∪ {DF (ci,1 , ci,k )} AB(ci,1 ) ∧ AB(ci,k ). In our example, CF M contains the following sentences: AB(V ) ∧ α1 → (uv = high) (uv = high) ∧ α2 → (ub1 = high) ... DF (V, B1 ) → AB(V ) ∧ α1 ∧ α2 ∧ α4 ... Deﬁnition 3.5 A failure cascade hypothesis3 H ∈ 2Φ is a set of dependency assumptions s.t. the following holds: if DF (c , c) ∈ H, then there is no component c with DF (c , c) ∈ H. Deﬁnition 3.6 Given a hypothesis H, a component c is a cause of a cascade (CoC) in H iff there is a component c s.t. DF (c, c ) ∈ H and there is no component c with DF (c , c) ∈ H. E.g., in H = {DF (V, B1 ), DF (V, B2 )} there is only one CoC, namely V . In general, a hypothesis may have multiple CoCs. We introduce the notation Γ(H) to denote the set Γ(H) = {c | DF (c, ·) ∈ H or DF (·, c) ∈ H}. E.g., for H = {DF (V, B1 ), DF (V, B2 )} we obtain Γ(H) = {V, B1 , B2 }. Deﬁnition 3.7 A hypothesis H is consistent iff SD ∪ CF M ∪ OBS ∪ H ∪ {¬AB(c) | c ∈ COM P \ Γ(H)} |= ⊥ Proposition 3.1 If a hypothesis H is consistent, then Γ(H) is a diagnosis. We propose to focus on hypotheses which have a single CoC; i.e., we assume that all multiple failures have a single cause: Deﬁnition 3.8 A σ-hypothesis Hc,Δ , which relates to a component c and a (non-empty) minimal diagnosis Δ, is a hypothesis which has only one CoC, namely c, and Γ(Hc,Δ ) ⊇ Δ. (a) A simple CFG. (b) A reﬁned CFG. Figure 3. Two cascading failure graphs (CFG’s) for the system in Fig. 2. A component ci,1 may directly cause the dependent failure of a component ci,k if there is a direct dependency path from ci,1 to ci,k : αi,1 αi,2 αi,k−1 Deﬁnition 3.2 A path Si,1 → Si,2 → . . . → Si,k in CFG is a dependency path from ci,1 to ci,k (k > 1) iff Si,1 contains AB(ci,1 ) and Si,k contains AB(ci,k ). Moreover, if no AB literal occurs in any node Si,j with 1 < j < k, then it is a direct dependency path. Deﬁnition 3.3 For every pair (ci,1 , ci,k ) of components with a direct dependency path from ci,1 to ci,k there is a dependency assumption DF (ci,1 , ci,k ), denoting that the failure of ci,1 has directly led to the failure of ci,k . The set of all dependency assumptions is Φ. We assume that there is at most one direct dependency path between two components. In our example we have: Φ = {DF (V, B1 ), DF (V, B2 )}. The cascading failure model (CFM) captures the semantics of a CFG. It is automatically generated: We propose to compute only those minimal diagnoses which may have a single cause, to generate σ-hypotheses for these diagnoses, and to check the consistency of the hypotheses. We propose the strategy to seek (at least) one consistent σ-hypothesis for each possible cause of a minimal diagnosis. The reason behind this strategy is the observation that ﬁnding the ultimate cause of a cascade of failures is, in many domains, crucial for a successful repair of the system. REFERENCES [1] Luca Console, Daniele Theseider Dupr´e, and Pietro Torasso, ‘A theory of diagnosis for incomplete causal models’, in Proc. IJCAI, pp. 1311–1317, Detroit, (August 1989). Morgan Kaufmann. [2] J. de Kleer, A. K. Mackworth, and R. Reiter, ‘Characterizing diagnoses and systems’, Artiﬁcial Intelligence, 56(2–3), 197–222, (1992). [3] Raymond Reiter, ‘A theory of diagnosis from ﬁrst principles’, Artiﬁcial Intelligence, 32(1), 57–95, (1987). [4] J¨org Weber and Franz Wotawa, ‘Diagnosing dependent failures - an extension of consistency-based diagnosis’, in 18th International Workshop on Principles of Diagnosis (DX-07), Nashville, USA, (2007). [5] J¨org Weber and Franz Wotawa, ‘Diagnosing dependent failures in the hardware and software of mobile autonomous robots’, in Proceedings of IEA/AIE 2007, Kioto, Japan, (June 2007). 3 For brevity we will often simply use the term ”hypothesis”. ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-803 803 Cost-sensitive Iterative Abductive Reasoning with Abstractions Gianluca Torta1 and Daniele Theseider Dupr´e2 and Luca Anselma1 1 Introduction Several explanation and interpretation tasks, such as diagnosis, plan recognition and image interpretation, can be formalized as abductive reasoning. A number of approaches, including recent ones [1, 4], address the problem based on a task-independent representation of a domain which includes an ontology or taxonomy of hypotheses. In this paper we adopt a similar representation, but we also deal with abduction as an iterative process where, like in model-based diagnosis, further observations are proposed to discriminate among candidate explanations; in addition, we take into account costs of observations and actions. In fact, discrimination also involves reﬁning hypotheses, but this is performed down to an appropriate level which depends on the cost of actions (e.g. repair actions or therapy) to be taken based on the results of abduction, and on the cost of additional observations, which should be balanced with the beneﬁts, in terms of more suitable actions, of better discrimination. The presence of a domain representation with abstractions has a signiﬁcant impact on this trade-off. In general, a better assessment of the situation at hand, based on additional observations, leads to a more focused action. However, the cost of observing the same phenomenon at different levels of abstraction may vary signiﬁcantly; in fact, it could involve more or less costly medical or technical tests, or computationally complex image processing, possibly with additional costs due to the delay before taking an action. Moreover, the knowledge base could have been designed independently of the explanation/action task (e.g. diagnosis and repair), and could therefore include a detailed description of the domain which is not necessary for the task; more generally, the convenience of a detailed discrimination may depend on the speciﬁc case at hand. By explicitly considering abstractions in the iterative abduction process, we can often reduce the observation costs signiﬁcantly, yet maintaining the ability to exploit detailed observations and knowledge when convenient (similar advantages have been shown in inductive classiﬁcation with abstractions, e.g. [6]). In the following, we ﬁrst describe the knowledge we expect to be available. We then describe a basic iterative abduction loop and, ﬁnally, we concentrate on the criterion for selecting the next step in the loop: either performing a next observation at some level of detail, or stopping because the estimated most convenient choice is performing the action(s) associated with the current hypotheses. 2 Domain Representation The basic elements of the domain model are a set of abducibles (atomic assumptions) A = {A1 , . . . , An } and a set of manifesta1 2 Universit`a di Torino, Italy, email: {torta,anselma}@di.unito.it Universit`a del Piemonte Orientale, Italy, email: dtd@mfn.unipmn.it tions M = {M1 , . . . , Mm }. Each abducible Ai is associated with an IS-A hierarchy Λ(Ai ) containing abstract values of Ai as well as their reﬁnements at multiple levels; similarly, each manifestation Mj is associated with an IS-A hierarchy Λ(Mj ). We assume that the direct reﬁnements v1 , . . . , vq of a value V in a hierarchy (either Λ(Ai ) or Λ(Mj )) are mutually exclusive, and at most one of the leaf values in a hierarchy is true in each situation, i.e. we allow at most one instance for each abducible and observation; moreover, for each leaf value v of an abducible an a-priori probability p(v) is given. The hypotheses space S(A) for the abduction task is the set of all of the combinations γ of values drawn from one or more distinct hierarchies Λ(Ai ), while the manifestations space S(M) is the set of all of the combinations ω of values drawn from distinct hierarchies Λ(Mj ). The relationships between the abducibles and the manifestations are deﬁned by the domain knowledge K ⊆ S(A) × S(M). Given an instance of manifestations ω ∈ S(M) and an instance of abducibles γ ∈ S(A), (γ, ω) ∈ K means that ω is a possible observation set corresponding to the hypothesis set γ. We associate costs with values of both abducibles and manifestations. Let C ∈ Λ(Ai ) be a value belonging to the IS-A hierarchy of Ai ; its cost ac(C) is the cost of the action to be taken when Ai takes value C (e.g. a repair action if Ai represents a component and C denotes one of its fault modes). Let c1 , . . . , cq be the children of C in Λ(Ai ), i.e. the possible reﬁnements of value C. We assume that: max({ac(c1 ), . . . , ac(cq )}) ≤ ac(C) ≤ q ac(ck ) k=1 i.e. the action that we take for a value C of Ai costs no less than the most expensive action for its reﬁnements and no more than taking the actions for all of such reﬁnements. As for the manifestations, let O ∈ Λ(Mj ) be a value belonging to the IS-A hierarchy of Mj ; its cost oc(O) is the cost of making the observation which reﬁnes value O into one of its children o1 , . . . , oq in Λ(Oj ). We can associate an action cost also with any instance γ = r{C1 , . . . , Cr } ∈ S(A) of abducibles simply as ac(γ) = ac(Ci ), i.e. we assume that independent actions are taken for i=1 each of the abducibles values that appear in γ. With a slightly more complex computation we can also associate an action cost with a set of instances Γ = {γ1 , . . . , γs } representing the cumulative action cost if Γ is the ﬁnal set of explanations. For each abducible Ai s.t. (a value of) Ai appears at least in one γ ∈ Γ, we compute a new hierarchy Λ(Ai , Γ) by considering the portion of Λ(Ai ) up to the least upper bound LUB (Ai , Γ) that covers all of the values of Ai that appear in Γ and by further removing from such a sub-tree all of the values that do not appear in Γ. In this way, it may happen that the cost ac(C) of a value C ∈ Λ(Ai , Γ) is larger than the sum of the costs ac(ck ) of its children, 804 G. Torta et al. / Cost-Sensitive Iterative Abductive Reasoning with Abstractions since not all of the children of C deﬁned in Λ(Ai ) need to appear in Λ(Ai , Γ). We therefore update (with a bottom-up computation) the ac costs in Λ(Ai , Γ) to new costs ac∗ in order to reestablish this property. The action cost of Γ is then computed just as: ac(Γ) = ac∗ (LUB (Ai , Γ)) Ai ∈Γ 3 Iterative Abduction We rely on the following generic loop for iterative explanation: Input is a set of values ωI = {O1 , . . . , Ot } representing the initial observations, i.e. the values of a set of manifestations {M 1 , . . . , M t } ⊆ M. Generate a set Γ of candidates (i.e. explanations of ωI ). loop O := NextStep(Γ); if O = STOP then exit else perform observation to reﬁne O into one of its children ok ; Γ := Update(Γ, ok ) end That is, we assume that one or more initial observations are given; that there is a way to generate candidate explanations based on them (see below), and to update candidates based on additional observations; and we proceed with selecting and performing one observation at a time, which, of course, is in general suboptimal, as in [3, 2]. In this paper we aim at providing a general approach to the selection of the next step; we do not provide a general approach to candidate generation and update which could involve a mix of abduction and consistency reasoning; its formulation would depend on the way K is represented. With hierarchies of abducibles, moreover, abstract as well as detailed assumptions may take part in explanations; a general criterion which is suitable in this setting is the preference for least presumptive explanations [5], which generalize minimal (wrt set inclusion) explanations: an explanation that (also based on the IS-A hierarchy) implies another explanation is not least presumptive. In the following we assume that the candidates computed at each iteration represent the least presumptive explanations of the observations collected so far. If ac(Γ) is the minimum among the costs, we stop; otherwise we observe the O with the smallest c(O). Let Γk = {γ1 , . . . , γs } be one of the candidate sets involved in the above formula (note that each candidate γi may contain ground as well as abstract causes) and ac(Γk ) be its action cost, i.e. the cost of stopping at Γk , which must be compared with the estimated cost of acting after a further discrimination and reﬁnement. In principle, this estimation step would require to simulate all the possible observation sequences and outcomes and, for each of them, to assess the point where it is convenient, on average, to stop and perform the actions; in order to avoid such an intractable search, we assume that the abductive process will continue as follows: ﬁrst, one of the γi ∈ Γk will be isolated; then, γi is reﬁned level by level, up to a point where performing an action is estimated to be convenient. Therefore the estimated cost of Γk is: c(Γk ) = min(ac(Γk ), ic(Γk ) + rac(Γk )) where ic(Γk ) is the estimated cost of isolating a single γi ∈ Γk and rac(Γk ) is the estimated additional reﬁnement and action cost once some γi has been isolated. In this proposal, we estimate the cost ic(Γk ) as follows: ic(Γk ) = s −p(γi ) · log(p(γi )) · oc(γi ) i=1 where −log(p(γi )) is the estimated number of observations needed for isolating γi [3] and oc(γi ) is an estimate of the cost of a single observation3 . The cost rac(Γk ) of reﬁning its members γi = {Ci,1 , . . . , Ci,ri } until an action is taken is estimated by: rac(Γk ) = s i=1 p(γi ) · ri c(Ci,j ) j=1 where c(Ci,j ) is the estimated cost associated with Ci,j . In case action costs do not depend on the current context, each cost c(Ci,j ) can be pre-computed ofﬂine with a bottom-up visit of the taxonomies of the causes. In this proposal we have adopted a formula similar to the one for c(Γk ), i.e.: c(Ci,j ) = min(ac(Ci,j ), ic(Ci,j ) + rac(Ci,j )) where ic(Ci,j ) is the estimated cost of isolating a single child of Ci,j in the hierarchy and rac(Ci,j ) is the estimated additional reﬁnement and action cost once some child of Ci,j has been isolated4 . 4 Choosing the Next Step Let Γ be the current candidate set and let OBS be the set of possible observations (including reﬁnements of previous observations). In order to decide whether to stop or to proceed with a new observation O ∈ OBS, we select the minimum among: • the action cost ac(Γ) associated with Γ • for each O ∈ OBS, the estimated cost c(O), which is the sum of the cost oc(O) of observing O and the expected cost of the candidate set after observing O, i.e.: c(O) = oc(O) + q p(ok ) · c(Γk ) k=1 where Γ1 , . . . , Γq are the possible candidate sets that would result by observing O and getting values o1 , . . . , oq respectively, p(ok ) is the probability of getting value ok (computed based on current candidates Γ as in [3, 2]) and c(Γk ) is the estimated cost of Γk as detailed in the following. REFERENCES [1] Ph. Besnard, M.-O. Cordier, and Y. Moinard, ‘Ontology-based inference for causal explanation’, in Knowledge Science, Engineering and Management, 2nd Int. Conf., LNCS 4798, pp. 153–164, (2007). [2] L. Console, D. Theseider Dupr´e, and P. Torasso, ‘Introducing test theory into abductive diagnosis’, in Proc. 10th Int. Work. on Expert Systems and Their Applications, pp. 111–124, Avignon, (1990). [3] J. de Kleer and B.C. Williams, ‘Diagnosing multiple faults’, Artiﬁcial Intelligence, 32(1), 97–130, (1987). [4] B. Neumann and R. M¨oller, ‘On scene interpretation with description logics’, in Cognitive Vision Systems, 247–275, Springer, (2006). [5] D. Poole, ‘Explanation, prediction: an architecture for default, abductive reasoning’, Computational Intelligence, 5, 97–110, (1989). [6] J. Zhang, A. Silvescu, and V. Honavar, ‘Ontology-driven induction of decision trees at multiple levels of abstraction’, LNCS, 2371, 316–323, (2002). 3 4 We have deﬁned oc as a function of γi to possibly take into account the level of detail of observations related with γi Note that when Ci,j is a leaf of the hierarchy, c(Ci,j ) is the action cost ac(Ci,j ). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-805 805 Computation of Minimal Sensor Sets for Conditional Testability Requirements Gianluca Torta1 and Pietro Torasso1 1 Introduction The problem of computing a minimal set of sensors (MSS) that guarantees a desired level of diagnosability of a given system is well known in Model-Based Diagnosis (e.g. [4], [3]). Unfortunately, for many real-world systems, guaranteeing the testability of a given fault in every situation is impossible, no matter how many sensors we place and how many test vectors we apply for identifying the fault. This impossibility is mainly the consequence of masking effects induced by the presence of other faults (e.g. we can’t tell whether a bulb is working properly if the power is down). While it is possible to partially address this problem by putting restrictions on the number of faults (in particular, the single-fault assumption as in [3] ), this unnecessarily limits the applicability of computing MSSs. To overcome these limitations, this paper introduces conditional testability, which requires the testability of a fault to be guaranteed just under some conditions given by the user. In order to be useful, such conditions must be easy to detect, so that during the on-line testing phase it is always known whether the testability guarantee applies or not. Therefore, the conditions are expressed in terms of endogenous variables: this makes possible to express testability conditions for a fault of component c directly on the endogenous variables local to c (making the speciﬁcation task significantly easier). Moreover, if we assume that endogenous variables are potentially observable, it is always possible to ﬁnd at least a Sensor Set (SS) that guarantees detectability of the conditions. In this paper we sketch how to compute MSSs for conditional testability starting from discriminability relations that are parsimoniously encoded using a symbolic representation. Such relations are built from an extended model of the system to be diagnosed which includes a set of switches modeling the inclusion/exclusion of observable variables into the set of actual observations. Minimization of the SSs can be further constrained either by positive information (e.g. we already have some sensors) or negative information (e.g. some endogenous variables can’t be sensorized); despite this ﬂexibility, the minimization can be done in linear time w.r.t. the size of the symbolic representation of the set of SSs satisfying the given conditional testability requirements. - SV is the set of discrete system variables partitioned in I (inputs and commands), C (components) and E (endogenous variables). We will denote with dom(v) the ﬁnite domain of v ∈ SV; in particular, for each c ∈ C, dom(c) consists of the list of possible behavioral modes for c (an ok mode and one or more fault modes) - DT (Domain Theory) is a relation over the variables in SV 2 The notion of testability we consider in this paper assumes that the system can be tested under different contextual conditions in order to ﬁnd the set of consistency-based diagnoses (see e.g. [1], [3]). In general we are interested in verifying whether a mode m of a component c is testable and what degree of observability guarantees such a testability independently of the status of the other components. Unfortunately, this intuitive notion may be too strong because the status of the other components can mask the effects of the presence of c(m); for this reason we introduce a weaker notion of testability which is required to hold only on a restriction of the possible system behaviors. This weaker notion is useful only when we have the guarantee of recognizing (through suitable observations) whether the behavior of the system to be diagnosed complies with the restriction or not, so that it is always possible to tell whether the conditional testability of c(m) applies. First, we introduce the notion of discriminability between two hypotheses given a speciﬁc degree of observability3 . Deﬁnition 2 An hypothesis H is any relation obtained by constraining DT through the application of relational algebra operators σ and 1. We say that two hypotheses Hi , Hj are discriminable w.r.t. observability O ⊆ E iff ΠO (Hi ) ∩ ΠO (Hj ) = ∅. In the above deﬁnition Hi and Hj are any two restrictions to the system model involving the status of the system or the values of E variables, or a combination of both. The possible values of the observable variables O under hypotheses Hi and Hj must be disjoint. We are now ready to formalize the notion of testability of a behavioral mode m of component c given a speciﬁc level of observability when the global behavior of the system satisﬁes certain conditions. 2 Conditional Testability and Minimal Sensor Sets Deﬁnition 3 Let c ∈ C, m ∈ dom(c), ϕE be a formula over E variables and SE = {C : DT ∧ C ∧ ϕE ⊥}, where C denotes an instance of C variables. We say that c(m) is conditionally testable under conditions ϕE w.r.t. observability O ⊆ E iff: In this section we formally characterize conditional testability and Minimal Sensor Sets, starting with the deﬁnition of system model. - there exists an instance X of I s.t. hypotheses Hi = (DT 1 X 1 SE ) and Hj = (DT 1 X 1 S E ) are discriminable w.r.t. O 2 Deﬁnition 1 A System Description is a pair SD = (SV, DT) where: 3 1 Universit`a di Torino, Italy, email: {torta,torasso}@di.unito.it In component-based systems, relation DT is obtained by joining a set of relations DT1 , . . ., DTn each one modeling the behavior of a component. In the following, we use Π, σ and 1 to denote the project, select and join operations deﬁned in the relational algebra 806 G. Torta and P. Torasso / Computation of Minimal Sensor Sets for Conditional Testability Requirements - for each X , hypotheses Hi = σϕE (DT 1 X 1 SE ) and Hj = σ¬ϕE (DT 1 X 1 SE ) are discriminable w.r.t. O - for each X , hypotheses Hi = σϕE ∧c(m) (DT 1 X 1 SE ) and Hj = σϕE ∧¬c(m) (DT 1 X 1 SE ) are discriminable w.r.t. O Set SE represents all of the possible assignments to component variables C (i.e. system states) consistent with ϕE . The ﬁrst discriminability condition requires that we can ﬁnd an instance X of the inputs such that, given observability O, it is possible to tell whether the status of the system is in SE or in S E . Since ϕE never holds in states C ∈ S E , only states in SE must be further considered. The second and third conditions are strongly related. Indeed, even if the system status is in SE , it is possible that ϕE does not hold given the current input vector X ; the second condition requires that, provided the system status is in SE , observability O allows us to detect whether ϕE holds or not, regardless of the current input X . Finally, if ϕE holds it must be possible (third condition) to tell which of c(m) and ¬c(m) holds, i.e. c(m) is discriminable from ¬c(m). Conditional testability represents the formal basis for deﬁning the notion of Minimal Sensor Set. Deﬁnition 4 A conditional testability requirement λ is a pair (c(m), ϕE ) where c ∈ C, mode m ∈ dom(c) and ϕE is a formula over E variables. Deﬁnition 5 Given SD = (SV, DT), an observability O and λ = (c(m), ϕE ), we say that O satisﬁes λ if c(m) is conditionally testable under ϕE w.r.t. O. A Minimal Sensor Set is an observability O∗ satisfying λ such that no other O , |O | < |O∗ | satisﬁes λ. From the deﬁnition above it is apparent that the preference criterion chosen for selecting MSSs is based on the minimum cardinality. The deﬁnition of MSS can be straightforwardly extended to apply to a set Λ = {λ1 , . . . , λm } of conditional testability requirements. 3 Computing the Sensor Sets The computation of MSSs involves a number of steps. The starting point is represented by the System Description SD and a set of userprovided conditional testability requirements Λ = {λ1 , . . . , λm }. Given a requirement λ = (c(m), ϕE ), the system computes the set SSλ of all the Sensor Sets that satisfy λ. In particular, according to Deﬁnition 5, SSλ includes all the observabilities that make c(m) conditionally testable under ϕE ; note that SSλ is empty only when ϕE is too weak and c(m) is not testable even under full observability. The computation of SSλ requires a way for easily representing and manipulating degrees of observability. To this end, we extend SD by adding a set of observation switches which model the inclusion/exclusion of potentially observable variables into the set of actual observations and by adding formulas relating the inﬂuence of the switches on the observable variables as described in [3]. Once the sets SSi , i = 1, . . . , m have been computed (one for each conditional testability requirement), the intersection of SSi , i = 1, . . . , m yields a set SS containing all the Sensor Sets (observabilities) that simultaneously satisfy the requirements in Λ. Note that SS is ∅ only if at least one of SSi is ∅; indeed, if each SSi = ∅, then SS contains at least the sensor set corresponding to full observability. Minimizing Sensor Sets. At this point, the process executes a minimization step in order to build a set M SS containing all of the Minimal Sensor Sets in SS. During the minimization step, the user has the possibility of specifying a set of constraints Ω on the sensors. In particular, the user may want to constrain some observables o to be available (e.g. because they are already sensorized) or to be excluded (e.g. because it is impossible to measure o with a sensor). In principle, the minimization step could be very expensive from a computational point of view but the efﬁcient approach developed for the computation of minimum cardinality diagnoses in [2] can be adopted for the computation of MSSs. The basic idea is to precompute a set of ﬁlters CSSi , where CSSi represents all of the possible observabilities involving exactly i observable variables as assignments to the switch variables. In order to compute M SS we intersect SS with ﬁlters CSSi starting form CSS0 (the lowest level of observability) and stopping as soon as the intersection is not empty. On-line Use of MSS for Testing. Let us consider the diagnosis of the system after it has been sensorized according to one of the Miminum Sensor Sets in M SS; the possibility of discriminating among different diagnoses depends on the set of requirements used for computing M SS. In particular, according to the ﬁrst condition of deﬁnition 3, for each λ = (c(m), ϕE ) it is possible to apply a single input vector4 in order to ﬁgure out whether the actual status C of the system is inconsistent with conditions ϕE ; therefore, we have a cheap way of determining which are the modes of each component which are guaranteed to be discriminable with further tests. If it turns out that C is consistent with ϕE , we perform additional tests until we apply an input vector XC that, together with C, induces a behavior that satisﬁes ϕE . Thanks to the properties of the M SS computed by our approach, we have the guarantee that such an XC exists and that, after we update the set of candidate diagnoses with the readings of the sensors induced by XC , all of the candidate diagnoses either agree on c(m) or on ¬c(m). Symbolic Implementation. Since the size of the relations involved in the computation of M SS can in general be huge, we have adopted OBDDs (Ordered Binary Decision Diagrams) for encoding and manipulating all of such relations (including the Domain Theory DT). For the minimization step, we have also a theoretical result which guarantees that the potentially very expensive task of computing M SS can be done in linear time with respect to the size of the OBDD OSS encoding the set SS. Property 1 Let OSS be an OBDD encoding the Sensor Sets for the set of requirements Λ; then, OBDD OM SS encoding the Minimal Sensor Sets can be computed in time O(|E|3 · |OSS |). The OBDD implementation has proved to be effective when applied to the model of an hydraulic system involving 4 commands, 10 components, and 40 endogenous multi valued variables (similar to the one in [3]). The total CPU time taken by the computation of M SS is very small when we require that three behavioural modes of the 5 pipes are conditionally testable: the total CPU time is about 900 msec on a PC with CPU at 1.4GHz and 512 MB RAM. In particular, the time taken by the minimization step is almost negligible. REFERENCES [1] M. Esser and P. Struss, ‘Fault-model-based test generation for embedded software’, in Proc. IJCAI, pp. 342–347, (2007). [2] P. Torasso and G. Torta, ‘Model-based diagnosis through obdd compilation: a complexity analysis’, LNCS, 4155, 287–305, (2006). [3] G. Torta and P. Torasso, ‘Computation of minimal sensor sets from precompiled discriminability relations’, in Proc. DX, pp. 202–209, (2007). [4] L. Trav´e-Massuy`es, T. Escobet, and X. Olive, ‘Diagnosability analysis based on component-supported analytical redundancy relations’, IEEE Trans. on Systems, Man and Cybernetics A, 36(6), 1146–1160, (2006). 4 Such input vector X is computed during the off-line check of the discriminability condition and it can be saved and associated with λ. ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-807 807 Combining Abduction with Conﬂict-based Diagnosis Ildik´o Flesch1 and Peter J.F. Lucas2 Abstract. Conﬂict-based diagnosis is a recently proposed probabilistic method for model-based diagnosis, inspired by consistencybased diagnosis, that uses a measure of data conﬂict, called the diagnostic conﬂict measure, to rank diagnoses. In this paper, this method is reﬁned using an abductive method that reuses part of the computation of the diagnostic conﬂict measure. I1 I2 AbX1 I3 AbA1 OX1 1 INTRODUCTION Conﬂict-based diagnosis is a recently proposed probabilistic method for model-based diagnosis that is inspired by consistency-based diagnosis, and uses a measure of data conﬂict, called the diagnostic conﬂict measure, to rank diagnoses. The probabilistic information that is required to compute the diagnostic conﬂict measure is represented by means of a Bayesian network. This Bayesian network contains sufﬁcient information to compute abductive diagnoses as well. In this paper, conﬂict-based diagnosis is augmented with an abductive method, similar in spirit to the probabilistic method employed by GDE [2]. The method reuses part of the computation of the diagnostic conﬂict measure. In essence, abductive diagnosis is used to rank conﬂict-based diagnoses with equal conﬂict-based rankings. 2 PRELIMINARIES 2.1 Model-based Diagnosis In model-based diagnosis, the structure and behaviour of a system is represented by a logical diagnostic system SL = (SD, COMPS), where (i) SD denotes the system description, which is a ﬁnite set of logical formulae, specifying structure and behaviour, and (ii) COMPS is a ﬁnite set of constants, corresponding to the components of the system; these components can be faulty. The system description consists of behaviour descriptions specifying normal and abnormal (faulty) functionalities of the components, and of connections of inputs and outputs of components. A logical diagnostic problem is deﬁned as a pair PL = (SL , OBS), where SL is a logical diagnostic system and OBS is a ﬁnite set of logical formulae, representing observations. Two types of model-based diagnosis are distinguished: (i) consistency-based diagnosis [2, 6], and (ii) abductive diagnosis [1]. Let ΔC consist of the assignment of abnormal behaviour to the set of components C ⊆ COMPS and normal behaviour to the remaining components COMPS − C, then, adopting the deﬁnition from [3], ΔC is a consistency-based diagnosis of the logical diagnostic problem PL iff the observations are consistent with both the system description and the diagnosis; formally: SD ∪ ΔC ∪ OBS ⊥. 1 2 Department of Computer Science, Maastricht University, email: ildiko@micc.unimaas.nl Institute for Computing and Information Sciences, Radboud University Nijmegen, email: peterl@cs.ru.nl OA1 AbA2 OA2 AbX2 OX2 Figure 1. AbR1 OR1 The graphical representation of a Bayesian diagnostic system corresponding to the full-adder in [6]. In the abductive approach, the behavioural assumptions ΔC are called an abductive diagnosis if the system description SD and the behavioural assumptions ΔC imply the set of observations OBS; formally: SD ∪ ΔC OBS. 2.2 Bayesian Diagnostic Problems Let P (X) be a joint probability distributions of the set of discrete binary random variables X, where for a singleton x and x ¯ denote the values ‘true’ and ‘false’, respectively. A Bayesian network B is then deﬁned as a pair B = (G, P ), where the acyclic directed graph G = (V, E) represents the relations between the random variables deﬁned in P (X), where each random variable corresponds to a unique vertex. A Bayesian diagnostic system is denoted by SB = (G, P ), where P is a joint probability distribution of the vertices of G, interpreted as random variables, and G is obtained by mapping a logical diagnostic system SL to a Bayesian diagnostic system as follows: (i) component c is represented by its input Ic and output Oc , where each arc points from input to the output, (ii) to each component c there belongs an abnormality vertex Abc . An example is given in Figure 1. Let the set of values of the abnormality variables Abc , with c ∈ COMPS, be denoted by δC = {abc | c ∈ C} ∪ {abc | c ∈ COMPS − C}, which establishes a link to ΔC in logical diagnostic systems. In this paper, the set of observed input and output variables are referred to as Iω and Oω , whereas the unobserved input and output variables will be referred to as Iu and Ou respectively. Let iω denote the values of the observed inputs, and oω the observed output values. 808 I. Flesch and P.J.F. Lucas / Combining Abduction with Conﬂict-Based Diagnosis The set of observations is then denoted as ω = iω ∪ oω . The following assumptions are used in the remainder of this paper: (i) the probabilistic behaviour of a component that is faulty is independent of its inputs, and (ii) normal components behave deterministically. These are realistic assumptions, as it is unlikely that detailed functional behaviour is known for a component that is faulty, whereas when the component is not faulty, it is certain it behaves as intended. A Bayesian diagnostic problem, denoted by PB = (SB , ω), consists of (i) a Bayesian diagnostic system and (ii) a set of observations ω [5, 4]. 2.3 Conﬂict-based Diagnosis The theory of conﬂict-based diagnosis uses the diagnostic conﬂict measure to solve Bayesian diagnostic problems [4], where a numeric value is assigned to each diagnosis to order them. Deﬁne ω = iω ∪ oω as the observations, then the diagnostic conﬂict measure (DCM), denoted by conf δC (ω), is deﬁned as conf δC (ω) = log P (iω | δC )P (oω | δC ) . P (iω , oω | δC ) (1) Using the independence properties from Bayesian diagnostic problems we obtain: P Q P i P (i) ou c P (Oc | π(Oc )) P Q . (2) conf δC (ω) = log P P (i ) u iu ou c P (Oc | π(Oc )) Intuitively, if the probability of the individual occurrence of the observations is smaller than that of the joint occurrence (if the numerator is smaller than the denominator), then the observations do ‘like’ or support each other. Thus, a smaller value of the DCM indicates a better ﬁt between observations and component behaviours. Therefore, the DCM imposes an ordering on diagnoses, where the lower the DCM for a diagnosis is, the better the diagnosis ﬁts the diagnostic problem. A diagnosis is a conﬂict-based diagnosis, if its DCM is non-positive, and it is also called minimal, if it has the least DCM value in comparison to the other conﬂict-based diagnoses. 3 ABDUCTIVE CONFLICT-BASED DIAGNOSIS In the ranking obtained by conﬂict-based diagnosis there may be cases, where the diagnoses have the same DCM. This has motivated us to develop a method which offers a way to distinguish such diagnoses. This method makes use of abductive computations, for which parts of the computation of the DCM are reused. 3.1 The Relation between Abductive and Consistency-based Reasoning In our probabilistic setting, the consistency condition requires that the probability of the occurrence of the observations given the diagnosis is non-zero. Formally, in consistency-based reasoning, we are searching for diagnoses δC with P (iω , oω | δC ) > 0. Note that the set of abnormality assumptions δC is given knowledge. In abductive reasoning, on the other hand, the observations have to be implied by the system descriptions and the abnormality assumptions δC . This means that we are looking for abnormality assumption δC that can be explained by the observations; formally: P (δC | iω , oω ). Using Bayes’ rule the following relationship between consistency-based and abductive reasoning can be established: P (δC | iω , oω ) = P (iω , oω | δC )P (δC ) , P (iω , oω ) (3) where 1/P (iω , oω ) is a normalisation constant. The maximum = a-posteriori assignment (MAP) diagnosis, deﬁned as δC argmaxδC P (δC | iω , oω ), is the natural probabilistic analogue to the concept of subset-minimal abductive diagnosis [7]. According to Equation (3), computation of abductive diagnoses requires the computation of consistency-based diagnoses. 3.2 Abductive Probabilistic Computations Next, a formula to compute abductive diagnoses of Bayesian diagnostic problems is derived, which is used to distinguish between equally ranked conﬂict-based diagnoses. Note that the numerator P (iω , oω | δC ) in Equation (3) is also the denominator of the DCM in equations (1) and (2); according to [4]: X XY P (iu ) P (Oc | π(Oc )). P (iω , oω | δC ) = P (iω ) iu ou c In contrast to Equation (2), the factor P (iω ) is not divided out. The denominator of the abductive formulas is computed as: X X XY P (δc ) P (iu ) P (oc | π(Oc )). P (iω , oω ) = P (iω ) δc iu ou c It is now possible to derive the abductive computational form: P (iω , oω | δc )P (δc ) = P (iω , oω ) P P Q P (iω ) iu P (iu ) ou c P (oc | π(Oc ))P (δc ) P P P Q = P (iω ) δc P (δc ) iu P (iu ) ou c P (oc | π(Oc )) P P Q P (δc ) iu P (iu ) ou c P (oc | π(Oc )) P P Q = P . (4) δc P (δc ) iu P (iu ) ou c P (oc | π(Oc )) P (δc | iω , oω ) = At ﬁrst sight, it seems computationally infeasible to compute P (δc | iω , oω ) in this manner. However, the computation can be simpliﬁed as P (δC | ω) is only used to rank diagnoses and thus the denominator need not be used as it is the same for all diagnoses; only the numerator has to be computed. P The computation of the numerator is easy, since the second term iu P (iu ) · · · is already computed as part of the denominator of the DCM (see Equation (2)). Only the probability P (δc ) needs to be computed, which is a product of the individual probabilities for (ab)normal behaviours of the components. 4 CONCLUSIONS In this paper a method was described to augment conﬂict-based diagnosis with probabilistic abductive diagnosis. The reﬁnement of conﬂict-based diagnosis by abduction has the virtue that it reuses part of the computation required for ﬁnding conﬂict-based diagnoses. REFERENCES [1] L. Console and P. Torasso. A Spectrum of Logical Deﬁnitions of Model-based Diagnosis, Computational Intelligence, 7:133–141, 1991. [2] J. de Kleer and B. C. Williams. Diagnosing multiple faults, AIJ, 32:97– 130, 1987. [3] J. de Kleer, A. K. Mackworth, and R. Reiter. Characterizing diagnoses and systems. AIJ, 56:197–222, 1992. [4] I. Flesch, P.J.F. Lucas and Th. van der Weide. Conﬂict-based diagnosis: Adding uncertainty to model-based diagnosis. IJCAI, 380–388, 2007. [5] J. Pearl. Probabilistic Reasoning in Intelligent Systems:Networks of Plausible Inference. Morgan Kauffman, San Francisco, CA, 1988. [6] R. Reiter. A theory of diagnosis from ﬁrst principles. AIJ, 32:57–95, 1987. [7] S.E. Shimony and E. Charniak. A new algorithm for ﬁnding MAP assignments to belief networks. AIJ, Volume 6, pp. 185–193, 1991. 4. Cognitive Modeling and Interaction This page intentionally left blank ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-811 811 An Activity Recognition Model for Alzheimer’s Patients: Extension of the COACH Task Guidance System B. Bouchard1 and P. Roy2 and A. Bouzouane1 and S. Giroux2 and A. Mihailidis3 Abstract. This paper presents a hybrid plan recognition model, based on probabilistic description logic, which addresses the issue of recognizing the activities and the errors of Alzheimer’s patients at an early stage of the disease. This model has been implemented to be a new extension of the COACH system, an emerging prototype of cognitive device for persons with Alzheimer’s disease that offers assistance in task completion. We present an initial experimentation done on this new recognition module for COACH, which is based on the results of two sets of clinical trials. 1 Introduction For several years, the IATSL laboratory4 has been exploring the process by which cognitive assistance, inside a smart home, can be provided to an occupant suffering from Alzheimer’s disease, in the performance of his Activities of Daily Living (ADL). This widespread form of dementia causes a progressive deterioration of thinking (cognitive impairment) and memory, leading to incoherent behavior and limiting the patient’s capacity to perform his tasks of everyday life (wash his hands, cook the meal, etc.) [4]. In this context, the research team led by Mihailidis [7] has developed COACH (Cognitive Orthosis for Assisting aCtivities in the Home), which is a prototype aiming to actively monitor an Alzheimer’s patient attempting a speciﬁc task, for instance handwashing, and offering assistance in the form of guidance (e.g., prompts or reminders) when it is most appropriate. A major limitation of the current COACH prototype is to presume that the system already knows which activity is in progress, and thus to suppose that it can only have one on-going task at a time. In this paper, we begin to address the limitation of recognizing an on-going ADL from observed basic actions, that constitutes a key issue inherent in cognitive assistance [3]. The complexity of this recognition process is increased because a memory lapse can lead a patient to perform actions in the wrong order, to skip steps of his activity, or to perform actions that are not even related to his original goal. However, the person is not always making errors; he can simply temporarily stop the execution of a plan to begin another one in the middle of an activity. This way, the patient deviates from the activity originally planned by carrying out multiple interleaved plans. This raises a new recognition dilemma that must be taken into account. 1 Universit´e du Qu´ebec a` Chicoutimi, (QC) Canada email:{Bruno.Bouchard, Abdenour.Bouzouane}@uqac.ca 2 Universit´ e de Sherbrooke, (QC) Canada email: {Patrice.C.Roy, Sylvain.Giroux}@usherbrooke.ca 3 University of Toronto, (ON) Canada, email: Alex.Mihailidis@utoronto.ca 4 The IATSL lab is sponsored by the Alzheimer Society of Canada, the American Association of Alzheimer, Intel, the Natural Sciences and Engineering Research Council of Canada (NSERC), and several other partners. Our contribution follows the traces of hybrid approaches for plan recognition, an emerging alternative that combines two different avenues of research, logical and probabilistic. We distinguish Geib’s model[5], which is based on abductive probabilistic logic in order to deduce hypotheses quantiﬁed by probabilities, thereby explaining multiple plan behavior while taking into account uncertainty in the loss of observations. Another model, developed by AvrahamiZilberbrand et al. [1], considers that plan recognition is the result of a probabilistic quantiﬁcation on the hypotheses obtained from a symbolic algorithm. It exploits a decision-tree system based on the properties of the observed actions, which is similar to the one used in the learning program C4.5, in order to efﬁciently identify possible hypotheses for interpreting simultaneous plans. These two models suppose that the observed agent is coherent and consequently, the proposed solutions cannot be applied to the recognition problem that we raised. On the other hand, the team of the Research Center on Intelligent Habitats (CRHI) of the University of Sherbrooke recently proposed a hybrid recognition model, based on lattice theory and a probabilistic action description logic [6, 2], allowing us to formalize the plausible incoherent intentions of the patient, resulting from the symptoms of his cognitive impairment. This model corresponds quite well with our needs by focusing on the recognition of erroneous/interleaved activities of Alzheimer’s patients. 2 Hybrid recognition model A plan recognition process consists of interpreting the set of observed actions performed by a agent (patient) with the aim of predicting his future actions that explain his behavior. Let A = {a, b, . . .} be the set of actions that an observed agent is able to perform and let P = {α, β, . . .} be the set of known plans of the observer. Let O be the set of observations such that O = {o | ∃a ∈ A, a(o)}. The assertion a(o) means that observation o corresponds to an instance of action concept a. The set of possible plans that would explain the set of observations O, according to the agent’s knowledge, is expressed by Pso = {α ∈ P | ∃(a, o) ∈ α × O, a(o)}. However, his intentions can go beyond the set of possible plans. In order to generate all of the agent’s intentions, we enhance Pso by dynamically generating extra-plans (hypotheses) based on the composition operation α ⊕ β between each pair of incomparable possible plans (α, β) ∈ Pso . We deﬁne this enhanced set of plans Pho as the union of the composition pairs of possible plans. This set is an interpretation model for O if it forms a lattice structure < Pho , ≺p , Δ, ∇ > ordered by the subsumption relation of plans ≺p and each couple of plans admits an upper bound ∇, corresponding to their least common partial subsumer, which is minimally composed of the observed actions. Also, each couple of plans admits a lower bound Δ, corre- 812 B. Bouchard et al. / An Activity Recognition Model for Alzheimer’s Patients: Extension of the COACH Task Guidance System sponding to a hypothesis schema (a plan containing action variables) obtained by disunifying the incomparable possible plans using the ﬁrst-order logic disuniﬁcation operation. The interest of this schema is to synthesizes the predictions concerning future actions. A plan αΔβ, deﬁned as a sequence of actions a1 , . . . , x, . . . , an , denoted αΔβ(an ◦ an−1 ◦ · · · ◦ x ◦ · · · ◦ a1 ) where ◦ is a sequence operator and x is an action variable, is a hypothesis schema if and only if there exists a substitution σ(x) ∈ A+ such that each new extra-plan π(an ◦ · · · ◦ σ(x) ◦ · · · ◦ a1 ) satisﬁes the two following properties. The ﬁrst one is the ⊕-stability of π which means that each hypothesis plan π ∈ α ⊕ β is formed by: (i) a set of partial plans included in the knowledge base P of the observer, (ii) at least one action common to plan α and to plan β, and (iii) a composition of actions that are components of α or of β. The second criterion is the ⊕-closure propriety which expresses that hypothesis π must admit an upper bound α∇β and a lower bound αΔβ. Hence, it must be included in the lattice bounded by αΔβ and α∇β. This algebraic space is not sufﬁcient to disambiguate the relevant hypotheses. Therefore, the addition of a probabilistic quantiﬁcation on the lattice structure is an interesting alternative. The symbolic recognition agent ﬁlters the hypotheses by passing only a bounded space to the probabilistic inference engine. Our proposal consists of characterizing through an interval of probabilities the relative inﬂuence (partial subsumption) of a plan on another one. Let α, β be two hypotheses interpreting the observed actions. The plan β partially subsumes the plan α with an interval of probabilities [pmin , pmax ], if there exists a supremum plan α∇β such that α ≺p α∇β and β ≺p α∇β, where pmin = 1/Pmin (α|α∇β) ∗ max(0, Pmin (α|α∇β) + Pmin (β|α∇β) − 1), and pmax = min(Pmax (β|α∇β)/Pmin (α|α∇β), 1). For instance, the term Pmin (α|α∇β) corresponds to the minimal conditional probability of observing the realization of a particular plan α, knowing that the sequence of actions α∇β has been observed. This estimation is based on a database given as input and composed of samples of observation frequencies concerning the realization of activities, which are obtained at the end of a training period while the system learns the usual routines of the patient. In other words, it models the minimal probability of implementation of an erroneous/interleaved plan by the patient. 3 Validation The COACH infrastructure consists of an intelligent environment taking the form of a common washroom, as we can see in Figure 1. It is equipped with a video camcorder mounted on the wall to record trials. The researchers can change the faucet for usability studies. The extended new architecture of COACH is divided into three different layers. First, inputs are received from hardware sensors (the camera) and are sent to the low-level (ﬁrst) recognition layer. This layer uses a vision algorithm based on a Bayesian sequential estimation method using ﬂocks of color features, which allows one to identify basic events (observations), such as the patient’s hand location, the tap position (open or closed), etc. Thereafter, reﬁned observations are sent to the plan recognition (second) layer, which uses our hybrid recognition model to construct a recognition space (a lattice structure), aiming to identify the possible on-going activities and to anticipate the possible future erroneous deviations of the patient. Finally, the assistance (third) layer receives this structured space and uses it to compute a correct prompting solution. The initial experimentation that we conducted is based on two studies conducted in the last two years in Toronto at the Lakeside Figure 1. Set-up for two activities: toothbrushing and handwashing. Long Term Care Centre Toronto Rehabilitation Institute and at the Harold & Grace Baker Centre with 20 patients. Both studies lasted approximately 3 months. In these studies, each patient was asked to perform, once a day (for 50 to 60 days), the same HandWashing activity. These trials allowed us to create a database of real case scenarios concerning common erroneous behavior of patients. Based on these data, we selected 30 representative scenarios that cover each type of error and we simulated them, step by step. The objective was to evaluate in what proportion the new module was able to recognize patient’s erroneous/interleaved activities. The results of this initial experiment show that the module was able to recognize almost all interleaved multiple activities, 80% of omission type errors, 60% of the substitution errors, and 50% of sequence errors. These results are promising, as all the recognized deviations were dynamically generated according to the initial identiﬁed possible plans set. However, the module is limited by the fact that the ﬁrst observed action is assumed to be correct (no errors). Also, some unrecognized errors are due to foreign actions that had nothing to do with the on-going activity, for instance washing his face instead of his hands. 4 Conclusion The extension of COACH that we proposed allows us to address a major limitation of the former prototype, which presumed that the system already identiﬁed the on-going activity. Therefore, it should be seen as a ﬁrst step toward the deployment of a complete prototype that could provide assistance for multiple different tasks to a patient at home. An interesting enrichment of the model consists in recognizing repetitive actions induced by patient’s erratic behavior, and to take into account the temporal relations between actions. REFERENCES [1] D. Avrahami-Zilberbrand and G.A. Kaminka, ‘Hybrid symbolicprobabilistic plan recognizer: Initial steps’, in Twenty-First National Conference on Artiﬁcial Intelligence (AAAI’06), pp. 1–7, (2006). [2] The Description Logic Handbook (Second Edition), eds., Franz Baader, Diego Calvanese, Deborah L. McGuinness, Daniele Nardi, and Peter F. Patel-Schneider, Cambridge University Press, 2007. [3] B. Bouchard, A. Bouzouane, and S. Giroux, ‘A keyhole plan recognition model for Alzheimer’s patients: First results’, Applied Artiﬁcial Intelligence, Taylor & Francis publisher, 22 (7), 623–658, (2007). [4] J. Diamond, ‘Alzheimer disease and current research’, Technical report, Alzheimer Society of Canada, (October 2006). [5] Christopher Geib and Robert Goldman, ‘Partial observability and probabilistic plan/goal recognition’, in Proc. of the IJCAI-05 workshop on Modeling Others from Observations, Edinburgh, Scotland, (2005). [6] J. Heinsohn, ‘Probabilistic description logics’, in Proc. of Uncertainty in AI’94, pp. 311–318, (1994). [7] A. Mihailidis, J. Boger, M. Canido, and J. Hoey, ‘The use of an intelligent prompting system for people with dementia: A case study’, ACM Interactions, 14(4), 34–37, (2007). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-813 813 Not so new: overblown claims for ‘new’ approaches to emotion Dylan Evans1 Abstract. The non-classical thesis of emotion (NCE) states that the conceptual resources of classical cognitive science cannot adequately account for certain important features of emotion. It also states that these features can be adequately accounted for by employing the conceptual resources of non-classical forms of cognitive science. There is a general problem with all forms of NCE, since they all assume that classical cognitive science is too restrictive, when if anything the reverse is true. In fact, the relationship between classical and nonclassical approaches to the study of emotion is much more fuzzy than NCE suggests. Though the two approaches use different terms of art, this does not grant one group privileged access to cognitive resources inaccessible to the other, but merely directs their attention to different features of the phenomena being studied. Thus the real contribution of non-classical models of emotion is to draw our attention to certain key aspects of emotion requiring explanation that had perhaps been somewhat neglected by classical models. 1 THE NON-CLASSICAL THESIS OF EMOTION (NCE) During the past decade, some philosophers and psychologists have argued that the conceptual resources of classical cognitive science cannot adequately account for certain important features of emotion. They have further argued that these features can be captured by nonclassical forms of cognitive science (eg. [6]). I will refer to such claims as the non-classical thesis of emotion (NCE). In this paper, I examine one particular form of NCE — namely, that put forward by Giovanna Colombetti. Before proceeding to examine Colombetti’s arguments in detail, however, let’s spell out the main components of NCE in a bit more detail. NCE may be characterised in terms of three lists and three claims, as follows: Three lists: 1. A list of the conceptual resources of classical cognitive science 2. A list of the conceptual resources of (some form of) non-classical cognitive science, which includes at least some elements not found in List 2. 3. A list of the key aspects of emotion that require explanation Three claims: 1. The newness of non-classical cognitive science (NNC): Members of list 2 that are not in list 1 cannot be reduced to any combination of the members of list 1 2. The explanatory weakness of classical cognitive science: Some members of list 3 cannot adequately be accounted for by the members of list 1. 1 Cork Constraint Computation Centre, Department of Computer Science, University College Cork, Ireland, email: devans@4c.ucc.ie 3. The explanatory strength of non-classical cognitive science: The same members of list 3 enumerated in claim 2 can be adequately accounted for by the members of list 2 (either on their own, or in addition to the members of list 1). The three claims are common to all forms of NCE. Differences between the various forms of NCE lie entirely in the different ways that the three lists are populated. None of the proponents of NCE goes so far as to provide exhaustive speciﬁcations of all three lists. Not only are their speciﬁcations partial, but they are not usually provided in the form of lists at all. Instead, their lists must be reconstructed from hints and ellipses, which makes criticism difﬁcult. Nevertheless, this is what I will attempt do with Colombetti’s version of NCE in section 2 2 COLOMBETTI’S SPECIFICATION OF CLASSICAL COGNITIVE SCIENCE Giovanna Colombetti provides a fairly representative example of NCE in a 2003 paper entitled ‘Complexity as a new framework for emotion theories’ [2]. Colombetti does not talk explicitly about ‘classical cognitive science’, but she does set out to criticise ‘good old fashioned frameworks’. The latter phrase clearly echoes the term GOFAI (‘good old fashioned Artiﬁcial Intelligence’) popularised by the philosopher John Haugeland, and which is synonymous with classical cognitive science [4]. According to Colombetti, the ‘good old fashioned frameworks [are] based on modular and hierarchical perspectives of the mind, which try to explain the elicitation of emotion by positing a strictly sequential causal chain of mental and/or physical events’ ([2]; emphasis in original). So, here, at least, is a partial speciﬁcation of list 1 according to Colombetti: 1. Modular processes 2. Hierarchical processes 3. Strictly sequential causal chains of events A deep ambiguity affects Colombetti’s use of all three terms. In a longer version of this paper, I explain in detail what this ambiguity is in each case [3]. I leave this analysis aside here for reasons of space. 3 COLOMBETTI’S SPECIFICATION OF NON-CLASSICAL COGNITIVE SCIENCE Colombetti does not use the term ‘non-classical cognitive science’, but she does talk about ‘the dynamical systems approach in cognitive science’ (or the ‘dynamical perspective’), and explicitly contrasts this approach with ‘the good old fashioned frameworks’ that we have already claimed to be coterminous with classical cognitive science. 814 D. Evans / Not So New: Overblown Claims for ‘New’ Approaches to Emotion This is broadly in line with most of the claims made on behalf of nonclassical cognitive science, which tend to focus on the ‘dynamical hypothesis in cognitive science’ [7], or on strong claims about the embodiment and/or situatedness of cognition [1], or both. Thus we may take her characterisation of the dynamical systems approach to be her speciﬁcation of non-classical cognitive science. Colombetti does not provide a formal list of the conceptual resources of the dynamical systems approach, but she does mention the following two ideas as being distinctive features of this approach: 1. Collective action of micro-components 2. Circular causation The ﬁrst of these is explicitly contrasted with the ‘hierarchical’ processes supposedly invoked by classical cognitive science, and the second with the ‘strictly sequential causal chains’ that classical cognitive science is supposedly restricted to. Since these terms are problematic, it is hardly surprising that their supposed opposites are similarly problematic. 4 CONCEPTUAL RESOURCES, COGNITIVE ARCHITECTURES, AND COGNITIVE MODELS It is important to note that NCE is not a claim about the existence of new models or theories of emotion, but a claim about conceptual resources (or, more precisely, a claim about the relationship between models and conceptual resources). If a new model or theory of emotion accounts for hitherto refractory aspects of emotional phenomena, but can be entirely explicated by recourse to the conceptual resources of classical cognitive science, then the existence of the new model provides no support to NCE. This is often overlooked by proponents of NCE. Typically, supporters of NCE get excited about a new model of emotion that is expressed in terms that are not part of the conceptual toolbox of classical cognitive science. The fact that the new model accounts for aspects of emotion that have previously been neglected by models developed in the classical idiom is then taken to show that the classical idiom is incomplete. But this is a non-sequitur, for it neglects the possibility that the new model can also be expressed in terms that are drawn entirely from the the classical idiom. This general point undermines all the various forms of NCE. All forms of NCE require the combined conceptual resources of classical cognitive science and non-classical cognitive science to be greater than the conceptual resources of classical cognitive science alone. More formally, if C is the set of conceptual resources of classical cognitive science, and N is the set of resources for non-classical cognitive science, then if NCE is true, the relative complement of N relative to C must be non-empty. However, the problem with classical cognitive science, if there is one, is that its conceptual resources are all-encompassing. As a theory, it is not constrained enough. Take the Soar cognitive architecture developed by John Laird, Allen Newell and Paul Rosenbloom, for example [5]. Soar is about as good an example of classical cognitive science as anyone could hope for. It was designed as a common format for expressing a whole variety of cognitive models. Yet Soar is Turing-complete, so it can be programmed to represent any kind of computational cognitive model at all. So, for most of its critics, the problem with Soar is not that it is too constrained, but that it does not embody enough constraints to act as a good psychological theory. The reference to Soar is particularly apt, since the current discussion would be better understood by cognitive scientists themselves (rather than by the philosophers of cognitive science who tend to dominate the discussion) if it were couched in the terminology of‘cognitive architectures’ rather than that of‘conceptual resources’. A cognitive architecture is, in fact, a speciﬁcation of the kind of conceptual resources that may be used to construct a set of consistent cognitive models. Classical cognitive science is perhaps best seen as a set of cognitive architectures (comprising Soar, ACT-R, and others), while non-classical cognitive science is a different set (comprising subsumption architectures, neural networks, dynamical models, among others). For any pair of architectures, and any cognitive model, the cognitive model can always be programmed in both, or just the classical architecture — but never in the non-classical architecture alone. 5 CONCLUSION The classical and non-classical forms of cognitive science certainly sound different. The key terms of the latter are rarely, if ever, to be found in the former. However, these terminological differences do not reﬂect any deep conceptual rift, since there is nothing in nonclassical explanations that cannot be translated into the terms of classical cognitive science. Yet the different terminology employed by classical and non-classical forms of cognitive science does make a difference to the way that the proponents of each go about their research. Terms like‘circular causation’ summon up in the researchers’ mind a set of studies that have put special emphasis on feedback loops (even though the researcher might explicitly state that they are concerned with something‘more’ than mere feedback) and so perhaps lead the researcher who‘feels at home with’ this terminology to discover feedback loops that he or she might might otherwise have missed. What is needed here is a theory of pragmatics, rather than a theory of deep conceptual structure. The use of different terms by different groups of cognitive scientists does not grant one group priveleged access to cognitive resources inaccessible to the other, but rather serves as a heuristic that directs their attention to different features of the phenomena being studied. ACKNOWLEDGEMENTS The research for this paper was supported by Marie Curie Transfer of Knowledge Action no. MTKD-CT-2006-042563. Thanks are also due to Ric Wallace for his comments on an earlier draft of this paper. REFERENCES [1] A. Clark, Being There: Putting Brain, Body, and World Together Again, MIT Press, Cambridge, Mass., 1997. [2] G. Colombetti, ‘Complexity as a new framework for emotion theories’, Logic and Philosophy of Science, 1, (2003). [3] D Evans, ‘The non-classical thesis of emotion’, unpublished manuscript. [4] J. Haugeland, ‘What is mind design?’, in Mind Design II: Philosophy, Psychology, Artiﬁcial Intelligence, ed., J. Haugeland, MIT Press, Cambridge, Mass. and London, England, (1996). [5] John E. Laird, Allen Newell, and Paul S. Rosenbloom, ‘Soar: an architecture for general intelligence’, Artif. Intell., 33(1), 1–64, (September 1987). [6] M. D. Lewis, ‘Bridging emotion theory and neurobiology through dynamic systems modeling (with commentary)’, Behav Brain Sci, 28(2), (April 2005). [7] T. van Gelder, ‘The dynamical hypothesis in cognitive science.’, Behav Brain Sci, 21(5), (October 1998). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-815 815 Emergence of Rules in Cell Assemblies of fLIF Neurons Roman V. Belavkin and Christian R. Huyck 1 Abstract. Inspired by biological cognition, CAB OT project explores the ways symbolic processing can emerge in a system of neural cell assemblies (CAs). Here we show how a stochastic meta– control process can regulate learning of associations between the CAs, the neural basis of symbols. An experiment illustrates the learning between CAs representing conditions actions pairs, which leads to CA–based representations of ‘if–then’ rules. 1 INTRODUCTION Previously, the authors have demonstrated how states in cell assembly (CA) neural system can be controlled and used to perform a typical symbolic task (counting) [5]. This work has developed into a much more ambitious project called CAB OT, where the same principles are applied in a system, based entirely on CAs, that integrates elements of vision, categorisation, natural language processing and learning in a virtual environment. This paper presents a part of this project — learning the connections between different CAs — that combines symbolic representations into logical rules. 2 OVERVIEW OF THE ARCHITECTURE Our system uses fatiguing, leaky, integrate and ﬁre (fLIF) neurons [4], an extension of LIF neurons [6]: Integrate and ﬁre — the neuron ‘ﬁres’ if its Paction potential, A, exceeds threshold θ, where A = (w, x) = ki=1 wi xi (integrator), w, x ∈ Rk are the weights and the stimuli vectors. The weights wt adapt according to the compensatory learning rule [4], which is an implementation of Hebbian learning [3]. t + (wt , xt ), where Leak and accumulation of potential, At+1 = A dt dt = ∞ if ﬁred at t; d ≥ 1 otherwise. Fatigue makes the threshold dynamic, θt+1 = θt + Ft , where Ft = F+ ≥ 0 if ﬁred (fatigue); Ft = F− < 0 otherwise (recovery). Cell assemblies are reverberating groups of neurons [3], and they are believed to be the neural basis of symbols in human mind. Our system is based on networks of sparsely connected neurons. The topology of the networks is pre–deﬁned by some random pattern, and it can be highly recurrent. When enough neurons ﬁre to start the reverberating circuit, the CA ignites, and its persistence is an important property of CAs’ dynamics. The fatigue and recovery rate parameters affect the persistence. A CA can be extinguished by another CA, which can ignite due to the change of the external pattern. A network with several CAs encoding a set of external patterns is referred to as a module. Several modules can be interconnected to create more complex systems. For example, a system of 7 modules and 40 CAs was used to implement a simple counting task [5]. More 1 Middlesex University, London NW4 4BT, UK complex systems have been used to parse natural language and implement ﬁnite state automata. The next stage in the development of the project is the ability to learn the connections between different modules, the focus of this paper. 3 STOCHASTIC META–CONTROL Although the connections between the correlated cells are strengthened via Hebbian learning, it is the meta–process that controls which neurons ﬁre and thus which connections are supported. The meta– process is based on stochastic control of action–selection algorithms, implemented earlier by the authors in cognitive architectures [1] and which are based on the following result of information theory. Given utility function u : Ω → R, the goal is to ﬁnd probability distribution, P p on Ω, that maximises the expected utility Ep {u} = (p, u) = pi ui under additional constraints. This distribution is p(ω) = q(ω) eβu(ω)−Γ(β) (1) where P q(ω) is the reference (prior) distribution, Γ(β) = ln Ω q(ω)eβu(ω) , and β is the Lagrange multiplier, deﬁned from constraints on information (I(p, q) ≤ I < ∞) or on the expected utility (Ep {u} ≥ U > −∞): β(U ) = dI(U ) , dU I(U ) = sup[U β − Γ(β)] (2) β Function β(U ) is strictly increasing, and for β > 0 the optimal distribution (1) has non–zero values (p(ω) > 0) for all ω ∈ Ω such that u(ω) > −∞. Thus, the optimal distribution describes stochastic process, where all ω are randomised by the control parameter β > 0, or its inverse T = β −1 called the temperature. Value–Explore Topology. Problems of optimal control often involve maximisation of utility over set Ω = X × Y , where X is the set of observations (e.g. goals), and Y is the set of controls (e.g. actions). In our system, these sets are represented by two modules, Goals and Actions, where CAs represent conditions and actions respectively. Thus, ω ∈ Ω are condition–action pairs (x, y) ∈ X × Y . Value Goal 1 .. . Goal m / Explore / / / Act 1 .. . Act n Initially, the modules are set up with excitatory connections from every x ∈ X to all y ∈ Y . Thus, given some goal, any action can be triggered. Due to the Hebbian learning, the connections x → y between CAs that have ﬁred together are reinforced, giving the pair a 816 R.V. Belavkin and C.R. Huyck / Emergence of Rules in Cell Assemblies of fLIF Neurons higher chance to ignite in the future. Thus, due to Hebbian learning, the system can learn some random relation R ⊂ X ×Y (set of rules), which may not be optimal. Learning of only a particular (optimal) relation is supported by the meta–process that involves two additional modules: Value and Explore. The activity of the Value module represents the values of utility (higher activity corresponds to higher utility). The average activity of the module corresponds to constraint U in equation (2). The input of the module can be conﬁgured according to the application. For example, it may receive inputs from the sensory system representing agent’s preference on the states of the environment. The purpose of the Explore module is to randomise the activity of the Action module. Cells in this module are spontaneously ﬁring, and the module sends excitatory connections to all CAs in the Action net. Thus, the Explore module can trigger randomly any Action CA, and this process has no memory. The module implements the effect of parameter β > 0 in equation (1) (or the temperature T = β −1 ). The Value module sends inhibitory connections to Explore, so that high activity of Value inhibits the activity of Explore. This implements the monotonic relation between constraint U and β in equation (2), and it allows for a very simple yet effective learning scheme. If a particular goal–action pair (x, y) results in a high utility, then the Value module inhibits Explore, and the (x, y) pair is allowed to persist longer. Since high utility pairs (x, y) on average co–ﬁre longer than low utility pairs, their connections increase relative to others due to the compensatory Hebbian learning rule. This way, the meta–process supports learning of the optimal relation R ⊂ X × Y . As a result, the average activity of the Value module (U ) increases with time, while the activity of the Explore module (T = β −1 ) decreases. The system makes a transition from stochastic to an almost deterministic rule–based system. The biological plausibility of this topology is supported by studies of the reward path and tonically active cholinergic neurons in the basal ganglia and striatal complex [2]. These neurons account for a small proportion of the connections, and they are quite uniform and nontopographic. These neurons may play the role of stochastic noise, and their activation is reduced when the reward path is activated. 4 EXPERIMENT: LEARNING DICHOTOMIES The code of the system and the experiment described is available at http://www.cwa.mdx.ac.uk/CABot/CANT.html In this simple experiment, there are two CAs in the Goal and two CAs in the Action modules. Each module consisted of 800 cells, with 400 cells in each CA. The modules were set up with low weight excitatory connections from every goal CA to all action CAs, shown by dashed arrows on the left diagram below. The task was to learn two rules, shown by two solid arrows on the right diagram. Goal 1I _ _ _/ Act : 1 I uu Iu u II u $ Goal 2 _ _ _/ Act 2 Goal 1 / Act 1 Goal 2 / Act 2 The training procedure consisted of a random presentation of an input pattern activating one of the goal CAs every 100 cycles. Figure 1 shows the proportion of the correct actions selected (ordinate) as a function of cycles (abscissa). The chart shows the results of ﬁve simulations. Initially the system makes only half of the choices correctly. After 3000 cycles, the proportion of correct choices increases to 70–90%. Figure 2 shows the percentage of neurons ﬁring per cycle 1 0.8 0.6 0.4 0.2 00 50 100 150 200 Cycles (x10) 250 300 Figure 1. The proportion of correct action choices (ordinate) as a function of cycles (abscissa). The curves represent results of different trials. 1 0.8 0.6 Value Explore 0.4 0.2 00 Figure 2. 50 100 150 200 Cycles (x10) 250 300 Activities of the Value and Explore modules in one experiment. in the Value and the Explore modules in one of the experiments. As desired, an increase of the Value activity coincides with the decrease of the Explore. The implementation of the meta–process for rule acquisition in our system is an important step in its evolution creating new opportunities and improving our understanding of biological cognition. ACKNOWLEDGEMENTS This work was supported by EPSRC grant EP/DO59720. REFERENCES [1] R. V. Belavkin, ‘Acting irrationally to improve performance in stochastic worlds’, in Proceedings of AI–2005, the 25th SGAI International Conference on Innovative Techniques and Applications of Artiﬁcial Intelligence, eds., M. Bramer, F. Coenen, and T. Allen, pp. 305–316, Cambridge, (December 2005). Springer. [2] R. Granger, ‘Engines of the brain: The computational instruction set of human cognition’, AI Magazine, 27(2), 15–32, (July 2006). [3] D. O. Hebb, The Organization of Behavior, John Wiley & Sons, New York, 1949. [4] C. Huyck, ‘Hierarchical cell assemblies’, Connection Science, (2007). [5] C. Huyck and R. V. Belavkin, ‘Counting with neurons, rule application with nets of fatiguing leaky integrate and ﬁre neurons’, in Proceedings of the Seventh International Conference on Cognitive Modeling, eds., D. Fum, F. D. Missier, and A. Stocco, Trieste, Italy, (April 2006). Edizioni Goliardiche. [6] W. Maas and C. Bishop, Pulsed Neural Networks, MIT Press, 2001. ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-817 817 E RS: Evaluating Reputations of Scientiﬁc Journals ´ Emilie Samuel and Colin de la Higuera1 Abstract. Current methods for evaluating research are based on counting the number of citations received for publications. Thus, the more an article is cited and the more its impact is considered as important. In this article, we propose a new method for assessing the reputation of scientiﬁc journals, based on a Web application in which are gathered the votes of expert researchers. The voting results indicate degrees of preference for one journal over another. Our system uses, in addition, the publications of an expert in order to quantify his expertise in speciﬁc ﬁelds. These values are coupled with those of votes to determine the relevance, according to the ﬁeld, of each journal in each topic. An iterative process of transferring values given to journals by experts to values of the experts themselves given their publications has been implemented. 1 Key concepts The system ERS manages bibliographic data formed by journals, researchers and themes. The journals and themes are bound by a relation called relevance. Each journal publishes articles more or less relevant to some research topic. Journals and researchers are directly connected by publications (relation has published). From these we hope to measure the expertise of each researcher in relationship with each theme. Finally, the votes of the researchers, depending on the expertise of the latter, will inﬂuence the relevance of the themes for journals. The phenomenon is recursively cyclic: the relevance inﬂuences the expertises, which in turn affect the relevances through the votes. The inﬂuence of researchers grows with their global conﬁdence. We summarise these relationships through the diagram represented in Figure 1. Entities Experts, Themes and Journals are connected by relations relevance, expertise, has published and vote. Associated attributes are not represented. 1.1 Relevance of a journal for a theme Each journal is more or less relevant to each theme. The relevance reﬂects the reputation for this topic, of the articles published by the journal. Its value can be interpreted as the probability that the community of researchers in the ﬁeld advise someone searching for literature in the given area, this journal. 1.2 Global conﬁdence and expertise of a voter in a topic The computation of the expertise in a theme depends directly on the relevance of the journals in which the expert has published for that theme. If, for example, the expert has published several times in journals recognised by the system itself as being relevant to the theme 1 Laboratoire Hubert Curien - UMR CNRS 5516 - 18 rue Benoˆıt Lauras, ´ 42000 Saint-Etienne - email: {emilie.samuel, cdlh}@univ-st-etienne.fr Journals has published vote relevance expertise Experts Figure 1. Themes The general model databases, then the expert will be deemed to be an expert in this area. Thus, the calculation of expertise in each subject, which is based on the journals in which the author has published, depends, for each of them: • in the number of publications; • in the sum of all relevances of the journal, which reﬂects its importance; • in the likelihood of the theme, given the journal; • in the belief in the relevance of the theme for this journal. However, comparisons between different researchers should be avoided. One can, for example, consider that a group of individuals with similar proﬁles have interests for similar research ﬁelds. By contrast, a researcher with expertise of 10 % in information retrieval can not be considered twice as recognised in this area as a researcher with expertise of 5%. It may, in fact, be the case that the publications of one are less diversiﬁed than the second, which would then generate higher expertise, but in fewer topics. 1.3 Interrogating the experts For each expert, a list of journals to be evaluated automatically is deﬁned. This list consists of journals in which he has published, and of journals that are judged by the system, close to his expertise. This list can also include journals in which his co-authors have published or journals on which the system has little information. The method of paired comparisons is used, whose application to ranking has been addressed since [2]. This method is intended to indicate a degree of preference, and lets one get a partial order by comparing journals two by two. It is then possible, from several partial orders resulting from expert opinions, to establish a total order of all the journals in each theme. Our approach is related to that shown in [3], where the authors propose to build clusters of total orders, corresponding to the opinions collected about movies. The expert must answer questions such as “ If you were to choose an article by one of these two journals, which would you choose?”. We call this process between two journals a match. A series of matches (until interruption by the expert) is organised, each match É. Samuel and C. de la Higuera / ERS: Evaluating Reputations of Scientiﬁc Journals being randomly drawn, where the journals in which the expert has published have higher probability to appear. The results of the matches are then analysed following the methodology employed by the Elo classiﬁcation, used to rank chessplayers [1]. This classiﬁcation assigns each player a rating based on his performance in competition. The rating of a player evolves over time with his results. When two players meet, a predicted result for each is calculated, the highest ranked player being supposed to beat his weaker opponent. The greater the difference in rating between the two players and the higher the probability that the best player wins. Following the match between the two players, their ratings are updated according to the following principle: if a player has achieved a better result than expected, it means that he was underestimated, and his rating is therefore increased, and vice versa. A rating can therefore rise or diminish, and the adjustment takes place proportionally with the difference between the true outcome and the presumed outcome. 2 Operational aspects A beta version of the system ERS has been running since July 20072 . Its ergonomics and aesthetics are subject to change. We are seeking a more attractive, user-friendly and interactive platform while retaining its ease of use. Initially we used a limited list of 16 themes, to which was added one smaller theme (grammatical inference) for testing purposes. The operationalisation required an initialisation phase, each journal being allocated an initial relevance in each theme. To do this, we chose an initial set of themes, and associated with each theme a list of keywords. For example, can be associated to the theme machine learning words like pattern recognition, or classiﬁcation, or reinforcement. We then computed a frequency (term frequency) for each keyword appearing in the titles of journal articles. Thus, the more a journal publishes articles with these words in their titles and the more its relevance to the corresponding theme increases. The conﬁdence in the relevance matches, is obtained as a computation of the inverse document frequency, which is a function increasing with the speciﬁcity of the keywords. 3 Convergence of the system The update of the system is done daily, in a batch mode. Thus, the data of the system constituted by relevance, global conﬁdence and expertise are changing continuously, and are recomputed iteratively. The convergence of these values occurs as soon as relevance remains stable from one iteration to the next. The ﬁrst phase of the experimental validation of the convergence of relevances consisted in the initialisation (random) and normalisation of the relevances for 570 journals and for 17 themes. In order to constitute a panel of experts, 2000 researchers were then randomly selected from those identiﬁed in D BLP. Their global conﬁdence and expertise in each subject were computed according to their publications in journals. Thereafter, a simulation of votes by these experts took place. This consisted of generating randomly 28500 votes, so as to reach an average of 50 per journal. The algorithm was ﬁnally run on this repeatedly until convergence of the values of relevance, global conﬁdence and expertise. The convergence results of three experiments respecting this protocol are shown in Figure 2. The variation distance L1 was used to 2 http://labh-curien.univ-st-etienne.fr/ERS/ measure the value of the difference between the relevance of an iteration to the next. As can be seen, the computation converges in a small number of iterations, each carried out in an average of 2 seconds. 20 15 distance L1 818 10 5 0 0 Figure 2. 1 2 3 4 5 6 7 itérations 8 9 10 11 12 13 14 Convergence of the relevances during the batch computations 4 Conclusion and perspectives System ERS permits a different evaluation of scientiﬁc journals, directly based on the opinions of scholars. This system, which we hope to render attractive, simple and efﬁcient, offers an assessment protocol for comparing journals two by two. Following the processing of votes, the results indicate, according to the subject, which are the journals best recognised by the community of the area. A number of perspectives are being looked into. In addition to those cited in this article, the ﬁrst is to work on the actually very reduced list of themes: ideally the list should be dynamic: new communities or sub-communities should be detected by the system, and the corresponding keywords should be automatically computed. The computation of the expertise and conﬁdence of the researchers could involve a more complex analysis, taking into account (again in an automatic way) the date on which his articles were published, or other information beyond D BLP and obtained by Web mining techniques. The interrogation scenario should also be considered as being improvable. Using better the results is another possible task: a proﬁle for a journal (as a vector of quantities over themes) can be easily computed, and a similar proﬁle can be computed for a researcher. One can therefore query the system with questions like “which journal is the closest to my way of doing research?”. In addition, the identiﬁcation of researchers at the registration remains an important point on which further work is necessary. Finally, the evaluation of conferences is a logical evolution of the system, which requires additional attention and so is the even more ambitious task of adapting the system to other ﬁelds of research. Acknowledgements System ERS was developed in the laboratory Hubert Curien with the help of Fabrice Muhlenbach, Baptiste Jeudy and Franc¸ois Jacquenet. The expertise of Thierry Murgue has solved many engineering problems. Students from the department of computer science at SaintEtienne have and continue to develop different modules for the system. REFERENCES [1] A. E. Elo, The rating of chessplayers, past and present, Arco, 1978. [2] H. Joe, ‘Rating systems based on paired comparison models’, Statistics & Probability Letters, 11, 343–347, (1991). [3] A. Ukkonen and H. Mannila, ‘Finding outlying items in sets of partial rankings’, in Knowledge Discovery in Databases: PKDD 2007, volume 4702 of LNCS, pp. 265–276. Springer, (2007). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-819 819 Personal Experience Acquisition Support from Blogs using Event-Depicting Images Keita SATO1 and Yoko NISHIHARA2 and Wataru SUNAYAMA1 Abstract. Internet users write blogs related to their personal experience, daily news, and so on. We can obtain blogs about personal experience using search engines on the Web. However, the search engines also output blogs about other topics unrelated to personal experience. Therefore, it is necessary for us to read all blogs to obtain those about personal experiences. It takes too much time. This paper proposes a support system for obtaining blogs about personal experiences efﬁciently. The system extracts three keywords that denote place, object, and action from a blog. The three keywords describe an event that leads a person to write a blog about personal experience. The system expresses the event with three pictures depicting the extracted keywords. The pictures help users to judge whether personal experience is written about in the blog. We experimented with the system, and veriﬁed that it supports users in obtaining personal experiences efﬁciently. The system extracts events that lead people to write blogs about personal experiences. The system does not need training data for event extraction, and the system can help users to obtain blogs about personal experiences. 1 The system chooses blogs including sentences in the past tense from the downloaded blogs. This is because a sentence is almost always written in the past tense when an event is described in it. INTRODUCTION Internet users write blogs related to their personal experience, daily news, and so on. We can obtain blogs about personal experience using search engines on the Web. However, the search engines also output blogs about other topics. Users need to read all blogs to obtain those about personal experiences, which takes too much time. This paper proposes a support system for obtaining blogs about personal experiences efﬁciently. The system extracts three keywords from a blog. The keywords express an event related to personal experiences. The system expresses the event using three pictures that depict the extracted keywords. The pictures help users to judge whether personal experience is written about in the blog. Showing images reduces the time needed to understand a blog’s contents [4, 5]. Therefore, seeing images requires less time to choose blogs about personal experience than reading blog texts. We deﬁne an event as three keywords, a place keyword, an object keyword, and an action keyword. Many studies about information extraction from the Web have been conducted [1]. In the case of extracting information noticed by many people, Glance, et al. have proposed a method to extract noticed persons, topic keywords, and topic sentences from blogs [3]. The proposed system also extracts information from the Web. However, we aim to extract information noticed by one person. Blogs about personal experience are reviews posted by Internet users. Review extraction methods from the Web have been studied. [2, 6] separate reviews for commercial items into positive/negative to extract characteristic keywords by machine learning. However, the proposed system does not extract blogs about personal experiences. 1 2 Hiroshima City University, Japan, email: keita@sys.im.hiroshimacu.ac.jp,sunayama@sys.im.hiroshima-cu.ac.jp The University of Tokyo, Japan, email: nishihara@sys.t.u-tokyo.ac.jp 2 PROPOSED SYSTEM In the proposed system, a user inputs a query related to a personal experience about which the user wants to know. Blogs are downloaded from a blog site using the query. The system chooses blogs with events from those downloaded. The system then separates the chosen blog texts into several blocks. Three keywords are extracted from each block. Then the system sets out three images depicting the extracted keywords and ﬁnally outputs the images. 2.1 Blog selection 2.2 Blog text separation A blog is considered to have descriptions about certain places that are different from each other. Therefore, the system separates blog texts into several blocks. One block has one place keyword. We deﬁne that a place keyword is a noun. The system extracts keywords behind a preposition that is often used for a place expression. The prepositions are as follows: at, in, to, for, and so on. 2.3 Keyword extraction Three keywords related to place, object and action, are extracted from each blocks. We explain how to extract an object keyword and an action keyword in the following sections. 2.3.1 Object keyword extraction We deﬁne that an object keyword is a noun. If several object keywords are in a sentence, the relation between an object keyword and the extracted place keyword is evaluated using Eq. (1). relation(p, o) = hit(p ∧ o) hit(p ∧ o) × hit(p) hit(o) (1) In Eq. (1), o denotes an object keyword and p denotes a place keyword. Eq. (1) calculates the proportion of the number of Web pages in which both keywords are included to the number of Web pages 820 K. Sato et al. / Personal Experience Acquisition Support from Blogs Using Event-Depicting Images in which each keyword is included. If the value of Eq. (1) is high, the relation between the keywords is strong. The system extracts a keyword with the highest value of Eq. (1) as an object keyword. 1. Image System: Look at images for each blog. Text System: Read blog summaries for each blog. 2. If you think a blog has events, choose the blog and read it. 3. Extract texts about personal experience from the read blog. 2.3.2 The number of participants was 36. The participants were undergraduate/graduate students majoring information science. 18 participants were assigned to a set of one query and one system. The time for the set was ﬁve minutes. We considered that most people spend about ﬁve minutes to do some research on personal experience on the Internet. We compared the number of text-extracted blogs using the Image System and the number using the Text System. Action keyword extraction It is deﬁned that an action keyword is a verb appearing in a sentence where the object keyword has been extracted. This is because an action keyword usually appears near an object keyword. 2.4 Image setting out The system sets out three images depicting the extracted keywords (Fig. 1). The images are set out transversely, place, object, and action, from left to right. If a blog is divided into some blocks, the system chooses a block in which three keywords are ﬁrst extracted. The system uses an image database made by the authors for setting out the three images. The database has 1,000 place images, 700 object images, and 200 action images. If there is not an image depicting the extracted keyword, the system shows a space. 3.1 Experimental Results Table 1 shows averages of text-extracted blogs. The averages using the Image System were higher than the averages using the Text System (P<.05). The result means that most of the blogs chosen using the Image System have events that lead Internet users to write blogs about personal experiences. Table 2 shows proportions of read blogs to text-extracted blogs. Except Hiroshima, the proportions using the Image System were higher than the proportions using the Text System (P<.05). In the case of Hiroshima, the Image System did not show many images depicting places. Therefore, the participants often chose blogs that did not have events. However, in the cases of the other queries, the proportions using the Image System were higher than those using the Text System. From the result, it was veriﬁed that the Image System helps users to obtain more blogs about personal experiences efﬁciently. Table 1. Image Text Okinawa 2.2 1.4 Table 2. Image Text Okinawa 0.38 0.29 Averages of text-extracted blogs Tokyo 3.1 2.6 Hiroshima 2.9 2.6 Nigata 1.8 1.3 Hokkaido 3.2 2.5 School festival 3.4 2.1 Proportions of read blogs to text-extracted blogs Tokyo 0.63 0.55 Hiroshima 0.57 0.60 Nigata 0.38 0.28 Hokkaido 0.54 0.53 School festival 0.63 0.45 4 CONCLUSION This paper proposes a support system to obtain blogs about personal experiences. The system expresses an event using three images, which depict place, object, and action. We veriﬁed that the system helps users to obtain blogs about personal experiences efﬁciently. Figure 1. 3 Output of the proposed system: blogs with three images EXPERIMENT We experimented with the proposed system (Image System). We asked participants to extract texts about personal experiences from blogs. We used 100 blogs that were downloaded from a blog site and chosen by the Image System. We considered that a system user wants to know about personal experiences that he/she may also experience in the future. Therefore, we used the following six queries: “{Okinawa, Tokyo, Hiroshima, Nigata, and Hokkaido} AND sightseeing,” and “school festival AND refreshment shop”. Okinawa, Tokyo, Hiroshima, Nigata, and Hokkaido are the names of sightseeing areas in Japan. We prepared another system that shows blog summaries (Text System). The summaries were also shown in the Image System. The 100 blogs were divided into four sets of 25 blogs. Both systems used a web browser whose window size was 1,200 pixels×1,920 pixels. We instructed participants as follows: REFERENCES [1] C.H. Chang, M. Kayed, M.R Girgis, and Shaalan K., ‘A survey of web information extraction systems’, IEEE Transactions on Knowledge and Data Engineering, 18(10), 1411–1428, (2006). [2] K. Dave, S. Lawrence, and D.M. Pennock, ‘Mining the peanut gallery: Opinion extraction and semantic classiﬁcation of product reviews’, in Proc. of the 12th International World Wide Web Conference, 519–528, (2003). [3] N.S. Glance, M. Hurst, and T.: Tomokiyo, ‘Blogpulse: Automated trend discovery for weblogs’, In WWW2004 Workshop on the Weblogging Ecosystem, (2004). [4] S. Hulbert, J. Beers, and P. Fowler, ‘Motorists’ understanding of trafﬁc control devices’, AAA Foundation for Trafﬁc Safety, (1979). [5] M. Pietrucha and R. Knoblauch, ‘Motorists’ comprehension of regulatory, warning and symbol signs’, Technical Report Contract DTFH6183-C-00136, FHWA, 2, (1985). [6] P.D. Turney, ‘Thumbs up or thumbs down? semantic orientation applied to unsupervised classiﬁcation of reviews’, in Proc. of the 40th Annual Meeting of the Association for Computational Linguistics, 417–424, (2002). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-821 821 Object Conﬁguration Reconstruction from Descriptions using Relative and Intrinsic Reference Frames H. Joe Steinhauer 1 Abstract. We provide a technique to reconstruct an object conﬁguration that has been described on site by only using intrinsic and relative frames of reference into an absolute frame of reference, as seen from the survey perspective. LF SF LN 1 Description In the present work, we assume descriptions consisting of a combination of a route tour [9, 6] and several embedded gaze tours [9, 6], one for each step within the route tour. Furthermore, a momentarily applied absolute frame of reference is used in addition to each gaze tour to report the relative positions between the objects, seen from a particular viewpoint. An activity diagram illustrating the route tour is presented in ﬁgure 2. All three description types use on of the two rectangular frame of reference, shown in ﬁgure 1. The frame of reference in ﬁgure 1a) is similar to many current approaches to qualitative reasoning about orientation. See for example [8, 3, 7, 4, 1]. Considering that people in way-ﬁnding or route-description tasks usually distinguish between eight direction classes [5], the eight directions right back (rb), right neutral (rn), right front (rf ), straight front (sf ), left front (lf ), left neutral (ln), and left back (lb) shall be used as possible object orientations. The ﬁrst time a frame of reference is used during the route tour, it automatically sets a corresponding projection-based global frame of reference that captures the concept of a representation of position in latitude and longitude [2]. 1 LN RN LB LF SF RF Motivation You probably recognize the following scenario: You invited a friend to your house and gave him extraordinary directions to easily ﬁnd the place. However, your friend suddenly phones you and tells you that he ’somehow’ cannot ﬁnd your house. Furthermore, he even had lost his orientation completely. Now it is your task (because you are supposed to know the area) to ﬁgure out where he is and guide him from there to your house. You probably let him describe the objects he sees around him and you try to match the described to some mental or external map of the suspected area. Having a survey perspective of his mental or external reconstruction, you (the listener) needs to translate all relative object relationships that your friend (the observer) provides into the global frame of reference. The observer is therefore encouraged to produce an object conﬁguration description that contains the information needed to recognize the object conﬁguration from a survey perspective. 2 RF Department of Computer and Information Science, Link¨oping University, Sweden, email: joest@ida.liu.se LB SB RB SB RB RN b) a) Figure 1. The frames of reference used. As described in [9] people are able to change perspectives during a task. Further they are often willing to accept a higher cognitive load if they feel that this may alleviate the cognitive load for their communication partners. Therefore we ask the observer to switch between both reference frames in the continuation of the description process. 3 Reconstruction Assuming (for readability reasons) that the listener uses the terms north, northeast, east, southeast, south, southwest, west, and northwest as global directions in the reconstruction, he may choose the orientation north for object 1. Accordingly, the other relationships are translated. The information in which direction the observer moved enables the listener to follow the angle that the applied frame of reference has in relation to the underlying global frame of reference. For a smooth reconstruction, it is advantageous if the description is sorted. The position of an object is only described in relation to objects that have been mentioned within the description before. Incorporating an additional object into a conﬁguration is done as follows. In the order the relationships of this object are given to other objects that already are reconstructed, the area of the new object is calculated by intersecting all the qualitative regions of the new object to all its reference objects. For instance is the estimated region for object 5 in ﬁgure 3a) the intersection of the regions north 1, northeast 2, northwest 3, and northwest 4 (printed in grey). Sometimes space has to be made between some already placed objects, for instance when the new object happens to be ’in the middle’ of them. For instance consider to insert object 8 into the conﬁguration shown in ﬁgure 3b) using the relationships (8 southeast 5), (8 southwest 7), and (8 northeast 1). The intersection of the regions southeast 5, southwest 7 and northeast 1 contains no space. We can solve this problem by dividing all objects in the reconstruction in two groups, 822 H.J. Steinhauer / Object Conﬁguration Reconstruction from Descriptions Using Relative and Intrinsic Reference Frames or northeast of the moved object and are not moved, the object is southwest, south, or southeast of each of them. These regions are inﬁnite to the south and the object will never leave them by moving southwards. All other objects are moved in the same way as the object itself and therefore its relationships to these objects does not change. Figure 3d) presents the result where object 8 has been inserted into the new obtained space. The procedure to obtain space in the horizontal dimension works accordingly. An object’s orientation is given by an arrow pointing in the object’s front edge or front corner. The representation of all objects aligned with the underlying global frame of reference allows the listener to draw objects into the reconstruction, whose orientation is unknown and to add the orientation later without need to redraw the object, or to change its frame of reference. Furthermore, it is necessary to apply the described reconstruction procedure. move to next object object inherits observer’s orientation [first object] [not first object] [moving direction rf, lf, rb, lb] [moving direction sf, sb, ln, rn] establish underlying global FoR change FoR type 4 build OCD by gaze tour build OCD using momentarily absolute FoR [unvisited objects left] [all objects visited] Figure 2. The route tour process. 1 4 1 4 6 3 b) a) 1 2 5 7 5 4 1 3 4 e We provide a technique to reconstruct an object conﬁguration that has been described on site by only using intrinsic and relative frames of reference into an absolute frame of reference, as seen from the survey perspective. A set of eight basic relations is sufﬁcient to describe eight positional object relations and allow for eight object orientations. On one hand, the use of eight orientation classes seems natural for people, on the other hand, the use of eight orientation classes (opposed to for instance four orientation classes) adds a higher cognitive load for the description process by making it necessary for the observer to switch between two different types of frame of reference. Decisions had to be made to what extent to manufacture an easy reconstruction process and to what extent to be responsive to psychological results of typical human behavior in object conﬁguration description. Both components are important in order to develop a representation scheme that is usable by a person from each side of the process. Nevertheless, these two aims are conﬂicting. However, Tverksy et al. [9] experienced that people accommodate the acceptable amount of inconvenience according to the cognitive load that the task requires of their communication partners. Therefore, it seems reasonable to balance the effort on both sides. REFERENCES 7 8 6 2 c) 7 2 3 2 5 Summary 6 3 d) Figure 3. Reconstruction of an object conﬁguration. one group containing all the objects that will be north of object 8 and and the other group containing all objects that will be south of object 8. Then the two groups, as they are, are moved apart from each other in vertical dimension indicated by the black line in ﬁgure 3c). This procedure does not inﬂuence any of the relationships of the reconstructed objects. The intuitive proof is as follows: At the beginning, the reconstruction is correctly containing all objects’ relationships. Objects will only be moved in increasing vertical dimension (southwards) and will therefore never cross any reference frame line that separates regions vertically. For all objects that are north, northwest, [1] J. Fernyhough, A. G. Cohn, and D. C. Hogg, ‘Constructing qualitative event models automatically from video input’, in Image and Vision Computing, volume 18, pp. 81–103, (2000). [2] A. U. Frank, ‘Qualitative spatial reasoning with cardinal directions’, in Seventh Austrian Conference on Artiﬁcial Intelligence, ed., H. Kaindl, Informatik Fachberichte, pp. 157–167, Wien, Austria, (September 1991). [3] C. Freksa, ‘Temporal reasoning based on semi-intervals’, in Artiﬁcial Intelligence, volume 54, pp. 199–227, (1992). [4] R. K. Goyal and M. J. Egenhofer, ‘Similarity of cardinal directions’, in SSTD 2001, ed., C. S. Jensen, LNCS 2121, pp. 36–55. Springer-Verlag Berlin Heidelberg, (2001). [5] Alexander Klippel, ‘Wayﬁnding choremes’, in Spatial Information Theory: Foundations of Geographic Information Science, eds., W. Kuhn, M. F. Worboys, and S. Timpf, pp. 320–334, Berlin, (2003). Springer. [6] Willem J. M. Levelt, Speech, Place and Action, chapter Cognitive Styles in the Use of Spatial Direction Terms, 251–268, John Wiley & Sons Ltd., 1982. [7] Gerard Ligozat, ‘Reasoning about cardinal directions’, Journal of Visual Languages and Computing, 9(1), 23–44, (1998). [8] A. Mukerjee and G. Joe, ‘A qualitative model for space’, in Proceedings of the AAAI, pp. 721–727, Boston, (1990). [9] Barbara Tversky, Paul Lee, and Scott Mainwaring, ‘Why do speakers mix perspectives?’, Spatial Cognition and Computation, 1, 399–412, (1999). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-823 823 Probabilistic Reinforcement Rules for Item-Based Recommender Systems Sylvain Castagnos and Armelle Brun and Anne Boyer1 Abstract. The Internet is constantly growing, proposing more and more services and sources of information. Modeling personal preferences enables recommender systems to identify relevant subsets of items. These systems often rely on ﬁltering techniques based on symbolic or numerical approaches in a stochastic context. In this paper, we focus on item-based collaborative ﬁltering (CF) techniques. We propose a new approach combining a classic CF algorithm with a reinforcement model to get a better accuracy. We deal with this issue by exploiting probabilistic skewnesses in triplets of items. 1 INTRODUCTION This paper focuses on recommender systems based on collaborative ﬁltering techniques (CF). CF algorithms provide personalization by exploiting the knowledge of a similar population and predicting future interests of a given user (called ”active user”) as regards his/her known preferences. In practical terms, this kind of algorithms is broken down into 3 parts. Firstly, the system needs to collect data about all users under the form of explicit and/or implicit ratings. Secondly, this data is used to infer predictions, that is to say to estimate the votes that the active user would have assigned on unrated items. Finally, the recommender system suggests to the active user items with the highest estimated values. As the highest values of prediction are the only ones of interest, we propose a new model that focuses on prediction of high values, to improve accuracy. As the error on these values may be signiﬁcant with a usual item-based CF algorithm, we propose to re-evaluate them by using reinforcement rules. The latter are automatically inferred by selecting triplets of items in the dataset according to their joint probabilities. After a short state-of-the-art, we propose a model combining an Item-Based Algorithm (CIBA) with reinforcement rules. We call it ”Reinforced Item-Based Algorithm” (RIBA). 2 RELATED WORK 2.1 Notations To help the readers, we introduce the following notations: • • • • • 1 U = {u1 , u2 , . . . , un } is the set of the n users; I = {i1 , i2 , . . . , im } is the set of the m items; Uk refers to the set of users who have rated the item ik ; Ia is the list of items rated by the active user ua ; v(j, k) is the vote of the user uj on the item ik ; LORIA - University Nancy 2, email: {sylvain.castagnos, armelle.brun, anne.boyer}@loria.fr • vmin and vmax are respectively the minimum and maximum values on the rating scale; • vl and vd are the thresholds for liked and disliked items; • i¯k is the average of all users’ ratings on ik ; • s(k, t) the similarity measure between ik and it ; • p(a, k) is the prediction of ua for item ik ; • pr(a, k) is the prediction of ua for ik with reinforcement rules. 2.2 Classical Item-Based Algorithm To supply the active user with information that is relevant to his/her concerns, the system ﬁrst builds his/her proﬁle under the form of a vector of item ratings. Proﬁles of all users are then aggregated in a user-item rating matrix, where each line corresponds to a user, and each column to an item. Item-based CF is based on the observation that the consultation of a given item often leads to the consultation of another one [4]. To translate this idea, the system builds a model that computes the relationships between items. Most of time, the model is generated by transforming the user-item matrix in an item-item matrix. This conversion requires the computation of similarities between items (i.e. columns of the user-item rating matrix). The active user’s predictions are then computed by taking into account his/her known ratings, and the similarities between the rated items and the unrated ones. In this paper, we propose a model that can be plugged on an itembased collaborative ﬁltering algorithm in order to reﬁne some predictions. In this subsection, we present the Classical Item-Based Algorithm (CIBA) used as a base for our model. When implementing an item-based CF algorithm, the designer has to choose a pairwise similarity metric, and a prediction formula. We decide to use the Pearson correlation coefﬁcient, as litterature shows this similarity metric works better [4]. Consequently, we ﬁll the itemitem similarity matrix by applying the equation 1 for each pair of items. ¯ ¯ uj ∈Uk ∩Ut (v(uj , ik ) − ik )(v(uj , it ) − it ) > s(k, t) = > (1) ¯ 2 ¯ 2 uj (v(uj , ik ) − ik ) uj (v(uj , it ) − it ) We also compared different prediction formulas [2, 3]. We chose to adapt the weighted sum of the deviation from the mean, usually used in user-based framework, to an item-based context (cf. formula 2). This formula leads to the highest accuracy. p(a, k) = ¯ it ∈Ia s(k, t) × (v(a, t) − it ) +i¯k , vmax max vmin , min it ∈Ia |s(k, t)| (2) 824 3 S. Castagnos et al. / Probabilistic Reinforcement Rules for Item-Based Recommender Systems REINFORCED ITEM-BASED ALGORITHM Our model, called ”Reinforced Item-Based Algorithm” (RIBA), is a combination of a Classic Item-Based Algorithm (CIBA) and probabilistic association rules that come to reinforce some predictions. This section is dedicated to the way to combine these two approaches. 3.1 Probabilistic Reinforcement Rules In standard item-based CF algorithms, similarities are computed between each neighbor item and the target item. We argue that, in some cases, pair-wise similarities may be insufﬁcient to explain the interest of a user for an item. We propose here to evaluate similarities of triplets, rather than pairs of items, before the prediction phase. A triplet is an association rule where the premisse is made up of two terms. The conclusion is the reinforced item. To illustrate this statement, we can consider three items ik =”Cinderella”, it =”Scary Movie”, and iw =”Shrek”. A user may have liked ik which is a fairytale without appreciating iw . At the same time, a user who enjoys the horror ﬁlm parody it should probably rate lowly iw . However, a ﬁlm goer who likes both fairy tales and parodies will take fun when watching Shrek. Let introduce the following additional notations: • • • • • • • Ik denotes the fact to like ik , i.e. when vj,k ≥ vl ; Ik is the fact to dislike ik , i.e. when vj,k ≤ vd ; I¨k when ik has not been rated (by convention, the vote is equal to 0 in this case); I˘k when ik has been rated (the vote is between vmin and vmax ); P (Ik , It , Iw ) the probability to like the three items ik , it , and iw ; P (Ik , It | I¨w ) the probability to like ik and it for users who have not rated iw ; N (Ik , It , I¨w ) the number of users who have liked ik and it , and not rated iw . Then a rule < Ik , It >⇒ Iw means that Ik alone does not explain Iw , It alone does not explain Iw , but < Ik , It > together explain Iw . Let notice that 3 items could lead up to 8 reinforcement rules, such as < Ik , It >⇒ Iw , or < Ik , It >⇒ Iw . 3.2 Determination of the reinforcement rules A triplet < ik , it , iw > is candidate to be a reinforcement rule < Ik , It > ⇒ Iw if the similarities between each pair of its items are around the mean similarity. In that case, the resulting reinforcement rule could impact accurately Iw . Thus a triplet is a candidate if the following constraints are satisﬁed: 0 < tmin ≤ |s(k, t)| ≤ tmax < 1 (3) 0 < tmin ≤ |s(k, w)| ≤ tmax < 1 (4) 0 < tmin ≤ |s(t, w)| ≤ tmax < 1 (5) where tmin and tmax respectively refer to the minimum and maximum similarity threshold that will be set experimentally. For each reinforcement rule candidate, we compute the probability of the corresponding triplet. Thus for each triplet < ik , it , iw >, we compute the joint probabilities P (Ik , It , Iw ), P (Ik , Iw | I¨t ), and P (It , Iw | I¨k ): N (Ik , It , Iw ) N (I˘k , I˘t , I˘w ) N (Ik , I¨t , Iw ) | I¨t ) = N (I˘k , I¨t , I˘w ) P (Ik , It , Iw ) = (6) P (Ik , Iw (7) If this probability is signiﬁcantly higher than the probability of each pair of its items, than this triplet is selected as a reinforcement rule. The reinforcement rule < Ik , It >⇒ Iw is then generated when the following conditions are fulﬁlled: P (Ik , It , Iw ) P (Ik , Iw | I¨t ) P (Ik , It , Iw ) P (It , Iw | I¨k ) 3.3 (8) (9) Rating Reﬁning Process The generated reinforcement rules allow to reﬁne some predictions. For each prediction p(a, k), a rule is applicable if ik corresponds to the item in the conclusion and if the premises are valid. Each applicable rule associated to p(a, k) is set to a weight w(r, a, k). This weight is equal to 1 when the conclusion of the rule is Ik , and it w(r, a, k) = −1 if the conclusion of the rule is Ik . We call ARa,k the set of rules that can be applied for the prediction computation of p(a, k). We reﬁne the vote with the following equation: coef ∗ r∈ARa,k w(r, a, k) pr(a, k) = p(a, k) + (10) r∈ARa,k |w(r, a, k)| ”coef” is the coefﬁcient of reﬁnement. The greater this coefﬁcient is, the more important the reﬁnement will be. 4 CONCLUSION In order to increase the quality of suggestions in recommender systems, we proposed a new approach combining an item-based collaborative ﬁltering model with reinforcement rules. These rules are generated automatically by analyzing joint probabilities in triplets, and allow us to reﬁne predictions of items where pair-wise similarities are not sufﬁcient. The experiments show that this approach improves signiﬁcantly the accuracy of high predictions. We validate our model by using the MovieLens dataset (http://www.movielens.org/) and get an improvement from 6 to 8% as regards the High MAE measure [1]. REFERENCES [1] Linas Baltrunas and Francesco Ricci, ‘Dynamic item weighting and selection for collaborative ﬁltering’, in Workshop PriCKL07, in conjunction with the 18th European Conference on Machine Learning (ECML) and the 11th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), Warsaw, Poland, (September 2007). [2] Sylvain Castagnos and Anne Boyer, ‘A client/server user-based collaborative ﬁltering algorithm: Model and implementation’, in 4th Prestigious Applications of Intelligent Systems special section (PAIS 2006), in conjunction with the European Conference on Artiﬁcial Intelligence (ECAI 2006), Riva del Garda, Italy, (August 2006). [3] Bradley N. Miller, Joseph A. Konstan, and John Riedl, ‘Pocketlens: Toward a personal recommender system’, in ACM Transactions on Information Systems, volume 22, pp. 437–476, (July 2004). [4] Badrul M. Sarwar, George Karypis, Joseph A. Konstan, and John Reidl, ‘Item-based collaborative ﬁltering recommendation algorithms’, in World Wide Web, pp. 285–295, (2001). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-825 825 An Efﬁcient Behavior Classiﬁer based on Distributions of Relevant Events Jose Antonio Iglesias and Agapito Ledezma and Araceli Sanchis1 and Gal Kaminka2 1 Introduction Recognizing the behavior of others is a signiﬁcant aspect of many different human tasks. In order to make a good decision, humans usually try to predict the behavior of others. We present an approach for creating automatically the model of the behavior of agents (software agents, robots or humans). Because of the sequence learning is a common form of human and animal learning, the observations of an agent are transformed into a sequence of atomic behaviors which is statistical analyzed to ﬁnd out its corresponding behavior model. Before recognizing a behavior, it needs to be modeled. Different techniques have been used in agent modeling in different areas: opponent-modeling in soccer domain simulation [6], intelligent user interface [7], and virtual environment for training [8]. However, although lot of research focus on agent modeling in an speciﬁc environment, it is not clear that they can be used in other environments. The aim of this research is to provide a general framework which can represent and classify different agent behaviors in a wide range of domains. Also, as the actions performed by an agent are usually inﬂuenced by his past experiences, the automated sequence learning is used for behavior classiﬁcation. 2 b) Storage of the subsequences in a trie: The subsequences of events are stored in a trie, in which every node represents an event, and the node’s children represent the events that have appeared following this event. Each node keeps track of the number of times an event has been inserted on to it. The subsequence sufﬁxes (subsequences that extend to the end of the sequence) are also inserted. c) Creation of the behavior model: The trie is traversed to calculate the relevance of each subsequence. For this purpose, frequency-based methods are used and the relative frequency or support of a subsequence is calculated. Then, an agent behavior model is represented by the distribution of its subsequences. 3. Storing the model in the Library: Once a behavior model (distribution of relevant subsequences) is created, it is stored in Library of Behavior Models (LibBM) (similar to plan-libraries used in plan recognition). This model is stored (with an identiﬁcation name) as a trie for a good and effective handling (Figure 1a). ABCD: Agent Behavior Classiﬁer based on Distributions of relevant events Any behavior has a sequential aspect and this sequentiality should be considered in the modeling process. Our approach classiﬁes an observed agent behavior into the classes (behaviors) stored previously in a library. Therefore, this process is divided in the following 2 parts: 2.1 Construction of Behavior Models 1. Obtaining Atomic Behavior Sequences: Useful features are extracted from the stream of observations of the environment and an ordered sequence of events is obtained. An event is an atomic behavior that occurs during a particular interval of time and deﬁnes an speciﬁc agent act. The type of events is domain-dependent. 2. Creating the behavior model: The temporal dependencies are very signiﬁcant and to get the most representative set of sequential events (subsequences) from the acquired sequence, the data structure trie [2] is used as in [3, 4]. The construction of a trie from a single sequence of events is processed in three steps: a) Segmentation of the sequence: This segmentation can be done by using some environment characteristic that separates the sequence in several subsequences of uninterrupted events or by obtaining every possible ordered subsequence of a deﬁned length. 1 2 Carlos III University of Madrid, Spain, masm}@inf.uc3m.es Bar-Ilan University, Israel, galk@cs.biu.ac.il {jiglesia, ledezma, Figure 1. 2.2 Agent Behavior Classiﬁcation Process Behavior Classiﬁcation The observations of the agent to classify are collected and the corresponding behavior model (represented by a distribution of events) is created. Then, it is matched with all the behavior models stored in LibBM. As both models are represented by a distribution of events, an statistical test is applied for matching these distributions. The proposed non-parametric test applied for matching two behaviors is a modiﬁcation of Chi-Square Test for two samples. The behavior model to classify is considered as an observed sample and all the behavior models stored in LibBM are considered as expected 826 J.A. Iglesias et al. / An Efﬁcient Behavior Classiﬁer Based on Distributions of Relevant Events samples. This test compares the observed distribution with all the expected distributions objectively and evaluates if a deviation appears. The proposed test is the comparison of two sets of support values 2 (Figure 1b). in which Chi-Square is the sum of the terms (Exp−Obs) Obs With this comparison, a value (comparing value) that indicates the difference (deviation) between the two distributions is obtained. The lower the value, the closer the similarity between the two behaviors. This comparison test is applied once for each behavior model stored in LibBM. The model which obtains the lowest deviation is considered as the most similar one. An advantage of the proposed test is its rapidity because only the observed subsequences are evaluated. However, there is no penalty for the expected relevant subsequences which do not appear in the observed distribution. 3 3.1 Experiments UNIX User Classiﬁcation In this domain, the behavior of a user is represented by the sequence of UNIX commands he/she typed during a period of time. We use 9 sets of preprocessed user data drawn from the command histories of 9 UNIX computer users [1]. Each UNIX user ﬁle is divided in: 1.Training Files: created with a small and random part of consecutive commands (100, 250, 500 and 850 commands) taking from the corresponding User ﬁle and creating 4 different LibBMs. These results are calculated using subsequences of size 6. 2. Testing Files: Obtained from the other part of each given user ﬁle. 20 Testing ﬁles with different amount of commands (from 15 to 35) are evaluated. For evaluating the results, a value (Classiﬁcation Result Value) is calculated from the ranking list obtained for each classiﬁcation. If the classiﬁcation is done correctly, this value is the difference (positive) between the lowest and the second lowest value. If the classiﬁcation is done incorrectly; for evaluating how far the obtained result is from the correct one, this value is calculated by comparing the lowest value with the obtained value (obtaining a negative value). 3.2 RoboCup Soccer Coach Simulation The goal in this domain is to observe a game and recognize the behavior models (previously analyzed and stored in LibBM) followed by the opponent team members. For these experiments, we have used the rules from the RoboCup 2006 Coach Competition. The construction of models is done considering only the behavior followed by a few players (player behavior). However, the behavior to classify is the sum of several player behaviors (team behavior). The construction of models is done by analyzing several game log ﬁles (Training ﬁles) in which different player behaviors are activated. The procedure to identify high-level events in a soccer game described by Kuhlmann et al. [5] is used. Then, a new game in which several player behaviors are activated at the same time (team behavior) is observed and the player behaviors activated must be recognized. In these experiments, 17 player behaviors are analyzed (download from RoboCup 2006 Coach Competition web page) and stored in LibBM. The ranking list obtained (with the most likely player behaviors) is evaluated. Table 1 shows the ﬁrst 10 elements of the ranking lists obtained for the 3 iterations of the ﬁrst round. The number of player behaviors activated in each iteration is indicated in square brackets. The player behaviors are identiﬁed with a number (from 00 to 16) and the player behaviors activated are marked with an asterisk. Table 1. Iter1 [4] Iter2 [5] Iter3 [5] 4 Results for the RoboCup Coach Competition. Round1 Ranking list reported (most likely player behaviors) 04(*), 16, 00(*), 12, 15(*), 03, 09, 05, 01, 06 16(*), 01(*), 00, 13(*), 05, 09, 07(*), 03, 10, 08(*) 04(*), 02(*), 13, 05, 12, 00(*), 01, 06(*), 03, 10 Conclusions and Future Works A general approach which can represent and handle different behaviors in a wide range of domains is provided and it is generalizable using behaviors represented by a sequence of events. The experiments show that a system based on ABCD is very effective for classifying a UNIX user. For areas such as computer intrusion detection, these results are very encouraging. In the real-time and multi-agent domain; the results depend of the kind of behavior to recognize, however the obtained results are satisfactory. As many agents change their behavior and their preferences over time, their models should be frequently revised to keep it up to date. This aspect could be solved by using Evolving Systems. Also, the use of the classiﬁcation results for carrying out effective actions in the environment is considered in our future work3 . REFERENCES Figure 2. Classiﬁcation Results - User 5 Figure 2 shows the classiﬁcation results of 20 different commands of a UNIX user. X-axis: length of the sequence to classify (from 15 to 35 commands). Y-axis: classiﬁcation result value obtained by applying ABCD. The 4 lines show the results by using 4 different sizes of Training Files to create the Tries of the LibBM: 100, 250, 500 and 850 commands. Each graph point is the average value of 25 different test conducted. Although this average determines that a sequence is correctly classiﬁed most of the tests, the classiﬁcation of the 25 tests is not always correct. The percentages of the 25 tests correctly classiﬁed using testing ﬁle of 20 commands are shown in Figure 2. [1] C.L. Blake D.J. Newman, S. Hettich and C.J. Merz. UCI repository of machine learning databases, 1998. [2] E. Fredkin, ‘Trie memory’, Comm. A.C.M., 3(9), 490–499, (1960). [3] Jos´e Antonio Iglesias, Agapito Ledezma, and Araceli Sanchis, ‘A comparing method of two team behaviours in the simulation coach competition’, in MDAI, volume 3885 of LNCS, pp. 117–128. Springer, (2006). [4] Jose Antonio Iglesias, Agapito Ledezma, and Araceli Sanchis, ‘Sequence classiﬁcation using statistical pattern recognition’, in IDA, pp. 207–218, (2007). [5] Gregory Kuhlmann, Peter Stone, and Justin Lallinger, ‘UT Austin Villa 2003 simulator online coach team’, in RoboCup2003, (2004). [6] Agapito Ledezma, Ricardo Aler, Araceli Sanchis, and Daniel Borrajo, ‘Predicting opponent actions by observation.’, in RoboCup, (2004). [7] Neal Lesh, Charles Rich, and Candace L. Sidner, ‘Using plan recognition in human-computer collaboration’, in UM99, pp. 23–32, (1999). [8] M. Tambe and P. S. Rosenbloom, ‘Resc: An approach for dynamic, realtime agent tracking’, in IJCAI-95, Montreal, Canada, (1995). 3 Acknowledgments. This work has been supported by the Spanish Ministry of Education and Science under project TRA-2007-67374-C02-02. ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-827 827 ContextAggregator: A heuristic-based approach for automated feature construction and selection Robert Lokaiczyk and Manuel Goertz1 Abstract. Our research goal is to work towards a personal contextaware assistance and retrieval of relevant resources to computer users during a certain work task. This paper presents a general-purpose, algorithmic approach for automated context aggregation by heuristicbased feature construction. Our implementation of the context reasoning layer combines lower-level context features to new aggregated higher-level context features. Our approach allows – in contrast to most other approaches – an automated feature combination to achieve a high prediction accuracy of the user’s work task. Introduction Recent work in personal information and knowledge management systems often focuses on context awareness [5] and task orientation [3, 9]. Which, given a determined work task, provide the user with suitable learning resources relevant for the current learning need in the current work task. A crucial factor for fulﬁlling the vision of in-place and in-time e-learning systems is the user’s context. Taking the user’s current task into account the systems are able to provide adaptive assistance and learning resources. Forms of workplaceintegrated learning support might be displaying a list of task-relevant documents in an enterprise environment. In [7] and [3] resources are determined by querying the (pre-modelled) semantic network given a description of the current work task. Consequently, our goal should be to determine the current work task of the user automatically only by means of available context information on the desktop and not by manual input by the user. 1 Context We focus on knowledge-intensive work on the desktop of the computer worker. Therefore, we deﬁne Desktop Context – in accordance with [1] – as all measurable environmental settings that surround the user desktop work. Technically, these settings are monitored by desktop context sensors that collect system events and user interaction with the workbench. The context sensors are implemented as software hooks that operate on operation system level and log the data continuously. Thereby, the layer of context elicitation is completely transparent and unobtrusive to the user. The collected context events are encoded in a data stream which can be used as a feature stream for further processing. Whole tasks can be seen as slices of the event stream consisting of typical events correlating with a certain work task. The sequence of context events reﬂects the users actions during the work process. Context events include keystrokes, application launches, full-text of 1 SAP Research, Darmstadt, Germany, email: ﬁrstname.lastname@sap.com documents etc. Based on the user’s context information it is possible to predict the user’s work task. 2 Approach The basic approach of understanding the problem of task detection as a machine learning classiﬁcation is shown in [6]. Consequently we only brieﬂy summarize the key idea. First, a reasonable amount of training data is acquired by manual annotation of the work task by user right during is work process. The user selects from a limited set of tasks which are pre-modeled and typical for the work process within the involved organization. The selected task is annotated to the collected training material of work streams recorded with the context monitor. The task prediction algorithm based on the learned model automatically classiﬁes the active tasks using continuously recorded event streams. Whenever the classiﬁer detects a change in the user’s work tasks, a new retrieval of task-relevant resources is triggered and our personal information assistant displays a new list of associated learning resources. 2.1 ContextAggregator Algorithm The paper presents the idea of unsupervised context aggregation. Until now most approaches of aggregating desktop events to more complex, meaningful units are manually handled by the user or previously modeled by domain experts. We differ by providing an unsupervised algorithm for context aggregation that takes the user out of the loop and is not dependent of domain-speciﬁc knowledge. The fundamental idea is combining desktop events to new events that potentially are more valuable features for work task prediction. Thereby the mutual correlation between features is taken into account to increase information gain for the prediction. 2.2 Aggregation Functions The idea of the aggregation functions here is basically building predicates on new combinations of features that are considered as potentially more valuable features. As measurement for the impact we use information gain [8], a common feature relevance measure from the data mining area. We propose a algorithm and a set of combination functions that appears to be very prospering for our particular context aggregation problem. For combining features we use an extensive set of functions that map a number of features (n) to a new feature (see equation 1). fi : F n → F (1) For our experiments the used set of functions turns out to deliver already good results. But the extensibility of the algorithm with more 828 R. Lokaiczyk and M. Goertz / ContextAggregator: A Heuristic-Based Approach for Automated Feature Construction and Selection speciﬁc aggregation functions is deﬁnitely an advantage in order to receive even better results with domain-speciﬁc mapping functions. 2.3 Heuristic For reducing both the computational complexity and the memory requirements we apply some heuristic rules that prefer certain feature combinations and reject others. In particular we use the following heuristics to reduce the complexity of context aggregation: I) Filter ill-deﬁned mappings of events. As an example we can consider the function max(date, windowname), which is not deﬁned. II) Keep statistics of transformation functions that usually lead to increased information gain. Thus, the algorithm can prefer rules that are already known to improve the result on the particular domain. III) Skip feature duplicates. We avoid those features by checking for duplicates within the already existing feature vectors. As an example you can consider max(max(event)) which always reduces to max(event). IV) Limit the stored feature set to a small subset of possible features. We keep only the topmost n features (ranked by information gain). V) Skip feature combinations with low impact. For a potential improvement the information gain of the feature combination should be at least above the maximum of the information gain of the involved features. With this set of rules the algorithm is quickly able to determine the most valuable feature combinations and will not take unimportant combinations into account. 3 Analysis of the Algorithms First, the we analyse the convergence of the ContextAggregatoralgorithm. As shown in Figure 1(a), the ContextAggregatoralgorithm usually converges very fast, only after a few iterations. In our experiments there was no more strong improvement after about 5 iterations. QE\-+KVS[XL XIVQMREXMSRFIGEYWI -+C -+C 2YQFIVSJ-XIVEXMSRW EGGYVEG] 2YQFIVSJ-XIVEXMSRW (a) Convergence of the Maximum (b) Boosted Task Prediction AccuInformation Gain Growth racy Figure 1. Evaluation of the Proposed Algorithm For evaluation purposes we (see Chapter 1) collected context data together with annotated task labels during a work process (14 unique users; 18 work hours). To measure the improvement of aggregating the context of the collected training material, we apply an n-fold cross-validation where n is the number of distinct users. We calculate the averaged performance metrics from the individual data segmentations. The separation of training data for each user is necessary in order to really prove that the learned knowledge from the training data is really transferable to the separated user whose own training material is not in the particular training set. As classiﬁcation algorithm we use Naive Bayes, since it has the theoretical minimum error rate in comparison to all other classiﬁers [4] and practical experiments indicate a good accuracy even if the independence assumption is violated [2]. In order to prove the boosted accuracy with automatically derived higher-level context information we compare the accuracy values of the prediction algorithm with context aggregation to those without. Obviously, the context aggregation yields to an increase in prediction accuracy which can be seen in Figure 1(b). This result is signiﬁcant at a conﬁdence of 99%. 4 Summary In this paper we propose a multi-purpose context aggregation algorithm based on heuristic rules that is able to construct more relevant features out of the large number of possible context events. Furthermore, we evaluate the algorithm on the data of a user study for the purpose of user task prediction and show a signiﬁcant improvement over the basic non-aggregated version. By using a number of simple heuristics we are able to reduce the computational complexity and memory requirements of the aggregation algorithm. REFERENCES [1] Anind K. Dey. Understanding and using context, 2001. [2] Pedro Domingos and Michael J. Pazzani, ‘Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classiﬁer’, in International Conference on Machine Learning, pp. 105–112, (1996). [3] Olaf Grebner, Uwe V. Riss, Ernie Ong, Marko Brunzel, Thomas RothBerghofer, and Ansgar Bernardi. Task management for the nepomuk social semantic desktop (poster). 4th Conference on Professional Knowledge Management - Experiences and Visions -, March 2007. [4] Jiawei Han and Micheline Kamber, Data Mining. Concepts and Techniques., Morgan Kaufmann Publishers, 2001. [5] Angela Kessell and Christopher Chan, ‘Castaway: a context-aware task management system’, in CHI ’06: CHI ’06 extended abstracts on Human factors in computing systems, pp. 941–946, New York, NY, USA, (2006). ACM. [6] Robert Lokaiczyk, Andreas Faatz, Arne Beckhaus, and Manuel G¨ortz, ‘Enhancing just-in-time e-learning through machine learning on desktop context sensors’, in CONTEXT, eds., Boicho N. Kokinov, Daniel C. Richardson, Thomas Roth-Berghofer, and Laure Vieu, volume 4635 of Lecture Notes in Computer Science, pp. 330–341. Springer, (August 2007). [7] H. Mayer, W. Haas, G. Thallinger, S. Lindstaedt, and K. Tochtermann. APOSDLE - Advanced Process-oriented Self-directed Learning Environment. Poster Presented on the 2nd European Workshop on the Integration of Knowledge, Semantic and Digital Media Technologies, 30 November - 01 December 2005. [8] Thomas M. Mitchell, Machine Learning, McGraw-Hill Higher Education, 1997. [9] Jianqiang Shen, Lida Li, Thomas G. Dietterich, and Jonathan L. Herlocker, ‘A hybrid learning system for recognizing user tasks from desktop activities and email messages’, in IUI ’06: Proceedings of the 11th international conference on Intelligent user interfaces, pp. 86–92, New York, NY, USA, (2006). ACM Press. ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-829 829 A pervasive assistant for nursing and doctoral staff Alexiei Dingli and Charlie Abela1 Abstract. The goal of health-care institutions is to provide patientcentric health care services. Unfortunately, this goal is frequently undermined due to human-related aspects. The PervasIve Nursing And docToral Assistant (PINATA) provides a patient-centric system powered with Ambience Intelligence techniques and Semantic Web technologies. Through PINATA, the movement of patients and medical staff is tracked via RFID sensors while an automated camera system monitors the interaction of people within their environment. The system reacts to particular situations autonomously by directing medical staff towards emergencies in a timely manner and providing them with just the information they require on their handheld devices. This ensures that patients are given the best care possible on a 24/7 basis especially when the medical staff is not around. 1 INTRODUCTION One of the main challenges faced by healthcare institutions is to maximize the available time that doctors and nurses spend with patients and to decrease mundane tasks such as form ﬁlling, which though important, inhibits the health worker’s efﬁciency and effectiveness. Ambient Assisted Living (AAL) systems which make use of Ambient Intelligence (AmI) technologies can help to solve these problems and to provide personalized solutions such as in [2] and [4]. These systems can be used for various tasks such as monitoring the patient’s permanence in a hospital, track down medical records, monitor diet, track movement and detect incidents (such as falls). Back-end intelligent systems are required to analyse the feedback obtained through the different sensors located around the hospital and recommend a plausible course of action for the medical staff. 2 STATE OF THE ART Ambient Intelligence (AmI) builds on three key technologies: ubiquitous computing, ubiquitous communication and intelligent user interfaces [2]. Ubiquitous computing means integration of microprocessors into everyday objects like furniture, clothing, white goods, toys and even paint. Ubiquitous communication enables these objects to communicate with each other and with the user by means of ad-hoc wireless networking. Intelligent user-interfaces enable people in the AmI environment to control and interact with the environment in a natural (voice, gestures) and personalised way (preferences, context) [1]. In AmI, people are empowered through a context aware environment that is sensitive, adaptive and responsive to their needs, habits, 1 Department of Artiﬁcial Intelligence, Faculty of ICT, University of Malta, Malta, email: alexiei.dingli@um.edu.mt, charlie.abela@um.edu.mt gestures and emotions. It is expected that by providing intelligent environments, quality and cost control can be improved and innovative intelligent personal health services can be developed. The ﬁve rights of patient care are often given as right patient, right drug, right dose, right route and right time [8]. Through technologies such as RFID or Radio-Frequency Identiﬁcation, it is possible to further integrate the digital and healthcare worlds to maintain those ﬁve rights and to join-up care and processes. In [2] this technology was used to provide personalised visualisation of patients’ information (including also images) to doctors during a clinical session. In [4] there is an outline of an RFID model for designing a real-time hospital-patient management system. A pilot implementation was done in [3] which consisted in monitoring of person and patient logistics in operating theatres, tracking and tracing of operating theatre materials and tracking and tracing of blood products. In [9] it was being predicted that the RFID technology was to play a very important role in the healthcare sector. 3 METHODOLOGY PINATA is based upon a Service Oriented Architecture (SOA) similarly to [7] and is composed of two main components (as can be seen in Figure 1); a Knowledge Brokering module (KBr) and a Device Manager (DM). The KBr consists of two main components, a KnowledgeBase (KB) and an AmI module. The role of the AmI is to integrate the patients’ information obtained through various sensors (after storing it inside the KB), analyse it and recommend a way forward. This module makes use of a number of domain speciﬁc ontologies which have been crafted in consultation with various medical entities. The Patient Ontology is one such ontology. It is an electronic representation of the patients’ records and describes patients’ proﬁles in terms of various health-related information. The Medical Ontology, is based on [5] and [6] and represents conceptual knowledge about clinical situations from three perspectives; clinical problems, investigations and recommendations. A set of rules are used to represent the decisionmaking logic of PINATA. The SOA approach was adopted to facilitate the integration of the patient-related data which typically resides in different hospitals or clinics. This approach allows the system to query the different organizations, get the data and collate it together thus providing a uniﬁed view of the information for the KBr. Once all the information is inside the KB, the AmI infers new knowledge from the available information and sends it to the medical staff for immediate action. The DM handles the various devices connected to the system. It also serves as a communication gateway between the AmI and the medical staff. In the present hospital scenario, the patient has an 830 A. Dingli and C. Abela / A Pervasive Assistant for Nursing and Doctoral Staff bathroom. The KBr can distinguish between a movement in the bed (while the person is sleeping) and the actual action of going out of the bed. In the latter case, the system can switch on the lights of the bathroom automatically and switches them off once the person returns back to his/her own bed. When patients return back to their homes, a basic version of PINATA can be installed in their homes. This is feasible due to the fact that PINATA is based around a SOA architecture. Thus it is possible to have cameras and sensors installed in the households while the processing and interpretation of the captured data is sent to the main hospital servers for continual monitoring. By doing so, the care provided by the hospitals can be extended to the community, thus making it possible for more patients to spend less time in hospitals and more time recovering in their homes. Once in homes, PINATA can be further extended to handle other aspects of health-care and safety in order to improve on the quality of life. 4 Figure 1. The PINATA Architecture RFID tag embedded inside the wrist band. The various RFID readers around the hospital detect the movement of the patient and send the information to the DM and eventually to KBr. This ensures that the patient’s whereabouts are continuously known by the medical staff. Handheld devices are used to provide the staff with various types of information including alerts (related to patients’ medication schedule). These alerts are described in the Medical Ontology and the web service responsible of keeping track of the patient’s medications makes use of this knowledge when sending out the alert to the nurse’s device. When a nurse is in the proximity of a patient, the handheld device reads the RFID tag and can automatically display the patient’s information, again via the appropriate set of web services. PINATA makes use of a camera-based monitoring system similar to [10], which tracks the movement of patients, through image processing and in case of an emergency alerts the nurse. To ensure that this system in no way presents a threat to the patient’s privacy, images are not recorded by the cameras. A typical situation in which this system becomes important is that in which a patient faints and falls in his room. Information captured through the camera is collated and analysed by the KBr which triggers an alert via the DM that is sent to the nurse. The RFID system is used to track the nurse which is in the closest vicinity to the patient in distress. The system also uploads automatically on the nurse’s handheld device, all the information required for that particular context. In a typical situation such as that in which the patient is suffering from anaphylactic shock due to some allergic reaction, the system is able to recommend to the nurse the best course of action. If the situation is deemed critical by the system (based upon various cues extracted from the environment and based upon knowledge accumulated during past events), it will automatically escalate the problem and request for reinforcements. Through the DM, PINATA can also interact with the surrounding environment and inﬂuence it. The KBr module is constantly collating the various inputs from the sensors (obtained through the DM) and managing the status of the environment. This involves switching on/off electrical equipment autonomously or alerting the person about possible dangerous situations. A typical situation is that in which a patient wakes up in the middle of the night to go to the CONCLUSION Even though PINATA is still a prototypical system and more work needs to be done, the results obtained from the system are encouraging. Patients quickly got used to it and the medical staff understood its potential and are now exploring new possibilities with our help. The beauty of the whole system is that it makes use of rather cheap technology which is readily available but which is controlled by a powerful brain. The KBr module is capable of integrating information obtained from various sources, reasoning things out and deciding on the best strategy. This has showed us that the time is reap to fuse intelligent systems with the real world and this fusion is unleashing new possibilities never thought of before in the ﬁeld of personal health care and safety. ACKNOWLEDGEMENTS This work was carried out within the PINATA project, funded by the Malta Council for Science and Technology (http://www.mcst.org.mt) and done in collaboration with St.James Hospital Malta (http://stjameshospital.com). The project was also supported by the Ministry of Technology (http://www.miti.gov.mt). REFERENCES [1] J. Ahola, ‘Ambient intelligence: Plenty of challenges by 2010’, in EDBT ’02: Proceedings of the 8th International Conference on Extending Database Technology, p. 14, London, UK, (2002). Springer-Verlag. [2] J. Bravo, R. Hervas, G. Chavira, and S Nava, ‘Rﬁd-sensor fusion: An experience at clinical sessions’, PTA2006 Workshop, (2006). [3] Capgemini, ‘Gaining solid results with rﬁd in healthcare’, (2007). [4] B. Chowdhury and R. Khosla, ‘Rﬁd-based hospital real-time patient management system’, in ACIS-ICIS, pp. 363–368, (2007). [5] M. J. Field and K. N. Lohr, ‘Clinical practice guidelines:directions for a new program’, National Academy Press, Institute of Medicine, Washington, DC, (1990). [6] S. Hussain and S. Abidi, ‘Ontology driven cpg authoring and execution via a semantic web framework’, HICSS-40, Hawaii, (2007). [7] V. Issarny, D. Sacchetti, F. Tartanoglu, F. Sailhan, R. Chibout, N. Levy, and A. Talamona, Developing Ambient Intelligence Systems: A Solution based on Web Services, Springer Netherlands, 2005. [8] C. Jervis, ‘Tag team care: Rﬁd could transform healthcare’, e-Health Insider, (2005). [9] J. Reiner and M. Sullivan, ‘Rﬁd in healthcare: a panacea for the regulations and issues affecting the industry?’, (2005). [10] L. Snidaro, C. Micheloni, and C. Chiavedale, ‘Video security for ambient intelligence’, Systems, Man and Cybernetics, Part A, IEEE Transactions, (2005). 5. Natural Language Processing This page intentionally left blank ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-833 833 Author Identification Using a Tensor Space Representation Spyridon Plakias and Efstathios Stamatatos1 Abstract. Author identification is a text categorization task with applications in intelligence, criminal law, computer forensics, etc. Usually, in such cases there is shortage of training texts. In this paper, we propose the use of second order tensors for representing texts for this problem, in contrast to the traditional vector space model. Based on a generalization of the SVM algorithm that can handle tensors, we explore various methods for filling the matrix of features taking into account that similar features should be placed in the same neighborhood. To this end, we propose a frequency-based metric. Experiments on a corpus controlled for genre and topic and variable amount of training texts show that the proposed approach is more effective than traditional vector-based SVM when only limited amount of training texts is used. 1 INTRODUCTION Author identification deals with the assignment of a text of unknown authorship to one author, given a set of candidate authors for whom text samples of undisputed authorship are available. The plethora of available electronic texts (e.g., e-mail messages, online forum messages, blogs, source code, etc.) indicates a wide variety of applications in areas such as intelligence, criminal law, computer forensics, etc. [1] From a machine learning point-of-view, author identification can be viewed as a multi-class single-label text categorization (TC) task. Actually, several studies on TC use this problem as one more testing ground together with other tasks, such as topic identification, language identification, genre detection, etc. [6] However, there are some important characteristics of author identification that distinguish it from other TC tasks. In particular, in style-based TC the most important factor for selecting features is the frequency [4]. On the contrary, in topic-based TC the most frequent words are excluded since they carry no semantic information. Moreover, in the typical applications of author identification usually there is shortage of training texts for the candidate authors. This stands for both the amount and length of training texts. Therefore, it is crucial for authorship identification methods to be able to handle limited training texts effectively. The vast majority of TC methods use a vector-based representation of texts. Traditionally, a bag-of-words approach provides several thousands of lexical features. Alternatively, character-based features (character n-grams) can be used. The latter have provided very good results in authorship identification experiments albeit the fact they increase considerably the dimensionality of the representation [5]. Especially in the case of short texts, such representation will produce very sparse data. Powerful machine learning algorithms such as support vector machines (SVM) can effectively handle such high dimensional and 1 Dept. of Information and Communication Systems Eng., University of the Aegean, 83200 – Karlovassi, Greece, email: stamatatos@aegean.gr sparse data. However, in case we have only a few instances for training, such algorithms are less effective. In this paper, we propose the use of tensor space representation for author identification tasks in order to cope with the problem of limited training texts. That is, instead of representing a text as a vector, we represent it as a matrix. Using a tensor of second order, the dimensionality of the text representation remains high but the classification algorithm has to learn much less parameters. As a result, it can better handle cases with very limited training instances. To this end, we use a generalization of the SVM algorithm that can handle tensors instead of vectors [3]. In contrast to the vector model, the position of each feature within the matrix is important since relevant features should be placed in the same row or column. Therefore, we examine several techniques for filling the representation matrix so that relevant features to be in the same neighbourhood. A set of experiments on a corpus controlled for genre and topic shows that when multiple short training texts are available the SVM model is the most effective. However, when only limited amount of short training texts is available, the tensor model produces better results. 2 THE TENSOR-BASED MODEL In a vector space model, a text is considered as a vector in Rn, where n is the number of features. A second order tensor model considers a text as a matrix in RxRy, where x and y are the dimensions of the matrix. A vector xRn can be transformed to a second order vector XRxRy provided n|x*y. A linear classifier in Rn (e.g., SVM) can be represented as aTx+b, that is, there are n+1 parameters to be learnt (b, ai, i=1,…,n). Similarly, a linear classifier in RxRy can be represented as uTXv+b, that is, there are x+y+1 parameters to be learnt (b, ui, i=1,…y, vj, j=1,…x). Consequently, the number of parameters is minimized when x=y and this is much lower than n. Therefore, the vector space representation is more suitable in cases with limited training sets. To be able to handle tensors instead of vectors, we use a generalization of SVM, called support tensor machines (STM) [3]. This algorithm works iteratively. First, it sets u=(1,…,1)T. Then, it solves a standard SVM optimization problem to compute an estimation of v. Once v is estimated, it solves another standard SVM optimization problem to estimate a new u. The procedure of calculating new values for u and v is repeated until they tend to converge. It is obvious that the tensor-based model takes into account associations between the features. Each feature is strongly associated with features that are in the same row and column. It is, therefore, crucial to place relevant features in the same neighbourhood. In conclusion, to transform suitably a vector representation to a second order tensor representation, one has to define what features are considered relevant and how relevant features are placed in the same neighbourhood. 834 S. Plakias and E. Stamatatos / Author Identiﬁcation Using a Tensor Space Representation In this paper, we consider the frequency of occurrence as the factor that determines relevance among features [4]. In a binary classification case, where we want to discriminate author A from author B, the relevance r(xi) of a feature xi is: f A ( xi ) f B ( xi ) f A ( xi ) f B ( xi ) b r ( xi ) where fA(xi) and fB(xi) are the relative frequencies of occurrence of feature xi in the texts of author A and B, respectively, and b a smoothing factor. The higher the r(xi), the more important the feature xi for author A. Similarly, the lower the r(xi), the more important the feature xi for author B. In order to fill the matrix with the features taking into account the just defined relevance of features, we examined three techniques (an example for each case is shown in figure 1): Vertical: the columns of the matrix are filled with decreasing relevance values. Hence, the first columns of the tensor will be strongly associated with author A and the last columns with author B. On the other hand, the rows of the matrix contain features of mixed importance for the two authors. Diagonal: we start from the upper left corner of the matrix and fill diagonals with decreasing relevance values. Hence, the upper left part of the matrix will be strongly associated with author A and the lower left part with author B. That way, the first rows and columns are mainly associated with author A while the last rows and columns with author B. Hilbert: we use the Hilbert space filling curve [2]. Examples of such curves are shown in figure 2. This technique produces small neighbourhoods of relevant features but any row or column contain features of mixed importance. ª1 º «2» « » «3» « » «4» «5» « » «6» « » «7 » «8 » « » ¬« 9 ¼» o ª1 «2 « ¬« 3 Vertical 4 5 6 ª1 º «2» « » «3» « » 7 º «4» » « 8 », 5» « » 9 ¼» « 6 » « » «7 » «8 » « » ¬« 9 ¼» o ª1 «2 « ¬« 4 Diagonal 3 5 7 ª1 º «2» « » «3» « » 6º «4» » « 8», 5» « » 9 ¼» « 6 » « » «7» «8» « » ¬« 9 ¼» o ª4 «5 « ¬« 8 3 6 7 2º 1 »» 9 ¼» Hilbert Figure 1. Three different techniques to transform a vector to a second order tensor. The vector features are sorted with decreasing relevance r. respectively. In all cases, the test corpus comprises 50 texts per author not overlapping with the training texts. To represent the texts we used a character n-gram approach. Thus, the feature set consists of the 2,500 most frequent 3-grams of the training corpus. A standard SVM model was built using the vector of 2,500 features. Moreover, the tensor model was based on a 50x50 matrix. For each space filling technique (vertical, diagonal, and Hilbert) we built a STM model. Note that since we deal with a multi-class author identification task, we followed a one vs. one approach, that is, for each pair of authors a STM model was built and the space filling technique was based on the feature relevance for that pair of authors. Based on preliminary experiments, we set the C parameter of SVM to 1, the corresponding parameter for STM models to 0.1 and the smoothing parameter b equal to 1. The comparison of the performance of SVM and STM models can be seen in table 1. Although SVM is superior when multiple training texts are available, the STM model based on vertical space filling provides better results when the training corpus is limited. Table 1. Performance of SVM and STM models. Training texts per author Method 50 10 SVM STM-Vertical STM-Diagonal STM-Hilbert 5 80.8% 78.0% 75.6% 76.6% 5 64.4% 48.2% 68.0% 60.8% 66.6% 51.2% 47.6% 46.0% CONCLUSION In this paper, we presented a tensor-based model for the author identification problem. The proposed approach is more effective than SVM when only limited amount of training texts is available. We used the frequency as the criterion of feature relevance and examined several space filling techniques to form the feature matrix so that relevant features to be in the same neighbourhood. The vertical method seems to provide the best results for limited training corpora. This technique produces some subsets of features (columns of matrix) that are strongly associated with the authors as well as other subsets (rows) that contain features of mixed importance for the authors. Further experiments should be conducted to verify this promising result. Moreover, more complex space filling techniques can be tested to provide even better results. REFERENCES [1] [2] Figure 2. Examples of the Hilbert space filling curve. 4 EXPERIMENTS The corpora used for evaluation in this study consist of newswire stories in English taken from the publicly available Reuters Corpus Volume 1 (RCV1). The top 10 authors with respect to the amount of texts belonging to the topic class CCAT (about corporate and industrial news) were selected. Therefore, this corpus of short texts is controlled for genre and topic hoping that the main factor that distinguishes the texts will be the authorship. Three versions of this corpus were formed using 50, 10 or 5 training texts per author, [3] [4] [5] [6] A. Abbasi and H. Chen, ‘Applying Authorship Analysis to Extremist-Group Web Forum Messages’, IEEE Intelligent Systems, 20(5), 67-75, (2005). A.R. Butz, ‘Alternative Algorithm for Hilbert’s Space Filling Curve’, IEEE Trans. On Computers, 20, 424-42 (1971). D. Cai, X. He, J.R. Wen, J. Han, W.Y. Ma. Support Tensor Machines for Text Categorization, Technical report, UIUCDCS-R2006-2714, University of Illinois at Urbana-Champaign, (2006). M. Koppel, N. Akiva, and I. Dagan, I. ‘Feature Instability as a Criterion for Selecting Potential Style Markers’, Journal of the American Society for Information Science and Technology, 57(11), 1519–1525, (2006). E. Stamatatos, ‘Ensemble-based Author Identification Using Character n-grams’, Proc. of the 3rd International Workshop on Text-based Information Retrieval, 41-46 (2006). D. Zhang and W.S. Lee,. ‘Extracting Key-substring-group Features for Text Classification’, Proc. of the 12th Annual SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 474-483 (2006). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-835 835 Categorizing Opinion in Discourse Nicholas Asher, Farah Benamara 1 and Yvette Yannick Mathieu2 1 Categorizing Opinions While research in the ﬁeld of opinion analysis has focused on determining the orientation of opinion words in various lexical categories, almost no work to date has investigated the effects of rhetorical relations on the expression of opinion. We present a preliminary study for a discourse-based opinion categorization and propose a new annotation scheme for a ﬁne-grained contextual opinion analysis using discourse relations. This study uses a lexical semantic analysis of opinion conveying expressions, based on the research of Wierzbicka [1], Levin [3] and Mathieu [4], coupled with an analysis of how clauses involving these expressions are related to each other within a discourse. Rather than providing a deﬁnition of opinion, we study how affective content is explicitly and lexically expressed in written texts. An opinion expression belongs to one of our top-level categories: REPORTING, JUDGEMENT, ADVISE and SENTIMENT. In the REPORTING group, opinions are often expressed as the objects of verbs used to report the speech and opinions of others. These verbs convey the degree of the holders commitment to the opinion being presented, and some provide at least indirectly a judgment by the author on the opinion expressed. The opinion polarity is given by the verbs’ complements. This category contains three subgroups according to the degree of commitment and the degree of veracity concerning the information in their complements. In the ﬁrst subgroup, we ﬁnd verbs that introduce information that (a) the author takes as established (INFORM group) or that (b) the holder is strongly committed to (ASSERT group). The second subgroup contains (c) the TELL group. Unlike ASSERT verbs, TELL verbs do not convey strong commitments of the subject to the embedded content; unlike INFORM verbs, they do not convey anything about the authors view of the embedded content. Finally, the last subgroup introduces an opinion with a certain degree of subjectivity. It contains (d) the THINK group verbs which express the fact that the subject has a strong commitment to the complement of the verb and (e) the GUESS group verbs which express a weaker commitment on the part of the agent. The veracity of the information from (d) is stronger than the information from (e). The JUDGEMENT group involves words that express a positive or negative assessment of something or someone. It includes verbs, nouns and adjectives. We consider two subgroups: judgments referring to a system of social norms (f) the BLAME group and (g) the PRAISE group - and judgments referring to personal norms -(h) the APPRECIATION group-. ADVISE expressions urge the reader to adopt a certain course of action or opinion. We ﬁnd here (i) the RECOMMEND group which expresses a good/bad opinion and a stronger push for some course of action and (j) the SUGGEST group used to say what the writer suggests or speculates on without being absolutely certain; ﬁnally, (k) 1 2 IRIT-CNRS France, email: {asher, benamara}@irit.fr LLF-CNRS France, email: yannick.mathieu@linguist.jussieu.fr the HOPE group expresses the wish that some desire will be fulﬁlled. Expressions in (i) are stronger than in (j) and (k) whereas expressions (k) are weaker. Finally, words in the SENTIMENT group express an attitude toward something usually based on feeling or emotion rather than reasoning. They have a polarity as well as a strength. We distinguish here between positive sentiments expressed by words in the CALM DOWN, ENTERTAIN, JOY, LOVE and FASCINATE groups and negative sentiments expressed by words in the ANGER, BORE, OFFENSE, SADNESS, FEAR, HATE and DISAPPOINT groups. Some groups, such as ASTONISHMENT and TOUCH generally express a neutral polarity, although the polarity and the strength are given by the context. 2 Rhetorical relations between clauses containing opinion expressions The rhetorical structure (RS) is an important element in understanding opinions conveyed by a text. Our four opinion categories are used to label opinion expressions within a discourse segment. Using the discourse theory SDRT [2] as our formal framework, we deﬁne a basic segment as a clause containing an opinion expression or a sequence of clauses that together bear a rhetorical relation to a segment expressing an opinion. We have segmented conjoined NPs or APs into separate clauses for instance, the ﬁlm is beautiful and powerful is taken to express two segments: the ﬁlm is beautiful and the ﬁlm is powerful. Segments are then connected to each other using a small subset of ”veridical” discourse relations. For example, there are three opinion segments in the following sentence, S:[Even if the product is excellent]a, [the design is very basic]b, [which is disappointing in this brand]c. There is a CONTRAST relation between a and b that renforces sentiment expressed in segment c. We use ﬁve types of rhetorical relations. CONTRAST and CORRECTION indicate a difference of opinion. CONTRAST(a, b) implies that a and b are both true but there is some defeasible implication of one that is contradicted by the other, whereas CORRECTION(a, b) involves a stronger opposition and implies that b is true while a is false. To ﬁnd these relations in text, we use speciﬁc discourse markers, such as: although, but, etc. for CONTRAST and protest, deny, etc. for CORRECTION. EXPLANATION(a, b), marked by because indicates that b offers a (typically sufﬁcient) reason for a. ELABORATION(a, b), marked by for example, in particular implies that b gives more details on what was expressed within a. We have merged EXPLANATION and ELABORATION within a single relation called SUPPORT, as both of these relations are used to support opinions. RESULT(a,b) indicated by markers like so, as a result, indicates that b is a consequence or result of a. Finally, CONTINUATION(a, b) means that a and b form part of a larger thematic whole. For example, the RS of S is RESULT(CONTRAST(a,b),c). We also took account of disjunctions, conditionals and negations in 836 N. Asher et al. / Categorizing Opinion in Discourse evaluating opinions. 3 A Semantic Representation We represent each opinion word that belongs to a category with a shallow semantic feature structure (FS) that associates with a segment: the category it belongs to, the opinion holder, the opinion topic, the opinion expressions that enable us to identify the segment, and the associated modality. A modality is deﬁned as a degree of preference(Pref) for expressions in the ADVISE category, or a combination of a degree of commitment (C) and a strength for expressions in the REPORTING category, or a combination of a polarity and a strength for expressions from the JUDGMENT and the SENTIMENT categories. For example, the groups (a) and (b) are associated to the modality C1, the groups (c) to C2 and the groups (d) and (e) to C3 such that C1 ≥ C2 ≥ C3. Simple scalar dimensions are used to represent strength. The values 2, 1 and 0 mean that the expression has a strong, a medium or a low strength, respectively. When verb arguments contain an opinion expression, we have an additional attribute in the FS describing the content of opinion expressions introduced by the verb. This attribute is mainly used for verbs in the REPORTING group. For example, the segment [The French presidency conﬁrmed congratulations sent to Vladimir Putin] is represented as : ⎡ ⎢ ⎢ ⎣ Category : [ reporting : Assert] M odality : [ commitment : C1 , strength : 1] Holder (1) : T he F rench presidency Opinion word : conf irmed Category : [ judgment : praise] M odality : [ polarity : positive, strength : 1] Content (2) : Holder : (1) T opic : V ladimir P utin Opinion W ord : congratulations T opic : (2) 6 ⎤ 7 ⎥ ⎥ ⎦ Discourse relations tell us how to combine various opinions using a set of dedicated combination rules. SUPPORT strengthens the opinion in the ﬁrst constituent. CONTINUATION strengthens the polarity of the common opinion. RESULT strengthens the polarity or opinion in the second argument. For CONTRAST, we distinguish two cases. If the two arguments are opinion segments, then the CONTRAST weakens the polarity of the ﬁrst argument. If one of the arguments bears a rhetorical relation with the other argument, then the CONTRAST strengthens the opinion polarity as in:[[I am an atheist], but [I totally agree with the priest]]. 4 Annotation Methodology and Preliminary Results We annotated three different types of on line corpora: movie reviews (M), Letters to the Editor (L) and news reports (N), written in French and English. M were taken from Telerama, AlloCine and movies.go.com, L from La Depeche du Midi and The San Francisco Chronicle, N from Le Monde, 20 Minutes and the MUC 6 news corpus. We randomly selected 150 articles for French corpora (around 50 articles for each genre). Two native French speakers annotated respectively around 546 and 589 segments. To check the cross linguistic feasibility of generalisations made about the French data, we also annotated opinion categories for English. We have annotated around 30 articles from M and L. For N, the annotation in English was considerably helped by using texts from the MUC 6 corpus (186 articles), which were annotated independently with discourse structures by three annotators in the University of Texas’s DISCOR project (NSF grant, IIS-0535154); the annotation for our opinion expressions involved a collapsing of structure proposed in DISCOR. Our lexicon is then extended during the annotation process. Actually, we have categorized 200 verbs, 160 nouns and 195 adjectives for French and 187 verbs, 150 nouns and 170 adjectives for English. For each corpus, annotators annotate elementary discourse segments, deﬁne its shallow semantic representation and then connect discourse segments using the set of rhetorical relations we have identiﬁed. The average distribution of opinion expressions in our corpus across our categories in French (Bold font) and English (normal font) is shown in the table below. Table 1. Groups Reporting Judgment Advise Sentiment Distribution of categories by each annotator. Movie (%) 2.67 2.12 60.53 40.52 6.92 10.63 27.30 34.04 Letters (%) 14.80 13.34 52.50 73.34 10.05 13.34 33.08 2.67 News (%) 43.91 42.85 39.23 33.34 7.27 9.52 11.35 16.67 Opinions in N involve principally reported speech. As we only annotated segments that clearly expressed opinions or were related via one of our rhetorical relations to a segment expressing an opinion, our annotations typically covered only a fraction of the whole document. The Press articles were the hardest to annotate and generally contained lots of embedded structures introduced by REPORTING type verbs, as well as negations. To compute the inter-annotator agreements (IAG), we chose to focus, at a ﬁrst step, only on agreements on opinion categorization, segment identiﬁcation and rhetorical relations. We computed the IAG only on the French corpus. We have a kappa of 95% on opinion categorization. 5 Conclusions and Future Works We think that reﬁned categories are needed to build a more nuanced appraisal of opinion expressions in discourse. The preliminary evaluations of our annotations have shown the validity of the categorization of opinions we proposed. We are able to calculate an overall global opinion on a topic in a principled way, by taking account of logical and discourse structure. In future research, we plan to (1) extend our annotation scheme to other types of corpora and to deepen our opinion typology, (2) compute IAG on the opinion holder, topics, modality as well as polarity, (3) characterize each discourse segment with a deep semantic representation and (4) to compare our annotation scheme to the MPQA one. In terms of automatization, we plan ﬁrst to exploit a syntactic parser to get the argument structure of verbs and then to use a discourse segmenter like that developed in the DISCOR project, followed by the detection of discourse relations using cue words. This will allow us to use the deep semantic analysis to provide a classiﬁcation of texts according to their opinions on various topics and to compare this approach to the bag of words approach. REFERENCES [1] Wierzbicka A., Speech Act Verbs, Sydney: Academic Press, 1987. [2] N. Asher and A Lascarides, Logics of Conversation, Cambridge University Press, 2003. [3] B Levin, English Verb Classes and Alternations: A Preliminary Investigation, University of Chicago Press., 1993. [4] Y. Y. Mathieu, A Computational Semantic Lexicon of French Verbs of Emotion, In Shanahan, G., Qu, Y., Wiebe, J. (eds.): Computing Attitude and Affect in Text. Dordrecht, The Netherlands: Springer, 2004. ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-837 837 A Dynamic Approach for Automatic Error Detection in Generation Grammars Tim vor der Br¨ uck1 and Holger Stenzhorn2 1 Introduction In any real–world application scenario, natural language generation (NLG) systems have to employ grammars consisting of tremendous amounts of rules. Detecting and ﬁxing errors in such grammars is therefore a highly tedious task. In this work we present a data mining algorithm which deduces incorrect grammar rules by abductive reasoning out of positive and negative training examples. More speciﬁcally, the constituency trees belonging to successful generation processes and the incomplete trees of failed ones are analyzed. From this a quality score is derived for each grammar rule by analyzing the occurrences of the rules in the trees and by spotting the exact error locations in the incomplete trees. In prior work on automatic error detection v.d.Br¨ uck et al. [5] proposed a static error detection algorithm for generation grammars. The approach of Cussens et al. [1] creates missing grammar rules for parsing using abduction. Zeller [6] introduced a dynamic approach in the related area of detecting errors in computer programs. 2 Error Detection The basic purpose of NLG, as considered here (we follow the TG/2 formalism [3]), is to convert an input structure, given as feature value pairs, by means of grammar rules into a constituency tree where the surface text can be read oﬀ as the terminal yield. Each non–leaf node in this tree is associated to a particular input substructure, a category and the applied grammar rule while the leaf nodes are associated to text segments. The ﬁnal surface text is created by concatenating the text segments of the leaf nodes. In case of success, the generation system not only returns the generated surface text (or texts if multiple possible solutions have been found) but also the associated constituency tree (or trees). However, in the case of failure, no surface text is generated and no associated constituency tree exists. However, it is obvious that, in order to detect a speciﬁc spot where the generation process fails, it is highly advantageous to have partial constituency trees for the failed generation attempts as well. For this reason, the employed generation engine has been extended to provide two types of partial trees in case of gen1 2 FernUniversit¨ at in Hagen, Hagen, Germany, tim.vorderbrueck@fernuni-hagen.de Institute for Medical Biometry and Medical Informatics, University Medical Center Freiburg, Freiburg, Germany, holger.stenzhorn@uniklinik-freiburg.de eration failures: The tree of the ﬁrst type is the largest tree to result from the generation process – we call the maximum tree. The other alternative tree, representing a non–successful generation, is the one having the smallest total diﬀerence to a complete tree – we call the minimum tree. Usually both types of trees are incomplete and hence can have non–terminal categories at its leaf nodes. In the following, a complete tree resulting from a successful generation is called a positive tree while an incomplete tree (either maximum or minimum) is called a negative tree. The detection of incorrect rules is basically done in several consecutive steps: 1. First, a global (i.e., independent of any speciﬁc input structure) rule quality score (gqs) is derived for each rule. 2. For each input structure leading to a generation failure, the most probable error location in the associated constituency tree is detected. 3. Both information are put together to derive a local rule quality score (lqs) which is associated to a certain input structure. The rules with the lowest lqs (and gqs below a given threshold) are considered as potentially erroneous. 1. Deriving a Global Rule Quality Score If a certain rule appears in a positive generation tree, this generally indicates that the rule is correct. However, the fact that a rule is appearing in a negative tree or in no constituency tree at all, is an indicator for an incorrect rule. By using this information, a gqs is deﬁned for each rule. The gqs of a rule reﬂects the probability that a generation fails if this rule appears in the associated constituency tree. More speciﬁcally, the gqs is deﬁned as the negated probability that a tree is negative if the rule r occurs in that tree: gqs := −P (t ∈ T − |(lhs(r), r) ∈ t) where • T − : the set of negative trees • lhs(r): left-hand side (LHS) category of rule r (see [3]) As usual, the probability is estimated by the relative frequency of a tree being negative if a certain rule appears in it. If a rule never appears in either the positive or negative trees then its gqs is set to −1 since this is a strong indication of a potential error. A rule is assumed to be correct if its gqs exceeds a given threshold h. To account for the fact that the probabilities for rules leading to negative trees (or that they appear in no tree at all) are not independent from each other (i.e., a rule might be assigned to a low score because of an error in an ancestor rule), a small portion of the gqs is propagated upwards and added to the gqs of each rule which could, according to its LHS, be 838 T. vor der Brück and H. Stenzhorn / A Dynamic Approach for Automatic Error Detection in Generation Grammars possibly applied at a superior node in a constituency tree. Note that only scores are modiﬁed or propagated which are assigned to rules that are not assumed to be correct (gqs < h). 2. Spotting the Error in the Generation Tree: The gqs already results in a good approximation for identifying an incorrect rule. However this method also has a drawback in that the identiﬁed rules are not related to any input structure. This is obviously an important information for the grammar developer if (s)he wants to know why no output has been generated for a speciﬁc input structure. Furthermore this information can potentially be necessary to automatically correct the error (which is planned for future work). Hence, we additionally try to determine for each input structure leading to a generation failure the most probable location (node) in the constituency tree where the error occurred and use this information to calculate a local rule quality score (i.e., a score which relates to a certain input structure). The identiﬁed node is supposed to be associated to the LHS category of the erroneous rule 3 . Naturally, since positive trees do not lead to a generation error, for spotting the erroneous nodes only the negative trees have to be examined. An error is deﬁned for each negative tree separately which means that diﬀerent errors can relate to diﬀerent constituency trees. Analogous to the determination of the gqs, there is again the possibility to employ either the maximum or the minimum trees where both methods have been evaluated. To spot the error location, each node in a negative tree is assigned to its node quality score (nqs). For the calculation of the nqs the following two aspects, relevant to many machine learning approaches, are taken into account: 1. How do the negative examples (i.e., negative trees) diﬀer from the positive ones? 2. What do all negative examples (i.e., negative trees) have in common? To account for the ﬁrst aspect, the probability of a node is determined in that a tree is negative if it contains this node (pair of category and rule): q1 = P (t ∈ T − |(r, c) ∈ t) where the probability is estimated by the relative frequency. A node is assigned the maximum value of 1 if it occurs only in the negative and never in positive trees. To account for the second aspect, the probability is estimated in that a negative tree contains a given pair of category and rule: q2 = P ((r, c) ∈ t|t ∈ T − ). A node is assigned the highest possible value of 1 if it occurs in all negative trees. The nqs for a tree node (r, c) is then given as nqs(r, c) = −q1 q2 . A node is considered to appear in a constituency tree if this tree contains a node associated to identical category and rule. Note that a leaf node of the incomplete constituency tree might not be associated with any rule. Such a node matches all nodes with identical category. The nodes associated with the lowest nqs are considered to be potentially erroneous, i.e., one of them is assumed to contain the LHS category of the erroneous rule. 3. Putting Both Types of Information Together : Finally, the gqs and the expected error location(s) are combined to the lqs. Even if the error location in the constituency 3 Note that this approach is not suitable for detecting rules with an incorrect LHS. In this case, only the gqs should be used instead. tree is not correctly determined by this algorithm, the actual error location is often a child, parent or sibling of one of the indicated locations. Thus, for determining the lqs of a rule, its gqs is weighted depending of the minimum possible distance in a constituency tree of that rule’s LHS category from any node representing one of the indicated error locations using an exponential decay. If this distance could not be determined because the rule’s LHS category is not reachable at all, the distance is set to some large value (e.g., the number of categories in the grammar). 3 Evaluation and Conclusion Table 1. Erroneous rule is among the top 5/3/2 suggestions; for all cases/cases with both positive and negative trees, in percent (%). Type Max. tree Min. tree Top 5 54/94 44/86 Top 3 48/82 38/74 Top 2 48/82 38/73 For the evaluation, we randomly changed a path expression [5] of a rule’s right-hand side in the evaluation grammar and determined how often the erroneous rule appeared under the top ﬁve/three/two rules with the lowest lqs. The evaluation shows that the accuracy raises signiﬁcantly if at least one positive constituency tree exists (see Table 1). The described algorithm has been implemented in a plugin for the grammar workbench eGram [3] which supports the GUI-based development of grammar rules for the grammar formalisms of the TG/2 [2] and XtraGen [4] NLG systems. The automatic detection and correction of grammar errors remains a very diﬃcult task but it is an important and necessary step towards creating NLG systems that are easy to deploy in real-world application scenarios with large amounts of rules. ACKNOWLEDGEMENTS We are especially obliged to Stephan Busemann for providing one of the authors with a research license of eGram and XtraGen. Furthermore we thank all members of our departments who contributed to this work. REFERENCES [1] J.Cussens and S.Pulman, ‘Incorporating linguistics constraints into ILP’, in Proc. of CoNLL, Lisbon, Portugal, (2000). [2] S.Busemann, ‘Best-ﬁrst surface realization’, in Proc. of INLG, Herstmonceux, UK, (1996). [3] S.Busemann, ‘eGram — a grammar development environment and its usage for language generation’, in Proc. of LREC, Lisbon, Portugal, (2004). [4] H. Stenzhorn, ‘XtraGen. A NLG system using Java and XML technologies’, in Proc. of NLPXML, Taipeh, Taiwan, (2002). [5] T. v.d.Br¨ uck and S. Busemann, ‘Suggesting error corrections of path expressions and categories for tree–mapping grammars’, Zeitschrift f¨ ur Sprachwissenschaft, 26(2), (2007). [6] A. Zeller, ‘Locating causes of program failures’, in Proc. of ICSE, Saint Louis, Missouri, USA, (2005). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-839 839 Answering Deﬁnition Question: Ranking for Top-k Chao Shen and Xipeng Qiu and Xuanjing Huang and Lide Wu 1 Abstract. As an important form of complex questions, deﬁnition question attracts much attention from QA researchers. For many of the deﬁnition question answering systems, it is a core step to rank the candidate answer sentences, so that the top-k in the ranked list can be extracted. We integrate these evidences as features into a whole framework, and propose a novel method to learning weights of these features to rank the candidate answer sentences. 1 Introduction Deﬁnition question answering[10], as an important form of complex question answering is attracting more attention recently. The deﬁnition question can be interpreted as “Tell me interesting things about X”. Here “X” is usually called “target”. Most deﬁnition question answering systems have the pipeline structure: Step-1 Extracting the candidate answer sentences from the corpus. Step-2 Ranking the candidate answer sentences. Step-3 Removing redundant answer sentences. Step-1 is the IR on the sentence or sub-sentence level. For a target, we can get a list of sentences through this step. Step-2 is the core step, which ranks the output of Step-1. Many researches on deﬁnition question focus on this step and various methods have been developed. Some simple methods such as checking the overlap of words between two sentences in the answer are often used in the step-3. To answer deﬁnition questions, pattern based methods [3] and centroid vector based methods [1, 5] are popular in ranking the answer sentences. And various resources including lexico-syntactic patterns and external resources such as Google, Wikipedia, encyclopedia, have been used as evidences to judge whether a sentence is a deﬁnition sentence about a target. However, in previous systems, if multiple resources have been used, the importance of each resource in the deﬁnition question answering system is ﬁxed manually. Since different patterns and centroid vector may play different roles, there should be a way to automatically identify their weights. Our work propose a learning method which 1) gives the optimal top-k sentences instead of the optimal ranking of the whole list and 2) explicitly slackens the condition that deﬁnition sentences should be ranked ahead of the other. Using such learning method for ranking, we integrate evidences for sentence be to deﬁnition as features into a whole framework and achieve a better result. 2 Learning to Rank for Top-k In this section, we will introduce how weights of resources is learned. Speciﬁcally, we use modiﬁed version of a online learning algorithm 1 Fudan University, China, email: {shenchao,xpqiu,xjhuang,ldwu}@fudan .edu.cn MIRA [2] for the task of sentence ranking in deﬁnition question answering. In training, a set of targets X = {x1 , x2 , . . . , xT } is given. Each target xt is associated with a set of nuggets sentences y t = {y1t , y2t , . . . , ynt t }, where yjt denotes the j-th sentence and nt denotes the sizes of y t . Meanwhile, each target also associated a list of sentences, st = {st1 , st2 , . . . , stmt }, which are the output of the ﬁrst step of the pipeline system, and to be ranked. From st , we will select k sentences as the input of the step 3 module or directly as the answer of the target. An arbitary subset of st with size k is denoted as st (k). To evaluate these sets of sentences, we deﬁned score(xt , st (k)) = w ∗Ψ(xt , st (k)), where Ψ(xt , st (k)) is the feature vector for the target and its k-sentences pair < xt , st (k) > and yˆt = arg maxst (k) score(xt , st (k)) will be extracted. We learn w with the goal that as many elements in yˆt are in y t as possible. If we assume each sentence is independent with others, the feature of the <target,k-sentences> pair can be deﬁned as Ψ(xt , st (k)) = P k t t t t j=1 ψ(x , sj ), where ψ(x , sj ) is the feature vector for the targett t setence pair < x , sj > and we can get score(xt , st (k)) = k X score(xt , stj ) j=1 where score(xt , stj ) = w ∗ ψ(xt , stj ). Thus yˆt is the top k sentences in the decreasingly ranked list of st by score(xt , stj ). Algorithm 1 Modiﬁed Version of Online MIRA Training Data: Γ = {(xt , y t )Tt=1 1: w0 = 0; v = 0; i = 0 2: for n : 1 . . . N do 3: for t : 1 . . . T do 4: min ||wi+1 − wi || 5: s.t. score(xt , sti ) − score(xt , stj ) ≥ 1 6: ∀stj ∈ Q, ∀sti ∈ y ∗t = (y t \ Q) ∪ P 7: v = v + w(i+1) 8: i=i+1 9: end for 10: end for 11: w = v/(N ∗ T ) MIRA is ﬁrst proposed for multi-classiﬁcation. In [8, 7], it was successfully used for structure-learning. The difference between the MIRA in [8] and the version of ours Algorithm 1 is the contraints (5,6th line of Algorithm 1) used to update the wi . To circumvent the problems of ranking for deﬁnition question answering mentioned in the Section 1, we ﬁrst introduce y ∗t , through adding nugget sentences in st \ yˆt to yˆt and excluding non-nugget sentences from yˆt , and take it as a slackened supervisor of the learning. 840 C. Shen et al. / Answering Deﬁnition Question: Ranking for Top-k We deﬁne θ0t = min{|yˆt \ y t |, |(st \ yˆt ) ∩ y t |}, i.e. the minimal number of non-nugget sentence in top-k and non-nugget sentences out of top-k. In the iteration of updating w with the input of (xt , yt ), we build y ∗t by inserting θt = min{θ0t , θ} nugget sentences, P , out of top-k into the top-k sentences and excluding the same number of non-nugget sentences, Q. Then y ∗t = (y t \Q)∪P is a better answer which contains more θ nugget sentences, if possible, compared with yˆt . and P and Q is deﬁned as following: P : the top-θ nugget sentences of st \ yˆt Q : the bottom-θ non-nugget sentences in yˆt Table 1. Comparison in terms of ranking on the TREC 2006 question set k 10 15 20 25 30 35 Our Method 0.2401 0.2725 0.2859 0.2801 0.2579 0.2338 RankSVM 0.1697 0.2068 0.2186 0.2225 0.1944 0.1916 F3 Han-Model 0.2282 0.2382 0.2592 0.2610 0.2557 0.2502 Exact-Answer 0.1842 0.2100 0.2737 0.2643 0.2449 0.2153 Table 2. Performance on TREC 2005 Question Set 3 Experiments We do two experiments on 65 TREC 2004 targets, 75 TREC 2005 targets and 75 TREC 2006 targets to validate our method. Same module of sentence extraction in [9] is used to extracting the candidate answer sentences from the corpus and no removing redundancy module is used. The features used in the paper is also same in [9], including 4 based on language model, 1 about document retrivel, and several based on syntaical patterns. In order to building training corpus, we collect the judgement of TREC to all the submitted answers from participants. If a [string, docid] pair is judged covering certain nugget of a target xt , we extract the original sentence from AQUAINT according to the [string, docid] pair, and add it to the set yt for target xt . F3-Score 0.2872 0.3031 0.3095 with a automatical evaluation tool Pourpre v1.0c [6]. [5] gave the result of their experiment on the TREC 2005 as test data and TREC 2004 as training data. As same as the setting of [5], we select the top 12 highest ranked sentences (k = 12) as answers. According to the analysis of the parameter θ, we let θ = 2. From Table 2, we can see our method clearly outperforms SP, and has a comparable result with HIM. 4 Conclusion 3.1 Ranking Comparison To show the effectiveness of our ranking method, we compare our result with those of the following methods. RankSVM RankSVM is used to rank deﬁnition sentences [11]. As same as in [11], we only use linear kernel. Han-Model If we ﬁx the weights for 4 features based on language models, we can take our system as a simple version of the statistical model proposed by [4]. Exact-Answer In our proposed method, we do not ask all nugget sentences ranked higher than non-nugget sentences. In this baseline, we construct stricter constraints, all nuggets sentences of a target should be ranked higher than the current non-nuggets sentences in top-k. s.t. s(xt , yit ) − s(xt , stj ) ≥ 1 ∀stj ∈ yˆt \ y t , ∀yit ∈ y t System Soft-Pattern (SP) Human Interest Model (HIM) Our Method (1) Comparison is on the TREC 2006 targets and TREC 2005 targets are used for training. This is because target set of TREC 2005 and 2006 both include PERSON, ORGANIZATION, THING, EVENT, but TREC 2004 does not contain EVENT targets. θ is decided by 5-fold cross validation on TREC 2005 targets. Table 1 shows the F3-score for each methed. Though RankSVM and Exact-Answer use more features, they still fail to outperform Han-Model. This implies the importance of the ranking method: If the weights of the features cannot be decided properly, the extra features will not help improve the performance. We can see our method has advandage, especially when k is relative small. 3.2 Comperison with Other Systems In [5], two state of the art systems, Soft Pattern model(SP) and Human Interests Model(HIM) are evaluated on the TREC 2005 targets In this paper, we integrate multiple resources to rank candidate answer sentences for deﬁnition question answering. Speciﬁcally, we have proposed a method of learning for ranking to do such task. Instead hoping that all deﬁnition sentences are at the top of the list of candidate answer sentences, we use a slack parameter θ to let the top-k sentences involve as many deﬁnition sentences as possible. Experimental results indicate that our proposed method performed better than the several other methods to rank used in the deﬁnition question answering. And our multiple resources integrated system has a comparable result to state of the art system. REFERENCES [1] Y. Chen, M. Zhou, and S. Wang, ‘Reranking answers for deﬁnitional QA using language modeling’, Proc. of ACL, (2006). [2] K. Crammer and Y. Singer, ‘Ultraconservative online algorithms for multiclass problems’, Journal of Machine Learning Research, 3, 951– 991, (2003). [3] H. Cui and M.Y. Kan, ‘Generic soft pattern models for deﬁnitional question answering’, Proc. of ACL, (2005). [4] K.S. Han, Y.I. Song, and H.C. Rim, ‘Probabilistic model for deﬁnitional question answering’, Proc. of SIGIR, (2006). [5] K.W. Kor and T.S. Chua, ‘Interesting nuggets and their impact on deﬁnitional question answering’, Proc. of SIGIR, (2007). [6] J. Lin and D. Demner-Fushman, ‘Automatically evaluating answers to deﬁnition questions’, Proc. of HLT-EMNLP, (2005). [7] R. McDonald, ‘Discriminative Sentence Compression with Soft Syntactic Evidence’, Proc. of EACL, (2006). [8] R. McDonald, K. Crammer, and F. Pereira, ‘Online Large-Margin Training of Dependency Parsers’, Proc. of ACL, (2005). [9] X. Qiu, B. Li, C. Shen, L. Wu, X. Huang, and Y. Zhou, ‘FDUQA on TREC2007 QA Track’, Proc. of TREC, (2007). [10] E.M. Voorhees, ‘Overview of the TREC 2003 Question Answering Track’, Proc. of TREC, (2003). [11] J. Xu, Y. Cao, H. Li, and M. Zhao, ‘Ranking deﬁnitions with supervised learning methods’, Proc. of WWW, (2005). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-841 841 Ontology-Driven Human Language Technology for Semantic-Based Business Intelligence Thierry Declerck1 and Hans-Ulrich Krieger2 and Horacio Saggion3 and Marcus Spies4 Abstract. In this poster submission, we describe the actual state of development of textual analysis and ontology-based information extraction in real world applications, as they are deﬁned in the context of the European R&D project ”MUSING” dealing with Business Intelligence. We present in some details the actual state of ontology development, including a time and domain ontologies, which are guiding information extraction onto an ontology population task. 1 INTRODUCTION MUSING is an R&D European project dedicated to the development of Business Intelligence (BI) tools and modules founded on semantic-based knowledge and content systems. MUSING integrates Semantic Web and Human Language technologies for enhancing the technological foundations of knowledge acquisition and reasoning in BI applications. The impact of MUSING on semantic-based BI is being measured in three strategic domains: • Financial Risk Management (FRM), providing services for the supply of information to build a creditworthiness proﬁle of a subject – from the collection and extraction of data from public and private sources up to the enrichment of these data with (semantic) indices, scores and ratings; • Internationalization (INT), providing an innovative platform, which an enterprise may use to support foreign market access and to beneﬁt from resources originating in other markets; • IT Operational Risk & Business Continuity (ITOpR), providing services to assess IT operational risks that are central for Financial Institutions – as a consequence of the Basel-II Accord - and to asses risks arising speciﬁcally from enterprise’s IT systems – such as software, hardware, telecommunications, or utility outage/disruption. Across those development streams of MUSING, there are some common tasks, like the one consisting in extracting relevant information from annual reports of companies and to map this information into XBRL (Extended Business Reporting Language). XBRL is a standardized way of encoding ﬁnancial information of companies, but also the management structure, location, number of employees, etc. (see www.xbrl.org). This is mostly ”quantitative” information, which is typically encoded in structured documents, like ﬁnancial tables or company proﬁles etc. But for many Business Intelligence applications, there is also a need to consider ”qualitative” information, which is most of the time delivered in the form of unstructured text, 1 2 3 4 DFKI GmbH, Germany, email: declerck@dfki.de DFKI GmbH, Germany, email:kriegger@dfki.de University of Shefﬁeld, UK, email: H.Saggion@dcs.shef.ac.uk Semantics Technology Institute, Austria, marcus.spies@sti2.at which one can ﬁnd in textual annexes to the balance sheets in annual reports or in news articles. The problem is here how to accurately integrate information extracted from structured sources, like the periodic reports of companies, and the day to day information provided by news agencies, mostly in unstructured text form. The detection and interpretation of temporal information in structured and unstructured documents is also a central focus of our attention in MUSING. We describe in the following the actual state of development of MUSING ontologies, including our proposal for temporal representation. Due to lack of space, we can not show here examples of the kind of temporal expressions we encounter in applications of MUSING, and how our IE and Ontology Population tools deal with those expressions in the light of our representation of temporal information, aiming also at supporting temporal reasoning in various applications. But those examples will be available on the poster. 2 STATE OF MUSING ONTOLOGIES In MUSING we decided to use as the upper level ontology the PROTON ontology (http://proton.semanticweb.org), on the base of which domain-speciﬁc extensions can be easily deﬁned. The species of the model of the PROTON Upper module is OWL Full. The MUSING version available contains mostly the same information as the original one but is slightly changed to fulﬁll the OWL Lite criteria. The System module of PROTON, http://proton.semanticweb.org/2005/04/protons, provides a sort of high-level system- or meta-primitives. It is the only component in PROTON that is not to be changed for the purposes of ontology extension.” The Top-Level classes in PROTON, http://proton.semanticweb.org/2005/04/protons, represent the most common deﬁnition of world knowledge concepts. These can directly be used for knowledge discovery, metadata generation and to interface intelligent knowledge access tools. The PROTON has also an upper module, http://proton.semanticweb.org/2005/04/protonu, which adds sub-classes and properties to the Top-module super classes to the concepts other than ”Abstract, Happening and Object” from the original PROTON Top ontology. The ”Extension” ontology in MUSING has been designed as a single contact point between upper and MUSING application speciﬁc ontologies. In MUSING we also developed a general time ontology, which is also added to the upper module. Besides the time ontology, there are currently ﬁve domain ontologies, which are not assigned to any particular application. They cover the following areas: Company, Industry sector, BACH (Standard for a harmonization of ﬁnancial for harmonizing accounts of companies across countries), XBRL (Standard language for ”Business Reporting”) and Risk. In the time ontology of MUSING, temporally-enriched facts are represented through time 842 T. Declerck et al. / Ontology-Driven Human Language Technology for Semantic-Based Business Intelligence slices, four dimensional slices of what Sider (1997) calls a spacetime worm (we only focus on the temporal dimension in MUSING). These worms, often referred to as perdurants, are the objects we are talking about. The time ontology itself contains the conceptualization of temporal objects that are relevant in MUSING. In fact, any time ontology can be combined with the ”4D” ontology. The other other ontologies are domain and applications speciﬁc. As a concluding remark about the ontologies, we would like to mention that they have been built by hand, most of them on the base of ”compentency questions” addressed by domain experts. But it is also planned in MUSING to investigate the topic of (semi-)automatic ontology learning or creation, on the base of information and knowledge extracted from the analyzed data. The poster presentation will mainly visualize the interconnections of the ontologies, and the integrated reasoning component that has been designed for acting on the ontologies and the knowledge bases of MUSING. 3 ONTOLOGY-BASED INFORMATION EXTRACTION IN MUSING In the former chapter, we presented in some details the different types of MUSING ontologies, and the way they interact (mainly via the ”Extension” ontology). This model of the relevant concepts for a set of Business Intelligence applications has to be ﬁlled (or populated) with real data, so that the applications can make use of the semantic capabilities of such an ontology infrastructure. We call this task ”ontology population”, which in a sense is Information Extraction (IE) guided by ontologies, the results of IE not being displayed in the form of templates, but in knowledge representation languages, e.g. OWL in the case of MUSING. The information stored in this way is considered as ”instances” of the concepts and relations introduced in the ontology. The set of instances is building the knowledge base for the applications, and this knowledge base is supporting for example credit institutes on their decision-making procedures on credit issuing issues. As we mentioned in the introduction, a substantial amount of the needed information for the development of semantic business intelligence applications is to be found in unstructured textual documents, so that the automatic ontology population task is relying on natural language processing in general and Information Extraction in particular. It is important to note here that all the instances of the ontologies, populated by means of the IE tools, are automatically ”enveloped” within temporal information, which turns every entity or event into a perdurant In case termporal information is not available, or has not been found, this can be left underspeciﬁed in the representation of the instances, and ﬁlled by information generated from other resources, or by the temporal reasoning engine, also implement in MUSING. As an example we can look at the following sentence, we took from a newspaer: ”Ermotti arbeitete frueher kurz fuer den weltgroessten Finanzkonzern Citigroup und danach 17 Jahre lang bis 2004 fuer die Investmentbank Merrill Lynch.” (Ermotti have worked before for a short time for the world largest ﬁnancial concern, Citigroup, and afterwards for 17 years, till 2004, for the investment bank Merrill Lynch.) This is a quite interesting sentence, since it contains a lot of temporal expressions (actually a quite normal fact in news articles). The ﬁrst two expressions (”before” and ”a short time”) are again very vague. So here we assume that the before is actually ”before the pubdate”. The next temporal expressions are ”for 17 years” and ”till 2004”. In those two expressions we get now more precise information: The relation ”Ermotti works at Merrill Lynch” is ﬁrst associated with the duration of 17 years, and in a second step we can calculate the starting point of this relationship since an ending point is given: 2004 (we allow for such under-speciﬁcation in the time ontology, having introduced a class called ”yearDate”). In order to extract this information and to populate the ontology we need here a deeper linguistic analysis. We extract with the help of syntactic analysis (and more specially dependency analysis) that there is a working relationship between Ermotti (as the subject of the ﬁrst part/clause of the sentence) and Merril Lynch. We can associate the time code to this relationship on the base of the dependency analysis of the two temporal expressions as linguistic expressions that ”modify” the main verb ”arbeitete” (worked). The name of the company for which Ermotti is working is included in a prepositional phrase (PP). The linguistic pattern ”[NP-SUBJ X] works [PP for [NP-IOBJ Y]]”is a very good candidate for a mapping into a relation <X is employed by Y>. But clearly the constraints that apply to both ”X” and ”Y” are, that the ﬁrst is an instance of a person and the second an instance of a company (domain and range of the relation). In this example, the reader could see how the constituent analysis of text, coupled with named entity detection, some lexical semantics and dependency relations, is guiding the ontology population. In this example we can also see that there are at least three syntactic ways to express temporal information; as an Adverb, an NP and a PP. First the textual analysis gives a linguistic structure to the unstructured text, on the base of which we deﬁne a mapping, which associates the name of the person to the person ontology and the name of the company to the company ontology. The relationship ”<Errmotti, is employed by, Merril Lynch>” can then be associated to the time slice ”1987-2004”. From the individual news article under consideration we can not extract information about activities of Ermotti in the time between 2004 and 2005-12-16, but we assume that he had an activity in the banking domain. We can thus automatically query for documents telling us something about ”Ermotti” and ”Year 2005”, in order to ”ﬁll the temporal gap” in the information card about Ermotti. The already extracted information and the temporal ontology of MUSING are structuring the semantic content of the query. On this base we found for example an article published on the 2006-12-06, one year later. The poster presentation will visualize in details the interconnections of the ontologies and the NLP and IE tools in order to populate the ontologies. 4 Conclusion In this poster, we show how we combine Semantic Web resources and tools with Language Technologies, in order to help in creating knowledge bases in the ﬁeld of Business Intelligence applications, ”upgrading” thus the actual strategies implemented in this ﬁeld, building on quantitative and qualitative information automatically extracted from various types of documents, towards a new generation of semantically driven Business Intelligence methods and tools. ACKNOWLEDGEMENTS The research described in this paper has been partially ﬁnanced by the European Integrated Project MUSING, with contract number FP6-027097. ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-843 843 Evaluation Evaluation David M W Powers1 Abstract. Over the last decade there has been increasing concern about the biases embodied in traditional evaluation methods for Natural Language Processing/Learning, particularly methods borrowed from Information Retrieval. Without knowledge of the Bias and Prevalence of the contingency being tested, or equivalently the expectation due to chance, the simple conditional probabilities Recall, Precision and Accuracy are not meaningful as evaluation measures, either individually or in combinations such as F-factor. The existence of bias in NLP measures leads to the ‘improvement’ of systems by increasing their bias, such as the practice of improving tagging and parsing scores by using most common value (e.g. water is always a Noun) rather than the attempting to discover the correct one. In this paper, we will analyze both biased and unbiased measures theoretically, characterizing the precise relationship between all these measures. 1 INTRODUCTION A common but poorly motivated way of evaluating results of Language and Learning experiments is using Recall, Precision and F-factor. These measures are named for their origin in Information Retrieval and present specific biases, namely that they ignore performance in correctly handling negative examples, they propagate the underlying marginal Prevalences and Biases, and they fail to take account the chance level performance. In the Medical Sciences, Receiver Operating Characteristics (ROC) analysis has been borrowed from Signal Processing to become a standard for evaluation and standard setting, comparing the Recall-like True Positive Rate and False Positive Rate. In the Behavioural Sciences, the related concepts of Specificity and Sensitivity, are commonly used. Alternate techniques, such as Rand Accuracy, have some advantages but are nonetheless still biased measures unless explicitly debiased. 2 THE BINARY CASE It is common to introduce the various measures in the context of a dichotomous binary classification problem, where the labels are by convention + and − and the predictions of a classifier are summarized in a four cell contingency table. This contingency table may be expressed using raw counts of the number of times each predicted label is associated with each real class, A, B, C, D, summing to N, or we may use acronyms for the generic terms for True and False, Real and Predicted Positives and Negatives, or else relative versions of these, e.g: tp, fp, fn, tn and rp, rn and pp, pn refer to the joint and marginal probabilities, and the four contingency cells and the two pairs of marginal probabilities each sum to 1. These systems are both illustrated in Table 1. 1 AILab, CSEM, Flinders University of South Australia, email:David.Powers@flinders.edu.au We thus make the specific assumptions that we are predicting and assessing a single condition that is either positive or negative (dichotomous), that we have one predicting model, and one gold standard labelling. 2.1 Recall & Precision, Sensitivity & Specificity Recall or Sensitivity (as it is called in Psychology) is the proportion of Real Positive cases that are correctly Predicted Positive. This measures the Coverage of the Real Positive cases by the +P (Predicted Positive) rule. Its desirable feature is that it reflects how many of the relevant cases the +P rule picks up. It tends not to be very highly valued in Information Retrieval (on the assumptions that there are many relevant documents, that it doesn't really matter which subset we find, that we can't know anything about the relevance of documents that aren't returned). Recall tends to be neglected or averaged away in Machine Learning and Computational Linguistics (where the focus is on how confident we can be in the rule or classifier). However, Recall has been shown to have a major weight in predicting success in several context including these areas, and in a Medical context Recall is primary but it is referred to as True Positive Rate (tpr). Recall is defined, with its various common appellations, by equation (1): Recall = Sensitivity = tpr = tp/rp (1) Conversely, Precision or Confidence (as it is called in Data Mining) denotes the proportion of Predicted Positive cases that are correctly Real Positives. It can also be called True Positive Accuracy (tpa), as a measure of accuracy of Predicted Positives in contrast with rate of discovery of Real Positives (tpr). Precision is defined in (2): Precision = Confidence = tpa = tp/pp (2) These two measures and their combinations focus only on the positive examples and predictions, although between them they capture some information about the rates and kinds of errors made. However, neither of them captures any information about how well the model handles negative cases. Recall relates only to the +R column and Precision only to the +P row. Neither of these takes into account the number of True Negatives. This also applies to their Arithmetic, Geometric and Harmonic Means: A, G and F=G2/A (the F-factor or F-measure). Table 1. Systematic and traditional notations in a contingency table. +R −R +P tp fp pp −P fn tn pn rp rn 1 +R −R +P A B A+B −P C D C+D A+C B+D N 844 D.M.W. Powers / Evaluation Evaluation Usually, there is in principle nothing special about the Positive case, and we can define Inverse statistics in terms of the Inverse problem in which we interchange positive and negative and are predicting the opposite case. Inverse Recall or Specificity is thus the proportion of Real Negative cases that are correctly Predicted Negative (3), and is also known as the True Negative Rate. Rand Accuracy explicitly takes into account the classification of negatives, and is expressible both as a weighted average of Precision and Inverse Precision and as a weighted average of Recall and Inverse Recall. Conversely, the Jaccard or Tanimoto similarity coefficient explicitly ignores correctly classified negatives (TN). Each of these measures also has a complementary form defining an error rate, of which some have specific names and importance: Fallout or False Positive Rate (fpr) is the proportion of Real Negatives that occur as Predicted Positive (ring-ins); Miss Rate or False Negative Rate (fnr) is the proportion of Real Positives that are Predicted Negatives (false-drops). 2.2 Prevalence, Bias, Cost & Skew We now turn our attention to various forms of bias or skew that detract from the utility of all of the above surface measures [1,2]. We will first note that rp represents the Prevalence of positive cases, RP/N – it is not usually under the control of the experimenter. By contrast, pp represents the (label) Bias of the model [1], the tendency of the model to output positive labels, PP/N, and is directly under the control of the experimenter, who can change the model by changing the theory or algorithm, or some parameter or threshold. A common rule of thumb, and a characteristic of some algorithms, is to parameterize a model so that Prevalence = Bias, viz. rp = pp. Corollaries of this setting are Recall = Precision (= A = G = F), Inverse Recall = Inverse Precision and Fallout = Miss Rate. 2.3 ROC and PN Analyses Flach [4] has highlighted the utility of ROC analysis to the Machine Learning community, and characterized the skew sensitivity of many measures in that context, utilizing the ROC format to give geometric insights into the nature of the measures and their sensitivity to skew. ROC analysis plots the rate tpr against the rate fpr. The most common condition is to minimize the area under the curve (AUC), which for a single parameterization of a model is defined by a single point and the segments connecting it to (0,0) and (1,1). A particular cost model and/or accuracy measure defines an isocost gradient, which for a skew and cost insensitive model will be c=1, and hence another common approach is to choose a tangent point on the highest isocost line that touches the curve. The area under the simple trapezoid is: AUC = 1 – (fpr+fnr)/2. but we present only the dichotomous formulation of Powers Informedness, as well as the complementary concept of Markedness. In fact, Bookmaker Informedness-based formulae may be averaged over all labels according to the label bias, and Markedness-based formulae over all classes by prevalence. Definition 1 Informedness quantifies how informed a predictor is for the specified condition, and specifies the probability that a prediction is informed in relation to the condition (versus chance). Informedness Markedness quantifies how marked a condition is for the specified predictor, and specifies the probability that a condition is marked by the predictor (versus chance). Markedness = Precision + Inverse Precision – 1 = tpa-fna = 1-fpa-fna = (Precision−Prevalence) / (1-Bias) (4) These definitions are aligned with the psychological and linguistic uses of the terms condition and marker. The condition represents the experimental outcome we are trying to determine by indirect means. A marker or predictor (cf. biomarker or neuromarker) represents the indicator we are using to determine the outcome. There is no implication of causality, however there are two possible directions of implication. Detection of the predictor may reliably predict the outcome, with or without the occurrence of a specific outcome condition reliably evincing the predictor. In the Psychology literature, Markedness is known as DeltaP and is empirically a good (normative) predictor of human associative judgements – that is it seems we develop associative relationships between a predictor and an outcome when DeltaP is high, and this is true even when multiple predictors are in competition. Conversely a complementary, backward, additional measure of strength of association, DeltaP' aka Informedness has been proposed [5]. Note that we can also estimate significance and confidence [3]: χ2 CI = N·Informedness·Markedness (5) = 1-|Informedness|/√[N-1]; CM = 1-|Markedness|/√[N-1] REFERENCES DeltaP, Informedness and Markedness Powers [2] also derived an unbiased accuracy measure to avoid the bias of Recall, Precision and Accuracy due to population Prevalence and label bias. The Bookmaker algorithm costs wins and losses in the same way a fair bookmaker would set prices based on the odds. Powers then defines the concept of Informedness which represents the 'edge' a punter has in making his bet, as evidenced and quantified by his winnings. Fair pricing based on correct odds should be zero sum – that is, guessing will leave you with nothing in the long run, whilst a punter with certain knowledge will win every time. Informedness is the probability that a punter is making an informed bet and is explained in terms of the proportion of the time the edge works out versus ends up being pure guesswork. Powers defined ‘Bookmaker Informedness’ for the general, K-label, case, (3) Definition 2 [1] 2.4 = Recall + Inverse Recall – 1 = tpr-fpr = 1-fnr-fpr = 2AUC-1 = (Recall-Bias) / (1−Prevalence) [2] [3] [4] [5] Lafferty, J., McCallum, A. & Pereira, F. (2001). Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. Proceedings of the 18th International Conference on Machine Learning (ICML-2001), CA: Morgan Kaufmann, pp. 282-289. Powers, David M. W. (2003), Recall and Precision versus the Bookmaker, Proceedings of the International Conference on Cognitive Science (ICSC-2003), Sydney Australia, 2003, pp. 529-534. http://david.wardpowers.info/BM/index.htm accessed 22 December 2007 Powers, David M. W. (2007) Evaluation, Flinders InfoEng Tech Rept SIE07001 http://www.infoeng.flinders.edu.au/research/techreps/SIE07001.pdf Flach, PA. (2003). The Geometry of ROC Space: Understanding Machine Learning Metrics through ROC Isometrics, Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), Washington DC, 2003, pp. 226-233. Perruchet, Pierre and Peereman, R. (2004). The exploitation of distributional information in syllable processing, Journal of Neurolinguistics 17:97−119. 6. Uncertainty and AI This page intentionally left blank ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-847 847 Using Decision Trees as the Answer Networks in Temporal Difference-Networks Laura-Andreea Antanas1 , Kurt Driessens1 , Jan Ramon 1 and Tom Croonenborghs 2 1 Introduction State representation for intelligent agents is a continuous challenge as the need for abstraction is unavoidable in large state spaces. Predictive representations offer one way to obtain state abstraction by replacing a state with a set of predictions about future interactions with the world. One such formalism is the Temporal-Difference Networks framework [2]. It splits the representation of knowledge in the question network and the answer network. The question network deﬁnes which questions (interactions) about future experience are of interest. It contains nodes, each corresponding to a single scalar prediction about a future observation given a certain sequence of interactions with the environment. The nodes are connected by links, annotated with action-labels, which represent temporal relationships between the predictions made by the nodes, conditioned on the action-labels on the links (more details in [2]). The answer network provides the predictive models to update the answers to the deﬁned questions, which are expected values of the scalar quantities in the nodes. These values can be seen as estimates of probabilities. With each executed action of the agent, the predictions are updated using the answer network models to obtain a description of the new state. In classical TD-networks, logistic regression models are used, whose weight vector is obtained using a gradient learning approach. We propose the use of probability-valued decision trees [1] in the answer network of TD-Nets. We believe that decision trees are a particular good choice to investigate, as they offer a different yet powerful form of generalization. Moreover, this aids in a better understanding of the strengths and weaknesses of TD-Nets and represents an important ﬁrst step towards using them in worlds with more extensive observations. Furthermore, decision tree induction can be regarded as a prototypical example of a non-gradient learning approach. 2 Decision Trees as Answer Networks in TD-Nets The (abstracted) state representation in a TD-Net consists of (1) the predictions made by the TD-Net in the previous timestep yt−1 = (1) (n) [yt−1 , . . . , yt−1 ] (one prediction for each node) with n the number of nodes in the question network, (2) the action executed during the last time frame at−1 and (3) the current observation ot , i.e., xt = [yt−1 , at−1 , ot ]. From this vector, the answer network will compute new predictions yt = f (xt ). In the original implementation of TD-Nets, this answer function is represented by a logistic regression function, i.e., yt = f W (xt ) = σ(Wxt ). 1 2 Declarative Languages and Artiﬁcial Intelligence, Katholieke Universiteit Leuven, Leuven, Belgium, email: {laura,kurtd,janr}@cs.kuleuven.be Biosciences and Technology Department, KH Kempen University College, Geel, Belgium, email: tom.croonenborghs@khk.be We investigate the use of probability trees [1] as an alternative to logistic regression for the answer network. The modiﬁcation of the original TD-Nets framework consists solely in the introduction of a probability tree function f T as the answer network, instead of the logistic regression function f W . Both the semantics of TDNets as well as the temporal improvement learning principle, remains unchanged. The difference between the two approaches is that we choose to represent the predictive model with one tree for each node. In the original TD-nets implementation it is common to learn a set of weight vectors (aggregated in matrices), but one matrix for each <action,observation rel="nofollow"> combination is also practicable. In our approach, the action and observation are inputs for the probability trees which allows for more generalisation. Generating the learning examples We build learning examples for the probability tree very similarly to the TD(1) learning scheme described in [3]. On a very high level, TD(1) generates target values for predictions looking as far into the future as possible. Figure 1 shows the timeline and dependancies between values of interest for the generation of learning examples. After the execution of an action, new predictions yt+1 for the nodes are made using the current tree function ftT and based on the previous predictions yt , the performed action at and the resulting observation ot+1 . Based on a Monte-Carlo approach, TD(1) uses these predictions and the made observations to derive target values for the old predictions of the question network, according to the structure of the network and the chosen actions. Figure 1. Dependancies between the different values used for the generation of learning examples. The structure of a learning example is shown by the shaded box, where zt is the target. For example, if the actions taken at time t and t+1 are at and at+1 respectively, our version of TD(1) will generate the learning examples as a result of these interactions. If a node (n ) in the network is conditioned by actions at and at+1 , following this order in time, TD(1) will use the observation ot+2 as the target for the input vector [yt−1 , at−1 , ot ]. For another node (n ), conditioned only by action at , it will use [yt−1 , at−1 , ot ] as the input vector and ot+1 as target. In practice, we implement this approach by using a history about chosen actions and observation with a length equal to the maximum depth of the question network. We choose to use a TD(1) approach because it generates the most informative learning examples for the probability tree. At the start, the predictions y will be mostly noise. This means that both the input L.-A. Antanas et al. / Using Decision Trees as the Answer Networks in Temporal Difference-Networks vector as well as the target, when not based directly on an observation, will contain noise. TD(1) will avoid the second source of noise when possible. Experiments reported in [3] show that TD(1) gives the best learning performance for a logistic regression approach too. Incremental Tree Learning As stated before, we learn a single probability tree for each node in the question network. We will employ binary decision trees, where internal nodes in the tree test attribute-value combinations. The available decision tests are identiﬁed before the induction of the probability trees begins, using a language bias deﬁned by the user of the system. It speciﬁes the possible actions, ranges for prediction values and observations for the world that can be considered when building the tree. Since predictions from the question network nodes are needed while still learning the answer network, we need an incremental tree learning algorithm. The incremental tree induction algorithm we used is described in Algorithm 1. Algorithm 1 Incremental Tree Induction 1: initialize by creating a tree with a single (empty) leaf 2: for each learning example do 3: sort the example down the tree until it reaches a leaf 4: update the statistics in the leaf and store the example 5: if number of examples in leaf > window size then 6: remove oldest example in the leaf and update the statistics 7: end if 8: if a split is needed and # examples in leaf > min ex size then 9: generate an internal node using the indicated test 10: grow 2 new empty leafs 11: end if 12: end for Each leaf in the tree stores statistical information about the examples it contains. This allows the algorithm to compute standard deviations of the target-value for all subsets created by all the available tests. The splitting criterium checks if the examples in one leaf are sufﬁciently coherent with respect to their target value. A leaf is split when it contains enough examples for the statistics to be significant reliable: min ex size = 30. This algorithm is greedy and has no mechanism to undo early mistakes. Since in the TD-Network learning setting, both the inputs and the targets of early learning examples can be noisy, we employ a sliding window approach to forget early learning examples. In the experiments, we use a window size = 50. 3 Empirical Evaluation We compare the original logistic regression approach with the probability trees approach. To this extend, we implemented our own version of the original TD(λ) learning algorithm as described in [3]. We perfomed experiments in two different environments: a ring world and a simple grid world. Experimental results are presented only for the 5-state deterministic ring world, as also used in [3]. Our ring world contains 5 interconnected circles. A circle indicates a state in the world. The agent has two different actions A={N,P}. N moves the agent to the adjacent state in clockwise rotation. P moves the agent in the counter-clockwise direction. The agent can only observe whether it is in state 1 or not, i.e. the observation bit is on (1) if the agent is in state 1 and off (0) otherwise. As in [3], we used symmetric action-conditional networks of depth 1, 2 and 3 as question networks. For the experiments with the classical TD-networks we used a learning rate α = 0.5, obtaining similar results to the ones presented in [3]. To compare the different learning algorithms, we used the root mean-squared error (RM SE) to determine the quality of the learned models. This error at time t is calculated by comparing for each node i the correct target zi∗ with the one predicted by the learned model yti . The correct targets are computed using full knowledge of the environment. Hence, if the RM SE converges to 0, a correct answer network has been learned. The experiments are performed in an episode-based fashion. All experiments present the average RM SE as a function of the number of episodes over 10 different runs. 5-state ring world 0.35 FW D1 W F D2 FW D3 T F D1 T F D2 T F D3 0.3 0.25 RMSE 848 0.2 0.15 0.1 0.05 0 0 200 400 600 800 1000 1200 1400 Number of Learning Episodes Figure 2. RM SE-curves for the symmetric action-conditional networks. Figure 2 shows the results for the ring world with question networks of depth 1, 2 and 3. f T always converges faster than f W for networks of equal depth. For networks of depth 1 it is not possible to provide a completely accurate answer network, as the question network is too small to represent the full environment, but the probability tree learner quickly learns the best approximation. As expected, networks with larger depth perform better. The results for the grid world also show a faster learning performance for the decision trees. 4 Conclusion We introduced the use of probability trees as answer networks in TD Networks. We illustrated how to translate the standard TD(1) learning approach into a training example generator and evaluated the performance of a simple incremental and greedy tree induction algorithm. The experimental evaluation shows consistently that the learning performance of the probability trees outperforms the original logistic regression approach. We consider this an important step towards the wider applicability of TD Networkss. As we only regard this work as a proof of concept, a wide range of future work is possible. The current implementation of the tree induction algorithm could be signiﬁcantly improved, for example by including tree restructuring operators or extending the learning algorithm to learn model-trees to combine the advantages of regression trees and logistic regression. Also, other non-parametric learners could be substituted for the probability tree learner. One exciting direction is the use of more elaborate observations, such as those for relational worlds. In this context decision trees offer the advantage of the more ﬂexible parameterisation. REFERENCES [1] D. Fierens, J. Ramon, H. Blockeel, and M. Bruynooghe, ‘A comparison of approaches for learning probability trees’, in Proceedings of the 16th European Conference on Machine Learning (ECML-05), pp. 556–563, (2005). [2] R.S. Sutton and B. Tanner, ‘Temporal-difference networks’, in Advances in Neural Information Processing Systems 17, pp. 1377–1384, (2004). [3] B. Tanner and R.S. Sutton, ‘Td(λ) networks: temporal-difference networks with eligibility traces’, in Proceedings of the 22nd international conference on Machine learning (ICML-05), pp. 888–895, (2005). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-849 849 An Efﬁcient Deduction Mechanism for Expressive Comparative Preferences Languages Nic Wilson1 1 INTRODUCTION 2 COMPARATIVE PREFERENCE STATEMENTS Recent years have seen a considerable literature develop in the Artiﬁcial Intelligence community on formalisms for reasoning with comparative preferences in combinatorial problems, involving statements which compactly express the relative preference of outcomes (complete assignments to a set of variables). A fundamental task for reasoning with preferences is the following: given input preference information from a user, and outcomes α and β, should we infer that the user will prefer α to β? For CP-nets and related comparative preference formalisms, inferring a preference of α over β using the standard deﬁnition of derived preference appears to be extremely hard, and has been proved to be PSPACE-complete in general for CP-nets [5]. Such inference is also rather conservative, only making the assumption of transitivity, and tends to lead to weak preferences with a great deal of incomparability. It is very often desirable to be able to ﬁll out the user’s direct preferences in a plausible way by some kind of extrapolation, generating a fuller relation. This paper, generalising the approach in [9], deﬁnes a less conservative approach to inference which can be applied for very general forms of input. It is shown to be efﬁcient for rather expressive comparative preference languages, allowing comparisons between arbitrary partial tuples (including complete assignments), and with the preferences being ceteris paribus or not. No acyclicity conditions are required regarding the input statements, and consistency (i.e., acyclicity of the preference relation) is not assumed. This paper is a short version of [10]. In this paper we will focus especially on comparative preference statements ϕ of the form p > q - T , where P , Q and T are subsets of V , and p ∈ P is an assignment to P , and q ∈ Q is an assignment to Q. (We can assume, without loss of generality, that P ∩ T = ∅ and Q ∩ T = ∅.) Informally, the statement p > q - T represents the following: p is preferred to q if T is held constant. Formally, the semantics of this statement is given by the relation ϕ∗ which is deﬁned to be the set of pairs (α, β) of outcomes such that α extends p, and β extends q, and α and β agree on T : α(T ) = β(T ). Each pair (α, β) in ϕ∗ represents a preference (e.g., of a single user) for outcome α over outcome β. Many comparative preference statements may be elicited, toform a set Γ. This set Γ thus directly represents preferences Γ∗ = ϕ∈Γ ϕ∗ . We are particularly interested in such statements ϕ when P = Q. The statement can then be written as us > us - T , where U , S and T are disjoint sets of variables, and u ∈ U , and s and s are assignments to S which differ on each variable: s(Z) = s (Z) for all Z ∈ S. Ceteris paribus preferences [other values being equal] are represented by statements with T = V − (U ∪ S); this includes CP-nets [1, 2], TCP-nets [3, 4] statements, and a feature vector rules in [6]. A CP-theory [8, 7] statement u : x > x [W ] is exactly equivalent to statement us > us - T when we set S = {X}, x = s, x = s and T = V − (U ∪ {X} ∪ W ). A preference of outcome α over outcome β can be expressed by a statement of the form us > us - ∅ with S = V − U . Terminology. Let V be a set of n variables. For each X ∈ V let X be the set of possible values of X; we assume Xhas at least two elements. For subset of variables A ⊆ V let A = X∈A X be the set of possible assignments to set of variables A. The assignment to the empty set of variables is written . An outcome is an element of V , i.e., an assignment to all the variables. If a ∈ A is an assignment to A, and b ∈ B, where A ∩ B = ∅, then we may write ab as the assignment to A ∪ B which combines a and b. For partial tuples a ∈ A and u ∈ U , we may write a |= u, or say a extends u, if A ⊇ U and a(U ) = u, i.e., a projected to U gives u. More generally, we say that a is compatible with u if there exists outcome α ∈ V extending both a and u, i.e., such that α(A) = a and α(U ) = u. This is if and only if u and a agree on common variables, i.e., u(A ∩ U ) = a(A ∩ U ). Otherwise, we say that a and u are incompatible. Selection-projections. The computational technique described in this paper is efﬁcient essentially if and only if one can efﬁciently compute a particular compound operation on the input comparative preference statements: the projection of a selection. Fortunately, this operation is efﬁcient for a broad class of natural comparative preference statements. Let a be an assignment to set of variables A, and let Y be a set of variables disjoint with A. For relation R on the set of outcomes, deﬁne the a-selection Ra of R to consist of all pairs (α, β) in R such that both α and β extend a. We deﬁne, for Y ⊆ V , the projection R↓Y of R to be the set of all pairs (y, y ) ∈ Y × Y such that there exists tuples z and z with (yz, y z ) ∈ R. We write RaY for (Ra )↓Y , the projection to Y of the a-selection of R. We call this compound operation a selection-projection. Let y, y ∈ Y be assignments to Y . We have (y, y ) ∈ RaY if and only if there exist assignments z, z to V − (A ∪ Y ) such that (ayz, ay z ) ∈ R. We have the following important property: 1 Cork Constraint Computation Centre, Department of Computer Science, University College Cork, Cork, Ireland, n.wilson@4c.ucc.ie Proposition 1 (Decomposition) For i in some index set I, let Ri be some relation on outcomes, and let R = i∈I Ri . Let a be an 850 N. Wilson / An Efﬁcient Deduction Mechanism for Expressive Comparative Preferences Languages assignment to set of variables A, and let Y be a set of variables disjoint with A. Then RaY = i∈I (Ri )Ya . For comparative preference statement ϕ and set of comparative ∗ Y Y preference statements Γ we abbreviate (ϕ )aY to ϕa and abbreviate ∗ Y Y Y (Γ )a to Γa . We thus have Γa = ϕ∈Γ ϕa . We are interested in sets Y whose associated product set Y is not large (so, small sets of variables whose domains are fairly small). Then the relations ΓYa are of manageable size, even though Γ∗ may very well be exponentially large. Proposition 2 Let P , Q and T be subsets of V , and let p ∈ P be an assignment to P , and q ∈ Q be an assignment to Q. Let ϕ be a comparative preference statement of the form p > q - T , as deﬁned above, where p ∈ P , q ∈ Q and (P ∪ Q) ∩ T = ∅. Let a be an assignment to a set of variables A, and let Y be a set of variables disjoint from A. ϕYa is empty unless a is compatible with both p and q. If a is compatible with both p and q then ϕYa consists of all pairs (y, y ) such that (i) y and y agree on Y ∩ T , i.e., y(Y ∩ T ) = y (Y ∩ T ); (ii) y is compatible with p; and (iii) y is compatible with q. Each of these conditions can be checked in time at worst linear in n, the number of variables, and so the relation ϕYa can be computed in time linear in n, given that the size of Y is bounded by a constant. Proposition 2 therefore shows that computing selection-projections can be achieved efﬁciently for statements of the form p > q - T , and hence, by Proposition 1, for sets Γ of such statements. 3 Y-ENTAILMENT This section deﬁnes a form of entailment, which we call Yentailment, which is polynomial for a wide range of comparative preference statements; in particular, statements of the form p > q - T as described in Section 2, or any other comparative statements for which computing selection-projections is polynomial. Throughout this section, we consider a ﬁxed family Y of sets of variables, which parameterises the inference relation, and a ﬁxed (and completely arbitary) input relation R on outcomes. We assume that Y satisﬁes the following property: if Y ∈ Y and non-empty Y is a subset of Y then Y ∈ Y. For example, Y might be deﬁned to be all singleton subsets of V (i.e., sets with cardinality of one), or, alternatively, all subsets of cardinality at most two, and so on. We also consider a ﬁxed comparative preference statement ψ of the form us > us - ∅, where U and S are disjoint sets of variables, and u ∈ U , and s and s are assignments to S which differ on each variable in S. Deﬁnition 1 (Pickable and Decisive) Given set Y ⊆ V and assignment a to some subset A of V − Y , we deﬁne Ya to be the transitive closure of RaY . Suppose that u is compatible with a ∈ A. Set of variables Y is said to be ψ-pickable given a if Y ∩ A = ∅ and either (i) Y ⊆ U and u(Y ) is not Ya -equivalent to any other assignment in Y ; or (ii) Y ⊆ U and there exists y, y ∈ Y with y Ya y and y is compatible with us, and y is compatible with us . In this case we say that Y is ψ-decisive given a. (y is said to be Ya -equivalent to y if both y Ya y and y Ya y.) The following algorithm deﬁnes Y-entailment: we say that R Yentails ψ if and only if the algorithm returns true. procedure Does R Y-entail ψ? for j := 1, . . . , n let aj be u restricted to Y1 ∪ · · · ∪ Yj−1 (in particular, a1 = ); if there exists a set in Y which is ψ-decisive given aj then return false and stop; if there exists a set in Y which is ψ-pickable given aj then let Yj be any such set; else return true and stop; next j; return true. Application to Deduction for Comparative Preference Statements Relation R will very often be exponentially large, and so will need to represented compactly, in particular as a set Γ of comparative preference statements (in some language), where Γ represents relation R = Γ∗ on outcomes. We infer ψ of the form us > us - ∅ from Γ if and only if Γ∗ Y-entails ψ. Applying the algorithm requires us to computeselection-projections of the form ΓYa , which we can compute as ϕ∈Γ ϕYa using Proposition 1. Complexity: Assume that the domain sizes are bounded above by a constant, and that the elements of Y have cardinality at most k, and so |Y| is less than nk . The algorithm is then O(mnk+1 ), where m = |Γ|. Semantics: In [10] it is shown how Y-entailment can be given a semantics. (In fact, Y-entailment is deﬁned semantically there, with the correctness of the algorithm then being a theorem.) In the standard entailment, Γ entails ψ if and only if every total pre-order extending Γ∗ also extends ψ ∗ . (Equivalently, ψ ∗ is a subset of the transitive closure of Γ∗ .) In contrast, Γ Y-entails ψ if and only if every total pre-order of a particular generalised lexicographic form which extends Γ∗ also extends ψ ∗ . This implies that Y-entailment is more adventurous than the standard entailment. REFERENCES [1] C. Boutilier, R. Brafman, H. Hoos, and D. Poole, ‘Reasoning with conditional ceteris paribus preference statements’, in Proceedings of UAI99, pp. 71–80, (1999). [2] C. Boutilier, R. I. Brafman, C. Domshlak, H. Hoos, and D. Poole, ‘CPnets: A tool for reasoning with conditional ceteris paribus preference statements’, Journal of Artiﬁcial Intelligence Research, 21, 135–191, (2004). [3] R. Brafman and C. Domshlak, ‘Introducing variable importance tradeoffs into CP-nets’, in Proceedings of UAI-02, pp. 69–76, (2002). [4] R. Brafman, C. Domshlak, and E. Shimony, ‘On graphical modeling of preference and importance’, Journal of Artiﬁcial Intelligence Research, 25, 389–424, (2006). [5] J. Goldsmith, J. Lang, M. Truszczy´nski, and N. Wilson, ‘The computational complexity of dominance and consistency in CP-nets’, in Proceedings of IJCAI-05, pp. 144 –149, (2005). [6] M. McGeachie and J. Doyle, ‘Utility functions for ceteris paribus preferences’, Computational Intelligence, 20(2), 158–217, (2004). [7] N. Wilson, ‘Consistency and constrained optimisation for conditional preferences’, in Proceedings of ECAI-04, pp. 888–892, (2004). [8] N. Wilson, ‘Extending CP-nets with stronger conditional preference statements’, in Proceedings of AAAI-04, pp. 735–741, (2004). [9] N. Wilson, ‘An efﬁcient upper approximation for conditional preference’, in Proceedings of ECAI-06, pp. 472–476, (2006). [10] N. Wilson, ‘An efﬁcient deduction mechanism for expressive comparative preferences languages’, Longer Version of Current Paper available at the 4C website: http://www.4c.ucc.ie/, (2008). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-851 851 An Analysis of Bayesian Network Model-Approximation Techniques Adamo Santana1 and Gregory Provan2 Department of Computer Science, University College Cork, Cork, Ireland Abstract. Two approaches have been used to perform approximate inference in Bayesian networks for which exact inference is infeasible: employing an approximation algorithm, or approximating the structure. In this article we compare two structure-approximation techniques, edge-deletion and approximate structure learning based on sub-sampling, in terms of relative accuracy and computational eﬃciency. Our empirical results indicate that edge-deletion techniques dominate the subsampling/induction strategy, in both accuracy and performance of generating the approximate network. We show, for several large Bayesian networks, how edge-deletion can create approximate networks with order-of-magnitude inference speedups and relatively little loss of accuracy. 1 Introduction Bayesian networks (BNs) have become an important tool for modeling and probabilistic inference. As the size and complexity of BN models increase, so too do the demands of performing inference. In cases where exact inference is intractable, it is important to use approximation techniques to enable inference to take place. Such approximation may apply to the inference algorithms (e.g., stochastic sampling algorithms [2], or other approaches [3, 8]), or to the BN model B (e.g., edgereduction [1, 10], probability-table/state-space approximation approaches [5, 7]). In this article, we focus on generating a space-bounded, approximate model, in cases where we have limitations on the space for embedding a BN model. Our objective is the examine the tradeoﬀ between space and performance of diﬀerent approximations, i.e., given an approximate model B , what kinds of inference speedups do we obtain for what levels of inference accuracy, with respect to B? This goal contrasts with the objectives of previous network-approximation (e.g,. edge-deletion) approaches [1, 10], where the primary interest was deleting edges while remaining within a certain error bound. Our contributions are the following. First, we compare two BN-approximation approaches, one using BN thresholdbased sub-sampling and network induction, and the other using threshold-based edge deletion. We show that the sampling/induction approach is limited by the accuracy of the induction algorithm, and produces networks which are inferior to the edge-deletion approach, due to the network-induction. We also show that, on a range of networks, the edge-deletion 1 2 adamo@ufpa.br; Supported by CAPES CBE/PDEE 0005/2007. g.provan@cs.ucc.ie; Supported by SFI grant 04/IN3/I524. approach can produce several orders-of-magnitude speedups in inference with small penalties in inference accuracy. 2 Technical Preliminaries A BN model B is deﬁned as a tuple (G, P), where G is a directed acyclic graph (DAG), and P is a set of probability distributions constructed from vertices V = {Vi } in G such n that P r{V} = i=1 P r{Vi |pa(Vi }, where pa(Vi ) are the parents of Vi in G. We compare two approaches, sub-sampling plus machine learning (SSML), and edge deletion (ED). SSML Approach: In SSML, we generate from B a training dataset T composed of 10,000 random samples, using the GeNIe tool [4]. We then used a sampling threshold φ to prune from T all cases for which P r(B) < φ, to create Tφ . For each value of φ examined, we induced an approximate network Bφ from Tφ using the constrained based PC-algorithm [9]. ED Approach: In ED, we generate from B an approximate network Bκ by pruning from B all those edges whose Kulback-Leibler (KL) divergence is below a threshold κ. The KL divergence [6] was chosen as the metric for indicating the importance of the dependence related to each edge of the network since it is one of the most widely used methods for measuring the distance between distributions. We adopt several metrics for the “quality” of an approximate network B with respect to B: the error on a test set is the diﬀerence in posterior probability averaged over the set (B ) Vt of target nodes; the complexity reduction factor, CT , CT (B) is the relative network complexity, based on using the maximum clique table size of B , CT (B ), as an inference complexity measure; and the network reduction factor, S(B , B), is a measure of the degree of isomorphism between B and B. 3 Experimental Analysis We empirically compared the SSML and ED approaches to BN approximation using 7 benchmark networks: C17, Alarm, Hailﬁnder, Pignet, Barley, Munin and C250 (a circuit with 250 nodes and 500 arcs). In our experiments, we created networks based on sub-sampling thresholds of φ = e−10 , e−5 , 5e−10 , and KL thresholds κ = 0.1, 0.15 and 0.2. To test the error of each approximate network, we sampled to create a testing data set of 500 cases, such that we chose a set of “target” nodes whose posterior distributions we computed during testing. We computed several comparative measures, including the error rate for classiﬁcation, the KL-divergence between 852 A. Santana and G. Provan / An Analysis of Bayesian Network Model-Approximation Techniques Complexity Reduction Using SSML Figure 2 displays the results of tradeoﬀs made over a range of KL threshold values using ED, showing that a signiﬁcant reduction in relative inference complexity occurs, with little loss of accuracy. For example, our data indicate that for the C250 network, we have O(106 ) faster inference with > 90% accuracy; for Munin, we have O(105 ) faster inference with ∼ 80% accuracy. 1 Accuracy the distributions and the maximum clique table size CT of the networks. Figure 1 shows that, whereas for ED the CT values never increase with increasing threshold κ (meaning that the network gets no computationally harder to evaluate), with SSML the CT values can increase with φ. This anomalous performance is due to the induction process, in which we cannot guarantee that the network structure learned will monotonically decrease in size and CT values with φ; in contrast, with ED, this is guaranteed as edge pruning occurs. This failure to guarantee that approximate networks will be computationally simpler with SSML means that it may not be possible to use this approach unless structure-based constraints can be applied during the induction phase. Alarm log(Complexity Reduction Factor) C250 0 Pignet 100000 Munin 10000 0.2 0.4 0.6 0.8 1 Relative Inference Complexity Barley Figure 2. Tradeoﬀ curves for four larger networks using ED C250 1000 4 100 10 1 0.1 0.01 0.001 0 0.0001 0.0002 0.0003 0.0004 0.0005 Network Reduction Factor Complexity Reduction using ED C17 Alarm 1 Hailfinder 0.9 Complexity Reduction Factor Pignet 0.7 0.5 Hailfinder 1000000 Barley 0.8 0.6 C17 10000000 Munin 0.9 Pignet 0.8 Barley 0.7 Munin 0.6 C250 0.4 0.3 0.2 0.1 0 0.05 0.1 0.15 0.2 Network Reduction Factor Figure 1. This paper compared two models for BN structure approximation, based on sub-sampling with network induction (SSML) and edge deletion (ED), to identify the types of tradeoﬀ of inference-speedup and loss of accuracy possible with each approach. We showed that SSML cannot guarantee monotonically faster inference with increasing network approximations; this arises because the network structure induced from approximate data (sampled from the original network) has high variance. In contrast, with ED, the tradeoﬀs of accuracy for faster inference are guaranteed to be monotonic. We have showed, for several large BNs, how ED can create approximate networks with order-of-magnitude inference speedups with relatively little loss of accuracy. REFERENCES 0.5 0 Conclusions Comparison of SSML and ED for approximation of Network Complexity The other major diﬀerence is the computational cost for the structure-approximation. The ED approach, since it uses the computing of divergences and pruning of edges with KL values, is proved very eﬃcient. In contrast, the SSML has high computational cost, since it involves computing posteriors for the original network and inducing an approximate network for a training set; both expensive tasks for complex BNs. Since exact inference was computationally infeasible for these larger networks, we used several sampling-based inference algorithms [2], all of which generate 10,000 samples to ensure close convergence of the results to the exact value. [1] A. Choi and A. Darwiche, ‘An Edge Deletion Semantics for Belief Propagation and Its Practical Impact on Approximation Quality’, Proceedings AAAI, 21(2), 1107, (2006). [2] B. D’Ambrosio, ‘Inference in Bayesian networks’, AI Magazine, 20(2), 21–36, (1999). [3] R. Dechter and I. Rish, ‘Mini-buckets: A general scheme for bounded inference’, J. of the ACM, 50(2), 107–153, (2003). [4] M.J. Druzdzel, ‘GeNIe: A development environment for graphical decision-analytic models’’, Proc. AMIA, (1999). [5] C.U. Kjaerulﬀ, ‘Reduction of computation complexity in bayesian networks through removal of weak dependencies’, Proc. of 10th Conf. on UAI, (1994). [6] S. Kullback and R.A. Leibler, ‘On Information and Suﬃciency’, Annals of Math. Stat., 22(1), 79–86, (1951). [7] C.L. Liu and M.P. Wellman, ‘Bounding probabilistic relationships in Bayesian networks using qualitative inﬂuences: methods and applications’, International Journal of Approximate Reasoning, 36(1), 31–73, (2004). [8] F.T. Ramos and F.G. Cozman, ‘Anytime anyspace probabilistic inference’, Int. J. of Approx. Reasoning, 38, 53–80, (2005). [9] P. Spirtes, C.N. Glymour, and R. Scheines, Causation, Prediction, and Search, MIT Press, 2000. [10] R.A. van Engelen, ‘Approximating Bayesian belief networks by arc removal’, IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(8), 916–920, (1997). 7. Distributed and Multi-Agents Systems This page intentionally left blank ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-855 855 Verifying the Conformance of Agents with Multiparty Protocols Laura Giordano1 , Alberto Martelli2 di Informatica, Universit`a del Piemonte Orientale, Alessandria 2 Dipartimento di Informatica, Universit` a di Torino, Torino 1 Dipartimento Abstract. The paper deﬁnes a notion of conformance of a set of k agents with a multiparty protocol with k roles, requiring the agents to be interoperable and to produce correct executions of the protocol. Conditions are introduced that enable each agent to be independently veriﬁed with respect to the protocol. particular, includes commitments; (ii) a set of temporal constraints, which specify the wanted interactions. Protocols with nonterminating computations, modeling reactive services, can also be captured in this framework. We deﬁne multiparty protocol P with k roles, by separately specifying the behavior of all roles P1 , . . . , Pk in the protocol. Consider the following example. 1 Example 1 (Purchase protocol) We have three roles: the merchant (mr), the customer (ct) and the bank (bk). ct sends a request to mr; mr replies with an offer or by saying that the requested good is not available. If ct receives the offer, it may either accept the offer and send a payment request to bk, or refuse the offer. If ct accepts the offer, then mr delivers the goods. If ct requires bk to pay mr, bk sends the payment. ct can send the request for payment to bk even before he has received the goods. Introduction In an open environment, the interaction of agents is ruled by interaction protocols on which agents commonly agree. An important issue, in this regard, concerns agent conformance with the protocol. Although agent policy may somehow deviate from the behavior dictated by the protocol, in some cases we want, nevertheless, to regard the policy as being compatible with the protocol. In this paper, we deﬁne a notion of conformance of a set of agents A1 , . . . , Ak with a multiparty protocol P . This notion must assure that agents A1 , . . . , Ak interoperate and their interactions produce correct executions of the protocol. We introduce a notion of interoperability among a set of agents, which guarantees the agents to interact properly. More precisely, each agent can freely choose among its possible emissions without the computation getting stuck. Verifying conformance of a set of agents altogether, however, is not feasible in an open environment, as, in general, the internal behavior of all agents participating in a protocol is not known. The veriﬁcation of each agent participating in the protocol must be done independently. To this purpose, we introduce a deﬁnition of conformance of a single agent Ai (playing role i) with the protocol P . We prove that a set of agents, independently conformant with the protocol P , are guaranteed to be interoperable and to produce correct executions of P . 2 Protocol Speciﬁcation The speciﬁcation of interaction protocols we adopt is based on the Product version of Dynamic Linear Time Temporal Logic (DLTL) [5], a temporal logic which extends LTL by allowing the until operator to be indexed by programs in Propositional Dynamic Logic (PDL). The Product version of DLTL allows to capture the behavior of a network of sequential agents, which coordinate their activities by performing common actions together. In our proposal, the speciﬁcation of agents and protocols is given in a temporal action theory [4], by means of temporal constraints, and the communication among agents is synchronous. Protocols are given a declarative speciﬁcation consisting of: (i) the speciﬁcation of communicative actions by means of their effects and preconditions on the social state which, in The Purchase protocol P u can be speciﬁed by separately deﬁning the protocols of the three participating agents: Pct , Pmr and Pbk . The role Pi in the protoocol is speciﬁed by a domain description Di , which is a pair (Πi , Ci ), where Πi is a set of formulas describing the effects actions, including action laws and causal laws (the action theory) of agent i, and Ci is a set of constraints that the executions of agent i must satisfy (including precondition laws). A social approach is adopted and an interaction protocol is speciﬁed by describing the effects of communicative actions on the social state, including agents commitments and permissions. The approach is a generalization of the one proposed in [4]. Given Di = (Πi , Ci ) as deﬁned above, we let Pi = Πi ∧ Ci . Once the protocols Pct , Pmr and Pbk are deﬁned, the speciﬁcation P u of the Purchase protocol can be given as follows: P u = Pct ∧ Pmr ∧ Pbk . The runs of the protocol are deﬁned to be the linear models of P u, namely, inﬁnite linear sequences of worlds (propositional interpretations), each one reachable from the initial world by a ﬁnite sequence τ of actions. The runs of P u are all runs that can be obtained by interleaving the actions of the runs of Pct , Pmr and Pbk , while synchronizing on common actions. By projecting the runs of the protocol P u to the alphabets of the participating roles, we get runs of each role Pct , Pmr and Pbk . The i-th projection of a run σ of a protocol P is an inﬁnite run σ|i of Pi . 3 Interoperability Let A1 , . . . , Ak be a set of agents given through a logical speciﬁcation, as the one introduced in Section 2. The executions of A1 , . . . , Ak are the runs of A1 ∧ . . . ∧ Ak , obtained by interleaving the executions of the Ai ’s. As the properties we will consider 856 L. Giordano and A. Martelli / Verifying the Conformance of Agents with Multiparty Protocols in this paper regard only the sequence of communicative action exchanged between agents, in the following, we will consider runs as inﬁnite sequences of actions, and disregard worlds. We want to deﬁne interoperability of A1 , . . . , Ak , so that that their interaction cannot get stuck, when each agent is free of choosing its emissions at each step. This requirement is stronger than simply requiring absence of deadlock. Let πi be the preﬁx of a run of Ai . To model the fact that Ai must be able to choose which action to execute after πi , we deﬁne a function choice(Ai , πi ), whose value is either a send action m(i, j), taken from the set {m1 (i, j1 ), . . . , mn (i, jn )} of all the actions Ai can execute after πi , or the value receive, if Ai can execute a receive action after πi . In the last case, Ai expects to receive a message from another agent. While we have assumed that agents can choose among the message they want to send, we have postulated that they cannot decide which message they will receive among those they are able to receive in a given state. This choice is left to the environment. Deﬁnition 1 We say that A1 . . . Ak are interoperable if, taken a function choice and a sequence π of actions such that, for each i, π|i is a preﬁx of a run of Ai , there exists a run σ of A1 . . . Ak with preﬁx πm(i, j), such that choice(Ai , π|i ) = m(i, j) and choice(Aj , π|j ) = receive, for some i and j. According to the above deﬁnition, any preﬁx obtained by the execution of A1 , . . . , Ak can be extended with a new action according to the choice function. In particular, each agent can chose which action he wants to execute at each stage of the computation and, eventually, he can execute such an action. 4 Conformance to mr either before or after the message sendAccept is sent from ct to mr. Although ct and bk do not put constraints on the order in which they send the acceptance of the offer and the payment to mr, in the overall protocol P u they are forced to respect the constraint of the merchant. Only the runs in which sendAccept is executed before sendP ayment are accepted as runs of P u. Let us now consider an agent Amr , with the following behavior: either it receives a message sendAccept followed by a message sendP ayment, or receives a message sendP ayment followed by a message sendAccept. Although Amr respects the policy ”less emissions and more receptions”, it cannot be considered to be conformant with P u, as it may produce an execution in which sendP ayment comes before sendAccept. A similar example is discussed in [6], where the problem of conformance checking is analyzed for models of asynchronous message passing software. There a very restrictive policy is proposed to solve the problem, namely, the policy less emissions and same receptions. Here, we propose a deﬁnition of the conformance of an agent Ai with respect to the overall protocol P , rather than to its role Pi . Besides referring to the runs of a protocol P = P1 ∧. . .∧Pk , we need to refer to the runs obtained by executing an agent Ai in the context of the protocol P . Let: P [Ai ] = P1 ∧. . .∧Pi−1 ∧Ai ∧Pi+1 ∧. . .∧Pk . The deﬁnition of conformance we introduce below, on the one hand, requires (condition C1) that, for each agent Ai , P [Ai ] is interoperable. On the other hand, it requires that the executions of Ai are correct for both emissions and receptions (condition C2), and complete for receptions (condition C3), when Ai is interacting with other agents respecting the protocol P . Deﬁnition 3 An agent Ai is conformant with a protocol P = P1 ∧ . . . ∧ Pk when the following conditions are satisﬁed: Given a protocol P = P1 ∧ . . . ∧ Pk with k roles, we deﬁne the conformance of a set of agents A1 , . . . , Ak with P , by requiring that (C1) INTEROPERABILITY - P [Ai ] interoperates the interaction of the agents cannot give rise to executions which are (C2) CORRECTNESS - all runs of P [Ai ] are runs of P not runs of P . Moreover, we require A1 , . . . , Ak to be interoperable. (C3) COMPLETENESS - whenever there are two runs, σP of P and σP [Ai ] of P [Ai ], such that π is a preﬁx of σP and of σP [Ai ] , if Deﬁnition 2 Agents A1 , . . . , Ak are conformant with P if: action m(j, i) is executed after the preﬁx π in σP , then there is a (a) A1 , . . . , Ak are interoperable, and run σP [Ai ] of P [Ai ] with preﬁx πm(j, i) (b) the executions of A1 , . . . , Ak are runs of the protocol P . We can prove the following result: In this section we want to introduce a notion of conformance of a Theorem 1 Let P = P1 ∧ . . . ∧ Pk be an interoperable protocol single agent Ai with the protocol P , so that the conformance of each and let, for each i = 1, . . . , k, the agent Ai be conformant with Ai , proved independently, guarantees the conformance of the overall protocol P according to Deﬁnition 3. Then agents A1 , . . . , Ak are set of agents A1 , . . . , Ak with P , according to Deﬁnition 2. conformant with P according to Deﬁnition 2. Most proposals in the literature rely on a notion of conformance based on the policy: less emissions and more receptions [2, 1, 3]. The problem of verifying the conformance of an agent with a proConsider, for instance, a customer agent Act whose behavior differs tocol can be solved by working on the B¨uchi automaton which can from that of the role “customer” of protocol P u as follows: whenbe extracted from the logical speciﬁcation of the protocol. For the ever it receives an offer from the merchant, it always accepts it; after two-party case, automata based techniques have been studied in [3]. accepting the offer it expects to receive from Pmr either sendGoods or cancelDelivery. Although the behavior of Act and that of the corREFERENCES responding role of the protocol are different, we could consider however the agent to be conformant with the protocol, since the customer [1] M. Baldoni, C. Baroglio, A. Martelli, Patti. Veriﬁcation of protocol conformance and agent interoperability. CLIMA’06, LNCS 3900, 265–283. can choose which messages to send, and thus it is not forced to send all the messages required by the protocol. Also, the agent can receive [2] L.Bordeaux, G.Sala¨un, D. Berardi, M.Mecella, When are two web-agents compatible, VLDB-TES 2004. more messages than those required by the protocol, since these re- [3] L. Giordano and A. Martelli. Verifying Agent Conformance with Protoceptions will never be executed. Unfortunately, this argument holds cols Speciﬁed in a Temporal Action Logic. In AI*IA 2007, LNAI 4733. [4] L. Giordano, A. Martelli, C. Schwind. Specifying and Verifying Interaconly for two-party protocols, as shown in the next example. Example 2 Consider protocol P u. Assume that mr has the requirement that he can receive the payment from bk only after he has received the acceptance of the offer from ct. According to the protocols of ct and bk the message sendP ayment can be sent from bk tion Protocols in a Temporal Action Logic J. Applied Logic, 5(2007). [5] J.G. Henriksen and P.S. Thiagarajan A product Version of Dynamic Linear Time Temporal Logic. in CONCUR’97, LNCS 1243, 45–58, 1997. [6] S.K. Rajamani and J. Rehof. Conformance checking for models of asynchronous message passing software. In CAV’02, 166–179, Springer, 2002. ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-857 857 Simulated Annealing for Coalition Formation Helena Keina¨ nen and Misa Keina¨ nen 1 INTRODUCTION We study coalition formation in characteristic function games (CFGs) [4, 5]. Consider a -person cooperative game where is the set of agents. A coalition is any non-empty subset of , i.e. such that . In CFGs a charassigns real values (worths) to acteristic function coalitions such that the function may be incomplete. A coalition structure is a partition of into mutually disjoint coalitions , we have , and in which, for all , . The value of a coalition structure is called social welfare, and it is deﬁned as . Given together with a characteristic a set of agents , our aim is to ﬁnd a coalition structure with function maximum social welfare. It is shown in [5] that ﬁnding a social welfare maximizing coalition structure is a -complete problem, and and . that the number of coalition structures is Motivated by the observations in [6, 7] that genetic algorithms provide a useful tool for searching the maximal sum of the values of coalitions, we show that simulated annealing (SA) [1, 3] provides also a very competitive approach to the problem. We observe that the SA algorithm with a suitable neighbourhood relation often ﬁnds better values or even the optimal coalition structures well before the state-of-the-art algorithms in [2, 4, 5]. 2 SA FOR COALITION FORMATION Algorithm 1 shows our SA algorithm for optimizing the social welfare of a CFG. The algorithm takes a characteristic function for an -agent CFG as its input. Additional inputs are iteration , initial temperature , and the cooling ratio . is limit to keep track of the number of iterations. records the coalition structure with the highest social welfare among the ones seen. At of coalition structure each iteration a random neighbour solution is picked according to a speciﬁc neighbourhood . of the The search proceeds with an adjacent coalition structure , if yields a better social welfare original coalition structure . Otherwise, the search is continued with with probabilthan . The temperature decreases after each iteraity where . tion according to an annealing schedule The performances of SA algorithms are very sensitive to parameter adjustments as well as to neighbourhood selection. Given a set of agents together with a characteristic function , let denote the set of all coalition structures that can be formed. The neighbourhood is a function which maps coalition structures to the sets of their neighbour coalition structures. Helsinki University of Technology, Finland, helena.keinanen@tkk.ﬁ Sampo Life Insurance Company Ltd. , Finland, misa.keinanen@gmail.com We found out that the following two neighbourhoods are particularly appropriate for Alg. 1. Split/merge neighbourhood, in which if and only if can be obtained by either (i) splitting one coalition in into two disfrom joint coalitions in , or (ii) merging two distinct coalitions of into a single coalition in . Shift neighbourhood, in which if and only if can be obtained from by shifting exactly one agent from a coalition to another coalition. Algorithm 1 Inputs: c_max, t_init, alpha External: V() c = 0; t = t_init; CS = random initial coalition structure; CS_best = CS; while c < c_max do CS’ = random neighbour of CS in Neighbour(CS); if V(CS’) > V(CS) then CS = CS’; if V(CS)>V(CS_best) then CS_best=CS; else with probability eˆ((V(CS’)-V(CS))/t) CS = CS’; c = c+1; t = alpha*t; return CS_best; 3 EXPERIMENTAL RESULTS We have implemented the Alg. 1 in C, and evaluated its performance on CFG problems. As our benchmarks we use problems from [2, 4, 5, 6]. In the following we present experimental results considering in particular solution quality, robustness, and runtime performances of various algorihms. The Fig. 1 (left) shows a robustness comparison of split/merge and shift neighbourhoods for the SA algorithm on 300 randomly generated 10-agent CFG problem instances. For each problem instance, an incomplete characteristic function was generated to assign random coalition values between . An exhaustive search was ﬁrst used to ﬁnd social welfare maximizing coalition structures, and then the SA algorithm was used to ﬁnd optimal solutions in the following way. For both neighbourhoods, we executed 11 runs on every proband lem instance with approximately optimal parameters . The runtime limit for each run was set to 100000 coalition structures. We plot the minimum execution times of 11 runs to ﬁnd an optimal social welfare. The shift neighbourhood is much more robust than the split/merge. SA with shift neighbourhood is able to ﬁnd the optimum solution in 298 of the 300 instances. In contrast, SA with the split/merge times out in 136 instances without ﬁnding an optimum solution. Tne SA with the shift neighbourhood is mostly able to ﬁnd the optimum values with substantially fewer search steps than SA with the split/merge. Irrespective of the parameter variation the behaviour of the shift neighbourhood was superior. To compare the solution qualities of SA with the two different neighbourhoods, we investigate the behaviours on 100 randomly 858 H. Keinänen and M. Keinänen / Simulated Annealing for Coalition Formation Minimum runtime to find optimal social welfare Maximum social welfares 100000 13 ’data’ using 2:1 x 12 10000 shift (number of seen coalition structures) 11 10 shift (social welfare) 1000 100 9 8 7 6 10 5 ’data’ using 7:10 x 1 4 1 10 100 1000 split/merge (number of seen coalition structures) 10000 100000 4 5 6 7 8 9 split/merge (social welfare) 10 11 12 13 Figure 1. Comparisons of neighbourhood relations on random CFGs. generated 20-agent CFG problem instances, again random coalition . Fig. 1 (right) shows the correlation between the soluvalues in tion quality of SA with the split/merge neighbourhood and SA with the shift neighbourhood. The plot illustrates the maximum social welfares found, measured from 11 runs per neighbourhood. The runtime limit was set to coalition structures, and we used the approximately optimal annealing schedule where and . These results clearly show that SA with the shift neighbourhood outperforms SA with the split/merge neighbourhood. We have also implemented in C algorithms presented in [2, 4, 5], and a random search on the graph induced by the neighbourhood relations. We compared the performances of SA, Random search and the anytime algorithms on a set of randomly generated 10-agent CFGs with coalition’s values picked randomly from a uniform dis. Fig. 2 shows the cumulative solution qualities over tribution runtime (measured as seen coalition structures), on a representative problem instance. The initial temperature for SA is and is ﬁxed to 0.8. Both SA and Random search are run only once. The SA algorithm ﬁnds good solutions very quickly. The SA with the shift neighbourhood ﬁnds the optimum within short runtime, and REFERENCES Cumulative solution quality over run time 1 Relative solution quality 0.8 0.6 0.4 0.2 Anytime Random split/merge SA split/merge Random shift SA shift 0 1 10 100 1000 10000 100000 Run time (number of seen coalition structures) Figure 2. also SA with the split/merge neighbourhood climbs very close to the optimum. Random search with both neighbourhoods manages to ﬁnd quickly relatively good solutions. However, as SA with the split/merge, Random search do not ﬁnd any maximal social welfare. The anytime algorithm searches for a long time without ﬁnding good solutions, but then ﬁnally sees a coalition structure with maximal social welfare. Finally, we conducted further experiments on 100 random 20. For each problem inagent CFGs with coalition’s values in stance, we collected the minimal, median and maximal social welfares measured from 11 runs per algorithm. In these tests, the runtime limit for all algorithms was set to coalition structures. We used the SA with the shift neighbourhood, and the SA parameters were the approximately optimal and . The results are consistent with the results of the previous experiments. For all problem instances the SA algorithm substantially outperforms the anytime algorithms in [2, 4, 5]. Notably, every social welfare found with the anytime algorithms [2, 4, 5] is smaller than 2, whereas the SA always ﬁnds social welfares better than 9. The results with the SA provide an improvement in the order of a factor 5. A comparison of SA, Random search and Anytime algorithms. ˇ y, ‘Thermodynamical approach to the traveling salesman prob[1] V. Cern´ lem: An efﬁcient simulation algorithm’, J. of Optimization Theory and Applications, 45, 41–51, (1985). [2] V.D. Dang and N.R. Jennings, ‘Generating coalition structures with ﬁnite bound from the optimal guarantees’, in Proc. 3rd Int. Conf. on Autonomous Agents and Multi-Agent Systems, 546–571, 2004. [3] S. Kirkpatrick, C.D. Gelatt, Jr. and M.P. Vecchi, ‘Optimization by simulated annealing’, Science, 220, 671–680, (1983). [4] K.S. Larson and T.W. Sandholm, ‘Anytime coalition structure generation: An average case study’, J. Expt. Theor. Artif. Intell., 12, 23–42, (2000). [5] T. Sandholm, K. Larson, M. Andersson, O. Shehory and F. Tohm´e, ‘Coalition structure generation with worst case guarantees’, Artiﬁcial Intelligence, 111, 209–239, (1999). [6] S. Sen and P.S. Dutta, ‘Searching for optimal coalition structures’, in Proc. 4th Int. Conf. on Multi-agent Systems, 286–292, 2000. [7] J. Yang and Z. Luo, ‘Coalition formation mechanism in multi-agent systems based on genetic algorithms’, Applied Soft Computing, 7, 561– 568, (2007). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-859 859 A Default Logic Based Framework for Argumentation Emanuel Santos 1 and Jo˜ao Pav˜ao Martins 2 Abstract. We extend the logic-based framework of Besnard and Hunter for Default Logic. We present structural sound results that provide a natural extension and introduce new concepts that enable us to characterize an argument based on its use of incomplete information. 1 Introduction We extend the logic-based framework of Besnard and Hunter [1], for a non-monotonic logic, namely Default Logic (DL), in order to make possible the construction of arguments based on non-monotonic reasoning. We present structural sound results that provide a natural extension for the concepts deﬁned in the framework by Besnard and Hunter and for the model-theoretic evaluation introduced by Hunter [4]. We also introduce the concepts of justiﬁcative and counterjustiﬁcative argument, which enable to characterize an argument based on the use of incomplete information to support its conclusions. 2 Deﬁnitions and Results We deﬁne some basic concepts that are used to deﬁne the extended deﬁnition of argument. In DL [6], a default theory is a pair (R, Δ), composed of a set of default rules, R, and a set of closed wffs, Δ (Δ ⊂ LF OL ). We only consider default theories (R, Δ) such that Δ ⊂ LP L . Deﬁnition 1 Let (R, Δ) be a default theory. (R, Δ) is unique if it has only one extension, Ω, given by Ext1((R, Δ)). (R, Δ) is minimum if it is unique, with an extension Ω, and there is no unique default theory (R , Δ ) such that R ⊂ R, Δ ⊆ Δ and Ext1((R , Δ )) = Ω. Deﬁnition 2 Let (R, Δ) be a default theory. (R, Δ) α if (R, Δ) is unique, with an extension Ω, and α ∈ Ω. Theorem 1 (Monotonicity of ) Let (R, Δ) and (R , Δ ) be minimum default theories. If (R, Δ) α, R ⊆ R and Δ ⊆ Δ then (R , Δ ) α. Deﬁnition 3 Let (R, Δ) be default theory. (R, Δ) is minimum with respect to α if (R, Δ) α and there is no default theory (R , Δ ) such that R ⊂ R, Δ ⊆ Δ and (R , Δ ) α. We use α, β, γ, ... to denote formulae, Δ, Φ, Ψ, ... to denote sets of formulae, A, B, C, ... to denote arguments and R, T, S, ... to denote sets of default rules. Θ = (RΘ , ΔΘ ) denotes a default theory 1 2 Phd Student, Instituto Superior T´ecnico, Technical University of Lisbon, Portugal, email: esantos@ist.utl.pt. Supported by Fundac¸a˜ o para a Ciˆencia e Tecnologia under PhD grant SFRH/BD/27253/2006. Instituto Superior T´ecnico, Technical University of Lisbon, Portugal. that represents an information repository from which arguments can be constructed and could have no extensions. We assume that for every sub-set of a repository Θ = (RΘ , ΔΘ ) there is a unique canonical enumeration φ1 , ..., φn , ϕ1 , ..., ϕl , ψ1 , ..., ψm , such that ΔΘ = {φ1 , ..., φn }, C(RΘ ) = {ϕ1 , ..., ϕl } and J(RΘ ) = {ψ1 , ..., ψm }3 . To enable the construction of arguments based on unknown information using DL, we extend the deﬁnition of argument presented by [1]. Deﬁnition 4 An argument is a pair (R, Δ), α such that: 1) (R, Δ) ⊥; 2) (R, Δ) is minimum wrt α; 3) There is no Δ ⊂ Δ such that (R, Δ ) α and (Ext1((R, Δ)) − Ext1((R, Δ )) ∩ J(R) = ∅. We say that (R, Δ), α is an argument for α, α the consequent (conclusion) of the argument and (R, Δ) the support of the argument. The use of minimum default theories enables us to easily extend the concept of “classical” argument to a non-monotonic context, using DL. We are able to construct arguments based on unknown information and easily extend all the other deﬁnitions and results presented in [1] and [4], which are mainly based on Theorem 1. Deﬁnition 5 An argument A = (R, Δ), α is a sub-argument of an argument B = (T, Ψ), β if R ⊆ T and Δ ⊆ Ψ. If also β % α, A is said to be more conservative than B. From Deﬁnition 4, we can construct arguments that depend on unknown information to derive its conclusion. In order to distinguish these arguments from the rest, the deﬁnitions of justiﬁcative and total justiﬁcative are introduced. Deﬁnition 6 An argument ∀β∈J(R) Δ % β. (R, Δ), α is justiﬁcative if Deﬁnition 7 (Recursive) An argument (R, Δ), α is total justiﬁcative if ∀r∈R ∀β∈J(r) exists an argument (R , Δ ), β, such that R ⊆ R − {r} and Δ ⊆ Δ, which is justiﬁcative or total justiﬁcative. Given Deﬁnition 7 an argument is total justiﬁcative if it doesn’t “use” unknown information, through the justiﬁcations of its default rules, to derive its conclusion. Deﬁnition 8 An argument A = (R, Δ), α is justiﬁcative wrt β if exists a sub-argument of A total justiﬁcative with consequent β. 3 α : β1 , ..., βm For a set R of default rules and a default rule r = we deﬁne γ P (r) = α, J(r) = {β1 , ..., βm }, C(r) = γ and P (R), J(R) and C(R) as theirs respective unions. 860 E. Santos and J. Pavão Martins / A Default Logic Based Framework for Argumentation Given Deﬁnition 8, an argument is justiﬁcative wrt β if it is possible to derive β without using some unknown information. In the following we extend the notion of undercut, introduced in [4], that is used to represent a counter-argument. Deﬁnition 9 An argument (R , Δ ), ¬(φ1 ∧ ... ∧ φn ∧ ϕ1 ∧ ... ∧ ϕl ) ∧ ¬ψ1 ∧ ... ∧ ¬ψm is an undercut of an argument (R, Δ), α if {φ1 , ..., φn } ⊆ Δ,{ψ1 , ..., ψm } ⊆ J(R) and {ϕ1 , ..., ϕl } ⊆ C(R). Deﬁnition 15 [4] The recursive empathy (r.e) for an argument tree T with a beliefbase Γ, denoted as EP RΓ (T), is given by F e(Ar ) where Ar is the root of T. Example 1 Let Θ = (RΘ , ΔΘ ) be a repository such that ΔΘ = {α, β, π → ¬α, π, δ, ω, λ → ¬γ, λ, λ → ω, η} and RΘ = α:β : ω , η : ρ , δ : ¬β }. For this repository, we construct the { γ , δα β ¬β following argument tree T for γ:6 Deﬁnition 10 An undercut A = (R, Δ), α for B = (T, Ψ), γ is counter-justiﬁcative if there is no argument C, that is an undercut of A and a sub-argument of B, such that ∀β∈JustArg(C) α ¬β.4 α : β }, {β, α}), γ kV A = ({ γ O VVVVV kk5 V kkk δ : ¬β }, {δ}), ∧ ¬β D = ({ B = (∅, {π → ¬α, π}), ¬β hQQQ O O QQQC = (∅, {λ → ¬γ, λ}), Q Q η : ρ QQQ }, {η}), ∧ ¬¬β G = ({ E = ({ δ : ω }, {δ}), β α QQ An undercut A for an argument B is counter-justiﬁcative (wrt B) if B can’t “defend” itself from the attack. This often happens when an undercut “attacks” a justiﬁcation of the target argument which is not justiﬁcative. Deﬁnition 11 An argument A = (R, Δ), ¬(φ1 ∧ ... ∧ φn ∧ ϕ1 ∧ ... ∧ ϕl ) ∧ ¬ψj ∧ ... ∧ ¬ψk is an canonical undercut for (T, Ψ), α if there is no argument B = (R, Δ), ¬(φ1 ∧ ... ∧ φn ∧ ϕ1 ∧ ... ∧ ϕl ) ∧ ¬ψi ∧ ... ∧ ¬ψp less conservative than A, φ1 , ..., φn , ϕ1 , ..., ϕl , ψ1 , ..., ψm is a canonical enumeration of (T, Ψ), 0 ≤ j ≤ k ≤ m and 0 ≤ i ≤ p ≤ m. In order to represent a inﬁnite set of “equivalent” arguments, we extend the deﬁnition of canonical undercut because of the existence of default rules justiﬁcations. Based on this concept we also extend the deﬁnition of argument tree presented in [1]: Deﬁnition 12 An argumentation tree for α is a tree where the nodes are arguments such that: 1) The root is an argument for α; 2) For no node (T, Ψ), β with ancestor nodes (T1 , Ψ1 ), β1 , ..., (Tn , Ψn ), βn such that Ψ ⊆ Ψ1 ∪ ... ∪ Ψn and T ⊆ T1 ∪ ... ∪ Tn ; 3) The children nodes of a node A consist of all canonical undercuts for A that obey 2. The evaluation of an argument is done through the comparison of its support and a consistent set of formulae, called beliefbase, which denotes the beliefs of the intended audience of the argument. We use the concept of degree of entailment (DE) introduced in [4] to extended the notions of empathy and recursive empathy. Deﬁnition 13 Let Γ be a beliefbase and A = (R, Δ), α an argument. The empathy for the argument A, EPΓ (A), is deﬁned as EPΓ (A) = DE(Γ, Δ ∪ C(R) ∪ J(R)). A empathy value of one, between a beliefbase and an argument, means that the audience, represented by the beliefbase, completely agrees with that argument. A zero empathy means that the audience disagrees with that argument. Deﬁnition 14 Let T be an argument tree. The function Fe is deﬁned for every node Ai of T in the following manner, where eAi = EPΓ (Ai ), aAi = M ax({−1} ∪ {F e(Aj ) | Aj ∈ ChildrenN CJ(T, Ai )}) and aAi = M ax({−1} ∪ {F e(Aj ) | Aj ∈ ChildrenCJ(T, Ai )}):5 8 −aAi if aAi > eAi > < −aAi if aAi eAi and aAi > 0 F e(Ai ) = 0 if aAi = eAi and aAi 0 > : eAi if aAi < eAi and aAi 0 4 5 JustArg(A) = {β | β ∈ J(R) and A is not justiﬁcative wrt β}. ChildrenCJ(T, Ai ) = {Aj | Aj ∈ Children(T, Ai ) and Aj is counter-justiﬁcative of Ai }. ChildrenN CJ(T, Ai ) = {Aj | Aj ∈ Children(T, Ai ) and there is no sub-argument of Ai that is undercut counter-justiﬁcative of Aj }. F = ({ δ : ω }, {δ, λ → ω, λ}), α For Γ = {¬λ, δ}, the arguments of T are evaluated below: Arguments A B C D EPΓ 1/8 1/4 0 1/2 Fe 1/8 0 0 -1/8 Arguments E F G EPΓ 1/4 0 1/8 Fe 1/4 0 1/8 Given these results we have that EP RΓ (T) = 1/8. This means that T has a low but positive recursive empathy for the beliefbase Γ. Theorem 2 (Extension) If every argument (R, Δ), α has R = ∅, Deﬁnitions 4, 5, 9, 11, 12, 13 and 14 are equivalent to the respective deﬁnitions introduced in [4]. 3 Discussion, Conclusions and Future Work The goal of this paper has been to extend the framework presented in [1] and some of the techniques introduced in [4] to a non-monotonic logic, namely DL. Theorem 2 shows that our framework is a extension of the framework of [1]. Besides providing an extension, we also introduced the concepts of justiﬁcative argument and counterjustiﬁcative undercut that allows us to differentiate arguments based on their use of unknown information through justiﬁcations. The choice of DL is justiﬁed by the desired to implement this framework on a system which uses this logic. The generalization of the concepts of information repository and beliefbase, the development of further evaluation techniques and the implementation of such a system in knowledge-based agents are subjects of ongoing research. REFERENCES [1] P. Besnard and A. Hunter, ‘A logic-based theory of deductive arguments’, Artiﬁcial Intelligence, 128, 203–235, (2001). [2] P. Besnard and A. Hunter, ‘Practical ﬁrst-order argumentation’, AAAI, (2005). [3] P. Dung, Kowalski R., and Toni F., ‘Dialectic proof procedures for assumption-based, admissible argumentation’, Artiﬁcial Inteligence, 170, 114–159, (2006). [4] A. Hunter, ‘Making argumentation more believable’, AAAI, (2004). [5] H. Prakken and G. Vreeswijk, ‘Logical systems for defeasible argumentation’, Handbook of Philosophical Logic, 1–87, (2002). [6] R. Reiter, ‘A logic for default reasoning’, Artiﬁcial Intelligence, 13, 81– 132, (1980). [7] E. Santos, ‘Argumentac¸a˜ o baseada em l´ogica n˜ao-mon´otona’, Simp´osio Doutoral de Inteligˆencia Artiﬁcial - EPIA, 125–135, (2007). 6 As a notation convenience, the symbol $ is used to denote the wff ¬(φ1 ∧ ... ∧ φn ∧ ϕ1 ∧ ... ∧ ϕl ) with respect to the canonical enumeration φ1 , ..., φn , ϕ1 , ..., ϕl , ψ1 , ..., ψm . ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-861 861 An Empirical Investigation of the Adversarial Activity Model Inon Zuckerman and Sarit Kraus 1 and Jeffrey S. Rosenschein 2 Abstract. Multiagent research provides an extensive literature on formal Belief-Desire-Intention (BDI) based models describing the notions of teamwork and cooperation, but adversarial and competitive relationships have received very little formal BDI treatment. Moreover, one of the main roles of such models is to serve as design guidelines for the creation of agents, and while there is work illustrating that role in cooperative interaction, there has been no empirical work done to validate competitive BDI models. In this work we use the Adversarial Activity model, a BDI-based model for bounded rational agents that are operating in a general zero-sum environment, as an architectural guideline for building bounded rational agents in two adversarial environments: the Connect-four game (a bilateral environment) and the Risk strategic board game (a multilateral environment). We carry out extensive simulations that illustrate the advantages and limitations of using this model as a design speciﬁcation. 1 Introduction Formal Belief-Desire-Intention (BDI) [1] based models of cooperation and teamwork have been extensively explored in multiagent worlds. They provide ﬁrm theoretical foundations and guidelines for the design of cooperative automated agents [4, 2]. However, as cooperation and teamwork led the research agenda, little work was done on providing BDI-based models for adversarial or competitive interactions that naturally occur in multiagent environments. The desire to adapt BDI-based models for competitive interactions comes from their successful implementation in teamwork domains [5] and the limitations of classical solutions in complex adversarial interactions. Recently, the Adversarial Activity (AA) model [6] was presented: a formal BDI-based model for bounded rational agents in zero-sum adversarial environments. Alongside the model were also presented several behavioral axioms that should be used when an agent ﬁnds itself in an Adversarial Activity. However, the discussion in [6] lacked empirical work to validate the advantages as well as the limitations of those behavioral axioms in adversarial domains. Our aim here is to ﬁll that gap, demonstrate how the AA model can be used as a design speciﬁcation, and investigate its usefulness in bounded rational agents. We will explore whether AA-based agents can outperform state of the art solutions in various adversarial environments. 2 Overview of the Adversarial Activity Model The AA model provides the speciﬁcation of capabilities and mental attitudes of an agent in an adversarial environment from a single adversarial agent’s perspective. The model describes both bilateral 1 2 Bar-Ilan University, Israel, email: {zukermi,sarit}@cs.biu.ac.il The Hebrew University, Israel, email: jeff@cs.huji.ac.il and multilateral instantiations of zero-sum environments, in which all agents are adversarial (i.e., there are no cooperative or neutral agents). Alongside the model, there exist several behavioral axioms that the agent can follow: A1. Goal Achieving Axiom. This axiom is a simple and intuitive one, stating that if the agent can take an action that will achieve its main goal (or one of its subgoals), it should take it. A2. Preventive Act Axiom. This axiom relies on the fact that the interaction is zero-sum. It says that the agent might take actions that will prevent its adversary from taking future high beneﬁcial actions, even if they do not explicitly advance the agent towards its goal. A3. Suboptimal Tactical Move Axiom. This axiom relies on the fact that the agent’s reasoning resources are bounded, as is the knowledge it has about its adversaries. In such cases the agent might decide to take actions that are suboptimal with respect to its limited search boundary, but they might prove to be highly beneﬁcial actions in the future, depending on its adversaries reactions. A4. Proﬁle Manipulation Axiom. This provides the ability to manipulate agents’ proﬁles (the knowledge one agent holds about the other), by taking actions such that the adversary’s reactions to them would reveal some of its proﬁle information. A5. Alliance Formation Axiom This axiom allows the creation of temporary task groups when, during the interaction, several agents have some common interests that they wish to pursuit together. A6. Evaluation Maximization Axiom. In a case when all other axioms are inapplicable, the agent will proceed with the action that maximizes the heuristic value as computed in its evaluation function. 3 Empirical Evaluation We will use two different experimental domains. The ﬁrst one is the Connect-Four board game, which will allow us to evaluate the model in a bilateral interaction. The second domain is the well-known Risk strategic board game of world domination. The embedding of behavioral axioms into the agent design, in both domains, was done by providing new functions, one for each of the implemented axioms (denoted as AxiomN V alue(), where N is the number of the axiom in the model). These functions return a possible action if its relevant precondition holds. The preconditions are the required beliefs, as stated in the axiom formalizations, formulated according to the relevant domain. The resulting architecture resembles a rule-based system, where each function returns its value and the ﬁnal selection among the potential actions is computed in a “Decide” function, whose role is to select among the actions (if there is more than a single possible action) and return its ﬁnal decision. 3.1 A Bilateral Domain—Connect4 We built an experimental environment where computer agents play the connect-four game against one another, and we have control over 862 I. Zuckerman et al. / An Empirical Investigation of the Adversarial Activity Model the search depth, reasoning time, and other variables. We built six different agents, each with a different evaluation function (H1-H6), ranging from a naive function to a reasonable function that can win when playing against an average human adversary. We had 12 different agents: 6 alpha-beta and 6 axiom-augmented agents, each using one of the evaluation functions. We staged a round-robin tournament among all agents, where each agent played with 3 different depth searches (3, 5, and 7) against all other agents and possible search depths. The tournament was played twice: once for the agents playing as the ﬁrst player (yellow), and the other time for them playing as the second (red) player (i.e., 11 opponents * 3 own depth * 3 opponent depth * 2 disc colors = 198 games). The results of the tournament are summarized in Figure 1. The ﬁgure shows the percentage of winning games for each of the 12 agents, where the agent names are written as R 1 for regular agent using H1, and A 3 shows the results for axiom-embedded agents using H3. The results clearly indicate that all agents improved their performance following the integration of axioms. The agents with naive heuristics (A 1 and A 2) showed only a small improvement, which usually reﬂected additional wins over their “regular” versions (R 1 and R 2), while the mid-ranged functions (H4 and H5) showed the largest improvement, with additional wins over different agents that were not possible prior to the embedding of axioms. Overall, we see that the best two agents were A 4 and A 6, with a single win advantage for the A 6 player, which in turn led A 5 by 7 wins. above list instead of their names). The worst agent was Angry (#1) with a 0.44% win percentage, while the best was KillBot (#10) with 32.54%. Looking at our agents, we can see that the basic heuristic agent (denoted as “He” and whose bar is colored in blue) managed to achieve only 11.79%, whereas its axiom-augmented version Ax (colored red on the graph) climbed all the way up to 26.84%, more than doubling the winning percentage of its regular version. Figure 1. Connect-Four experiment results Figure 3. Winning percentage with ﬁxed opponents 3.2 A Multilateral Domain—Risk Our next domain is a multilateral interaction in the form of the Risk board game. The game is a strategy board game that incorporates probabilistic elements and strategic reasoning in various forms. Risk is too complicated to solve using classical search methods.We used the Lux Delux3 environment which provides a large number of computer opponents implemented by different programmers and employing varying strategies. We chose to work with exactly the same subset of adversaries that was used in [3], which contains 12 adversaries of different difﬁculty levels (Easy, Medium, and Hard): (1) Angry (2) Stinky (3) Communist (4) Shaft (5) Yakool (6) Pixie (7) Cluster (8) Bosco (9) EvilPixie (10) KillBot (11) Que (12) Nefarious. The basic agent implementation and evaluation function were based on the one described in [3], as it proved to be a very successful evaluation function-based agent, which does not use expert knowledge about the strategic domain. The next step was to augment the original agent with the implementation of the adversarial axioms (we used continent ownership as a subgoal). Experiment 1: The game map was “Risk classic”, card values were set to “5, 5, 5, . . . ”, the continent bonus was constant, and starting position and initial army placement were randomized. Each game had 6 players, randomized from the set of 14 agents described above. Figure 2 shows results of running 1741 such games, with the winning percentage of each of the agents (we use the agent number from the 3 Downloadable from http://sillysoft.net/lux/. Figure 2. Winning percentage on “Risk classic” map Experiment 2: In the second experiment we compared the performance of both kinds of agents on randomly-generated world maps. The results show approximately the same improvement, from 9.16% with the regular heuristic agent, to a total of 21.36% with its axiomaugmented version. Experiment 3: We ﬁxed a ﬁve-agent opponent set (agent 1 through 5), and ran a total of 2000 games on the classic map setting: 1000 games with agent He and the opponent set, and 1000 games with agent Ax and the opponent set. The results show that even when playing against very easy opponents, in which the regular heuristic agent led the group with a winning percentage of 31.8%, the integration of the axioms managed to lift the agent to an impressive winning percentage of 57.1%. 4 Conclusions We have presented an empirical evaluation of the Adversarial Activity model for bounded rational agents in a zero-sum environment. Our results show that bounded-rational agents can improve their performance when their original architectures are augmented with the model’s behavioral axioms, even as their evaluation functions remained unchanged. 5 Acknowledgments This work was supported in part by NSF under grant #IS0705587 and ISF under grant #1357/07. Sarit Kraus is also afﬁliated with UMIACS. REFERENCES [1] Michael E. Bratman, Intention, Plans and Practical Reason, Harvard University Press, Cambridge, MA, 1987. [2] Barbara J. Grosz and Sarit Kraus, ‘Collaborative plans for complex group action’, AIJ, 86(2), 269–357, (1996). [3] Stefan J. Johansson and Fredrik Olsson, ‘Using multi-agent system technology in risk bots.’, in AIIDE, pp. 42–47, (2006). [4] H. J. Levesque, P. R. Cohen, and J. H. T. Nunes, ‘On acting together’, in Proc. of AAAI-90, pp. 94–99, Boston, MA, (1990). [5] M. Tambe, ‘Agent architectures for ﬂexible, practical teamwork’, in National Conference on Artiﬁcial Intelligence (AAAI), (1997). [6] Inon Zuckerman, Sarit Kraus, Jeffrey S. Rosenschein, and Gal A. Kaminka, ‘An adversarial environment model for bounded rational agents in zero-sum interactions’, in AAMAS 2007, pp. 538–546, (2007). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-863 863 Addressing Temporal Aspects of Privacy-Related Norms Guillaume Piolle1 and Yves Demazeau2 Abstract. Agents interacting in open environments such as Internet are often in charge of personal information. In order to protect the privacy of human users, such agents have to be aware of the normative context regarding personal data protection (applicable laws and other regulations). These privacy-related norms usually refer to deadlines and durations. To represent these regulations, we introduce the Deontic Logic for Privacy; this logic represents privacy-related obligations while providing the required temporal expressiveness. 1 INTRODUCTION Any personal agent designed to evolve in an environment like Internet and to assist a human user with her online activities should then be aware of privacy issues and regulations, in order to protect the user’s personal information. These regulations appear as laws, contracts, company policies, user requirements... Six dimensions have been identiﬁed that can be used to analyze regulations dealing with personal data protection [7, 6]. These are user information, user consent, data update, justiﬁcation of data collection and usage, data retention and data forwarding. Many privacy-enhancing technologies, protocols and architectures try to address parts of the issue [4]. The Platform for Privacy Preferences (P3P), for instance, aims to deal with the ﬁrst two dimensions, by providing websites with means to communicate on their privacy policies [9]. However, none is able to provide a cognitive agent with means to reason on the regulations themselves, so that it could adapt to the context of a transaction in a dynamic and autonomous fashion. In this paper, we propose a logic designed in order to speciﬁcally represent privacy-related regulations concerned with all six dimensions. This Deontic Logic for Privacy (DLP) is able to deal with obligations regarding personal data processing and its temporal organization. We explain why speciﬁc operators are needed to represent dated norms, we identify the requirements for expressing obligations with deadlines, we build such an operator on the base of existing propositions and we put it in the context of privacy norms. 2 THE DLP LOGIC When dealing with privacy management, norms are often linked with notions of delays, deadlines, precedence between actions; an explicit representation of time would then provide valuable reasoning means. Much work has been done on temporal deontic logics in general [1], but to the best of our knowledge none of them deals with privacyrelated norms in a speciﬁc way. A prominent temporal feature of 1 Universit´e Joseph Fourier, Laboratoire d’Informatique de Grenoble, France, email: guillaume.piolle@imag.fr 2 CNRS, Laboratoire d’Informatique de Grenoble, France, email: yves.demazeau@imag.fr privacy regulations is the notion of deadline. We will examine how existing proposals can be of use in privacy-based reasoning, but we must ﬁrst introduce a common formalism to compare them. This is why, in the light of this background, we present here the DLP language, a temporal deontic logic able to represent speciﬁc privacyrelated norms, and in particular the deadlines associated to them. DLP is a language where the SDL obligation modality Ob is freely mixed with LTL operators. The well-formed formulae ϕ of the DLP language are deﬁned as follows, where p is a proposition from a language LDLP to be speciﬁed later: ϕ = p | ϕ “ ∨ ” ϕ | “¬” ϕ | “Ob” ϕ |ϕ“U ”ϕ|ϕ“S ”ϕ; (1) We have chosen the U S temporal language (U and S being the strict versions of “until” and “since” connectives) for its expressiveness, but we will use the common abbreviations F , G, H, P . We also deﬁne U − and F − as the loose versions of U and F including the present. The X i operators, based on a “neXt” operator X and its counterpart in the past X −1 , can be used to travel step-wise along a time ﬂow. The DLP logic is interpreted over bidimensional Kripkelike structures, where a world is deﬁned by its history h (the linear ﬂow of time it belongs) and a date ti in the time ﬂow. The temporal accessibility relation relates a world in a history to its successor in the same history, and the deontic accessibility relation relates a world to all its acceptable deontic alternatives (in all histories). 3 OBLIGATIONS WITH DEADLINES We have said that in order to express privacy-related norms, we need the notion of deadlines, to which obligations will be attached. Indeed, it is often argued that obligations without deadlines are void [3]: one can not fulﬁll them, and yet never be in violation of a norm (since one can always postpone and pretend the obligation will be fulﬁlled later). In order to deal with deadlines, we introduce specialized constants in our language, which we call dated propositions. They are noted {δi }i∈N , δi being true only at date ti . Our aim here is to build an operator Ob(ϕ, δ) expressing the obligation for ϕ to be true before the date represented by δ (i.e. before the propositional date δ becomes true). We have identiﬁed six requirements that an operator in our formalism should meet in order to bear the right meaning in privacy-related norms: 1. 2. 3. 4. 5. Failed obligations should be dropped after the deadline; Violations should be made punctual, not persistent in time; Deadlines that are not dated propositions have no meaning; Obligations on ⊥ should be impossible to fulﬁll; Obligations on should be trivially respected; 864 G. Piolle and Y. Demazeau / Addressing Temporal Aspects of Privacy-Related Norms 6. It should be impossible to express obligations with past deadlines; 7. The operator must comply with the propagation principle [2], saying that an obligation must be maintained until the deadline is reached or the obligation is fulﬁlled; 8. The operator must comply with the monotony principle [2], saying that an obligation with a given deadline implies an obligation with a further deadline. Some work has been done already on obligations with deadlines; our ﬁrst six requirements regard choice points already discussed by Dignum et al [5]. However, our conclusions slightly differs from theirs, for instance on the fact that they take violation as a state rather than as an event. In their own work, they introduces an operator that deﬁnes an obligation jointly with its violation. Because of their strictly temporal deﬁnition, dated obligations can then be derived whenever they seem to be respected, which is a signiﬁcant drawback for us. From another point of view, it is not monotonic and deadlines with a value of can be deﬁned, resulting in an immediate obligation. Brunel et al [2] extend a temporal deontic logic with explicit quantiﬁcation on time, in order to reason on delays rather than on deadlines. For that reason, it cannot be directly expressed in DLP. Furthermore, it is not monotonic. The operator proposed by Demolombe et al [3], although not expressed in temporal deontic logic, can be translated. It satisﬁes some kind of semi-monotony, ensuring the property provided that the obligation is not violated. This key property makes it our best candidate. The operator matches most of our other requirements, but needs to be adapted to dated propositions in order to comply with the third and sixth points. We integrate these conditions to a DLP translation of the original proposition, and end up with the dated operator Ob(ϕ, δ) (2). The authors propose a persistent violation for their operator; we transform it into a punctual one (3) in order to match our second requirement. One can see that semi-monotony has a nice side-effect: it prevents us from deriving multiple violations for the same initial obligation, while still ensuring monotony if the obligation is fulﬁlled. def Ob(ϕ, δ) = j F (δ ∧ G¬δ ∧ H¬δ) Ob(F − (ϕ ∧ F δ)) U − (ϕ ∨ δ) viol(ϕ) = δ ∧ P (Ob(ϕ, δ) ∧ ¬ϕ U − δ) def 4 (2) (3) APPLICATION TO PRIVACY NORMS Our deontic and temporal formalism can be used to express privacyrelated norms by its application on a base language LDLP . LDLP is based on predicates related to the six dimensions of personal data protection mentionned in the introduction. Argument domains are ﬁnite or countable sets, so we end up with a countable set of propositional terms. LDLP includes for instance a predicate perform representing the actual process involving personal information, a predicate consent representing the user’s authorization, a predicate forget representing data deletion3 ... As an application, let us see how the an example regulation about data retention (One must not keep somebody else’s credit card number more than one week after a transaction) translates into DLP (4). It is an interesting example since it involves antecedence and a deadline. Formally, it says that whenever an agent A performs a process of type transaction to which is attached an information of type creditCardNum owned by an agent B (and not by agent A), if δ 3 Due to page limitations, we are not able to include the full speciﬁcations of LDLP here. represents a date one week in the future, then there is a dated obligation that A should forget this information before the deadline δ. perform(A, ID) ∧ owner(ID, creditCardNum, B) ∧¬owner(ID, creditCardNum, A) ∧ actiontype(ID, transaction) ∧ X 7∗24 δ → Ob(forget(A, ID, creditCardNum), δ) 5 (4) CONCLUSION AND FUTURE WORK We have proposed the DLP language, based on temporal deontic logic, to represent privacy-related norms. DLP is expressive enough to represent obligations with deadlines, as well as other (more classical) temporal notions, in an acceptable way. DLP is based on a propositional language speciﬁcally oriented towards personal data processing. Some work remains to be done on this logic, including a better basis for the temporal operators of the language. Currently, it is based on the U, S logic, which is very general but somewhat too expressive. Indeed, we must then question inclusion of “since” in the logic, since we do not seem to need it, and it has already been argued that adding it to an until-based logic is not trivial from the point of view of complexity [8]. An automated procedure is to be proposed to generate DLP formulae on the basis of information exctracted from P3P policies [9]. DLP, along with these associated tools, are then to be integrated in a privacy-aware cognitive agent that should be able to model and reason on its privacy-related normative context. ACKNOWLEDGEMENTS This research has been supported by the Rhˆone-Alpes region Web Intelligence project. We would also like to thank Andreas Herzig and Philippe Balbiani for their valuable comments on our work. REFERENCES ˚ [1] Lennart Aqvist, ‘Combinations of tense and deontic modalities’, in 7th International Workshop on Deontic Logic in Computer Science (DEON 2004), eds., Alessio Lomuscio and Donald Nute, volume 3065 of LNCS, pp. 3–28, Madeira, Portugal, (2004). Springer. [2] Julien Brunel, Jean-Paul Bodeveix, and Mamoun Filali, ‘A state/event temporal deontic logic’, in Eighth International Workshop on Deontic Logic in Computer Science (DEON’06), number 4048 in LNCS, (2006). [3] Robert Demolombe, ‘Formalisation de l’obligation de faire avec d´elais’, in Troisi`emes journ´ees francophones des mod`eles formels de l’interaction (MFI’05), Caen, France, (2005). [4] Yves Deswarte and Carlos Aguilar-Melchor, ‘Current and future privacy enhancing technologies for the internet.’, Annales des T´el´ecommunications, 61(3-4), 399–417, (2006). [5] Frank Dignum, Jan Broersen, Virginia Dignum, and John-Jules Meyer, ‘Meeting the deadline: Why, when and how’, in Third International Workshop on Formal Approaches to Agent-Based Systems (FAABS’04), eds., Michael G. Hinchey, James L. Rash, Walt Truszkowski, and Christopher Rouff, pp. 30–40, (2004). Springer Verlag. [6] Guillaume Piolle and Yves Demazeau, ‘Une logique pour raisonner sur la protection des donn´ees personnelles’, in 16e congr`es francophone AFRIF-AFIA sur la Reconnaissance de Formes et l’Intelligence Artiﬁcielle (RFIA’08), Amiens, France, (2008). AFRIF-AFIA. [7] Guillaume Piolle, Yves Demazeau, and Jean Caelen, ‘Privacy management in user-centred multi-agent systems’, in 7th Annual International Workshop ”Engineering Societies in the Agents World” (ESAW 2006), eds., Gregory O’Hare, Michael O’Grady, Oguz Dikenelli, and Alessandro Ricci, pp. 354–367, Dublin, Ireland, (2006). Springer Verlag. [8] Mark Reynolds, ‘The complexity of the temporal logic with until over general linear time’, Journal of Computer and System Sciences, 66(2), 393–426, (2003). [9] World Wide Web Consortium. Platform for Privacy Preferences speciﬁcation 1.1. http://www.w3.org/P3P/. ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-865 865 Evaluation of global system state thanks to local phenomenona CONTET Jean-Michel and GECHTER Franck and GRUER Pablo and KOUKAM Abder 1 Abstract. This paper presents a new approach for the evaluation of a system’s global state properties. The approach is intented for the application to reactive multiagent system (RMAS) and adresses the evaluation of emergent properties such as global stabilisation. This approach is inspired by statistical physics and thermodynamics, as a way to link the microscopic and a macroscopic points of view. It gives an important role to partition function Z as deﬁned in statistical physics. From this mathematical function can be extracted indicators that represent the global evaluation of the system state based on local phenomena. In this paper, the approach is put into practice by considering a classical reactive multiagent system: bird ﬂocks simulation. The methodology was applied to analyze system stability. Experimental results obtained with a multiagent simulation platform are presented. 1 Introduction Multiagent systems (MAS) can now be considered as a wide spread technique for the simulation of complex systems. They have been applied to a wide range of applications.In order to simulate complex systems, the reactive approach, in which interaction and emerging phenomena prevail in the deﬁnition of the agents themselves, is pertinent. It brings relevant properties such as adaptation skills, reliability or robustness to parameters change. The main drawbacks of the reactive approach is the lack of theoretical background in convergence proofs and emergence characterization. The goal of this article is to propose a method based on partition function Z [2] as it is deﬁned in statistical physics. Statistical physics is generally considered to be one of the ﬁrst scientiﬁc disciplines where statistical methods succeed to link the microscopic and the macroscopic points of view. From this mathematical function can be extracted indicators, the computation of which is based on local estimations, represent global measurements of the system state. This article is structured as follow. After a paragraph dealing with the related works found in literature, partition function is deﬁned relatively to physics. Then, we explain through a simple physics inspired example how to apply partition function theory. Last part deals with the application of partition function to a classical multiagent model based on Reynolds Boids [7]. Finally, we conclude drawing some extends to the work presented. 2 Related Works As stated in the introduction, one of the main problem in MAS is the evaluation of the accuracy/efﬁciency of the system relatively to the 1 University of Technology of Belfort-Montbeliard (UTBM), Systems and Transportation Laboratory (SET), Belfort, France, email:ﬁrstname.name@utbm.fr task to perform and to the local mechanisms involved. Those evaluation methods can be classiﬁed in 3 categories : (i) Indicators tied to the application ﬁeld [5] (ii) Indicators based on a global point of view on the system and on its topology [6] (iii) Global Indicators based on local estimation [4]. Solutions found in literature usually take inspiration from biology (ﬁtness functions, etc), sociology (altruism, etc), agency theories (utility functions, etc) or physics (state functions, etc). Moreover, some of them are based on strong mathematical background such as [8] but seem to be hardly applicable to any kind of pratical MAS. Among these methods, the physics inspired solutions are the most widespread. For instance, entropy [1, 6] has been widely used in reactive MAS in particular in order to represent disorder/organisation in the system. Even if this measurement can be useful in many cases, it has two main drawbacks: it depends on the past transformations of the system and it is a global measurement that does not take into account local mechanisms of the system. In order to overcome these drawbacks, other approaches can be used. One generic solution is the computation of energy as a state function on both agent and system levels [3]. 3 Evaluating global state properties 3.1 Description of the approach The approach applies to reactive multiagent systems based on interaction models inspired by physics. The environment must be limited and the number of elements ﬁxed. If the system respects these conditions, the following methodology can be applied: 1. 2. 3. 4. 5. All interaction forces are computed. System Energy is computed from agents energy at every time step. Partition function Z is computed from system energy. Thermodynamic potential A is plotted in real time. Studying evolution of Helmholtz free energy A explains the system evolution and the time to equilibrium. 3.2 Application to classical reactive multiagent system Flocking represents an approach to solve some kinds of problem such as spatial distribution. This is a model [7] for the coordinated motion of groups of entities called boids. Craig Reynolds [7] realized that the motion of a ﬂock of birds could be modeled by applying three simple rules to be followed by each boid: Cohesion: steer to move toward the average position of local ﬂockmates, Separation: steer to avoid crowding local ﬂockmates alignment diagram, Alignment: steer towards the average heading of local ﬂockmate’s cohesion diagram. 866 3.3 J.-M. Contet et al. / Evaluation of Global System State Thanks to Local Phenomenona Interaction model The environment is closed and the number of boids is ﬁxed. Each bird corresponds to an agent. An agent only perceives ﬂockmates inside it’s perception distance. The interaction model is based on the three forces deﬁned before, with N Number of neighborhood agent, i distance between the agent and the neighborhood agent i and P R agent position. Cohesion = [ F N i R i=1 Separation = F N agent ]−P N i R i=1 Alignment = F i -3 -R N R˙ i i=1 3.4 i R (2) (3) • Kinetic energy: In the following equation, the agent i is represented by its mass mi and its speed Vi . (4) • Potential energy: it is computed, for agent i, using the classical expression of the energy U (U = δW +δQ) where δW represents the work done on the system and δQ the heat ﬂow (here, δQ = 0 since no heat is dissipate). The work done on the system δW is expressed considering a conservative force (Cf. equation 5) with a unit vector in the direction of agent speed. du =F +F +F total .du C .du S .du A .du Ep = δW = F 3.5 A(T, V, Ni ) = −ln(Z) Z = e−βEi Boids simulation Conclusion Reactive multiagent systems are becoming an important ﬁeld of research within application domains characterized by distributed aspects. Particulary developement approaches for reactive MAS should include the posibility to evaluate the quality of ermergent phenomena and even, in some cases to the application objective. The aim of this article was to present a new conceptual frame for the evaluation of the global state of reactive MAS. This evolution is based on a local to global approach, inspired from statistic physics and thermodynamics. Statistical physics is generally considered to be one of the ﬁrst scientiﬁc disciplines where statistical methods succeed to link the microscopic and the macroscopic points of view. In this work, we present an approach for the application of statistical physics to RMAS. A great attention has been given to the justiﬁcations and conditions of the use of statistical physics. This approach has been put into practice through a classical example: Boids ﬂoking. Simulation experiments have shown the relation between the indicator proposed in this paper and the system evolution. Additional research work is needed to extend the applicability of the approach to more complexe phenomena. REFERENCES (5) From now, each boid’s energy and allows to compute the partition function and the thermodynamic potential A, with T,V and Ni constant, Ei = Ek + Ep . Figure 1. 4 According to the interaction model, the energy measurement can be detailed as follow : 1 mi Vi .Vi 2 Oscillation phase (1) System energy EK = Stability (6) Free energy A evolution during simulation The simulations use run a group of 100 boids. Every simulation begins with a random boids dispersion environment. The simulation starts (Cf ﬁgure 1 top left) with the lower free energy A, because of the great agent dispersion. Then, following to the Reynolds model, boids form a moving group similar to a bird ﬂocks. During this phase, the system tends to the stability. The free energy oscillations indicate that the system is not yet in a stable state. Finally, the boids formed a ﬂock (Cf ﬁgure 1 top right) and the system is in equilibrium. Thus, the free energy tends toward a constant value representing system stability. [1] Tucker Balch, ‘Hierarchic social entropy: an information theoretic measure of robot group diversity’, Autonomous Robots, 8(3), 209 – 237, (2000). [2] Roger Balian, From Microphysics to Macrophysics, Springer, 2007. [3] Jean Michel Contet, Franck Gechter, Pablo Gruer, and Abder Koukam, ‘Multiagent system model for vehicle platooning with merge and split capabilities’, Third International Conference on Autonomous Robots and Agents ICARA, 41–46, (2006). [4] Nicolas Gaud, Franck Gechter, St´ephane Galland, and Abderaf`ıaˆ a Koukam, ‘Holonic multiagent multilevel simulation : Application to real-time pedestrians simulation in urban environment’, Twentieth International Joint Conference on Artiﬁcial Intelligence, IJCAI’07, 1275– 1280, (2007). [5] Franck Gechter, Vincent Chevrier, and Franois. Charpillet, ‘A reactive agent-based problem-solving model : Application to localization and tracking’, ACM Transactions on Autonomous and Adaptive Systems, (Novembre 2006). [6] H. Van Dyke Parunak and Sven Brueckner, ‘Entropy and selforganization in multi-agent systems’, in AGENTS ’01: Proceedings of the ﬁfth international conference on Autonomous agents, pp. 124–130. ACM, (2001). [7] Craig W. Reynolds, ‘Flocks, herds, and schools: A distributed behavioral model.’, Computer Graphics (ACM), 21(4), 25 – 34, (1987). [8] Daniel Yamins, ‘The emergence of global properties from local interactions’, volume 2006, pp. 1122 – 1124, Hakodate, Japan, (2006). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-867 867 Experience and Trust — A Systems-Theoretic Approach Norman Foo1 and Jochen Renz2 Abstract. An inﬂuential model of agent trust and experience is that of Jonker and Treur [Jonker and Treur 99]. In that model an agent uses its experience of the interactions of another agent to assess that agent’s trustworthiness. We showed that key properties of that model are subsumed by classical mathematical systems theory. Using the latter theory we also clarify the issue of when two experience sequences may be regarded as equivalent. An intuitive feature of the Jonker and Treur model is that experience sequence orderings are respected by functions that map such sequences to trust orderings. We raise a question about another intuitive property — that of continuity of these functions, viz. that they map experience sequences that resemble each other to trust values that also resemble each other. Using fundamental results in the relationship between partial orders and topologies we also showed that these two intutive properties are essentially equivalent. 1 INTRODUCTION In electronic internet trading systems like eBay an agent can rank other agents based on its assessment of the behavior of those agents in transactions. For an agent A observing another agent B over time (possibly even B’s interactions with agents other than A), such sequential assessments may be said to form A’s experience sequence of B, and result in its judgement of the trustworthiness of B. In an inﬂuential model of agent trust due to [Jonker and Treur 99], agents assess the quality of their interactions and map such experience sequences into a trust space. They required the experience sequence and trust spaces to be at least partially ordered, and the mapping to be order-preserving. They established properties of their model, including condtions for the updating of trust ranks that depend only on the existing rank and a new assessment of experience. In our paper we showed that the update and a number of other properties are in fact subsumed by classical mathematical systems theory. Space limitations restrict to merely outline our results, but a fuller version is in [Foo and Renz 07]. 2 SYSTEMS-THEORETIC IMPLICATIONS We took the work of [Jonker and Treur 99] as a starting point accepting in particular the discrete time framework (modelled as the natural numbers) for all functions. A sequel to that work is that by Treur [Treur 07] on properties of states arising from it. We used systems theory to (i) connect established propositions with their work, (ii) showed constraints on trust structure imposed by experience structure, (iii) suggested a way to topologize these and other derivative 1 The School of Computer Science and Engineering, University of New South Wales, Sydney NSW 2052, Australia 2 Research School of Information Science and Engineering, Australian National University, Canberra ACT 0200, Australia μ ω1 ω2 F (ω1 , μ) F F (ω2 , μ) μ F equal outputs Figure 1. Nerode Equivalence structures, (iv) and showed that order-preservation of the map from experience sequences to trust is equivalent to its continuity in the topologies. Conceptually the system we consider is a black box that accepts experience sequences as inputs and produces trust sequences as outputs. A basic result from systems theory (see [Padulo and Arbib 74] and [Zeigler, et.al. 2000]) guarantees that this black box can be endowed with a state space of trust values iff the input-output function F representing it is causal, i.e, for any point k in time the trust output at k depends only on the initial segment of the input sequence. This subsumes a key result of [Jonker and Treur 99]. De¯ we noting the initial segment space of experience sequences by Ω, then showed that the canonical state space is in fact a quotient space ¯ with the quotient arising from an equiva(see [Kelley 55]) of Ω, lence relation known in systems theory as the Nerode equivalence (see [Padulo and Arbib 74]), denoted here by ≡N . Indeed, it follows ¯ ≡N . that the trust space can be most succinctly identiﬁed with Ω/ ¯ a semigroup using the concatenation To explain ≡N we ﬁrst make Ω (denoted by ◦) of segments as the binary operation. Next, we use the ¯ to correinput-output function F to induce a function F¯ that maps Ω sponding length output segments. For any two input segments ω1 and ω2 we deﬁned ω1 ≡N ω2 if for any arbitrary segment μ, F¯ (ω1 ◦ μ) agrees with F¯ (ω2 ◦μ) from the respective times when μ is appended. This is only well-deﬁned if F is causal. See ﬁgure 1 for intuition. It is then intuitive that ω1 and ω2 cannot be distinguished once ¯ ≡N qualiﬁes as a state of their end points are reached. Thus, Ω/ the system. It is a corollary of that result, known as the State Realization Theorem (see [Padulo and Arbib 74]), that there is an update function δ from inputs and current state to the next state as follows: δ([ω], e) = [ω ◦ eˆ] where eˆ is the unit length segment with value e. Figure 2 illustrates the main points. In the ﬁgure γ is the map that “reads” the trust and outputs it into the trust value space Vout , and η is the map induced by the combination of γ and the state update function δ. Also, F is the earlier deﬁned map F¯ restricted to the 868 N. Foo and J. Renz / Experience and Trust — A Systems-Theoretic Approach F Ω Vout γ ψ Ω/≡ Ω/≡ × Ω Figure 2. η Λ State Realization – The Key Ideas (value at the) end of its input segment. It can be shown that the state realization above, call it R, is in a strong sense the most economical among all possible state representations. Formally, it is said that this realization is canonical in that if there is another realization R that reproduces the same F¯ , then there is a unique homomorphism that maps R to R. In particular a typical assumption (see e.g., Treur [Treur 07]) that input (trust, etc.) sequences and system states are both viable primitives in formalizing temporal dynamics is subject to this canonical constraint. 3 EXPERIENCE AND TRUST ORDERINGS One example ordering considered by [Jonker and Treur 99] was worst < bad < neutral < good < best for experience values. They then used these to partially order, say experience sequences. We may ¯ above, and the trust space as well identify these sequences with Ω ¯ ≡N , and it can be partially ordered by, say, T . The T with Ω/ order-preservation postulate of [Jonker and Treur 99] then translates ¯ to T (= Ω/ ¯ ≡N ) in systems theory to the quotient map ψ from Ω deﬁned by ψ(ω) = [ω]≡N to be also order-preserving. That is an intuitive requirement — good experiences should lead to good trust. If a measure of “nearness” is placed on experience sequences and trust values we may also desire the property that ψ maps near sequences to near trust. The formalization of this is the continuity of ¯ ψ. The most abstract way to do this is via topologies for both Ω ¯ ≡N . Fortunately, there is already much classical machinery and Ω/ ¯ [Kelley 55] to do this. We switch notation to the near synonyms of Ω ¯ ≡N (calling it T ) for brevity. If a topology τE (calling it E) and Ω/ is given to E, then since ψ is the quotient map a natural topology τT is induced by ψ that makes it both continuous and open. We then showed that under a simple topology — the Alexandrov topology (see [Arenas 99] or [Wiki Alexandrov]) — the two requirements above, viz., order-presevation and continuity, are equivalent. We now outline how this was done. There is a close connection between partial orders — in fact preorders will do — and topologies on a space. Given a partial order on a space S, the Alexandrov topology deﬁned by it has as open sets the so-called up-sets, viz., subsets θ such that x ∈ θ and x z implies z ∈ θ of up-sets. Conversely, given a topology τ on a set S, the specialization pre-order ≤ is deﬁned by x ≤ y iff y is in every open set that contains x. It is easily seen that ≤ so deﬁned is indeed a pre-order. If we had started with some partial order and used it to deﬁne the Alexandrov topology as before, it is natural to ask what is the specialization order that arises from that topology. The answer is that we get back , and although there are other topologies (e.g. the Scott topology [Abramsky and Jung 94] or [Stoy 77]) that have this “reversal” property the Alexandrov topology is the ﬁnest one. In this way the partial order E deﬁnes the Alexandrov topology on ¯ (which in our context is identiﬁed with the input segment space Ω the space of experience sequences E) and is induced by it. Any topology that is placed on the trust space T will induce a specialization pre-order (partial orders are special cases). So what is a suitable topology for it? If we identify T with the range of ψ, i.e., ¯ ¯ Ω/≡ ¯ Ω/≡, then T is the quotient space of Ω. can thus be given the quotient topology. Experience values in the real interval [−1, 1] rather than ﬁnite or even discrete values may alter the character of the results and observations because the experience and trust spaces can now be inﬁnite and continuous. Continous values lend themselves to measurements of nearness using the metrics well-known in functional analysis, and it is an obvious question whether the nexus of order-preservation between E and T and continuity of the map ψ still hold. Unfortunately, if the space is Hausdorff (which is the most familiar one), its corresponding Alexandrov topology reduces to the discrete topology which is trivial for convergence. Therefore the requirements for order-preservation and continuity are distinct. 4 CONCLUSION We used classical mathematical systems theory to underpin the foundations of an inﬂuential model of agent trust and experience. It was shown that many of the properties of that model follow from results in systems theory. Moreover, the latter provides deep insights into the structural interaction between experience and trust sequences, in particular what it means to say that trust is condensed experience. An intuitive feature of that model is that experience sequence orderings are respected by functions that map such sequences to trust orderings. We raised a question about another intuitive property — that of continuity of these functions, viz. that they map experience sequences that resemble each other to trust values that also resemble each other. Using fundamental results in the relationship between partial orders and topologies we showed that these two intutive properties are essentially equivalent. REFERENCES [Abramsky and Jung 94] S. Abramsky and A. Jung: Domain theory. In S. Abramsky, D. M. Gabbay, T. S. E. Maibaum, (ed), Handbook of Logic in Computer Science, vol. III. Oxford University Press, 1994. [Wiki Alexandrov] See the Wikipedia entry on Alexandrov topology: http://en.wikipedia.org/wiki/Alexandrov topology. [Arenas 99] F.G. Arenas, Alexandroff spaces, Acta Math. Univ. Comenianae Vol. LXVIII, 1 (1999), pp. 17-25 [Foo and Renz 07] N. Foo and J. Renz. Experience and Trust: A SytemsTheoretic Approach. UNSW CSE Tech Report UNSW-CSE-TR-0717, ftp://ftp.cse.unsw.edu.au/pub/doc/papers/UNSW/0717.pdf. [Jonker and Treur 99] C.M. Jonker and J. Treur. Formal Analysis of Models for the Dynamics of Trust based on Experiences. In: F.J. Garijo, M. Boman (eds.), Multi-Agent System Engineering, Proceedings of the 9th European Workshop on Modelling Autonomous Agents in a MultiAgent World, MAAMAW’99. Lecture Notes in AI, vol. 1647, Springer Verlag, Berlin, 1999, pp. 221-232. [Padulo and Arbib 74] L. Padulo and M.A. Arbib, System theory: a uniﬁed state-space approach to continuous and discrete systems, Philadelphia, Saunders, 1974. [Kelley 55] J.L. Kelley, General Topology, Springer Verlag, 1955 (reprinted). [Stoy 77] J.E. Stoy, Denotational Semantics: The Scott-Strachey Approach to Programming Language Semantics. MIT Press, Cambridge, Massachusetts, 1977. [Treur 07] J. Treur. Temporal Factorisation: Realisation of Mediating State Properties for Dynamics. Cognitive Systems Research Volume 8, Issue 2, June 2007, Pages 75-88. [Zeigler, et.al. 2000] B.P. Zeigler, H. Praehofer and T.G. Kim, Theory of Modeling and Simulation :integrating discrete event and continuous complex dynamic systems, 2nd ed, Academic Press, San Diego, 2000. ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-869 869 Trust-Aided Acquisition Of Unveriﬁable Information Eugen Staab and Volker Fusenig and Thomas Engel Abstract. We propose a mechanism for the acquisition of information from potentially unreliable sources. Our mechanism addresses the case where the acquired information cannot be veriﬁed. The idea is to intersperse questions (“challenges”) for which the correct answers are known. By evaluating the answers to these challenges, probabilistic conclusions about the correctness of the unveriﬁable information can be drawn. Less challenges need to be used if an information provider has shown to be trustworthy. Our approach can resist collusion and shows great promise for various application scenarios such as grid-computing or peer-to-peer networks. 1 Introduction Much research addresses trust that is based on direct experiences [8, 6]. These direct experiences result from evaluating the outcomes of interactions with other agents. Such an evaluation however is not possible when the outcome of an interaction is information that cannot be veriﬁed, or the veriﬁcation is too costly. We give an example to illustrate this problem. Example 1 Assume, agent Alice needs to know the ﬁrst 100 digits of π. However, Alice cannot compute these digits because she is either incapable to perform the necessary calculations on her own or she is out of resources. So Alice asks another agent Bob for these digits of π. Although Bob knows how to calculate them, he returns three correct digits followed by 97 random digits to Alice in order to save resources. Consequently, Alice, who cannot verify the information, utilizes the wrong digits of π in her further work. This will cause additional costs for her, and if she is not aware of them, she even does not classify the experience with Bob as a negative experience. To overcome this problem, an agent could ask several information providers and compare their answers; if the answers are not the same, they are discarded. However, this so called redundancy in computation, which is used for example in distributed computing (e.g. [1]), can fail in detecting collusion between several malicious information providers. Therefore, we propose an approach which allows for estimating the correctness of acquired information without using redundancy. 2 Mechanism for Information-Acquisition In the following, we describe one run of the mechanism. An agent wants to get answers to m questions, the real requests. 1 University of gen.staab@uni.lu Luxembourg, Luxembourg, email: eu- 1 In our particular case, the agent will not be able to verify the corresponding answers, because he is incapable or does not want to spend resources on it. Instead, before sending the request, the agent adds n challenges for which the answers are known to him. These challenges are chosen in a way such that another agent is not able to easily distinguish them from the real requests – how this choice can be made depends on the concrete setting (see Sect. 4 for examples). The number of challenges n depends on the number of real requests m and on how trustworthy the selected information provider has shown to be: The more accurate the information acquired from him was in the past (the more trustworthy he seems to be), the less challenges are used. However, a minimal number of challenges is always retained to account for the ﬁrst-time oﬀender problem (see [7]). The agent randomly merges the m real requests and the n challenges into a vector of size m + n. This request-vector is then transferred to the information provider who is expected to reply with a response-vector of the same size. After having received the response-vector, the agent veriﬁes the answers to the challenges and ﬁnds r correct and s incorrect answers, with r + s = n. The agent uses r and s as basis for the following three computations: 1. Estimate the error rate of the answers to the real requests: Probability theory can be applied here because the n challenges and m real requests were randomly distributed what is from a probabilistic point of view tantamount to picking randomly n samples out of the m + n answers. 2. Decide whether the answers to the real requests seem to be accurate enough: If the estimated error rate is too high, the information is requested again from other agents. 3. Assess (or reassess) the trustworthiness of the information provider, based on the actual and past response-vectors: The number of challenges, that is used for future requests to the same information provider, is decreased if he now is seen to be more trustworthy than before (and vice versa). 3 Discussion In this section we want to discuss several issues concerning the practical use of the mechanism. Optimal number of challenges Whenever a requestvector is made up, it has to be decided how many challenges are used. The more challenges are used, the better the accuracy of the mentioned error rate estimate will be. At the same time the number of challenges should be kept small: One can assume that certain costs arise when generating a challenge, 870 E. Staab et al. / Trust-Aided Acquisition of Unveriﬁable Information requesting the answer (the information provider may get some payment) and evaluating the answer. This optimization problem is subject to future work. In scenarios where a lack of resources is the only reason for not being able to verify acquired information, real requests can be declared as challenges after a response has been received. This has the advantage that challenges cannot be disclosed by the information provider (there are no challenges beforehand) and that less costs arise. Moreover, an optimal number of challenges can be determined during veriﬁcation based on statistical considerations. This is also possible in scenarios where, for reasons of practicability, real requests and challenges are not bundled in a vector but distributed over time. Here, an agent can decide on the ﬂy whether to intersperse more or less challenges. Collusion For the choice of challenges, two important rules have to be obeyed in order to avoid the possibility of collusion: 1. If a request is resent to another agent since it was not answered satisfactory, the same challenges are to be used. 2. For two requests for diﬀering information, diﬀerent challenges are to be used. The reader can easily verify that otherwise colluding agents would be able to identify real requests and challenges simply by comparing the diﬀerent request-vectors and checking what has changed. Malicious providers A malicious information provider might try to answer all challenges correctly and to answer at the same time all real requests incorrectly. However, assuming that he cannot distinguish challenges from real requests, the only way for him to achieve this objective would be to guess the number and the positions of the challenges. Note here that our mechanism does not aim at distinguishing between malicious, incompetent or unmotivated information providers. Context-sensitivity The context-sensitivity of trust is important in certain ﬁelds of information acquisition. Agents may be competent in some areas (“what is the prime factorization of 12345?”) and incompetent in others (“will it rain today?”). For making probability theory applicable, it is necessary to choose all questions in one request-vector from the same context. Apart from that the approach to contextsensitive trust by Reh´ ak and Pechoucek [7] seems suitable for a combination with our work. Alternatively, techniques such as Latent Semantic Indexing (LSI) [4] or Concept Indexing [5] could be used. These techniques would allow to deﬁne the context-space on the basis of acquired natural-language text. 4 Application Scenarios As motivated by Ex. 1, where Alice requested some digits of π, the mechanism is intended for cases where calculations are outsourced and the results cannot be veriﬁed. Thus, our mechanism can be applied to the scenarios of grid-computing [2] or cloud-computing [10]. In these cases, challenges can either be provided by trusted nodes or be computed whenever the system of the requesting agent is idle. In Wireless Ad Hoc Networks [9], exchanged routing information can only be veriﬁed by trial and error. Our mechanism would help to detect incorrect routing-information provided by malicious or incompetent nodes without testing the route. The challenges can be chosen to be questions about routes that are known to exist (e.g. because packets have been sent over these routes in the recent past). In peer-to-peer networks, our mechanism can be used against pollution and poisoning attacks (see [3]). Challenges would consist of requests for ﬁles that already have been veriﬁed by a human to match their description and to be free of “bad chunks”. Note that in these settings, a small number of challenges for a given number of real requests would be essential for the practicability of the mechanism. The veriﬁcation of a partly downloaded response to the challenge should start as soon as a certain amount of packets is received. 5 Conclusion & Future Work We presented a mechanism for information acquisition that can be used in cases where the acquired information cannot be veriﬁed. The mechanism mixes the real requests with some challenges in a random fashion. This way, an agent can use the evaluated answers to the challenges in order to probabilistically estimate the error rate of the unveriﬁable answers to the real requests. Currently, we are working out the mechanism by focusing on the following four issues: How to estimate the error rate for the real requests, how to ﬁnd an “optimal” number of challenges for a given number of real requests, how to use trust to reduce the number of challenges and how to decide whether some acquired information is accurate enough. REFERENCES [1] David P. Anderson, Jeﬀ Cobb, Eric Korpela, Matt Lebofsky, and Dan Werthimer, ‘SETI@home: an experiment in publicresource computing’, Commun. ACM, 45(11), 56–61, (2002). [2] Fran Berman, Geoﬀrey Fox, and Anthony J. G. Hey, Grid Computing: Making the Global Infrastructure a Reality, John Wiley & Sons, Inc., New York, NY, USA, 2003. [3] Nicolas Christin, Andreas S. Weigend, and John Chuang, ‘Content availability, pollution and poisoning in ﬁle sharing peer-to-peer networks’, in EC ’05: Proc. of the 6th ACM Conf. on Electronic commerce, pp. 68–77. ACM, (2005). [4] S. T. Dumais, G. W. Furnas, T. K. Landauer, S. Deerwester, and R. Harshman, ‘Using latent semantic analysis to improve access to textual information’, in CHI ’88: Proc. of the SIGCHI Conf. on Human Factors in Computing Systems, pp. 281–285, New York, NY, USA, (1988). ACM. [5] George Karypis and Euihong Han, ‘Concept indexing: A fast dimensionality reduction algorithm with applications to document retrieval and categorization’, Technical Report TR-000016, University of Minnesota, (2000). [6] Sarvapali D. Ramchurn, T. D. Huynh, and Nicholas R. Jennings, ‘Trust in multi-agent systems’, Knowl. Eng. Rev., 19(1), 1–25, (2004). [7] Martin Reh´ ak and Michal Pechoucek, ‘Trust modeling with context representation and generalized identities’, in CIA ’07: Proc. of the 11th Int. Workshop on Cooperative Information Agents, pp. 298–312. Springer Verlag, (2007). [8] Jordi Sabater and Carles Sierra, ‘Review on computational trust and reputation models’, Artif. Intell. Rev., 24(1), 33– 60, (2005). [9] C.-K. Toh, Ad Hoc Wireless Networks: Protocols and Systems, Prentice Hall PTR, Upper Saddle River, NJ, USA, 2001. [10] Aaron Weiss, ‘Computing in the clouds’, netWorker, 11(4), 16–25, (2007). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-871 871 BIDFLOW: a New Graph-Based Bidding Language for Combinatorial Auctions Madalina Croitoru1 and Cornelius Croitoru2 and Paul Lewis3 Abstract. In this paper we introduce a new graph based bidding language for combinatorial auctions. In our language, each bidder submits to the arbitrator a generalized ﬂow network (netbid) representing her bids. The interpretation of the winner determination problem as an aggregation of individual preferences represented as ﬂowbids allows building an aggregate netbid for its representation. Labelling the nodes with appropriate procedural functions considerably improves upon the expressivity of previous bidding languages. 1 Introduction A Combinatorial Auction (CA) is an abstraction of a marked-based centralized distributed system for the determination of welfare allocations of heterogenous indivisible resources. In such a Resource Allocation (RA) system, there is central node a, the auctioneer, and a set of n nodes, I = {1, . . . , n}, the bidders, which concurrently demand bundles of resources from a common set of available resources, R = {r1 , . . . , rm }, held by the auctioneer. The auctioneer broadcasts R to all n bidders, asking them to submit in a speciﬁed common language, the bidding language, their R-valuations over bundles of resources. Bidder’s i R-valuation, vi , is a non-negative real function on P(R), expressing for each bundle S ⊆ R the individual interest (value), vi (S), of bidder i in obtaining S. It is assumed that vi (∅) = 0, and vi (S) ≤ vi (T ) whenever S ⊆ T . No bidder i knows the valuation of any other n−1 bidders, but all the participants in the system agreed on a welfare outcome: Based on bidders Rvaluations, the auctioneer will determine a resources allocation O = (O1 , . . . , On ), specifying for each bidder i her obtained bundle Oi . O is a (weak) n-partition of R, that is Oi ∩ Oj = ∅, for any different bidders i and j, and ∪i=1,n Oi =R. Furthermore, the global (social) value of the outcome va (O) = j=1,n vj (Oj ) is a maximum value allocation, that is va (O) = max{va (O )|O is a n-partition of R}. The task of the auctioneer to ﬁnd a maximum value allocation for a given set of bidder valuations is called the Winner Determination Problem (WDP). This is a NP-hard problem, being equivalent to weighted set-packing([6]). WDP is expressed as an integer linear program and solved using standard methods. WDP can be parameterized by the set R of resources, considering a ﬁxed set I of bidders and bidders R-valuations {vi |i ∈ I}. Therefore we can write W DP (R) and its corresponding maximum value va (R). With these notations, W DP (S) and va (S) are well deﬁned for each subset S ⊆ R (by considering the restriction of vi to P(S)). We have obtained a global R-valuation va assigning to each bundle S ⊆ R the maximum value 1 2 3 University of Southampton, UK; mc3@ecs.soton.ac.uk; work supported by OpenKnowledge STREP project IST-FP11V341. Al. I. Cuza University, Iasi, Romania; croitoru@infoiasi.ro University of Southampton,UK; phl@ecs.soton.ac.uk; work supported by OpenKnowledge STREP project IST-FP11V341. of an S-allocation to the bidders from I. Therefore WDP can be viewed as the problem of constructing a social aggregation of Rvaluations of bidders. If we denote by V(R) the set of all R-valuations, it is natural to consider in our RA system the set of superadditive R-valuations due to the synergies among the resources: SV(R) = {v ∈ V(R)| v(A1 ∪ A2 ) ≥ v(A1 ) + v(A2 ) for all A1 , A2 ⊆ R, A1 ∩ A2 = ∅}. It is easy to see that if all vi ∈ I are superadditive then va is superadditive and the following theorem holds: Theorem 1 If all bidders R-valuations are superadditive, then the aggregate R-valuation va satisﬁes va (A) = maxB⊆A [va (B) + va (A − B)] for all A ⊆ R. Let v ∈ V(R). A v -basis is any B ⊆ P(R) such that for each A ⊆ R we have v(A) = maxB∈B,B⊆A [v(B) + v(A − B)]. In other words, if B is a v-basis, then the value of v(A) is uniquely determined by the values of v on the elements of the basis contained in A, for each A ⊆ R. The elements of a v-basis, B ∈ B, are called bundles and the pairs (B, v(B))B∈B are called bids. It is not difﬁcult to prove that a R-valuation v ∈ V(R) has a v-basis iff v ∈ SV(R) ([5]) and furthermore, the following representational theorem holds: Theorem 2 If in a RA system the bidder superadditive R-valuations vi are represented using vi -basis Bi for each i ∈ I, then the aggregate R-valuation va is represented by the va -basis Ba = ∪i∈I Bi , by taking va (B) = max{vi (B)|i ∈ I and B ∈ Bi }, for all B ∈ Ba . 2 Approach In the new language, each bidder submits to the arbitrator a generalized ﬂow network called NETBID, which represents the valuation of the bidder, by specifying a basis for it. More precisely, if the set of resources is R = {r1 , r2 , . . . , rm }, then in the NETBID of each agent there is a special starting node s connected to all nodes rj by directed edges with capacity 1. An integer ﬂow in NETBID will represent an assignment of resources to the agent by considering the set of resources rj with ﬂow value 1 on the directed edge (s, rj ). The node rj is an usual node, i.e. it satisﬁes the conservation law: the total (sum) of incoming ﬂows equals the total ﬂow of outgoings ﬂows. In the network there are also bundle nodes which doesn’t satisfy the conservation law, which are used to combine (via their inputs ﬂows) different goods in subset of goods. The combination is conducted by the (integer) directed edges ﬂows together with appropriate lower and capacity bounds. Once the NETBID constructed, any maximum value ﬂow (in the sense described bellow) will represent the valuation function of the agent. For example, the NETBID in Figure 1 expresses that the bidder is interested in a bundle consisting in two or three resources of type E , together with the resource M which adds 10 to the values sum of particular resources of type E. Formally a NETBID, the bidﬂows and their values are deﬁned as follows: 872 M. Croitoru et al. / BIDFLOW: A New Graph-Based Bidding Language for Combinatorial Auctions E 1 1 0,1 E 0,1 0,1 s 2 2 E 0,1 3 0,1 E 3 2,3 sum t +10 4 4 0,1 E 1,1 5 5 M interior node and this last node linked to a new superbundle node by a directed edge having as lower bound 1 and capacity k. Clearly, any valuation represented in a XOR language can be obtained in such way and any R-valuation can be represented [5, 3]. The NETBIDS submitted by the bidders are merged by the arbitrator in a common NETBID sharing only the nodes corresponding to s and R, and also a common t node in which are projected the corresponding t nodes of the individual NETBIDS. This common NETBID, Na , is a symbolic representation of the aggregate valuation of the society and is illustrated in Figure 2 bellow. NETBID 0,1 Figure 1. The NETBID for an example from [2] r 1 r 2 1 0,1 0,1 Deﬁnition 1 A R-NETBID is a tuple N = (D,s,t, c, l, λ), where: 1. D = (V, E) is a digraph with two distinguished nodes s, t ∈ V ; the other nodes, V − {s, t}, are partitioned R ∪ B ∪ I: R is the set of resources nodes, B is the set of bundles nodes and I is the set of interior nodes. There is a directed edge (s, r) ∈ E for each r ∈ R, and also (b, t) ∈ E, ∀b ∈ B. There are no other directed edges entering in a resource node or leaving a bundle node. 2. c, l are nonnegative integer partial functions deﬁned on the set of edges of D; if (i, j) ∈ E and c is deﬁned on (i, j) then c((i, j)) ∈ Z+ , denoted cij , is the capacity of edge (i, j); l((i, j)) ∈ Z+ , if deﬁned, is the lower bound on the edge (i, j) and is denoted lij ; if (i, j) has assigned a capacity and a lower bound then lij ≤ cij . All edges (s, r) have csr = 1 and lsr = 0. No edge (b, t) has capacity and lower bound. 3. λ is a labelling function on V − {s, t} which assign to a vertex v a pair of rules (λ1 (v), λ2 (v)) (described in the next deﬁnitions). Deﬁnition 2 Let N = (D, s, t, c, l, λ) be a R-NETBID. A bidﬂow in N is a function f : E → Z+ such that (fij denotes f ((i, j))): 1. For each directed edge (i, j) ∈ E: if fij > 0 and cij is deﬁned, then fij ≤ cij ; if fij > 0 and lij is deﬁned, then fij ≥ lij . 2. If v ∈ V − {s, t} has λ1 (v) = conservation then (i,v)∈E(D) fiv = (v,i)∈E(D) fvi . 3. For each v ∈ B, fvt ∈ {0, 1}; fvt = 1 if and only if for each w ∈ R ∪ I, such that (w, v) ∈ E, we have fwv > 0. The set of all bidﬂows in N is denoted by F N . In order to simplify our presentation we have considered here that for each v ∈ V − {s, t}, λ1 (v) ∈ {conservation, bundle} giving rise to the ﬂow rules in 2 and 3 above. In the ﬁgure considered above, the function λ1 (v) is illustrated by the color of the node v: a gray node is a bundle node and a white node is a conservation node. Deﬁnition 3 Let f be a bidﬂow in the R-NETBID N = (D, s, t, c, l, λ). The value of f , val(f ), is deﬁned as val(f ) = b∈B val(b)fbt , where val(v) is 0 if v = s and val(v) = λ2 (Df−1 (v)) if = s, t. Df−1 (v) is the set of all vertices w ∈ V (D) such that (w, v) ∈ E(D) and fwv > 0. λ2 (Df−1 (v)) is the rule (speciﬁed by the second label associated to vertex v) of computing val(v) from the values of its predecessors which send ﬂows into v. Deﬁnition 4 Let N = (D, s, t, c, l, λ). The R-valuation designated by N is the function vN : P(R) → R+ , where for each S ⊆ R, vN (S) = max{val(f )|f ∈ F N , fsr = 0 ∀r ∈ R − S}. By the above two deﬁnitions, the value associated by N to a set S of resources is the maximum sum of the values of the (disjoint) bundles which are contained in the set (assignment) S. This is in concordance with the deﬁnition of a v-basis given section 1 for a superadditive valuation v. However, the NETBID structure deﬁned above is more ﬂexible in order to express any valuation. If the bidder desires to express that at most k bundles from some set of bundle nodes must be considered, then these nodes are connected to a new 0,1 NETBID 2 0,1 s 0,1 r i t 0,1 r n 0,1 rn ’ NETBID n Figure 2. Aggregate NETBID From this construction, the following theorem can be proved Theorem 3 If each bidder’s i R-valuation, vi , is represented by RNETBID Ni (i ∈ I), then the aggregate R-valuation va is designated by the aggregate R-NETBID Na , that is, va = vNa . 3 Conclusion In this paper we proposed a new visual framework for bidding languages. Several bidding languages for CAs have previously been proposed: “logical bidding languages” [5, 3, 7]; the LGB language [1]; TBBL, a tree-based bidding language that has several novel properties [2]; Petri Nets formalisms [4] etc. Our bidﬂows can be viewed as concise statical tools to represent (following the TBBL spirit) dynamical resources allocation processes. Simple NETBID constructions can simulate “phantom variables”([5]) hence expressing any R-valuation. Future work will practically investigate the proposed alternative for business intelligence. REFERENCES [1] C. Boutilier and H. Hoos. Bidding languages for combinatorial auctions. In Proceedings of the 18th International Joint Conference on Artiﬁcial Intelligence (IJCAI)., pages 1211–1217, 2001. [2] R. Cavallo, D. Parkes, A. Juda, A. Kirsch, A. Kulesza, S. Lahaie, B. Lubin, L. Michael, and J. Shneidman. Tbbl: A tree-based bidding language for iterative combinatorial exchanges. In Int’l Joint Conf’s on A.I.: Workshop on Advances in Preference Handling, 2005. [3] Y. Fujisima, K. Leyton-Brown, and Y. Shoham. Taming the computational complexity of combinatorial auctions. In Proceedings of the 16th Int’l Joint Conf on AI, pages 548–553, 1999. [4] A. Giovannucci, J. Rodriguez-Aguilar, J. Cerquides, and U. Endriss. Winner determination for mixed multi-unit combinatorial auctions via petri nets. In Proceedings of the 6th international joint conference on Autonomous agents and multiagent systems. ACM, 2007. [5] N. Nisan. Bidding and allocations in combinatorial auctions. In ACM Conference on Electronic Commerce (EC-2000), 2000. [6] M. Rothkopf, A. Pekec, and R. Harstad. Computationally manageable combinatorial auctions. Management Science, 44:1131–1147, 1998. [7] T. Sandholm. emediator: a next generation electronic commerce server. In Proceedings of the 4th Int’l Conf on Autonomous Agents, pages 341– 348, 2000. ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-873 873 Multi-Agent Reinforcement Learning for Intrusion Detection: A case study and evaluation Arturo Servin and Daniel Kudenko1 Abstract. In this paper we propose a novel approach to train MultiAgent Reinforcement Learning (MARL) agents to cooperate to detect intrusions in the form of normal and abnormal states in the network. We present an architecture of distributed sensor and decision agents that learn how to identify normal and abnormal states of the network using Reinforcement Learning (RL). Sensor agents extract network-state information using tile-coding as a function approximation technique and send communication signals in the form of actions to decision agents. By means of an on line process, sensor and decision agents learn the semantics of the communication actions. In this paper we detail the learning process and the operation of the agent architecture. We also present tests and results of our research work in an intrusion detection case study, using a realistic network simulation where sensor and decision agents learn to identify normal and abnormal states of the network. 1 Introduction Intrusion Detection Systems (IDS) play an important role in the protection of computer networks and information systems from intruders and attacks. Despite previous research efforts there are still areas where IDS have not satisﬁed all requirements of modern computer systems. Speciﬁcally, Denial of Service (DoS) and Distributed Denial of Service (DDoS) attacks have received signiﬁcant attention due to the increased security vulnerabilities in end-user software and bot-nets. A special case of DoS are the Flooding-Base DoS and Flooding-Base DDoS attacks. These are generally based on a ﬂood of packets with the intention of overﬁlling the network resources of the victim. It is especially difﬁcult to create a ﬂexible hand-coded IDS for such attacks, and machine learning is a promising avenue to tackle the problem. Due to the distributed nature of this type of attacks and the complexities that involve its detection, we propose a distributed reinforcement learning (RL) approach. In RL agents learn to act optimally via observations and feedback from the environment in the form of positive or negative rewards [7]. Multi-Agent RL has been successfully used to solve some challenging problems in various areas. Despite its apparent appeal, MARL needs to deal with problems such as the size of the action-state space which makes scalability an issue; the partial information that agents have of other agents’ observations and actions; a non-stationary environment as result of the actions of other agents, and the credit assignment problem. To overcome these problems we present an architecture of distributed sensor agents that get information from the environment and share it in the form of communication signals with other agents 1 University of York, aservin,kudenko@cs.york.ac.uk United Kingdom, email: higher up the hierarchy. Without any previous semantic knowledge about the signals, higher-level hierarchical agents interpret them and consequently interact with the environment. This results in a learning process where agents with partial observability make decisions and coordinate their own actions to reach a common goal. In order to evaluate our proposal we explore its use of Distributed Intrusion Detection Systems (DIDS). 2 Agent Architecture We propose an architecture of autonomus agents divided sensor agents (SA) and decision agents (DA). SA collect and analyse state information about the environment. Each SA receives only partial information about the global state of the environment and they map this local state to communications action-signals. These signals are received by the DA and without any previous knowledge it learns their semantics and how to interpret their meaning. In this way, the DA tries to model the local state of cell environment. Then it decides which ﬁnal action to trigger (in our case study it triggers an alarm to the network operator). When the DA triggers the action and this is appropriate accordingly with the goal pursued, all the agents receive a positive reward. If the action is not correct, all the agents receive a negative reward. The goal is to coordinate the signals sent by the SA to the DA in order to represent the global state of the environment. To detect the abnormal states that DoS and DDoS generate in a computer network we have designed an architecture composed by four agents. These agents are a Congestion Sensor Agent (CSA), a Delay Sensor Agent (DSA), a Flow Sensor Agent (FSA) and the Decision Agent (DA). We need this diversity of sensor information to develop more reliable IDS. The idea is that each sensor agent perceives different information depending on their capabilities, their operative task and where they are deployed in the network. Furthermore not all the features are available in a single point in the network. Flow and congestion information may be measured in a border router between the Internet and the Intranet whilst delay information may be only available from an internal router. 3 Results We set up several tests to verify the learning capabilities of our agent architecture. We used a control test to train the agents to categorise basic normal and abnormal activity in the network. To simulate the normal trafﬁc we randomly started and stopped connections from node 0 (TCP/FTP) and node 1 (UDP stream). Using another random pattern of connections we used node 4 to simulate the attacks to the network characterised by a ﬂood of UDP trafﬁc. To evaluate the adaptability of the agents we ran tests changing the normal and abnormal trafﬁc patterns. We also ran tests designed to create more 874 A. Servin and D. Kudenko / Multi-Agent Reinforcement Learning for Intrusion Detection: A Case Study and Evaluation complex scenarios where the attacker changes its attack to mimic authorised or normal trafﬁc. We compared our learning algorithm against two hard-coded approaches. The ﬁrst hard-coded approach (Hard-Coded 1) emulated a misuse IDS. In this case the IDS is looking for the patterns that match an attack. The Hard-Coded 2 approach integrates the same variety of input information as our learning algorithm. We evaluated the learning and hard-coded approaches using test 2 and test 5. Test 2 only changes the trafﬁc pattern of the attack and it must be very simple to detect. Attacks in test 5 we changed the packet size and the attack UDP port to be the same used by normal applications. This test is the hardest to detect because it emulates some of the signatures of normal trafﬁc. The learning curves of the test are shown in Fig.1. Hard-Coded 1 had no problem to identify attacks and have low false negatives for test 2 but it completely failed to detect attacks test 5. This is the same problem that misuse IDS have when the signature of the attack changes or when they face unknown attacks. The results for Hard-Coded 2 and our learning approach conﬁrm our argument that for more reliable intrusion detection we need a variety of information sources. Both solutions were capable of detecting the attacks even though one of the sensors was reporting incorrect information. This scenario also could be seen as the emulation of a broken sensor sending bogus information or a sensor compromised by the attacker and forced to send misleading signals. Either way it demonstrates that a system using more than one source to detect intrusions could be more reliable than single-source IDS. Figure 1. Learning Curves 10 5 0 5 Conclusion and Future Work We have shown how a group of agents can coordinate their actions to reach the common goal of network intrusion detection. During this process decision agents learn how to interpret the action-signals sent by sensor agents without any previously assigned semantics. These action-signals aggregate the partial information received by sensor agents and they are used by the decision agents to reconstruct the global state of the environment. In our case study, we evaluate our learning approach by identifying normal and abnormal states of a realistic network subjected to various DoS attacks. We have also successfully applied RL in a group of network agents under conditions of partial observability, restricted communication and global rewards in a realistic network simulation. Finally we can conclude that using a variety of network data has generated good results to identify the state of the network. In some cases the agents can generate good results even when some of this information is missing. Future work include scaling up our learning approach to a large number of agents a hierarchical approach. This architecture will allow us to create more complex network topologies and eventually the emulation of real packet streams inside the network environment. REFERENCES Test Test Test Test -5 2, Hand Coded-1 5, Hand Coded-1 2, Hand Coded-2 5, Hand Coded-2 Test 2, Learning Test 5, Learning -10 0 50 100 150 200 250 Iteration Both the Hand-coded 2 and learning approaches present very good results regarding the identiﬁcation of normal and abnormal states in the network. While the learning algorithm requires some time to learn to recognise normal and abnormal activity, it does not require any previous knowledge about the behaviour of the measured variables. Hand-coded 2 reaches maximum performance since the beginning of the simulation but it requires in-deepth knowledge from the policy programmer about the the network trafﬁc and the variables measured to detect intrusions. 4 [2] where cooperative agents learn how to route packets using optimal paths. Using the same approach of ﬂow control and feedback from the environment, other researchers have expanded the use of RL in routing algorithms [6], explore the use of MARL to control congestion in networks [4], routing using QoS [5] and more recently to control DDoS attacks [9]. The use of RL in the intrusion detection ﬁeld has not been widely studied and even less in distributed intrusion detection. Some research works are [3] where the authors trained a neural network using RL and [1] where game theory is used to train agents to recognise DoS attacks against routing infrastructure. Other recent research work include the use of RL to detect host intrusion using sequence system calls [10] and the previously mentioned [9]. Related Work Problems such as the curse of dimensionality; partial observability and scalability in MARL have been analysed using a variety of methods and techniques and they represent the foundation of our research. An application of MARL to networking environments is presented in [1] B. Awerbuch, D. Holmer, and H. Rubens, ‘Provably Secure Competitive Routing against Proactive Byzantine Adversaries via Reinforcement Learning’, John Hopkins University, Tech. Rep., May, (2003). [2] J.A. Boyan and M.L. Littman, ‘Packet routing in dynamically changing networks: A reinforcement learning approach’, Advances in Neural Information Processing Systems, 6(1994), 671–678, (1994). [3] J. Cannady, ‘Next Generation Intrusion Detection: Autonomous Reinforcement Learning of Network Attacks’, NISSC00: Proc. 23rd National Information Systems Security Conference, (2000). [4] J. Dowling, E. Curran, R. Cunningham, and V. Cahill, ‘Using feedback in collaborative reinforcement learning to adaptively optimize MANET routing’, Systems, Man and Cybernetics, Part A, IEEE Transactions on, 35(3), 360–372, (2005). [5] E.G. Gelenbe, M. Lent, and R.P.L.P. Su, ‘Autonomous smart routing for network QoS’, Autonomic Computing, 2004. Proceedings. International Conference on, 232–239, (2004). [6] A. Nowe, K. Steenhaut, M. Fakir, and K. Verbeeck, ‘Q-learning for adaptive load based routing’, Systems, Man, and Cybernetics, 1998. 1998 IEEE International Conference on, 4, (1998). [7] R.S. Sutton and A.G. Barto, Reinforcement Learning: An Introduction, MIT Press, 1998. [8] X. Xu, Y. Sun, and Z. Huang, ‘Defending DDoS Attacks Using Hidden Markov Models and Cooperative Reinforcement Learning’, LECTURE NOTES IN COMPUTER SCIENCE, 4430, 196, (2007). [9] X. Xu and T. Xie, ‘A Reinforcement Learning Approach for Host-Based Intrusion Detection Using Sequences of System Calls’, Proceedings of the International Conference on Intelligent Computing, (2005). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-875 875 GR-MAS: Multi-Agent System for Geriatric Residences Javier Bajo1 and Juan M. Corchado and Sara Rodriguez2 Abstract. This paper presents a multiagent architecture (GR-MAS) developed for facilitating health care in geriatric residences. GRMAS (Geriatric Residence Multi-Agent System) contains different agent types and takes into account the integration within RFID, WiFi technologies and handheld devices. The core of GR-MAS is an autonomous deliberative case-based planner agent called GerAg (Geriatric Agent for monitoring alzheimer patients). This agent, which allows adaptation and learning capabilities, has been designed to plan the nurses’ working time dynamically, to maintain the standard working reports about the nurses’ activities, and to guarantee that the patients assigned to the nurses are given the right care. A description of GerAg, its relationship with the complementary agents, and preliminary results of the multi-agent system prototype in a real environment are presented. 1 INTRODUCTION There is an ever growing need to supply constant care and support to the disabled and elderly and the drive to ﬁnd more effective ways to provide such care has become a major challenge for the scientiﬁc community [3]. During the last three decades the number of Europeans over 60 years old has risen by about 50%. Today they represent more than 25% of the population and it is estimated that in 20 years this percentage will rise to one third of the population, meaning 100 millions of citizens [3]. In the USA, people over 65 years old are the fastest growing segment of the population and it is expected that in 2020 they will represent about 1 of 6 citizens totaling 69 million by 2030. Furthermore, over 20% of people over 85 years old have a limited capacity for independent living, requiring continuous monitoring and daily care. The importance of developing new and more reliable ways to provide care and support to the elderly is underlined by this trend [3], and the creation of mechanisms for monitoring and optimizing health care will become vital. Some authors consider that tomorrow’s health care institutions will be equipped with intelligent systems capable of interacting with humans. Multiagent systems and architectures based on intelligent devices have recently been explored as supervision systems for medical care for the elderly patients, these intelligent systems aim to support them in all aspects of daily life, predicting potential hazardous situations and delivering physical and cognitive support. Multiagent systems together with the use of RFID and Wi-Fi technologies, and handheld devices offer new possibilities and open new ﬁelds such as the ambient intelligence that may facilitate the integration of distributed intelligence software applications in our daily life. 1 2 Pontiﬁcal University of Salamanca, Spain, email: jbajope@upsa.es University of Salamanca, Spain, email: {corchado, srg}@usal.es 2 GR-MAS: A MULTIAGENT SYSTEM FOR GERIATRIC RESIDENCES GR-MAS (Geriatric Residence Multi-Agent System) is a multiagent architecture proposed for improving health care services and its integration within the complementary technologies. The GerAg agent, which is a deliberative planning agent, is the core of GR-MAS, and incorporates a planning mechanism that improves the medical assistance in geriatric residences by optimizing the visiting schedules. GR-MAS is a dynamic system for the management of different aspects of the geriatric center. This distributed system uses Radio Frequency Identiﬁcation (RFID) technology for ascertaining patients’ location in order to maximize their safety or to generate medical staff plans. The development of such multiagent system has been motivated for one of the more distinctive characteristics of geriatric or alzheimer residences, which is their dynamism, in the sense that the patients change very frequently (new patients arrive and others pass away), while the staff rotation is also relatively high and they normally work in shifts of eight hours. GR-MAS uses mobile devices and Wi-Fi technology to provide the personnel of the residence with updated information about the center and the patients, to provide the working plan, information about alarms or potential problems and to keep track of their movements and actions within the center. From the user’s point of view the complexity of the solution has been reduced with the help of friendly user interfaces and a robust and easy to use multiagent system. Figure 1. GR-MAS wireless technology organization schema GR-MAS is composed of four different types of agent, as can be seen in Figure 1: Patient agent manages the patient’s personal data and behaviour (monitoring, location, daily tasks, and anomalies); Manager agent plays two roles, the security control and the management of the medical record database; Doctor GerAg agent treats patients; and GerAg agent schedules the nurse’s working day obtaining dynamic plans depending on the tasks needed for each assigned patient. 876 3 J. Bajo et al. / GR-MAS: Multi-Agent System for Geriatric Residences GERAG: AUTONOMOUS PLANNER AGENT FOR GERIATRIC RESIDENCES GerAg is an autonomous deliberative case-based planner (CBPBDI) agent [2] developed for integration within a multi-agent system named GR-MAS. The goal of this agent is to provide efﬁcient working schedules, in execution time, for geriatric residences staff and therefore to improve the quality of health care and the supervision of patients in geriatric residences. Each of the GerAg agents is assigned to a nurse or a doctor of a residence, and provides also information about patient locations, historical data and alarms. As the members of the staff are carrying out their duties (following the plan provided by the agent) the initial proposed plan may need to be modiﬁed due for example to delays or alarms, in this case the agent is capable of re-planning in execution time. The CBP planner used by the GerAg agent identiﬁes a plan, for a given nurse, to provide daily nursing care in the residence. It is very important to maintain a map with the location of the different patients at the time of planning or re-planning, which is why RFID technology is used to facilitate the location and identiﬁcation of patients, nurses and doctors. The CBP Agent calculates the most re-plan-able intention (MRPI) as shown in [4], which is the plan than can be easily substituted by other in case the initial plan gets interrupted. In a dynamic environment, to have an alternative plan it is important to maintain the efﬁciency of the system. This agent follows the 4 stages of a CBR system (Retrieval, Reuse, Review and Retain) [1]. adopted on January 15th, 2007. The average number of patients was the same before and after the implementation. To test the system 30 patient agents, 10 GerAg nurse agents, 2 doctor agents and 1 manager agent were instantiated. The tests have focused on the GerAg nurse agents. As can be seen in Figure 3, the pointed line represents the average number of nurses required in the residence each hour of a day without GR-MAS. The vertical bars represent the same measure but after the implementation. As can be seen, the GR-MAS multiagent system helps the nurses to gain time, which can be dedicated to the care of special patients, to learn or to prepare new activities. The time spent on supervision and control tasks has been reduced substantially, as well as the time spent attending false alarms, while the time for direct patient care has been increased. Figure 3. Number of nurses working simultaneously 5 Figure 2. Case-based planning cycle Figure 2 shows the steps carried out in each of the stages of the CBP system. When an interruption occurs, the system initiates a new CBP cycle, taking into account the tasks previously accomplished. That is, in the new retrieval stage, plans with a similar problem description to the current situation (after the interruption) will be recovered. The MRPI guarantees that at least some of the plans closest to the initial geodesic plan will be recovered (the rest of the plans are not valid anymore because of the restrictions, that the tasks have already accomplished, etc.) together with new plans. 4 RESULTS OBTAINED The GR-MAS system has been tested over the last few months. During the testing period the system usefulness has been evaluated from different points of view. Figure 3 shows the average number of nurses working simultaneously (each of the 24 hours of the day) at the Residence before and after the implantation of the system prototype, with data collected from October 2006 to March 2007. The prototype was CONCLUSION In the future, health care for Alzheimer’s patients, the elderly and people with other disabilities will require the use of new technologies that allow medical personnel to carry out their tasks more efﬁciently. One of the possibilities is the use of multiagent systems. We have shown the potential of deliberative GerAg agents in a distributed GR-MAS on health care, providing a way to respond to some challenges of health care, related for example to the identiﬁcation, control and health care planning. In addition, the use of RFID technology on people provides a high level of interaction among users and patients through the system and is fundamental in the construction of the intelligent environment. Furthermore, the use of mobile devices, when used well, can facilitate social interactions and knowledge transfer. ACKNOWLEDGEMENTS This work has been partially supported by the MCYT Spanish Ministry of Science project TIN2006-14630-C03-03. REFERENCES [1] A. Aamodt and E. Plaza, ‘Case-based reasoning: foundational issues, methodological variations, and system approaches’, AI Communications, 20, 39–59, (1994). [2] Tapia D.I. de Luis A. Rodrguez S. de Paz J.F. Bajo, J. and J.M. Corchado, Hybrid Architecture for a Reasoning Planner Agent, 461–468, Lecture Notes in Artiﬁcial Intelligence, 4693, Springer Verlag, Berlin, 2007. [3] L. Camarinha-Matos and H. Afsarmanesh, Design of a virtual community infrastructure for elderly care, 635, PRO-VE02 3rd IFIP Working Conference on Infrastructures for Virtual Enterprises, Kluwer B.W., Deventer, 2002. [4] M. Glez-Bedia and J.M. Corchado, ‘Pa planning strategy based on variational calculus for deliberative agents’, Computing and Information Systems Journal, 10(1). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-877 877 Agent-Based and Population-Based Simulation of Displacement of Crime (extended abstract) Tibor Bosse and Charlotte Gerritsen and Mark Hoogendoorn and S. Waqar Jaffry and Jan Treur1 Abstract. Within Criminology, the process of crime displacement is usually explained by referring to the interaction of three types of agents: criminals, passers-by, and guardians. Most existing simulation models of this process are agent-based. However, when the number of agents considered becomes large, population-based simulation has computational advantages over agent-based simulation. This paper presents both an agent-based and a population-based simulation model of crime displacement, and reports a comparative evaluation of the two models. In addition, an approach is put forward to analyse the behaviour of both models by means of formal techniques. 1 INTRODUCTION Within Criminology one of the main research interests is the emergence of so-called criminal hot spots. These hot spots are places where many crimes occur. After a while the criminal activities shift to another location, for example, because the police has changed its policy and increased the numbers of officers at the hot spot. Another reason may be that the passers by move away, when a certain location gets a bad reputation. Such a shift between locations is called the displacement of crime. The reputation of specific locations in a city is an important factor in the spatio-temporal distribution and dynamics of crime. For example, it may be expected that the amount of assaults that take place at a certain location affect the reputation of this location. Similarly, the reputation of a location affects the attractiveness of that location for certain types of individuals. For instance, a location that is known for its high crime rates will attract police officers, whereas most citizens will be more likely to avoid it. As a result, the amount of criminal activity at such a location will decrease, which will affect its reputation again. The classical approaches to simulation of processes in which groups of larger number of agents and their interaction are involved are population-based: a number of groups is distinguished (populations) and each of these populations is represented by a numerical variable indicating their number or density (within a given area or location) at a certain time point. The simulation model takes the form of a system of difference or differential equations expressing temporal relationships for the dynamics of these variables. Well-known classical examples of such population-based models are systems of difference or differential equations for predator-prey dynamics (e.g., [8], [12], [13], [9], [4]) and the dynamics of epidemics (e.g., [10], [7], [4] [1], [6]). Such models can be studied by simulation and by using analysis techniques from mathematics and dynamical systems theory. From the more recently developed agent system area it is 1 Vrije Universiteit Amsterdam, Department of Artificial Intelligence, De Boelelaan 1081a, 1081 HV Amsterdam, The Netherlands, email: {cg, tbosse, mhoogen, swjaffry, treur}@few.vu.nl often taken as a presupposition that simulations based on individual agents are a more natural or faithful way of modelling, and thus will provide better results (e.g., [5], [11], [2]). Although for larger numbers of agents such agent-based modelling approaches are more expensive computationally than population-based modelling approaches, such a presupposition may provide a justification of preferring their use over population-based modelling approaches, in spite of the computational disadvantages. However, for larger numbers of agents (in the limit), agent-based simulations may equally well approximate population-based simulations. In such cases agentbased simulations just can be replaced by population-based simulations. In this paper, for the application area of crime displacement these considerations are explored in more detail. Comparative simulation experiments have been conducted based on different simulation models, both agent-based (for different numbers of agents), and population-based. The results are analysed and related to the assumptions discussed above. This paper is organised as follows. First, Section 2 introduces the population based model which has been defined for this domain and briefly presents the outcomes of a mathematical analysis of the model and simulations using the model . Thereafter, Section 3, introduces the agent-based model and briefly describes the simulation results using that model. Finally, Section 4 is a discussion. 2 A POPULATION-BASED MODEL In the population-based model, the densities of the different agent types (i.e. criminals, passers-by, and guardians) are calculated by means of differential equations. An example of an equation to determine the number of criminals at location L is specified as follows: c(L, t + t) = c(L, t) + 1 (β(L, c, t) - c(L, t)/ c) t This expresses that the density c(L, t + t) of criminals at location L on t + t is equal to the density of criminals at the location at time point t plus a constant 1 (expressing the rate at which criminals move per time unit) times the movement of criminals from t to t+Δt from and to location L multiplied by t. Here, the movement of criminals is calculated by determining the relative attractiveness β(L, c, t) of the location (compared to the other locations) for criminals. From this, the density of criminals at the location at time point t divided by the total number c of criminals (which is constant) is subtracted, resulting in the change of the number of criminals for this location. For the guardians and the passers-by similar formulae are used. The calculation of the attractiveness of locations has been omitted for the sake of brevity. A mathematical analysis has been conduced to investigate the behaviour of the model, and it was shown that in all cases attraction to the equilibrium will take place. Hence, given the set 878 T. Bosse et al. / Agent-Based and Population-Based Simulation of Displacement of Crime of assumptions as described above, the model will eventually stabilise. Besides the mathematical analysis, simulation runs have been conducted as well and the outcomes confirm the results found in the mathematical analysis. The computation time needed to perform the simulations is approximately 1 second. 3 AN AGENT-BASED MODEL For the agent-based model, the following algorithm is used: 1. initialise all agents on locations 2. for each time step repeat the following a. each agent calculates the attractiveness of a location depending on its type (passers-by, criminals, and guardians) for all locations. % of the agents of each type is selected at random to b. decide whether the agent moves to a new location or stay at the old one c. the selected agents move to a location with a probability proportional to the attractiveness of the specific location (i.e. a selected agent has a higher probability of moving to a relative attractive location than to a non-attractive one). Using the agent based model, simulation runs have been performed, and the results are closely correlated to the results using the population based model. The computation time needed to perform the agent based model (for 100 runs) is 16.39 seconds. 4 DISCUSSION In this paper two models have been introduced to investigate the criminological phenomenon of the displacement of crime. Hereby, a population-based model has been introduced as well as an agent-based model. These models have been presented in a generic format to allow for an investigation of a variety of different functions representing aspects such as the attractiveness of locations. Using mathematical analysis, and confirmed by simulation results, the population-based model was shown to end up in an equilibrium for one variant of the model. The parameter settings for these simulations have been determined in cooperation with criminologists. The simulation results for the agent-based model using the same parameter settings show an identical trend to the population-based model except for some minor deviations that can be attributed to the fact that the agentbased model is discrete, as confirmed by the formal evaluation. The computation time of the populations-based model was shown to be much lower than the computation time of the agentbased model. The results reported in this paper differ at some points from the results reported in [3]. In the results using an agent-based model reported in that paper, cyclic patterns were observed whereby there is a continuous movement so called hot-spots (i.e. places where a lot of crime takes place). As already stated before, this paper shows that the population of agents at the various locations stabilises over time. The difference can be attributed to the fact that in [3] all agents decide where to move to based upon the attractiveness of locations, whereas in the case of the models presented in this paper only a subset of the agents move. The results can however be reproduced using the model presented in this paper as well by using an = 1 and t = 1. Determining what settings are most realistic in real life is future work. The idea that population-based models approximate agentbased models for larger populations is indeed confirmed by the simulation results reported in this paper. Future work is to introduce a general framework to make a comparison between the models possible. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] R.A. Anderson and R.M. May, Infectious Diseases of Humans: Dynamics and Control. Oxford University Press, Oxford, UK, 1992. L. Antunes and K. Takadama (eds.), Multi-Agent-Based Simulation VII, Proceedings of the Seventh International Workshop on Multi-Agent-Based Simulation, MABS’06, LNAI, vol.4442, Springer Verlag, 2007. T. Bosse and C. Gerritsen, Agent-Based Simulation of the Spatial Dynamics of Crime: on the interplay between criminals hot spots and reputation, In: Proceedings of the Seventh International Joint Conference on Autonomous Agents and Multi-Agent Systems, AAMAS’08, ACM Press, to appear, 2008. D.N. Burghes and M.S. Borrie, Modelling with Differential Equations, John Wiley and Sons, 1981. P. Davidsson, L. Gasser, B. Logan and K. Takadama (eds.), MultiAgent and Multi-Agent-Based Simulation, Proceedings of the Joint Workshop on Multi-Agent and Multi-Agent-Based Simulation, MABS’04, LNAI, vol. 3415, Springer Verlag, 2005. S.P. Ellner and J. Guckenheimer, Dynamic Models in Biology, Princeton Univerity Press, 2006. W.O. Kermack and W.G. McKendrick, A contribution to the mathematical theory of epidemics, Proceedings of the Royal Society of London, Series A 115, pp. 700-721, 1927. A.J. Lotka, Elements of Physical Biology, reprinted by Dover in 1956 as Elements of Mathematical Biology, 1924. J. Maynard Smith, Models in Ecology, Cambridge University Press, Cambridge, 1974. R. Ross, An application of the theory of probabilities to the study of pathometry Part I, Proceedings of the Royal Society of London, Series A 92, pp.204-230, 1916. J.S. Sichman and L. Antunes (eds.), Multi-Agent-Based Simulation VI, Proceedings of the Sixth International Workshop on Multi-Agent-Based Simulation, MABS’05, LNAI, vol. 3891, Springer Verlag, 2006. V. Volterra, Fluctuations in the abundance of a species considered mathematically, Nature 118, pp. 558-560, 1926. V. Volterra, Variations and fluctuations of the number of individuals in animal species living together, In: Animal Ecology, McGraw-Hill, 1931, translated from 1928 edition by R.N. Chapman. ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-879 879 Organizing Coherent Coalitions Jan Broersen and Rosja Mastop and John-Jules Ch.Meyer and Paolo Turrini 1 Abstract. In this paper we provide and discuss a language to talk about coherence, a property of interaction that ensures players’ abilities non to contradict one other and the empty coalition not to make active choices. With this property we can model a closed-world interaction, such as those of a Coordination Game or of a Prisoner Dilemma, where all the outcomes are determined only by the choices of the agents that are present. 1 Introduction Pauly’s Coalition Logic has shown to be a sound formal tool to analyze the properties of strategic interactions and games. One issue left is to deﬁne in that language what the interesting properties of an interaction are, as possible for instance with regularity (abilities of coalitions do not contradict each other) or outcome monotonicity (if a coalition can force an outcome to lie in a set X, can also force an outcome to lie in all supersets of X). XXX Row Column XX White Dress XX X White Dress Black Dress Table 1. Black Dress (3, 3) (0, 0) (0, 0) (3, 3) Clothing Conformity In the situation of Table 1, a legislator who wants to achieve the socially optimal state (players coordinate) should declare that a discordant choice is forbidden, thereby labeling the combinations of moves (black, white), (white, black) as violations. Suppose however that the environment were an active part of the game, and that it could decide to cut off the left side of the matrix, eliminating the possibility for Column to make a proper choice. Then what should a legislator do? It is quite clear that requiring the agents to choose something should depend on the moves that are available to the players. In order to have a regulation of the system we need a proper agent-oriented normative system, in particular we should avoid deontic statements that concern proper choices to be carried out by nature. This translates into ruling out all those systems in which nature plays an active role, i.e. isolating all closed-world interactions. In this paper we will pursue this idea formally, identifying all such interactions and axiomatizing their logic. 2 Coherence We introduce the concept of a Effectivity Function, adopted from [6]. 1 Universiteit Utrecht, The Netherlands, email: paolo@cs.uu.nl Deﬁnition 1 (Effectivity Function) Given a ﬁnite set of agents Agt and a set of states W , a effectivity W function is a function E : W → (2Agt → 22 ). Any subset of Agt will henceforth be called a coalition. For elements of W we use variables u, v, w, . . .. The elements of W are called ‘states’ or ‘worlds’; the subsets of Agt are called ‘coalitions’; the sets of states X ∈ E(w)(C) are called the ‘choices’ of coalition C in state w. The set E(w)(C) is called the ‘choice set’ of C in w. The complement of a set G is calculated from the obvious domain. Intuitively, if X ∈ E(w)(C) the coalition is said to be able to force that the next state after w will be some member of the set X. For studying closed-world interaction, we consider these minimal properties: (1) coalition monotonicity: for all X, w, C, D, if X ∈ E(w)(C) and C ⊆ D, then X ∈ E(w)(D);(2) regularity: for all X, w, C, if X ∈ E(w)(C), then X ∈ E(w)(C); (3) outcome monotonicity: for all X, Y, w, C, if X ∈ E(w)(C) and X ⊆ Y , then Y ∈ E(w)(C); (4) inability of the empty coalition (IOEC): for all w, E(w)(∅) = {W }. If an Effectivity Function has these properties, it will be called coherent. As noticed also by [2] with the last property the empty coalition cannot force non-trivial outcomes of a game. One important class of Effectivity Functions are the playable ones, that have been proved to be corresponding to strategic games ([6] [Theorem 2.27]). For any world w an Effectivity Function is playable if it has the following properties: (1) ∅ ∈ / E(w)(C), for any (C); (2) W ∈ E(w)(C) for any C. (3) E is Agt-maximal, that is for any X ⊆ W , s.t. W \X ∈ E(w)(∅) implies X ∈ E(w)(Agt) (4) E is superadditive, i.e. for C ∩ D = ∅, if X ∈ E(w)(C) and Y ∈ E(w)(D) then X ∩ Y ∈ E(w)(C ∪ D). In order to understand the types of interactions we are isolating, we need to compare coherent and playable effectivity functions. First some deﬁnitions: we call Agt-superadditive an Effectivity Function that is superadditive for C, D ⊆ Agt with C ∪ D = Agt, and Csuperadditive an EF that is superadditive for C, D with C ∪ D = Agt. We skip the proofs, for reason of space. Proposition 1 (1) Not all playable games are coherent, and not all coherent games are playable. (2) Coherent Agt-maximal games are Agt-superadditive. (3) Coherent Agt-maximal C-superadditive games are playable. 3 Axiomatization Let Agt be a ﬁnite set of agents and P rop a countable set of atomic formulas. The syntax of Coherent Coalition Logic is deﬁned as follows: φ ::= p|¬φ|φ ∨ φ|[C]φ|Eφ where p ranges over P rop and C ranges over the subsets of Agt. The other Boolean connectives are deﬁned as usual. The informal reading 880 J. Broersen et al. / Organizing Coherent Coalitions of the modalities is: “Coalition C can choose φ” and “There is a state that satisﬁes φ”. The dual Aφ is deﬁned as ¬E¬φ. Notice that we have the syntax of standard Coalition Logic (see [6]) plus a global modality. Deﬁnition 2 (Models) A model for our logic is a tuple (W, E, R∃ , V ) W where W is a nonempty set of states; E : W −→ (2Agt −→ 22 ) is a coherent Effectivity Function; R∃ = W × W is a global relation; V : W −→ 2P rop is a valuation function. The satisfaction relation of modal formulas (the rest is standard) with respect to a pointed model M, w is deﬁned as follows: M, w |= [C]φ M, w |= Eφ iff iff [[φ]]M ∈ E(w)(C) ∃v s.t. wR∃ v and M, v |= φ the maximally consistent sets w ∈ W ∗ , closed under the proof system depicted in the table. We take the following conditions to describe coherence of the Effectivity Function on the canonical relation. ∗ ⊆ X : [C]φ ∈ w and ∀ψ ⊆ (W ∗ \X) : [C]ψ ∈ w • wEC X iff ∃φ (for C = ∅) ∗ ∗ • EC ⊆ ED (for C ⊆ D) ∗ • wEC X iff X = W ∗ (for C = ∅) • wR∃ v iff w, v ∈ W ∗ Proposition 4 The set of axioms and rules in the table are sound and complete with respect to the class of Coherent Coalitional Frames Proof We need just to check the statement with respect to the gen erated submodel M ∗ through the global relation. We make use of [Theorem 3.10] in [6]. We omit the detailed proof. In this deﬁnition, [[φ]]M =def {w ∈ W | M, w |= φ}. The modality for coalitional ability is standard from Coalition Logic [6]. What we look for now is the a set of axioms and rules such that the corresponding maximally consistent sets generate a coherent Effectivity Function in the canonical models. However the Inability of the Empty Coalition is not deﬁnable in Coalition Logic. To see this it is important to notice that Coalition Logic is a monotonic multimodal logic, and frame validity of formulas of monotonic modal logics is closed under taking disjoint unions. This is proved for modal satisfaction in [4][Deﬁnition 4.1, Proposition 4.2]. Proposition 2 There is no formula of Coalition Logic that deﬁnes Inability of the Empty Coalition. We can construct two models that have IOEC, while their disjoint union does not (see [4] and [1] for the deﬁnitions). We claim that Aφ ↔ [∅]φ deﬁnes the Inability of the Empty Coalition. Proposition 3 |=C Aφ ↔ [∅]φ ⇔ E(w)(∅) = {W } for every w in any frame F in the class of Coalitional Frames C. Proof (⇒) Assume that |=C Aφ ↔ [∅]φ while not E(w)(∅) = {W } for every w in any frame F in the class of Coalitional Frames C. Then there is an F in which there is a w such that E(w)(∅) = {W }. Notice that both W and E(w)(∅) are nonempty. So there is a W = W s.t W ∈ E(w)(∅) and W ⊂ W . Take an atom p to be true in all w ∈ W and false in W \ W . Now we have model M based on a coalitional frame C s.t. M |= Ap ↔ [∅]p. (⇐) Assume E(w)(∅) = {W } for a given w in an arbitrary model M of a coalition frame in C, and that w |= Aφ. Then [[φ]]M = W and w |= [∅]φ follows. Assume now that w |= [∅]φ. It has to be the case that [[φ]]M = W by assumption. So also that w |= Aφ, which concludes the proof. Notice that it is enough to have Aφ ↔ [∅]φ to ensure that the global relation axiomatizes an equivalence relation and preserves modal satisfaction. To see this it is enough to check taking a generated submodel with respect to the global relation, given this axiom, ensures the condition of taking also a generated submodel with respect to the neighbourhood modality. = {w ∈ Take a canonical model C ∗ = ((W ∗ , E ∗ ), V ∗ ) with φ ∗ W |φ ∈ w} as the truth set of φ in the canonical model. Take now A1 A2 A3 A4 A5 A6 A7 R1 R2 R3 Proof System [C]φ → [D]φ (for C ⊆ D) [C]φ → ¬[C]¬φ Aφ ↔ [∅]φ φ → Eφ EEφ → Eφ φ → AEφ A(φ → ψ) → (Aφ → Aψ) φ∧φ→ψ ⇒ψ φ → ψ ⇒ [C]φ → [C]ψ φ ⇒ Aφ Notice that if we add Agt-maximality to Coherent Games, the following holds: M, w |= [Agt]φ ↔ Eφ At the expressivity level coherent coalition logic is powerful enough to reason on global properties of the models. 4 Conclusion and Future Work In this paper we studied those interactions in which nature does not play an active role and provided an axiomatization of the resulting logic. The work here described allows for several developments, as the study of stability (Nash-consistency) of normative systems [6] or the efﬁciency of social procedures [5]. Following this line of reasoning it is possible, given a notion of optimality or efﬁciency, to construct a deontic language that requires this notion to hold, as done for instance in [3]. We can view Coherent Coalition Logic as a language to talk about those interactions for which it makes sense to construct a deontic language. REFERENCES [1] P. Blackburn, M. de Rijke, and Y. Venema. Modal Logic. Cambridge Tracts in Theoretical Computer Science, 2001. [2] S. Borgo. Coalitions in action logic. In Proc. of IJCAI, pages 1822–1827, 2007. [3] J. Broersen, R.Mastop, J-J.Ch. Meyer, and Paolo Turrini. A deontic logic for socially optimal norms. forthcoming, 2008. [4] H.H. Hansen. Monotonic Modal Logics. Master Thesis, ILLC, 2001. [5] R. Parikh. Social software. Synthese, 132(3):187–211, 2002. [6] M. Pauly. Logic for Social Software. ILLC Dissertation Series, 2001. ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-881 881 A probabilistic trust model for semantic peer-to-peer systems Nguyen Gia-Hien1 and Chatalic Philippe2 and Rousset Marie-Christine1 1 Preliminaries and illustrative example We consider a network of semantic peers P = (Pi )i=1..n . Each peer Pi uses its own ontology, expressed on its own vocabulary Vi , for describing and structuring its knowledge as well as for annotating its resources. A class C ∈ Vi of a peer Pi is referred by Pi :C or simply by C when no confusion is possible. Peers are connected each other by means of mappings, corresponding to logical constraints linking classes of different peers. Users ask queries to one of the peers, using the vocabulary of this peer. When processing a query, the reasoning propagates from one peer to other peers thanks to those mappings. The mappings are exploited during information retrieval or query answering for query reformulation between peers. For example, let us consider a semantic P2P system sharing movies based on semantic annotations, where P1 organizes his video resources according to their genres (Suspense, Action, Animation), and P2 organizes his ﬁlms based on the actors playing in the movies (Bruce Willis, Jolie). While having different views for classifying movies, P1 and P2 can establish some mappings between their two classiﬁcations. For example, they can agree that the class BruceWillis of P2 (denoted by P2 :BruceW illis) is more speciﬁc than the class Action of P1 (denoted by P1 :Action). It will result into the mapping P2 :BruceW illis P1 :Action. Similarly, P1 and another peer P3 can have established the mapping P1 :Action P1 :Suspense P3 :T hriller between their two classiﬁcations, in order to state that the category named T hriller by P3 is more general than what P1 classiﬁes as both Action and Suspense. As a result, the movies that are classiﬁed by P1 as Suspense and by P2 as BruceW illis are returned as answers to the query T hriller asked by the user at the peer P3 . We assume that each resource r returned as an answer to some query is associated with a label L(r) = Ci1 . . . CiL corresponding to its logical justiﬁcation. L(r) is a set of classes of the vocabularies of (possibly different) peers known to annotate the resource r and supposed to characterize a sufﬁcient condition for r to be an answer. Any other resource annotated in the same way is thus equally supposed to be an alternative answer to the query. We also assume that the classes used in labels are independent in the sense that for any two classes of a justiﬁcation, none of them is a subclass of the other. This important assumption means that for a returned answer, the only classes that appear in its justiﬁcations are those corresponding to most speciﬁc classes of the network. Finally we assume that the user, when querying a peer Pi , is randomly asked to evaluate some of the returned answers as satisfying 1University of Grenoble, LIG, France, email: gia-hien.nguyen@imag.fr, marie-christine.rousset@imag.fr 2 Univ. Paris-Sud, LRI, France, email: philippe.chatalic@lri.fr or not satisfying and to store the result of this evaluation in a local observation database Oi . Each evaluation is recorded into Oi as a pair S.L or S.L, where S (resp. S) denotes the user satisfaction (resp. unsatisfaction) and L is the label of the evaluated resource. Deﬁnition 1 (Observation relevant to a label L) Let Oi be the set of observations of a peer Pi and L be a label. An observation of Oi is said to be relevant to L if and only if its label contains all classes of L. The number of satisfying and unsatisfying observations of Pi that are relevant to L are respectively denoted by: Oi+ (L) = |{S.L ∈ Oi /L ⊆ L }| Oi− (L) = |{S.L ∈ Oi /L ⊆ L }| These two numbers summarize the past experience of the peer Pi relevant to the label L, i.e. of the evaluated resources justiﬁed by at least the classes of L. For instance, suppose that Peter is the user querying the peer P1 . After a number of answers have been evaluated, Peter’s past experience may be summarized as in table 1. Label (L) P2 :M yActionF ilms P2 :M yCartoons P4 :ScienceF iction P5 :Italian P5 :W estern P6 :AnimalsDocum P7 :JeanRenoir P8 :Bollywood Table 1. O1+ (L) 30 3 14 0 8 22 6 O1− (L) 6 15 14 6 2 11 35 Summary of Peter’s observations at P1 Among all the resources evaluated by Peter and annotated with the class M yActionF ilms of the peer P2 , 30 have been considered as satisfactory and 6 as not satisfactory. For the same peer P2 , only 3 out of 18 evaluated resources tagged by M yCartoons were positive. Similarly all evaluated resources annotated with both Italian and W estern by P5 , obtained negative feedbacks. Each peer Pi can progressively update its observation database Oi , as new answers are evaluated, and reﬁne the trust it has towards answers justiﬁed by the different observed labels. The level of trust can vary according to the justiﬁcation. 2 Bayesian model and estimation of trust Given a label L, let XiL be the binary random variable deﬁned on the set of resources annotated by L as follows: XiL (r) = n 1 0 if the resource r is satisfying for Pi otherwise 882 G.-H. Nguyen et al. / A Probabilistic Trust Model for Semantic Peer-to-Peer Systems We deﬁne the trust of a peer Pi towards a label L as the probability that the random variable XiL is equal to 1, given the observations resulting from the past experiences of Pi . Deﬁnition 2 (Trust of a peer towards a label L) Let Oi be the set of observations of a peer Pi and L be a label, the trust T (Pi , L) of Pi towards L is deﬁned as follows: T (Pi , L) = P r(XiL = 1|Oi ) The following theorem provides a way to estimate the trust T (Pi , L) of a peer Pi towards a label L, and the associated error of estimation. Theorem 1 Let Oi be the set of observations of a peer Pi and L be a label. After Oi+ (L) satisfying and Oi− (L) unsatisfying observations relevant to L have been performed, T (Pi , L) can be estimated to 1 + Oi+ (L) 2 + Oi+ (L) + Oi− (L) with a standard deviation of s (1 + Oi+ (L)) × (1 + Oi− (L)) + (2 + Oi (L) + Oi− (L))2 × (3 + Oi+ (L) + Oi− (L)) It follows from a well known result (e.g., [3],page 336) in probabilities of the application of the Bayes rule to random variables following a Bernoulli distribution the parameter of which is unknown. Table 2 summarizes the estimations with their associated standard deviation obtained by applying Theorem 1 to the Peter’s observations summarized in Table 1. Label (L) P2 :M yActionF ilms P2 :M yCartoons P4 :ScienceF iction P5 :Italian P5 :W estern P6 :AnimalsDocum P7 :JeanRenoir P8 :Bollywood Estimated trust of P1 towards L 0.815 0.2 0.5 0.125 0.75 0.657 0.162 Standard deviation 0.062 0.087 0.089 0.11 0.12 0.079 0.055 Table 2. Estimated trust of P1 towards the labels of Table 1 3 Propagation of trust When the observation database does not contain enough observations relevant to a label for computing trust with a good precision, we have to use some propagation mechanism to compensate for the lack of local relevant observations. Instead of propagating trust between peers, our approach consists in propagating the pairs of numbers used for computing trust. Propagating two numbers instead of one does not represent a signiﬁcant overhead. Yet, it has the signiﬁcant advantage of providing a wellfounded way to compute a joint trust using the same Bayesian model as the one presented in section 2. Instead of using an ad-hoc aggregation function for combining lo+ + cal coefﬁcients of trust, the numbers Oi1 (L) . . . Oil (L) (respectively − − Oi1 (L) . . . Oil (L)) coming from solicited peers Pi1 . . . Pil are cumulated to compute the joint trust of the subset Pi1 . . . Pil towards L, by applying the formula of Theorem 1. Different strategies are possible to gather on the querying peer the relevant information from the solicited peer’s observations. • The lazy strategy consists in waiting for getting some answer justiﬁed by a label L and then asking one or several trusted neighbors for their direct feedbacks about the label L. Since it applies after the obtention of answers, such a strategy can be used as a postprecessing and does not require to change the query evaluation mechanism itself. As a consequence it can be applied to different kinds of semantic P2P systems, provided they are able to justify answers by means of such labels (e.g. sets of independant semantic annotations). • The greedy strategy consists in collecting the direct feedbacks likely to be relevant (i.e., concerning the classes in the annotation being built) during the query processing. It thus requires some adaptation of the query answering algorithm. In a system like S OME W HERE [1], the D E C A algorithm [2] is ﬁrst used to infer, from the ontologies and mappings, all the possible reformulations (i.e. rewritings) of the initial query into conjunctions of extensional classes (i.e. containers of instances) C1 , . . . , Cn . Each instance in C1 . . . Cn is then produced as an answer, C1 . . . Cn being the semantic annotation justifying it. The D E C A algorithm can be slightly modiﬁed in order to convey, when transmitting back rewritings from a queried peer P to the querying peer P , those feedbacks likely to be relevant. When a rewriting Cj . . . Cm is transmitted from P to P within a message, P uses that message to convey its direct observations (O+ (L), O− (L)) for all labels L containing the classes of the rewriting. By construction, those classes will be part of the annotation of an answer. Therefore, observations relevant to these classes may be relevant for computing (if needed) the joint trust towards the labels annotating answers returned to the peer the initial query is issued from. Note that this strategy leads to combining feedbacks from the very peers that have contributed to obtain an answer. Those peers may thus be considered as naturally relevant for obtaining appropriate feedbacks. However, such sets of peers are determined at query time and may vary according to the query and the returned answer. 4 Perspectives One of the objectives of reputation systems is the detection and handling of malicious agents in an electronic environment. In a P2P system, a peer can be malicious by providing to other peers virusaffected resources, or by simply lying when reporting its feedbacks about others. In our model, when a peer has enough direct experiences, it does not have to rely on other peers and thus avoid malicious peers. When it has to rely on observations of other peers for estimating its trust towards a label, it is reasonable to assume that the number of malicious peers is small. Therefore, it is possible to either increase the number of peers to solicit to get observations (in order to decrease the impact of wrong observations coming from few peers) or to discard the peers the observations of which change a lot the joint trust (they are likely to be malicious). REFERENCES [1] P. Adjiman, Philippe Chatalic, Franc¸ois Goasdou´e, Marie-Christine Rousset, and Laurent Simon, ‘Somewhere in the semantic web.’, in PPSWR, pp. 1–16, (2005). [2] Philippe Adjiman, Philippe Chatalic, Franc¸ois Goasdou´e, MarieChristine Rousset, and Laurent Simon, ‘Distributed reasoning in a peerto-peer setting: Application to the semantic web’, Journal of Artiﬁcial Intelligence Research, 25, 269–314, (2006). [3] Morris H.DeGroot and Mark J.Schervish, Probability and Statistics, Addison Wesley, 2002. ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-883 883 Conditional Norms and Dyadic Obligations in Time Jan Broersen1 and Leendert van der Torre2 1 Introduction Reasoning about norm violation and time is of central concern to the regulation of multi-agent system behavior. Here we continue work in [2] on an approach to reasoning about norms, obligations, time and agents, involving three main ingredients. First, we assume a branching temporal structure representing the change of propositions over time. Second, we use an algorithm that, given the input of the branching temporal structure and a set of norms, produces an ‘obligation labeling’ of the temporal structure. Finally, we reason about the norms represented by these deontically labeled temporal structures to determine norm redundancy and equivalence of normative systems. We distinguish between conditional norms and conditional obligations. General directives like ”if an agent receives a request, it has to accept or reject within ﬁve seconds” are conditional norms. We interpret norms by deﬁning which conditional and/or temporal obligations they give rise to. For example, if at any moment in time for which the norm is in force, the agent receives a request, then for the following ﬁve seconds, if it has not accepted or rejected the request yet, it has the obligation to do so. So, norms ‘detach’ obligations. The deontic logic literature distinguishes between so-called factual and deontic detachment [5]. The former is based on a match between the condition of the norm and the facts, and the latter is based on a match between the condition and the obligations. 2 Norms and Obligations For the temporal structures we use trees, i.e., possibly inﬁnite, branching time temporal structures. Deﬁnition 1 (Temporal structure) Let L be a propositional language built on a set of atoms P . A temporal structure is a tuple T = N, E, |= where N is a set of nodes, E ⊆ N × N is a set of edges obeying the tree properties, and |=⊆ N × L is any minimal satisfaction relation for nodes and propositional formulas of L closed under propositional logic. We consider only regulative norms like obligations and prohibitions, since they are the most basic and often used kind of norms. Following input/output logic [8, 9], we write a conditional norm “if i, then o is obligatory” as a pair of propositional formulas (i, o). Deﬁnition 2 (Normative system) A norm “if i, then obligatory o” is represented by a pair of formulas of L, and written as (i, o). It is also read as the norm “if i, then forbidden ¬o.” A normative system S is a set of norms {(i1 , o1 ), . . . , (in , on )}. 1 2 University of Utrecht, The Netherlands, email: broersen@cs.uu.nl Computer Science and Communication, University of Luxembourg, Luxembourg, email: leon.vandertorre@uni.lu Example 1 (Manuscript) The norms are “if owexy , then obligatory payxy ” (owexy , payxy ), and “if payxy , then obligatory receiptyx ” (payxy , receiptyx ). Here x and y are variables ranging over the set of agents, in the sense that each norm is treated as a set of proposition based norms, for each instance of the agent variables. Norms are used to detach obligations. The detached obligations are a labeling of the temporal structure. Deﬁnition 3 (Obligation labeling) An obligation labeling is a function O : N → 2L . The way we label the temporal structure determines the meaning of the norms. For the ‘persistent norm semantics’ we assume persistence and deductive closure of obligatory formulas. Deﬁnition 4 (Persistent norm semantics) The persistent norm semantics of a normative system S is the unique obligation labeling O : N → 2L such that for each node n, O(n) is the minimal set such that: 1. for all norms (i, o), all nodes n1 and all paths (n1 , n2 , . . . , nm ) with m ≥ 1, if n1 |= i and nk |= o for all 1 ≤ k ≤ m − 1, then o ∈ O(nm ) 2. if O(n) |= ϕ then ϕ ∈ O(n) We now deﬁne how to reason about norms, obligations and time. A norm is redundant if it does not affect the obligation labeling of a temporal structure. Deﬁnition 5 (Norm redundancy) In normative system S, a norm (i, o) ∈ S is redundant if and only if for all temporal structures, the obligation labeling of S is the same as the obligation labeling of S \ {(i, o)}. Two normative systems S1 and S2 are equivalent if and only if each norm of S1 would be redundant when added to S2 , and vice versa. We have the following result for the semantics of deﬁnition 4. Theorem 1 (Completeness norm persistent reasoning) In a normative system S, a norm (i, o) ∈ S is redundant under the persistence semantics if and only if we can derive it from S \ {(i, o)} using replacement of logical equivalents in input and output, together with the following rules: (i1 , o) SI (i1 ∧ i2 , o) 3 (i, o1 ∧ o2 ) WO (i, o1 ) (i1 , o)(i2 , o) OR (i1 ∨ i2 , o) Fulﬁlling obligations before they are detached The persistent norm semantics is not always appropriate. If Peter gives the receipt to John before John has given him the money, maybe 884 J. Broersen and L. van der Torre / Conditional Norms and Dyadic Obligations in Time because they are in a long standing relationship and Peter trusts John, or maybe because Peter wrongly believed that John already transferred the money, then after John gives him the money, the obligation to write a receipt is still detached, and persists indeﬁnitely. In this section we deﬁne a semantics avoiding this property, using a labeling with dyadic obligations O(o|c), read as “o is obligatory in context c.” Deﬁnition 6 (Dyadic obligation labeling) A dyadic obligation labeling is a function Od : N → 2L×L . O(receiptpj ) receiptpj * O(receiptpj ) O(payjp ) payjp O(receiptpj ) O(receiptpj ) HH j O(receiptpj |payjp ) O(payjp ) O(payjp ) O(receiptpj |payjp ) - owejp O(payjp ) payjp * HH j receiptpj O(receiptpj |payjp ) A HH O(payjp ) O(payjp ) O(payjp |owejp ) j A A A O(receiptpj |pay jp ) O(receiptpj |payjp ) AU O(payjp ) O(payjp ) Figure 1. Labeling of the temporal structure using the receipt normative system S = {(owejp , payjp ), (payjp , receiptpj )} with persistent dyadic obligations. The obligations persists in time, until they are fulﬁlled. The obligation O(receiptpj |payjp ) does not persist after receiptpj holds. Example 2 (Receipt, continued) See the temporal structure in Figure1. The desired labelling with dyadic obligations is as follows. From the root node, we detach dyadic obligations O(o|i) for all the norms (i, o). Then, a monadic obligation is detached from the dyadic obligation when the context holds in a node, and the obligation persists until its consequent holds in a node. In particular, the obligation for receiptpj in context of payjp does not persist after receiptpj holds. The look-ahead norm semantics adds to the persistence semantics that obligations are possibly not generated because before the moment where their condition becomes true, the obligation has already been satisﬁed. We achieve this by keeping track of all dyadic obligations generated by the norms at any temporal initial state. Deﬁnition 7 (Look-ahead norm semantics) The look-ahead norm semantics of a normative system S is the obligation labeling O : N → 2L together with the dyadic obligation labeling Od : N → 2L×L such that for each n, O(n) and Od (n) are the minimal sets such that: 1. for all norms (i, o) and a root node n0 , (i, o) ∈ Od (n0 ). 2. for all paths (n1 , n2 , . . . , nm ), if (i, o) ∈ Od (n1 ) and nk |= o for all 1 ≤ k ≤ m − 1, then (i, o) ∈ Od (nm ) 3. for all paths (n1 , n2 , . . . , nl , m), if (i, o) ∈ Od (n1 ) and n1 |= i and nk |= o for all 1 ≤ k ≤ l, then o ∈ O(m) 4. if O(n) |= ϕ then ϕ ∈ O(n) Notably, reasoning is not different in this semantics. Theorem 2 (Completeness look-ahead reasoning) In a normative system S, a norm (i, o) ∈ S is redundant under the look-ahead semantics if and only if we can derive it from S \ {(i, o)} using replacement of logical equivalents in input and output, together with the following rules: (i1 , o) SI (i1 ∧ i2 , o) (i, o1 ∧ o2 ) WO (i, o1 ) (i1 , o)(i2 , o) OR (i1 ∨ i2 , o) Also this semantics has a drawback. Unlike for the persistence labeling, we do not have that every time the condition of the norm is true, an obligation is detached. For example, if at some future moment in time, we again have that owejp , then we no longer detach the obligations for payjp and receiptpj . This is left for further research. 4 Concluding remarks The distinction between norms and obligations goes back to the philosophical problem known as Jorgenson’s dilemma [7], which roughly says that a proper logic of norms is impossible because norms do not have truth values. Systems without explicit norms are difﬁcult to use in multi-agent systems. However, most formal systems in the deontic literature [1, 11, 5, 10, 6, 7, 4] are restricted to obligations, prohibitions and permissions, and do not consider the originating norms explicitly. Furthermore, systems that do explicitly represent norms of the system usually do not provide a way to reason about them. Finally, systems for reasoning about norms [7, 8, 3] do not consider the intricacies of time. Our approach aims at ﬁlling this gap. Our approach also gives temporal interpretations to well known issues discussed in the deontic logic literature, such as the distinction between ‘conditions’ and ‘contexts’, and the distinction between between creating a new obligation and detaching an obligation. REFERENCES [1] C.E. Alchourr´on and E. Bulygin, Normative Systems, Springer, Wien, 1971. [2] J.M. Broersen and L. van der Torre, ‘Reasoning about norms, obligations, time and agents’, in Proceedings PRIMA ’07, eds., A. Ghose and G. Governatori, Lecture Notes in Computer Science. Springer, (2008). [3] J. Hansen, ‘Sets, sentences, and some logics about imperatives’, Fundamenta Informaticae, 48, 205 – 226, (2001). [4] J. Horty, Agency and Deontic Logic, Oxford University Press, 2001. [5] B. Loewer and M. Belzer, ‘Dyadic deontic detachment’, Synthese, 54, 295–318, (1983). [6] D. Makinson, ‘Five faces of minimality’, Studia Logica, 52, 339–379, (1993). [7] D. Makinson, ‘On a fundamental problem of deontic logic’, in Norms, Logics and Information Systems. New Studies on Deontic Logic and Computer Science, eds., P. McNamara and H. Prakken, pp. 29–54. IOS, (1999). [8] D. Makinson and L. van der Torre, ‘Input-output logics’, Journal of Philosophical Logic, 29(4), 383–408, (2000). [9] D. Makinson and L. van der Torre, ‘Constraints for input-output logics’, Journal of Philosophical Logic, 30(2), 155–185, (2001). [10] J. J. Ch. Meyer, ‘A different approach to deontic logic: Deontic logic viewed as a variant of dynamic logic’, Notre Dame Journal of Formal Logic, 29(1), 109–136, (1988). [11] J. van Eck, ‘A system of temporally relative modal and deontic predicate logic and its philosophical applications’, logique et analyse, 25, 339–381, (1982). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-885 885 Trust Aware Negotiation Dissolution Nicol´as Hormaz´abal, Josep Lluis de la Rosa i Esteva and Silvana Aciar 1 Abstract. In this paper we propose a recommender system that suggests the best moment to end a negotiation. The recommendation is made from a trust evaluation of every agent in the negotiation based on their past negotiation experiences. For this, we introduce the Trust Aware Negotiation Dissolution algorithm. 1 INTRODUCTION Negotiation and cooperation are critical issues in multi-agent environments [3], such as in Multi Agents Systems and research on Distributed Artiﬁcial Intelligence. In distributed systems, high costs and time delays are associated with operators that make high demands on the communication bandwidth [1]. Considering that agents are aware of their own preferences, which help their decision making during the negotiation process, the negotiation can go through several steps depending on their values as each agent does not know the others’ preferences. This can lead to an increase of communication bandwith costs affecting the general performance, and might put agents in undesirable negotiation situations (such as a negotiation that probably will not end with an acceptable agreement). Termination of the negotiation process or a negotiation dissolution action should be considered when the negotiation is in a situation where the expected result of the following steps cannot be better than the current result. This will not only help to determine when to end a negotiation process, but also to help decide wether to end it with or without an agreement. 2 TRUST AWARE DISSOLUTION The Trust Aware Negotiation Dissolution algorithm (TAND from now on) takes into account direct interactions from similar situations in the past (Situational Trust [4]). The basic formula used to calculate this type of trust is: APα ) Ta (y, α) = Ua (α) × Ta (y, Where e is the number of times that an agreement has been made with the agent y on the each situation from Pα , and n is the total number of observed cases in Pα with the agent y. n = |Pα |. A function g based on agent a’s decision process returns the set S of j possible negotiation situations (the offers the agent is willing to make) σ based on the current situation α the agent is in: g : α −→ S (3) S = {σ1 , σ2 , ..., σj } (4) From the possible situations, we obtain the best expected situational trust Ea (y, S); which obtains the trust for the best expected case from among the possible situations in which the agents can ﬁnd themselves in the future, given the current negotiation: Ea (y, S) = max Ta (y, σ) σ∈S (5) We know the trust in the current situation Ta (y, α). We also have the best expected situational trust Ea (y, S). With these two values, we can calculate a rate that will help the agent decide whether or not they should continue the negotiation. The situational trust at the present time, divided by the potential situational trust gives us the Dissolution Rate R, which in conjunction with a minimum acceptable trust value M , will help to decide whether or not to dissolve the negotiation process. (1) Where: • • • • a is the evaluator agent. y is the target agent. α is the situation. Ua (α) represents the utility that a gains from a situation α, calculated by its utility function. • Pα is a set of past situations similar to α. APα ) is an estimated general trust for the current situation. • Ta (y, We will calculate this value considering two possible results for each situation in the set of past interactions Pα , that are similar to 1 α: a successful result or an unsuccessful one (whether or not an agreement was reached). This leads to the calculation of the probability that the current situation could end in an agreement based on past interactions (based on the work in [6]). This is calculated by: APα ) = e Ta (y, (2) n Agents Research Lab, University of Girona, Catalonia, Spain, email: (nicolash, peplluis, saciar)@eia.udg.edu R= Ta (y, α) Ea (y, S) (6) The dissolution decision depends on the value of R: R ≥ 1 ⇒ Dissolve (R < 1) ∨ (Ea (y, S) < M ) ⇒ Dissolve (R < 1) ∨ (Ea (y, S) ≥ M ) ⇒ Continue Negotiating (7) In other words, if, based on future steps, the expected situation does not have a better trust value than the current one, the best thing to do is to end the negotiation now. Otherwise, it is better to continue negotiating. 886 3 N. Hormazábal et al. / Trust Aware Negotiation Dissolution EXPERIMENT AND RESULTS Table 1. For testing the TAND algorithm, we implemented a negotiation environment where two agents negotiate to reach an agreement from a limited number of options; agents consecutively offer their next best option at each step until the offer is no better than the received one. The scenario consists of different agents that each represent a person who wants to go to a movie with a partner, so they negotiate between them from different available movie genres to choose which movie to go to together. The environment was developed in RePast 2 . In order to avoid affecting the system performance, agents will only save the trust of a limited number of potential partners in their knowledge base; that is, they will maintain a limited contact list instead of recording the experience of every partner they have negotiated with. There will be a ﬁxed number of available movie genres (for example, drama, comedy, horror, etc.) during the whole simulation. Each agent will have a randomly generated personal preference value (from a uniform distribution) for each genre between 0 and 1, where 0 is a genre it does not like at all, and 1 is its preferred movie genre. One of these genres, randomly chosen for each agent, will have a preference value of 1, so each agent will have always a favorite genre. We assume that there is always a movie in the theaters available for each genre. Each movie genre will be used to identify the situation α the negotiation is in, for the calculation of the trust from equation 1. The result of the utility function Ua (α) will be the preference for each movie genre randomly assigned for each agent. Partners, involved in the negotiation will be randomly chosen. An agent can participate only in one negotiation at one time. The experiment will run through three different cases, each one with 100 agents and 10 different movie genres: • Case 1: Contact list of 20 most trusted partners. • Case 2: Unlimited contact list size. • Case 3: No TAND, simple negotiation. Every experiment will run through 2000 steps. At each step, 1/4 of the total population (25 agents for the cases described above) will invite another partner to a negotiation for a movie. For evaluating the performance, we will use three values: • Average steps used for all agreements made: AS (lower is better). • Average preference (the ﬁnal preference value during the agreement for each successful negotiation): AP (higher is better). • Average distance from the perfect pitch: AD (lower is better). We deﬁne the perfect pitch P as the highest value for the product of each agent a in the A set of participating agents’ preference (result of the utility function Ua (α) for each movie genre) from every possible agreement d: P = max( d∈D fa ) (8) a∈A The distance from the perfect pitch is the difference from the negotiation preference K with the perfect pitch P . AD = P − K (9) After 20 experiments for each case, in every case at each experiment we averaged the results obtained, seen in table 1. 2 http://repast.sourceforge.net AS AP AD Avg Std Dev Avg Std Dev Avg Std Dev Average Final Steps. Case 1 5,2894 0,0283 0,8001 0,0073 0,1370 0,0034 Case 2 4,6683 0,0249 0,8168 0,0064 0,1125 0,0030 Case 3 5,6993 0,0282 0,7892 0,0080 0,1548 0,0048 The results improve in cases 1 and 2, in terms of average steps AS needed for closing a negotiation with an agreement, compared to case 3, where TAND is not used. However, the average preference AP has a higher value, and the distance from the perfect pitch AD is reduced more than 35% from case 3 to case 2. The contact list size is a critical issue, as we can see from comparing results between cases 1 and 2, that the improvement is higher when there are no limits in the contact list’s size. 4 CONCLUSIONS AND FUTURE WORK We have presented TAND and its preliminary results, where we can see that it improves the negotiation process in terms of agents’ preferences and number of steps to achieve an agreement. Taking into account possible agents’ performance issues, a limited contact list should be considered, but its size limitation could negatively affect the TAND results as we can see in table 1, so that work ﬁnding the optimal contact list size should be done. As far as now, the contact list ﬁlling criteria are simple, in the trusted contact list, agents with higher trust replace the agents with lower values and when the contact list is full, improved results are expected using other criteria for dealing with the contact list, for example using different levels of priorities, or a relation with the partner selection criteria (in the experiments the selection is made randomly). TAND has been tested on a simple bilateral negotiation process, but can also be used on other types of temporary coalitions such as dynamic electronic institutions [5] for supporting their dissolution phase. Future work will focus on this, expanding its scope to generic types of coalitions. In addition, work on implementing other ways to calculate trust should be done, and other methods to manage the dissolution (such as Case Based Reasoning [2]) in order to compare results. The topic of dissolution of coalitions is not a new one, but it is not a topic that has been studied in depth [2], so this research topic provides a wide open ﬁeld that needs to be explored. REFERENCES [1] H. Bui, D. Kieronska, and S. Venkatesh, ‘Learning other agents preferences in multiagent negotiation’, Proceedings of the National Conference on Articial Intelligence (AAAI-96), 114–119, (1996). [2] N. Hormazabal and J. Ll. de la Rosa, ‘Dissolution of dynamic electronic institutions, a ﬁrst approach: Relevant factors and causes’, in 2008 Second IEEE International Conference on Digital Ecosystems and Technologies, (February 2008). [3] S. Kraus, ‘Negotiation and cooperation in multi-agent environments’, Artiﬁcial Intelligence, 79–97, (1997). [4] S. Marsh, Formalising Trust as a Computational Concept, Ph.D. dissertation, Department of Mathematics and Computer Science, University of Stirling, 1994. [5] E. Muntaner-Perich and J. Ll. de la Rosa, ‘Dynamic electronic institutions: from agent coalitions to agent institutions’, Workshop on Radical Agent Concepts (WRAC’05), Springer LNCS (Volume 3825), (2006). [6] M. Schillo, P. Funk, and M. Rovatsos, ‘Using trust for detecting deceitful agents in articial societites’, Applied Articial Intelligence, (Special Issue on Trust, Deception and Fraud in Agent Societies), 14, 825–848, (2000). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-887 887 On the Role of Structured Information Exchange in Supervised Learning Ricardo M. Araujo and Luis C. Lamb Institute of Informatics - UFRGS, Brazil 1 Introduction When considering Multi-Agent Systems (MAS) composed of learning agents, an important aspect is how agents interact with each other - i.e. their social structure [8, 6]. Several studies are concerned with understanding the role of structures in multi-agent learning (MAL). Research on this topic are typically directed to static structures to constraint interaction between agents [3, 5]. However, in several cases, it may be preferable to have a self-organized social structure by letting agents decide the connections they will make [1]. We propose a multi-agent framework for modeling agents that exchange information through a self-organized network. This framework is composed of communication stages that separate the dynamics of and on the network [7] - i.e. how the underlying network is constructed and how it is used. In several ways, our model resembles social network models and the propagation of memes [2] between individuals. We thus call the framework Memetic Networks. We apply the framework to build a distributed learning algorithm that learn concepts by exchanging information about hypotheses constructed by individual agents. We show that this algorithm is able to learn reasonably complex concepts on a real-world scenario. We use this algorithm to explore questions such as (i) is it advantageous to be connected to many sources of information? and (ii) is it beneﬁcial to have access only to good sources of information? 2 On Memetic Networks We deﬁne a Memetic Network as a set of agents that may exchange information through a network following rules that guide how information ﬂows and is used. Our scenario can be formalized as follows. Let A be an ordered set of agents a1 , a2 , ..., aN and E an unordered set of pairs of distinct agents in A. The ordered pair (A, E) is thus a graph where vertices represent agents and edges represent the possibility of two agents to interact. Three rules deﬁne how the graph is wired and how information is processed. We describe conceptually each rule below and shall present speciﬁc implementations for each one in the next section. Connection rule. It speciﬁes how individuals will connect to and disconnect from each other. This rule guides the construction of the network structure. For example, an instance of such rule could be “a connection between node n1 and n2 exists if and only if n1 is better evaluated than n2 ” or “connect randomly to a certain number of individuals”. The connection rule is executed at every step of the algorithm, thus the network is dynamic. Aggregation rule. Given a connection, this rule speciﬁes how information is to ﬂow through it. It guides how the solution contained in each node is to be modiﬁed as a function of the connected nodes. For example, if every node contains a single bit, this rule could be “adopt the bit that is present in the majority of connected nodes”. It deﬁnes the dynamics on the network. Appropriation rule. After information has been aggregated, this rule speciﬁes any local changes to the information contained in a node (e.g. the application of a hill-climbing search). 3 Concept Learning in Memetic Networks Using the framework deﬁned above, we can deﬁne a Memetic Network Algorithm that is able to inductively learn concepts by searching the hypotheses space. The search is guided by information exchange between agents and new hypotheses are formed by aggregating multiple existing hypotheses. We are interested in learning binary concepts from a set of (possibly noisy) examples. In order to deﬁne each rule and thus instantiate a Memetic Network, we must choose a representation for the hypotheses and an evaluation criteria. We use propositional rules in Disjunctive Normal Form (DNF) to represent our search space, using the binary codiﬁcation proposed in [4]. An agent ai thus contains a binary string hi which represents a hypothesis. Hypotheses are evaluated by the number of examples correctly classiﬁed. In what follows, we propose and discuss each implementation of the proposed general Memetic Network rules. Connection rule: a directed edge (ai , aj ) from agent ai to agent aj exists if and only if eval(aj ) > eval(ai ), where eval(ak ) is the evaluation of the hypothesis of agent ak . Aggregation rule: the bit in position j in hypothesis hk is equal to the bit value in position j that happens more often among all agents that ak connects to. Ties are broken by coin toss. Appropriation rule: each bit of the aggreagated hypothesis is ﬂipped with a (small) probability pn and the whole hypothesis is evaluated; if this new hypothesis is better than the previous one, it becomes the hypothesis for the agent; otherwise, the previous hypothesis is kept unchanged. 4 Experiments We have experimented with the algorithm described in the previous section on the Breast Cancer Recurrence dataset [9]. This dataset is composed of 286 examples with 9 features (each varying between 2 and 13 distinct possible values, all discrete) and 2 classes; 201 examples are positive examples and 85 are negative. We randomly partition the whole set of examples into a training set (200 examples) and validation set (86 examples). Each agent was randomly initialized with a hypothesis and evaluated using the training set by counting how many examples it correctly classiﬁed (batch mode). We let the algorithm run for 1000 rounds and repeat the whole process 20 times. We set pn = 0.001, N = 100 and limited the rules to a maximum of 20 disjuncts (hypotheses are 1020 bits in length). Figure 1(a) shows convergence results over the validation set. Network Diversity. In order to better understand whether there are beneﬁts of being connected to many agents, we modify the connec- 888 R.M. Araujo and L.C. Lamb / The Role of Structured Information Exchange in Supervised Learning tion rule so as to limit the number of connections. In what follows, indegree(ak ) and outdegree(ak ) measure respectively the number of incoming and outgoing connections of agent ak . Connection Rule (2nd version): a directed edge (ai , aj ) from agent ai to agent aj exists if and only if eval(aj ) > eval(ai ) and indegree(aj ) < α and outdegree(ai ) < β. Figures 1(b) and 1(c) show how performance changes with changes in α and β. We can observe a logarithmic increase in the accuracy of the best solution when we increase α. Thus, allowing agents to source information to many other agents is beneﬁcial in our model. By setting a high α value we are allowing good information to reach as many agents as possible; thus one could argue that the above results were expected. However, the excessive spread of good information could cause early stagnation as most agents would converge to the same solution. If this were the case, an intermediate value of α would be the best setting, but this is not what happens. Variations in β are responsible for smaller variations in performance when compared to changes in α. However, unlike for α, there is a optimum intermediate value for β - setting this parameter to lower or higher values than the optimum cause the algorithm’s performance to worsen. When connecting to very few agents, not enough information is being recombined and agents are effectively cloning better hypotheses (thus exploiting good solutions but not exploring efﬁciently the search space). When connecting to too many agents the high diversity of solutions seem to be detrimental to the aggregation rule’s ability to perform recombinations that are useful. Agent Diversity. We assumed in our learning algorithm that it is beneﬁcial to be connected only to agents that are better evaluated. To test such assumption, we modiﬁed the algorithm so as to let agents connect to other agents that have lower evaluations with probability pd . This increases the diversity of solutions that can be used to compose new hypotheses. Figure 1(d) shows the results of varying pd from 0.0 to 1.0 with 0.1 increments. We observe that the algorithm’s performance quickly drops as we increase the probability of bad solutions to take part of the recombination process. Allowing agents to have access to worse evaluated hypotheses is detrimental to the algorithm’s performance. (a) Convergence results for validation set (b) Accuracy after 100 rounds for varying α Acknowledgments. Work partly supported by CNPq - Brazil. REFERENCES [1] R. Araujo and L. Lamb, ‘Memetic Networks: Analyzing the Effects of Network Properties in Multi-Agent Performance’, in Proc. of the AAAI08, (2008). [2] Richard Dawkins, The Selﬁsh Gene, Oxford University Press, 1976. [3] M. Giacobini, M. Tomassini, and A. Tettamanzi, ‘Takeover time curves in random and small-world structured populations’, in GECCO ’05: Proceedings of the 2005 conference on Genetic and evolutionary computation, (2005). [4] K. De Jong and W. Spears, ‘Using genetic algorithms for concept learning’, Machine Learning, 13, 161–188, (1993). [5] E. Lieberman, C. Hauert, and M. A. Nowak, ‘Evolutionary dynamics on graphs’, Nature, 433(7023), 312–316, (January 2005). [6] P. Mathieu, J. C. Routier, and Y. Secq, ‘Dynamic organization of multiagent systems’, in Proc. of the 1st International Joint Conference on Autonomous Agents and Multiagent Systems, pp. 451–452, New York, NY, USA, (2002). ACM. [7] The Structure and Dynamics of Networks, eds., M. Newman, A.-L. Barab´asi, and D. Watts, Princeton University Press, 2006. [8] Z.-G. Wang and X.-H. Liang, ‘A graph based simulation of reorganization in multi-agent systems’, in Proc. of the IEEE/WIC/ACM International conference on Intelligent Agent Technology, (2006). IEEE Computer Society. [9] M. Zwitter and M. Soklic. Breast cancer data. Institute of Oncology, University Medical Centre Ljubljana, Yugoslavia., 1988. (c) Accuracy after 100 rounds for varying β (d) Accuracy after 100 rounds for varying pd Figure 1. Experiments results ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-889 889 Magic Agents: Using Information Relevance to Control Autonomy 1 B. van der Vecht23 and F. Dignum3 and J-J.Ch. Meyer3 Abstract. Autonomous agents are believed to have control over their internal state and over their behaviour. For that reason, an agent should control how and by whom it is being inﬂuenced. We introduce a reasoning component for BDI-agents that deals with the control over external inﬂuences, and we propose heuristics using local knowledge to process incoming stimuli. One of those heuristics is based on information relevance with respect to the agent’s current plans and goals. We have developed a way to determine the relevance of information in BDI-agents using magic sets from database research as basis. The method presented shows a new application of magic sets by applying the theory in agent systems. 1 INTRODUCTION Agents are believed to be autonomous, meaning that they have control over their internal state and over their behaviour, [4]. In a multiagent environment where coordination of group behaviour is required agents will inﬂuence each other. We argue that an autonomous agent should control how and by whom it is being inﬂuenced. Key issue is to ﬁnd general heuristics to control external inﬂuences. In this paper we investigate information relevance as such a heuristic. An agent that can determine the information relevance with respect to its goals, is able to deal dynamically with external input and is less sensitive for information overload [2]. We describe a way to determine information relevance in BDI agents based on magic sets [1], a method developed for efﬁcient deductive database searching. We introduce a new use of magic-sets theory, that is beneﬁcial for agent reasoning. 2 CONTROLLING EXTERNAL INFLUENCES A popular approach for agent reasoning is BDI-reasoning. We can view a BDI-agent as a mental state and a reasoning process. The mental state captures the beliefs, goals and plans and the reasoning process decides upon actions based on the mental state. The mental state is updated via internal actions and via external inﬂuences, such as own observations or messages from other agents. In [5] an extension to the classic reasoning model has been proposed, such that an agent analyzes incoming stimuli based on internal knowledge, before it adopts them in its mental state. This requires a separate process next to the goal-directed decision making. The model uses reasoning rules in order to decide on adoption or 1 2 3 The research reported here is part of the Interactive Collaborative Information Systems (ICIS) project, supported by the Dutch Ministry of Economic Affairs, grant no: BSIK03024. TNO Defense, Security and Safety, The Netherlands Department of Information and Computing Sciences, Universiteit Utrecht, The Netherlands, email: bobv, dignum, jj@cs.uu.nl rejection of external inﬂuences. The reasoning rules contain predicates that consult knowledge from the agent’s internal state. Reasoning rules contain a head, a guard and a body. The head indicates the activation event for the rule, the guard contains predicates that should match with the agent’s belief base, and the body describes the resulting action. An example of a reasoning rule: observe(X) <- relevant(X) | Adopt(X) The rule is activated by an observation event. The rule states that if something is observed and it is relevant the agent will add the information to its belief base. The agent evaluates the predicates using its local knowledge, and therefore it actively chooses whether it should reject or adopt an event. This process is opposed to models that adopt observations or messages directly into the belief base, and therewith take the control over the internal state away from the agent. 3 INFORMATION RELEVANCE An agent is continuously receptive for input from the environment. In our model an agent has the option to adopt or reject incoming stimuli based on local knowledge. It evaluates predicates in reasoning rules to determine whether to adopt or reject inﬂuences. An immediate question is: what are valid reasons to adopt or reject inﬂuences? Castelfranchi introduced information relevance as a typical heuristics to control external inﬂuences [2]. He deﬁnes relevance of information with respect to a certain goal. Intuitively, an agent might want to focus on a speciﬁc type of information given its goals. A typical reason can be that the agent does not want to be distracted, or it wants to prevent information overload. Therefore it should be able to determine whether information is relevant for the goal. Relevance of information is a heuristic for an agent to determine how it inﬂuenced. The agent should also to control by whom it is being inﬂuenced. The reasoning rules can use knowledge about the organization or about social context, for example by evaluating whether the sender of a message be trusted. Then, agents achieve coordination by allowing inﬂuence on the internal state based on social and organizational knowledge. One can think of several other reasons to allow or refuse inﬂuence on the internal state. For example, domain knowledge always plays a role. In this paper we focus on the heuristic of information relevance. 4 MAGIC AGENTS In work on agent autonomy Castelfranchi has described information relevance for agents with respect to goals [2]. According to his definition, information is relevant for a goal if the information is about 890 B. van der Vecht et al. / Magic Agents: Using Information Relevance to Control Autonomy the goal, if it is about the content of a goal or if it is about plan relations of the goal. Castelfranchi did not include any methods of how relevance of information could be determined. We analyze the notion of relevance based on the 3APL model [3]. 3APL provides a reasoning model and a programming language for BDI-agents. The agent’s internal state consists of a belief base, a goal base, a set of reasoning rules and a set of capabilities. The deliberation cycle describes the decision-making process, that makes use of the concepts in the internal state. During the deliberation process queries will be asked to the belief base. From the formal semantics of the 3APL model we know that the following types of queries exist in the deliberation process: guard checks, test actions and goal achievement checks. The guard checks are used to activate a reasoning rule, test actions are tests on the belief base as part of a plan and goal achievement checks are used to check whether a current goals have been reached. All information used to solve those queries is relevant for the agent’s reasoning process. However, the belief base also contains allows deduction rules and it will be difﬁcult to tell in advance which facts are used to solve the query. We could evaluate all possible queries, in order to determine whether information is relevant at the moment of perception. This would be an intensive task, and therefore we have developed a method for quick evaluation of information relevance based on magic sets. 4.1 Using magic sets for relevance determination The magic set method is a bottom-up query evaluation technique developed in deductive database research. A straight forward algorithm for the Magic Set transformation is explained in [1]. Magic Sets are used to deﬁne the relevant elements in a database for a speciﬁc query in order to speed up the search process signiﬁcantly. We can use the theory behind magic sets to determine information relevance. Consider a program P containing logical derivation rules. A query is written as q(c, X), where some variables of the query are bound (c) and others are open (X) and need to be derived. The solution to the query is a set of bindings for variables in X that make the query expression true. The Magic Set method evaluates program P with the information of the bound variables from the query Q. The program P is rewritten into a new program P’, that is equivalent to P with respect to the query, and that uses the bindings in the query to direct the computation. A new predicate is deﬁned based on the query, in which all values of predicates that need to be computed are stored. This new predicate is the magic set. We can use the magic sets to determine information relevance in a BDI agent. The agent wants to know the relevance of an observation with respect to a query. We can create magic sets for those queries. Therefore we create a predicate relevant(X), which is derived using the magic sets of the queries. contain the decision to start replanning the route based on trafﬁc information and otherwise execute the action Go. BELIEFBASE: on_route(X, Y, Z) :- planned_route(X, Y). on_route(X, Y, Z) :- planned_route(X, W), on_route(W, Y, Z). traffic(X, Y) :- traffic_message(Z), on_route(X, Z, Y). REASONING RULES: <-- traffic(X, Y) | Replan <-- NO traffic(X,Y) | Go We want to determine the relevance of trafﬁc messages. Based on the reasoning rules of the agent, we know that there is one query on the belief base that uses trafﬁc messages in its deduction: trafﬁc(X, Y)?. Therefore we need to create a magic set for the trafﬁc predicate. The free variable Z from trafﬁc message(Z) is also used in on route(X, Z, Y). Now we can deﬁne relevance of a trafﬁc message as follows: relevant( traffic_message(Z) ) :magic_traffic(X, Y), on_route(X, Z, Y). The construction of magic trafﬁc(X, Y) predicate is done using the algorithm for the Magic Set transformation [1]. Furthermore, the trafﬁc and on route predicates are rewritten in the magic set transformation, which ensures a quick evaluation using the immediate bindings for X and Y. The set of relevant messages is in the relevance predicate. Whenever the agent receives a message it can derive whether it contains relevant information or not. We can use the predicate in reasoning rules to control external inﬂuences. For example, the following agent only accepts relevant messages and ignores all other messages: traffic_message(Y) :- relevant((traffic_message(Y)) | Accept(Y) traffic_message(Y) :- TRUE | Ignore(Y) 5 CONCLUSION We have argued that an autonomous agent should control how and by whom it is being inﬂuenced. We have introduced a reasoning component that deals with the control over external inﬂuences, and we have described heuristics based on local knowledge with which the agent can decide to adopt or reject incoming stimuli. Information relevance is such a heuristic. Agents that control external inﬂuences based on information relevance can improve their performance and are less sensitive for information overload. We have proposed a way to determine the relevance of information in BDI-agents based on magic sets theory. Our approach showed a new use of magic sets within agent systems. REFERENCES 4.2 Example Consider an agent traveling from A to B. It continuously receives trafﬁc information, which might lead to a reconsideration of the planned route. Intuitively we know that only information that concerns the agent’s planned route is relevant. We will construct the predicate relevant( trafﬁc message(X ). The belief base contains a route the agent has planned in planned route(X, Y) facts. Furthermore the agent can determine whether a location is on the planned route. The fact on route(X, Y, Z implies that Y is on the route from X to Z. The reasoning rules [1] C. Beeri and R. Ramakrishnan, ‘On the power of magic’, Journal of Logic Programming, 10, 255–300, (1991). [2] C Castelfranchi, ‘Guarantees for autonomy in cognitive agent architecture’, Intelligent Agents, (890), 56–70, (1995). [3] M. Dastani, B. van Riemsdijk, F. Dignum, and J-J. Ch. Meyer, ‘A programming language for cognitive agents: Goal directed 3apl’, in ProMAS’03, volume 3067 of LNAI, pp. 111–130. Springer, (2004). [4] N. R. Jennings, ‘On agent-based software engineering’, Artiﬁcial Intelligence, 117(2), 277–296, (2000). [5] B. van der Vecht, F. Dignum, J-J. Ch. Meyer, and M. Neef, ‘A dynamic coordination mechanism using adjustable autonomy’, in COIN III, volume 4870 of LNCS, pp. 83–96. Springer, (2007). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-891 891 Infection-Based Norm Emergence in Multi-Agent Complex Networks Norman Salazar and Juan A. Rodriguez-Aguilar and Josep Ll. Arcos 1 Abstract. We propose a computational model that facilitates agents in a MAS to collaboratively evolve their norms to reach the best norm conventions. Our approach borrows from the social contagion phenomenon to exploit the notion of positive infection: agents with good behaviors become infectious to spread their norms in the agent society. By combining infection and innovation, our computational model helps a MAS establish better norm conventions even when a sub-optimal one has fully settled in the population. 1 Introduction Norms have become a common mechanism to regulate the behavior of agents in multi-agent systems (MAS). They exist to balance agents’ interests with respect to the society’s in such a way that each agent can pursue its individual goals without preventing other agents’. However, learning and establishing an adequate set of norms is not trivial. This process, usually referred to as either selforganization or emergence. In societies, conventions result when members agree upon a speciﬁc behavior. Thus, a norm convention refers to a set of norms that has been established among the members of a society. One of the trends of thought in social studies is that norm conventions emerge by propagation or contagion, where social facilitation and imitation are key factors [2, 1]. From a MAS perspective, the studies in [8] [7] show that norm emergence is possible. However, these works limit to analyze norm propagation, leaving out norm innovation (discovery of new norms), a key factor for the evolution of societies. When the aim is to help a MAS establish conventions in dynamic environments, propagating norms may not be enough since propagation assumes that at least some agent in the society knows the correct set of norms, which is not always the case. Additionally, the problem can become even more difﬁcult when the aim is not only to establish (any) convention(s), but the best convention(s). We propose an evolutionary computational model that facilitates agents in a MAS to collaboratively evolve their norms to reach the best norm conventions for a wide range of interaction topologies. At this aim, we take inspiration on the argument in the social sciences literature that behavior conventions arise from a social contagion [1]. Although further evolutionary approaches appear in the literature [4], they are usually applied either: (i) as a centralized process; or (ii) as an individual self-contained process for each agent. Both approaches can be potentially slow and tend to be off-line processes, and thus unsuitable to our purpose of dynamically adapting norms. 1 IIIA, Artiﬁcial Intelligence Research Institute, CSIC, Spanish National Research Council, Spain, email:{norman, jar, arcos}@iiia.csic.es 2 An Evolutionary Infection-Based Model We propose a computational model that helps agents in a MAS reach norm conventions that maximize the social welfare. At this aim, we assume, in line with the distributed nature of the problem, that we can achieve our goal by maximizing agents’ individual welfares. The social sciences literature argues that conventions in societies are reached through social contagion [1]: behaviors spread between individuals akin to an infectious disease. Hence, we chose to model the social contagion into a MAS framework. However, we target beneﬁcial conventions that if possible tend to maximize the social welfare. Considering the social welfare as a composition of individual welfares, it makes sense to let the individual behaviors that impact positively on it, here named good behaviors, be more infectious. Nevertheless, positive infection at most achieves a total replication of the best-known behavior among agents. Therefore, we also require a norm innovation mechanism. Hence we expect that a MAS can reach norms that are dominant in the society so that no better ones can be found and no worst ones can upstage them. However, if some unaccounted factor(s) alter(s) the MAS so that the current norms become obsolete (the social welfare deteriorates), the infectious process will re-conﬁgure the norms toward a better social welfare. We propose an evolutionary algorithm (EA) approach that helps agents in a MAS reach the best norm conventions. In our infectionbased EA, each agent has genes that encodes its behavior. Agents can infect other agents with their genes following the survival of the ﬁttest concept: the highest its individual welfare, the more infectious. Furthermore, it realizes innovation (exploration) by letting agents mutate their genes. This process runs distributedly: each agent decides whether to infect or mutate based on local knowledge. Thus, each agent is endowed with: i) an evaluation function to assess its individual welfare; ii) a selection process to choose a peer to infect, out of its local neighborhood, based on its ﬁtness; iii) an infection operator to inject some of its genes into the selected agent; and iv) an innovation operator to mutate its genes to create new behaviors. 3 Empirical Results Agents in a MAS interact with each other by engaging in iterative games with multiple rounds. During a round each agent randomly selects a neighbor agent to play with (an opponent).A play consists in both agents doing an action, either A or B (actions constrained by their current norms). Plays are rewarded with a payoff, which is accumulated after each game round. The payoff for as round can be: -1,1 or α based on the agent’s current action and the action of the selected by the neighbor (different ones, both B and both A respectively). This payoff can help capture pure coordination games [8][7] N. Salazar et al. / Infection-Based Norm Emergence in Multi-Agent Complex Networks (α = 1) and coordination games with equilibrium differing in social efﬁciency [6] (α > 1). Each agent, agi has two parameterized norms: one to help it decide what action to take based on the last opponent’s past action; and another one to decide the action to take when no past action is known. To this end, the agent keeps on its memory the action performed by its last opponent without distinguishing who the opponent was. Thus, our model has the task of ﬁnding for each agent the norm parameters that maximize the social-welfare u. It is well known that the behavior of infections its affected by the type of topology on which a population interacts [9, 5]. Therefore, in order to empirically analyze such effects in our infection-based model we chose the following interaction topologies: small-world, 10,0.1 10,−3 10 W1000 ; scale-free, S1000 ; and random graphs, R1000 . We know beforehand that four cooperative-only norms exist (norms that always try to cooperate), and also that they are the strongest attractors. Two of them always make agents do A (Aconventions) and the other two always do B (B-conventions). Aconventions give higher payoffs when α > 1. Our experiments aimed at showing that our model can help establish the best norm convention(s), maximize the social welfare, for a wide range of initial agent settings (norm conﬁgurations) and under the most common interaction topologies. Therefore, each experiment is composed of: i) an interaction topology model; ii) a payoff: α ∈ {1, 1.5, 2}; and iii) an initial norm distribution, consisting in initializing the norms of every agent using ﬁve distributions: a) random (norms are randomly set); b) attractor-free (norms set from the non-cooperative-only norms); c) low sub-optimal (norms of 25% of the agents set from the B-Conventions ; d) high sub-optimal (75% of agents with norms from the B-Convention); and e) fully sub-optimal (norms of all agents were set from the B-Conventions). We run 50 simulations of each experiment. In a simulation agents interact and infect each other, as described above, during 20000 ticks. To measure if a convention is established, we counted the agents with the same norms per tick, and the agents doing A or B per tick. The counts of each simulation in the experiment where then aggregated using the inter-quartile mean. Pure coordination game [α = 1]. The experiments show that the population converges an A-convention if initially more than 50% of the agents doing action A; otherwise, a B-convention settles down. Importantly, a MAS establishes the cooperative-only norms even though for this game other conventions can achieve the same result. Since the A and B-conventions are equally valuable, in this case the MAS establishes one of the best conventions regardless of the initial norm distribution and independently of the interaction topology. Different social efﬁciencies [α > 1]. When using random initial distribution, a MAS readily establishes in an A-convention for α > 1.0 independently of the interaction topology. The same occurs for the attractor-free and the low sub-optimal initializations, even though in the former, at startup no agent knew the best norms. Departing from a high sub-optimal distribution, a MAS establishes in a B-convention when α = 1.5 for all interaction topologies. However, by setting α to 2.0, the small-world networks manage to establish an A-convention. Thus, agents will not consider a new convention unless its beneﬁt is signiﬁcant enough. As to the scale-free case, a greater beneﬁt is needed. The fully sub-optimal distribution represents the worst case scenario (ﬁgure 1). In this case, innovation becomes a key factor. When the innovation probability is low, (pmutation = 0.003), the MAS is unable to converge to the best convention, because innovating agents are not able to overcome the high peer pressure. Even more, in- 1000 1000 A-Conventions Tit-For-Tat 800 Agents 892 600 600 400 400 200 200 0 0 0 5000 10000 15000 20000 Ticks Figure 1. Action A Action B 800 0 5000 10000 15000 20000 Ticks Results for scale-free with fully sub-optimal initialization. Left) agents per norm; Right) agents per action fected scale-free networks are hard to overcome [3, 5]. Hence, we increased the mutation probability (pmutation = 0.055) so that scale-free(α = 2.0) and small-world (α > 1) converged to an AConvention. This occurred because a small group of agents playing tit-for-tat kind of norms starts to appear. Agents with this strategy can coexist with B-Convention agents with a small or non-negative effect to their accumulated payoffs. Therefore, when agents with an A-convention norms appear, they have a higher chance of having neighbors that will cooperate with them. However, a high mutation presents the disadvantage that a small part of the population will be constantly trying to innovate (for our case around 20%). We conclude that highly-clustered agent communities (e.g. smallworld) are more open to positive infections, whereas the lowclustered ones (e.g. scale-free) are harder to infect if some infection has settled. This is similar to some results shown in [6]. However, our evolutionary model can overcome the difﬁculty of re-infecting low-clustered networks (by using a high innovation through mutation rate) whenever we are ready to pay the following cost: a small subgroup of agents unable to settle on a set of norms. Finally, we claim that i) a convention is always reached, and ii) under certain conditions this convention is the best one for all topologies. Moreover, when these conditions are not met, e.g. a suboptimal convention is fully established, our model can still reach the best convention through innovation. ACKNOWLEDGEMENTS The ﬁrst author thanks the CONACyT scholarship. The work was funded by IEA (TIN2006-15662-C02-01), AT (CSD2007-0022), Ssourcing, and the Generalitat of Catalunya grant 2005-SGR-00093. REFERENCES [1] R Burt, ‘Social contagion and innovation: Cohesion versus structural equivalence’, American J. of Sociology, 92, 1287–1335, (1987). [2] R Conte and Mario P, ‘Intelligent social learning’, Artiﬁcial Society and Social Simulation, 4(1), 1–23, (2001). [3] Z Dezso and A.L. Barab´asi, ‘Halting viruses in scale-free networks’, Physical Review E (Statistical, Nonlinear, and Soft Matter Physics), 65(5), 055103, (2002). [4] D Moriarty, A Schultz, and J Grefenstette, ‘Evolutionary algorithms for reinforcement learning’, Artiﬁcial Intelligence Research, 11(1-1), 241– 276, (1999). [5] R Pastor-Satorras and A Vespignani, ‘Epidemic dynamics and endemic states in complex networks’, Physical Review E, 63, 066117, (2001). [6] J.M. Pujol, J Delgado, R Sang¨uesa, and A Flache, ‘The role of clustering on the emergence of efﬁcient social conventions.’, in IJCAI 2005, pp. 965–970, (2005). [7] Y Shoham and M Tennenholtz, ‘On the emergence of social conventions: Modeling, analysis, and simulations’, Artiﬁcial Intelligence, 94(12), 139–166, (1997). [8] A Walker and M Wooldridge, ‘Understanding the emergence of conventions in multi- agent systems’, in ICMAS 1995, pp. 384–389, (1995). [9] D.J. Watts and S. H. Strogatz, ‘Collective dynamics of ’small-world’ networks’, Nature, 393(6684), 440–442, (June 1998). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-893 893 Opponent Modelling in Texas Hold’em Poker as the Key for Success Dinis Félix and Luís Paulo Reis1 Abstract. Over the last few years, research in Artificial Intelligence has focussed on games with incomplete information and non-deterministic moves. The game of Poker is a perfect theme for studying this subject. The best known Poker variant is Texas Hold’em that combines simple rules with a huge amount of possible playing strategies. This paper is focussed on developing algorithms for performing simple online opponent modelling in Texas Hold’em Poker enabling to select the best strategy to play against each given opponent. Several autonomous agents were developed in order to simulate typical Poker player’s behaviour and an observer agent was developed, capable of using simple opponent modelling techniques, in order to select the best playing strategy against each opponent. The results obtained in realistic experiments using eight distinct poker playing agents showed the usefulness of the approach. The observer agent is clearly capable of outperforming all their counterparts in all tests performed. 1 INTRODUCTION Incomplete knowledge, risk management, opponent modelling and dealing with unreliable information are topics that identify Poker as an important research area in Artificial Intelligence (AI). Unlike games of perfect information, in poker, players face hidden information resulting from the opponents’ cards and future actions. In such a domain, to be successful, players face the need to use opponent modelling techniques in order to understand and adapt themselves to the opponents playing style [1,2]. However, the huge amount of possible playing strategies in Poker makes opponent modelling a very hard task in this domain.1 Poker is a popular card game in which players bet on the value of the card combination in their possession. The winner is the one who holds the highest valued hand according to an established hand rankings hierarchy, or otherwise the player who remains "in the hand" after all others have folded. Texas Hold’em is the most popular poker game. It is a community card game where each player may use any combination of the five community cards and the player's own two hidden cards to make a poker hand. This characteristic makes it a very good game for strategic analysis. The main goal of the project is to prove that a poker agent that considers the opponent behaviour has better results, against players that use typical poker playing strategies, than an agent that doesn’t, even when playing the same global betting strategy. 2 RELATED WORK This project is based on previous betting strategies developed at the University of Alberta [1,2,3,4]. They are divided in betting strategy 1 FEUP – Faculty of Engineering of the University of Porto, Portugal, LIACC – Artificial Intelligence and Computer Science Lab., Portugal, email: felixdinis@gmail.com, lpreis@fe.up.pt. before the flop and after the flop [4]. There are 1326 possible hands prior to the flop. The value of one of these hands is called an income rate and is based on an off-line computation that consists of playing several million games where all players call the first bet [5,6]. The basic betting strategy after the flop is based on computing the hand strength (HS), positive potential (PPot), negative potential (NPot), and effective hand strength (EHS) of agent’s hand relative to the board. EHS is a measure of how well the agent's hand stands in relationship to the remaining active opponents in the game. The hand strength (HS) is the probability that a given hand is better than that of an active opponent. Suppose an opponent is equally likely to have any possible two hole card combination. Thus it is possible to calculate the hand strength as: HandStrength(ourcards, boardcards) { ahead = tied = behind = 0 ourrank = Rank(ourcards, boardcards) for each case(oppcards) { opprank = Rank(oppcards, boardcards) if (ourrank>opprank) ahead += 1 else if (ourrank==opprank) tied += 1 else behind += 1 } handstrength=(ahead+tied/2)/ahead+tied+behind) return(handstrength) } After the flop, there are still two more board cards to be revealed and it is essential to determine its potential impact. The positive potential (PPot) is the chance that a hand that is not currently the best improves to win at the showdown. The negative potential (NPot) is the chance that a currently leading hand ends up losing. PPot and NPot are calculated by enumerating over all possible hole cards for the opponent, like the hand strength calculation, and also over all possible board cards. The effective hand strength (EHS) combines hand strength and potential to give a single measure of the relative strength of a hand against an active opponent. A simple formula for computing the probability of winning at the showdown is: Pr(win)=HSx(1-NPot)+(1-HS)xPPot. Since the interest is the probability of the hand is either currently the best, or will improve to become the best, one possible formula for EHS sets NPot=0, giving: EHS=HS+(1-HS)xPPot. 3 OPPONENT MODELLING No poker strategy is complete without a good opponent modelling system [7]. A strong poker player must develop an adaptive model of each opponent, to identify potential weaknesses. In poker, distinct opponents can make different kinds of errors that may be exploited [4]. The Intelligent Agents developed in this project observe the moves of the other players in the table. There are many possible approaches to opponent modelling [2,8,9], but in this work the observation model is based on basic observation of the starting moves of the players, so it could be created a fast, online estimated guess of their starting hands in future rounds. 894 D. Félix and L.P. Reis / Opponent Modelling in Texas Hold’em Poker as the Key for Success Players could be classified generally in four models that depend of two parameters: loose/tight and passive/aggressive. Knowing the types of hole cards various players tend to play, and in what position, is probably the starting point of opponent modelling. Players are classified as loose or tight according to the percentage of hands that he plays. These two concepts are obtained analysing the percentage of the time a player puts money into a pot to see a flop in Hold'em - VP$IP (voluntarily put money in the pot). The players are also classified as passive or aggressive. These concepts are obtained analysing the Aggression Factor (AF) which describes the player's nature. 4 INTELLIGENT AGENTS Based on the player classification developed 8 intelligent agents were created, two for each player style: LA - Loose Aggressive (Maniac and Gambler); LP - Loose Passive (Fish and Calling Station); TA - Tight Aggressive (Fox and Ace); TP - Tight Passive (Rock and Weak Tight). A general observer agent was also created capable of keeping the information of every move made from the opponents and calculating playing information like the VP$IP and AF of each opponent in every moment of the game. The opponents are classified into 4 types of players: loose if VP$IP above 28% tight otherwise; aggressive if AF above 1, passive otherwise. After player classification the agent could consider a different range of possible hands for different opponents. A general consideration is that tight players have a smaller range of possible hands than loose agents. In order to pass this information to Hand Strength calculation, for each player is determined a parameter that was called “sklansky”. This parameter represents the lowest value of a hand that belongs to the most probable range of hands that the player plays with that specific movement (call or raise). Taking into account that many times the correct hand of the opponent is wrongly ignored, the better approach of Effective Hand Strength calculation given with this technique should give a better result that compensates this. The Hand Strength and Potential Hand Strength could now be calculated with a better approach. It is calculated only considering the hands with a rank better than the “sklansky” parameter. 5 RESULTS 6000 2000 1800 5000 1600 1400 4000 bankroll bankroll 1200 3000 800 2000 600 400 1000 200 0 0 1 1 10 19 28 37 46 55 64 73 82 91 100 109 118 127 136 145 154 163 172 181 190 199 208 217 226 235 244 253 262 52 103 154 205 256 307 358 409 460 511 562 613 664 715 766 817 868 919 970 1021 1072 1123 3500 2500 3000 2000 2500 1500 bankroll bankroll 2000 1500 1000 1000 500 500 0 0 1 51 6 CONCLUSIONS AND FUTURE WORK From the results achieved it is possible to verify that the Observer agent has better results than a non observer agent, even when the strategy of hand selection is not very good. This proves that even with simple opponent modelling strategies it is possible to achieve good results. However playing normal poker, due to the reduced number of games and the incomplete information gathered, only simple opponent models are possible to create online and thus, the approach proposed is very useful. At the end of this project, we have a good, stable simulator to test future work and an Observer Agent capable of playing poker at an acceptable level, improving the capabilities of the original agent, prepared to be explored, introducing new functionalities. Future work may be concerned in exploring topics like learning to play depending on the position at the table and bluffing. Regarding opponent modelling in Texas Hold’em, future work may include: to consider more than the 4 type of players; analyse other player style variables; and retrieve information from the cards shown at showdown. REFERENCES In order to obtain results, several simulations were made with the agents created. In each simulation 8 normal agents and 1 observer were used at the table with the intention to give the Observer Agent the possibility to play in a table with all different kind of players: LA in the first round of simulations, LP in the second, TA in the third and TP in the final round of simulations. 1000 The hand selection in the pre-flop of the Observer was equal to the type of agent modelled using the opponent modelling strategy to change the hand strength potential accordingly to the opponents. Each one of the simulations performed was repeated 3 times and ends up when one of the two agents looses all his bankroll or after 2000 games. Figure 1 shows the bankroll variation of the four observer agents compared with corresponding non-observer agents. In the 12 complete experiments performed (more than 10 000 games in total), the Observer achieved better results than the non observer agent that uses the same hand selection in pre-flop. The most conclusive results are with passive agents, Observer besides having always a big advantage from non observer, the results are also very good, reaching a good level of bankroll. With aggressive agents, the simulations seem to be a bit inconclusive due to big variations of bankroll that sometimes causes the end of the game too soon for an agent. Although, we can conclude that opponent modelling could help these kinds of agents to keep in game for a long time. 101 151 201 251 301 351 401 451 501 551 601 651 701 751 801 851 901 951 1001 1051 1101 1151 1 61 121 181 241 301 361 421 481 541 601 661 721 781 841 901 961 1021 1081 1141 1201 1261 1321 1381 Figure 1: Bankroll of LA (top-left), LP (top-right), TA (bottom-left) and TP (bottom-right) observer agents (dark blue) compared with corresponding non-observer agents (magenta) [1] D. Billings, D. Papp, J. Schaeffer, and D. Szafron. Opponent modeling in poker. In American Association of Artificial Intelligence National Conference, AAAI'98, pages 493-499, 1998 [2] A. Davidson, D. Billings, J. Schaeffer, and D. Szafron. Improved opponent modeling in poker. In International Conference on Artificial Intelligence, ICAI'00, pages 1467-1473, 2000 [3] UA GAMES Group. The University of Alberta GAMES Group, http://www.cs.ualberta.ca/~games [consulted in March 2008] [4] D. Billings, A. Davidson, J. Schaeffer, and D. Szafron. The challenge of poker. Artificial Intelligence, Vol 134(1.2), pages 201-240, January 2002 [5] D. Papp. Dealing with imperfect information in poker. Master's thesis, Department of Computing Science, University of Alberta, 1998 [6] L. Peña. Probabilities and simulations in poker. Master's thesis, Department of Computing Science, University of Alberta, 1999 [7] F. Southey, M. Bowling, B. Larson, C. Piccione, N. Burch, D. Billings, and C. Rayner. Bayes' bluff: Opponent modelling in poker. In 21st Conference on Uncertainty in Artificial Intelligence, UAI'05, pages 550558, July 2005 [8] A. Davidson. Opponent modeling in poker. Master's thesis, Department of Computing Science, University of Alberta, 2002 [9] D. Carmel and S. Markovitch. Incorporating opponent models into adversary search. In American Association of Artificial Intelligence National Conference, AAAI'96, pages 120-125, 1996 8. Constraints and Search This page intentionally left blank ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-897 897 LRTA* Works Much Better with Pessimistic Heuristics Aleksander Sadikov and Ivan Bratko1 1 EXPERIMENTAL DESIGN We have performed our experiments on classical testbeds for singleagent search methods, the Eight and Fifteen puzzles used in experiments by many authors. We conducted two series of experiments. The ﬁrst series used artiﬁcially constructed heuristic functions for the 8-puzzle. These artiﬁcial heuristics enabled good experimental control over the properties of heuristics: optimistic vs. pessimistic vs. mixed (neither optimistic nor pessimistic), all of comparable quality. The construction of these heuristics is the same as in our preceding paper and is described in detail in [4]. In the second series we used a “naturally” constructed pessimistic heuristic function, where the construction was based on problem decomposition. We will therefore be referring to this heuristic as “decomposition heuristic”. The decomposition heuristic is by construction guaranteed to be pessimistic and applies to sliding-tile puzzles of any size. This heuristic was used on the 15-puzzle for comparison with the performance of Manhattan distance heuristic and for direct comparison with other real-time search algorithms. It is based on the decomposition of solving an N × N -puzzle into partial solution of this puzzle plus the solving of an (N − 1) × (N − 1)-puzzle. Accordingly, the 1 25,000 20,000 15,000 10,000 INTRODUCTION Recently, we proved [4] that incomplete, real-time search methods like RTA* [3] commit less decision errors and produce better solutions when used with pessimistic heuristics instead of optimistic ones of equal quality under very reasonable conditions. Since LRTA* [3] is basically a repetitive running of the RTA* search with a slightly modiﬁed update rule to enable the agent to learn from previous runs, we decided to test how pessimistic heuristics behave in the LRTA* setting, thus logically extending our results in [4]. In this paper we experimentally demonstrate that using pessimistic heuristics with LRTA* dramatically speeds-up the convergence process at the cost of just marginally worse converged solutions. 2 30,000 Number of moves Abstract. Recently we showed that under very reasonable conditions, incomplete, real-time search methods like RTA* work better with pessimistic heuristic functions than with optimistic, admissible heuristic functions of equal quality. The use of pessimistic heuristic functions results in higher percentage of correct decisions and in shorter solution lengths. We extend this result to learning RTA* (LRTA*) and demonstrate that the use of pessimistic instead of optimistic (or mixed) heuristic functions of equal quality results in much faster learning process at the cost of just marginally worse quality of converged solutions. Artiﬁcial Intelligence Laboratory, Faculty of Computer and Information Science, University of Ljubljana, Trˇzaˇska 25, 1000 Ljubljana, Slovenia, email: {aleksander.sadikov;ivan.bratko}@fri.uni-lj.si optimistic mixed 50:50 pessimistic 5,000 0 5 10 15 20 25 30 Search depth Figure 1. The comparison of convergence speed between optimistic (solid line), mixed (dashed line), and pessimistic (dash-dotted line) heuristic heuristic upper bound on the solution length is computed as the cost of solving the left-most column and the top-most row of the N × N puzzle, plus (recursively) the heuristic estimate of the cost of solving the remaining (N − 1) × (N − 1)-puzzle. The complete details of the realization of this idea are somewhat involved, and due to space limitations cannot be presented here. 3 EXPERIMENTAL RESULTS We were interested in the following characteristics of the LRTA* search: the speed of convergence process and total search effort needed, and the converged solution quality, all depending on the heuristic used. We compared these characteristics between various heuristics and between various search algorithms. 3.1 Artiﬁcial Heuristics When measuring the speed of convergence we varied the nature and quality of heuristics and the depth of lookahead. Results of the experiments with artiﬁcial heuristics of various nature and quality are presented in Figure 1. For a given quality of heuristics and a given depth of lookahead, we measured the average speed of convergence on a set of 1,000 randomly chosen puzzles of various levels of difﬁculty — this way of testing is quite common and was used for example in [3]. The x-axis represents the depth of lookahead, and the y-axis represents the time needed for convergence to take place (measured 898 A. Sadikov and I. Bratko / LRTA* Works Much Better with Pessimistic Heuristics Table 1. Experimental results for 15-puzzle. All statistics are per puzzle and averaged over the whole set of 100 optimally solved puzzles. Algorithm FALCONS LRTA* (d = 1) ε-LRTA* (d = 1, ε = 2) LRTA* (d = 1) LRTA* (d = 5) Heuristic Manhattan Manhattan Manhattan Decomposition Decomposition #moves First sol. Conv. sol. Deg. factor CPU time (s) 7 no convergence on any instance within 4 × 10 states [5] no convergence on any instance within 4 × 107 states [5] 2,391,847.4 1311.07 6564.59 76.96 1.45 1420.01 2,612.7 21.58 114.93 93.55 1.76 6.02 1,922.8 17.17 107.93 83.23 1.57 88.04 by the total number of moves performed by the underlying RTA* search in all trials needed to complete the solving of one puzzle). Figure 1 shows a very small (but representative) subset of the results. The chart shows results obtained with heuristics of quality similar to that of Manhattan heuristic (in terms of root mean squared error σ = 2.5). The three curves on the chart relate to the three types of heuristics used: solid line represents the optimistic heuristic, dashed line represents the mixed heuristic (50%-optimistic and 50%-pessimistic), and the bold dash-dotted line represents the pessimistic heuristic. It is obvious from the chart that the pessimistic heuristic causes LRTA* to converge much faster than the optimistic one. The mixed heuristic is not somewhere in the middle as might be expected, but is much closer to the optimistic one. Further experiments with mixed heuristics conﬁrmed that these behave similar to optimistic ones (results not shown due to lack of space). As we have seen, pessimistic heuristics cause LRTA* to converge much faster than when it uses optimistic heuristics. Now the relevant question to ask is how much do we sacriﬁce in terms of the quality of converged solutions for this speed-up? If the solutions thus obtained are worthless then there is no beneﬁt in the speed-up. The results for artiﬁcial heuristics on the 8-puzzle are as follows. For σ = 2.5, in the worst case, with one ply lookahead, on average over 1,000 test puzzles we lose about one single move, or in other words, about 5% (ε ≈ 0.05). With deeper lookahead this suboptimality decreases and by search depth 2 or 3 it already halves. Heuristics of worse quality are even more susceptible to the beneﬁts of increased lookahead — by search depth 2 or 3 the suboptimality of their converged solutions more than halves. The results for other qualities of heuristics tested are very similar to the reported ones. 3.2 #trials expensive than Manhattan distance). All the experiments were run on the same platform (Python interpreter on a 2.6 GHz PC). The average optimal solution cost over 100 puzzles was 53.05. Experimenting with greater lookahead of ﬁve with decomposition heuristic is justiﬁed by the fact that it is affordable due to convergence efﬁciency of the pessimistic heuristic. 4 DISCUSSION The obvious point of the obtained results is that the pessimistic heuristic in comparison with the optimistic heuristic and the mixed heuristic used with LRTA* offers orders of magnitude better search efﬁciency (in terms of CPU time) at a relatively low cost in solution quality. The speed efﬁciency of pessimistic heuristic in terms of the number of moves is even relatively greater. It is important to note that these performance results are not due to the quality of the heuristics used here. The average value of (optimistic) Manhattan distance heuristic evaluation over the 100 puzzles is 69% of the true solution costs, whereas the average value of (pessimistic) decomposition heuristic evaluation is 250% of the true costs. Of even greater interest is the discrimination power of the two heuristics in deciding which of two given 15-puzzles is easier (has shorter optimal solution). Manhattan distance correctly decides in 74.2% of all 12 × 100 × 99 possible pairs of Korf’s 100 15-puzzles, whereas decomposition heuristic only gives the correct decision in 56.9% of these pairs. It should be admitted that increasing lookahead depth of LRTA* with decomposition heuristic beyond depth ﬁve does not signiﬁcantly improve the quality of converged solutions. It looks like reaching a plateau. On the other hand, it is possible to improve the quality of solutions by decreasing parameter ε below 2, although again at the expense in convergence time. Decomposition Heuristic In this section we experimented with 15-puzzles comparing the popular Manhattan distance optimistic heuristic with our decompositionbased pessimistic heuristic. Also, we compared the performance of LRTA* with two other related real-time search algorithms, FALCONS [1] and ε-LRTA* [5]. However, the main point of this comparison is the study of optimistic vs. pessimistic heuristic, which makes the main differences between the compared variants of LRTA*. Table 1 shows experimental results averaged over selected 100 15-puzzles for which Korf [2] gave the costs of optimal solutions. The table gives for each entry in the comparison the heuristic used, average total number of moves to convergence per puzzle, average number of trials to convergence per puzzle, the average cost of the ﬁrst solution obtained and of the converged solution, average degradation factor (ratio between the average cost of converged solution and average cost of optimal solution), and average CPU times. CPU times are important for comparison because different heuristic functions take different CPU times to compute (decomposition heuristic being more ACKNOWLEDGEMENTS This work was partly funded by ARRS, Slovenian Research Agency. REFERENCES [1] David Furcy and Sven Koenig, ‘Speeding up the convergence of realtime search’, in Proceedings of the Seventeenth National Conference on Artiﬁcial Intelligence, pp. 891–897, (2000). [2] Richard E. Korf, ‘Depth-ﬁrst iterative deepening: An optimal admissible tree search’, Artiﬁcial Intelligence, 27(1), 97–109, (1985). [3] Richard E. Korf, ‘Real-time heuristic search’, Artiﬁcial Intelligence, 42(2-3), 189–211, (1990). [4] Aleksander Sadikov and Ivan Bratko, ‘Pessimistic heuristics beat optimistic ones in real-time search’, in Proceedings of the Seventeenth European Conference on Artiﬁcial Intelligence, ed., Gerhard Brewka, pp. 148–152, Riva di Garda, Italy, (August 2006). [5] Masashi Shimbo and Toru Ishida, ‘Controlling the learning process of real-time heuristic search’, Artiﬁcial Intelligence, 146(1), 1–41, (2003). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-899 899 Thinking Too Much: Pathology in Pathﬁnding Mitja Luˇstrek 1 and Vadim Bulitko 2 1 INTRODUCTION Incomplete single-agent search methods are often better suited to real-time pathﬁnding tasks than complete methods (such as A*). Incomplete methods conduct a limited-depth lookahead search, i.e., expand a part of the space centered on the agent, and heuristically evaluate the distances from the frontier of the expanded space to the goal. Actions selected this way are not necessarily optimal, but it is generally believed that deeper lookahead increases the quality of decisions. However, in two-player games, where similar methods are used, it has long been known that this is not always the case [7, 1]. This phenomenon has been termed minimax pathology. More recently pathological behavior was discovered in single-agent search as well [3]. Some attempts to explain it have been made [5, 6], but the pathology in single-agent search is largely still not understood. In this paper we investigate lookahead pathology in real-time pathﬁnding on maps from commercial computer games. First, we present an empirical study showing a degree of pathology in over 90% of the problems considered. Second, we give four explanations for such wide-spread pathological behavior. 2 THE PATHOLOGY OBSERVED We study the problem of an agent trying to ﬁnd a path from a start to a goal state in a two-dimensional grid world. The agent plans its path using the Learning Real-Time Search (LRTS) algorithm [2]. LRTS conducts a lookahead search centered on the current state and generates all the states up to d moves away. It heuristically estimates the distances from the frontier states to the goal state and moves to the most promising frontier state. Upon reaching it, it conducts a new search. The initial heuristic is the shortest distance assuming an empty map. After each search, the heuristic of the current state is updated to the estimated distance through the most promising frontier state, which constitutes the process of learning. We conducted two types of experiments: on-policy and off-policy. In the ﬁrst type the agent follows a path from the start state to the goal state as directed by the LRTS algorithm. In the second type the agent appears in a (randomly selected) state and selects the ﬁrst move towards the goal state. If the move does not lie on the shortest path to the goal state, it is erroneous. The error e(Sd ) is the fraction of erroneous moves taken in the set of states Sd visited using lookahead depth d. The degree of error pathology in the sequence of sets S1 , . . . , Sdmax is k iff e(Sd+1 ) > e(Sd ) for k different d < dmax . We generated 1,000 problems on maps from a commercial roleplaying game. The lookahead depth ranged from 1 to 10 = dmax . First we conducted the basic on-policy experiment: the agent solved the 1 2 Joˇzef Stefan Institute, Department of Intelligent Systems, Jamova cesta 39, 1000 Ljubljana, Slovenia, email: mitja.lustrek@ijs.si University of Alberta, Department of Computing Science, Edmonton, Alberta T6G 2E8, Canada, email: bulitko@ualberta.ca problems, we measured the degree of error pathology for each problem and counted the number of problems with each of the possible degrees. The on-policy row in Table 1 shows that over 90% of the problems are pathological. Table 1. Pathology in the basic on and off-policy experiments. Degree Pat. problems on-policy [%] Pat. problems off-policy [%] 0 6.3 83.1 1 13.1 14.9 2 24.8 2.0 ≥4 26.7 0.0 3 29.0 0.0 The ﬁrst possible explanation of the on-policy results in Table 1 is that the maps contain a lot of states where deeper lookaheads lead to suboptimal decisions, whereas shallower ones do not. If this were the case, the basic off-policy experiment, where the pathology is measured in randomly selected states, should yield comparable pathology. However, the off-policy row in Table 1 shows much less pathology. In the rest of the paper, we will investigate the reasons for this. 3 EXPLANATIONS OF THE PATHOLOGY The ﬁrst explanation is that the LRTS algorithm’s behavioral policy steers the search to pathological states. The explanation was veriﬁed by computing off-policy pathology from the error in the states visited during the basic on-policy experiment instead of randomly selected ones. The results in Table 2 do show more pathology compared to the basic off-policy experiment in Table 1 (23.2% vs. 16.9%), but they are still far from the basic on-policy experiment (23.2% vs. 93.7%). Table 2. Pathology measured off-policy in the states visited on-policy. Degree Patological problems [%] 0 76.8 1 13.8 2 5.7 3 2.3 ≥4 1.4 The basic on-policy experiment involves learning, but no learning takes place in the basic off-policy experiment. It is harder to ﬁnd the path to the goal when the lookahead depth is small. Consequently the agent backtracks more, encountering updateted states more often when the lookahead depth is large. This leads us to the second explanation. Smaller lookahead depths beneﬁt more from the updates to the heuristic. This can be expected to make their decisions better than the mere depth would suggest and thus closer to larger depths. If they are closer to larger depths, cases where a deeper lookahead actually performs worse than a shallower one should be more common. The ﬁrst test of the second explanation is an on-policy experiment where the agent is directed by the LRTS algorithm that uses learning (to prevent inﬁnite loops), but the error is measured using only the initial, non-updated heuristic. The results in Table 3 suggest that learning is indeed responsible for the pathology, because the pathology in the new experiment is markedly smaller than in the basic onpolicy experiment shown in Table 1: 70.4% vs. 93.7%. Table 3. Pathology on-policy with error measured without learning. Degree Pathological problems [%] 0 29.6 1 20.4 2 19.3 3 18.2 ≥4 12.5 900 M. Luštrek and V. Bulitko / Thinking Too Much: Pathology in Pathﬁnding The second test is to measure the volume of heuristic updates, which reﬂects the beneﬁt of learinng. This volume is the sum of the differences between the updated and the initial heuristics in the states generated during search. Figure 1 shows the results for the basic onpolicy experiment and for the basic off-policy experiment (where no learning takes place). We see that in the on-policy experiment the volume of updates decreases with lookahead depth (unlike in the offpolicy experiment), which conﬁrms our explanation. Figure 2. Figure 1. The volume of heuristic updates encountered per move with respect to the lookahead depth in the basic on- and off-policy experiments. The results in Table 3 still show more pathology than in the basic off-policy experiment, so there must be a third explanation. Let αoff (d) and αon (d) be the average number of states generated per move in the basic off-policy and on-policy experiments respectively. In off-policy experiments a search is performed every move, whereas in on-policy experiments a search is performed every d moves. Therefore αon (d) = αoff (d)/d. This means that in the basic on-policy experiment fewer states are generated at larger lookahead depths than in the basic off-policy experiment. Consequently the depths in the basic on-policy experiment are closer to each other with respect to the number of states generated. Since the number of states generated can be expected to correspond to the quality of decisions, cases where a deeper lookahead actually performs worse than a shallower one should be more common. The ﬁrst test of the third explanation is an on-policy experiment where a search is performed every move instead of every d moves. The results in Table 4 conﬁrm the explanation. The percentage of pathological problems is considerably smaller than in the basic onpolicy experiment shown in Table 1: 34.7% vs. 93.7%. Since LRTS that searches every move is very similar to LRTA* [4], LRTA* can also be expected to be less pathological. The number of states generated per move with respect to the lookahead depth in different experiments. accurate heuristic value will be selected. We veriﬁed the forth explanation with an on-policy experiment with pessimistic heuristic values. If the regular heuristic value of a state s is h(s) = h∗ (s) − e, where e is the heuristic error, then the pessimistic heuristic value is hp (s) = h∗ (s) + e. Such a heuristic is unrealistic, but it should give us an idea of what to expect from realistic pessimistic heuristics, should we be able to design them. The results in Table 5 do show a decrease in pathology compared to the basic on-policy experiment shown in Table 1: 86.1% vs. 97.7%. Table 5. Pathology on-policy with pessimistic heuristic. Degree Pat. problems [%] 4 0 13.9 1 4.1 2 8.3 3 22.9 4 27.7 ≥5 23.1 CONCLUSION The ﬁrst two expanations of the pathology do not seem to offer practical ways for avoiding the pathology. When ivestigating the third explanation, we learned that searching every move the way LRTA* does brings the pathology from 93.7% to 34.7%. It also generates up to 2.6 times shorter solutions. However, it increases the number of states generated per move roughly by a factor of d. This means that the number of states generated per problem when searching every move is up to 4.5 larger (at d = 10) than with the regular LRTS. A promising direction of research therefore seems to be a method for dynamically selecting the point at which a new search is needed. Finally, the fourth explanation suggests that pessimistic heuristics may be less prone to the pathology. In addition, the solutions found using the pessimistic heuristic were nearly optimal (3.8–7.2 times shorter than with the regular heuristic), so pessimistic heuristics deserve further attention. Table 4. Pathology on-policy when searching every move. Degree Pathological problems [%] 0 65.3 1 14.6 2 8.6 3 7.1 ≥4 4.4 The second test is to measure the number of states generated per move. Figure 2 shows that in the basic off-policy experiment and in the on-policy experiment when searching every move, the number increases more quickly with lookahead depth than in the basic onpolicy experiment. The depths are thus less similar than in the basic on-policy experiment, which again conﬁrms our explanation. Experiments with eight-puzzle [8] showed that pessimistic heuristics can prevent the pathology. This inspired the fourth explanation of the pathology. During lookahead search, states with low heuristic values are favored. If the heuristic values are optimistic (as in our case), the lowest heuristic value is likely to be particularly far from the true value. With deeper lookahead, more states are considered and the chances of selecting a state with an especially inaccurate heuristic increase. If the heuristic values are pessimistic, the opposite is true: the states with accurate heuristic values are favored and the more states are considered, the more likely a state with a very REFERENCES [1] Donald F. Beal, ‘An analysis of minimax’, in Advances in Computer Chess, volume 2, pp. 103–109, (1980). [2] Vadim Bulitko and Greg Lee, ‘Learning in real time search: A unifying framework’, Journal of Artiﬁcial Intelligence Research, 25, 119–157, (2006). [3] Vadim Bulitko, Lihong Li, Russell Greiner, and Ilya Levner, ‘Lookahead pathologies for single agent search’, in Proceedings of IJCAI, poster section, pp. 1531–1533, Acapulco, Mexico, (2003). [4] Richard E. Korf, ‘Real-time heuristic search’, Artiﬁcial Intelligence, 42(2, 3), 189–211, (1990). [5] Mitja Luˇstrek, ‘Pathology in single-agent search’, in Proceedings of Information Society Conference, pp. 345–348, Ljubljana, Slovenia, (2005). [6] Mitja Luˇstrek and Vadim Bulitko, ‘Lookahead pathology in real-time path-ﬁnding’, in Proceedings of AAAI, Learning for Search Workshop, pp. 108–114, Boston, USA, (2006). [7] Dana S. Nau, Quality of Decision versus Depth of Search on Game Trees, Ph.D. dissertation, Duke University, 1979. [8] Aleksander Sadikov and Ivan Bratko, ‘Pessimistic heuristics beat optimistic ones in real-time search’, in Proceedings of ECAI, pp. 148–152, Riva del Garda, Italy, (2006). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-901 901 Dynamic Backtracking for Distributed Constraint Optimization1 Redouane Ezzahir2 and Christian Bessiere3 and Imade Benelallam2 and El Houssine Bouyakhf2 and Mustapha Belaissaoui4 Abstract. We propose a new algorithm for solving Distributed Constraint Optimization Problems (DCOPs). Our algorithm, called DyBop, is based on branch and bound search with dynamic ordering of agents. A distinctive feature of this algorithm is that it uses the concept of valued nogood. Combining lower bounds on inferred valued nogoods computed cooperatively helps pruning dynamically unfeasible sub-problems and speeds up the search. DyBop requires a polynomial space at each agent. Experiments show that DyBop has signiﬁcantly better performance than other DCOP algorithms. 1 INTRODUCTION The Distributed Constraint Optimization Problem (DCOP) is a powerful formalism to model a wide range of applications in multi-agent coordination. The major motivation for research on DCOPs is that they are an elegant model for many every day combinatorial problems that are distributed by nature, such as distributed resource allocation, distributed scheduling, or sensor networks. In this paper, we present a new distributed algorithm, called DyBop: Dynamic Backtracking search for DCOPs. It is based on branch and bound search and uses the concept of valued nogood introduced in [2, 7]. DyBop is guaranteed to terminate and requires polynomial space. The agents assign their variables sequentially and compute asynchronously a lower bound to the cost of the current context. Whenever an agent is successfully assigned, it sends copies of the current context to all unassigned agents concurrently. These unassigned agents send back the cost of their cheapest assignment wrt this context. If the aggregation of the costs received by the current agent is greater than the current upper bound, the current agent changes its value. If no value remains available, a valued nogood is sent from the current agent to the lowest agent involved in the nogood. Experimental results on random Max-DisCSPs and a real structured problem (Distributed Meeting Scheduling) show that DyBop outperforms both AFB-BJ [5], NCBB [1] and ABFS [4]. 2 BACKGROUND The Distributed Constraint Optimization Problem is a tuple (A, X , D, C, F ), where A is a set of agents {A1 , A2 , ..., Ak }, X is a set of variables {X1 , X2 , ..., Xn }, and D = {D1 , D2 , ..., Dn } is a set of domains, where each Di in D is a ﬁnite set containing the 1 2 3 4 This work has been supported by the Maroc-France PAI project no. MA/05/127 and the ANR project ANR-06-BLAN-0383-02. LIMIARF/FSR, Morocco, email: ezzahir@lirmm.fr, bouyakhf@fsr.ac.ma, imade.benelallam@ieee.org LIRMM (CNRS/U. Montpellier), France, email: bessiere@lirmm.fr Universit´e Hassan I, Morocco, email: m.belaissaoui@encg-settat.ma values to which its associated variable Xi may be assigned. Only the agent who contains a variable has control of its value, and knowledge of its domain. C = {cij : Di × Dj → R+ , with i, j ∈ 1...n, i = j} is a set of constraints, represented by a cost function cij for each pair of variables Xi and Xj . The goal is to ﬁnd a global assignment I of values to variables in X such that the objective function F is minimized. The valued nogood [2] is an extension of the classical nogood for Valued CSP. Recently, Silaghi et al. have introduced the inference power of valued nogoods in DCOP solving [7]. Deﬁnition 1 (Valued Nogood [2]) A valued nogood has the form (I, v, C). It speciﬁes that given a set of constraints C, a global assignment extending the partial assignment I = {(X1 , v1 ), . . . , (Xk , vk )} has cost at least v. C is a set of reference constraints called justiﬁcation in [7]. 3 DYBOP In DyBop, each agent stores a nogood per value. During search, each agent holds its view of the current state of search in a data structure called current context CCT X. Deﬁnition 2 (Context) For a partial assignment P A that we try to extend, we associate a current context CCT X of the form P A, N , CS, where N = {NX1 , . . . , NXk } is the set of all nogoods associated to variable assignment in P A, and CS = {CS X1 , . . . , CS Xn } is a list of conﬂict sets. Each conﬂict set CS Xi contains all agentID which are used to identify the assignments in any nogood stored by Xi . In DyBop, only one agent performs an assignment on the current context CCT X at a time. Whenever the agent was successfully assigned, it sends copies of CCT X to all unassigned agents concurrently and awaits for response messages. All unassigned agents compute asynchronously a valued nogood with a valuation equal to the lower bound of the cost of assigning a value to their variables, and send this nogood back to the agent which performed the assignment. The assigning agent accumulates these valued nogoods using suminference, an aggregation operator based on the objective function F . Once the valuation of accumulated nogood exceeds that of the best known solution found so far, the agent prunes its current value. The accumulated nogood is stored as explanation of value removal. On the other hand, when the cost of the aggregation of all valued nogoods coming from unassigned agents is less than the cost of the best known full assignment, the agent sends the current context to the agent selected as next. So, the current context is propagated forward sequentially. Whenever the current agent cannot ﬁnd a valid value, it 902 R. Ezzahir et al. / Dynamic Backtracking for Distributed Constraint Optimization performs the min-resolution of its stored set of nogoods, and sends back the resulting nogood to the last assigned agent in this nogood. When an agent receives such a valued nogood due to a backtrack, before storing it, the agent performs a partial reduction [2] of this nogood by using the last stored nogood related to the current value and the nogood related to the current assignment CCT X. The communication among DyBop agents is performed by ﬁve types of messages. CTX: A message that carries the current context CCT X. FB CTX: A forward bounding message that is an exact copy of a CCT X. Every agent that assigns its variables on a CCT X creates an exact copy in the form of a FB CTX and sends it forward to all unassigned agents. An agent receiving an FB CTX message computes a valued nogood with a valuation equal to the lower bound on the cost increment caused by adding an assignment for its variables to the CCT X. This estimated nogood is sent back to the agent which sent the FB CTX message via an ESTIMATE message. BACK: A message that is sent back when dynamic backtracking is performed. It carries a valued nogood that justiﬁes the conﬂict and the current context CCT X. The receiver of this message is chosen as the last assigned agent in the carried nogood. Figure 1. Results on random instances of max-DisCSP with 10 agents Theorem 1 DyBop is correct and terminates. Proof. During DyBop search, all operations on valued nogoods are logically sound. Thus, if DyBop terminates, the upper bound is optimal, so solution is found. Termination can be proved if we consider a simple version of DyBop where FB CTX messages are not used. This version is a complete algorithm because the nogoods produced by min-resolution are similar to classical nogoods rejecting an assignment. Therefore, it is sufﬁcient to apply the method used by Ginsberg to show the termination of centralized dynamic backtracking. When we add FB CTX messages, the stored nogoods coming with these messages cannot break the termination because they follow the same inference principle used for nogoods coming from Back messages. Thus, DyBop terminates. 2 4 EXPERIMENTAL RESULTS We considered two different problems for our experiments. The ﬁrst was a random Max-DisCSP with 10 agents containing a single variable, in which all constraint costs are equal to 1, and density of the constraint graph is 40%. The second was a real structured problem, the Distributed Meeting Scheduling problem (DMS). We have tested three DMS problem classes. Each DMS problem class is represented by the pair (m, p) = (#meetings, #participants) in which there are p agents with multiple variables (m variables to the maximum). There are 5 values in each domain, each of them representing a possible meeting start time. All experiments were performed on the DisChoco platform [3] in which agents are simulated by threads which communicate only through message passing. We evaluate the algorithms performance by the mean of non obsolete messages (NO-MSGs), and the Equivalent Non Concurrent Constraint Checks (ENCCCs) [1] on 10 instances. ENCCCs are a weighted sum of processing and communication time. For random problems, we simulate two scenarios of communication: fast communication (message delay cost = 0 CCCs), and slow communication (message delay cost = 1000 CCCs). Experimental results are shown in Fig. 1. We observe that for almost all parameter settings, DyBop is signiﬁcantly better than both AFBBJ and ABFS. For DMS problems, Table 1 presents the results in a slow communication system. It shows that DyBop is even better than on random. On instances (5, 7), with a cutoff set at 1,800 seconds, DyBop provides optimal solutions for all instances against 70% for ABFS and AFB. We point out that Gershman et al. showed that AFBBJ is faster than DPOP, which has been shown faster than NCBB [6]. Table 1. Results on DMS. #msg = #NO-MSG and #cc = #ENCCCs (m, p) (5, 5) (5, 6) (5, 7) DyBop #msg #cc 931 2,299 1,676 3,797 2,597 5,791 ABFS #msg #cc 1,315 3,730 3,365 12,450 316K 1,513K AFB-BJ #msg #cc 1,946 7,982 5,403 34,188 5,387K 80,345K 5 CONCLUSION We have presented DyBop, a new algorithm for solving Distributed Constraint Optimization Problems (DCOPs). DyBop is based on branch and bound search with dynamic ordering of agents, and it uses the concept of valued nogood introduced in [2, 7]. Experiments show that the proposed approach of combining lower bounds on inferred valued nogoods computed cooperatively speeds up the search signiﬁcantly wrt existing techniques. REFERENCES [1] A. Chechetka and K. Sycara., ‘No-commitment branch and bound search for distributed constraint optimization’, in AAMAS, pp. 1427–1429, (2006). [2] P. Dago and G. Verfaillie, ‘Nogood recording for valued constraint satisfaction problems’, in ICTAI, pp. 132–139, (1996). [3] R. Ezzahir, C. Bessiere, M. Belaissaoui, and E.H. Bouyakhf., ‘Dischoco: A platform for distributed constraint programming.’, in IJCAI’07 Workshop on DCR, pp. 16–21, (2007). [4] R. Ezzahir, C. Bessiere, E. H. Bouyakhf, I. Benelallam, and M. Belaissaoui, ‘Asynchronous breadth-ﬁrst search DisCOP algorithm’, in EUMAS’07, (2007). [5] A. Gershman, A. Meisels, and R. Zivan, ‘Asynchronous forwardbounding with backjumping’, in IJCAI’07 Workshop on DCR, pp. 28–39, (2007). [6] A. Gershman, R. Zivan, T. Grinshpoun, A. Grubshtein, and A. Meisels, ‘Measuring distributed constraint optimization algorithms’, in AAMAS’08 Workshop on DCR, (2008). [7] M.C. Silaghi and M. Yokoo, ‘Nogood based asynchronous distributed optimization (adopt-ng)’, in AAMAS, pp. 1389–1396, (2006). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-903 903 Integrating Abduction and Constraint Optimization in Constraint Handling Rules Marco Gavanelli and Marco Alberti and Evelina Lamma 1 1 Abduction and CHR Abductive Logic Programming (ALP) [10] is a set of languages supporting hypothetical reasoning; the corresponding proof-procedures feature a simple, sound implementation of negation by failure [6]. An ALP is a logic program KB with a distinguished set A of predicates, called abducibles, that do not have a deﬁnition, but their truth value can be assumed. A set of implications called Integrity Constraints (IC) restrict the possible set of hypotheses, in order to avoid unrealistic assumptions. Given a goal G, the aim is to ﬁnd a set Δ ⊆ A such that KB ∪ Δ |= G and KB ∪ Δ |= IC. ALP and Constraint Logic Programming (CLP) have been merged in works by various authors [11, 12, 5]. However, while almost all CLP languages provide algorithms for ﬁnding an optimal solution with respect to some objective function (and not just any solution), the issue has received little attention in ALP. We believe that adding optimisation meta-predicates to abductive proof-procedures would improve research and practical applications of abductive reasoning. However, abductive proof-procedures are often implemented as Prolog meta-interpreters, which makes clumsy the strong intertwining with CLP required to fully exploit optimisation meta-predicates. In the line with previous research [1, 9, 4, 2], we implemented the SCIFF abductive proof-procedure [3] in Constraint Handling Rules (CHR) [7], which provides a strong integration between abduction and constraint solving/optimisation. In SCIFF, the abductive logic program can invoke optimisation meta-predicates, which can invoke abductive predicates, in a recursive way. Previous implementations of abduction in CHR mapped abducibles into CHR constraints, and integrity constraints into CHR rules [1, 9, 4, 2]. In this way, the implementation is very efﬁcient, but there are limitations on the language: only abducibles can occur in the condition of ICs. This limits the applicability of sound negation by failure to abducibles, while negative literals of other predicates inherit “the dubious semantics of Prolog” [4]. E.g., the following IC (where abducibles are in bold) a(X, Y ), b(Y ) → c(X) ∧ p(Y ) ∨ q(X) (1) cannot be represented in this way, because p/1 is not abducible. This means that it is not possible to deal with negation by failure in a sound way, since not(p(X)) should be rewritten as p(X) → false. In SCIFF, an abducible a(X, Y ) is represented as the CHR constraint abd(a(X, Y)). We do not map integrity constraints to CHR rules, but to other CHR constraints. IC (2) is mapped to the constraint ic([abd(a(X, Y)), p(Y)], [[r(X), q(Y)], [q(X)]]). The operational semantics (derived from the IFF [8]) is deﬁned by a set of transitions [3]. The transitions are then easily implemented as CHR rules; for example, transition propagation (joined with case analysis) [8] propagates an abducible with an implication: abd(P), ic([P1|Rest], Head) =⇒ rename(ic([P1|Rest], Head), ic([RenP1|RenRest], RenHead)), reif unify(RenP1, P, B), (B = 1, ic(Rest, Head); B = 0) We ﬁrst rename the variables (considering their quantiﬁcation), and then apply reiﬁed uniﬁcation [12]: a CHR constraint that imposes that either the two ﬁrst arguments unify and B = 1, or that the two arguments do not unify and B = 0. One of the features of the CHR implementation is that the abductive program written by the user is directly executed by the Prolog engine, and the resolvent of the proof-procedure coincides with the Prolog resolvent. This also means that every Prolog predicate can be invoked, and, in particular, we can invoke optimisation metapredicates: in some cases, it is not enough to ﬁnd one abductive solution, but the best solution with respect to some criteria is requested. CLP offers an answer to this practical need by optimisation metapredicates (minimize and maximize), that select the best solution amongst those provided by a goal. 2 An example from Game Theory N grim pirates plundered a treasure of M golden coins. They have to divide their treasure, and they want to have fun. Since they are bloodthirsty, they adopt rules in which blood might be shed: can be rewritten as a propagation CHR a(X, Y), b(Y) ==> c(X), p(Y) ; q(X) because in the antecedent only abducibles occur, thus in the head of the propagation CHR there are only CHR constraints. Instead, the IC a(X, Y ), p(Y ) → r(X) ∧ q(Y ) ∨ q(X), 1 University of Ferrara, Italy, email: name.surname@unife.it (2) 1. The lowest pirate in grade proposes a full division: he decides how many coins are given to each pirate (including himself). 2. All the pirates vote: if the majority votes for the proposal, the money is shared as in the division. Otherwise, the proposer is killed, and the process restarts from step 1. Knowing that all pirates are greedy and bloodthirsty (i.e., they mostly care about money, and in case of parity they like to see someone die), we have to propose a division. 904 M. Gavanelli et al. / Integrating Abduction and Constraint Optimization in Constraint Handling Rules This is clearly an optimisation problem, as pirates want to get as much money as possible; moreover, the proposer has to hypothesise how the other pirates will vote, in order to stay alive. The lowest in grade will abduce an atom bearing the information for each pirate: at the ﬁrst proposal, there is a literal for each pirate E(pirate(Grade, V ote, Coins, Alive), 1) (3) meaning that the proposer gives to the pirate with given Grade (1 being the highest) a number Coins of coins, we suppose his vote is expressed with a boolean V ote (an integer 0=false or 1=true), and that at the end he will be alive if and only if the boolean Alive = 1. Moreover, we had better try to foresee what could possibly happen in the next protocol iterations, in the unlucky case our proposal does not get the majority. We suppose each proposal happens at a time step indicated by an integer (last argument of Eq. 3). Now we can see the rules of the protocol. Predicate npirates/1 deﬁnes the number of pirates. The N -th pirate makes the ﬁrst proposal, the N − 1 has the second choice and so on: turn(Grade, T urn):-npirates(N ), T = N + 1 − Grade. Each pirate is alive, if his turn of proposing has not come yet. E(pirate(Grade, V ote, Coins, Alive), T ) ∧turn(Grade, T urn) ∧ T < T urn → Alive = 1 After his proposal, a pirate is dead: he gets 0 coins and does not vote: E(pirate(Grade, V ote, Coins, Alive), T ) ∧ turn(Grade, T urn) ∧ T > T urn → Alive = 0 ∧ V ote = 0 ∧ Coins = 0 Each pirate votes for his own proposal: E(pirate(Grade, V ote, Coins, Alive), T ) ∧turn(Grade, T urn) ∧ T = T urn → V ote = 1 If in the current proposal I suppose to get more money than in the next, I will vote for the current one. Otherwise, I will vote against: either I hope to get more money, or I hope to see the proposer die. E(pirate(Grade, V ote1 , Coin1 , Alive1 ), T1 ) ∧ T2 = T1 + 1∧ E(pirate(Grade, V ote2 , Coin2 , Alive2 ), T2 ) → Coin1 > Coin2 ∧ V ote1 = 1 ∨ Coin1 ≤ Coin2 ∧ V ote1 = 0. If I suppose next iteration I will be dead, I will accept any proposal: E(pirate(Grade, V ote1 , Coins1 , 1), T1 ) ∧ T2 = T1 + 1∧ E(pirate(Grade, V ote2 , Coins2 , 0), T2 ) → V ote1 = 1. The predicate pirates(Lcoins , Lvotes , T ) is the program entry point. Its arguments are the coins assignment (list Lcoins ), the result of the voting (list Lvotes ), and the iteration number T (initially 1). In the following code, CLP predicates are underlined. The abduce predicate abduces the atom (3) for each pirate. pirates([],[],T):- npirates(N), T>N. pirates(Lcoins ,Lvotes ,T):- npirates(N), ncoins(M), T ≤ N , % Deﬁne variables’ domains length(Lcoins ,N), domain(Lcoins ,0,M), sumlist(Lcoins ,M), length(Lvotes ,N), domain(Lvotes ,0,1), sumlist(Lvotes ,Nvotes ), 2Nvotes > N-T+1 ⇔ Win, %One wins if he gets majority % The pirate gets the coins only if he wins nth(T,Lcoins ,CoinsPirate), GotCoins = Win*CoinsPirate, % The proposer will be alive only if he wins length(Lalive ,N), nth(T,Lalive ,Win), maximize( ( T1 is T+1, pirates( , ,T1), abduce(Lcoins ,Lvotes ,Lalive ,N,T),% Abduce a division labeling(Lcoins ), labeling(Lvotes ), ),GotCoins). %Maximise number of obtained coins The result for N = 4 pirates and M = 9 coins is the following: E(pirate(4,1,7,1),1) E(pirate(4,0,0,0),2) E(pirate(4,0,0,0),3) E(pirate(4,0,0,0),4) E(pirate(3,0,0,1),1) E(pirate(3,1,9,1),2) E(pirate(3,0,0,0),3) E(pirate(3,0,0,0),4) E(pirate(2,1,1,1),1) E(pirate(2,1,0,1),2) E(pirate(2,1,0,0),3) E(pirate(2,0,0,0),4) E(pirate(1,1,1,1),1) E(pirate(1,0,0,1),2) E(pirate(1,0,9,1),3) E(pirate(1,1,9,1),4) Pirate 4 (ﬁrst row of the table) takes 7 coins for himself, gives 1 coin each to pirates 1 and 2, and nothing to pirate 3. He is sure to get 3 votes: his own, plus those of pirates 1 and 2. How can he be so sure of surviving? Because if he dies (second row), pirate 3 gets all the money, while 1 and 2 get nothing, and nevertheless pirate 2 votes for the proposal! In fact, in iteration 3, pirate 2 is sure to die: whatever proposal he makes, pirate 1 will vote against, getting in the last iteration all the money, and making pirate 2 die. Besides the correct game theory result, this example shows remarkable features of SCIFF. First, a SCIFF program is a real CLP(FD) program. The user is not restricted to a subset of the available constraints, and, in particular, she can use global constraints (e.g., sumlist) in the knowledge base. Second, we have recursion through the optimisation meta-predicate maximize. SCIFF tightly integrates CLP(FD) and abduction, thanks to its CHR implementation. Finally, SCIFF is efﬁcient: it took 49s to solve the above problem with N = 4 pirates on a Pentium M715, 1.5GHz, 512MB RAM computer, which is reasonable considering that the problem is at the fourth level of the polynomial hierarchy. REFERENCES [1] S. Abdennadher and H. Christiansen, ‘An experimental CLP platform for integrity constraints and abduction’, in FQAS 2000, pp. 141–152. [2] M. Alberti, F. Chesani, D. Daolio, M. Gavanelli, E. Lamma, P. Mello, and P. Torroni, ‘Speciﬁcation and veriﬁcation of agent interaction protocols in a logic-based system’, Scalable Computing: Practice and Experience, 8(1), 1–13, (2007). [3] M. Alberti, F. Chesani, M. Gavanelli, E. Lamma, P. Mello, and P. Torroni, ‘Veriﬁable agent interaction in abductive logic programming: the SCIFF framework’, ACM Trans. on Computational Logic, 9(4), (2008). [4] H. Christiansen and V. Dahl, ‘HYPROLOG: A new logic programming language with assumptions and abduction.’, in ICLP, (2005). [5] U. Endriss, P. Mancarella, F. Sadri, G. Terreni, and F. Toni, ‘The CIFF proof procedure for abductive logic programming with constraints’, in JELIA 2004, eds., J. Alferes and J. Leite, volume 3229 of LNAI, (2004). [6] K. Eshghi and R. Kowalski, ‘Abduction compared with negation by failure’, in ICLP’89, eds., G. Levi and M. Martelli, pp. 234–255, (1989). [7] T. Fr¨uhwirth, ‘Theory and practice of constraint handling rules’, Journal of Logic Programming, 37(1-3), 95–138, (October 1998). [8] T. Fung and R. Kowalski, ‘The IFF proof procedure for abductive logic programming’, Journal of Logic Programming, 33(2), (1997). [9] M. Gavanelli, E. Lamma, P. Mello, M. Milano, and P. Torroni, ‘Interpreting abduction in CLP’, in APPIA-GULP-PRODE Joint Conf. on Declarative Programming, Reggio Calabria, Italy, (2003). [10] A. Kakas, R. Kowalski, and F. Toni, ‘The role of abduction in logic programming’, in Handbook of Logic in Artiﬁcial Intelligence and Logic Programming, eds., D. Gabbay, C. Hogger, and J. Robinson, (1998). [11] A. C. Kakas, A. Michael, and C. Mourlas, ‘ACLP: Abductive Constraint Logic Programming’, J. of Logic Programming, 44(1-3), (2000). [12] A. C. Kakas, B. van Nuffelen, and M. Denecker, ‘A-System: Problem solving through abduction’, in IJCAI-01, ed., B. Nebel, (2001). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-905 905 Symbolic Classiﬁcation of General Multi-Player Games1 Peter Kissmann and Stefan Edelkamp2 Abstract. For general two-player turn-taking games, ﬁrst solvers have been contributed. Algorithms for multi-player games like Maxn , however, cannot classify general games robustly, and its extension Soft-Maxn , which can play optimally against unknown and weak opponents, demands large amounts of memory. As RAM is a scarce resource, this paper proposes a memory-efﬁcient implementation of the Soft-Maxn algorithm, by exploiting the functional representation of state and evaluation sets with BDDs. 1: (0, 100, 0) (0, 0, 100) 3 2 1 2: (0, 0, 100) 3 2 1 2: (0, 100, 0) 3 2 3: (0, 0, 100) 1 3 2 2: (0, 100, 0) 2 1 2 INTRODUCTION In General Game Playing (GGP), the game is provided implicitly in form of a set of rules transforming the initial state eventually into some terminal one. The games we consider, described in the Game Description Language GDL [4], are fully observable, discrete, ﬁnite, and deterministic and provide discrete outcomes 0, . . . , 100 for each player. Thus, an automated player has no prior information on which game it actually plays. While much effort is spent in developing good players (e. g. [3]), the ultimate goal is to solve games and to provide a playing strategy. Most of the work in GGP research has focused on two-player games or on transformations into such. In [2] we proposed a symbolic classiﬁcation algorithm for general two-player turn-taking games. Games with three or more players have received less attention. One line of research extends minimax search from the two-player to the multi-player scenario. As with two-player turn-taking games, opponents attempt to maximize their individual outcome. At each node in the game tree an evaluation vector (the Maxn vector) denotes the reward for each player. The successor with the maximum for the active player is selected. This results in the Maxn algorithm [5]. Even though this computes an equilibrium strategy, its deﬁciency is that it calculates only one of many equilibria and keeps no information about alternatives. As a bypass, Soft-Maxn [6] backs up the non-dominated information from the leaves to the root. All sets that are not dominated with respect to the active player are propagated bottom-up. Soft-Maxn has been shown to calculate a superset of all equilibria [6]. Unfortunately, the space demands of Soft-Maxn are large, as for each state a set of reward vectors has to be stored. Binary Decision Diagrams (BDDs) [1] have shown considerable advances in form of memory savings in the analysis of large systems. The crucial advantage compared to ordinary explicit-state retrograde analysis algorithms is a compact representation of state sets. As this allows studying much larger games, in this paper we consider a symbolic implementation of the Soft-Maxn algorithm to classify general multi-player games using BDDs to compactly compute and store evaluation sets. We apply the extension to a selection of small games given in GDL. 1 2 Thanks to DFG for support in ED 74/3 and 74/2. Dortmund University of Technology, Germany, email: {peter.kissmamn, stefan.edelkamp}@cs.uni-dortmund.de 1 1: (100, 0, 0) 1 1 3: (0, 100, 0) 1: (100, 0, 0) 1 3: (0, 0, 100) 1 1 3: (0, 0, 100) 1: (0, 0, 100) 1 2 2: (0, 100, 0) 2: (100, 0, 0) Figure 1. Example graph for 3-player Nim with 5 matches. The nodes show the active player and the classiﬁed rewards, the edges the number of matches taken (dotted: 1, dashed: 2, solid: 3). 2 SOFT-MAXn GAME TREE SEARCH The value of a game state is formalized by an evaluation vector, where component i denotes the value for player i. The Soft-Maxn algorithm [6] avoids the prediction of opponents’ tie-breaking strategies used in Maxn and thus allows to compute a robust player. When a tie is encountered, instead of choosing a single vector, a set of vectors is selected. This set represents the possible outcomes for a particular branch of a tree. The set of Soft-Maxn vectors for player i is computed as follows. For a leaf, the Soft-Maxn vector set consists of the exact evaluation vector. At an internal node, the Soft-Maxn vector set for that node is the union of all sets of its non-dominated children with respect to the current player3 . At the game tree’s root, the player to move can use any decision rule to select the best of the non-dominated moves. An example for Soft-Maxn is given in Figure 1. Here, we present our symbolic extension of Soft-Maxn (cf. Algorithm 1). All states are stored as BDDs and we extend the state description by the possible rewards for each player. After calculating the set of reachable states, the rewards for the players are set for each of the goal states by conjunction with the corresponding reward BDDs. These classiﬁed goal states are stored in the set class. For the backward search, we then take the set of states front, which are those unclassiﬁed states (unclass) whose successors are all classiﬁed. These are determined by the strong preimage: front ← ∀s , v . trans ⇒ class (s and v represent the effect state and reward variable sets). From these we take one state after the other and construct an array of all successors (succ), all of which we check for domination. This is done by calculating the conjunction with the less relation, which is deﬁned for player i on the reward 3 A Soft-Maxn vector set V 1 strictly dominates another Soft-Maxn vector set V 2 with respect to player i iff for all v 1 ∈ V 1 and v 2 ∈ V 2 we have vi1 > vi2 . 906 P. Kissmann and S. Edelkamp / Symbolic Classiﬁcation of General Multi-Player Games Algorithm 1: Symbolic Soft-Maxn reach ← reachable (); class ← reach ∧ goal; for p = 1, . . . , |player| do class ← class ∧ i reward (p, i); unclass ← reach ∧ ¬class; front ← ⊥; while unclass = ⊥ do for p = |player| , . . . , 1 do front ← unclass ∧ move (p) ∧ ∀s , v . trans ⇒ class; if front = ⊥ then foreach state ∈ front do for a = 1, . . . , |A| do succa ← ∃s, v. state ∧ transa ; succa ← ∃s , v . succa ∧ s = s ∧ v = v ; dominated ← ∅; for i = 1, . . . , |A| − 1 do if i ∈ dominated then continue; for j = i + 1, . . . , |A| do if j ∈ dominated then continue; if succi ∧ succj ∧ lessp = ⊥ then dominated ← dominated ∪ {i}; break; if succj ∧ succi ∧ lessp = ⊥ then dominated ← dominated ∪ {j}; class ← class ∨ state ∧ i∈/ dominated ∃s. succi ; Table 1. Results for the three-player version of Tic-Tac-Toe. The Rewards give the rewards for X player, O player and Eraser, respectively, for a line of X, a line of O and a full board with no line (from top to bottom), while Opt Outcome determines the possible outcomes in case of optimal play. t is the time needed for classiﬁcation (in minutes), sreach the number of reachable states, sclass the number of classiﬁed states6 and nclass the number of BDD nodes needed to represent them. Rewards (100, 0, 0) (0, 100, 0) (50, 50, 100) (100, 0, 0) (0, 100, 0) (0, 0, 100) (100, 100, 0) (100, 100, 0) (0, 0, 100) lessi := j=1 vij ∧ |Vi | k=j+1 l ¬vik ∧ ¬ v l=1 i m |Vi | j ∧ v 3.2 i m=j+1 with Vi being the set of possible rewards for player i and vij the BDD representation of the j-th reward for player i. If this conjunction is not false, the index of the dominated successor will be inserted into the list of dominated states and it will not be considered any more. Once we are done, we calculate the BDD representing the set of dominating rewards by calculating the disjunction of the rewards of all non-dominated successors. This BDD is attached to the current state by calculating the conjunction. The classiﬁed current state will then be inserted into the set of classiﬁed states. With this we continue, until ﬁnally all states are classiﬁed. 3 t sreach sclass nclass (50, 50, 100) 20 39,742 44,319 5,257 (100, 0, 0) (0, 100, 0) 25 39,742 5,693 5,257 (100, 100, 0) 10 39,742 39,742 5,693 distribution, the X and O player do not mind if their opponent or the Eraser wins. Thus, the Eraser has no chance of winning, while the others can both win, most likely depending on the choice of the Eraser. With the third distribution, the X and O players cooperate against the Eraser, so that in the end the Eraser cannot prevent both of them from completing a line. This results in a victory for both players, though we cannot say who actually created a line. variable sets v and v in the following way: |Vi |−1 Opt Outcome EXPERIMENTAL RESULTS We implemented the algorithm using JavaBDD4 , which provides an interface to Fabio Somenzi’s BDD-library CUDD5 . We performed the experiments on an AMD Opteron with 2.4 GHz and 4 GB RAM. We chose two games to experiment on: a three-player version of TicTac-Toe and a multi-player version of Nim. Multi-player Nim In Nim, we have a row of matches. In turn, each player may take one up to three of these. The player to take the last match wins the game. Table 2 shows the results obtained by the symbolic algorithm for different numbers of players n and different numbers of matches m. The resulting classiﬁcation of the initial state is always a set of n − 1 different classiﬁcations, thus it sufﬁces to give the player who surely loses. All the examples took less than one second to compute. Table 2. Results for the game Nim showing the number of players n the number of matches m, the losing player l, the number of BDD nodes b, and the number of classiﬁed states s. n 3 3 3 3 3 3 3 3 3 5 m 7 10 15 20 25 30 50 75 100 50 l 3 1 1 1 1 1 1 1 1 2 b 30 44 46 52 54 55 62 67 69 174 s 23 41 71 101 131 161 281 431 581 831 n 4 4 4 4 4 4 4 4 4 6 m 7 10 15 20 25 30 50 75 100 50 l 2 4 4 3 2 1 3 4 4 3 b 41 57 66 73 73 74 81 83 88 216 s 28 58 118 178 238 298 538 838 1,138 1,200 References 3.1 Three-player Tic-Tac-Toe Three-player Tic-Tac-Toe works similar to the two-player version with the exception of a third player, the Eraser, who erases one of the symbols from the board. It is the Eraser’s turn each time the O player is done. We tried three different reward distributions. The results are shown in Table 1. The ﬁrst distribution is similar to the two-player version and results in a ﬁlled board, i. e. a victory for the Eraser. In the second 4 5 6 http://javabdd.sourceforge.net http://vlsi.colorado.edu/∼fabio/CUDD/ A classiﬁed state is a state along with one of its associated classiﬁcations. Thus, a state with several classiﬁcations results in several classiﬁed states. [1] Randal E. Bryant, ‘Graph-based algorithms for boolean function manipulation’, IEEE Transactions on Computers, 35(8), 677–691, (1986). [2] Stefan Edelkamp and Peter Kissmann, ‘Symbolic exploration for general game playing in PDDL’, in ICAPS-Workshop on Planning in Games, (2007). [3] Hilmar Finnsson and Yngvi Bj¨ornsson, ‘Simulation-based approach to general game playing’, in AAAI, (2008). [4] Nathaniel C. Love, Timothy L. Hinrichs, and Michael R. Genesereth, ‘General game playing: Game description language speciﬁcation’, Technical Report LG-2006-01, Stanford Logic Group, (April 2006). [5] Carol A. Luckhardt and Keki B. Irani, ‘An algorithmic solution of Nperson games’, in AAAI, pp. 158–162, (1986). [6] Nathan R. Sturtevant and Michael Bowling, ‘Robust game play against unknown opponents’, in AAMAS, pp. 713–719. ACM, (2006). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-907 907 Redundancy in CSPs Assef Chmeiss and Vincent Krawczyk and Lakhdar Sais 1 Abstract. In this paper, we propose a new technique to compute irredundant sub-sets of constraint networks. Since, checking redundancy is Co-NP Complete problem, we use different polynomial local consistency entailments for reducing the computational complexity. The obtained constraint network is irredundant modulo a given local consistency. Redundant constraints are eliminated from the original instance producing an equivalent one with respect to satisﬁability. Eliminating redundancy might help the CSP solver to direct the search to the most constrained (irredundant) part of the network. 1 Constraint Satisfaction Constraint-satisfaction problems (CSPs) involve the assignment of values to variables which are subject to a set of constraints. The modeling and solving phases are known to be heavily interconnected. Indeed, the efﬁciency of the solver depends on the way the problem instance is modeled. Until recently, these two phases are considered separately. Many improvements have been proposed for the solving side and many other approaches have been suggested to simplify the crucial modeling step [1, 3]. As there exists several ways to model the same problem, this means that the user is not safe from introducing redundancies in such modeling process. Also redundancies might result from an incorrect encoding or merging different parts from several sources. The obtained constraint network (CN), might contain parts that can be removed without losing the information it carries. However, several forms of redundancies can be characterized. In this paper, we address constraint redundancies. A CSP is redundant if and only if some its constraints can be removed while preserving its set of models. As stated by Paolo Liberatore [4] in the context of propositional clausal formulae, the deletion of redundant constraints is clearly important for several reasons. First, removing redundant constraints can simplify the CN by reducing its size. A large amount of redundancies might obscure the real set of constraints (the irredundant part of the CN). In other cases, redundancy might indicate that some pieces of the CN are more important than the others. Consequently, depending on the application domain, redundancy might be either a positive or a negative concept. Our main goal is to measure the relationships between constraints redundancies and the efﬁciency of CSP solvers. As a side effect, our approach can be seen as a possible technique that can be used to check the degree of redundancy of a given CN. On the current available CSP instances, our approach might give a nice way to approximate their irredundant part. However, checking constraint redundancy, meaning that deciding if a given constraint can be deduced from the remaining part of the CN is known to be Co-NP complete 1 Universit´e Lille-Nord de France, CNRS UMR 8188, Artois, Rue Jean Souvraz, SP-18, F-62307 Lens, email:{chmeiss, krawczyk, sais}@cril.fr [4]. To deal with this main drawback, in this paper, different polynomial local consistency entailments are used for reducing the computational complexity. The obtained CN is irredundant modulo a given local consistency. 1.1 Deﬁnitions and notations A CSP is deﬁned as a tuple P =< X , C >. X is a ﬁnite set of n variables {x1 , . . . , xn }. Each variable xi ∈ X is deﬁned on a ﬁnite set of di values, denoted dom(xi ) = {vi1 , . . . vidi }. C is a ﬁnite set of m constraints {c1 , . . . , cm }. Each constraint ci ∈ C of arity k is deﬁned as a couple (scope(ci ), Rci ) where scope(ci ) = {xi1 , . . . , xik } ⊆ X is the set of variables involved in ci and Rci ⊆ dom(xi1 ) × . . . × dom(xik ) the set of allowed tuples i.e. t ∈ Rci iff the tuple t satisﬁes the constraint ci . A CSP P is called binary iff ∀ci ∈ C, |scope(ci )| ≤ 2. A model (solution) is an assignment of a value for each variable x ∈ X which satisﬁes all the constraints. In this paper, we limit our presentation to binary CSPs. However, our proposed approach can be easily extended to n-ary CSPs. We deﬁne φ(P) as the CSP P obtained after applying a local consistency φ. For φ = AC this means that all arc-inconsistent values are removed from P. If there is a variable with an empty domain in φ(P), we denote φ(P) =⊥. The sub-network obtained after the assignment of a variable x to a value v is denoted P|x=v . 1.2 Tuple Arc Consistency In this section, we propose a new ﬁltering technique, called Tuple Arc Consistency (TAC). This local consistency is introduced to be exploited in our redundancy framework. The main idea is that instead of ﬁxing one value as for SAC, we ﬁx one tuple of a constraint c i.e. we assign the variables involved in c and we apply AC on the obtained sub-network. Deﬁnition 1 Let P be a CSP. A constraint cij ∈ C is Tuple Arc Consistent (TAC) iff ∀(a, b) ∈ Rcij , AC(P |xi =a,xj =b ) =⊥. P is TAC iff ∀c ∈ C, c is TAC. 2 Constraints redundancies The redundancy, in a CSP, occurs when some informations are present several times, that is, a subset of constraints can be deduced from others. To determine if a constraint c is redundant or not, we need to solve the CSP with the negation of a constraint c. This problem is known to be Co-NP complete [4]. However, it’s possible to detect in polynomial time some redundant constraints while using entailment modulo a given local consistency. In this section, we deﬁne formally the notion of constraint redundancy in CSP and we 908 A. Chmeiss et al. / Redundancy in CSPs instance name (#var, #ctr) bqwh-15-106 (106, 644) domino-1000-800 (1000, 1000) driverlogw-02c-sat ehi-85-297-0 (297, 4094) frb30-15-1 (30, 208) rlfap-graph1 (200, 1134) rlfapscen11-f10 (680, 4103) AC, RedAC time % 0,03 572 (11%) 123,01 0 (100%) 5,5 1910 (53%) 0,26 4094 (0%) 0,01 208 (0%) 0,15 1134 (0%) 0,476 4103 (0%) AC, RedT AC time % 0,16 570 (11%) 123,85 0 (100%) 10,36 1756 (57%) 12,29 776 (81%) 5,87 208 (0%) 101,83 885 (22%) 197,742 2954 (28%) TAC, RedAC time % 0,13 559 (13%) 123,06 0 (100%) 6,78 1428 (65%) 10,72 0,11 208 (0%) 964 1134 (0%) 69,827 - TAC, RedT AC time % 0,25 555 (14%) 123,75 0 (100%) 9,58 1367 (66%) 11,25 5,88 208 (0%) 1042,26 522 (54%) 72,548 - Table 1. Results on benchmarks from the second international CSPs competition show how we can use local ﬁlterings to detect some redundant constraints. In satisﬁability problem redundancy modulo unit propagation has been shown very useful in practice namely on real-world instances [2]. A CSP P is redundant if it contains a subset of redundant constraints otherwise it is called irredundant. For a CSP P =< X , C > and a constraint c ∈ C , we deﬁne P \ {c} as the CSP P =< X , C\c >. We deﬁne the negation of a constraint c, noted ¬c, as the constraint c such that scope(c ) = scope(c) and Rc = {t|t ∈ ∀x∈scope(c) dom(x), t ∈ / Rc }. Deﬁnition 2 Let P be a CSP and c ∈ C. c is redundant iff P \ {c} ∪ ¬c is unsatisﬁable. P is redundant iff ∃c ∈ C such that c is redundant. Otherwise P is said to be irredundant. To avoid solving the problem P \ {c} ∪ ¬c to see if c is redundant or not, we consider an incomplete but polynomial time algorithm to detect redundant constraints. We apply a local ﬁltering φ such AC and TAC. Any other local consistency can be used. Checking if a constraint is redundant can be done using a refutation procedure. Namely, a constraint c ∈ C is redundant iff the constraint network in which c is substituted by its negation is unsatisﬁable. This is clearly intractable. That’s why, we deﬁne weaker form of refutation inducing a weaker form of redundancy. Deﬁnition 3 Let P be a CSP and φ a local consistency. A constraint c ∈ C is φ-redundant iff φ(P\{c} ∪ {¬c}) =⊥. A CSP P is called φ-redundant (respectively φ-irredundant) iff it (respectively does not) contains φ-redundant constraints. Algorithm 1: Computing a φ irredundant constraint network Input: P =< X , C > Output: A φ-irredundant CSP P 1 for each c ∈ C do 2 P ← P \ {c} ∪ ¬c; 3 if φ(P ) =⊥ then 4 C ← C\c; 3 Preliminary experiments In this section, we show the practical interest of our approach. We present the reduction power in terms of the percentage of deleted φ-redundant constraints with φ instantiated to AC and T AC. As a CSP solver, we used the MAC algorithm with dom/WDeg. In table 1, which presents results on some instances, we provide the percentage of φ-redundant constraints. In the four double-columns, the results obtained by applying a φ consistency as a preprocessing and φ-redundancy checking are given. For example, in the second double column (AC, RedT AC ) means that we apply AC on the original problem then we check constraints redundancy using T AC. For each case, we give the run time (in seconds), the number of remaining constraints and the percentage of φ-redundant constraints. Instances solved in the preprocessing step, are indicated with a dash ”-”. On the domino-1000-1000 instance, we can see that all constraints are deleted. In fact, all the constraints become redundant since, after the preprocessing, there is one value for each domain and the instance is proven satisﬁable. On the contrary, for some instances like frb30-15-1, the technique does not detect any redundant constraint. For the bqwh-15-106 and driverlogw-02c-sat instances, we remark that a stronger ﬁltering like TAC detects more redundant constraints than AC. This remark is conﬁrmed for other classes like ehi-85 and rlfap-graph1 where the detection of redundant constraints by TAC is more signiﬁcant than with AC. For classes ehi-85 and rlfap-scen11, the ﬁltering technique TAC prove inconsistency during the preprocessing step. 4 Conclusions In this paper, a new approach to compute irredundant sub-sets of CN is proposed. Using polynomial time local consistency techniques for redundancy checking, signiﬁcant reductions in the size of the have been obtained on many classes of CSP instances. The obtained subnetwork is irredundant modulo a local consistency entailment. The new ﬁltering TAC we propose is clearly powerful for detecting redundant constraints. Used as a preprocessing some classes of instances are solved without search. REFERENCES In algorithm 2, φ can be replaced by any local consistency ﬁltering like AC and TAC. The complexity of Algorithm 1 is polynomial. If we use an AC ﬁltering whose the time complexity is O(md2 ), then the time complexity of the algorithm 1 is bounded by O(m2 d2 ). Let us note that using different constraint orderings in the algorithm 1, might lead to different φ-irredundant constraints subnetworks. [1] C. Bessi`ere, R. Coletta, B. O’Sullivan, and M. Paulin, ‘Query-driven constraint acquisition’, in IJCAI’2007, pp. 50–55, (2007). [2] O. Fourdrinoy, E. Gr´egoire, B. Mazure, and L. Sa¨ıs, ‘Reducing hard sat instances to polynomial ones’, in IEEE-IRI’07, pp. 18–23, (2007). [3] A. M. Frisch, C. Jefferson, B. Mart´ınez Hern´andez, and I. Miguel, ‘The rules of constraint modelling’, in Ijcai’2005, pp. 109–116, (2005). [4] P. Liberatore, ‘Redundancy in logic i: Cnf propositional formulae’, Artif. Intell., 163(2), 203–232, (2005). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-909 909 Reinforcement Learning and Reactive Search: an adaptive MAX-SAT solver Roberto Battiti and Paolo Campigotto 1 1 Introduction This paper investigates Reinforcement Learning (RL) applied to online parameter tuning in Stochastic Local Search (SLS) methods. In particular, a novel application of RL is proposed in the Reactive Tabu Search (RTS) scheme, where the appropriate amount of diversiﬁcation in prohibition-based local search is adapted in a fast online manner to the characteristics of a task and of the local conﬁguration. The experimental tests demonstrate promising results on Maximum Satisﬁability (MAX-SAT) instances when compared with state-of-the-art SLS SAT solvers, such us AdaptNovelty+ , rSAPS and gNovelty+ . 2 Reinforcement Learning for Reactive Tabu Search This paper investigates a novel application of Reinforcement Learning in the framework of Reactive Tabu Search (RTS) proposed in [1]. Tabu Search (TS) is a prohibition-based search technique based on local search. At a given iteration some local search moves (e.g., variable ﬂips in the case of the SAT) are prohibited, only a non-empty subset of them is allowed: the local search move executed at iteration t will not be allowed for the next T iterations, where T is the prohibition parameter. In this work, T is assumed to take values over the interval [Tmin , Tmax ]. RTS is a proposal to determine a dynamic value of the prohibition parameter which is appropriate to a speciﬁc instance and to the local characteristics of the ﬁtness surface around the current conﬁguration. Among all the RL methods developed, we consider the LeastSquares Policy Iteration (LSPI) algorithm [4], a form of model-free approximate policy iteration using a set of training samples collected in any arbitrary manner. In [6], we present an off-line application of LSPI to tune the prohibition parameter, in particular by considering an application to the MAX-SAT problem. The parameter-tuning policy is modeled as a Markov Decision Process (MDP) where the states summarize relevant information about the recent history of the search, and a near-optimal policy is determined by using the LSPI method. In this work, we consider an online version of the method to determine a critical algorithm parameter while the algorithm is running on a selected instance. The impact of different choices for designing the Markov states and the deﬁnition of the basis function for the approximation architecture are discussed. The effect of changing the prohibition parameter on the algorithm’s behavior can only be evaluated after a reasonable number of local moves. We therefore divide the algorithm’s trace into epochs 1 Dipartimento di Ingegneria e Scienza dell’Informazione, University of Trento, Italy, email: {battiti, campigotto}@disi.unitn.it (E1 , E2 , . . . ) composed of a suitable number of local moves, and allow changes of T only between epochs. The state at the end of epoch Ei is a collection of features extracted from the algorithm’s execution up to that moment. Assume n and m the number of variables and clauses of the input SAT instance, respectively. Let f (x) the score function counting the number of unsatisﬁed clauses in the truth assignment x. Each state of the MDP is created by observing the behavior of the Tabu search algorithm over an epoch of 2∗Tmax consecutive variable ﬂips. In particular, let us deﬁne the following: • xbsf is the “best-so-far” (BSF) conﬁguration before the current epoch; • Tf is the current fractional prohibition value (the actual prohibition period is T = nTf ); • f epoch is the average value of f during the epoch; • H epoch is the average Hamming distance during the current epoch from the conﬁguration at the beginning of the current epoch itself. These variables have been chosen because of the Reactive Search paradigm’s concern on the trade-off between diversiﬁcation (the ability to explore new conﬁgurations in the search space by moving away from local minima) and bias (the preference for conﬁgurations with low objective function values). The compact state representation chosen to describe an epoch is the following triplet: « „ f epoch − f (xbsf ) H epoch where Δf = , Tf , . s ≡ Δf, n m The ﬁrst component is the mean change of f in the current epoch with respect to the best value; all components of the state have been normalized. The actions set is composed by two choices: A = {increase, decrease}, with the following effects: j max {Tf · 1.1, Tf + 1/n} if a = increase (1) Tf = if a = decrease min {Tf /1.1, Tf − 1/n} Changes in Tf are designed in order to ensure variation of at least 1 in the actual prohibition period T . In addition, Tf is bounded between a minimum and a maximum value (0 and .2 in our experiments). An alternative deﬁnition for the actions set consists of setting Tf from scratch by one of the 20 uniformly distributed values in the range [0.01, 0.2]: Tf = 0.01 ∗ i, where i ∈ [1, 20] (2) The reward signal is given by the normalized change of the best value achieved in the observed epoch with respect to the “best-sofar” value before the epoch: (f (xbsf ) − f (xlocalBest ))/m. 910 R. Battiti and P. Campigotto / Reinforcement Learning and Reactive Search: An Adaptive MAX-SAT Solver 210 Mean best so far 200 Mean best so far 180 rsaps AdNov+ h_rts LSPI for SAT gNov+ 190 180 170 160 150 1000 170 10000 100000 Figure 2. Performance of the two implemented actions for the update of the Tf value: increasing/decreasing vs setting Tf from scratch. 100000 Iterations Figure 1. The comparison among our RL-based method and other SAT solvers. For the case of the actions set deﬁned via Eq. 1, we use the basis function set presented in [6]. If the actions set is deﬁned by Eq. 2, assume action a being “set Tf to 0.01 ∗ i”, i ∈ [1, 20], and Φj (s, a) being the j-th entry for the considered basis function vector Φ(s, a). We have: 8 Δf > > > > H epoch > > > > < H epoch · Δf (Δf )2 Φj (s, a) = > 2 > > H epoch > > > > > i/100 : 0 if j = 1 if j = 2 if j = 3 if j = 4 if j = 5 if j = 5 + i otherwise (3) The training phase is executed online, while solving a single SAT instance. This design choice implies that the best policy learnt by the SAT solver is not deﬁned a priori by an off-line training phase over selected SAT instances, but it is determined by learning while the target optimization task is performed. During an initial set up phase, 100 training examples for the input SAT instance are extracted to calculate the initial policy. Then, the solving phase is started. As soon as the search history provides a new example, it is added to the training set and the policy is updated. 3 10000 Iterations 160 150 1000 increasing/decreasing Tf setting Tf from scratch Experimental results For our tests, we use the benchmark described in [5], formed by MAX-3-SAT random instances with 500 variable and 5000 clauses. The Tf parameter has been bounded in [0, .2]. To evaluate our novel MAX-SAT solver based on Reinforcement learning we report here a comparison with some of the best and famous SLS algorithms for MAX-SAT. In particular, the SLS techniques considered are the the AdaptNovelty+ [7], the the RSAPS, a reactive version of SAPS [3], the H RTS [1], and the gNovelty+ [2]. For each algorithm, 10 runs with different random seeds are performed for each of the 50 instances taken from the benchmark set, for a total of 500 tests. Fig. 1 shows the average results as a function of the number of iterations (ﬂips). Fig. 1 indicates that our RL-based approach is competitive with the other existing SLS MAX-SAT solvers. In the experiment in Fig. 1, for our RL-based approach we consider the case where the update of the Tf value is performed by Eq. 1. However, in Sec. 2 we presented two possible deﬁnitions for the action that updates the value of Tf : 1. Tf is increased/decreased by the value 1/n (see Eq. 1); 2. Tf is set from scratch via Eq. 2. Fig. 2 compare the two hypotheses, showing an improvement in the second case. E.g., at iteration 100000 an improvement of 2.4% in the mean best-so-far value is registered. Setting the Tf value from scratch, our algorithm reaches the optimal performance of H RTS. Furthermore, for the ﬁrst hypothesis, a bigger increase/decrease of the Tf parameter has also been tested. In particular, we replaced the factor 1.1 in Eq. 1 by the value 1.3. However, in this case we obtain a little bit worse results. 4 Conclusions This paper describes an application of Reinforcement Learning for the online tuning of the prohibition parameter in the Reactive Tabu Search algorithm. We discussed a couple of relevant architectural choices and presented preliminary experimental results. The results are promising: over the MAX-SAT benchmark considered our algorithm performs better than the gNovelty+ , which is a Gold Medal winner in the random category in the SAT 2007 competition and achieves results which are comparable with those obtained by the original RTS algorithm. These ﬁndings are conﬁrmed by the additional experimental work not presented in this paper because of space limits. REFERENCES [1] R. Battiti and M. Protasi, ‘Reactive search, a history-sensitive heuristic for MAX-SAT’, ACM Journal of Experimental Algorithmics, 2(ARTICLE 2), (1997). http://www.jea.acm.org/. [2] C. Gretton D.N. Pahm, J.R. Thornton and A. Sattar, ‘Advances in local search for satisﬁability’, in 20th Australian Joint Conference on Artiﬁcial Intelligence, Gold Coast, Australia, December 2-6, 2007, ed., John Thornton Mehmet Orgun, number 4830 in Lecture Notes in Computer Science, pp. 213–222. Springer, (2007). [3] D.A.D. Tompkins F. Hutter and H.H. Hoos, ‘Scaling and probabilistic smoothing: Efﬁcient dynamic local search for sat’, in Proc. Principles and Practice of Constraint Programming - CP 2002 : 8th International Conference, CP 2002, Ithaca, NY, USA, September 9-13, volume 2470 of LNCS, pp. 233–248. Springer Verlag, (2002). [4] M.G. Lagoudakis and R. Parr, ‘Least-Squares Policy Iteration’, Journal of Machine Learning Research, 4(6), 1107–1149, (2004). [5] D. Mitchell, B. Selman, and H. Levesque, ‘Hard and easy distributions of SAT problems’, in Proceedings of the Tenth National Conference on Artiﬁcial Intelligence (AAAI-92), pp. 459–465, San Jose, Ca, (July 1992). [6] M. Brunato R. Battiti and P. Campigotto, ‘Learning while optimizing an unknown ﬁtness surface’, in Proceedings of the 2nd Learning and Intelligent OptimizatioN Conference (LION II), Trento, Italy, Dec 10-12, 2007. Springer LNCS, in press, (2008). [7] D.A.D. Tompkins and H.H. Hoos. Novelty+ and adaptive novelty+ . SAT 2004 Competition Booklet. (solver description). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-911 911 A MAX-SAT Algorithm Portfolio1 Paulo Matos and Jordi Planes and Florian Letombe and Jo˜ao Marques-Silva2 Abstract. The results of the last MaxSAT Evaluations suggest there is no universal best algorithm for solving MaxSAT, as the fastest solver often depends on the type of instance. Having an oracle able to predict the most suitable MaxSAT solver for a given instance would result in the most robust solver. Inspired by the success of SATzilla for SAT, this paper describes the ﬁrst approach for a portfolio of algorithms for MaxSAT. Compared to existing solvers, the resulting portfolio can achieve signiﬁcant performance improvements on a representative set of instances. 1 Introduction In recent years, one of the optimization counterparts of the Boolean satisﬁability problem (SAT) has attracted the interest of researchers: the maximum satisﬁability (MaxSAT) problem. MaxSAT and its variations ﬁnd a number of relevant applications, including scheduling and design automation [12, 13]. This work is the ﬁrst attempt to implement and evaluate an algorithm portfolio solving MaxSAT problems. The portfolio computes several features of an instance and estimates the runtime for each solver in the portfolio. Then, it solves the instance with the estimated fastest solver. An extended number of instances have been considered, indicating that the portfolio is able to solve more instances, from the selected set of instances, than any other solver. Besides, the total run time to solve is lower for the portfolio, despite the time spent in the feature computation. The paper is organized as follows, Section 2 gives the notions for MaxSAT solving; Section 3 introduces the portfolio learning process; and Section 4 explains the steps to execute and test the portfolio, and discusses the experimental results. The paper concludes in Section 5. 2 Preliminaries This section provides a brief introduction to the MaxSAT problem solving. Familiarity with SAT and related topics is assumed [1]. The MaxSAT problem consists of ﬁnding an assignment which satisﬁes the maximum number of clauses in a CNF formula. MaxSAT algorithms have been the subject of signiﬁcant improvements over the last decade (e.g., see [7, 5] for a review of past work). Despite the clear relation with the SAT problem, most modern SAT techniques cannot be applied directly to the MaxSAT problem (e.g. unit propagation or clause learning). As a result, the most successful MaxSAT algorithms, in the most recent MaxSAT Evaluations, implement branch and bound search, and integrate sophisticated lower 1 2 This work is partially supported by EPSRC grant EP/E012973/1, and by EU grants IST/033709 and ICT/217069. School of Electronics & Computer Science, University of Southampton, UK, email: {pocm,jp3,ﬂ,jpms}@ecs.soton.ac.uk bounding and inference techniques. However, past MaxSAT Evaluations did not consider complex problem instances from practical applications. As a result, we have also considered for the portfolio a set of practical problem instances and a recent solver focused on such instances, msu [10]. We have focused on the experience of an existent efﬁcient portfolio, SATzilla [14], an algorithm portfolio for SAT, which has demonstrated to be a robust solver and very competitive in the SAT Competitions3 . Before SATzilla, Gomes and Selman [3] worked with stochastic search portfolios on several N P-Complete problems. There exists also other preliminary works on algorithm portfolios that deal with problems similar to MaxSAT [8, 6, 4]. 3 Model Generation The capacity to predict the time that a solver will spend on a given instance is one of the key aspects in the design of an algorithm portfolio. The prediction is done using a model created by a learning process over a set of instances. Once the model is created, the portfolio computes the features for a given instance and, based on the model, decides which solver to run. Our models are linear functions i>0 βi xi + β0 , which compute the approximate runtime of a solver on a particular instance i. For the linear function, xi is the value for the feature i of the instance and βi are the coefﬁcients to be found for each feature by the model generator. After several steps of forward selection and basis function expansion, in order to ﬁt supra-linear data, we perform ridge regression [9] to obtain the unknowns βi . Forward selection is performed to reduce the number of interesting features. Basis function expansion of the feature set, on the other hand, allows a linear model as the one we used to model supra-linear data (which allowed us to generate the quadratic model presented in section 4). Data preprocessing also handles cases where a solver timed-out on a speciﬁc instance by removing it from the training set. The process of generating the model is executed for every solver in the portfolio. After each model is computed, it is tested over a test set. Our model generator was tested for correctness by generating random data and ﬁnding a model for it. If the data can be ﬁt using our model, the model output should be the same as the model used to generate the random data. The selected solvers to be used are of three different kinds for the sake of complementarity: a Pseudo-Boolean Optimization solver, minisat+ [2]; a recent solver that efﬁciently deals with real problem instances, msu [10]; and the strongest solver in the MaxSAT category in the MaxSAT Evaluation 2007, maxsatz [7]. The solver maxsatz implements a branch and bound search and integrates sophisticated lower bounds and inference techniques. On the other hand, algo3 http://www.satcompetition.org/ 912 P. Matos et al. / A MAX-SAT Algorithm Porfolio rithm msu is a process that iteratively solves several SAT problem instances, until it reaches the MaxSAT solution. Three kind of features have been considered [11]: problem size features, balance features and local search probe features. The most important features (among the ﬁrst selected by forward selection) are in the set of local search probes. the rest of solvers. We are aware, however, that this can still be improved. As mentioned earlier, our learning method does not handle solver timeouts, which means that our portfolio is biased regarding solvers which timeout often and solve a few instances in short time. Still, having a portfolio capable of achieving these initial results motivates additional research in algorithm portfolios for MaxSAT. 4 Experimental Results 5 Conclusions The experimentation has been performed in a Linux Intel Xeon 3.0 GHz. A timeout of 1000 seconds was used for all MaxSAT solvers considered. The memory limit was set to 3GB. Some of the sets of in- This paper presents a method to develop an algorithm portfolio for the MaxSAT problem. Given that no benchmark repository exists for MaxSAT, problem instances from real world problems and from the MaxSAT evaluation have been used. To the best of our knowledge, this is the ﬁrst algorithm portfolio for MaxSAT problem. From the experimental results we conclude that our MaxSAT algorithm portfolio is the most robust solver among the MaxSAT problem instances we have considered. Future research work includes adapting the model generator to handle timeouts, and also adapting the solver portfolio solver to deal with Partial MaxSAT and Weighted MaxSAT. Additional research on identifying suitable features will be required for further improving the model used. msu3.1 507 minisat+ 211 maxsatz 135 pfquad 524 pﬂin 548 oracle 582 Table 1. Total number of solved instances for each solver stances considered are from the MaxSAT Evaluation 2007, the ones considered hard to solve and close to real problems; and instances from real problems: circuit design and planning. There are 586 instances from the following sets: RAMSEY, SPINGLASS, MAXCUT from the MaxSAT Evaluation 2007; DEBUG, IBM, UCLID, PIMAG from circuit design; SATPLAN from planning problems converted to SAT instances. In order to check our portfolio, we have created the oracle, a virtual portfolio which always selects the best possible result. The entries pﬂin and pfquad correspond to our portolios using a linear model of the features and a quadratic model of the features respectively. A preprocessing time per instance has been added to its total time. In Table 1, we can notice the portfolio is the most robust MaxSAT solver, since it solves the largest number of instances. 500000 Total time spent (in seconds) 450000 400000 350000 300000 250000 200000 150000 100000 50000 Figure 1. oracle pflin pfquad maxsatz minisat+ msu3.1 0 Total spent time in seconds for each solver in MaxSAT Figure 1 shows the total time taken by each of the solvers in the portfolio, our two portfolio models and the oracle. The results obtained by our models are close to the oracle, and spend less time than REFERENCES [1] Lucas Bordeaux, Youssef Hamadi, and Lintao Zhang, ‘Propositional satisﬁability and constraint programming: A comparative survey’, ACM Computing Surveys, 38(4), (2006). Electronic Edition, 54 pages. [2] Niklas Een and Niklas S¨orensson, ‘Translating pseudo-boolean constraints into SAT’, Journal on Satisﬁability, Boolean Modeling and Computation, 2, 1–26, (2006). [3] Carla P. Gomes and Bart Selman, ‘Algorithm portfolios’, Artiﬁcial Intelligence, 126(1-2), 43–62, (2001). [4] Kevin Keyton-Brown, Eugene Nudelman, Galen Andrew, Jim McFadden, and Yoav Shoham, ‘A portfolio approach to algorithm selection’, in International Joint Conference on Artiﬁcial Intelligence - IJCAI’03, pp. 1542–1543, (2003). [5] Javier Larrosa, Federico Heras, and Simon de Givry, ‘A logical approach to efﬁcient max-SAT solving’, Artiﬁcial Intelligence, 172(2–3), 204–233, (2008). [6] Kevin Leyton-Brown, Eugene Nudelman, and Yoah Shoham, ‘Learning the empirical hardness of optimization problems’, in Principles and Practice of Constraint Programming CP’02, volume 2470 of LNCS, pp. 556–572, (2002). [7] Chu Min Li, Felip Manya, and Jordi Planes, ‘New inference rules for max-SAT’, Journal of Artiﬁcial Intelligence Research, 30, 321–359, (2007). [8] Lionel Lobjois and Michel Lemaˆıtre, ‘Branch and bound algorithm selection by performance prediction’, in National Conference on Artiﬁcial Intelligence - AAAI’98, pp. 353–358, (1998). [9] Donald W. Marquardt and Ronald D. Snee, ‘Ridge regression in practice’, The American Statistician, 29(1), 3–20, (1975). [10] Joao Marques-Silva and Jordi Planes, ‘Algorithms for maximum satisﬁability using unsatisﬁable cores’, in Design, Automation and Test in Europe - DATE’08, (2008). [11] Eugene Nudelman, Alex Devkar, Yoav Shoham, and Kevin LeytonBrown, ‘Understanding random SAT: Beyond the clauses-to-variables ratio’, in Principles and Practice of Constraint Programming CP’04, volume 3258 of LNCS, pp. 438–452, (2004). [12] Sean Safarpour, Hratch Mangassarian, Andreas Veneris, Mark H. Lifﬁton, and Karem A. Sakallah, ‘Improved design debugging using maximum satisﬁability’, in Formal Methods in Computer Aided Design FMCAD’07, pp. 13–19, (2007). [13] Hui Xu, R. A. Rutenbar, and Karem A. Sakallah, ‘sub-SAT: a formulation for relaxed boolean satisﬁability with applications in routing’, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 22(6), 814–820, (2003). [14] Lin Xu, Frank Hutter, Holger Hoos, and Kevin Leyton-Brown, ‘SATzilla-07: The design and analysis of an algorithm portfolio for SAT’, in Principles and Practice of Constraint Programming CP’07, volume 4741 of LNCS, pp. 712–727, (2007). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-913 913 On the Practical Signiﬁcance of Hypertree vs. Tree Width Rina Dechter1, Lars Otten1, and Radu Marinescu2 Abstract. The recently introduced notion of hypertree width has been shown to provide a broader characterization of tractable constraint and probabilistic networks than the tree width. This paper demonstrates empirically that in practice the bounding power of the tree width is still superior to the hypertree width for many benchmark instances of both probabilistic and deterministic networks. 1. For each fj ∈ F , there is at least one v ∈ V such that fj ∈ ψ(v) . 2. If fj ∈ ψ(v), then scope(fj ) ⊆ χ(v) . 3. For each xi ∈ X, the set {v ∈ V |xi ∈ χ(v)} induces a connected subtree of T . The tree width of T is w = maxv∈V |χ(v)|−1 . T is also a hypertree decomposition if it satisﬁes the following additional condition: 4. For each v ∈ V , χ(v) ⊆ fj ∈ψ(v) scope(fj ) . 1 INTRODUCTION In this case, the hypertree width of T is hw = maxv∈V |ψ(v)| . Inference in graphical models is known to be time and space exponential in the problem graph’s tree width. In practice, however, this measure is often inaccurate, since it ignores the effects of determinism in problem solving, which can, for instance, lead to pruning of large parts of the search space. To that end, in 2000 [3] introduced a parameter called hypertree width and showed that for constraint networks it is more effective in capturing tractable classes. In [4], its applicability was extended to inference algorithms over general graphical models having relational function speciﬁcation. In this paper we explore the practical signiﬁcance of the hypertree width against the tree width from a more practical angle. We show empirically, on probabilistic and deterministic benchmarks, that in most cases the tree width yields a far better predictor of instancebased complexity than the hypertree width, except when the problem has substantial determinism. The outline of this paper is as follows: Section 2 gives a brief overview of the two decomposition schemes. Section 3 provides the empirical results, and Section 4 concludes. Finding tree and hypertree decompositions of minimal width is known to be NP-complete, therefore heuristic algorithms are employed in practice [4, 2]. Once a tree or hypertree decomposition is available, it can be processed by the suitable version of a message passing algorithm like Cluster-Tree Elimination (CTE) [4]. Allowing a probabilistic function to be placed in more than one node will lead to incorrect processing by CTE for any graphical model other than constraint networks. To remedy this we modify multiple showings of a function by ﬂattening all but one of them into a 0/1-valued constraint. 2 DECOMPOSITION SCHEMES We assume the usual deﬁnitions of directed and undirected graphs, hypergraphs, primal and dual graphs, and hypertrees. A graphical model is typically deﬁned to be a set of real-valued functions F = {f1 , . . . , fl } over a set of variables X = {x1 , . . . , xn } with domains D = {D1 , . . . , Dn }, together with a combination operator like summation or multiplication. The scope of a function fj , denoted scope(fj ), is the set of variables on which fj is deﬁned. A common approach to solving graphical model problems is to cluster variables and functions such that the resulting decomposition exhibits tree structure: D EFINITION 1 A tree decomposition of a graphical model is a triple T = T, χ, ψ, where T = (V, E) is a tree and χ and ψ are labeling functions that associate with each vertex v ∈ V two sets, χ(v) ⊆ X and ψ(v) ⊆ F , that satisfy the following conditions: 1 2 Bren School of Information and Computer Sciences, University of California, Irvine, CA 92697-3435. Email: {dechter,lotten}@ics.uci.edu Cork Constraint Computation Centre, Department of Computer Science, University College Cork, Ireland. Email: r.marinescu@4c.ucc.ie Complexity bounds The time complexity of algorithm CTE, when executed on a tree decomposition T with tree width w, has been shown to be O((r + m) · deg · kw+1 ) , (1) where r is the number of functions in the problem, m the number of clusters in T , and deg the maximum degree in T . The space complexity is O(m·kw ) [4]. By virtue of using a tree decomposition, however, the hypergraph structure of the problem is completely ignored. Bound (1) does therefore not account for any determinism that might be present in the function speciﬁcations. To that end, if T is also a hypertree decomposition, algorithm CTE can be adapted to exploit the hypergraph structure. Assuming T has hypertree width hw, the time complexity of applying CTE can be shown to be O(m · deg · hw · log t · thw ) , (2) where t bounds the size of the relational representation of each function in the problem (i.e., the number of zero cost tuples in CSPs and the number of non-zero probability tuples in belief networks). Space complexity is O(thw ) [3, 4]. We note that bound (2) indeed takes determinism into account by using the parameter t, which denotes the number of relevant tuples in a function table. It is clear that t ≤ kr ≤ kw , where r is the maximum function arity. Hence bound (2) can only yield tighter results when the problem instance possesses a high degree of determinism. While it has been shown that hypertree decompositions are strictly more general than tree decompositions, it is unclear how the asymptotic bounds compare for practical problem instances. 914 R. Dechter et al. / On the Practical Signiﬁcance of Hypertree vs. Tree Width n instance pedigree1 pedigree18 pedigree20 pedigree23 pedigree25 pedigree30 pedigree33 pedigree37 pedigree38 pedigree39 pedigree42 pedigree50 pedigree7 pedigree9 pedigree13 pedigree19 pedigree31 pedigree34 pedigree40 pedigree41 pedigree44 pedigree51 mm mm mm mm mm mm 03 03 03 04 04 10 08 08 08 08 08 08 03 04 05 03 04 03 k r t w hw Genetic linkage 334 4 5 32 16 13 1184 5 5 50 22 18 437 5 4 50 24 16 402 5 4 50 29 15 1289 5 5 50 27 19 1289 5 5 50 25 18 798 4 5 32 31 21 1032 5 4 32 22 13 724 5 4 50 18 10 1272 5 4 50 25 18 448 5 4 50 24 16 514 6 4 72 18 10 1068 4 4 32 40 23 1118 7 4 50 31 21 1077 3 4 18 35 29 793 5 5 50 27 21 1183 5 5 50 34 29 1160 5 4 32 32 25 1030 7 5 98 31 24 1062 5 5 50 35 25 811 4 5 32 28 22 1152 5 4 50 44 33 Mastermind puzzle game (WCSP) 1220 2 3 4 21 14 2288 2 3 4 31 20 3692 2 3 4 40 25 1418 2 3 4 26 17 2616 2 3 4 38 24 2606 2 3 4 56 34 R 9.934 15.204 10.408 5.214 13.408 13.107 12.944 4.190 4.408 13.107 10.408 4.567 10.536 9.480 19.704 16.806 25.505 15.262 21.591 18.010 16.256 25.311 2.107 2.709 3.010 2.408 3.010 3.612 k r t w Coding networks BN 126 512 2 5 16 56 BN 127 512 2 5 16 55 BN 128 512 2 5 16 50 BN 129 512 2 5 16 54 BN 130 512 2 5 16 53 BN 131 512 2 5 16 53 BN 132 512 2 5 16 52 BN 133 512 2 5 16 56 BN 134 512 2 5 16 55 Dynamic Bayesian networks BN 21 2843 91 4 208 7 BN 23 2425 91 4 208 5 BN 25 1819 91 4 208 5 BN 27 3025 5 7 3645 10 BN 29 24 10 6 999999 6 Digital circuits c432.isc 432 2 10 512 28 c499.isc 499 2 6 32 25 s386.scan 172 2 5 16 19 s953.scan 440 2 5 16 66 Radio frequency assignment (WCSP) CELAR6-SUB0 16 44 2 1302 8 CELAR6-SUB1-24 14 24 2 301 10 CELAR6-SUB1 14 44 2 928 10 CELAR6-SUB2 16 44 2 928 11 CELAR6-SUB3 18 44 2 928 11 CELAR6-SUB4-20 22 20 2 396 12 CELAR6-SUB4 22 44 2 1548 12 instance n hw R 21 22 20 21 21 21 21 21 21 8.429 9.934 9.031 9.031 9.332 9.332 9.633 8.429 8.730 4 3 2 2 2 -4.441 -2.841 -5.159 0.134 6.000 22 25 8 38 51.175 30.103 3.913 25.889 4 5 5 6 6 6 6 -0.689 -1.409 -1.597 -0.273 -0.273 -0.026 -0.583 Table 1. Selected experimental results comparing the tree width and hypertree width based bounds. 3 EXPERIMENTAL RESULTS 4 We evaluated empirically the tree width and hypertree width bounds on 112 practical probabilistic networks and 30 constraint networks. Problem instances were obtained from various sources; all of them are available online3 . To obtain a tree decomposition of a problem, we perform bucket elimination along a minﬁll ordering (random tie breaking, optimum over 20 iterations). The tree decomposition is then extended to a hypertree decomposition by the method described in [2], where variables in a decomposition cluster are greedily covered by functions. For each problem instance we collected the following statistics: the number of variables n, the maximum domain size k, the maximum function arity r, and the maximum function tightness t. We also report the best tree width and hypertree width found in the experiments described above. hw We deﬁne the measure R := log10 ( tkw ) . This compares the two dominant factors of the w bound (1) and the hw bound (2). If R is positive, it signiﬁes how many orders of magnitude tighter the w bound is when compared to the hw bound, and vice versa for negative values of R. Some select instances are shown in Table 1, the full set of results is available in an extended version of this paper [1]. Out of the 112 belief networks, the hw bound was only superior for 5 instances, and not by many orders of magnitude. On the other hand, for genetic linkage instances with considerable determinism in their CPTs, the hw bound is signiﬁcantly worse, as is the case for most other belief networks. This situation does not change much for constraint problems, except for radio frequency assignment, where the hw bound fares somewhat better, but only by a small margin. In summary we can review that, in order for the hypertree width bound to be competitive with, or even superior to, the tree width bound, problem instances need to comply with several conditions; foremost among these are very tight function speciﬁcations. The latter is promoted by large variable domains and high function arity, which we found to be not the case for the majority of practical problem instances. The contribution of this paper is in exploring empirically the practical beneﬁt of the hypertree width compared with the tree width in bounding the complexity of algorithms over given problem instances. Statistics collected over 112 Bayesian networks instances and 30 weighted CSPs provided interesting, yet somewhat sobering information. We conﬁrmed that while the hypertree is always smaller than the tree width, the complexity bound it implies is often inferior to the bound suggested by the tree width. Only when problem instances possess substantial determinism and when the functions have large arity, the hypertree can provide bounds that are tighter and therefore more informative than the tree width. The above empirical observation raises doubts regarding the need to obtain good hypertree decompositions beyond the already substantial effort into the search of good tree-decompositions, that has been ongoing for three decades now. 3 Repository at http://graphmod.ics.uci.edu/ CONCLUSION ACKNOWLEDGEMENTS This work was partially supported by NSF grant IIS-0713118 and NIH grant R01-HG004175-02. REFERENCES [1] R. Dechter, L. Otten, and R. Marinescu: ‘On the Practical Signiﬁcance of Hypertree vs. Tree Width’, Technical Report, University of California, Irvine, (2008). [2] G. Gottlob, M. Grohe, N. Musliu, M. Samer, and F. Scarcello, Hypertree Decompositions: ‘Structure, Algorithms, and Applications’, International Workshop on Graph-Theoretic Concepts in Computer Science, (2005). [3] G. Gottlob, N. Leone, and F. Scarcello, ‘A comparison of structural CSP decomposition methods’, Artiﬁcial Intelligence, (2000). [4] K. Kask, R. Dechter, J. Larrosa, and A. Dechter, ‘Unifying treedecompositions for reasoning in graphical models’, Artiﬁcial Intelligence, (2005). 9. Planning and Scheduling This page intentionally left blank ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-917 917 A New Approach to Planning in Networks Jussi Rintanen NICTA & the Australian National University Canberra, Australia Abstract. Control of networks like those for transportation, power distribution, communication to name a few, provides challenges to planning and scheduling. Many problems can be deﬁned in terms of a basic state space model, but more general problems require an expressive language for talking about the topology and connectivity of the system, which are outside the scope of standard planning languages. In this work we introduce a general framework for deﬁning planning languages for networked systems, with capability to express properties of connectivity and topology of such systems. 1 Introduction Other areas of computer science that use the transition system model of actions include computer-aided veriﬁcation and validation, where reachability analysis and model-checking problems are described in languages like SMV [6] and PROMELA [5]. These languages describe transition systems in terms concepts naturally occuring in the relevant application areas of the tools. Much of the high-level technological infrastructure has the form of networks: transportation, telecommunications, power distribution and water distribution are all based on networks with clearly deﬁnable nodes and edges connecting them. Also, most of the standard planning benchmark problems involve networks. Even minor extensions to many of these problems are difﬁcult to express compactly in standard planning languages. For network-structured applications we propose the use of highlevel planning languages with network features, as well as efﬁcient algorithms for directly solving the problems expressed in these languages. A practical requirement for this approach to be feasible is that the overall complexity is not increased, in comparison to expressing the same problems in a standard classical planning language. A nominally “tractable” reduction of network-planning problems to standard planning languages is often possible, but this involves, in all but the simplest cases, a prohibitively high increase in the size of the problem instance and solution times. To illustrate the differences between the approaches, consider a real-world planning problem with network structure that has been considered in earlier work on AI planning: the power-supply restoration problem for electricity distribution networks [7]. The ﬁrst modeling of this problem in a classical planning language similar to PDDL leads to huge problem descriptions even for small networks [1]. Using the axioms of PDDL [4] to express network connectivity properties leads to a much more practical representation of the problem [2]. This formulation is compact but arguably not very natural as the axiom mechanism (inductive deﬁnitions) doesn’t per se directly represent any natural features of this domain. 2 Problem Deﬁnition We model systems in terms of a set of nodes and connections between them. The properties of each node are expressed in terms of state variables. Every node has the same set of state variables, but as the actions need not treat the nodes uniformly, this is not a restriction. In this paper we only consider a deterministic planning problem similarly to classical planning, and hence we have a unique initial state for the system.The state of the system consists of the connections and the values of the state variables. Deﬁnition 1 (State) For given sets A of state variables, V of nodes and E of (atomic) connections, a state is a pair (v, e) where • v : V × A → {0, 1} assigns a value to each state variable at each node, and • e : E → 2V ×V assigns (atomic) connections a binary relation. Deﬁnition 2 A network is deﬁned as (A, V, E, O, I, G) where • • • • • • A is the set of (Boolean) state variables, V is the set of nodes, E is the set of (atomic) connections between the nodes, O is the set of actions (to be deﬁned later), I is the the initial state, and G is a modal formula representing the goal of the system (to be deﬁned later.) 2.1 Network Properties We employ modal logic to express properties of systems. The modalities express connections between nodes. • • • • Atomic connections c ∈ E are connections. If c1 and c2 are connections then so are c1 ; c2 , c1 ∪ c2 and c1 ∩ c2 . If c is a connection then so are c∗ and c−1 . If φ is a formula then φ? is a connection. Composite connection c1 ; c2 between nodes n and n means that n can be reached from n by ﬁrst following c1 to some intermediate node and from there c2 to n . Connection c1 ∪c2 expresses disjunctivity: there is either connection c1 or c2 . Analogously, c1 ∩c2 expresses conjunctivity. The connection c∗ represents the reﬂexive transitive closure of c. The connection c−1 is the inverse of c: a connection going from n to n whenever c goes from n to n . The connection φ? conditionally connects a node with itself if φ is true. Deﬁnition 3 The meaning cS s of connections c in a state s = (v, e) of S = (A, V, E, O, I, G) is deﬁned as follows. 918 • • • • • • • J. Rintanen / A New Approach to Planning in Networks 8 if n: a ∈ F <1 if ¬n: a ∈ F for all n ∈ V and a ∈ A • v (n, a) = 0 : v(n, a) otherwise g(c)\{(n, n )|¬(n, c, n ) ∈ F } • g (c) = for all c ∈ E ∪{(n, n )|(n, c, n ) ∈ F } cS s = e(c) if c ∈ E S S c; c S s = {(x, z)|(x, y) ∈ cs , (y, z) ∈ c s } S S c ∪ c S = c ∪ c s s s S S c ∩ c S s = cs ∩ c s S c∗S s = cs ∗ −1 S c s = {(n, m)|(m, n) ∈ cS s} φ?S s = {(t, t) ∈ V × V |s |=t φ} 3 Here the operations ∪, ∩ and ∗ on the right hand sides are the settheoretic union and intersection and the reﬂexive transitive closure. The connections are used as a part of a modal language that includes the classical propositional logic. The atomic formulas include the propositional variables and the names of nodes. • • • • • • • The constants ⊥ and (for false and true) are formulas. a for a state variable a ∈ A is a formula. n for a node n ∈ V is a formula. φ ∨ ψ is a formula where φ and ψ are formulae. ¬φ is a formula where φ is a formula. [c]φ is a formula if φ is a formula and c is a connection. n: φ is a formula if φ is a formula and n ∈ V is a node. The modal operators [c] represent universal quantiﬁcation over all nodes that are reachable by a path described by c. The formula n is true in the node n and false elsewhere. Formulae n : φ refer to the truth of φ in node n. The meaning of → and ∧ and ↔ is deﬁned in the usual way, as is c by cφ = ¬[c]¬φ. Example 1 The next formula is true in cities (nodes) from which one can ﬂy to a tropical destination with a direct ﬂight or two ﬂights without changing planes in the U.S. ﬂight ∪ (ﬂight; ¬US?; ﬂight)tropics The next formula is true if there are paths in a communications network from the current node to node n that go through a designated center node and only visit nodes that are safe on the way. (link; safe?)∗; center?; (safe?; link)∗n At this point it is apparent that our logic is a variant of the propositional dynamic logic (PDL) [3] with a mechanism for referring to the names of nodes. A truth-deﬁnition for this modal logic can be given in the obvious way. We deﬁne s |= φ iff s |=n φ for all n ∈ V . 2.2 Actions Actions can change the values of the state variables associated with the nodes and the connections between the nodes. An action consists of a precondition which determines the circumstances under which the action can be taken, as well as the effects that indicate when and how do the values of the state variables change and which connections between nodes are added or removed. Deﬁnition 4 (Action) An action is p, e where p is a formula and e is a set of pairs q r where q is a formula and r is a set of literals. The literals that can be effects in an action are n : a, ¬n : a and (n, c, n ) and ¬(n, c, n ), where a ∈ A, n ∈ N , n ∈ N and c ∈ C. Deﬁnition 5 (Successor state) Let S = (A, V, E, O, I, G) be a system. Let p, e be an action and s = (v, S g) a state. The action is executable if s |= p and the set F = {r|q r, s |= q} is consistent. The successor state of s is s = (v , g ) where Examples Many of the standard planning benchmark problems can be viewed as consisting of nodes and connections between them. Example 2 In Blocks World the blocks are the nodes and the onrelation are the connections. Moving x from y onto z is deﬁned by ˙ ¸ x: ony ∧ x: [on−1 ]⊥ ∧ z: [on−1 ]⊥, {¬(x, on, y), (x, on, z)} . Here x: [on−1 ]⊥ says that false is true in all nodes related to x by on, meaning that there is no such node, i.e., the block is clear. We can introduce an action that allows moving stacks of blocks. ¸ ˙ x: ony ∧ z: ¬on∗x ∧ z: [on−1 ]⊥, {¬(x, on, y), (x, on, z)} Example 3 With the network planning language it is easy to express movement from one node to any of the reachable nodes. Here a and b are names of locations and p is an object that moves. a: (p ∧ road∗b), {¬a: p, b: p} Further extensions are possible. For example, we can require that property φ is satisﬁed after each road segment along the path. a: (p ∧ (road; φ?)∗b), {¬a: p, b: p} 4 Conclusions We have considered a language for planning in networks. The language is directly relevant to many application domains with network structure. The network model is even more general, allowing to express interesting properties of benchmarks that at the surface level are not about networks. This is a consequence of many problems having a relational/graph representation and the support of the language for expressing properties of graphs. Acknowledgements The research was funded by Australian Government’s Department of Broadband, Communications and the Digital Economy and the Australian Research Council through NICTA and the SuperCom project. REFERENCES [1] P. Bertoli, A. Cimatti, J. K. Slaney, and S. Thi´ebaux, ‘Solving power supply restoration problems with planning via symbolic model checking’, in ECAI’02, pp. 576–580, (2002). [2] B. Bonet and S. Thi´ebaux, ‘GPT meets PSR’, in ICAPS’03, pp. 102–112, (2003). [3] Michael J. Fischer and Richard E. Ladner, ‘Propositional dynamic logic of regular programs’, J. Computer and System Sciences, 18(2), 194–211, (1979). [4] M. Ghallab, A. Howe, C. Knoblock, D. McDermott, A. Ram, M. Veloso, D. Weld, and D. Wilkins, ‘PDDL - the Planning Domain Deﬁnition Language, version 1.2’, Technical report, Yale Center for Computational Vision and Control, Yale University, (1998). [5] Gerald J. Holzmann, Design and Validation of Computer Protocols, Prentice Hall, 1991. [6] Kenneth L. McMillan, Symbolic Model Checking, Kluwer, 1993. [7] S. Thi´ebaux and M.-O. Cordier, ‘Supply restoration in power distribution systems – a benchmark for planning under uncertainty’, in ECP’01. Springer, (2001). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-919 919 Detection of unsolvable temporal planning problems through the use of landmarks1 E. Marzal and L. Sebastia and E. Onaindia 2 Abstract. Deadline constraints have been recently introduced in PDDL3.0. The results obtained in the constraints domains in the last Planning Competition show that planners are not yet fully competitive. When dealing with deadline constraints the number of feasible solutions for a problem is reduced and thus it is specially relevant the ability to detect unsolvability. In this paper we present a new approach, based on the use of temporal landmarks, for the detection of unsolvable temporal planning problems. 1 Introduction The last planning competition (IPC5) [3] introduced the new language PDDL3.0 [4] to allow the user to express strong and soft constraints about the structure of the plans. Deadline constraints, expressed through modal operators such as within, always-within or sometime-after [4], were extensively tested in several domains. Only two planners, MIPS-XXL [2] and SGPlan [6], participated in the time constraints track;MIPS-XXL could only solve a few problems of each domain, and SGPlan solved many more problems but returned worse quality solution plans. A new testing on the last available version of SGPlan revealed that this planner was not able to identify unsolvable problems. Moreover, SGPlan returns the same output when it cannot ﬁnd a solution and when the problem is actually unsolvable. This brings up the issue that handling deadline constraints introduces a major difﬁculty in planning as propositions are now bounded to hold within a speciﬁc time interval. In this paper we present a preliminary approach based on the extraction of landmarks capable to inform if a temporal planning problem with deadline constraints is unsolvable. The system builds a temporal landmarks graph which represents a skeletal plan of the solution. If the graph is not consistent then the system returns the problem is unsolvable; otherwise, there is no guarantee that a satisfying plan exists. It is not generally possible to prove the unsolvability of problems but the experiments will show that our approach was able to identify all the unsolvable problems that were tested. On the other hand, although the system does not yet compute a solution when the problem is solvable, the landmarks graph comprises a correct partial plan which can be further reﬁned into an executable solution. 1 This work has been partially funded by Consolider Ingenio 2010 CSD200700022 project and by the Spanish Government TIN2005-08945-C06-06 project. 2 Universidad Politecnica of Valencia. e-mail: {emarzal, lstarin, onaindia}@dsic.upv.es 2 System overview The input of our system is a PDDL3.0 problem which contains within, always-within and sometime-after constraints [4] and it returns a message in case the planning problem is unsolvable. First, we extract the set of landmarks [5] of the problem and we build a landmarks graph by adding causal relationships between them. Second, we associate some temporal intervals to each landmark. These intervals, together with the causal relationships, deﬁne a set of constraints that are inserted in an agenda. Then a CSP solver performs the consistency checking. If an inconsistency is found in the graph, the system will return a message saying that a feasible solution plan does not exist. 3 Temporal model In a STRIPS context, a landmark [5] is a literal that must be true at some point in any solution plan. Our process for landmarks extraction is similar to the method described in [8]. We establish two types of causal relationships between landmarks. Deﬁnition 1 There is a dependency relationship between two landmarks li and lj (li d lj ) if for every temporal plan that achieves lj at time t (lj ∈ St ) from state I there exists at least one state St prior to St which contains li . Deﬁnition 2 Given two landmarks li and lj such that li ∈ St , lj ∈ St and li d lj ; there is a necessary relationship between li and lj (li n lj ) if every temporal plan that achieves lj at time t from state St contains a single action a such that li ∈ Cond(a) ∧ lj ∈ AddEf f(a)3 . The set of landmarks and the relationships between them deﬁne a landmarks graph. Let g be a top-level goal that must be obtained at time T (speciﬁed by a within constraint) and let l be a landmark such that l {d,n} . . . {d,n} g. We deﬁne the following intervals for l: • validity interval (denoted as [minv , maxv ]) is the temporal interval when l will be true in the plan. • necessity interval (denoted as [minn , maxn ]) is the set of time points when l must be true in order to satisfy other landmarks. • generation interval (denoted as [ming , maxg ]), where ming and maxg represent respectively the earliest and latest start time of l in order to satisfy g at a time less or equal than T . 3 A PPDL3.0 durative action a contains the following elements: Conditions Cond(a), which denote the set of conditions to be guaranteed over the execution of the action; Duration, which is a positive value represented by dur(a) ∈ R+ ; Effects Ef f (a), classiﬁed into AddEf f (a) as the set of all add effects and DelEf f (a) as the set of all delete effects. 920 E. Marzal et al. / Detection of Unsolvable Temporal Planning Problems Through the Use of Landmarks Initially, ming is set to the time of the earliest temporal state reached from I where landmark l appears (additionally, minv = minn = ming ). maxg , maxv and maxn will be set to the temporal bound T of the corresponding top-level goal. All these values (except ming ) will be eventually updated. The relationships between validity, necessity and generation intervals deﬁne a set of constraints between the two endpoints of an interval -for example: minv ≤ maxv -, and constraints between the endpoints of different intervals: ming ≤ minv ≤ minn minv ≤ maxg maxg ≤ maxv maxn ≤ maxv A causal relationship between two landmarks li and lj implicitly establishes a temporal constraint such as endpointli + distance ≤ endpointlj . If li n lj we calculate two temporal distances; the minimum distance between the ﬁrst (and last) time instant when li is needed and the time instant when lj is generated by the set of actions {ai } such that li ∈ Cond(ai ) ∧ lj ∈ AddEf f(ai ). If li d lj we calculate two distance values, namely (DISF (li , lj ) and DISL (li , lj )), by recursively computing the minimum distance between all the literals in the path from a state that contains li to a state that contains lj . Therefore, a causal relationship between two landmarks establishes the following relations between the endpoints of the intervals: ∃li {d,n} lj → maxg (lj ) ≥ maxg (li ) + DISF (li , lj ) ∃li {d,n} lj → minv (lj ) ≥ minv (li ) + DISF (li , lj ) minn (li ) = min(minv (lj ) − DISF (li , lj )), ∀lj : ∃li n lj maxn (li ) = max(maxg (lj ) − DISL (li , lj )), ∀lj : ∃li n lj Our system is capable of handling within, always-within and sometime-after constraints. The introduction of these deadline constraints modiﬁes the landmarks intervals in the following way: Constraint within t l always-within t li lj sometime-after li lj In our model maxg (l) ≤ t minv (li ) ≤ maxv (lj ) − t maxg (li ) ≤ maxg (lj ) minv (li ) ≤ minv (lj ) Additionally, we also take account of inconsistencies between landmarks. If li and lj are mutex [1] at time t, we can ﬁnd two different situations. If there exists a causal relationship between both landmarks, we set maxv (li ) to min(maxg (li ), maxg (lj ) − DISL (li , lj )) and, consequently, maxn (li ) = min(maxv (li ), maxn (li )). Then we propagate this new information to the rest of the graph and insert the constraint maxv (li ) ≤ minv (lj ) in the agenda. In case there is not a causal relationship between li and lj , we only insert the disjunctive constraint maxv (li ) ≤ minv (lj ) ∨ maxv (lj ) ≤ minv (li ) in the agenda. A CSP solver is invoked to perform consistency checking. This process will help us restrict the landmarks intervals. In case an inconsistency is found (a constraint that cannot be satisﬁed), the algorithm will display a message saying that ”No solution exists”. As the experiments will show, our system returned the message ”No solution exists” for all the unsolvable problems we tested. When no inconsistency is found in the landmarks graph, there is no guarantee the problem is solvable (or unsolvable) as it is not generally possible to prove the unsolvability of problems. However, we have observed the landmarks graph is informative enough so as to identify unsolvability in all the cases we have tested; thus we can afﬁrm the landmarks graph allows us to experimentally detect unsolvability. On the other hand, the graph will comprise a correct partial solution in case the problem is solvable, and this solution can be further expanded until a complete plan is obtained. 4 Experiments We have tested our approach on problems from the Pipesworld domain and a modiﬁed version of the Driverlog including within, always-within and sometime-after constraints [3, 4]. We have compared our results with the last available version of SGPlan [7]. Pipesworld. SGPlan did not solve any problem from this domain at IPC5. In contrast, our model processed all tested problems and found no indication of unsolvability in any of them. We then changed the time limits in the within and always-within constrains and run the ﬁrst ten problems again. SGPlan could only solve the problems that contained very loose deadlines. This indicates that SGPlan fails at ﬁnding solutions for problems with very restrictive temporal constraints and few feasible solution plans. Driverlog. We run both SGPlan and our model on ten problems from this domain; four out of the ten problems were solvable and the remaining six problems were unsolvable. Our model identiﬁed the six unsolvable problems and it did not show the four remaining problems were unsolvable. However, SGPlan only returned solution plans for two of the solvable problems. For the remaining eight problems, SGPlan did not provide any response, neither a solution plan nor ”Solution not found” nor ”No solution exists”. 5 Conclusions In this paper, we have presented a preliminary but promising approach to deal with temporal planning problems with deadline constraints. Our model allocates landmarks in time and, through the calculation of causal relationships and other constraints between them, it draws a temporal picture of the problem to detect unsolvability. This approximation is very appropriate for planning problems with very restrictive deadline constraints. Currently, we are studying the properties of the landmarks graph in order to detect solvability. Our principal focus in this research is to show that if the algorithm obtains a complete and conﬂict-free landmarks graph and the agenda does not contain any disjunctive constraint then solvability can be ensured. REFERENCES [1] A. Blum and M. Furst, ‘Fast planning through planning graph analysis’, Artiﬁcial Intelligence, 90(1-2), 281–300, (1997). [2] S. Edelkamp, S. Jabbar, and M. Nazih, ‘Large-scale optimal PDDL3 planning with MIPS-XXL’, in ICAPS-2006 – Fifth International Planning Competition, pp. 28–30, (2006). [3] A. Gerevini and D. Long. ICAPS-2006 Fifth International Planning Competition, 2006. http://zeus.ing.unibs.it/ipc-5/. [4] A. Gerevini and D. Long, ‘Plan constraints and preferences in PDDL3’, in ICAPS-2006 – Fifth International Planning Competition, pp. 7–13, (2006). [5] J. Hoffmann, J. Porteous, and L. Sebastia, ‘Ordered landmarks’, Journal of Artiﬁcial Intelligence Research, 22, 215–287, (2004). [6] C. W. Hsu, B. W. Wah, R. Huang, and Y. X. Chen, ‘New features in SGPlan for handling preferences and constraints in PDDL3.0’, in ICAPS2006 – Fifth International Planning Competition, pp. 39–41, (2006). [7] C. W. Hsu, B. W. Wah, R. Huang, and Y. X. Chen. The SGPlan planner, 2007. http://manip.crhc.uiuc.edu/programs/SGPlan/index.html. [8] L. Zhu and R. Givan, ‘Landmark Extraction via Planning Graph Propagation’, in In Printed Notes of ICAPS’03 Doctoral Consortium, (2003). Trento, Italy. ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-921 921 A Planning Graph Heuristic for Forward-Chaining Adversarial Planning 1 ¨ Pascal Bercher and Robert Mattmuller Abstract. In contrast to classical planning, in adversarial planning, the planning agent has to face an adversary trying to prevent him from reaching his goals. In this paper, we investigate a forwardchaining approach to adversarial planning based on the AO* algorithm. The exploration of the underlying AND/OR graph is guided by a heuristic evaluation function, inspired by the relaxed planning graph heuristic used in the FF planner. Unlike FF, our heuristic uses an adversarial planning graph with distinct proposition and action layers for the protagonist and antagonist. First results suggest that in certain planning domains, our approach yields results competitive with the state of the art. 1 Introduction In many planning problems, the environment in which the agent acts is not static. The exogenous dynamics can be caused by “nature” or by one or more other agents sharing the same environment. Other agents can behave neutrally (simply following their own independent agenda or otherwise acting unpredictably), adversarially, or cooperatively with respect to the protagonist’s goals. Here, we focus on adversarial problems. We assume complete observability, i.e., a plan will be a mapping from physical states to applicable actions. A usual approach to conditional (adversarial) planning is planning as model checking [5], whereas planning as heuristic search [3] tends to yield best results for static, deterministic problems. Both approaches are also used in general game playing [7]. Related work includes the dynamic programming approach by Hansen and Zilberstein [8], and, for partially observable problems, heuristic search in the belief space as implemented in the POND planner by Bryce et al. [4]. 2 Adversarial Planning We consider discrete adversarial planning problems under full observability with alternating turns. More formally, similar to STRIPS problems [6], an adversarial planning problem is given by a set of states S = 2P over a ﬁnite set of propositions P , an initial state I ⊆ P , two ﬁnite sets of operators Op and Oa (controlled by the protagonist p and antagonist a, respectively), and a goal condition G ⊆ P . Operators have the form o = pre, add, del, where pre ⊆ P is the precondition and add, del ⊆ P are the add and delete lists of o. An operator o is applicable in a state s ⊆ P iff pre ⊆ s, and if applied, leads to the successor state s = (s\del)∪add. A state s is a goal state iff G ⊆ s. The players take alternating turns, starting with 1 University of Freiburg, Germany, {bercherp,mattmuel}@informatik.unifreiburg.de. This work was partly supported by the German Research Council (DFG) as part of the Transregional Collaborative Research Center “Automatic Veriﬁcation and Analysis of Complex Systems” (SFB/TR 14 AVACS). See www.avacs.org for more information. the protagonist controlling Op . We assume that the player to move is known in each state. The protagonist tries to reach a goal state in a ﬁnite number of steps, whereas the antagonist tries to prevent him from doing so. A winning strategy for the protagonist is a function mapping states in which he is to move to applicable operators, such that, against each possible strategy of the antagonist, a goal state will be reached in a ﬁnite number of steps. Such an adversarial planning problem naturally corresponds to the problem of evaluating an AND/OR graph over the state space. OR (AND) nodes correspond to states where the protagonist (antagonist) is to move and arcs correspond to operator applications. The relevant part of a winning strategy for the protagonist corresponds to an acyclic subgraph containing (a) the initial state, (b) for each contained non-goal AND node all outgoing arcs and their target nodes, (c) for each contained non-goal OR node exactly one outgoing arc and its target node, and no further nodes or arcs, such that all leaf nodes are goal states. inAL, atCL, ¬full, ¬nop inAL, inCA, ¬full, nop load nop nop inAP, atCP, ¬full, nop unload inAP, atCP, full, ¬nop unload inAL, inCA, ¬full, ¬nop fuel inAL, inCA, full, ¬nop inAL, inCA, ¬full, nop fuel inAL, inCA, full, nop inAP, inCA, ¬full, nop inAP, inCA, full, ¬nop nop fuel ﬂy ﬂy inAP, inCA, ¬full, ¬nop unload inAP, atCP, ¬full, ¬nop Figure 1. Cargo transport from London to Paris. The initial state is depicted on the upper left hand side, goal states are doubly framed. The protagonist moves in elliptic, the antagonist in rectangular nodes. Consider for example a modiﬁed version of the Simple Rocket domain [2] with one airplane/rocket whose tank can be either full or empty, a set of cities, and a set of cargo packages which can be loaded and unloaded. Possible actions are ﬂying from one city to another one if the tank is full, loading a package into the plane, unloading a package from the plane unless the same package has just been loaded without an intermittent ﬂying action, fueling the plane if necessary, and performing no-ops. Flying and loading can only be done by the protagonist, fueling only by the antagonist, and unloading and noops by both, with the antagonist being barred from two consecutive no-ops without a ﬂight in between. The goal of the protagonist is to transport the packages to speciﬁed target cities. The agents take turns, starting with the protagonist. Assume two cities Paris and London, one package to be trans- 922 P. Bercher and R. Mattmüller / A Planning Graph Heuristic for Forward-Chaining Adversarial Planning cities 2 2 3 3 3 3 4 4 4 pack’s 1 2 3 4 5 6 6 7 8 Breadth-First Search time mem nodes 0.014 1 44 0.048 2 152 0.354 6 2106 0.870 49 8211 5.556 159 43785 87.691 987 237264 — 3098 722750 — 2192 771629 — 3889 912816 AO* with FF heuristic time mem nodes 0.025 1 37 0.071 1 88 0.202 6 625 0.463 28 1871 1.437 98 6917 16.323 397 63498 76.718 698 169349 373.553 1840 510738 — 3356 738520 AO* with adversarial FF heuristic time mem nodes 0.026 1 37 0.072 1 78 0.260 7 628 0.232 17 605 0.321 23 794 1.157 25 4164 82.701 642 194304 99.639 1487 225544 — 5440 914602 time 0.000 0.016 0.380 1.780 9.041 44.287 130.064 — — MBP BDD nodes 6601 84424 23068 165718 365272 546666 834704 — — Figure 2. Experimental results for the transportation benchmark problems. We used a Java implementation, running on a machine with two Quad Xeon processors, 2.66 GHz, and a memory limitation of 16 GB RAM. The time-out, indicated by dashes, was set to ten minutes. Times are given in seconds, memory usage in MB. Memory usage and node counts in case of time-outs are the current values when the time-out occurred. ported from London (atCL) to Paris (atCP), and the plane initially in London (inAL) with its tank empty (¬full). The variable “nop” is true iff the adversary has already performed a no-op since the last ﬂight. A winning strategy for the protagonist is depicted in Figure 1. 3 Search Algorithm and Heuristic As search algorithm, we used AO* [10] with maximization of cost estimates at AND nodes. The performance of the AO* algorithm depends on the choice of the evaluation function applied to the fringe nodes. To compute this function, we used an adaption of the graphplan-based [2] distance heuristic used in the FF planning system [9]. Just like the heuristic of the FF planning system, to which we will refer as FF heuristic, the adversarial FF heuristic uses relaxed operators, which we get by ignoring delete lists. For each agent + be the set of relaxed operators he controls. ag ∈ {p, a}, let Oag Fig. 3 shows the pseudocode of the adversarial FF heuristic. Lines 1 to 3 are equal to the forward step of the FF heuristic, except that there is not only one set of relaxed operators, but two distinct sets + that belong to the two agents ag. Lines 4 to 11 correspond to Oag the backward step of the FF heuristic. In addition, in line 12, the se+ , one for each agent. lected operators are put in two distinct sets SOag After these two sets have been completely computed, in line 13 the value of the adversarial FF heuristic is calculated as follows: Since both agents move in turn, the number of moves needed to execute the plan is at most twice the number of operators contained in the larger + + , which we call SOmax . First, we calculate how one of the sets SOag many operators have to be applied by agent max ∈ {p, a}, which + + + + | − |SOmax ∩ Omax |, where Omax is the set of is r := |SOmax relaxed operators agent max ∈ {p, a}\{max} controls. The value of the heuristic can then be calculated as max{2r, |SOp+ | + |SOa+ |}. 4 Experimental Results We experimented with solvable problems from the example domain described above with varying numbers of cities and packages. We compared running times, memory usage and node creations for uninformed breadth-ﬁrst search, AO* search with the FF heuristic under the assumption of full cooperation, and AO* search with the adversarial FF heuristic. In addition, we encoded the same tasks as conditional planning problems under full observability in NuPDDL and solved them using MBP [5]. The results are summarized in Fig. 2. 5 Conclusion The results in Fig. 2 suggest that in domains where the antagonist controls operators that may contribute to a plan, AO* search with adversarial FF heuristic often outperforms AO* search with FF heuris- 1 2 3 4 5 6 7 8 9 10 11 12 13 while G is not contained in the current layer i do Let S[i] be the set of all state variables in layer i. Let O+ [i] be the set of all relaxed operators that are applicable in layer i and that belong to agent ag ∈ {p, a}, who is to move in layer i. Increment i. Let G[m] be G. for layer j := m − 1 to 0 do foreach state variable g ∈ G[j + 1] do if g ∈ S[j] then Put g into G[j]. else Put a relaxed operator o+ into SO + [j] that is in O+ [j] and that creates g. Put the precondition pre of o+ into G[j]. + Put all selected operators of SO+ [j] into SOag , the set of all selected operators of agent ag ∈ {p, a} who is to move in layer j. If possible, shift operators from SOp+ to SOa+ (or vice versa) to ensure that the difference between |SOp+ | and |SOa+ | is as small as possible. Calculate and return the least number of moves that will be needed to apply all operators of the two rearranged sets. Figure 3. Adversarial FF heuristic. tic and uninformed search. It is competitive with the symbolic approach used in MBP. REFERENCES [1] Pascal Bercher and Robert Mattm¨uller, ‘A Planning Graph Heuristic for Forward-Chaining Adversarial Planning’, Technical Report 238, Albert-Ludwigs-Universit¨at Freiburg, Institut f¨ur Informatik, (2008). [2] Avrim L. Blum and Merrick L. Furst, ‘Fast Planning Through Planning Graph Analysis’, in Proc. of the Fourteenth International Joint Conference on Artiﬁcial Intelligence (IJCAI’95), pp. 1636–1642, (1995). [3] Blai Bonet and H´ector Geffner, ‘Planning as Heuristic Search’, Artiﬁcial Intelligence, 129(1-2), 5–33, (2001). [4] Daniel Bryce, Subbarao Kambhampati, and David E. Smith, ‘Planning Graph Heuristics for Belief Space Search’, Journal of Artiﬁcial Intelligence Research, 26, 35–99, (2006). [5] Alessandro Cimatti, Marco Pistore, Marco Roveri, and Paolo Traverso, ‘Weak, Strong, and Strong Cyclic Planning via Symbolic Model Checking’, Artiﬁcial Intelligence, 147(1–2), 35–84, (2003). [6] Richard E. Fikes and Nils J. Nilsson, ‘STRIPS: A New Approach to the Application of Theorem Proving to Problem Solving’, Artiﬁcial Intelligence, 2(3–4), 189–208, (1971). [7] Michael R. Genesereth, Nathaniel Love, and Barney Pell, ‘General Game Playing: Overview of the AAAI Competition’, AI Magazine, 26(2), 62–72, (2005). [8] Eric A. Hansen and Shlomo Zilberstein, ‘Heuristic Search in Cyclic AND/OR Graphs’, in Proceedings of the Fifteenth National Conference on Artiﬁcial Intelligence (AAAI’98), pp. 412–418, (1998). [9] J¨org Hoffmann and Bernhard Nebel, ‘The FF Planning System: Fast Plan Generation Through Heuristic Search’, Journal of Artiﬁcial Intelligence Research, 14, 253–302, (2001). [10] Nils J. Nilsson, Principles of Artiﬁcial Intelligence, Springer, 1980. 10. Perception, Sensing and Cognitive Robotics This page intentionally left blank ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-925 925 Vector Valued Markov Decision Process for robot platooning Matthieu Boussard and Maroua Bouzid and Abdel-Illah Mouaddib 1 1 Introduction Many approaches have been dedicated to deal with situated agents movements coordination as the ﬂocking where the global behaviour of agents is controlled by three simple rules(separation, alignment and cohesion) or the platoon formation where agents steer towards a position following a leader. The common property of all these group movements is that the global behaviour is emerging from local behaviours. Our general problem is how to formalise the local behaviours as local decision making processes and how the local interactions may lead to a coherent global behaviour. In this paper, we focus our discussion on the platoon formation where many approaches focus on the longitudinal control [8] techniques have been proposed. By using local sensing, the agents can maintain a certain distance between the closest one. In systems using the longitudinal control techniques, it is necessary to have a platoon leader allowing the group to move towards the goal. But those systems consider neither the uncertainty in an explicit way nor other possible interactions between agents as precedence constraints [2] or common resource to share with others. In [7], a framework has been developed to formalize the impact of a local decision in the group by considering two new criteria based respectively on positive and negative effects. The local decision process of each agent considers the individual and the group interests and it proposes a framework based on vector-valued MDP [5] to manage those two criteria. This approach considers all the possible interactions at any state which leads to framework with high complexity. To reduce this complexity, in [3] only local interactions have been considered, where the impact of a local decision on the perceived agents is assessed. In this approach, an agent perceives agents in its neighbourhood and develops on-line a 2V-MDP to derive a policy to behave in this neighbourhood. The algorithm satisﬁes the following constraints:(Dynamic change) this approach is suitable to world changes since only the neighbourhood is considered to make a decision. (Scalability) this approach is applicable in a real system because each agent constructs a small 2V-MDP with limited space since it formalizes only the decision process problem in a limited area. Also, the underlying DEC-MDP is considered as a set of separate MDPs where the expected value is augmented to consider the interactions with the other MDPs. (Local coordination) an agent can interact with a limited number of agents, and in a limited number of locations. This reduces signiﬁcantly the complexity of the problem. (Optimal when possible) the behaviour of an agent is optimal when it is in an “easy situation”. 1 GREYC Universit´e de CAEN France, email:{mboussar, bouzid, mouaddib}@info.unicaen.fr 2 Backgrounds and related works The problem addressed here can be seen as a problem of collective decision and multi-agents planning [10]. Platooning is a kind of ﬂocking problem, where each agent should maintain the cohesion of the group. For this purpose many formalisms based on ﬂocking approaches have been developed [9]. The basic idea is to maintain a global shape of the group of agents, while each agent perceives only its local environment and its close neighbours. The main beneﬁts of these approaches are a strong scalability and the possibility to manage a huge number of agents. However, the drawback of these approaches may be the lack of optimality proof and the lack of expressiveness to consider different kind of interactions. Longitudinal control techniques are also used in platooning approaches [8]. These techniques aim to keep a safety distance between the platoon leader and the closest neighbour, based on a local coordination. The platoon leader can also give orders to the rest of the group. In these approaches, the platoon must have a leader which is the unique member that has an explicit goal. However, the problem of important number of messages exchanged between the platoon members may be a limitation. Planning approaches allow a precise description of the target goal and of the environment. The DEC-MDP [1] has been proposed in order to offer decentralized applications but in the general case the complexity is too high for real applications. However, recent works have shown the possibility to use this framework for applications with large scale [2] considering some speciﬁc and local interactions. Our approach is similar to this approach but it offers a rich model of local interactions, uses a precise description of the environment, and before selecting an action, the agents adapt their behaviours according to their local perceptions. 3 The Platooning problem Let a group of agents be in a start area and from there, they have to reach a goal area as fast as possible. There are requirements, neither for the arriving order, nor for the position in the group on the way. The safety of all agents is the most important constraint. Due to the uncertainty agents have to respect a security inter-agent spacing. If this spacing is too large, it can worsen the quality of the solution. We deﬁne a platoon of a group of agents as a set of mobile robots, each trying to manage its distance from its nearest neighbour, so that the group can reach the goal efﬁciently. We assume that the agents are evolving in a dynamic, entirely observable, discrete space environment, and only a limited number of agents are perceived in its neighbourhood. All the agents are using the same world model, and they all know the same goal area. There are limited number of actions. We suppose that those actions are non-deterministic, the outcomes 926 M. Boussard et al. / Vector Valued Markov Decision Process for Robot Platooning of each action is ruled by a probability distribution. Because sometimes it is not possible (hostile area) to communicate, we consider that agents do not use communication to coordinate their activities. We also don’t need an explicit platoon leader. 3.1 The platooning problem as a 2V-DEC-MDP Because, as mentioned before, the world is fully observable, we formalize the planning problem for a single agent as an MDP S, A, T, R, that allows it to compute (off-line) its optimal monoagent policy. The coordination with the other agents is made on-line. The agents know locally the exact position of the other agents. So the s contains the exact position of agent i and also the exact position of neighbour agents. Before each decision, the impacts of actions are computed. Once those interactions determined, they are beeing used by an agent to assess the expected value of its decision. Indeed, to each decision a ∈ A in state s ∈ S a vector of values (ER(s, a), JER(s, a), JEP enalty(s, a)) is assigned. This vector represents respectively, the individual expected value given by ER(s, a)2 , the expected gain of the group JER(s, a)3 because of the local decision a (positive impact) and the expected opportunity cost of this decision JEP enalty(s, a) (negative impact). More details are given in [3]. 4 First evaluations This simulation shows the emergence and the dynamic change of the leader. In Figure 1c, the exit get closed, and the agents have to take another way to reach the goal. The leader change now, and take the lead until the goal. In Figure 1d, the platoon exits the tunnel and goes towards the goal. The online part of the algorithm is linear with leader’s emergence is one of the main issue of this work. This issue has two parts. The ﬁrst part is how to identify a leader in a group of agents. The second part is to ﬁnd which action will make an agent a leader. Once the leader is identiﬁed, we could apply a technique as in [6] to improve the global behaviour. The leader’s change should also be studied. Long term impact of action The 2V-DEC-MDP allows us to express short terms impact of actions. But it appears that a few actions of some agents affect the behaviour of the whole group. At the end of the platoon move, agents reach an equilibrium. We are studying the type of equilibrium that arises. We are interested in three kind of equilibria, Nash, Pareto, and Stackelberg. 6 Conclusion A coordination framework, based on local observations, has been presented in this paper. It uses the MDP framework to describe the planning problem, and we showed, how to express the local relations in the 2V-DEC-MDP. It allows platoon emergence without explicit leader designation. Furthermore, if some changes appear in the world, a new leader can emerge. When a whole group of agent is blocked, only a few number of agents may cause this group to be blocked. The ﬁrst extension of this work is to add a reinforcement learning algorithm [4]. We would like this learning algorithm to detect deadlocks and, by learning the behaviour of the other agents, to solve them. We will use equilibria from the game theory to detect the agent that should learn. The second part of this work will be around the analysis of the kind of equilibria this algorithm attains. REFERENCES (a) Start (b) Formation of the platoon (c) Re-planning, new leader (d) Movement of the platoon Figure 1. Emergence of a new leader respect to the number of neighbour agents. For 50 agents, it takes around 10ms to select the action. 5 Discussion and theoretical consideration Emergence of the leader As mentioned before, in this approach we have never selected an agent as the leader of the platoon. It appears, however, that this leader actually exists.The analysis of the 2 3 Expected Reward Join Expected Reward [1] Daniel S. Bernstein, Shlomo Zilberstein, and Neil Immerman, ‘The complexity of decentralized control of markov decision processes’, in UAI ’00: Proceedings of the 16th Conference on Uncertainty in Artiﬁcial Intelligence, pp. 32–37. Morgan Kaufmann Publishers Inc., (2000). [2] Aur´elie Beynier and Abdel-Illah Mouaddib, ‘An iterative algorithm for solving constrained decentralized markov decision processes’, in AAAI, (2006). [3] Matthieu Boussard, Maroua Bouzid, and Abdel-Illah Mouaddib, ‘Multi-criteria decision making for local coordination in multi-agent systems’, in Proceedings of the International Conference on Tools with Artiﬁcial Intelligence (ICTAI’07), (Octobre 2007). [4] J´erˆome Chapelle, Olivier Simonin, and Jacques Ferber, ‘How situated agents can learn to cooperate by monitoring their neighbors’ satisfaction’, in ECAI, pp. 68–72, (2002). [5] Kazuyoshi Wakuta, Vector valued Markov Decision Processes with Average Rewad Criterion, volume 14, chapter Journal of Probability in the Engineering and Information Sciences, 533–548, Cambridge, 2000. [6] Ville K¨on¨onen, ‘Asymmetric multiagent reinforcement learning’, in IAT ’03: Proceedings of the IEEE/WIC International Conference on Intelligent Agent Technology, p. 336, Washington, DC, USA, (2003). IEEE Computer Society. [7] Abdel-Illah Mouaddib, Matthieu Boussard, and Maroua Bouzid, ‘Towards a framework for multi-objective multiagent planning’, in AAMAS, (2007). [8] David J. Nafﬁn, Gaurav S. Sukhatme, and Mehmet Akar, ‘Lateral and longitudinal stability for decentralized formation control’, in Proceedings of the International Symposium on Distributed Autonomous Robotic Systems, pp. 421–430, (June 2004). [9] Craig W. Reynolds, ‘Steering behaviors for autonomous characters’, in Proceedings Game Group Conference on Game developers, pp. 763– 782, (1999). [10] David H. Wolpert and Kagan Tumer, ‘Collective intelligence, data routing and braess’ paradox’, in JAIR, (2002). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-927 927 Learning to Select Object Recognition Methods for Autonomous Mobile Robots Reinaldo A. C. Bianchi 1,2 and Arnau Ramisa2 and Ram´on L´opez de M´antaras 2 Abstract. Selecting which algorithms should be used by a mobile robot computer vision system is a decision that is usually made a priori by the system developer, based on past experience and intuition, not systematically taking into account information that can be found in the images and in the visual process itself to learn which algorithm should be used, in execution time. This paper presents a method that uses Reinforcement Learning to decide which algorithm should be used to recognize objects seen by a mobile robot in an indoor environment, based on simple attributes extracted on-line from the images, such as mean intensity and intensity deviation. Two stateof-the-art object recognition algorithms can be selected: the constellation method proposed by Lowe together with its interest point detector and descriptor, the Scale-Invariant Feature Transform and a bag of features approach. A set of empirical evaluations was conducted using a household mobile robots image database, and results obtained shows that the approach adopted here is very promising. 1 INTRODUCTION Reinforcement Learning [7] is concerned with the problem of learning from interaction to achieve a goal, for example, an autonomous agent interacting with its environment via perception and action. On each interaction step the agent senses the current state s of the environment, and chooses an action a to perform. The action a alters the state s of the environment, and a scalar reinforcement signal r (a reward or penalty) is provided to the agent to indicate the desirability of the resulting state. The policy π is some function that tells the agent which actions should be chosen, and is learned through trial-and-error interactions of the agent with its environment. Several algorithms were proposed as a strategy to learn an optimal policy π ∗ when the model (T and R) is not known in advance, for example, the Q–learning [8] and the SARSA [6] algorithms. Some researchers have been using RL as a technique to optimize image segmentation and object recognition algorithms. For example, Peng et al. used RL to learn, from input images, to adapt the image segmentation parameters of a speciﬁc algorithm to the changing environmental conditions, in a closed-loop manner [1, 5] and Draper et al. modeled the object recognition problem as a Markov Decision Problem, and proposed a method to learn sequences of image processing operators for detecting houses in aerial images [2]. To allow a robotic agent to decide which object recognition method should be used, during on line world exploration, we propose to use RL to learn a policy that minimizes computing time, discarding an image if it is not suitable for analysis or choosing between two well known algorithms, described in the following section. 1 2 Centro Universit´ario da FEI, S˜ao Bernardo do Campo, Brazil. Artiﬁcial Intelligence Research Institute (IIIA-CSIC), Bellaterra, Spain. 2 TWO OBJECT RECOGNITION METHODS Two successful general object recognition approaches that have been widely used are the constellation method proposed by Lowe together with its interest point detector and descriptor SIFT [3] and a bag of features approach [4]. The ﬁrst approach is a single view object detection and recognition system with some interesting characteristics for mobile robots, most signiﬁcant of which are the ability to detect and recognize objects at the same time in an unsegmented image and the use of an algorithm for approximate fast matching. In this approach, individual descriptors of the features detected in a test image are initially matched to the ones stored in the object database using the Euclidean distance. False matches are rejected if the distance of the ﬁrst nearest neighbor is not distinctive enough when compared with that of the second. Once a set of matches is found, the generalized Hough transform and Iteratively Reweighted Least Squares are used to cluster each match and to estimate the most probable afﬁne transformation for every hypothesis. The Bag of Features (BoF) approach to object classiﬁcation comes from the text categorization domain, where the occurrence of certain words in documents is recorded and used to train classiﬁers that can later recognize the subject of new texts. This technique has been adapted to visual object classiﬁcation substituting the words with local descriptors such as SIFT. The descriptor space is discretized in a codebook created applying hierarchical k-means to a dataset of descriptors. A histogram of descriptor occurrences is built to characterize an image. Next, a multi-class classiﬁer – the k-NN in this implementation – is trained with the histograms of local descriptor counts. The class of the object in the image is determined as the dominant one in the k nearest neighbors. Although both object recognition methods proved their reliability in real world applications, they have their limitations: Lowe’s method performs poorly when recognizing sparsely textured objects or objects with repetitive textures, while the Bag of Features needs an accurate segmentation stage prior to classiﬁcation, which can be very time consuming. Furthermore, the method depends on the quality of that segmentation stage to provide good results. 3 EXPERIMENTS AND RESULTS In order to decide which algorithm should be used by the agent, the RL problem was deﬁned as a 2 stage MDP, with 2 possible actions in each stage: In the ﬁrst one, the agent must decide if the image contains an object, and thus must be recognized, or if the image does not contain objects, and can be discarded, saving processing time. In the second stage, the agent must decide which object recognition algorithms should be used: Lowe’s or Bag of Features. 928 R.A.C. Bianchi et al. / Learning to Select Object Recognition Methods for Autonomous Mobile Robots Table 1. MS Back Lowe 80.4 52.3 Correctly classiﬁed images (percentage). Full Img MSE MSI 100.0 93.2 Table 2. MS Back Lowe Figure 1. Images from the dataset. At each stage the agent choses a system state s, composed of the stage the agent is at plus a combination of simple attributes extracted on-line from the images, for example, mean image intensity and standard deviation. Then, it selects an action to be executed, compute the reward and update the value function. The RL algorithm used is the Q-learning [8], because it directly approximates the optimal policy independently of the policy being followed (it is an off-policy method), allowing the state and the action to be executed by the agent to be selected randomly. The rewards used during the learning phase are computed using a set of training images. If the state in which the agent is corresponds to a training image, and the action taken results in a correct classiﬁcation, the agent receives a reward. Otherwise it is zero. For example, if we have a training image that does not contain an object, with mean intensity value of 50, standard deviation of 10, the reward given to the state Q(stage = 1, mean = 50, std = 10, action = discard) is 100. Several experiments were executed using a dataset consisting of approximately 150 images of objects occurring in typical household environments plus 30 background images. The objects, that can be textured, untextured or with repetitive textures, are mugs, books, trashcans, chairs and computer monitors (Figure 1). The images includes occlusions, illumination changes, blur and other typical nuisances that can be encountered while navigating with a mobile robot. To evaluate the result of the learning process statistical validation method called Leave-One-Out was used. Six different experiments were conducted, using three different combinations of image attributes as space state and two different image sizes (the original size and a 10 by 10 pixels reduced size image). The combinations of image attributes used as space state are: mean and standard deviation of the image intensity (MS); mean and standard deviation of the image intensity plus entropy of the image (MSE); and mean and standard deviation of the image intensity plus the number of interest points detected by the Difference of Gaussians operator (MSI). The parameters used in the experiments were: the learning rate α = 0.1 and the discount factor γ = 0.9. Values in the Q table were randomly initiated. Tables 1 and 2 present the results obtained. The ﬁrst line of Table 1 shows the percentage times that the agent correctly choose to discard a background image, and the second line shows the percentage of times the agent correctly choose to use the Lowe algorithm, instead of the BoF. The columns in this table presents the results for the six experiments, the ﬁrst three using the original image and, from the fourth to sixth column, showing the results for the reduced size image. The last column shows the percentage of times a human expert 100.0 22.7 MS 82.6 63.6 100.0 93.2 100.0 11.4 Expert 100.0 93.2 Incorrect classiﬁcation (percentage). Full Img MSE MSI 4.8 25.5 Small Img MSE MSI 0.0 0.0 1.4 6.9 MS 3.4 18.6 Small Img MSE MSI 0.7 0.0 1.4 6.9 Expert 8.2 10.8 takes the correct action. Table 2 is similar to Table 1, but shows the classiﬁcation error. The ﬁrst line shows the percentage of images discarded as background, when they should be analyzed, and line two presents the number of times the Lowe algorithm is chosen, when the correct one is the BoF. These results shows that the use of the MSE combination presented very good results, for original size images as well as reduced size ones. On the other hand, the use of the number of interest points detected by the Difference of Gaussians operator as space state did not produce good results. 4 CONCLUSION The results obtained shows that the use Reinforcement Learning to decide which algorithm should be used to recognize objects yields good results, performing better than a human expert in some cases. To the best of our knowledge, there is no similar approach using automatic selection of algorithms for object recognition. Future works includes testing other image attributes that can be used as the system’s state, other RL algorithms and applying RL techniques to the image segmentation problem. ACKNOWLEDGEMENTS This work has been partially funded by the FI grant and the BE grant from the AGAUR, the 2005-SGR-00093 project, supported by the Generalitat de Catalunya, the MID-CBR project grant TIN 200615140-C03-01 and FEDER funds. Reinaldo Bianchi is supported by CNPq grant 201591/2007-3. REFERENCES [1] B. Bhanu, Y. Lin, G. Jones, and J. Peng, ‘Adaptive target recognition’, Machine Vision and Applications, 11(6), 289–299, (2000). [2] B. A. Draper, J. Bins, and K. Baek, ‘ADORE: Adaptive object recognition’, in International Conference on Vision Systems, pp. 522–537, (1999). [3] D. Lowe, ‘Distinctive image features from scale-invariant keypoints’, International Journal of Computer Vision, 60(2), 91–110, (2004). [4] D. Nister and H. Stewenius, ‘Scalable recognition with a vocabulary tree’, in Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, volume 2, pp. 2161–2168, (2006). [5] J. Peng and B. Bhanu, ‘Closed-loop object recognition using reinforcement learning’, IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(2), 139–154, (1998). [6] G. A. Rummery and M. Niranjan, ‘On-line Q-learning using connectionist systems’, Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Department, (1994). [7] R.S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, MIT Press, Cambridge, MA, 1998. [8] C. J. C. H. Watkins, Learning from Delayed Rewards, PhD Thesis, University of Cambridge, 1989. ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-929 929 Robust Reservation-Based Multi-Agent Routing Adriaan ter Mors and Xiaoyu Mao1 and Jonne Zutt and Cees Witteveen2 and Nico Roos3 1 Problem description In a multi-agent routing problem, agents must ﬁnd the shortest-time path from source to destination while avoiding deadlocks and collisions with other agents. Agents travel over an infrastructure of resources (such as intersections and road segments between intersections), and each resource r has (i) a capacity C(r), which is the maximum number of agents that may occupy the resource at the same time, and (ii) a minimum travel time D(r). Example 1 illustrates this multi-agent routing problem. r14 The multi-agent routing problem often occurs in application domains of Automated Guided Vehicles (AGVs), such as transportation of goods in warehouses, or loading and unloading of ships at container terminals. The multi-agent routing problem is also relevant in taxiway planning on airports. The quality of a routing method is judged not only on the basis of its efﬁciency (i.e., in terms of the time required by the agents to reach their destinations), but also on the basis of its ability to deal with changing circumstances and unexpected incidents. Examples of incidents in the application domains mentioned above are human interference with AGVs (e.g. by people stepping in the path of an AGV) in a warehouse scenario, or the delay of a ship (or aircraft) at the (air)port. r7 2 Reservation-based multi-agent routing r1 r8 r9 r3 r4 r10 r1 r 12 3 r2 r6 r11 r5 Figure 1: Infrastructure of unit-capacity resources. Example 1. Figure 1 shows an example of a multi-agent routing problem. There is an infrastructure of 14 resources: resources r1 to r7 represent intersections or interesting locations, whereas resources r8 to r14 represent lanes between the intersections. All resources have a capacity of one and the same minimum travel time. Suppose we have two agents, A1 that wants to go from r1 to r7 , and agent A2 that wants to go from r5 to r2 . The optimal individual plans for these agents are p1 and p2 respectively: p1 = (r1 , 1), (r8 , 2), (r3 , 3), (r10 , 4), (r4 , 5), (r14 , 6), (r7 , 7) p2 = (r5 , 1), (r11 , 2), (r4 , 3), (r10 , 4), (r3 , 5), (r9 , 6), (r2 , 7) In the above, (r1 , 1) in p1 means that during time unit 1, agent A1 is travelling on resource r1 . These two plans cannot be both put into action, as they are in conﬂict with each other: both agents plan to travel on resource r10 during time unit 4, but this is not possible, since each resource can hold at most one agent at the same time. 1 2 3 There are two ways to solve this conﬂict. The ﬁrst is that either A1 or A2 (but not both) does not make use of r10 , by making a detour along r6 . The second solution is that one of the agents waits until the other has passed. If, as we assume in this paper, we only optimize for time (as opposed to e.g. distance travelled), then the ﬁrst solution is the best. Almende BV {adriaan,xiaoyu}@almende.org Delft University of Technology, {j.zutt,c.witteveen}@tudelft.nl Maastricht University, roos@micc.unimaas.nl In a reservation-based approach to multi-agent routing, agents plan their route by reserving time intervals on resources; these reservations should be made in such a way that an agent’s plan speciﬁes its location (i.e., the resources) at each point in time. Furthermore, the routing method should ensure that reservations of different agents are not in conﬂict with each other. The deﬁnition of a conﬂict we employ here is simply that the capacity of a resource may not be exceeded at any point in time.4 We have developed a routing algorithm that a single agent can use to ﬁnd a conﬂict-free plan, given a set of reservations previously made by other agents. Our algorithm is optimal, in the sense that it ﬁnds the shortest-time conﬂict-free plan for this agent. The algorithm can be described as a shortest path search through the free time window graph, where a free time window is a time interval associated with a resource, in which the resource can accommodate at least one more agent. Our approach is similar to that of Kim and Tanchoco [1]. However, their distinction between lanes and intersections (as opposed to only having resources), combined with explicitly checking for conﬂicts results in a computational complexity of O(A4 R2 ), whereas our algorithm has a complexity of O(AR log(AR) + AR2 ). The full algorithm, the proof of its correctness, and the analysis of its worst-case complexity can be found in [4]. 4 A more advanced conﬂict deﬁnition, in which agents are not allowed to overtake each other (catching-up conﬂicts) or pass each other by (head-on conﬂicts) can also be modeled in our framework, but we do not show this here. 930 A. ter Mors et al. / Robust Reservation-Based Multi-Agent Routing Example 2. We have the same two agents: A1 with source and destination r1 and r7 , and A2 with r5 and r2 . Suppose they have the following plans, in which A2 plans to wait (in resource r11 ) until A1 has passed: p1 = (r1 , 1), (r8 , 2), (r3 , 3), (r10 , 4), (r4 , 5), (r14 , 6), (r7 , 7) p2 = (r5 , 1), (r11 , 2), (r4 , 6), (r10 , 7), (r3 , 8), (r9 , 9), (r2 , 10) 10 8 6 4 relative mechanism delay (%) 12 14 HH HL LH LL 2 It stands to reason that carefully crafted plans, which detail all actions of all agents at each point in time, may be obsoleted even by minor incidents in the environment. In their survey paper on design and control of AGV systems, Le-Anh and De Koster [2] wrote “a small change in the schedule may destroy it completely”, referring to the reservation-based routing method of Kim and Tanchoco [1]. The truth of this statement depends on the existence and quality of mechanisms that can repair route plans. The quality of a repair mechanism depends on (i) the cost of the repaired plan in relation to the cost of the original plan, (ii) the similarity between the original and the repaired plan (the more similar the better), and (iii) the computational effort required to perform the repairs. Maza and Castagna [3] proposed a repair mechanism designed to prevent deadlocks that is both computationally inexpensive and adheres closely to the original plan. In Section 4, we investigate the cost of repaired plans when combining Maza’s mechanism with our route planning algorithm. To see how a delay of one agent can create a deadlock, consider again the infrastructure of Figure 1. networks, lattice networks, and a map of an actual airport; (ii) the number of agents in the system; (iii) the frequency and duration of incidents. The frequency is a value p that represents, for every resource in the agent’s plan, a chance of p of having an incident. 0 3 Dealing with incidents 100 200 300 400 500 number of agents Figure 2: Mechanism delay for Amsterdam Airport Schiphol infrastructure. Incident parameters: HH = (p=0.1, duration=120s); HL = (p=0.1, duration=30s); LH = (p=0.01, duration=120s); LL = (p=0.01, duration=30s). The idea of Maza and Castagna is to determine for each resource which agent will enter the resource ﬁrst, second, etc. This information can be derived from the plans of the agents. Then, during the execution of the plans, an agent is only allowed to enter a resource when it’s his turn. In our example, A2 is the second agent to enter r4 , so it will wait in resource r11 until its turn has come, which is after A1 has exited r4 . Figure 2 shows the mechanism delay, averaged over all agents, as a percentage of the plans of the agents. At least three noteworthy conclusions can be drawn from this ﬁgure: ﬁrst, as number of agents increases, the relative mechanism delay decreases. It turns out that the increased congestion in the system is more important that any increase in complexity that might result in more mechanism delay. Second, even for a high number (p = 0.1) of long incidents (duration = 120s), the mechanism delay is never more than 15% of planned travel time. Third, for a small number (p = 0.01) of short incidents (duration = 30s), there is no discernible impact on plan quality. Experiments conducted on other types of infrastructures produced ﬁgures similar to Figure 2: on lattice networks and small-world networks, mechanism delays were slightly smaller (maximum relative mechanism delays around 10%), whereas for random networks they were higher, with a maximum relative mechanism delay of 30%. 4 Evaluating robustness ACKNOWLEDGEMENTS We evaluate the ability of our routing method to deal with change (its robustness) by measuring the delay caused by the deadlockprevention mechanism. This mechanism delay is the time agents have to wait before they are allowed to enter a resource (or the time they have to wait behind other agents that are waiting for clearance to enter a resource, as we do not allow overtaking in our experiments). We had the following experimental setup: ﬁrst, all agents make a route plan for their (randomly chosen) start and destination locations. Then, in a simulation environment, the agents try to execute their plans. If these (reservation-based) plans are executed perfectly, then no conﬂicts will occur and all agents will arrive at their destination on time. However, we generate random incidents that cause agents to stop for a ﬁxed duration, potentially blocking other agents behind them. Over different experiment runs, we varied the following parameters: (i) the infrastructure: we used random networks, small-world This research is supported by NWO (Netherlands Organization for Scientiﬁc Research), Grant No. CSI4006. Suppose that in the execution of his plan, A1 is delayed in resource r3 until time 7. To resume his journey, A1 wants to go to r10 , but that resource is occupied by A2 ; similarly, A2 is also stuck, since the next resource in his plan, r3 , is occupied by A1 . REFERENCES [1] Chang W. Kim and J.M.A. Tanchoco, ‘Conﬂict-free shortest-time bidirectional AGV routeing’, International Journal of Production Research, 29(1), 2377–2391, (1991). [2] Tuan Le-Anh and M.B.M De Koster, ‘A review of design and control of automated guided vehicle systems’, European Journal of Operational Research, 171(1), 1–23, (May 2006). [3] Samia Maza and Pierre Castagna, ‘A performance-based structural policy for conﬂict-free routing of bi-directional automated guided vehicles’, Computers in Industry, 56(7), 719–733, (2005). [4] Adriaan W. ter Mors, Jonne Zutt, and Cees Witteveen, ‘Context-aware logistic routing and scheduling’, in Proceedings of the Seventeenth International Conference on Automated Planning and Scheduling, (2007). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © Government of Canada 2008. All rights reserved. doi:10.3233/978-1-58603-891-5-931 931 Automatic Animation Generation of a Teleoperated Robot Arm Khaled Belghith and Benjamin Auder and Froduald Kabanza and Philipe Bellefeuille 1 and Leo Hartman 2 Abstract. In this paper we describe the Automatic Task Demonstration Generator (ATDG), a system implemented into a software prototype for teaching the operation of a robot manipulator deployed on the International Space Station (ISS). The ATDG combines the use of path planning and camera planning to take into account the complexity of the manipulator, the limited direct view of the ISS exterior, and the unpredictability of lighting conditions in the workspace. The pathplanning algorithm not only avoids obstacles in the workspace as is normal for a path-planner, but in addition takes into account the position of corridors for safe operations and the placement of cameras on the ISS. The camera planner is then invoked to ﬁnd the right arrangement of cameras to follow the manipulator on its trajectory. This allows the on-the-ﬂy production of useful and pedagogical task demonstrations to help the student carry out tasks involving the manipulation of the robot on the ISS. Even if the system has been developed for robotic manipulations, it could be used for any application involving the ﬁlming of unpredictable complex scenes. 1 Introduction The Space Station Remote Manipulator (SSRMS) is an articulated robot arm mounted on the international space station (ISS). The SSRMS is a key component of the ISS, used in the assembling, the maintenance and the repair of the station, and also for moving payloads from visiting shuttles. Astronauts operate the SSRMS through a workstation located inside one of the ISS compartments. The workstation has an interface with three monitors, each connected to a camera placed at a strategic location of the ISS. There are a total of 14 cameras on the ISS. Making the right camera choices for each of the three monitors available in the robotic workstation is essential for the operator to have a good awareness of the space when manoeuvering the arm. Operators manipulating the SSRMS on orbit receive support from ground operations. Part of this support consist in visualizing and validating manoeuvres before they are actually carried out. In order to improve the ground support operations on the SSRMS, we have developed the automatic task demonstration generator (ATDG), which generates 3D animations that demonstrate how to perform a given task with the SSRMS. The ATDG is integrated within the RObot MANipulation Tutoring System (Roman Tutor) [5], a simulator for the command and control of the SSRMS (Figure 1). 1 University of sherbrooke, Canada, email: {khaled.belghith, benjamin.auder, kabanza, philipe.bellefeuille}@usherbrooke.ca 2 The Canadian Space Agency, email: Leo.Hartman@space.gc.ca Figure 1. Roman Tutor Student Interface Filming a trajectory of the SSRMS is a particular case of the problem of automatic movie generation. Previous approaches can be generally classiﬁed into constraint satisfaction methods and idiom-based approaches. Constraint-satisfaction methods [2] work at the level of the frame. Given a set of constraints about the objects to appear in the frame, they ﬁnd the camera parameters that best satisfy these constraints. Idiom-based approaches [4] are based on cinematography principles. They establish a formalization of these principles to reduce the large search space produced by the many degrees of freedom the camera has in each frame of the animation. A key difference between these applications and ours is that in their case, they have a detailed script of the animation at the design phase, with well identiﬁed scenes and corresponding semantics. Hence, constraints for the placement of objects and the types of camera shots for different scenes are speciﬁed off-line at the design phase. In our case, the trajectory for the SSRMS has to be generated online, depending on the task at hand; we do not have a script specifying beforehand all the scenes of interest and how they should be ﬁlmed. Our main contribution is to actually explain how idiombased approaches can be adapted to ﬁlming complex robot arm trajectories by integrating an automated segmentation of the trajectory into scenes depending on some spatial and cognitive task speciﬁcations. Another difference between previous approaches and ours deals with the nature of the domain. A number of general-purpose rules have been developed in the literature constraining the types of camera shots used for ﬁlming people or animated characters. These rules do not apply when the object being ﬁlmed is an articulated arm, so we had to introduce more appropriate ones. 932 2 K. Belghith et al. / Automatic Animation Generation of a Teleoperated Robot Arm ATDG - Automatic Task Demonstration Generator The ATDG system takes as input a start and a goal conﬁguration for the SSRMS. It generates a movie demonstrating how to move the SSRMS from the start to the goal conﬁguration. The ATDG algorithm sequentially performs the following steps: 1. Calls the path-planner to compute the trajectory from the start to the goal conﬁguration 2. Segments the trajectory into scenes 3. Calls the camera planner to plan the shots on the scenes The path-planner implements the FADPRM algorithm introduced by Belghith et al. [3] which takes into account collisions and visibility constraints. Collisions are treated as hard constraints on trajectories that must be avoided at any cost, whereas visibility constraints are handled as preferences among desirable trajectories. This approach generates safe collision-free trajectories such that the robot is visible at all times from one or more of the cameras. In order to categorize the movements performed by the SSRMS and decompose them into scenes and shoot them correctly using speciﬁc idioms, it was necessary to add new information to the trajectory provided by the FADPRM path planner. This new information takes the form of a geometric decomposition of the workspace. The trajectory found by the path-planner is mapped within these geometric decompositions to produce a series of corridors. Each of these corridors corresponds to a speciﬁc scene category. A list of idioms is associated with each category of scene as in the normal idiom-based animation generation. This geometric decomposition and the choice of idioms for each scene category is done manually by a domain expert. We plan to construct a complete module that will automatically generate these decompositions from the actual state of the ISS including, among others, the geometry of the workplace, visibility constraints and luminosity. The trajectory mapped within the succession of corridors is then passed to the camera planner. For each portion of the path in a single corridor, the camera planner will try to select the best suitable idiom. The selection of the best idiom in each corridor depends on the quality of the rendering and takes into account the cinematic principles guaranteeing continuity between shots and then consistency of the ﬁnal movie. In ATDG, each shot in the idiom is distinguished by three key attributes: shot type, camera placement mode, camera zooming mode. Shot Types Five shot types are currently deﬁned in the ATDG System: Static, GoBy, Pan, Track and Pov. A Static shot for example is done from a static camera when the robot is in a constant position or moving slowly. Whereas in a Track shot, a camera follows the robot and keeps a constant distance from it. Camera Placements For each shot type, the camera can be placed in ﬁve different ways according to some given line of interest: External, Parallel, Internal, Apex and External II. Currently, we take the trajectory of the robot’s center of gravity as the line of interest which allows ﬁlming of a number of many typical manoeuvres. For larger coverage of manoeuvres, additional lines of interest will be added later. Zoom modes For each shot type and camera placement, the zoom of the camera can be in ﬁve different modes: Extreme Close up, Close up, Medium View, Full View and Long View. Figure 2 shows an idiom illustrating the anchoring of a new component on the ISS. It starts with a Track shot following the robot Figure 2. Idiom to ﬁlm the SSRM anchoring a component on the ISS while moving on the truss. Then, another Track shot showing the rotation of one joint on the robot to align with the ISS structure. And ﬁnally a Static shot focusing on the anchoring operation. In DCCL [4], idioms are speciﬁed using planning operators, so that the sequence of shots is generated by a planner. We follow a similar approach but use a different planner and another idiom speciﬁcation language. In our case, we specify idioms in the Planning Deﬁnition Language (PDDL 3.0) and use the TLPlan system [1]. Intuitively, a PDDL operator speciﬁes preferences about shot types in time and in space depending on the robot manoeuver. Parsing the trajectory of the robot mapped within the corridors designating the successive scenes performed, the planner will try to ﬁnd a succession of shots that captures the best possible idioms. The planner also takes into account the cinematic principles to ensure consistency of the resulting movie. Idioms and cinematic principles are in fact encoded in the form of temporal logic formulas within the planner. 3 Conclusion and Future work We have introduced a heuristic technique for segmenting the robot trajectory and an approach for deﬁning idioms for robot manoeuvres, allowing us to adapt idiom-based approaches for the automatic ﬁlming of robot manipulations. As we are using the TLPlan system, this framework also opens up interesting avenues for developing efﬁcient search control knowledge for this particular application domain and possibly for learning it. There are widespread expectations that the TLPLan and the planning techniques it incorporates are useful in real world applications and ATDG is one of the ﬁrst examples. REFERENCES [1] F. Bacchus and F. Kabanza, ‘Using temporal logics to express search control knowledge for planning’, Artiﬁcial Intelligence, 116(1-2), 123– 191, (2000). [2] W.H. Bares, J.P. Gregoire, and J.C. Lester, ‘Real-time constraint-based cinematography for complex interactive 3d worlds’, in Association for the Advancement of Artiﬁcial Intelligence (AAAI/IAAI), pp. 1101–1106, (1998). [3] K. Belghith, F. Kabanza, L. Hartman, and R. Nkambou, ‘Anytime dynamic path-planning with ﬂexible probabilistic roadmaps’, in IEEE International Conference on Robotics and Automation (ICRA), pp. 2372– 2377, (2006). [4] D.B. Christianson, S.E. Anderson, L. He, D.H. Salesin, D.S. Weld, and Cohen M.F., ‘Declarative camera control for automatic cinematography’, in Association for the Advancement of Artiﬁcial Intelligence (AAAI), pp. 148–155, (1996). [5] F. Kabanza, R. Nkambou, and K. Belghith, ‘Path-planning for autonomous training on robot manipulators in space’, in International Joint Conference In Artiﬁcial Intelligence (IJCAI), pp. 1729–1731, (2005). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-933 933 Planning, Executing, and Monitoring Communication in a Logic-based Multi-agent System Martin Magnusson and David Land´en and Patrick Doherty1 1 Introduction Imagine the chaotic aftermath of a natural disaster. Teams of rescue workers search the affected area for people in need of help, but they are hopelessly understaffed and time is short. Fortunately, they are aided by a small ﬂeet of autonomous unmanned aerial vehicles (UAVs). The UAVs help in quickly locating injured by scanning large parts of the area from above using infrared cameras and communicating the information to the command and control center (CCC) in charge of the emergency relief operation. An autonomous agent carrying out tasks in such dynamic environments must automatically construct plans of action adapted to the current situation and the other agents. Its multi-agent plans involve both physical actions, that affect the world, and communicative actions, that affect the other agents’ mental states. In addition, assumptions made during planning must be monitored during execution so that the agent can autonomously recover, should its plans fail. The strong interdependency between these capabilities can be captured in a formal logic. We take advantage of this by building a multiagent system that reasons directly with the logical speciﬁcation using automated theorem proving. Our implementation and its integration with a physical robot platform, in the form of an autonomous helicopter, goes some way towards demonstrating that this idea is not only theoretically interesting, but practically feasible. 2 Speech Acts in TAL The system is based on automated reasoning in Temporal Action Logic (TAL) [1], a ﬁrst-order logic for commonsense knowledge about action and change. Inspired by Morgenstern’s work [3] we extend TAL with syntactic operators for representing agents’ mental states and beliefs. A formula preceded by a quote is a regular ﬁrstorder term that serves as a name of that formula. Alternatively one may use a backquote, which facilitates quantifying-in by exposing variables inside the backquoted expression to binding by quantiﬁers. With quotation one may pass (names of) formulas as arguments to regular ﬁrst-order predicates, without introducing modal operators. E.g., the fact that the UAV believes, at noon, that there were, at 11:45, ﬁve survivors in cell 2,3 in a coordinate grid of the disaster area can be expressed by: (Believes uav 12:00 ’(= (value 11:45 (survivors (cell 2 3))) 5)) 1 Link¨oping University, Sweden, email: {marma,davla,patdo}@ida.liu.se This work is supported in part by a grant from the Swedish Research Council (VR), the National Aeronautics Research Program NFFP04 S4203, CENIIT, and the Strategic Research Center MOVIII, funded by the Swedish Foundation for Strategic Research, SSF. This epistemic extension of TAL enables us to characterize communication in terms of actions that affect the mental states of others. Such speech acts form a basis for planning both physical and communicative actions in the same framework, as is done e.g. by Perrault, Allen, and Cohen [4]. Speech acts have also been adopted by research on agent communication languages (ACL) such as the widely used FIPA ACL2 , which establish standards that ensure interoperability between different multi-agent systems. With the help of quotation in TAL we formulate the FIPA inform, informRef, and request speech acts. These can be used by agents to communicate beliefs to, and to incur commitment in, other agents. 3 Planning Planning with speech acts is, in our framework, the result of proving a goal while abductively assuming action occurrences that satisfy three kinds of preconditions. The action must be physically executable by an agent during some time interval (b e], the agent must have a belief that identiﬁes the action, and the agent must be committed to the action occurring, at the start of the time interval: (→ (∧ (Executable agent (b e] action) (Believes agent b ‘(ActionId ‘action ‘actionid)) (Committed agent b ‘(Occurs agent (b e] action))) (Occurs agent (b e] action)) Executability preconditions are different for each action and are therefore part of the speciﬁcation of an action. The belief preconditions are satisﬁed when the agent knows identiﬁers for the arguments of a primitive action [2]. But the time point at which an action is executed is also critically important. However, it seems overly restrictive to require that the agent holds beliefs that identify the action occurrence time points. Actions that do not depend on external circumstances can be executed whenever the agent so chooses, without deciding upon an identiﬁable clock time in advance. Actions that do depend on external circumstances can also be successfully executed as long as the agent is sure to know the correct time point when it comes to pass. This is precisely what the concept of dynamic controllability captures. Following Vidal and Fargier [6] we denote time points controlled by the agent by b and time points over which the agent has no control by e. The temporal dependencies between actions form a simple temporal network with uncertainty (STNU) that can be checked for dynamic controllability to ensure an executable plan. Finally, the commitment precondition can be satisﬁed in one of two ways. Either the agent adds the action to its own planned execu2 http://www.ﬁpa.org/specs/ﬁpa00037/ 934 M. Magnusson et al. / Planning, Executing, and Monitoring Communication in a Logic-Based Multi-Agent System tion schedule (described below), or it uses the request speech act to delegate the action to another agent, thereby ensuring commitment. 4 Execution Scheduled actions are tied to the STNU through the explicit time points in the Occurs predicate. An STNU execution algorithm propagates time windows during which these time points need to occur. Executed time points are bound to the current clock time and action occurrences scheduled at those time points are proved dispatched using the following axiom: (→ (∧ (ActionId ‘action ‘id) (ProcedureCall agent (b e] id)) (Dispatch agent (b e] action)) The axiom forces the theorem prover to ﬁnd an action identiﬁer with standardized arguments for ProcedureCall predicate. This is the link between the automated reasoner and the execution sub-system in that the predicate is proved by looking up the procedure associated with the given action and calling it. But the actions are often still too highlevel to be passed directly to the low-level system. An example is the action of scanning a cell of the coordinate grid with the infrared camera. This involves using a scan pattern generator, ﬂying the generated trajectory, and applying the image processing service to identify humans in the video footage. The assumption is that the scanning of a grid cell will always proceed in the manner just described, so there is no need to plan its sub-actions. Such macro-actions, and primitive physical actions, are realized (in simulation, so far) by an execution framework, built using the Java agent development framework3 (JADE). It encapsulates the agent so that all communication is channeled through a standardized interface as FIPA ACL speech acts. 5 Monitoring Executing the plan will satisfy the goal as long as abduced assumptions hold up. But the real world is an unpredictable place and unexpected events are sure to conspire to interfere with any non-trivial plan. To detect problems early we continually evaluate the assumptions that are possible to monitor. E.g., the agent’s plan might rely on some aspect of the environment to persist, in effect making a frame assumption. A failure of such an assumption produces a percept that is added to the agent’s knowledge base. A simple truth maintenance system removes assumptions that are contradicted by observations and unchecks goals that were previously checked off as completed but that depended on the failed assumptions. This immediately gives rise to plan revision and failure recovery as the theorem prover tries to reestablish those goals. If the proof of the unchecked goals succeeds, the revision will have had minimal effect on the original plan. A failed proof means that the current sub-goal is no longer viable, in the context of the execution failure, and the revision is extended by dropping the sub-goals one at a time. This process continues until a revision has been found or the main goal is dropped and the mission fails. 6 Multi-agent Scenario The theory presented so far needs to be complemented with an automated reasoner. The current work utilizes a theorem prover named 3 http://jade.tilab.com/ ANDI, which is based on Pollock’s natural deduction system that makes use of uniﬁcation [5]. Natural deduction is an interesting alternative to the widely used resolution method. The set of proof rules is extensible and easily accommodates special purpose rules that make reasoning more efﬁcient. ANDI incorporates specialized inference rules for reasoning with quoted expressions and beliefs according to the rules and axioms of TAL. This enables ANDI to process the follwing scenario in less than two seconds on a laptop with a 2.4GHz Intel Core 2 Duo Mobile T7700 processor. Suppose that the CCC wants to know the number of survivors in grid cell 2,3 at 13:00. The UAV produces the following plan (in addition to an STNU that orders the actions in time): (ﬂy (cell 2 3)) (scan (cell 2 3)) (informRef ccc ’(value 13:00 (survivors (cell 2 3)))) The success of the plan depends on two persistence assumptions that were made during planning and that are monitored during execution, namely that (location uav) is not affected between ﬂying and scanning, and that (radio uav ccc) indicates a functioning radio communication link. There is also an assumption of the persistence of the survivor count, though this is impossible for our UAV to monitor since it can not see the relevant area all at once. If one of the survivors leaves, then the plan revision process will take the resulting body count discrepancy into account when it is discovered. Suppose however that due to the large distance and hostile geography of the area (or some other unknown error) the radio communication stops functioning while the UAV is scanning the area, before reporting the results. The UAV perceives that the ﬂuent (radio uav ccc) was not persistent and the truth maintenance system successively removes incompatible assumptions and sub-goals until a revised plan is found: (informRef mob ’(value 13:00 (survivors (cell 2 3)))) (request mob ’(Occurs mob (b e] (informRef ccc ’(value 13:00 (survivors (cell 2 3)))))) The new plan involves requesting help from another mobile agent (mob). By communicating the survivor count to this “middle man”, and requesting it to pass on the information to the CCC, the UAV ensures that the CCC gets the requested information. Another set of assumptions now require monitoring, namely (radio mob ccc) and (radio uav mob). While the UAV cannot monitor the other agent’s radio communication, it will be monitored if that agent is also running our agent architecture. At this point let us assume that no further failures ensue so that the knowledge gathering assignment is completed successfully within this paper’s page limit. REFERENCES [1] Patrick Doherty and Jonas Kvarnstr¨om, ‘Temporal action logics’, in Handbook of Knowledge Representation, eds., Vladimir Lifschitz, Frank van Harmelen, and Bruce Porter, Elsevier, (2007). [2] Robert Moore, ‘Reasoning about knowledge and action’, Technical Report 191, AI Center, SRI International, Menlo Park, CA, (1980). [3] Leora Morgenstern, Foundations of a logic of knowledge, action, and communication, Ph.D. dissertation, New York, NY, USA, 1988. [4] Raymond C. Perrault, James F. Allen, and Philip R. Cohen, ‘Speech acts as a basis for understanding dialogue coherence’, in TINLAP’78, pp. 125–132, (1978). [5] John L. Pollock, ‘Natural deduction’, Technical report, Department of Philosophy, University of Arizona, (1999). [6] Thierry Vidal and H´el`ene Fargier, ‘Handling contingency in temporal constraint networks: From consistency to controllabilities’, JETAI, 11(1), 23–45, (1999). ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. 935 Author Index Abela, C. Aciar, S. Adams, N.M. Akchurina, N. Aknine, S. Akodjènou, M.-I. Albayrak, Ş. Alberti, M. Alferes, J.J. Alonso-González, C. Amgoud, L. Amigó Cabrera, E. Amlacher, K. Anagnostopoulos, C. Analyti, A. Anbulagan Angelini, E. Angiulli, F. Annicchiarico, R. Anselma, L. Antanas, L.-A. Anthony, P. Antoniou, G. Araujo, R.M. Arcos, J.Ll. Arzt, A. Asher, N. Atif, J. Atzmueller, M. Auder, B. Avouris , N. Azimifar, Z. Badea, L. Bajo, J. Baldwin, T. Balog, K. Barrué, C. Bassiliades, N. Basu, A. Battiti, R. Bauckhage, C. Bayoudh, M. Beal, C.R. Belaissaoui, M. Belavkin, R.V. Belgasmi, N. 829 885 132, 663, 777 433 418 767 261 903 99 791 463 688 601 132 733 787 621 107 708 803 847 443 70, 733 887 891 241 835 611, 621 683 931 xiii 545 152 875 343 318 708 729, 731 631 909 261 219 663 901 815 563 Belghith, K. Bellefeuille, P. Ben Saïd, L. Benamara, F. Benayadi, N. Benazera, E. Benedico, T. Benelallam, I. Bennaceur, H. Bensalem, S. Bercher, P. Besana, P. Besnard, Ph. Bessiere, C. Bi, Y. Bianchi, R.A.C. Biba, M. Bienvenu, M. Blacoe, I. Blanchard, E. Blatt, R. Bloch, I. Bollegala, D. Bonarini, A. Bosse, T. Botea, A. Both, F. Bouchard, B. Boussard, M. Bouyakhf, E.H. Bouzid, M. Bouzouane, A. Boyer, A. Bratko, I. Bregon, A. Broersen, J. Brun, A. Bruynooghe, M. Buffet, O. Buitelaar, P. Bulfoni, A. Bulitko, V. Buscher, G. Buscher, H.-P. Calabró, E. Callens, L. 931 931 563 835 745 179 708 901 500 631 921 743 723 475, 901 757 927 361 741 648 20 693 611, 621 333 693 877 578 266 811 925 901 735, 925 811 823 234, 897 791 879, 883 823 137 626 288 668 899 683 683 693 653 936 Campana, F. Campigotto, P. Carrault, G. Castagnos, S. Cejnar, P. Cesta, A. Chalkiadakis, G. Charpillet, F. Chatalic, P. Chi, M. Chmeiss, A. Christophides, V. Ciré, A.A. Coelho, H. Cohen, P.R. Cohn, A.G. Colucci, S. Contet, J.-M. Cooper, G.F. Cooper, M.C. Coppola, P. Corchado, J.M. Cordier, M.-O. Cortellessa, G. Cortés, U. Croitoru, C. Croitoru, M. Croonenborghs, T. Cucchiarelli, A. Cuenca Grau, B. Curzi, M. da Costa Pereira, C. Dague, P. Damásio, C.V. Dastani, M. De Falco, I. de Jong, H. de la Higuera, C. de la Rosa i Esteva, J.L. De Paz, J.F. de Rijke, M. de Weerdt, M. Dechter, R. Declerck, T. Del Pozo, D. Della Cioppa, A. Della Mea, V. Della Torre, M. Delteil, A. 708 909 653 823 753 703 393 626 881 122 907 70 578 413 663 606 739 865 214 530 668 643, 875 194, 653, 723, 789 703 708 871 871 847 765 40 765 453 224 733 256 783 229 817 885 643 318 423 913 841 643 783 668 693 45 Demazeau, Y. Demiris, Y. Denis, M. Devignes, M.-D. Di Gaspero, L. Di Rosa, E. Di Sciascio, E. Dignum, F. Dimopoulos, Y. Dingli, A. Dixon, S. Doherty, P. Donati, A. Donini, F.M. Donnarumma, F. Dorado, J. Dowell, A. Driessens, K. Edelkamp, S. Eiter, T. El Falou, M. Elkind, E. Engel, T. Escoffier, B. Esposito, F. Evans, D. Everaere, P. Ezzahir, R. Fabiani, P. Faddoul, J. Fakotakis, N. Fan, T.-F. Fargier, H. Farsinia, N. Faulhaber, A. Félix, D. Felner, A. Ferilli, S. Ferreira, N. Flach, P. Flener, P. Flesch, I. Flouris, G. Foo, N. Fotinopoulos, A.M. Fouquier, G. Fratini, S. Freitas, A.T. Fromont, E. 863 271 703 127 668 510 739 889 463 829 241 933 703 739 783 142 388 779, 847 905 60 735 393 869 366 361 813 737 901 583 725 xiii 749 50 725 276 893 495 361 658 162 520 807 70 867 85 611 703 229 653 937 Fusenig, V. Gallien, M. Gallinari, P. Gama, J. Gange, G. Garcia, F. Gatt, A. Gatti, N. Gavanelli, M. Gebser, M. Gechter, F. Geffner, H. Génin, T. Gerding, E.H. Gerevini, A. Gerritsen, C. Ghahramani, Z. Ghallab, M. Ghassem-Sani, G. Ghédira, K. Gini, M.L. Giordano, L. Giroux, S. Giunchiglia, E. Giunchiglia, F. Glad, A. Goertz, M. Goetschalckx, R. Grastien, A. Gruer, P. Guid, M. Guo, J. Guo, Y.Z. Haarslev, V. Haasdijk, E. Haase, C. Hacker, M. Hadzic, T. Hamadi, Y. Hand, D.J. Hao, Y. Hardcastle, D. Haringer, M. Hartman, L. Hartrumpf, S. Harzallah, M. Hebrard, E. Herzig, A. Hitzler, P. 869 631 767 172 505 583 678 403 903 15 865 588 418 448 573 877 8 xiii 293 563 398 855 811 510 743 626 827 779 209, 787 865 234 122 251 725 761 25 323 698 525 132, 777 308 162 673 931 313 20 475 741 80, 99 Hjelm, H. Hnich, B. Hoche, S. Hoffmann, J. Hogg, D.C. Holte, R. Hoogendoorn, M. Hormazábal, N. Horrocks, I. Hotz, L. Huang, X. Hué, J. Huettig, M. Hunsberger, L. Hunter, J. Huyck, C.R. Iannone, L. Iglesias, J.A. Ingrand, F. Inoue, K. Ishizuka, M. Jaffry, S.W. Janhunen, T. Järvisalo, M. Jeavons, P.G. Jennings, N.R. Junttila, T. Kabanza, F. Kaci, S. Kakas, A. Kaminka, G. Kamp, V. Kan John, P. Karampiperis, P. Karkaletsis, V. Karlsson, L. Karssemeijer, N. Katakis, I. Kather, A. Kaufmann, B. Keinänen, H. Keinänen, M. Kern-Isberner, G. Keyder, E. Kissmann, P. Kitzelmann, E. Kiziltan, Z. Klapaftis, I.P. Klein, M. 288 475 162 558 606 495 266, 398, 877 885 40 673 839 94 683 553 678 815 648 825 631 35 333 877 75 535 530 393, 428, 448 535 931 376 747 825 673 209 688 303, 688 616 658 763 184 15 857 857 65 588 905 781 475 298 266 938 Knorr, M. Konev, B. Konieczny, S. Konstantinidis, G. Koriche, F. Koukam, A. Kowalski, T. Kraus, S. Krawczyk, V. Krieger, H.-U. Krivec, J. Krötzsch, M. Kruijff, G.-J. Kubera, Y. Kudenko, D. Kuminov, D. Kunegis, J. Kuntz, P. Kuter, U. Kutz, O. Labský, M. Lagoon, V. Lamb, L.C. Lamma, E. Lamperti, G. Landén, D. Lang, J. Law, E. Le Goc, M. Le Guillou, X. Lecoutre, C. Lécué, F. Ledezma, A. Léger, A. Leis, A. Leopold, T. Lesaint, D. Lesire, C. Letombe, F. Lewis, P. Li, H. Li, J.J. Li, S. Liau, C.-J. Linardaki, E. Linares López, C. Lison, P. Liu, D.-R. Liu, W. 99 55 737 70 112 865 515 861 907 841 234 80 636 383 873 438 261 20 573 89 688 505 887 903 204, 793 933 351, 366 443 745 194 500 45 825 45 688 65 698 631 911 871 328 515 515 749 283 540 636 749 356, 371 Lokaiczyk, R. Lopes Cardoso, H. López de Mántaras, R. Loukis, E. Lucas, P.J.F. Ludwig, B. Luehrs, H. Lundh, R. Luštrek, M. Lutz, C. Ma, J. Magnusson, M. Maisto, D. Maîtrepierre, R. Makino, K. Manandhar, S. Mandow, L. Mao, X. Maragoudakis, M. Maratea, M. Marinescu, R. Marques-Silva, J. Marquis, P. Martelli, A. Martínez-Velasco, A. Mary, J. Marzal, E. Mastop, R. Mastrogiovanni, F. Mata, A. Mateescu, R. Mathieu, P. Mathieu, Y.Y. Matos, P. Matsuo, Y. Matteucci, M. Mattmüller, R. Mayer, M.A. Mayer, W. Mazuel, L. McBurney, P. McNeill, F. Meditskos, G. Mehlitz, M. Mehta, D. Melis, E. Melo, F.S. Messai, N. Metakides, G. 827 468 927 769 658, 807 323 683 616 899 25, 55 356 933 783 458 60 298 480 929 769 510 913 911 50, 737 855 708 458 919 879 246 643 229 383 835 911 333 693 921 688 795 727 388 743 729, 731 261 698 276 157 127 10 939 Meyer, J.-J.Ch. Micalizio, R. Michael, L. Michalak, T. Mihailidis, A. Miller, R. Mirroshandel, S.A. Mischis, D. Mizzaro, S. Moinard, Y. Möller, R. Montaña, J.L. Monteiro, P.T. Moraitis, P. Mossakowski, T. Mouaddib, A.-I. Možina, M. Munos, R. Nakov, P. Napoli, A. Nau, D. Navigli, R. Nempont, O. Neumann, A. Nguyen, G.-H. Nguyen, T.-H. Nica, M. Niemelä, I. Nishihara, Y. Nouioua, F. O’Sullivan, B. Oddi, A. Olive, X. Oliveira, E. Onaindia, E. Otten, L. Ouyang, D. Oyama, S. Öztürk, M. Paletta, L. Paliouras, G. Palmisano, I. Pane, J. Pantelides, P.-P. Papakonstantinou, A. Papini, O. Park, L.A.F. Partalas, I. Pastorino, U. 256, 879, 889 408 747 388 811 747 293 668 668 723 725 167 229 463 89 735, 925 234 458 338 127, 147 573 765 621 15 881 631 797 535 819 224 698 703 219 468 919 913 189 771 366 601 303, 759, 775 648 743 769 448 94 251 117, 759 693 Pavão Martins, J. Pavlidis, N.G. Pazos, A. Pearson, J. Pencolé, Y. Pennerath, F. Peppas, P. Pérez de la Cruz, J.L. Perny, P. Petasis, G. Peters, G. Petridis, S. Picault, S. Piette, C. Piolle, G. Plakias, S. Planes, J. Polaillon, G. Policella, N. Pöllä, M. Ponzetto, S.P. Portet, F. Powers, D.M.W. Prade, H. Prevete, R. Provan, G. Pulido, B. Puppe, F. Qi, G. Qiu, X. Quesada, L. Quesnel, G. Quiniou, R. Rabenau, E. Rabuñal, J. Rachelson, E. Raeymaekers, S. Ramamohanarao, K. Ramisa, A. Ramon, J. Reis, L.P. Reiter, E. Renz, J. Rintanen, J. Rivero, D. Robin, S. Rodrigues, P.P. Rodriguez, S. Rodriguez-Aguilar, J.A. 859 777 142 520 789 147 85 480 490 303 65 775 383 525 863 833 911 147 703 688 751 653, 678 843 376 783 199, 851 791 683 741 839 698 583 653 703 142 583 137 251 927 847 893 678 515, 867 568, 917 142 194 172 875 891 940 Rogers, A. Roos, N. Ropers, D. Rosenschein, J.S. Roussel, O. Rousset, M.-C. Roy, P. Rozé, L. Rudolph, S. Rune Hansen, E. Ruta, D. Růžička, M. Sabouret, N. Sadikov, A. Saetti, A. Saffiotti, A. Saggion, H. Sais, L. Sakama, C. Salamon, A.Z. Salazar, N. Samadi, M. Samar, M. Samuel, É. Samulski, M. Sanchis, A. Sanner, S. Santana, A. Santos, E. Santos, P. Sato, K. Scagnetto, I. Scalmato, A. Schaeffer, J. Schaub, T. Schmidt, S. Schneider, D. Schulster, J. Schumann, A. Schut, M.C. Sebastia, L. Şensoy, M. Seremetaki, S. Servin, A. Sgorbissa, A. Shen, C. Shen, X. Shi, B. Shvaiko, P. 448 593, 929 229 861 500 881 811 194 80 799 713 688 727 234, 897 573 616 841 525, 907 35 530 891 495, 545 545 817 658 825 779 851 859 30 819 668 246 545, 703 15 261 184 703 795 761 919 773 85 873 246 839 757 428 743 Siabani, M. Sifakis, J. Simões, L. Simonin, O. Smail-Tabbone, M. Soutchanski, M. Spanjaard, O. Spies, M. Spyropoulos, C.D. Sridhar, M. Srinivasa Rao, S. Sripada, S. Staab, E. Stamatakis, K. Stamatatos, E. Steinhauer, H.J. Stenzhorn, H. Stergiou, K. Steunebrink, B.R. Strube, M. Struss, P. Stuckey, P.J. Stumptner, M. Sunayama, W. Šutovský, P. Svátek, V. Tamma, V. Tanaka, K. Tarantino, E. Tasoulis, D.K. Tennenholtz, M. ter Mors, A. Tettamanzi, A.G.B. Theseider Dupré, D. Thonnat, M. Tidemann, A. Tiedemann, P. Ţilivea, D. Tinelli, E. Torabi Asr, F. Torasso, P. Torta, G. Travé-Massuyés, L. Treur, J. Trigo, P. Tsoumakas, G. Turrini, P. Uszkoreit, H. van der Krogt, R. 495 631 761 626 127 30 490 841 xiii, 303 606 799 678 869 688 833 821 837 485 256 751 184 505 795 819 214 688 648 771 783 132, 777 438 929 453 803 3 271 799 152 739 545 408, 805 803, 805 179, 219, 789 266, 877 413 117, 763 879 328 423 941 van der Torre, L. van der Vecht, B. Van Hentenryck, P. Vassena, L. Veale, T. Velardi, P. Velikova, M. Vetsikas, I.A. Vidal, T. Villarroel Gonzales, D. Vlahavas, I. Voigt, T. vor der Brück, T. Vouros, G.A. Vytelingum, P. Wagner, G. Waisbrot, N. Walsh, T. Walther, D. Wang, Q. Weber, J. Weerkamp, W. Widmer, G. 351, 883 889 9 668 308 765 658 428 735, 789 688 117, 759, 763 184 837 775 428 733 573 475 55 122 801 318 241 Wilson, N. Witteveen, C. Wolter, F. Wooldridge, M. Wotawa, F. Wu, L. Wu, S. Würbel, E. Xiong, P. Xu, F. Yatskevich, M. Yencken, L. Yolum, P. Yue, A. Zaccaria, R. Zanella, M. Zavitsanos, E. Zhang, L. Zhang, Y. Zhao, X. Zuckerman, I. Zutt, J. 698, 849 593, 929 55 388 797, 801 839 757 94 757 328 743 343 773 371 246 204, 793 775 122 423 189 861 929 This page intentionally left blank This page intentionally left blank This page intentionally left blank </div> </div> </div> <div class="row hidden-xs"> <div class="col-md-12"> <h2></h2> <hr /> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/innovations-in-applied-artificial-intelligence-18th-international-conference-on-.html"> <img src="https://epdf.mx/img/300x300/innovations-in-applied-artificial-intelligence-18t_5b85c3f8b7d7bcd261604841.jpg" alt="Innovations in Applied Artificial Intelligence: 18th International Conference on Industrial and Engineering Applications of Artificial Intelligence" /> <h3 class="note-title">Innovations in Applied Artificial Intelligence: 18th International Conference on Industrial and Engineering Applications of Artificial Intelligence</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/innovations-in-applied-artificial-intelligence-18th-international-conference-on-.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/knowledge-management-for-health-care-procedures-ecai-2008-workshop-k4help-2008-p.html"> <img src="https://epdf.mx/img/300x300/knowledge-management-for-health-care-procedures-ec_5b88af88b7d7bc1845030111.jpg" alt="Knowledge Management for Health Care Procedures: ECAI 2008 Workshop K4HelP 2008, Patras, Greece, July 21, 2008, Revised Selected Papers (Lecture Notes ... Lecture Notes in Artificial Intelligence)" /> <h3 class="note-title">Knowledge Management for Health Care Procedures: ECAI 2008 Workshop K4HelP 2008, Patras, Greece, July 21, 2008, Revised Selected Papers (Lecture Notes ... Lecture Notes in Artificial Intelligence)</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/knowledge-management-for-health-care-procedures-ecai-2008-workshop-k4help-2008-p.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/artificial-intelligence-in-education-frontiers-in-artificial-intelligence-and-ap.html"> <img src="https://epdf.mx/img/300x300/artificial-intelligence-in-education-frontiers-in-_5aefc5ccb7d7bc9b5bfadb8e.jpg" alt="Artificial Intelligence in Education (Frontiers in Artificial Intelligence and Applications)" /> <h3 class="note-title">Artificial Intelligence in Education (Frontiers in Artificial Intelligence and Applications)</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/artificial-intelligence-in-education-frontiers-in-artificial-intelligence-and-ap.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/artificial-intelligence-theories-models-and-applications-5-conf-setn-2008.html"> <img src="https://epdf.mx/img/300x300/artificial-intelligence-theories-models-and-applic_5b03343cb7d7bc7969fdcba4.jpg" alt="Artificial Intelligence.. Theories, Models and Applications, 5 conf., SETN 2008" /> <h3 class="note-title">Artificial Intelligence.. Theories, Models and Applications, 5 conf., SETN 2008</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/artificial-intelligence-theories-models-and-applications-5-conf-setn-2008.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/artificial-intelligence-methodology-systems-and-applications-13-conf-aimsa-2008.html"> <img src="https://epdf.mx/img/300x300/artificial-intelligence-methodology-systems-and-ap_5ac5745fb7d7bc583f422bde.jpg" alt="Artificial Intelligence.. Methodology, Systems, and Applications, 13 conf., AIMSA 2008" /> <h3 class="note-title">Artificial Intelligence.. Methodology, Systems, and Applications, 13 conf., AIMSA 2008</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/artificial-intelligence-methodology-systems-and-applications-13-conf-aimsa-2008.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/new-frontiers-in-artificial-intelligence-jsai-2008-conference-and-workshops.html"> <img src="https://epdf.mx/img/300x300/new-frontiers-in-artificial-intelligence-jsai-2008_5aef849ab7d7bc595740bfc0.jpg" alt="New Frontiers in Artificial Intelligence, JSAI 2008 Conference and Workshops" /> <h3 class="note-title">New Frontiers in Artificial Intelligence, JSAI 2008 Conference and Workshops</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/new-frontiers-in-artificial-intelligence-jsai-2008-conference-and-workshops.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/artificial-intelligence-research-and-development-frontiers-in-artificial-intelli.html"> <img src="https://epdf.mx/img/300x300/artificial-intelligence-research-and-development-f_5a524b13b7d7bc21481ccbb9.jpg" alt="Artificial Intelligence Research and Development (Frontiers in Artificial Intelligence and Applications, Vol. 146) (Frontiers in Artificial Intelligence and Applications)" /> <h3 class="note-title">Artificial Intelligence Research and Development (Frontiers in Artificial Intelligence and Applications, Vol. 146) (Frontiers in Artificial Intelligence and Applications)</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/artificial-intelligence-research-and-development-frontiers-in-artificial-intelli.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/ki-94-advances-in-artificial-intelligence-18th-german-annual-conference-on-artif.html"> <img src="https://epdf.mx/img/300x300/ki-94-advances-in-artificial-intelligence-18th-ger_5b745ceeb7d7bcfe6efd1e42.jpg" alt="KI-94: Advances in Artificial Intelligence: 18th German Annual Conference on Artificial Intelligence, Saarbrücken, September 18-23, 1994. Proceedings: ... 18th" /> <h3 class="note-title">KI-94: Advances in Artificial Intelligence: 18th German Annual Conference on Artificial Intelligence, Saarbrücken, September 18-23, 1994. Proceedings: ... 18th</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/ki-94-advances-in-artificial-intelligence-18th-german-annual-conference-on-artif.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/international-symposium-on-distributed-computing-and-artificial-intelligence-200.html"> <img src="https://epdf.mx/img/300x300/international-symposium-on-distributed-computing-a_5a5965f7b7d7bc6c15da02f8.jpg" alt="International Symposium on Distributed Computing and Artificial Intelligence 2008" /> <h3 class="note-title">International Symposium on Distributed Computing and Artificial Intelligence 2008</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/international-symposium-on-distributed-computing-and-artificial-intelligence-200.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/encyclopedia-of-artifical-intelligence.html"> <img src="https://epdf.mx/img/300x300/encyclopedia-of-artifical-intelligence_5a8e03f9b7d7bc084542cd10.jpg" alt="Encyclopedia of Artifical Intelligence" /> <h3 class="note-title">Encyclopedia of Artifical Intelligence</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/encyclopedia-of-artifical-intelligence.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/artificial-intelligence-strategies-applications-and-models.html"> <img src="https://epdf.mx/img/300x300/artificial-intelligence-strategies-applications-an_5a5294f2b7d7bc5c6003d399.jpg" alt="Artificial Intelligence - Strategies Applications and Models" /> <h3 class="note-title">Artificial Intelligence - Strategies Applications and Models</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/artificial-intelligence-strategies-applications-and-models.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/artificial-intelligence-in-medicine-10th-conference-on-artificial-intelligence-i.html"> <img src="https://epdf.mx/img/300x300/artificial-intelligence-in-medicine-10th-conferenc_5a51970db7d7bcdf1b6ccc42.jpg" alt="Artificial intelligence in medicine: 10th Conference on Artificial Intelligence in Medicine, AIME 2005, Aberdeen, UK, July 23-27, 2005; proceedings" /> <h3 class="note-title">Artificial intelligence in medicine: 10th Conference on Artificial Intelligence in Medicine, AIME 2005, Aberdeen, UK, July 23-27, 2005; proceedings</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/artificial-intelligence-in-medicine-10th-conference-on-artificial-intelligence-i.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/tools-and-applications-with-artificial-intelligence.html"> <img src="https://epdf.mx/img/300x300/tools-and-applications-with-artificial-intelligenc_5a53f261b7d7bc9461bda486.jpg" alt="Tools and Applications with Artificial Intelligence" /> <h3 class="note-title">Tools and Applications with Artificial Intelligence</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/tools-and-applications-with-artificial-intelligence.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/artificial-intelligence-research-and-development-proceedings-of-the-11th-interna.html"> <img src="https://epdf.mx/img/300x300/artificial-intelligence-research-and-development-p_5a523eceb7d7bc8f7d752677.jpg" alt="Artificial Intelligence Research and Development: Proceedings of the 11th International Conference of the Catalan Association for Artificial Intelligence ... in Artificial Intelligence and Applications)" /> <h3 class="note-title">Artificial Intelligence Research and Development: Proceedings of the 11th International Conference of the Catalan Association for Artificial Intelligence ... in Artificial Intelligence and Applications)</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/artificial-intelligence-research-and-development-proceedings-of-the-11th-interna.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/algorithms-and-architectures-of-artificial-intelligence-frontiers-in-artificial-249c5fad83a15489928ca8f44d8d61f475498.html"> <img src="https://epdf.mx/img/300x300/algorithms-and-architectures-of-artificial-intelli_5b3c63e2b7d7bc9458cd915b.jpg" alt="Algorithms and Architectures of Artificial Intelligence (Frontiers in Artificial Intelligence and Applications)" /> <h3 class="note-title">Algorithms and Architectures of Artificial Intelligence (Frontiers in Artificial Intelligence and Applications)</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/algorithms-and-architectures-of-artificial-intelligence-frontiers-in-artificial-249c5fad83a15489928ca8f44d8d61f475498.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/algorithms-and-architectures-of-artificial-intelligence-frontiers-in-artificial-.html"> <img src="https://epdf.mx/img/300x300/algorithms-and-architectures-of-artificial-intelli_5a57ed87b7d7bcdb198d55ad.jpg" alt="Algorithms and Architectures of Artificial Intelligence (Frontiers in Artificial Intelligence and Applications)" /> <h3 class="note-title">Algorithms and Architectures of Artificial Intelligence (Frontiers in Artificial Intelligence and Applications)</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/algorithms-and-architectures-of-artificial-intelligence-frontiers-in-artificial-.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/new-trends-in-applied-artificial-intelligence-20th-international-conference-on-i.html"> <img src="https://epdf.mx/img/300x300/new-trends-in-applied-artificial-intelligence-20th_5a890d0fb7d7bc446f59bc23.jpg" alt="New Trends in Applied Artificial Intelligence: 20th International Conference on Industrial and Engineering Applications of Artificial Intelligence and ... Lecture Notes in Artificial Intelligence)" /> <h3 class="note-title">New Trends in Applied Artificial Intelligence: 20th International Conference on Industrial and Engineering Applications of Artificial Intelligence and ... Lecture Notes in Artificial Intelligence)</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/new-trends-in-applied-artificial-intelligence-20th-international-conference-on-i.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/applications-and-innovations-in-intelligent-systems-xvi-proceedings-of-ai-2008-ta19fa1f755e951b5320fbc206584cf4c20029.html"> <img src="https://epdf.mx/img/300x300/applications-and-innovations-in-intelligent-system_5b49e51db7d7bc8748f5183e.jpg" alt="Applications and Innovations in Intelligent Systems XVI: Proceedings of AI-2008, The Twenty-eighth SGAI International Conference on Innovative Techniques ... of Artificial Intelligence (v. 16)" /> <h3 class="note-title">Applications and Innovations in Intelligent Systems XVI: Proceedings of AI-2008, The Twenty-eighth SGAI International Conference on Innovative Techniques ... of Artificial Intelligence (v. 16)</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/applications-and-innovations-in-intelligent-systems-xvi-proceedings-of-ai-2008-ta19fa1f755e951b5320fbc206584cf4c20029.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/applications-and-innovations-in-intelligent-systems-xvi-proceedings-of-ai-2008-t.html"> <img src="https://epdf.mx/img/300x300/applications-and-innovations-in-intelligent-system_5a46e94cb7d7bcef6611256f.jpg" alt="Applications and Innovations in Intelligent Systems XVI: Proceedings of AI-2008, The Twenty-eighth SGAI International Conference on Innovative Techniques ... of Artificial Intelligence (v. 16)" /> <h3 class="note-title">Applications and Innovations in Intelligent Systems XVI: Proceedings of AI-2008, The Twenty-eighth SGAI International Conference on Innovative Techniques ... of Artificial Intelligence (v. 16)</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/applications-and-innovations-in-intelligent-systems-xvi-proceedings-of-ai-2008-t.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/formal-models-languages-and-applications-machine-perception-and-artifical-intell.html"> <img src="https://epdf.mx/img/300x300/formal-models-languages-and-applications-machine-p_5abf1cdeb7d7bc4f187c724e.jpg" alt="Formal Models, Languages And Applications (Machine Perception and Artifical Intelligence)" /> <h3 class="note-title">Formal Models, Languages And Applications (Machine Perception and Artifical Intelligence)</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/formal-models-languages-and-applications-machine-perception-and-artifical-intell.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/advances-in-logic-artificial-intelligence-and-robotics-laptec-2002-frontiers-in-.html"> <img src="https://epdf.mx/img/300x300/advances-in-logic-artificial-intelligence-and-robo_5a56ca30b7d7bc4874363aa7.jpg" alt="Advances in Logic, Artificial Intelligence and Robotics: Laptec 2002 (Frontiers in Artificial Intelligence and Applications, 85)" /> <h3 class="note-title">Advances in Logic, Artificial Intelligence and Robotics: Laptec 2002 (Frontiers in Artificial Intelligence and Applications, 85)</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/advances-in-logic-artificial-intelligence-and-robotics-laptec-2002-frontiers-in-.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/advances-in-artificial-general-intelligence-concepts-architectures-and-algorithm36854.html"> <img src="https://epdf.mx/img/300x300/advances-in-artificial-general-intelligence-concep_5a69360fb7d7bca87cb7fac1.jpg" alt="Advances in Artificial General Intelligence: Concepts, Architectures and Algorithms (Frontiers in Artificial Intelligence and Applications)" /> <h3 class="note-title">Advances in Artificial General Intelligence: Concepts, Architectures and Algorithms (Frontiers in Artificial Intelligence and Applications)</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/advances-in-artificial-general-intelligence-concepts-architectures-and-algorithm36854.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/emerging-intelligent-computing-technology-and-applications-with-aspects-of-artif.html"> <img src="https://epdf.mx/img/300x300/emerging-intelligent-computing-technology-and-appl_5a5943adb7d7bc0f28fb77a9.jpg" alt="Emerging Intelligent Computing Technology and Applications. With Aspects of Artificial Intelligence: 5th International Conference on Intelligent ... Lecture Notes in Artificial Intelligence)" /> <h3 class="note-title">Emerging Intelligent Computing Technology and Applications. With Aspects of Artificial Intelligence: 5th International Conference on Intelligent ... Lecture Notes in Artificial Intelligence)</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/emerging-intelligent-computing-technology-and-applications-with-aspects-of-artif.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/computational-collective-intelligence-technologies-and-applications-iccci-2011-p.html"> <img src="https://epdf.mx/img/300x300/computational-collective-intelligence-technologies_5adf88f0b7d7bca15b2dd540.jpg" alt="Computational Collective Intelligence. Technologies and Applications. ICCCI 2011 Proceedings Part II (Lecture Notes in Artificial Intelligence)" /> <h3 class="note-title">Computational Collective Intelligence. Technologies and Applications. ICCCI 2011 Proceedings Part II (Lecture Notes in Artificial Intelligence)</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/computational-collective-intelligence-technologies-and-applications-iccci-2011-p.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/proceedings-of-the-second-conference-on-artificial-general-intelligence-advances.html"> <img src="https://epdf.mx/img/300x300/proceedings-of-the-second-conference-on-artificial_5acd8986b7d7bce30e75f361.jpg" alt="Proceedings of the Second Conference on Artificial General Intelligence (Advances in Intelligent Systems Research)" /> <h3 class="note-title">Proceedings of the Second Conference on Artificial General Intelligence (Advances in Intelligent Systems Research)</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/proceedings-of-the-second-conference-on-artificial-general-intelligence-advances.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/methods-and-applications-of-artificial-intelligence-third-helenic-conference-on-.html"> <img src="https://epdf.mx/img/300x300/methods-and-applications-of-artificial-intelligenc_5b48f640b7d7bcbe7c61ea34.jpg" alt="Methods and Applications of Artificial Intelligence: Third Helenic Conference on AI, SETN 2004, Samos, Greece, May 5-8, 2004, Proceedings" /> <h3 class="note-title">Methods and Applications of Artificial Intelligence: Third Helenic Conference on AI, SETN 2004, Samos, Greece, May 5-8, 2004, Proceedings</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/methods-and-applications-of-artificial-intelligence-third-helenic-conference-on-.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/ubiquitous-intelligence-and-computing-5th-international-conference-uic-2008-oslo.html"> <img src="https://epdf.mx/img/300x300/ubiquitous-intelligence-and-computing-5th-internat_5b6a888ab7d7bc924f23bf73.jpg" alt="Ubiquitous Intelligence and Computing: 5th International Conference, UIC 2008, Oslo, Norway, June 23-25, 2008 Proceedings" /> <h3 class="note-title">Ubiquitous Intelligence and Computing: 5th International Conference, UIC 2008, Oslo, Norway, June 23-25, 2008 Proceedings</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/ubiquitous-intelligence-and-computing-5th-international-conference-uic-2008-oslo.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/advancing-artificial-intelligence-through-biological-process-applications795e25d4b3797d5df94334a510c65e6b99654.html"> <img src="https://epdf.mx/img/300x300/advancing-artificial-intelligence-through-biologic_5b4e76b9b7d7bc472912018f.jpg" alt="Advancing Artificial Intelligence through Biological Process Applications" /> <h3 class="note-title">Advancing Artificial Intelligence through Biological Process Applications</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/advancing-artificial-intelligence-through-biological-process-applications795e25d4b3797d5df94334a510c65e6b99654.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/computational-collective-intelligence-technologies-and-applications-iccci-2011-p1486588345db7f0f9342694d12b78d2c28605.html"> <img src="https://epdf.mx/img/300x300/computational-collective-intelligence-technologies_5adf89a6b7d7bc9f5b4edca5.jpg" alt="Computational Collective Intelligence. Technologies and Applications. ICCCI 2011 Proceedings Part I (Lecture Notes in Artificial Intelligence)" /> <h3 class="note-title">Computational Collective Intelligence. Technologies and Applications. ICCCI 2011 Proceedings Part I (Lecture Notes in Artificial Intelligence)</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/computational-collective-intelligence-technologies-and-applications-iccci-2011-p1486588345db7f0f9342694d12b78d2c28605.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/advancing-artificial-intelligence-through-biological-process-applications.html"> <img src="https://epdf.mx/img/300x300/advancing-artificial-intelligence-through-biologic_5ab718d5b7d7bc9e1f81ec68.jpg" alt="Advancing Artificial Intelligence through Biological Process Applications" /> <h3 class="note-title">Advancing Artificial Intelligence through Biological Process Applications</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/advancing-artificial-intelligence-through-biological-process-applications.html">Read more</a> </div> </div> </div> </div> </div> <div class="col-lg-3 col-md-4 col-xs-12"> <div class="panel-recommend panel panel-primary"> <div class="panel-heading"> <h4 class="panel-title">Recommend Documents</h4> </div> <div class="panel-body"> <div class="row m-0"> <div class="col-md-3 col-xs-3 pl-0 text-center"> <a href="https://epdf.mx/innovations-in-applied-artificial-intelligence-18th-international-conference-on-.html"> <img src="https://epdf.mx/img/60x80/innovations-in-applied-artificial-intelligence-18t_5b85c3f8b7d7bcd261604841.jpg" alt="" width="100%" /> </a> </div> <div class="col-md-9 col-xs-9 p-0"> <label> <a href="https://epdf.mx/innovations-in-applied-artificial-intelligence-18th-international-conference-on-.html"> Innovations in Applied Artificial Intelligence: 18th International Conference on Industrial and Engineering Applications of Artificial Intelligence </a> </label> <div class="note-meta"> <div class="note-desc">Lecture Notes in Artificial Intelligence Edited by J. G. Carbonell and J. Siekmann Subseries of Lecture Notes in Comput...</div> </div> </div> <div class="clearfix"></div> <hr class="mt-15 mb-15" /> </div> <div class="row m-0"> <div class="col-md-3 col-xs-3 pl-0 text-center"> <a href="https://epdf.mx/knowledge-management-for-health-care-procedures-ecai-2008-workshop-k4help-2008-p.html"> <img src="https://epdf.mx/img/60x80/knowledge-management-for-health-care-procedures-ec_5b88af88b7d7bc1845030111.jpg" alt="" width="100%" /> </a> </div> <div class="col-md-9 col-xs-9 p-0"> <label> <a href="https://epdf.mx/knowledge-management-for-health-care-procedures-ecai-2008-workshop-k4help-2008-p.html"> Knowledge Management for Health Care Procedures: ECAI 2008 Workshop K4HelP 2008, Patras, Greece, July 21, 2008, Revised Selected Papers (Lecture Notes ... Lecture Notes in Artificial Intelligence) </a> </label> <div class="note-meta"> <div class="note-desc">Lecture Notes in Artificial Intelligence Edited by R. Goebel, J. Siekmann, and W. Wahlster Subseries of Lecture Notes i...</div> </div> </div> <div class="clearfix"></div> <hr class="mt-15 mb-15" /> </div> <div class="row m-0"> <div class="col-md-3 col-xs-3 pl-0 text-center"> <a href="https://epdf.mx/artificial-intelligence-in-education-frontiers-in-artificial-intelligence-and-ap.html"> <img src="https://epdf.mx/img/60x80/artificial-intelligence-in-education-frontiers-in-_5aefc5ccb7d7bc9b5bfadb8e.jpg" alt="" width="100%" /> </a> </div> <div class="col-md-9 col-xs-9 p-0"> <label> <a href="https://epdf.mx/artificial-intelligence-in-education-frontiers-in-artificial-intelligence-and-ap.html"> Artificial Intelligence in Education (Frontiers in Artificial Intelligence and Applications) </a> </label> <div class="note-meta"> <div class="note-desc">ARTIFICIAL INTELLIGENCE IN EDUCATION Frontiers in Artificial Intelligence and Applications Series Editors: J. Breuker...</div> </div> </div> <div class="clearfix"></div> <hr class="mt-15 mb-15" /> </div> <div class="row m-0"> <div class="col-md-3 col-xs-3 pl-0 text-center"> <a href="https://epdf.mx/artificial-intelligence-theories-models-and-applications-5-conf-setn-2008.html"> <img src="https://epdf.mx/img/60x80/artificial-intelligence-theories-models-and-applic_5b03343cb7d7bc7969fdcba4.jpg" alt="" width="100%" /> </a> </div> <div class="col-md-9 col-xs-9 p-0"> <label> <a href="https://epdf.mx/artificial-intelligence-theories-models-and-applications-5-conf-setn-2008.html"> Artificial Intelligence.. Theories, Models and Applications, 5 conf., SETN 2008 </a> </label> <div class="note-meta"> <div class="note-desc">Lecture Notes in Artificial Intelligence Edited by R. Goebel, J. Siekmann, and W. Wahlster Subseries of Lecture Notes i...</div> </div> </div> <div class="clearfix"></div> <hr class="mt-15 mb-15" /> </div> <div class="row m-0"> <div class="col-md-3 col-xs-3 pl-0 text-center"> <a href="https://epdf.mx/artificial-intelligence-methodology-systems-and-applications-13-conf-aimsa-2008.html"> <img src="https://epdf.mx/img/60x80/artificial-intelligence-methodology-systems-and-ap_5ac5745fb7d7bc583f422bde.jpg" alt="" width="100%" /> </a> </div> <div class="col-md-9 col-xs-9 p-0"> <label> <a href="https://epdf.mx/artificial-intelligence-methodology-systems-and-applications-13-conf-aimsa-2008.html"> Artificial Intelligence.. Methodology, Systems, and Applications, 13 conf., AIMSA 2008 </a> </label> <div class="note-meta"> <div class="note-desc">Lecture Notes in Artificial Intelligence Edited by R. Goebel, J. Siekmann, and W. Wahlster Subseries of Lecture Notes i...</div> </div> </div> <div class="clearfix"></div> <hr class="mt-15 mb-15" /> </div> <div class="row m-0"> <div class="col-md-3 col-xs-3 pl-0 text-center"> <a href="https://epdf.mx/new-frontiers-in-artificial-intelligence-jsai-2008-conference-and-workshops.html"> <img src="https://epdf.mx/img/60x80/new-frontiers-in-artificial-intelligence-jsai-2008_5aef849ab7d7bc595740bfc0.jpg" alt="" width="100%" /> </a> </div> <div class="col-md-9 col-xs-9 p-0"> <label> <a href="https://epdf.mx/new-frontiers-in-artificial-intelligence-jsai-2008-conference-and-workshops.html"> New Frontiers in Artificial Intelligence, JSAI 2008 Conference and Workshops </a> </label> <div class="note-meta"> <div class="note-desc">Lecture Notes in Artificial Intelligence Edited by R. Goebel, J. Siekmann, and W. Wahlster Subseries of Lecture Notes i...</div> </div> </div> <div class="clearfix"></div> <hr class="mt-15 mb-15" /> </div> <div class="row m-0"> <div class="col-md-3 col-xs-3 pl-0 text-center"> <a href="https://epdf.mx/artificial-intelligence-research-and-development-frontiers-in-artificial-intelli.html"> <img src="https://epdf.mx/img/60x80/artificial-intelligence-research-and-development-f_5a524b13b7d7bc21481ccbb9.jpg" alt="" width="100%" /> </a> </div> <div class="col-md-9 col-xs-9 p-0"> <label> <a href="https://epdf.mx/artificial-intelligence-research-and-development-frontiers-in-artificial-intelli.html"> Artificial Intelligence Research and Development (Frontiers in Artificial Intelligence and Applications, Vol. 146) (Frontiers in Artificial Intelligence and Applications) </a> </label> <div class="note-meta"> <div class="note-desc">ARTIFICIAL INTELLIGENCE RESEARCH AND DEVELOPMENT Frontiers in Artificial Intelligence and Applications FAIA covers all...</div> </div> </div> <div class="clearfix"></div> <hr class="mt-15 mb-15" /> </div> <div class="row m-0"> <div class="col-md-3 col-xs-3 pl-0 text-center"> <a href="https://epdf.mx/ki-94-advances-in-artificial-intelligence-18th-german-annual-conference-on-artif.html"> <img src="https://epdf.mx/img/60x80/ki-94-advances-in-artificial-intelligence-18th-ger_5b745ceeb7d7bcfe6efd1e42.jpg" alt="" width="100%" /> </a> </div> <div class="col-md-9 col-xs-9 p-0"> <label> <a href="https://epdf.mx/ki-94-advances-in-artificial-intelligence-18th-german-annual-conference-on-artif.html"> KI-94: Advances in Artificial Intelligence: 18th German Annual Conference on Artificial Intelligence, Saarbrücken, September 18-23, 1994. Proceedings: ... 18th </a> </label> <div class="note-meta"> <div class="note-desc"></div> </div> </div> <div class="clearfix"></div> <hr class="mt-15 mb-15" /> </div> <div class="row m-0"> <div class="col-md-3 col-xs-3 pl-0 text-center"> <a href="https://epdf.mx/international-symposium-on-distributed-computing-and-artificial-intelligence-200.html"> <img src="https://epdf.mx/img/60x80/international-symposium-on-distributed-computing-a_5a5965f7b7d7bc6c15da02f8.jpg" alt="" width="100%" /> </a> </div> <div class="col-md-9 col-xs-9 p-0"> <label> <a href="https://epdf.mx/international-symposium-on-distributed-computing-and-artificial-intelligence-200.html"> International Symposium on Distributed Computing and Artificial Intelligence 2008 </a> </label> <div class="note-meta"> <div class="note-desc">Advances in Soft Computing Editor-in-Chief: J. Kacprzyk 50 Advances in Soft Computing Editor-in-Chief Prof. Janusz K...</div> </div> </div> <div class="clearfix"></div> <hr class="mt-15 mb-15" /> </div> <div class="row m-0"> <div class="col-md-3 col-xs-3 pl-0 text-center"> <a href="https://epdf.mx/encyclopedia-of-artifical-intelligence.html"> <img src="https://epdf.mx/img/60x80/encyclopedia-of-artifical-intelligence_5a8e03f9b7d7bc084542cd10.jpg" alt="" width="100%" /> </a> </div> <div class="col-md-9 col-xs-9 p-0"> <label> <a href="https://epdf.mx/encyclopedia-of-artifical-intelligence.html"> Encyclopedia of Artifical Intelligence </a> </label> <div class="note-meta"> <div class="note-desc">Encyclopedia of Artificial Intelligence Juan Ramón Rabuñal Dopico University of A Coruña, Spain Julián Dorado de la Cal...</div> </div> </div> <div class="clearfix"></div> <hr class="mt-15 mb-15" /> </div> </div> </div> </div> </div> </div> <div class="modal fade" id="report" tabindex="-1" role="dialog" aria-hidden="true"> <div class="modal-dialog"> <div class="modal-content"> <form role="form" method="post" action="https://epdf.mx/report/ecai-2008-proceedings-18th-european-conference-on-artificial-intelligence-july-2" style="border: none;"> <div class="modal-header"> <button type="button" class="close" data-dismiss="modal" aria-hidden="true">×</button> <h4 class="modal-title">Report "ECAI 2008: Proceedings, 18th European Conference on Artificial Intelligence, July 21-25, 2008, Patras, Greece : Including Prestigious Applications of Intelligent ... in Artifical Intelligence and Applications)"</h4> </div> <div class="modal-body"> <div class="form-group"> <label>Your name</label> <input type="text" name="name" required="required" class="form-control" /> </div> <div class="form-group"> <label>Email</label> <input type="email" name="email" required="required" class="form-control" /> </div> <div class="form-group"> <label>Reason</label> <select name="reason" required="required" class="form-control"> <option value="">-Select Reason-</option> <option value="pornographic" selected="selected">Pornographic</option> <option value="defamatory">Defamatory</option> <option value="illegal">Illegal/Unlawful</option> <option value="spam">Spam</option> <option value="others">Other Terms Of Service Violation</option> <option value="copyright">File a copyright complaint</option> </select> </div> <div class="form-group"> <label>Description</label> <textarea name="description" required="required" rows="3" class="form-control" style="border: 1px solid #cccccc;"></textarea> </div> <div class="form-group"> <div style="display: inline-block;"> <div class="g-recaptcha" data-sitekey="6Le0e5ceAAAAADsZpn1H3VI-HvOppGDh-O-QAVYL"></div> </div> </div> <script src='https://www.google.com/recaptcha/api.js'></script> </div> <div class="modal-footer"> <button type="button" class="btn btn-default" data-dismiss="modal">Close</button> <button type="submit" class="btn btn-success">Send</button> </div> </form> </div> </div> </div> <footer class="footer" style="margin-top: 60px;"> <div class="container-fluid"> Copyright © 2024 EPDF.MX. All rights reserved. <div class="pull-right"> <a href="https://epdf.mx/about">About Us</a> | <a href="https://epdf.mx/privacy">Privacy Policy</a> | <a href="https://epdf.mx/term">Terms of Service</a> | <a href="https://epdf.mx/copyright">Copyright</a> | <a href="https://epdf.mx/dmca">DMCA</a> | <a href="https://epdf.mx/contact">Contact Us</a> | <a href="https://epdf.mx/cookie_policy">Cookie Policy</a> </div> </div> </footer>  <div class="modal fade" id="login" tabindex="-1" role="dialog" aria-labelledby="myModalLabel"> <div class="modal-dialog" role="document"> <div class="modal-content"> <div class="modal-header"> <button type="button" class="close" data-dismiss="modal" aria-label="Close" on="tap:login.close">×</button> <h4 class="modal-title" id="add-note-label">Sign In</h4> </div> <div class="modal-body"> <form action="https://epdf.mx/login" method="post"> <div class="form-group"> <label class="sr-only" for="email">Email</label> <input class="form-input form-control" type="text" name="email" id="email" value="" placeholder="Email" /> </div> <div class="form-group"> <label class="sr-only" for="password">Password</label> <input class="form-input form-control" type="password" name="password" id="password" value="" placeholder="Password" /> </div> <div class="form-group"> <div class="checkbox"> <label class="form-checkbox"> <input type="checkbox" name="remember" value="1" /> Remember me </label> <label class="pull-right"><a href="https://epdf.mx/forgot">Forgot password?</a></label> </div> </div> <button class="btn btn-primary btn-block" type="submit">Sign In</button> </form> <hr style="margin-top: 15px;" /> <a href="https://epdf.mx/login/facebook" class="btn btn-facebook btn-block"> Login with Facebook</a> </div> </div> </div> </div>  <script async src="https://www.googletagmanager.com/gtag/js?id=UA-111550345-1"></script> <script> window.dataLayer = window.dataLayer || []; function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'UA-111550345-1'); </script> <script src="https://epdf.mx/assets/js/jquery-ui.min.js"></script> <link rel="stylesheet" href="https://epdf.mx/assets/css/jquery-ui.css"> <script> $(function () { $("#document_search").autocomplete({ source: function (request, response) { $.ajax({ url: "https://epdf.mx/suggest", dataType: "json", data: { term: request.term }, success: function (data) { response(data); } }); }, autoFill: true, select: function( event, ui ) { $(this).val(ui.item.value); $(this).parents("form").submit(); } }); }); </script>  <div id="EPDFMX_cookie_box" style="z-index:99999; border-top: 1px solid #fefefe; background: #97c479; width: 100%; position: fixed; padding: 5px 15px; text-align: center; left:0; bottom: 0;"> Our partners will collect data and use cookies for ad personalization and measurement. <a href="https://epdf.mx/cookie_policy" target="_blank">Learn how we and our ad partner Google, collect and use data</a>. <a href="#" class="btn btn-success" onclick="accept_EPDFMX_cookie_box();return false;">Agree & close</a> </div> <script> function accept_EPDFMX_cookie_box() { document.cookie = "EPDFMX_cookie_box_viewed=1;max-age=15768000;path=/"; hide_EPDFMX_cookie_box(); } function hide_EPDFMX_cookie_box() { var cb = document.getElementById('EPDFMX_cookie_box'); if (cb) { cb.parentElement.removeChild(cb); } } (function () { var EPDFMX_cookie_box_viewed = (function (name) { var matches = document.cookie.match(new RegExp("(?:^|; )" + name.replace(/([\.$?*|{}\[\]\\\/\+^])/g, '\\$1') + "=([^;]*)")); return matches ? decodeURIComponent(matches[1]) : undefined; })('EPDFMX_cookie_box_viewed'); if (EPDFMX_cookie_box_viewed) { hide_EPDFMX_cookie_box(); } })(); </script>  </body> </html> <script data-cfasync="false" src="/cdn-cgi/scripts/5c5dd728/cloudflare-static/email-decode.min.js"></script>