Advances in Artificial Intelligence: Theories, Models, and Applications: 6th Hellenic Conference on AI, SETN 2010, Athens, Greece, May 4-7, 2010. ... Lecture Notes in Artificial Intelligence)

Lecture Notes in Artificial Intelligence Edited by R. Goebel, J. Siekmann, and W. Wahlster Subseries of Lecture Notes i...

Author: Stasinos Konstantopoulos | Stavros Perantonis | Vangelis Karkaletsis | Costas D. Spyropoulos | George Vouros

102 downloads 623 Views 8MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Lecture Notes in Artificial Intelligence Edited by R. Goebel, J. Siekmann, and W. Wahlster

Subseries of Lecture Notes in Computer Science

6040

Stasinos Konstantopoulos Stavros Perantonis Vangelis Karkaletsis Constantine D. Spyropoulos George Vouros (Eds.)

Artificial Intelligence: Theories, Models and Applications 6th Hellenic Conference on AI, SETN 2010 Athens, Greece, May 4-7, 2010 Proceedings

13

Series Editors Randy Goebel, University of Alberta, Edmonton, Canada Jörg Siekmann, University of Saarland, Saarbrücken, Germany Wolfgang Wahlster, DFKI and University of Saarland, Saarbrücken, Germany Volume Editors Stasinos Konstantopoulos Stavros Perantonis Vangelis Karkaletsis Constantine D. Spyropoulos Institute of Informatics and Telecommunications NCSR Demokritos Ag. Paraskevi 15310, Athens, Greece E-mail: {konstant, sper, vangelis, costass}@iit.demokritos.gr George Vouros Department of Information and Communication Systems Engineering University of the Aegean Karlovassi, Samos 83200, Greece E-mail: [email protected]

Library of Congress Control Number: 2010925798

CR Subject Classification (1998): I.2, H.3, H.4, F.1, H.5, H.2.8 LNCS Sublibrary: SL 7 – Artificial Intelligence ISSN ISBN-10 ISBN-13

0302-9743 3-642-12841-6 Springer Berlin Heidelberg New York 978-3-642-12841-7 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2010 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper 06/3180

Preface

Artiﬁcial intelligence (AI) is a dynamic ﬁeld that is constantly expanding into new application areas, discovering new research challenges, and facilitating the development of innovative products. Today’s AI tools might not pass the Turing test, but they are invaluable aids in organizing and sorting the ever-increasing volume, complexity, and heterogeneity of knowledge available to us in our rapidly changing technological, economic, cultural, and social environment. This volume aims at bringing to the reader all the latest developments in this exciting and challenging ﬁeld, and contains papers selected for presentation at the 6th Hellenic Conference on Artiﬁcial Intelligence (SETN 2010), the oﬃcial meeting of the Hellenic Society for Artiﬁcial Intelligence (EETN). SETN 2010 was organized by the Hellenic Society of Artiﬁcial Intelligence and the Institute of Informatics and Telecommunications, NCSR ‘Demokritos’ and took place in Athens during May 4–7. Previous conferences were held at the University of Piraeus (1996), at the Aristotle University of Thessaloniki (2002), at the University of the Aegean (Samos, 2004, and Syros, 2008), and jointly at the Foundation for Research and Technology–Hellas (FORTH) and the University of Crete (2006). SETN conferences play an important role in disseminating innovative and high-quality scientiﬁc results by AI researchers, attracting not only EETN members but also scientists advancing and applying AI in many and diverse domains and from various Greek and international institutes. However, the most important aspect of SETN conferences is that they provide the context in which AI researchers meet and discuss their work, as well as an excellect opportunity for students to attend high-quality tutorials and get closer to AI results. SETN 2010 continued this tradition of excellence, attracting submissions not only from Greece but also numerous European countries, Asia, and the Americas, which underwent a thorough reviewing process on the basis of their relevance to AI, originality, signiﬁcance, technical soundness, and presentation. The selection process was hard, with only 28 papers out of the 83 submitted being accepted as full papers and an additional 22 submissions accepted as short papers. This proceedings volume also includes the abstracts of the invited talks presented at SETN 2010 by four internationally distinguished keynote speakers: Panos Constantopoulos, Michail Lagoudakis, Nikolaos Mavridis, and Demetri Terzopoulos. As yet another indication of the growing international inﬂuence and importance of the conference, the EVENTS international workshop on event recognition and tracking chose to be co-located with SETN 2010. And, ﬁnally, SETN 2010 hosted the ﬁrst ever RoboCup event organized in Greece, with the participation of two teams from abroad and one from Greece. The Area Chairs and members of the SETN 2010 Programme Committee and the additional reviewers did an enormous amount of work and deserve the

VI

Preface

special gratitude of all participants. Our sincere thanks go to our sponsors for their generous ﬁnancial support and to the Steering Committee for its assistance and support. The conference operations were supported in an excellent way by the ConfMaster conference management system; many thanks to Thomas Preuss for his prompt responding with all questions and requests. Special thanks go to to Konstantinos Stamatakis for the design of the conference poster and the design and maintenance of the conference website. We also wish to thank the Organizing Committee and Be to Be Travel, the conference travel and organization agent, for implementing the conference schedule in a timely and ﬂawless manner. Last but not least, we also thank Alfred Hofmann, Anna Kramer, Leonie Kunz, and the Springer team for their continuous help and support. March 2010

Stasinos Konstantopoulos Stavros Perantonis Vangelis Karkaletsis Constantine D. Spyropoulos George Vouros

Organization

SETN 2010 was organized by the Institute of Informatics and Telecommunications, NCSR ‘Demokritos’, and EETN, the Hellenic Association of Artiﬁcial Intelligence.

Conference Chairs Constantine D. Spyropoulos Vangelis Karkaletsis George Vouros

NCSR ‘Demokritos’, Greece NCSR ‘Demokritos’, Greece University of the Aegean, Greece

Steering Committee Grigoris Antoniou John Darzentas Nikos Fakotakis Themistoklis Panayiotopoulos Ioannis Vlahavas

FORTH and University of Crete University of the Aegean University of Patras University of Piraeus Aristotle University

Organizing Committee Alexandros Artikis Vassilis Gatos Pythagoras Karampiperis Anastasios Kesidis Anastasia Krithara Georgios Petasis Sergios Petridis Ioannis Pratikakis Konstantinos Stamatakis Dimitrios Vogiatzis

Programme Committee Chairs Stasinos Konstantopoulos Stavros Perantonis

NCSR ‘Demokritos’ NCSR ‘Demokritos’

Programme Committee Area Chairs Ion Androutsopoulos Nick Bassiliades

Athens University of Economics and Business Aristotle University of Thessaloniki

VIII

Organization

Ioannis Hatzilygeroudis Ilias Maglogiannis Georgios Paliouras Ioannis Refanidis Efstathios Stamatatos Kostas Stergiou Panos Trahanias

University of Patras University of Central Greece NCSR ‘Demokritos’ University of Macedonia University of the Aegean University of the Aegean FORTH and University of Crete

Programme Committee Members Dimitris Apostolou Argyris Arnellos Alexander Artikis Grigorios Beligiannis Basilis Boutsinas Theodore Dalamagas Yannis Dimopoulos Christos Douligeris George Dounias Eleni Galiotou Todor Ganchev Vassilis Gatos Efstratios Georgopoulos Manolis Gergatsoulis Nikos Hatziargyriou Katerina Kabassi Dimitris Kalles Kostas Karatzas Dimitrios Karras Petros Kefalas Stefanos Kollias Yiannis Kompatsaris Dimitris Kosmopoulos Constantine Kotropoulos Manolis Koubarakis Konstantinos Koutroumbas Michail Lagoudakis Aristidis Likas George Magoulas Filia Makedon Manolis Maragoudakis Vassilis Moustakis Christos Papatheodorou Pavlos Peppas Sergios Petridis

University of Piraeus University of the Aegean NCSR ‘Demokritos’ University of Ioannina University of Patras IMIS Institute/‘Athena’ Research Center University of Cyprus University of Piraeus University of the Aegean TEI Athens University of Patras NCSR ‘Demokritos’ TEI Kalamata Ionian University National Technical University of Athens TEI Ionian Hellenic Open University Aristotle University of Thessaloniki TEI Chalkis City Liberal Studies National Technical University of Athens CERTH NCSR ‘Demokritos’ Aristotle University of Thessaloniki National and Kapodistrian University of Athens National Observatory of Athens Technical University of Crete University of Ioannina Birkbeck College, University of London (UK) University of Texas at Arlington (USA) University of the Aegean Technical University of Crete Ionian University University of Patras NCSR ‘Demokritos’

Organization

Stelios Piperidis Vassilis Plagianakos Dimitris Plexousakis George Potamias Ioannis Pratikakis Jim Prentzas Ilias Sakellariou Kyriakos Sgarbas John Soldatos Panagiotis Stamatopoulos Giorgos Stoilos Ioannis Tsamardinos George Tsichrintzis Nikos Vasilas Michalis Vazirgia Maria Virvou Spyros Vosinakis Dimitris Vrakas

ILSP-Athena RC University of Central Greece FORTH and University of Crete FORTH NCSR ‘Demokritos’ Democritus University of Thrace University of Macedonia University of Patras AIT National and Kapodistrian University of Athens Oxford University (UK) University of Crete and FORTH University of Piraeus TEI Athens Athens University of Economics and Business University of Piraeus University of the Aegean Aristotle University of Thessaloniki

Additional Reviewers Charalampos Doukas Anastasios Doulamis Giorgos Flouris Theodoros Giannakopoulos Katia Kermanidis Otilia Kocsis Eleytherios Koumakis Anastasia Krithara Pavlos Moraitis Nikolaos Pothitos Spyros Raptis Vassiliki Rentoumi Evangelos Sakkopoulos Themos Stafylakis Sophia Stamou Andreas Symeonidis Vassilios Vassiliadis Dimitrios Vogiatzis

IX

University of the Aegean Technical University of Crete FORTH NCSR ‘Demokritos’ Ionian University University of Patras Technical University of Crete NCSR ‘Demokritos’ Paris Descartes University (France) National and Kapodistrian University of Athens ILSP-Athena RC NCSR ‘Demokritos’ University of Patras ILSP-Athena RC University of Patras Aristotle University of Thessaloniki University of the Aegean NCSR ‘Demokritos’

Table of Contents

Invited Talks Digital Curation and Digital Cultural Memory . . . . . . . . . . . . . . . . . . . . . . . Panos Constantopoulos

1

RoboCup: A Challenge Problem for Artiﬁcial Intelligence . . . . . . . . . . . . . Michail G. Lagoudakis

3

Robots, Natural Language, Social Networks, and Art . . . . . . . . . . . . . . . . . Nikolaos Mavridis

5

Artiﬁcial Life Simulation of Humans and Lower Animals: From Biomechanics to Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Demetri Terzopoulos

7

Full Papers Prediction of Aircraft Aluminum Alloys Tensile Mechanical Properties Degradation Using Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . Nikolaos Ampazis and Nikolaos D. Alexopoulos

9

Mutual Information Measures for Subclass Error-Correcting Output Codes Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nikolaos Arvanitopoulos, Dimitrios Bouzas, and Anastasios Tefas

19

Conﬂict Directed Variable Selection Strategies for Constraint Satisfaction Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thanasis Balafoutis and Kostas Stergiou

29

A Feasibility Study on Low Level Techniques for Improving Parsing Accuracy for Spanish Using Maltparser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Miguel Ballesteros, Jes´ us Herrera, Virginia Francisco, and Pablo Gerv´ as A Hybrid Ant Colony Optimization Algorithm for Solving the Ring Arc-Loading Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anabela Moreira Bernardino, Eug´enia Moreira Bernardino, Juan Manuel S´ anchez-P´erez, Juan Antonio G´ omez-Pulido, and Miguel Angel Vega-Rodr´ıguez Trends and Issues in Description Logics Frameworks for Image Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stamatia Dasiopoulou and Ioannis Kompatsiaris

39

49

61

XII

Table of Contents

Unsupervised Recognition of ADLs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Todor Dimitrov, Josef Pauli, and Edwin Naroska Audio Features Selection for Automatic Height Estimation from Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Todor Ganchev, Iosif Mporas, and Nikos Fakotakis Audio-Visual Fusion for Detecting Violent Scenes in Videos . . . . . . . . . . . Theodoros Giannakopoulos, Alexandros Makris, Dimitrios Kosmopoulos, Stavros Perantonis, and Sergios Theodoridis Experimental Study on a Hybrid Nature-Inspired Algorithm for Financial Portfolio Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Giorgos Giannakouris, Vassilios Vassiliadis, and George Dounias Associations between Constructive Models for Set Contraction . . . . . . . . . Vasilis Giannopoulos and Pavlos Peppas Semantic Awareness in Automated Web Service Composition through Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ourania Hatzi, Dimitris Vrakas, Nick Bassiliades, Dimosthenis Anagnostopoulos, and Ioannis Vlahavas Unsupervised Web Name Disambiguation Using Semantic Similarity and Single-Pass Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Elias Iosif

71

81 91

101 113

123

133

Time Does Not Always Buy Quality in Co-evolutionary Learning . . . . . . Dimitris Kalles and Ilias Fykouras

143

Visual Tracking by Adaptive Kalman Filtering and Mean Shift . . . . . . . . Vasileios Karavasilis, Christophoros Nikou, and Aristidis Likas

153

On the Approximation Capabilities of Hard Limiter Feedforward Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Konstantinos Koutroumbas and Yannis Bakopoulos

163

EMERALD: A Multi-Agent System for Knowledge-Based Reasoning Interoperability in the Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kalliopi Kravari, Efstratios Kontopoulos, and Nick Bassiliades

173

An Extension of the Aspect PLSA Model to Active and Semi-Supervised Learning for Text Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anastasia Krithara, Massih-Reza Amini, Cyril Goutte, and Jean-Michel Renders A Market-Aﬀected Sealed-Bid Auction Protocol . . . . . . . . . . . . . . . . . . . . . Claudia Lindner

183

193

Table of Contents

XIII

A Sparse Spatial Linear Regression Model for fMRI Data Analysis . . . . . Vangelis P. Oikonomou and Konstantinos Blekas

203

A Reasoning Framework for Ambient Intelligence . . . . . . . . . . . . . . . . . . . . Theodore Patkos, Ioannis Chrysakis, Antonis Bikakis, Dimitris Plexousakis, and Grigoris Antoniou

213

The Large Scale Artiﬁcial Intelligence Applications – An Analysis of AI-Supported Estimation of OS Software Projects . . . . . . . . . . . . . . . . . . . . Wieslaw Pietruszkiewicz and Dorota Dzega Towards the Discovery of Reliable Biomarkers from Gene-Expression Proﬁles: An Iterative Constraint Satisfaction Learning Approach . . . . . . . George Potamias, Lefteris Koumakis, Alexandros Kanterakis, and Vassilis Moustakis

223

233

Skin Lesions Characterisation Utilising Clustering Algorithms . . . . . . . . . Sotiris K. Tasoulis, Charalampos N. Doukas, Ilias Maglogiannis, and Vassilis P. Plagianakos

243

Mining for Mutually Exclusive Gene Expressions . . . . . . . . . . . . . . . . . . . . . George Tzanis and Ioannis Vlahavas

255

Task-Based Dependency Management for the Preservation of Digital Objects Using Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yannis Tzitzikas, Yannis Marketakis, and Grigoris Antoniou Designing Trading Agents for Real-World Auctions . . . . . . . . . . . . . . . . . . . Ioannis A. Vetsikas and Nicholas R. Jennings Scalable Semantic Annotation of Text Using Lexical and Web Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Elias Zavitsanos, George Tsatsaronis, Iraklis Varlamis, and Georgios Paliouras

265 275

287

Short Papers A Gene Expression Programming Environment for Fatigue Modeling of Composite Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maria A. Antoniou, Efstratios F. Georgopoulos, Konstantinos A. Theoﬁlatos, Anastasios P. Vassilopoulos, and Spiridon D. Likothanassis A Hybrid DE Algorithm with a Multiple Strategy for Solving the Terminal Assignment Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eug´enia Moreira Bernardino, Anabela Moreira Bernardino, Juan Manuel S´ anchez-P´erez, Juan Antonio G´ omez-Pulido, and Miguel Angel Vega-Rodr´ıguez

297

303

XIV

Table of Contents

Event Detection and Classiﬁcation in Video Surveillance Sequences . . . . . Vasileios Chasanis and Aristidis Likas The Support of e-Learning Platform Management by the Extraction of Activity Features and Clustering Based Observation of Users . . . . . . . . . . Dorota Dzega and Wieslaw Pietruszkiewicz Mapping Cultural Metadata Schemas to CIDOC Conceptual Reference Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Manolis Gergatsoulis, Lina Bountouri, Panorea Gaitanou, and Christos Papatheodorou Genetic Algorithm Solution to Optimal Sizing Problem of Small Autonomous Hybrid Power Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yiannis A. Katsigiannis, Pavlos S. Georgilakis, and Emmanuel S. Karapidakis A WSDL Structure Based Approach for Semantic Categorization of Web Service Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dionisis D. Kehagias, Efthimia Mavridou, Konstantinos M. Giannoutakis, and Dimitrios Tzovaras Heuristic Rule Induction for Decision Making in Near-Deterministic Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stavros Korokithakis and Michail G. Lagoudakis Behavior Recognition from Multiple Views Using Fused Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dimitrios I. Kosmopoulos, Athanasios S. Voulodimos, and Theodora A. Varvarigou A Machine Learning-Based Evaluation Method for Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Katsunori Kotani and Takehiko Yoshimi Feature Selection for Improved Phone Duration Modeling of Greek Emotional Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexandros Lazaridis, Todor Ganchev, Iosif Mporas, Theodoros Kostoulas, and Nikos Fakotakis A Stochastic Greek-to-Greeklish Transcriber Modeled by Real User Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dimitrios P. Lyras, Ilias Kotinas, Kyriakos Sgarbas, and Nikos Fakotakis Face Detection Using Particle Swarm Optimization and Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ermioni Marami and Anastasios Tefas

309

315

321

327

333

339

345

351

357

363

369

Table of Contents

Reducing Impact of Conﬂicting Data in DDFS by Using Second Order Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Luca Marchetti and Luca Iocchi Towards Intelligent Management of a Student’s Time . . . . . . . . . . . . . . . . . Evangelia Moka and Ioannis Refanidis Virtual Simulation of Cultural Heritage Works Using Haptic Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Konstantinos Moustakas and Dimitrios Tzovaras Ethnicity as a Factor for the Estimation of the Risk for Preeclampsia: A Neural Network Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Costas Neocleous, Kypros Nicolaides, Kleanthis Neokleous, and Christos Schizas A Multi-class Method for Detecting Audio Events in News Broadcasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sergios Petridis, Theodoros Giannakopoulos, and Stavros Perantonis

XV

375 383

389

395

399

Flexible Management of Large-Scale Integer Domains in CSPs . . . . . . . . . Nikolaos Pothitos and Panagiotis Stamatopoulos

405

A Collaborative System for Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . Vassiliki Rentoumi, Stefanos Petrakis, Vangelis Karkaletsis, Manfred Klenner, and George A. Vouros

411

Minimax Search and Reinforcement Learning for Adversarial Tetris . . . . Maria Rovatsou and Michail G. Lagoudakis

417

A Multi-agent Simulation Framework for Emergency Evacuations Incorporating Personality and Emotions . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexia Zoumpoulaki, Nikos Avradinis, and Spyros Vosinakis

423

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

429

Digital Curation and Digital Cultural Memory Panos Constantopoulos Department of Informatics, Athens University of Economics and Business, Athens 10434, Greece and Digital Curation Unit, IMIS – ‘Athena’ Research Centre, Athens 11524, Greece

Abstract. The last two decades have witnessed an ever increasing penetration of digital media initially in the management and, subsequently, in the study of culture. From collections management, object documentation, domain knowledge representation and reasoning, to supporting the creative synthesis and re-interpretation of data in the framework of digital productions, signiﬁcant progress has been achieved in the development of relevant knowledge and software tools. Developing a standard ontology for the cultural domain stands out as the most prominent such development. As a consequence of this progress, digital repositories are created that aim at serving as digital cultural memories, while a process of convergence has started among the diﬀerent kinds of memory institutions, i.e., museums, archives, and libraries, in what concerns their information functions. The success of digital cultural memories will be decided against rivals with centuries-long tradition. The advantages oﬀered by technology, mass storage, copying, and the ease of searching and quantitative analysis, will not suﬃce unless reliability, long-term preservation, and the ability to re-use, re-combine and re-interpret digital content are ensured. To this end digital curation is exercised. In this talk we will examine the development of digital cultural memories using digital curation. More speciﬁcally, we will discuss issues of knowledge representation and reasoning, we will present some examples of interesting research and development eﬀorts, and will refer to certain current trends.

S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, p. 1, 2010. c Springer-Verlag Berlin Heidelberg 2010

RoboCup: A Challenge Problem for Artificial Intelligence Michail G. Lagoudakis Intelligent Systems Laboratory Department of Electronic and Computer Engineering Technical University of Crete Chania 73100, Greece

Abstract. The RoboCup competition is the international robotic soccer world cup organized annually since 1997. The initial conception by Hiroaki Kitano in 1993 led to the formation of the RoboCup Federation with a bold vision: By the year 2050, to develop a team of fully autonomous humanoid robots that can win against the human world soccer champions! RoboCup poses a real-world challenge for Artiﬁcial Intelligence, which requires addressing simultaneously the core problems of perception, cognition, action, and coordination under real-time constraints. In this talk, I will outline the vision, the challenges, and the contribution of the RoboCup competition in its short history. I will also oﬀer an overview of the research eﬀorts of team Kouretes, the RoboCup team of the Technical University of Crete, on topics ranging from complex motion design, eﬃcient visual recognition, and self-localization to robotic software engineering, distributed communication, skill learning, and coordinated game play. My motivation is to inspire researchers and students to form teams with the goal of participating in the various leagues of this exciting and challenging benchmark competition and ultimately contributing to the advancement of the state-of-the-art in Artiﬁcial Intelligence and Robotics.

S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, p. 3, 2010. c Springer-Verlag Berlin Heidelberg 2010

Robots, Natural Language, Social Networks, and Art Nikolaos Mavridis Interactive Robots and Media Lab United Arab Emirates University Al Ain 17551, U.A.E.

Abstract. Creating robots that can fluidly converse in natural language, and cooperate and sozialize with their human partners is a goal that has always captured human imagination. Furthermore, it is a goal that requires truly interdisciplinary research: engineering, computer science, as well as the cognitive sciences are crucial towards its realization. Challenges and current progress towards this goal will be illustrated through two real-world robot examples: the conversational robot Ripley, and the FaceBots social robots which utilize and publish social information on the FaceBook website. Finally, a quick glimpse towards novel educational and artistic avenues opened by such robots will be provided, through the Interactive Theatre installation of the Ibn Sina robot.

S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, p. 5, 2010. c Springer-Verlag Berlin Heidelberg 2010

Artificial Life Simulation of Humans and Lower Animals: From Biomechanics to Intelligence Demetri Terzopoulos Computer Science Department University of California, Los Angeles Los Angeles, CA 90095-1596, U.S.A.

Abstract. The confluence of virtual reality and artificial life, an emerging discipline that spans the computational and biological sciences, has yielded synthetic worlds inhabited by realistic artificial flora and fauna. The latter are complex synthetic organisms with functional, biomechanically simulated bodies, sensors, and brains with locomotion, perception, behavior, learning, and cognition centers. These biomimetic autonomous agents in their realistic virtual worlds foster deeper computationallyoriented insights into natural living systems. Virtual humans and lower animals are of great interest in computer graphics because they are selfanimating graphical characters poised to dramatically advance the motion picture and interactive game industries. Furthermore, they engender interesting new applications in computer vision, medical imaging, sensor networks, archaeology, and many other domains.

S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, p. 7, 2010. c Springer-Verlag Berlin Heidelberg 2010

Prediction of Aircraft Aluminum Alloys Tensile Mechanical Properties Degradation Using Support Vector Machines Nikolaos Ampazis and Nikolaos D. Alexopoulos Department of Financial and Management Engineering University of the Aegean 82100 Chios, Greece [email protected], [email protected]

Abstract. In this paper we utilize Support Vector Machines to predict the degradation of the mechanical properties, due to surface corrosion, of the Al 2024-T3 aluminum alloy used in the aircraft industry. Precorroded surfaces from Al 2024-T3 tensile specimens for various exposure times to EXCO solution were scanned and analyzed using image processing techniques. The generated pitting morphology and individual characteristics were measured and quantiﬁed for the diﬀerent exposure times of the alloy. The pre-corroded specimens were then tensile tested and the residual mechanical properties were evaluated. Several pitting characteristics were directly correlated to the degree of degradation of the tensile mechanical properties. The support vector machine models were trained by taking as inputs all the pitting characteristics of each corroded surface to predict the residual mechanical properties of the 2024-T3 alloy. The results indicate that the proposed approach constitutes a robust methodology for accurately predicting the degradation of the mechanical properties of the material. Keywords: Material Science, Corrosion Prediction, Machine Learning, Support Vector Machines.

1

Introduction

The widely used aluminum alloy in the aircraft industry is the damage-tolerant Al 2024-T3 alloy, currently used in the skin and the wings of many civil aircrafts. The main problems of the design and inspection engineers are the fatigue, corrosion and impact damage that the fuselage and wing skins are subjected to. Corrosion damage of the material is also very essential to the structural integrity of the aircraft. It was calculated that among the maintenance periods of more than a decade in-service aircrafts, a 40% of the repairs were associated with corrosion damage. Figure 1 shows a typical surface corrosion damage produced at the wings of an in-service aircraft. Since the material of a component is subjected to corrosion, it is expected that its critical mechanical properties S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 9–18, 2010. c Springer-Verlag Berlin Heidelberg 2010

10

N. Ampazis and N.D. Alexopoulos

Fig. 1. Photograph showing corrosion products formed at the lower surface of an inservice aircraft wing. Source: Hellenic Aerospace Industry S.A.

might vary with increasing service time and thus, must be taken into account for the structural integrity calculation of the component. The eﬀect of corrosion damage on the reference alloy has been studied in various works. The exposure of the alloy 2024-T3 on various accelerated, laboratory environments, e.g. [1,2,3,4], resulted in the formation of large pits and micro-cracks on the sub-surface of the specimens, that lead to exfoliation of the alloy with increasing exposure time. This has a deleterious impact on the residual mechanical properties, especially in the tensile ductility. Alexopoulos and Papanikos [3] noticed that after the exposure for only 2h (hours), the ductility of the 2024-T3 decreased by almost 20%. The decrease of all mechanical properties and for all the spectra of exposure to the corrosive solution was attributed to the pitting that was formed on the surface of the specimens and their induced cracks to the cross-section of the specimen. In a number of publications, e.g. [5,6] it was shown that machine learning methods can be used in the wider ﬁeld of materials science and, more speciﬁcally, to predict mechanical properties of aluminium alloys. In [5] it was demonstrated that Least Squares Support Vector Machines (LSSVM) are quite applicable for simulation and monitoring of the ageing process optimization of AlZnMgCu series alloys. In [6] Artiﬁcial Neural Networks (ANNs) were used for the estimation of ﬂow stress of AA5083 with regard to dynamic strain ageing that occurs in certain deformation conditions. The input variables were selected to be strain rate, temperature and strain, and the prediction variable was the ﬂow stress. However the use of ANNs in coupled ﬁelds of corrosion / material science and mechanics is still limited. Some literature publications can be found for the exploitation of ANNs to the corrosion of steels and Ti alloys, e.g. [7,8,9]. In these cases, diﬀerent chloride concentrations, pH and temperature were used to model and predict the surface pitting corrosion behaviour. Additionally in [9] various polarized corrosion data were used to predict the future maximum pit depth with good agreements between estimation/prediction and experimental data.

Prediction of Aircraft Aluminum Properties Degradation Using SVMs

11

The prediction of surface corrosion of aluminium alloys with the exploitation of ANNs has been also attempted in the literature. Leifer [10] attempted to predict via neural networks the pit depth of aluminium alloy 1100 when subjected to natural water corrosion. The trained model was found capable to predict the expected pit depth as a function of water pH, concentrations of carbonate (CO−2 3 ), copper Cu+2 and chloride Cl as well as storage time. Pidaparti et al. [11] trained an ANN on 2024-T3 to predict the degradation of chemical elements obtained from Energy Dispersive X-ray Spectrometry (EDS) on corroded specimens. Input parameters to the ANN model were the alloy composition, electrochemical parameters and corrosion time. Though the trained model worked in all the above cases, there is no information regarding the residual mechanical properties of the corroded materials, in order to calculate the structural health of a likewise structure. This was ﬁrstly attempted in the case of aluminium alloys, where models of neural networks were trained to predict the fatigue performance of pre-corroded specimens in [12,13,14]. The inputs of the models were maximum corrosion depth, fatigue performance, corrosion temperature and corrosion time. The models were trained with the back propagation learning algorithm, in order to predict the maximum corrosion depth and fatigue performance of prior-corroded aluminium alloys. All existing models in the case of corrosion of aluminium alloys take diﬀerent parameters as inputs, such as the composition of the alloy, the maximum pit depth and the pitting density of the surface. In order to train an ANN model to predict the residual tensile mechanical behaviour of pre-corroded aluminium alloys, the input parameters in many cases are too few and the available training patterns do not usually exceed more than one hundred data points. In addition, in all cases only the value of maximum pit depth generated by the surface corrosion of an alloy has been taken into account within the ANN models. However this is not always the critical parameter to be utilized and usually other pit characteristics are neglected. Jones and Hoeppner [15] demonstrated that the shape and size of a pit are major factors aﬀecting the fatigue life of the pre-corroded 2024-T3 specimens. The microcracking network is a forefather of the nucleating fatigue crack that seriously degrades the fatigue life of the specimen. Van der Walde and Hillberry [16] also showed that the fatigue crack initiation of the same alloy happens by approximately 60% in the maximum-depth pit. Hence, it is of imperative importance to characterize the whole surface corroded area of the alloy and correlate the ﬁndings of the corrosion-induced pits with the residual mechanical properties. In the present work, the corroded surfaces were analyzed by employing image analysis techniques in order to extract meaningful training features. Speciﬁc areas from tensile specimens gauge length locations were scanned before tensile testing. Any formation of corrosion-induced pits was characterized and quantiﬁed as a function of the materials exposure time to the corrosive environment. At each diﬀerent case study, the number and the morphology of the corrosion-induced pits was correlated with the residual tensile mechanical properties of specimens of 2024-T3 alloy. Support vector machines were then trained as regressors with the

12

N. Ampazis and N.D. Alexopoulos

resulting features in order to predict the degradation of a number of mechanical properties for diﬀerent exposure times.

2

Support Vector Machines

Support Vector Machines (SVM) were ﬁrst introduced as a new class of machine learning techniques by Vapnik [17] and are based on the structural risk minimization principle. An SVM seeks a decision surface to separate the training data points into two classes and makes decisions based on the support vectors that are selected as the only eﬀective elements from the training set. The goal of SVM learning is to ﬁnd the optimal separating hyper-plane (OSH) that has the maximal margin to both sides of the data classes. This can be formulated as: 1 T w w 2 subject to yi (wxi + b) ≥ 1 Minimize

(1)

where yi ∈ [-1 +1] is the decision of SVM for pattern xi and b is the bias of the separating hyperplane. After the OSH has been determined, the SVM makes decisions based on the globally optimized separating hyper-plane by ﬁnding out on which side of the OSH the pattern is located. This property makes SVM highly competitive with other traditional pattern recognition methods in terms of predictive accuracy and eﬃciency. Support Vector Machines may also be used for regression problems with the following simple modiﬁcation: n

Minimize

1 T w w+C (ξi + ξˆi ) 2 i=1

subject to (wxi + b) − yi ≤ + ξi and yi − (wxi + b) ≤ + ξˆi

(2)

where ξi is a slack variable introduced for exceeding the target value by more than and ξˆi a slack variable for being more than below the target value [18]. The idea of the Support Vector Machine is to ﬁnd a model which guarantees the lowest classiﬁcation or regression error by controlling the model complexity (VC-dimension) based on the structural risk minimization principle. This avoids over-ﬁtting, which is the main problem for other learning algorithms.

3

Material Data and Experimental Procedure

The material used was a wrought aluminum alloy 2024-T3 which was received in sheet form with nominal thickness of 3.2 mm. The surfaces of the tensile specimens were cleaned with acetone and then they were exposed to the laboratory exfoliation corrosion environment (hereafter called EXCO solution) according to speciﬁcation ASTM G34. The specimens were exposed to the EXCO solution

Prediction of Aircraft Aluminum Properties Degradation Using SVMs

13

Table 1. Image analysis measuring parameters and their physical interpretation in the corrosion-induced pitting problem Feature

Measurements Physical interpretation of the measurements

Area

Area of each individual object (pit) - does not include holes area that have the same color with the matrix

Density (mean)

Average optical density (or intensity) of object is an indication of the mean depth of each pit

Axis (major)

Length of major axis of an ellipse - maximum length of a pit in one axis

Axis (minor)

Length of minor axis of an ellipse - maximum length of a pit in the transverse axis

Diameter (max)

Length of longest line joining two points of objects outline and passing through the centroid - calculation of the maximum diameter of each pit

Per-Area

Ratio of area of object to total investigated area

for a number of diﬀerent exposure times. More details regarding the corrosion procedure can be seen in the respective speciﬁcation as well as in [3,4]. After the exposure, the corroded specimens were cleaned with running water to remove any surface corrosion products, e.g. salt deposits. The reduced crosssection area (gauge length) of the specimens was scanned in individual images and in grayscale format. Only this part of the tensile specimen was examined for the surface corrosion pits as can be directly correlated with the relative mechanical properties of the same specimen. Image analysis was performed by using R the ImageP ro image processing, enhancement, and analysis software [19]. The same surface area of the approximate value of 500 mm2 was analyzed for each testing specimen and for various corrosion exposure durations, namely for 2h, 12h, 24h, 48h, and 96h. Individual characterization of each formed corrosioninduced surface pit was made and statistical values of the generated pits were calculated. The selected parameters for the quantiﬁcation of the corrosion-induced surface pits as well as their physical interpretation are summarized in Table 1.

14

N. Ampazis and N.D. Alexopoulos

A number of diﬀerent parameters were chosen to quantify the geometry of the pits, e.g. major and minor axis, aspect ratio and diameter of the pits. In addition, the number, area and perimeter of the pits were measured and used to calculate the pitting coverage area of the total investigated area. After the corrosion exposure for each of the aforementioned durations, the testing specimens were subjected to mechanical testing. Details regarding the mechanical testing part can be found elsewhere [3,4]. Evaluated properties which were later predicted by the SVMs were: yield strength Rp (0.2% proof stress), tensile strength Rm , elongation to fracture Af , and strain energy density W .

4 4.1

Results and Discussion Corrosion Analysis

Figure 2 shows the scanned images of the surfaces of four diﬀerent 2024-T3 specimens after their exposure to diﬀerent times to EXCO solution, (a) to (d), respectively. As can be seen in the ﬁgure, the pits are starting to be seen as small gray/black dots in the white surface of the reference, uncorroded specimen. With increasing exposure time, the pits seen in the specimen’s surface are increasing their coverage over the total investigated area. To the best of our knowledge their increase seems to follow an unknown rule.

(a)

(b)

(c)

(d)

Fig. 2. Pit surfaces after exposure for: (a) 2 hours, (b) 12 hours, (c) 48 hours, (d) 96 hours to EXCO solution

Quantitative data of the corrosion-induced surface pits, after using image analysis can be seen in Figure 3. The number of pits is continuously increasing with increasing exposure time to the solution; their total number almost reaches 15,000 by using an exponential decreasing ﬁtting curve. The number of recorded pits for each exposure duration is shown in Table 2. Since the number of pits alone is not informative enough, a more representative parameter to denote the eﬀect of corrosion is the pitting coverage area; it is calculated as the percentage fraction of the total area of the pits to the investigated area of the specimen. The results can also be seen in Figure 3, where also

Prediction of Aircraft Aluminum Properties Degradation Using SVMs

15

Table 2. Number of corrosion-induced surface pits at diﬀerent exposure durations

50

15000

40

12000

30

9000

20

6000

10

Number of pits [-]

Pitting area coverage [%]

Exposure Duration (hours) Number of pits 2 2199 12 3696 24 11205 48 12363 96 14699

3000 pitting area coverage [%] number of pits [-] curve fitting for pitting coverage area curve fitting for number of pits

0 0

20

40

60

80

0 100

Alloy exposure time to EXCO solution [h]

0,10

2,0

0,08

1,8

0,06

1,6

0,04

1,4

0,02

Aspect ratio [-]

Pit's measured values

(a)

1,2 mean value of major axis [mm] mean value of minor axis [mm] aspect ratio [-]

0,00 0

20

40

60

80

1,0 100

Alloy exposure time to EXCO solution [h]

(b) Fig. 3. Statistical quantitative analysis of (a) number of pits and pitting coverage area and (b) the aspect ratio of the formed pits

an exponential decrease curve ﬁtting is proposed to simulate this phenomenon. Besides, it seems that up to 24h exposure the increase is almost linear with continuously increasing exposure.

16

4.2

N. Ampazis and N.D. Alexopoulos

SVM Prediction Results

We trained four diﬀerent SVMs for predicting four tensile mechanical properties (namely yield strength Rp , tensile strength Rm , elongation to fracture Af and strain energy density W ) of the pre-corroded specimens and by taking into account their initial values of the reference - uncorroded specimens. As training patterns we used the various pit features corresponding to pits formated at 2h,12h,48h, and 96h exposure durations. This resulted in a total of 32957 training points for each SVM. The performance of each SVM was evaluated on the prediction of the mechanical properties residuals for the set of pits appearing at the 24h exposure (11205 testing points). This particular exposure time was selected since in [3] it was shown that at 24h the hydrogen embrittlement degradation mechanism of mechanical properties is saturated. As a performance measure for the accuracy of the SVMs we used the Root Means Square Error (RMSE) criterion between the actual and predicted values. For training the SVMs we used the SV M light package [20] compiled with the Intel C/C++ Compiler Professional Edition for Linux. Training of the SVMs were run on a 2.5GHz Quad Core Pentium CPU with 4G RAM running Ubuntu 9.10 desktop x86 64 (Karmic Koala) operating system. The total running time of each SVM training was approximately 5 to 10 seconds. In our experiments we linearly scaled each feature to the range [-1, +1]. Scaling training data before applying SVM is very important. The main advantage is to avoid attributes in greater numeric ranges to dominate those in smaller numeric ranges. Another advantage is to avoid numerical diﬃculties during the calculation. Because kernel values usually depend on the inner products of feature vectors, e.g. the linear kernel and the polynomial kernel, large attribute values might cause numerical problems [21]. With the same method, testing data features were scaled to the training data ranges before testing. The training target outputs were also scaled to [0, +1] and the output of each SVM was then transformed back from the [0, +1] range to it’s original target value in order to calculate the RMSE for each mechanical properties residual. The prediction accuracy of the trained SVMs is summarized in Table 3. Standard deviation values of the mechanical properties of pre-corroded specimens appearing in the table have been previously calculated based on three diﬀerent experiments in order to get reliable statistical values. As it can be seen, all predicted mechanical properties for pre-corrosion of 2024-T3 for 24 hours to EXCO Table 3. RMSE of trained SVMs and Standard Deviation (calculated from real measurements) Mechanical Property RMSE Std. Dev. (Real Measurements) Rp 1.5 2.5 Rm 0.5 2.0 Af 0.45 0.23 W 1.35 1.16

Prediction of Aircraft Aluminum Properties Degradation Using SVMs

17

solution are very close to the actually measured properties. The RMSE values are of the same order of magnitude or even lower to the standard deviation values of the experiments. Hence, it is eminent that the calculation of the residual mechanical properties of corroded specimens can be performed by quantitative analysis of the corroded surface and trained SVMs. This is of imperative importance according to the damage tolerance philosophy in aircraft structures as the corroded part may still carry mechanical loads and shall not be replaced.

5

Conclusions

Support Vector Machines were used to predict the eﬀect of existing corrosion damage on the residual tensile properties of the Al 2024-T3 aluminum. An extensive experimental preprocessing was performed in order to scan and analyze (with image processing techniques) diﬀerent pre-corroded surfaces from tensile specimens to extract training features. The pre-corroded tensile specimens were then tensile tested and the residuals of six mechanical properties (yield strength, tensile strength, elongation to fracture, and strain energy density) were evaluated. Several pitting characteristics were directly correlated to the degree of decrease of the tensile mechanical properties. The results achieved by the SVMs show that the predicted values of the mechanical properties have been in very good agreement with the experimental data and this can be proven valuable for optimizing service time and inspection operational procedures. Finally, the prediction accuracy achieved is encouraging for the exploitation of the SVM models also for other alloys in use in engineering structural applications.

References 1. Pantelakis, S., Daglaras, P., Apostolopoulos, C.: Tensile and energy density properties of 2024, 6013, 8090 and 2091 aircraft aluminum alloy after corrosion exposure. Theoretical and Applied Fracture Mechanics 33, 117–134 (2000) 2. Kamoutsi, H., Haidemenopoulos, G., Bontozoglou, V., Pantelakis, S.: Corrosioninduced hydrogen embrittlement in aluminum alloy 2024. Corrosion Science 48, 1209–1224 (2006) 3. Alexopoulos, N., Papanikos, P.: Experimental and theoretical studies of corrosioninduced mechanical properties degradation of aircraft 2024 aluminum alloy. Materials Science and Engineering A A 498, 248–257 (2008) 4. Alexopoulos, N.: On the corrosion-induced mechanical degradation for diﬀerent artiﬁcial aging conditions of 2024 aluminum alloy. Materials Science and Engineering A A520, 40–48 (2009) 5. Fang, S., Wanga, M., Song, M.: An approach for the aging process optimization of Al-Zn-Mg-Cu series alloys. Materials and Design 30, 2460–2467 (2009) 6. Sheikh, H., Serajzadeh, S.: Estimation of ﬂow stress behavior of AA5083 using artiﬁcial neural networks with regard to dynamic strain ageing eﬀect. Journal of Materials Processing Technology 196, 115–119 (2008) 7. Ramana, K., Anita, T., Mandal, S., Kaliappan, S., Shaikh, H.: Eﬀect of diﬀerent environmental parameters on pitting behavior of AISI type 316L stainless steel: Experimental studies and neural network modeling. Materials and Design 30, 3770– 3775 (2009)

18

N. Ampazis and N.D. Alexopoulos

8. Wang, H.T., Han, E.H., Ke, W.: Artiﬁcial neural network modeling for atmospheric corrosion of carbon steel and low alloy steel. Corrosion Science and Protection Technology 18, 144–147 (2006) 9. Kamrunnahar, M., Urquidi-Macdonald, M.: Prediction of corrosion behavior using neural network as a data mining tool. Corrosion Science (2009) (in press) 10. Leifer, J.: Prediction of aluminum pitting in natural waters via artiﬁcial neural network analysis. Corrosion 56, 563–571 (2000) 11. Pidaparti, R., Neblett, E.: Neural network mapping of corrosion induced chemical elements degradation in aircraft aluminum. Computers, Materials and Continua 5, 1–9 (2007) 12. Liu, Y., Zhong, Q., Zhang, Z.: Predictive model based on artiﬁcial neural network for fatigue performance of prior-corroded aluminum alloys. Acta Aeronautica et Astronautica Sinica 22, 135–139 (2001) 13. Fan, C., He, Y., Zhang, H., Li, H., Li, F.: Predictive model based on genetic algorithm-neural network for fatigue performances of pre-corroded aluminum alloys. Key Engineering Materials 353-358, 1029–1032 (2007) 14. Fan, C., He, Y., Li, H., Li, F.: Performance prediction of pre-corroded aluminum alloy using genetic algorithm-neural network and fuzzy neural network. Advanced Materials Research 33-37, 1283–1288 (2008) 15. Jones, K., Hoeppner, D.: Prior corrosion and fatigue of 2024-T3 aluminum alloy. Corrosion Science 48, 3109–3122 (2006) 16. Van der Walde, K., Hillbrry, B.: Initiation and shape development of corrosionnucleated fatigue cracking. International Journal of Fatigue 29, 1269–1281 (2007) 17. Vapnik, V.: The Nature of Statistical Learning Theory. Wiley, New York (1998) 18. Webb, A.R.: Statistical Pattern Recognition, 2nd edn. Wiley, Chichester (2002) 19. MediaCybernetics: Image pro web page., http://www.mediacy.com/index.aspx?page=IPP 20. Joachims, T.: SVM light (2002), http://svmlight.joachims.org 21. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/~ cjlin/libsvm

Mutual Information Measures for Subclass Error-Correcting Output Codes Classification Nikolaos Arvanitopoulos, Dimitrios Bouzas, and Anastasios Tefas Aristotle University of Thessaloniki, Department of Informatics Artificial Intelligence & Information Analysis Laboratory {niarvani,dmpouzas}@csd.auth.gr, [email protected]

Abstract. Error-Correcting Output Codes (ECOCs) reveal a common way to model multi-class classification problems. According to this state of the art technique, a multi-class problem is decomposed into several binary ones. Additionally, on the ECOC framework we can apply the subclasses technique (sub-ECOC), where by splitting the initial classes of the problem we aim to the creation of larger but easier to solve ECOC configurations. The multi-class problem’s decomposition is achieved via a searching procedure known as sequential forward floating search (SFFS). The SFFS algorithm in each step searches for the optimum binary separation of the classes that compose the multi-class problem. The separation decision is based on the maximization or minimization of a criterion function. The standard criterion used is the maximization of the mutual information (MI) between the bi-partitions created in each step of the SFFS. The materialization of the MI measure is achieved by a method called fast quadratic Mutual Information (FQMI). Although FQMI is quite accurate in modelling the MI, its computation is of high algorithmic complexity, which as a consequence makes the ECOC and sub-ECOC techniques applicable only on small datasets. In this paper we present some alternative separation criteria of reduced computational complexity that can be used in the SFFS algorithm. Furthermore, we compare the performance of these criteria over several multi-class classification problems. Keywords: Multi-class classification, Subclasses, Error-Correcting Output Codes, Support Vector Machines, Sequential Forward Floating Search, Mutual Information.

1

Introduction

In the literature one can ﬁnd various binary classiﬁcation techniques. However, in the real world the problems to be addressed are usually multi-class. In dealing with multi-class problems we must use the binary techniques as a leverage. This can be achieved by deﬁning a method that decomposes the multi-class problem into several binary ones, and combines their solutions to solve the initial multiclass problem [1]. In this context, the Error-Correcting Output Codes (ECOCs) S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 19–28, 2010. c Springer-Verlag Berlin Heidelberg 2010

20

N. Arvanitopoulos, D. Bouzas, and A. Tefas

emerged. Based on the error correcting principles [2] and on its ability to correct the bias and variance errors of the base classiﬁers [3], this state of the art technique has been proved valuable in solving multi-class classiﬁcation problems over a number of ﬁelds and applications. As proposed by Escalera et al. [4], on the ECOC framework we can apply the subclass technique. According to this technique, we use a guided problem dependent procedure to group the classes and split them into subsets with respect to the improvement we obtain in the training performance. Both the ECOC and sub-ECOC techniques can be applied independently to diﬀerent types of classiﬁers. In our work we applied both of those techniques on Linear and RBF (Radial Basis Function) SVM (Support Vector Machine) classiﬁers with various conﬁgurations. SVMs are very powerful classiﬁers capable of materializing optimum classification surfaces that give improved results in the test domain. As mentioned earlier, the ECOC as well as the sub-ECOC techniques use the SFFS algorithm in order to decompose a multi-class problem into smaller binary ones. The problem’s decomposition is based on a criterion function that maximizes or minimizes a certain quantity acording to the nature of the criterion used. The common way is to maximize the MI (mutual information) in both the bi-partitions created by SFFS. As proposed by Torkkola [5], we can model the MI in the bi-partitions through the FQMI (Fast Quadratic Mutual Information) method. However, although the FQMI procedure is quite accurate in modelling the MI of a set of classes, it turns out to be computational costly. In this paper we propose some novel MI measures of reduced computational complexity, where in certain classiﬁcation problems yield better performance results than the FQMI. Furthermore, we compare these MI measures over a number of multi-class classiﬁcation problems in the UCI machine learning repository [6]. 1.1

Error Correcting Output Codes (ECOC)

Error Correcting Output Codes is a general framework to solve multi-class problems by decomposing them into several binary ones. This technique consists of two separate steps: a) the encoding and b) the decoding step [7]. a) In the encoding step, given a set of N classes, we assign a unique binary string called codeword 1 to each class. The length n of each codeword represents the number of bi-partitions (groups of classes) that are formed and, consequently, the number of binary problems to be trained. Each bit of the codeword represents the response of the corresponding binary classiﬁer and it is coded by +1 or -1, according to its class membership. The next step is to arrange all these codewords as rows of a matrix obtaining the so-called coding matrix M, where M ∈ {−1, +1}N ×n. Each column of this matrix deﬁnes a partition of classes, while each row deﬁnes the membership of the corresponding class in the speciﬁc binary problem. 1

The codeword is a sequence of bits of a code representing each class, where each bit identifies the membership of the class for a given binary classifier.

Mutual Information Measures for Subclass ECOCs Classification

21

An extension of this standard ECOC approach was proposed by Allwein et al. [1] by adding a third symbol in the coding process. The new coding matrix M is now M ∈ {−1, 0, +1}N ×n. In this approach, the zero symbol means that a certain class is not considered by a speciﬁc binary classiﬁer. As a result, this symbol increases the number of bi-partitions to be created in the ternary ECOC framework. b) The decoding step of the ECOC approach consists of applying the n diﬀerent binary classiﬁers to each data sample in the test set, in order to obtain a code for this sample. This code is then compared to all the codewords of the classes deﬁned in the coding matrix M (each row in M deﬁnes a codeword) and the sample is assigned to the class with the closest codeword. The most frequently used decoding methods are the Hamming and the Euclidean decoding distances. 1.2

Sub-ECOC

Escalera et al. [4] proposed that from an initial set of classes C of a given multiclass problem, we can deﬁne a new set of classes C , where the cardinality of C is greater than that of C, that is |C | > |C|. The new set of binary problems that will be created will improve the created classiﬁers’ training performance. Additionally to the ECOC framework Pujol [8] proposed that we can use a ternary problem dependent design of ECOC, called discriminant ECOC (DECOC) where, given a number of N classes, we can achieve a high classiﬁcation performance by training only N − 1 binary classiﬁers. The combination of the above mentioned methods results in a new classiﬁcation procedure called subECOC. The procedure is based on the creation of discriminant tree structures which depend on the problem domain. These binary trees are built by choosing the problem partitioning that maximizes the MI between the samples and their respective class labels. The structure as a whole describes the decomposition of the initial multi-class problem into an assembly of smaller binary sub-problems. Each node of the tree represents a pair that consists of a speciﬁc binary sub-problem with its respective classiﬁer. The construction of the tree’s nodes is achieved through an evaluation procedure described in Escalera et al. [4]. According to this procedure, we can split the bi-partitions that consist the current sub-problem examined. Splitting can be achieved using K-means or some other clustering method. After splitting we form two new problems that can be examined separately. On each one of the new problems created, we repeat the SFFS procedure independently in order to form two new separate sub-problem domains that are easier to solve. Next, we evaluate the two new problem conﬁgurations against three user deﬁned thresholds {θp , θs , θi } described below. If the thresholds are satisﬁed, the new created pair of sub-problems is accepted along with their new created binary classiﬁers, otherwise they are rejected and we keep the initial conﬁguration with its respective binary classiﬁer. – θp : Performance of created classiﬁer for newly created problem (after splitting).

22

N. Arvanitopoulos, D. Bouzas, and A. Tefas

– θs : Minimum cluster’s size. – θi : Performance’s improvement of current classiﬁer for newly created problem against previous classiﬁer (before splitting). 1.3

Loss Weighted Decoding Algorithm

In the decoding process of the sub-ECOC approach we use the Loss Weighted Decoding algorithm [7]. As already mentioned, the 0 symbol in the decoding matrix allows to increase the number of binary problems created and as a result the number of diﬀerent binary classiﬁers to be trained. Standard decoding techniques, such as the Euclidean or the Hamming distance do not consider this third symbol and often produce non-robust results. So, in order to solve the problems produced by the standard decoding algorithms, the loss weighted decoding was proposed. The main objective is to deﬁne a weighting matrix MW that weights a loss function to adjust the decision of the classiﬁers. In order to obtain the matrix MW , a hypothesis matrix H is constructed ﬁrst. The elements H(i, j) of this matrix are continuous values that correspond to the accuracy of the binary classiﬁer hj classifying the samples of class i. The matrix H has zero values in the positions which correspond to unconsidered classes, since these positions do not contain any representative information. The next step is the normalization of the rows of matrix H. This is done, so that the matrix MW can be considered as a discrete probability density function. This is very important, since we assume that the probability of considering each class for the ﬁnal classiﬁcation is the same. Finally, we decode by computing the weighted sum of our coding matrix M and our binary classiﬁer with the weighting matrix MW and assign our test sample to the class that attains the minimum decoding value. 1.4

Sequential Forward Floating Search

The Floating search methods are a family of suboptimal sequential search methods that were developed as an alternative counterpart to the more computational costly exhaustive search methods. These methods allow the search criterion to be non-monotonic. They are also able to counteract the nesting eﬀect by considering conditional inclusion and exclusion of features controlled by the value of the criterion itself. In our approach we use a variation of the Sequential Forward Floating Search (SFFS) [9] algorithm. We modiﬁed the algorithm so that it can handle criterion functions evaluated using subsets of classes. We apply a number of backward steps after each forward step, as long as the resulting subsets are better than the previously evaluated ones at that level. Consequently, there are no backward steps at all if the performance cannot be improved. Thus, backtracking in this algorithm is controlled dynamically and, as a consequence, no parameter setting is needed. The SFFS method is described in algorithm 1.

Mutual Information Measures for Subclass ECOCs Classification

23

Algorithm 1. SFFS for Classes Input: Y = {yj |j = 1, . . . , Nc } // available classes Output: // disjoint subsets with maximum MI between the features and their class labels Xk = {xj |j = 1, . . . , k, xj ∈ Y }, k = 0, 1, . . . , Nc Xk = {xj |j = 1, . . . , k , xj ∈ Y }, k = 0, 1, . . . , Nc Initialization: X0 := ∅, XN := Y ; k := 0, k := Nc // k and k denote the number of classes in each subset c Termination: Stop when k = Nc and k = 0 Step 1 (Inclusion) the most significant 11: x+ := arg max J(Xk + x, Xk − x) class with respect to the group {Xk , Xk } x∈Y −Xk 1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

12: Xk+1 := Xk + x+ ; Xk −1 := Xk − x+ ; k := k + 1, k := k − 1 13: Step 2 (Conditional exclusion) the least significant class 14: x− := arg max J(Xk − x, Xk + x) with respect to the group {Xk , Xk } x∈Xk 15: if J(Xk − x− , Xk + x− ) > J(Xk−1 , Xk +1 ) then

16: Xk−1 := Xk − x− ; Xk +1 := Xk + x− ; k := k − 1, k := k + 1 17: go to Step 2 18: else 19: go to Step 1 20: end if

1.5

Fast Quadratic Mutual Information (FQMI)

Consider two random vectors x1 and x2 and let p(x1 ) and p(x2 ) be their probability density functions respectively. Then the MI of x1 and x2 can be regarded as a measure of the dependence between them and is deﬁned as follows: I(x1 , x2 ) =

p(x1 , x2 ) log

p(x1 , x2 ) dx1 dx2 p(x1 )p(x2 )

(1)

Note that when the random vectors x1 and x2 are stochastically independent, it holds that p(x1 , x2 ) = p(x1 )p(x2 ). It is of great importance to mention that (1) can be interpreted as a KullbackLeibler divergence, deﬁned as follows: f1 (x) K(f1 , f2 ) = f1 (x) log dx (2) f2 (x) where f1 (x) = p(x1 , x2 ) and f2 (x) = p(x1 )p(x2 ). According to Kapur and Kesavan [10], if we seek to ﬁnd the distribution that maximizes or alternatively minimizes the divergence, several axioms could be relaxed and it can be proven that K(f1 , f2 ) is analogically related to D(f1 , f2 ) = (f1 (x) − f2 (x))2 dx. Consequently, maximization of K(f1 , f2 ) leads to maximization of D(f1 , f2 ) and vice versa. Considering the above we can deﬁne the quadratic mutual information as follows IQ (x1 , x2 ) = (p(x1 , x2 ) − p(x1 )p(x2 ))2 dx1 dx2 (3)

24

N. Arvanitopoulos, D. Bouzas, and A. Tefas

Using Parzen window estimators we can estimate the probability density functions in (3) and combining with Gaussian kernels the following property is applicable: Let N (x, Σ) be a n-dimensional Gaussian function; it can be shown that N (x − a1 , Σ1 )N (x − a2 , Σ2 )dx = N (a1 − a2 , Σ1 − Σ2 ) (4) and by the use of this property we avoid one integration. In our case, we calculate the amount of mutual information between the random vector x of the features and the discrete random variable associated to the class labels created for a given partition (y). The practical implementation of this computation is deﬁned as follows: Let N be the number of pattern samples in the entire data set, Ji the number of samples of class i, let Nc be the number of classes in the entire data set, let xi be the ith feature vector of the data set, and let xij be the jth feature vector of the set in class i. Consequently, p(y = yp ) and p(x|y = yp ), where 1 ≤ p ≤ Nc can be written as: p(y = yp ) = p(x|y = yp ) =

Jp , N Jp 1 Jp

N (x − xpj , σ 2 I),

j=1

Jp 1 p(x) = N (x − xj , σ 2 I). N j=1

By the expansion of (3) while using a Parzen estimator with symmetrical kernel of width σ, we get the following equation: IQ (x, y) = VIN + VALL − 2VBT W ,

(5)

where VIN =

y

VALL =

y

VBT W =

x

x

y

p(x, y)2 dx =

Jp Jp Nc 1 N (xpl − xpk , 2σ 2 I), 2 N p=1

(6)

l=1 k=1

p(x)2 p(y)2 dx =

2 Nc N N 1 Jp N (xl − xk , 2σ 2 I), (7) N 2 p=1 N l=1 k=1

Nc N Jp 1 Jp p(x, y)p(x)p(y)dx = 2 N (xl − xpk , 2σ 2 I). (8) N p=1 N x l=1 k=1

The computational complexity of (5) is comprised of the computational complexity of (6) - (8) and is given in table 1. Furthermore, it is known that the FQMI requires many samples to be accurately computed by Parzen window estimation. Thus, we can assume that when the number of samples N is much greater than their respective dimensionality, that is, N >> d, the complexity of VALL , which is quadratic with respect to N , is dominant for the equation (5).

Mutual Information Measures for Subclass ECOCs Classification

25

Table 1. Computational Complexity for terms VIN , VALL , VBT W [Nc = classes #, N = samples #, Jp = samples # in class p, d = samples’ dimension] FQMI Terms Computational Complexity VIN O(Nc Jp2 d2 ) VALL O(Nc N 2 d2 ) VBT W O(Nc N Jp2 d2 )

2

Separation Criterions

The standard separation criterion for use in the SFFS algorithm, as proposed by Escalera et al. [4], is the maximization of the Mutual Information between the two created bi-partitions of classes and their respective class labels. That is, in each iteration of the SFFS algorithm two partitions of classes are constructed with labels {−1, +1} respectively. As already mentioned, the above procedure is computationaly costly because the FQMI computation in each step of SFFS is applied on all the samples of the considered bi-partitions. We reduce the computational cost if we avoid the computation of FQMI for both of the bipartitions and apply it only on one of them in each step of SFFS. As can be seen in table 1, another possibility is to avoid computing the term VALL which is of quadratic complexity with respect to the number of samples N . By discarding the computation of the VALL term in the FQMI procedure and considering a Fisher like ratio with the available terms VIN and VBT W which are of lower complexity, we can reduce signiﬁcantly the running time. Finally, we can further reduce the running time if in the Fisher like ratio mentioned, we consider only a representative subset of classes’ samples. Based on these ideas we propose three diﬀerent variations of the standard criterion {C1 , C2 , C3 } which are outlined below: – Criterion C1 : In criterion C1 we apply the standard FQMI computation only in the current subset of classes that are examined by SFFS in each iteration step. That is, we do not consider in the computation the remaining set of classes that do not belong in the current subset. In this case our goal is to minimize the above measure. In particular, the criterion J(X, X ) in the lines 11, 14, 15 of the SFFS algorithm reduces to the criterion J(X). Here, FQMI is evaluated between the subset X and the original class labels of the samples that consist it. The computational complexity of this variation remains quadratic with respect to the number of samples of the group in which the FQMI is evaluated. The evaluation, though, is done using much less data samples and consequently the running time is less than the original approach. – Criterion C2 : In criterion C2 we consider the maximization of the ratio C2 =

VIN VBT W

26

N. Arvanitopoulos, D. Bouzas, and A. Tefas

where VIN and VBT W are computed as in equations (6) and (8). Here we omit the costly computation of the quantity VALL . The resulting computational complexity as can be seen from table 1 is quadratic to the number of samples Jp of the binary group, that is p ∈ {−1, +1}. – Criterion C3 : The computational cost of FQMI is mostly attributed to the number of samples N . Thus, if we reduce the number of samples we can achieve a drastic reduction of the computational complexity. To this end we can represent each class by only one sample. This sample can be a location estimator such as the mean or the median. We propose the use of the mean vector as the only representative of each class and the criterion C2 reduces to minimizing of VBT W where in this case VBT W is given by: VBT W =

Nc Nc 1 ˜ j , 2σ 2 I) N (˜ xi − x Nc2 i=1 j=1

˜ i is the mean vector of class i. where x The new variation has quadratic complexity with respect to the number of classes Nc of the bipartition, since the computation of the mean vectors takes linear time with respect to number of samples in each class Jp .

3

Experimental Results

Datasets. We compared the proposed criteria using eight datasets of the UCI Machine Learning Repository. The characteristics of each dataset can be seen in table 2. All the features of each dataset were scaled to the interval [−1, +1]. To evaluate the test error on the diﬀerent experiments, we used 10-fold cross validation. Sub-class ECOC configuration. The set of parameters θ = {θp , θs , θi } in the subclass approach were ﬁxed in each dataset to the following values: – θp = 0%, split the classes if the classiﬁer does not attain zero training error. – θs = |J| 50 , minimum number of samples in each constructed cluster, where |J| is the number of features in each dataset. – θi = 5%, the improvement of the newly constructed binary problems after splitting. Furthermore, as a clustering method we used the K-means algorithm with the number of clusters K = 2. As stated by Escalera et al. [4], the K-means algorithm obtains similar results with other more sophisticated clustering algorithms, such as hierarchical and graph cut clustering, but with much less computational cost. In the tables 3 and 4 we present the results from our experiments in the UCI datasets using the DECOC and sub-ECOC approaches. In each column we illustrate the corresponding 10 fold cross-validation performance and in the case of the sub-ECOC method the (mean number of rows × mean number of columns) of the encoding matrices which are formed in each fold.

Mutual Information Measures for Subclass ECOCs Classification

27

Table 2. UCI Machine Learning Repository Data Sets Characteristics Database Samples Attributes Classes Iris 150 4 3 Ecoli 336 8 8 Wine 178 13 3 Glass 214 9 7 Thyroid 215 5 3 Vowel 990 10 11 Balance 625 4 3 Yeast 1484 8 10

Table 3. UCI Repository Experiments for linear SVM C=100 Database

ECOC

Iris

97.33%

Ecoli

82.98%

Wine

96.07%

Glass

63.16%

Thyroid

96.77%

Vowel

73.94%

Balance

91.7%

Yeast

56.6%

FQMI sub-ECOC 97.33% (3.3 × 2.3) 80.71% (10.2 × 10.6) 96.07% (3 × 2) 66.01% (13 × 14.3) 96.77% (3.3 × 2.6) 77.47% (27.2 × 29) 83.56% (54.3 × 64.6) 53.49% (29.5 × 36.7)

Criterion 1 ECOC sub-ECOC 97.33% 84.85% 96.07% 60.58% 96.77% 50.91% 91.7% 39.36%

97.33% (3.3 × 2.3) 84.85% (8.2 × 7.2) 96.07% (3 × 2) 63.64% (7.1 × 6.1) 96.77% (6 × 7.1) 52.73% (18.1 × 16.9) 89.31% (26.4 × 27) 39.36% (10 × 9)

Criterion 2 ECOC sub-ECOC 97.33% 78.21% 96.73% 61.07% 90.26% 46.26% 91.7% 42.37%

97.33% (3.3 × 2.3) 78.21% (8 × 7) 96.73% (3 × 2) 59.78% (7 × 6) 94.89% (5.9 × 7.6) 45.35% (15.1 × 14) 75.71% (416 × 508) 42.63% (10.2 × 9.2)

Criterion 3 ECOC sub-ECOC 97.33% 83.01% 96.07% 60.97% 96.77% 72.73% 91.7% 47.18%

97.33% (3.3 × 2.3) 80.63% (8.4 × 7.6) 96.07% (3 × 2) 62.85% (9.4 × 8.8) 96.77% (3 × 2) 86.57% (23.1 × 22) 88.65% (9.5 × 8.4) 36.23% (15.7 × 17)

SVM configuration. As a standard classiﬁer for our experiments we used the libsvm [11] implementation of the Support Vector Machine with linear and RBF kernel. For both linear and RBF SVM we ﬁxed the cost parameter C to 100 and for the RBF SVM we ﬁxed the σ parameter to 1. Table 4. UCI Repository Experiments for RBF SVM C=100, σ = 1 Database

ECOC

Iris

96%

Ecoli

82.83%

Wine

97.74%

Glass

69.39%

Thyroid

95.35%

Vowel

99.09%

Balance

95.04%

Yeast

58.6%

FQMI sub-ECOC 96% (3 × 2) 82.56% (13.1 × 16) 97.74% (3 × 2) 70.78% (7.9 × 7.6) 95.35% (3.2 × 2.4) 99.09% (11 × 10) 95.04% (3 × 2) 55.44% (27.3 × 33.4)

Criterion 1 ECOC sub-ECOC 96% 85.10% 97.74% 69.39% 95.35% 99.09% 95.04% 56.66%

96% (3 × 2) 85.13% (8.6 × 7.6) 97.74% (3 × 2) 69.39% (6 × 5) 95.82% (3.8 × 3.4) 99.09% (11 × 10) 95.04% (3 × 2) 56.66% (10 × 9)

Criterion 2 ECOC sub-ECOC 96% 84.08% 97.18% 64.77% 97.21% 98.59% 95.51% 54.95%

96% (3 × 2) 84.08% (8.1 × 7.1) 97.18% (3 × 2) 64.77% (6 × 5) 95.32% (5 × 5.4) 98.59% (11 × 10) 95.51% (3 × 2) 52.75% (10.5 × 9.5)

Criterion 3 ECOC sub-ECOC 96% 85.04% 97.74% 68.48% 95.35% 98.99% 95.04% 56.18%

96 % (3 × 2) 85.04% (8.1 × 7.1) 97.74% (3 × 2) 68.48% (6 × 5) 95.35% (3 × 2) 98.99% (11 × 10) 95.04% (3 × 2) 52.04% (20.7 × 22.1)

From the experiments it is obvious that the proposed criteria attain similar performance in most cases with the FQMI criterion whereas, in terms of computational speed we found that for the tested databases C1 and C2 run approximately 4 times faster and criterion C3 runs approximately 100 times faster. Moreover, FQMI cannot be applied to databases having a great number of

28

N. Arvanitopoulos, D. Bouzas, and A. Tefas

samples. However, the proposed criterion C3 can be used in very large databases arising in applications such as Data Mining.

4

Conclusion

Although FQMI is a quite accurate method for modeling the MI between classes, its computational complexity makes it impractical for real life classiﬁcation problems. FQMI’s inability to address large datasets makes the ECOC - sub-ECOC methods also impractical. As it has been illustrated in our paper, we can substitute FQMI with other MI measures of less computational complexity and attain similar or even in quite a few cases better classiﬁcation results. These novel MI measures proposed, make the ECOC and sub-ECOC methods applicable in large real-life datasets.

References 1. Allwein, E.L., Schapire, R.E., Singer, Y.: Reducing multi-class to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research 1, 113–141 (2002) 2. Dietterich, T.G., Bakiri, G.: Solving multi-class learning problems via errorcorrecting output codes. Journal of Machine Learning Research 2, 263–282 (1995) 3. Kong, E., Dietterich., T.: Error-correcting output coding corrects bias and variance. In: Proc. 12th Intl Conf. Machine Learning, pp. 313–321 (1995) 4. Escalera, S., Tax, D.M., Pujol, O., Radeva, P., Duin, R.P.: Subclass problemdependent design for error-correcting output codes. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(6), 1041–1054 (2008) 5. Torkkola, K.: Feature extraction by non-parametric mutual information maximization. Journal of Machine Learning Research 3, 1415–1438 (2003) 6. Asuncion, A., Newman, D.: Uci machine learning repository (2007) 7. Escalera, S., Pujol, O., Radeva, P.: Loss-weighted decoding for error-correcting output coding. In: Proc. Int’l Conf. Computer Vision Theory and Applications, June 2008, vol. 2, pp. 117–122 (2008) 8. Pujol, O., Radeva, P., Vitria, J.: Discriminant ecoc: A heuristic method for application dependent design of error correcting output codes. IEEE Transactions on Pattern Analysis and Machine Intelligence 6, 1001–1007 (2006) 9. Pudil, P., Ferri, F., Novovicova, J., Kittler., J.: Floating search methods for feature selection with non-monotonic criterion functions. In: Proc. Int’l Conf. Pattern Recognition, March 1994, vol. 3, pp. 279–283 (1994) 10. Kapur, J., Kesavan, H.: Entropy Optimization principles with Applications (1992) 11. Chang, C.C., Lin, C.J.: Libsvm: a library for support vector machines (2001)

Conflict Directed Variable Selection Strategies for Constraint Satisfaction Problems Thanasis Balafoutis and Kostas Stergiou Department of Information and Communication Systems Engineering University of the Aegean, Samos, Greece {abalafoutis,konsterg}@aegean.gr

Abstract. It is well known that the order in which variables are instantiated by a backtracking search algorithm can make an enormous diﬀerence to the search eﬀort in solving CSPs. Among the plethora of heuristics that have been proposed in the literature to eﬃciently order variables during search, a signiﬁcant recently proposed class uses the learning-from-failure approach. Prime examples of such heuristics are the wdeg and dom/wdeg heuristics of Boussemart et al. which store and exploit information about failures in the form of constraint weights. The eﬃciency of all the proposed conﬂict-directed heuristics is due to their ability to learn though conﬂicts encountered during search. As a result, they can guide search towards hard parts of the problem and identify contentious constraints. Such heuristics are now considered as the most eﬃcient general purpose variable ordering heuristic for CSPs. In this paper we show how information about constraint weights can be used in order to create several new variants of the wdeg and dom/wdeg heuristics. The proposed conﬂict-driven variable ordering heuristics have been tested over a wide range of benchmarks. Experimental results show that they are quite competitive compared to existing ones and in some cases they can increase eﬃciency. Keywords: Constraint Satisfaction, Variable Ordering Heuristics, Search.

1

Introduction

Constraint satisfaction problems (CSPs) and propositional satisﬁability (SAT) are two automated reasoning technologies that have a lot in common regarding the approaches and algorithms they use for solving combinatorial problems. Most complete algorithms from both paradigms use constraint propagation methods together with variable ordering heuristics to improve search eﬃciency. Learning from failure has become a key component in solving combinatorial problems in the SAT community, through literals learning and weighting, e.g. as implemented in the Chaﬀ solver [7]. This approach is based on learning new literals through conﬂict analysis and assigning weights to literals based on the number of times they cause a failure during search. This information can be then exploited by S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 29–38, 2010. c Springer-Verlag Berlin Heidelberg 2010

30

T. Balafoutis and K. Stergiou

the variable ordering heuristic to eﬃciently choose the variable to assign at each choice point. In the CSP community, learning from failure has followed a similar direction in recent years, in particular with respect to novel variable ordering heuristics. Boussemart et al. were the ﬁrst to introduce SAT inﬂuenced heuristics that learn from conﬂicts encountered during search [3]. In their approach, constraint weights are used as a metric to guide the variable ordering heuristic towards hard parts of the problem. Constraint weights are continuously updated during search using information learned from failures. The advantage that these heuristics have is that they use previous search states as guidance, while most formerly proposed heuristics either use the initial or the current state. The heuristics of [3], called wdeg and dom/wdeg, are now probably considered as the most eﬃcient general purpose variable ordering heuristic for CSPs. Subsequently, a number of alternative heuristics based on learning during search were proposed [8,4,6]. As discussed by Grimes and Wallace, heuristics based on constraint weights can be conceived in terms of an overall strategy that except from the standard Fail-First Principle also obeys the Contention Principle, which states that variables directly related to conﬂicts are more likely to cause a failure if they are chosen instead of other variables [6]. In this paper we focus on conﬂict-driven variable ordering heuristics based on constraint weights. We concentrate on an investigation of new general purpose variants of conﬂict-driven heuristics. These variants diﬀer from wdeg and dom/wdeg in the way they assign weights to constraints. First we propose three new variants of the wdeg and dom/wdeg heuristics that record the constraint that is responsible for any value deletion during search. These heuristics then exploit this information to update constraint weights upon detection of failure. We also examine a SAT inﬂuenced weight aging strategy that gives greater importance to recent conﬂicts. Finally, we propose a new heuristic that tries to better identify contentious constraints by detecting all the possible conﬂicts after a failure. Experimental results from various random, academic and real world problems show that some of the proposed heuristics are quite competitive compared to existing ones and in some cases they can increase eﬃciency. The rest of the paper is organized as follows. Section 2 gives the necessary background material and an overview on the existing conﬂict-driven variable ordering heuristics. In Section 3 we propose several new general purpose variants of conﬂict-driven variable ordering heuristics. In Section 4 we experimentally compare the proposed heuristics to dom/wdeg on a variety of real, academic and random problems. Finally, conclusions are presented in Section 5.

2

Background

A Constraint Satisfaction Problem (CSP) is a tuple (X, D, C ), where X is a set containing n variables {x1 , x2 , ..., xn }; D is a set of domains {D(x1 ), D(x2 ),..., D(xn )} for those variables, with each D(xi ) consisting of the possible values which xi may take; and C is a set of constraints {c1 , c2 , ..., ck } between variables

Conﬂict Directed Variable Selection Strategies for CSPs

31

in subsets of X. Each ci ∈ C expresses a relation deﬁning which variable assignment combinations are allowed for the variables vars(ci ) in the scope of the constraint. Two variables are said to be neighbors if they share a constraint. The arity of a constraint is the number of variables in the scope of the constraint. The degree of a variable xi , denoted by Γ (xi ) , is the number of constraints in which xi participates. A binary constraint between variables xi and xj will be denoted by cij . A partial assignment is a set of tuple pairs, each tuple consisting of an instantiated variable and the value that is assigned to it in the current search state. A full assignment is one containing all n variables. A solution to a CSP is a full assignment such that no constraint is violated. An arc is a pair (c, xi ) where xi ∈ vars(c). Any arc (cij , xi ) will be alternatively denoted by the pair of variables (xi ,xj ), where xj ∈ vars(cij ). That is, xj is the other variable involved in cij . An arc (xi ,xj ) is arc consistent (AC) iﬀ for every value a ∈ D(xi ) there exists at least one value b ∈ D(xj ) such that the pair (a,b) satisﬁes cij . In this case we say that b is a support of a on arc (xi ,xj ). Accordingly, a is a support of b on arc (xj ,xi ). A problem is AC iﬀ there are no empty domains and all arcs are AC. The application of AC on a problem results in the removal of all non-supported values from the domains of the variables. The deﬁnition of arc consistency for non-binary constraints, usually called generalized arc consistency (GAC), is a direct extension of the deﬁnition of AC. A support check (consistency check) is a test to ﬁnd out if two values support each other. The revision of an arc (xi ,xj ) using AC veriﬁes if all values in D(xi ) have supports in D(xj ). A domain wipeout (DWO ) revision is one that causes a DWO. That is, it results in an empty domain. In the following will use MAC (maintaining arc consistency) [9,1] as our search algorithm. In MAC a problem is made arc consistent after every assignment, i.e. all values which are arc inconsistent given that assignment, are removed from the current domains of their variables. If during this process a DWO occurs, then the last value selected is removed from the current domain of its variable and a new value is assigned to the variable. If no new value exists then the algorithm backtracks. 2.1

Overview of Existing Conflict-Driven Variable Ordering Heuristics

The order in which variables are assigned by a backtracking search algorithm has been understood for a long time to be of primary importance. A variable ordering can be either static, where the ordering is ﬁxed and determined prior to search, or dynamic, where the ordering is determined as the search progresses. Dynamic variable orderings are considerably more eﬃcient and have thus received much attention in the literature. One common dynamic variable ordering strategy, known as “fail-ﬁrst”, is to select as the next variable the one likely to fail as quickly as possible. All other factors being equal, the variable with the smallest number of viable values in its (current) domain will have the fewest subtrees

32

T. Balafoutis and K. Stergiou

rooted at those values, and therefore, if none of these contain a solution, the search can quickly return to a path that leads to a solution. Recent years have seen the emergence of numerous modern heuristics for choosing variables during CSP search. The so called conﬂict-driven heuristics exploit information about failures gathered throughout search and recorded in the form of constraint weights. Boussemart et al. [3] proposed the ﬁrst conﬂict-directed variable ordering heuristics. In these heuristics, every time a constraint causes a failure (i.e. a domain wipeout) during search, its weight is incremented by one. Each variable has a weighted degree, which is the sum of the weights over all constraints in which this variable participates. The weighted degree heuristic (wdeg) selects the variable with the largest weighted degree. The current domain of the variable can also be incorporated to give the domain-over-weighted-degree heuristic (dom/wdeg) which selects the variable with minimum ratio between current domain size and weighted degree. Both of these heuristics (especially dom/wdeg) have been shown to be extremely eﬀective on a wide range of problems. Grimes and Wallace [6] proposed alternative conﬂict-driven heuristics that consider value deletions as the basic propagation events associated with constraint weights. That is, the weight of a constraint is incremented each time the constraint causes one or more value deletions. They also used a sampling technique called random probing with which they can uncover cases of global contention, i.e. contention that holds across the entire search space. The three heuristics of [6] work as follows: 1. constraint weights are increased by the size of the domain reduction leading to a DWO. 2. whenever a domain is reduced in size during constraint propagation, the weight of the constraint involved is incremented by 1. 3. whenever a domain is reduced in size, the constraint weights are increased by the size of domain reduction (allDel heuristic).

3

Heuristics Based on Weighting Constraints

As stated in the previous section, the wdeg and dom/wdeg heuristics associate a counter, called weight, with each constraint of a problem. These counters are updated during search whenever a DWO occurs. Although experimentally it has been shown that these heuristics are extremely eﬀective on a wide range of problems, in theory it seems quite plausible that they may not always assign weights to constraints in an accurate way. To better illustrate our conjecture about the accuracy in assigning weights to constraints, we give the following example. Example 1. Assume we are using MAC-3 (i.e. MAC with AC-3) to solve a CSP (X, D, C) where X includes, among others, the three variables {xi , xj , xk }, all having the same domain {a, b, c, d, e}, and C includes, among others, the two binary constraints cij , cik . Also assume that a conﬂict-driven variable ordering heuristic (e.g. dom/wdeg) is used, and that at some point during search AC tries

Conﬂict Directed Variable Selection Strategies for CSPs

33

to revise variable xi . That is, it tries to ﬁnd supports for the values in D(xi ) in the constraints where xi participates. Suppose that when xi is revised against cij , values {a, b, c, d} are removed from D(xi ) (i.e. they do not have a support in D(xj )). Also suppose that when xi is revised against cik , value {e} is removed from D(xi ) and hence a DWO occurs. Then, the dom/wdeg heuristic will increase the weight of constraint cik by one but it will not change the weight of cij . It is obvious from this example that although constraint cij removes more values from D(xi ) than cik , its important indirect contribution to the DWO is ignored by the heuristic. A second point regarding potential ineﬃciencies of wdeg and dom/wdeg has to do with the order in which revisions are made by the AC algorithm used. Coarse-grained AC algorithms, like AC-3, use a revision list to propagate the eﬀects of variable assignments. It has been shown that the order in which the elements of the list are selected for revision aﬀects the overall cost of search. Hence a number of revision ordering heuristics have been proposed [10,2]. In general, revision ordering and variable ordering heuristics have diﬀerent tasks to perform when used in a search algorithm like MAC. Before the appearance of conﬂict-driven heuristics there was no way to achieve an interaction with each other, i.e. the order in which the revision list was organized during the application of AC could not aﬀect the decision of which variable to select next (and vice versa). The contribution of revision ordering heuristics to the solver’s eﬃciency was limited to the reduction of list operations and constraint checks. However, when a conﬂict-driven variable ordering heuristic like dom/weg is used, then there are cases where the decision of which arc (or variable) to revise ﬁrst can aﬀect the variable selection. To better illustrate this interaction we give the following example. Example 2. Assume that we want to solve a CSP (X, D, C) using a conﬂictdriven variable ordering heuristic (e.g. dom/wdeg), and that at some point during search the following AC revision list is formed: Q={(x1 ), (x3 ), (x5 )}. Suppose that revising x1 against constraint c12 leads to the DWO of D(x1 ), i.e. the remaining values of x1 have no support in D(x2 ). Suppose also that the revision of x5 against constraint c56 leads to the DWO of D(x5 ), i.e. the remaining values of x5 have no support in D(x6 ). Depending on the order in which revisions are performed, one or the other between the two possible DWOs will be detected. If a revision ordering heuristic R1 selects x1 ﬁrst then the DWO of D(x1 ) will be detected and the weight of constraint c12 will increased by 1. If some other revision ordering heuristic R2 selects x5 ﬁrst then the DWO of D(x5 ) will be detected, but this time the weight of a diﬀerent constraint (c56 ) will increased by 1. Although the revision list includes two variables (x1 , x5 ) that can cause a DWO, and consequently two constraint weights can be increased (c12 , c56 ), dom/wdeg will increase the weight of only one constraint depending on the choice of the revision heuristic. Since constraint weights aﬀect the choices of the variable ordering heuristic, R1 and R2 can lead to diﬀerent future decisions for variable instantiation. Thus, R1 and R2 may guide search to diﬀerent parts of the search space.

34

T. Balafoutis and K. Stergiou

From the above example it becomes clear that known heuristics based on constraint weights are quite sensitive to revision orderings and their performance can be aﬀected by them. In order to overcome the above described weaknesses that the weighted degree heuristics seem to have, we next describe a number of new variable ordering heuristics which can be seen as variants of wdeg and dom/weg. All the proposed heuristics are lightweight as they aﬀect the overall complexity only by a constant factor. 3.1

Constraints Responsible for Value Deletions

The ﬁrst enhancement to wdeg and dom/wdeg tries to alleviate the problem illustrated in Example 1. To achieve this, we propose to record the constraint which is responsible for each value deletion from any variable in the problem. In this way, once a DWO occurs during search we know which constraints have, not only directly, but also indirectly contributed to the DWO. Based on this idea, when a DWO occurs in a variable xi , constraint weights can be updated in the following three alternative ways: – Heuristic H1: for every constraint that is responsible for any value deletion from D(xi ), we increase its weight by one. – Heuristic H2: for every constraint that is responsible for any value deletion from variable D(xi ), we increase its weight by the number of value deletions. – Heuristic H3: for every constraint that is responsible for any value deletion from variable D(xi ), we increase its weight by the normalized number of value deletions. That is, by the ratio between the number of value deletions and the size of D(xi ). The way in which the new heuristics update constraint weights is displayed in the following example. Example 3. Assume that when solving a CSP (X, D, C), the domain of some variable e.g. x1 is wiped out. Suppose that D(x1 ) initially was {a, b, c, d, e} and each of the values was deleted because of constraints: {c12 , c12 , c13 , c12 , c13 } respectively. The proposed heuristics will assign constraint weights as follows: H1(weightH1 [c12 ] = weightH1 [c13 ] = 1), H2(weightH2 [c12 ] = 3, weightH2 [c13 ] = 2) and H3(weightH3 [c12 ] = 3/5, weightH3 [c13 ] = 2/5) Heuristics H1, H2, H3 are closely related to the three heuristics proposed by Grimes and Wallace [6]. The last two heuristics in [6], record constraints responsible for value deletions and use this information to increase weights. However, the weights are increased during constraint propagation in each value deletion for all variables. Our proposed heuristics diﬀer by increasing constraints weights only when a DWO occurs. As discussed in [6], DWOs seem to be particularly important events in helping identify hard parts of the problem. Hence we focus on information derived from DWOs and not just any value deletion.

Conﬂict Directed Variable Selection Strategies for CSPs

3.2

35

Constraint Weight Aging

Most of the clause learning SAT solvers like BerkMin [5] and Chaﬀ [7], use the strategy of weight “aging”. In such solvers, each variable is assigned a counter that stores the number of clauses responsible for at least one conﬂict . The value of this counter is updated during search. As soon as a new clause responsible for the current conﬂict is derived, the counters of the variables, whose literals are in this clause, are incremented by one. The values of all counters are periodically divided by a small constant greater than 1. This constant is equal to 2 for Chaﬀ and 4 for BerkMin. In this way, the inﬂuence of “aged” clauses is decreased and preference is given to recently deduced clauses. Inspired from SAT solvers, we propose here the use of “aging” to periodically age constraint weights. As in SAT, constraint weights can be “aged” by periodically dividing their current value by a constant greater than 1. The period of divisions can be set according to a speciﬁed number of backtracks during search. With such a strategy we give greater importance to recently discovered conﬂicts. The following example illustrates the improvement that weight “aging” can contribute to the solver’s performance. Example 4. Assume that in a CSP (X, D, C) with D={0,1,2}, we have a ternary constraint c123 ∈ C for variables x1 , x2 , x3 with disallowed tuples {(0,0,0), (0,0,1), (0,1,1), (0,2,2)}. When variable x1 is set to a value diﬀerent from 0 during search, constraint c123 is not involved in a conﬂict and hence its weight will not increase. However, in a branch that includes assignment x1 = 0, constraint c123 becomes highly “active” and a possible DWO in variable x2 or x3 should increase the importance of constraint c123 (more that a simple increment of its weight by one). We need a mechanism to quickly adopt changes in the problem caused by a value assignment. This can be done, by “aging” the weights of the other previously active constraints. 3.3

Fully Assigned Weights

When arc consistency is maintained during search using a coarse grained algorithm like AC-3, a revision list is created after each variable assignment. The variables that have been inserted into the list are removed and revised in turn. We observed that in the same revision list, diﬀerent revision ordering heuristics can lead to the DWOs of diﬀerent variables. To better illustrate this, we give the following example. Example 5. Assume that we use two diﬀerent revision ordering heuristic R1 , R2 to solve a CSP (X, D, C), and that at some point during search the following AC revision list is formed for R1 and R2 . R1 :{X1 ,X2 }, R2 :{X2 ,X1 }. We also assume the following: a) The revision of X1 deletes some values from the domain of X1 and it causes the addition of the variable X3 in the revision list. b) The revision of X2 deletes some values from the domain of X2 and it causes the addition of the variable X4 in the revision list. c) The revision of X3 deletes some values

36

T. Balafoutis and K. Stergiou

from the domain of X1 . d ) The revision of X4 deletes some values from the domain of X2 . e). A DWO occurs after a sequential revision of X3 and X1 . f ) A DWO occurs after a sequential revision of X4 and X2 . Considering the R1 list, the revision of X1 is fruitful and adds X3 in the list (R1 :{X3 ,X1 }). The sequential revision of X3 and X1 leads to the DWO of X1 . Considering the R2 list, the revision of X2 is fruitful and adds X4 in the list (R2 :{X4 ,X2 }). The sequential revision of X4 and X2 leads to the DWO of X2 . From the above example it is clear that although only one DWO is identiﬁed in a revision list, both X1 and X2 can be responsible for this. In R1 where X1 is the DWO variable, we can say that X2 is also a “potential” DWO variable i.e. it would be a DWO variable, if the R2 revision ordering was used. The question that arises here is: how can we identify the “potential” DWO variables that exists on a revision list? A ﬁrst observation that can be helpful in answering this question is that “potential” DWO variables are among variables that participate in fruitful revisions. Based on this observation, we propose here a new conﬂict-driven variable ordering heuristic that takes into account the “potential” DWO variables. This heuristic increases the weights of constraints that are responsible for a DWO by one (as the wdeg heuristic does) and also, only for revision lists that lead to a DWO, increases by one the weights of constraints that participates in fruitful revisions. Hence, to implement this heuristic we record all variables that delete at least one value during the application of AC. If a DWO is detected, we increase the weight of all these variables. An interesting direction for future work can be a more selective identiﬁcation of “potential” DWO variables.

4

Experiments and Results

In this section we experimentally investigate the behavior of the new proposed variable ordering heuristics on several classes of real, academic and random problems. All benchmarks are taken from C. Lecoutre’s web page1 , where the reader can ﬁnd addition details about the description and the formulation of all the tested benchmarks. We compare the new proposed heuristics with dom/wdeg and allDel. Regarding the heuristics of Section 3.1, we only show results from dom/wdegH1 , dom/wdegH2 and dom/wdegH3 , denoted as H1, H2 and H3 for simplicity, which are more eﬃcient than the corresponding versions that do not take the domain size into account. In our tests we have used the following measures of performance: cpu time in seconds (t) and number of visited nodes (n). The solver we used applies lexicographic value ordering and employs restarts. Concerning the restart policy, the initial number of allowed backtracks for the ﬁrst run has been set to 10 and at each new run the number of allowed backtracks increases by a factor of 1.5. Regarding the aging heuristic, we have selected to periodically decrease all constraint weights by a factor of 2, with the period set 1

http://www.cril.univ-artois.fr/∼lecoutre/benchmarks.html

Conﬂict Directed Variable Selection Strategies for CSPs

37

Table 1. Averaged values for Cpu times (t), and nodes (n) from 6 diﬀerent problem classes. Best cpu time is in bold. Problem Class RLFAP scensMod (13 instances) RLFAP graphMod (12 instances) Driver (11 instances) Interval Series (10 instances) Golomb Ruler (6 instances) geo50-20-d4-75 (10 instances) frb30-15 (10 instances)

t n t n t n t n t n t n t n

dom/wdeg

H1

H2

H3

1,9 734 9,1 6168 22,4 10866 34 32091 274,9 7728 62,8 15087 37,3 20176

2 768 5,2 3448 7 2986 19,4 18751 321,4 10337 174,1 36949 35,1 18672

2,2 824 6,1 4111 7,8 3604 23,4 23644 173,1 4480 72,1 16970 45,8 24326

2,3 873 5,5 3295 11,6 5829 13,3 13334 143,4 3782 95 23562 57,2 30027

aged f ully allDel dom/wdeg assigned 1,7 2,2 2,2 646 738 809 12,9 13,4 11,1 8478 11108 9346 6,4 18,8 20 1654 4746 4568 6,5 66,4 17,4 5860 74310 26127 342,1 208,3 154,4 7863 6815 3841 69 57,6 76 15031 12508 18094 42,3 32,9 26,1 21759 17717 14608

to 20 backtracks. Our search algorithm is MGAC-3, denoting MAC with GAC-3. Experiments run on an Intel T4200 @2.00 GHz with 3GB RAM. Table 1 show results from six diﬀerent problem classes. The ﬁrst two classes are from the real world Radio Link Frequency Assignment Problem (RLFAP). For the scensMod class we have run 13 instances and in this table we present the averaged values for cpu time and nodes visited. Since these instances are quite easy to solve, all the heuristics have almost the same behavior. The aged version of the dom/wdeg heuristic has a slightly better performance. For the graphMod class we have run 12 instances. Here heuristics H1, H2, H3 that record the constraint which is responsible for each value deletion display better performance. The third problem class is from another real world problem, which is called Driver. In these 11 instances the aged dom/wdeg heuristic has on average the best behavior. The next 10 instances are from the non-binary academic problem “All Interval Series” and have maximum constraint arity of 3. We must notice here that the aged dom/wdeg heuristic, which has the best performance is ﬁve times faster compared to dom/wdeg. This good performance that the aged dom/wdeg heuristic has, is not generic within diﬀerent problem classes. This can be seen in the next academic problem class (the well known Golomb Ruler problem) where the aged dom/wdeg heuristic, has the worst performance. The last two classes are from the “geo”quasirandom instances (random problems which contain some structure) and from the “frb” pure random instances that are forced to be satisﬁable. Here, although on average the fullyAssigned and allDel heuristics have the best performance, within each class we observed a big variation in cpu time among all the tested heuristics. A possible explanation for this diversity is the lack of structure that random instances have. Finally we must also comment that interestingly the dom/wdeg heuristic does not achieve any win, in all the tested experiments. As a general comment we can say that experimentally, all the proposed heuristics are competitive with dom/wdeg and in many benchmarks a notable improvement is observed.

38

5

T. Balafoutis and K. Stergiou

Conclusions

In this paper several new general purpose variable ordering heuristics are proposed. These heuristics follow the learning-from-failure approach, in which information regarding failures is stored in the form of constraint weights. By recording constraints that are responsible for any value deletion, we derive three new heuristics that use this information to spread constraint weights in a different way compared to the heuristics of Boussemart et al. We also explore a SAT inspired constraint aging strategy that gives greater importance to recent conﬂicts. Finally we proposed a new heuristic that tries to better identify contentious constraints by recording all the potential conﬂicts upon detection of failure. The proposed conﬂict driven variable ordering heuristics have been tested over a wide range of benchmarks. Experimental results shows they are quite competitive compared to existing ones and in some cases they can increase eﬃciency.

References 1. Bessi`ere, C., R´egin, J.C.: MAC and combined heuristics: two reasons to forsake FC (and CBJ?). In: Freuder, E.C. (ed.) CP 1996. LNCS, vol. 1118, pp. 61–75. Springer, Heidelberg (1996) 2. Boussemart, F., Hemery, F., Lecoutre, C.: Revision ordering heuristics for the Constraint Satisfaction Problem. In: Proceedings of CP 2004 Workshop on Constraint Propagation and Implementation, Toronto, Canada, pp. 29–43 (2004) 3. Boussemart, F., Hemery, F., Lecoutre, C., Sais, L.: Boosting systematic search by weighting constraints. In: Proceedings of 16th European Conference on Artiﬁcial Intelligence (ECAI 2004), Valencia, Spain, pp. 146–150 (2004) 4. Cambazard, H., Jussien, N.: Identifying and Exploiting Problem Structures Using Explanation-based Constraint Programming. Constraints 11, 295–313 (2006) 5. Goldberg, E., Novikov, Y.: BerkMin: a Fast and Robust Sat-Solver. In: Proceedings of DATE 2002, pp. 142–149 (2002) 6. Grimes, D., Wallace, R.J.: Sampling strategies and variable selection in weighted degree heuristics. In: Bessi`ere, C. (ed.) CP 2007. LNCS, vol. 4741, pp. 831–838. Springer, Heidelberg (2007) 7. Moskewicz, M., Madigan, C., Malik, S.: Chaﬀ: Engineering an eﬃcient SAT solver. In: Proceedings of Design Automation Conference, pp. 530–535 (2001) 8. Refalo, P.: Impact-based search strategies for constraint programming. In: Wallace, M. (ed.) CP 2004. LNCS, vol. 3258, pp. 556–571. Springer, Heidelberg (2004) 9. Sabin, D., Freuder, E.C.: Contradicting conventional wisdom in constraint satisfaction. In: Proceedings 2nd Workshop on Principles and Practice of Constraint Programming (CP 1994), pp. 10–20 (1994) 10. Wallace, R., Freuder, E.: Ordering heuristics for arc consistency algorithms. In: AI/GI/VI, Vancouver, British Columbia, Canada, pp. 163–169 (1992)

A Feasibility Study on Low Level Techniques for Improving Parsing Accuracy for Spanish Using Maltparser Miguel Ballesteros1, Jes´ us Herrera1, Virginia Francisco2 , and Pablo Gerv´ a s2 1

Departamento de Ingenier´ıa del Software e Inteligencia Artiﬁcial 2 Instituto de Tecnolog´ıa del Conocimiento Universidad Complutense de Madrid C/ Profesor Jos´e Garc´ıa Santesmases, s/n E–28040 Madrid, Spain {miballes,jesus.herrera,virginia}@fdi.ucm.es, [email protected]

Abstract. In the last years dependency parsing has been accomplished by machine learning–based systems showing great accuracy but usually under 90% for Labelled Attachment Score (LAS). Maltparser is one of such systems. Machine learning allows to obtain parsers for every language having an adequate training corpus. Since generally such systems can not be modiﬁed the following question arises: Can we beat this 90% LAS by using better training corpora? Some previous work points that high level techniques are not suﬃcient for building more accurate training corpora. Thus, by analyzing the words that are more frequently incorrectly attached or labelled, we study the feasibility of some low level techniques, based on n–version parsing models, in order to obtain better parsing accuracy.

1

Introduction

In the 10th edition of the Conference of Computational Natural Language Learning (CoNLL) a ﬁrst Shared Task on Multilingual Dependency Parsing was accomplished [1]. Thirteen diﬀerent languages including Spanish were involved. Participants should implement a parsing system that could be trained for all these languages. Maltparser achieved great results in this task, in which Spanish was proposed for parsing. The goal of the present work was to study the feasibility of low level techniques to obtain a better parsing performance when the parsing system (based on machine learning) can not be modiﬁed. 90% Labelled Attachment Score seems to be a de facto limit for contemporary dependency parsers. Some previous works [2] have been developed on how to improve dependency parsing by applying high level tecnhiques to obtain better training corpora. The conclusions of these works are that overall accuracy can not be enhanced by modifying training corpus’ size or its sentences’ lengths. In adition local accuracy is important too, but it has not been solved yet. N–version parsers could be the way to obtain better overall S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 39–48, 2010. c Springer-Verlag Berlin Heidelberg 2010

40

M. Ballesteros et al.

accuracies by obtaining better local accuracies. N–version parsers consist of n speciﬁcally trained models, each one able to parse one kind or a small range of kinds of sentences. Thus, a n–version parser should select the speciﬁc model that would better parse the sentence that is used as input. Each speciﬁc model would improve parsing accuracy of the sentences for which is specialized, producing a better overall parsing acurracy. After selecting a small number of words that are more frequently incorrectly attached or labelled, we started a thorough analysis of the parsings that contained those words. We selected the two most frequently incorrectly attached or labelled words, i.e., the conjunction and (“y” or “e” in Spanish) and the preposition to (“a” in Spanish.). These words led us to develop preliminary works on low level techniques useful to reach better parsing accuracy by improving attachment and labelling. Maltparser 0.4 is the public available software of the system presented by Nivre’s group to the CoNLL–X Shared Task. Since Spanish was the language for which we decided to develop the present work and we have already developed some previous work on dependency parsing using Maltparser [3,4,5], we used Maltparser 0.4 to carry out our experiments. The paper is organized as follows: Section 2 describes the CoNLL–X Shared Task focusing on Spanish participation; also we show our results when replicating the participation of Nivre’s group. Section 3 shows our consideration about local parsing accuracy. Section 4 shows two cases study in which the conjunction and preposition “a” are used to evaluate the feasibility of low level techniques oriented to obtaining better local parsing results. Finally, Section 5 shows the conclusions of the presented work and suggests some future work.

2

The CoNLL–X Shared Task

Each year the Conference of Computational Natural Language Learning (CoNLL) features a shared task, the 10th CoNLL Shared Task was Multilingual dependency parsing [1]. The goal of this Shared Task was to label dependency structures by means of a fully automatic dependency parser. This task provided a benchmark for evaluating the parsers presented to it accross 13 languages among which is Spanish. Systems were scored by computing their Labelled Attachment Score (LAS), i.e. the percentage of “scoring” tokens for which the system had predicted the correct head and dependency label [6]. Also Unlabelled Attachment Score (UAS) and Label Accuracy (LA). UAS is the percentage of “scoring” tokens for which the system had predicted the correct head [7]. LA is the percentage of “scoring” tokens for which the system had predicted the correct dependency label [8]. Our research is focused on Spanish parsing. For this language results across the 19 participants ranged from 47.0% to 82.3% LAS, with an average of 73.5%. The Spanish treebank used was AnCora [9], [10], a 95,028 wordforms corpus containing open–domain texts annotated with their dependency analyses. AnCora

A Feasibility Study on Low Level Techniques

41

was developed by the Clic group at Barcelona University. The results of Spanish parsing were in the average. The two participant groups with the highest total score for Spanish were McDonald’s group [11] (82.3% LAS) and Nivre’s group [12] (81.3% LAS). We are specially interested in Nivre’s group research because we used their system (Maltparser 0.4) for the experiments presented in this paper. Other participants that used the Nivre algorithm in the CoNLL–X Shared Task were Johansson’s group [13] and Wu’s group [14]. Their scores on Spanish parsing were 78.2% (7th place) and 73.2% (13th place), respectively. The evaluation shows that the approach given by Nivre gives competitive parsing accuracy for the languages studied. More speciﬁcally Spanish parsing scored 81.3% LAS, only 1 point under the best one [11], which did not use the Nivre algorithm but a Eisner’s bottom–up span algorithm in order to compute maximum spanning trees. In our work, the ﬁrst step was to replicate the participation of Nivre’s group in the CoNLL–X Shared Task for Spanish. We trained Maltparser 0.4 with the section of AnCora that was provided as training corpus in the CoNLL–X Shared Task (89,334 wordforms) and the system was set as referred by Nivre’s group in [12]. Once a model was obtained, we used it to parse the section of AnCora that was provided as test set in the CoNLL–X Shared Task (5,694 wordforms). We obtained the same results as the Nivre’s group in the Shared Task, i.e., LAS = 81.30%, UAS = 84.67% and LA = 90.06%. These results serve us as a baseline for our work which is presented in the following sections.

3

Local Parsing Accuracy

Considering the baseline experiment described in Section 2, despite a high overall parsing accuracy only 358 wordforms of the test corpus obtain a 100% LAS, UAS and LA in all parsed sentences, i.e., only 6.3% of the wordforms. If considering sentences, only 38 sentences of the test corpus (18.4% of them) were parsed without errors. An end user should usually expect a high local parsing accuracy (at the sentence level) rather than a high overall parsing accuracy. But nowadays a remarkable percentage of sentences in Spanish shows almost one error when parsed by Maltparser. Our hypothesis is that by enhancing local accuracy, not only overall accuracy should be enhanced, but end user satisfaction should be increased. We found that there is a small set of words that show an incorrect attachment, labelling or both. These words are the prepositions “a” (to), “de” (of ), “ en” (in), “con” (with), “por” (for ), the conjunction and, which has two wordings: “y” or “e”, and the nexus “que” (that ). All these words sometimes cause error in the dependency, in the head tag, or in both tags. For instance there are only 20 sentences (340 wordforms), in the test corpus presented in Section 2, with only one error after parsing. That is 9.7% of the corpus’ sentences and 5.98% of its wordforms. We found that in 10 of these 20 sentences the only failure is caused by one of the words listed above.

42

4

M. Ballesteros et al.

Cases Study: The Conjunction and the Preposition “a”

The conjunction and the preposition “a” are the words that caused a parsing error more frequently. This is why we selected them as cases to study to determine if low level techniques are feasible to increase parsing accuracy. We started experimenting these tecnhiques with the conjunction. The study of the obtained errors when parsing conjunctions, began with a manual analysis of AnCora. Thus, we extracted from AnCora every sentence containing a conjunction (“y” or “e”). There are 1.586 sentences with at least one conjunction in the whole AnCora. We inspected these sentences to ﬁnd labelling patterns and in doing so we obtained a list of patterns that depend on conjunction’s action. For instance, a pattern is given when conjunction acts as nexus in a coordinated copulative sentence and another pattern is given when it acts as the last nexus in a list of nouns. For example, the following sentence match these two patterns: Los activos en divisas en poder del Banco Central y el Ministerio de Finanzas se calculan en d´ olares estadounidenses y su valor depende del cambio oficial rublo–d´ olar que establece el Banco Central (The foreign exchange assets held by the Central Bank and the Ministry of Finance are calculated in U.S. dollars and its value depends on the ruble-dollar oﬃcial exchange rate established by the Central Bank). In this example the ﬁrst y is a nexus between the proper nouns Banco Central and Ministerio de Finanzas and the second y acts as a coordinated copulative nexus. These patterns guided the experiments described below. 4.1

The Conjunction

In this subsection we present two diﬀerent approaches we have applied to the conjuction. First Approach to the Conjuction. The ﬁrst approach that we studied was an n–version parsing model. Our idea was to determine if some kind of “diﬃcult” sentences could be succesfully parsed by speciﬁc parsers while a general parser would parse the non–troubled sentences. The ﬁrst speciﬁc parser that we tried to obtain was supposed to accurately parse quoted sentence sections containing conjunctions. This situation is quite commonly given and corresponds to one of the labbeling patterns that we have identiﬁed as problematic. This way we trained a parsing model with Maltparser 0.4 for sentences that contain conjuctions. The system was set as in Nivre’s group participation in the CoNLL–X Shared Task. The training corpus consisted of only quoted sentence sections containing conjunctions. These sentence sections were obtained from the section of AnCora provided as training corpus for Spanish in the CoNLL– X Shared Task. It consisted of 22 sentence sections starting and ﬁnishing by a quotation mark and containing conjunctions. The test corpus was obtained in a similar way from the section of AnCora provided as test corpus for Spanish in the CoNLL–X Shared Task. This test corpus contained 7 sentences. To analyse this approach, we incrementally built a training corpus and we evaluated the

A Feasibility Study on Low Level Techniques

43

parsing performance for every trained model. The method we followed to build this corpus is described below: – First of all, we selected the longest sentence of the training subcorpus of quoted sentence sections and this was the ﬁrst subcorpus added to the incremental training corpus. – Then we iterated until every sentence section was added to the incremental training corpus. In each iteration we did the following: • Malparser 0.4 was trained with the incremental corpus. • The trained model was tested by parsing the test subcorpus with it. • The remaining longest sentence section was added to the incremental corpus. The results of this experiment are showed in Figure 1, in which we plotted LAS, UAS and LA for every iteration. In the x axis we represented the number of sentences contained in the incremental training corpus in every iteration and in the y axis the values for LAS, UAS and LA.

Fig. 1. LAS, UAS and LA when training a parsing model incrementally with quoted sentence sections containing conjunctions from AnCora

If we take only conjunction parsing into consideration the results are quite good. In the ﬁrst iteration 3 conjunctions were incorrectly parsed, but in the second and the other iterations only 1 conjunction was incorrectly parsed. But as seen in Figure 1 the overall results did worse than those obtained by the general parser. Therefore, despite the improvement in local accuracy this approach does not seem to be realistic. This is because the number of available samples is not suﬃcient to train a speciﬁc model. This model should not only be able to obtain good results for parsing conjunctions but also for all the words of the whole quoted sentence. This led us to investigate another approach which is explained in the next section. A more Complex Approach. In this section we study the feasibility of a more complex n–version parsing model. As seen in Section 4.1, speciﬁc models can be

44

M. Ballesteros et al.

trainied to obtain high accurate parsings for a speciﬁc word, but these models cannot deal with the whole sentence in which the speciﬁc word is contained. This is what inspired this new approach. The idea is to obtain several speciﬁc models, each one able to accurately parse a single word into a speciﬁc context. Thus, the word would be one of the words that are more frequently incorrectly parsed and the context would be one of the labelling patterns referred in the beginning of Section 4. For instance, one of these words is the conjunction “y” and one of the contexts in which it can be found is the one presented in Subsection 4.1, i.e., quoted sentence sections. This way, after parsing a sentence with a general model (such as the one presented in Section 2) a program should decide if the parsed sentence contains a word that must be parsed by a speciﬁc model. In that case the program should choose the appropriated speciﬁc model for this word in the context in which it is. Once the sentence is parsed with the speciﬁc model, the result for the “problematic” word is replaced in the result obtained by the general model. This way the best of both parsings can be made. In the case of the conjunction, the labelling given to it by the speciﬁc parser is cut from this parsing and pasted into the parsing given by the general model, by replacing the labelling given to the conjunction by the general parser. This easy solution is posible because the conjunction is always a leaf of the parsing tree and its labellings can be changed without aﬀecting the rest of the parsing. To study if this n–version parsing model could be useful to get more accurate parsings we developed the experiments described below. For the ﬁrst experiment we trained a speciﬁc model for coordinated copulative sentences. We built a speciﬁc training corpus with the set of unambiguous coordinated copulative sentences contained in the section of AnCora that was provided as training corpus in the CoNLL–X Shared Task. This speciﬁc training corpus contains 361 sentences (10,561 wordforms). Then we parsed all the coordinated copulative sentences contained in the section of AnCora that was provided as test corpus in the CoNLL–X Shared Task (16 sentences, 549 wordforms). MaltParser uses history–based feature models for predicting the next action in the deterministic derivation of a dependency structure, which means that it uses features of the partially built dependency structure together with features of the (tagged) input string. More precisely, features are deﬁned in terms of the wordform (LEX), part–of–speech (POS) or dependency type (DEP) of a token deﬁned relative to one of the data structures STACK, INPUT and CONTEXT. A feature model is deﬁned in an external feature speciﬁcation1 . We set the experiments described above with the same feature model that Nivre’s group used in its participation in the CoNLL–X Shared Task. We also used this feature model in the present experiment and we obtained that the conjuntion was incorrectly parsed 8 times (in a test set containing 16 conjunctions). This fact led us to investigate other feature models. After a few failed attempts we found a feature model where 12 of the 16 conjunctions were parsed correctly. This feature model is shown in Figure 2. 1

An in–depth description of these feature models can be found http://w3.msi.vxu.se/∼nivre/research/MaltParser.htmlfeatures

in

A Feasibility Study on Low Level Techniques

45

Fig. 2. Feature model for coordinated copulative sentences

Despite that the results were enhanced by using the new feature model, the general parsing model (obtained in Section 2) parses correctly 13 of these 16 conjunctions. It could mean that speciﬁc models are not feasible for our objectives. Since the accuracies reached by both models were very similar, we developed some other experiments to conﬁrm or reject this hypothesis. Thus, we tried new speciﬁc parsers for other combinations conjunction–context. For the second experiment we developed a speciﬁc parser for conjunctions acting as a nexus in a list of proper nouns. We built a speciﬁc training corpus with the set of unambiguous sentences containing conjunctions acting as a nexus in lists of proper nouns, from the section of AnCora that was provided as training corpus in the CoNLL–X Shared Task. This speciﬁc training corpus contains 59 sentences (1,741 wordforms). After the training we parsed all the sentences containing conjunctions acting as a nexus in the lists of proper nouns, from the section of AnCora that was provided as test corpus in the CoNLL–X Shared Task (5 sentences, 121 wordforms). We set this training with the same feature model that Nivre’s group used in its participation in the CoNLL–X Shared Task. This speciﬁc model parsed all 5 conjunctions of the test set successfully, while the general model parsed only 4 of these conjunctions successfully. We developed a third experiment to evaluate a speciﬁc model for parsing conjunctions acting as a nexus in the lists of common nouns. We built a speciﬁc training corpus with the set of unambiguous sentences containing conjunctions acting as a nexus in the lists of common nouns, from the section of AnCora that was provided as training corpus in the CoNLL–X Shared Task. This speciﬁc training corpus contains 266 sentences (8,327 wordforms). After the training we parsed all the sentences containing conjunctions acting as a nexus in the lists of proper nouns, from the section of AnCora that was provided as test corpus in the CoNLL–X Shared Task (15 sentences, 480 wordforms). Once again the best feature model was the one that Nivre’s group used in its participation in the CoNLL–X Shared Task. This speciﬁc model parsed 12 of the 15 conjunctions of the test set successfully, while the general model parsed only 10 of these conjunctions successfully. A last experiment was accomplished to ﬁnd more evidences for the feasability of this n–version parsing model. Doing this we developed a speciﬁc model for parsing conjunctions acting as a nexus in the lists of adjectives or constructions acting as adjectives. We built a speciﬁc training corpus with the set of unambiguous

46

M. Ballesteros et al.

sentences containing conjunctions acting as nexus in lists of adjectives or constructions acting as adjectives, from the section of AnCora that was provided as training corpus in the CoNLL–X Shared Task. This speciﬁc training corpus contains 59 sentences (3,155 wordforms). After the training we parsed all the sentences containing conjunctions acting as a nexus in the lists of adjectives, from the section of AnCora that was provided as test corpus in the CoNLL–X Shared Task (5 sentences, 113 wordforms). The feature model that Nivre’s group used in its participation in the CoNLL–X Shared Task gave the best results again. This speciﬁc model parsed all the 5 conjunctions of the test set successfully, while the general model parsed 4 of these conjunctions successfully. The parsings given by the general parsing model to the conjunctions involved in the previous four experiments were replaced by the parsings given by the speciﬁc models. This way we combined both parsings as seen above in this section. Then, we recomputed LAS, UAS and LA for this combined parsing, obtaining the following values: LAS = 81.92%, UAS = 85.31% and LA = 90.06%. The results show a slight enhancement with respect to the results given by the general parsing model presented in Section 2. In addition, in the combined parsing the conjunction does not belong to the set of words that are more frequently incorrectly parsed. This improvement seems to indicate that this n–version parsing model is feasible and overall accuracy could be improved via local accuracy improvement. 4.2

The Preposition “a”

Once we found the promising approach presented in Section 4.1 we applied it to the following word in the list of more frequently wrong parsed words. This way we followed the steps stated previously. We started looking for the diﬀerent ways in which the preposition “a” is attached and labelled. Six cases were found, as shown in Table 1. A speciﬁc parser was trained for each case using Maltparser 0.4 set as in the CoNLL-X Shared Task. We used the feature model proposed in the Shared Task, except for case number 1 for which we used a m3.par model. This model was chosen empirically because the one proposed in the Shared Task was not suitable for tackling case number 1. In all the cases, except for case number 5, the quality of the labelling and the attachment of the word “a” were clearly improved, as shown in Table 1. Case number 5 is very challenging because we had only 8 Table 1. Attachment and labelling of the preposition “a” in AnCora. Found cases and LAS only for the preposition “a”, before and after the application of our method. Case #1 #2 #3 #4 #5 #6 Label CD CI CC CREG Attached to a verb noun LASa before 62.5% 42.9% 60.0% 25.0% 0.0% 50.0% LASa after 87.5% 100% 100% 75.0% 0.0% 100%

A Feasibility Study on Low Level Techniques

47

sentences containing it in the training set and 1 sentence in the test set. Perhaps the problem is in the small number of sentences used for training. Since case number 5 is not frequently given we did not make any particular eﬀorts to solve it in such a preliminary work. Nevertheless, it remains as a very interesting case study for future work. Once again the improvement in local accuracy is beneﬁcial to the overall accuracy. When aplying the labellings and attachments given by all the speciﬁc parsers presented in Sections 4.1 and 4.2, we obtain the following new overal values for the test set: LAS = 82.17%, UAS = 85.51% and LA = 90.32%.

5

Conclusions and Future Work

Previous work shows that high level techniques, such as controlling training corpus size or its sentences’ lengths, are not suﬃcient for improving parsing accuracy when using machine learning–based systems that can not be modiﬁed. This led us to investigate low level techniques, based on the detailed study of the words that are more frequently incorrectly parsed. In this work we study the feasibility of these low level techniques to reach better parsing accuracy. The idea presented in this paper is to develop n–version parsing models. Each parsing model is trained to accurately parse a speciﬁc kind of word in a speciﬁc context. This way, local accuracy is enhanced by avoiding errors given by general parsers, i.e., by enhacing local accuracy. Therefore, if a sentence contains one of the words that are more frequently incorrectly parsed by general parsers, it is simultaneously sent to a speciﬁc parser and to a general parser. After this, both parsings are combined in order to make the best of them. This work relies on two cases study: the conjunction and the preposition “a”, because these are the parts of speech most frequently incorrectly parsed. These preliminary experiments show that these kinds of low level techniques are promising for improving parsing accuracy under the circumstances described in this paper. A lot of promising future work remains encouraged by the present one. This future work includes not only similar studies on the rest of the words that are more frequently incorrectly parsed, but on the development of programs that must accurately send each sentence to adecuate speciﬁc parsers, when necessary. Also, some eﬀects that could be given in this kind of work, such as overﬁtting, should be studied. This work focused on Maltparser 0.4 and Spanish, but similar analyses could be accomplished to study other languages and/or parsers, complementing the present one.

Acknowledgments This work has been partially funded by Banco Santander Central Hispano and Universidad Complutense de Madrid under the Creaci´ on y Consolidaci´ on de Grupos de Investigaci´ on program, Ref. 921332–953.

48

M. Ballesteros et al.

References 1. Buchholz, S., Marsi, E.: CoNLL–X Shared Task on Multilingual Dependency Parsing. In: Proceedings of the 10th Conference on Computational Natural Language Learning (CoNLL–X), pp. 149–164 (2006) 2. Ballesteros, M., Herrera, J., Francisco, V., Gerv´as, P.: Improving Parsing Accuracy for Spanish using Maltparser. Journal of the Spanish Society for Natural Language Processing (SEPLN) 44 (in press, 2010) 3. Herrera, J., Gerv´ as, P.: Towards a Dependency Parser for Greek Using a Small Training Data Set. Journal of the Spanish Society for Natural Language Processing (SEPLN) 41, 29–36 (2008) 4. Herrera, J., Gerv´ as, P., Moriano, P.J., Moreno, A., Romero, L.: Building Corpora for the Development of a Dependency Parser for Spanish Using Maltparser. Journal of the Spanish Society for Natural Language Processing (SEPLN) 39, 181–186 (2007) 5. Herrera, J., Gerv´ as, P., Moriano, P.J., Moreno, A., Romero, L.: JBeaver: un Analizador de Dependencias para el Espa˜ nol Basado en Aprendizaje. In: Borrajo, D., Castillo, L., Corchado, J.M. (eds.) CAEPIA 2007. LNCS (LNAI), vol. 4788, pp. 211–220. Springer, Heidelberg (2007) 6. Nivre, J., Hall, J., Nilsson, J.: Memory–based Dependency Parsing. In: Proceedings of CoNLL–2004, Boston, MA, USA, pp. 49–56 (2004) 7. Eisner, J.: Three New Probabilistic Models for Dependency Parsing: An Exploration. In: Proceedings of the 16th International Conference on Computational Linguistics (COLING 1996), Copenhagen, pp. 340–345 (1996) 8. Yamada, H., Matsumoto, Y.: Statistical Dependency Analysis with Support Vector Machines. In: Proceedings of International Workshop of Parsing Technologies (IWPT 2003), pp. 195–206 (2003) 9. Palomar, M., Civit, M., D´ıaz, A., Moreno, L., Bisbal, E., Aranzabe, M., Ageno, ´ A., Mart´ı, M.A., Navarro, B.: 3LB: Construcci´ on de una Base de Datos de Arboles Sint´ actico–Sem´ anticos para el Catal´ an, Euskera y Espa˜ nol. In: Proceedings of the XX Conference of the Spanish Society for Natural Language Processing (SEPLN), Sociedad Espa˜ nola para el Procesamiento del Lenguaje Natural, pp. 81–88 (2004) 10. Taul´e, M., Mart´ı, M., Recasens, M.: AnCora: Multilevel Annotated Corpora for Catalan and Spanish. In: Proceedings of 6th International Conference on Language Resources and Evaluation (2008) 11. McDonald, R., Lerman, K., Pereira, F.: Multilingual Dependency Analysis with a Two-Stage Discriminative Parser. In: Proceedings of the 10th Conference on Computational Natural Language Learning (CoNLL–X), pp. 216–220 (2006) 12. Nivre, J., Hall, J., Nilsson, J., Eryi˘ git, G., Marinov, S.: Labeled Pseudo–Projective Dependency Parsing with Support Vector Machines. In: Proceedings of the 10th Conference on Computational Natural Language Learning (CoNLL–X), pp. 221–225 (2006) 13. Johansson, R., Nugues, P.: Investigating Multilingual Dependency Parsing. In: Proceedings of the Conference on Computational Natural Language Learning, CoNLL– X (2006) 14. Wu, Y., Lee, Y., Yang, J.: The Exploration of Deterministic and Eﬃcient Dependency Parsing. In: Proceedings of the Conference on Computational Natural Language Learning, CoNLL–X (2006)

A Hybrid Ant Colony Optimization Algorithm for Solving the Ring Arc-Loading Problem Anabela Moreira Bernardino1, Eugénia Moreira Bernardino1, Juan Manuel Sánchez-Pérez2, Juan Antonio Gómez-Pulido2, and Miguel Angel Vega-Rodríguez2 1

Research Center for Informatics and Communications, Department of Computer Science, School of Technology and Management, Polytechnic Institute of Leiria, 2411 Leiria, Portugal {anabela.bernardino,eugenia.bernardino}@ipleiria.pt 2 Department of Technologies of Computers and Communications, Polytechnic School, University of Extremadura, 10071 Cáceres, Spain {sanperez,jangomez,mavega}@unex.es

Abstract. The past two decades have witnessed tremendous research activities in optimization methods for communication networks. One important problem in communication networks is the Weighted Ring Arc-Loading Problem (combinatorial optimization NP-complete problem). This problem arises in engineering and planning of the Resilient Packet Ring (RPR) systems. Specifically, for a given set of non-split and uni-directional point-to-point demands (weights), the objective is to find the routing for each demand (i.e., assignment of the demand to either clockwise or counter-clockwise ring) so that the maximum arc load is minimised. In this paper, we propose a Hybrid Ant Colony Optimization Algorithm to solve this problem. We compare our results with the results obtained by the standard Genetic Algorithm and Particle Swarm Optimization, used in literature. Keywords: Communication Networks, Optimization Algorithms, Ant Colony Optimization Algorithm, Weighted Ring Arc-Loading Problem.

1 Introduction Resilient Packet Ring (RPR), also known as IEEE 802.17, is a standard, designed to optimise the transport of data traffic through optical fiber ring networks [1]. The RPR aims to combine the appealing functionalities of Synchronous Optical Network/Synchronous Digital Hierarchy (SONET/SDH) networks with the advantages of Ethernet networks. It is a ring-based architecture that consists of two counter directional optical fiber rings. The bandwidth utilisation in RPR is further increased by means of spatial reuse. Spatial reuse is achieved in RPR through the so-called destination stripping, which means that the destination node takes a transmitted packet off the fiber ring. Thus, a given transmission traverses only the ring segment from the source node to the destination node, allowing other nodes on the ring segment between the destination node and the source node to exchange transmissions at the same S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 49–59, 2010. © Springer-Verlag Berlin Heidelberg 2010

50

A.M. Bernardino et al.

time on the same fiber ring. Furthermore, the RPR provides fairness and allows the full ring bandwidth to be used under normal operation conditions. To effectively use the RPR’s potential, namely the spatial reuse, the statistical multiplexing and the bi-directionality, it is necessary to route the demands efficiently. Given a network and a D set of communications’ requests, a fundamental problem is to design a transmission route (direct path) for each request, to avoid high load on the arcs, where an arc is an edge endowed with a direction (clockwise or counterclockwise). The load of an arc is defined as the total weight of those requests that are routed through the arc in its direction. In general each request is associated with a non-negative integer weight. Practically, the weight of a request can be interpreted as a traffic demand or as the size of the data to be transmitted. The Weighted Ring Arc-Loading Problem (WRALP) can be classified into two formulations: with demand splitting (WRALP) or without demand splitting (non-split WRALP). The split loading allows the splitting of a demand into two portions to be carried out in both directions, while in a non-split loading each demand must be entirely carried out in either clockwise or counter-clockwise direction. For the research on the no-split formulation, Cosares and Saniee [2], and Dell’Amico et al. [3] studied the problem on SONET rings. Cosares and Saniee [2] proved that this formulation is NP-complete. It means that we cannot guarantee to find the best solution in a reasonable amount of time. Recent studies apply evolutionary algorithms to solve the non-split formulation [4][5]. For the split problem, various approaches are summarised by Schrijver et al. [6], and their algorithms are compared in Myung and Kim [7] and Wang [8]. The non-split WRALP considered is this paper is identical to the one described by Kubat and Smith [9] (non-split WRALP), Cho et al. [10] (non-split WRALP and WRALP) and Yuan and Zhou [11] (WRALP). They try to find approximate solutions in a reduced amount of time. Our purpose is different - we want to compare the performance of our algorithm with others in the achievement of the best-known solution. Using the same principle Bernardino et al. [12] presented four hybrid Particle Swarm Optimization (PSO) algorithms to solve the non-plit WRALP. An Ant Colony Optimization algorithm (ACO) is essentially a system based on agents which simulate the natural behaviour of ants, including mechanisms of cooperation and adaptation. This metaheuristic has shown to be both robust and versatile. The ACO algorithm has been successfully applied to a range of different combinatorial optimization problems [13]. In this paper we present an ACO algorithm coupled with a local search (HACO), applied to the WRALP. Our algorithm is based on the algorithm proposed by Gambardella et al. [14] to solve the quadratic assignment problem. The HACO uses pheromone trail information to perform modifications on WRALP solutions, unlike the more traditional ant systems that use pheromone trail information to construct complete solutions. We compare the performance of HACO with the standard Genetic Algorithm (GA) and Local Search - Probability Binary PSO (LS-PBPSO), used in literature. The paper is structured as follows. In Section 2 we describe the WRALP; in Section 3 we describe the implemented HACO algorithm; in Section 4 we present the studied examples and we discuss the computational results obtained and in Section 5 we report the conclusions.

A Hybrid Ant Colony Optimization Algorithm for Solving the Ring Arc-Loading Problem

51

2 Problem Definition Let Rn be a n-node bidirectional ring with nodes {n1, n2, …, nn} labelled clockwise. Each edge {ek, ek+1} of Rn, 1≤ k ≤ n is taken as two arcs with opposite directions, in which the data streams can be transmitted in either direction: ak+ = (ek , ek +1 ), ak− = (ek +1 , ek ) . A communication request on Rn is an ordered pair (s,t) of distinct nodes, where s is the source and t is the destination. We assume that data can be transmitted clockwise or counter-clockwise on the ring, without splitting. We use P+(s,t) to indicate the directed (s,t) path clockwise around Rn, and P(s,t) to indicate the directed (s,t) path counter-clockwise around Rn. A request (s,t) is often associated with an integer weight w>=0; we denote this weighted request by (s,t; w). Let D={(s1,t1; w1),(s2,t2; w2),..., (sm,tm; wm)} be a set of integrally weighted requests on Rn. For each request/pair (si,ti) we need to design a directed path Pi of Rn from si to ti. A set P={Pi: i=1, 2, ..., m} of such directed paths is called a routing for D. Table 1. Solution representation Pair(s,t) Demand 1: 2: 3: 4: 5: 6:

(1, (1, (1, (2, (2, (3,

2) 3) 4) 3) 4) 4)

Æ Æ Æ Æ Æ Æ

15 3 6 15 6 14

Representation (V)

C–clockwise 15 3 6 15 6 14

CC–counterclockwise

C CC CC C CC C

Pair1 Pair2 1 0

Pair3 0

Pair4 1

Pair5 0

Pair6 1

In this work, the solutions are represented using binary vectors (Table 1). For some integer Vi =1, 1≤ i ≤ m, the total amount of data is transmitted along P+(s,t); Vi=0, the total amount of data is transmitted along P-(s,t). The vector V=(V1, V2, …, Vm) determines a routing scheme for D.

3 Proposed Hybrid Ant Colony Optimization ACO is a population-based optimization method to solve hard combinatorial optimization problems. The first ACO algorithm was presented by Dorigo, Maniezzo and Colorni [15][16], Dorigo [17] and since then, many diverse variants of the basic principle have been reported in literature [13]. In real life, ants indirectly communicate among them (with each other) by depositing pheromone trails on the ground, influencing the decision processes of other ants. This simple form of communication between individual ants causes complex behaviours and capabilities of the colony as a whole. The real ants behaviour is transposed into an algorithm by making an analogy between: the real ants search and the set of feasible solutions to the problem; the amount of food in a source and the fitness function; the pheromone trail and an adaptive memory [14].

52

A.M. Bernardino et al.

The pheromone trails in ACO serves as a distributed, numerical information which the ants use to probabilistically construct solutions to the problem to be solved and which they adapt during the algorithm execution to reflect their search experience. Gambardella et al. [14] present a Hybrid Ant Colony System coupled with a local search (HAS_QAP) that uses pheromone trail information to perform modifications on QAP solutions. The simplest way to exploit the ants search experience is to make the pheromone updating process a function of the solution quality achieved by each particular ant. In HACO, only the best solution found during the search process contributes to pheromone trail updating. This makes the search more aggressive and requires less time to reach good solutions [14]. Moreover, this has been strengthened by an intensification mechanism that allows it to return to previous best solutions [14]. The algorithm proposed by Gambardella et al. [14] also performs a diversification mechanism after performing a predefined number of S iterations without improving the best solution found so far. We verify that in our algorithm the diversification mechanism doesn’t produce better solutions, mainly due to the LS method used. The main steps of the HACO algorithm are given below: Initialize Parameters Initialize Solutions (ants) Evaluate Solutions Apply Local Search Procedure Evaluate Solutions Initialize Pheromone Trails WHILE TerminationCriterion() FOR each Solution in Population Modify Solution using Pheromone Trails Evaluate Solution Apply Local Search Procedure Evaluate Solution Apply Intensification Mechanism Update Pheromone Trails

Initialisation of parameters The following parameters must be defined by the user: number of ants (NA); maximum number of iterations (MI); value used to initialise the pheromone trails (Q); probability exploration/ exploitation (q); pheromone evaporation rate (x1); pheromone influence rate (x2) and number of modifications (R). Initialisation of solutions The initial solutions can be created randomly or in a deterministic form. The deterministic form is based in a Shortest-Path Algorithm (SPA). The SPA is a simple traffic demand assignment rule in which the demand will traverse the smallest number of segments. Evaluation of solutions The fitness function is responsible for performing the evaluation and returning a positive number (fitness value) that reflects how optimal the solution is. To evaluate the solutions, we use the following fitness function:

A Hybrid Ant Colony Optimization Algorithm for Solving the Ring Arc-Loading Problem

Wi,…,wm Æ demands of the pairs (si,ti),…,(sm,tm) Vi, …, Vm = 0 Æ P-(si,ti); 1 Æ P+(si,ti) Load on arcs:

L(V, a k+ )=

∑ wi

L(V, a k− )=

i: a +k ∈ P + (s i , t i )

Fitness function:

∀k=1,…,n; ∀i=1,…,m max{max L(V, a k+ ),max L(V, ak− )}

∑ wi

53

(1a) (1b) (2a)

i : a −k ∈ P − (s i , t i )

(2b) (3)

The fitness function is based on the following constraints: (1) between each node pair (si,ti) there is a demand value >=0 and each positive demand value is routed in either clockwise (C) or counter-clockwise (CC) direction; (2) for an arc the load is the sum of wk for clockwise or counter-clockwise direction between nodes ek and ek+1. The purpose is to minimise the maximum load on the arcs of a ring (3). Initialisation of pheromone trails For the WRALP, the set of pheromone trails is maintained in a matrix T of size 2*m, where each Tij measures the desirability of assigning the direction i to the pair j. All pheromone trails Tij are set to the same value T0=1/(Q*Fitness(G))[14]. G is the best solution found so far and Q a parameter. Modification of solutions The algorithm performs R modifications. A modification consists on assigning a direction d to a pair p. First a pair p is randomly chosen (between 1 and m) and after a direction d is chosen (clockwise or counter-clockwise). A random number x is generated between 0 and 1. If x is smaller than q (parameter), the best direction d is chosen in a way that Tdp is the maximum. This policy consists in exploiting the pheromone trail. If x is higher than q, the direction d is chosen with a probability proportional to the values contained in the pheromone trail. This consists in exploring the solution space. Local Search The LS algorithm applies a partial neighbourhood examination. Some pairs of the solution are selected and their directions are exchanged (partial search). This method can be summarised in the following pseudo-code steps [12]: For t=0 to numberNodesRing/4 P1 = random (number of pairs) P2 = random (number of pairs) N = neighborhoods of ACTUAL-SOLUTION (one neighborhood results of interchange the direction of P1 and/or P2) SOLUTION = FindBest (N) If ACTUAL-SOLUTION is worst than SOLUTION ACTUAL-SOLUTION = SOLUTION

Intensification mechanism The intensification mechanism allows exploring the neighbourhood in a more complete way and allows it to return to the previous best solutions. If the intensification is active and the solution V in the beginning of the iteration is better, the ant comes back to the initial solution V. The intensification is activated when the best solution found

54

A.M. Bernardino et al.

so far has been improved and remains active while at least one ant succeeds on improving its solution during the iteration. Pheromone trails update To speed-up the convergence, the pheromone trails are updated by taking into account only the best solution found so far [14]. The pheromone trails are updated by setting: Tij=(1-x1)*Tij, with 0<x1<1 where x1 is a parameter that controls the evaporation of the pheromone trail; and TiGi=TiGi+x2/fitness(G), with 0<x2<1 where x2 is a parameter that controls the influence of the best solution G in the pheromone trail. Termination criterion The algorithm stops when a maximum number of iterations (MI) is reached. More information about ACO can be found in ACO Website [13].

4 Results We evaluate the utility of the algorithms using the same examples produced by Bernardino et al. [12]. They consider six different ring sizes - 5, 10, 15, 20, 25 and 30 and four demand cases: (i) complete set of demands between 5 and 100 with uniform distribution; (ii) half of the demands in (i) set to zero; (iii) 75% of the demands in (i) set to zero; and (iv) complete set of demand between 1 and 500 with uniform distribution. The last case was only used for the 30 nodes ring. For convenience, the instances used are labelled Cij, where 10.6 and x2>0.7, Q=100, q >0.7 (Fig. 1) and number of ants={30,40}. The correct combinations of parameters proved to be good and robust for the problems tested.

Fig. 1. Influence of parameters – Problem C41 – 100 iterations

In our experiments we use a growing number of ants. The number of ants was set to {10, 20, 30, 40, 50, 60, 70, 80, 90, 100}. We studied the impact on the execution time, the average fitness and the number of best solutions using all parameter combination. A higher number of ants significantly increases the algorithm execution time (Fig. 2a).

A Hybrid Ant Colony Optimization Algorithm for Solving the Ring Arc-Loading Problem

55

Fig. 2. Number of Ants – Execution Time (a) – Number of Best Solutions (b)

Fig. 3. Number of Ants – Average Fitness (a) – Convergence (b)

Fig. 4. Number of modifications – Number of best solutions – Execution time

With 30 and 40 ants the algorithm can reach in a reasonable amount of time a good number of best solutions (Fig. 2b) and a good average fitness (Fig. 3a). With a higher number of ants the algorithm can reach a better average fitness (Fig. 3a), but it is more time consuming. We also observe that a small number of ants allows an initial faster convergence, but a worse final result, following an increased amount of suboptima values (Fig. 3b). This can be explained, because the quality of the initial best-located solution, depends highly on the population size. It is necessary a bigger population diversity to avoid premature stagnation. For parameter R, the number of swaps executed using pheromone trail information, R between [m/5-m/4] has been shown experimentally to be more efficient (Fig. 4). In our experiments, R was set to {15, 20, 25, 30, 35, 40}. In case of a high R the resulting permutation tends to be closer to the best solution used to perform global pheromone trail updating, which makes it more difficult to generate new improving solutions. A high R has also a significant impact on the execution time (Fig. 4). On the other hand, a small R did not allow the system to escape from local minima, because after the local search, the resulting solution was in most cases the same as the starting permutation.

56

A.M. Bernardino et al.

Large types of experiments and considerations have been made to define other parameters. In general, experiments have shown that the proposed parameter setting is very robust. To compare our results we consider the results produced with the standard GA [18] and the LS-PBPSO proposed by Bernardino et al. [12]. Suggestions from literature helped to guide our choice of parameter values for the GA algorithm [18] and the LSPBPSO algorithm [12]. The GA was applied to populations of 200 individuals; it uses the “One-point” method for recombination, the “Change Direction” method for mutation and the “Tournament” method for selection. For the GA, we consider crossover probability in the range [0.6,0.9] and a mutation probability in the range [0.5,0.8]. The LS-PBPSO was applied to populations of 40 particles and we consider the value 1.49 for the parameters C1 and C2, and for the inertia velocity (W) values in the range [0.6,0.8]. For the HACO we consider populations of 40 individuals, 30 modifications, Q=100, x1 in the range [0.6,0.8], x2 in the range [0.7,0.8] and q in the range [0.7,0.8]. The algorithms have been executed using a processor Intel Quad Core Q9450 and the initial solutions of all algorithms were created using random solutions. For the problem C64 we used the SPA to create the initial populations. Table 2 presents the best obtained results by Bernardino et al. [12]. The first column represents the number of instance (Instance), the second and the third columns show the number of nodes (Nodes) and the number of pairs (Pairs), and the fourth column shows the minimum fitness values obtained. Table 2. Best obtained results Instance Nodes Pairs C11 C12 C13 C21 C22 C23 C31 C32 C33

5 5 5 10 10 10 15 15 15

10 8 6 45 23 12 105 50 25

Best Fitness 161 116 116 525 243 141 1574 941 563

Instance Nodes Pairs C41 C42 C43 C51 C52 C53 C61 C62 C63 C64

20 20 20 25 25 25 30 30 30 30

190 93 40 300 150 61 435 201 92 435

Best Fitness 2581 1482 612 4265 2323 912 5762 2696 1453 27779

Table 3 presents the best results obtained with GA, LS-PBPSO and HACO. The first column represents the number of the problem (Prob), the second column demonstrates the number of iterations used to test each instance and the remaining columns show the results obtained (Time – Run Times, IT - Iterations) by the three algorithms. The results have been computed based on 100 different executions for each test instance, using the best combination of parameters found and different seeds. Table 3 considers only the 30 best executions. All the algorithms reach the best solutions before the run times and number of iterations presented. Table 4 presents the average fitness and the average time obtained with GA, LSPBPSO and HACO using a limited number of iterations for the problems C41, C51

A Hybrid Ant Colony Optimization Algorithm for Solving the Ring Arc-Loading Problem

57

and C61 (harder problems). The first column represents the number of the problem (Prob), the second column demonstrates the number of iterations used to test each instance and the remaining columns show the results obtained (AvgF – Average Fitness, AvgT – Average Time) by the three algorithms. The results have been computed based on 100 different executions for each test instance using the best combination of parameters found and different seeds. Table 3. Results – run times and number of iterations Prob

Number Iterations

C11 C12 C13 C21 C22 C23 C31 C32 C33 C41 C42 C43 C51 C52 C53 C61 C62 C63 C64

25 10 10 50 25 10 100 50 25 150 75 25 250 150 75 500 250 100 100

GA

LS-PBPSO

HACO

Time

IT

Time

IT

Time

IT

<0.001 <0.001 <0.001 <0.001 <0.001 <0.001 0. 1 <0.001 <0.001 0.15 0.075 <0.001 0.8 0.15 0.02 2.3 0.6 0.08 0.5

2 2 1 25 5 3 40 15 5 100 35 10 150 50 35 300 120 30 5

<0.001 <0.001 <0.001 <0.001 <0.001 <0.001 0. 08 <0.001 <0.001 0.1 0.05 <0.001 0.75 0.1 0.01 2 0.4 0.075 0.75

2 2 1 15 3 3 20 8 5 40 20 5 100 25 15 140 50 15 40

<0.001 <0.001 <0.001 <0.001 <0.001 <0.001 0. 06 <0.001 <0.001 0.08 0.03 <0.001 0.6 0.1 0.01 1.75 0.3 0.075 0.5

2 2 1 20 3 3 30 10 5 50 25 5 100 30 20 150 60 20 5

Table 4. Results – Average Time / Average Fitness Problem Number of GA LS-PBPSO HACO iterations AvgF AvgT AvgF AvgT AvgF AvgT C41 C51 C61

50 100 150

2675,50 0,13 2612,25 0,20 2605,34 0,18 4384,50 0,47 4288,22 0,84 4279,49 0,76 6010,45 1,16 5814,89 2,33 5793,68 2,23

The algorithm HACO obtains a better average fitness in smaller times (Table 3 and Table 4). GA is the faster algorithm for iteration (Table 4), however it obtains the best solutions in higher times (Table 3). We have tried to apply the LS used in LS-PBPSO and in HACO to GA, but that has lead to poor performance, mainly due to the size of the GA population. When using the SPA for creating the initial solutions, the times and number of iterations decreases – problem C64 (Table 3). This problem is computationally harder than the C61, however the best solution is obtained faster. To improve the solutions we consider more efficient to initially apply a SPA and then the metaheuristic to improve the solutions.

58

A.M. Bernardino et al.

5 Conclusions In this paper we present a HACO algorithm to solve the non-split WRALP. The performance of our algorithm is compared with a classical GA and a PSO algorithm. Experimental results demonstrate that the proposed HACO algorithm is an effective and competitive approach in composing fairly satisfactory results with respect to solution quality and execution time for the WRALP. HACO provides a higher number of best solutions for larger problems and a better average fitness. In literature the application of HACO for this problem is nonexistent and for that reason this article shows its enforceability in the resolution of this problem. When using the SPA to create the initial solutions, the best solution is obtained faster. The continuation of this work will be the search and implementation of new methods to speed up the optimization process.

References 1. RPR Alliance: A Summary and Overview of the IEEE 802.17 Resilient Packet Ring Standard (2004) 2. Cosares, S., Saniee, I.: An optimization problem related to balancing loads on SONET rings. Telecommunication Systems 3(2), 165–181 (1994) 3. Dell’Amico, M., Labbé, M., Maffioli, F.: Exact solution of the SONET Ring Loading Problem. Oper. Res. Lett. 25(3), 119–129 (1999) 4. Bernardino, A.M., Bernardino, E.M., Sánchez-Pérez, J.M., Vega-Rodríguez, M.A., Gómez-Pulido, J.A.: Solving the Ring Loading Problem using Genetic Algorithms with intelligent multiple operators. In: International Symposium on Distributed Computing and Artificial Intelligence 2008 (DCAI 2008), pp. 235–244. Springer, Heidelberg (2008) 5. Bernardino, A.M., Bernardino, E.M., Sánchez-Pérez, J.M., Vega-Rodríguez, M.A., Gómez-Pulido, J.A.: Solving the weighted ring edge-loading problem without demand splitting using a Hybrid Differential Evolution Algorithm. In: The 34th IEEE Conference on Local Computer Networks. IEEE Press, Los Alamitos (2009) 6. Schrijver, A., Seymour, P., Winkler, P.: The ring loading problem. SIAM Journal of Discrete Mathematics 11, 1–14 (1998) 7. Myung, Y.S., Kim, H.G.: On the ring loading problem with demand splitting. Operations Research Letters 32(2), 167–173 (2004) 8. Wang, B.F.: Linear time algorithms for the ring loading problem with demand splitting. Journal of Algorithms 54(1), 45–57 (2005) 9. Kubat, P., Smith, J.M.: Balancing traffic flows in resilient packet rings. In: Girard, A., et al. (eds.) Performance evaluation and planning methods for the next generation internet, series 6, pp. 125–140. Springer, GERAD 25th Anniversary (2005) 10. Cho, K.S., Joo, U.G., Lee, H.S., Kim, B.T., Lee, W.D.: Efficient Load Balancing Algorithms for a Resilient Packet Ring. ETRI Journal 27(1), 110–113 (2005) 11. Yuan, J., Zhou, S.: Polynomial Time Solvability Of The Weighted Ring Arc-Loading Problem With Integer Splitting. Journal of Interconnection Networks 5(2), 193–200 (2004) 12. Bernardino, A.M., Bernardino, E.M., Sánchez-Pérez, J.M., Vega-Rodríguez, M.A., Gómez-Pulido, J.A.: Solving the non-split weighted ring arc-loading problem in a Resilient Packet Ring using Particle Swarm Optimization. In: International Conference in Evolutionary Computation (2009)

A Hybrid Ant Colony Optimization Algorithm for Solving the Ring Arc-Loading Problem

59

13. Ant Colony Optimization HomePage, http://iridia.ulb.ac.be/dorigo/ACO/ACO.html 14. Gambardella, L.M., Taillard, E.D., Dorigo, M.: Ant colonies for the quadratic assignment problem. Journal of the Operational Research Society 50(2), 167–176 (1999) 15. Dorigo, M., Maniezzo, V., Colorni, A.: Positive feedback as a search strategy. Technical Report 91-016, Dipartimento di Elettronica e Informazione, Politecnico di Milano, Italy (Springer, GERAD 25th Anniversary) (1991) 16. Dorigo, M., Maniezzo, V., Colorni, A.: The ant system: Optimization by a colony of cooperating agents. IEEE Transactions on Systems, Man, and Cybernetics 26, 29–41 (1996) 17. Dorigo, M.: Ottimizzazione, apprendimento automatico, ed algoritmi basati su metafora naturale (Optimisation, learning and natural algorithms). Doctoral dissertation, Dipartimento di Elettronica e Informazione, Politecnico di Milano, Italy (1991) 18. Eiben, A.E., Smith, J.E.: Introduction to Evolutionary Computing. Springer, Berlin (2003)

Trends and Issues in Description Logics Frameworks for Image Interpretation Stamatia Dasiopoulou and Ioannis Kompatsiaris Informatics and Telematics Institute, Centre for Research and Technology Hellas, Thessaloniki, Greece

Abstract. Description Logics have recently attracted signiﬁcant interest as the underlying formalism for conceptual modelling in the context of high-level image interpretation. Diﬀerences in the formulation of image interpretation semantics have resulted in varying conﬁgurations with respect to the adopted modelling paradigm, the utilised form of reasoning, and the way imprecision is managed. In this paper, we examine the relevant literature, outlining the corresponding strengths and weaknesses, and argue that although coming up with a complete solution is hard to envisage any time soon, there are a number of key considerations that may serve as guidelines towards this direction.

1

Introduction

Research in cognitive computer vision is intertwined with the use of symbolic knowledge and reasoning in the pursuit of endowing computational systems with the notion of educated, in terms of background knowledge driven, perception. In the last couple of years, and under the inﬂuence of the modelling paradigm embodied in the W3C recommended Semantic Web languages, Description Logics [1] have attracted signiﬁcant interest as the underlying formalism for conceptual modelling in the context of image interpretation. The low-level information made available by means of typical image analysis is encoded in the form of ABox A assertions, while an appropriately constructed TBox T admits the “reasonable” interpretations that are relevant to the domain of discourse (Fig. 1). At this point, one would assume that given the standard inference services provided by DLs, the proposed interpretation conﬁgurations would diﬀer only with respect to knowledge engineering considerations, such as the type and granularity of the knowledge employed, knowledge acquisition methodologies, and so forth. As a matter of fact though, certain traits that are intrinsic to image interpretation have induced, and in a sense reinforced, the formulation of diﬀerent interpretation conﬁgurations. Ambiguity manifested in the form of incomplete and conﬂicting assertions (r4 in Fig. 1 for example is identiﬁed both as building and vegetation by the classiﬁcation algorithm), constitutes a prominent factor that has induced diﬀering interpretation conﬁgurations. Imprecision manifested in the form of degrees of uncertainty (or truth) is another such factor. Interrelated are the diﬀerent premises made with respect to the semantics of computational perception per se, which determine how the available knowledge S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 61–70, 2010. c Springer-Verlag Berlin Heidelberg 2010

62

S. Dasiopoulou and I. Kompatsiaris

Fig. 1. Abstract architecture of Description Logics image interpretation framework

(axioms and assertions) is to be construed. In eﬀect, the individual viewpoints regarding the transition from low-level representations to high-level semantic interpretations, have a direct impact on which “parts” of the provided semantics are actually deployed and in which ways. As indicated by the undertaken study, it is not uncommon to engage on the closed domain model of the Datalog paradigm rather than the Classical logic paradigm [2]. In this paper, we consider the use of Description Logics in image interpretation and discuss the proposed frameworks with respect to three key conceptual dimensions, namely – the left – the – the

type of modelling assumption followed, i.e. whether unstated facts are open form of reasoning followed, and management of imprecision

Section 2 discusses how incompleteness in image interpretation relates to the open world semantics. Section 3 discusses the use of abductive reasoning to better model ambiguity, and Section 4 examines the use of probabilistic and fuzzy extensions for the purpose of handling imprecision. Examining the proposed image interpretation frameworks with respect to these three dimensions, useful considerations emerge with respect to possible directions and guidelines for further research. Section 5 concludes the paper, summarising the main observations.

2

Open vs. Closed World Semantics

Description Logics fall under the Classical logics paradigm. The domain is abstractly represented in terms of sets of objects (concepts) and relationships (roles) between them. Appropriate statements (axioms) capture the conditions that need to be met by the “reasonable” states (interpretations) of the domain. There can be many such interpretations, in accordance to all possible ways in

Trends and Issues in Description Logics Frameworks for Image Interpretation

63

which objects can be related through the deﬁned relationships in a manner consistent to the deﬁned axioms (open world assumption). Hence, all reasonable interpretations are admitted, but which one is the actual situation is left open. For example, let us assume that Seaside(image), contains(image, region1 ), contains(image, region2 ). In the presence of an axiom stating that seaside images contain at least one region depicting sea, possible interpretations (in the absence of any other information) include that sea is depicted by region1 , by region2 , by both, or by some other region, for which we happen to have no available information1 . However, in a signiﬁcant share of the proposed DLs based approaches, image interpretation translates to augmenting the explicitly asserted data, made available via image analysis, with additional ones that are derived through the application of inference over the known only objects and relationships. Indicative approaches include amongst others the works presented in [3,4,5,6,7]. The underlying assumption is that the analysis provided descriptions correspond to all relevant information. Such treatment is closer to the closed world assumption adopted in Datalog related logics, where the only objects and relationships assumed to exist are the explicitly asserted ones, rather than the semantics underlying DLs. Given the modelling diﬀerences between the two paradigms [2], such choice invokes considerations regarding the modelling of image interpretation. In the ideal case, image interpretation would be modelled as mapping a set of meaningfully partitioned regions with well-deﬁned perceptual characteristics to conceptual descriptions, whose further aggregation entails semantic descriptions of higher abstraction. However, the real case is far from the afore-described closed and highly structured setting, which poses hardly pragmatical assumptions with respect to both image processing and analysis, as well as the rendering of semantics in terms of perceptual manifestations. Automatically segmented image regions tend to enclose more than one objects (or constituent parts of multiple objects), as in the case of region r4 in Fig. 1 that encloses part of the building and the surrounding vegetation. As a result, even if accurate and robust perceptual models (in terms of numerical feature values or qualitative representations) are available for the relevant objects and relationships, it is not possible to derive satisfactory interpretations based only on the explicitly asserted regions and their respective descriptions. Similar considerations apply when image analysis provides conceptual descriptions in the form of object and scene for example classiﬁcations. Unless all regions correspond to distinct objects (or parts of objects) and all classiﬁcations are accurate, the explicitly asserted data comprise an incomplete, partial only view of the actual image content. The aforementioned example, though representing a small only fraction of the intricacies and challenges that comprise the so called semantic gap [8,9], outlines 1

Of course, it could be the case that we explicitly admitted interpretations to include exactly two regions, thus ruling out the last possibility. The point made is that the absence of information does not necessarily translate to negative information.

64

S. Dasiopoulou and I. Kompatsiaris

acutely that the kind of incompleteness, that pertains to the open world assumption, percolates image interpretation. Incorporating expectation feedback strategies, where the possible interpretations derived via reasoning serve as cues for subsequent analysis cycles by selectively activating and tuning image processing algorithms, as in [10,11], is part of the way towards coping with image interpretation open semantics. Modelling the background knowledge appropriately so as to capture the multiplicity of interpretations due to such incompleteness is another part, essential to the formalisation of image interpretation in a well-deﬁned manner2 . A ﬁnal remark concerns the use of rules. Adopting a hybrid representation scheme, ontologies are used to represent domain and media speciﬁc notions, while rules embody the closed domain semantics and provide the mappings between analysis descriptions and semantic ones [3,12,13]. However, the diﬀerent expressivity capabilities that rules provide, for example the representation of triangular relations as in the case of properties propagating across part-whole relations, seem to have been poorly explored. In the majority of the cases, ontology axioms could have been used as well in the place of rules, signifying that higher familiarity with the rule paradigm may have played a role. Given the aforementioned issues regarding incompleteness, and the continuous eﬀorts towards the formalisation of hybrid representations, a shift towards the eﬀective utlisation of the two formalisms in image interpretation frameworks is to be expected.

3

Interpretation as Logical Inference

Standard inference in Description Logics amounts to deductive reasoning. If Σ is a logical theory (e.g. background knowledge regarding shape and colour attributes of architectural artifacts) and α a set of facts (e.g. analysis extracted descriptions from images of such artifacts), through deduction is veriﬁed whether ϕ (e.g. a building facade) is logically entailed, that is whether Σ,α |= ϕ. The majority of the proposed DLs based approaches conﬁgure image interpretation along this line of reasoning [14,15,7,16]. Whether adopting closed or open world semantics, the higher level interpretations are derived via logical entailment over the available set of analysis provided assertions. The extraction and understanding though of image semantics encompasses a high degree of ambiguity, which as already mentioned in the introductory section, may be manifested in the form of incomplete as well as contradictory information. Ambiguity may as well refer to subjective views and interpretations attributed by diﬀerent persons, but considerations of this type are beyond the scope of the current discussion. Unlike the kind of incompleteness, discussed in the previous section, that came from the open world assumption, in this context incomplete information amounts to missing information. Let us consider as an example the opposite case 2

For example, assuming a region at the top of an image that is asserted as sea, reasoning should be able to derive as a possible interpretation one that construes the upper part of this region as sky and the lower as sea.

Trends and Issues in Description Logics Frameworks for Image Interpretation

65

of that described previously, and assume an image that has been partitioned into meaningful, from the conveyed semantics perspective, regions. Although there have been signiﬁcant advances that support capturing in generic fashion associations between automatically extracted perceptual features and conceptual descriptions, the accuracy of classiﬁcation remains highly variable and tends to deteriorate rather severely as the number of conceptual notions increases. The situation is further aggravated by serious discrepancies often observed between the intended perceptual to symbolic mappings and the actually acquired ones [17]. As a result, image analysis fails to provide the complete set of expected descriptions (e.g. sky, building), producing instead either false negatives, where the existence of a concept is ignored, or false positives, where it is mistakenly attributed as a diﬀerent concept. Keeping in mind that meaningful image segmentations are hard to obtain automatically, the aforementioned clearly indicate that adopting a purely deductive form of reasoning, higher level interpretations cannot be derived in a satisfactory and robust manner. Towards this end, modelling image interpretation as inference to the best explanation using abductive reasoning has been proposed [18,19,20]. Given Σ and ϕ, abduction consists in ﬁnding “explanations” α so that the entailment Σ,α |= ϕ is true. The duality between abduction and deduction (Σ,α |= ϕ iﬀ Σ,¬ϕ |= ¬α) though, is rather misleading regarding the formal apparatus available for abduction [21]. Abduction is not mere deduction in reverse [22] and many questions remain still open with respect to the formalisation of minimality criteria and preference metrics driving the generation of explanations3 . This reﬂects on the considerably small number of approaches investigating abductive interpretation frameworks, much as on the adoption of kind of ad hoc approaches to the implementation of abductive reasoning. In [20] for example, Description Logics are used in combination with rules, and abduction is implemented in the form of backward-chaining over the rules. The typical deﬁnition for the problem of abduction is modiﬁed into Σ,ϕ1 ,α |= ϕ2 , by splitting ϕ into bona ﬁde assertions (ϕ1 ) that are considered true by default, and bona ﬁats ones (ϕ2 ) that need to be explained. This division is arbitrary and in the proposed framework, ϕ2 corresponds to the set of spatial relationships assertions. Preference over the possible explanations is determined in terms of the number of (new) individuals that need to be hypothesized (as part of α) and the number of ϕ2 assertions that get explained. The abductive framework of robot perception presented in [10] is based on a more generic treatment of abduction, but no details are given on the actual computation methodology. Yet, the work presented in [10] and, later, in [19] provide several insights regarding the use of abduction in image interpretation that are not present in [20]. Noise and abnormality terms are introduced as part of the background knowledge Σ, so as to formally account for conﬂicting 3

Abduction in the context of logic programming is in conﬂict with the open world semantics underlying the problem of image interpretation, and therefore is not considered here.

66

S. Dasiopoulou and I. Kompatsiaris

assertions as well as for assertions inconsistent with respect to Σ. The latter is crucial as inconsistencies, due to the implications of the semantics gap, are hardly uncommon in image analysis, and by consequence, in tasks related to interpretation. Strictly speaking, the presence of inconsistencies introduces already a deviation from the formal abduction problem formulation, as Σ ¬ϕ becomes true. Though the inhibitory implications from inconsistencies aﬀect clearly deductive reasoning as well, it is quite interesting that even in the more straightforward cases of deductive reasoning, the majority of the proposed DLs image interpretation frameworks overlooks this issue. Instead the silent assumption of semantic coherency is made for the analysis provided assertions. Exceptions include [16], where following a reverse like tableau procedure inconsistencies are tracked and resolved, the aforementioned work of [10], and partially [20], as not all of assertions belonging to ϕ2 need to be necessarily considered by an interpretation.

4

Representing and Handling Imprecision

Besides providing the means to deal with incomplete, missing and contradictory information as discussed in the previous two sections, image interpretation needs to allow for a certain degree of vagueness. Imprecision is present in the extraction of features, the identiﬁcation of shapes, matching textures, colours, etc., and distills the translation from perceptual to symbolic representations addressed by image analysis. Already in [23], where a ﬁrst, preliminary proposal of a (crisp) Description Logic language is presented for the recognition of two-dimensional objects, the need for an inference of approximation is highlighted. The wide adoption of statistical models, such as Support Vector Machines (SVMs) [24], Hidden Markov Models [25] and Bayesian Networks (BNs) [26], currently forming the state of the art in image analysis and retrieval frameworks, urges further the investigation of appropriate means to model and handle imprecision in the proposed formal conﬁgurations of image interpretation. Yet, a signiﬁcant share of the proposed DLs image interpretation frameworks either presume crisp assertions or adopt ad hoc approaches to deal with the degrees of plausibility/vagueness that come with the analysis provided assertions. In [14,27,20,11] for example, the assertions over which inference is invoked, are by deﬁnition crisp. A pseudo-fuzzy extension is adopted in [4] to allow the deﬁnition of conceptual objects in terms of the minimum and maximum accepted values of perceptual features, while in [3,28] threshold values can be set for individual attributes or features. Clearly though, as membership in perceptual categories is not an all-or-nothing aﬀair, approaches along the aforementioned lines do not capture the pragmatics of image interpretation, and miss signiﬁcant pieces of information. In contrast, approaches that deal with imprecision in a more formal and structured fashion embrace either the probabilistic theory [29] or the fuzzy theory [30]. The choice between the two viewpoints reﬂects the espoused nature of imprecision, and is in accordance with the semantics embodied in the considered image

Trends and Issues in Description Logics Frameworks for Image Interpretation

67

analysis techniques. Bayesian and Markov based image analysis frameworks effect probabilistic interpretation, where the degrees accompanying the generated assertions are interpreted as degrees of truth. SVM and analogous approaches deploying similarity-based metrics, translate into degrees of uncertainty. Concerning probabilistic extensions, related eﬀorts include P-SHOQ(D) [31,32], which is among the most expressive probabilistic description logics that have been investigated, and Pronto [33], a non monotonic probabilistic reasoner built on top of Pellet [34], that however supports only terminological probabilistic knowledge. The lack of correspondence between theoretical advances and corresponding implementations, is reﬂected in the proposed DLs image interpretation frameworks by exploring notions from the Bayesian networks theory [35,36]. Eﬀorts related to this direction include among others PR-OWL [37], which combines ﬁrst order logic with Bayesian networks, and BayesOWL [38], which provides a set of rules for the translation of an ontology into an “equivalent” Bayesian network. Again however, although following similar notions, the proposed image interpretation frameworks do not directly use these results. Image interpretation frameworks that address vagueness on the contrary, exhibit a higher uptake of the corresponding fuzzy extensions to DLs [39,40,41,42], an attitude that may be attributed (at least partially) to the availability of respective implementations such as fuzzyDL [43], FiRE [44] and Delorean [45]. In [46], fuzzy DLs reasoning is proposed to support the reﬁnement of an initial set of over-segmented image regions and their classiﬁcations, in terms of region merging and update of classiﬁcation degrees based on those of its neighboring regions. In [16], a fuzzy DLs based reasoning framework is proposed to integrate, possibly complementary, overlapping or conﬂicting classiﬁcations at object and scene level, into a semantically coherent ﬁnal interpretation. A fuzzy spatial relation ontology for image interpretation is presented in [7], yet in the current implementation, only crisp reasoning has been used. The aforementioned outline, on one hand, the increasing awareness regarding the eﬀective handling of the imprecision involved in image interpretation, and on the other hand, the availability of conducive and active research activities concerning the management of uncertainty and vagueness in Description Logics (for a comprehensive overview the reader is referred to [47]). The diﬀerent nature of semantics pertaining to the two types of uncertainty [48], and their complementary implications in the context of image interpretation, roughly sketched in the experiments conducted in [10], render the investigation of formal frameworks that couple fuzzy with probabilistic knowledge of particular interest.

5

Conclusions

Description Logics have recently gained signiﬁcant popularity as the underlying formalism for conceptual modelling in formal image interpretation frameworks; a fact not surprising, given the high expressivity and well-deﬁned inference services they come with. Achieving robust and accurate interpretations though, still confronts serious challenges and open issues, in order to bring forth the full potential

68

S. Dasiopoulou and I. Kompatsiaris

of incorporating knowledge in image interpretation. The open world modelling of the classical logic paradigm on which Description Logics are based, matches closely the incompleteness encountered in image interpretation, urging the investigation of conﬁgurations that eﬀectively exploit it, instead of considering only explicitly asserted facts. Conducive towards this end is also the extension of purely deductive reasoning schemes abductive services, which, as outlined appears to be a promising direction for further research. Yet, in order to acquire truly pragmatic interpretation frameworks that model reliably the semantics of both the available facts and the way in which they should be construed, it is mandatory to eﬀectively introduce and handle imprecision in the conﬁgured image interpretation models. A promising pursuit for the future is to investigate the coupling of fuzzy and probabilistic reasoning, while preserving clean semantics.

Acknowledgements This work was partially supported by the European Commission under contracts FP7-215453 WeKnowIt and FP6-026978 X-Media.

References 1. Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F.: The description logic handbook: Theory, implementation, and applications. In: Description Logic Handbook. Cambridge University Press, Cambridge (2003) 2. Patel-Schneider, P., Horrocks, I.: A comparison of two modelling paradigms in the semantic web. J. Web Sem. 5(4), 240–250 (2007) 3. Little, S., Hunter, J.: Rules-by-example - a novel approach to semantic indexing and querying of images. In: McIlraith, S.A., Plexousakis, D., van Harmelen, F. (eds.) ISWC 2004. LNCS, vol. 3298, pp. 534–548. Springer, Heidelberg (2004) 4. Schober, J.P., Hermes, T., Herzog, O.: Content-based image retrieval by ontologybased object recognition. In: KI 2004 Workshop on Applications of Description Logics (ADL), Ulm, Germany september 24, pp. 1–10 (2004) 5. Neumann, B., Moller, R.: On scene interpretation with description logics FBI-B257/04 (2004) 6. Simou, N., Athanasiadis, T., Tzouvaras, V., Kollias, S.: Multimedia reasoning with f-shin. In: 2nd International Workshop on Semantic Media Adaptation and Personalization, London, UK, pp. 413–420 (2007) 7. Hudelot, C., Atif, J., Bloch, I.: Fuzzy spatial relation ontology for image interpretation. Fuzzy Sets and Systems 159(15), 1929–1951 (2008) 8. Smeulders, A.W.M., Worring, M., Santini, S., Gupta, A., Jain, R.: Content-based image retrieval at the end of the early years. IEEE Trans. Pattern Anal. Mach. Intell. 22(12), 1349–1380 (2000) 9. Hanjalic, A., Lienhart, R., Ma, W., Smith, J.: The holy grail of multimedia information retrieval: So close or yet so far away. IEEE Proceedings, Special Issue on Multimedia Information Retrieval 96(4), 541–547 (2008) 10. Shanahan, M.: A logical account of perception incorporating feedback and expectation. In: International Conference on Principles and Knowledge Representation and Reasoning (KR 2002), Toulouse, France, April, 22-25 pp. 3–13 (2002)

Trends and Issues in Description Logics Frameworks for Image Interpretation

69

11. Hotz, L., Neumann, B., Terzic, K.: High-level expectations for low-level image processing. In: Dengel, A.R., Berns, K., Breuel, T.M., Bomarius, F., Roth-Berghofer, T.R. (eds.) KI 2008. LNCS (LNAI), vol. 5243, pp. 87–94. Springer, Heidelberg (2008) 12. Dasiopoulou, S., Mezaris, V., Kompatsiaris, I., Papastathis, V., Strintzis, M.: Knowledge-assisted semantic video object detection. IEEE Trans. Circuits Syst. Video Techn. 15(10), 1210–1224 (2005) 13. Bagdanov, A., Bertini, M., DelBimbo, A., Serra, G., Torniai, C.: Semantic annotation and retrieval of video events using multimedia ontologies. In: IEEE International Conference on Semantic Computing (ICSC), Irvine, CA, USA, pp. 713–720 (2007) 14. Moller, R., Neumann, B., Wessel, M.: Towards computer vision with description logics: Some recent progress. In: Workshop on Integration of Speech and Image Understanding, Corfu, Greece, pp. 101–115 (1999) 15. Hunter, J., Drennan, J., Little, S.: Realizing the hydrogen economy through semantic web technologies. IEEE Intelligent Systems Journal - Special Issue on eScience 19, 40–47 (2004) 16. Dasiopoulou, S., Kompatsiaris, I., Strintzis, M.: Applying fuzzy dls in the extraction of image semantics. J. Data Semantics 14, 105–132 (2009) 17. Snoek, C., Huurnink, B., Hollink, L., Rijke, M., Schreiber, G., Worring, M.: Adding semantics to detectors for video retrieval. IEEE Transactions on Multimedia 9(5), 975–986 (2007) 18. Shanahan, M.: Robotics and the common sense informatic situation. In: European Conference on Artiﬁcial Intelligence (ECAI), Budapest, Hungary, August, 11-16, pp. 684–688 (1996) 19. Shanahan, M.: Perception as abduction: Turning sensor data into meaningful representation. Cognitive Science 29(1), 103–134 (2005) 20. Espinosa, S., Kaya, A., Melzer, S., M¨ oller, R., Wessel, M.: Multimedia interpretation as abduction. In: International Workshop on Description Logics (DL), BrixenBressanone, Italy, June, 8-10, pp. 323–331 (2007) 21. Elsenbroich, C., Kutz, O., Sattler, U.: A case for abductive reasoning over ontologies. In: Workshop on OWL: Experiences and Directions (OWLED), Athens, Georgia, USA. (November 10-11, 2006) 22. Mayer, M., Pirri, F.: First order abduction via tableau and sequent calculi. Logic Journal of the IGPL 1(1), 99–117 (1993) 23. Sciascio, E.D., Donini, F.: Description logics for image recognition: a preliminary proposal. In: International Workshop on Description Logics (DL), Link¨ oping, Sweden (July 30- August 1, 1999) 24. Burges, C.: A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery 2(2), 121–167 (1998) 25. Rabiner, L., Juang, B.: An introduction to hidden markov models. ASSP Magazine, IEEE [see also IEEE Signal Processing Magazine] 3(1), 4–16 (1986) 26. Heckerman, D.: A tutorial on learning with bayesian networks. Learning in Graphical Models, 301–354 (1998) 27. Neumann, B., Weiss, T.: Navigating through logic-based scene models for high-level scene interpretations. In: ICVS, pp. 212–222 (2003) 28. Hollink, L., Little, S., Hunter, J.: Evaluating the application of semantic inferencing rules to image annotation. In: International Conference on Knowledge Capture (KCAP), Banﬀ, Alberta, Canada, October 2-5 pp. 91–98 (2005) 29. Nilsson, N.: Probabilistic logic. Artif. Intell. 28(1), 71–87 (1986)

70

S. Dasiopoulou and I. Kompatsiaris

30. Klir, G., Yuan, B.: Fuzzy sets and fuzzy logic: Theory and applications. PrenticeHall, Englewood Cliﬀs (1995) 31. Giugno, R., Lukasiewicz, T.: P-shoq(d): A probabilistic extension of shoq(d) for probabilistic ontologies in the semantic web. In: European Conference on Logics in Artiﬁcial Intelligence (JELIA), Cosenza, Italy September 23-26, pp. 86–97 (2002) 32. Lukasiewicz, T.: Expressive probabilistic description logics. Artif. Intell. 172(6-7), 852–883 (2008) 33. Klinov, P.: Pronto: A non-monotonic probabilistic description logic reasoner. In: Bechhofer, S., Hauswirth, M., Hoﬀmann, J., Koubarakis, M. (eds.) ESWC 2008. LNCS, vol. 5021, pp. 822–826. Springer, Heidelberg (2008) 34. Sirin, E., Parsia, B., Grau, B., Kalyanpur, A., Katz, Y.: Pellet: A practical owl-dl reasoner. J. Web Sem. 5(2), 51–53 (2007) 35. Town, C., Sinclair, D.: A self-referential perceptual inference framework for video interpretation. In: International Confernce on Computer Vision Systems (ICVS), Graz, Austria, pp. 54–67 (2003) 36. Neumann, B., M¨ oller, R.: On scene interpretation with description logics. Image Vision Comput. 26(1), 82–101 (2008) 37. da Costa, P., Laskey, K., Laskey, K.: Pr-owl: A bayesian ontology language for the semantic web. In: URSW. LNCS, pp. 88–107 (2008) 38. Ding, Z.: BayesOWL: A Probabilistic Framework for Semantic Web. Phd thesis, University of Maryland, Baltimore County (December 2005) 39. Yen, J.: Generalizing term subsumption languages to fuzzy logic. In: 12th International Joint Conference on Artiﬁcial Intelligence (IJCAI), Sydney, Australia, August 24-30, pp. 472–477 (1991) 40. Straccia, U.: Reasoning within fuzzy description logics. J. Artif. Intell. Res (JAIR) 14, 137–166 (2001) 41. Straccia, U.: Transforming fuzzy description logics into classical description logics. In: European Conference on Logics in Artiﬁcial Intelligence (JELIA), Lisbon, Portugal, September, 27-30 pp. 385–399 (2004) 42. Stoilos, G., Stamou, G., Tzouvaras, V., Pan, J., Horrocks, I.: The fuzzy description logic f-SHIN. In: International Workshop on Uncertainty Reasoning For the Semantic Web (URSW), Galway, Ireland November, 7 pp. 67–76 (2005) 43. Bobillo, F., Straccia, U.: fuzzydl: An expressive fuzzy description logic reasoner. In: International Conference on Fuzzy Systems (FUZZ) June, 1-6 pp. 923–930. IEEE Computer Society, Hong Kong (2008) 44. Simou, N., Kollias, S.: Fire: A fuzzy reasoning engine for impecise knowledge, In: K-Space PhD Students Workshop Berlin, Germany, September 14 (2007) 45. Bobillo, F., Delgado, M., G´ omez-Romero, J.: Delorean: A reasoner for fuzzy owl 1.1. In: International Workshop on Uncertainty Reasoning for the Semantic Web (URSW), Karlsruhe, Germany, October 26 (2008) 46. Simou, N., Athanasiadis, T., Stoilos, G., Kollias, S.: Image indexing and retrieval using expressive fuzzy description logics. Signal, Image and Video Processing 2(4), 321–335 (2008) 47. Straccia, U.: Managing uncertainty and vagueness in description logics, logic programs and description logic programs. In: Tutorial Lectures, Reasoning Web, 4th International Summer School, Venice, Italy, pp. 54–103 (2008) 48. Dubois, D., Prade, H.: Possibility theory, probability theory and multiple-valued logics: A clariﬁcation. Annals of Mathematics and Artiﬁcail Intelligence 32(1-4), 35–66 (2001)

Unsupervised Recognition of ADLs Todor Dimitrov1 , Josef Pauli1 , and Edwin Naroska2 1 Universit¨ at Duisburg-Essen [email protected], [email protected] 2 Fachhochschule Ingolstadt [email protected]

Abstract. In this paper we present an approach to the unsupervised recognition of activities of daily living (ADLs) in the context of smart environments. The developed system utilizes background domain knowledge about the user activities and the environment in combination with probabilistic reasoning methods in order to build best possible explanation of the observed stream of sensor events. The main advantage over traditional methods, e.g. dynamic Bayesian models, lies in the ability to deploy the solution in diﬀerent environments without needing to undergo a training phase. To demonstrate this, tests with recorded data sets from two ambient intelligence labs have been conducted. The results show that even using basic semantic modeling of how the user behaves and how his/her behavior is reﬂected in the environment, it is possible to draw conclusions about the certainty and the frequencies with which certain activities are performed.

1

Introduction

Being able to monitor the activities of people is the key enabling factor for the development of assistive technologies for elderly people and dementia patients. Over the past several years, the research community has shown that in certain scenarios, using ubiquitously distributed sensors, it is possible for a software agent to discern what the inhabitant is currently doing. Most of the approaches rely on the fact that the system undergoes a training phase, in which the user is prompted to log his/her current activities. This poses two major problems when trying to reuse the trained models in real-world scenarios. First, the approaches do not account for the speciﬁcs of the employed hardware. As a simple example, one can imagine that a presence detector would generate a diﬀerent stream of sensor events as compared to a motion detector. Second, the models do not represent a generalization of human behavior - they are merely a probabilistic mapping between the observed system state and the way the current user performs the activities. In this paper, we present a diﬀerent approach to the detection of ADLs. Instead of relying on user input during a training phase, our system uses a selfupdating probabilistic knowledge base. This means, that even immediately after deployment, it is capable of making activity predictions. As sensor observations S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 71–80, 2010. c Springer-Verlag Berlin Heidelberg 2010

72

T. Dimitrov, J. Pauli, and E. Naroska

are processed, the predictions are reﬁned without needing an intervention from the user. To go even further, the inhabitant could be optionally queried by the system in order to conﬁrm or refute a given hypothesis. This guarantees a nice user experience by underlining the much desired unobtrusive nature of ambient intelligence solutions. To deal with the diversity of sensors and environments, the aforementioned knowledge base is built from the ground up on top of semantic models. Thus, as the results conﬁrm, it is possible to deploy the system in diﬀerent home infrastructures without needing to undergo a supervised training phase. In the following, an overview of the related work is given. Subsequently, the used approach is detailed and the obtained results using data from two diﬀerent labs are presented. The paper concludes with discussion and outlook on possible system improvements.

2

State of the Art

There are basically two distinct methodologies when it come to detecting in-home activities. The ﬁrst one uses sensors that are deployed in the environment, e.g. motion detectors, light switches, household appliances, etc. and the second one uses tagged objects that are accessed by the user when he/she performs a certain activity, e.g. tooth brush, kitchen utensils, etc. For the latter, an additional reader (e.g. RFID glove) is usually needed. In both cases, there is an ongoing eﬀort to utilize unsupervised learning techniques. The authors of [1] present an automated method for the construction of an activity model (ontology), which is later used to determine the probability with which certain objects are likely to be used. Similarly, [4] describes a method of constructing generic hidden Markov models (HMM) from mined web data. These generic models are then used as the basis for learning a customized HMM for a given person from the object traces created by that person. [3] goes one step further by using a ”large scale common-sense” probabilistic model to reason about not only the activity that the user is currently performing, but also about the overall state of the world (e.g. user is hungry). In all three cases, the results show the unsupervised recognition of ADLs with given object usage traces is very well possible. The problem is, the developed methods are not easily applicable to the more general case where home-automation sensors are used. The reason for that is there is no way to know in advance what types of sensors will be used and how these sensors are operating, e.g. coverage area of a motion detector. On the other hand, the common approach to detecting activities using sensors in the environment is to resort to the supervised learning of probabilistic models. [2] discusses the performance of a Na¨ıve Bayes classiﬁer when trained from observation logs and corresponding activity labels. The authors of [5] and [6] compare HMM and CRF (conditional random ﬁelds) while the presented work in [8] demonstrates the use of a particle ﬁlter for the simultaneous tracking and activity recognition of multiple occupants. All these methods deliver a probabilistic model, which is only applicable to the environment and the person

Unsupervised Recognition of ADLs

73

for which the training data has been recorded. An approach towards the application of learnt models in other context is discussed in [7]. The authors make use of transfer learning to train a HMM from the sensor logs and activity data from a sample environment 1 together with only the sensor logs of a similar environment 2. To do so, a mapping between the corresponding sensors in both environments has to be established. The main shortcoming of the approach is that both environments should use the same types of sensors.

3

The Proposed Semantic-Probabilistic Approach

As previously mentioned, our approach is based on a self-updating probabilistic knowledge base. A detailed overview of the overall architecture of the system as well as the mechanisms for constructing and training the knowledge base are beyond the scope of this paper. The interested reader is referred to [10] and [11]. However, in the following a brief description is provided. The semantic components that are speciﬁc to the activity detection scenarios are then discussed. The section concludes with the discussion of how the information in the knowledge base case be utilized towards an unsupervised activity detection application. The basic idea behind the generic probabilistic reasoning framework is to mine for inter-component dependencies within the household. This is done with the help of semantic models of the environment, the deployed devices and the user activities. From the semantic models, with the help of rules (e.g. SWRL rules), the so called information facets are extracted. An information facet represents a semantic relation between one or several components (e.g. proximity, causal inﬂuence, etc.), which is backed by a probabilistic model. For the latter, we use Bayesian networks. To be able to train the model, each information facet also provides a so-called learning context. It is used to ﬁlter the world state such that only those sensor events are processed during training, which are consistent with the semantic meaning of the relation. As an example consider several presence detectors deployed in the kitchen. One possible information facet could be ”inSensingArea”. It is used to learn the probability that, for example, if the user interacts with a given kitchen appliance, he or she is in the sensing area of each of the presence detectors. For training the probabilistic model, the learning context would consider only the events and/or states of the presence detectors and the kitchen appliance. Which events and/or states are taken into consideration, on the other hand, is dependent on the used semantic models of the components. In the current implementation, two approaches are taken in the extraction of possible inter-component dependencies. First, a common-sense rule base is applied to the current system conﬁguration. To give an example, consider again the scenario from above. The presence detector measures the ”user presence property” in the kitchen and from the semantic model of the kitchen appliance we know that it can be only operated when the user is in close proximity. Therefore, the system infers that the presence detectors and the kitchen appliance are somehow related. Another example is the dependency between the lights in a given room. Since all lights inﬂuence the luminosity in the room, they are

74

T. Dimitrov, J. Pauli, and E. Naroska

also somehow related. The second approach is the usage of an ADL ontology. The everyday activities performed by the user represent recurring tasks, which leave a discernible ”ﬁngerprint” in the sensor logs. Thus, for example, the process of cooking reveals possible dependencies between the reed switches in the cupboards (e.g. temporary correlations) and the process of going to bed deﬁnes a possible relation between the motion detector in the bedroom and the light switch. It is important to notice that the types and number of dependencies is entirely dependent on the environment in which the system is running, e.g. the layout of the house and the kinds of sensors and actuators used. One can easily imagine that having a probabilistic knowledge base with various kinds of semantic relations between the components is useful in many application domains. Thus, a system monitoring tool can keep track of unfulﬁlled dependencies (e.g. in a causal relation with high probability, one sensor ﬁres and the other does not) and over time build a hypothesis of which devices may be malfunctioning. Similarly, an activity recognition application can use the fulﬁlled (detected) dependencies as features that are fed into a simple classiﬁer. This is the main idea behind our unsupervised recognition approach. In the following, the structure of the used ADL ontology is presented. 3.1

Semantic Modeling of ADLs

To be able to represent the user activities in an environment-independent fashion, we have taken a rather abstract view on how the user actions are reﬂected in his/her surroundings. For this purpose, the user’s intents during the execution of a given activity have been modeled. This is achieved by specifying a set of mandatory and optional requirements that have to be fulﬁlled in order for the activity to be recognized as being performed. Figure 1 shows the basic concepts and relations in the ADL ontology. As one can see, apart from the requirements model, we make use of a detailed execution model. Each complex activity (e.g. cooking, bathing, etc.) has a single instance of the execution concept. It deﬁnes the logical steps taken by the user when performing the activity. The execution steps can further have execution instances on their own such that they can be used to build a complex hierarchical structure, i.e. an execution tree. The leaves of this tree are represented by the activity steps. Each activity step takes place at a given absolute or relative location, e.g. in the vicinity of a refrigerator. Thus, using this model, we are able to establish the spacial and temporal frame, in which inter-component dependencies are expected to happen. As an example, consider the ”getting a drink” activity. It has an execution instance of type ”ExecutionSet”, which in turn is comprised of two activity steps that can be performed in random order. The ﬁrst step is to get the beverage and the second, noncompulsory step is to get a cup. Each activity step deﬁnes one or several requirements towards the home environment. Again, using the model, it is possible to build a complex hierarchical structure, at which leaves the object and physical property requirements are located. The former specify an object or a group of objects, which are accessed by the user during a given activity step, and the latter put some constraints on the

Unsupervised Recognition of ADLs

75

Fig. 1. ADL Ontology. 1 - Requirements Model; 2 - Execution Model.

physical parameters of the environment, e.g. loudness, luminosity, etc. Both the object and the physical property requirements are bound to a given location. For example, when reading a book, it is preferable to have high luminosity at the user’s current location. Or, to come back to the “getting a drink” example, the user would normally get a cold beverage out of any of the refrigerators in the household. Thus, using the requirements and execution models, we are able to tell in which order the separate activity steps are performed by the user, which objects are going to be accessed and how the environment is going to change. In addition, it is possible to determine the probable points of user interaction with the system, e.g. while reading, the user would most probably control the lights in the room from a switch in the vicinity of a chair or a couch. This allows us not only to determine which devices are used during an activity but also how they are dependent on one another. Still, to account for the speciﬁcs of the diﬀerent types of devices that could be present in the diﬀerent environments, we also need rather abstract semantic device models. It is only the combination of the ADL ontology and the device ontology that delivers the proper information facets for a given setup. In the following section, an overview of the used semantic models for the devices is presented. 3.2

Semantic Modeling of Devices

Similar to the activities, we had to develop a generic device ontology for specifying the usage scenarios for certain device types as well as for deﬁning how these devices inﬂuence the properties of the environment. On the basic level, each device has one or several states, which in turn are comprised by one or several device parameters. In each state, a device is either capable of inﬂuencing (i.e. an actuator)

76

T. Dimitrov, J. Pauli, and E. Naroska

or sensing (i.e. a sensor) a certain physical or object property. For example, when a lamp is in the ”on” state, it inﬂuences the luminosity in the room, in which it is located. Likewise, a motion detector senses the presence of the user and a door contact detects the ”open” property of the door to which it is attached. The grade of inﬂuence or the value of a measured physical property is represented by one or several of the parameters of the corresponding device state. This rather simple device model together with the ADL ontology allows us to infer what types of dependencies between the diﬀerent devices in the home environment ”should” be observable. Moreover, the semantic abstraction guarantees that diﬀerent types of sensors can be automatically utilized without needing to alter the description of the user activities. In the next step, we use the inferred dependencies as features in a basic Bayesian network classiﬁer. 3.3

Probabilistic Model for the Detection of ADLs

According to the used ADL and device ontologies as well as the current system setup, each activity accounts for a set of detectable device dependencies, which we map to binary features. Figure 2 shows the structure of the recognition Bayesian network for a single activity. Every node in the network is a binary node with the activity node representing the probability that a given activity (e.g. cooking) is currently being performed. The probability tables of the activity and the feature-presence nodes are set to the uniform distribution. The probability table of the feature nodes is deﬁned as follows: ⎧ wi for F i = 1, Act = 1, F i P r = 1 ⎨ P (F i|Act, F i P r) = 1 − wi for F i = 0, Act = 1, F i P r = 1 ⎩ 0.5 otherwise where wi is the weight of the feature. In the current implementation, all features are equally weighted such that the values of all wi are set to 1. During the recognition phase, the system proceeds as follows: every time a new device event is received, all information facets that can be mapped to this event are queried in order to obtain the probability values of the corresponding device dependencies. Whenever a device dependency is ”sensed” (i.e. its probability value is greater 0), the value of the presence node for the corresponding feature

Fig. 2. Activity Bayesian Network

Unsupervised Recognition of ADLs

77

is set to 1. In addition, a virtual evidence (see [12]) for the binary feature is created and its value is set to the probability value of the device dependency. As an example, consider a temporary correlation between the events ﬁred by the sensors attached to two diﬀerent cupboards. In the case the event from the second sensor arrives 10 seconds after the event from the ﬁrst sensor, we query the probabilistic knowledge base for Paf ter (S2 |S1 , ΔT = 10s). By setting the value of the virtual evidence to the result of this query, the relevance of the feature is established. The reasoning behind is that if the correlation has been often observed in the past, it is most probably the result of the user performing a recurring task. After obtaining the values for the feature-presence nodes as well as the virtual evidence for each feature, we enter this information in the Bayesian network and then query the probability of the activity node. In order to determine for how long a given feature is valid, we use a sliding window approach. This means that a feature remains present for the typical duration of the activity. If its value changes, we update the corresponding virtual evidence as well as the feature’s temporal position in the sliding window. Thus, it is guaranteed that the system retains a short-term memory of the detected features and builds a hypothesis incrementally.

4

Evaluation Using Recorded Data

For evaluation purposes, we have tested the approach using recorded data from two experiments conducted at MIT [2] and the University of Amsterdam [6]. To be able to compare the results, we have selected a subset of the recorded activities that are common to both environments. For each of the selected activities, a semantic model compliant with the previously discussed ADL ontology has been created. The same models have been then applied to the setups of the two apartments in order to create the two probabilistic knowledge bases and the corresponding activity recognition Bayesian networks. To train the knowledge bases, we have created two training sets for each of the environments - one containing data from the ﬁrst recorded day and one with data from the ﬁrst ﬁve days. For this purpose, the logs have been linearized. Further, no use of the ground truth activity labels has been made. During the recognition phase, all recorded activity instances have been tested. 4.1

Results

Figure 3 shows the confusion tables for both apartments and both training data sets. The MIT logs do not provide any recorded instances of the ”Go to bed” activity and therefore the corresponding values are left blank. In addition, preparing breakfast, lunch and dinner have been summarized into a single ”preparing food” activity. The numbers represent the percentage of the number of cases in which a given activity has been either correctly detected or confused with some other activity. Thus, the value of 1 indicates that for all instances of a given

78

T. Dimitrov, J. Pauli, and E. Naroska

(a) Data from [2]

(b) Data from [6] Fig. 3. Confusion tables delivered by our approach for the two environments when using diﬀerent amounts of training data

activity the system always makes one and the same prediction. This, however, does not mean that for every recorded instance the detection probability is the same. For comparison purposes, each cell in the confusion tables contains three numbers. The ﬁrst two are obtained by applying the proposed approach with one and ﬁve days of training data respectively. The third number is taken from the confusion matrices in [2]1 and [6]. When compared to the utilized models in the two experiments, our approach delivers results with similar, or in some case even superior detection rates. Moreover, due to the semantic nature of the probabilistic knowledge base, the confusion tables are much clearer, i.e. confusion occurs only between activities that are semantically related. 4.2

Discussion

Looking at the results, two trends can be observed. First, as the amount of training data increases, the performance of the system improves. This is due to the fact that as more variations of the activities are performed by the user, more device dependencies can be learnt and thus more features can be used during the detection process. Even though the system knows which dependencies might 1

The values have been manually calculated using the ”Activity Detected at Least Once” confusion matrix for subject one.

Unsupervised Recognition of ADLs

79

occur for each of the activities, these dependencies appear in the probabilistic knowledge base, and thus can be queried, only after training data becomes available. The second trend is that the actually performed activity is mostly confused with activities that share the same features. For example, ”preparing food” and ”getting a drink” are both performed in the kitchen and both involve accessing some of the cupboards as well as the refrigerator or freezer. Thus, in some of the case where confusion has occurred, for a short period of time the ”getting a drink” activity has been a better explanation of the current event stream than the ”preparing food” activity. Should we however consider the duration for which a given prediction has sustained a high probability value, then the ”preparing food” activity would have been the ”winner” in those cases. It can be also observed that for certain recorded activity instances the system is not able to make any predictions. One of the reasons could be that the used ADL ontology is not expressive enough to account for all possible variations of a given activity. The other reason could be the number and, even more important, the type of the used sensors. Having a contact sensor attached to the bathroom door does not necessarily provide any information on whether the user is actually in the bathroom. A presence detector would be a more appropriate choice.

5

Outlook

In this paper, we present an approach to the unsupervised detection of ADLs. Using semantic and probabilistic models, we are able to determine the set of features (inter-component dependencies) that could occur during the execution of the daily activities. By combining those features, the developed system builds explanations for the currently observed events. When doing a direct comparison to other, supervised techniques, several pros and cons should be pointed out. The major advantage of our approach is that there is no need for a training phase. As the evaluation shows, even after the system is deployed, it is capable of making in some of the cases valid predictions. As more data becomes available, the detection rates improve. The biggest disadvantage is probably the conﬁguration overhead for specifying mainly the locations and the types of the used devices and sensors. There are certainly many areas in which the system could be improved. First and foremost, the expressiveness of the ADL and device ontologies is crucial to the overall system performance. Second, it is not possible currently to indicate or learn which features are more indicative of a given activity. By engaging the user during the recognition phase, the system could obtain the ground truth of the actually performed activity and use this information for updating the strengths of the features, i.e. the wi .

References 1. Munguia-Tapia, E., Choudhury, T., Philipose, M.: Building Reliable Activity Models Using Hierarchical Shrinkage and Mined Ontology. In: Proceedings of Pervasive 2006, Dublin (May 2006)

80

T. Dimitrov, J. Pauli, and E. Naroska

2. Munguia-Tapia, E.: Activity Recognition in the Home Setting Using Simple and Ubiquitous Sensors. Master Thesis, MIT (2003) 3. Pentney, W., Popescu, A., Wang, S., Kautz, H., Philipose, M.: Sensor-Based Understanding of Daily Life via Large-Scale Use of Common Sense. In: Proceedings of the National Conference on Artiﬁcial Intelligence (2006) 4. Wyatt, D., Philipose, M., Choudhury, T.: Unsupervised Activity Recognition Using Automatically Mined Common Sense. In: Proceedings of the National Conference on Artiﬁcial Intelligence (2005) 5. van Kasteren, T.L.M., Krse, B.J.A.: A probabilistic approach to automatic health monitoring for elderly. In: Proceedings of Advanced School of Computing and Imaging Conference (ASCI 2007), Heijen, The Netherlands (2007) 6. van Kasteren, T.L.M., Noulas, A.K., Englebienne, G., Krse, B.J.A.: Accurate Activity Recognition in a Home Setting. In: ACM Tenth International Conference on Ubiquitous Computing (Ubicomp 2008), Seoul, South Korea (2008) 7. van Kasteren, T.L.M., Englebienne, G., Krse, B.J.A.: Recognizing Activities in Multiple Contexts using Transfer Learning. In: AAAI Fall 2008 Symposium: AI in Eldercare (2008) 8. Wilson, D.: Assistive Intelligent Environments for Automatic Health Monitoring. PhD Thesis, Carnegie Mellon University (2005) 9. Oliver, N., Horvitz, E.: A Comparison of HMMs and Dynamic Bayesian Networks for Recognizing Oﬃce Activities. User Modeling (2005) 10. Dimitrov, T., Pauli, J., Naroska, E.: A probabilistic reasoning framework for smart homes. In: Proceedings of the 5th international workshop on Middleware for pervasive and ad-hoc computing: held at the ACM/IFIP/USENIX 8th International Middleware Conference 11. Dimitrov, T., Pauli, J., Naroska, E., Ressel, C.: Structured Learning of Component Dependencies in AmI Systems. In: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (2008) 12. Huang, C., Darwiche, A.: Inference in Belief Network: A Procedural Guide. International Journal of Approximate Reasoning (1996)

Audio Features Selection for Automatic Height Estimation from Speech Todor Ganchev, Iosif Mporas, and Nikos Fakotakis Artificial Intelligence Group, Wire Communications Laboratory, Dept. of Electrical and Computer Engineering, University of Patras, 26500 Rion-Patras, Greece [email protected], {imporas,fakotaki}@upatras.gr

Abstract. Aiming at the automatic estimation of the height of a person from speech, we investigate the applicability of various subsets of speech features, which were formed on the basis of ranking the relevance and the individual quality of numerous audio features. Specifically, based on the relevance ranking of the large set of openSMILE audio descriptors, we performed selection of subsets with different sizes and evaluated them on the height estimation task. In brief, during the speech parameterization process, every input utterance is converted to a single feature vector, which consists of 6552 parameters. Next, a subset of this feature vector is fed to a support vector machine (SVM)-based regression model, which aims at the straight estimation of the height of an unknown speaker. The experimental evaluation performed on the TIMIT database demonstrated that: (i) the feature vector composed of the top-50 ranked parameters provides a good trade-off between computational demands and accuracy, and that (ii) the best accuracy, in terms of mean absolute error and root mean square error, is observed for the top-200 subset. Keywords: height estimation from speech, speech parameterization, feature ranking, feature selection, SVM regression models.

1 Introduction Studies with X-ray and magnetic resonance imaging (MRI) have shown that the vocal tract length and the height of a speaker are strongly correlated [1]. Other studies on the correlations between the height of a speaker and specific audio features, such as the formants [2-6], the fundamental frequency of speech [2,5-8], the glottal-pulse rate [9], the energy of speech [2], the Mel frequency cepstral coefficients (MFCCs) and the linear prediction coefficients (LPCs) [10], have been reported. Although the speech production theory assumes correlation between the vocal tract length and the formant frequencies [11], the formants have been observed to be weakly correlated with the speaker’s height [2,4,5]. The only exception is [6], where it was shown that the forth formant and the height of male speakers are somehow correlated for the vowel schwa [3]. Furthermore, a number of studies reported that the fundamental frequency of speech is not significantly correlated with the height of a speaker [2,5-8], and also that the energy below 1 kHz is not correlated with the speaker’s height [2]. S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 81–90, 2010. © Springer-Verlag Berlin Heidelberg 2010

82

T. Ganchev, I. Mporas, and N. Fakotakis

In [10], Dusan evaluated the MFCCs and the LPCs, computed as in [12], and the highest correlation with human height was observed for the seventh MFCC. In summary, the main research efforts in previous related work were focused on correlation measurements and investigation of the appropriateness of small sets of well-known and widely-used basic audio parameters, such as the formants, the fundamental frequency of speech, the energy of speech, the LPC parameters, the MFCCs, etc, and to some extent their time-derivatives. Recently a new paradigm for speech and audio parameterization, referred to as openSMILE audio parameterization, was introduced [13]. In contrast to the above mentioned speech parameters (fundamental frequency, MFCCs, LPCs, etc), which are computed over short quasi-stationary segments of speech, the openSMILE audio parameterization computes a single feature vector, consisting of 6552 audio parameters, for the entire speech recording. Among these parameters are the averaged (over the duration of the entire utterance) values of the basic audio descriptors, such as the signal energy, pseudo loudness, Mel-spectra, MFCCs, fundamental frequency of speech, voice quality, etc and numerous statistical functional parameters, such as means, extremes, moments, segments, peaks, linear and quadratic regression, percentiles, durations, onsets, DCT coefficients, etc. Although there are numerous studies on the correlation of basic speech parameters with the human height, such as [2-10] and others, there is no extensive study devoted to the investigation of various subsets of parameters or on the effectiveness of different combinations of basic speech features. To the best of our knowledge, there is no existing related work, which investigates the applicability of statistical functional parameters, computed over the basic speech parameters, on the task of automatic height estimation. In the present work, we perform ranking of the relative importance of the large set of 6552 audio parameters, which are computed through the openSMILE audio parameterization. The ranking results are further utilized for the selection of subsets of features with various sizes, which consist of the top-n ranked audio descriptors, with n ∈ {1, 2,…, 9, 10, 20, …90, 100, 200, …, 1000}. We evaluate the appropriateness of these subsets through measuring the performance of an SVMbased regression model, which is trained to provide a straight estimation of the height of an unknown speaker. Here, unknown refers to the fact that we aim at the height estimation of speakers, whose speech was not used during the training of the models.

2 Feature Ranking and Selection for Automatic Height Estimation Automatic height estimation is a supervised learning task that aims at the creation of regression-based height-estimation models from a given set of labelled fixed-length feature vectors. These models are subsequently used for estimating the height of unknown speakers, given a spoken utterance. Previous studies on human height estimation from speech had shown that various speech parameters are correlated (either moderately or weakly) with the speaker’s height. Since, there is no single speech feature, which would permit an accurate estimation of the human height, an automatic height estimator shall rely on a set of speech descriptors, which hopefully will be complementary to each other, and when combined would contribute to the increasing of the overall estimation accuracy.

Audio Features Selection for Automatic Height Estimation from Speech

83

An intuitive belief is that large feature vectors constituted by the aggregation of various relevant speech features would be more advantageous than, or at least as beneficial as, the involved individual speech features or smaller subsets. However, the use of multidimensional feature vectors that consist of a large number of speech features is costly in terms of training and operational complexity, memory demands, required training data, etc. Therefore, one would like to know if all speech features in the feature vector are indeed relevant and complimentary to each other. Furthermore, ranking of the speech features according to their relevance to the height estimation task hints reasonable trade-offs between the feasible estimation accuracy and the demands of resources. Having accepted that direction of thoughts, in the present work we investigate the possibility to obtain improved accuracy of automatic height estimation by combining numerous speech parameters, while still preserving reasonable computational complexity and the practical feasibility of this solution. In brief, the automatic height estimation strategy followed here is illustrated in Fig. 1, where the upper part of the figure summarizes the off-line training of the height estimation models, and the lower part illustrates the operational phase, during which the estimation of the height of a speaker from a spoken input is performed. Specifically, the training phase involves: • a training dataset with known (ground truth) height labels for all speech utterances, • a speech parameterization stage that computes a large set of speech parameters, • feature ranking and feature selection stages that evaluate the relevance of these speech parameters and select subsets that form the feature vector, and finally, • a regression model training stage that utilizes the feature vector devised at the output of the feature selection stage and the corresponding ground truth labels, in order to build appropriate height estimation models. In the present work, we consider the general case, where a single height estimation model, which is common for all speakers, is used. The outcome of the training phase is the list of indexes of the selected speech features and the trained height-estimation model. In the operational phase, these are utilized in the estimation of the height of a speaker from a given speech utterance. Specifically, as shown in the figure, initially, the unlabeled input utterance is parameterized, and then a subset of the computed speech features is picked out in accordance with the list of indexes selected during training. The feature vector obtained to this end is fed to the regression models, created during training, which provides a direct estimation of the speakers’ height. Here, we made use of the openSmile audio parameterization scheme [13], which computes a total of 6552 audio descriptors for each speech utterance, after applying an energy-based voice activity detector [14] and excluding the non-speech intervals. Among these parameters are basic audio descriptors, such as: the root mean square (RMS) frame energy, the zero-crossing rate (ZCR) from the time-domain signal, the harmonics-to-noise ratio (HNR) by autocorrelation function, the pseudo loudness, the Mel-spectra, the twelve MFCCs computed as in the HTK setup [14], the fundamental frequency of speech normalized to 500 Hz, the voice quality etc. Additionally, for each of these static parameters, the first and second time-derivatives were computed.

84

T. Ganchev, I. Mporas, and N. Fakotakis

Fig. 1. Architecture for automatic height estimation from speech. Top: the speech parameterization, the feature ranking and selection, and the training of the regression models. Bottom: the operational phase, in which the height of a speaker is estimated from a given spoken utterance, using a predefined subset of features and the pre-trained regression model.

Furthermore, twelve statistical functional parameters were applied on a chunk basis, i.e. per audio recording. These are the mean, standard deviation, kurtosis, skewness and higher order moments, segments, extreme values (minimum, maximum, relative position and range), linear and quadratic regression coefficients (offset, slope, mean square error, etc), percentiles, durations, onsets, DCT coefficients etc. Further details about the openSMILE audio features are available in [13]. Although the individual properties of most of the basic speech parameters were studied in the literature, the relevance of the statistical functional parameters with the height of the speaker has not been investigated so far, and thus their contribution is unidentified. Here, we target at identifying subsets of complimentary speech features, which might be beneficial in terms of overall performance when compared to the full set. For that purpose, the full set of 6552 audio descriptors, computed for each utterance of the training dataset, was considered subject to a feature ranking process. In the present study, we rely on the Regression Relief-F feature-ranking algorithm [15], which we summarize in the following. In brief, the Regression Relief-F algorithm evaluates the quality of an attribute A, i.e. how well this attribute distinguishes instances that are near to each other, utilizing the value of A for a given instance and the values of its k nearest instances. An exponential correction factor, which reduces with the distance of the nearest

Audio Features Selection for Automatic Height Estimation from Speech

85

neighbour from the considered instance, is used to balance the influence of each nearest neighbour. The quality of each attribute A is quantified by an index, denoted by W(A), which takes values in the interval [-1, 1]. Positive values of W(A) indicate that A is more correlated with the ground truth labels (the higher the W(A), the higher the correlation), while negative values of W(A) indicate that A is not correlated with the ground truth prediction labels. The results of feature ranking, i.e. the ordered list of attributes, were further utilized for the selection of subsets of various sizes, which consist of the top-n ranked audio descriptors, with n ∈ {1, 2,…, 9, 10, 20, …90, 100, 200, …, 1000}. In the present study, we evaluate the appropriateness of these subsets through measuring the accuracy of height estimation for an SVM-based regression model, which is trained to provide a straight estimation of the height of a speaker, given a speech utterance. The SVM-based regression model employed here is based on the ν-SVR algorithm [16], which was preferred due to its ability to adjust automatically the ε insensitive cost parameter. Given the set of training data {V ( i ), h real ( i )} for each speaker i , with T V ( i ) = [V1 ( i ), V 2 ( i ),..., V k ( i ),..., V K ( i ) ] , a function φ maps the input feature vector to a higher dimensional space. The primal problem of ν-SVR, ⎧1 1 k ⎛ ⎞⎫ arg min ⎨ w T w + C ⎜νε + ∑ (ξi + ξi* ) ⎟ ⎬ , * k i =1 w , ε ,ξ i , ξ i ⎩ 2 ⎝ ⎠⎭

(1)

subject to the following restrictions: ( w Tφ ( xi ) + β ) − hreal ( i ) ≤ ε + ξi , T * * hreal ( i ) − ( w φ ( xi ) + β ) ≤ ε + ξi , ξi , ξi ≥ 0, with w \ N , E \ , i ∈ [0, Ν ] and ε ≥ 0 . Here, ξi and ξi* are the slack variables for exceeding the target value more or less than ε , respectively, and C is the penalty parameter. The value of ν affects the number of support vectors and the training errors. In the present work we rely on the 2 radial basis kernel function K ( xi , x j ) = exp −γ xi − x j , due to its excellent local properties. is

(

)

3 Experimental Setup In the following, we describe the speech database and the experimental protocol used in the present study. In brief, we utilized the TIMIT database [17], which is an American-English database containing microphone-quality prompted speech. Each recording is a speech utterance sampled at 16 kHz, with resolution of 16 bits per sample. We followed the standard division to Train and Test subsets, defined in the TIMIT documentation. In brief, the Train subset consists of recordings from 462 speakers, including 326 males and 136 females, and the Test subset consists of the recordings of 168 speakers, including 112 males and 56 females. In both subsets, each speaker utters 10 utterances. TIMIT covers seven major dialectal regions of the United States of America plus a set of speakers moving across the states. The reported height of each speaker is provided in the database documentation in inches. In the present study we converted all units to meters in accordance with the International System of Units (SI). Thus, in the entire database the speaker heights

86

T. Ganchev, I. Mporas, and N. Fakotakis

range between 1.448 and 2.032 meters, with mean and standard deviation values equal to 1.755 and 0.095 meters, respectively. As in previous related work [10, 18], we excluded speaker MCTW0 from the Test subset, since his height of 2.032 meters is out of the range of heights represented in the Train subset, htrn ∈ [1.448, 1.981]. In all experiments, we followed a common experimental protocol. The Train subset was used for the feature ranking and for the training of the regression models, while the accuracy of automatic height estimation was measured on the Test subset. This resulted in 4620 and 1670 speech utterances in the Train and Test subsets, respectively. There is no overlapping between the speakers in the Train and Test datasets, and therefore in the present we consider the case of height estimation for unknown speakers. For the needs of feature ranking, we used the Weka [19] implementation of the Regression Relief-F algorithm. The number of nearest neighbours for attribute quality estimation was set to ten, without weighting the nearest neighbours by their distance. All instances from the Train subset were used for the attribute quality estimation, and the ranking list included all the 6552 audio features computed through openSMILE. Likewise, for the training and the evaluation of the regression models described in Section 2, we made use of the SVM implementation of Weka. The parameter values of the regression models were empirically determined utilizing two utterances from each speaker of the Train subset, i.e. in total 924 utterances. The criteria for measuring the height estimation accuracy were the mean absolute error (MAE) and the root mean square error (RMSE).

4 Experimental Results The ranking results for the top-50 audio features are presented in Table 1. As seen in the table, multiple audio features related to the fundamental frequency (F0) and the MFCCs were found relevant to the height of the speaker. The fact that the mean value of the fundamental frequency, F0_mean, is not ranked in the top-50 audio features is in agreement with previous research [2,5-8], where was reported that the fundamental frequency of speech is weakly correlated with the speaker’s height. However, it is to some degree surprising that a number of time-derivatives and statistical parameters computed over the fundamental frequency of speech were found important for estimating the height of a person. As the ranking results demonstrated, there are three F0-related parameters (percentiles 95, linear regression, standard deviation) in the top-10 list and sixteen in the top-50 ranked audio features, which make the F0-based statistical parameters quite well represented, when compared to other individual features. The good relevance of the MFCCs, reported in earlier work [10], was also confirmed since twenty-five out of the top-50 parameters are statistical functional parameters derived from the MFCCs. In total, 3029 out of the 6552 audio features were found to be to some degree relevant with the height estimation problem, i.e. they demonstrated positive attribute quality value, W(A)>0. Another fifty-eight audio features obtained quality value W(A)=0, and the remaining 3465 obtained negative values, W(A)<0, which means that they are not relevant attributes with respect to the height estimation problem.

Audio Features Selection for Automatic Height Estimation from Speech

87

Table 1. Ranking results for the top-50 openSMILE audio features rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

W(A) 0.00673 0.00671 0.00654 0.00632 0.00620 0.00595 0.00587 0.00584 0.00582 0.00562 0.00561 0.00559 0.00558 0.00549 0.00548 0.00547 0.00547 0.00539 0.00538 0.00535 0.00535 0.00528 0.00527 0.00525 0.00519

Audio feature rank W(A) Audio feature mfcc[9]_perc95 26 0.00519 voiceProb_iqr1-3 F0_linregerrA 27 0.00518 mfcc[9]_quartile3 F0_perc95 28 0.00518 F0_de_zcr mfcc[9]_perc98 29 0.00513 F0env_de_zcr voiceProb_stddev 30 0.00503 voiceProb_linregerrA mfcc[9]_amean 31 0.00495 mfcc[9]_skewness voiceProb_perc95 32 0.00493 mfcc[9]_peakMean F0_stddev 33 0.00487 F0_de_perc98 voiceProb_variance 34 0.00484 voiceProb_quartile3 mfcc[10]_amean 35 0.00480 F0_zcr mfcc[8]_perc98 36 0.00468 F0_linregerrQ mfcc[11]_amean 37 0.00461 F0_perc98 mfcc[10]_quartile1 38 0.00460 mfcc[10]_peakMean F0_de_de_zcr 39 0.00456 mfcc[12]_range F0env_de_de_zcr 40 0.00455 mfcc[9]_quartile1 F0_quartile3 41 0.00454 mfcc[8]_amean F0_iqr1-3 42 0.00453 F0_variance mfcc[8]_perc95 43 0.00440 mfcc[11]_quartile1 mfcc[10]_perc98 44 0.00435 mfcc[11]_perc95 voiceProb_linregerrQ 45 0.00429 mfcc[12]_minameandist F0_de_perc95 46 0.00427 mfcc[11]_quartile2 voiceProb_perc98 47 0.00421 mfcc[10]_quartile2 mfcc[10]_perc95 48 0.00413 mfcc[10]_quartile3 F0_de_de_perc95 49 0.00411 mfcc[3]_amean mfcc[7]_skewness 50 0.00411 pcm_LOGenergy_range

Since the focus in the present study is on identifying subsets of audio features that facilitate more accurate automatic human height estimation, we experimented with multiple subsets of features, keeping the top-n ranked audio descriptors, with n ∈ {1, 2,…, 9, 10, 20, …90, 100, 200, …, 1000}, and involving them in the height estimation problem. Figure 2 summarizes the experimental results in terms of mean absolute error and root mean square error. As can be seen in the figure, there is significant drop of error when the second-best audio parameter, F0_linregerrA, is appended to the best one, mfcc[9]_perc95. Specifically, we observed drop of the MAE from 0.073 to 0.056 meters, which corresponds to a relative reduction of the error rate by approximately 23%, and drop of the RMSE from 0.088 to 0.070, which is a relative reduction of approximately 20%. When appending more features to the top-2 subset we observed only a slight drop of the MAE and RMSE up to the top-9, but the addition of more features contributed to a more noticeable reduction of the error rates. Specifically, we observed that the top-50 audio parameters led to the best accuracy in terms of MAE, equal to 0.053 meters, with RMSE equal to 0.068 meters, and the top-200 features reached the best RMSE of 0.067 meters for the same MAE

88

T. Ganchev, I. Mporas, and N. Fakotakis

Fig. 2. Height estimation error in meters for various subsets of audio features: in terms of mean absolute error (MAE) and in terms of root mean square error (RMSE)

as the one for the top-50. Thus, one may consider the top-50 as a reasonable trade-off between computational complexity and performance. The slight reduction of the RMSE with 0.001 meter, which corresponds to a relative reduction of the RMSE by 1.5%, means that the estimation results for the top-200 subset contains fewer gross errors; however this comes at the price of manifold increase of the computational and memory demands. Finally, appending more audio features beyond the top-200 did not contribute to any further improvement of the height estimation accuracy. Although recognized as relevant, i.e. quality value W(A)>0, these additional audio parameters do not carry complimentary information beyond what is already available in the attributes with higher quality value, and thus they appear as redundant parameters in the feature vector. Although the SVM-based modelling is known to cope well with highdimensional feature vectors, and to some degree to tolerate redundant features, the extensive amount of redundant attributes in the feature vectors for the subsets from top-400 to top-1000 led to a steady increase of the MAE and RMSE. This increase of error rates gives evidence of the increased sensitivity to noise, when the number of redundant parameters is high. We deem that the last justifies well the extra effort for feature ranking and selection, oppositely to using the full feature set, as a necessary step towards achieving higher performance and efficient trade-offs between accuracy and computational demands. Further effort for identifying the most efficient make-up of the feature vector shall involve a processing step aiming at the elimination of any redundancy among the selected audio parameters. For instance, the redundancy elimination can be performed via sequential forward or backward selection methods, or via floating search methods [20].

Audio Features Selection for Automatic Height Estimation from Speech

89

5 Conclusion In this work, we studied the applicability of the recently proposed audio parameterization paradigm, known as the openSMILE, with respect to the task of automatic human height estimation from speech. Investigating the relevance of the 6552 openSMILE audio features, computed for each speech recording, we performed ranking with respect to their quality value computed through the Regression Relief-F algorithm. The ranking results were next utilized for the selection of numerous subsets of audio parameters, which were used in SVM-based scheme for automatic height estimation. The experimental results presented in Section 4, confirmed the importance of the MFCC features as well correlated to the height of a person. We also observed that audio features related to the statistics of the fundamental frequency of speech are relatively well correlated with the height of a speaker, but the fundamental frequency itself is not. The experimental results demonstrated that (i) the feature subset composed of the top-50 ranked audio parameters provides a reasonable tradeoff between computational demands and accuracy, and that (ii) the best accuracy, in terms of mean absolute error and root mean square error, is observed for the top-200 subset. However, this accuracy gain (for the top-200 subset) is negligible and merely amounts to the relative reduction of the root mean square error by approximately 1.5%. Finally, the experimental results justified the necessity of feature ranking, since increased error rates were observed for subsets that incorporate many redundant attributes. The automatic height estimation from speech can play an important role in automatic scene analysis and autonomous surveillance applications, since it offers complementary information, which facilitates the estimation of the distance to objects through video and thermal imaging sensors, and offers additional clue for the person re-identification after his/her reappearing in the receptive field of the optical sensor. Acknowledgments. This work was partially supported by the Prometheus project (FP7-ICT-214901) “Prediction and Interpretation of human behaviour based on probabilistic models and heterogeneous sensors”, co-funded by the European Commission under the Seventh’ Framework Programme.

References 1. Fitch, W.T., Giedd, J.: Morphology and development of human vocal tract: a study using magnetic resonance imaging. Journal of Acoustical Society of America 106(3), 1511–1522 (1999) 2. van Dommelen, W.A., Moxness, B.H.: Acoustic parameters in speaker height and weight identification: sex-specific behaviour. Language and Speech 38, 267–287 (1995) 3. van Oostendorp, M.: Schwa in phonological theory. GLOT International 3, 3–8 (1998) 4. Collins, S.A.: Men’s voices and women’s choices. Animal Behaviour 60, 773–780 (2000) 5. Gonzalez, J.: Estimation of speaker’s weight and height from speech: a re-analysis of data from multiple studies by Lass and colleagues. Perceptual and Motor Skills 96, 297–304 (2003) 6. Rendall, D., Kollias, S., Ney, C.: Pitch (F0) and formant profiles of human vowels and vowel-like baboon grunts: the role of vocalizer body size and voice-acoustic allometry. Journal of Acoustical Society of America 117(2), 1–12 (2005)

90

T. Ganchev, I. Mporas, and N. Fakotakis

7. Lass, N.J., Brown, W.S.: Correlation study of speaker’s heights, weights, body surface areas, and speaking fundamental frequencies. Journal of Acoustical Society of America 63(4), 700–703 (1978) 8. Künzel, H.J.: How well does average fundamental frequency correlate with speaker height and weight? Phonetica 46, 117–125 (1989) 9. Smith, D.R.R., Patterson, R.D., Turner, R., Kawahara, H., Irino, T.: The processing and perception of size information in speech sounds. Journal of Acoustical Society of America 117(1), 305–318 (2005) 10. Dusan, S.: Estimation of speaker’s height and vocal tract length from speech signal. In: Proc. of the 9th European Conference on Speech Communication and Technology (Interspeech 2005), pp. 1989–1992 (2005) 11. Fant, G.: Acoustic Theory of Speech Production. Mouton, The Hague (1960) 12. Davis, S.B., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and Signal Processing 28(4), 357–366 (1980) 13. Eyben, F., Wöllmer, M., Schüller, B.: openEAR – introducing the Munich open-source emotion and affect recognition toolkit. In: Proc. 4th International HUMAINE Association Conference on Affective Computing and Intelligent Interaction 2009 (ACII 2009), September 10-12. IEEE, Amsterdam (2009) 14. Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P.: The HTK Book (for HTK Version 3.4). Cambridge University Engineering Department, Cambridge (2006) 15. Robnik-Šikonja, M., Kononenko, I.: An adaptation of Relief for attribute estimation in regression. In: Fourteenth International Conference on Machine Learning, pp. 296–304 (1997) 16. Scholkopf, B., Smola, A., Williamson, R., Bartlett, P.L.: New support vector algorithms. Neural Computation 12(5), 1207–1245 (2000) 17. Garofolo, J.: Getting started with the DARPA-TIMIT CD-ROM: an acoustic phonetic continuous speech database. National Institute of Standards and Technology (NIST), Gaithersburgh, MD, USA (1988) 18. Pellom, B.L., Hansen, J.H.L.: Voice analysis in adverse conditions: the centennial Olympic park bombing 911 call. In: Proc. of the 40th Midwest Symposium on Circuits and Systems (MWSCAS 1997), vol. 2, pp. 873–876 (1997) 19. Witten, H.I., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann Publishing, San Francisco (2005) 20. Theodoridis, S., Koutroumbas, K.: Pattern Recognition, 4th edn. Academic Press, London (2009)

Audio-Visual Fusion for Detecting Violent Scenes in Videos Theodoros Giannakopoulos1 , Alexandros Makris1 , Dimitrios Kosmopoulos1, Stavros Perantonis1, and Sergios Theodoridis2 1

Computational Intelligence Laboratory, Institute of Informatics and Telecommunications, National Center of Scientiﬁc Research Demokritos, Greece [email protected],{amakris,dkosmo,sper}@iit.demokritos.gr 2 Department of Informatics and Telecommunications, University of Athens, Greece [email protected]

Abstract. In this paper we present our research towards the detection of violent scenes in movies, employing fusion methodologies, based on learning. Towards this goal, a multi-step approach is followed: initially, automated auditory and visual processing and analysis is performed in order to estimate probabilistic measures regarding particular audio and visual related classes. At a second stage, a meta-classiﬁcation architecture is adopted, which combines the audio and visual information, in order to classify mid-term video segments as “violent” or “non-violent”. The proposed scheme has been evaluated on a real dataset from 10 ﬁlms. Keywords: Violence detection, multi-modal video classiﬁcation.

1

Introduction

During the last decade, a huge increase of video data has occurred, mainly due to the existence of several ﬁle-sharing web communities and new facilities regarding digital television. Therefore, the provided multimedia content is becoming easily accessible by large portions of the population, while limited control over the content exists. Due to this vast amount of multimedia content, the manually performed annotation is obviously a hard task. It is therefore obvious that the need of protection of sensitive social groups (e.g. children), using automatic contentbased classiﬁcation techniques, is imperative. In this paper, we present a method for automatic violence detection in ﬁlms, based on audio-visual information. There are not many works in the literature which attempt to detect violence scenes using visual features. Most of the methods concern surveillance cameras and use background substraction techniques to detect the people in the scene [1], [2]. These approaches however are not suitable for movies where the camera moves abruptly and there are many shot changes. In [3], a generic approach to determine the presence of violence is presented. Two features are used, which measure the average activity and the average shot-length. Experiments with S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 91–100, 2010. c Springer-Verlag Berlin Heidelberg 2010

92

T. Giannakopoulos et al.

movie trailers show that the features are able to discriminate violent from nonviolent movies. However, no attempt to characterize the speciﬁc segments of the movie which contain the violence is carried out. In [4], three visual features are used, measuring the level of activity, the presence of gunﬁres/explosions and the presence of blood. In video data, most violent scenes are characterized by speciﬁc audio events (e.g. explosions). The literature related to the detection of violent content is limited and usually it examines only visual features ([5], [6]). In [7] a simple audio feature, in particular, the energy entropy, is used as additional information to visual data. In [8], a ﬁlm classiﬁcation method is proposed that is mainly based in visual cues. The only audio feature adopted in this paper is the signal’s energy. A more detailed examination of the audio features for discriminating between violent and non-violent sounds was presented in [9]. In particular, seven audio features, both from the time and frequency domain, have been used, while the binary classiﬁcation task (violent and non violent) was accomplished via the usage of Support Vector Machines. In [10], a multi-class classiﬁcation algorithm for audio segments from movies has been proposed. Bayesian networks along with the one vs all architecture has been used, while the deﬁnition of the classes has been violence-oriented (three violent classes have been adopted). In this work we have used a variant of the classiﬁer proposed in [10], on a segment basis, in order to generate a sequence of audio class labels. As far as the visual part is concerned, in this work we have employed motion and person activity related features, in order to derive three visual-related classes. The two individual modules are combined in a meta-classiﬁcation stage, which is responsible for classifying video segments in two classes, i.e., ”Violence” and ”Non-Violence”.

2 2.1

Audio Classifier Audio Class Definition

In order to create an audio-based characterization scheme, we have deﬁned seven audio classes, three from which are violent and four non-violent. The class definitions have been motivated by the nature of the audio signals met in most movies. The non-violent classes are: Music, Speech, Others1, and Others2. Classes Others1 and Others 2 are environmental sounds met in movies. Others1 contains environmental sounds of low energy and almost stable signal level (e.g. silence, background noise, etc). Others2 is populated by environmental sounds with abrupt signal changes, e.g. a door closing, thunders, etc. The violent -related classes are: Shots, Fights (beatings) and Screams. 2.2

Audio Feature Extraction

12 audio features are extracted for each segment on a short-term basis. In particular, each segment is broken into a sequence of non-overlapping short-term

Audio-Visual Fusion for Detecting Violent Scenes in Videos

93

windows (frames). For each frame 12 feature values are calculated. This process leads to 12 feature sequences, for the whole segment. In the sequel, a statistic (e.g. standard deviation, or average value) is calculated for each sequence, leading to a 12-D feature vector for each audio segment. The features, the statistics and the window lengths adopted are presented in Table 1. For more detailed descriptions of those features, please refer to [10]. Table 1. Window sizes and statistics for each of the adopted features Feature 1 Spectrogram 2 Chroma 1 3 Chroma 2 4 Energy Entropy 5 MFCC 2 6 MFCC 1 7 ZCR 8 Sp. RollOﬀ 9 Zero Pitch Ratio 10 MFCC 1 11 Spectrogram 12 MFCC 3

Statistic Window (msecs) σ2 20 μ 100 median 20 (mid term:200) max 20 σ2 20 max 20 μ 20 median 20 − 20 max/μ 20 max 20 median 20

The selection of the particular audio features, and of the respective statistics, was a result of extensive experimentation. Though, most of the adopted features have a physical meaning for the task of classifying an audio sample in the particular seven classes. For example, in Figure 1 an example of an Energy Entropy sequence is presented for an audio stream that contains: classical music, gun2 shots, speech and punk-rock music. Also, the maximum value and the σμ ratio statistics are presented. It can be seen that the maximum value of the energy entropy sequence is higher for gunshots and speech. This is something expected, since the energy entropy feature ([10]) has higher values for audio signals with abrupt energy changes (such as gunshots). Energy Entropy 0.12

Feature Value

0.1

Classical Music max = 0.054 2 σ / μ = 0.42

Gunshots max = 0.094 2 σ / μ = 1.28

Speech max = 0.100 2 σ / μ = 0.62

Punk − Rock Music max = 0.076 2 σ / μ = 0.24

0.08 0.06 0.04 0.02

50

100

150 Frame index

200

250

300

Fig. 1. Example of Energy Entropy sequence

94

2.3

T. Giannakopoulos et al.

Class Probability Estimation

In order to achieve multi-class classiﬁcation, the ”One-vs-All” (OVA) classiﬁcation scheme has been adopted. This method is based on decomposing the K-class classiﬁcation problem into K binary sub-problems ([11]). In particular, K binary classiﬁers are used, each one trained to distinguish the samples of a single class from the samples in the remaining classes. In the current work, we have chosen to use Bayesian Networks (BNs) for building these binary classiﬁers. At a ﬁrst step, the 12 feature values described in section 2.2, are grouped into three 4D separate feature vectors (feature sub-spaces). In the sequel, for each one of the 7 binary sub-problems, three k-Nearest Neighbor classiﬁers are trained on the respective feature sub-space. This process leads to three binary decisions for each binary classiﬁcation problem. Thus, a 7x3 kNN decision matrix R is computed. Ri,j is 1 if the input sample is classiﬁed in class i, given the j-th feature sub-vector, and it is equal to 0 if the sample is classiﬁed in class “not i”. In order to decide to which class the input sample will be classiﬁed, according to R, BNs have been adopted: each binary subproblem has been modelled via a BN which combines the individual kNN decisions to produce the ﬁnal decision. In order to classify the input sample to a speciﬁc class, the kNN binary decisions of each subproblem (i.e. the rows of matrix R) are fed as input to a separate BN i, i = 1, . . . , 7, which produces the following probabilistic measure for each (k) (k) (k) class, for each input each sample k: Pi (k) = P (Yi (k) = 1|Ri,1 , Ri,2 , Ri,3 ). This is the probability that the input sample’s true class label is i, given the results of the individual kNN classiﬁers. After the probabilities Pi (k), i = 1, . . . , 7 are calculated for all binary subproblems, the input sample k is classiﬁed to the class with the largest probability, i.e.: W innerClass(k) = arg maxi Pi (k). This combination scheme can used as a classiﬁer, though, in this work we use it as a probability estimator for each one of the seven classes. For more details on this classiﬁcation scheme, please refer to [10].

3

Visual Classifier

The problem of violence detection in videos is challenging because of the big variability of the violent scenes, and the unconstrained camera motion. It is impossible to accurately model the scene and the objects within. Instead, we deﬁne classes that represent the amount of human activity in the scene and use features that can discriminate the video segments between these classes. The amount of activity is strongly correlated with the existence of violence as can be seen from Figure 2. 3.1

Video Class Definition

For the video based characterization we deﬁne three activity classes. These classes are deﬁned by the amount of human activity in the scene as no-normalhigh activity. The ﬁrst class contains scenes that do not show humans or that

Audio-Visual Fusion for Detecting Violent Scenes in Videos

95

Fig. 2. Correlation between Violence and Activity classes: The ﬁrst set of bars represents the percentage of segments with no, normal, and high activity which are labeled as non-violent whereas the second set contains the violent. As can be seen most of the segments that contain high activity are violent and vice-versa. The plots are derived from a randomly selected dataset of hand-labeled movie segments.

show humans that are completely inactive. The second class contains scenes in which one or more persons perform an activity that does not include erratic motion (e.g. walking, joking, chatting). The third class which is strongly correlated with violence contains scenes with people with erratic motion (e.g. ﬁghting, falling). 3.2

Video Features

The used features can be split in two categories: the ﬁrst contains simple motion related features, the second contains higher level features which originate from detecting semantic visual objects in the scene. The motivation for the motion features is that most scenes with high activity contain fast erratic motion. Additionally, the detectors are used to determine the presence and an estimation of the trajectories of people in the scene. The visual signal is split into 1-seconds mid-term segments. Each segment is comprised of several video frames, and the visual features are calculated on every frame. We average over the values of each frame to derive the value of the feature which characterizes the whole segment. Motion Features – Average Motion (AM ): This is the average motion of the frame. The frame is split in blocks of which we calculate the motion vectors using the previous frame as reference. The feature is derived by averaging the motion vector’s lengths. The motion vectors lengths and the block size are determined as fractions of the frame size so that the feature will be invariant to frame N b scale changes. The feature is deﬁned by: AM = N1b i=1 vi , where Nb is the number of blocks and vi is the length of the motion vector of the i-th block.

96

T. Giannakopoulos et al.

– Motion Orientation Variance (M OV ): The feature measures the variance of the motion vectors orientations. The mean orientation is derived by calculating the mean motion vector. Then the variance is calculated by the following: b 2 M OV = N1b N i=1 da (ai , aμ ), where ai is the orientation of the i-th motion vector, aμ is the orientation of the mean motion vector and da () denotes the diﬀerence between the two orientations in (−π, π). Detection Features. Using an object recognition method we detect the presence of people or faces in the scene. The detected visual objects are tracked to establish the correspondences between them in consecutive frames. The tracked objects are used to derive a feature that measures their motion in the scene. We used the object detection algorithm of [12] which exploits shape information to detect people in the scene. The detector uses a set of haar-like features and a boosted classiﬁer. We recognize frontal faces, proﬁle faces, full body and, upper body. The algorithm results in bounding boxes containing the detected objects in each frame. The detector runs in standard intervals and each time the produced bounding boxes are used to initialize the trackers. When the next interval starts we measure the degree of overlap between a tracked object and each detected object of the same type (e.g. face). If there is a signiﬁcant overlap then the detected object is linked to the tracked object. This way the ﬁnal output of the system is a set of object trajectories from the frame where ﬁrst detected up to the frame were it has been lost by the tracker or the position of the tracker does not match with any new detection. The tracking algorithm used is the one described in [13]. The algorithm uses the spot and blob tracking models. The tracked objects are used to derive a metric for their motion. The metric is calculated as the average degree of overlap over all the tracked objects between two consecutive frames. The degree of overlap for a tracked object is deﬁned as: t t 1 I t (b) − Iint (b) I t−1 (b) − Iint (b) OT D = + (1) 2 I t (b) I t−1 (b) where: I t (b) is the number of pixels of the b-th tracked object’s bounding box t at time t, and Iint (b) is the number of pixels of the intersection between the bounding boxes of the b-th tracked object at times t and t − 1. 3.3

Video Class Probability Estimation

The classiﬁcation of the mid-term segments in the activity classes is performed using a weighted kNN classiﬁer using the three features described in Section 3.2. The classiﬁer is trained using a dataset of hand-labelled scenes containing no, normal, or high human activity. As described in Section 4.1, the individual audio and visual decisions are fused, using a meta-classiﬁer. Therefore, it was necessary to use the weighted kNN algorithm as a class probability estimator. In particular, the probabilities of the “normal activity” and the “high activity” classes are estimated, using the weighted kNN algorithm, for each segment.

Audio-Visual Fusion for Detecting Violent Scenes in Videos

4 4.1

97

Fusion Approaches towards Violence Detection Multi-modal Fusion

The seven audio class probabilities described in Section 2, along with the two visual-based class probabilities, described in Section 3, are combined in order to extract a ﬁnal binary decision. This process is executed on a mid-term basis, i.e. in a sequence of successive segments from the original stream. In particular, for each mid-term window of 1 sec length, a 10-D feature vector is created with elements the seven audio probabilities, described in Section 2.3, the label of the winner audio class and the two visual-based classiﬁcation decisions. The combined 10-D feature vector is used by a k-Nearest Neighbor classiﬁer, which extracts the ﬁnal binary (violence vs non-violence) decision, for the respective mid-term segment. The same process is repeated for all 1 sec segments of the whole video stream. In Figure 3 a scheme of this process is presented.

Fig. 3. Multi-modal fusion process

For comparison purposes, apart from the fused classiﬁer, two individual kNN classiﬁers, an audio-based and a visual-based, have been trained, in order to distinguish between violence and non-violence, on a mid-term basis. In other words, these two individual classiﬁers have been trained on the 8D feature sub-space (audio-related) and on the 2D feature sub-space (visual-related) respectively. In Figure 4, an example of the violence detection algorithm is presented, when using a) only audio features b) only visual features and c) the fused feature vector. The gray line corresponds to the true class labels over time. Also, for each case, the precision and recall rates are presented. It is obvious that the fused approach performs signiﬁcantly better than both audio and visual based approaches.

5 5.1

Experimental Evaluation Scenario and Setup

For training and evaluation purposes, 50 videos have been ripped from 10 different ﬁlms. The overall duration of the data is 2.5 hours. The video streams

98

T. Giannakopoulos et al.

Audio mid−term decision Recall: 54.12 − Precision: 76.67

Violent Non Violent

20

40

60

80

100

120

140

160

140

160

140

160

Visual mid−term decision Recall: 10.59 − Precision: 90.00

Violent Non Violent

20

40

60

80

100

120

Fused mid−term decision Recall: 60.00 − Precision: 91.07

Violent Non Violent

20

40

60

80

100

120

Fig. 4. Violence detection example for a movie audio stream

have been manually annotated by three humans. In particular, the humans annotated the parts of the videos that contained violent content and the results of this annotation have been used as ground truth for training and evaluating the proposed methods. According to the manual annotations 19.4% of the data was of violent content. Almost 9000 mid-term segments were, in total, available for training and testing. Though, the evaluation (as described in the following Section) was carried out on a video stream basis. 5.2

Classification and Detection Results

In this Section the results of the proposed binary classiﬁcation method are presented. The performance of the fused classiﬁcation process is compared to the individual performances, if only the audio and the visual features were used. In all three methods, the “Leave One Out” evaluation method has been used, on a video ﬁle basis, i.e., in each cross-validation loop, the mid-term segments of a single video ﬁle have been used for evaluation, while the rest of the data has been used for training purposes. The following types of performance measures have been computed: 1. Classiﬁcation Precision (P ): This is the proportion of mid-term segments that have been classiﬁed as violence and were indeed violence. 2. Classiﬁcation Recall (R): This is the proportion of mid-term violent segments that were ﬁnally classiﬁed as violent.

Audio-Visual Fusion for Detecting Violent Scenes in Videos

99

3. Classiﬁcation F1 measure. 4. Detection Precision (P d): This is the number of detected violent segments, that were indeed violence, divided by the total number of detected violent segments. 5. Detection Recall (Rd): This is the number of correctly detected violent segments divided by the total number of true violent segments. 6. Detection F d1 measure. Performance measures P , R and F1 are associated to the classification performance of the algorithm on a mid-term (1-second) basis, while the measures P d, Rd and F d1 are related to the event detection performance of the algorithm. Note that a violent segment is correctly detected if it overlaps with a true violent segment. In addition, for comparison purposes, we have computed the same performance measures for the random mid-term classiﬁer. The performance results are displayed in Table 2. Table 2. Classiﬁcation and Detection Performance Measures Classiﬁcation Performance Measures Recall Precision F1 Audio-based classiﬁcation 63.2% 45.2% 52.7% Visual-based classiﬁcation 65.1% 40.7% 50.1% Random classiﬁcation 19 % 50% 28% Fused classiﬁcation 60.1% 47% 52.8%

Audio-based detection Visual-based detection Fused detection

6

Detection Performance Measures Recall Precision F1 82.9% 38.9% 53% 75.6% 34% 46.9% 83% 45.2% 58.5%

Conclusions

We have presented a method for detecting violence in video streams from movies. Both audio and visual based classes have been deﬁned, and respective soft-output classiﬁers have been trained. Then, a simple meta-classiﬁer has been adopted, in order to solve the binary classiﬁcation task: Violence Vs Non Violence. Experimentation has been carried out on a real ﬁlm dataset. Experiments indicated that audio classiﬁcation and detection was respectively 1.6% and 6.1% better than the visual-based method. Furthermore, the fused meta - classiﬁcation achieved a boosting at the overall performance compared to the best individual method (i.e., the audio-based method). Finally, the overall event detection performance indicated that only 17% of the violent events are not detected, while almost 1 out of 2 detected events are indeed violent ones.

100

T. Giannakopoulos et al.

Acknowledgment This work has been supported by the Greek Secreteriat for Research and Technology, in the framework of the PENED program, grant number TP698.

References 1. Datta, A., Shah, M., da Vitoria Lobo, N.: Person-on-person violence detection in video data. In: ICPR, vol. 1, pp. 433–438 (2002) 2. Zajdel, W., Krijnders, J., Andringa, T., Gavrila, D.: Cassandra: audio-video sensor fusion for aggression detection, pp. 200–205 (2007) 3. Vasconcelos, N., Lippman, A.: Towards semantically meaningful feature spaces for the characterization of video content. In: ICIP 1997: Proceedings of the 1997 International Conference on Image Processing (ICIP 1997), Washington, DC, USA, 3-Volume Set-Volume 1, p. 25. IEEE Computer Society, Los Alamitos (1997) 4. Nam, J., Alghoniemy, M., Tewﬁk, A.H.: Audio-visual content-based violent scene characterization. In: ICIP(1), pp. 353–357 (1998) 5. Vasconcelos, N., Lippman, A.: Towards semantically meaningful feature spaces for the characterization of video content. In: International Conference on Image Processing, pp. 25–28 (1997) 6. Datta, A., Shah, M., Lobo, N.V.: Person-on-person violence detection in video data. In: IEEE International Conference on Pattern Recognition, Canada (2002) 7. Nam, J., Tewﬁk, A.H.: Event-driven video abstraction and visualization. Multimedia Tools Appl. 16(1-2), 55–77 (2002) 8. Rasheed, Z., Shah, M.: Movie genre classiﬁcation by exploiting audio-visual features of previews. In: Proceedings 16th International Conference on Pattern Recognition, pp. 1086–1089 (2002) 9. Giannakopoulos, T., Kosmopoulos, D., Aristidou, A., Theodoridis, S.: Violence content classiﬁcation using audio features. In: Antoniou, G., Potamias, G., Spyropoulos, C., Plexousakis, D. (eds.) SETN 2006. LNCS (LNAI), vol. 3955, pp. 502–507. Springer, Heidelberg (2006) 10. Giannakopoulos, T., Pikrakis, A., Theodoridis, S.: A multi-class audio classiﬁcation method with respect to violent content in movies, using bayesian networks. In: IEEE International Workshop on Multimedia Signal Processing, MMSP 2007 (2007) 11. Rifkin, R., Klautau, A.: In defense of one-vs-all classiﬁcation. J. Mach. Learn. Res. 5, 101–141 (2004) 12. Lienhart, R., Maydt, J.: An extended set of haar-like features for rapid object detection. In: ICIP(1), pp. 900–903 (2002) 13. Makris, A., Kosmopoulos, D., Perantonis, S., Theodoridis, S.: Hierarchical feature fusion for visual tracking. In: IEEE International Conference on Image Processing 2007, September 16 -October 19, vol. 6, pp. VI –289–VI –292 (2007)

Experimental Study on a Hybrid Nature-Inspired Algorithm for Financial Portfolio Optimization Giorgos Giannakouris, Vassilios Vassiliadis, and George Dounias Management and Decision Engineering Laboratory, Department of Financial and Management Engineering, University of the Aegean, 41 Kountouriotou Str. GR-82100, Greece [email protected], [email protected], [email protected]

Abstract. Hybrid intelligent schemes have proven their efficiency in solving NP-hard optimization problems. Portfolio optimization refers to the problem of finding the optimal combination of assets and their corresponding weights which satisfies a specific investment goal and various constraints. In this study, a hybrid intelligent metaheuristic, which combines the Ant Colony Optimization algorithm and the Firefly algorithm, is proposed in tackling a complex formulation of the portfolio management problem. The objective function under consideration is the maximization of a financial ratio which combines factors of risk and return. At the same time, a hard constraint, which refers to the tracking ability of the constructed portfolio towards a benchmark stock index, is imposed. The aim of this computational study is twofold. Firstly, the efficiency of the hybrid scheme is highlighted. Secondly, comparison results between alternative mechanisms, which are incorporated in the main function of the hybrid scheme, are presented. Keywords: ant colony optimization algorithm, firefly algorithm, portfolio optimization, hybrid NII algorithm.

1 Introduction Nowadays, a non-trivial task for investment managers, as well as investors in general, is to construct efficient portfolios of assets which satisfy demanding objectives. There are several factors, market or not, which affect the performance of the constructed portfolio. One potential objective for optimizing a portfolio is to outperform or track a benchmark (stock) index. Stock indexes reflect many market factors. On the other hand, investing to the index itself is not an optimal strategy for several factors (transaction costs, hard to manage). So, an obvious solution is to construct a portfolio containing a portion of assets from the stock index itself, with the specific aim to track the movement of the stock index. This is known as passive portfolio management [1]. Portfolio optimization problems are concerned with finding the optimal combination of assets, as well as their corresponding weights, i.e optimization in two search spaces: one discrete (for assets) and one continuous (for weights). This kind of problems are considered as NP-hard, i.e. there is no deterministic algorithm known S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 101–111, 2010. © Springer-Verlag Berlin Heidelberg 2010

102

G. Giannakouris, V. Vassiliadis, and G. Dounias

that can find an exact solution within polynomial time. In order to demonstrate the complexity of the portfolio optimization problem, consider the following situation. If the objective is to find the optimal portfolio, consisting of 10 out of 100 assets, then all the alternative combinations would be = 1.73*1013. What is more, for each of these combinations, the optimal weights have to be found (continuous solution space). As it can be seen, the alternative combinations in the continuous solution space tend to infinity. As a result, exhaustive search algorithms, or other traditional approaches from the field of operational research, are inefficient to find the optimal solution or, in the best case, they get stuck in local optima [2]. A potential solution is the introduction of intelligent metaheuristics. Nature-inspired algorithms in the field of artificial intelligence correspond to techniques that are based on how biological systems and natural networks deal with real-world situations in nature [3]. Some examples are the ant colony optimization (ACO) algorithm, which stems from the basic functions of real ant colonies and the particle swarm (PSO) optimization algorithm which is based on how flock of birds move and communicate. The main advantage of nature-inspired intelligent algorithms over traditional methodologies which deal with optimization problems is their searching ability. While methodologies from the field of operational research tend to stuck to local optima solutions, nature-inspired intelligent techniques combine unique characteristics for global exploration regarding the solution space. Finally, hybrid schemes combine unique characteristics of two or more intelligent methods so as to enhance searching of the solution space. The aim of this study is twofold. First and foremost, this paper introduces a new hybrid scheme which combines characteristics of two nature-inspired intelligent algorithms, namely the ant colony (ACO) and the firefly (FA) algorithm. The main characteristic of ACO is the way real ant colonies behave in order to find a food source. Ant colony algorithms have proven their efficiency in searching globally the solution space, both in discrete and continuous cases. The main characteristic of FA is the way real fireflies move and communicate with each other. Swarm-based algorithms are very effective in finding near-optimum solutions in complex continuous spaces. All in all, nature-inspired methodologies have a very good searching ability. As a result, by combining two methodologies with good heuristics rules for searching complex solution spaces, it is expected that good quality solutions are to be found. Secondly, some basic mechanisms of these algorithms, which deal with their searching ability, are studied. More specifically, comparative results from the implementation of alternative mechanisms are presented. The application domain concerns a complex portfolio management problem, whose objective is to maximize a financial ratio under a constraint on the index tracking ability of the constructed portfolio. The main contribution of this work is to highlight the importance and efficiency of hybrid nature-inspired intelligent schemes and to give a better insight on the functionality of particular mechanisms of the main algorithms. The structure of this paper is as follows. In section 1, an introduction to some main concepts is given. In section 2, findings from the literature review are presented in brief. In section 3, the basic methodological issues are shown. In section 4, the mathematical formulation of the optimization problem is presented. Computational results and a brief discussion are presented in section 5. Finally, in section 6 some basic conclusions and future research potentials are presented.

Experimental Study on a Hybrid Nature-Inspired Algorithm

103

2 Literature Review In what follows, a selection of some representative studies from the literature is presented in brief. These studies concern the application of, hybrid or not, natureinspired intelligent (NII) algorithms in various portfolio optimization problems. Studies in this field are limited, and only a selection of them is presented here. Table 1. Basic studies from the literature Reference [1,4] [5] [6]

[7]

[8]

Applied Methodology

Portfolio Optimization Problem

Genetic Algorithm & Quadratic Programming (hybrid) Evolutionary Algorithm & Quadratic Programming (hybrid) - Hill Climbing - Simulated Annealing - Tabu Search - Genetic Algorithms - Evolutionary Algorithms - Memetic Algorithms Genetic Algorithm & Simulated Annealing

Minimize tracking error volatility Minimize tracking error volatility Minimize portfolio’s risk

Minimize portfolio’s risk

Minimize portfolio’s risk Constraint on portfolio’s expected return

[9]

Ant Colony Optimization algorithm

Maximize Sharpe Ratio

[10]

Particle Swarm Optimization

[11]

Particle Swarm Optimization

Minimize portfolio’s risk Constraint on portfolio’s expected return Maximize excess return Constraint on tracking error volatility Minimize probability of tracking error falling below a threshold

[12]

Ant Colony Optimization & programming algorithm (hybrid)

non

linear

Some interesting conclusions driven from the literature review are the following: • In most of the studies, some benchmark methodologies are used against the proposed hybrid intelligent schemes. These methodologies include Monte Carlo – based search, techniques from the field of operational research (such as quadratic programming and other non linear programming algorithms), local search heuristics, as well as other versions of intelligent metaheuristics. Results are indicative of the efficiency of hybrid schemes as far as the quality of the solution is concerned. Their main disadvantage is that they need more computational time in order to explore the solution space in a more efficient way. • Another important fact is that genetic algorithm and evolutionary – based metaheuristics have mostly been applied to this kind of problems. On the one hand, this might be indicative of their efficiency. On the other hand, NII algorithms offer many competitive algorithms, and only in recent years has the academia dealt with them. • As far as the portfolio optimization problem is concerned, the main focus is on classical formulations of the problem (referring to the mean-variance framework). However, in the last years, an attempt is made to shed some light to other, more important, aspects of the problem. Passive and active portfolio management offer a brand new class of portfolio optimization problem whose main focus is on tracking or beating the market.

104

G. Giannakouris, V. Vassiliadis, and G. Dounias

• To sum up, findings from the literature review highlight the importance of using hybrid NII techniques in order to solve the portfolio optimization problem under the passive and active management framework. Particularly, new, more complex formulations of the problem, offer new challenges to the academia. The combination of unique characteristics from two or even more NII algorithms is encouraged. In this study, results from such a hybrid NII scheme are compared with those of a classical Monte-Carlo – based algorithm.

3 Methodological Issues A hybrid scheme, which combines two nature-inspired intelligent metaheuristics, is used in this study. More specifically, an ant colony optimization (ACO) algorithm is used in order to find the optimum, or near-optimum, combination of assets, whereas a firefly algorithm (FA) is used to find the optimum, or near-optimum, weights for the constructed portfolio. ACO algorithm was first proposed by Dorigo in the beginning of the 90’s [14]. ACO belongs to a certain class of metaheuristics, whose main attribute is that they yield high – quality near – optimum solutions in a reasonable amount of time, which can be considered as an advantage if the solution space is of high dimensionality. ACO metaheuristic is mainly inspired by the foraging behavior of real ant colonies. Ants start searching for potential food sources in a random manner, due to the lack of knowledge about the surrounding environment. When they come up against a food source, they carry some of it back to the nest. In the way back, each deposits a chemical substance, called pheromone, which is a function of the quality and the quantity of the food source. So, chemical trails are formed in each ant’s path. However, as time passes, this chemical evaporates. Only paths with strong pheromone trails, high – quality food source, manage to survive. As a consequence, all ants from the nest tend to follow the path, or paths, with large amount of pheromone on it. This indirect kind of communication is called stigmergy. Firefly algorithm (FA) was first introduced by Yang in 2007 [15]. FA is swarmbased intelligent metaheuristic which is based on the way real fireflies move and communicate. In nature, fireflies produce light, through a chemical process (transformation of chemical to light energy - bioluminescence), and use it in order to communicate with each other. Based on the brightness, the frequency and the time period of the light, real fireflies are able to calculate the distance of each other. An important aspect of this process is that the attractiveness of a firefly is relative to the brightness of its light. So, brighter fireflies attract less bright ones for two reasons: mating or predation. In the proposed hybrid metaheuristic scheme, ACO is used for discrete optimization (assets), while FA is used for continuous optimization (weights). The flowchart below, demonstrates the main function of the hybrid algorithm. Last, but not least in importance, alternative techniques regarding some main mechanisms of the hybrid scheme were proposed with the aim of enhancing the efficiency of the algorithm. More specifically, two basic mechanisms were studied. Firstly, at the portfolio construction process in ACO, a roulette wheel mechanism is

Experimental Study on a Hybrid Nature-Inspired Algorithm

105

Initialization of algorithm's parameters and basic structures/variables

For all generations

For each ant in the population (ACO process)

Construct portfolio / find combination of assets (ACO process)

Find near-optimum weights for the constructed portfolio (FA process)

End (for inside loop)

Find best-so-far portfolio

End (outside loop)

Fig. 1. Flowchart of the proposed hybrid scheme

applied in order to ensure a level of randomness. Otherwise, there is a high probability of premature convergence. A tournament selection mechanism is proposed as an alternative [16]. This mechanism initially selects a sample of assets out of the universe of assets in a random number, and then picks the asset with the high probability. In contrast, in the roulette wheel process, all assets are placed in a roulette process based on their selection probability. Secondly, Euclidean distance is applied in order to calculate the spatial distance between two artificial fireflies in FA. Alternatively, Manhattan distance is used as an alternative mechanism [17]. The difference between these two processes is that the formula for Euclidean distance contains complex calculations with square roots, which require extra computational effort.

4 Application Domain Harry M. Markowitz, with his seminal paper [18], established a new framework for the study of portfolio management. Most financial researchers were depended on his work in order to enhance the topics of portfolio optimization. As it was mentioned

106

G. Giannakouris, V. Vassiliadis, and G. Dounias

above, the portfolio optimization problem deals with finding a combination of assets, as well as the corresponding amount of capital invested in these assets, with the aim of optimizing a given objective function (investor’s goal) under certain constraints. There are various formulations of the objective function, linear or not. Each one of them resides to a different type of problems. Passive portfolio management is adopted by investors who believe that financial markets are efficient, i.e. it is impossible to consistently beat the market. So, the main objective is to achieve a similar level of returns and risk, as possible, of a certain benchmark. One passive portfolio management strategy is index tracking, i.e. construction of a portfolio, using assets from a universe of assets (like a stock index), with the attempt to reproduce the performance of the stock index itself [19]. In this study, a constraint on tracking error volatility, i.e. a measure of the deviation between the portfolio’s and benchmark’s return, is imposed. The objective of the portfolio optimization problem is to maximize a financial ratio, namely the Sortino ratio [20]. Sortino ratio is based on the preliminary work of Sharpe (Sharpe ratio) [21], who developed a reward-to-variability ratio. The main concept was to create a criterion that takes into consideration both assets’s expected return and volatility (risk). However, in recent years, investors started to adopt the concept of “good volatility”, which considers returns above a certain threshold, and “bad volatility”, which considers returns below a certain threshold. Investors are mainly interested only for the minimization of “bad volatility”, when investing to an asset. So, Sortino ratio considers only the volatility of returns, which fall below a defined threshold. The formulation of the financial optimization problem is presented below: (1) s.t. 1

1

0 5

(2)

(3) (4) (5)

where, E(rP), is the portfolio’s expected return rf, is the risk-free return θ0(rP), is the volatility of returns which fall below a certain threshold and equals

Experimental Study on a Hybrid Nature-Inspired Algorithm

107

(6)

wi, is the percentage of capital invested in the ith asset N, is the maximum number of assets contained in a portfolio rB, is the benchmark’s daily return H, is the upper threshold for the tracking error volatility , is the probability density function of the portfolio’s returns. Assuming that portfolio’s returns follow a normal distribution, the probability density function can be defined as:

√

5 Computational Study In this part of the paper, some basic simulations results are presented. Data used in this study comprised of daily closing stock prices from the S&P’s 500, which is a high capitalization stock index comprised of 500 US stocks with the highest capitalization. The time period for this study is six months (01/12/2008 – 01/05/2009). This was a significant time period due to the global financial crisis. In the next table, values for the basic parameters for both the hybrid scheme and application domain are presented. Table 2. Parameters for hybrid scheme and financial optimization problem Parameters for ACO algorithm Generations Population Percentage of best ants in each generation Pheromone evaporation rate Parameters for Firefly Algorithm Gamma (Light absorption coefficient) b0 (Attractiveness coefficient from a hypothetical light source) Alpha (Randomness coefficient) M (Coefficient used for the calculation of the level of attractiveness) Generation Population Financial parameters Cardinality Lower and upper bound for weights Tracking Error Volatility Constraint

10 50 10% 0.2 0.2 1 0.4 2 2 100 5 [0 1] 0.0001

Selection of configuration settings for these parameters was based both on evidence from the literature and on previous simulation studies. Also, four sets of simulations of a hundred independent runs were conducted due to the stochastic behavior of the NII metaheuristics.

108

G. Giannakouris, V. Vassiliadis, and G. Dounias Table 3. Sets of simulations First set of simulations Second set of simulations Third set of simulations Fourth set of simulations

Roulette Wheel (ACO)-Euclidean Distance (FA) Tournament Selection (ACO)-Euclidean Distance (FA) Roulette Wheel (ACO)-Manhattan Distance (FA) Tournament Selection (ACO)-Manhattan Distance (FA)

Table 4. Basic simulation results Quantiles of distribution 0.975 0.50 0.025

0.5717 1st set 0.5614 2nd set rd 0.5567 3 set th 0.5718 4 set Benchmark methods Monte Carlo & non linear programming1 5 highest capitalization stocks

0.5040 0.4959 0.5022 0.4969

0.4626 0.4646 0.4580 0.4582

Best Results Found Sortino Portfolio Ratio

Weights

0.5921 0.5717 0.5718 0.5996

[45,87,36,5,449,485] [3, 45,229,242,309] [45,66,79,107,242] [81,115,174,242,438]

[0.5645,0,0.0472,0,0.3882] [0,0.7869,0.0347,0.1779,0] [0.7816,0,0,0,0.2184] [0.0902,0,0.3202,0.2368,0.3528]

0.5548

[45,242,450,289,484]

[0.7817,0.2183,0,0,0]

0.2800

[492,47,3,251,204]

[0,0,0.7655,0,0.2345]

Execution time for simulations (days) 3.2 3 2.92 3.07

1

In order to analyze the results presented in table 4, the notion of quantiles has to be properly described, as follows. If quantile of X is a in 0.975 confidence level, then there is a probability of 97.5% that X will get values less than a. As far as the distribution of objective function’s values, it is desirable to have two basic properties: − Fat right tails, which indicate high probability of finding portfolios with large Sortino ratios. − Thin left tails, which indicate low probability of finding portfolios with small Sortino ratios. So, it is preferable for the quantiles to have large values, indicating a distribution that is shrunk as far right as possible. Based on the numerical findings, all four sets yield quite similar results. However, the first set seems to yield slightly better results than the other three. In this point, it has to be noticed that these results may be considered as a preliminary approach for two reasons: − Firstly, a hundred independent runs can only give a slight insight regarding the efficiency and functionality of the alternative mechanisms. − Secondly, the configuration settings require further investigation and experimentation (very low values for generation and population parameters of the two NII algorithms). On the other hand, setting these parameters in high values should require more computational effort and time. It can be observed that for these settings, the average execution time was three days.

1

The non linear algorithm is based on the Levenberg – Marquardt method which combines the Gauss – Newton and the steepest descent method.

Experimental Study on a Hybrid Nature-Inspired Algorithm

109

Also, the fourth set managed to find the portfolio with the maximum Sortino ratio. Another noticeable fact is that in the optimal portfolio (fourth set), the stock 242, which corresponds to Sun Microsystems Inc. (JAVA) company, is included. The inclusion of this stock has strongly affected the expected return of the portfolio, and as a consequence the value of the Sortino ratio. In this particular time period, there were advanced discussion regarding the takeover of Sun Microsystems Inc. by IBM. This fact might trigger the value of Sun Microsystems’s Inc. stock, leading to high returns. What is more, some benchmark heuristics were used with the aim of checking the efficiency of the hybrid scheme against other techniques. The first approach was a Monte-Carlo-based algorithm (for finding the combination of stocks) hybridized with a non linear programming algorithm (for finding the weights). The second approach was a simple financial heuristic rule. As it can be observed in table 4, the best value for the objective function (Sortino ratio) was found by the proposed hybrid scheme. The Monte Carlo approach yielded a slightly worst Sortino ratio, whereas the portfolio constructed using the financial heuristic rule performed rather poor in terms of objective function.

6 Conclusion and Future Research In this study a hybrid NII scheme, which combined an ACO and FA algorithms, was proposed for solving a certain formulation of the constrained passive portfolio optimization problem. More specifically, the objective was to maximize a financial ration, namely the Sortino ratio, under a constraint in the tracking ability of the portfolio. The focus of this work was on two aspects. Firstly, the efficiency of a hybrid scheme was checked against other traditional local search or naïve approaches (Monte Carlo, non linear programming method). Secondly, a comparative study, regarding the introduction of alternative mechanisms in certain points/mechanisms of the proposed scheme, was conducted. Although the simulation results are preliminary, some useful conclusions are made. Firstly, the hybrid NII algorithm yielded better solutions (near-optimum) compared to traditional benchmark techniques. Secondly, as far as the alternative mechanisms are concerned, the use of roulette wheel (ACO) and Euclidean distance (FA) mechanisms gave slightly better results. This result indicates that further research must be done regarding the basic mechanisms of NII algorithms. Other NII techniques or heuristic rules have great potential and should be thoroughly studied. The main disadvantages of the proposed hybrid technique were: a) need of computational effort and b) the use of non – optimized settings. All in all, the use of hybrid nature-inspired intelligent algorithms is quite promising due to their good searching ability in complex solution spaces. Traditional approaches tend to get stuck in local optima most of the cases. This fact highlights the superiority of natureinspired techniques. However, an important issue that has to be tackled is the intensive computational effort of these methodologies. As far as the application domain is concerned, near-optimum portfolios which track the S&P’s 500 stock index, in an acceptable level, are found. In a period of financial crisis, it is important to construct portfolios which manage to track stock indexes.

110

G. Giannakouris, V. Vassiliadis, and G. Dounias

Finally, some future research directions might be the following: firstly, other, hybrid or not, NII algorithms should be used as benchmark methodologies so as to lead to safer conclusions about the proposed methodology. Another important aspect is the optimization process of the configuration settings, which might ensure better results. Also, the study of other alternative mechanisms is encouraged, either from the field of artificial intelligence or other heuristics. What is more, further investigation is required in order to come up with safer conclusions about the functionality of the proposed alternative mechanisms in this study. As far as the application domain is concerned, studies on other time periods should be implemented as well. Furthermore, other formulations of the portfolio optimization problem should be investigated, specifically these which reflect up-to-date objectives.

References 1. Jeurissen, R.V.: Optimized Index Tracking using a Hybrid Genetic Algorithm. In: IEEE Congress on Evolutionary Computation, pp. 2327–2334 (2008) 2. Maringer, D.: Portfolio Management with Heuristic Optimization. Advances in Computational Science. Springer, Heidelberg (2005) 3. Vassiliadis, V., Dounias, G.: Nature-inspired intelligence: a review of selected methods and applications. International Journal on Artificial Intelligence Tools 18(4), 487–516 (2009) 4. Shapcott, J.: Index Tracking: Genetic Algorithms for Investment Portfolio Selection. EPCC-SS92-24, 1–24 (1992) 5. Ruiz, R.T., Suarez, A.: A hybrid optimization approach to index tracking. Annals of Operations Research 166(1), 57–71 (2008) 6. Schaerf, A.: Local search techniques for constrained portfolio selection problems. Computational Economics 20(3), 177–190 (2002) 7. Streichert, F., Ulmer, H., Zell, A.: Evolutionary algorithms and the cardinality constrained portfolio optimization problem. In: Selected Papers of the International Conference on Operations Research (OR 2003), pp. 253–260 (2003) 8. Gomez, M.A., Flores, C.X., Osorio, M.A.: Hybrid search for cardinality constrained portfolio optimization. In: GECCO 2006, pp. 1865–1866 (2006) 9. Maringer, D.: Small is beautiful. Diversification with a limited number of assets. Center for Computational Finance and Economic Agents, working paper (2006) 10. Chen, W., Zhang, R.T., Cai, Y.M., Xu, F.S.: Particle swarm optimization for constrained portfolio selection problems. In: 5th International Conference on Machine Learning and Cybernetics, pp. 2425–2429 (2006) 11. Thomaidis, N.S., Angelidis, T., Vassiliadis, V., Dounias, G.: Active portfolio management with cardinality constraints: an application of particle swarm optimization. New Mathematics and Natural Computation, working paper (2007) 12. Vassiliadis, V., Thomaidis, N., Dounias, G.: Active portfolio management under a downside risk framework: comparison of a hybrid nature-inspired scheme. In: Corchado, E., Wu, X., Oja, E., Herrero, Á., Baruque, B. (eds.) HAIS 2009. LNCS, vol. 5572, pp. 702–712. Springer, Heidelberg (2009) 13. Jorion, P.: Portfolio Optimization with tracking-error constraints. Financial Analysts Journal 59(5), 70–82 (2003) 14. Dorigo, M., Stultze, M.: Ant Colony Optimization. MIT Press, Cambridge (2004) 15. Yang, X.S.: Nature-Inspired Metaheuristic Algorithm. Luniver Press (2008)

Experimental Study on a Hybrid Nature-Inspired Algorithm

111

16. Miller, B.L., Goldberg, D.E.: Genetic algorithms, tournament selection, and the effects of noise. Complex Systems 9(3), 193–212 (1995) 17. Swanepoel, K.J.: Vertex Degrees of Steiner Minimal trees in lp and other smooth Minkowski spaces. Discrete Computation Geometry 21, 437–447 (1999) 18. Markowitz, H.: Portfolio Selection. The Journal of Finance 7(1), 77–91 (1952) 19. Jeurissen, R.: A hybrid genetic algorithm to track the Dutch AEX-Index. Bachelor thesis, Informatics & Economics, Faculty of Economics, Erasmus University of Rotterdam (2005) 20. Kuhn, J.: Optimal risk-return tradeoffs of commercial banks and the suitability of profitability measures for loan portfolios. Springer, Berlin (2006) 21. Sharpe, W.F.: The Sharpe ratio. Journal of Portfolio Management, 49–58 (1994)

Associations between Constructive Models for Set Contraction Vasilis Giannopoulos and Pavlos Peppas Dept of Business Administration University of Patras, Patras 265 00, Greece [email protected], [email protected]

Abstract. Belief Change is one of the central research topics in Knowledge Representation and theory revision and contraction are two of the most important operators in Belief Change. Recently the original axiomatization of revision and contraction was extended to include epistemic input represented by a (possibly inﬁnite) set of sentences (as opposed to a single sentence) giving rise to the operators of set revision (also known as multiple revision) and set contraction. Both set revision and set contraction have been characterized in terms of constructive models called system of spheres and epistemic grasp respectively. Based on these links, in this paper we provide a characterization of set contraction in terms of system of spheres, and we identify the necessary and suﬃcient conditions under which the system-of-spheres model and the epistemic-grasp model give rise to the same set contraction. Keywords: Belief Revision, Belief Merging, Knowledge Representation.

1

Introduction

Imagine a situation where each agent has her own beliefs. If all the conditions remain stable and nothing unexpected happens, then each agent will continue maintaining her beliefs. What will happen, however, if some new information (alias, epistemic input) is received, which comes contrary to the existing beliefs? How will each agent react? The area of Belief Revision studies these questions thoroughly, oﬀering formal models for the process of belief change. Each agent, as a separate personality, reacts diﬀerently when new information comes up and revises diﬀerently the total of her beliefs. The bet that was placed by the ﬁrst authors, who dealt with belief change, was the creation of a theory, which could include the total of diﬀerent reactions of agents. The article which is considered to mark the birth of the area Belief Revision is [1] by Alchourron, Gardenfors and Makinson. The framework that evolved from [1] - known as the AGM paradigm - is to this date the dominant framework in Belief Revision. Since then many articles have been published (see [10] for a recent survey), taking into consideration the need of growth and extension of the theory, covering weaknesses and voids of initial articles. S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 113–121, 2010. c Springer-Verlag Berlin Heidelberg 2010

114

V. Giannopoulos and P. Peppas

The original AGM paradigm focuses on belief change with respect to epistemic input encoded as a single logical sentence ϕ. Later, many studies generalised the AGM paradigm to include epistemic input expressed as a (possibly inﬁnite) set of sentences. This process is known as multiple or set belief change. The present paper is a contribution to this line of research. More precisely, the two most prominent types of belief change are revision and contraction. For both these processes axiomatic as well as constructive models have been proposed and representation results have been produced establishing the equivalence between axioms and constructions. For multiple revision, one of the most popular constructive models is that based on a system of spheres, [6], [8], [9]. For set contraction a new constructive model based on the notion of epistemic grasp has recently been developed in [11], which (unlike previous models) characterizes precisely the axioms for set contraction without the need of the so-called limit postulate. Building on these links, in this paper we, ﬁrstly, express set contraction in terms of systems of spheres, and secondly, we formulate a necessary and suﬃcient condition under which a system of spheres and an epistemic grasp generate the same set contraction. The paper is structured as follows. In the next section we introduce some notation. In section 3 we review the axiomatic approaches to multiple revision and contraction. This is followed by a quick review of the constructive models for the types of belief change. In section 5 we present the our main results. The last section contains some concluding remarks.

2

Preliminaries

Throughout this paper we shall be working with a formal language L governed by a logic identiﬁed by its consequence relation . Very little is assumed about L and . In particular, L is taken to be closed under all Boolean connectives, and has to satisfy the following properties: (i) (ii) (iii) (iv) (v)

ϕ for all truth-functional tautologies ϕ. If (ϕ → y) and ϕ, then y. is consistent, i.e. L. satisﬁes the deduction theorem, that is, ϕ1 , . . . , ϕn y iﬀ (ϕ1 ∧ ϕ2 ∧ ... ∧ ϕn ) → y. is compact.

For a set of sentences Γ of L, we denote by Cn(Γ ) the set of all logical consequences of Γ , i.e. Cn(Γ ) = {ϕ ∈ L: Γ ϕ}. A theory K of L is any set of sentences of L closed under , i.e. K = Cn(K). We denote the set of all theories of L by KL . A theory K of L is complete iﬀ for all sentences ϕ ∈ L, ϕ ∈ K or ¬ϕ ∈ K. We denote the set of all consistent complete theories of L by ML . For a set of sentences Γ of L, [Γ ] denotes the set of all consistent complete theories of L that contain Γ . Often we use the notation [ϕ] for a sentence ϕ ∈ L, as an abbreviation of [{ϕ}]. For a theory K and a set of sentences Γ of L, we denote by K + Γ the closure under of K ∪ Γ , i.e. K + Γ = Cn(K ∪ Γ ). For a sentence

Associations between Constructive Models for Set Contraction

115

ϕ ∈ L we often write K + ϕ as an abbreviation of K + {ϕ}. Finally, the symbols and ⊥ will be used to denote an arbitrary tautology and contradiction of L respectively.

3

Axiomatic Models for Belief Change

In the AGM paradigm a belief set is modelled as a logical theory K. The epistemic input was originally encoded as a single sentence ϕ, but as already mentioned, recent generalizations model the epistemic input as sets of sentence. With these ingredients, the process of multiple belief revision is modelled as a function ∗ mapping a theory K and a nonempty, consistent, possibly inﬁnite set of sentences Γ to a new theory K ∗ Γ that satisﬁes the following postulates: (K ∗ 1) (K ∗ 2) (K ∗ 3) (K ∗ 4) (K ∗ 5) (K ∗ 6) (K ∗ 7) (K ∗ 8)

K ∗ Γ is a theory of L. Γ ⊆ K ∗ Γ. K ∗ Γ ⊆ K + Γ. If K ∪ Γ is consistent then K + Γ ⊆ K ∗ Γ . If Γ is consistent then K ∗ Γ is also consistent. If Cn(Γ ) = Cn(Δ) then K ∗ Γ = K ∗ Δ. K ∗ (Γ ∪ Δ) ⊆ (K ∗ Γ ) + Δ. If (K ∗ Γ ) ∪ Δ is consistent then (K ∗ Γ ) + Δ ⊆ K ∗ (Γ ∪ Δ).

These postulates, proposed by Fuhrmann, [2], and reﬁned by Lindstrom, [7], are a straightforward generalization of the celebrated AGM postulates for sentence revision developed by Gardenfors in [4], which are widely considered to have captured much of what is the essence of rational belief revision. Like revision, the process of belief contraction is also modelled as a function ˙ mapping a theory K and a nonempty consistent set of sentences Γ to a new − ˙ . However contrary to revision, the generalization of the original theory K −Γ postulates for contraction was not straightforward. The reason is that there are (at least) three diﬀerent ways of interpreting multiple contraction, generating three diﬀerent operators called package contraction, choice contraction, and set contraction. The ﬁrst two are due to Fuhrmann and Hansson, [3], while the third has been introduced and analyzed by Zhang and Foo [12], [13]. In this paper we focus on set contraction. ˙ Given a set of sentences Γ as epistemic input, a set contraction function − ˙ contracts a belief set K so that the resulting set K −Γ is consistent with Γ . ˙ is deﬁned as a function from KL × 2L to KL that satisﬁes the Formally, − following postulates: ˙ (K −1) ˙ (K −2) ˙ (K −3) ˙ (K −4) ˙ (K −5) ˙ (K −6)

˙ is a theory of L. K −Γ ˙ ⊆ K. K −Γ ˙ = K. If Γ is consistent with K then K −Γ ˙ . If Γ is consistent, then Γ is consistent with K −Γ ˙ ) + ϕ. If ϕ ∈ K and Γ ¬ϕ then K ⊆ (K −Γ ˙ = K −Δ. ˙ If Cn(Γ ) = Cn(Δ) then K −Γ

116

V. Giannopoulos and P. Peppas

˙ (K −7) ˙ (K −8)

˙ ⊆ (K −Γ ˙ ) + Δ. If Γ ⊆ Δ then K −Δ ˙ , then K −Γ ˙ ⊆ K −Δ. ˙ If Γ ⊆ Δ and Δ is consistent with K −Γ

Not surprisingly, set contraction and multiple revision are closely related. The two identities below are generalizations of the ones proposed for sentence revision and contraction and are named after Isaac Levi and William Harper who originally devised them: ˙ K ∗ Γ = (K −G) + Γ (Generalized Levi Identity). ˙ = (K ∗ Γ ) ∩ K (Generalized Harper Identity). K −Γ ˙ the function ∗ Results in [13] show that given a set contraction function −, generated by the Generalized Levi Identify satisﬁes the postulates (K ∗ 1) ˙ produced (K ∗ 8). Likewise, for any multiple revision function ∗ the function − from the Generalized Harper Identity is a set contraction function (i.e. it satisﬁes ˙ - (K −8)). ˙ (K −1) In fact the relation between the two types of belief change is ˙ the multiple revision even stronger: starting with any set contraction function −, function ∗ induced by the Generalized Levi Identify is such that when fed to the ˙ we Generalized Harper Identity it generates the same set contraction function − started with.

4

Constructive Models

To give more substance to the AGM postulates, several authors have proposed constructive models for belief revision and contraction. These models are used both for sentence and set belief change, customized depending on the case. Below we examine the two constructive models that will be the focus of our main technical results. 4.1

System of Spheres Model

Given an initial belief set K, Grove [6] introduces a system of spheres centred on [K] to be in essence a preorder on consistent complete theories (which in this context play the role of possible worlds), representing comparative plausibility: the closer a world is to the start of the ordering the more plausible from K’s point of view. Peppas [8], [9], reﬁned Grove’s model to render it adequate for multiple belief revision. This reﬁned version of a system of spheres, called a well ranked system of spheres is formally deﬁned as a collection S of subsets of ML , called spheres, satisfying the following conditions: (S1) (S2) (S3)

S is totally ordered with respect to set inclusion; that is, if V, U ∈ S then V ⊆ U or U ⊆ V . The smallest sphere in S is [K]; that is [K] ∈ S, and if U ∈ S then [K] ⊆ U . ML ∈ S (and therefore KL is the largest sphere in S).

Associations between Constructive Models for Set Contraction

(SM) (SD)

117

For every nonempty consistent set of sentences Γ , there exists a smallest sphere in S intersecting [Γ ] (denoted c(Γ )). For every nonempty Γ⊆ L, if there is a smallest sphere c(Γ ) in S intersecting [Γ ], then [ (c(Γ ) ∩ [Γ ])] ⊆ c(Γ ) ∩ [Γ ].

As already mentioned a system of spheres represents comparative plausibility with worlds closer to the center being more plausible than distant ones. Suppose now that we want to revise K by a set of sentences Γ . Intuitively, the rational thing to do is to select the most plausible Γ -worlds and deﬁne through them the new belief set K ∗ Γ : (S*) K ∗ Γ = (c(Γ ) ∩ [Γ ]). In [6] Grove proved that the functions produced from system of spheres through (S*) match exactly the ones satisfying the AGM postulates for sentence revision. Peppas, [9], later generalized this result for multiple revision. 4.2

Epistemic Grasp

A second very popular constructive model in belief change is that based on epistemic entrenchment. An epistemic entrenchment is a special preorder on sentence (rather than on worlds) and it is used to construct contraction functions (rather than revision functions). It was introduced by Gardenfors and Makinson in [5], and was shown to be sound and complete with respect to the postulates for contraction. Zhang and Foo, [13], generalized the epistemic entrenchment model for set contraction, introducing a structure called nicely ordered partition. However the functions generated from nicely ordered partitions are only a subset of set con˙ - (K −8) ˙ satisfy traction functions (namely the subset that in addition to (K −1) the so-called limit postulate). An more adequate generalization of epistemic entrenchment for set contraction was developed very recently in [11] in the form of an epistemic grasp. Formally, an epistemic grasp ≤ related to a theory K is deﬁned as a preorder in the powerset of L (i.e. it compares sets of sentences rather than single sentences), satisfying the following properties: (EG1) (EG2) (EG3) (EG4) (EG5) (EG6)

If If If If If If

Γ Δ then Γ ≤ Δ. Γ ≤ Δ and Δ ≤ E then Γ ≤ E. Γ ⊥, then there exists a z ∈ [Γ ] such that Γ ≤ z. K ⊥, then Δ is consistent with K iﬀ Γ ≤ Δ for all Γ ⊆ L. Γ ≤ Δ for all Δ ⊆ L, then Γ ⊥. for all δ ∈ Cn(Δ), Γ ≤ Γ ∪ {δ}, then Γ ≤ Γ ∪ Δ.

˙ can be constructed by Given an epistemic grasp ≤, a set contraction function − means of the following condition: (EC)

˙ iﬀ x ∈ K and Γ ∪ {¬x} < Γ . x ∈ K −Γ

It was shown in [11] that the functions generated from (EC) match precisely the ˙ - (K −8). ˙ ones satisfying (K −1)

118

5

V. Giannopoulos and P. Peppas

Connecting Systems of Spheres with Epistemic Grasps

Figure 1 below summaries the deﬁnitions and results presented so far: axioms ˙ ˙ (K ∗ 1)-(K ∗ 8) and (K −1)-(K −8) deﬁne multiple revision and set contraction respectively which are related to each other via the Generalized Levi and Harper identities. Moreover, multiple revisions are characterized in terms of well ranked system of spheres and set contractions in terms of epistemic grasps via (S*) and (EC) respectively. In view of these links, it should in principle be possible to provide a characterization of set contractions in terms of systems of spheres, and subsequently to formulate necessary and suﬃcient conditions under which a system of spheres and an epistemic grasp generate the same set contraction. This is our mission in this section; i.e. to establish the dashed lines labelled (SC) and (ES) in ﬁgure 1. Multiple Revisions (K*1) – (K*8)

* * * * * *

Set Contractions

Generalized Levi & Harper Identities

. .- . .- - .-. -

(S*)

.

(K .-1) – (K -8) (EC)

(SC) d

(ES) Systems of Spheres

d

d d

Epistemic Grasps

Fig. 1. Old and New Links in the AGM paradigm

Starting with the ﬁrst link, let K be a theory and S a well ranked system of spheres centered at [K]. As already mentioned above, S represents comparative plausibility, i.e. the closer a world is to the center of S, the more plausible it is. Suppose now that we want to make K consistent with a given set of sentences Γ , while giving away as little of K as possible (this is known as the principle of minimal change). In terms of possible worlds, this amounts to adding some Γ worlds to [K]. But which ones? Intuitively, choosing the most plausible Γ -worlds is the right thing to do: ˙ = ([K] ∪ (c(Γ ) ∩ [Γ ])) (SC) K −Γ It turns out that condition (SC) is not only intuitively appealing, but it also complies with our formal expectations. The following result proves that the functions ˙ ˙ generated from (SC) coincide precisely with the ones satisfying (K −1)-(K −8).

Associations between Constructive Models for Set Contraction

119

Theorem 1. Let K be a theory. For any well ranked system of spheres S cen˙ deﬁned by (SC) satisﬁes (K −1)-(K ˙ ˙ tred on [K] the function − −8). Conversely, ˙ ˙ ˙ for any set contraction function − satisfying (K −1)-(K −8) there exists a well ranked system of spheres centred on [K] such that (SC) holds. Sketch of Proof (⇒) ˙ the function Let S be a well ranked system of spheres centered on [K] and − ˙ satisﬁes the postulates (K −1) ˙ - (K −8). ˙ deﬁned via (SC). We show that − ˙ ˙ ˙ ˙ Postulates (K −1), (K −2), (K −4), and (K −6) follow immediately from the ˙ For (K −3), ˙ construction of −. assume that the nonempty set of sentences Γ is consistent with K. Then clearly, c(Γ ) = [K] and consequently, ([K] ∪ (c(Γ ) ∩ ˙ = K. For (K −5), ˙ [Γ ])) = [K]. Hence by (SC), K −Γ assume that a nonempty consistent set of sentences Γ entails ¬ϕ, where ϕ is a sentence that belongs to K. Clearly then, all worlds in (c(Γ ) ∩ [Γ ]) are ¬ϕ-worlds and all worlds in [K] are ϕ-worlds. Hence expanding ([K] ∪ (c(Γ ) ∩ [Γ ])) with ϕ results in K. Therefore ˙ is true. (K −5) ˙ For (K −7) assume that Γ , Δ are nonempty consistent sets of sentence such ˙ ) + Δ = L and that Γ ⊆ Δ. If there are no Δ-world in c(Γ ), then by (SC), (K −Γ ˙ trivially holds. Assume therefore that c(Γ )∩[Δ] = ∅. Given that therefore (K −7) Γ ⊆ Δ, this entails that c(Γ ) = c(Δ) and moreover c(Δ)∩[Δ] = (c(Γ )∩[Γ ])∩[Δ]. ˙ = (K −Γ ˙ ) + Δ and consequently, once This again entails (via (SC)) that K −Δ ˙ holds. again, (K −7) ˙ Finally for (K −8), assume that Γ , Δ are nonempty consistent sets of sentence ˙ . Then there is at least such that Γ ⊆ Δ and moreover Δ is consistent with K −Γ one Δ-world in [K]∪(c(Γ )∩[Γ ]) which again entails that c(Δ)∩[Δ] ⊆ c(Γ )∩[Γ ]. ˙ ⊆ K −Δ ˙ as desired. From this we derive that K −Γ (⇐) ˙ be a set contraction function satisfying (K −1) ˙ - (K −8), ˙ Let − and let ∗ be the ˙ revision function generated from − via the Generalized Levi Identity. Then, as shown in [9], there is a well-ranked system of spheres S such that (S*) holds. Consider now any nonempty consistent sentence Γ . By the Generalized Harper ˙ = (K ∗ Γ ) ∩ K and therefore [K −Γ ˙ ] = [K ∗ Γ ] ∪ [K]. From (S*) Identity, K −Γ ˙ we then derive that [K −Γ ] = (c(Γ ) ∩ [Γ ]) ∩ K. Hence S satisﬁes (SC) as desired. Turning to our second goal in this section, let K be a theory of L, ≤ an epistemic grasp related to K, and S a well ranked system of spheres centered on K. As already mentioned both ≤ and S induce set contraction functions via (EC) and (SC) respectively. Theorem 2 below proves that the necessary and suﬃcient condition under which ≤ and S generate the same set contraction function is the following: (ES)

Γ ≤ Δ iﬀ c(Δ) ⊆ c(Γ ).

In the above condition, Γ and Δ denote arbitrary nonempty consistent sets of sentences.

120

V. Giannopoulos and P. Peppas

Theorem 2. Let K be a theory, ≤ an epistemic grasp relative to K and S a well ranked system of spheres centred on [K]. Then ≤ and S generate the same ˙ by means of (EC) and (SC) respectively, iﬀ condition set contraction function − (ES) is satisﬁed. Sketch of Proof Before we proceed with the actual proof we recall a useful result from [11] ac˙ is cording to which for any epistemic grasp ≤ the induced set contraction − such that the following condition is satisﬁed for all nonempty consistent sets of sentences Γ , Δ: (CE)

˙ Γ ≤ Δ iﬀ Δ is consistent with K −(Γ ∨ Δ) or Γ ⊥.

In the condition above, Γ ∨ Δ is deﬁned to be the set Γ ∨ Δ = {x ∨ y : x ∈ Γ and y ∈ Δ}. (⇒) ˙ by means of Assume that ≤ and S generate the same set contraction function − (EC) and (SC) respectively. Let Γ , Δ be arbitrary nonempty consistent sets of ˙ sentences. First assume that Γ ≤ Δ. By (CE), Δ is consistent with K −(Γ ∨ Δ). By (SC) this entails that there is at least one Δ-world in [K]∪(c(Γ ∨Δ)∩[Γ ∨Δ]). Moreover it is not hard to verify that [Γ ∨ Δ] = [Γ ] ∪ [Δ]. Combining the above we derive that c(Δ) ⊆ c(Γ ) as desired. Conversely, assume that c(Δ) ⊆ c(Γ ). Then given that [Γ ∨Δ] = [Γ ]∪[Δ], Δ is consistent with [K]∪(c(Γ ∨Δ)∩[Γ ∨Δ]), ˙ ∨Δ), which in turn entails which by (SC) entails that Δ is consistent with K −(Γ by (CE) that Γ ≤ Δ. (⇐) ˙ be the set contraction function generated by Assume that (ES) is true. Let − ˙ the set contraction function generated by (SC). We will show that (EC) and − ˙ = K− ˙ Γ . for any nonempty consistent set of sentences Γ , K −Γ ˙ . By (EC) this entails that x ∈ K and Γ ∪ {¬x} < Γ . Consider any x ∈ K −Γ From (SE) we then derive that c(Γ ) ⊂ c(Γ ∪ {¬x}) which again entails that all ˙ Γ . Since x Γ -worlds in c(Γ ) contain x. From (SC) we then derive that x ∈ K − ˙ ⊆ K− ˙ Γ. was chosen arbitrarily we derive that K −Γ ˙ For the converse, consider any x in K − Γ . By (SC), x ∈ K and moreover all worlds in c(Γ ) ∩ [Γ ] contain x. Consequently, c(Γ ) ⊂ c(Γ ∪ {¬x}) which by (SC) ˙ . Hence entails that Γ ∪ {¬x} ≤ Γ . Then from (EC) we derive that x ∈ K −Γ ˙ Γ ⊆ K −Γ ˙ . Combining the above we get K −Γ ˙ = K− ˙ Γ as desired. K−

6

Conclusion

The two most prominent types of belief change are belief revision and belief contraction. Both these processes have been studied extensively in the literature in terms of axiomatic as well as constructive models and a wealth of representation results exist that connect the various models between them. Recently these

Associations between Constructive Models for Set Contraction

121

models have been generalized to revision/contraction by sets of sentences. This paper is a contribution to this line of research. More speciﬁcally, there are two new results reported herein. Firstly, the systemof-spheres model originally developed for revision was reshaped to characterize set contraction (Theorem 1). Secondly, a necessary and suﬃcient condition was formulated under which a system-of-spheres and an epistemic-grasp (an alternative constructive model for set contraction), give rise to the same contraction function (Theorem 2). The missing link in Figure 1 between the postulates (K*1) - (K*8) for multiple revision and epistemic grasps, is clearly an interesting avenue for future work.

References 1. Alchourron, C., Gardenfors, P., Makinson, D.: On the logic of theory change: Partial meet functions for contraction and revision. Journal of Symbolic Logic 50, 510–530 (1985) 2. Fuhrmann, A.: Relevant Logics, Modal Logics and Theory Change (Doctoral Dissertation). Department of Philosophy and Automated Reasoning Project, Australian National University (1988) 3. Fuhrmann, A., Hansson, S.: A survey on multiple contraction. Journal of Logic, Language, and Information 3, 39–76 (1994) 4. Gardenfors, P.: Epistemic importance and minimal changes of belief. Australasian Journal of Philosophy 62, 136–157 (1984) 5. Gardenfors, P., Makinson, D.: Revisions of knowledge systems using epistemic entrenchment. In: Proceedings of Theoretical Aspects of Reasoning about Knowledge, pp. 83–95. Morgan Kaufmann, San Francisco (1988) 6. Grove, A.: Two modellings for theory change. Journal of Philosophical Logic 17, 157–170 (1988) 7. Lindstrom, S.: A semantic approach to nonmonotonic reasoning: Inference operations and choice. Technical report, Dept of Philosophy, Uppsala University (1991) 8. Peppas, P.: Well behaved and multiple belief revision. In: Proceedings of the 12th European Conference on Artiﬁcial Intelligence (ECAI 1996), pp. 90–94 (1996) 9. Peppas, P.: The limit assumption and multiple revision. Journal of Logic and Computation 14(3), 355–371 (2004) 10. Peppas, P.: Belief revision. In: van Harmelen, F., Lifschitz, V., Porter, B. (eds.) Handbook of Knowledge Representation, pp. 317–359. Elsevier, Amsterdam (2008) 11. Peppas, P.: Epistemic Grasp in Set Contraction. Manuscript (2010) 12. Zhang, D.: Belief revision by sets of sentences. Journal of Computer Science and Technology 11, 108–125 (1996) 13. Zhang, D., Foo, N.: Inﬁnitary belief revision. Journal of Philosophical Logic 30(6), 525–570 (2001)

Semantic Awareness in Automated Web Service Composition through Planning Ourania Hatzi1, Dimitris Vrakas2, Nick Bassiliades2, Dimosthenis Anagnostopoulos1, and Ioannis Vlahavas2 1

Department of Informatics and Telematics, Harokopio University of Athens, Greece {raniah,dimosthe}@hua.gr 2 Department of Informatics, Aristotle University of Thessaloniki, Greece {dvrakas,nbassili,vlahavas}@csd.auth.gr

Abstract. PORSCE II is a framework that performs automatic web service composition by transforming the composition problem into AI planning terms and utilizing external planners to obtain solutions. A distinctive feature of the system is that throughout the entire process, it achieves semantic awareness by exploiting semantic information extracted from the OWL-S descriptions of the available atomic web services and the corresponding ontologies. This information is then used in order to enhance the planning domain and problem. Semantic awareness facilitates approximations when searching for suitable atomic services, as well as modification of the produced composite service. The alternatives for modification include the replacement of a certain atomic service that takes part in the composite service by an equivalent or a semantically relevant service, the replacement of an atomic service through planning, or the replanning from a certain point in the composite service. The system also provides semantic representation of the produced composite service. Keywords: Automatic Web Service Composition, Semantic Web Services, AI Planning, Semantic Awareness, Semantic Matching Relaxation.

1 Introduction Web services play an important role in the Web today, as they accommodate the increasing need for interoperability and collaboration between heterogeneous systems that expose their functionality over a network. The web services technology provides a way to communicate and interact with such information systems through a standard interface, which is independent from platform and internal implementation. In many cases, as the user requirements shift towards more complex functionality, they cannot be fulfilled by a simple atomic web service. This shortcoming can be handled by web service composition; that is, the appropriate combination of certain atomic web services in order to achieve a complex goal. The task of web service composition becomes significantly difficult, time-consuming and inefficient as the number of available atomic web services increases continuously; therefore, the capability to automate the web service composition process is proved essential. S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 123–132, 2010. © Springer-Verlag Berlin Heidelberg 2010

124

O. Hatzi et al.

Automated web service discovery, invocation, composition and interoperation is significantly facilitated by the existence of semantic information in the atomic web services description [13]; such information represents knowledge about the actual meaning of services. The incorporation of semantics in the description of web services is accommodated through the development of a number of standard languages such as OWL-S [3] and SAWSDL [5], leading to the notion of semantic web services, which are defined, evolve and operate in the Semantic Web. The existence of semantics facilitates composition using intelligent techniques, such as AI Planning. Without the presence of semantic information, a high degree of human expertise would be required in order to compose web services meaningfully and not based on circumstantial syntactic similarities. The focus of this paper is on the incorporation of the semantic information in the web service composition process and its effects on various aspects of the PORSCE II framework. PORSCE II aims at automated semantic web service composition by utilizing planning techniques. The process exploits information in the OWL-S descriptions of atomic web services, translates the web service composition problem to a planning problem, exports it to PDDL [2] and invokes external planning systems to acquire plans, which constitute descriptions of the desired composite service. Each composite service is evaluated in terms of statistic and accuracy measures, while a visual component is also integrated, which accommodates composite service visualization and manipulation. Modification in the composite web service is performed by atomic service replacement, either with an alternative equivalent atomic service, or through finding a sub-plan that can substitute it. If necessary, the composite service can also be modified through replanning. Finally, in order to provide full-cycle support, and render the result of the composition process independent from planning, the composite service is translated back to OWL-S, presenting the user with a description in the same standard as the initial atomic services and facilitating composite service deployment. Semantic awareness in PORSCE II is achieved by exploiting semantic information extracted from the OWL-S descriptions of the available atomic web services and the corresponding ontologies, and analyzing this information based on semantic distance measures and user-defined thresholds. The derived knowledge is then used to enhance the planning domain and problem, achieve approximate solutions, when necessary, and accommodate intervention to the composite service. The rest of the paper is organized as follows: Section 2 discusses some related work, while Section 3 describes the system architecture and functionality. Section 4 elaborates on the effects of semantic awareness on various aspects of the framework, along with examples. Finally, Section 5 concludes the paper and poses future directions.

2 Related Work OWLS-Xplan [7] uses the semantic descriptions of atomic web services in OWL-S to derive planning domains and problems, and invokes the XPlan planning module to generate the composite services. The system is PDDL compliant, as the authors have developed an XML dialect of PDDL called PDDXML. Although the system imports semantic descriptions in OWL-S, the semantic information provided from the domain

Semantic Awareness in Automated Web Service Composition through Planning

125

ontologies is not utilized and semantic awareness is not achieved; therefore, the planning module requires exact matching for service inputs and outputs. Another system that attempts automatic web service composition through AI Planning is SHOP-2 [6]. The system uses services descriptions in DAML-S, the predecessor of OWL-S, and performs Hierarchical Task Network (HTN) planning to solve the problem. The main disadvantage of this approach lies in the fact that the planning process, due to its hierarchical nature, requires the specification of certain decomposition rules, which have to be encoded in advance by an expert in the specific domain, with the help of a DAML-S process ontology. Other approaches for automatic web service composition can be found at [8] and [9]; however, they are not further discussed here as they do not deal with semantic descriptions of web services, or incorporating semantics in the composition process. The main advantage of the proposed framework with respect to the aforementioned systems is the extended utilization of semantic information, in order to perform planning under semantic awareness and relaxation, in order to find better and, when necessary, approximate solutions. Furthermore, PORSCE II does not require any prior, domain-specific knowledge to formulate valid, desired composite services; the OWL-S descriptions of the atomic web services and the corresponding ontologies suffice. Finally, the system is able to handle cases of service failure or unavailability dynamically, through composite service modification, taking into account not only syntactic similarities but also semantics, which is an important feature not covered in the aforementioned frameworks.

3 Framework Architecture and Overview PORSCE II aims at a higher degree of integration as, additionally to the core transformation and composition component, it contains a visual interface, composite web service manipulation features, and interconnection with multiple external planners. The key features of the framework include: • Translation of OWL-S atomic web service descriptions into planning operators. • Interaction with the user in order to acquire their preferences regarding the composite service and desired metrics for semantic relaxation. • Enhancement of the planning domain with semantically similar concepts. • Exporting the web service composition problem as a PDDL planning domain and problem. • Acquisition of solutions by invoking external planners. • Assessing the accuracy of the composite services. • Visualizing and modifying the solution by service replacement or replanning. • Transformation of the solution (composite web service) back to OWL-S. PORSCE II comprises of the OWL-S Parser, the Transformation Component, the OWL Ontology Manager (OOM), the Visualizer and the Service Replacement Component. An overview of the architecture and the interactions among the components is depicted in Fig. 1.

126

O. Hatzi et al.

Fig. 1. The architecture of PORSCE II

The OWL-S Parser is responsible for parsing a set of OWL-S web service profiles and determining the corresponding ontologies that organize the concepts appearing in the web service descriptions as inputs and outputs. The OWL Ontology Manager (OOM), utilizing the inferencing capabilities of the Pellet DL Reasoner, applies the selected distance measure and thresholds for discovering concepts that are semantically relevant to a query concept. Upon request, it is able to provide the rest of the system with advice on semantically relevant or equivalent concepts, facilitating the implementation of semantic awareness. The Transformation Component is the core component of the system. It is responsible for a number of operations that include the formulation of the planning problem from the initial web service composition problem, its consequent solving, and the transformation of the produced composite web service back to OWL-S. Throughout the process, it requests advice from the OOM in order to semantically enhance these procedures. The Transformation Component is also responsible for evaluating the produced composite services, according to their semantic distances obtained from the OOM. The purpose of the Visualizer is to facilitate comprehension of the results, by providing the user with a visual representation of the plan, which in fact is the description of the composite service. The composite service is visualized as a web service graph, that is, a graph G=(V, E), where the nodes in V correspond to all the atomic services in the plan and the edges (x→y) in E, where x and y are nodes in V, define that web service x produces an output that serves y as an input. Finally, the Service Replacement Component enables the user to employ a number of alternative techniques in order to replace a specific atomic web service in the composite service sequence. In order to be able to perform replacement approximately, in cases than no exact matching candidates can be found, the Service Replacement Component requests semantic information by the OOM.

Semantic Awareness in Automated Web Service Composition through Planning

127

More on the interoperability between the systems and the functionality of PORSCE II can be found at [11], while the system, along with test cases, is available online at http://www.dit.hua.gr/~raniah/porsceII_en.html.

4 Effects of Semantic Awareness In order to achieve semantic awareness, the system needs to be aware of semantic equivalences and similarities among syntactically different concepts, used as inputs and outputs of the web services. Semantic awareness is implemented by including all the required semantic information in a pre-processing phase, and letting the planning system handle the problem as a classical planning problem. The advantages of this approach include independency from the planner and minimization of the interactions between the planning system and the OOM. The OOM provides the rest of the system modules with advice on demand about the equivalent and the semantically relevant ontology concepts to a query concept. A semantic distance can be assigned to each pair of ontology concepts, based on their hierarchical relationship (subclass, superclass, sibling, etc) and semantic distance metrics (edge-counting or upwards cotopic) [11]. The semantic awareness affects the system in four main aspects: • • • •

enhancement of the planning domain and problem with semantic information inclusion of semantically equivalent and relevant services during composition search among semantically equivalent and relevant services for replacement semantic representation of the composite service

The examples that will be used in this section are extracted from a case study which combines and modifies the books and finance domains of the OWLS-TC (version 2.2 revision 1) semantic web service descriptions test sets [1]. The implemented scenario concerns the electronic purchase of a book. The user provides as inputs a book title and author, credit card info and the address that the book will be shipped to, and requires a charge to their credit card for the purchase, as well as the shipping dates and the customs cost for the specific item. The initial state of the planning problem is produced by the inputs of the composite service, while the goal state is produced by the desired composite service outcomes. A concise presentation of the inputs and outputs of the web services of interest for this scenario is provided in Table 1. Table 1. Inputs and outputs of the web services of interest for the specific examples Service BookToPublisher CreditCardCharge ElectronicOrder PublisherElectronicOrder ElectronicOrderInfo Shipping WaysOfOrder CustomsCost

Inputs Book, Author OrderData, CreditCard Electronic PublisherInfo Electronic Address, OrderData Publisher Publisher, OrderData

Outputs Publisher Payment OrderData OrderData OrderInformation ShippingDate Electronic CustomsCost

128

O. Hatzi et al.

4.1 Semantic Domain Enhancement In a pre-processing phase, the Transformation Component probes the OOM in order to acquire all the semantically relevant concepts for both the facts of the initial state and the outputs of the planning operators. Consequently, semantic enhancement abides by these three rules: 1. The concepts of the initial state together with the semantically equivalent and similar concepts form a new set of facts noted as the Expanded Initial State (EIS). 2. The goals of the problem remain the same. 3. The Enhanced Action Set (EAS) is produced by altering the description of each operator by enhancing its effects with all equivalent and semantically similar concepts. Note that the initial size of the set is preserved. Fig. 2 shows an example of the planning problem produced by the aforementioned web service composition domain, before and after the semantic enhancement, along with the semantically relevant concepts returned by the OOM. For legibility purposes, a shortened version of the domain described above is used.

Fig. 2. An example of the semantic domain and problem enhancement

The new problem, namely <EIS, EAS, G> is encoded into PDDL and forwarded to an external PDDL-compliant planning system in order to acquire solutions. Note that the semantic information is encoded in such a way that it is transparent to the external planner, which can solve the problem as any other classical planning problem. 4.2 Semantic Composition The produced plans, or descriptions of the desired composite service, are consequently imported into the Visualizer Component, where they are transformed into a web service graph and represented visually. While solutions with exact matching of input to output concepts is obligatory in the classical planning domains, in the web services world the case can be different, as it is preferable to present the

Semantic Awareness in Automated Web Service Composition through Planning

129

user with a composite service that approximates the required functionality than to present no service at all. The semantic awareness achieved in the PORSCE II system enables the composition of alternative services that approximate the desired one in case there are no exact matches, by performing semantically relaxed concept matching. Such a case is presented in Fig. 3. The ElectronicOrderInfoService produces as output an instance of the concept OrderInformation, while the available atomic services that are needed to fulfill the goals (CreditCardCharge, CustomsCost and Shipping) accept as input instances of the concept OrderData. Without semantic awareness and relaxation, an approximate matching between these concepts would not be possible, and the approximate plan of Fig. 3 would not be produced. However, under semantic relaxation, these two concepts are annotated as semantically relevant by the OOM. As this is not an exact matching service, the accuracy measure is changed to reflect it, following the accuracy definitions in [11].

Fig. 3. Approximate composite service

4.3 Semantic Service Replacement The simplest alternative for composite service modification is the replacement of an atomic service included in the composite service (plan) with a semantically equivalent or relevant one. The replacement takes into account semantics, therefore the discovery of all actions that could be used alternatively instead of the chosen one is guided by advice from the OOM. An action A is considered an alternative for an action Q of the plan as far as it does not disturb the plan sequence and the intermediate states, that is, prec(A) ⊆ prec(Q) and add(A) ⊇ add(Q). In cases when none of the semantically equivalent or relevant services that correspond to an atomic service is considered suitable, or in cases where there are no alternative services, the system offers the option to substitute the atomic service with a partially ordered set of atomic services, in the form of a subplan, found through planning. In this case, the world states right before and after the execution of the action being replaced serve as the initial and goal states for the planning process,

130

O. Hatzi et al.

respectively. The world state right before the application of the selected action can be found by starting at the original initial state and progressively including all the add effects of all intermediate actions up to the selected one. Likewise, the world state after the execution of the selected action can be found by starting at the original goal state, subtracting all add effects and including all preconditions of all intermediate operators, going backwards from the end of the plan to the selected action. Note that the replanning process is bound to return the atomic service being replaced itself, especially if the external planner used produces the optimal plan in each case. In order to prevent that, this specific service has to be removed from the set of available services before the replanning process proceeds. In both cases, the selected alternative substitutes the original service and the new quality metrics are incorporated in the quality metrics of the entire plan. If replacement of an atomic service, either by an equivalent or through replanning, is not a suitable option, the user can resort to replanning from a certain point in the plan, or even replanning from scratch. When replanning from a certain point, the world state at that point has to be calculated starting at the original initial state and progressively including all the add effects of all intermediate actions up to that point. As an example, consider that replacement with equivalent is selected on the CreditCardChargeService. The resulting plan after the replacement with a semantically equivalent service is shown in Fig. 4.

Fig. 4. Composite service after replacement by semantically equivalent service

Fig. 5. The service replacement interface

Semantic Awareness in Automated Web Service Composition through Planning

131

Fig. 6. Composite service after replacement through replanning operation

If the replacement through replanning alternative is selected, for example on the CustomsCostService, the corresponding planner will be re-invoked, finding a new

sequence of actions that can substitute the selected service. The user interface for the replacement options is depicted in Fig. 5, while the resulting composite service after the modifications is depicted in Fig. 6. 4.4 Semantic Composite Service Representation Semantic descriptions of web services in OWL-S [3] allow their use by software agents. As far as the composition of web services is concerned, OWL-S establishes a framework for semantically defining composite processes or services. A composite process is a set of atomic processes, combined together using a number of control constructs, such as Sequence, Split, Split + Join, Choice, Any-Order, Condition, IfThen-Else, Iterate, Repeat-While, and Repeat-Until. The main reasons for using these constructs while defining a composite web service are: a) to enable the definition of compact services (i.e. through the use of Iterate, Repeat-While and Repeat-Until), b) to facilitate the definition of alternative paths (i.e. through the use of Conditions and If-Then-Else constructs) and c) to speed up the invocation of the composite web service, by allowing multiple atomic processes to be invoked concurrently (i.e. through the use of Split and Split+Join constructs). For the purposes of the PORSCE II framework, the use of these constructs is not mandatory, as far as the proper invocation of the atomic processes is concerned. Since the modeling of the web service composition problem to a planning problem, is merely based on the STRIPS formalism, there is no need for defining alternative paths. However, in order to speed up the invocation of the composite service, by allowing the parallel execution of certain atomic processes, we have developed a set of algorithms that translate plans (linear or non linear) to composite web services using the Sequence, Split and Split+Join. The basic algorithm creates a composite service, given a web service graph G=(V, E), as it was defined in Section 3. The process of obtaining a web service graph from the plan is straightforward and due to space limitations, we will not elaborate on that. The output of the algorithm is either a composite construct of the form sequence(c1, c2), or split(c1, c2,.., cn), where c1 to cn are either NULL or composite constructs. The next step is to replace the Join construct with Split+Join wherever this is possible. This requires a search in all possible pairs of Split arguments, in order to find a common ending part. The last step is to simplify the composite service by removing NULLs and constructs with single arguments. More information concerning the methodology for the representation of composite services can be found in [12].

132

O. Hatzi et al.

5 Conclusions and Future Work This paper presents the semantic awareness issues on several aspects of the automated web service composition process in the PORSCE II framework, an integrated system which exploits planning in order to approach the automated web service composition problem. Semantic awareness, facilitated by the semantic analysis of the OWL-S descriptions of the web services and ontologies, enables semantic enhancement of the planning problem. Moreover, it enables approximations when no exact solutions can be found, and it permits semantic composite service modification and representation. Future goals include the extension of the system in order to deploy the produced composite services, through OWL-S execution systems as the OWL-S Virtual Machine [10]; the feedback of which can be used to automate service replacement. Another goal concerns the possibility to accelerate the composition process by inserting the produced services in the base of available services. Finally, integration with VLEPPO [4] can accommodate the visual design of web service composition.

References 1. OWLS-TC version 2.2 revision 1, http://projects.semwebcentral.org/projects/owls-tc/ 2. Ghallab, M., Howe, A., Knoblock, C., McDermott, D., Ram, A., Veloso, M., Weld, D., Wilkins, D.: PDDL – the Planning Domain Definition Language. Technical report, Yale University, New Haven, CT (1998) 3. OWL-S 1.1., http://www.daml.org/services/owl-s/1.1/ 4. Hatzi, O., Vrakas, D., Bassiliades, N., Anagnostopoulos, D., Vlahavas, I.: VLEPpO: A Visual Language for Problem Representation. In: PlanSIG 2007, pp. 60–66 (2007) 5. SAWSDL, http://www.w3.org/2002/ws/sawsdl/ 6. Sirin, E., Parsia, B., Wu, D., Hendler, J., Nau, D.: HTN planning for web service composition using SHOP2. Journal of Web Semantics 1(4), 377–396 (2004) 7. Klusch, M., Gerber, A., Schmidt, M.: Semantic Web Service Composition Planning with OWLS-XPlan. In: AAAI Fall Symposium on Semantic Web and Agents, USA (2005) 8. Rao, J., Su, X.: A Survey of Automated Web Service Composition Methods. In: Cardoso, J., Sheth, A.P. (eds.) SWSWPC 2004. LNCS, vol. 3387, pp. 43–54. Springer, Heidelberg (2005) 9. Dustdar, S., Schreiner, W.: A survey on web services composition. Int. J. Web and Grid Services 1(1), 1–30 (2005) 10. Paolucci, M., Ankolekar, A., Srinivasan, N., Sycara, K.: The DAML-S Virtual Machine. In: Fensel, D., Sycara, K., Mylopoulos, J. (eds.) ISWC 2003. LNCS, vol. 2870, pp. 290– 305. Springer, Heidelberg (2003) 11. Hatzi, O., Meditskos, G., Vrakas, D., Bassiliades, N., Anagnostopoulos, D., Vlahavas, I.: Semantic Web Service Composition using Planning and Ontology Concept Relevance. In: IEEE / WIC / ACM Int. Conference on Web Intelligence (WI 2009), Milan, Italy (2009) 12. Hatzi, O., Vrakas, D., Bassiliades, N., Anagnostopoulos, D., Vlahavas, I.: The PORSCE II Framework: Using AI Planning for Automated Semantic Web Service Composition, Technical Report TR-LPIS-303-09, LPIS, Dept. of Informatics, Aristotle University of Thessaloniki, Greece (2009), http://lpis.csd.auth.gr/publications/hatzi_porsceII.pdf 13. Mentzas, G., Friesen, A. (eds.): Semantic Enterprise Application Integration for Business Processes: Service-Oriented Frameworks. Business Science Reference (2009)

Unsupervised Web Name Disambiguation Using Semantic Similarity and Single-Pass Clustering Elias Iosif Dept. of Electronics and Computer Engineering, Technical University of Crete, Chania 73100, Greece

Abstract. In this paper, we propose a method for name disambiguation. For a given set of names and documents we cluster the documents and map each cluster to the appropriate name. The proposed method incorporates an unsupervised metric for semantic similarity computation and a computationally low-cost clustering algorithm. We experimented with the data used in Web People Search Task of SemEval-2007, in which 16 diﬀerent teams were participated. The proposed system has an equal performance compared to the oﬃcially best system.

1

Introduction

World wide web is frequently searched in order to retrieve information about speciﬁc person names. In general, web search includes diﬀerent needs for different groups of users. This range spans from common users that want to ﬁnd information about a popular person, up to researchers who are interested for the work of a scientist. It is noteworthy that a signiﬁcant subset, 30%, of queries submitted to search engines, consists of person names [5]. This fact shows that the requested web information is often associated with a particular person name. This association can be two-fold: the requested information is directly related to person name, i.e., ﬁnding information about a person, and/or the person name is used in order to determine implicitly the thematic area. The latter can be considered as a special case of the theoretical framework suggesting that the proper names are logically related with the features of the entity to which they make a mention [14]. A name can be used as person identiﬁer in extremely small networks, such as the team of colleagues in a laboratory, but they are ambiguous in larger networks. This is the case for networks like the scientiﬁc domain, and of course the world wide web. For example, according to a report of U.S. Bureau, 90.000 diﬀerent names are distributed over 100 million people [1,5]. This phenomenon holds even for specialized domains. For instance, in the Greek literature the name “Nikos Kavvadias” is shared among a famous poet and a less popular novelist. The world wide web can be considered as the largest resource that one can search for retrieving information about person names for a plethora of domains. The web search in the most of the cases is accomplished by querying a search engine, and the response is a ranked list of documents. In this framework we formulate the problem of name disambiguation for web documents, as a document S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 133–141, 2010. c Springer-Verlag Berlin Heidelberg 2010

134

E. Iosif

clustering problem, whereas the documents of each cluster should refer to the same person. In the literature several approaches have been proposed for this problem, employing some widely used natural language techniques. The “bagof-words” model has been used in order to exploit the lexical environment of the person names for disambiguation purposes [2,4]. A slightly more sophisticated approach is proposed by [11], where the lexical features are not treated uniformly. Instead, more weight is given to biographic features, which can better discriminate person names. An interesting fusion of lexical and semantic information is proposed in [13]. Beyond the individual research eﬀorts, the 4th International Workshop on Semantic Evaluations (SemEval-2007) hosted a task devoted to the problem of web people search [7]. In particular, the task was deﬁned as: “. . . systems receive a set of web pages (which are the result of a web search for a person name), and they have to cluster them in as many sets as entities sharing the name” [1]. The disambiguation of person names can be used in many applications such as information retrieval, question answering, personalized web search, etc. For the case of information retrieval consider a search engine that returns a set of document collections, organized according to the person names that appear in them. This can be extended to a personalized search, in which the retrieved documents are ﬁltered according to the user’s preferences. In the same manner, question answering systems can use name disambiguation in order to identify more eﬀectively the documents from which the information about a person is extracted. In this paper, we propose a method for person name disambiguation over a collection of documents returned by a search engine. We propose the use of an unsupervised similarity metric, which exploits the contextual features of person names in order to estimate the semantic similarity between them. Next, the documents in which the person names appear, are clustered according to a straightforward single-pass clustering algorithm. The novelty of our work deals with the unsupervised semantic similarity computation in conjunction with the employment of a simple, but eﬀective, clustering algorithm, which has a minimal computational cost. In particular, we used the test dataset, as well as the evaluation system, of the Web People Search Task of SemEval-2007, despite the fact that we did not participate to this task. However, we believe the resources of such events, continue to be useful for the development and evaluation of systems, even after the closing of the event. Given these data resources and evaluation procedures we are able to compare our work with other systems with respect to a common groundtruth. The rest of the paper is organized as follows. Section 2 describes an unsupervised metric that is used in order to estimate the semantic similarity of names. In Section 3, a simple single-pass clustering algorithm is deﬁned. The experimental corpus and procedure are described in Section 4. Section 5 reports the evaluation results and compares the performance of the proposed system to the oﬃcial systems. Finally, in Section 6 we give some brief conclusions, as well as directions for future work.

Unsupervised Web Name Disambiguation

2

135

Semantic Similarity Metric

In this section we present a metric that exploits the contextual environment of person names in order to estimate the semantic similarity between them, and consequently to estimate the semantic similarity of documents in which names appear. The key idea for this approach is that the similarity of context implies similarity of meaning [6]. In our approach we use the “bag-of-words” [15] model for representing the contextual features of names. Of course, this model assumes that the features are independent [10]. The applied context-based similarity metric uses a context window size, WS, for feature extraction. In particular, the right and left contexts of length W S are taken into consideration for each occurrence of a person name, p: {wWS,L ... w2,L w1,L } p {w1,R w2,R ... wWS,R }. Using this representation, wi,L and wi,R stand for the feature appearing to the left and to the right of p, respectively. Given a window size the feature extraction is constrained only to the WS left-right features of p. For each person name, the contextual environment is explored according to the value of WS, and a feature vector is built. The feature vector for every p is deﬁned as Qp,WS = (q1 , q2 , ..., qV ), where qi is an integer and WS is the contextual window size. Also, the size of feature vector is equal to the size of corpus vocabulary, V . The ith feature value, qi , denotes the appearance of the ith vocabulary word, vi , within the left or right contexts of p that were extracted according to WS. The value for each feature, qi , is set according to one of two diﬀerent schemes: Binary (B) and Frequency (F) scheme. The Binary scheme sets qi = 1, if vi is included in the extracted contexts of p, otherwise, qi = 0. The Frequency scheme sets qi equal to the frequency of vi , computed over the extracted contexts of p. As before, if vi is not included in the extracted contexts, qi = 0. Applying one of the above schemes, the semantic similarity, S(p1 , p2 ), between two person names, p1 and p2 is estimated as the cosine similarity of their feature vectors, Qp1 ,WS and Qp2 ,WS , as follows: S(p1 , p2 ) = V

V

i=1 qp1 ,i qp2 ,i

2 i=1 (qp1 ,i )

(1)

V 2 i=1 (qp2 ,i )

The metric of Eq. (1) assigns 1 to the similarity score if the names occur in identical contexts, while the similarity score is set 0 if the names do not have any common contextual feature. Our goal is to apply the metric of Eq. (1) for the task of web name disambiguation, as it was described in Section 1. For example, assume that there are d documents, which refer to m diﬀerent persons that share the same name. Also, assume that a particular document is related only to one of the m diﬀerent persons. The similarity metric is applied over the possible pairs of documents (for d documents, d(d−1) similarity scores are computed). Finally, we estimate 2 the similarity between two documents as the semantic similarity of the person names that appear in these documents.

136

3

E. Iosif

Single-Pass Clustering Algorithm

In this section we present a clustering algorithm, which uses the semantic similarity scores of documents, in order to cluster them. In particular we want each cluster to contain documents that refer to the same person, i.e., to resolve the name ambiguity. Before applying the clustering algorithm, the pairs of documents are ranked into a list according to their semantic similarity, from semantically similar to semantically dissimilar. Since this list includes all possible pairs of documents, it is important to select a ﬁxed number of top ranking pairs, pruning the less similar documents. This approach was inspired by the work of Pargellis et al. [12], although their goal was the iterative induction of semantic classes using a simpler form of the presented algorithm. Speciﬁcally, in [12], a semantic class (cluster) was created for every pair of words. A limitation of the algorithm proposed in [12], is that there is no way to merge more than two items at one step. In the past [8], we extended the work of Pargellis et al., developing an agglomerative clustering algorithm for the task of semantic class induction. In this work, we propose a variation of the algorithm presented in [8], for a slightly diﬀerent problem, i.e, document clustering for name disambiguation. In particular, the algorithm applied in this work is single-pass, which means that generates the required clusters using only one iteration. The algorithm explores multiple pairs in the ranked similarity list and identiﬁes those pairs that share a common member. Pairs with common members create a cluster that contains the union of their members. Assume that the pairs (D1 , D2 ), (D1 , D3 ), (D2 , D4 ) were ranked at the upper part of the list. The proposed algorithm merges the three pairs and creates the cluster (D1 , D2 , D3 , D4 ). In order to avoid over-generalizations only pairs that are ranked closely are allowed to participate in the merging procedure. An experimental parameter called “Search Margin”, SM , is used in order to control the maximum distance between two pairs (in the ordered list) that are allowed to create a single cluster. The following example illustrates the use of SM : Position in List 1 2 3 4 5 Pairs D 1 D2 D 2 D3 D 5 D6 D 6 D7 D 3 D4 where D1 , D2 , D3 , D4 , D5 , D6 , D7 are documents for which similarity was computed according to the procedure described in Section 2. For SM = 2 the clusters (D1 , D2 , D3 ) and (D5 , D6 , D7 ) will be generated. Increasing by one the value of SM , the clusters (D1 , D2 , D3 , D4 ) and (D5 , D6 , D7 ) will be generated. The SM parameter was observed to preserve the semantic homogeny of the created clusters [8]. Next, we describe in more details the steps of the clustering algorithm for SM = 2, using the data of the previous example. The algorithm begins with the ﬁrst pair of the list. An initial cluster, ClusterA, is created and documents D1 and D2 are assigned to it. The algorithm proceeds to the second position, where documents D2 and D3 exist. Since D2 is already member of ClusterA, and the (list) distance between the ﬁrst and the second pair is less than the

Unsupervised Web Name Disambiguation

137

value of SM , document D3 is assigned to ClusterA. The algorithm continues to the third pair of the list, where documents D5 and D6 exist. None of them are present in ClusterA, so, a new cluster, ClusterB, is created and documents D5 and D6 are assigned to it. In the same manner, document D7 , which belongs to the fourth pair of the list, is assigned to ClusterB. Finally, the algorithm examines the ﬁfth pair of the list, which shares a common member, D3 , with ClusterA. However, the distance between the last pair which was used for the creation of ClusterA (the second pair of the list) and the current pair (the ﬁfth pair of the list) is greater than the value of SM . Thus, the ﬁfth pair is not taken into consideration and the clustering algorithm terminates. The created clusters are ClusterA = (D1 , D2 , D3 ) and ClusterB = (D5 , D6 , D7 ). The algorithm for computing semantic similarity, described in Section 2, has quadratic time complexity, since it computes pairwise similarities between the documents. Given that the similarities between the documents have been computed by the algorithm of semantic similarity, the time required by the clustering algorithm is linearly proportional to the number of document pairs.

4

Experimental Corpus and Procedure

For the name disambiguation task we used the test data of Web People Search Task, SemEval-2007 Workshop [1,7]. In total, 30 diﬀerent names were randomly selected from Wikipedia, ACL-06 Conference and U.S. Census data. As it was mentioned, each name is shared between numerous persons. For each person the following steps are performed: (a) 100 (approximately) top ranked documents are downloaded, using the Yahoo! search engine (the documents were downloaded and provided by the workshop organizers), (b) the semantic similarities between documents are computed by exploiting the contexts of person names that appear in them, as it was described in Section 2, and (c) the clustering algorithm of Section 3 is applied. The experimental parameters are: (i) the value of window size (WS), deﬁned in Section 2, (ii) the size of Search Margin (SM), deﬁned in Section 3, and (iii) the maximum number of generated clusters (MaxC). The MaxC parameter is needed because the number of generated document clusters with respect to a person name is unknown. Thus, the clustering algorithm it is possible to generate diﬀerent number of document clusters for diﬀerent names. It should be noted that we used several values of experimental parameters. However, we report on the values that achieved the highest results. The optimal setting of them remains an open issue, which depends on the experimental data and the nature of the domain.

5

Evaluation Results

In this section, we present the evaluation results of the proposed method, in comparison with the 16 systems that participated to the Web People Search Task of SemEval-2007 Workshop. According to [1,7], ﬁve annotators worked in order to build the gold standard for evaluation purposes. Despite the fact that we

138

E. Iosif

did not participated to the workshop, we used the oﬃcial evaluation system of the workshop. The oﬃcial evaluation system generates a detailed report about the created document clusters that includes several measurements: purity, inverse purity, and two weighted schemes of purity and inverse purity. For the ﬁnal ranking of systems the organizers used the harmonic mean of purity and inverse purity [3], Fα=0.5 , deﬁned in [1] as: Fα=0.5 =

1 α purity + (1 − α) inverse1 purity 1

(2)

We also used the measurement of Eq. 2. Note that purity is related to precision measure [9], and favors clusters with less erroneous document assignments [1]. On the other hand, inverse purity is related to recall measure [9], favoring clusters with more correct document assignments with respect to the total correct assignments [1]. In Fig. 1 we present the Fα=0.5 scores for several values of MaxC, for both Binary and Frequency scheme. Although we experimented with numerous values of SM , in this ﬁgure we present the values of SM for which the highest results were achieved, using WS = 10. Also, in Fig. 1 we include two baseline scenarios, which were suggested by the organizers of Web People Search Task of SemEval-2007 Workshop: “1in1” and “ALLin1”. According to the “1in1” baseline scenario each document forms a single cluster, containing only itself. So, the purity score takes its highest value, which is equal to 1. According to the “ALLin1” baseline scenario all the documents are assigned to a single cluster. This gives the highest inverse purity score that is equal to 1. The Fα=0.5 scores for 1in1 and ALLin1 baseline scenarios, were reported to be 0.61 and 0.40, respectively [1]. We observe that the performance of the proposed system becomes higher as the maximum number of clusters, MaxC, increases. For MaxC ≥ 50, Fα=0.5 score reaches its highest value, which remains constant: 0.78 and 0.77 for Binary and Frequency scheme, respectively. At this point the

0.8

0.7

Fα=0.5

0.6

0.5

0.4

0.3 B scheme(SM=3) F scheme(SM=2) Baseline: 1in1 Baseline: ALLin1

0.2

0.1

10

15

20

25

30 35 40 45 50 55 60 65 Maximum number of clusters (MaxC)

70

75

Fig. 1. Fα=0.5 scores vs. baseline scores

80

Unsupervised Web Name Disambiguation

139

Table 1. Ranking of oﬃcial systems (taken from [1]) vs. our approach (NEW) Rank SystemID 1 CU COMSEM N EW Proposed (B scheme) N EW Proposed (F scheme) 2 IRST-BP 3 PSNUS 4 UVA 5 SHEF 6 FICO 7 UNN Baseline 1in1 8 AUG 9 SWAT-IV 10 UA-ZSA 11 TITPI 12 JHU1-13 13 DFKI2 14 WIT 15 UC3M-13 16 UBC-AS Baseline ALLin1

Fα=0.5 0.78 0.78 0.77 0.75 0.75 0.67 0.66 0.64 0.62 0.61 0.60 0.58 0.58 0.57 0.53 0.50 0.49 0.48 0.40 0.40

system performance outperforms the two baseline scores. Also, it is interesting to observe that these scores are achieved for small values of SM , 2 and 3 for the Binary and Frequency scheme, respectively. This happens because such values preserve the semantic homogeny of the created clusters. Regarding the values of WS , we observed that for WS ≥ 10 similar results were achieved, in contrast to smaller window sizes. This suggests that the immediate context can not provide adequate information for the disambiguation task, so larger context should be exploited. Furthermore, Table 1 presents the ranking of the 16 oﬃcial systems that participated to the Web People Search Task of SemEval-2007, along with the two-best performances of our system. The evaluation scores for the 16 systems were taken from [1]. The two baseline scores are also included. We observe that the proposed system applying Binary scheme achieves equal results compared to the best oﬃcial system, CO COMSEM. Also, the performance obtained by the Frequency scheme is is greater compared into the oﬃcially second system, IRST-BP. For the majority of the oﬃcial systems1 several variations of agglomerative clustering were applied. This is also the case for the best oﬃcial system, which employed agglomerative clustering with a single linkage. Our approach is simpler 1

A complete and detailed description of the oﬃcial systems can be found in the proceedings of SemEval-2007.

140

E. Iosif

compared to the typical agglomerative approach, because it does not compute similarities between the progressively created clusters. Instead, as it was described in Section 3, pairwise similarities between documents are computed only once, and the clusters are created according to the common members that are shared between the top ranked document pairs. Other clustering approaches were employed by a smaller number of oﬃcial systems, including k-means clustering, as well as more sophisticated techniques, such as stochastic graph-based clustering. In general, several NLP techniques were applied by the oﬃcial systems for feature selection, in order to take into consideration richer information than simple lexical types. For example, the three-best oﬃcial systems employed named entity recognition. Also, the best oﬃcial system incorporated syntactic phrase chunking. In contrast, our system does not rely on such techniques, since it uses only the lexical information extracted from the considered contextual environment, described in Section 2.

6

Conclusions and Future Work

In this work, we presented a method for person name disambiguation, which employs a semantic similarity metric in combination with a computationally low-cost clustering algorithm. The similarity metric exploits the contextual environment of names and is fully unsupervised, i.e., there is no need for knowledge resources. The employed clustering algorithm has low computational cost, since it performs only a single pass over the document similarity scores. This characteristic can be considered as an important advantage compared to the more sophisticated clustering approaches that were followed by the oﬃcial systems. Finally, the proposed system was evaluated using the oﬃcial evaluation system of the Web People Search Task of SemEval-2007. The performance of the proposed system was found to be equal to the highest oﬃcial score. Future work includes the enhancement of the applied similarity metrics by investigating several contextual features, such as part-of-speech tags, etc. Also, it is interesting to conduct further experiments with more and diﬀerent clustering algorithms. Finally, we believe that it is challenging to try the proposed method for more disambiguation tasks.

References 1. Artiles, J., Gonzalo, J., Sekine, S.: The SemEval-2007 WePS Evaluation: Establishing a Benchmark for the Web People Search. In: Proc. ACL 4th International Workshop on Semantic Evaluations, SemEval 2007 (2007) 2. Bagga, B., Baldwin, B.: Entity-based Cross-document Coreferencing using the Vector Space Model. In: Proc. COLING (1998) 3. Duda, R., Stork, D., Hart, P.: Pattern Classiﬁcation. John Wiley & Sons, Chichester (2000) 4. Gooi, H.C., Allan, J.: Cross-document Coreference on a Large Scale Corpus. In: Proc. HLT/NAACL (2004)

Unsupervised Web Name Disambiguation

141

5. Guha, V.R., Garg, A.: Disambiguating People in Search. In: Proc. 13th World Wide Web Conference (2004) 6. Herbert, R., Goodenough, B.J.: Contextual Correlates of Synonymy. Communications of the ACM 8 (1965) 7. http://www.senseval.org/ 8. Iosif, E., Tegos, A., Pangos, A., Fosler-Lussier, E., Potamianos, A.: Unsupervised Combination of Metrics for Semantic Class Induction. In: Proc. Spoken Language Technology Workshop (2006) 9. Jurafsky, D., Martin, J.H.: Speech and Language Processing. Prentice Hall. Upper Saddle River (2000) 10. Lewis, D.: Naive Bayes at Forty: The Independence Assumption in Information Retrieval. In: Proc. of European Conference on Machine Learning (1998) 11. Mann, S.G., Yarowsky, D.: Unsupervised Personal Name Disambiguation. In: Proc. CoNLL (2003) 12. Pargellis, A., Fosler-Lussier, E., Lee, C., Potamianos, A., Tsai, A.: Auto-Induced Semantic Classes. Speech Communication 43, 183–203 (2004) 13. Phan, X.-H., Nguyen, L.-M., Horiguchi, S.: Personal Name Resolution Crossover Documents by a Semantics-based Approach. IEICE Inf. and Syst. E89-D (2006) 14. Searle, R.J.: Proper Names. Mind 67, 166–173 (1958) 15. Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1) (2002)

Time Does Not Always Buy Quality in Co-evolutionary Learning Dimitris Kalles and Ilias Fykouras Hellenic Open University, Patras, Greece [email protected], [email protected]

Abstract. We review an experiment in co-evolutionary learning of game playing where we show experimental evidence that the straightforward composition of individually learned models more often than not results in diluting what was earlier learned and that self-playing can result in reaching plateaus of uninteresting playing behavior. These observations suggest that learning cannot be easily distributed when one hopes to harness multiple experts to develop a quality computer player and reinforce the need to develop tools that facilitate the mix of expert-based tuition and computer self-learning.

1 Introduction Games are a domain with a significant human-computer interaction component and have witnessed significant inroads by machine learning techniques; Shannon [1] and Samuel [2] provided the first stimulating examples, Deep Blue defeated Kasparov at chess in 1997 [3] and, more recently, Schaeffer’s team solved checkers completely [4]. The advances of machine learning techniques, beyond also serving a thriving market for playing machines for the public, now allow us to tackle a much more difficult and challenging question: how can a person instruct a computer to play a game by simply showing it how to play? To put it in a nutshell, we are given a computer that is equipped with a generic learning mechanism and we must design a “syllabus” of experience that will allow it to formulate playing knowledge. This paper is about using humans as experts who attempt to teach a computer how to play a strategy board game. We pursue this line of work to investigate whether expert playing behavior can be generalized from brief, not-so-expert, training sessions. Our main experimental result is that while individual training sessions between a human and a computer can improve the computer’s performance, a straightforward composition of individually learned behaviors is not yet possible. To arrive at this observation we designed and carried out 1,100,000 computer-vs.-computer games based on an earlier experiment of more than 1,000 human-vs.-computer games. For our work, we used two distinct groups of users, one consisting of high-school students and one consisting of their tutors, across various disciplines. The workbench we use is a relatively simple strategy board game that, for legacy reasons, is called RLGame, since it uses reinforcement learning as the sole learning mechanism to infer how to play by observing games in action. We were interested in investigating both pupils and teachers primarily because these two groups have a huge gap in their S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 143–152, 2010. © Springer-Verlag Berlin Heidelberg 2010

144

D. Kalles and I. Fykouras

academic development, experience and perception of how one instructs, so that our tutoring sample would have a healthy diversity. This paper is structured in four subsequent sections. We first present the rules of the game we use as a workbench and we briefly review our work to draw the base upon which we extend our work. We then review the experiments we designed and carried out with high school pupils and their teachers, as well as the automated experiments that help us make a comparison. We then discuss our findings, drawing references to related work, and we conclude, also discussing future aspects of our work.

2 A Brief Background on a Strategy Game Workbench The game [5] is played by two players on an n x n square board. Two a x a square bases are on opposite board corners; the white player starts off the lower left base and the black player starts off the upper right one. Initially, each player possesses β pawns. The goal is to move a pawn into the opponent’s base. Currently, we use n = 8, a = 2 and β = 10. The base is considered as a single square, therefore a pawn can move out of the base to any adjacent free square. Players take turns and pawns move one at a time. A pawn can move vertically or horizontally to an adjacent free square, provided that the maximum distance from its base is not decreased (so, backward moves are not allowed). The distance from the base is defined as the maximum of the horizontal and the vertical distance from the base. A pawn that cannot move is lost (more than one pawn may be lost in one move). A player also loses by running out of pawns.

Fig. 1. Examples of game rules application

The leftmost board in Fig. 1 demonstrates a legal and an illegal move for the pawn pointed to by the arrow, the illegal move being due to the rule that does not allow decreasing the distance from the home (black) base. The rightmost boards demonstrate the loss of pawns, with arrows showing pawn casualties. A “trapped” pawn automatically draws away from the game; so, when there is no free square next to the base, the rest of the pawns of the base are lost. In RLGame the a priori knowledge of the system consists of the rules only. To judge what the next move should be, we use reinforcement learning [6] to learn an optimal policy that will maximize the expected sum of rewards in a specific time, determining which action should be taken next given the current state of the environment. We approximate the value function on the game state space with neural

Time Does Not Always Buy Quality in Co-evolutionary Learning

145

networks [7], where each next possible move and the current board configuration are fed as input and the network outputs a score that represents the expectation to win by making that move [5], as initially adopted in Neurogammon [8]. We have eventually a commonly used ε-greedy policy with ε=0.9 (the system chooses the best-valued action with a probability of 0.9 and a random action with a probability of 0.1), assigned to all states but the final the same initial value and updated values after each move through TD(0.5), thus halving the influence of credits and penalties for every backward step that we consider. For each neural network, the input layer nodes are the board positions for the next possible move plus a binary attribute on whether a pawn has entered an enemy base and plus some more binary attributes on whether the number of pawns in the home base has exceeded some thresholds, totalling n2-2a2+10 input nodes. The hidden layer consists of half as many hidden nodes. There is one output node; it stores the probability of winning when one starts from a specific game-board configuration and then makes a specific move. Note that, since we use a neural network to approximate the state space, we do not have access to the original state space had we wanted to compute an optimal deterministic optimal policy [9]. 2.1 Reviewing the Effects of Expert Involvement The initial experiments demonstrated that, when trained with self-playing, both players would converge to having nearly equal chances to win [5], and that self-playing would achieve weaker performance compared to a computer playing against a human player, even with limited human involvement [10] (from this point onwards, we use the terms CCk and HCk to indicate a session of k games, in computer-vs-computer and human-vs-computer mode respectively). We next devised a way to measure the relative effectiveness of the policies learned by two distinct approaches [11]. Assuming that we have available a player, X, with its associated white and black (neural network) components, WX and BX, we compare it to Y by first pairing WX with BY for a CC1000 session, then pairing WY with BX for a further CC1000 session, and subsequently calculating the number of games won and the average number of moves per game won (see Table 1 for an example). Table 1. Comparative evaluation of learned policies X and Y (X collectively wins over Y)

WX vs. BY WY vs. BX

Games Won White Black 715X 285Y 530Y 470X

Average # of Moves White Black 291 397 445 314

We then used the above reporting scheme in round-robin tournaments of learning policies (each player competing against every other player; note the difference with elimination tournaments where only winners advance to the next round) and observed that, in general, a low average number of moves per session was associated with one of the sides being a comprehensive winner as reported in the number of games won.

146

D. Kalles and I. Fykouras

3 Experimentation and Analysis via Tournaments For the work reported in this paper, the experimental session consisted of two distinct stages. During the first stage we collect data based on HC40 sessions; this is the stage where humans do their best to teach a computer within a limited number (40) of games. During the second stage, the learned policies are paired in CC1000 rounds of various types of elimination tournaments to obtain insight as to whether some individual attained a clearly good training of its “computer” players and to examine whether the composition of players may deliver a better player. The data collection session took part in a high school setting where one of the authors serves as teacher. Twenty students aged about 13 were assigned to play RLGame for 40 consecutive games each; the neural networks were initialized before the first game and were being updated throughout the HC40 session. Additionally, twelve teachers were assigned to play RLGame for 40 consecutive games. Both groups attended a short presentation on the rules of the games; furthermore, a one-page brochure that served as quick reference outlining the rules and the concept of the experimentation was distributed as an aid throughout the experiments. This brochure helped reinforce their capacity to look for information on their own and helped develop a positive attitude to the experiment. We instructed all users that we were asking them to attain two main goals, to win the computer, and to teach it. We emphasized “winning” to limit the degrees of freedom of our experiment and to ensure an as equal as possible footing of all learned policies. We also emphasized that the computer learns by wins, losses and pawn eliminations. Based on the current geometric configuration of RLGame a player needs a minimum of 10 moves to navigate to the enemy base; the median number of moves played in all games in both groups was slightly over 11. 3.1 All-Inclusive Knock-Out Tournaments The first knock-out tournament paired all students and teachers in successive CC1000 elimination rounds. This was a synthesis elimination tournament; therein winners advanced to the next round by using the modified neural networks, as evolved during the matches with their opponents. In Fig. 2 we show the details of the tournament (each internal node signifies a match between the two players who point into it and its label shows the winning player which, however, advances to the next round in its modified form). Since the number of players and teachers happened to be a power of 2, all players participated in the first round (which is why we refer to these tournaments as complete). However, we took care to match a teacher with a student in the first round. A cursory review of the tournament reveals that teachers outperform students not only in terms of who wins the tournament but also in terms of relative wins in matches between them, along all tournament legs. Subsequently, we carried out a similar tournament paired all students and teachers in successive CC10000 elimination rounds (Fig. 3).

Time Does Not Always Buy Quality in Co-evolutionary Learning

147

Fig. 2. The CC1000 tournament

A cursory review of this tournament reveals that, this time, teachers outperform students in terms of relative wins in matches between them in the early legs of the tournament; eventually, however, students improve and a student wins the tournament. To assess the relative significance of these results we employ two metrics [11]: -

The speed ratio (at least 1); values close to 1 suggest that the two CC1000 sessions are of roughly the same length. The advantage ratio (at least 1); large values indicate that one of the players was a comfortable winner (in Table 1, note that WY-BX is nearly a draw and WX-BY produces a clear winner; the advantage ratio of X-Y is near 2).

We then analyzed for each tournament the path of the winning player from the first round up to the final. We summarize our findings below: -

-

For the CC1000 tournament, the T3 player sustained an advantage ratio of at least 3 in all its games, except the second one (against T3), where it still won clearly though. T3 sustained speed ratios > 1.5 in all tournament legs except the first one. T3 also did very well (ratios > 5) in the semi-final against S7. For the CC10000 tournament, the S19 player scored an overwhelming win against S14 in the first leg and, then, proceeded with relatively close wins, eventually winning the final with an advantage ratio of 1.04 only. Speed ratios did not show any interesting deviations from the CC1000 tournament.

148

D. Kalles and I. Fykouras

Fig. 3. The CC10000 tournament

-

The CC10000 tournament seems to have produced significantly larger average numbers of moves per game than the CC1000 tournament. Coupled with the observation that S19’s advance was accompanied by close wins, this finding confirms earlier results that suggest it is not productive to run long sessions of CC games, since they tend to overstretch game durations [11].

To further investigate whether long-session tournaments actually contribute to some improvement we ran two further direct comparison tests: -

-

For the CC1000 tournament, we compared (the four finalists) T3-5, T11-4, T8-3, and S7-3 to their original versions, T3, T11, T8, and S7 correspondingly, via CC1000 games. The original players lost on all occasions. For the CC10000 tournament, we compared S19-5, T8-4, T1-3, and T7-3 to S19, T8, T1, and T7 correspondingly, and observed that two games were won by the original players and two by the evolved ones. Moreover, the CC10000 comparison games were considerably lengthier than their CC1000 counterparts.

We reinforced the above findings by observing that T3-5 beat S19-5 in both a CC1000 session and a CC10000 session. However, the most important finding of this duel was that the increased duration from a CC1000 session to a CC10000 one slightly increased

Time Does Not Always Buy Quality in Co-evolutionary Learning

149

(by about 1/10) the percentage of games won by S19-5, and at the same time decreased by about 1/3 the average number of moves per game won by S19-5. We can, therefore, claim that, indeed, drawing from all data on T3-5, an initially trained player will invariably be lead to diluted performance (ability to win games and to do so fast) when extensive automatic co-evolutionary playing is allowed. 3.2 Focused Knock-Out Tournaments against Mini-max Trained Players In the next experimentation stage, we pit the (four) finalists of the above tournaments to a selection of mini-max guided white players with an increasing look-ahead of 1, 3, 5, 7 moves (these correspond to 2n+1 moves; n+1 moves for the white player and n responses for the black one) [12]. We refer to these mini-max players as MC1, MC3, MC5 and MC7; all were automatically built via CC100 sessions. We first tackled the CC1000 tournament with two follow-up tournaments: -

-

We ran a new CC1000 tournament pitting T3-5, T11-4, T8-3, and S7-3 against MC7, MC5, MC3 and MC1 respectively, with a random initial pairing (see left part of Fig. 4). Therein, a T3–based player was pronounced as a winner (with clear advantage ratios of > 2). We ran a new CC1000 tournament pitting T3, T11, T8, and S7 against MC5, MC3, MC7 and MC1 respectively, with a random initial pairing (see right part of Fig. 4). Now, the T3–based player was eliminated in the first round. However, when we analysed its games, we saw that it lost by the MC5–based player by a close margin, and then the MC5–based player proceeded to a clear win in the second leg, only to lose the final in a close contest.

We observe, therefore, that T3–based players demonstrated a (relatively speaking) consistent quality throughout all CC1000 experiments.

Fig. 4. The CC1000 tournaments with mini-max opponents

150

D. Kalles and I. Fykouras

We then followed-up the CC10000 in a similar fashion: -

-

We ran a new CC10000 tournament pitting S19-5, T8-4, T1-3, and T7-3 against MC7, MC5, MC3 and MC1 respectively, with a random initial pairing. Therein, a T8– based player was pronounced as a winner and S19-5 was clearly eliminated from the first round. We ran a new CC10000 tournament pitting T8, T7, T1 and S19 against MC7, MC5, MC3 and MC1, with a random initial pairing. Therein, an MC3–based player was pronounced as a winner, beating a T8–based player in the final (in turn, the T8–based finalist had also comfortably won the first two rounds).

This time, the longer experiments deliver a picture on the quality of the players that is not as distant from the corresponding picture delivered from the shorter games.

4 On the Validity of the Results The cautious reader may question the use of a high-school teacher or a student as an expert in our experiments. It is true that these people are not experts but, at the current level of computers playing RLGame, any reasonable human opponent is expert enough. Moreover, we aim to also explore in our research the suggestion that the pleasure to interact may be a key success factor of entertainment robotics [13,14]. Capturing a player’s style is, first of all, an exercise in developing adequate infrastructure to codify and store that “style”. A couple of dozens of games, however, only provide a snapshot of that style. Enlarging that snapshot can be accomplished by obtaining more instances of that style (i.e. let experts play more games) or by attempting to generalize from the given instances (i.e. attempt to automatically evolve the learned attitude by extensive automatic self-playing), or by combining the two approaches (note the similarity to the cycle of instructive demonstration, generalization and practice trial, as it appears in the robot task learning terminology [13]). Judging what the best combination may be necessarily entails a workflow of human computer interaction activities that may be automated only if we have credible metrics that relate to some notion of game playing quality [11]. Such an examination would also have to take into account the relative richness (or, the lack of it) of the tactics employed by the human player during the game [15]; when this richness is constrained by the decision of an expert (human or mini-max) player to only pursue a limited number of options, the result is that only a small part of the value function gets a chance to be learned. Sometimes, when the original players are consistent in their behavior (as are MC players with a small look-ahead), that small part may be able to guide the player’s behavior for an extended number of games. But it seems that, after all, combining small parts of knowledge developed via reinforcement learning is also likely to create confusion since value updates may take less time to affect previously learned parameters.

5 Conclusions In this paper we have reviewed experiments to develop game players based on input of sample games played by experts. We have used two player groups, one consisting

Time Does Not Always Buy Quality in Co-evolutionary Learning

151

of high-school students and one consisting of their tutors, and then added up a group of mini-max trained computer players. The experimentation suggested that attempting to merge such behaviors in a straightforward fashion does not result in improved automatic game playing. Initially this seems to suggest that we must rethink how to deploy such synthetic approaches to game play learning; it looks like composition which can exploit parallelism is not as easy as intuition would have lead us believe. Interactive evolution is a promising direction. In such a course, one would ideally switch from focused expert-based training to autonomous crawling between promising alternatives. But, as we have discovered during the preparation of this work, the interactivity requirements of the process of improving the computer player is very tightly linked to the availability of a computer-automated environment that supports this development. It is a must to strive to put the expert in the loop as efficiently as possible. In our approach, this has meant the development of scripts to generate game identifiers, to generate matches and calculate the results as well as to interface with the job submission procedure of hellasgrid (http://www.hellasgrid.gr). In terms of the experiments described above, we have noticed several features of an experimentation system that we have deemed indispensable if one views the project from the point of system efficiency. Such features range from being able to easily design an experimentation batch, direct its results to a specially designed database (to also facilitate reproducibility), automatically process the game statistics and observe correlations, link experimentation batches in terms of succession, while at the same time being able to pre-design a whole series of linked experiments with varying parameters of duration and succession and then guide experts to play a game according to that design. We need to improve interaction if we aim for interactive evolution.

Acknowledgements This paper has not been published elsewhere and has not been submitted for publication elsewhere. The paper shares some setting-the-context paragraphs and suggestions for further work with related papers, with which it is neither identical nor similar. A related report is available at http://arxiv.org/abs/0911.1021. Code and data are available for academic purposes, upon request.

References 1. Shannon, C.: Programming a computer for playing chess. Philosophical Magazine 41(4), 265–275 (1950) 2. Samuel, A.: Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development 3, 210–229 (1959) 3. Hsu, F.-H.: Behind Deep Blue: Building the Computer that Defeated the World Chess Champion. Princeton University Press, Princeton (2002) 4. Schaeffer, J., Bjoernsson, Y., Burch, N., Kishimoto, A., Mueller, M., Lake, R., Lu, P., Sutphen, S.: Solving Checkers. In: International Joint Conference on Artificial Intelligence (2005)

152

D. Kalles and I. Fykouras

5. Kalles, D., Kanellopoulos, P.: On Verifying Game Design and Playing Strategies using Reinforcement Learning. In: ACM Symposium on Applied Computing, special track on Artificial Intelligence and Computation Logic, Las Vegas (2001) 6. Sutton, R.: Learning to Predict by the Methods of Temporal Differences. Machine Learning 3(1), 9–44 (1988) 7. Sutton, R., Barto, A.: Reinforcement Learning - An Introduction. MIT Press, Cambridge (1988) 8. Tesauro, G.: Temporal Difference Learning and TD-Gammon. Communications of the ACM 38(3), 58–68 (1995) 9. Littman, M.L.: Markov Games as a Framework for Multi-Agent Reinforcement Learning. In: 11th International Conference on Machine Learning, San Francisco, pp. 157–163 (1994) 10. Kalles, D., Ntoutsi, E.: Interactive Verification of Game Design and Playing Strategies. In: IEEE International Conference on Tools with Artificial Intelligence, Washington D.C. (2002) 11. Kalles, D.: Player co-modelling in a strategy board game: discovering how to play fast. Cybernetics and Systems 39(1), 1–18 (2008) 12. Kalles, D., Kanellopoulos, P.: A Minimax Tutor for Learning to Play a Board Game. In: 18th European Conference on Artificial Intelligence, workshop on Artificial Intelligence in Games, Patras, Greece, pp. 10–14 (2008) 13. Nicolescu, M.N., Matarić, M.J.: Natural methods for robot task learning: Instructive demonstrations, generalization and practice. In: 2nd International Conference on Autonomous Agents and Multi-Agent Systems, Melbourne, pp. 241–248 (2003) 14. Kaplan, F., Oudeyer, P.-Y., Kubinyi, E., Miklosi, A.: Robotic Clicker Training. Robotics and Autonomous Systems 38(3–4), 197–206 (2002) 15. Kalles, D.: Measuring Expert Impact on Learning how to Play a Board Game. In: 4th IFIP Conference on Artificial Intelligence Applications and Innovations, Athens, Greece (2007)

Visual Tracking by Adaptive Kalman Filtering and Mean Shift Vasileios Karavasilis, Christophoros Nikou, and Aristidis Likas Department of Computer Science, University of Ioannina, PO Box1186, 45110 Ioannina, Greece Ph.: + (30) 26510 08802 {vkaravas,cnikou,arly}@cs.uoi.gr

Abstract. A method for object tracking combining the accuracy of mean shift with the robustness to occlusion of Kalman ﬁltering is proposed. At ﬁrst, an estimation of the object’s position is obtained by the mean shift tracking algorithm and it is treated as the observation for a Kalman ﬁlter. Moreover, we propose a dynamic scheme for the Kalman ﬁlter as the elements of its state matrix are updated on-line depending on a measure evaluating the quality of the observation. According to this measure, if the target is not occluded the observation contributes to the update equations of the Kalman ﬁlter state matrix. Otherwise, the observation is not taken into consideration. Experimental results show signiﬁcant improvement with respect to the standard mean shift method both in terms of accuracy and execution time. Keywords: Visual tracking, Kalman ﬁlter, mean shift algorithm.

1

Introduction

Tracking is the procedure of generating an inference about motion given a sequence of images. Based on a set of measurements in image frames the object’s true position should be estimated. Tracking algorithms may be classiﬁed in two categories [1]. The ﬁrst category is based on ﬁltering and data association, while the second family of methods relies on target representation and localization. The algorithms based on ﬁltering assume that the moving object has an internal state which may be measured and, by combining the measurements with the model of state evolution, the object’s true position is estimated. The ﬁrst method of that category is the Kalman ﬁlter which successfully tracks objects even in the case of occlusion if the assumed type of motion is correctly modeled [2]. This category also includes Condensation [3] algorithm which is more general than Kalman ﬁlters, as it does not assume speciﬁc type of densities and, using factored sampling, have the ability to predict an object’s location under occlusion as well. On the other hand, tracking algorithms relying on target representation and localization employ a probabilistic model of the object appearance and try to detect this model in consecutive frames of the image sequence. More speciﬁcally, S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 153–162, 2010. c Springer-Verlag Berlin Heidelberg 2010

154

V. Karavasilis, C. Nikou, and A. Likas

color or texture features of the object, masked by an isotropic kernel, are used to create a histogram. Then, the object’s position is estimated by minimizing a cost function between the model’s histogram and candidate histograms in the next image. A representative method in this category is the mean shift algorithm [1]. Other approaches using multiple kernels [4], Earth Mover’s Distance [5] and a Newton style optimization procedure [6] were also proposed. Combination of Kalman ﬁltering with mean shift have also being proposed in [7,8,9,10,11]. Other works track many objects simultaneously [12]. A Gaussian mixture model (GMM) was used in [13] to represent the object in a joint spatial-color space. Furthermore, the object is represented by its contour [14] and multiple object representations were also combined to make the tracking procedure more robust [15]. In this paper we propose to consider the estimated location of the target obtained by mean shift, as a measurement (observation) of a time-varying Kalman ﬁlter in order to address cases presenting occlusions or abrupt motion changes. Hence, the prediction for the object’s location is forwarded to a Kalman ﬁlter whose state matrix parameters are not constant but they are updated on-line based on recent history of the estimated motion. The remainder of the paper is organized as follows: in sections 2 and 3, the mean shift algorithm and Kalman ﬁlter are respectively reviewed. In section 4, the combination of the algorithms in order to address the problem of occlusion is described. Experimental results are shown in section 5 which are followed by our conclusion in section 6.

2

Background on Mean Shift Tracker

The mean shift [1] is an algorithm trying to locate the object by ﬁnding the local maximum of a function. The object target pdf is approximated by a histogram ˆ = {ˆ ˆu = 1, with qˆu being the u-th bin. To form the of m bins q qu }u=1...m , m u=1 q histogram, only the pixels inside an ellipse surrounding the object are taken into account. The center of the ellipse is assumed to be at the origin of the axes. Due to the fact that the ellipse contains both object pixels and background pixels a kernel with proﬁle k(x), k : [0, ∞) → is applied to every pixel to make pixels near the center of the ellipse to be considered more important. To reduce the inﬂuence of diﬀerent length of the ellipse axes on the weights, the pixel locations are normalized by dividing the pixel’s coordinates with the ellipse’s semi-axes dimension hx and hy . Let {x∗i }i=1...n be the normalized pixel’s spatial location. The u-th histogram bin is given by qˆu = C

n

k(x∗i 2 )δ[b(x∗i ) − u]

(1)

i=1

where b : 2 → {1 . . . m} associates each pixel with each bin in the quantized feature space, m δ is the Kronecker delta function and C is a normalization factor such as u=1 qˆu = 1.

Visual Tracking by Adaptive Kalman Filtering and Mean Shift

155

In the next image, the object candidate is inside the same ellipse with its center at the normalized spatial location y. Let {xi }1...n be the normalized pixel coordinates inside the target candidate ellipse. The pdf of the target candidate is ˆ (y) = {ˆ also approximated by an m-bin histogram p pu (y)}u=1...m , m ˆu (y) = u=1 p 1, with each histogram bin given by pˆu (y) = Cc

n k y − xi 2 δ[b(xi ) − u]

(2)

i=1

m where Cc is a normalization factor such as u=1 pˆu (y) = 1. ˆ and p ˆ (y) is deﬁned as: The distance between q ˆ] d(y) = 1 − ρ[ˆ p(y), q where

m ˆ] = ρ[ˆ p(y), q pˆu (y)ˆ qu

(3)

(4)

u=1

ˆ and p ˆ (y), called Bhattacharyya coeﬃcient. is the similarity function between q To locate the object correctly in the image, the distance in (3) must be minimized, which is equivalent to maximize (4). The ellipse center is initialized at a ˆ 0 which is the ellipse center in the previous image frame. The probalocation y bilities {ρˆu (ˆ y0 )}u=1...m are computed and using linear Taylor approximation of (4) around these values: ˆ] ≈ ρ[ˆ p(y), q

m n 1 Cc 2 pˆu (ˆ y0 )ˆ qu + wi k y − xi , 2 u=1 2 u=1

(5)

where wi =

m u=1

qˆu δ[b(xi ) − u]. pˆu (ˆ y0 )

(6)

As the ﬁrst term of (5) is independent of y, the second term of (5) must be maximized. The maximization of this term may be accomplished by employing

ˆ] Algorithm 1. Maximizing Bhattacharyya coeﬃcient ρ[ˆ p(y), q ˆ 0 in the previous frame. Input: The target model {ˆ qu }u=1...m and its location y ˆ0, 1. Initialize the center of the ellipse in the current frame at y ˆ ] using (4). compute {ˆ pu (ˆ y0 )}u=1...m using (2) and evaluate ρ[ˆ p(ˆ y0 ), q 2. Compute the weights {wi }i=1...n according to (6). 3. Compute the next location of the target candidate according to (7). ˆ ] using (4). 4. Compute {ˆ pu (ˆ y1 )}u=1...m using (2) and evaluate ρ[ˆ p(ˆ y1 ), q ˆ 0 < Stop. 5. If ˆ y1 − y ˆ0 ← y ˆ 1 and go to Step 2. Otherwise set y

156

V. Karavasilis, C. Nikou, and A. Likas

the mean shift algorithm [1], which yields the following update: n y0 − xi 2 i=1 xi wi g ˆ , ˆ1 = y n 2 y0 − xi i=1 wi g ˆ

(7)

where g(x) = −k (x) and k(x) is kernel with Epanechnikov proﬁle. The complete algorithm [1] is summarized in algorithm 1.

3

Kalman Filter

In general, we assume that there is a linear process governed by an unknown inner state producing a set of measurements. More speciﬁcally, there is a discrete time system and its state at time n is given by vector xn . The state in the next time step n + 1 is given by xn+1 = Fn+1,n xn + wn+1

(8)

where Fn+1,n is the transition matrix from state xn to xn+1 and wn+1 is white Gaussian noise with zero mean and covariance matrix Qn+1 . The measurement vector zn+1 is given by zn+1 = Hn+1 xn+1 + vn+1

(9)

where Hn+1 is the measurement matrix and vn+1 is white Gaussian noise with zero mean and covariance matrix Rn+1 . In equation (9), the measurement zn+1 depends only on the current state xn+1 and the noise vector vn+1 is independent of the noise wn+1 . Kalman ﬁlter computes the minimum mean-square error estimate of the state xk given the measurements z1 , . . . , zk . The solution is a recursive procedure [2], which is described in algorithm 2.

Algorithm 2. Kalman ﬁlter 1 Initialization: ˆ 0 = E[x0 ], x P0 = E[(x0 − E[x0 ])(x0 − E[x0 ])T ]. 2 Prediction:

ˆ− ˆ n−1 , x n = Fn,n−1 x T P− n = Fn,n−1 Pn−1 Fn,n−1 + Qn , T − T −1 Gn = P− . n Hn [Hn Pn Hn + Rn ]

3 Estimation:

ˆn = x ˆ− ˆ− x n + Gn (zn − Hn x n ), Pn = (I − Gn Hn )P− n.

Goto the Prediction step for the next prediction.

Visual Tracking by Adaptive Kalman Filtering and Mean Shift

4

157

The Proposed Method

The main idea is to ﬁnd the position of the object with algorithm 1 (considered as the measurement or the observation in Kalman ﬁlter terminology) and forward it to algorithm 2 to obtain the current position of the object (estimation). Moreover, in this section, we propose a dynamic scheme for the Kalman ﬁlter as the elements of its state matrix are updated on-line depending on a measure evaluating the quality of the observation. By these means the tracking procedure may be signiﬁcantly accelerated. We assume that the object is described by its center coordinates (x, y), the ellipse axes are (hx , hy ) and that the size of the ellipse does not change through time. The state vector xn = [xn , yn , 1]T denotes the true position of the center in the image in homogenous coordinates (xn and yn are the horizontal and vertical coordinate respectively) and its position varies over time by equation (8). Matrix Fn+1,n is deﬁned as: ⎤ ⎡ 1 0 dxn+1,n Fn+1,n = ⎣0 1 dyn+1,n ⎦ 00 1 where dxn+1,n dyn+1,n are the horizontal and vertical translations of the object’s center. Parameters dxn+1,n dyn+1,n are not constant in time (ﬁgure 1), but they are computed dynamically as it will be explained in what follows. The noise vector wn+1 = [wn+1x , wn+1y , 1]T has covariance matrix ⎡ ⎤ σQx 0 0 Q = ⎣ 0 σQy 0⎦ 0 0 0 where σQx = hx and σQy = hy . This means that the assumed noise perturbates the object center inside the ellipse. ]T We employ algorithm 1 to obtain the measurement vector zn+1 = [xn+1 , yn+1 where xn+1 and yn+1 are the horizontal and vertical coordinates of the ellipse center. In general, these measurements diﬀer from the state variables xn+1 and yn+1 of vector xn+1 due to the presence of noise vn+1 . The relation between measurement zn+1 and state xn+1 is given by (9), where

100 H= 010 and the measurement noise vn+1 = [vn+1x , vn+1y ]T has covariance matrix

σ 0 R = Rx 0 σRy where σRx = hx and σRy = hy . This means that the ellipse that actually contains the object and the ellipse we measure in Kalman ﬁltering are overlapping. The only problem that remains to be solved is the automatic evaluation of dxn+1,n and dyn+1,n . Using algorithm 1, we obtain:

158

V. Karavasilis, C. Nikou, and A. Likas

– the measurement zn+1 , – the distance between the mixture components of the target model and the target candidate. The main idea is to use the computed distance to determine if the object was detected or not. This provides a quality measure of the current estimate of the object. If the distance is small then there is a good chance that the object’s center is near the predicted center. If this distance is large, then, the target is lost. This distance is embedded in a normalized coeﬃcient: a(y) = f (d(y))

(10)

where d(y) is given by (3) and it is the distance between the model and the candidate histogram at position y and f is a decreasing function. Experiments have been made with various formulas for f . Relying on the value of a(y) in (10), parameter dn+1,n = [dxn+1,n , dyn+1,n ]T is automatically updated by: ˆn ) dn+2,n+1 = (1 − a)dn+1,n + a(ˆ xn+1 − x

(11)

ˆ n+1 is the vector containing the estimated values of the horizontal and where x vertical coordinates of the ellipse center at time n + 1. In view of (11), the esˆ n+1 contributes to the updates of the displacement dn+2,n only when timate x the current estimate resembles the source object model, that is when a(y) → 1. On the other hand (a(y) → 0), the displacements included in the state matrix Fn+2,n+1 remain nearly unchanged, as they were in step n + 1, considering that the object is occluded. This process has the advantage that matrix Fn+2,n+1 incorporating information on the object movement can be updated by the tracking algorithm. In order to clarify the impact of parameter a(y) in (10), a representative schema is shown in ﬁgure 1. The ellipses with the time steps on top of them show the position of the object in the respective frame. The dash lines show the iterations of the mean shift for one frame. The solid arrows show the displacements due to the state matrix Fn+1,n . In frame 4, we assume that an occlusion takes place, so the object is lost. The ﬁrst row, indicated by MS shows the results for mean shift while the second row (MS-K ) show the enhanced mean shift with Kalman ﬁlter. Using mean shift, the object is successfully tracked in frames n = 2 and n = 3, but due to occlusion it is lost after frame n = 4. On the other hand, in row 2, the initial state matrix F2,1 has dx2,1 = 0 and dy2,1 = 0. The object is tracked using mean shift at time n = 2. Assuming that a(y) → 1 the state matrix is updated and F3,2 has dx3,2 = d1 and dy3,2 = 0. The starting position y03 at time n = 3 will not be the same as the end position y 2 of time n = 2, but using matrix F3,2 the initial position will be more close to the true object. So the number of iterations of the mean shift is signiﬁcantly reduced. The state matrix F4,3 is updated by dx4,3 = d2 and dy4,3 = 0. Using this state matrix, the object is assumed to be in position y04 at time n = 4, and because mean shift can not ﬁnd the object (occlusion) the end position is y 4 = y04 leaving F5,4 = F4,3 because a(y) → 0. In the last frame (n = 5) by using state matrix

Visual Tracking by Adaptive Kalman Filtering and Mean Shift

159

Fig. 1. The displacements d1 , d2 , d3 and d4 are diﬀerent. The ﬁrst row (MS) shows the displacements using only mean shift. In the second row (MS-K) a Kalman ﬁlter is combined with mean shift. The dots represent distinct iterations of mean shift in a single frame. The solid arrows show the displacement predicted by the Kalman ﬁlter. In frame n = 4, an occlusion takes place and the object is lost by the mean shift, while using Kalman ﬁlter the tracking is successful.

F5,4 the initial position y05 bypass the object, but due to mean shift (which is moves its center backwards) the object is successfully located. Algorithm 3 presents the mean shift with Kalman tracking algorithm with occlusion handling.

5

Experimental Results

To evaluate the proposed algorithm 3, we have performed comparisons with the standard mean shift algorithm [1]. Various test sequences were employed in the evaluation. These sequences consist of outdoor and indoor testing situations. Representative frames are shown in ﬁgure 2. Each object is described by its center, in image coordinates, and the size of the ellipse around it (the ellipse has axes parallel to the image axes). The ground truth in every image was determined manually. In all tests, the number of histogram bins for the mean shift algorithms was 16 as suggested in [1]. The examples were carried out with a core 2 Duo 1.66 GHz processor with 2GB RAM under Matlab. To estimate the accuracy of the compared algorithms we measure the normalized Euclidean distance between the true center (c) of the object (as determined by the ground truth) and the estimated location of the ellipse center (ˆc). The normalized Euclidean distance is deﬁned by 2 2 cx − ˆ cx cy − ˆcy N ED (c, cˆ) = + (12) hx hy where we recall that hx and hy are the ellipse dimensions. This implies that if N ED (c, cˆ) < 1, then the estimated ellipse center cˆ is inside the ground truth ellipse. By these means, the image size and the ellipse dimensions do not inﬂuence the relative distance between (c) and (ˆ c).

160

V. Karavasilis, C. Nikou, and A. Likas

Algorithm 3. Mean shift with Kalman ﬁlter ˆ 0 ← initial object location: 1 Initialization: x ⎡ ⎤ ⎡ ⎤ 000 hx 0 0 P0 = ⎣0 0 0⎦ , Q = ⎣ 0 hy 0⎦ , 000 0 0 0

⎡

⎤ hx 0 0 R = ⎣ 0 hy 0⎦ , 0 0 0

F1,0 = I3×3 .

2 Compute the initial histogram q in the ﬁrst frame as described in (1). 3 Prediction: ˆ− ˆ n−1 , x n = Fn,n−1 x T P− n = Fn,n−1 Pn−1 Fn,n−1 + Q, T − T −1 Gn = P− . n Hn [Hn Pn Hn + R]

4 Measurement: Compute the new center (zn ), p(y) and the distance between q and p using algorithm 1. 5 Estimation: ˆn = x ˆ− ˆ− x n + Gn (zn − Hn x n ), Pn = (I − Gn Hn )P− n. ˆ n is the object’s new location. The output x 6 Update the elements of Fn using (11). 7 Goto the Prediction step for the next iteration.

Fig. 2. Representative frames of the image sequences. In the ﬁrst sequence (walk 1) a woman is walking from the center of the image to the right. In the second sequence (walk 2), a man is walking from the left side to the right and backward. In the third sequence (car 1) the car is moving from the left to the right an backward. In the forth sequence (car 2) the car is moving from the left to the right and occlusion takes place in the right side of the image.

The proposed algorithm is √ tested using three diﬀerent functions for f in (10), f1 (x) = 1 − x, f2 (x) = 1 − 10 x and f3 (x) = e−10x . Tables 1 and 2 summarize the comparisons. As it can be seen, the mean shift with Kalman ﬁlter has better performance in terms of accuracy and execution time than the original algorithm. Moreover, the type of function f in (10) is not critical for the performance of the algorithm in ordinary tracking scenarios. In ﬁgure 3, an example with occlusion is presented. As the car is moving it is occluded by trees at the right side of the image. The standard mean shift misses the object. On the other hand, the proposed algorithm successfully tracks the

Visual Tracking by Adaptive Kalman Filtering and Mean Shift

161

Table 1. Tracking accuracy. The average normalized Euclidean distance between the true object center and the estimated object center is presented for the compared methods. Sequence Frames Walk1 80 Walk2 156 Car1 183 Car2 72

MS 0.381 0.229 0.303 lost

MS-Kf1 MS-Kf2 MS-Kf3 0.183 0.206 0.215 0.365 0.404 0.398 0.226 0.321 0.286 0.391 0.452 0.365

Table 2. Execution times for the compared methods (sec/frame) Sequence Frames Walk1 80 Walk2 156 Car1 183 Car2 72

MS 7.983 8.369 10.39 2.262

MS-Kf1 MS-Kf2 MS-Kf3 4.004 3.827 3.858 4.302 6.432 6.141 5.958 7.736 6.831 1.387 1.775 1.708

Fig. 3. Car 2 Frames of a moving car with occlusions. The ﬁrst row shows the result of the original mean shift. The second row shows the results of mean shift with Kalman ﬁltering using f1 (x) = 1 − x.

car. This happens because the state matrix Fn+1,n remains unchanged if the distance (3) is not small enough.

6

Conclusion

We have proposed a method combining mean shift with Kalman ﬁlter for object tracking in long image sequences. Using mean shift, we obtain an estimation of the object’s location which is then forwarded as an observation to an adaptive Kalman ﬁlter. The ﬁlter’s state matrix is automatically updated with respect to a quality indicator concerning the obtained measurement. This on-line update of the ﬁlter’s parameters may lead to a signiﬁcant improvement in accuracy and

162

V. Karavasilis, C. Nikou, and A. Likas

computation time. Consequently, abrupt motion changes and partial or total occlusions may be successfully addressed. Future work consists in considering tracking of multiple targets.

References 1. Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(5), 564–577 (2003) 2. Cuevas, E., Zaldivar, D., Rojas, R.: Kalman ﬁlter for vision tracking. Technical Report B 05-12, Freier Universitat Berlin, Institut fur Informatik (2005) 3. Isard, M., Blake, A.: Condensation - conditional density propagation for visual tracking. International Journal of Computer Vision 29, 5–28 (1998) 4. Fan, Z., Yang, M., Wu, Y.: Multiple collaborative kernel tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(7), 1268–1273 (2007) 5. Zhao, Q., Tao, H.: Diﬀerential Earth Mover’s Distance with its application to visual tracking. To appear in IEEE Transactions on Pattern Analysis and Machine Intelligence 32(2), 274–287 (2010) 6. Hager, G.D., Dewan, M., Stewart, C.V.: Multiple kernel tracking with SSD. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2004), vol. 1, pp. 790–797 (2004) 7. Zhu, Z., Ji, Q., Fujimura, K., Lee, K.: Combining kalman ﬁltering and mean shift for real time eye tracking under active ir illumination. In: Proceedings of 16th International Conference on Pattern Recognition (ICPR 2004), vol. 4, p. 40318 (2002) 8. Babu, R.V., P´erez, P., Bouthemy, P.: Robust tracking with motion estimation and local kernel-based color modeling. Image and Vision Computing 25(8), 1205–1216 (2007) 9. Qi, Y., Jing, Z., Hu, S., Zhao, H.: New method for dynamic bias estimation: Gaussian mean shift registration. Optical Engineering 47(2), 26401 (2008) 10. Lu, H., Zhang, R., Chen, Y.W.: Head detection and tracking by mean-shift and kalman ﬁlter. In: Proceedings of the 2008 3rd International Conference on Innovative Computing Information and Control (ICICIC 2008), p. 357 (2008) 11. Zhao, J., Qiao, W., Men, G.Z.: An approach based on mean shift and kalman ﬁlter for target tracking under occlusion. In: International Conference on Machine Learning and Cybernetics, vol. 4(12–15), pp. 2058–2062 (2009) 12. Bugeau, A., Perez, P.: Track and cut: Simultaneous tracking and segmentation of multiple objects with graph cuts. EURASIP Journal on Image and Video Processing 2008, ID:317278 (2008) 13. Wang, H., Suter, D., Schindler, K.: Eﬀective appearance model and similarity measure for particle ﬁltering and visual tracking. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3953, pp. 606–618. Springer, Heidelberg (2006) 14. Yilmaz, A., Li, X., Shah, M.: Contour-based object tracking with occlusion handling in video acquired using mobile cameras. IEEE Transactions on Pattern Analysis and Machine Intelligence 26(11), 1531–1536 (2004) 15. Moreno-Noguer, F., Sanfeliu, A., Samaras, D.: Dependent multiple cue integration for robust tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(4), 670–685 (2008)

On the Approximation Capabilities of Hard Limiter Feedforward Neural Networks Konstantinos Koutroumbas1 and Yannis Bakopoulos2 1

2

Institute for Space Applications and Remote Sensing, National Observatory of Athens, Greece [email protected] Computational Application Group, Division of Applied Technologies, NCSR Demokritos, Athens, Greece [email protected]

Abstract. In this paper the problem of the approximation of decision regions bordered by (a) closed and/or (b) open and unbounded convex hypersurfaces using feedforward neural networks (FNNs) with hard limiter nodes is considered. Speciﬁcally, a constructive proof is given for the fact that a two or a three layer FNN with hard limiter nodes can approximate with arbitrary precision a given decision region of the above kind. This is carried out in three steps. First, each hypersurface is approximated by hyperplanes. Then each one of the regions formed by the hypersurfaces is appropriately approximated by regions deﬁned via the previous hyperplanes. Finally, a feedforward neural network with hard limiter nodes is constructed, based on the previous hyperplanes and the regions deﬁned by them. Keywords: open and closed convex hypersurfaces, hard limiter feedforward neural networks, approximation of decision regions.

1

Introduction

An important and widely used class of neural networks is that of the sigmoid feedforward neural networks (FNNs), where the nodes are arranged into layers and are modeled by functions of the form y = σ(w T x + b), where w and x are the parameter and input vectors, respectively, b is the bias of the node and for σ it is σz→−∞ → a and σz→+∞ → b (a < b). In the special case where σ(z) = 1(0) for z ≥ (<)0, σ is called hard limiter (or step activation) function, and the corresponding node is called hard limiter node. Several results concerning the capabilities of sigmoid Feedforward Neural Networks (FNNs) in approximating functions of the form g : S → S (S ⊆ Rl , S ⊆ Rp ) appeared in the bibliography in the last two decades. These can be divided into two-major categories according to whether they examine the case where g is continuous or discontinuous. Signiﬁcant results of the former category are given in [3], [6], [1], [2], [4], [10]. For the case where g is discontinuous, the available results are more sparse. For example, in [11] the case where g has a ﬁnite number of known discontinuities S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 163–172, 2010. c Springer-Verlag Berlin Heidelberg 2010

164

K. Koutroumbas and Y. Bakopoulos

is considered. In this category lie also functions of the form g : Rl → A, that partition the Rl space into regions such that each one of them is assigned in one of the c available classes of A. In the sequel, functions of this kind are also called partition functions 1 . The case where g partitions Rl into polyhedral regions (i.e. regions bordered by hyperplanes), has been considered in [5], [8], [12]. In this paper, we consider the problem of approximating functions of the form of g : Rl → A, that partition the (non compact) Rl space with convex hypersurfaces, using FNNs with hard limiter nodes (HLFNNs). Speciﬁcally, it is shown that two or three layer HLFNNs can approximate with arbitrary accuracy any function of the form of g that partitions the Rl space using (a) closed and/or (b) open and unbounded convex hypersurfaces2 (see ﬁg. 1). The general idea is to approximate (a) each one of the hypersurfaces with hyperplanes (see ﬁgs. 2 and 3) and (b) each one of the regions formed by the hypersurfaces by regions formed by the previous hyperplanes. The hyperplanes as well as the regions formed by them are taken into account for the construction of the appropriate HLFNN (see e.g. [8], [12]). In the next section we give some preliminary deﬁnitions and propositions. In section 3 the problem is stated explicitly. In section 4 procedures for approximating (a) closed and (b) open and unbounded convex hypersurfaces using hyperplanes, are discussed. In section 5 a procedure for the approximation of regions bordered by more than one hypersurfaces with regions bordered by the hyperplanes approximating these hupersurfaces is given. Finally, section 6 contains the concluding remarks.

2

Preliminaries

l A hyperplane H in the Rl space is deﬁned as H = {x ∈ Rl : wT x = i=1 wi xi + w0 = 0}, where w = [w1 , w2 , . . . , wl ]T , x = [x1 , x2 , . . . , xl ]T and wi ∈ R, i = 0, . . . , l. Each hypeprlane separates the Rl space into two sets, denoted by H + (positive halfspace of H) and H − (negative halfspace of H), which are deﬁned as l H + (H − ) = {x ∈ Rl : i=1 wi xi + w0 ≥ (<)0}. Note that for each hypeprlane there exists an associated hard-limiter node, whose parameter vector consists of the wi ’s of H and its bias is set equal to w0 . Clearly, if its input x lies in the positive (negative) halfspace of H, the output of the node will be 1 (0). Consider the hyperplanes H1 , H2 , . . . , Hk in the Rl space. We deﬁne a cell as a region of Rl bordered by λ(≤ k) of the above hyperplanes, such that none of the remaining k − λ hyperplanes intersects it. A hypersurface C in the Rl space is characterized by a continuous nonlinear functional F : Rl → R and is deﬁned as C = {x ∈ Rl : F (x, w) = 0}, where w is the vector containing the parameters of F . Each hypersurface separates the 1 2

A work that is in close aﬃnity with this framework is described in [9]. Note that one of the two subsets of Rl that is bordered by a convex hypersurface is a convex set.

On the Approximation Capabilities of Hard Limiter FNNs

165

Rl space into two sets, denoted by C + (positive side of C) and C − (negative side of C) and deﬁned as C + (C − ) = {x ∈ Rl : F (x, w) ≥ (<)0}. A hypersphere S(a, R), with radius R, centered at a, is deﬁned as S(a, R) = {x ∈ Rl : d2a,x ≡ (x − a)T (x − a) = R2 }, where da,x denotes the Euclidean distance between a and x. Setting F (x, a, R) = R − da,x , the positive and the negative sides of S(a, R), S(a, R)+ and S(a, R)− , are S(a, R)+ = {x ∈ Rl : F (x, a, R) ≥ 0} and S(a, R)− = {x ∈ Rl : F (x, a, R) < 0}, respectively. An l¯ dimensional ball B(a, R), with center a and radius R, is the set of points x ∈ Rl , with da,x ≤ R. An l-dimensional open ball B (o)l (a, R) contains the points of ¯ B(a, R) for which da,x < R. ¯ A set A ⊂ Rl is called bounded, if and only if there exists a ball B(x, r) such l ¯ that A ⊂ B(x, r), for x ∈ R . A set is called unbounded if it is not bounded. If ∀x, y ∈ A, it is λx + (1 − λ)y ∈ A, λ ∈ [0, 1], A is called convex set. Consider the convex hypersurfaces C1 , C2 , . . . , Ck in the Rl space. Without loss of generality, we assume that each Ci leaves the convex set it deﬁnes from its positive side. A sub-region Ri ∈ Rl is a region formed by the intersection of positive or negative sides deﬁned by the above hypersurfaces, such that: (a) for each hypersurface, only one side (positive or negative), at the most, can contribute to the deﬁnition of Ri and (b) No Cj that does not border Ri intersects Ri , that is (Ri − ∪Cq : Cq bordersRi Cq ) ∩ Cj = ∅, where Ri is the convex hull of Ri . Each Ri is represented by a string si of k components. Its j-th component, sij , equals to 1 or 0 according to whether Ri lies in the positive or in the negative side of Cj , respectively. The number of 1’s encountered in si determines the rank of Ri , which is denoted by ni (see ﬁg. 1).

0

0000

-

+

C1

0

1000

1

1110

0 1010

0

C2

1

1100

+

1

+

0100

C3

0

1100

1

+

1

0110

C4

-

0111

0 0101

0100

Fig. 1. A desicion region deﬁned by hypersurfaces. Each formed sub-region is assigned to one of the two classes 1 and 0 as is denoted in the ﬁgure. Five sub-regions constitute class 1 and six the class 0. The sequences of 0’s and 1’s in each region denote the respective representative string si (see section 2).

166

K. Koutroumbas and Y. Bakopoulos

In the sequel, we state some propositions that are prerequisite for the subsequent analysis. Proposition 1: Let C, C + , C − be deﬁned as before and let C be (a) convex and (b) homeomorphic to the (l − 1)-dimensional unit hypersphere S (l−1) (⊂ Rl ). Then, ∀x ∈ C + \C 3 and for every straight line ε that passes through x there exist x , x ∈ ε such that {x, x } = (ε ∩ C). Proof: If x ∈ C + \C and ε goes through x, then by applying the Jordan theorem it is obvious that at least two points x , x exist on (ε ∩ C), one at each side of x. We prove that (ε ∩ C) contains only x and x , that is (ε ∩ C) = {x , x }. Suppose on the contrary that appart from x and x , (ε ∩ C) contains a third point y , diﬀerent from the other two that lies between x and x . Since x ∈ C + \C, ∃δ > 0, such that B l (x, δ) ⊂ C + \C. In addition, since y lies on C ≡ Bdr(C + ) (the topological boundary of C + ), ∀δ > 0, ∃z ∈ C − , such that dz y ≡ ||z − y || < δ . By choosing δ small enough such that ||y −x|| > δ , a line ζ passing through x and z may have distance from x : d(x, ζ ) ≡ minw ∈ζ ||x − w ||, less than δ. So, for any z ∈ (ζ ∩ B (o)l (x, δ)) the situation arises where z, z , x ∈ ζ , z ∈ C + \C, x ∈ C, z ∈ C − and z lies between z and x . This contradicts the hypothesis that C is convex. Since the same argument hold for the case where y lies between x and x , it follows that {x , x } ≡ (ε ∩ C). QED

Proposition 2: Let C, C + , C − be deﬁned as before and let C be (a) convex (b) unbounded and (c) homeomorphic to R(l−1) ⊂ Rl . Let x ∈ C + \C, x ∈ C and ε the straight line deﬁned by x and x . Let also H be the (l − 1)-dimensional hypeprlane that passes through x and is perpendicular to ε. If x and x are chosen such that {x } = (ε ∩ C), then ∀y ∈ H, ∃y ∈ C such that the line ε deﬁned by them is perpendicular to H and {y } = (ε ∩ C). Proof: First, due to the properties (b) and (c) of C, there exist x and x such that for the line ε they deﬁne it is {x } = (ε ∩ C). Let y ∈ H and let ε be the straight line passing through y that is parallel to ε. Then, since C is homeomorphic to Rl−1 and thus simply connected, there exists at least one loop J on C surrounding ε and ε , which can be shrunk to a point y ∈ (ε ∩ C), by a suitable homotopic transformation. So (ε ∩ C) = ∅. Suppose now that (ε ∩ C) has more than one points. Let Z be the twodimensional plane deﬁned by ε and ε. Since {x } ≡ (ε ∩ C), any x ∈ ε, such that x lies between x and x , belongs to C + \C. Consider now a point y = y , with y ∈ (ε ∩ C). We examine the following two cases: (i) y lies between y and y. Since y ∈ C + \C, ∃δ > 0, such that B l (y, δ) ⊂ C + \C. In addition, since y ∈ C ≡ Bdr(C + ) (the topological boundary of C + ), ∀δ > 0, ∃z ∈ C − , such that ||z − y || < δ . 3

The set C + \C contains all the points of C + except those that belong to C.

On the Approximation Capabilities of Hard Limiter FNNs

167

By choosing δ small enough such that ||y −y || > δ , and by proper choice of z , the straight line h passing through y and z has distance from y : d(y, h) = minw ∈h ||y − w || less than δ. So, for any z ∈ (h ∩ B (o)l (y, δ)), the situation arises where z, z , y ∈ h, z ∈ C + \C, z ∈ C − , y ∈ C and z lies between y and z. This contradicts the hypothesis that C is convex. So any y between y and y belongs to C + . (ii) y lies between y and y . Then ∃y ∈ (ε ∩ C − ), with y lying between y ¯ l (y , δ)), and y . Since y ∈ C − , there exists δ > 0 such that ∀z ∈ (Z ∩ B − z ∈ (Z ∩ C ). Let us choose z on Z such that: (a) z lies between ε and ε and (b) the straight line ζ passing through y and z intersects ε at a point x ∈ (ε ∩ (C + \C)). Then it is x ∈ (ε ∩ Z ∩ C + ) and z ∈ (Z ∩ C − ). Thus, the point z ∈ C − lies between x ∈ C + and y ∈ C + . This contradicts the fact that C is convex. Therefore {y } ≡ (ε ∩ C). QED

Proposition 3: Let C, C + , C − be deﬁned as before and let C be (a) convex, (b) unbounded and (c) homeomorphic to R(l−1) ⊂ Rl . Then, there exists a hyperplane H that cuts C such that for any x ∈ C ∩ H, the line ε that is perpendicular to H and passes through x do not intersect C at any other point. Proof: Choose x ∈ C + \C, x ∈ C, such that for the line ε deﬁned by x and x to be {x } = (ε ∩ C). Then, consider the hyperplane H that passes through x and is perpendicular to ε. Employing proposition 2, the claim follows. QED

3

Definition of the Problem

Let us consider k oriented convex hypersurfaces C1 , C2 , . . . , Ck in the Rl space. Hyperplanes may be viewed as a degenerate case of hypersurfaces. Consider the function (1) g : Rl → A, where A = {[0, 0, . . . , 0], [1, 0, . . . , 0], [0, 1, . . . , 0], . . . , [0, 0, . . . , 1]}, that partitions Rl into sub-regions using C1 , C2 , . . . , Ck such that all points lying in the same sub-region Ri are assigned to the same class. The set A consists of c (c − 1)dimensional vectors, i.e., [0, 0, . . . , 0], [1, 0, . . . , 0], [0, 1, . . . , 0], . . ., [0, 0, . . . , 1], which indicate the classes 0, 1, 2, . . . , c − 1, respectively. Let Sp be the set consisting of the points of all sub-regions in Rl that are l assigned to class p, p = 0, . . . , c − 1. Note that ∪c−1 p=0 Sp = R . Also, let mp be the number of sub-regions that constitute Sp , p = 0, . . . , c − 1 (see ﬁg. 1). The problem is to construct a two or a three layer HLFNN, Nh , with c − 1 output nodes that approximates g. That is, for almost all x ∈ Sp , p = 0, 1, . . . , c− p 1, the output of Nh will be the (c − 1)-dimensional vector [0, 0, . . . , 1 , . . . , 0], for p = 1, . . . , c and [0, 0, . . . , 0], for p = 0.

168

4

K. Koutroumbas and Y. Bakopoulos

Approximation of Convex Hypersurfaces Using Hyperplanes

Since HLFNNs are consisted of hard limiter nodes, each one being associated with a hyperplane, we have to approximate the hypersurfaces C1 , C2 , . . . , Ck with hyperplanes. In the trivial case where a Ci is a hyperplane, then a single node, whose associated hyperplane is Ci , suﬃces to implement it. In the next two sections we discuss ways for the approximation of (a) closed and (b) open and unbounded convex hypersurfaces using hyperplanes.

4.1

Approximating Closed Convex Hypersurfaces Using Hyperplanes

A procedure for the approximation of a closed convex hypersurface C is described next. We cut C with a hyperplane H (see ﬁg. 2(a)) and we choose a set B = {x1 , . . . , xl } of l points in C ∩ H such that each point is equidistant from the remaining points of B. Let y be the center of the hypersphere deﬁned by the points of B. Clearly, y lies in the convex hull of B and, since C + and H are both convex, y lies in C + ∩ H. Let ε be the line that passes through y and is perpendicular to H. From Proposition 1, ε intersects C at two points y , y . Without loss of generality, assume that the distance of y from H is less than or equal to the distance of y from H, i.e d(y , H) ≤ d(y , H). Let us focus for a while on y . Consider the sets Bi = {x1 , . . . , xi−1 , y , xi+1 . . . , xl }, i = 1, . . . , l and let H1 , H2 , . . . , Hl be the hyperplanes deﬁned by these sets such that the point xi lies on the positive side of Hi , Hi+ . In a similar manner, we can deﬁne the sets Bi = {x1 , . . . , xi−1 , y , xi+1 . . . , xl }, i = 1, . . . , l and let H1 , H2 , . . . , Hl be the hyperplanes deﬁned by Bi ’s such that the point xi lies on the positive side of Hi , Hi+ . Then, we drop H. C is now approximated by H1 , H2 , . . . , Hl , H1 , H2 , . . . , Hl and the region C + is approximated by H1+ ∩ H2+ ∩ . . . , Hl+ ∩ H1+ ∩ H2+ . . . ∩ Hl+ (see ﬁg. 2(b)). In order to achieve better approximation accuracy, the above described procedure may be applied to each one of the H1 , H2 , . . . , Hl , H1 , H2 , . . . , Hl , only for their y , i.e. the intersection point of ε and C that lies closer to Hi or Hi , i = 1, . . . , l. Clearly, each one of these hyperplanes will be replaced by l new hyperplanes (see ﬁg. 2(c)). Taking the last observation into account, we notice that, at the ﬁrst approximation level, C is approximated by 2l hyperplanes (those that replace H), at the second approximation level, 2l2 hyperplanes are used and, in general, at the n-th approximation level, 2ln hyperplanes are used. Clearly, as n → ∞, the intersection of the positive half-spaces of these hyperplanes tends to the convex set defined by C. Finally, it should be noticed that it is not necessary to substitute all hyperplanes in a given approximation level.

On the Approximation Capabilities of Hard Limiter FNNs

-

-

C

H

+

y’

+

-

H1 --

+ +

H2

-+

H’2 (a )

(b )

+-

H

+ -

C

H’1 y’‘

H11 H12 H22 H21

-

169

H H’11

+

H’22 H’21

H’12 C

(c )

Fig. 2. (a) The hyperplane H cuts the hypersurface C. (b) The H1 , H1 , H2 and H2 hyperplanes produced by H, as described in the text. (c) The H11 , H12 (H21 , H22 ) , H12 (H21 , H22 ) hyperplanes deﬁned by hyperplanes deﬁned by H1 (H2 ) and the H11 H1 (H2 ), as described in the text.

4.2

Approximating Open Unbounded Convex Hypersurfaces Using Hyperplanes

A procedure for the approximation of an open unbounded convex hypersurface C is described next. Let H be a hyperplane that cuts C. Consider l equidistant points x1 , . . . , xl ∈ C ∩H and let ε1 , . . . , εl be the lines that are perpendicular to H and passing through x1 , . . . , xl , respectively, such that none of them intersects C at any other point (see ﬁg. 3(a)). If the selected hyperplane does not meet the above requirements, we discard it and we choose one that meets them (such a hyperplane H exists due to Proposition 3). Let H0 be the hyperplane that (a) is parallel to H, (b) is tangented to C at the point z and (c) leaves C on its negative side (see ﬁg. 3(a)). We consider next the sequence of hyperplanes H1 , H2 , . . . , Hq that are (a) parallel to H0 , (b) have the same polarity with H0 and (c) the distance between two successive hyperplanes Hi−1 and Hi is equal to δ. δ is a user deﬁned parameter that aﬀects the approximation accuracy of C (see ﬁg. 3(b) and ﬁg. 4(a)). Consider now the points wi1 , wi2 , . . . , wil ∈ C ∩ Hi , i = 1, . . . , q, that are equidistant to each other, such that z, w1j , . . . , wij , . . . lie on the same hyperplane, j = 1, 2, . . . , l, i = 1, 2, . . . , q (see ﬁg. 4(a)). Then, for each pair of consecutive hyperplanes Hi and Hi+1 (i ≥ 1), we consider the sets Dik = {wi1 wi+1,1 , . . . , wi,k−1 wi+1,k−1 , wi,k+1 wi+1,k+1 , . . . , wi,l wi+1,l }, k k = 1, . . . , l, where wij wi+1,j denote the line deﬁned by wij and wi+1,j . Let Hij k be the hyperplane deﬁned by Di . For i = 0, we deﬁne the hyperplanes speciﬁed by the sets D0k = {zw11 , . . . , zw1,k−1 , zw1,k+1 , . . . , zw1l }, k = 1, . . . , l, where k zw1j denote the line deﬁned by z and w1j . As long as we have deﬁned Hij ’s, we drop Hi ’s.

170

K. Koutroumbas and Y. Bakopoulos

H0

H

+ -

+ -

-

C +

H0 H1 H2 H3 -C + - +- + - + -

w11

w21

w31

2

H23

+

2

H12 2

H01

z 1

H01 w12

(a )

w22

(b )

w 32

1

H12

1

H23 (c )

Fig. 3. (a) The deﬁnition of H0 . (b) Deﬁnition of H1 , H2 , H3 , . . ., that are parallel to 1 , H0 . Also, the points where each Hi intersects C are shown. (c) The hyperplanes H01 2 1 2 1 2 , H12 , H12 , H23 , H23 , that are used for the approximation of C are depicted. H01 k For each one of the Hij hyperplanes, i = 0, 1, . . . , q, k = 1, 2, . . . , l, j = 1, . . . , l − 1, we can apply the procedure described for the closed hypersurfaces k only for the point y , the intersection point of ε and C that lies closer to the Hij at hand, in order to improve the approximation accuracy of C. Finally, notice that the sequence of Hi ’s is ﬁnite. Thus the number of hyperplanes used for the approximation is also ﬁnite (see ﬁg. 3 and 4(a)).

5

Approximation of Subregions by Unions of Cells

Let H1 , H2 , . . . , Hp be the hyperplanes deﬁned by the above described approximation procedures when applied to the hypersurfaces C1 , C2 , . . . , Ck . The idea is to approximate each sub-region Ri by the union of cells, defined by H1 , . . . , Hq , whose major part lies in Ri , since HLFNNs can handle only regions bordered by hyperplanes, i.e. cells. These cells are assigned to the class where the corresponding sub-regions belong. We remind here that we have assumed that the convex sets deﬁned by the hypersurfaces C are the C + . Thus, regions of higher rank lie at the intersection of more convex sets C + and, as a consequence, they are in general of smaller volume than those having lower rank. Thus, since it is desirable to represent all sub-regions formed by the given hypersurfaces, we start the assignment of cells Qi from the sub-regions having higher rank. Of course, a prerequisite for the approximation of all sub-regions is that each one of them contains at least one cell.

On the Approximation Capabilities of Hard Limiter FNNs

171

The following procedure determines explicitly which cells are used for the representation of each sub-region. Let nmax be the maximum among the ranks of all sub-regions. – For n = nmax down to 0 with step −1, do • Determine all sub-regions, Ri with rank n. • Represent each Ri by the union of all the cells lying entirely in Ri ∪ (∪rank(Rj )>n Rj ) and are not used for the representation of an Rj with rank(Rj ) > n. – End Note that all the cells used to represent a sub-region Ri are assigned to the class where Ri is assigned (see ﬁg. 4(b),(c) for an example). H1 w11

H0 z

w12

H

C - w31

w212

H3

w22

H5

w23

w33

(a)

Q1 Q2 Q3

R4 11

01

+ R2

00

(b )

-

Q5

Q4 Q6

C2

10 R3

H1

w32

w13

C1 + R1

H2

+

Q7 H6

H4 H7

H8

(c )

Fig. 4. (a) Approximation of a 3-dimensional open unbounded convex surface, using hyperplanes. (b) Partition of R2 in four sub-regions, R1 , R2 , R3 and R4 (the corresponding representative strings are also given). (b) Representation of the sub-regions using cells. Speciﬁcally, R1 is represented by Q1 , Q2 and Q3 , R2 is represented by Q5 , Q6 and Q7 , R3 is represented by Q4 and, ﬁnally, R4 is represented by all the remaining cells, formed by the above hyperplanes (they are not explicitly named).

After all the above, it is clear that g, which partitions Rl using the hypersurfaces C1 , C2 , . . . , Ck is approximated by g : Rl → A, which (a) partitions Rl into cells using the hyperplanes H1 , H2 , . . . , Hp , that have been resulted from the approximation of the hypersurfaces C1 , . . . , Ck , and (b) assigns each one of the resulted cells to the class where the corresponding sub-region belongs. g can now be implemented by a three layer HLFNN, where the ﬁrst layer nodes correspond to the hyperplanes H1 , . . . , Hp , the second layer nodes correspond to the cells deﬁned by the previous hyperplanes and the third layer nodes correspond to the classes where the above cells are assigned (see eg. [8], [12])4 . 4

In some cases a two-layer HLFNN suﬃces (see eg. [12]).

172

6

K. Koutroumbas and Y. Bakopoulos

Concluding Remarks

In this paper it is shown that hard limiter feedforward neural networks are capable of approximating with arbitrary accuracy any decision region (partition function) that partitions the Rl space using (a) closed and/or (b) open and unbounded convex hypersurfaces. The main idea is to approximate each hypersurface with hyperplanes and then to represent each sub-region Ri with cells Qj (bordered by the above hyperplanes) whose major part lies in Ri . Methods for the approximation of closed as well as open and unbounded convex hypersurfaces with hyperplanes are discussed. Also, a procedure is given that assigns cells to sub-regions. As soon as the above steps have been carried out, a hard limiter feedforward neural network, with three layers at the most, is constructed based on the hyperplanes used for the approximation of the hypersurfaces as well as to the cells formed by them (see e.g. [8], [12]).

References 1. Barron, A.: Universal approximation bounds for superposition of a sigmoidal function. IEEE Transactions on Information Theory 3, 930–945 (1993) 2. Blum, E., Li, K.: Approximation theory and feedforward networks. Neural Networks 4, 511–515 (1991) 3. Cybenko, G.: Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems 2, 304–314 (1989) 4. Geva, S., Sitte, J.: A constructive method for multivariate function approximation by multilayer perceptrons. IEEE Transactions on Neural Networks 3(4), 621–623 (1992) 5. Gibson, G.J., Cowan, C.F.N.: On the decision regions of multilayer perceptrons. Proceedings of the IEEE 78(10), 1590–1594 (1990) 6. Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Networks 2(5), 359–366 (1989) 7. Lelek, A.: Introduction to set theory and topology. Trohalia (translated in Greek) (1992) 8. Lippmann, R.P.: An introduction to computing with neural networks. IEEE ASSP Magazine 4(2), 4–22 (1987) 9. Sandberg, I.W.: General structures for classiﬁcation. IEEE Transactions on Circuits and Systems I 41(5), 372–376 (1994) 10. Sarselli, F., Tsoi, A.C.: Universal approximation using feedforward neural networks: A survey of some existing methods and some new results. Neural Networks 11(1), 15–37 (1998) 11. Selmic, R.R., Lewis, F.L.: Neural network approximation of piecewise continuous functions: application to friction compensation. IEEE Transactions on Neural Networks 13(3), 745–751 (2002) 12. Theodoridis, S., Koutroumbas, K.: Pattern Recognition, 4th edn. Academic Press, London (2009)

EMERALD: A Multi-Agent System for Knowledge-Based Reasoning Interoperability in the Semantic Web Kalliopi Kravari, Efstratios Kontopoulos, and Nick Bassiliades Dept. of Informatics, Aristotle University of Thessaloniki, GR-54124 Thessaloniki, Greece {kkravari,skontopo,nbassili}@csd.auth.gr

Abstract. The Semantic Web aims at augmenting the WWW with meaning, assisting people and machines in comprehending Web content and better satisfying their requests. Intelligent agents are considered to be greatly favored by Semantic Web technologies, because of the interoperability the latter will achieve. One of the main problems in agent interoperation is the great variety in reasoning formalisms, as agents do not necessarily share a common rule or logic formalism. This paper reports on the implementation of EMERALD, a knowledge-based framework for interoperating intelligent agents in the Semantic Web. More specifically, a multi-agent system was developed on top of JADE, featuring trusted, third party reasoning services, a reusable agent prototype for knowledge-customizable agent behavior, as well as a reputation mechanism for ensuring trust in the framework. Finally, a use case scenario is presented that illustrates the viability of the proposed framework. Keywords: semantic web, intelligent agents, multi-agent system, reasoning.

1 Introduction The Semantic Web (SW) is a rapidly evolving extension of the WWW that derives from Sir Tim Berners-Lee’s vision of a universal medium for data, information and knowledge exchange. The SW aims at augmenting Web content with meaning (i.e. semantics), making it possible for people and machines to comprehend the available information and better satisfy their requests. Until now, the fundamental SW technologies (content representation, ontologies) have been established and researchers are currently focusing their attention on logic and proofs. Intelligent agents (IAs) are considered the most prominent means towards realizing the SW vision [1]. Via the use of IAs, programs are extended to perform tasks more efficiently and with less human intervention. The gradual integration of multi-agent systems (MAS) with SW technology will affect the use of the Web in the future; the next generation of the Web will consist of groups of intercommunicating agents traversing it and performing complex actions on behalf of their users. Intelligent agents are considered to be greatly favored by SW technologies, because of the interoperability the latter will achieve. Nevertheless, a critical issue is now raised: agent interoperation is overwhelmed from the variety of representation and reasoning technologies. On the other hand, agents do not necessarily share a common rule or logic formalism. In fact, it will often S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 173–182, 2010. © Springer-Verlag Berlin Heidelberg 2010

174

K. Kravari, E. Kontopoulos, and N. Bassiliades

be the case that two or more intercommunicating agents will ‘understand’ different (rule) languages. On the other hand, it would be unrealistic to attempt imposing specific logic formalisms in a rapidly changing world like the Web. We propose a novel, more viable approach, which involves trusted, third-party reasoning services that will infer knowledge from an agent’s rule base and verify the results. More specifically, this paper reports on the implementation of EMERALD, a framework for interoperating, knowledge-based IAs in the SW. A JADE MAS was extended with reasoning capabilities, provided as agents. Furthermore, the framework features a generic, reusable agent prototype for knowledge-customizable agents (KCAgents), consisted of an agent model (KC Model), a yellow pages service (Advanced Yellow Pages Service) and several external Java methods (Basic Java Library). Also, since the notion of trust is vital here, a reputation mechanism was integrated in the framework. Finally, the paper presents a use case scenario that illustrates the usability of the framework and the integration of all the technologies involved. The rest of the paper is structured as follows: Section 2 gives a brief overview of EMERALD, followed by descriptions of its various components. More specifically, the featured reasoning services are presented, as well as the knowledge-customizable agent prototype (KC-Agents). Since trust is essential in a framework like EMERALD, section 5 presents the deployed reputation mechanism. Finally, section 6 illustrates an apartment renting use case scenario, which better displays the potential of the framework. The paper is concluded with references to related work and conclusions, as well as directions for future improvements.

2 Framework Overview As mentioned in the introduction, EMERALD is a common framework for interoperating knowledge-based intelligent agents in the SW. The motivation behind our work was to leverage the weaknesses in agent intercommunication outlined above and attempt to deploy trusted, third-party reasoning services instead. EMERALD is developed on top of JADE [2], the popular MAS Java framework. Fig. 1 illustrates a general overview of EMERALD: Each human user controls a single all-around agent. Agents can intercommunicate, but do not necessarily share a common rule/logic formalism; therefore, it is vital for them to find a way to exchange their position arguments seamlessly. Our approach does not rely on translation between rule formalisms but on exchanging the rule base results. The receiving agent uses an external reasoning service to grasp the semantics of the rulebase, i.e. the set of rule base conclusions. In EMERALD, reasoning services are “wrapped” by an agent interface, called the Reasoner, allowing other IAs to contact them via ACL messages. Reasoners are, in essence, agents offering reasoning services to the rest of the agent community. Currently, the framework features a variety of few but widely diverse Reasoners (see section 3), but the available array can be easily extended. Moreover, agents are knowledge-customizable, meaning that they are not confined in having their logics and strategies/policies hard-wired. Instead, they can be either generic or customizable; each agent contains a rule base that describes its knowledge of the environment, its behaviour pattern as well as its strategy/policy. By altering the

EMERALD: A MAS for Knowledge-Based Reasoning Interoperability in the SW

175

Fig. 1. Overview of the proposed framework

rule base, the agent’s knowledge and/or behaviour will instantly be modified accordingly. Currently, EMERALD provides a knowledge-based agent module based on Jess language, but our goal is to provide a range of modules based on a variety of rule languages (i.e. Prolog/Prova, RuleML). The use case scenario presented later in this work (section 6) demonstrates this feature. Overall, the goal is to apply as many standards as possible (ACL, RuleML, RDF/S, OWL), in order to encourage the application and development of the framework. In practice, the SW serves as the framework infrastructure.

3 Reasoning Services EMERALD integrates a number of reasoning engines that use various logics. The Reasoner, a reasoning service provided by a JADE IA, can call an associated reasoning engine in order to perform inference and provide results (Fig. 2). The procedure is straightforward: each Reasoner stands by for new requests (ACL messages with a “REQUEST” communication act) and as soon as it receives a valid request, it launches the associated reasoning engine and returns the results (ACL message with an “INFORM” communication act). Consequently, although Reasoners seem to be fully autonomous agents, actually they behave more like a web service.

Fig. 2. Input – Output of a Reasoner agent

EMERALD currently implements a number of Reasoners that offer reasoning services in two major reasoning formalisms: deductive and defeasible reasoning. Deductive reasoning is based on classical logic arguments, where conclusions are proved to be valid when the premises of the argument (rule condition) are true. EMERALD implements two such reasoners, which are actually based on the logic programming paradigm, namely R-DEVICE [3] and Prova [4]. R-DEVICE is a deductive object-oriented knowledge base system for querying and reasoning about

176

K. Kravari, E. Kontopoulos, and N. Bassiliades

RDF metadata. R-DEVICE, transforms RDF triples into objects and uses a deductive rule language for querying and reasoning about them, in a forward-chaining Datalog fashion. Prova is a rule engine that supports distributed inference services, rule interchange and rule-based decision logic and combines declarative rules, ontologies and inference with dynamic object-oriented programming. Defeasible reasoning [5] constitutes a simple rule-based approach for efficient reasoning with incomplete and inconsistent information. When compared to mainstream non-monotonic reasoning, the main advantages of defeasible reasoning are enhanced representational capabilities and low computational complexity. Currently, EMERALD supports two Reasoners that use defeasible reasoning, the DRReasoner and the SPINdle-Reasoner based on DR-DEVICE [6] and SPINdle [7], accordingly. DR-Reasoner was presented in [8]; DR-DEVICE accepts as input the address of a defeasible logic rule base, written in the OORuleML-like syntax. The rule base contains only rules; the facts for the rule program are contained in RDF documents, whose addresses are declared in the rule base. Finally, conclusions are exported as an RDF document. On the other hand, SPINdle-Reasoner, supports reasoning on both standard and modal defeasible logic. It accepts defeasible logic theories represented using XML or plain text (with pre-defined syntax), processes them and finally exports the results via XML.

4 KC-Agents: The Prototype EMERALD provides a generic, reusable agent prototype for knowledge-customizable agents (KC-Agents) that offers certain advantages, concerning, among others, modularity, reusability and interoperability of behavior between agents. The current prototype consists of an agent model (KC Model), a yellow pages service (Advanced Yellow Pages Service - AYPS) and some external Java methods (Basic Java Library BJL). Fig. 3 displays the above prototype.

Fig. 3. The KC-Agents Prototype

4.1 The KC Model The KC Model is a model for customizable agents equipped with a rule engine and a knowledge base (KB) that contains environment knowledge (in the form of facts), behaviour patterns (in the form of rules) and strategies. By altering the KB, both the agent’s knowledge and behaviour are modified accordingly. KC model’s abstract specification, presented in [9], contains facts and rules. A short description is presented below for better comprehension.

EMERALD: A MAS for Knowledge-Based Reasoning Interoperability in the SW

177

The generic rule format is: result ← rule (preconditions). The agent’s internal knowledge is a set of facts F ≡ Fu ∪ Fe, where Fu ≡ {fu1, fu2, …, fuk} are user-defined facts and Fe ≡ {fe1, fe2, …, fem} are environment-asserted facts. The agent’s behaviour is represented as a set of potential actions–rules P ≡ A ∪ S, where A ≡ {a | fe←a(fu1, fu2, …, fun) ∧ {fu1, fu2,..., fun}⊆Fu ∧ fe∈Fe} are the rules that derive new facts by inserting them into the KB and S ≡ C ∪ J are the rules that lead to the execution of a special action. Note that special actions can either refer to agent communication C ≡ {c | ACLMessage←c(f1, f2, …, fp) ∧ {f1, f2,..., fp}⊆F} or Java calls J ≡ {j | JavaMethod←j(f1, f2, …, fq) ∧ {f1, f2,..., fq}⊆F}. The overall goal for EMERALD is to provide a variety of KC modules (KC Model’s implementations) based on several rule engines and languages. Currently, EMERALD provides a KCj module, based on Jess [10]. KCj module is equipped with a Jess rule engine and its knowledge base contains knowledge, in the form of Jess facts and behaviour patterns, in the form of Jess production rules. 4.2 Advanced Yellow Pages Service (AYPS) Moreover, an advanced customized procedure for the yellow pages service is provided, both for registered and required services. In JADE, the traditional yellow pages service is provided by the Directory Facilitator (DF) agent. This procedure allows an agent to get the proper results from the DF agent, but, since these results have the form of ACL Message content, this means that the developer has to manually convert them, in order to use them as facts. With our advanced yellow pages service (AYPS) model, the service is fully automated: the services that each agent provides and/or requests are declared in a separate file, accessible only by the specific agent. As soon as the agent wishes to advertise one or more services, AYPS is activated and places them into a repository. On the other hand, if the agent requires some specific service, AYPS traverses the repository and returns the proper providers. Concerning the KCj module, AYPS returns the providers as Jess facts with a designated format: (service_type (provider provider_name)). AYPS also implements a reputation mechanism (section 5) available for each EMERALD agent at any time. Thus, AYPS is able to provide not only the name but also the reputation value of a provider. 4.3 Basic Java Library (BJL) In order to provide a standard communication interface between the KB and the JADE agent, we have developed a library of Java methods (Basic Java Library - BJL) that can be evoked from KC-agent’s actions (in the current implementation Jess production rules). A generic syntax specification is: (defrule call_method ;;; preconditions =>

178

K. Kravari, E. Kontopoulos, and N. Bassiliades

(bind ?t (new Basic)) (bind ?str (?t method's_name argument(s)))) where ?t is bound to a new instance of BJL and ?str is the returned value. For instance, method createRulebase creates a backup of a rule base, determining the RDF data input, and method extractTriples extracts triples from an RDF file and stores them as facts. In order to extract ‘knowledge’ from these facts, suitable rules can be executed. A sample rule finding out the attribute A of B is the following, in the context of KCj module: (defrule find_A (triple (subject ?x) (predicate rdf:type) (object dr-device:xx)) (triple (subject ?x) (predicate dr-device:B) (object ?A)) => (assert(A ?A)))

5 Trust Trust has been recognized as a key issue in SW MAS, where agents have to interact under uncertain and risky situations. Thus, a number of researchers have proposed, in different perspectives, models and metrics of trust, some involving past experience or using only a single agent’s previous experience. Five such metrics are described in [11], among them Sporas seems to be the most widely used metric, although CR (Certified Reputation) is one of the most recently proposed methodologies. The overall goal for EMERALD is to adopt a variety of trust models, both proposed in the literature and original. Currently, EMERALD adopts two reputation mechanisms, a decentralized and a centralized one. The decentralized mechanism is a combination of Sporas and CR, and was presented in [8]. In the centralized approach, presented here, AYPS keeps the references given from agents interacting with Reasoners or other agents in EMERALD. Each reference is in the form of Refi=(a, b, cr, cm, flx, rs), where a is the trustee, b is the trustor and cr (Correctness), cm (Completeness), flx (Flexibility) and rs (Response time) are the evaluation criteria. Ratings vary from -1 (terrible) to 1 (perfect), r∈[-1,1], while newcomers start with reputation equal to 0 (neutral). The final reputation value (Rb) is based on the weighted sum of the relevant references stored in AYPS and is calculated according to the formula: ∑Rb=w1*cr+w2*cm+w3*flx+w4*rs, where w1+w2+w3+w4=1. AYPS supports two options for Rb, a default where the weights are equivalent, namely wk∈[1,4]=0.25 each and a user-defined, where the weights vary from 0 to 1 depending on user priorities. The simple evaluation formula of the above approach, compared to the decentralized one, leads to time profit as it needs less evaluation and calculation time. Moreover, it provides more guaranteed and reliable results (Rb) as it is centralized (AYPS), overcoming the difficulty to locate references in a distributed mechanism. Agents can use either one of the above mechanisms or even both, complementarily,

EMERALD: A MAS for Knowledge-Based Reasoning Interoperability in the SW

179

namely they can use the centralized mechanism provided by AYPS in order to find the most trusted service provider and/or they can use the decentralized approach for the rest EMERALD agents.

6 Use Case Reasoning is widely used in various applications. This section presents an apartment renting use case paradigm that applies both deductive and defeasible logic. The scenario aims at demonstrating the overall functionality of the framework and, more specifically, the usability of the Reasoners and the modularity of the KC-Agents prototype and its ability to easily adapt to various applications. The scenario, adopted from [12], involves two independent parties, represented by IAs and one of the four Reasoners provided in EMERALD. The first party is the customer, a potential renter that uses defeasible logic and wishes to rent an apartment based on his requirements (e.g. size, location) and personal preferences. The other party is the broker, who uses deductive logic and possesses a database of available apartments. His role is to match customer’s requirements with the apartment specifications and eventually propose suitable apartments to the potential renter. The R-Reasoner and the DR-Reasoner are the two reasoners involved in the paradigm. The scenario is carried out in ten distinct steps (shown in Fig 4). A similar but more simplistic brokering scenario was presented in [8], where only one type of logic (defeasible) was applied and the broker did not possess any private interaction strategy, expressed in any kind of logic, but it was just mediating between the customer and the Reasoner. Initially, the customer finds a broker, by asking the AYPS (step 1). The AYPS returns a number of potential brokers accompanied with their reputation ratings (step 2). Customer selects the more trusted broker and sends his requirements to him, in order to get back all the available apartments with the proper specifications (step 3). The broker has a list of all available apartments which cannot be communicated to the customer, because they belong to the broker’s private assets. However, since the broker cannot process customer’s requirements using defeasible logic, he finds a defeasible logic reasoner (DR-Reasoner) (step 4), by using the AYPS (this step is not shown). DR-Reasoner returns the apartments that fulfill all requirements (step 5); however, the broker checks the results in order to exclude the unavailable apartments or apartments reserved for a special private-to-the-broker reason. Thus, the broker agent requests from the R-Reasoner to process the results with his own private interaction strategy expressed in a deductive logic rulebase (step 6). When he receives the remaining apartments (step 7), he sends them to customer’s agent (step 8). Eventually, the customer receives the appropriate list and has to decide which apartment he prefers. However, his agent does not want to send customer’s preferences to the broker, in order not to be exploited; thus, customer’s agent selects his most preferred apartment by sending to the DR-Reasoner his preferences, as a defeasible logic rulebase, along with the list of acceptable apartments (step 9). The Reasoner replies and proposes the best apartment to rent (step 10). The apartment selection procedure ends. Now the customer has to negotiate with the owner for the renting contract. This process is carried out in two basic steps, as shown in Fig. 5: first

180

K. Kravari, E. Kontopoulos, and N. Bassiliades

Fig. 4. Apartment renting scenario steps

Fig. 5. Negotiation scenario steps

the customer’s agent has to find out the apartment owner’s name and then negotiate with him for the rent. The customer sends a REQUEST message to the broker containing the chosen apartment and waits for its owner’s name. The broker, sends back his reply via an INFORM message. Afterwards, the customer starts a negotiation process with the owner, negotiating among others, the price and rental duration. Following the generic, abstract specification for agents, the customer agent’s description contains a fact, ruleml_path, which is part of its internal knowledge and represents the rulebase URL. Moreover, due to the dynamic environment (AYPS is constantly updating the environment), a new fact with the agent name (agent_name) is added to the working memory. Agent behavior is represented by rules; one of these is the ‘read’ rule that calls the BJL’s fileToString method. It has only a single precondition (actually fact), the ruleml_path, as shown below.

Fucust ≡ {ruleml_path}, Fecust ≡ {agent_name} J cust ≡ {rule_base_content ← (bind ((new Basic) fileToString ruleml_path))} Similarly, the broker agent’s description contains facts and rules: fact url represents (part of) its internal knowledge and stands for the URL of the RDF document containing all available apartments, while reasoner_name (DR-Reasoner’s name) is added by the environment due to AYPS and rules “request” and “read” (BJL’s fileToString) comprise part of the agent’s behavior.

Fubrok ≡ {url}, Febrok ≡ {reasoner_name} C brok ≡ {(ACLMessage (communicative-act REQUEST) (sender Broker) (receiver reasoner_name) (content “request”)) ← request (reasoner_name)} brok ≡ {rule_base_content ←(bind ((new Basic) fileToString url))} J

EMERALD: A MAS for Knowledge-Based Reasoning Interoperability in the SW

181

7 Related Work DR-BROKERING, a system for brokering and matchmaking, is presented in [13]. The system applies RDF in representing offerings and a deductive logical language for expressing requirements and preferences. Three agent types are featured (Buyer, Seller and Broker) and a DF agent plays the role of the yellow pages service. Also, DR-NEGOTIATE [14], another system by the same authors, implements a negotiation scenario using JADE and DR-DEVICE. Similarly, our approach applies the same technologies and identifies roles such as Broker and Buyer. Conversely, we provide a number of independent reasoning services, offering both deductive and defeasible logic. Moreover, our approach takes into account trust issues, providing two reputation approaches in order to guarantee the interactions’ safety. The Rule Responder [15] project builds a service-oriented methodology and a rulebased middleware for interchanging rules in virtual organizations, as well as negotiating about their meaning. Rule Responder demonstrates the interoperation of distributed platform-specific rule execution environments, with Reaction RuleML as a platform-independent rule interchange format. We have a similar view of reasoning service for agents and usage of RuleML. Also, both approaches allow utilizing a variety of rule engines. However, contrary to Rule Responder, EMERALD is based on FIPA specifications, achieving a fully FIPA-compliant model and deals with trust issues. Finally, and most importantly, our framework does not rely on a single rule interchange language, but allows each agent to follow its own rule formalism, but still be able to exchange its rule base with other agents, which will use trusted third-party reasoning services to infer knowledge based on the received ruleset.

8 Conclusions and Future Work This paper argued that agents are vital in realizing the Semantic Web vision and presented EMERALD, a knowledge-based multi-agent framework that provides reasoning interoperability, designed for the SW. EMERALD, developed on top of JADE, is fully FIPA-compliant and features trusted, third party reasoning services, a generic, reusable agent prototype for knowledge-customizable agents, consisted of an agent model, a yellow pages service and several external Java methods. Also, since the notion of trust is vital here, a reputation mechanism was integrated for ensuring trust in the framework. Finally, the paper presents a use case scenario that illustrates the usability of the framework and the integration of all the technologies involved. As for future directions, it would be interesting to verify our model’s capability to adapt to an even wider variety of Reasoners and KC modules, in order to form a generic environment for cooperating agents in the SW. Another direction would be towards developing and integrating more trust mechanisms. As pointed out, trust is essential, since each agent will have to make subjective trust judgements about the services provided by other agents. Considering the parameter of trust would certainly lead to more realistic and efficient applications. A final direction could be towards equipping the KC-Agents prototype with user-friendly GUI editors for each KC module, such as the KCj module.

182

K. Kravari, E. Kontopoulos, and N. Bassiliades

References [1] Hendler, J.: Agents and the Semantic Web. IEEE Intelligent Systems 16(2), 30–37 (2001) [2] JADE, http://jade.tilab.com/ [3] Bassiliades, N., Vlahavas, I.: R-DEVICE: An Object-Oriented Knowledge Base System for RDF Metadata. Int. J. on Semantic Web and Information Systems 2(2), 24–90 (2006) [4] Prova, http://www.prova.ws [5] Nute, D.: Defeasible Reasoning, 20th Int. Conference on Systems Science, pp. 470–477. IEEE Press, Los Alamitos (1987) [6] Bassiliades, N., Antoniou, G., Vlahavas, I.: A Defeasible Logic Reasoner for the Semantic Web. Int. J. on Semantic Web and Information Systems 2(1), 1–41 (2006) [7] Lam, H., Governatori, G.: The Making of SPINdle. Rule ML 2009. Int. Symp. on Rule Interchange and Applications, 315–322 (2009) [8] Kravari, K., Kontopoulos, E., Bassiliades, N.: A Trusted Defeasible Reasoning Service for Brokering Agents in the Semantic Web. In: 3rd Int. Symp. on Intelligent Distributed Computing (IDC 2009), Cyprus, vol. 237, pp. 243–248. Springer, Heidelberg (2009) [9] Kravari, K., Kontopoulos, E., Bassiliades, N.: Towards a Knowledge-based Framework for Agents Interacting in the Semantic Web. In: IEEE/WIC/ACM Int. Conf. on Intelligent Agent Technology (IAT 2009), Italy, vol. 2, pp. 482–485 (2009) [10] JESS, http://www.jessrules.com/ [11] Macarthur, K.: Trust and Reputation in Multi-Agent Systems, AAMAS, Portugal (2008) [12] Antoniou, G., Harmelen, F.: A Semantic Web Primer. MIT Press, Cambridge (2004) [13] Antoniou, G., Skylogiannis, T., Bikakis, A., Bassiliades, N.: DR-BROKERING – A Defeasible Logic-Based System for Semantic Brokering. In: IEEE Int. Conf. on ETechnology, E-Commerce and E-Service, pp. 414–417 (2005) [14] Skylogiannis, T., Antoniou, G., Bassiliades, N., Governatori, G., Bikakis, A.: DRNEGOTIATE - A System for Automated Agent Negotiation with Defeasible Logic-based Strategies. Data & Knowledge Engineering 63(2), 362–380 (2007) [15] Paschke, A., Boley, H., Kozlenkov, A., Craig, B.: Rule responder: RuleML-based Agents for Distributed Collaboration on the Pragmatic Web. In: 2nd Int. Conf. on Pragmatic Web, vol. 280, pp. 17–28. ACM, New York (2007)

An Extension of the Aspect PLSA Model to Active and Semi-Supervised Learning for Text Classification Anastasia Krithara1 , Massih-Reza Amini2 , Cyril Goutte2 , and Jean-Michel Renders3 1

National Center for Scientiﬁc Research (NCSR) ’Demokritos’, Athens, Greece 2 National Research Council Canada, Gatineau, Canada 3 Xerox Research Centre Europe, Grenoble, France

Abstract. In this paper, we address the problem of learning aspect models with partially labeled examples. We propose a method which beneﬁts from both semi-supervised and active learning frameworks. In particular, we combine a semi-supervised extension of the PLSA algorithm [11] with two active learning techniques. We perform experiments over four diﬀerent datasets and show the eﬀectiveness of the combination of the two frameworks.

1

Introduction

The explosion of available information during the last years has increased the interest of the Machine Learning (ML) community for diﬀerent learning problems that have been raised in most of the information access applications. In this paper we are interested in the study of two of these problems which are the handling of partially labeled data and the modeling of the generation of textual observations. Semi-Supervised Learning (SSL) has emerged in the Machine Learning community in the late 90 s. Under this framework, the aim is to establish a decision rule based on both labeled and unlabeled training examples. To achieve this goal, the decision rule is learned by simultaneously optimizing a supervised empirical learner on the labeled set, while respecting the underline structure of the unlabeled training data in the input space. In the same vein, Active Learning addresses also the issue of the annotation burden, but from a diﬀerent perspective. Instead of using all the unlabeled data together with the labeled ones, it tries to minimize the annotation cost by labeling as few examples as possible and focussing on the most useful examples. Diﬀerent types of active learning methods have been introduced in the literature, such as uncertainty-based methods ([13,21,3]) , expected error minimization methods ([10,19,7]) and query by committee methods ([20,8,17,6]). By combining semi-supervised and active learning, an attempt is made to beneﬁt from both frameworks to address the annotation burden problem. The semisupervised learning component improves the classiﬁcation rule and the measure S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 183–192, 2010. c Springer-Verlag Berlin Heidelberg 2010

184

A. Krithara et al.

of its conﬁdence, while the active learning queries for labelling the most relevant and potentially useful examples. On the other hand, new generative aspect models have recently been proposed which aim to take into account data with multiple facets. In this class of models, observations are generated by a mixture of aspects, or topics, each of which being a distribution over the basic features of the observations. Aspect models ([9,1]) have been succesfully used for various textual information access and image analysis tasks such as document clustering and categorization or scene segmentation. In many of these tasks, acquiring the annotated data necessary to apply supervised learning techniques is a major challenge, especially in very large data sets. These annotations require humans who can understand the scene or the text, and are therefore very costly, especially in technical domains. In this paper, we explore the possibility to learn such models with the help of the unlabeled examples, by combining SSL and active learning. This work is the continuation of, and builds on earlier work on SSL for PLSA [11]. In particular, the combination of the SSL variant of PLSA with two active learning techniques.

2

Combining SSL and Active Learning

The idea of combining active and semi-supervised learning was ﬁrst introduced by [15]. The idea is to integrate an EM algorithm with unlabeled data into an active learning, and more particularly in a query by committee (QBC) method. The commitee members are generated by sampling classiﬁers according to the distribution of classiﬁer parameters speciﬁed by the training data. In [16], CoEMT is proposed. This algorithm combines Co-Testing and Co-EM. As opposed to Co-Testing algorithm, which learns hypotheses h1 and h2 based only on the labeled examples, Co-EMT learns the two hypotheses by running Co-EM on both labeled and unlabeled examples. Then, in the active learning step, it annotates the example on which the predictions of h1 and h2 are the most divergent, that is the example for which h1 and h2 have an equally strong conﬁdence at predicting a diﬀerent label. [24] also presents a combination of semi-supervised and active learning using Gaussian ﬁelds and harmonic functions. [23] presented the so-called method Semi-Supervised Active Image Retrieval (SSAIR) for a diﬀerent task of relevance feedback. The method was inspired by co-training [2] and co-testing [17], but instead of using two suﬃcient but redundant views of the dataset, it employs two diﬀerent learners on the same data. In the context of multi-view active learning, [18] proposed a method which combines semisupervised and active learning. The ﬁrst step uses co-EM with naive Bayes as the semi-supervised algorithm. They present an approximation to co-EM with naive Bayes that can incorporate user feedback almost instantly and can use any sample-selection strategy for active learning. Why the combination should work? The combination of both semi-supervised and active learning appears to be particularly beneﬁcial in reducing the annotation burden for the following reasons:

An Extension of the Aspect PLSA Model to Active and SSL

185

1. It constitutes an eﬃcient way of solving the exploitation/exploration problem: semi-supervised learning is more focused on exploitation, while active learning is more dedicated to exploration. Semi-supervised learning alone may lead to poor performance in the case of very scarce initial annotation. It strongly suﬀers from poorly represented classes, while being very sensitive to noise and potential instability. On the other hand, active learning alone may spend too much time querying useless examples, as it can not exploit the information given by the unlabeled data. 2. In the same vein, it may alleviate the data imbalance problem due to each method separately. Semi-supervised learning tends to over-weight easy-toclassify examples that will dominate the process, while active learning has the opposite strategy, resulting in exploring more deeply the hard-to-classify examples [22]. 3. Semi-supervised learning is able to provide a more motivated estimation of the conﬁdence score associated to the class prediction for each example, taking into account the whole data set, including the unlabelled data. As a consequence, active learning based on these better conﬁdence scores is expected to be more eﬃcient.

3

Semi-Supervised PLSA with a Mislabeling Error Model

In this section we present the semi-supervised variant of the Probabilistic Latent Semantic Analysis (PLSA) model which is used in combination with active learning. This method incorporate a misclassiﬁcation error model (namely the ssPLSA-mem) [11]. We assume that the labeling errors made by the generative model for unlabeled data come from a stochastic process and that these errors are inherent to semi-supervised learning algorithms. The idea here is to characterize this stochastic process in order to reduce the labeling errors computed by the classiﬁer for unlabeled data in the training set. We assume that for each unlabeled example x ∈ Xu , there exists a perfect, true label y, and an imperfect label y˜, estimated by the classiﬁer. Assuming also that the estimated label is dependent on the true one, we can model these labels by the following probabilities: y = k|y = h) ∀(k, h) ∈ C × C, βkh = P (˜

(1)

subject to the constraint that ∀h, k βkh = 1. The underlying generation process associated to this latent variable model for unlabeled data is: – Pick an example x with probability P (x), – Choose a latent variable α according to its conditional probability P (α | x) – Generate a feature w with probability P (w | α)

186

A. Krithara et al.

– Generate the latent class y according to the probability P (y | α) – The imperfect class label y˜ is generated with probability βy˜|y = P (˜ y | y) The values of P (y|α) depend on the value of latent topic variable α. The cardinal of α is given. The number of latent topics α per class is also known for both labeled and unlabeled examples. We initialize by forcing to zero the P (y|α) for the latent topic variables α which do not belong to the particular class y. These values remain ﬁxed. In other words, we perform hard clustering. We have to note that the hard clustering is done for each class separately, since in each class (y) the corresponing feature examples may aggregate to several clusters. In algorithm 1 the estimation of model parameters Φ = {P (α | x), P (w | α), βy˜|y : x ∈ X , w ∈ W, α ∈ A, y ∈ C, y˜ ∈ C} is described. This algorithm is an EM-like algorithm. With n(x, w) we denote the frequency of the feature w in the example x. For more information about this model, please refer to [11]. Algorithm 1. ssPLSA-mem Input : – A set of partially labeled data X = Xl ∪ Xu , – Random initial model parameters Φ(0) . – j←0 – Run a simple PLSA algorithm for the estimation of the initial y˜ for each example repeat E-step: Estimate the latent class posteriors P (α|x)P (w|α)P (y|α) πα (w, x, y) = , if x ∈ Xl α P (α|x)P (w|α)P (y|α) P (α|x)P (w|α) y P (y|α)βy|y ˜ π ˜α (w, x, y˜) = , if x ∈ Xu P (α|x)P (w|α) P (y|α)β y|y ˜ α y M-step: Estimate the new model parameters Φ(j+1) by maximizing the complete-data log-likelihood P (j+1) (w|α) ∝ n(w, x)πα(j) (w, x, y(x)) + n(w, x)˜ πα(j) (w, x, y˜(x)) x∈Xl

P (j+1) (α|x) ∝

n(w, x) ×

w

(j+1) βy|y ˜

∝

w x∈Xu

x∈Xu (j) πα (w, x, y(x)), (j) π ˜α (w, x, y˜(x)),

n(w, x)

for x ∈ Xl for x ∈ Xu

π ˜α(j) (w, x, y˜)

α|α∈y

j ←j+1 until convergence of the complete-data log-likelihood ; Output : A generative classiﬁer with parameters Φ

An Extension of the Aspect PLSA Model to Active and SSL

4

187

Active Learning

In this section, we extend the presented semi-supervised model, by combining it with two active learning methods. The motivation is to try to take advantage of the characteristics of both frameworks. In both models, we choose to annotate the less conﬁdent example. Their diﬀerence lies on the measure of conﬁdence they use. Margin Based Method. The ﬁrst active learning method (the so-called margin based method) chooses to annotate the example which is closer to the classes’ boundaries [12]. The latter gives us a notion of conﬁdence the classiﬁer has on the classiﬁcation of these examples. In order to measure this conﬁdence we use the following class-entropy measure for each unlabeled example: B(x) = − P (y|x) log P (y|x), where x ∈ Xu (2) y

The bigger the B is, the less conﬁdent the classiﬁer is about the labeling of the example. After having selected an example, we annotate it and we add it to the initial labeled set Xl . More than one examples can be selected at each iteration. The reason is that, especially for classiﬁcation problems with a big amount of examples and many classes, the annotation of only one example at a time, can be proved time-consuming, as a respectful amount of labeled examples will be needed in order to achieve a good performance. If we choose to do the latter, it is not wise to choose examples that are next to each other, as they cannot give us signiﬁcantly more information than each of them does. As a result, it is better to choose, for instance, examples with big class-entropy which have been given diﬀerent labels. That way the classiﬁer can get information about diﬀerent classes and not only for a single one. Entropy Based Method. Based on the method presented in [5], we calculate the entropy of the annotation of the unlabeled data, during the iterations of the Algorithm 2. Combining ssPLSA and Active Learning Input : A set of partially labeled examples X = Xl ∪ Xu repeat – Run the ssPLSA algorithm (and calculate the P (y|x)) – Estimate the conﬁdence of the classiﬁer on the unlabeled examples – Choose the example(s) with low conﬁdence (if we choose more than one example to label, we choose examples with have been classiﬁed into diﬀerent classes), annotate them and add them in the labeled dataset Xl until a certain number of queries or a certain performance ; Output : A generative classiﬁer

188

A. Krithara et al.

model. This method can be seen as a query by committee approach, where, in contrast to the method of [5], the committees here are the diﬀerent iterations of the same model. In contrast to the margin based method presented previously, the current one does not use the probabilities P (y|x) of an example x to be assigned the label y, but instead, is uses the deterministic votes of the classiﬁer during the diﬀerent iterations. We denote by V (y, x) the number of times that the label y was assigned in the example x during the previous iterations. Then, we denote as Vote Entropy of an example x as: V E(x) = −

V (y, x) y

iters

log

V (y, x) iters

(3)

where iters refers to the number of iterations. The examples to be labeled are chosen using equation (3), that is, examples with higher entropies are selected. As long as we add new examples during the iterations, the labeling of some examples will change as, new information will be given to the classiﬁer. The strategy chooses the examples for which the classiﬁer changes its decision more often during the iterations. We have to note, that during the ﬁrst 2-3 iterations, we do not have enough information in order to choose the best examples to label, but very quickly the active learner manage to identify these examples. The intuition behind this model is that examples which tend to change labels are those for which the classiﬁer seems more undecided. Algorithm 2 gives us the general framework under which the above active learning methods can be combined with the semi-supervised variant of the PLSA model.

5

Experiments

In our experiments we used four diﬀerent datasets: two collections from the CMU World Wide Knowledge Base project - WebKB1 [4] and 20Newsgroups2 , the widely used text collection of Reuters (Reuters − 21578)3 and a real-world dataset from Xerox. As mentioned before, we are concentrated in document classiﬁcation; nevertheless, the algorithms described in the previous sections can be also used for diﬀerent applications in which there is a relation of co-occrence between objects and variables such as image classiﬁcation. These three datasets were pre-processed by removing the email tags and other numeric terms, discarding the tokens which appear in less than 5 documents and removing a total of 608 stopwords from the CACM stoplist4 . No other form of preprocessing (stemming, multi-word recognition etc.) was used on the documents. Table 1 summarizes the characteristics of these datasets. 1 2 3 4

http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/ http://people.csail.mit.edu/jrennie/20Newsgroups/ http://www.daviddlewis.com/resources/testcollections/reuters21578/ http://ir.dcs.gla.ac.uk/resources/test collections/cacm/

An Extension of the Aspect PLSA Model to Active and SSL

189

Table 1. Characteristics of the datasets 20Newsgroups WebKB Reuters

Dataset Collection size # of classes, K Vocabulary size, |W| Training set size, |Xl ∪ Xu | Test set size

20000 20 38300 16000 4000

4196 4 9400 3257 839

4381 7 4749 3504 876

Except the previous datasets which are widely used for evaluation of diﬀerent classiﬁcation algorithms in the Machine Learning community, we used a real world dataset (called XLS) which comes from a Xerox Business Group (XLS). This dataset is constituted of 20000 documents in the training set and 34770 in the test set. The documents consist of approximately 40% emails, 20% Microsoft Word documents; 20% Microsoft Excel documents, 10% Microsoft Power point documents and 10% PDF and miscellaneous documents. We want to classify the documents as Responsive and Non-Responsive to a particular given case. Evaluation Measures. In order to evaluate the performance of the models, we used the microaverage F-score measure. For each classiﬁer, Gf , we ﬁrst compute its microaverage precision P and recall R by summing over all the individual decisions it made on the test set: R(Gf ) = K

K

k=1

θ(k, Gf )

k=1 (θ(k, Gf )

P (Gf ) = K

K

k=1

+ ψ(k, Gf ))

θ(k, Gf )

k=1 (θ(k, Gf )

+ φ(k, Gf ))

Where, θ(k, Gf ), φ(k, Gf ) and ψ(k, Gf ) respectively denote the true positive, false positive and false negative documents in class k found by Gf , and K denotes the number of classes. The F-score measure is then deﬁned as [14]: F (Gf ) = 5.1

2P (Gf )R(Gf ) P (Gf ) + R(Gf )

Results

We run experiments for all semi-supervised variants, for both active learning techniques, and for all four datasets. In our experiments, we label one example in each iteration and 100 iterations are performed for WebKB, Reuters and 150 for 20Newsgroups dataset. For the XLS dataset we label 2 examples in each iteration, and we perform 100 iterations (as the dataset is bigger than the other three we need more data for achieving a good performance). For the Margin Method, it is not wise to choose 2 examples that are next to each other, as they cannot gives us more information that each of them does. As a result, we chose

190

A. Krithara et al. Reuters

WebKB 0.75

0.81

0.7

0.78

0.65

F−score

F−score

0.84

0.75 0.72 0.69

0.5

Entropy method Margin Method Random method

0.66 0.63

0.6 0.55

0.4 10

20

30

40

50

60

70

80

90

Entropy method Margin method Random method

0.45

100

10

20

30

40

50

60

70

80

90

100

# of labeled examples

# of labeled examples

XLS

20 Newsgroups 0.8

0.74

0.75

0.71

0.7 0.68

F−score

F−score

0.65 0.6 0.55 0.5 0.45 0.4

0.65 0.62 0.59 0.56

0.35

Entropy method Margin method Random method

0.3 15

30

45

60

75

90

105

# of labeled examples

120

135

Entropy method Margin method Random method

0.53

150

20

40

60

80

100

120

140

160

180

200

# of labeled examples

Fig. 1. F-Score (y-axis) versus, the number of labeled examples in the training set |Dl |, (x-axis) graphs for the combination of the two ssPLSA algorithms with active learning on Reuters, WebKB and 20Newsgroups datasets

the two examples with the biggest class-entropy but, in addition, with diﬀerent assigned labels. In order to evaluate the performance of the active learning methods, we also run experiments for the combination of the semi-supervised algorithms with a random selection method, where in each iteration the documents to be labeled are chosen randomly. As we can notice from the ﬁgure 1 the use of active learning helps, in comparison with the random query for all four datasets. The performance of the two diﬀerent active learning techniques are comparable, and their diﬀerence is not statistically signiﬁcant. Nevertheless, they clearly outperfom the random method, especially when very few labeled data are available. For the XLS dataset in particular, as we can notice, active learning helps, comparing to the random method, although the gain is less than the other three datasets. As in the previous case, the two active learning methods give similar results.

6

Conclusions

In this work, a variant of the semi-supervised PLSA algorithm has been combined with two active learning techniques. Experiments on four diﬀerent datasets validate a consistent signiﬁcant increase inperformance. The evaluation we performed has shown that this combination can further increase classiﬁer’s

An Extension of the Aspect PLSA Model to Active and SSL

191

performance. Using active learning we manage to chose our training labeled set carefully, using the most informative examples. Working this way, we can achieve a better performance using less labeled examples. This work was focused on the PLSA model. Nevertheless, this does not mean that the developed models can exclusively used with it. On the contrary, the proposed techniques are very easily applicable to diﬀerent aspect models. Another possible extension is the use of diﬀerent active learning techniques. Also, the combination of more than one active learning technique could be considered.

Acknowledgment This work was supported in part by the IST Program of the European Community, under the PASCAL Network of Excellence, IST-2002-506778. This publication only reﬂects the authors’ views.

References 1. Blei, D., Ng, A., Jordan, M.: Latent dirichlet allocation. Journal of Machine Learning Research (2003) 2. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceedings of the Conference on Computational Learning Theory (COLT), pp. 92–100 (1998) 3. Campbell, C., Cristianini, N., Smola, A.J.: Query learning with large margin classiﬁers. In: Proceedings of the 17th International Conference on Machine Learning (ICML), San Francisco, CA, USA, pp. 111–118 (2000) 4. Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., Slattery, S.: Learning to extract symbolic knowledge from the World Wide Web. In: Proceedings of the 15th Conference of the American Association for Artiﬁcial Intelligence, Madison, US, pp. 509–516 (1998) 5. Dagan, I., Engelson, S.P.: Committee-based sampling for training probabilistic classiﬁers. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 150–157 (1995) 6. Davy, M., Luz, S.: Active learning with history-based query selection for text categorisation. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECiR 2007. LNCS, vol. 4425, pp. 695–698. Springer, Heidelberg (2007) 7. D¨ onmez, P., Carbonell, J.G., Bennett, P.N.: Dual strategy active learning. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladeniˇc, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 116–127. Springer, Heidelberg (2007) 8. Freund, Y., Seung, H., Shamir, E., Tishby, N.: Selective sampling using the query by committee algorithm. Machine Learning 28(2-3), 133–168 (1997) 9. Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Machine Learning 42(1-2), 177–196 (2001) 10. Iyengar, V., Apte, C., Zhang, T.: Active learning using adaptive resampling. In: Proceedings of the 6th International Confenrence on Knowledge Discovery and Data Mining, pp. 92–98 (2000)

192

A. Krithara et al.

11. Krithara, A., Amini, M.-R., Renders, J.-M., Goutte, C.: Semi-supervised document classiﬁcation with a mislabeling error model. In: European Conference on Information Retrieval (ECIR), Glasgow, Scotland (2008) 12. Krithara, A., Goutte, C., Amini, M.-R., Renders, J.-M.: Reducing the annotation burden in text classiﬁcation. In: 1st International Conference on Multidisciplinary Information Sciences and Technologies (InSCiT), Merida, Spain (2006) 13. Lewis, D., Gale, W.A.: A sequential algorithm for training text classiﬁers. In: Proceedings of the 17th International Conference on Research and Development in Information Retrieval (SIGIR), Dublin, pp. 3–12 (1994) 14. Lewis, D., Ringuette, M.: A comparison of two learning algorithms for text categorization. In: Proceedings of the Symposium on Document Analysis and Information Retrieval (SDAIR), pp. 81–93 (1994) 15. McCallum, A., Nigam, K.: Employing EM and pool-based active learning for text classiﬁcation. In: Proceedings of the 15th International Conference on Machine Learning (ICML), pp. 350–358 (1998) 16. Muslea, I., Minton, S., Knoblock, C.: Active + Semi-supervised Learning = Robust Multi-View Learning. In: Proceedings of the 19th International Conference on Machine Learning (ICML), pp. 435–442 (2002) 17. Muslea, I., Minton, S., Knoblock, C.A.: Selective sampling with redundant views. In: Proceedings of the 17th National Conference on Artiﬁcial Intelligence (AAAI), pp. 621–626 (2000) 18. Probst, K., Ghani, R.: Towards ’interactive’ active learning in multi-view feature sets for information extraction. In: Proceedings of European Conference on Machine Learning (ECML), pp. 683–690 (2007) 19. Roy, N., McCallum, A.: Toward optimal active learning through sampling estimation of error reduction. In: Proceedings of the 18th International Conference on Machine Learning (ICML), San Francisco, CA, pp. 441–448 (2001) 20. Seung, H.S., Opper, M., Sompolinsky, H.: Query by committee. Computational Learning Theory, 287–294 (1992) 21. Tong, S., Koller, D.: Support vector machine active learning with applications to text classiﬁcation. In: Proceedings of 17th International Conference on Machine Learning (ICML), Stanford, US, pp. 999–1006 (2000) 22. T¨ ur, G., Hakkani-T¨ ur, D., Schapire, R.: Combining active and semi-supervised learning for spoken language understanding. Speech Communication 45(2), 171– 186 (2005) 23. Zhou, Z.-H., Chen, K.-J., Dai, H.-B.: Enhancing relevance feedback in image retrieval using unlabeled data. ACM Transactions on Information Systems 24(2), 219–244 (2006) 24. Zhu, X., Laﬀerty, J., Ghahramani, Z.: Combining active learning and semisupervised learning using gaussian ﬁelds and harmonic functions. In: Proceedings of the 20th International Conference on Machine Learning (2003)

A Market-Affected Sealed-Bid Auction Protocol Claudia Lindner Institut f¨ur Informatik, Universit¨at D¨usseldorf, 40225 D¨usseldorf, Germany

Abstract. Multiagent resource allocation defines the issue of having to distribute a set of resources among a set of agents, aiming at a fair and efficient allocation. Resource allocation procedures can be evaluated with regard to properties such as budget balance and strategy-proofness. Designing a budget-balanced and strategy-proof allocation procedure that always provides a fair (namely, envyfree) and efficient (namely, Pareto-optimal) allocation poses a true challenge. To the best of our knowledge, none of the existing procedures combines all four properties. Moreover, in previous literature no attention is given to the allocation of unwanted resources (i.e., resources that seem to be of no use for all agents) in a way as to maximize social welfare. Yet, dealing inappropriately with unwanted resources may decrease each agent’s benefit. Therefore, we extend the scope of sealed-bid auctions by means of involving market prices so as to always provide an optimal solution under consideration of each agent’s preferences. We present a new market-affected sealed-bid auction protocol (MSAP) where agents submit sealed bids on indivisible resources, and we allow monetary side-payments. We show this protocol to be budget-balanced and weakly strategy-proof, and to always provide an allocation that maximizes both utilitarian and egalitarian social welfare, and is envy-free and Pareto-optimal. Keywords: multiagent systems, multiagent resource allocation, auctions.

1 Introduction In multiagent resource allocation, agents participate in an allocation procedure to obtain a fair and efficient allocation of a set of resources.1 There are two types of procedures: centralized and decentralized. Auction protocols give a good example for centralized procedures: All agents are asked to state their preferences (utility values) for the resources given, and based on these the protocol makes a decision on the final allocation (see, e.g., [11,6]). In contrast, in a decentralized environment the final allocation is the result of a sequence of conducted negotiations between single agents (see, e.g., [7]). Many resource allocation approaches are about allocating each of the goods to any of the agents (see, e.g., [14,1,17]), and thus an important aspect is missed out: Among the goods to be allocated there may be “unwanted goods”, i.e., goods none of the agents is interested in. Though some procedures consider that a particular agent may not be interested in each good available (see, e.g., [17]), the issue of unwanted goods is no attention given to. The market-affected sealed-bid auction protocol (MSAP) to be presented 1

Supported in part by the DFG under grant RO 1202/12-1 (within the ESF’s EUROCORES program LogICCC: “Computational Foundations of Social Choice”). We will use “resource” and “good”, and “procedure” and “protocol” interchangeably.

S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 193–202, 2010. c Springer-Verlag Berlin Heidelberg 2010

194

C. Lindner

in Section 4 fills this gap by involving market prices. This new auction protocol has highly relevant properties such as budget balance, weak strategy-proofness, and always providing an allocation that maximizes both utilitarian and egalitarian social welfare, and that is envy-free and Pareto-optimal (see Section 5). Utilitarian and egalitarian social welfare are the most common notions of social welfare (see, e.g., [5]). Informally speaking, for a given allocation, utilitarian social welfare states the sum of all agents’ utilities, whereas egalitarian social welfare states the utility of the worst off agent. Both notions make meaningful statements about the quality of an allocation, utilitarian social welfare measures the overall benefit for society, and egalitarian social welfare measures the level of fairness as to satisfying minimum needs. Another substantial concept of fairness is envy-freeness: An allocation is said to be envy-free if none of the agents has an incentive to swap his or her share with any other agent’s share. Regarding efficiency, the most fundamental concept is the notion of Pareto-optimality: An allocation is Paretooptimal if any change to make an agent better off results in making another agent worse off. With regard to resource allocation protocols that involve monetary side-payments, budget balance states that all payments sum up to zero, i.e., the application of the protocol causes neither a profit nor a loss. These notions and the general framework are specified in Section 2. Moreover, the MSAP is proven to be “weakly strategy-proof”, a notion to be motivated and introduced in Section 3. In short, weakly strategy-proof is a somewhat milder notion of strategy-proofness implying that a cheating attempt may be successful but is always at the risk of an overall loss. Pioneering work in the field of auction procedures for indivisible goods was done by Knaster. His auction protocol of sealed bids2 is about agents that submit sealed bids for single goods, and the agent whose bid is the highest is assigned the good; monetary side-payments are used for compensation. Knaster’s procedure always provides an efficient allocation but does not guarantee envy-freeness (see, e.g. [3]). Just as in Knaster’s procedure of sealed bids, the MSAP asks agents to submit sealed bids on single goods reflecting their individual welfare from receiving these goods. This has the advantage of making winner determination easy (as opposed to the hardness of winnerdetermination problems in combinatorial auctions, see [11,6]). In this regard, in order to create a mutual bid-basis and to account for unwanted goods, the option to sell goods on the open market is included, i.e., the MSAP combines the actual value of a good with each agent’s preferences. Furthermore, we allow monetary side-payments, and a central authority manages the allocation procedure. Put simply, the MSAP extends the scope of sealed-bid auctions by means of involving market prices in order to always provide a fair and efficient allocation—even when taking unwanted goods into account. The New Economy, which is rooted in continuous information-technological progress, provides the basis for the MSAP to act on the global market. The internet and related technologies overcome geographical borders and increase market transparency, and hence create some kind of perfect information environment [4]. Being generated and used by internet users worldwide, information (e.g., with reference to demand and supply) turns out to be the crucial factor when performing global market activities. The global market provides a shared platform for market research, giving each agent the same chances to sell unwanted goods, and thus to make bids on a comparable basis. 2

This protocol has been proposed by Knaster and was first presented in Steinhaus [13].

A Market-Affected Sealed-Bid Auction Protocol

195

Application areas could be the allocation of inheritance items or collective raffle prices, or in the case of two agents even the allocation of household goods within a divorce settlement. In general, the MSAP can be applied to a mixture of insignificant replaceable goods (e.g., a car that is of no (considerable) personal significance to a particular agent) and significant replaceable goods, i.e., goods that are sort of irreplaceable to an agent, but a certain monetary compensation would be accepted (e.g., a car that is of considerable personal significance to a particular agent for some reason—such as this agent has been born in this car—but the agent would be willing to accept some money to set the car aside). The MSAP can also be applied to a set of either solely significant or solely insignificant replaceable goods.

2 Preliminaries and Notation Let A = {a1 , a2 , . . . , an } be a set of n agents, and let G = {g1 , g2 , . . . , gm } be a set of m indivisible and nonshareable goods (i.e., each good is to be allocated in its entirety and to one agent only). If some amount of money is among the goods to be allocated, this money is excluded from G and its value is split equally among all agents in A. Moreover, the number of agents as well as the number of goods are not restricted, and there is no limitation on how many goods are to be allocated per agent. While in previous literature the focus is mostly on scenarios where only one single good is to be assigned per agent (see e.g., [14,15,9]), we do not need this restriction (for related work, see, e.g., [2,17]). Let U = {u1 , u2 , . . . , un } be a set of n utility functions representing each agent’s preferences (i.e., bids), where u j : G → R specifies agent a j ’s utility of each single good in G. An allocation of G is a mapping X : A → G with X(a j ) ∩ X(ak ) = 0/ for any two agents a j and ak , j = k. At this, u j (X(a j )) gives agent a j ’s additive utility of the subset (bundle) of goods allocated to him or her by allocation X; to simplify notation, we will write u j (X) instead of u j (X(a j )). Note that agents do not have any knowledge about the utility values of other agents. Let C = {c1 , c2 , . . . , cn } be a set of n side-payments that agents a j in A either make (i.e., c j ∈ R− ) or receive (i.e., c j ∈ R+ ) in conjunction with an allocation. At this, the value of money is supposed to be the same for all agents. The MSAP asks agents to bid on single goods only. Hence, direct synergetic effects caused by the allocation of bundles of goods are disregarded. However, in Section 5 it is shown that the final allocation involves bundles when accounting for side-payments, and that statements regarding social welfare, fairness, and efficiency can be made. Various criteria have been introduced to measure the quality of an allocation such as the concepts of social welfare, envy-freeness, and Pareto-optimality. Concerning a given society of agents, concepts of social welfare measure the benefit an allocation yields. Transferring this measure to individual agents gives the notion of individual welfare. Definitions 1, 2, 3, and 4 each are given with reference to a resource allocation setting where monetary side-payments are allowed. Definition 1. Consider an allocation X of a set G of goods to a set A of agents, where the agents’ preferences are represented by utility functions U. Let C be the set of sidepayments agents in A either make or receive, as appropriate. The individual welfare of agent a j obtained through allocation X and side-payment c j is defined as iw j (X(a j )) = u j (X(a j )) + c j . We will write iw j (X) instead of iw j (X(a j )).

196

C. Lindner

Note that for an allocation X and any two agents a j and ak in A, a j = ak , the individual welfare agent a j would obtain through the assignment of agent ak ’s share is defined as iw j (X(ak )) = u j (X(ak )) + ck . In terms of social welfare, in this paper, we focus on the following two types (see, e.g., [5]). Definition 2. Consider an allocation X of a set G of goods to a set A of agents, where iw j (X) is the individual welfare agent a j in A obtains through allocation X. (1) The utilitarian social welfare is defined as swu (X) = ∑a j ∈A iw j (X). (2) The egalitarian social welfare is defined as swe (X) = min{iw j (X) | a j ∈ A}. We now define envy-freeness and Pareto-optimality. Informally speaking, an allocation is envy-free if every agent is at least as happy with his or her share as he or she would be with any other agent’s share. An allocation is Pareto-optimal if no agent can be made better off without making another agent worse off. Definition 3. Let a set G of goods and a set A of agents be given, and let X and Y be two allocations of G to A. Let iw j (X) and iw j (Y ) be the individual welfares agent a j in A obtains through allocations X and Y . (1) X is said to be envy-free if for any two agents a j and ak in A, we have iw j (X(a j )) ≥ iw j (X(ak )). (2) Y is said to be Pareto-dominated by X if for each agent a j in A, we have iw j (X) ≥ iw j (Y ), and there exists some agent ak in A such that iwk (X) > iwk (Y ). An allocation is said to be Pareto-optimal (or, Paretoefficient) if it is not Pareto-dominated by any other allocation. The notion of budget balance makes a statement on the quality of a resource allocation protocol that involves monetary side-payments. Definition 4. A resource allocation protocol is said to be budget-balanced if for every allocation obtained it holds that all side-payments sum up to zero, i.e., ∑c j ∈C c j = 0.

3 Motivation Considering the design of a multiagent resource allocation procedure, the common goal is to guarantee a fair (namely, envy-free) and efficient (namely, Pareto-optimal) allocation. In this context, two more desirable properties for a resource allocation protocol to possess are budget balance, and strategy-proofness (i.e., none of the agents has an incentive to bid dishonestly). While the well-known Groves mechanisms [8] satisfy both Pareto-optimality and strategy-proofness, they in general do not guarantee to provide an envy-free allocation and are not budget-balanced (see, e.g., [10]). In fact, Tadenuma and Thomson [16] showed that envy-freeness and strategy-proofness are mutually exclusive. Thus, in this paper, we aim at guaranteeing envy-freeness, Pareto-optimality and budget balance while weakening the requirement of strategy-proofness. In literature it is common to define strategy-proofness by incentive compatibility which states that truthfulness is the dominant strategy (see, e.g., [12]). In order to deal with the impossibility result given, we focus on a somewhat weaker notion of strategy-proofness. Definition 5. Given that all agents have no knowledge about the utility functions of other agents, a resource allocation protocol is said to be weakly strategy-proof if a cheating agent is always risking a loss and is never guaranteed to cheat successfully.

A Market-Affected Sealed-Bid Auction Protocol

197

In the context of additive valuation functions, Knaster’s procedure of sealed bids satisfies Pareto-optimality and budget balance, but lacks of envy-freeness and strategyproofness—though it is weakly strategy-proof according to Definition 5 (see, e.g. [3]). Willson [17] presented a procedure that indeed is envy-free, Pareto-optimal, budgetbalanced and weakly strategy-proof (according to Definition 5). However, this procedure does not give consideration to welfare maximization when having unwanted goods. According to Willson’s procedure, agents are expected to state negative values for those goods they do not want (i.e., to specify some monetary compensation to be paid to the agents in order to persuade them to accept those goods nonetheless), but there are no restrictions in terms of some sort of value limit. Thus, regarding an unwanted good (i.e., a good that is a burden on each of the agents), it is possible that the absolute equivalent of each single negative value is higher than the overall value of all other goods to be allocated. This allows agents to receive compensations that, in the end, may cause a moneylosing allocation. As an example, consider the setting that we have two agents a1 and a2 , three goods g1 , g2 and g3 , and the following utility values u1 (g1 ) = 100, u1 (g2 ) = 50, u1 (g3 ) = −300, u2 (g1 ) = 80, u2 (g2 ) = 60 and u2 (g3 ) = −250. According to the procedure given in [17], each good is assigned to the highest bidder. Thus, good g1 is assigned to agent a1 , and goods g2 and g3 are assigned to agent a2 . The overall benefit sums up to 100 + 60 + (−250) = −90, and hence agent a1 has to pay side-payments worth −145 to agent a2 , resulting in a negative share of −45 for each of the agents. Just one unwanted good, here g3 , can smash a whole allocation. But, if an agent does not want a good for personal use, this does not necessarily mean that this good is worth nothing to the agent as he or she may have good selling opportunities. For example, let us assume good g3 is a car and both agents have no use for it, hence they only see the cost involved such as the cost for scrapping or insurance. By missing the option to sell unwanted goods and to distribute the related profit, agents may end up paying rather than benefiting. Moreover, there may be goods that, though being wanted by some agents, do not make up a high personal significance. Having no common basis for the specification of utility values, agents with a similar preference for one particular good (e.g., considering the good to be of no personal significance) may state significantly different utilities. Without a common basis the values stated may neither be related to the actual value of the good nor to one another, and thus a lower overall benefit may be caused. To address the issues mentioned above, we present and analyze a new marketaffected sealed-bid auction protocol that is proven to be budget-balanced and weakly strategy-proof, and to always provide an allocation that is envy-free, Pareto-optimal, and that always maximizes both utilitarian and egalitarian social welfare.

4 A Market-Affected Sealed-Bid Auction Protocol The MSAP is about allocating goods that are to give away in as fair a way as possible. However, agents may have diverse preferences for the goods in G, and thus some allocations may result in an advantage for one agent and in a disadvantage for another—which is unfair as every agent is to be treated equally. For the purpose of achieving not only an efficient but also a fair allocation, the aim is to assign all goods in G in such a way that the individual welfares of all agents are equalized according to how valuable the goods

198

C. Lindner

are to them. Regarding a good that is significant to an agent, the utility value reflects the level of personal significance of this good, whereas, regarding an insignificant good, the utility value states the profit the agent could make by selling this good.3 Note that the central authority (CA) managing the allocation procedure is not one of the agents. If the CA needs to be paid for its job, this is done proportionally by all agents once the protocol is finished. It is assumed though that the CA generally does not have to be paid for organizing the allocation. Moreover, the MSAP may involve sidepayments, but, as opposed to other approaches (see, e.g., [16,17]), none of the goods in G needs to be infinitely divisible. Furthermore, neither any agent nor the CA will lose any value by the application of the MSAP, because all side-payments are included in the overall value of the goods in G. Let X Σ denote the final allocation obtained by the MSAP. We write swΣ instead of sw(X Σ ), and iwΣj instead of iw j (X Σ ). The MSAP is a multi-stage resource allocation protocol and consists of three phases: the bidding phase, the assignment phase and the compensation phase. In the course of the bidding phase agents are asked to specify a utility value for each of the goods. During the assignment phase each good is assigned to the agent whose benefit from receiving this good is the highest, where ties can be broken arbitrarily. Finally, the compensation phase is about equalizing the individual welfares of all agents by means of monetary side-payments. The MSAP is presented in detail in Figure 1. Remark 1. Some remarks on the steps of the protocol in Figure 1 are in order: 1. B5. Agent a j states u j (gi ) (i.e., the individual welfare agent a j would obtain from receiving good gi ) according to the following rules. (a) If a j is not interested in gi , a j would not keep gi but would sell it and thus states a utility value fulfilling u j (gi ) ≤ M(gi ). Agent a j would receive revenue M(gi ) for selling gi but he or she may also have some expenses S j (gi ) caused by selling gi , i.e., u j (gi ) = M(gi ) − S j (gi ). If agent a j wants to make sure to receive good gi by no means, he or she states a utility value of zero.4 (b) If a j would like to have gi for him- or herself, a j states a utility value fulfilling u j (gi ) ≥ M(gi ). At this, the value difference between u j (gi ) and M(gi ) expresses the degree of significance of gi to a j , i.e., the higher the difference the more significant gi is to a j . 2. A3. If uk (gi ) < PCA (gi ) holds true, the CA has the lowest selling cost for gi and none of the agents in A is interested in keeping gi for him- or herself, i.e., gi is an unwanted good. m M(g ). 3. C1. If none of the goods in G needs to be sold, it holds that swΣ ≥ M Σ = Σi=1 i After the assignment phase is finished, all agents concerned and the CA go into the matter of selling those goods that are to be sold, since in the compensation phase all related profits are involved. However, if agents concerned have sufficient cash at hand, the selling could be done later on, though at the risk of financial loss and the chance of additional profit as market prices may change over time. 3 4

There is no need to consider selling opportunities for significant goods, since agents are intersted in keeping those goods due to their significance. Note that stating “0” would simplify the process for agent a j (i.e., no selling activity would be required), but this may cause a lower overall benefit compared to when a j would sell gi .

A Market-Affected Sealed-Bid Auction Protocol

199

Bidding Phase: For each good gi in G perform steps B1 to B5. B1. Based on market research, the CA determines market price M(gi ), i.e., the revenue the CA or an agent would receive when selling good gi on the market. B2. The CA determines selling cost SCA (gi ), i.e., the cost caused by the CA selling gi on the market (e.g., the cost for meeting a potential buyer, or the cost for shipping the good). B3. The CA calculates profit PCA (gi ) := M(gi ) − SCA (gi ). B4. The CA discloses market price M(gi ), but conceals selling cost SCA (gi ) and profit PCA (gi ). B5. Each agent a j in A specifies utility value u j (gi ) ≥ 0 and submits this one to the CA. Assignment Phase: For each good gi in G perform steps A1 to A3. A1. Find agent ak in A such that there is no agent a j in A with u j (gi ) > uk (gi ), i.e., find a highest bidder for gi . (Ties can be broken arbitrarily.) A2. If uk (gi ) ≥ PCA (gi ) and uk (gi ) > 0, good gi is allocated to agent ak and the highest bid for good gi is recorded by setting u (gi ) := uk (gi ). A3. If uk (gi ) < PCA (gi ) or uk (gi ) = PCA (gi ) = 0, the CA is going to keep gi for the time being and the highest bid for gi is recorded by setting u (gi ) := PCA (gi ). Compensation Phase: m u (g ). C1. For final allocation X Σ calculate the overall social welfare by swΣ := Σi=1 i C2. In compliance with values u j (X Σ ), divide set A into three disjoint sets, R, S, and T with A = R ∪ S ∪ T , such that: (1) ur (X Σ ) > (1/n) · swΣ for all ar in R; (2) us (X Σ ) < (1/n) · swΣ for all as in S; (3) ut (X Σ ) = (1/n) · swΣ for all at in T . C3. All agents ar in R (i.e., all advantaged agents) have to make side-payments cr ∈ R− such that iwΣr = ur (X Σ ) + cr = (1/n) · swΣ . C4. For goods g in G with u (g ) = PCA (g ) that had to be sold by the CA, the CA has to make side-payments cCA ∈ R− such that cCA = −Σλ ∈{} PCA (gλ ). C5. All agents as in S (i.e., all disadvantaged agents) receive side-payments cs ∈ R+ such that iwΣs = us (X Σ ) + cs = (1/n) · swΣ . Note that −Σσ ∈{s} cσ = cCA + Σρ ∈{r} cρ . The CA discloses social welfare swΣ , the assignment of goods gi according to X Σ and sidepayments c j for each agent a j in A. Fig. 1. A market-affected sealed-bid auction protocol for any number of goods and agents

4. C5. Side-payments cr and cs can be made to or received from several agents, and agents either make side-payments or receive side-payments. Agents at in T have been in possession of a proportional share of social welfare swΣ after the assignment phase already, i.e., ct = 0 and iwtΣ = ut (X Σ ) = (1/n) · swΣ for all at in T .

5 Results and Discussion The easiest way of equitably allocating all goods in G would be if the CA itself would sell all goods on the market, and distribute the profit made in a proportional manner among all agents. In this case, the overall social welfare swΣ would equal m PΣ = Σi=1 PCA (gi ). Thus, (1/n) · PΣ specifies the minimal individual welfare each agent in A is guaranteed to obtain through the allocation of all goods in G by the MSAP. However, taking each agent’s preferences into consideration may increase all individual welfares, and the overall social welfare accordingly, up to any amount. Concerning

200

C. Lindner

unwanted goods, the MSAP includes the option to sell those goods with the best possible profit, which gives an opportunity to increase the overall social welfare, and which guarantees that the overall social welfare is not devaluated by some “out-of-favor” good. Furthermore, the MSAP takes into account that agents may have exceptionally low selling costs for one or the other good, and by this keeps all selling costs as low as possible. After the application of the MSAP, every agent a j in A possesses a bundle of goods (which may be empty) and some side-payments (which may be positive, negative, or zero). Note that each agent’s individual welfare iwΣj is at least as high as a proportional share of the overall social welfare that could have been achieved if all goods m u (g ). Consequently, in G would have been allocated to this agent, i.e., iwΣj ≥ 1/n · Σi=1 j i by including all agents’ preferences each agent experiences an increase of what he or she actually receives over what he or she anticipated to receive according to his or her measure. Combining utility values and side-payments, individual welfare iwΣj can be interpreted as the bundle (consisting of goods and/or money) agent a j received by final allocation X Σ . Analogously, social welfare swΣ can be linked to the concept of utilitarian social welfare, i.e., swu (X Σ ) = swΣ , and to the concept of egalitarian social welfare, i.e., swe (X Σ ) = 1/n · swΣ , again in consideration of monetary side-payments. Theorem 1. Every allocation obtained by the MSAP maximizes both utilitarian social welfare and egalitarian social welfare according to the agents’ valuations. Proof. The allocation of all goods gi in G is conducted in a way such that each good is assigned to a highest bidder, or to the CA, respectively. By this, the overall social welfare to be distributed, which turns out to correspond to the utilitarian social welfare of final allocation X Σ (including side-payments), is maximized on the basis of all agents’ valuations of the goods. Given each agent’s utility that would result from receiving good gi , every other allocation that assigns at least one good to another agent than one of the highest bidders would result in a lower utilitarian social welfare. Maximization of egalitarian social welfare follows immediately from maximization of utilitarian social welfare, since the MSAP makes each agent to receive a proportional share of utilitarian social welfare swu (X Σ ) according to his or her measure.

From an inter-agent perspective, each agent values every other agent’s bundle at most as much as his or her own, and thus an envy-free allocation is guaranteed. Moreover, no agent can be made better off without making any other agent worse off. Theorem 2. Every allocation obtained by the MSAP is Pareto-optimal and envy-free. Proof. The notion of efficiency implies that there is no better overall outcome for the set of agents involved [3]. Taking this statement into account, Pareto-optimality follows immediately from Theorem 1. Envy-freeness is easy to see when considering that each good gi in G is allocated to one of the agents that bid the highest value u (gi ), and that the very same agent has to make side-payments (each valued (1/n) · u (gi )) to the n − 1 other agents.5 Since each of the n − 1 other agents values good gi at most u (gi ), this guarantees that none of those n − 1 agents envies this agent for having received good gi . Concerning goods that are sold by the CA, each agent is receiving the same proportional monetary share of the profit made, and hence, in this case too, no envy is created.

5

For the sake of convenience, all side-payments of agent a j sum up to one final side-payment c j .

A Market-Affected Sealed-Bid Auction Protocol

201

Budget balance of the MSAP (in the sense of Definition 4) follows immediately from steps C3, C4, and C5 as given in Figure 1. Corollary 1. The MSAP is budget-balanced. Our last result shows that the market-affected sealed-bid auction protocol presented in Figure 1 is weakly strategy-proof (in the sense of Definition 5). Theorem 3. The MSAP is weakly strategy-proof. Proof. With reference to any good gi in G, a “cheater” (i.e., an agent in A not telling the truth) could cheat by either stating a higher or a lower than his or her true utility value. In the former case, if the cheater wins the bid for gi , he or she, just as all other agents, will obtain a higher individual welfare, but at the expense of the cheater as he or she has to compensate for the difference—a difference which in fact is only fictitious. That is, the cheater has to pay compensations out of a fund that does not exist and which is based on untruthful values only. Consequently, all agents but the cheater would benefit from this type of cheating. A cheating attempt would be reasonable only if a utility value lower than the true one is stated. Referring to this, if an agent’s true utility value is higher than the market price M(gi ) (i.e., if this agent wants good gi for him- or herself due to its high significance), he or she is motivated not to cheat by stating a lower utility value, since in this case he or she may end up not getting gi at all. On the other hand, if an agent that is not interested in keeping good gi would cheat by stating a lower than the true utility value, this cheating attempt would succeed if, firstly, the cheater has the highest bid, and secondly, the cheater’s bid is at least as high as profit PCA (gi ). In contrast, if the second condition is not fulfilled, good gi would be sold by the CA causing a decreased individual welfare for all agents, including the cheater. This is the reason why for all gi in G selling cost SCA (gi ) and profit PCA (gi ) are not disclosed, aiming to motivate all agents to state true utility values as otherwise they would risk their share. To sum up, trying to cheat by stating a higher than the true utility value is of no advantage to the cheater, and trying to cheat by stating a lower than the true utility value always bears the risk of ending up with even less. Thus, the MSAP is weakly strategy-proof.

6 Conclusions We have proposed a new market-affected sealed-bid auction protocol that can be applied to a set of any number of agents and a set of any number of goods. We have shown this budget-balanced and weakly strategy-proof protocol to possess nice properties such as always providing an envy-free and Pareto-optimal allocation with maximal utilitarian and egalitarian social welfare. In addition, this protocol guarantees each agent to receive a bundle that is worth at least as much as a proportional share of all goods, according to his or her measure. These advantages notwithstanding, we mention the following limitations of this protocol. Depending on the scenario, agents not selling those goods received may need to have some cash at hand or to hold sufficient liquid assets in order to make side-payments. In this regard, poorer agents not holding sufficient liquid assets could, to play safe, accept each good for selling only, and by this avoid any trouble.

202

C. Lindner

However, the total of side-payments to be made by an agent never exceeds the value of all to this agent assigned goods, and thus each agent is guaranteed to gain in individual welfare by the application of the MSAP. Moreover, having huge amounts of the same good to be sold may have an impact on the market price of this good. Note also that weak strategy-proofness ( in the sense of Definition 5) is a quite softer concept than the common notion of strategy-proofness. In terms of future work, one direction to go could be the involvement of each agent’s wealth by using weighted utility values for all significant goods. In this way, poorer agents could be motivated to make bids that reflect each good’s true significance; rather than, fearing side-payments, to accept each good for selling only.

References 1. Aragones, E.: A solution to the envy-free selection problem in economies with indivisible goods. Technical Report 984, Northwestern University, Center for Mathematical Studies in Economics and Management Science (April 1992) 2. Bevi´a, C.: Fair allocation in a general model with indivisible goods. Review of Economic Design 3(3), 195–213 (1998) 3. Brams, S., Taylor, A.: Fair Division: From Cake-Cutting to Dispute Resolution. Cambridge University Press, Cambridge (1996) 4. Cassiman, B., Sieber, S.: The impact of the internet on market structure. Technical Report D/467, IESE Business School (July 2002) 5. Chevaleyre, Y., Dunne, P., Endriss, U., Lang, J., Lemaˆıtre, M., Maudet, N., Padget, J., Phelps, S., Rodr´ıguez-Aguilar, J., Sousa, P.: Issues in multiagent resource allocation. Informatica 30, 3–31 (2006) 6. Conitzer, V., Sandholm, T., Santi, P.: Combinatorial auctions with k-wise dependent valuations. In: Proceedings of the 20th National Conference on Artificial Intelligence, pp. 248– 254. AAAI Press, Menlo Park (2005) 7. Dunne, P., Wooldridge, M., Laurence, M.: The complexity of contract negotiation. Artificial Intelligence 164(1–2), 23–46 (2005) 8. Groves, T.: Incentives in teams. Econometrica 41(4), 617–631 (1973) 9. Ohseto, S.: Implementing egalitarian-equivalent allocation of indivisible goods on restricted domains. Journal of Economic Theory 23(3), 659–670 (2004) 10. P´apai, S.: Groves sealed bid auctions of heterogeneous objects with fair prices. Social Choice and Welfare 20(3), 371–385 (2003) 11. Sandholm, T., Suri, S., Gilpin, A., Levine, D.: Winner determination in combinatorial auction generalizations. In: Proceedings of the 1st International Joint Conference on Autonomous Agents and Multiagent Systems, pp. 69–76. ACM Press, New York (2002) 12. Shoham, Y., Leyton-Brown, K.: Multiagent Systems: Algorithmic, Game-Theoretic, and Logical Foundations. Cambridge University Press, New York (2009) 13. Steinhaus, H.: The problem of fair division. Econometrica 16, 101–104 (1948) 14. Svensson, L.: Large indivisibles: An analysis with respect to price equilibrium and fairness. Econometrica 51(4), 939–954 (1983) 15. Tadenuma, K., Thomson, W.: No-envy and consistency in economies with indivisible goods. Econometrica 59(6), 1755–1767 (1991) 16. Tadenuma, K., Thomson, W.: Games of fair division. Games and Economic Behavior 9(2), 191–204 (1995) 17. Willson, S.: Money-egalitarian-equivalent and gain-maximin allocations of indivisible items with monetary compensation. Social Choice and Welfare 20(2), 247–259 (2003)

A Sparse Spatial Linear Regression Model for fMRI Data Analysis Vangelis P. Oikonomou and Konstantinos Blekas Department of Computer Science, University of Ioannina P.O. Box 1186, Ioannina 45110 - GREECE {voikonom,kblekas}@cs.uoi.gr

Abstract. In this study we present an advanced Bayesian framework for the analysis of functional Magnetic Resonance Imaging (fMRI) data that simultaneously employs both spatial and sparse properties. The basic building block of our method is the general linear model (GML) that constitute a well-known probabilistic approach for regression. By treating regression coeﬃcients as random variables, we can apply an appropriate Gibbs distribution function in order to capture spatial constraints of fMRI time series. In the same time, sparse properties are also embedded through a RVM-based sparse prior over coeﬃcients. The proposed scheme is described as a maximum a posteriori (MAP) approach, where the known Expectation Maximization (EM) algorithm is applied oﬀering closed form update equations. We have demonstrated that our method produces improved performance and enhanced functional activation detection in both simulated data and real applications.

1

Introduction

Functional magnetic resonance imaging (fMRI) measures the tiny metabolic changes that take place in an active part of the brain. It is becoming a common diagnostic method of the behavior of a normal, diseased or injured brain, as well as for assessing the potential risks of surgery or other invasive treatments of the brain. Functional MRI is based on the increase in blood ﬂow to the local vasculature that accompanies neural activity of the brain [1]. When neurons are activated, the resulting increased need for oxygen is overcompensated by a large increase in perfusion. As a result, the venous oxyhemoglobin concentration increases and the deoxyhemoglobin concentration decreases. The latter has paramagnetic properties and the intensity of the fMRI images increases in the activated areas. The signal in the activated voxels increases and decreases according to the paradigm. fMRI detects changes of deoxyhemoglobin levels and generates blood oxygen level dependent (BOLD) signals related to the activation of the neurons [1]. The fMRI data analysis consists of two basic stages: preprocessing and statistical analysis. The ﬁrst stage is usually carried out in four steps: slice timing, motion correction, spatial normalization and spatial smoothing [1]. Statistical analysis can be done using the parametric general linear regression model S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 203–212, 2010. c Springer-Verlag Berlin Heidelberg 2010

204

V.P. Oikonomou and K. Blekas

(GLM) [2] under a Maximum Likelihood (ML) framework for parameter estimation. Sequentially, the t or F statistic is used on order to form a so-called statistical parametric map (SPM) that maps the desired active areas. A signiﬁcant drawback of the basic GLM approach is that spatial and temporal properties of fMRI data are not taken into account. However, it is well known that the BOLD signal is constrained spatially due to its physiological nature and preprocessing steps such as realignment and spatial normalization [1]. Within the literature there are several methods that incorporate spatial and temporal correlations into the estimation procedure. A common approach is to apply Gaussian ﬁlter smoothing or adaptive thresholding techniques that adjust statistical signiﬁcance of active regions, according to their size. Alternatively, spatial characteristics of fMRI can be naturally described in a Bayesian framework through the use of Markov Random Fields (MRF) priors [3, 4] and autoregressive (AR) spatio-temporal models [5, 6]. The estimation process of most of these works is achieved by either Markov Chain Monte Carlo (MCMC), or Variational Bayes framework. An alternative methodology has been presented in [7], where the image of the regression coeﬃcient is ﬁrst spatially decomposed using wavelets, and secondly a sparse prior is applied over the wavelet coeﬃcients. Apart from spatial another desired property of analysis is to embody a mechanism that automatically selects the model order. This is a very important issue in many model based applications including regression. If the order of the regressor model is too large it may overﬁt the observations and does not generalize well. On the other hand, if it is too small it might miss trends in the data. Sparse Bayesian regression oﬀers a solution to the above problem [8, 9] by introducing sparse priors on the model parameters. In this paper we propose a model-based framework that simultaneously employs both spatial and sparse properties in a more systematic way. The basic regression model GLM can be spatially constrained by considering that the regression coeﬃcients follow a Gibbs distribution [10]. By using then a modiﬁcation of the clique potential function, we can allow the incorporation of sparse properties based on the notion of Relevance Vector Machine (RVM) [8]. A maximum a posteriori expectation maximization algorithm (MAP-EM) [11] is applied next to train this model. This is very eﬃcient since it leads to update rules of model parameters in closed form during the M -step and improves data ﬁtting. The performance of the proposed methodology is evaluated using a variety of simulated and real datasets. Comparison has been made using the typical maximum likelihood (ML) and the spatially variant alone regression model. As the experimental study has showed, the proposed method is more ﬂexible and robust providing with quantitatively and qualitatively superior results. In section 2 we brieﬂy describe the general linear model and its spatially variant version by setting a Gibbs prior. The proposed simultaneous spatial and sparse regression model is then presented in section 3 and the MAP-based learning procedure. To assess the performance of the proposed methodology we present in section 4 numerical experiments with artiﬁcial and real fMRI datasets. Finally, in section 5 we give conclusions and suggestions for future research.

A Sparse Spatial Linear Regression Model for fMRI Data Analysis

2

205

A Spatially Variant Generalized Linear Regression Model

Suppose we are given a set of N fMRI time-series Y = {y1 . . . , yN }, where each observation yn is a sequence of M values over time, i.e. yn = {ynm }M m=1 . The Generalized Linear Model (GLM) assumes that the fMRI time series yn are described with the following manner: yn = Φwn + en ,

(1)

where Φ is the design matrix of size M × D and wn is the vector of the D regression coeﬃcients which are unknown and must be estimated. Moreover, the last term en in Eq. 1 is a M -dimensional vector determining the error term that is assumed to be Gaussian with zero mean, independent over time with a precision (inverse variance) λn , i.e. en ∼ N (0, λ−1 n I). The design matrix Φ contains some explanatory variables that describes various experimental factors. In block design related experiments it usually has one regressor for the BOLD response plus the mean constant, i.e. it is a two-column matrix. However, we can expand it containing regressors related to other components of the fMRI time series such as drift and movement eﬀects [6]. In fMRI data analysis the goal is to ﬁnd the involvement of experimental factors in the generation process of time series, that is achieved through the estimation of coeﬃcients wn . Since Φwn is deterministic, we can model the probability density of the sequence yn with the normal distribution p(yn |wn , λn ) = N (Φwn , λ−1 n I). Thus, the problem becomes a maximum likelihood (ML) estimation problem for the regression parameters Θ = {wn , λn }N n=1 . The maximization of the log-likelihood function: LML (Θ) =

N

log p(yn |wn , λn ) =

n=1

N M n=1

2

log λn −

λn yn − Φwn 2 , (2) 2

leads to the following rules: ˆn = ˆ n = (ΦT Φ)−1 ΦT yn , λ w

M . ˆ n 2 yn − Φw

(3)

After the estimation procedure, we calculate the t-statistic for each voxel for drawing the statistical map and identifying the activation regions. The fMRI data are biologically generated by structures that involve spatial properties, since adjacent voxels tend to have similar activation level [12]. Moreover, the produced ML-based activation maps contain many small activation islands and so there is a need for spatial regularization. The Bayesian formulation oﬀers a natural platform for automatically incorporating these ideas. We assume that the vector of coeﬃcients wn follows the Gibbs density function according to the following form: β n p(wn |βn ) ∝ βn|Nn | exp − wn − wk 2 , (4) 2 k∈Nn

206

V.P. Oikonomou and K. Blekas

where βn is the regularization parameter. The summation term denotes the cliques potential function within the neighborhood Nn of the n-th voxel, i.e. |N | horizontally, vertically or diagonally adjacent voxels, while the ﬁrst term βn n acts as a normalizing factor. In addition, a Gamma prior is imposed on the regularization parameter βn as well as the noise precision parameter λn with Gamma parameters {cβ , bβ } and {cλ , bλ }, respectively. The estimation problem can now be formulated as a maximum a posteriori (MAP) approach, in the sense of maximizing the posterior of Θ = {wn , βn , λn }N n=1 : LMAP (Θ) =

N

log p(yn |wn , λn ) + log p(wn |βn ) + log p(βn ) + log p(λn ) (5)

n=1

The maximization problem can be easily found that leads to the following updated rules: ˆ n = (λn ΦT Φ + Bn )−1 (λn ΦT y + BWn ) , w (6) |N | + c n β βˆn = 1 , (7) ˆ ˆ k 2 + bβ w − w n k∈Nn 2 M + cλ ˆn = λ , (8) 1 ˆ n 2 + bλ y − Φw n 2 where Bn = k∈Nn (βn + βk )I and BWn = k∈Nn (βn + βk )wk that determine the contribution of neighbors inside the clique. Equations 6-8 are applied iteratively until the convergence of the MAP log-likelihood function. The above scheme can be also described within an Expectation-Maximization (EM) framework [11], where the E-step computes the expectation of the hidden variables (wn ) and use them next for updating the model parameters during the M -step. This approach will be referred next as SVGLM.

3

Simultaneous Sparse and Spatial Regression

A desired property of the linear regression model is to oﬀer an automatic mechanism that will zero out the coeﬃcients that are not signiﬁcant and maintain only large coeﬃcients that are considered signiﬁcant based on the model. Moreover, an important issue when using the regression model is how to deﬁne its order D. The problem can be tackled using the Bayesian regularization method that has been successfully employed in the Relevance Vector Machine (RVM) model [8]. In order to capture both spatial and sparse properties over regression coefﬁcients, the Gibbs distribution function needs to be reformulated. This can be accomplished by using the following Gibbs density function: D 1 1/2 (1) (2) p(wn |βn , zn , αn ) ∝ βn|Nn | znk αnd exp − VNn (wn )+VNn (wn ) . 2 k∈Nn

d=1

(9)

A Sparse Spatial Linear Regression Model for fMRI Data Analysis

207

The ﬁrst term in the exponential part of this function is the sparse term used for describing local relationships of the n-th voxel coeﬃcients. This is given by: (1)

VNn (W) = wnT An wn ,

(10)

where An is a diagonal matrix containing the D elements of the hyperparameter vector αn = (αn1 , . . . , αnD )T . By imposing a Gamma prior over hyperparameters, a two-stage hierarchical prior is achieved, which is actually a Student-t distribution with heavy tails [8]. This scheme enforces most αnd to be large, thus the corresponding coeﬃcients wnd are set zero and ﬁnally eliminated. The second term of the exponential part (Eq. 9) captures the sparse property and is responsible for the clique potential of the nth voxel: (2) VNn (W) = βn znk wn − wk 2 . (11) k∈Nn

In comparison with the potential function of the SVGLM method (Eq. 4), here each neighbor contribute with a diﬀerent weight, as denoted by parameters znk , to the computation of the clique energy value. The introduction of these weights can increase the ﬂexibility of spatial modeling. As experimentally have shown, this can be proved advantageous in cases around the borders of activation regions (edges). Finally, the ﬁrst part of Eq. 9 acts as a normalization factor. We also assume that the regularization parameter βn , the noise precision λn and the weights znk follow Gamma distribution. Training of the proposed model is therefore converted into a MAP-estimation problem for the set of model N parameters Θ = {θn }N n=1 = {wn , βn , λn , zn , αn }n=1 : LMAP (Θ) =

N

log p(yn |θn ) + log{p(wn |βn , zn , αn )p(βn )p(λn )p(zn )p(αn )} .

n=1

(12) By setting the partial derivative equal to zero the following closed form update rule for regression coeﬃcients can be obtained: ˆ n = (λn ΦT Φ + BZn + An )−1 (λn ΦT yn + BZWn ) , w (13) where the matrices BZn and BZWn are: BZn = βn k∈Nn (znk + zkn )I and BZWn = βn k∈Nn (znk + zkn )wk . For the other model parameters we have: |Nn | + cβ , 2 k∈Nn znk wn − wk + bβ 1 + cz = 1 , ˆn w ˆ ˆ k 2 + bz β n−w 2 1 + 2ca = 2 , w ˆnd + 2ba

βˆn = zˆnk α ˆ nd

1 2

(14) (15) (16)

while the noise precision λn has the same form as previously deﬁned in SVGLM, (Eq. 8). The whole procedure can be integrated in an EM framework, where the

208

V.P. Oikonomou and K. Blekas

expectation of regression coeﬃcients are computed in the E-step (Eq. 13), and the maximization of the complete-data log-likelihood is performed during the M-step (Eqs 14-16), giving update equations for model parameters. The above scheme is iteratively applied until the convergence of the MAP function. Notice that in the above equations we took into consideration that the weights of n-th voxel occurs two times into the summation term, one as the central voxel, and |Nn | times as a neighbor of diﬀerent voxels. We call this method SSGLM.

4

Experimental Results

We have tested the proposed method, SSGLM, using various simulated and real datasets. Comparison has been made with the simple ML method and the SVGLM as has been presented in Section 2. The SVGLM and SSGLM have been initialized with the same manner. First, the ML estimates of the regression coeﬃcients wn are obtained and use them next for initializing the rest model parameters βn ,λn ,zkn and anp , according to Eqs. (14)-(16), respectively. During the experiments the parameters of Gamma prior distributions were set cβ = bβ = cz = bz = 1, cλ = bλ = 10−8 and bα = cα = 10−8 (making them non-informative as suggested by the RVM methodology [8]). 4.1

Experiments with Simulated Data

The simulated datasets used in our experiments were created using the following generation mechanism. We applied a design matrix (Φ) of size M × 2 with two pre-speciﬁed regressors, the ﬁrst one captures the BOLD signal (Fig. 1 (a)), and the second one being a constant with ones. Then, we constructed an image with the activation regions that corresponds to the value of the ﬁrst coeﬃcient (wn1 ), while the second coeﬃcient wn2 had a constant value equal to 100. In our study we have used two such images of size 80 × 80 with diﬀerent shape of activation areas, rectangular (Fig. 1(b)) and circular (Fig. 1 (c)), respectively. The time series data (yn ) were ﬁnally produced by using the generative equation of GLM (Eq. 1) with an additive white Gaussian noise of various signal-tonoise-ratio (SNR) levels, where we performed 50 runs and computed their mean

1

10

10

20

20

30

30

40

40

50

50

60

60

Amplitude

0.8

0.6

0.4

0.2

0

−0.2 0

70

70

80 10

20

30

40 Time

(a)

50

60

70

80

80 10

20

30

40

(b)

50

60

70

80

10

20

30

40

50

60

70

80

(c)

Fig. 1. Simulated data generative features: (a) Bold signal, (b) rectangular and (c) circular shape image of true activated areas

A Sparse Spatial Linear Regression Model for fMRI Data Analysis

209

performance. Evaluation has done using two criteria: 1) The Area Under Curve (AUC) of the Receiver Operating Curve (ROC) based on t-statistic calculations and 2) the normalized mean square error (NMSE), between the estimated and the true coeﬃcients responsible for the BOLD signal. We present in Table 1 the comparative performance results in terms of the above two criteria for several SNR values in the case of rectangular and circular activation regions, respectively. As it is obvious, the proposed spatial sparse model (SSGLM) improves functional activation detection quality, especially for lower values of examined SN R values. In all cases both MAP-based approaches perform signiﬁcantly better than the simple ML method. Figure 2 presents the mapping results of a typical run in the case of SNR =-20 dB. As it is obvious the proposed SSGLM approach manages to construct much smoother maps of brain activity than the spatial SVGLM model. That is interesting to observe is that SVGLM method has the tendency to overestimate the activation areas and Table 1. Comparative results for simulated data in various noisy environments

SNR 0 -5 -10 -15 -20 -30

SSGLM 0.999 0.998 0.998 0.986 0.920 0.763

circular areas AUC NMSE SVGLM ML SSGLM SVGLM 0.999 0.999 0.118 0.177 0.999 0.929 0.551 0.464 0.998 0.795 0.704 0.633 0.988 0.674 0.807 0.802 0.914 0.600 0.993 1.084 0.724 0.558 1.748 1.854

SSGLM

ML SSGLM 0.294 0.998 0.933 0.998 1.642 0.995 2.948 0.978 5.214 0.898 9.257 0.747

rectangular areas AUC NMSE SVGLM ML SSGLM SVGLM 0.995 0.980 0.129 0.170 0.992 0.819 0.415 0.318 0.991 0.712 0.541 0.478 0.972 0.624 0.641 0.665 0.883 0.570 0.855 0.971 0.716 0.536 1.437 1.641

SSGLM

AUC=0.9969, NMSE=0.70787

ML

AUC=0.99759, NMSE=0.73026

AUC=0.70245, NMSE=3.0523

10

10

10

20

20

20

30

30

30

40

40

40

50

50

50

60

60

60

70

70

80

70

80 10

20

30

40

50

60

70

80

ML 0.255 0.812 1.439 2.554 4.579 8.074

80 10

20

30

40

50

60

70

80

10

20

30

40

50

60

70

80

60

70

80

(a) rectangular activated areas AUC=0.9604, NMSE=0.57853

AUC=0.95256, NMSE=0.59935

AUC=0.63168, NMSE=2.6525

10

10

10

20

20

20

30

30

30

40

40

40

50

50

50

60

60

60

70

70

80

80 10

20

30

40

50

60

70

80

70 80 10

20

30

40

50

60

70

80

10

20

30

40

50

(b) circular activated areas Fig. 2. An example of the statistical map produced by three comparative methods for two kind of activity (a) rectangular and (b) circular. The SNR value is -20 dB.

210

V.P. Oikonomou and K. Blekas SSGLM

SVGLM

ML

10

10

10

20

20

20

30

30

30

40

40

40

50

50

50

60

60

60

70

70

70

80

80

80

90

90 10

20

30

40

50

60

70

90 10

20

30

40

50

60

70

10

20

30

40

50

60

70

Fig. 3. Maps of the estimated BOLD signal (wn1 ) obtained by three methods

discover larger regions than their true size. The proposed SSGLM exhibits very clean edges between activated and non - activated areas, and thus visual improvement. Finally, the ML approach completely fails to discover any activation pattern in this experiment. 4.2

Experiments with Real fMRI Data

The proposed approach was also evaluated in real applications. Experiments were made using a block design real fMRI dataset that was downloaded from the SPM web page1 which was designed for auditory processing task on a healthy volunteer. In our study, we followed the standard preprocessing steps of the statistical parametric mapping package (SPM) manual, which are realignment, segmentation, and spatial normalization, without performing the spatial smoothing step. We selected the slice 29 of this dataset for making experiments. Figure (3) presents the maps of the BOLD signal (regression coeﬃcients wn1 ) as estimated by the three comparative approaches SSGLM, SVGLM and ML. As it is obvious the proposed SSGLM approach achieves signiﬁcantly smoother results, where brain activity is found on the auditory cortex, as it was expected. In addition, produced activation areas are less noisy and very clean in comparison with those produced by the SVGLM which overestimates the brain activity, thus making the decision harder. On the other hand, the resulting map of the ML method is confused without showing any signiﬁcant distinction between the activated and non activated areas. Moreover, we ﬁnd it useful to visually inspect the resulting activation maps obtained by the t-test. In Figure 4 the SPMs of each method are shown, calculated without setting a threshold (Figure 4a), or by using a threshold (t0 = 1.6) on t-value (Figure 4b). Notice that the activation maps of the SSGLM approach are similar in both cases that makes our approach less sensitive to the threshold value. The latter can be more apparent by plotting in Figure 5(a) the estimated size (number of voxels) of activation areas from each method in terms of the threshold value t0 . This behavior can be proved very useful, since there is not need to resort in multiple comparison between t-tests. This can be also viewed in 1

http://www.ﬁl.ion.ucl.ac.uk/spm/

A Sparse Spatial Linear Regression Model for fMRI Data Analysis SSGLM

SVGLM

ML

10

10

10

20

20

20

30

30

30

40

40

40

50

50

50

60

60

60

70

70

70

80

80

90

80

90 10

20

30

40

50

60

70

211

90 10

20

30

40

50

60

70

10

20

30

40

50

60

70

10

20

30

40

50

60

70

(a) 10

10

10

20

20

20

30

30

30

40

40

40

50

50

50

60

60

60

70

70

70

80

80

80

90

90 10

20

30

40

50

60

70

90 10

20

30

40

50

60

70

(b) Fig. 4. Statistical parametric maps from the t-statistics (a) without and (b) with a threshold value t0 = 1.6 15 SSGLM SVGLM ML

2500

SSGLM SVGLM 10

n

t−value (t )

# of Activated voxels

2000

1500

1000

0

−5

500

0 0

5

0.5

1

1.5 2 2.5 Threshold value (tn)

(a)

3

3.5

−10 0

1000

2000

3000 # of Voxels

4000

5000

(b)

Fig. 5. (a) Plots of the estimated number of activated voxels in terms of threshold value used for producing the SPMs. (b) Plots of the t-values as computed by comparative methods SSGLM (thick line) and SVGLM (thin line).

Figure 5(b) where we plot the calculated t-values of the SSGLM and the SVGLM methods. The distinction between the activated and non activated areas is much more apparent in the case of SSGLM plot.

5

Conclusions

In this work we present an advanced regression model for fMRI time series analysis by incorporating both spatial correlations and sparse capabilities. This is done by using an appropriate prior over the regression coeﬃcients based on the MRF and the RVM schemes. Training is achieved through a maximum a posteriori (MAP) framework that allows the EM algorithm to be eﬀectively used for

212

V.P. Oikonomou and K. Blekas

estimating the model parameters. This has the advantage of establishing update rules in closed form during the M -step and thus data ﬁtting is computationally eﬃcient. Experiments on artiﬁcial and real datasets have demonstrated the ability of the proposed approach to improve the detection performance by providing cleaner and more accurate estimates. We are planning to make experiments with extended kernel design matrix and also to improve its speciﬁcation by an adaption mechanism, as well as to examine the appropriateness of other types of sparse priors [9].

References 1. Frackowiak, R.S.J., Ashburner, J.T., Penny, W.D., Zeki, S., Friston, K.J., Frith, C.D., Dolan, R.J., Price, C.J.: Human Brain Function, 2nd edn. Elsevier Science, USA (2004) 2. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Heidelberg (2007) 3. Descombes, X., Kruggel, F., von Cramon, D.Y.: fMRI signal restoration using a spatio-temporal Markov Random Field preserving transitions. NeuroImage 8, 340– 349 (1998) 4. Gossl, C., Auer, D.P., Fahrmeir, L.: Bayesian spatiotemporal inference in functional magnetic resonance imaging. Biometrics 57, 554–562 (2001) 5. Woolrich, M.W., Jenkinson, M., Brady, J.M., Smith, S.M.: Fully bayesian spatiotemporal modeling of fmri data. IEEE Transactions on Medical Imaging 23(2), 213–231 (2004) 6. Penny, W.D., Trujillo-Barreto, N.J., Friston, K.J.: Bayesian fmri time series analysis with spatial priors. NeuroImage 24, 350–362 (2005) 7. Flandin, G., Penny, W.: Bayesian fmri data analysis with sparse spatial basis function priors. NeuroImage 34, 1108–1125 (2007) 8. Tipping, M.E.: Sparse Bayesian Learning and the Relevance Vector Machine. Journal of Machine Learning Research 1, 211–244 (2001) 9. Seeger, M.: Bayesian Inference and Optimal Design for the Sparse Linear Model. Journal of Machine Learning Research 9, 759–813 (2008) 10. Geman, S., Geman, D.: Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. on Pattern Analysis and Machine Intelligence 6, 721–741 (1984) 11. Dempster, A., Laird A., Rubin D.: Maximum likelihood from incomplete data via the em algorithm. Journal of Royal Statistical Society, Series B 39, 1–38 (1977) 12. Harrison, L.M., Penny, W., Daunizeau, J., Friston, K.J.: Diﬀusion-based spatial priors for functional magnetic resonance images. NeuroImage 41(2), 408–423 (2008)

A Reasoning Framework for Ambient Intelligence Theodore Patkos, Ioannis Chrysakis, Antonis Bikakis, Dimitris Plexousakis, and Grigoris Antoniou Institute of Computer Science, FO.R.T.H. {patkos,hrysakis,bikakis,dp,antoniou}@ics.forth.gr

Abstract. Ambient Intelligence is an emerging discipline that requires the integration of expertise from a multitude of scientiﬁc ﬁelds. The role of Artiﬁcial Intelligence is crucial not only for bringing intelligence to everyday environments, but also for providing the means for the diﬀerent disciplines to collaborate. In this paper we describe the design of a reasoning framework, applied to an operational Ambient Intelligence infrastructure, that combines rule-based reasoning with reasoning about actions and causality on top of ontology-based context models. The emphasis is on identifying the limitations of the rule-based approach and the way action theories can be employed to ﬁll the gaps.

1

Introduction

The Ambient Intelligence (AmI) paradigm has generated an enabling multidisciplinary research ﬁeld that envisages to bring intelligence to everyday environments and facilitate human interaction with devices and the surrounding. Artiﬁcial Intelligence has a decisive role to play for the realization of this vision promising commonsense reasoning and better decision making in dynamic and highly complex conditions, as advocated by recent studies [1]. Within AmI environments human users do not experience passively the functionalities of smart spaces, instead they participate actively in by it performing actions that change its state in diﬀerent ways. At the same time, the smart space itself and its devices are expected to perform actions and generate plans either in response to changes in the context or to predict user desires and adapt to user needs. In this paper we describe the design of a reasoning framework intended for use in an AmI infrastructure that is being implemented in our institute. The framework integrates Semantic Web technologies for representing contextual knowledge with rule-based and causality-based reasoning methodologies for supporting a multitude of general-purpose and domain-speciﬁc reasoning tasks imposed by the AmI system. Given that during the ﬁrst phases of this project we have fully implemented the rule-based features of the reasoner and applied them in practice,

This work has been supported by the FORTH-ICS internal RTD Programme “Ambient Intelligence and Smart Environments”.

S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 213–222, 2010. c Springer-Verlag Berlin Heidelberg 2010

214

T. Patkos et al.

the present study concentrates primarily on the identiﬁed limitations concerning certain challenging issues of this new ﬁeld, and the way action theories can be employed to oﬀer eﬃcient solutions. Action theories, a fundamental ﬁeld of research within KR&R, are formal tools for performing commonsense reasoning in dynamic domains. In this paper we present how they can contribute to the AmI vision and what types of problems they can resolve. We report on our experiences in developing the proposed solutions as distinct functionalities, but also describe how we plan to integrate them in the overall framework in order to provide a powerful hybrid reasoning tool for application in real-world situations. Our objective is to illustrate the impact of combining logic-based AI methods for addressing a broad range of practical issues. The paper is organized as follows. We ﬁrst describe the overall architecture of the framework and continue with the tasks assigned to the rule-based reasoning component. In section 4 we elaborate on the contribution of causality-based approaches to AmI and we concludes with a discussion of related work.

2

Event-Based Architecture

The design goals of the reasoning framework have been the eﬃcient representation, monitoring and dissemination of any low- or high-level contextual information in our AmI infrastructure, as well as the support for a number of general-purpose and domain-speciﬁc inferencing tasks. For that purpose we deploy the hybrid event-based reasoning architecture shown in Fig. 1, that comprises four main components: the Event Manager that receives and processes

Fig. 1. Event management framework architecture

A Reasoning Framework for Ambient Intelligence

215

incoming events from the ambient infrastructure, the Reasoner that can perform both rule-based and causality-based reasoning, the Knowledge Base that stores semantic information represented using ontology-based languages, and the Communication Module that forwards Reasoner requests for action execution to appropriate services. A middleware layer undertakes the role of connecting applications and services implemented by diﬀerent research groups and with diﬀerent technologies. Services denote standalone entities that implement speciﬁc functionalities about world aspects, such as voice recognition, localization, light management etc., whereas applications group together service instances to provide an AmI experience in smart rooms. The semantic distinction between events and actions is essential for coordinating the behavior of AmI systems. Events are generated by services and declare changes in the state of context aspects. In case these aspects can be modiﬁed on demand, e.g., CloseDoor(doorID), the allocated service provides also the appropriate interface for performing the change, otherwise the service plays the role of a gateway for monitoring environmental context acquired from sensors or devices. Actions, on the other hand, reﬂect the desire of an application for an event to occur. In a sense, events express atomic, transient facts that have occurred, while actions can either be requests for atomic events or complex combination of event occurrences according to certain operators that form compound events, i.e., sets of event patterns. It is the responsibility of the reasoner to examine individual actions before allowing their execution and guarantee that the state of the system remains consistent at all times. Implementation: Our framework is part of a large-scale AmI facility that is being implemented in FORTH and has completed the ﬁrst year of life. It expands in a three-room set up, where a multitude of hardware and software technologies contribute services, such as camera network support for 2D person localization and 3D head pose estimation, RFID, iris and audio sensors for person identiﬁcation and speech recognition, multi-protocol wireless communications etc. The middleware responsible for creating, connecting and consuming the services is CORBA-based and provides support for C++, Java, .NET, Python, and Flash/ActionScript languages. The rule-based component of the reasoner module uses Jess1 as its reasoning engine, while the Validator component uses both Jess and DEC Reasoner2 , a SAT-based Event Calculus reasoner. The former is responsible for run-time action validation, while the latter performs more powerful reasoning tasks, such as speciﬁcation analysis and planning. All available knowledge is encoded in OWL ontologies using the Prot´ eg´ e platform for editing and the Prot´ eg´ e-OWL API for browsing and querying the corresponding models.

3

Context Information Modeling and Reasoning

The task of context management in an AmI environment requires an open framework to support seamless interoperability and mutual understanding of 1 2

Jess, http://www.jessrules.com/ DECReasoner, http://decreasoner.sourceforge.net/

216

T. Patkos et al.

the meaning of concepts among diﬀerent devices and services. Ontology-based models are arguably the most enabling approach for modeling contextual information, satisfying the representation requirements set by many studies in terms of type and level of formality, expressiveness, ﬂexibility and extensibility, generality, granularity and valid context constraining [2,3]. In our framework we design ontologies that capture the meaning and relations of concepts regarding low-level context acquired from sensors, high-level context inferred through reasoning, user and device proﬁling information, spatial features and resource characteristics (Fig. 1). An aspect of the framework that acknowledges the beneﬁt of ontologies is the derivation of high-level context knowledge. Complex context inferred by means of rule-based reasoning tasks on the basis of raw sensor data or other highlevel knowledge, is based on ontology representations and may concern a user’s emotional state, identity, intentions, location etc. For instance, the following rule speciﬁes that a user is assumed to have left the main room and entered the adjacent warmup room only if she was standing near the open main-room door and is no longer tracked by the localizer, which is installed only in this room: (user (id ?u) (location DOORMAIN)) ∧ (door (id DOORMAIN) (state OPEN)) ∧ (event (type USERLOST) (user ?u)) ⇒ (user (id ?u) (location WARMUP)) Under a diﬀerent context the same event will trigger a diﬀerent set of rules, capturing for example the case where the user stands behind an obstacle. Rule-based reasoning is a commonly adopted solution for high-level context inference in AmI [4]. Within our system it contributes to the design of enhanced applications and also provides feedback to services for sensor fusion purposes, in order to resolve conﬂicts of ambiguous or imprecise context and detect erroneous context. Furthermore, rule-based reasoning is also employed to coordinate the overall system behavior and oﬀer explanations; the operation is partitioned into distinct modes that invoke appropriate rulesets, according to relevant context and the functionality that we wish to implement. Finally, this component also provides procedures for domain-speciﬁc reasoning tasks for search and optimization problems driven by application demands, such as for determining the best camera viewpoint to record a user interacting inside a smart room.

4

Causality-Based Reasoning in Ambient Intelligence

Event-based architectures oﬀer opportunities for ﬂexible processing of the information ﬂow and knowledge evolution over time. Rule-based languages provide only limited expressiveness to describe certain complex features, such as compound actions, therefore they do not fully exploit the potential of the event-based style to solve challenging event processing tasks raised by ubiquitous computing domains. The Event-Condition-Action (ECA) paradigm that is most frequently applied can be used for reacting to detected events, viewing them as transient atomic instances and consuming them upon detection. They do not consider their duration, how far into the past or future their eﬀects extend nor do they investigate causality issues originating from the fact that other events are known

A Reasoning Framework for Ambient Intelligence

217

to have occurred or are planned to happen. Paschke [5] has already shown that the ECA treatment may even result in unintended semantics for some compositions of event patterns. However, real-life systems demand formal semantics for veriﬁcation and traceability purposes. To this end, we apply techniques for reasoning about actions and causality. Action theories are formal tools, based on the predicate calculus, particularly designed for addressing key issues of reasoning in dynamically changing worlds by axiomatizing the speciﬁcations of their dynamics and exploiting logic-based techniques in order to draw conclusions. Diﬀerent formalisms have been developed that model action preconditions and eﬀects and solve both deduction and abduction problems about a multitude of commonsense reasoning phenomena, such as eﬀect ramiﬁcations, non-deterministic actions, qualiﬁcations and others. For our purposes we apply the Event Calculus formalism [6,7], which establishes a linear time structure enabling temporal reasoning in order to infer the intervals in which certain world aspects hold. The notion of time, inherently modeled within the Event Calculus, as opposed to other action theories, is a crucial leverage for event-based systems, e.g. to express partial ordering of events or timestamps. In the remaining subsections we describe our approach to integrate action theories with our Ambient Intelligence semantic infrastructure and present the types of reasoning problems that have been assigned to them. 4.1

Event Ontology and Complex Actions

For any large-scale event-based system it is important to identify event patterns that describe the structure of complex events built from atomic or other complex event instances. Their deﬁnition and processing must follow formal rules with well-deﬁned operators, in order for their meaning to be understood by all system entities. In an Ambient Intelligence infrastructure in particular, this need is even more critical due to the multidisciplinary nature and the demand for collaborative actions by entities with signiﬁcantly diﬀerent backgrounds. In order to promote a high-level description of the speciﬁcations of applications and to achieve a high degree of interoperability among services we design an event ontology to capture the notions of atomic and compound events and deﬁne operators among event sets, such as sequence, concurrency, disjunction and conjunction (Fig. 2). An event is further characterized by its initiation and termination occurrence times (for atomic events they coincide), the eﬀect that it causes to context resources, the physical location at which it occurred, the service that detected or triggered it etc. Our intention is to focus only on generic event attributes that satisfy the objectives of our system, based on previous studies that deﬁne common top ontologies for events (e.g. [8]), rather than reproduce a complete domain-independent top event ontology, which would be in large part out-of-focus and result in a less scalable and eﬃcient implementation. To deﬁne formal semantics for the operators, we implement them as container elements that collect resources and translate them to Event Calculus axioms. Notice the compound event e1 in Fig. 2, for instance, that expresses the partially ordered event type [[T urnOnLight; StartLocalizer] ∧ StartM apService]

218

T. Patkos et al.

Fig. 2. Event ontology sample. Event operators, such as sequenceSetOf and concurrentSetOf, are implemented as rdf : Seq, rdf : Bag and rdf : Alt container elements.

where (;) represents the sequence operator. In order to utilize e1 for reasoning tasks we axiomatize its temporal properties in Event Calculus: Happens(Start(e1 ), t) ≡ Happens(T urnOnLight(l), t1) ∧ Happens(StartLocalizer(), t2)∧ Happens(StartM apService(), t3 ) ∧ (t1 < t2 ) ∧ (t = min(t1 , t3 )) (respectively for Happens(Stop(e1 ), t)), as well as its causal properties: Initiates(Start(e1 ), LightOn(l), t) ∧ T erminates(Stop(e1 ), T rainingM ode(), t) We may formalize the eﬀects of compound events to act cumulatively or canceling the eﬀects of their atomic components and also we can specify whether certain eﬀects hold at beginning times or at ending times. We currently model the duration of compound events in terms of their Start and Stop times, but also investigate the potentials of other approaches, such as the interval-based Event Calculus [5] or the three-argument Happens Event Calculus axiomatization [9]. Main advantages of the Event Calculus, in comparison to rule-based approaches, are its inherent ability to perform temporal reasoning considering both relative and absolute times of event occurrences and that it can reevaluate different variations of event patterns as time progresses. With ECA style reactive rules events are consumed as they are detected and cannot contribute to the detection of other complex events afterwards. Finally, the combination of semantic event representation and causality-based event processing makes the process of describing the speciﬁcations of applications much more convenient for non-AIexpert developers in our system, without undermining the system’s reasoning capabilities. As we show next, these descriptions are expressive enough to enable inferences about future world states during application execution, as well as the identiﬁcation of potential system restriction violations.

A Reasoning Framework for Ambient Intelligence

4.2

219

Design-Time Application Verification

It is of particular signiﬁcance to the management of an AmI system to separate the rules that govern its behavior from the domain-speciﬁc functionalities in order to enable eﬃcient and dynamic adaptation to changes during development. We implement a modular approach that distinguishes the rules that express system policies and restrictions that guarantee a consistent and error-free overall execution at all times, from service specifications that change in frequent time periods and by a multitude of users, as well as from application specifications that are usually under the responsibility of non-experts who only possess partial knowledge about the system restrictions. Table 1 shows samples of the type of information that these speciﬁcations contain expressed in Event Calculus axioms. The speciﬁcations of services, for instance, retrieved from the diﬀerent ontologies, describe the domains and express inheritance relations, instantiations of entities, and potentially context-dependent eﬀect properties. Application descriptions express primarily the intended behavior of a developed application as a narrative of context-dependent action occurrences. Finally, system restrictions capture assertions about attributes of system states that must hold for every possible system execution (sometimes also called safety properties [10]). A core task of our framework is to verify that the speciﬁcations of AmI applications are in compliance with the overall system restrictions and detect errors early in the development phase. This a priori analysis is performed at designtime and can formally be deﬁned as the abductive reasoning process to ﬁnd a set P of permissible actions that lead a consistent system to a state where some of its constraints are violated, given a domain description D, an application description APi for application i and a set of system constraints C: D ∧ APi ∧ P |= ∃t¬C(t) where D ∧ APi ∧ P is consistent Table 1. Deﬁned speciﬁcation axioms for application veriﬁcation

220

T. Patkos et al.

In fact, if such a plan is found it acts as a counterexample providing diagnostic information about violated safety properties. Apparently, such inferences are computationally expensive and most importantly semidecidable. Nevertheless, Russo et al. [10] proved that a reduction considering only two timepoints, current (tc ) and next (tn ), can transform such an abductive framework to fully decidable and tractable under certain conditions (no nesting temporal quantiﬁers): D(T ) ∧ APi (T ) ∧ C(tc ) ∧ P |= ¬C(tn ) given a 2-timepoint structure T This way, we do not need to fully specify the state at time tc ; the generated plan P is a mixture of HoldsAt and Happens predicates without requiring a complete description of the initial system state, in contrast to similar model-checking techniques. We plan to even expand the type of system speciﬁcations to allow for optimizing the process of application designing, capturing for instance ineﬃcient action executions that for the developer may seem harmless or unimportant. Example. A developer uploads an application description ﬁle to the system containing, among others, the two axioms shown in Table 1. The new application must ﬁrst be examined for consistency with respect to the set of restrictions already stored in the system by service engineers. The developer executes the ApplicationCheck functionality of the Validator accessible through the middleware, which identiﬁes a potential restriction violation whenever a user sits on a chair; the event causes the T urnOf f Light action to occur that conﬂicts with the Localization being at a Running state (any substantial change in lighting destabilizes the localization process). As a result, the developer needs to review the application, pausing for instance the Localizer before turning oﬀ the lights. 4.3

Run-Time Validation

Although application analysis can be accomplished at design-time and in isolation, action validation must be performed at run-time considering the current state of the system, as well as potential conﬂicts with other applications that might share the same resources. For that purpose, action validation is not implemented as an abduction process as before. Instead, a projection of the current state is performed to determine potential abnormal resulting states, which is a more eﬃcient approach for the needs of run-time reasoning. We have identiﬁed the following situations where action theory reasoning can contribute solutions: Resource management. An AmI reasoner needs to resolve conﬂicts raised by applications that request access to the same resource (e.g., speakers). We introduce axioms to capture integrity constraints, as below: ∀app1, app2, t HoldsAt(InU seBy(Speaker01, app1), t) ∧ HoldsAt(InU seBy(Speaker01, app2), t) ⇒ (app1 = app2) We also intend to expand the Validator’s resource allocation policies with shortterm planning based on known demands. Ramifications and priorities. Apart from direct conﬂicts between applications, certain actions may cause indirect side-eﬀects to the execution of others. Terminating a service may aﬀect applications that do not use it directly, instead

A Reasoning Framework for Ambient Intelligence

221

invoke services that depend on it. Since the reasoner is the only module that is aware of the current state of the system as a whole it can detect unsafe eﬀect ramiﬁcations and take measures to prevent unintended situations to emerge, either by denying the initial actions or by reconﬁguring certain system aspects. Towards this direction, actions are executed in terms of prioritization policies. Uncertainty handling. Imagine a system constraint requiring for the mainroom door to be in locked state iﬀ no user is located inside. The multi-camera localization component may lose track of users under certain circumstances, e.g. if they stand behind an obstacle. In cases where the reasoner is not aware of whether the user has left the room or not, it needs to perform reasoning based on partial knowledge. As a result, the state constraint must be formulated as: ∀user, t HoldsAt(Knows(¬U serInRoom(user)), t) ⇔ HoldsAt(Knows(DoorLocked(DOORM AIN )), t) Situations of ambiguous knowledge are very common in AmI systems. For that purpose, we plan to integrate a recent extension of the Event Calculus axiomatization that accounts for knowledge-producing actions in partially observable domains, enabling knowledge update based on sense actions and context [11].

5

Discussion and Conclusions

Responding to the need for event processing, rule-based approaches have contributed in manifold ways to the development of event-driven IT systems over the last years, from the ﬁeld of active databases to distributed event-notiﬁcation systems [12]. Within the ﬁeld of ubiquitous computing, this paradigm is most frequently applied to the design of inference techniques for recognizing high-level context, such as user activity in smart spaces. For instance, FOL rules for managing and deriving high-level context objects and resolving conﬂicts is applied in SOCAM [13], whereas [14] describes a system that executes rules for in-home detection of daily living activities for elderly health monitoring. Our framework applies rule-based techniques in a style similar to these systems, but as previously argued, a full-scale AmI system requires much more potent reasoning both for context-modeling and for regulating the overall system operation. The need for hybrid approaches is highlighted in many recent studies [3,4]. Towards this direction, the COSAR system [15] combines ontological with statistical inferencing techniques, but concentrates on the topic of activity recognition. In [16] a combination of rule-based reasoning, Bayesian networks and ontologies is applied to context inference. Our approach, on the other hand, combines two logic-based approaches, namely rule- and causality-based reasoning and achieves a general-purpose reasoning framework for AmI, able to address a broad range of aspects that arise in a ubiquitous domain. The proposed integration of technologies is a novel and enabling direction for the implementation of the Ambient Intelligence vision. It is our intention, while developing the run-time functionalities of the framework, to contribute to the action theories research, as well. A Jess-based Event Calculus reasoner, oﬀering eﬃcient online inferencing, is already under implementation.

222

T. Patkos et al.

References 1. Ramos, C., Augusto, J.C., Shapiro, D.: Ambient Intelligence–the Next Step for Artiﬁcial Intelligence. IEEE Intelligent Systems 23(2), 15–18 (2008) 2. Thomas Strang, C.L.P.: A Context Modeling Survey. In: 1st International Workshop on Advanced Context Modelling, Reasoning and Management (2004) 3. Bettini, C., Brdiczka, O., Henricksen, K., Indulska, J., Nicklas, D., Ranganathan, A., Riboni, D.: A Survey of Context Modelling and Reasoning Techniques. Pervasive and Mobile Computing (2009) 4. Bikakis, A., Patkos, T., Antoniou, G., Plexousakis, D.: A Survey of Semanticsbased Approaches for Context Reasoning in Ambient Intelligence. In: Proceedings of the Workshop Artiﬁcial Intelligence Methods for Ambient Intelligence, pp. 15–24 (2007) 5. Paschke, A.: ECA-RuleML: An Approach combining ECA Rules with temporal interval-based KR Event/Action Logics and Transactional Update Logics. CoRR abs/cs/0610167 (2006) 6. Kowalski, R., Sergot, M.: A Logic-based Calculus of Events. Foundations of knowledge base management, 23–51 (1989) 7. Miller, R., Shanahan, M.: Some Alternative Formulations of the Event Calculus. In: Computational Logic: Logic Programming and Beyond, Essays in Honour of Robert A. Kowalski, Part II, London, UK, pp. 452–490. Springer, London (2002) 8. Kharbili, M.E., Stojanovic, N.: Semantic Event-Based Decision Management in Compliance Management for Business Processes. In: Intelligent Event Processing - AAAI Spring Symposium 2009, pp. 35–40 (2009) 9. Shanahan, M.: The Event Calculus Explained. In: Veloso, M.M., Wooldridge, M.J. (eds.) Artiﬁcial Intelligence Today. LNCS (LNAI), vol. 1600, pp. 409–431. Springer, Heidelberg (1999) 10. Russo, A., Miller, R., Nuseibeh, B., Kramer, J.: An Abductive Approach for Analysing Event-Based Requirements Speciﬁcations. In: Stuckey, P.J. (ed.) ICLP 2002. LNCS, vol. 2401, pp. 22–37. Springer, Heidelberg (2002) 11. Patkos, T., Plexousakis, D.: Reasoning with Knowledge, Action and Time in Dynamic and Uncertain Domains. In: 21st International Joint Conference on Artiﬁcial Intelligence, pp. 885–890 (2009) 12. Paschke, A., Kozlenkov, A.: Rule-Based Event Processing and Reaction Rules. In: Governatori, G., Hall, J., Paschke, A. (eds.) RuleML 2009. LNCS, vol. 5858, pp. 53–66. Springer, Heidelberg (2009) 13. Gu, T., Pung, H.K., Zhang, D.Q.: A Service-oriented Middleware for Building Context-aware Services. Journal of Network and Computer Applications 28(1), 1–18 (2005) 14. Cao, Y., Tao, L., Xu, G.: An Event-driven Context Model in Elderly Health Monitoring. Ubiquitous, Autonomic and Trusted Computing 0, 120–124 (2009) 15. Riboni, D., Bettini, C.: Context-Aware Activity Recognition through a Combination of Ontological and Statistical Reasoning. In: 6th International Conference on Ubiquitous Intelligence and Computing, pp. 39–53 (2009) 16. Bulfoni, A., Coppola, P., Della Mea, V., Di Gaspero, L., Mischis, D., Mizzaro, S., Scagnetto, I., Vassena, L.: AI on the Move: Exploiting AI Techniques for Context Inference on Mobile Devices. In: 18th European Conference on Artiﬁcial Intelligence, pp. 668–672 (2008)

The Large Scale Artificial Intelligence Applications – An Analysis of AI-Supported Estimation of OS Software Projects Wieslaw Pietruszkiewicz1 and Dorota Dzega2 1 West Pomeranian University of Technology, Faculty of Computer Science and Information Technology ul. Zolnierska 49, 71-210 Szczecin, Poland [email protected] 2 West Pomeranian Business School, Faculty of Economics and Computers Science ul. Zolnierska 53, 71-210 Szczecin, Poland [email protected]

Abstract. We present the practical aspects of large scale AI-based solutions, by analysing an application of Artiﬁcial Intelligence for estimation of Open Source projects being hosted on the leading platform for Open Source - Sourceforge.net. We start by introducing the steps of data extraction task, that transformed tens of tables and hundreds of ﬁelds, originally designed to be used by web-based project collaboration system, into four datasets–dimensions important to the project management i.e skills, time, costs and eﬀectiveness. Later, we present the structure and results of experiments, that were performed using various algorithms i.e. decision trees (C4.5, RandomTree and CART), Neural Networks and Bayesian Belief Networks. Later, we describe how metaclassiﬁcation algorithms improved the prediction quality and inﬂuenced the generalization ability or prediction accuracy. In the ﬁnal part we evaluate the deployed algorithms from practical point of view, presenting their characteristic beyond purely scientiﬁc perspective. Keywords: Classiﬁcation, Metaclassiﬁcation, Decision trees, Software estimation, AI usage factors.

1

Introduction

Currently many popular software applications are being developed as Free Libre/Open Source Software (OS later in this article). Their results, achieved by projects’ teams usually cooperating via web–systems, often overperform the proprietary software e.g. the web servers as well as Artiﬁcial Intelligence or Data Mining software packages. Hence OS projects must be considered the strong competitors to the classic proprietary software products with closed source. To eﬀectively manage these project we must develop the methods of software management specially tailored to OS characteristic. S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 223–232, 2010. c Springer-Verlag Berlin Heidelberg 2010

224

W. Pietruszkiewicz and D. Dzega

Concerning the basic assumption of project management we must notice it assumes that the experience and knowledge, both acquired during the previously managed projects, will help in an eﬀective project management. This supposition is consistent with the basic purpose of AI, that focuses on the knowledge extraction from the past observations and its conversion to the forms easily applicable in the future. In this paper we present a large scale Artiﬁcial Intelligence application that was aimed at two directions. The ﬁrst one was the creation of the sets of models supporting OS project management via the prognosis of important features relating to the projects. The second was the usage of prepared datasets to examine the practical usefulness of AI methods in a large scale application. The data source was SourceForge.net being the leading OS hosting platform. The complexity of data tables, their various interconnections as well as the number of stored records were a serious test to capabilities of used AI methods, that being applied to this real life problem had to prove their practical usefulness, contrasting with a purely scientiﬁc usage common to the most of research papers.

2

Datasets

The data source we used in the experiments was “A Repository of Free / Libre / Open Source Software Research Data” which is a copy of internal databases used by SourceForge.net web–based platform [1]. In the research presented herein the data source contained a large number of features and its form of storage was not designed to be later used in the knowledge extraction process. It was built to store all data necessary to run a web-based project management services e.g. web forums, subversion control or tasks assignments. Due to this reason the examined problem was a modelling example of real life AI application (oriented on data mining) that is a situation where all meaningful attributes have to be extracted from data repositories and the scale causes that some machine learning algorithms fail. It is being caused by the high demands on memory, low speed of learning and simulation or by too many adjustable parameters inﬂuencing the easiness of usage. From the technical perspective the data source contained almost 100 tables and a monthly increase of data was approx. 25 GB. The mentioned data repository was a subject previous of research e.g. [2] examined how machine learning algorithms like logistic regression, decision trees and neural networks could be used to analyse the success factors for Open Source Software. In the other paper [3] were presented similarity measures for OSS projects and clustering was performed , the other research [4] explained how to predict if OS project abandonment is likely. Comparing presented herein research with previous ones, that focus only on selected aspects of AI applied to OSS, we conducted a complete analysis of 4 projects dimensions. This was a outcome of the basic premise i.e. that for a successful project managing it more important to forecast a few factors, than to predict if project become successful or not (comparing to a assumption in some papers).

The Large Scale AI Applications – The Estimation of OS Projects

225

Scope

Time

Project's Success

Costs

Effects

Fig. 1. The datasets relating to the project’s success

We extracted the data from databases (some attributes had to be calculated using the others) and divided them into four groups being the most important dimensions of OS projects and relating to the project’s success (see Figure 1) [5]: – project scope Zt - the duration of project from the moment of project initialization (project registration) till the last published presentation of the project eﬀects; containing 39 attributes, including 8 attributes pertaining to the project ﬁeld (dp ), 28 attributes pertaining to the project resources (zp ) and 3 attributes pertaining to project communication (kp ), – project time Ct - the time of task completion expressed in working hours spent on completing a particular task; containing 12 attributes, including 7 attributes pertaining to general conditions of task completion (wt ) and 5 attributes pertaining to the resources of persons completing the task (zt ), – project cost Kt - the average number of working hours spent by a particular project contractor on task completion; containing 18 attributes, including 8 attributes pertaining to the participant competence (zu ) and 10 attributes pertaining to the participant activity (au ), – project eﬀects Et - the number of completed tasks as of the date of diagnosis; containing 21 attributes, including 16 attributes pertaining to activity of project execution (ra ) and 5 attributes pertaining to communication activity related to project execution (ka ). The numeric characteristic of created dataset was presented in Table 1. The column Reduced records denotes how many records passed through ﬁlters e.g. we have excluded empty projects or projects with an empty development team. To select the most important attributes we have used Information Gain Ratio. For each dataset we have examined various models, where number of input was changing from 1 to Di , where Di was a number of attributes for i dataset. Figure 2 presents an example of this step of experiments for Time dataset. The selected sets of information attributes for the best prediction models (desired attributes) were:

226

W. Pietruszkiewicz and D. Dzega Table 1. Details for Scope, Time, Costs and Eﬀects datasets Dataset Unique records Reduced records Scope 167698 167698 Time 233139 104912 Costs 127208 20353 Eﬀects 96830 15492

Objects Attributes 2881 39 77592 12 10889 18 64960 21

49% 47%

Accuracy

45% 43% 41% Naive_Bayes_D TS

39%

Naive_Bayes_D CV-10

37%

Naive_Bayes_D CV-5

35% w1

w7

w5

w2

z5

w4

z4

z3

z2

w6

w3

z1

Attributes

Fig. 2. Prediction accuracy vs used inputs for Time dataset

Zt = {z1 , d8 , d1 , z4 , d7 , z8 , z2 , d5 , d3 }, Ct = {w1 , w7 , w5 , w2 , z5 , w4 , z4 , z3 , z2 }, Kt = {z1 , a1 , z6 , z8 , a2 , a4 }, Et = {r2 , r13 , r11 , r14 , r10 , r9 , r8 , r12 , k5 , k4 , r7 , k1 }. We found that the number of features may be reduced more, to a smaller subsets without a signiﬁcant decrease of the accuracy. The selected smaller subsets of information attributes, required for the prediction of project features were: Zt = {z1 , d8 , d1 , z4 }, Ct = {w1 , w7 , w5 , w2 }, Kt = {z1 , a1 }, Et = {r2 , r13 , r11 }. These attributes were used in next stage of experiments.

3

Experiments with Classifiers

The practical applications of software estimation often use description of software risk or complexity in form of label e.g. high, mid or low. Thus, we decided to use classiﬁcation as a method of prediction instead of regression that often gives large errors for software. To ensure an unbiased environment, each dataset was ﬁltered to form uniform distribution of its classes. It must be kept in mind, that for 5 uniform classes, the accuracy of blind choice equals to 20%. Table 2 contains comparison of accuracy ratios for C4.5, RandomTree (RT in abbrv.) and Classiﬁcation and Regression Tree (CART in abbrv.) classiﬁers

The Large Scale AI Applications – The Estimation of OS Projects

227

Table 2. Comparison of prediction models for 4 datasets - Scope, Time, Costs and Eﬀects Classifier Scope C4.5 97% RT 99% CART 96%

Time 68% 72% 65%

Cost 78% 92% 75%

Eﬀects 55% 77% 70%

– the detailed information about these methods may be found in [6], [7], [8] and [9]. The values presented in Table 2 are the best accuracies achieved by each classiﬁer. As it can be noticed, the most accurate method for each dataset was RandomTree. There were other important practical issues concerning these algorithms that will be presented in Section 5.

4

Experiments with Metaclassifiers

In the second stage of experiments, we have tested the metaclassiﬁers to check if they were able to increase the performance of previously examined classiﬁers. During this stage of experiments we have used 2 methods of boosting i.e. AdaBoost [10], LogitBoost [11] and Bagging metaclassiﬁer [12], [13]. Adaptive–Boosting (AdaBoot) is a process of iterative classiﬁers learning, where training sets are resampled according to the weighted classiﬁcation error. The more errors occur in the class, the bigger weight this class receives. LogitBoost is another variant of Boosting that uses binomial log-likelihood weight calculating function. Bagging is an acronym for Bootstrap AGGregatING. Its idea is to create an ensemble of classiﬁers built using bootstraping of the training dataset. Output of this ensemble is a result of plurality vote. The process of internal classiﬁers creation might be parallel and therefore their outputs are independent. Tables 3 and 4 show accuracy of AdaBoost and Bagging methods with diﬀerent number of internal iterations. The decision trees described in the previous section were used as the core classiﬁers in each of metaclassiﬁers. This part of experiment was performed using Eﬀects dataset, because it was as the most diﬃcult of all four datasets. The results from C4.5 and CART decision trees increased after usage of metaclassiﬁers. RandomTree, due to its construction and random internal loops, was Table 3. The prediction accuracy vs. number of iteration for AdaBoost mataclassiﬁer (Eﬀects dataset) Iterations 2 5 10

C4.5 55% 62% 64%

RT 77% 77% 77%

CART 69% 76% 76%

228

W. Pietruszkiewicz and D. Dzega

Table 4. The prediction accuracy vs. number of iteration for Bagging mataclassiﬁer (Eﬀects dataset) Iterations 2 5 10

C4.5 58% 63% 65%

RT 72% 77% 78%

CART 53% 57% 58%

75 reweighting resampling

Accuracy ratio [in %]

70

65

60

55

50

0

2

4

6

8 10 12 14 Metaclassifiers internal iterations

16

18

20

Fig. 3. Classiﬁcation accuracy for LogitBoost (Eﬀects dataset) 0.94 0.92 0.9

ROC value

0.88 0.86 0.84 0.82 class 1 class 2 class 3 class 4 class 5

0.8 0.78 0.76

0

2

4

6

8

10 Iterations

12

14

16

18

20

Fig. 4. ROC values for each class vs. no. of iterations for MultiBoost (Eﬀects dataset)

not aﬀected by Boosting. Therefore, we claim that this algorithm is a robust member of decision trees family. The Bagging also caused accuracy increase for all analysed classiﬁers, but a low number of iterations for this metaclassiﬁer resulted in slightly decrease of it accuracy. For another boosting method – LogitBoost we have compared its accuracy in 2 diﬀerent variants. This metaclassiﬁer can use resampling or reweighting during boost procedure. Figure 3 presents plots with dataseries for both variants of

The Large Scale AI Applications – The Estimation of OS Projects

229

68 67

Accuracy [in %]

66 65 64 63 62 61 60 59

0

2

4

6

8

10 Iterations

12

14

16

18

20

Fig. 5. Classiﬁcation accuracy vs no. of iterations for MultiBoost (Eﬀects dataset)

LogitBoost. The core classiﬁer used in LogitBoost was REPTree being an algorithm of “Fast decision tree learner”. It must be noted that value corresponding to 0–iterations is the accuracy for REPTree not LogitBoost. Thus, an increase of accuracy after LogitBoosting can be noticed. During the results evaluation another important characteristic, apart of accuracy ratio, was examined – Receiver Operating Characteristic (ROC) [14]. As it can be noticed on Figure 4, ROC for each class increased with a number of MultiBoost iterations (ROC equals to 1 for an ideal classiﬁer). Similarly to the previous experiments the metaclassiﬁers were built over REPTree. Figure 5 presents how classiﬁcation accuracy increases for MultiBoost. It shall be noticed that this metaclassiﬁer also managed to achieve better prediction results than its core classiﬁer. Therefore it is possible to claim that metaclassiﬁers oﬀer a potential to signiﬁcantly increase the estimation accuracy.

5

Practical Observations

During the presented experiments we have observed diﬀerent characteristics for all algorithms and that inﬂuence their practical usefulness. We decided to do that as many researchers examine the AI methods evaluating them on academic problems, neglecting their overall usage factors. In our opinion the usage factors for AI methods could be divided into two groups (see Figure 6) i.e. scientiﬁc and practical factors. The factors important in a purely scientiﬁc application of AI are: – the quality – being the most common factor presented in the research papers that usually evaluate diﬀerent AI methods from the perspective of results quality, – the stability – means that the results should be repeatable or very coherent during multiple runs of the AI method (with the same adjustments).

230

W. Pietruszkiewicz and D. Dzega Practical usage factors Adjustable parameters Speed of learning

AI method

Memory requirements Speed of simulation

Scientific usage factors Quality of results Stability of results

Understandable results

Fig. 6. Scientiﬁc and practical usage of AI methods 25 RandomTree C4.5

20 15

BBN

10

Average time

CART

5 0 1

2

3

4

5

6

7

8

9

10 11 12 13 14 15 16 17 18 19 20 21

No. of attributes

Fig. 7. The time of learning for diﬀerent classiﬁers (Eﬀects dataset)

We would like to point out the second group of usage factors commonly omitted. The practical usage of AI involves factors like: – the speed/time of learning – as knowledge induction isn’t one step process, the time consuming methods increase the length of experiments, that usually contain multiple run–test–adjust steps, – the number of adjustable parameters – inﬂuences the easiness with which a method could be used, higher numbers of parameters, increases space in which an optimal set of parameters should be found; it is important that for some parameters they ﬁnely tune in e.g. like conﬁdence factor for C4.5, in other cases they changing deeply method’s behaviour e.g. evaluator for Bayesian Networks, – the speed/time of simulation – is the time taken by method to ﬁnd an answer for the asked question (output for inputed data), – the easiness of understanding the results – for many applications it is desired for the method to return the results in a form easily understandable by humans and that could be straightforwardly implemented later. The measures of time of learning for four methods i.e. RandomTree, C4.5, CART and Bayesian Belief Network run for the number of attributes changing from 1

The Large Scale AI Applications – The Estimation of OS Projects

231

Table 5. The speed and easiness of application for diﬀerent classiﬁers Method Speed Easiness Memory Understandable RandomTree Fast Easy Low Easy C4.5 Average Average Low Easy CART Slow Easy Low Easy Bayesian Belief Networks Fast Challenging High Easy

to 21 are presented in Figure 7. To reduce the possible bias each value is an average for 5–times repeated learning process. Our overall opinion about them was presented in Table 5. We decided not to include quality factor as it may vary in the other experiments. Summarising, RandomTree did not require any careful and sophisticated adjustments of parameters like other methods and it was the fastest learning algorithm. It must be mentioned that the most popular decision tree classiﬁer i.e. C4.5 had the largest number of adjusting parameters. That increase searching space in which researchers must be looking for an optimal conﬁguration.

6

Conclusions

The research presented herein prove that Open Source projects management may be supported by data mining methods. As all four datasets presented herein were extracted or calculated from database ﬁelds, designed to store data required by web-based project management platform, it is possible to predict important factors for project without necessity to use any other (external) data sources. The examined classiﬁers and metaclassiﬁers showed large diﬀerences in performance e.g. Neural Networks and Support Vector Machines were rejected at early stages, due to their low accuracy for each dataset. It was caused by a large number of unordered–labelled attributes. The decision trees showed high accuracy for examined data and we claim that it designates them to problems with the similar datasets. The next stage of experiments by incorporating metaclassiﬁers allowed to increase signiﬁcantly the prediction accuracy. It must be noted that some methods of classiﬁcation were practically impossible to use due to their low speed or to high memory requirements. In our opinion, researchers focus their attention too much on developing new methods of data processing, that achieve a slight increase of the quality, forgetting to check if these methods work of fail in the front of a complicated tasks. In the future research, we plan to analyse the dynamics of modelled process, that could lead to better understanding and more eﬀective decision support for OS projects. We also plan to further investigate the computational complexity, computer resources requirements and the other usage factors for the popular machine learning algorithms. These comparisons could be used to establish a comprehensive guideline for the AI methods.

232

W. Pietruszkiewicz and D. Dzega

References 1. Madey, G.: The sourceforge research data archive (srda) (2008), http://zerlot.cse.nd.edu 2. Raja, U., Tretter, M.J.: Experiments with a new boosting algorithm. In: Proceedings of the Thirty-ﬁrst Annual SAS: Users Group International Conference. SAS (2006) 3. Gao, Y., Huang, Y., Madey, G.: Data mining project history in open source software communities. In: North American Association for Computational Social and Organization Sciences (2004) 4. English, R., Schweik, C.M.: Identifying success and abandonment of ﬂoss commons: A classiﬁcation of sourceforge.net projects. Upgrade: The European Journal for the Informatics Professional VIII(6) (2007) 5. Dzega, D.: The method of software project risk assessment. PhD thesis, Szczecin University of Technology (June 2008) 6. Quinlan, R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo (1993) 7. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classiﬁcation and Regression Trees. Wadsworth International Group, Belmont (1984) 8. Breiman, L.: Random forests. In: Machine Learning, pp. 5–32 (2001) 9. McColm, G.L.: An introduction to random trees. Research on Language and Computation 1, 203–227 (2004) 10. Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. In: Proceedings of the Thirteenth International Conference on Machine Learning, pp. 148– 156. Morgan Kaufmann, San Francisco (1996) 11. Friedman, J., Hastie, T., Tibshirani, R.: Additive logistic regression: a statistical view of boosting. Annals of Statistics 28, 337–407 (2000) 12. Breiman, L.: Bagging predictors. In: Machine Learning, pp. 123–140 (1996) 13. Kuncheva, L.: Combining Pattern Classiﬁers: Methods and Algorithms. WileyIEEE, Hoboken (2004) 14. Gonen, M.: Analyzing Receiver Operating Characteristic Curves Using SAS. SAS Press (2007)

Towards the Discovery of Reliable Biomarkers from Gene-Expression Profiles: An Iterative Constraint Satisfaction Learning Approach George Potamias1, Lefteris Koumakis1,2, Alexandros Kanterakis1,2, and Vassilis Moustakis1,2 1

Institute of Computer Science, Foundation for Research & Technology – Hellas (FORTH), N. Plastira 100, Vassilika Vouton, 700 13 Heraklion, Crete, Greece 2 Technical University of Crete, Department of Production and Management, Management Systems Laboratory, Chania 73100, Greece {potamias,koumakis,kantale,moustaki}@ics.forth.gr

Abstract. The article demonstrates the use of Multiple Iterative Constraint Satisfaction Learning (MICSL) process in inducing gene-markers from microarray gene-expression profiles. MICSL adopts a supervised learning from examples framework and proceeds by optimizing an evolving zero-one optimization model with constraints. After a data discretization pre-processing step, each example sample is transformed into a corresponding constraint. Extra constraints are added to guarantee mutual-exclusiveness between gene (feature) and assigned phenotype (class) values. The objective function corresponds to the learning outcome and strives to minimize use of genes by following an iterative constraint-satisfaction mode that finds solutions of increasing complexity. Standard (c4.5-like) pruning and rule-simplification processes are also incorporated. MICSL is applied on several well-known microarray datasets and exhibits very good performance that outperforms other established algorithms, providing evidence that the approach is suited for the discovery of biomarkers from microarray experiments. Implications of the approach in the biomedical informatics domain are also discussed. Keywords: Bioinformatics, constraint satisfaction, data mining, knowledge discovery, microarrays.

1 Introduction The completion of the human genome posts new scientific and technological challenges. In the rising post-genomic era the principal item in the respective research agenda concerns the functional annotation of the human genome. Respective activities signal a major shift toward trans-disciplinary team science and translational research [1], and the cornerstone task concerns the management and the analysis of heterogeneous clinico-genomic data and information sources. The vision is to fight major diseases, such as cancer, on an individualized diagnostic, prognostic and treatment manner. This requires not only an understanding of the genetic background S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 233–242, 2010. © Springer-Verlag Berlin Heidelberg 2010

234

G. Potamias et al.

of the disease but also the correlation of genomic data with clinical data, information and knowledge [2]. The advent of genomic and proteomic high-throughput technologies, such as transcriptomics realized by DNA microarray technology [3], enabled a ‘systems level analysis’ by offering the ability to measure the expression status of thousands of genes in parallel, even if the heterogeneity of the produced data sources make interpretation especially challenging. The high volume of data being produced by the numerous studies worldwide, post the need for a long-term initiative on bio-data analysis [4,5] in the context of ‘translational bioinformatics’ research [6]. The target is the customization and application, but also the invention of new data mining methodologies and techniques suitable for the development of disease classification systems and detection of new disease classes. In a number of gene-expression studies, associations between genomic and phenotype profiles have proved feasible, especially for several types of cancer such as leukemia [7], breast cancer [8], colon cancer [9], central nervous system [10], lung cancer [11], lymphoma [12], ovarian cancer using mass spectrometry sample profiles [13], and other malignancies. This article demonstrates the application of Multiple Iterative Constraint Satisfaction Learning or MICSL for short [14] on microarray gene-expression data. MICSL integrates iterative constrain satisfaction with zero-one integer programming and it is cast in the realm of concept learning from examples [15,16]. The article is organized into separate sections. Section 2 presents the data preprocessing discretization step. Section 3 provides a sustain overview of MICSL; section 4 presents experiments on indicative gene-expression domains and datasets, and overviews results. Finally, section 5 incorporate concluding remarks, potentials and limitations, and points to future R&D plans.

2 Discretization of Gene Expression Values For microarrays the binary constraint is not extortionate. Gene expression values are originally numeric yet, physical meaning corresponds to either low or high, which is of course binary, or expressed vs. not expressed, which has identical meaning and is binary too. Of course, the process makes it necessary to represent the original numeric values to binary equivalents; however, this is common practice in microarray data analysis and exploration. We employ an Entropy-based binary discretization process in order to transform gene expression values into binary equivalents: high (expressed / upregulated) or low (not-expressed / down-regulated). Transformation of gene expression values into high/low values is rationalized by the two-class gene expression modeling problems we cope with. Our method resembles the Fayyad-Irani approach [17], as well as a similar approach presented in [18]. Both methods employ entropy-based statistics; however, they do not incorporate an explicit parameter to force binary split, which may result to uncontrolled numbers of discretization intervals (more than two), which would be difficult to interpret in the presence of two classes for our sample cases. Assume a gene expression matrix with M gene (rows) and S samples (columns) where, each sample is assigned to one of two (mutually exclusive) classes, P (positive) and N (negative), i.e., for the leukemia domain P may denote the ALL and N may denote AML leukemia sub-types, respectively (see below). For each gene g, 1 ≤ g ≤ M,

Towards the Discovery of Reliable Biomarkers from Gene-Expression Profiles

235

consider the descending ordered vector of its values, Vg = , ng;i ≥ ng;i+1 1 ≤ i ≤ S-1. Each ng;i associates with one of the classes. We seek a point estimate μg to split interval [n1, nS] in two parts so that μg discriminates between classes P and N in the best possible way; μg is used to split vector Vg elements in: high (h) and low (l) values. Binary transformation of Vg into h and l value intervals proceeds via two distinct steps. Step 1: Calculation of midpoint values μg;i across Vg elements, e.g. μg;i = (ng;1+ng;i+1)/2, and formulation of the descending ordered vector of midpoint values: Mg = <μg;1, μg;2, ... μg;S-1>, μg;i ≥ μg;i+1. Step 2: Assessment of point estimate μg. For each midpoint μg;i we assess its information gain IG(S,μg;i) with respect to set of samples S using the information theoretic model reflected in formula (1) below (utilized as well in the standard C4.5 decision tree induction system [19]). IG(S,μg;i) = E(S) – E(S,μg;i)

(1)

E(S) corresponds to the class-related entropy of the original set of samples S, calculated by formula (2) using an information theoretic model [20]. E (S ) = −

Sp S

log 2

Sp S

−

S SN log 2 N S S

(2)

where, Sp and SN denote the samples from S that belong to class P and N, respectively; |S(.)| denotes set cardinality. For each gene g, E(S,μg;i) corresponds to the conditional entropy when μg;i is used as a split point for the target gene g, and it is computed by formula (3). E (S, μ g ; i) = −

⎞ S S S Sh ⎛ SP,h log 2 P , h + N , h log 2 N , h ⎟ − ⎜ S ⎝ Sh Sh Sh Sh ⎠ SP ,l SN ,l ⎞ SN ,l Sl ⎛ SP ,l log 2 + log 2 ⎜ ⎟ S ⎝ Sl Sl Sl Sl ⎠

(3)

where, SP,h, SP,l ⊆ SP denote the class P samples with high (ng;i ≥ μg;i) and low (ng;i < μg;i) values for the target gene, respectively; SN,h, SN,l are defined in an analogous way. The estimate μg;i, which maximizes IG(S,μg;i) is set as the value of μ. This point is selected to transform each gene-expression value, gi, to its binary equivalents, v(gi) ∈ {0,1} (0 and 1 represent low (l) and high (h) gene-expression values, respectively) Binary transformation proceeds independently across genes. When all gene expression values are transformed, a matrix that includes the binary equivalent values replaces the original gene expression matrix. In addition, optimal midpoint values are stored in order to be used for the binary transformation of the corresponding test (unseen) cases. The C4.5/Rel8 decision tree induction algorithm [19] follows a similar to our discretization approach, aiming to improve the use of continuous attributes during decision tree induction. Reported results slightly favor a ‘local’ (i.e., during the growth of the tree) discretization process in contrast to a ‘global’ one. However, running Weka’s C4.5/Rel8 implementation (called J48 [21]) on the gene-expression

236

G. Potamias et al.

datasets used in this paper, performance figures were not satisfactory (data not shown). This could be attributed to the intrinsic difficulty of decision tree induction approaches to cope with many irrelevant attributes.

3 Multiple Iterative Constraint Satisfaction Based Learning MICSL proceeds by viewing samples as constraints that should be satisfied in context of a mathematical programming model in which the objective function maps to rules generated. The specific optimization model is 0-1 linear, which means that all values are binary (valued either 0 or 1). This is achieved by transforming the original learning problem to a binary equivalent, which in turn implies that all attributes (or features) are valued over nominal scales. To explain MICSL we develop a notation system. We use gi to represent a gene. Then v(gik) represents the binary value equivalent of sequence gi with respect to sample k, which means that v(gik) = 0 or v(gik) =1 and value of 0 means that gik is under-expressed (or it is ‘low’) and value of 1 means that gik is over-expressed (or it is ‘high’). A sample is represented as an implication between the conjunction of v(gik) and the class of interest. Microarray samples are assigned into two classes often tagged as positive or negative, P or N, respectively. Thus a positive sample sk representation is:

sk ≡ P ← ∧i (v(gik)} - negative sample representation is identical, only class assignment changes. Given I example samples, and that v(gik) and P, N are binary then, the binary sk representation leads to the formation of a non-linear constraint, namely: I

P ≥ ∏ v(gik ) i= 1

The constraint has a linear equivalent [22], which is: I

∑ v(g

ik

)−P ≤ I−1

i= 1

Additional constraints are introduced, namely: P + N = 1, which mean that a derived solution, or rule, can point (exclusively) either to P or to N class, along with a series of constraints that will guarantee that any v(gik) can be either 0 or 1, that is a series of constraints like: [v(gik) = 0] + [ v(gik) = 1] ≤ 1, ∀i ≤ I. The objective function is the summation across all v(gik). In order to learn minimal description rules a control parameter R is introduced – it is linked to the available samples I, and helps to form the necessary number of constraints in the following form: I

∑ v(g

ik

) − (P, N ) ≤ R − 1,1 ≤ R ≤ I

i= 1

Parenthesis (P, N) means that for samples, which are positive then P is used, and samples, which are negative then N is used.

Towards the Discovery of Reliable Biomarkers from Gene-Expression Profiles

237

To summarize model formulation we present as example the model for the breast cancer microarray - we restrict to the training set of 78 samples, and to the selected 70 gene-markers reported in the original publication [8], namely: 70

min{ ∑

∑ v(g

ik

)}

(4)

i= 1 k ∈{0, 1}

Subject to: 70

∑v(g

ik

) − (P, N) ≤ R − 1, R ≥ 1

i= 1

(5)

78 such constraints are formed [ v ( g i ) = 0 ] + [ v ( g i ) = 1] ≤ 1, ∀ i ≤ 70

(6)

P+N =1

(7)

v(gik ) = 0,1,∀i ≤ I

(8)

Constraint (5) works iteratively until all constraints are satisfied. Optimization starts by setting R=1 and stops when all constraints are satisfied. At each iteration the model produces a solution in the form:

∧i [v(gi)∈{0,1}] ==> Class with Class taking the values P or N. In general, and at each iteration R (1 ≤ R ≤ I), the process terminates when all solutions are found. Then the value of R increases, the corresponding constraints are formed, and the process iterates. The system could be parameterized in order to stop when special criteria are met, e.g., when all examples are covered by the so far formed. The union of solutions, derived by objective function (5), forms the set of knowledge learned, i.e., induced set of rules. Constraints (6) and (7) are the additional constraints added in order to guarantee: the mutual exclusiveness between binary gene-expression values, and the mutual exclusiveness between class values with the extra requirement that at least one class is true (note the equality in constraint (7)). Constraint (8) declares all variables (gene values) as binary. Potamias [14] demonstrates that the optimization model (4) – (8), yields the R minimal consistent models that explain the domain. In addition, there is a limit for the value of R at which the process stops, which means that optimization converges. Model complexity is linear and relates to complexity of the above presented binary integer programming model.

4 Experimental Results Experimentation follows a 2x2 scheme. Different learning approaches are used on a variety of publicly available samples. Nine learning methods were used and results compared with MICSL. Learning methods used are: (1) J48: A variation of the classic C4.5 for generating a pruned or un-pruned C4.5 decision tree [19]; (2) Jrip: A propositional rule learner, Repeated Incremental Pruning to Produce Error Reduction (RIPPER), which was proposed by Cohen [23]; (3) PART: Algorithm for generating a

238

G. Potamias et al.

PART decision list [24] – it utilizes a separate-and-conquer strategy, build a partial C4.5 decision tree at each iteration, and make the "best" leaf into a rule; (4) Conjunctive Rule: this algorithm implements a single conjunctive rule learner that can predict for numeric and nominal class labels [25]; (5) Decision Table: an algorithm for building and using a simple decision table majority classifier [26]; (6) DTNB: an algorithm for building and using a decision table equipped with a naive Bayes hybrid classifier [27] - at each point in the search, the algorithm evaluates the merit of dividing the attributes into two disjoint subsets: one for the decision table, the other for naive Bayes; (7) NNge: Nearest-neighbor-like algorithm using non-nested generalized exemplars which are hyper-rectangles that can be viewed as if-then rules [28]; (8) OneR: Algorithm for building and using a 1R classifier that utilizes the minimum-error (discretised) attribute for prediction [29]; and (9) RiDor: an implementation of a Ripple-Down Rule learner [30] - it generates a default rule first and then the exceptions for the default rule with the least (weighted) error rate. The nine methods plus MICSL were assessed over seven public domain datasets with the following characteristics: i.

Veer: Veer et al [8] proposed a signature for breast cancer with 70 genes (genesignature) using supervised classification from a training dataset of 78 samples and a test dataset of 19 samples; ii. West: West et al [33] used DNA microarray expression data from a series of primary breast cancer samples to discriminate and predict the estrogen receptor status of these tumors as well as the lymph node status of the patient at the time the tumor was surgically removed. The dataset has 38 samples in the training and 9 in the test set. The proposed gene-signature contains 102 genes; iii. Golub: The Golub dataset [7] contains data for acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL). The proposed gene-signature consists of 50 genes; the dataset contains 38 samples in the training and 34 in the test set; iv. Gordon: The Gordon dataset [11] consists of 32 samples in the training set and 149 samples in the test set for the distinction between malignant pleural mesothelioma (MPM) and adenocarcinoma (ADCA) of the lung cancer. The reported gene-signature contains 8 genes. v. Golub: The Golub dataset [7] contains data for acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL). The reported gene-signature consists of 50 genes; the dataset contains 38 samples in the training and 34 in the test set. vi. Gordon: The Gordon dataset [11] consists of 32 samples in the training set and 149 samples in the test set for the distinction between malignant pleural mesothelioma (MPM) and adenocarcinoma (ADCA) of the lung cancer. The reported gene-signature genes contains 8 genes. vii. Singh: In Singh et al [31] we have microarray data for prostate tumor. The proposed set of genes was identified that is strongly correlated with the state of tumor differentiation as measured by Gleason score. This set consists of 26 genes, the training dataset contains 102 samples and the test dataset contain 34 samples. viii. Sorace: Sorace et al [32] proposed a gene profile with 22 genes for early detection of ovarian cancer. The experimental data collection comes from the Clinical Proteomics Data Bank and contains 125 samples in the training set and 128 samples in the test set.

Towards the Discovery of Reliable Biomarkers from Gene-Expression Profiles

239

ix. Alizadeh: Based on the work of [12] Alizadeh et al and contains data for distinct types of diffuse large B-cell lymphoma. The proposed gene profile contain 380 genes, the training dataset contains 34 samples and the test 13 samples. Statistical performance assessment was done using predictive accuracy (PA), sensitivity (SE) and specificity (SP) figures across all published domain test samples. Results are summarized in Table 1. Experimentation was done on the reduced original samples; for instance, the breast-cancer dataset included about 25000 genes; however experimentation across all learning methods was done using the 70 genes molecular signature as suggested and reported in [8]. Thus, by using the genes selected in the respective original study we avoid getting into the medical specifics and focus just on the statistical performance of each learning method, including MICSL. Table 1. Statistical performance results. Bold entries are used to indicate superior performance; last column to the right averages results across domains with respect to learning method. ANOVA analysis revealed significant differences across methods and domains (p = 0.001). Method MICSL

J48

Jrip

PART

CR

DT

DTNB

PA SE SP PA SE SP PA SE SP PA SE SP PA SE SP PA SE SP PA SE SP

Veer 78.9% 78.9% 81.8% 63.2% 63.2% 72.6% 63.2% 63.2% 60.7% 73.7% 73.7% 66.8% 52.6% 52.6% 48.6% 73.7% 73.7% 78.7% 68.4% 68.4% 75.6%

West 77.8% 77.8% 82.2% 66.7% 66.7% 68.3% 66.7% 66.7% 68.3% 66.7% 66.7% 68.3% 66.7% 66.7% 68.3% 66.7% 66.7% 68.3% 66.7% 66.7% 68.3%

Golub 97.1% 97.1% 95.8% 94.1% 94.1% 93.7% 94.1% 94.1% 93.7% 94.1% 94.1% 93.7% 94.1% 94.1% 93.7% 94.1% 94.1% 93.7% 94.1% 94.1% 93.7%

Gordon 98.7% 97.5% 86.7% 98.7% 98.7% 93.9% 98.7% 98.7% 93.9% 98.7% 98.7% 98.7% 98.7% 98.7% 88.0% 59.0% 98.7% 93.9% 98.7% 98.7% 88.0%

Singh 88.2% 88.2% 95.8% 61.8% 61.8% 86.2% 76.5% 76.5% 91.5% 85.3% 85.3% 94.7% 97.1% 97.1% 98.9% 97.1% 97.1% 98.9% 94.1% 94.1% 97.9%

Sorace 99.2% 98.4% 98.8% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 98.4% 98.4% 97.2% 99.2% 99.2% 98.6% 97.7% 97.7% 98.7% 97.7% 97.7% 98.7%

Alizadeh 92.3% 92.3% 95.2% 92.3% 92.3% 87.7% 92.3% 92.3% 87.7% 92.3% 92.3% 87.7% 92.3% 92.3% 95.2% 61.5% 61.5% 76.0% 92.3% 92.3% 87.7%

Average 90.3% 90.0% 90.9% 82.4% 82.4% 86.1% 84.5% 84.5% 85.1% 87.0% 87.0% 86.7% 85.8% 85.8% 84.5% 78.5% 84.2% 86.9% 87.4% 87.4% 87.1%

77.8%

94.1%

98.7%

79.4%

100.0%

100.0%

86.1%

PA

52.6%

NNge

SE

52.6%

77.8%

94.1%

98.7%

79.4%

100.0%

100.0%

86.1%

OneR

SP PA SE SP PA

66.4% 42.1% 42.1% 60.3% 73.7%

82.2% 66.7% 66.7% 68.3% 44.4%

91.6% 94.1% 94.1% 93.7% 94.1%

88.0% 98.7% 98.7% 93.9% 98.7%

92.6% 70.6% 70.6% 75.2% 97.1%

100.0% 99.2% 99.2% 98.6% 100.0%

100.0% 76.9% 76.9% 85.6% 38.5%

88.7% 78.3% 78.3% 82.2% 78.1%

Ridor

SE

73.7%

44.4%

94.1%

98.7%

97.1%

100.0%

38.5%

78.1%

SP

78.7%

50.6%

93.7%

93.9%

98.9%

100.0%

61.5%

82.5%

240

G. Potamias et al.

Based on average across domains, MICSL outperformed all comparison learning methods. In addition, our method held exceptional performance across all selected domains, with the exception of one domain, in which MICSL trailed in performance.

5 Discussion and Concluding Remarks Overall performance of MICSL is at least as good as compared to performance of standard learning methods. It is noteworthy that MICSL performed very well in domain with rather few training samples while trailed in domain with a rather larger sample space. Better performance in smaller sample spaces can be attributed to the fact that MICSL strives to maximize usage of domain samples with zero assumptions, excluding of course gene-value discretization, which however is common across methods. For instance, the entropy metric used by J48 brings the assumption that, in Veer, underlying population is split between 44 (or 54%) with good prognosis and 34 (or 44%) with bad prognosis. The assumption implies an almost 50/50 split between bad and good prognosis patients, an assumption, which may not always be real. Instead MICSL needs not such a limiting assumption. MISCL made very well in all domains expect, slightly, on the Singh domain (a fact that cannot be readily explained). In general, as MICSL does very well in domains with few or very few samples, it may provide a useful tool in novel domain investigation, in which the number of samples is small – a common situation in research clinico-genomic trials. MICSL couples learning with optimization. Such coupling places the method with support vector methodology and future research should focus on a hybrid architecture combining the two method families. In addition, in our R&D plans is to port MICSL on a more efficient environment, e.g., ILOG CP (http://ilog.com.sg/products/cp/). Acknowledgments. Work presented herein was partially supported by the ACTIONGrid EU project (FP7 ICT 224176), as well by the ACGT (FP6-IST-2005-026996) and GEN2PHEN (European Commission, Health theme, project 200754) projects. Authors hold full responsibility for opinions, results and views expressed in the text.

References 1. Sander, C.: Genomic Medicine and the Future of Health Care. Science 287(5460), 1977– 1978 (2000) 2. Sanchez, F.M., Iakovidis, I., et al.: Synergy between medical informatics and bioinformatics: facilitating genomic medicine for future health care. Journal of Biomedical Informatics 37(1), 30–42 (2004) 3. McConnell, P., Johnson, K., Lockhart, D.J.: An introduction to DNA microarrays. In: 2nd Conference on Critical Assessment of Microarray Data Analysis (CAMDA 2001) Methods of Microarray Data Analysis II, pp. 9–21 (2002) 4. Dopazo, J.: Microarray data processing and analysis. In: 2nd Conference on Critical Assessment of Microarray Data Analysis (CAMDA 2001) - Methods of Microarray Data Analysis II, pp. 43–63 (2002)

Towards the Discovery of Reliable Biomarkers from Gene-Expression Profiles

241

5. Piatetsky-Shapiro, G., Tamayo, P.: Microarray Data Mining: Facing the Challenges. ACM SIGKDD Explorations 5(5), 1–5 (2003) 6. Butte, A.J.: Translational Bioinformatics: Coming of Age. J Am. Med. Inform. Assoc. 15(6), 709–714 (2008) 7. Golub, T.R., Slonim, D.K., et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439), 531–537 (1999) 8. Van’t Veer, L.J., Dai, H., et al.: Gene expression profiling predicts clinical outcome of breast cancer. Nature 415(6871), 530–536 (2002) 9. Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., Levine, A.J.: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissue probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. 96(12), 6745–6750 (1999) 10. Pomeroy, S.L., Tamayo, P., et al.: Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415, 436–442 (2002) 11. Gordon, G.J., Jensen, R.V., et al.: Translation of Microarray Data into Clinically Relevant Cancer Diagnostic Tests Using Gene Expression Ratios in Lung Cancer and Mesothelioma. Cancer Research 62, 4963–4967 (2002) 12. Alizadeh, A.A., Eisen, M.B., et al.: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403(6769), 503–511 (2000) 13. Petricoin, E.F., Ardekani, A.M., et al.: Use of proteomic patterns in serum to identify ovarian cancer. Lancet 359(93056), 572–577 (2002) 14. Potamias, G.: MICSL: Multiple Iterative Constraint Satisfaction based Learning. Intell. Data Anal. 3(4), 245–265 (1999) 15. Hunt, E.B., Marin, J., Stone, P.J.: Experiments in Induction. Academic Press, New York (1966) 16. Michalski, R.C.: Concept Learning. Encyvlopedia of Artificial Intelligence 1, 185–194 (1986) 17. Fayyad, U., Irani, K.: Multi-interval discretization of continuous-valued attributes for classification learning. In: 13th International Joint Conference of Artificial Intelligence, pp. 1022–1027 (1993) 18. Li, J., Wong, L.: Identifying good diagnostic gene groups from gene expression profiles using the concept of emerging patterns. Bioinformatics 18(5), 725–734 (2002) 19. Quinlan, J.R.: C4.5: Programs for Machine Learning. Kaufmann Publishers Inc., San Mateo (1993) 20. Shannon, C.E.: A mathematical theory of communication. Bell System Technical Journal 27(379–423), 623–656 (1948) 21. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.E.: The WEKA Data Mining Software: An Update. SIGKDD Explorations 11(1), 10–18 (2009) 22. Bell, C., Nerode, A., Raymond, T.N., Subrahmanian, V.S.: Implementing deductive databases by mixed integer programming. ACM Transactions on Database Systems 21(2), 238–269 (1996) 23. Cohen, W.W.: Fast Effective Rule Induction. In: 12th International Conference on Machine Learning, pp. 115–123 (1995) 24. Frank, E., Witten, I.H.: Generating Accurate Rule Sets Without Global Optimization. In: 15th International Conference on Machine Learning, pp. 144–151 (1998) 25. Pazzani, M.J., Sarrett, W.: A framework for the average case analysis of conjunctive learning algorithms. Machine Learning 9, 349–372 (1992) 26. Kohavi, R.: The Power of Decision Tables. In: 8th European Conference on Machine Learning, pp. 174–189 (1995)

242

G. Potamias et al.

27. Hall, M., Frank, E.: Combining Naive Bayes and Decision Tables. In: 21st Florida Artificial Intelligence Society Conference, pp. 15–17 (2008) 28. Martin, B.: Instance-based learning: nearest neighbor with generalization. Master Thesis, University of. Waikato, Hamilton, New Zealand (1995) 29. Holte, R.C.: Very simple classification rules perform well on most commonly used datasets. Machine Learning 11, 63–91 (1993) 30. Gaines, B.R., Compton, P.: Induction of Ripple-Down Rules. In: 5th Australian Joint Conference on Artificial Intelligence, pp. 349–354 (1992) 31. Singh, D., Febbo, P.G., et al.: Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1(2), 203–209 (2002) 32. Sorace, J.M., Zhan, M.: A data review and re-assessment of ovarian cancer serum proteomic profiling. BMC Bioinformatics 4, 24 (2003) 33. West, M., Blanchette, C., et al.: Predicting the clinical status of human breast cancer by using gene expression profiles. Proc. Natl. Acad. Sci. 98(20), 11462–11467 (2001)

Skin Lesions Characterisation Utilising Clustering Algorithms Sotiris K. Tasoulis1 , Charalampos N. Doukas2 , Ilias Maglogiannis1 , and Vassilis P. Plagianakos1 1

2

Department of Computer Science and Biomedical Informatics, University of Central Greece, Papassiopoulou 2–4, Lamia, 35100, Greece {stas,imaglo,vpp}@ucg.gr Department of Information and Communication Systems Engineering, University of the Aegean, Karlovassi, 83200, Samos, Greece [email protected]

Abstract. In this paper we propose a clustering technique for the recognition of pigmented skin lesions in dermatological images. It is known that computer vision-based diagnosis systems have been used aiming mostly at the early detection of skin cancer and more speciﬁcally the recognition of malignant melanoma tumour. The feature extraction is performed utilising digital image processing methods, i.e. segmentation, border detection, colour and texture processing. The proposed method belongs to a class of clustering algorithms which are very successful in dealing with high dimensional data, utilising information driven by the Principal Component Analysis. Experimental results show the high performance of the algorithm against other methods of the same class. Keywords: Pigmented Skin Lesion, Image Analysis, Feature Extraction, Classiﬁcation, Unsupervised clustering, Cluster analysis, Principal Component Analysis, Kernel density estimation.

1

Introduction

Several studies found in literature have proven that the analysis of dermatological images and the quantiﬁcation of tissue lesion features may be of essential importance in dermatology [9,13]. The main goal is the early detection of malignant melanoma tumour, which is among the most frequent types of skin cancer, versus other types of non-malignant cutaneous diseases. The interest in melanoma is due to the fact that its incidence has increased faster than that of almost all other cancers and the annual incidence rates have increased on the order of 3 − 7% in fair-skinned populations in recent decades [10]. The advanced cutaneous melanoma is still incurable, but when diagnosed at early stages it can be cured without complications. However, the diﬀerentiation of early melanoma from other non-malignant pigmented skin lesions is not trivial S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 243–253, 2010. c Springer-Verlag Berlin Heidelberg 2010

244

S.K. Tasoulis et al.

even for experienced dermatologists. In several cases, primary care physicians underestimate melanoma in its early stage [13]. A promising technique to deal with such a problem seems to be the use of data mining methods. In particular, using clustering could be the key step on understanding the diﬀerences between the types and subtypes of skin lesions. In this work we focus on a powerful class of algorithms that reduces the dimensionality of the data without signiﬁcant loss of information. In this class of algorithms the Principal Direction Divisive Partitioning algorithm (PDDP) is of particular value [1]. PDDP is a “divisive” hierarchical clustering algorithm. Any divisive clustering algorithm can be characterised by the way it chooses to provide answers to the following three questions: Q1 : Which cluster to split further? Q2 : How to split the selected cluster? Q3 : When should the iteration terminate? The PDDP based algorithms in particular, use information from the Principal Component Analysis (PCA) of the corresponding data matrix to provide answers, in a computationally eﬃcient manner. This is achieved by incorporating information from only the ﬁrst singular vector and not a full rank decomposition of the data matrix. In this paper, we will show the strength of an enhanced version of the PDDP algorithm that uses extracted features from dermatological images aiming at the recognition of malignant melanoma versus dysplastic nevus and non-dysplastic skin lesion. The paper is organised as follows: in section 2 we present the image dataset, as well as the preprocessing and segmentation, and feature extraction techniques applied. In Section 3, the proposed clustering algorithm is presented. In Section 4 we illustrate the experimental results and discuss the potential of our approach. The paper ends with concluding remarks and pointers for future work.

2

The Skin Lesions Images Dataset

The image data set used in this study is an extraction of the skin database that exists at the Vienna Hospital, kindly provided by Dr. Ganster. The whole data set consists of 3631 images, 972 of them are displaying nevus (dysplastic skin lesions), 2590 featuring non-dysplastic lesions and the rest 69 images contain malignant melanoma cases. The number of the melanoma images set is not small considering the fact that malignant melanoma cases in a primordial state are very rare. It is very common that many patients arrive at specialised hospitals with partially removed lesions. A standard protocol was used for the acquisition of the skin lesion images ensuring the reliability and reproducibility of the collected images. Reproducibility is considered quite essential for the image characterisation and recognition attempted in this study, since only standardised images may produce comparable results.

Skin Lesions Characterisation Utilising Clustering Algorithms

2.1

245

Image Pre-processing and Segmentation

The segmentation of an image containing a cutaneous disease involves the separation of the skin lesion from the healthy skin. For the special problem of skin lesion segmentation, mainly region-based segmentation methods are applied [2,21]. A simple approach is thresholding, which is based on the fact that the values of pixels that belong to a skin lesion diﬀer from the values of the background. By choosing an upper and a lower value it is possible to isolate those pixels that have values within this range. The information for the upper and the lower limits can be extracted from the image histogram, where the diﬀerent objects are represented as peaks. The bounds of the peeks are good estimates of these limits. It should be noted though that, simple thresholding as it is described here can not be used in all cases because image histograms of skin lesions are not always multi-modal [7]. In this study, a more sophisticated approach of a local/adaptive thresholding technique was adopted, where the window size, the threshold value and degree of overlap between successive moving windows were the procedure parameters. This algorithm is presented in the ﬂowcharts of Figure 1. The parameters of the proposed thresholding algorithm were tuned so that skin lesions separation was performed satisfactory. START

START Choose a new pixel in the image

Choose a new pixel in the image NO

Compute average intensity of all pixels

Is pixel "dark" and and unscanned already?

in pixel window YES Characterize pixel as scanned

Compare pixel intensity with average intensity minus threshold

Create an array of adjacent "dark" unscanned pixels

<0?

YES Recursively populate the array

NO Characterize all array elements as scanned

YES

Any pixel not checked already?

Pixel characterized as "dark¨ pixel

Any pixel not checked already?

YES

NO

(a)

FINISH

(b)

FINISH

Fig. 1. (a) The proposed adaptive thresholding algorithm. (b) The object extraction algorithm.

2.2

Utilised Features and Feature Extraction Techniques

In automated diagnosis of skin lesions, feature design is based on the so-called ABCD-rule of dermatology. ABCD rule, which constitutes the basis for a diagnosis by a dermatologist [11] represents the Asymmetry, Border structure, variegated Colour, and the Diﬀerential Structures characteristics of the skin lesion. The feature extraction is performed by measurements on the pixels that

246

S.K. Tasoulis et al.

represent a segmented object allowing non-visible features to be computed. Several studies have also proven the eﬃciency of border shape descriptors for the detection of malignant melanoma on both clinical and computer based evaluation methods [8,15]. Three types of features are utilised in this study: Border Features which cover the A and B parts of the ABCD-rule of dermatology, Colour Features which correspond to the C rules and Textural Features, which are based on D rules. More speciﬁcally the extracted features are as follows: Border features – Thinness Ratio measures the circularity of the skin lesion deﬁned as: T R = 4πArea/(perimeter)2 . – Border Asymmetry is computed as the percent of non-overlapping area after a hypothetical folding of the border around the greatest diameter or the maximum symmetry diameters. – The variance of the distance of the border lesion points from the centroid location. – Minimum, maximum, average and variance responses of the gradient operator, applied on the intensity image along the lesion border. Colour Features – Plain RGB colour plane average and variance responses for pixels within the lesion. – Intensity, Hue, Saturation Colour Space average and variance responses for 3 pixels within the lesion: I = R+G+B , S = 1 − R+G+B [min(R, G, B)], and 3 ⎧ ⎨ W , G > B, R(1 − 12 (G + B)) H = 2π − W , G < B, and W = arccos[ 1 ]. ⎩ (R − G)2 + (R − B)(G − B) 2 0 , G = B, – Spherical coordinates LAB average and variance responses for pixels within the √ lesion: R −1 L = R2 + G2 + B 2 , AngleA = cos−1 [ B [ L sin(AngleA) ]. L ], and AngleB= cos Texture features – Dissimilarity, d, which is a measure related to contrast usinglinear increase of N −1 weights as one moves away from the GLCM diagonal: d = i,j=0 Pi,j i − j, where i and j denote the rows and columns, respectively, N is the total V number of rows and columns, and Pi,j = N −1i,j V is the normalisation i,j=0

i,j

equation in which Vi,j is the DN value of the cell i, j in the image window. – Angular Second Moment, ASM, which is a measure to orderliness, −1related 2 where Pi,j is used as a weight to itself: ASM = N iP . i,j i,j=0 – GLCM Mean, μi , which diﬀers from the familiar mean equation in the sense that it denotes the frequency of the occurrence of one pixel value in with a certain neighbour pixel value and is given by μi = Ncombination −1 i(P ). For the symmetrical GLCM, holds that μi = μj . i,j i,j=0

Skin Lesions Characterisation Utilising Clustering Algorithms

247

– GLCM Standard Deviation, σi , which gives a measure of the dispersion of N −1 2 the values around the mean: σi = i,j=0 Pi,j (i − μi ) .

3

The Proposed Clustering Algorithm

To formally describe the manner in which principal direction based divisive clustering algorithms operate, let us assume the data are represented by an n × a matrix D, whose row vectors represent a data sample di , for i = 1, . . . , n. Also deﬁne the vector b and matrix Σ to represent the mean vector and the covarin ance of the data respectively: b = n1 i=1 di , Σ = n1 (D − be) (D − be), where e is a column vector of ones. The covariance matrix Σ is symmetric and positive semi-deﬁnite, so all its eigenvalues are real and non-negative. The eigenvectors uj j = 1, . . . , k corresponding to the k largest eigenvalues are called the principal components or principal directions. The PDDP algorithm and all similar techniques use the projections pri = u1 (di − b), i = 1, . . . , n, onto the ﬁrst principal component u1 , to initially separate the entire data set into two partitions P T1 and P T2 . To generate this partitioning, the original PDDP algorithm uses the sign of the projection of each data point (division point 0). However, very often, the sign of the projection can lead the algorithm to undesirable cluster splits [3,4,17]. More formally we can deﬁne this splitting criterion as follows: – (Splitting Criterion – SP C1 ): ∀di ∈ D, if pri 0, then the i-th data point belongs to the ﬁrst partition P T1 = P T1 ∪ di ; otherwise, it belongs to the second partition P T2 = P T2 ∪ di . To answer the cluster selection question Q2 , the PDDP algorithm as well as all its variations [3,4,12,20], select the cluster with maximum scatter value SV , deﬁned as SV = D − beF , where e is a column vector of ones, the vector b represents the mean vector of D, and ·F is the Frobenious norm. This quantity can be a measure of coherence and it is the same as the one used by the k-means steered PDDP algorithm [20]. Finally most of the PDDP variants terminate the clustering procedure when a user deﬁned number of clusters has been retrieved. 3.1

dePDDP

The dePDDP algorithm is based on a splitting criterion that suggests that the minimiser of the estimated density of the projected data onto the ﬁrst principal component, is the best we can do to avoid splitting clusters. The cluster selection criterion and the termination criterion are guided by the same idea. More speciﬁcally, the dePDDP algorithm utilises the following criteria: – (Stopping Criterion – ST2 ): Let Π a partition of the data set D into k sets. Let X , be the set of minimisers x∗i of the density estimates fˆ(x∗i ; h) of the projection of the data of each Ci ∈ Π, i = 1, . . . , k. Stop the procedure when the set X is empty.

248

S.K. Tasoulis et al.

– (Cluster Selection Criterion – CS2 ): Let Π a partition of the data set D into k sets. Let F be the set of the density estimates fi = fˆ(x∗i ; h) of the minimisers x∗i for the projection of the data of each Ci ∈ Π, i = 1, . . . , k. The next set to split is Cj , with j = arg maxi {fi : fi ∈ F }. – (Splitting Criterion – SP C3 ): Let fˆ (x; h ) be the kernel density estimation of the density of the projections pri ∈ PT , and x∗ its global minimiser. Then construct P T1 = {di ∈ D : pri x∗ } and P T2 = {di ∈ D : pri > x∗ }. The dePDDP implementation is shown below: Function dePDDP (D) 1. Set Π = {D} 2. Do 3. Select a set C ∈ Π using cluster selection criterion CS2 3. Split C into two sub-sets C1 , C2 using Splitting Criterion SP C3 4. Remove C from Π and set Π → Π ∪ {C1 , C2 } 5. While Stopping Criterion ST2 is not satisﬁed 6. Return Π the partition of D into |Π| clusters To compute the principal vectors as in the original PDDP algorithm, the Singular Value Decomposition of the data matrix D is employed. This introduces a total worst case complexity of O(cmax (2 + kSV D )snz n a), where kSV D are the iterations needed by the Lanczos SVD computation algorithm and snz is the fraction of non-zero entries in D. For more details refer to [1]. The computational complexity of this approach, using a brute force technique, would be quadratic in the number of samples. However, it has been shown [5,19] that techniques like the Fast Gauss Transform achieve linear running time for the Kernel Density Estimation, especially for the one dimensional case. To ﬁnd the minimiser of that we only need to evaluate the density at n positions, in between the projected data points since those are the only places we can have valid splitting points. Thus the total complexity of the algorithm remains O (cmax (2 + kSV D )(snz n a)). From the termination criterion it is obvious that the algorithm can be used to automatic determine the cluster number. We could allow the existence of a minimiser to guide the termination of the procedure. We can stop the iteration as long as there does not exist a minimiser for any of the retrieved clusters [16]. In this work we do not focus on this attribute of the algorithm and we consider the termination criterion to be the same as for PDDP algorithm.

4

Experimental Analysis

As deﬁned at section 2, the image dataset used in our experiments consists of 3631 images, 972 of them are displaying nevus (dysplastic skin lesions), 2590 featuring non-dysplastic lesions and the rest 69 images contain malignant melanoma cases. The number of the melanoma images set is not so small considering the fact that malignant melanoma cases in a primordial state are very rare. Our ﬁrst clustering experiment was on the full dataset containing all classes and then in

Skin Lesions Characterisation Utilising Clustering Algorithms

249

our sub-experiments we use a dataset which contains the melanoma class and one of the others each time. To determine the qualityof the clustering results we use the purity measure, k |N Pi | as deﬁned in [6,18]: P U = i=1 , where |N Pi | denotes the number of points |T N | with the dominant class label in cluster i, |T N | the total number of points, and k, the number of clusters. Intuitively, this measures the purity of the clusters with respect to the true cluster (class) labels that are known for our data sets, and can be considered as a form of the clustering-version of the classiﬁcation accuracy. In our experiments, we measure the cluster purity of the PDDP, iPDDP [17], dePDDP, and k-means-PDDP [20]. In all our experiments the number of the clusters that the algorithms needed to compute was given as input. Although the dataset contains 3 classes, we force the clustering algorithms to ﬁnd more clusters in order to better understand the structure of the dataset, since a particular class might be deﬁned by several sub-clusters. For a small value of the cluster number parameter, the structure of the true clusters has very small correspondence to the class labels. As the cluster number is increased the purity, as expected, is also increased. In result, when we set the cluster number to 25 the dePDDP algorithm accomplishes a very high purity result and the points with diﬀerent class labels are mostly separated in diﬀerent clusters. Studying the the confusion matrix exhibited in Table 1, it is evident that the dePDDP algorithm misaligned only 14 samples of the melanoma class. The nondysplastic class is clearly separated by putting all the samples in one big cluster. The samples of the melanoma class are spread through 6 clusters and the rest of the clusters include the samples of the nevus class. Thus, the structure of the nevus class is not solid, probably because of the existence of many outliers. This can be the reason of the low purity results of the iPDDP algorithm. None of the others algorithms managed to compute clusters with high cluster-class Table 1. Confusion Matrix of the dePDDP algorithm for 25 clusters

cluster1 cluster2 cluster3 cluster4 cluster5 cluster6 cluster7 cluster8 cluster9 cluster10 cluster11 cluster12 cluster13

non-dys nevus mel 2590 7 0 0 245 0 0 161 0 0 9 0 0 12 0 0 1 0 0 6 0 0 16 0 0 94 0 0 4 0 0 68 0 0 3 0 0 139 14

cluster14 cluster15 cluster16 cluster17 cluster18 cluster19 cluster20 cluster21 cluster22 cluster23 cluster24 cluster25

non-dys nevus mel 0 0 8 0 0 7 0 0 9 0 0 4 0 0 27 0 126 0 0 19 0 0 16 0 0 22 0 0 9 0 0 10 0 0 5 0

250

S.K. Tasoulis et al.

Table 2. Results with respect to the clustering purity for diﬀerent values of cluster number input at the full dataset Cluster Number 5 10 15 20 25

PDDP 0.7369 0.7369 0.7369 0.9658 0.9765

iPDDP 0.7168 0.7267 0.7312 0.7469 0.7548

dePDDP k-means-PDDP 0.7243 0.9339 0.8124 0.9471 0.9339 0.9498 0.9405 0.9526 0.9942 0.9553

Table 3. Results with respect to the clustering purity for diﬀerent values of cluster number input for the non-dysplastic and melanoma classes Cluster Number 5 10 15 20 25

PDDP 0.9740 0.9740 0.9740 0.9740 0.9740

iPDDP 0.9917 0.9917 0.9917 0.9917 0.9917

dePDDP k-means-PDDP 0.9785 0.9740 0.9928 0.9740 0.9988 0.9740 0.9988 0.9740 0.9988 0.9740

Table 4. Confusion Matrices of the PDDP and dePDDP algorithms for 5 clusters at the non-dysplastic and melanoma dataset

cluster1 cluster2 cluster3 cluster4 cluster5

PDDP non-dys melanoma 86 0 62 0 167 0 914 0 1361 69

cluster1 cluster2 cluster3 cluster4 cluster5

dePDDP non-dys melanoma 43 0 2525 3 22 38 0 6 0 0

correspondence. In Table 2 we illustrate the clustering result with respect to the purity for diﬀerent values of the cluster number parameter. Next, we demonstrate how the melanoma class can be separated from each of the other classes. To this end, two new datasets are formed composed of two classes each. The ﬁrst contains samples from the non-dysplastic and the melanoma classes, while the second contains samples from the dysplastic and the melanoma classes. In Table 3, we present the purity results for the dataset that contains only the non-dysplastic and the melanoma classes. These two classes are successfully separated. The dePDDP algorithm achieves very high purity even for the case of 10 clusters. In Table 4, we exhibit the confusion matrix of this experiment for the PDDP and dePDDP algorithms at the case of 5 clusters. It is evident that even in this case for which dePDDP does not achieve high purity, the melanoma class can be properly retrieved. The 22 non-dysplastic samples that are classiﬁed with the melanoma samples, in

Skin Lesions Characterisation Utilising Clustering Algorithms

251

Table 5. Results with respect to the clustering purity for diﬀerent values of cluster number input for the nevus and melanoma classes Cluster Number 5 10 15 20 25

PDDP 0.9337 0.9337 0.9337 0.9337 0.9951

iPDDP 0.9942 0.9538 0.9961 0.9961 0.9961

dePDDP k-means-PDDP 0.9971 0.9337 0.9971 0.9337 0.9971 0.9337 0.9971 0.9337 0.9971 0.9337

Table 6. Confusion Matrices of the PDDP and dePDDP algorithms for 5 clusters at the nevus and melanoma dataset

cluster1 cluster2 cluster3 cluster4 cluster5

PDDP dePDDP nevus melanoma nevus melanoma 39 0 cluster1 11 0 35 0 cluster2 962 3 64 0 cluster3 0 52 375 0 cluster4 0 8 459 69 cluster5 0 6

practice, are not a major concern, since they represent false positives. In the last experiment we utilise the dataset that contains samples from the nevus and the melanoma classes. In Table 5 illustrates the purity results. Again in this case, dePDDP algorithm managed to achieve very good results even for a small number of clusters. The confusion matrix for the PDDP and dePDDP algorithm for the case of 5 clusters is presented in table 6. In this experiment the classes are very well separated and only 3 melanoma samples were misaligned. Analysing all the experimental results we can come to a very important conclusion. In our ﬁrst experiment, the dePDDP algorithm managed to eﬀectively separate the non-dysplastic class. As a next step, we could remove the samples of the nondysplastic class and execute the algorithm to cluster the remaining samples. To this end, as shown in Table 6, the melanoma class can be retrieved with high accuracy.

5

Conclusions

Nowadays, image acquisition and processing tools serve as diagnostic adjuncts for medical professionals for the conﬁrmation of a diagnosis, as well as for the training of new dermatologists. The introduction of diagnostic tools based on intelligent decision support systems is also capable of enhancing the quality of medical care, particularly in areas where a specialised dermatologist is not available. The inability of general physicians to provide high quality dermatological services leads them to wrong diagnoses, particularly in evaluating fatal skin diseases such as melanoma. In such cases, an intelligent system may detect the

252

S.K. Tasoulis et al.

possibility of a serious skin lesion and warn of the need for early treatment. A promising technique in order to deal with such a problem seems to be the use of data mining methods. In particular, using clustering could be the key step of understanding the diﬀerences between the types of skin lesions. In this paper we propose an advanced algorithmic approach, which incorporates technologies from image analysis, feature extraction, numerical linear algebra, and clustering, in an attempt to eﬃciently and aﬀectively produce high quality retrieval of melanoma samples in skin lesion images. In a future work we intend to extend the proposed approach, focusing on the automatic determination of the number of clusters in the dataset.

References 1. Boley, D.: Principal direction divisive partitioning. Data Mining and Knowledge Discovery 2(4), 325–344 (1998) 2. Chung, D.H., Sapiro, G.: Segmenting skin lesions with partial-diﬀerentialequations-based image processing algorithms. IEEE Transactions on Medical Imaging 19(7), 763–767 (2000) 3. Dhillon, I., Kogan, J., Nicholas, C.: Feature selection and document clustering. A Comprehensive Survey of Text Mining, 73–100 (2003) 4. Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 269–274. ACM, New York (2001) 5. Greengard, L., Strain, J.: The fast gauss transform. SIAM J. Sci. Stat. Comput. 12(1), 79–94 (1991) 6. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Computing Surveys 31(3), 264–323 (1999), citeseer.ist.psu.edu/jain99data.html 7. Maglogiannis, I.: Automated segmentation and registration of dermatological images, vol. 2, pp. 277–294. Springer, Heidelberg (2003) 8. Maglogiannis, I., Pavlopoulos, S., Koutsouris, D.: An integrated computer supported acquisition, handling and characterization system for pigmented skin lesions in dermatological images. IEEE Transactions on Information Technology in Biomedicine 9(1), 86–98 (2005) 9. Maglogiannis, I., Doukas, C.: Overview of advanced computer vision systems for skin lesions characterization. IEEE Transactions on Information Technology in Biomedicine 13(5), 721–733 (2009) 10. Marks, R.: Epidemiology of melanoma. Clin. Exp. Dermatol. 25, 459–463 (2000) 11. Nachbar, F., Stolz, W., Merkle, T., Cognetta, A.B., Vogt, T., Landthaler, M., Bilek, P., Braun-Falco, O., Plewig, G.: The abcd rule of dermatoscopy: High prospective value in the diagnosis of doubtful melanocytic skin lesions. J. Amer. Acad. Dermatol. 30(4), 551–559 (1994) 12. Nilsson, M.: Hierarchical Clustering using non-greedy principal direction divisive partitioning. Information Retrieval 5(4), 311–321 (2002) 13. Pariser, R., Pariser, D.: Primary care physicians errors in handling cutaneous disorders. J. Am. Acad. Dermatol. 17(3), 239–245 (1987) 14. Sanders, J.E., Goldstein, B.S., Leotta, D.F., Richards, K.A.: Image processing techniques for quantitative analysis of skin structures. Computer Methods and Programs in Biomedicine 59(3), 167–180 (1999)

Skin Lesions Characterisation Utilising Clustering Algorithms

253

15. Stoecker, W.V., Li, W.W., Moss, R.H.: Automatic detection of asymmetry in skin tumors. Computerized Med. Imag. Graph 16(3), 191–197 (1992) 16. Tasoulis, S.K., Plagianakos, V.P., Tasoulis, D.K.: Projection based clustering at gene expression data. In: Computational intelligence methods for bioinformatics and biostatistics, Genova, Italy (2009) 17. Tasoulis, S., Tasoulis, D.: Improving principal direction divisive clustering. In: 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2008), Workshop on Data Mining using Matrices and Tensors (2008) 18. Zhao, Y., Karypis, G.: Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning 55(3), 311–331 (2004) 19. Yang, C., Duraiswami, R., Gumerov, N.A., Davis, L.: Improved fast gauss transform and eﬃcient kernel density estimation. In: Proceedings of Ninth IEEE International Conference on Computer Vision 2003, pp. 664–671 (2003) 20. Zeimpekis, D., Gallopoulos, E.: Principal direction divisive partitioning with kernels and k-means steering. In: Survey of Text Mining II: Clustering, Classiﬁcation, and Retrieval, pp. 45–64 (2007) 21. Zhang, Z., Stoecker, W., Moss, R.: Border detection on digitized skin tumor image. IEEE Transactions on Medical Imaging 19(11), 1128–1143 (2000)

Mining for Mutually Exclusive Gene Expressions George Tzanis and Ioannis Vlahavas Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki 54124, Greece {gtzanis,vlahavas}@csd.auth.gr http://mlkd.csd.auth.gr

Abstract. Association rules mining is a popular task that involves the discovery of co-occurences of items in transaction databases. Several extensions of the traditional association rules mining model have been proposed so far, however, the problem of mining for mutually exclusive items has not been investigated. Such information could be useful in various cases in many application domains like bioinformatics (e.g. when the expression of a gene excludes the expression of another) In this paper, we address the problem of mining pairs and triples of genes, such that the presence of one excludes the presence of the other. First, we provide a concise review of the literature, then we define this problem, we propose a probability-based evaluation metric, and finally a mining algorithm that we apply on gene expression data gaining new biological insights.

1 Introduction This paper deals with the issue of gene expression analysis. Proteins are the main structural and functional units of an organism’s cell, whereas DNA and RNA have the role to carry the genetic information of the organisms. In particular, the genetic information that is coded in the genes of DNA is transcribed into messenger (mRNA) and then is translated into a protein. The functions of an organism depend on the abundance of proteins which is partly determined by the levels of mRNA which in turn are determined by the expression of the corresponding gene. Changes in gene expression underlie many biological phenomena. The study of gene expression levels may guide to very important findings. SAGE (Serial Analysis of Gene Expression) is a method that provides the quantitative and simultaneous analysis of the whole gene function of a cell [26]. The method works by counting short tags of all the mRNA transcripts of a cell. The set of all tag counts in a single sample is called a SAGE library, and describes the gene expression profile of the sample. An important advantage of the SAGE method, against other methods like microarrays, is that the experimenter does not have to select the mRNA sequences that will be counted in a sample. This is quite important, since the appropriate sequences for studying various diseases such as cancer are not usually known in advance. This advantage of SAGE makes it a fairly promising method, especially for cancer studies as in ours. In this paper we present a method that utilizes the concept of association rules mining for extracting mutually exclusive expressions of genes. This is a new problem that S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 255–264, 2010. © Springer-Verlag Berlin Heidelberg 2010

256

G. Tzanis and I. Vlahavas

has not been studied yet. We define the problem of mining for genes with mutually exclusive expressions. We propose two metrics and a mining algorithm that we study on SAGE data. The paper is organized as follows. The next section presents the required background knowledge. Section 3 contains a short review of the relative literature. Section 4 contains the description of the proposed approach, definitions of terms and notions that are used, the proposed algorithm and the metrics for measuring the mutual exclusion. In section 5 we present our experiments and discuss important issues and in section 6 we conclude.

2 Preliminaries This section provides the necessary background knowledge including mining for frequent itemsets and contiguous frequent itemsets. 2.1 Frequent Itemsets The term “frequent itemset” has been proposed in the framework of association rules mining. The association rules mining paradigm involves searching for co-occurrences of items in transaction databases. Such a co-occurrence may imply a relationship among the items it associates. The task of mining association rules consists of two main steps. The first one includes the discovery of all the frequent itemsets contained in a transaction database. In the second step, the association rules are generated from the discovered frequent itemsets. A formal statement of the concept of frequent itemsets is presented in the following paragraph. Let I = {i1, i2, …, iN} be a finite set of binary attributes which are called items and D be a finite multiset of transactions, which is called dataset. Each transaction T ∈ D is a set of items such that T ⊆ I. A set of items is usually called an itemset. The length or size of an itemset is the number of items it contains. It is said that a transaction T ∈ D contains an itemset X ⊆ I, if X ⊆ T. The support of itemset X is defined as the fraction of the transactions that contain itemset X over the total number of transactions in D: suppD ( X ) =

{T ∈ D | T ⊇ X } D

(1)

Given a minimum support threshold σ ∈ (0,1] , an itemset X is said to be σ-frequent, or simply frequent in D, if suppD ( X ) ≥ σ . 2.2 Contiguous Frequent Itemsets In the following lines we provide some definitions and formulate the problem of mining contiguous frequent itemsets as defined in our previous work [5]. Every frequent itemset F ⊆ I divides the search space in two disjoint subspaces: the first consists of the transactions that contain F and from now on will be called the F-subspace and the second all the other transactions.

Mining for Mutually Exclusive Gene Expressions

257

Definition 1. Let F ⊆ I be a frequent itemset in D, and E ⊆ I be another itemset. The itemset F ∪ E is considered to be a contiguous frequent itemset, if F ∩ E= ∅ and E is frequent in the F-subspace. Itemset E is called the locally frequent extension of F. The term locally is used, because E may not be frequent in the whole set of transactions. In order to avoid any confusion, from now on we will use the terms local and locally, when we refer to a subset of D and the terms global and globally when we refer to D. For example, we use the terms global support (gsup = suppD) and local support (lsup = suppFSub⊂ D ). There may be set two separate thresholds for global and local support. An itemset F that satisfies the minimum global support threshold (min_gsup) is considered to be globally frequent and an itemset E that is frequent in the F-subspace, according to the minimum local support threshold (min_lsup), is considered to be locally frequent. The local support of an itemset E in the F-subspace can be calculated as in (2). suppF ∩T |T∈D ( E ) =

suppD ( E ∪ F ) suppD ( F )

(2)

Given a finite multiset of transactions D, the problem of mining contiguous frequent itemsets is to generate all itemsets F ∪ E that consist of an itemset F that has global support at least equal to the user-specified minimum global support threshold and an extension E that has local support at least equal to the user-specified minimum local support threshold.

3 Related Work Some recent efforts have utilized data mining methods for analyzing SAGE data. Decision trees (C4.5) and support vector machines were used in [7] to classify the data according to cell state (normal or cancerous) and tissue type (colon, brain, ovary, etc.). Hierarchical clustering of SAGE libraries was also studied [15]. In [24] hierarchical and partitional (K-Means) clustering algorithms as well as various cluster validation criteria were studied. Other approaches have also been applied on SAGE data, including mining of frequent patterns [25], strong emerging patterns [16], association rules [4], and frequent closed itemsets [9]. The effect of dimensionality reduction methods was studied in [3]. Data cleaning was considered in [14] as well as the process of the attribution of a tag to a gene. Finally, various feature ranking, classification, and error estimation methods were presented in [13]. Association rules were first introduced by Agrawal et al. [1] as a market basket analysis tool. Later, Agrawal and Srikant [2] proposed Apriori, a level-wise algorithm, which works by generating candidate itemsets and testing if they are frequent by scanning the database. Several algorithms have been proposed since then, others improving the efficiency, such as FPGrowth [11] and others addressing different problems from various application domains, such as spatial [12], temporal [6] and intertransactional rules [21]. One of the major problems in association rules mining is the large number of often uninteresting rules extracted. Srikant and Agrawal [18] presented the problem of mining for generalized association rules. Thomas and Sarawagi [20] propose a technique

258

G. Tzanis and I. Vlahavas

for mining generalized association rules based on SQL queries. Han and Fu [10] also describe the problem of mining “multiple-level” association rules, based on taxonomies and propose a set of top-down progressive deepening algorithms. Teng [19] proposes a type of augmented association rules, using negative information called dissociations. A dissociation is relationship of the form “X does not imply Y”, but it could be that “when X appears together with Z, this implies Y”. Another kind of association rules are negative association rules. Savasere et al. [17] introduced the problem of mining for negative associations. They propose a naive and an improved algorithm for mining negative association rules along with a new measure of interestingness. In a more recent work Wu et al. [27] present an efficient method for mining positive and negative associations and propose a pruning strategy and an interestingness measure. Their method extends the traditional positive association rules (A ⇒ B) to include negative association rules of the form A ⇒ ¬B, ¬A ⇒ B, and ¬A ⇒ ¬B. The last three rules indicate negative associations between itemsets A and B. A mutual exclusion can not be expressed by one such rule. If items a and b are mutually exclusive, then {a} ⇒ ¬{b} and {b} ⇒ ¬{a} concurrently, that is different from ¬{a} ⇒ ¬{b}. The problem of mining pairs of mutually exclusive items has been recently introduced [22, 23].

4 Our Approach In this section we provide a detailed description of the proposed approach. Before presenting the basic steps of this approach we will describe the structure of the input data. The data are structured in a gene expression matrix A. The columns of the matrix represent the tags of the genes and the rows represent the different samples (SAGE libraries). The intersection of the ith row with the jth column, namely the element aij, is the gene expression level for the gene j in the sample i. 4.1 Discretization The data that will be used for mining the mutually exclusive gene expressions should contain binary values. Each value denotes if a gene in a particular SAGE library is expressed or not. We have utilized three methods for discretization that have been proposed in [4]. These methods are presented below: • Max minus x%. This consists of identifying the highest expression value (HV) in any library for each tag, and defining a value of 1 for the expression of that tag in a library when the expression value is above (HV - x)/100. Otherwise, the expression of the tag is assigned a value of 0. • Mid-range based cutoff. The highest and lowest expression values are identified for each tag and the mid-range value is defined as the arithmetic mean of these two numbers. Then, all expression values below or equal to the mid-range are set to 0, and all values above the mid-range are set to 1. • X% of highest value. For each tag, we identified libraries in which its level of expression was in the x% of highest values (e.g. 30%). These are assigned the value 1, and the rest are set to 0.

Mining for Mutually Exclusive Gene Expressions

259

The use of the above methods filters out a large number of infrequent tags that may correspond either to genes that have very low expression levels or to mistakenly counted tags (e.g. a tag with count 1 could be caused in an error in sequencing of the tag). Any of the above methods can be selected depending on the particular study and dataset. For example if a very short number of expressed genes is desirable, then method 10% of highest value would be a good choice. At this step, also, the dataset is converted to a transactional format, so that each sample contains the IDs of the genes (tags) that are expressed in the particular sample (SAGE library). 4.2 Definition of Mutual Exclusion Definition 2. Let D be a finite multiset of transactions (samples or SAGE libraries) and I be a finite set of items (genes or tags). Each transaction T ∈ D is a set of items such that T ⊆ I. If two items i1 ∈ I and i2 ∈ I are mutually exclusive, then there is not any transaction T ∈ D, such that {i1, i2} ⊆ T. If a gene i is contained in transaction T, then gene i is expressed in SAGE library T. The above definition of mutual exclusion is strict. However, the inverse of definition 1 does not generally stand, so it cannot be used to identify mutually exclusive items. Typically in a SAGE dataset genes are in tens of thousands, whereas SAGE libraries (transactions) are in hundreds. It is possible that there is a large number of pairs of genes that are never expressed together. According to the inverse of definition 2 all of these pairs of genes have mutually exclusive expressions. But, in fact only a very small number of these pairs of genes have truly mutually exclusive expressions. Mining for mutually exclusive items in a database possibly containing several thousands of different items, involves searching in a space that consists of all the possible pairs of items, because virtually any of them could contain two items that exclude each other. However this approach is naive and simplistic and can lead to many mutually exclusive items that in fact are not. We propose a more intuitive approach, which is based on the assumption/observation that every frequent itemset expresses a certain behavior and therefore it could be used to guide our search. Items that appear with high frequency in the subspace of a frequent itemset are more likely to be systematically mutually exclusive, because they follow a pattern and not because of pure chance or unusual cases. Our approach consists of three steps. In the first step, all the frequent itemsets are mined. Then, the frequent itemsets are used for mining the contiguous frequent itemsets, producing the extensions that will be used in the next step as candidate mutually exclusive items. Any frequent itemset mining algorithm can be used in the first step. Step 2 works in a level-wise manner and requires a number of scans over the database, which is proportional to the size of the extensions discovered. The extensions of the contiguous frequent itemsets mined at the second step are candidates for participating in a pair of mutually exclusive items. 4.3 Mutual Exclusion Metrics and Mining Algorithm In order to distinguish when two items are mutually exclusive, we need a measure to evaluate the degree of the mutual exclusion between them. Initially, we should be able

260

G. Tzanis and I. Vlahavas

to evaluate this within the subspace of a frequent itemset (locally) and then it should be evaluated globally, with all the frequent itemsets that support this candidate pair to contribute accordingly. For this purpose, we propose the use of a metric we call MEM (Mutual Exclusion Metric) that can be calculated in two phases, the first one is local and is required for the second one, which is the global one. Local Metric. We propose the following local metric (3), which will be called Local MEM for the evaluation of a candidate pair of mutually exclusive items that is supported by a frequent itemset I and its range is [0, 1]. LM I ( A, B ) = ⎡⎣ P ( A − B ) + P ( B − A ) ⎤⎦ min ⎡⎣ P ( A − B | A ) , P ( B − A | B ) ⎤⎦ = ⎡(S - S ) (S - S )⎤ ⎡⎣( S A - S AB ) + ( S B - S AB ) ⎤⎦ min ⎢ A AB , B AB ⎥ = SA SB ⎣⎢ ⎦⎥

(3)

⎡ ⎤ S AB ( S A + SB - 2S AB ) ⎢1⎥ ⎢⎣ min ( S A , S B ) ⎥⎦

For the above formula P(I) = 1. SX is the fraction of transactions that contain X over the number of transactions that contain I. Global Metric. We propose the following global metric (4) for the evaluation of a candidate pair of mutually exclusive items that is supported by a set IS of frequent itemsets. ⎛ ⎞ GM I ( A, B ) = IIF ⎜ S I LM I ( A, B ) ⎟ ⎜ ⎟ ⎝ I ∈IS ⎠

∑

(4)

IIF stands for Itemset Independence Factor and is calculated as the ratio of the number of the distinct items contained in all itemsets that support a candidate pair over the total number of items contained in these itemsets. For example, the IIF of the itemsets {A, B, C} and {A, D} is 0.8, since there are 4 distinct items (A, B, C and D) over a total of 5 ones (A, B, C, A and D). The IIF is used in order to take into account the possible overlapping of two candidate mutually exclusive items. We do this, because the overlapping between the transactions that contain two different itemsets implies overlapping between the transactions that contain the pair. Using the above metrics we have implemented an algorithm for mining pairs of mutually exclusive items [22]. In our study we have extended our algorithm for mining not only pairs, but also triples of mutually exclusive items and we have adapted our approach for application on gene expression domain. We have implemented a tool that is available at our group’s website: http://mlkd.csd.auth.gr/mutex/index.html.

5 Experiments In this section we describe the dataset and the results of our experiments and we discuss some important issues.

Mining for Mutually Exclusive Gene Expressions

261

5.1 Dataset We have used a SAGE dataset that consists of 90 SAGE libraries and 27679 tags. This dataset has been provided by Dr Olivier Gandrillon’s team (Centre de Génétique Moléculaire et Cellulaire de Lyon, France) and has been studied and presented at the ECML/PKDD Discovery Challenge Workshops in 2004 and 2005. The SAGE libraries contained in this dataset have been prepared as of December 2002 [8]. They are collected from various human tissue types (colon, brain, ovary, etc.) and are labeled according to their cell state that is either normal or cancerous. 5.2 Results We have conducted a number of experiments in order to evaluate the behavior of our approach. Table 1 presents the mean transaction size of the datasets that are generated using many variants of the three discretization methods described in 4.1. As it is shown the counts of the tags are not equally distributed. For example, method “max minus 5%” provides a mean transaction size of 344 genes while method “5% of highest value” provides a mean transaction size of 1047 genes. This means that the tags that have at most 5% smaller count than the maximum count are not the 5% of the highest counted tags but a very smaller percentage. Table 1. Mean transaction size for various discretization methods

Discretization Method Mean Transaction Size Max minus 5% 344 Max minus 10% 392 Max minus 15% 440 Max minus 20% 500 Max minus 25% 580 Max minus 30% 658 5% of highest value 1047 10% of highest value 1867 15% of highest value 2636 20% of highest value 3113 25% of highest value 3591 30% of highest value 3897 Mid-range based cutoff 1273 Fig. 1 presents the number of mutually exclusive pairs and triples of genes that are mined for various values of minimum local support threshold, when minimum global support threshold is 0.25 and mid-range cutoff discretization is used. As shown in the graph the number of mutually exclusive genes grows exponentially with minimum local support threshold. Moreover, the number of triples is in most cases greater than the number of pairs. The same happens when minimum global support threshold varies (these results are not presented here, due to space limitations).

262

G. Tzanis and I. Vlahavas 4

x 10 9 pairs 8

triples

7 6 5 4 3 2 1 0.48

0.46

0.44

0.42 0.4 Local Support Threshold

0.38

0.36

0 0.34

Fig. 1. Number of mutually exclusive genes against local support threshold (min_gsup = 0.25, mid-range cutoff discretization)

5.3 Discussion In our approach a level-wise technique is applied in order to extract the contiguous frequent sets of genes and then pairs and triples of genes that their expressions are mutually exclusive. Searching for mutually exclusive pairs of items in a “blind” manner would produce a huge amount of candidate pairs. Moreover, most of the discovered mutually exclusive items would be uninteresting. The intuition behind searching for mutually exclusive items between the extensions of frequent itemsets is manifold. First, the search space is reduced sensibly. Second, genes that are expressed in particular cases and thus are expressed in a small number of samples could not be mined as globally frequent. This does not mean that these genes are not important. However, they cannot be mined guiding to possible loss of knowledge. In our approach, these genes may be found as frequent extensions of other frequent sets of genes, recovering the aforementioned loss of knowledge. Third, if a large number of frequent sets of genes share the same extensions and these common extensions are frequent in the subspace of these sets, they are likely to be mutually exclusive and possibly of the same category and the same level of a taxonomy (e.g. same tissue type). As shown by the experiments all pairs or triples of mutually exclusive genes contain genes with different functions. In some cases there are found genes or ESTs with unknown functions. This denotes an important use of the proposed approach. This approach could be used as part of a procedure for discovering the function of a gene or EST. For example, if a gene with an unknown function is found to be mutually exclusive with other genes with known functions, then it is very possible that the function of the gene is none of the functions of its mutually exclusive genes. Moreover, genes that suppress each other under some conditions can also be found by the proposed approach.

6 Conclusions and Future Work In this paper we have presented the novel problem of mining for mutually exclusive gene expressions. When two or more genes have mutually exclusive expressions, this

Mining for Mutually Exclusive Gene Expressions

263

can be used as a valuable hint when looking for previously unknown functional relationships among them. In such case, this can be an interesting type of knowledge to the domain expert. For this purpose, we propose an intuitive approach, formulated the problem providing definitions of terms and evaluation metrics. We have also developed a mining algorithm. In the future, we would like to extend our algorithm for mining not only pairs and triples of mutually exclusive genes but sets of unlimited size. Moreover, we will deal with the improvement of the efficiency of our algorithm.

References 1. Agrawal, R., Imielinski, T., Swami, A.: Mining Association Rules Between Sets of Items in Large Databases. In: Proceedings of the ACM SIGMOD Conference on Management of Data, pp. 207–216 (1993) 2. Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules in Large Databases. In: Proceedings of the International Conference on Very Large Databases, pp. 487– 499 (1994) 3. Alves, A., Zagoruiko, N., Okun, O., Kutnenko, O., Borisova, I.: Predictive Analysis of Gene Expression Data from Human SAGE Libraries. In: Proceedings of the ECML/PKDD Discovery Challenge Workshop, Porto, Portugal, pp. 60–71 (2005) 4. Becquet, C., Blachon, S., Jeudy, B., Boulicaut, J.F., Gandrillon, O.: Strong-associationrule mining for large-scale gene-expression data analysis: a case study on human SAGE data. Genome Biology 3(12) (2002) 5. Berberidis, C., Tzanis, G., Vlahavas, I.: Mining for Contiguous Frequent Itemsets in Transaction Databases. In: Proceedings of the IEEE 3rd International Workshop on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (2005) 6. Chen, X., Petrounias, I.: Discovering Temporal Association Rules: Algorithms, Language and System. In: Proceedings of the 16th International Conference on Data Engineering (2000) 7. Gamberoni, G., Storari, S.: Supervised and unsupervised learning techniques for profiling SAGE results. In: Proceedings of the ECML/PKDD Discovery Challenge Workshop, Pisa, Italy, pp. 121–126 (2004) 8. Gandrillon, O.: Guide to the gene expression data. In: Proceedings of the ECML/PKDD Discovery Challenge Workshop, Pisa, Italy, pp. 116–120 (2004) 9. Gasmi, G., Hamrouni, T., Abdelhak, S., Ben Yahia, S., Mephu Nguifo, E.: Extracting Generic Basis of Association Rules from SAGE Data. In: Proceedings of the ECML/PKDD Discovery Challenge Workshop, Porto, Portugal, pp. 84–89 (2005) 10. Han, J., Fu, Y.: Discovery of Multiple-Level Association Rules from Large Databases. In: Proceedings of the 21st International Conference on Very Large Databases, pp. 420–431 (1995) 11. Han, J., Pei, J., Yin, Y.: Mining Frequent Patterns without Candidate Generation. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 1–12 (2000) 12. Koperski, K., Han, J.: Discovery of Spatial Association Rules in Geographic Information Databases. In: Proceedings of the 4th International Symposium on Large Spatial Databases, pp. 47–66 (1995)

264

G. Tzanis and I. Vlahavas

13. Lin, H.-T., Li, L.: “Analysis of SAGE Results with Combined Learning Techniques”. In: Proceedings of the ECML/PKDD Discovery Challenge Workshop, Porto, Portugal, pp. 102–113 (2005) 14. Martinez, R., Christen, R., Pasquier, C., Pasquier, N.: Exploratory Analysis of Cancer SAGE Data. In: Proceedings of the ECML/PKDD Discovery Challenge Workshop, Porto, Portugal, pp. 72–77 (2005) 15. Ng, R.T., Sander, J., Sleumer, M.C.: Hierarchical cluster analysis of SAGE data for cancer profiling. In: Proceedings of Workshop on Data Mining in Bioinformatics, pp. 65–72 (2001) 16. Rioult, F.: Mining strong emerging patterns in wide SAGE data. In: Proceedings of the ECML/PKDD Discovery Challenge Workshop, Pisa, Italy, pp. 484–487 (2004) 17. Savasere, A., Omiecinski, E., Navathe, S.B.: Mining for Strong Negative Associations in a Large Database of Customer Transactions. In: Proceedings of the 14th International Conference on Data Engineering, pp. 494–502 (1998) 18. Srikant, R., Agrawal, R.: Mining Generalized Association Rules. In: Proceedings of the 21st VLDB Conference, pp. 407–419 (1995) 19. Teng, C.M.: Learning form Dissociations. In: Proceedings of the 4th International Conference on Data Warehousing and Knowledge Discovery, pp. 11–20 (2002) 20. Thomas, S., Sarawagi, S.: Mining Generalized Association Rules and Sequential Patterns Using SQL Queries. In: Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, pp. 344–348 (1998) 21. Tung, A.K.H., Lu, H., Han, J., Feng, L.: Efficient Mining of Intertransaction Association Rules. IEEE Transactions on Knowledge and Data Engineering 15(1), 43–56 (2003) 22. Tzanis, G., Berberidis, C.: Mining for Mutually Exclusive Items in Transaction Databases. International Journal of Data Warehousing and Mining 3(3) (2007) 23. Tzanis, G., Berberidis, C., Vlahavas, I.: On the Discovery of Mutually Exclusive Items in a Market Basket Database. In: Proceedings of the 2nd ADBIS Workshop on Data Mining and Knowledge Discovery, Thessaloniki, Greece, September 6 (2006) 24. Tzanis, G., Vlahavas, I.: Mining High Quality Clusters of SAGE Data. In: Proceedings of the 2nd VLDB Workshop on Data Mining in Bioinformatics, Vienna, Austria (2007) 25. Tzanis, G., Vlahavas, I.: Accurate Classification of SAGE Data Based on Frequent Patterns of Gene Expression. In: Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2007), Patras, Greece, October 29-31. IEEE, Los Alamitos (2007) 26. Velculescu, V.E., Zhang, L., Vogelstein, B., Kinzler, K.W.: Serial analysis of gene expression. Science 270(5235), 484–487 (1995) 27. Wu, X., Zhang, C., Zhang, S.: Efficient Mining of both Positive and Negative Association Rules. ACM Transactions on Information Systems 22(3), 381–405 (2004)

Task-Based Dependency Management for the Preservation of Digital Objects Using Rules Yannis Tzitzikas, Yannis Marketakis, and Grigoris Antoniou Computer Science Department, University of Crete, Institute of Computer Science, Forth-ICS, Greece {tzitzik,marketak,antoniou}@ics.forth.gr

Abstract. The preservation of digital objects is a topic of prominent importance for archives and digital libraries. This paper focuses on the problem of preserving the performability of tasks on digital objects. It formalizes the problem in terms of Horn Rules and details the required inference services. The proposed framework and methodology is more expressive and flexible than previous attempts as it allows expressing the various properties of dependencies (e.g. transitivity, symmetry) straightforwardly. Finally, the paper describes how the proposed approach can be implemented using various technologies.

1

Introduction

The preservation of digital objects is a topic of prominent importance for archives and digital libraries. To support digital preservation, i.e. the curation of archives of digital objects, requires tackling several issues. For instance, there is a need for services that help archivists in checking whether the archived digital artifacts remain functional, in identifying hazards and the consequences of probable losses or obsolescence risks. To tackle these requirements [11] showed how the needed services can be reduced to dependency management services, while [12] extended that model with disjunctive dependencies. The key notions of these works is the notion of module, dependency and profile. In a nutshell, a module can be a software/hardware component or even a knowledge base expressed either formally or informally, explicitly or tacitly, that we want to preserve. A module may require the availability of other modules in order to function, be understood or managed. We can denote such dependency relationships as t > t meaning that module t depends on module t . A profile is the set of modules that are assumed to be known (available or intelligible) by a user (or community of users). Based on this model, a number of services have been deﬁned for checking whether a module is intelligible by a community, or for computing the intelligibility gap of a module. GapMgr1 and PreScan2 [9] are two systems that have been developed based on this model, and have been applied successfully in the context of the EU project CASPAR3 . 1 2 3

http://athena.ics.forth.gr:9090/Applications/GapManager/ http://www.ics.forth.gr/prescan http://www.casparpreserves.eu/

S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 265–274, 2010. c Springer-Verlag Berlin Heidelberg 2010

266

Y. Tzitzikas, Y. Marketakis, and G. Antoniou

In the current work we extend that framework with task-based dependencies. This extension allows expressing dependencies in a more systematic manner, i.e. each dependency is due to one or more tasks. We found the extended framework on Horn Rules and we sketch a methodology for applying it. The proposed framework and methodology, apart from simplifying the disjunctive dependencies of [12], is more expressive and ﬂexible as it allows expressing the various properties of dependencies (e.g. transitivity, symmetry) straightforwardly. The rest of this paper is organized as follows. Section 2 introduces a running example and discusses requirements. Section 3 contains background information on Datalog. Section 4 introduces the proposed approach. Section 5 elaborates on the inference services required for task-performability, risk-detection and computing intelligibility gaps, and Section 6 discusses implementation choices. Section 7 discusses related work, and ﬁnally, Section 8 concludes and identiﬁes issues that are worth further research.

2

Motivation and Requirements

Running Example. James has a laptop where he has installed the NotePad text editor, the javac 1.6 compiler for compiling Java programs and JRE1.5 for running Java programs (bytecodes). He is learning to program in Java and C++ and to this end, and through NotePad he has created two ﬁles, HelloWorld.java and HelloWorld.cc, the ﬁrst being the source code of a program in java, the second of one in C++. Consider another user, say Helen, who has installed in her laptop the Vi editor and JRE1.5. Suppose that we want to preserve these ﬁles, i.e. to ensure that in future James and Helen will be able to edit, compile and run these ﬁles. In general, to edit a ﬁle we need an editor, to compile a program we need a compiler, and to run the bytecodes of a Java program we need a Java Virtual Machine. To ensure preservation we should be able to express the above. To this end we could use facts and rules. For example, we could state: A file is editable if it is TextFile and a TextEditor is available. Since James has two text ﬁles (HelloWorld.java, HelloWorld.cc) and a text editor (NotePad), we can conclude that these ﬁles are editable by him. By a rule of the form: If a file is Editable then its is Readable too, we can also infer that these two ﬁles are also readable. We can deﬁne more rules in a similar manner to express more task-based dependencies, such as compilability, runability etc. For our running example we could use the following facts and rules: 1. 2. 3. 4. 5. 6. 7.

NotePad is a TextEditor VI is a TextEditor HelloWorld.java is a JavaSourceFile HelloWorld.cc is a C++SourceFile javac1.6 is a JavaCompiler JRE1.5 is a JavaVirtualMachine A ﬁle is Editable if it is a TextFile and a TextEditor is available

Task-Based Dependency Management for the Preservation of Digital Objects

267

8. A ﬁle is JavaCombilable if it is a JavaSourceFile and a JavaCompiler is available 9. A ﬁle is C++Combilable if it is a C++SourceFile and a C++Compiler is available 10. A ﬁle is Compilable if it is JavaCompilable or C++Compilable 11. If a ﬁle is Editable then it is Readable Lines 1-6 are actually facts while lines 7-11 deﬁne how various tasks are carried out. Notice that some facts are valid for James while some other are valid for Helen (the only fact that is not valid for James is 2, while for Helen only 2 and 6 hold). From these we can infer that James is able to compile the ﬁle HelloWorld.java (using the lines 3,5,8,10) but he cannot compile the ﬁle HelloWorld.cc (since there is no fact about C++Compiler for James). If James send his TextFiles to Helen then she can only edit them but not compile them since she has no facts about Compilers. Requirements. In general, we have identiﬁed the following key requirements: – Task-Performability Checking. In most cases, to perform a task we have to perform other subtasks and to fulﬁl associated requirements for carrying out these tasks (e.g. to have the necessary modules - in our running example the necessary digital ﬁles). Therefore, we need to be able to decide whether a task can be performed by examining all the necessary subtasks. For example we might want to ensure that a ﬁle is runnable, editable or compilable. – Risk Detection. The removal of a software module could also aﬀect the performability of other tasks that depend on it and thus break a chain of task-based dependencies. Therefore, we need to be able to identify which tasks are aﬀected by such removals. – Identification of missing resources to perform a task. When a task cannot be carried out it is desirable to be able to compute the resources that are missing. For example, if James wants to compile the ﬁle HelloWorld.cc, his system cannot perform this task since there is not any C++Compiler. James should be informed that he should install a compiler for C++ to perform this task. – Support of Task Hierarchies. It is desirable to be able to deﬁne task-type hierarchies for gaining ﬂexibility and reducing the number of rules that have to be deﬁned. – Properties of Dependencies. Some dependencies are transitive, some are not. Therefore we should be able to deﬁne the properties of each kind of dependency.

3

Background: Datalog

Datalog is a query and rule language for deductive databases that syntactically is a subset of Prolog. As we will model our approach in Datalog this section provides some background material (the reader who is already familiar with Datalog can skip this section).

268

Y. Tzitzikas, Y. Marketakis, and G. Antoniou

Syntax. The basic elements of Datalog are: variables (denoted by a capital letter), constants (numbers or alphanumeric strings), and predicates (alphanumeric strings). A term is either a constant or a variable. A constant is called ground term and the Herbrand Universe of a Datalog program is the set of constants occurring in it. An atom p(t1 , ..., tn ) consists of an n-ary predicate symbol p and a list of arguments (t1 , ..., tn ) such that each ti is a term. A literal is an atom p(t1 , ..., tn ) or a negated atom ¬p(t1 , ..., tn ). A clause is a ﬁnite list of literals, and a ground clause is a clause which does not contain any variables. Clauses containing only negative literals are called negative clauses, while positive clauses are those with only positive literals in it. A unit clause is a clause with only one literal. Horn Clauses contain at most one positive literal. There are three possible types of Horn clauses, for which additional restrictions apply in Datalog: – Facts are positive unit clauses, which also have to be ground clauses. – Rules are clauses with exactly one positive literal. The positive literal is called the head, and the list of negative literals is called the body of the rule. In Datalog, rules also must be safe, i.e. all variables occuring in the head also must occur in the body of the rule. – A goal clause is a negative clause which represents a query to the Datalog program to be answered. In Datalog, the set of predicates is partitioned into two disjoint sets, EP red and IP red. The elements of EP red denote extensionally deﬁned predicates, i.e. predicates whose extensions are given by the facts of the Datalog programs (i.e. tuples of database tables), while the elements of IP red denote intensionally deﬁned predicates, where the extension is deﬁned by means of the rules of the Datalog program. Furthermore, there are built-in predicates like e.g. =, =, <, which we do not discuss explicitly here. If S is a set of positive unit clauses, then E(S) denotes the extensional part of S, i.e. the set of all unit clauses in S whose predicates are elements of EP red. On the other hand, I(S) = S − E(S) denotes the intensional part of S (clauses in S with at least one predicate from IP red). Now we can deﬁne a Datalog program P as a ﬁnite set of Horn clauses such that for all C ∈ P , either C ∈ EP red or C is a safe rule where the predicate occurring in the head of C belongs to IP red. So far, we have described the syntax of pure Datalog. In order to allow also for negation, we consider an extension called stratified Datalog. Here negated literals in rule bodies are allowed, but with the restriction that the program must be stratified. For checking this property, the dependency graph of a Datalog program P has to be constructed. For each rule in P , there is an arc from each predicate occuring in the rule body to the head predicate. P is stratiﬁed iﬀ whenever there is a rule with head predicate p and a negated subgoal with predicate q, then there is no path in the dependency graph from p to q. Semantics. The Herbrand base (HB) of a Datalog program is the set of all possible ground unit clauses that can be formed with the predicate symbols and the constants occurring in the program. Furthermore, let EHB denote the extensional and IHB the intensional part of HB. An extensional database (EBD)

Task-Based Dependency Management for the Preservation of Digital Objects

269

is a subset of EHB, i.e. a ﬁnite set of positive ground facts. In deterministic Datalog, a Herbrand interpretation is a subset of the Herbrand base HB. For pure Datalog, there is a least Herbrand model such that any other Herbrand model is a superset of this model. Stratiﬁed Datalog is based on a closed-world assumption. If we have rules with negation, then there is no least Herbrand model, but possibly several minimal Herbrand models, i.e. there exists no other Herbrand model which is a proper subset of a minimal model. Among the diﬀerent minimal models, the one chosen is constructed in the following way: When evaluating a rule with one or more negative literals in the body, ﬁrst the set of all answer-facts to the predicates which occur negatively in the rule body is computed (in case of EDB predicates these answer-facts are already given), followed by the computation of the answers to the head predicates. For stratiﬁed Datalog programs, this procedure yields a unique minimal model. The minimum model computed in this way is often called the perfect Herbrand model.

4

Proposed Approach

In brief, digital ﬁles are represented by EDB facts. Task-based dependencies (and their properties) are represented as Datalog rules and facts. Proﬁles (as well as particular software archives or system settings) are represented by EDB facts. Datalog query answering and methods of logical inference (i.e. deductive and abductive reasoning) are exploited for enabling the required inference services (performability, risk detection, etc). 4.1

Digital Files and Type Hierarchies

Digital ﬁles and their types are represented as EDB facts. The two ﬁles of our running example will be expressed as: JavaFile(HelloWorld.java). C++File(HelloWorld.cc). The types of the digital ﬁles can be organized hierarchically. Such taxonomies can be represented with appropriate rules. For example to deﬁne that every JavaFile is also a UTF8File we must add the following rule UTF8File(X) :- JavaFile(X). Each ﬁle can be associated with more than one type. In general we could capture several features of the ﬁles (apart from types) using predicates (not necessarily unary), e.g. ReadOnly(HelloWorld.java), LastModifDate(HelloWorld.java, 2008-10-18). Also note that in place of the ﬁlenames, we could use any string that can be used as an identity of these ﬁles (e.g. ﬁle paths, URIs, DOIs, etc). 4.2

Software Components

Software components can be described analogously, e.g.:

270

Y. Tzitzikas, Y. Marketakis, and G. Antoniou

AsciiEditor(NotePad). AsciiEditor(vi). JavaCompiler(javac 1.6).

| | |

C++Compiler(gcc). JVM(jre1.5win). JVM(jre1.6linux).

Again predicates are used for expressing the types of the software components. The above set of facts may correspond to the software components available in a particular computer. 4.3

Task-Dependencies

We will also use (IDB) predicates to model tasks and their dependencies. Specifically, for each real world task we deﬁne two intensional predicates: one (which is usually unary) to denote the task, and another one (with arity greater than 2) for denoting the dependencies the task. For instance, Compile(HelloWorld.java) will denote the task of compiling HelloWorld.java. Since the compilability of HelloWorld.java depends on the availability of a compiler (speciﬁcally a compiler for the Java language), we can express this dependency using a rule of the form: Compile(X) :- Compilable(X,Y) where the binary predicate Compilable(X, Y ) is used for expressing the appropriateness of a Y for compiling a X. For example, Compilable(HelloWorld.java, javac 1.6) expresses that HelloWorld.java is compilable by javac 1.6. It is beneﬁcial to express such relationships at the class level (not at the level of individuals), speciﬁcally over the types (and other properties) of the digital objects and software components, i.e. with rules of the form: Compilable(X,Y) Compilable(X,Y) Runable(X,Y) EditableBy(X,Y)

::::-

JavaFile(X), JavaCompiler(Y). C++File(X), C++Compiler(Y). JavaClassFile(X), JVM(Y). JavaFile(X), AsciiEditor(Y).

Relations of higher arity can also be employed according to the requirements, e.g.: Runnable(X) :- Runnable(X,Y,Z) Runnable(X,Y,Z) :- JavaFile(X), Compilable(X,Y), JVM(Z) 4.4

Task Hierarchies

We have already seen how ﬁle type hierarchies can be expressed using rules. We can express hierarchies of tasks in a similar manner. The motivation is the need for enabling deductions of the form: ”if we can do task A then certainly we can do task B”. For example: Read(X) :- Edit(X). Read(X) :- Compile(X). The ﬁrst rule means that if we can edit something then certainly we can read it too. Alternatively, or complementarily, we can deﬁne such deductions at the ”dependency” level, e.g.:

Task-Based Dependency Management for the Preservation of Digital Objects

271

Readable(X,Y) :- EditableBy(X,Y). Intelligible(X,Y) :- ReadableBy(X,Y). 4.5

Properties of (Task) Dependencies

We can also express other properties of task dependencies (e.g. transitivity, symmetry, etc). For example, from Runnable(a.class, JVM) and Runnable(JVM, Windows) we might want to infer that Runnable(a.class, Windows). Such inferences can be speciﬁed by a rule of the form: Runable(X,Y) :- Runnable(X,Z), Runnable(Z,Y). As another example, IntelligibleBy(X,Y) :- IntelligibleBy(X,Z), IntelligibleBy(Z,Y). This means that if x is intelligible by z and z is intelligible by y, then x is intelligible by y. This captures the assumptions of the dependency model described in [11] (i.e. the transitivity of dependencies). 4.6

Profiles

A proﬁle is a set of facts, describing the modules available (or assumed to be known) to a user (or community). For example, the proﬁles of James and Helen can be expressed as two sets of facts: James | Helen ---------------------------+-----------AsciiEditor(NotePad). | AsciiEditor(Vi). JavaCompiler(javac 1.6). | JVM(jre1.5Win). JVM(jre1.5Win). |

Synopsis. Methodologically, for each real world task we deﬁne two intensional predicates: one (which is usually unary) to denote the task, and another one (which is usually binary) for denoting the dependencies of task (e.g. Read, Readable). Figure 1 depicts the partitioning of the various facts and rules. For instance, all services regarding James should be based on James’box plus the boxes at the upper level. File Type Taxonomy

Task Taxonomy

(primitive types: EDB predicates,

(IDB predicates)

Task Dependencies and their properties

higher types: IDB predicates)

James

(IDB predicates or arity

Helen

...

Digital files

Digital files

Digital files

Profile

Profile

Profile

Fig. 1. Logical partitioning

≥ 2)

272

5

Y. Tzitzikas, Y. Marketakis, and G. Antoniou

Reasoning Services

Here we describe how the reasoning services described at Section 2 can be realized in the proposed framework. – Task-Performability. This service aims at answering if a task can be performed by a user/system. It relies on query answering over the Proﬁles of the user. E.g. to check if HelloWorld.cc is compilable we have to check if HelloWorld.cc is in the answer of the query Compile(X). – Risk-Detection. Suppose that we want to identify the consequences on editability after removing a module, say NotePad. To do so we can do the following: 1. We compute the answer of the query Edit(X), and let A be the returned set of elements. 2. We delete NotePad from the database and we do the same. Let B be the returned set of elements4 . 3. The elements in A \ B are those that will be aﬀected. – Computation of Gaps (Missing Modules). There can be more than one way to ﬁll a gap due to the disjunctive nature of dependencies since the same predicate can be the head of more than one rules (e.g. the predicate AsciiEditor in the example earlier). The gap is actually the set of facts that are missing and are needed to perform a task. To this end we must ﬁnd the possible explanations (the possible facts) that entail a consequence (in our case a task). For example some possible explanations for the compilability of a JavaFile is the existence of the available compilers. In order to ﬁnd the possible explanations of a consequence we can use abduction [8,4,5]. Abductive reasoning allows inferring an atom as an explanation of a given consequence. For example assume that the ﬁle HelloWorld.cc is not compilable. Abduction will result all the possible C++ Compilers as explanations for the compilability of the ﬁle.

6

Implementation and Application Issues

There are several possible implementation approaches. Below we describe them in brief: Prolog is a declarative logic programming language, where a program is a set of Horn clauses describing the data and the relations between them. The proposed approach can be straightforwardly expressed in Prolog. Furthermore and regarding abduction there are several approaches that either extend Prolog [2] or augment it [3] and propose a new Programming Language. The Semantic Web Rule Language (SWRL) [7] is a combination of OWL DL and OWL Lite [10] with the Unary/Binary Datalog RuleML5 . SWRL provides the ability to write Horn-like rules expressed in terms of OWL concepts to infer new knowledge from existing OWL KB. For instance, each type predicate can be 4 5

In Prolog we could use the retract feature. http://ruleml.org

Task-Based Dependency Management for the Preservation of Digital Objects

273

expressed as a class. Each proﬁle can be expressed as an OWL class whose instances are the modules available to that proﬁle (we exploit the multiple classiﬁcation of SW languages). Module type hierarchies can be expressed through subclassOf relationships between the corresponding classes. All rules regarding performability and the hierarchical organization of tasks can be expressed as SWRL rules. In a DBMS-approach all facts can be stored in a relational database, while Recursive SQL can be used for expressing the rules. Speciﬁcally, each type predicate can be expressed as a relational table with tuples the modules of that type. Each proﬁle can be expressed as an additional relational table, whose tuples will be the modules known by that proﬁle. All rules regarding task performability, hierarchical organosis of tasks, and the module type hierarchies, can be expressed as datalog queries. Note that there are many commercial SQL servers that support the SQL:1999 syntax regarding recursive SQL (e.g. Microsoft SQL Server 2005, Oracle 9i, IBM DB2). Just indicatively, Table 1 synopsizes the various implementation approaches. Table 1. Implementation Approaches What

DB-approach

ModuleType predicates relational table Facts regarding Module (and tuples their types) DC Profile relational table DC Profiles Contents tuples Task predicates IDB predicates Task Type Hierarchy datalog rules, or isa if an ORDBMS is used Performability datalog queries (recursive SQL)

7

Semantic Web-approach class class instances class class instances predicates appearing in rules subclassOf rules

Related Work

There are many dependency management related works but only a few focus on task-based dependencies. Below we discuss in brief some of these works. [1] proposes a static deployment system for ensuring the success of two tasks: installation and deinstallation. It is based on a dependency description language, where the requirements of a service are expressed in ﬁrst order predicate language in conjunctive normal form. The success of installation guarantees that once a component is installed successfully it will work properly while the success of deinstallation ensures that the system remains safe after the removal of a component. [6] deﬁnes four types of dependencies: goal, soft goal, task and resource dependencies. The ﬁrst three types determine the conditions or the particular ways under which a speciﬁc goal or task can be attained. Furthermore it describes several properties, for soft goal dependencies, that determine the best approach to be followed. Finally it categorizes components to light (replaceable

274

Y. Tzitzikas, Y. Marketakis, and G. Antoniou

components) and heavy (components on which others strictly depend). In brief, we can say that all these approaches are less ﬂexible and extensible (in terms of task and dependency modeling) than the approach that we propose.

8

Concluding Remarks

We showed how rules can be employed for advancing the dependency management services that have been proposed for digital preservation. We reduced the problem to Datalog-based modeling and query answering. One issue that is worth further research is to investigate whether the way abduction is supported by existing systems (e.g. [2,3]) is adequate for the problem at hand. Other issues that are important for applying this model successfully in real settings is how modularity is supported and what kind of assisting tools are needed for managing and administrating the underlying sets of facts and rules.

References 1. Belguidoum, M., Dagnat, F.: Dependency Management in Software Component Deployment. Electronic Notes in Theoretical Computer Science 182, 17–32 (2007) 2. Christiansen, H., Dahl, V.: Assumptions and abduction in Prolog. In: 3rd International Workshop on Multiparadigm Constraint Programming Languages, MultiCPL, Citeseer, vol. 4 (2004) 3. Christiansen, H., Dahl, V.: HYPROLOG: A new logic programming language with assumptions and abduction. In: Gabbrielli, M., Gupta, G. (eds.) ICLP 2005. LNCS, vol. 3668, pp. 159–173. Springer, Heidelberg (2005) 4. Console, L., Dupre, D.T., Torasso, P.: On the relationship between abduction and deduction. Journal of Logic and Computation 1(5), 661 (1991) 5. Eiter, T., Gottlob, G.: The complexity of logic-based abduction. Journal of the ACM (JACM) 42(1), 3–42 (1995) 6. Franch, X., Maiden, N.A.M.: Modeling Component Dependencies to Inform their Selection. In: 2nd International Conference on COTS-Based Software Systems. Springer, Heidelberg (2003) 7. Horrocks, I., Patel-Schneider, P.F., Boley, H., Tabet, S., Grosof, B., Dean, M.: SWRL: A Semantic Web Rule Language Combining OWL and RuleML (May 2004), http://www.w3.org/Submission/SWRL/ 8. Kakas, A.C., Kowalski, R.A., Toni, F.: The Role of Abduction in Logic Programming. In: Handbook of Logic in Artificial Intelligence and Logic Programming: Logic programming, p. 235 (1998) 9. Marketakis, Y., Tzanakis, M., Tzitzikas, Y.: PreScan: Towards Automating the Preservation of Digital Objects. In: Procs. of the Intern. Conf. on Management of Emergent Digital Ecosystems MEDES 2009, Lyon, France (October 2009) 10. McGuinness, D.L., van Harmelen, F.: OWL Web Ontology Language Overview (2004), http://www.w3.org/TR/owl-features/ 11. Tzitzikas, Y.: Dependency Management for the Preservation of Digital Information. In: Wagner, R., Revell, N., Pernul, G. (eds.) DEXA 2007. LNCS, vol. 4653, pp. 582–592. Springer, Heidelberg (2007) 12. Tzitzikas, Y., Flouris, G.: Mind the (Intelligibily) Gap. In: Kov´ acs, L., Fuhr, N., Meghini, C. (eds.) ECDL 2007. LNCS, vol. 4675, pp. 87–99. Springer, Heidelberg (2007)

Designing Trading Agents for Real-World Auctions Ioannis A. Vetsikas and Nicholas R. Jennings Electronics and Computer Science, Univ. of Southampton, Southampton SO17 1BJ, UK {iv,nrj}@ecs.soton.ac.uk

Abstract. Online auctions have become a popular method for business transactions. The variety of different auction rules, restrictions in supply or demand, and the agents’ combinatorial preferences for sets of goods, mean that realistic scenarios are very complex. Using game theory, we design trading strategies for participating in a single auction or group of similar auctions. A number of concerns need to be considered in order to account for all the relevant features of real-world auctions; these include: budget constraints, uncertainty in the value of the desired goods, the auction reserve prices, the bidders’ attitudes towards risk, purchasing multiple units, competition and spitefulness between bidders, the existence of multiple sources for each good. To design a realistic agent, it is necessary to analyze the multi-unit auctions in which a combination of these issues are present together and we have made significant progress towards this goal. Furthermore, we use a principled methodology, utilizing empirical evaluation, to combine these results into the design of agents capable of bidding in the general real-world scenarios.

1 Introduction Auctions have become commonplace; they are used to trade all kinds of commodity, from flowers and food to industrial commodities and keyword targeted advertisement slots, from bonds and securities to spectrum rights and gold bullion. Once the preserve of governments and large companies, the advent of online auctions has opened up auctions to millions of private individuals and small commercial ventures. Thus, it is desirable to develop autonomous agents (intelligent software programs) that will aid and represent individuals or companies, thus letting them participate effectively in such settings, even though they do not possess professional expertise in this area. To achieve this, however, these agents should account for the features of real-world auctions that expert bidders take into consideration when determining their bidding strategies. Game theory is widely used in multi-agent systems as a way to model and predict the interactions between rational agents in auctions. However, the models that are canonically analyzed are rather limited, because the work that incorporates features important in real auctions looks at each feature separately and in most cases analyzes single-unit auctions. Now, while this is useful for economists and perhaps expert bidders, who can integrate the lessons learned using human intuition and imagination, an automated agent cannot immediately benefit. In order to design agents that would represent non-expert humans in real auctions effectively, it is necessary to first analyze the strategic behavior of bidders in auctions that incorporate all the features deemed to be relevant and important in real world auction scenarios; this is the first part of our agenda towards S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 275–285, 2010. c Springer-Verlag Berlin Heidelberg 2010

276

I.A. Vetsikas and N.R. Jennings

the design of trading agents for realistic auctions. To identify these features we looked at a variety of real auctions and at the related literature. We concluded that the following issues capture the essence of almost any real auction1: (i) budget constraints as, in every practical setting, the participating bidders have a certain pot of money that they can spend for purchases and how much they can spend (and thus bid) is limited by this budget [1,2]; (ii) bidders’ attitudes towards risk, meaning whether bidders are conservative or not and how much they are willing to take risks in order to gain a higher profit, which is described by their utility function, which maps profit into utility [3,4]; (iii) the reserve price of the specific auction, which is the minimum transaction price allowed in this auction [4,5,6]; (iv) uncertainty in the bidders’ valuation for the offered good, as there are many cases when the value of a good or service is not known precisely a priori, i.e. before it is purchased and used, which could be addressed by introspection on the part of the agent [7,8,9]; (v) competition or spite between bidders, especially in scenarios where one bidder wants to drive a competitor out of the market [10]; (vi) auctions with multiple rounds of bidding, because they have multiple possible closing times [11]; (vii) bidders’ desiring to purchase multiple items of the good sold, with a different valuation for each one [12,13]; (viii) the availability of each good from multiple sellers instead of assuming that all of them are sold in a single auction [3,14]; (ix) auctions with asymmetric bidder models, as in reality it could be known that some bidders have higher valuations or budgets, or different risk attitudes [15,16]. Furthermore, we consider the fact that, in practice, agents (or humans) are rarely interested in a single item, but rather wish to bid in several auctions for multiple interacting goods. Therefore, they must bid intelligently in order to get exactly what they need, as, for example, they may need to acquire a whole set of items.2 The second part of our agenda is to design agents for this general case. Unlike pre-existing work, e.g. [17,18], which focused mostly on specific cases or used heuristics to tackle this problem, we want to have a principled methodology for design trading agents, which allows us to use the theoretical results obtained from the theoretical analysis we described earlier. In order to achieve the goals set by our agenda, we first need to analyze an auction where all the identified relevant features are included in our theoretical analysis. In this paper, we present some selected results we have obtained thus far towards this goal: (i) the analysis of sealed-bid auctions when budget constraints, reserve prices, varying risk attitudes and valuation uncertainty are examined together [19,20,21] (in section 3); we identify cases where the analysis of two features together produces completely different behaviour than that observed from the analysis of each feature separately. We also present evidence that our strategies work well even when the opponents do not follow the equilibrium strategy, which is not guaranteed by the equilibrium definition. (ii) the equilibria for auctions with competitive or spiteful bidding [22] (in section 4); there are significant differences between our results and those for single-unit auctions. 1

2

The citations mostly refer to previous work where the specific feature is examined by itself in the context of single-unit auctions. For example a person may wish to buy a TV and a VCR, but if she does not have a “flexible plan”, she may only end up acquiring the VCR. On the other hand, she bids for VCRs in several auctions, she may end up with more than one. Goods are called complementary (substitutable) if the value of acquiring two of them is more (less) than the sum of their individual values.

Designing Trading Agents for Real-World Auctions

277

(iii) the equilibria for the case of multi-round auctions [11] (in section 5); this a completely novel analysis that has been used in our methodology for general settings. Thus, we selected to present these results, because not only do they demonstrate the significant progress already made towards our goal, but also show that our analysis provides qualitatively different results to the existing literature looking at simpler versions of the same problems. We have also obtained results on auctions with asymmetric models [23], sequential auctions with partially substitutable goods [24] which extends the standard model found in [3] and some initial results on auctions with bidders having multi-unit demand, which we cannot include in this paper due to space constraints. In addition, we present our principled methodology for achieving the second part of our agenda. According to this methodology, the problem is decomposed into several subparts, strategies are generated for each subproblem using game-theoretic analysis and empirical knowledge and then systematic experimentation is used to determine the strategies that work best for the whole problem in question. We present this work in section 6. We applied this technique to the Trading Agent Competition (TAC) classic game [25]. The resulting agent WhiteBear is the best agent in the history of the competition, which confirms the validity of our approach.

2 The Multi-unit Auction Model Used in Our Analysis In this section we formally describe the auction setting to be analyzed and define the objective function that the agents wish to maximize. We also give the notation that we use in the results included in this paper; some additional notation would be required for the model to incorporate the remaining features, which are not presented in this paper. In particular, we will compute and analyze the symmetric Bayes-Nash equilibria for sealed-bid auctions where m ≥ 1 identical items are being sold; the Bayes-Nash equilibrium is the standard solution concept used in game theory to analyze auctions and other games. These equilibria are defined by a strategy, which maps the agents’ valuations vi to bids bi . The two most common settings in this context are the mth and (m+1)th price auctions, in which the top m bidders win one item each at a price equal to the mth and (m + 1)th highest bid respectively. We assume that there is a reserve price r ≥ 0 in our setting; this means that bidders, who wish to participate in the auction, must place bids bi ≥ r. N indistinguishable bidders (where N ≥ m) participate in the auction and they have a private valuation (utility) vi for acquiring any one of the traded items; these valuations are assumed to be i.i.d. drawn from a distribution with cumulative distribution function (cdf) F (v), which is the same for all bidders. In the case that there is uncertainty about the valuation vi , then we need to extend this model. For the (m + 1)th price multi-unit auction case, we can use the most general model possible: the agent knows that his valuation vi is drawn from distribution Gi (), but not the precise value. As the valuations vi are independent, we can assume that any uncertainty that a bidder has about his own valuation is independent of the uncertainty he has about other agents’ valuations. For the mth price multi-unit auction case, we use a simpler model, because unlike the previous case, no dominant strategies exist in these cases, therefore the strategy used by opponent bidders affects the strategy used by any specific bidder. Thus, we assume that the true valuation v i , which is not known to

278

I.A. Vetsikas and N.R. Jennings

bidder i, is drawn from distribution Gvi (), where vi is known as being the mean value of distribution Gvi (); hence, each bidder i knows approximately his own value as being drawn from distribution Gvi () around the value vi (and vi is a known value to bidder i). We also assume that each bidder has a certain budget ci , which is known only to himself and which limits the maximum bid that he can place in the auction. The available budgets of the agents are i.i.d. drawn from a known distribution with cdf H(c). Every agent has a strictly monotonically increasing utility function u() that maps profit into utility and tries to maximize this utility. This function determines the agent’s risk attitude. In the case of spiteful bidding, then the objective function that each agent wishes to maximize is given by: Ui = (1 − α) · u i − α · j=i u j where α ∈ [0, 1] is a parameter called the spite coefficient, u i is the utility-mapped gain of agent i (i.e. u i = u(0), if it does not win any items, and u i = u(vi − pi ), if it does) and pi is the total payment the agent must make to the auctioneer. Finally, in the case of multiple round auctions, we assume that the valuations vir at round r are i.i.d. drawn from distribution Fr (u), and the probability that round r is the last round is known to be pr . If more rounds exist, an agent can submit new bids as long as they are greater or equal to the bid price from the end of the previous round; this is the minimum bid allowed at round r which is denoted as Qr .

3 Equilibria When Budget Constraints, Reserve Prices, Varying Risk Attitudes and Valuation Uncertainty Are Present Together In this section, we examine the case where a number of issues are taken into account. More specifically, the agents have budget constraints ci , their risk attitude is described by function u() (not necessarily risk neutral) and the auction has a reserve price r. In addition to these issues, we also consider that the valuations vi are not known precisely, and we use the models of valuation uncertainty described in the previous section. In [21], we proved the following theorems: Theorem 1. In the case of an mth price sealed-bid auction, with reserve price r ≥ 0, with N participating bidders, in which each bidder i is interested in purchasing one unit of the good for sale with inherent utility (valuation) for that item equal to v i (the exact value unknown to the bidder, and drawn from distribution Gvi ()), which is approximated by a known variable vi , the agent’s “uncertain valuation”, and has a budget constraint ci , where vi and ci are i.i.d. drawn from F (v) and H(c) respectively, and the bidders have a risk attitude which is described by utility function u(), the following bidding strategy constitutes a symmetric Bayes-Nash equilibrium: bi = min{g(vi ), ci }

(1)

where g(v) is the solution of the differential equation: g (vi ) =

(1 − H(g(vi )))F (vi ) ∞ (1−(1−F (vi ))(1−H(g(vi )))) −∞ u (x−g(vi ))Gv (x)dx i ∞ u(x−g(v ))G (x)dx−u(0) (N−m) −∞ i v

(2) − (1−F (vi ))H (g(vi ))

i

with boundary condition g(r) = r, where r satisfies the equation = u(0).

∞ −∞

u(x−r)Gr (x)dx

Designing Trading Agents for Real-World Auctions

279

Theorem 2. In an (m + 1)th price auction, with reserve price r, if a bidder has budget constraint ci , he knows only imprecisely his own valuation vi , in that it is drawn from distribution Gi (vi ), and his risk attitude is described by utility function ui (), it is a dominant strategy to bid: bi = min{βi , ci }, if bi ≥ r, and not to participate otherwise. The variable βi is the solution of equation: ∞ u(z − βi )Gi (z)dz = u(0) (3) −∞

While, sometimes, the bidding behaviour is a combination of those observed when the individual features are present separately, there are cases when this does not occur. For example, it is known that a risk-neutral bidder with uncertainty in his valuation vi bids as if the value was known and equal to the average of the expected value μGi .[8] When bidders are not risk-neutral this is no longer the case. In [21], we prove theoretically, for the case of an (m + 1)th price auction, that the more risk-averse (resp. risk-seeking) the bidders become and the higher the uncertainty, meaning the variance of the valuation distribution, the less (resp. more) they will bid. A similar result holds for the mth price auction as well; in figure 1(a) we graph the bidding strategies of risk averse bidders, for different degrees of valuation uncertainty. To be more precise, we assume all bidders use utility function u(x) = (x + 1)0.01 , and also that if they have uncertain valuation vi , then their true valuation for the good sold (which is unknown to them) is uniformly distributed in [vi (1 − γ), vi (1 + γ)], thus 1 Gvi (x) = 2γv , for x ∈ [vi (1 − γ), vi (1 + γ)] and Gvi (x) = 0, otherwise. When i γ = 0, there is no valuation uncertainty, and as γ increases so does the uncertainty. We observe that as γ increases, the bidders will indeed bid lower (for any valuation vi ). The same happens as the bidders’ risk averseness increases. As can been seen in the graph, while for small values of the parameter γ the bidders bid higher than g(v) = v2 , which is what a risk neutral bidder would bid, on the other hand, for large values of γ, they bid less than that. Therefore, when the valuation uncertainty is significant, the risk averse bidders bid in the same way as risk seeking bidders would (when the latter have no valuation uncertainty). Without our complete analysis, it would not be possible to figure out how these two effects factor in the final equilibrium strategy. This validates our claim that it is important to include all features in our analysis of a realistic auction. We also examined what happens when several opponents deviate from the equilibrium strategy; there are no theoretical guarantees that our strategies would outperform the other strategies in that case. We simulated an mth price auction with N = 3 participating bidders, where m = 2 items are sold. The bidders are all risk-averse with utility function u(x) = xα , α = 0.5 and they have budget constraints ci and valuations vi drawn from uniform distribution U [0, 1]. We denote the standard equilibrium strategy (given by equation 2) as S and compare it against the following two strategies: (i) NB is the strategy when the agent does not take the budget constraint into account, and (ii) RN is the strategy when the agent does not take the risk attitudes into account (and assumes that everyone is risk neutral). We choose these two strategies, because they look at less features than strategy S. In this sense, they are strategies which do not take advantage of our full analysis, and yet are reasonable, because they do consider some of the desired features. We compare S against each of these two strategies by running experiments

280

I.A. Vetsikas and N.R. Jennings

Estimated Expected Utility Ratio : EU/EU(3xS)

1.12

Bidding Strategy : g(v)

0.5

0.4

0.3

0.2 γ=0 γ=1/3

0.1

γ=2/3 γ=1 0

0

0.2

0.4

0.6

0.8

1

Valuation : v

(a)

EU(S) : 3xS EU(S) : 2xS + 1xNB EU(NB) : 2xS + 1xNB EU(S) : 1xS + 2xNB EU(NB) : 1xS + 2xNB EU(NB) : 3xNB EU(S) : 2xS + 1xRN EU(RN) : 2xS + 1xRN EU(S) : 1xS + 2xRN EU(RN) : 1xS + 2xRN EU(RN) : 3xRN

1.1 1.08 1.06 1.04 1.02 1 0.98 0.96

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Reserve Price : r

(b) 0.45 (1−p)/p = 1

0.4 3

(1−p)/p = 2

m=4

2.5

(1−p)/p = 5 (1−p)/p = 10

0.3

(1−p)/p = 20

2

bid: g(v)

Expected Revenue

0.35

m=3

0.25 0.2 0.15

1.5

0.1

m=2

0.05

1 0

(c)

0.2

0.4

0.6

coefficient α

0.8

0

1

0

0.2

0.4

0.6

0.8

1

utility: v

(d)

1000

bid: g1(v,Q)

800 600 400 200 0 1000 1000 800

500

600 400

starting price: Q

(e)

(f)

0

200 0

utility: v

Fig. 1. Graphs of the experimental results, as well as the diagram of our methodology for decomposing the general problem into subproblems, each of which can then be analyzed separately

in which some agents bid according to S and some according to NB (according to RN in the second comparison we did), for various values of the reserve price r. The results are presented in figure 1(b); they are presented as the ratio of the corresponding utility divided by the utility of the case when all agents use strategy S (experiment “3xS”). From this figure we can observe that, in every single instance, an agent using strategy NB (or RN) would always obtain a higher utility by switching to strategy S. This means that strategies NB and RN are dominated by S, when we consider agents who play either NB/RN or S. This fact that the trading strategies we compute analytically perform well even when opponents deviate from the equilibrium strategies has also been observed in experimental simulations we conducted in many other settings.

Designing Trading Agents for Real-World Auctions

281

4 Equilibria in the Case of Competitive or Spiteful Bidding In [22], we examine the case of spiteful bidding. We define vl and vh to be respectively the lowest and highest value that the bidders’ valuations can take. We proved that: Theorem 3. In the case of an mth price sealed bid auction with N participating bidders, in which each agent i is interested in purchasing one unit of the good for sale with inherent utility (valuation) for that item equal to vi , vi are i.i.d. drawn from F (v), and an α-coefficient for outperforming its competition, the following bidding strategy constitutes a symmentric Bayes-Nash equilibrium: ⎧ v N −m − N −m ⎪ ⎨v − (F (v)) 1−αm vl (F (z)) 1−αm dz αm < 1 gα (v) = v (4) αm = 1 ⎪ N −m v N −m ⎩ h − 1−αm 1−αm v + (F (v)) dz αm > 1 v (F (z)) Theorem 4. In the case of the equivalent (m + 1)th price sealed bid auction, the symmetric Bayes-Nash equilibrium is given by the strategy: vh

1

gα (v) = v + (1 − F (v))− α ·

v

1

(1 − F (z)) α · dz

(5)

We note some notable differences when analyzing the multi-unit auction case compared to the existing results for the single-unit auction case, which were presented in [10]. Firstly, bidders in an mth price auction would now bid higher than their true value 1 vi , when α > m . An even more interesting results is obtained when examining the revenue of the seller under the two auction formats. In the single-unit case, the (m + 1)th price auction yields more revenue then the equivalent mth price one, therefore the seller should always use this auction. In the extended (multi-unit) setting, for the same coefficient α, the mth and (m + 1)th price auctions yield the same expected revenue, at 1 1 equilibrium, when α = 0 or α = m , while, for 0 < α < m , the (m+ 1)th price auction 1 yields more revenue, and for α > m , the mth price auction yields more revenue. This is shown in figure 1(c), where we graph the revenue for the two auction types when the spite coefficient a is varied; the number of items is m = 2, 3, 4, the number of bidders N = 2m and the distribution F = U [0, 1] (uniform). This validates our claim that it is important to analyze the multi-unit setting of each auction.

5 Equilibria for Multi-round Auctions In this section, we present the case when an auction does not close at one preset time; we assume that there is a set of possible closing times, each one having a known probability of being selected. We analyze this case in [11]: Theorem 5. If the starting price of the current round r is Qr ≥ 0, the next round of bidding (r + 1) exists with probability (1 − pr ) (pr = 0, 1) and the utility of the agents in round r is drawn from the distribution Fr (v) (and each agent i in fact has utility vir of a similar value to the utility vi of the first round) then the equilibrium strategy is the solution of the differential equation: Φ (vi ) vi − gr (vi ) · r = Φr (vi ) − Yr (vi ) · Ψr (vi , gr (vi )) (6) gr (vi )

282

I.A. Vetsikas and N.R. Jennings

∂Ur+1 (v,x) r where Ψr (v, x) = 1 − 1−p , and Ur+1 (vi , Qr+1 ) is the expected utility at pr · ∂x round (r + 1), when the agent’s valuation is vi and the starting price is Qr+1 . The boundary condition is g(Qr ) = Qr . In addition, the expected utility at round r given this strategy gr (vi ) is then:

Ur (vi , Qr ) = pr · (vi − gr (vi )) · Φr (vi ) +

vi +

vi Qr

Yr (ω) · gr (ω) · dω + (1 − pr )·

Qr Ur+1 (vi , gr (ω))Yr (ω)dω + Ur+1 (vi , gr (vi )) −1 gr (vi )

Ur+1 (vi , gr (ω))Φr (ω)dω vi

(7)

Φr (vi ) − Yr (vi )

We present this result, because we have used this equilibrium strategy in our methodology, which is discussed in the next section. In figure 1(d), we graph the equilibrium strategy for R = 2 rounds and various values of 1−p p , where p = p1 is the probability of the auction closing after the first round of bids. As the probability (1 − p) of a second round increases, the equilibrium strategy is that the agent should bid progressively less.

6 A Principled Methodology for Designing Trading Agents In the previous sections, we demonstrated that equilibrium strategies can be quite useful in designing trading agents. However, the game-theoretic analysis has its limitations. It is not possible to fully analyze a setting with various different auction types, each with different rules, selling different commodities, when these commodities are complementary or substitutable to each other. In order to design an agent for such a setting, in [26], we presented a methodology that allows us to use the equilibrium strategies computed in the design of our agents. The high-level description of the proposed methodology is: A. Decompose the problem into subproblems as shown in figure 1(e): 1. OPTIMIZER: This component decides the quantities to buy assuming that everything will be bought at current or predicted prices (optimize utility) 2. CANDIDATE STRATEGIES: For each different auction type (and good) do: a. Determine boundary “partial strategies” for this auction b. Generate “intermediate” strategies as follows: - combine the boundary strategies, or - modify them using empirical knowledge from the domain, or - Use equilibrium strategies (game-theoretic analysis for that auction type) B. Use rigorous experimentation to explore the space of candidate strategies: 1. Select one auction type and fix other partial strategies for all agents 2. Experiment to find the best partial strategy for the specific auction as follows: a. Fix the agents using some intermediate candidate strategies b. Vary the number of agents using the boundary strategies 3. Find the best candidate strategy and use it for all agents in the experiment 4. Repeat step 1 for a different auction type, until the best strategies do not change The first part of our methodology requires the decomposition of the general problem into several subproblems. The quantities placed in each bid are determined independently by maximizing the utility of the agent assuming that all the goods are bought (or

Designing Trading Agents for Real-World Auctions

283

sold) at some predicted prices and that every unit will be bought instantly. A dedicated module called “the planner” is doing this task. How to bid for these goods is determined by the “partial strategies”; one such strategy is used per auction type (single auction or group of similar auctions). For each such auction type, we first compute the boundary strategies that are possible. We then combine parts of the boundary strategies or modify some of their parts to form intermediate strategies that behave between the extreme bounds (e.g. if the one boundary strategy will place a bid at price plow and the other at price phigh in a certain case, then the intermediate strategy should place its bid at price p : plow ≤ p ≤ phigh ). Another way to generate these intermediate strategies, which is of particular interest, is to use the equilibrium strategy computed for the particular subproblem. Having done this, the second part of our methodology advocates the use of experimentation in order to select the best combination of partial strategies, which is then implemented in the final design of the agent. For additional details, see [26]. Aplication: designing an agent for the TAC Classic game We selected the Trading Agent Competition as an application domain for our methodology and designed agent WhiteBear. The equilibrium strategy derived from the gametheoretic analysis of multi-unit, multi-round auctions was computed and incorporated into our agent design. In figure 1(f), we graph this equilibrium bidding strategy for the first round; the priors Fr () we used in theorem 5 to generate this strategy were obtained by sampling the valuations of the goods from the actual simulations. We then used rigorous experimentation to generate the final agent. (more details are given in [11]) The performance in the seeding, semi-final and final rounds of TAC show that our agent was the best performing agent in the history of the competition (in parenthesis we give the difference from the top competing agent), which validates our decision to use a principled methodology that included the use of game-theoretic analysis: 2002: Semi-final 1st (+0.55%), Final 1st (+1.85%). 2003: Seeding 1st (+1.63%), Semi-final 1st (+5.37%), Final 3rd (−1.81%). 2004: Seeding 1st (+3.12%), Semi-final 1st (+6.57%), Final 1st (+7.10%). 2005: Seeding 1st (+2.63%), Semi-final 1st (+1.61%), Final 2nd (−0.50%).

7 Conclusions In this paper, we described our agenda for designing trading agent which would be able to participate in a large number of auctions, bidding for sets of goods. We identified the relevant features of real-world auctions which are important and we presented several theoretical results on analysis of auctions, which lead towards accomplishing this goal. We showed that including all the features in the analysis is necessary, because the behaviour when two features are present can be quite different to the cases when each one is examined independently. For a similar reason, we need to examine multi-unit auctions. In addition, we demonstrated that playing the equilibrium strategy outperforms other strategies even in settings where opponents deviate from the equilibrium. We also presented our methodology that allows the design of trading agents for the general setting, which involves bidding in many different auctions while trying to acquire several goods with combinatorial valuations. The methodology allows the incorporation of theoretic results obtained for a single auction or group of similar auctions.

284

I.A. Vetsikas and N.R. Jennings

We showed the usefulness of our methodology by applying it to the design of our agent WhiteBear, which was the most successful agent in the history of TAC Classic. We are currently continuing the theoretical work to analyze cases with bidders with multi-unit demand, asymmetric bidder cases and the presence of multiple auctions (either in sequence or in parallel). Furthermore, we continue our analysis and combine our results towards the final goal of an analytically computed strategy that would incorporate all the important features together. Acknowledgements. This research was undertaken as part of the ALADDIN project and is jointly funded by a BAE Systems and EPSRC strategic partnership (EP/C548051/1).

References 1. Che, Y., Gale, I.: Standard auctions with financially constrained bidders. Review of Economic Studies 65(1), 1–21 (1998) 2. Che, Y.K., Gale, I.: Expected revenue of all-pay auctions and first-price sealed-bid auctions with budget constraints. Economics Letters 50(3), 373–379 (1996) 3. Krishna, V.: Auction theory. Academic Press, London (2002) 4. Maskin, E., Riley, J.: Optimal auctions with risk averse buyers. Econometrica 52(6), 1473– 1518 (1984) 5. Myerson, R.B.: Optimal auction design. Maths of Operations Research 6, 58–73 (1981) 6. Riley, J.G., Samuelson, W.F.: Optimal auctions. Amer. Econ. Review 71(3), 381–392 (1981) 7. Larson, K., Sandholm, T.: Costly valuation computation in auctions. In: 8th Conference of Theoretical Aspects of Knowledge and Rationality (TARK VIII), pp. 169–182 (2001) 8. Thompson, D., Leyton-Brown, K.: Valuation uncertainty and imperfect introspection in second-price auctions. In: AAAI 2007, pp. 148–153 (2007) 9. Parkes, D.C.: Optimal auction design for agents with hard valuation problems. In: Proc. Agent Mediated Electronic Commerce (AMEC), pp. 206–219 (1999) 10. Brandt, F., Sandholm, T., Shoham, Y.: Spiteful bidding in sealed-bid auctions. In: IJCAI 2007, pp. 1207–1214 (2007) 11. Vetsikas, I.A., Jennings, N.R., Selman, B.: Generating Bayes-Nash equilibria to design autonomous trading agents. In: IJCAI 2007, pp. 1543–1550 (2007) 12. McAdams, D.: Isotone equilibria in games of incomplete information. Econometrica 71, 1191–1214 (2003) 13. McAdams, D.: Monotone equilibrium in multi-unit auctions. Review of Economic Studies 73(4), 1039–1056 (2006) 14. Gerding, E.H., Dash, R.K., Yuen, D.C.K., Jennings, N.R.: Bidding optimally in concurrent second-price auctions of perfectly substitutable goods. In: AAMAS 2007, pp. 267–274 (2007) 15. Maskin, E., Riley, J.: Asymmetric auctions. Review of Economic Studies 67, 413–438 (2000) 16. Lebrun, B.: First price auctions and the asymmetric n bidder case. International Economic Review 40(1), 125–142 (1999) 17. Anthony, P., Hall, W., Dang, V., Jennings, N.R.: Autonomous agents for participating in multiple on-line auctions. In: IJCAI workshop on e-Business & Intelligent Web, pp. 54–64 (2001) 18. Preist, C., Bartolini, C., Phillips, I.: Algorithm design for agents which participate in multiple simultaneous auctions. In: Agent Mediated e-Commerce (AMEC) III, pp. 139–154 (2001) 19. Vetsikas, I.A., Jennings, N.R.: Towards agents participating in realistic multi-unit sealed-bid auctions. In: AAMAS 2008, pp. 1621–1624 (2008) 20. Vetsikas, I.A., Jennings, N.R.: Bidding strategies for realistic multi-unit sealed-bid auctions. In: AAAI 2008, pp. 182–189 (2008)

Designing Trading Agents for Real-World Auctions

285

21. Vetsikas, I.A., Jennings, N.R.: Bidding strategies for realistic multi-unit sealed-bid auctions. Journal of Autonomous Agents and Multi-Agent Systems, JAAMAS (2009) (in press) doi : 10.1007/s10458–009–9109–6 22. Vetsikas, I.A., Jennings, N.R.: Outperforming the competition in multi-unit sealed bid auctions. In: AAMAS 2007, pp. 702–709 (2007) 23. Vetsikas, I.A., Jennings, N.R.: Considering assymmetric opponents in multi-unit sealed-bid auctions. In: Proc. of the 1st Int. Workshop on Market Based Control, MBC (2008) 24. Vetsikas, I.A., Jennings, N.R.: Sequential auctions with partially substitutable goods. In: Proc. IJCAI Workshop on Trading Agent Design and Analysis (TADA), pp. 123–130 (2009) 25. Wellman, M.P., Wurman, P.R., O’Malley, K., Bangera, R., Lin, S., Reeves, D., Walsh, W.E.: Designing the market game for TAC. IEEE Internet Computing (April 2001) 26. Vetsikas, I.A., Selman, B.: A principled study of the design tradeoffs for autonomous trading agents. In: AAMAS 2003, pp. 473–480 (2003)

Scalable Semantic Annotation of Text Using Lexical and Web Resources Elias Zavitsanos1, George Tsatsaronis2, Iraklis Varlamis3 , and Georgios Paliouras1 1

Institute of Informatics & Telecommunications, NCSR “Demokritos” 2 Department of Computer and Information Science, Norwegian University of Science and Technology 3 Department of Informatics and Telematics, Harokopio University [email protected], [email protected], [email protected], [email protected]

Abstract. In this paper we are dealing with the task of adding domainspeciﬁc semantic tags to a document, based solely on the domain ontology and generic lexical and Web resources. In this manner, we avoid the need for trained domain-speciﬁc lexical resources, which hinder the scalability of semantic annotation. More speciﬁcally, the proposed method maps the content of the document to concepts of the ontology, using the WordNet lexicon and Wikipedia. The method comprises a novel combination of measures of semantic relatedness and word sense disambiguation techniques to identify the most related ontology concepts for the document. We test the method on two case studies: (a) a set of summaries, accompanying environmental news videos, (b) a set of medical abstracts. The results in both cases show that the proposed method achieves reasonable performance, thus pointing to a promising path for scalable semantic annotation of documents.

1

Introduction

Reasoning about the contents of text documents, as achieved by human readers constitutes a key challenge to every semantics-aware document management system. Automated reasoning directly from text aims at the automated inference of new knowledge. One step towards this direction is the design and development of new methods that enable the automated annotation of plain text with ontology concepts. Such techniques enable the transfer of useful information from text documents to ontology structures, and vice versa. Motivated by this need, the CASAM research project1 introduces the concept of computer-aided semantic annotation to accelerate the adoption of semiautomated multimedia annotation in the industry. In the context of this work, we present part of the KDTA (Knowledge-driven Text Analysis) module of the 1

CASAM: Computer-Aided Semantic Annotation of Multimedia, http://www.casam-project.eu/

S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 287–296, 2010. c Springer-Verlag Berlin Heidelberg 2010

288

E. Zavitsanos et al.

overall project architecture, that is responsible for the automated annotation of text documents. In particular, this work presents a new method for the automated annotation of plain text with ontology concepts from a given domain ontology. The method is based on the pre-processing of the input text and the extraction of semantic information (e.g. word senses) from text. The text processing techniques utilize knowledge bases, like the WordNet thesaurus, and the Wikipedia electronic encyclopedia2 , and combine measures of semantic relatedness and word sense disambiguation (WSD) algorithms to annotate text words with ontology concepts. The contributions of this work lie in the following: (a) a novel method for semantic annotation of plain texts with ontology concepts, (b) experimental evaluation of the proposed method, by measuring the precision and recall of the annotations in two diﬀerent data sets, pertaining to the environmental and the bioinformatics domain respectively, and (c) a study on the eﬀects of the various techniques involved on the performance of semantic annotation (e.g. the eﬀect of WSD techniques). In what follows, we discuss the related work on automated or semi-automated text annotation with ontology concepts, as well as on measures of semantic relatedness and WSD techniques (Section 2). Section 3 introduces the proposed method. Section 4 presents our experimental evaluation, and Section 5 concludes the paper.

2

Related Work

2.1

Automated or Semi-automated Text Annotation with Ontology Concepts

Text annotation with ontology concepts constitutes a fundamental technology for intelligent Web applications, e.g. the Semantic Web. Usually the task is performed in a semi-automated manner, starting from an initial set of manual annotations. An automated system is then suggesting new annotations to the user and assists in extending the annotation to more fragments of text [6]. In our case, we automatically annotate text with existing ontology concepts without using any type of learning or information extraction. In this direction, Cimiano et al. [3] propose a method for annotating named entities in a document. The method ﬁrst maps entities into several linguistic patterns, which convey competing semantics, and then selects the top scoring patterns to indicate the meaning of the named entity. Though this procedure may oﬀer high accuracy, it has limited recall, since it annotates only certain kinds of named entity. In [4], the authors propose a method for automated semantic annotation of Web pages, which is based on the existence of data-extraction ontologies that specify formalized semantics for each domain. These ontologies are used to avoid the heuristics of standard information extraction techniques. However, a domain 2

http://www.wikipedia.org/

Scalable Semantic Annotation of Text Using Lexical and Web Resources

289

expert is required to import the formalized semantics of the domain, in order for the system to detect candidate instances to annotate with concepts of the original domain ontology. In the approach presented in [5], the idea of mapping text headings to one or more entries in the ontology is introduced. The mapping is performed with exact matching of the segment titles and the used ontology concepts. N-grams and simple transformations, such as stemming, are employed in order to improve the method’s performance. Finally, in [8] the authors present the Ontea system, which is based on the application of regular expression patterns and methods of lemmatization. In this case the caveat, which prohibits this approach from being applicable to free text, is the need for predeﬁned domain speciﬁc patterns that constitute the basis for the Web document annotation. 2.2

Measures of Semantic Relatedness and Similarity

Semantic relatedness measures estimate the degree of relatedness or similarity3 between two concepts in a thesaurus. Such measures can be classiﬁed to dictionary-based, corpus-based and hybrid. Among dictionary-based measures, the measures in [1] and [9] consider factors such as the density and depth of concepts in the set, or the length of the shortest path that connects them, or even the maximum depth of the taxonomy. However, in most such measures, it is assumed that all edges in the path are equally important. Resnik’s [13] measure for a pair of concepts A,B is based on the Information Content (IC ) of the deepest concept C that can subsume both A and B. The measure combines both the hierarchy of the used thesaurus, and statistical information for concept occurrences measured in large corpora. Recent work includes the measure in [12], which utilizes the gloss words found in the word’s deﬁnitions to create WordNet-based context vectors, and several Wikipedia-based measures [7, 11]. We encourage the reader to consult the analysis in [2] for a detailed discussion on relatedness measures. Although any of the aforementioned measures of semantic similarity or relatedness could ﬁt our method, in this work, we use the Omiotis measure of semantic relatedness between two words [16, 15], which was shown to provide the highest correlation with human judgments among the dictionarybased measures of semantic relatedness. For the cases where one of the words does not exist in WordNet, we use the Wikipedia-based measure of Milne and Witten [11], since among the oﬀered Wikipedia-based alternatives, this is the fastest, and provides very high correlation with human judgements. 2.3

Word Sense Disambiguation

In the proposed method, we also explore the merits of sense disambiguation prior to computing the semantic relatedness between words. Thus, before computing semantic relatedness between text terms and ontology concepts, we ﬁrst disambiguate the text terms, so as to compute even more precise relatedness values, 3

Similarity measures use only the hierarchical relations from a thesaurus, whereas relatedness measures employ all the available relations.

290

E. Zavitsanos et al.

since word-to-word measures of semantic relatedness do not take into account the context of the terms. The WSD method that we are employing is unsupervised. Though supervised methods outperform their unsupervised rivals, they require extensive training in large data sets. Unsupervised approaches comprise corpusbased [17], knowledge-based [10] and graph-based [14] methods. However, the graph-based methods demonstrate high performance and seem to be a promising solution for unsupervised WSD. Such methods rely on the construction of semantic graphs from text. The graphs are consequently processed in order to select the most appropriate meaning4 of each examined word, in its given context. In this work, we use a graph-based approach, that constructs semantic networks and processes them with an altered PageRank formula that takes into account edge weights. The PageRank-based method is described in [14]. Any other WSD approach could have been implemented instead in CASAM. However, the method that we selected has demonstrated high accuracy with full coverage for all parts of speech when tested in benchmark WSD data sets [14].

3

Semantic-Based Automated Annotation of Text Documents with Ontology Concepts

This section presents the proposed automated semantic annotation method that is followed in CASAM. The overall architecture of the KDTA module is depicted in Figure 1. Given a text document written in natural language, the preprocessing phase starts with the identiﬁcation of the text language, its translation to English, if necessary, and the application of Part of Speech (POS) tagging and Word Sense Disambiguation (WSD) techniques. Then, the text is semantically annotated with ontology concepts. For this purpose, we calculate the semantic relatedness between candidate keywords of the text and the concepts of the domain ontology, and select for annotation the keywords that are more closely related to ontology concepts than others, in the sense of having higher relatedness values. In addition, KDTA exploits the senses of the ontology concepts, where available, as well as other external resources, such as WordNet and Wikipedia, for the calculation of semantic relatedness. Thus, given a text document, the proposed solution depicted in Figure 1, produces a ranking of proposed annotations of text segments with ontology concepts. The highest ranked proposals can be used for the annotation of the text with ontology concepts. The overall solution can scale up to large document collections, since the language identiﬁcation, online translation, POS tagging, and WSD modules do not require any type of training or learning, and the computation of semantic relatedness values is supported by a powerful infrastructure [15] that has indexed all pairwise WordNet synsets relatedness values in order to accelerate computations. 4

In the remaining of the paper, the words concept, sense, and synset may be used interchangeably to describe the meaning of a word, among the several oﬀered by a dictionary or a word thesaurus.

Scalable Semantic Annotation of Text Using Lexical and Web Resources

291

Fig. 1. Overall architecture of the KDTA CASAM Module

KDTA is implemented as a system on the general-purpose text engineering platform Ellogon5 . Apart from providing basic pre-processing modules, Ellogon facilitates an open and ﬂexible architecture for KDTA and provides eﬃcient handling of text document information. 3.1

Pre-processing Phase

Given an input text, a language identiﬁer is called in order to detect the language of the text. At this stage, KDTA operates on English documents, and thus, in case the text appears in another language, online translation services are exploited to translate the input into English. The next step is the annotation of the text with part-of-speech (POS) tags. The use of such a tagger is important, since the POS tag provides useful information to the disambiguation process and it is also helpful in the identiﬁcation of candidate keywords to be annotated with ontology concepts. Particularly in CASAM, the domain ontology comprises mainly nouns, and thus, nouns or a noun phrases in the input text are more likely to be linked to concepts of the ontology. The last step of the pre-processing phase is the disambiguation of the input text. This process results in ﬁnding the correct sense of each word, by consulting WordNet. In particular, we use the PageRank-based method in [14] to ﬁnd the sense that corresponds to each word. 3.2

Annotating Text Words with Ontology Concepts

The annotation procedure, as shown in Figure 2, comprises three consecutive steps: exact matching, stem matching and semantic matching (similarity calculation). At the ﬁrst step of exact matching, the method searches for lexicalizations of concepts inside the input text. In case of success, the document is annotated 5

http://www.ellogon.org/

292

E. Zavitsanos et al.

Fig. 2. The proposed annotation method

with the corresponding concept, and a relatedness value equal to 1 is assigned to that annotation. If, on the other hand, none of the concepts of the given ontology appears in the text, in its original form (i.e. as it appears in the ontology), the second step searches for appearances of its stemmed form. If such a case occurs, the document is annotated with the corresponding concept and a relatedness value equal to 0.9 is assigned to the annotation. The third step is responsible for a more advanced annotation procedure. Four diﬀerent methods are implemented: (a) Baseline Ad-Hoc method - When this method is used, KDTA consults WordNet to retrieve a list of synonyms for the lexicalization of each concept of the given ontology. The calculation of relatedness in this method depends to a large extent on the set of the retrieved synonyms. In particular, it assigns high relatedness scores in cases where the semantic distance between a concept and its synonym is small, and lower relatedness scores otherwise. The semantic distance is actually the length of the path in WordNet between the concept and its synonym. Equation (1) incorporates the above constraints in the calculation of SD, the relatedness score between a text keyword and a synonym of an ontology concept: 1 (1) log CS NS ∗ log N S N S is the total number of synonyms of the concept in question and CS is the semantic distance, expressed as the length of the path in WordNet, between the concept and its synonym. (b) Relatedness-based Annotation with Omiotis - In contrast to the Baseline method, where the relatedness is calculated according to the distance of the synonym from the domain concept, this method relies on the relatedness between two words (in this case between a word of the text and an ontology concept), in order to perform the annotation. Speciﬁcally, after exploiting a list of standard English common words to reduce the term space of the input text, the underlying idea is to measure the relatedness between each of the resulting words and each ontology concept. Only words that are related to concepts, in the sense of having relatedness score greater than zero, are annotated, and in particular, we annotate a speciﬁc word with the concept that gives the highest SD =

Scalable Semantic Annotation of Text Using Lexical and Web Resources

293

relatedness score. Regarding the computation of relatedness between two terms, i.e. a candidate word and the lexicalization of a concept, we use the measure of Omiotis [16], which was shown to provide the highest correlation with human judgments among the dictionary-based measures of semantic relatedness. (c) Relatedness-based Annotation with Omiotis and WSD - This method is an extension of the previous method. It exploits additional information, derived from the pre-processing phase, in order to construct a speciﬁc structure for each word, comprising its POS tag and its sense. This structure is further exploited by Omiotis, in order to calculate the semantic relatedness between the word and an ontology concept, and provide a more accurate score. However, this method requires the ontology concepts to be disambiguated as well, and thus its direct application, using any ontology is not always straightforward. Besides the disambiguation part, the main idea is the same as in (b). (d) Relatedness-based Annotation with Omiotis and Wikipedia - The last annotation method employs an additional Wikipedia-based measure, in order to handle those cases not supported by Omiotis, i.e., the words that do not appear in WordNet. The method employs the measure of Milne and Witten [11], which is the fastest among several alternatives and provides very high correlation with human judgements.

4

Experimental Evaluation

This section presents the empirical evaluation of our semantic annotation method in two datasets. Subsection 4.1 presents evaluation results in the LUSA dataset, regarding the environmental domain, while 4.2 presents the performance of the method in the Genia dataset from the molecular biology domain. 4.1

Environmental Domain: LUSA Corpus

The ﬁrst dataset that was used for the empirical evaluation of the proposed annotation method comprises 51 documents provided by the LUSA Agency6 , regarding the environmental domain. The corresponding ontology, developed in the CASAM project, comprises 230 concepts, covering environmental concepts, such as “Wind”, “Water”, “Solar Energy”, “Alternative Energy”, etc., entities, such as “Person”, “Profession Name”, etc., and technological concepts, such as “Media Equipment”, “Car”, “Building”, etc. For the evaluation of the proposed method in the given documents, a ground truth dataset was created in CASAM, in order to serve as a gold standard and assist in deriving quantitative results using Macro Average Precision and Recall. The ground truth dataset contains manual annotations of terms residing in the 51 documents, with ontology concepts from the used ontology. Furthermore, the ontology concepts were manually disambiguated with WordNet senses. Table 1 presents the performance of the proposed method for the four alternative approaches of the advanced annotation step, discussed in 3.2. The best 6

http://www.lusa.pt/

294

E. Zavitsanos et al. Table 1. Evaluation results for the LUSA dataset

Macro Avg. Precision Macro Avg. Recall Macro Avg. Fmeasure

Baseline Omiotis Omiotis&WSD Omiotis&Wiki 0.73 0.51 0.54 0.51 0.76 0.57 0.55 0.58 0.73 0.49 0.51 0.50

results were achieved with the use of the baseline method. This behavior is explained by the fact that in many cases the manually annotated data set contained cases as simple as the annotation of a term, with its stem, which exists in the ontology. Those cases do not produce high relatedness values, and thus cannot be tracked by the relatedness-based methods. Beyond the baseline method, Omiotis, and its enhancement with Wikipedia perform rather similarly. On the other hand, the disambiguation of words seems to help increase the precision of the method by 3p.p., but decreases recall. The overall F-Measure using WSD is 2p.p. higher than the simple case, which shows that WSD can help in the computation of more accurate relatedness values. A ﬁnal point regarding the interpretation of the experimental results is that the domain ontology comprises many concepts regarding entities, such as “Person”, “Person Name”, “Profession Name”, “Organization”, “Date”, etc. In the context of the CASAM project, the proposed method is extended by the recognition of entities, using the Open Calais7 service. In this manner, the performance of the method can be improved further by about 20p.p., achieving nearly perfect results. 4.2

Molecular Biology Domain: GENIA Corpus

In order to test the applicability of our proposed architecture in a diﬀerent domain, we also experimented on a dataset used in the molecular biology domain. More speciﬁcally, we have used the GENIA ontology8 comprising 49 concepts and a set of 2000 MEDLINE abstracts, which have been annotated with GENIA concepts. Since we know the correct annotations per document, we were able to measure the macro-average precision, recall and F-Measure, as previously. Table 2 shows the results for the baseline, the Omiotis, and the Omiotis+Wiki approach. From the reported results, we can observe that the baseline achieves a very high precision of almost 72%, but also a very low recall, and a total F-Measure of 15%. In contrast, the Omiotis and Omiotis+Wiki approaches, increase the recall and the total F-Measure by 13p.p. and 16p.p. respectively, compared to the baseline. The reason for the low recall of the baseline method stems from the fact that there are rarely exact matchings between terms and ontology concepts in the ground truth answers. On the other hand, the few 7 8

http://www.opencalais.com/ http://www-tsujii.is.s.u-tokyo.ac.jp/~ genia/topics/ Corpus/genia-ontology.html

Scalable Semantic Annotation of Text Using Lexical and Web Resources

295

Table 2. Evaluation results for the GENIA dataset

Macro Avg. Precision Macro Avg. Recall Macro Avg. Fmeasure

Baseline Omiotis Omiotis&Wiki 0.72 0.30 0.36 0.08 0.26 0.27 0.15 0.28 0.31

that exist (directly or in stemmed form), are very successfully captured by the baseline, and this explains its very high precision. Regarding the performance of the two relatedness approaches, it is overall lower than in the ﬁrst data set, due to the very relatedness values that were calculated. More speciﬁcally, in this second dataset, there were many proposals for each annotation, all having close to zero (i.e., between 10−5 and 3 · 10−1 ) relatedness values. Compared to the baseline, the recall of the relatedness methods improved their overall performance. This is due to the fact that the relatedness methods can capture annotations even between a text segment and an ontology concept that contain diﬀerent parts of speech, or are connected through a really long path in WordNet or Wikipedia, which is often the case in this dataset. A possible improvement in this case could occur from the use of an additional knowledge base, that would be more speciﬁc to the domain, i.e., a molecular biology lexicon. This would solve the problem of low relatedness values, since for each term candidate, the lemmas from the lexicon could be used for the computation of Omiotis. Omiotis can also compute the relatedness between two sentences, or even between a term - like an ontology concept - and a sentence. Since the relatedness approaches seem to improve the overall performance in this dataset, but mostly due to increased recall, we have also experimented for various thresholds of the Omiotis values, i.e., below which values, we do not consider the proposals at all. Our results showed that the macro-averaged precision can reach up to almost 95% for the Omiotis and the combined OmiotisWikipedia approaches, but the respective recall drops to almost 3%. The cut-oﬀs that we tested were 10−3 , 10−2 , and 10−1 , with the latter producing the best precision. Further investigation of how to tune automatically the relatedness variants of our approach, seems promising and may lead to even more interesting results in the future.

5

Conclusions

This work presented a method for automated semantic annotation of documents with ontology concepts, based on generic lexicons and Web resources. The use of generic lexical and Web resources removes the need for trained semantic classiﬁers, thus constituting the method scalable. The proposed method consists of a novel combination of measures of semantic relatedness and word sense disambiguation techniques, in order to identify the most related ontology concepts for a given document. The proposed method forms the basis for the Knowledge-driven Text Analysis (KDTA) module, in the context of the CASAM

296

E. Zavitsanos et al.

project, and we have validated its performance in two case studies, obtaining promising results.

Acknowledgments This work has been partially funded by the CASAM Project, under the EU FP7 programme (contract number FP7-217061). We would like to thank our partners in CASAM for providing us with the ontology and the data that we used in the ﬁrst experiment.

References 1. Agirre, E., Rigau, G.: A proposal for word sense disambiguation using conceptual distance. In: International Conference on Recent Advances in NLP (1995) 2. Budanitsky, A., Hirst, G.: Evaluating wordnet-based measures of lexical semantic relatedness. Computational Linguistics 32(1), 13–47 (2006) 3. Cimiano, P., Ladwig, G., Staab, S.: Gimme’ the context: context-driven automatic semantic annotation with c-pankow. In: WWW, pp. 332–341 (2005) 4. Ding, Y., Embley, D.W.: Using data-extraction ontologies to foster automating semantic annotation. In: ICDE Workshops (2006) 5. El-Beltagy, S.R., Hazman, M., Rafea, A.A.: Ontology based annotation of text segments. In: SAC (2007) 6. Erdmann, M., Maedche, A., Schnurr, H.P., Staab, S.: From manual to semiautomatic semantic annotation: About ontology-based text annotation tools. ETAI Journal - Section on Semantic Web 6(2) (2001) 7. Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipediabased explicit semantic analysis. In: IJCAI, pp. 1606–1611 (2007) 8. Laclavik, M., Seleng, M., Gatial, E., Balogh, Z., Hluch´ y, L.: Ontology based text annotation - ontea. In: EJC (2006) 9. Leacock, C., Miller, G., Chodorow, M.: Using corpus statistics and wordnet relations for sense identiﬁcation. Computational Linguistics 24(1), 147–165 (1998) 10. Lesk, M.: Automated sense disambiguation using machine-readable dictionaries: How to tell a pine cone from an ice cream cone. In: SIGDOC (1986) 11. Milne, D., Witten, I.H.: An eﬀective, low-cost measure of semantic relatedness obtained from wikipedia links. In: AAAI Workshop on Wikipedia and Artiﬁcial Intelligence (2008) 12. Patwardhan, S., Pedersen, T.: Using wordnet based context vectors to estimate the semantic relatedness of concepts. In: EACL 2006 Workshop Making Sense of Sense - Bringing Computational Linguistics and Psycholinguistics Together (2006) 13. Resnik, P.: Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of Artiﬁcial Intelligence Research 11, 95–130 (1999) 14. Tsatsaronis, G., Varlamis, I., Nørv˚ ag, K.: An experimental study on unsupervised graph-based word sense disambiguation. In: CICLing (2010) 15. Tsatsaronis, G., Varlamis, I., Nørv˚ ag, K., Vazirgiannis, M.: Omiotis: A thesaurusbased measure of text relatedness. In: ECML-PKDD (2009) 16. Tsatsaronis, G., Varlamis, I., Vazirgiannis, M.: Text relatedness based on a word thesaurus. Journal of Artiﬁcial Intelligence Research 37, 1–39 (2010) 17. Yarowsky, D.: Word-sense disambiguation using statistical models of roget’s categories trained on large corpora. In: Int. Conf. on Compuitational Linguistics (1992)

A Gene Expression Programming Environment for Fatigue Modeling of Composite Materials Maria Α. Antoniou1, Efstratios F. Georgopoulos1,2, Konstantinos A. Theofilatos1, Anastasios P. Vassilopoulos3, and Spiridon D. Likothanassis1 1

Pattern Recognition Laboratory, Dept. of Computer Engineering & Informatics, University of Patras, 26500, Patras, Greece 2 Technological Educational Institute of Kalamata, 24100, Kalamata, Greece 3 Composite Construction Laboratory (CCLab), Ecole Polytechnique Fédérale de Lausanne (EPFL), Station 16, CH-1015 Lausanne, Switzerland [email protected], [email protected], [email protected], [email protected], [email protected]

Abstract. In the current paper is presented the application of a Gene Expression Programming Environment in modeling the fatigue behavior of composite materials. The environment was developed using the JAVA programming language, and is an implementation of a variation of Gene Expression Programming. Gene Expression Programming (GEP) is a new evolutionary algorithm that evolves computer programs (they can take many forms: mathematical expressions, neural networks, decision trees, polynomial constructs, logical expressions, and so on). The computer programs of GEP, irrespective of their complexity, are all encoded in linear chromosomes. Then the linear chromosomes are expressed or translated into expression trees (branched structures). Thus, in GEP, the genotype (the linear chromosomes) and the phenotype (the expression trees) are different entities (both structurally and functionally). This is the main difference between GEP and classical tree based Genetic Programming techniques. In order to evaluate the performance of the presented environment, we tested it in fatigue modeling of composite materials. Keywords: Gene Expression Programming, Genetic Programming, Evolutionary Algorithms, System Modeling, Fatigue Modeling.

1 Introduction The problem of discovering a mathematical expression that describes the operation of a physical or artificial system using empirically observed variables or measurements is a very common and important problem in many scientific areas. Usually, the observed data are noisy and sometimes missing. Also, it is very common, that there is no known mathematical way to express the relation using a formal mathematical way. These kinds of problems are called modeling problems, symbolic system identification problems, black box problems, or data mining problems [3]. S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 297–302, 2010. © Springer-Verlag Berlin Heidelberg 2010

298

M.Α. Antoniou et al.

Most data-driven system modeling or system identification techniques assume an a-priori known model structure and focus mainly to the calculation of the model parameters’ values. But what can be done when there is no a-priori knowledge about the model’s structure? Gene Expression Programming (GEP) is a domain-independent problem-solving technique in which computer programs are evolved to solve, or approximately solve, problems [4][12]. GEP is a member of a broad family of techniques called Evolutionary Algorithms. All these techniques are based on the Darwinian principle of reproduction and survival of the fittest and are similar to the biological genetic operations such as crossover and mutation [11]. GEP addresses one of the central goals of computer science, namely automatic programming; which is to create, in an automated way, a computer program that enables a computer to solve a problem [1], [2]. In GEP the evolution operates on a population of computer programs of varying sizes and shapes. GEP starts with an initial population of thousands or millions of randomly generated computer programs and then applies the principles of biological evolution to create a new (and often improved) population of programs. The fundamental difference between other Evolutionary Algorithms and GEP is that, in GEP there is a distinct discrimination between the genotype and the phenotype of an individual. So, in GEP the individuals are symbolic strings of fixed length representing an organism’s genome (chromosomes/genotype), but these simple entities are encoded as non-linear entities of different sizes and shapes, determining an organism’s fitness (expression trees/phenotype). GEP is a new evolutionary technique and its applications so far are quite limited. However, it has been successfully applied in some real life problems [5],[6],[7]. In the current paper we present an integrated GEP environment with a graphical user interface (GUI), called jGEPModeling. The jGEPModeling environment was developed using the JAVA programming language, and is an implementation of the steady-state gene expression programming algorithm. In order to evaluate the performance of the jGEPModeling we tested it in fatigue modeling of composite materials. Namely, we tried to model the behavior of a composite material subject to stress tests. This material is a fiberglass laminate, which is a typical material for wind turbine blade construction.

2 The Gene Expression Programming Algorithm Gene Expression Programming is a new Evolutionary Algorithm proposed by Ferreira (2001) as an alternative method to overcome the drawbacks of Genetic Algorithms (GAs) and Genetic Programming (GP) [2][4][10][12] . Similar to GA and GP, GEP follows the Darwinian principles of natural selection and survival of the fittest individual [11]. The fundamental difference between the three algorithms is that, in GEP there is a distinct discrimination between the genotype and the phenotype of an individual. In GEP the individuals are symbolic strings of fixed length representing an organism’s genome (chromosomes/genotype), but these simple entities are encoded as non-linear entities of different sizes and shapes, determining an organism’s fitness (expression trees/phenotype) [4].

A GEP Environment for Fatigue Modeling of Composite Materials

299

GEP chromosomes are usually composed of more than one gene of equal length. Each gene is composed of a head and a tail. The head contains symbols that represent both functions and terminals, whereas the tail contains only terminals. The set of functions usually includes any mathematical or Boolean function that the user believes is appropriate to solve the problem. The set of terminals is composed of the constants and the independent variables of the problem. The head length (denoted h) is chosen by the user, whereas the tail length (denoted t) is evaluated by:

t = ( n − 1) h + 1

(1)

where n is the number of arguments of the function with most arguments. Despite its fixed length, each gene has the potential to code for Expression Trees (ETs) of different sizes and shapes, being the simplest composed of only one node (when the first element of a gene is a terminal) and the largest composed of as many nodes as the length of the gene (when all the elements of the head are functions with maximum arity). One of the advantages of GEP is that the chromosomes will always produce valid expression trees, regardless of modification, and this means that no time needs to be spent on rejecting invalid organisms, as in case of GP [12]. In GEP, each gene encodes an ET. In the case of multigenic chromosomes, each gene codes for a sub-ET and the sub-ETs interact with one another using a linking function (any mathematical or Boolean function with more than one argument) in order to fully express the individual. Every gene has a coding region known as an Open Reading Frame (ORF) that, after being decoded, is expressed as an ET, representing a candidate solution for the problem. While the start point of the ORF is always the first position of the gene, the termination point does not always coincide with the last position of a gene. Next, follows a full description of the algorithm’s steps: 1. Creation of initial population: The initialization in GEP is a very trivial task· in fact is the random creation of the chromosomal structure of the individuals. According to the nature of the problem, we must choose the symbols used to create the chromosomes, that is, the set of functions and terminals we believe to be appropriate to solve the problem. We must also choose the length of each gene, the number of genes per chromosome and how the products of their expression interact with one another. 2. Express chromosomes: The second step is the expression of the chromosome of each individual as an ET. This process is also very simple and straightforward. For the complete expression, the rules governing the spatial distribution of functions and terminals must be followed. First, the start of a gene corresponds to the root of the ET, forming this node the first line of the ET. Second, depending on the number of arguments of each element (functions may have a different number of arguments, whereas terminals have an arity of zero), in the next line are placed as many nodes as there are arguments to the elements in the previous line. Third, from left to right, the new nodes are filled, in the same order, with the elements of the gene. This process is repeated until a line containing only terminals is formed. 3. Execute each program: Having expressed each individual to an ET is now easy to find and compute the mathematical (or Boolean) expression it codes. We implement this by a post order traversal on the ET.

300

M.Α. Antoniou et al.

4. Evaluate fitness: One crucial step in GEP is finding a function that performs well for all fitness cases within a certain error of the correct value. In the design of the fitness function, the goal must be clearly and correctly defined in order to make the system evolve in the intended direction. 5. Keep best Program: A feature that plays a significant role in GEP is the elitism. Elitism is the cloning of the best chromosome(s)/individual(s) to the next population (also called generation). By elitism, we guarantee that at least one descendant will be viable in the next generation, keeping at the same time the best trait during the process of adaptation. 6. Selection: In GEP, individuals are selected according to fitness by the tournament selection method to reproduce with modification. Tournament selection involves running several "tournaments" among a few individuals randomly chosen from the population. The winner of each tournament (the one with the best fitness) is selected for genetic modification. Selection pressure is easily adjusted by changing the tournament size. If the tournament size is larger, weak individuals have a smaller chance to be selected. 7. Reproduction: At this step of the GEP we apply the genetic operators of mutation and recombination on the winners of the tournaments. Note that, during reproduction it is the chromosomes of the individuals, not the ETs that are reproduced with modification and transmitted to the next generation. 8. Prepare new programs of the next generation: At this step, we replace the tournament losers with the new individuals creating by reproduction in the population. 9. Termination criterion: We check if the termination criterion is fulfilled, if it is not we return to step 2. As a termination criterion it was used the maximum number of 500.000 generations that GEP was left to run. 10. Results: As a result we return the best individual ever found during the evolution process.

3 Modeling Experiments Many experimental data sets can be found in the literature on the fatigue behavior of composite laminates (see [8] and [9] for details). The database used here, contains fatigue data of composite materials tested under several fatigue loading conditions of constant amplitude (see [8] for details). Here, the task is to calculate a mathematical expression of the number of cycles to failure N as a function of R and σmax. To our knowledge, this is the first time that Gene Expression Programming is used for modeling the fatigue behavior of composite laminates. Our dataset was split to two other datasets, namely training set consisting of 70% of the initial dataset and test set consisting of the remaining percentage of the initial dataset. Training dataset was used to find the best model using GEP and test set was used to test its fitness in unseen data. After a certain number of experiments, we are in position to define standard values for some running parameters of jGEPModeling. Table 1 defines all these defaults parameters we use for our experiments.

A GEP Environment for Fatigue Modeling of Composite Materials

301

Table 1. Default parameters derived by experimentation Parameter Number of Generations Function Set Constants Range Head Size Type of Recombination Mutation Probability Population Size Tournament Size

{+, -, *, /, ^,

Value 500.000 ,abs, cos, sin, ln, exp, tan, min, max} [-3, 3] 70 two points recombination 0,5 1500 20

Table 2. Best model found by GEP logN = (((max(((normsmax) * (abs(abs((abs(abs(normR))) - (abs(min((exp((ln(abs(max((0.7459479346411833 ),( -0.7482939400422683)))))^((exp((normR)^1/2)) * (exp(((exp(cos(((normR)^(-2.526001698730898)) + ((normsmax) (ln(exp((min((2.8275740069528528 ),( 0.06370620951323458))max((2.8275740069528528 ),( 0.06370620951323458))) + (min((-2.8053943509604755 ),( (ln(normsmax))^3))max((2.8053943509604755 ),( (ln(normsmax))^3))))))))))^2) + (normR))))) ),( 2.8275740069528528))max((exp((ln(abs(max((-0.7459479346411833 ),( 0.7482939400422683)))))^((exp((normR)^1/2)) * (exp(((exp(cos(((normR)^(2.526001698730898)) + ((normsmax) - (ln(exp((min((2.8275740069528528 ),( 0.06370620951323458))max((2.8275740069528528 ),( 0.06370620951323458))) + (min((2.8053943509604755 ),( (ln(normsmax))^3))max((-2.8053943509604755 ),( (ln(normsmax))^3))))))))))^2) + (normR))))) ),( 2.8275740069528528))))))) ),( (sin((cos(0.7482939400422683)) * (-0.7459479346411833)))^2))) * (abs(exp(normsmax)))) + (2.8053943509604755))^2

The model described in Table 2 has mean square error equal to 0.2106026823310941 in the test set.

4 Conclusions The reported results taken by the application of jGEPModeling in fatigue modeling of composite material, confirm our intuition that Gene Expression Programming is a very effective technique in system modeling. So, we conclude that jGEPModeling environment can be used in a variety of problems in different scientific areas. Also, it is important to note that the rapid development of very fast computer hardware encourages the use of techniques like GEP, and is expected that these kinds of techniques will be used more and more for solving difficult problems in the future. Now, concerning some future directions of the presented jGEPModeling environment these could be: •

The application of jGEPModeling tool to other system modeling tasks in various scientific areas, such as system identification, timeseries prediction, e.t.c..

302

M.Α. Antoniou et al.

• •

The further improvement of the tool in terms of speed, memory management and parallel implementation. The implementation of more sophisticated genetic operators, initialization techniques and selection techniques in the basic Gene Expression Programming technique.

Thus, the jGEPModeling environment has been proved to be a powerful GEP tool with great expansion capabilities that we intend to perform in the near future.

References 1. Koza, J.R.: Genetic Programming: A Paradigm for Genetically Breeding Populations of Computer Programs to Solve Problems Technical Report STAN-TR-CS 1314, Stanford University (1990) 2. Koza, J.R.: Genetic programming: on the programming of computers by means of natural selection. MIT Press, Cambridge (1992) 3. Winkler, S., Affenzeller, M., Wagner, S.: Identifying Nonlinear Model Structures Using Genetic Programming Techniques. In: Cybernetics and Systems 2004. Austrian Society for Cybernetic Studies, pp. 689–694 (2004) 4. Ferreira, C.: Gene Expression Programming: A New Adaptive Algorithm for Solving Problems. Complex Systems 13(2), 87–129 (2001) 5. Lopez, H.S., Weinert, W.R.: EGIPSYS: An Enhanced Gene Expression Programming Approach for Symbolic Regression Problems. International Journal of Applied Mathematics in Computer Science 14(3), 375–384 (2004) 6. Margny, M.H., El-Semman, I.E.: Extracting Logical Classification Rules with Gene Expression Programming: Micro array case study. In: AIML 2005 Conference, Cairo, Egypt, December 19-21 (2005) 7. Dehuri, S., Cho, S.B.: Multi-Objective Classification Rule Mining Using Gene Expression Programming. Third International Conference on Convergence and Hybrid Information Technology (2008) 8. Philippidis, T.P., Vassilopoulos, A.P.: Complex stress state effect on fatigue life of GRP laminates. Part I, experimental. Int. J. Fatigue 24, 813–823 (2002) 9. Nijssen, R.P.L.: OptiDAT – fatigue of wind turbine materials database, http://www.kc-wmc.nl/optimat_blades/index.htm 10. Goldberg, D.E.: Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley, Reading (1989) 11. Darwin, C.: On the Origin of Species (1859) 12. Ferreira, C.: Gene Expression Programming: Mathematical Modeling by an Artificial Intelligence, 2nd edn. Springer, Heidelberg (2006) 13. Mitchell, M.: An Introduction to Genetic Algorithms. MIT Press, Cambridge (1996)

A Hybrid DE Algorithm with a Multiple Strategy for Solving the Terminal Assignment Problem Eugénia Moreira Bernardino1, Anabela Moreira Bernardino1, Juan Manuel Sánchez-Pérez2, Juan Antonio Gómez-Pulido2, and Miguel Angel Vega-Rodríguez2 1

Research Center for Informatics and Communications, Department of Computer Science, School of Technology and Management, Polytechnic Institute of Leiria, 2411 Leiria, Portugal {eugenia.bernardino,anabela.bernardino}@ipleiria.pt 2 Department of Technologies of Computers and Communications, Polytechnic School, University of Extremadura, 10071 Cáceres, Spain {sanperez,jangomez,mavega}@unex.es

Abstract. In the last decades a large amount of interests have been focused on telecommunication network problems. One important problem in telecommunication networks is the terminal assignment problem. In this paper, we propose a Differential Evolution algorithm employing a “multiple” strategy to solve the Terminal Assignment problem. A set of available strategies is established initially. In each generation a strategy is selected based on the amount fitness improvements achieved over a number of previous generations. We use tournament selection for this purpose. Simulation results with the different methods implemented are compared. Keywords: Communication Networks, Optimization Algorithms, Differential Evolution Algorithm, Terminal Assignment Problem.

1 Introduction With the recent growth of communication networks, their increasing complexity and the heterogeneity of the connected equipments, a large variety of combinatorial optimisation problems appeared. Terminal Assignment (TA) is an important issue in telecommunication networks optimisation to increase their capacity and to reduce the cost of them [1] [2]. The task here is to assign a given set of terminals to a given set of concentrators [3]. The optimisation goals are to simultaneously produce feasible solutions, minimise the distances between concentrators and terminals assigned to them and to maintain a balanced distribution of terminals among concentrators. The terminal and concentrator sites have fixed and known locations. The capacity requirement of each terminal is known and may vary from one terminal to another. Each concentrator is limited in the amount of traffic that it can accommodate and the capacities of all concentrators and the cost of linking each terminal to a concentrator are known. The TA problem can be described as follows: (1) a set N of n distinct terminals; (2) a set M of m distinct concentrators; (3) a vector C, with the capacity required for each concentrator; (4) a vector T, with the capacity required for each terminal; (5) a matrix S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 303–308, 2010. © Springer-Verlag Berlin Heidelberg 2010

304

E.M. Bernardino et al.

CP, with the location (x,y) of each concentrator and (6) a matrix CT, with the location (x,y) of each terminal. The TA problem is a NP-complete combinatorial optimisation problem. The intractability of this problem is a motivation for the pursuits of Differential Evolution (DE) algorithm that produces approximate, rather than exact, solutions. The DE is an Evolutionary Algorithm (EA) that uses the principle of natural selection to evolve a set of solutions toward an optimal solution. An EA is a subset of evolutionary computation, a population-based metaheuristic optimisation algorithm [4]. Our algorithm is based on the Hybrid DE (HDE) algorithm proposed by Bernardino et al. [5] to solve the TA problem. In our implementation we use a new “multiple” strategy to solve the TA problem (MHDE) and we make some important improvements to the Local Search (LS) procedure. We compare the performance of MHDE with three algorithms: Genetic Algorithm (GA), Tabu Search (TS) algorithm and HDE algorithm, used in literature. The paper is structured as follows. In Section 2 we describe the implemented MHDE algorithm; in Section 3 we discuss the computational results obtained and in Section 4 we report about the conclusions.

2 The Proposed MHDE MHDE combines a DE with a LS algorithm. The DE was introduced by Storn and Price in 1995 [6]. The DE is a population-based algorithm using crossover, mutation and selection operators. It explores the candidate solutions encoded in chromosomes and exploits those with better fitness iteratively until the stop conditions are reached. The LS by itself explores the solution space making specific moves in its neighbourhood. The MHDE combines those two aspects, using the chromosomes that are produced by DE and optimises them, using a LS. The crucial idea behind DE is a scheme for generating trial parameter vectors. There are several strategies with different approaches [5]. The main steps of the MHDE algorithm are given below: Initialise Parameters Create initial Population P Evaluate Population P WHILE stop criterion isn't reached For each individual in P Apply local search: indS = LocalSearch(P(i)) Apply strategy: indS’= Strategy (indS) Evaluate indS’ Select the best individual: indS’’= P(i) or indS’ Add indS'' to new Population

Initialise Parameters – The following parameters must be defined by the user (1) ps – population size; (2) CR – crossover probability; (3) F – mutation factor selected between 0 and 2; (4) strategy and (5) mg – number of generations. Create initial Population P – A Greedy algorithm proposed by Abuali et al. [7] is used to create the initial population (P). The solutions are represented using integer vectors. Each position in the vector corresponds to a terminal. The value carried by the position i specifies the concentrator to which terminal i is to be assigned.

A Hybrid DE Algorithm with a Multiple Strategy for Solving the TA Problem

305

Evaluation of Solutions – The fitness function is based on 3 different objectives: (1) the total number of terminals connected to each concentrator (the objective is to guarantee the balanced distribution of terminals among concentrators); (2) the distance between the concentrators and the terminals assigned to them (the objective is to minimise the distances) and (3) the penalisation if a solution is not feasible. M −1

fitness = 0,9 * ∑ balc +

(1)

c =0

N −1

(2)

0,1 * ∑ distt ,c ( t ) + distt , c ( t ) = t =0

(3)

⎛ ⎛N ⎞ ⎞ ⎧⎪ N −1 if ⎜⎜ total c = round ⎜ ⎟ +1⎟⎟ ⎝M ⎠ ⎠ ⎝ balc =⎨10 ⎛ totalc = ∑ ⎞ ⎛N ⎞ t =0 ⎪⎩ 20*abs⎜⎜⎝ round ⎜⎝ M ⎟⎠ +1−totalc ⎟⎟⎠

Penalisation

{

1 0

if (c ( t ) = c )

(CP[c(t )].x − CT [t ].x )2 + (CP[c(t )]. y − CT [t ]. y )2

{

Penalisation =

0 500

if ( Feasible )

c(t)= concentrator of terminal t t = terminal M = number of concentrators N = number of terminals

c = concentrator

Apply LS – In our work, the LS method is applied to all individuals in P at the beginning of each generation. In HDE the LS method is applied to trial individuals (after applying a strategy). One neighbour is generated by swapping 2 terminals between 2 concentrators, c1 and c2 (randomly chosen). The algorithm searches for a better solution in the initial set of neighbours. If the better neighbour improves the actual solution then the algorithm replaces the actual solution with the better neighbour. Otherwise, it creates another set of neighbours. In this case, one neighbour results in the assignment of one terminal of c1 to c2 or c2 to c1 (see [5] for a comprehensive description of the LS). Our LS algorithm has some important improvements compared to the LS proposed by Bernardino et al. [5]. After creating a neighbour we don’t perform a full examination to calculate the new fitness value, we just update the fitness value based on the modifications made to create the neighbour. Apply strategy to create a new individual – At the mutation step, individuals (usually three) are selected from the population [5]. Mutation adds the weighted difference of two (or more) individuals to the third. At the recombination step, new individuals are created by combining the mutated individual with the old individual. Combination takes place according to the applied strategy [8]. After recombination, if a gene demand of a solution is outside of the allowed demand range it is necessary to apply the following transformation: IF concentrator > M THEN concentrator = concentrator - M ELSE IF concentrator <0 THEN concentrator = M + concentrator

In “multiple strategy”, a set of available strategies is initially established. The strategy set contains the 5 exponential strategies applicable to the problem. Bernardino et al. [5] prove that the best strategies applied to the TA problem are the exponential strategies (RandToBest1Exp, Rand1Exp, Best1Exp, Rand2Exp, Best2Exp). In the initialisation phase, each strategy has the same probability of being selected. From thereon and after applying a strategy, a fitness value is assigned to the respective strategy based on its contribution to individuals’ fitness. This fitness value is used for strategy selection. The strategies are selected using a tournament selection. The application chooses 3 random strategies. The strategy with the highest fitness will

306

E.M. Bernardino et al.

win. After a predefined number of generations (NUMPREVGENS) the probabilities of each strategy are updated based on their contribution in the last generations. The “multiple” strategy consists on the following steps: fitIni = fitness(individual) IF (first time) THEN initialise numGensS, fitPrevGens, fitActual, numGens ELSE IF (numGens = NUMPREVGENS) THEN vector numGensS – FOR i=1 to NUMSTRATEGIES DO number of times where fitActual[i] = fitPrevGens[i]/numGensS[i] each strategy is applied in initialise numGensS, fitPrevGens, numGens previous generations. numGens = numGens + 1 strat = random(NUMSTRATEGIES) vector fitPrevGens – FOR op = 1 to 3 DO total fitness achieved by op = random(NUMSTRATEGIES) each strategy in previous IF (fitActual[op] > fitActual[strat]) THEN generations. strat = op vector fitActual – SWITCH(strat) fitness achieved by each case 1: st = RandToBest1Exp() strategy considering the case 2: st = Rand1Exp() case 3: st = Best1Exp() numGensS. case 4: st = Rand2Exp() case 5: st = Best2Exp() run(st, individual) fitPrevGens[strat]=fitPrevGens[strat]+fitIni–fitness(individual) numGensS[strat] = numGensS[strat] + 1

Select the best individual – The performance of the child and its parent is compared and the best one is selected. Stop Criterion – MHDE stops when a maximum number of generations is reached. Further information on DE can be found in DE Homepage [9].

3 Results In order to test the performance of our approach, we use a collection of TA instances of different sizes. We take 9 problems from literature [10]. The best results obtained with MHDE and HDE use crossover probability (CR) and factor (F) between [0.1-0.4] and [0.1-2], respectively. In our experiments we use different population sizes {25,50,100,200}. The results show that the best values are 50 and 100. With these values the algorithm can reach in a reasonable amount of time a reasonable number of good solutions. Bernardino et al. [5] with HDE have shown that the best strategies applied to TA are the exponential strategies. For that reason, we only compare the “multiple” strategy with the exponential strategies. Depending on the problem, there are strategies with a similar average fitness (Fig. 1). Despite of that, the “multiple” strategy can produce more optimal-solutions for all instances (Fig. 1) and can also work better with smaller populations. We didn’t observe significant changes in executing time using different strategies. The GA was first applied to TA by Abuali et al. [7]. The GA is widely used in literature to make comparisons with other algorithms [1][2][3]. The GA adopted uses “one-point” method for recombination, “change order” method for mutation and tournament method for selection [5]. The TS was applied to this problem by Xu et al. [11]

A Hybrid DE Algorithm with a Multiple Strategy for Solving the TA Problem

307

Fig. 1. Strategies – Number of optimal solutions/ Average fitness – Problem 7

and Bernardino et al. [10]. We compare our algorithm with the GA, TS and the HDE proposed by Bernardino et al. [10][5] because they use the same test instances. Table 1 presents the best-obtained results with MHDE, GA, TS and HDE. The first column represents the number of the problem (Pr) and the remaining columns show the results obtained (BestF – Best Fitness, Ts – Run Times). The algorithms have been executed using a processor Intel Core Duo T2300. The run time corresponds to the average time that the algorithms need to obtain the best solution. Table 2 presents the average fitnesses (AvgF) and standard deviations (Std) obtained by HDE and MHDE. To compute the results in table 2 we use 300 iterations/generations for instances 1- 5, 1000 for instance 6, 1500 for instance 7 and 2000 for instances 8-9. The parameters of the DE algorithm are set to CR=0.3, F={0.9,1.6} and strategy=Best1Exp. The HDE and MHDE were applied to populations of 200 individuals. The values presented have been computed based on 50 different executions (50 best executions out of 100 executions) for each test instance. Table 1. Results Pr 1 2 3 4 5 6 7 8 9

GA BestF Ts 65,63 <1s 134,65 <1s 284,07 <1s 286,89 <1s 335,09 <1s 371,48 1s 401,45 2s 563,75 4s 703,78 5s

Tabu Search BestF Ts 65,63 <1s 134,65 <1s 270,26 <1s 286,89 <1s 335,09 <1s 371,12 <1s 401,49 1s 563,34 1s 642,86 2s

HDE BestF Ts 65,63 <1s 134,65 <1s 270,26 <5s 286,89 <5s 335,09 <5s 371,12 58s 401,21 118s 563,19 274s 642,83 456s

Table 2. Statistics MHDE BestF Ts 65,63 <1s 134,65 <1s 270,26 <1s 286,89 <1s 335,09 <1s 371,12 <1s 401,21 2s 563,19 10s 642,83 15s

HDE AvgF Std 65,63 0,00 134,65 0,00 270,35 0,06 286,97 0,09 335,42 0,16 371,60 0,17 401,58 0,12 564,03 0,21 646,65 0,61

MHDE AvgF Std 65,63 0,00 134,65 0,00 270,26 0,00 286,89 0,00 335,09 0,00 371,20 0,06 401,45 0,10 563,37 0,07 643,17 0,14

With the improvement in the LS algorithm described in Section 2 we observe a significant decrease of the execution time. Our MHDE algorithm in comparison with HDE algorithm is significantly faster. The four algorithms reach feasible solutions for all test instances. In comparison, the HDE and the MHDE algorithms obtain better solutions for larger instances. The TS is the fastest algorithm, because it can find good solutions in a better running time. In HDE and MHDE the crossover probability is applied to each gene, creating several perturbations by generation, for which the algorithm slows down. Besides, in HDE and MHDE it is necessary to carry out a concentrator conversion, so that the concentrator obtained always stays inside the defined range. As it can be seen, for larger instances, the standard deviations and the average fitnesses for the MHDE are smaller. It means that the MHDE is more robust

308

E.M. Bernardino et al.

than the HDE. All the statistics obtained show that the performance of the MHDE is superior to TS, GA and HDE.

4 Conclusions In this paper we present a new HDE algorithm employing a “multiple” strategy to solve the TA problem. The performance of' the different strategies are compared. Relatively to the problem studied, the “multiple” strategy provides more optimal solutions. A great advantage of the proposed method is that it allows, through its execution, to select the strategies that are better adapted to the problem resolution. The performance of our algorithm is also compared with the classical GA, TS algorithm and HDE algorithm. The MHDE in comparison with GA, TS and HDE presents better results for larger problems. The MDHE is better than HDE, because it works better with lower populations and at the same time, it presents a better average fitness and can produce more optimal solutions in a smaller execution time. Moreover, in terms of standard deviation, the MHDE algorithm also proved to be more stable and robust than the HDE.

References 1. Salcedo-Sanz, S., Yao, X.: A hybrid Hopfield network-genetic algorithm approach for the terminal assignment problem. IEEE Transaction on Systems, Man and Cybernetics, 2343– 2353 (2004) 2. Yao, X., Wang, F., Padmanabhan, K., Salcedo-Sanz, S.: Hybrid evolutionary approaches to terminal assignment in communications networks. In: Recent Advances in Memetic Algorithms and related search technologies, vol. 166, pp. 129–159. Springer, Berlin (2005) 3. Khuri, S., Chiu, T.: Heuristic Algorithms for the Terminal Assignment Problem. In: Proc. of the ACM Symposium on Applied Computing, pp. 247–251. ACM Press, New York (1997) 4. Eiben, A.E., Smith, J.E.: Introduction to Evolutionary Computing. Springer, Berlin (2003) 5. Bernardino, E., Bernardino, A., Sánchez-Pérez, J., Vega-Rodríguez, M., Gómez-Pulido, J.: A Hybrid Differential Evolution Algorithm for solving the Terminal assignment problem. In: International Symposium on Distributed Computing and Artificial Intelligence, pp. 178–185. Springer, Heidelberg (2009) 6. Storn, R., Price, K.: Differential Evolution - a Simple and Efficient Adaptive Scheme for Global Optimization over Continuous Spaces. Technical Report TR-95-012, ICSI (1995) 7. Abuali, F., Schoenefeld, D., Wainwright, R.: Terminal assignment in a Communications Network Using Genetic Algorithms. In: Proc. of the 22nd Annual ACM Computer Science Conference, pp. 74–81. ACM Press, New York (1994) 8. Storn, R., Price, K.: Differential Evolution - a Simple and Efficient Heuristic for Global Optimization over Continuous Spaces. Journal of Global Optimization 11, 341–359 (1997) 9. Differential Evolution Homepage, http://www.icsi.berkeley.edu/~storn/code.html 10. Bernardino, E., Bernardino, A., Sánchez-Pérez, J., Vega-Rodríguez, M., Gómez-Pulido, J.: Tabu Search vs Hybrid Genetic Algorithm to solve the terminal assignment problem. In: IADIS International Conference Applied Computing, pp. 404–409. IADIS Press (2008) 11. Xu, Y., Salcedo-Sanz, S., Yao, X.: Non-standard cost terminal assignment problems using tabu search approach. In: IEEE Conference in Evolutionary Computation, vol. 2, pp. 2302–2306 (2004)

Event Detection and Classification in Video Surveillance Sequences Vasileios Chasanis and Aristidis Likas Department of Computer Science, University of Ioannina, 45110 Ioannina, Greece {vchasani,arly}@cs.uoi.gr

Abstract. In this paper, we present a system for event recognition and classification in video surveillance sequences. First, local invariant descriptors of video frames are employed to remove background information and segment the video into events. Next, visual word histograms are computed for each video event and used to define a distance measure between events. Finally, machine learning techniques are employed to classify events into predefined categories. Numerical experiments indicate that the proposed approach provides high event detection and classification rates. Keywords: Video surveillance, Event detection, Dynamic time warping.

1

Introduction

Video surveillance has received many attention over the last years and is a major research topic in computer vision [4]. Typically, the framework of a video surveillance system involves the following stages: background substraction, environment modeling, object detection, classiﬁcation and tracking of moving objects and descriptions of behaviors/events. The goal of video surveillance systems is to detect and characterize events as activities using unsupervised or supervised techniques. In [2], a method is presented that integrates audio and visual information for scene analysis in a typical surveillance scenario, using only one camera and one monaural microphone. In [8], a video behavior modeling method is proposed for online normal behavior recognition and anomaly detection. For each video segment, blobs are detected that correspond to scene events. These scene events are clustered into groups using a gaussian mixture model producing a behavior representation for the video segment. In our approach, local invariant descriptors are employed to remove background information. Then, by analyzing the number of foreground descriptors, we automatically segment the video surveillance sequence into segments/events, which describe some activity taking place in the room under surveillance. Each video segment/event is represented either by a single (summary) visual word histogram or by multidimensional signal corresponding to the visual word histograms of its own frames. In the second case, Dynamic Time Warping S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 309–314, 2010. c Springer-Verlag Berlin Heidelberg 2010

310

V. Chasanis and A. Likas

(a)

(b)

(c)

Fig. 1. Video frame a) of the background and the location of the extracted descriptors b) of an event with its descriptors c) of an event with unmatched descriptors

distance [7] is employed to deﬁne a proper event dissimilarity metric. Finally, supervised and unsupervised techniques are implemented either to classify or to cluster events into categories. The rest of the paper is organized as follows: In Section 2, the procedure of background substraction is described. In Section 3, the proposed event detection algorithm is presented. In Section 4, we deﬁne an event dissimilarity metric and in Section 5 we present numerical experiments for video event classiﬁcation and clustering into categories. Finally, in Section 6, we provide some conclusions.

2

Background Substraction

For each frame of the video surveillance sequence, SIFT descriptors are extracted as proposed in [6]. In this work, we concentrate on diﬀerent individual activities performed in an indoor environment, captured by using a standing camera. Thus, background remains the same and object/event detection relies on foreground detection modules. In order to remove descriptors that correspond to background objects, we compare the descriptors of each frame of the video surveillance sequence with a set of pre-computed descriptors corresponding to frames describing only the background using the comparison approach proposed in [6]. In Fig. 1(a), we present a video frame of the background and the location of the extracted descriptors. In Fig. 1(b) and Fig. 1(c), we present a video event frame with the corresponding SIFT descriptors and the descriptors that do not match with those of the background, respectively.

3

Video Segmentation into Events

After we have subtracted the descriptors corresponding to background, we wish to identify unique events in the video sequence. In our surveillance problem a video event is deﬁned as the time interval where a person performs an activity. Thus, it is expected that when someone enters the room under surveillance, new descriptors will appear that do not correspond to background. In our method we analyze a vector that corresponds to the number of “unmatched” descriptors between each frame and the background. In Fig. 2(a), we present the sequence of “unmatched” descriptors of a video surveillance sequence.

Event Detection and Classification in Video Surveillance Sequences

(a)

311

(b)

Fig. 2. a) Normal and b) smoothed signal of the number of unmatched descriptors of a video surveillance sequence

In order to detect the beginning and the end of a video event, this vector H is smoothed by: ∞ Lt = Hn · Kσ (t − n), (1) n=−∞

where Kσ is a normalized discretized gaussian kernel with zero mean and standard deviation σ. Furthermore, we discard low values of the smoothed signal to remove noise (background descriptors that have not been removed). In Fig. 2(b) we present the ﬁnal smoothed signal for the sequence of Fig. 2(a). 3.1

Event Representation

After we have segmented the video into N events, we represent each video frame of the event or the whole event with a visual word histogram. More speciﬁcally, for each video event Ei , i = 1, . . . , N a diﬀerent number of descriptors is computed that describe certain objects or interest points in the event. Suppose we are given a video event Ei and its corresponding set of n video frames F = {f1 , . . . , fn }. For each video frame fj , j = 1, . . . , n, a set of SIFT descriptors Dfj is extracted using the algorithm presented in [6]. Then, all the sets of descriptors are concatenated to describe the whole event DEi = Df1 . . . Dfn . (2) To extract visual words fromthe descriptors, the sets of descriptors for all N video events DV = DE1 DE2 . . . DEN is clustered into K groups {C1 , C2 , . . . , CK } using the k-means algorithm, where K denotes the total visual words vocabulary size. To construct the visual word histogram (bag of visual words) for video frame ft , each element of the set of descriptors Dft is assigned to one of the K visual words (clusters), thus resulting into a vector containing the frequency of each visual word in the video frame. Thus, given that frame ft has D descriptors dt1 , . . . , dtD , the visual word histogram V HFt for this video frame is deﬁned as: V HFt (l) =

#{dtj ∈ Cl , j = 1, . . . , D} , l = 1, . . . , K. |D|

(3)

312

V. Chasanis and A. Likas

Similarly, a visual word histogram V HE i of an event i is constructed by assigning each descriptor of set DEi to one of the K visual words (clusters).

4

Event Dissimilarity

In order to proceed with video event classiﬁcation an event dissimilarity metric must be deﬁned. In our approach we consider two approaches. In the ﬁrst one, to compute a distance value between two events Ei and El we compare their corresponding visual word histograms V HE i and V HE l . In the second approach, we compare the visual word histograms V HF of their frames. More speciﬁcally, suppose that we are given events Ei = {f1i , . . . , fni i } and El = {f1l , . . . , fnl l }. Since ni = nl , we have to deﬁne a proper dissimilarity metric to compare these two events. In our approach, we use Dynamic Time Warping (DTW) distance, which is employed to compare two events with diﬀerent number of frames. Dynamic Time Warping (DTW) is a well-known technique to ﬁnd an optimal alignment between two given time-independent sequences [7]. 4.1

Event Dissimilarity Metric

Each frame fji , j = 1, . . . , ni of event histogram V HF ij as deﬁned in equation K-dimensional signal of length ni : ⎛ V HF i1 (k = 1) ⎜ .. V Ei = ⎝ .

Ei is represented with a visual word (3). Thus, event Ei is represented by a

⎞ . . . V HF ini (k = 1) ⎟ .. ⎠, ... . i i V HF 1 (k = K) . . . V HF ni (k = K)

(4)

where K the size of the vocabulary size employed to create the visual word histograms in Section 3.1. Each row k of matrix V E i represents the frequency of “visual word” k in the time interval of the event. In order to compute the distance between two video segments/events Ei and El we compute the average DTW distance of their K-multidimensional signals. More speciﬁcally D(Ei , El ) =

K 1 DT W (V HF i (k), V HF l (k)), K

(5)

k=1

where V HF i (k), V HF l (k) are the k-th rows of the K-dimensional signals V E i , V E l representing segments/events Ei and El , respectively.

5 5.1

Experimental Results Video Surveillance Sequence

The video sequence we used comprises of more than 25000 frames and contains diﬀerent individual and not overlapping activities performed in an indoor

Event Detection and Classification in Video Surveillance Sequences

313

Fig. 3. Sample frames of the background and the five categories of events Table 1. Classification and Clustering results for the first video sequence K 10 20 50 100 200

1-NN DTW EV 80% 85% 90% 90% 95% 95% 95% 90% 95% 90%

3-NN DTW EV 80% 85% 95% 90% 95% 95% 100% 100% 100% 100%

5-NN DTW EV 65% 65% 90% 80% 95% 90% 10% 95% 100% 100%

SVM Hierarchical Clustering DTW EV DTW EV 75% 65% 80% 45% 95% 95% 95% 90% 100% 95% 100% 90% 100% 95% 100% 100% 100% 100% 100% 100%

environment captured by a standing camera. In this video sequence, 20 activities/events are performed that are divided in ﬁve categories, as presented in Fig. 3. The result of the automatic segmentation was optimal, since no over-segmentation or under-segmentation was performed and all 20 events were detected as unique. 5.2

Classification Results

To classify the 20 events into 5 categories we carried out two experiments. In the ﬁrst one, we used the nearest neighbor classiﬁer [3] and in the second one we used Support Vector Machines [1]. We implemented the nearest neighbor classiﬁer with 1, 3, and 5 nearest neighbors for both dissimilarity measures deﬁned in Section 4. Comparison between the visual word histograms of events is referred as EV and comparison between the visual word histograms of the frames of the events is referred as DTW. In Table 1 we present the numerical results of the experiments for diﬀerent number of visual words K. The classiﬁcation accuracy was estimated using the leave-one-out (LOO) approach [3]. In the second experiment, Support Vector Machine (SVM) classiﬁers [1] were employed using the leave one out (LOO) scheme again. In our approach, we employed the typical radial basis function (RBF) kernel and the parameters C, γ were selected through cross-validation. In Table 1, we present the numerical results for the two compared approaches of Section 4 and for diﬀerent number of visual words K. It can be observed that DTW distance gives results slightly superior to the ones obtained by the other dissimilarity metric. 5.3

Clustering Results

We have also employed an unsupervised method for grouping the video events into categories. More speciﬁcally, we performed agglomerative hierarchical

314

V. Chasanis and A. Likas

clustering [5], setting the number of cluster to ﬁve and using the Ward criterion to select the clusters to be merged at each iteration. In Table 1 we present the clustering accuracy for the two approaches of Section 4 using a diﬀerent number of visual words K. It can be observed that DTW distance provides better results for a small number of visual words.

6

Conclusions

In this paper, we have presented a method for video event detection and classiﬁcation in video surveillance sequences. For each video frame, local invariant descriptors were computed and compared to a pre-computed set of descriptors from the background framer of the surveillance room. In this way, a number of “unmatched” descriptors was identiﬁed that describe foreground objects. By analyzing the number of “unmatched” descriptors, the video sequence was segmented into segments/events. Each video event was represented either by a single (summary) visual word histogram or by a K-dimensional signal corresponding to the visual word histograms of its frames. Thus, two diﬀerent approaches were followed in order to compare video events. Unsupervised and supervised learning methods were employed to cluster and classify the events into certain categories. Numerical results presented in this paper indicate that our approach achieves high detection, classiﬁcation and clustering rates.

References 1. Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20(3), 273–297 (1995) 2. Cristani, M., Bicego, M., Murino, V.: Audio-visual event recognition in surveillance video sequences. IEEE Transactions on Multimedia 9(2), 257–267 (2007) 3. Duda, R., Hart, P., Stork, D.: Pattern Classification, 2nd edn. Wiley Interscience, Hoboken (2000) 4. Hu, W., Tan, T., Wang, L., Maybank, S.: A survey on visual surveillance of object motion and behaviors. IEEE Transactions on Systems, Man and Cybernetics 34, 334–352 (2004) 5. Jain, A., Dubes, R.: Algorithms for clustering data. Prentice-Hall, Inc., Upper Saddle River (1988) 6. Lowe, D.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60(2), 91–110 (2004) 7. Sakoe, H.: Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing 26, 43–49 (1978) 8. Xiang, T., Gong, S.: Video behavior profiling for anomaly detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(5), 893–908 (2008)

The Support of e-Learning Platform Management by the Extraction of Activity Features and Clustering Based Observation of Users Dorota Dzega1 and Wieslaw Pietruszkiewicz2 1

West Pomeranian Business School, Faculty of Economics and Computer Science ul. Zolnierska 53, 71-210 Szczecin, Poland [email protected] 2 SDART Ltd One Central Park, Northampton Road Manchester M40 5WW, United Kindgom [email protected]

Abstract. We present an application of data mining in e-learning, where web platform management was supported by the extraction of users’ activity features and further by the clusterisation of users’ proﬁles. By this approach we have identiﬁed groups of users with a similar activity on elearning platform and were able to observe their performance. The experiments presented in this paper were performed on the real data coming from Moodle platform. Comparing to the other research in this ﬁled, that focus on the analysis of students, we investigated teachers’ behaviour. We have proposed a smoothing model in the form of a dynamic system, that was used to transform the logged events into time series of activities. These series were later used to cluster teachers’ performance and to divide them into three groups: active, moderate and passive users. The main aim of our research was to propose and test an data mining based approach to support of e-learning management by an observation of teachers leading to an increase of the process quality. Keywords: Clustering, dynamic systems, smoothing, k-Medoids, Kernel k-Means.

1

Introduction

The management of e-learning involves users modelling and control. There exists various researches relating to users modelling e.g. [1] showed how the decision trees could be used by teacher to ﬁnd relationships between students marks and their activity. The neural networks were used to predict students’ marks in [2], while a comparison of diﬀerent classiﬁers used as students’ grades predictors was a subject of research described in [3]. Among the other papers data mining procedures were applied e.g. to analyse the usage of e-learning courses and support S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 315–320, 2010. c Springer-Verlag Berlin Heidelberg 2010

316

D. Dzega and W. Pietruszkiewicz

Fig. 1. The general schema of the managment supporting system for e-learning

their improvement [4], to evaluate the quality of e-learning [5] or in case-based reasoning deployed in distance learning [6]. The other techniques like clustering was used to analyse the students’ patterns of usage [7]. A similar clustering based analysis of virtual steps was presented in [8], while the identiﬁcation of incorrect students’ behaviour was presented in [9]. An automated mechanism recommending the relevant materials for students was proposed in [10] and an analysis of students learning sequences was introduced in [11]. A more detailed survey of educational data mining could be found in [12]. Analysing the previous researches in this ﬁled we must point out that they were limited to the students neglecting the teachers. The application of data mining in e-learning was presented in Figure 1. The process of knowledge extraction from data relating to the teachers could be divided into two major blocks having ﬁve steps in total (see 1). These steps will be explained in the following parts of the paper. It must be also noted that the research presented herein was focused on the teachers, but a similar analysis could be conducted for the students.

2

Methodology

The data source for our research was Moodle platform, used to teach 105 courses by 43 teachers. The data spanned approx. one year. We have used this platform to generate the reports and to prepare preliminary ﬁles. They were later converted to well structured array of attributes being considered as a representation of users’ activity in six areas i.e. All actions, Add, View, Delete, Update and All changes.

The Support of e-Learning Platform Management

317

Fig. 2. The smoothed time serie for the logged events

As there are no indisputable deﬁnitions of teacher’s activity in the e–learning process, we proposed two appropriate criterions: 1. The variety and number of actions logged for a teacher stimulate the value of teacher’s activity. 2. Equally spread events are more desired than those grouped only during in short sessions. The reports generated by Moodle (being a typical web logs) contain only information about time of events, not about users’ activity occurring between these events. We have rejected the na¨ıve approach to this problem (that would be the calculation of an average number of events), as it wouldn’t distinguish users whose events would spread equally from events cumulated in a short time span. To create time series of activity from a set of impulses we have chosen smoothing form of process equation written in form of a smoothing dynamic system. An example of smoothing is graphically presented in Figure 2. Each event (denoted as A, B, C, D or E) stimulated activity at the moment it has taken place and the activity value was fading out in the following steps. The modelled system was time–discrete, hence a time window was deployed to check if any event occurred during analysed period. To introduce the smoothing equation let assume that X is an activity vector, U is an event vector and we set a limit on state variables to x ∈ 0, 1. Correspondingly the process equation is: ⎤ ⎡ a ⎤ ⎡ ⎤ ⎡ a ⎤ ⎡ ⎤⎞ xn un p00000 100000 1 ⎥ ⎢ aa ⎥ ⎢ ⎥⎟ ⎥ ⎢ ⎜⎢ 0 p 0 0 0 0 ⎥ ⎢ xaa ⎥ ⎢ nac ⎥ ⎢ 0 1 0 0 0 0 ⎥ ⎢ unac ⎥ ⎢ 1 ⎥⎟ ⎜⎢ ⎜⎢ 0 0 p 0 0 0 ⎥ ⎢ xn ⎥ ⎢ 0 0 1 0 0 0 ⎥ ⎢ un ⎥ ⎢ 1 ⎥⎟ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎟ = min ⎜ ⎜⎢ 0 0 0 p 0 0 ⎥ ∗ ⎢ xdn ⎥ + ⎢ 0 0 0 1 0 0 ⎥ ∗ ⎢ udn ⎥ , ⎢ 1 ⎥⎟ ⎥ ⎢ u ⎥ ⎢ ⎥ ⎢ u ⎥ ⎢ ⎥⎟ ⎜⎢ ⎝⎣ 0 0 0 0 p 0 ⎦ ⎣ xn ⎦ ⎣ 0 0 0 0 1 0 ⎦ ⎣ un ⎦ ⎣ 1 ⎦⎠ 00000p 000001 1 xvn uvn ⎛⎡

Xn+1

(1)

At each step the value of u components was set to 1 if any event occurred for this activity component, otherwise it was set to 0. In the result in time frames where any event was logged the activity was set to 1 and later, due

318

D. Dzega and W. Pietruszkiewicz

Fig. 3. The idea of activity clustering

to P –diagonal matrix, this value fades to zero. The p value is a fading ratio. The length of time frame was set to 15 minutes. Later the series of activity components were averaged, thus the results keep the both criterions introduced earlier in this section. It is important to remember that courses available on Moodle contained all materials, in SCORM form, necessary for students to learn the course subject. Tasks, tests, games, glossaries, lessons, workshops, quizzes and other activities were additional, thus some teachers were more active engaging students to different task, while other were more passive i.e. responding to students problems, moderating forums and chatting with students. Thus, passive users to be a negative group, but as a group that must be observed more carefully and to eventually support them technically.

3

Results of Clustering

In the next step of experiments we clustered the samples describing teachers average activity in each dimension i.e. All actions, Add, View, Delete, Update and All changes. The idea of the clusterisation was presented in Figure 3. The clusteres were generated using two algorithms i.e. k-Medoids [13] and Kernel k-Means [14]. We set the number of clusters to 8 and later joined them into three groups with diﬀerent kinds of activty i.e. ’Active’, ’Moderate’ and ’Passive’ users. The distributions for ’Add’ vs. ’View’ and ’Add’ vs. ’Delete’ for both algorithms were presented in Figure 4. Comparing the both algorithms, we have selected k-Medoids as it was more coherent with users’ observation done by the Moodle administrative staﬀ and the clusters generated by this algorithm were easier separable. In the outcome we were able to check the teachers’ behaviour via an automated way comparing to a more subjective and time consuming human analysis done by talking with the platform users or reading the information posted in it.

The Support of e-Learning Platform Management cluster_1

cluster_4

cluster_5

cluster_3

cluster_6

cluster_2

cluster_0

cluster_7

cluster_6

cluster_2

cluster_3

cluster_1

Active

cluster_0

cluster_5

319

cluster_4

Active

Active

Active Active

Active Moderate

Moderate

Passive

Passive 119.0

(a) Kernel k-Means cluster_1

cluster_4

cluster_5

cluster_3

cluster_6

cluster_2

cluster_0

Active

(b) k-Medoids cluster_7

cluster_6

cluster_2

cluster_3

cluster_1

cluster_0

cluster_5

cluster_4

Active

Active Moderate

Passive

(c) Kernel k-Means

Active Moderate

Passive

(d) k-Medoids

Fig. 4. Clusters density in ’Add’ vs. ’View’ dimensions for Kernel k-Means (a) and k-Medoids (b) algorithms or in ’Add’ vs. ’Delete’ dimensions for Kernel k-Means (c) and k-Medoids (d) algorithms

4

Summary

In this paper we presented how the hidden features may be extracted from Moodle and be used to support the management of e-learning. The results of clustering described herein were achieved using attributes pre-processed by a dynamic system, smoothing data coming from Moodle reports. We have proposed the activity measures for six areas, that were later used by k-Medoids and Kernel k-Means algorithms to group users with similar activity proﬁles. While many researchers focus on modelling students behaviour, we presented an analysis of teachers. We also think that future extension of e-learning platforms must support management staﬀ in observing users’ behaviour. This process should be done by procedures that would observe users’ activity, measure it and supply easily accessible and understandable information for e-learning managers. We see the proposed approach to be a base of a supporting mechanism, supplying information enabling early identiﬁcation teachers problems and to transform the e-learning management to a more eﬀective and pro-active way. Our future plans involve separations of activity components onto more detailed subsets. As a result it will be possible to measure more aspects of users’ activity e.g. events can be separated according to their subject i.e. events occurring for posts, quizzes, chats or notes. Another potential area of further research

320

D. Dzega and W. Pietruszkiewicz

is a linkage between the teachers’ activity and the students’ activity and their grades. This will join previous research done for the students with the research proposed in this paper. Finally, we plan to develop the presented mechanism as a Moodle plug–in supporting eﬀective management of the e-learning platform.

References 1. Ventura, S., Romero, C., Herv´ as, C.: Analyzing rule evaluation measures with educational datasets: A framework to help the teacher. In: Proceedings of Educational Data Mining 2008: 1st International Conference on Educational Data Mining (2008) 2. Delgado Calvo-Flores, M., Gibaja Galindo, E., Pegalajar Jim´enez, M.C., P´erez Pi˜ neiro, O.: Predicting students’ marks from Moodle logs using neural network models. FORMATEX, Badajoz (2006) 3. Romero, C., Ventura, S., Espejo, P.G., Herv´ as, C.: Data mining algorithms to classify students. In: Proceedings of Educational Data Mining 2008: 1st International Conference on Educational Data Mining (2008) 4. Blondet Baruque, C., Amaral, M.A., Barcellos, A., Jo˜ ao Carlos da Silva Freitas, J.C., Juliano Longo, C.J.: Analysing users’ access logs in moodle to improve e learning. In: Proceedings of the 2007 Euro American conference on Telematics and information systems (2007) 5. Balogh, I.: Use of data mining tools in examining and developing the quality of e-learning. In: Proceedings of LOGOS Open Conference on strengthening the integration of ICT research eﬀort (2009) 6. Shen, R., Han, P., Yang, F., Yang, Q., Huang, J.: Data mining and case-based reasoning for distance learning. Journal of Distance Education Technologies 3, 46– 58 (2003) 7. Tang, T.Y., McCalla, G.: Student modeling for a web-based learning environment: a data mining approach. In: Eighteenth national conference on Artiﬁcial intelligence, pp. 967–968. American Association for Artiﬁcial Intelligence, Menlo Park (2002) 8. Mor, E., Minguill´ on, J.: E-learning personalization based on itineraries and longterm navigational behavior. In: WWW Alt. 2004: Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters, pp. 264–265. ACM, New York (2004) 9. Yu, P., Own, C., Lin, L.: On learning behavior analysis of web based interactive environment. In: Proceedings of ICCEE (2001) 10. Markellou, P., Mousourouli, I., Spiros, S., Tsakalidis, A.: Using semantic web mining technologies for personalized e-learning experiences. In: Proceedings of the web-based education (2005) 11. Pahl, C., Donnellan, C.: Data mining technology for the evaluation of web-based teaching and learning systems. In: Proceedings of the congress e-learning (2003) 12. Romero, C., Ventura, S.: Educational data mining: A survey from 1995 to 2005. Expert Systems with Applications 33(1), 135–146 (2007) 13. Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., Euler, T.: Yale: Rapid prototyping for complex data mining tasks. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD-06 (2006) 14. Camastra, F., Verri, A.: A novel kernel method for clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 5(25) (2005)

Mapping Cultural Metadata Schemas to CIDOC Conceptual Reference Model Manolis Gergatsoulis1 , Lina Bountouri1 , Panorea Gaitanou1 , and Christos Papatheodorou1,2 1

Database & Information Systems Group (DBIS), Laboratory on Digital Libraries and Electronic Publishing, Department of Archive and Library Sciences, Ionian University, Corfu, Greece 2 Institute for the Management of Information Systems (IMIS), Athena R.C., Athens, Greece {manolis,boudouri,rgaitanou,papatheodor}@ionio.gr

Abstract. Managing heterogeneous data is a challenge for cultural heritage institutions as they host and develop various collections with heterogeneous types of material, described by diﬀerent metadata schemas. In this paper, we propose an expressive mapping language for the mapping of XML-based metadata schemas to CIDOC CRM. We also show how these mappings can be used to transform metadata to CIDOC CRM data by considering the mapping of VRA Core 4.0 to CIDOC CRM. Keywords: Metadata interoperability, ontologies, mapping languages.

1

Introduction

Managing heterogeneous data is a challenge for cultural heritage institutions, such as archives, libraries, and museums. These institutions host and develop various collections with heterogeneous material, often described by diﬀerent metadata schemas. Handling these metadata as a uniﬁed set is vital in several applications, including information retrieval and (meta)data exchange. To achieve this, interoperability mechanisms [2] should be applied. One of the most widely implemented methods in the interoperability ﬁeld is the Ontology-based Integration [5]. Ontologies [8] have a crucial role in interoperability scenarios, given that they can conceptualize a domain and express its rich semantics and their relations in a formal manner. One of their main roles in such scenarios is the promotion of semantic integration, acting as a mediated schema between heterogeneous information systems [1,4]. It is worthy of note that various ontologies have been lately created in order to deﬁne speciﬁc domains, such as CIDOC CRM [3] for the cultural heritage domain. In this paper, we investigate the problem of mapping XML-based metadata schemas to CIDOC CRM ontology. We propose a mapping language and show how it is used to specify the semantic mappings of XML-based metadata schemas to CIDOC CRM. We also show how these mappings can be used to transform metadata to CIDOC CRM data. The proposed method is applied to map the VRA Core 4.0 schema to the ontology. S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 321–326, 2010. c Springer-Verlag Berlin Heidelberg 2010

322

2

M. Gergatsoulis et al.

Preliminaries

2.1

The CIDOC CRM

The CIDOC Conceptual Reference Model (CIDOC CRM) [3] is a formal ontology consisting of a hierarchy of 86 classes and 137 properties. A class (also called an entity) groups items (called class instances) that share one or more common characteristics. A class may be the domain or the range of properties which are binary relations between classes. An instance of a property is a relation between an instance of its domain and an instance of its range. A subclass is a class which specializes another class (its superclass). A class may have one or more immediate superclasses. When a class A is a subclass of a class B then all the instances of A are also instances of B. A subclass inherits the properties declared on its superclasses without exception (strict inheritance) in addition to having none, one or more properties of its own. A sample of CIDOC CRM properties is shown in Table 1. Table 1. A sample of CIDOC CRM properties Property id & Name P2 has type (is type of) P3 has note P14 carried out by (performed) P45 consists of (is incorporated in) P58 has section definition(defines section) P72 has language (is language of) P94 has created (was created by) P102 has title (is title of) P108 has produced (was produced by) P128 carries (is carried by) P131 is identified by (identifies) P138 represents (has representation)

Entity - Domain E1 CRM Entity E1 CRM Entity E7 Activity E18 Physical Thing E18 Physical Thing E33 Linguistic Object E65 Creation E71 Man-Made Thing E12 Production E24 Physical Man-Made Thing E39 Actor E36 Visual Item

Entity - Range E55 Type E62 String E39 Actor E57 Material E46 Section Definition E56 Language E28 Conceptual Object E35 Title E24 Physical Man-Made Thing E73 Information Object E82 Actor Appellation E1 CRM Entity

A subproperty is a property that specializes another property (its superproperty). When a property P is a subproperty of a property Q then (1) all instances of P are also instances of Q, (2) the domain of P is either the same or a subclass of the domain of Q, and (3) the range of P is either the same or a subclass of the range of Q. A property can be interpreted in both directions (active and passive voice), with two distinct but related interpretations. Some properties are associated with an additional property (called property of property) whose domain is the property instances and whose range is the class E55 Type. These properties of properties are used to specialize the meaning of their parent properties. 2.2

VRA Core 4.0: A Brief Introduction

VRA Core 4.0 is a metadata schema for the cultural heritage community developed by the Visual Resources Association’s Data Standards Committee1 . VRA Core 4.0 aims to encode descriptive metadata for the documentation of visual objects and consists of a hierarchy of elements, wrapped by the three top class elements work, collection and image. These top level elements may 1

URL: http://www.vraweb.org/projects/vracore4/

Mapping Cultural Metadata Schemas to CIDOC CRM

323

contain as subelements the following elements: agent, culturalContext, date, description, inscription, location, material, measurements, relation, rights, source, stateEdition, stylePeriod, subject, technique, textref, title, and worktype. Finally, it is worth noting that diﬀerent VRA records may be related to each other through the relation subelement. The supported types of relations are: work to work, image to work and image or work to collection. Example 1. The following VRA document is a simpliﬁed version of a VRA document taken from http://www.vraweb.org/projects/vracore4/ <work source="History of Art Visual Resources Collection, UCB"> Cropsey, Jasper Francis painter Cropsey, Jasper Francis <position>lower center Autumn-on the Hudson River/J. F Cropsey/London 1860 <materialSet> oil paint on canvas <material type="medium" vocab="AAT" refid="300015050">oil paint <material type="support" vocab="AAT" refid="300014078">canvas Autumn - On the Hudson River

3

The Mapping Description Language (MDL)

The proposed mapping method between the metadata schemas and CIDOC CRM is based on a path-oriented approach. A mapping from a source schema to a target schema transforms each instance of the source schema into a valid instance of the target schema. Hence, we interpret the metadata paths as semantically equivalent CIDOC CRM paths. As we are interested in XML-based metadata schemas, the paths in the source schemas are based on an extension of XPath [9] that uses location paths along with variables and stars. A mapping rule consists of two parts representing paths, the left one in the XML document and the right one in the CIDOC CRM. Variables are used in both parts to declare and refer to branching points, while stars are used to declare the transfer of value from the XML element/attribute to the corresponding class instance. The syntax of the MDL mapping rules is given bellow in EBNF: R ::= Left ‘−−’ Right Left ::= AP ath | VP ath AP ath ::= | ‘/’ RP ath RP ath ::= L | L ‘*’ | L ‘{’ Vl ‘}’ | L ‘*’ ‘{’ Vl ‘}’ VP ath ::= ‘$’ Vl ‘/’ RP ath | ‘$’ Vl ‘{’ Vl ‘}’ Right ::= Ee | Ee ‘→’ O | ‘$’ Vc ‘→’ O | ‘$’ Vp ‘→’ ‘E55’

324

M. Gergatsoulis et al.

O ::= Pe ‘→’ Ee O ::= O ‘→’ O Ee ::= E | E ‘{’ Vc ‘}’ | E ‘{=’ String ‘}’ Pe ::= P | P ‘{’ Vp ‘}’

where L represents the relative location paths (of XPath), E (resp. P) represents class (resp. property) ids of CIDOC CRM, Vl location variables, Vp property variables and Vc class variables. Notice that in the CIDOC CRM paths we use only class/property ids instead of their full names. In Table 2 a fragment of the VRA Core 4.0 to CIDOC CRM mapping is presented. Although the rules of Table 2 deﬁne the mapping of the metadata elements/attributes of Example 1, quite similar rules apply for the same subelements of the element image or collection. The diﬀerence is on rule R1, where in the case of image (resp. collection) the corresponding CIDOC CRM class is E38 Image (resp. E78 Collection) instead of E24 Physical Man-Made Thing. Table 2. Mapping rules: Mapping VRA Core 4.0 to CIDOC CRM R1: R2: R3: R4: R5: R6: R7: R8: R9: R10: R11: R12: R13: R14: R15: R16: R17: R18: R19: R20: R21: R22: R23:

4

/vra/work{X1} $X1/titleSet/title*{Q1} $Q1/@type* $X1/agentSet{W1} $W1/agent[name/@type="personal"]{W5} $W5/name* $W5/role* $W1/agent[name/@type="corporate"]{W10} $W10/name* $W10/role* $W1/agent[name/@type="family"]{W15} $W15/name*{W17} $W17 $W15/role* $X1/materialSet/material*{Z1} $Z1/@type* $X1/inscriptionSet/inscription{Y1} $Y1/@type* $Y1/author* $Y1/text*{Y3} $Y3/@type* $Y3/@xml:lang* $Y1/position*

– – – – – – – – – – – – – – – – – – – – – – –

E24{C1} $C1→P102{S1}→E35 $S1→P102.1→E55 $C1→P108B→E12{J1} $J1→P14{S2}→E21{J5} $J5→P131→E82 $S2→P14.1→E55 $J1→P14{S3}→E40{J10} $J10→P131→E82 $S3→P14.1→E55 $J1→P14{S4}→E74{J15} $J15→P131→E82{J17} $J17→P2→E55{"family"} $S4→P14.1→E55 $C1→P45→E57{A1} $A1→P2→E55 $C1→P128→E37{D1} $D1→P2→E55 $D1→P94B→E65→P14→E39→P131→E82 $D1→P138→E33{D3} $D3→P2→E55 $D3→P72→E56 $D1→P58→E46

Mapping VRA Core 4.0 Metadata to CIDOC CRM

This section presents how the mapping rules of Table 2 are applied to transform the VRA document of Example 1 to CIDOC CRM. The CIDOC CRM data generated by this transformation are depicted in Figure 1, where each box represents a CIDOC CRM class instance and is divided into two parts. The upper part indicates the VRA path mapped to the CIDOC CRM class instance included in the lower part of the box. Each instance has an id, denoted by an “o” followed by an integer number. When an instance has a speciﬁc value, this appears as an assignment in the lower box. The boxes are linked through arrows representing CIDOC CRM properties. In case a property is used in the inverse direction, its id is followed by the letter “B” (e.g. P108B was produced by).

Mapping Cultural Metadata Schemas to CIDOC CRM

/vra/work/titleSet/ title/@type

P128 carries

E35 Title: o12=‘AutumnOn the Hudson River’

P108B was produced by

E37 Mark: o2

/vra/work/ inscriptionSet/ inscription/text

E65 Creation: o3

E33 Linguistic Object: o6=‘Autumn-on ...1860’

P14 carried out by E39 Actor: o4

E57 Material: o10=‘canvas’

/vra/work/materialSet/ material E57 Material: o8= ‘oil paint’

/vra/work/ inscriptionSet/ inscription/position E46 Section Definition: o7=‘lower center’

P14 carried out by P14.1 has type

/vra/work/agentSet/ agent/role P131 is identified by

P2 has type /vra/work/ materialSet/ material/@type E55 Type: o11 = ‘medium’

P2 has type

E12 Production: o14

P58 has section definition

P94B was created by

P45 consists of

/vra/work/agentSet

/vra/work/inscriptionSet/ inscription

P138 represents

/vra/work/materialSet/ material

E24 Physical Man-Made Thing: o1

P102 has title

/vra/work/titleSet/title

P45 consists of

/vra/work

P102.1 has type

E55 Type: o13=‘inscribed’

325

E55 Type: o17 =‘painter’

/vra/work/materialSet/ material/@type E55 Type: o9 = ‘medium’ /vra/work/agentSet/agent[name/ @type= “personal”] E21 Person: o15 P131 is identified by

/vra/work/inscriptionSet/ inscription/author

/vra/work/agentSet/agent/ name

E82 Actor Appellation: o5=‘Cropsey, Jasper Francis’

E82 Actor Appellation: o16=‘Cropsey, Jasper Francis’

Fig. 1. Transforming VRA Core 4.0 metadata to CIDOC CRM data

Rule R1 denotes that an element work in a VRA document maps to the class E24 Physical Man-Made Thing. This means that a work is an instance of E24 Physical Man-Made Thing. Thus, R1 creates an instance o1 of E24 Physical Man-Made Thing. Besides, the path /vra/work in the XML tree is marked as X1 and the object o1 in the CIDOC CRM is marked as C1. Observe that the rules being applicable at this point are R2, R4, R15, and R17 and they provide instructions for the mapping of the titleSet, agentSet, materialSet, and inscriptionSet subelements of the element work, respectively. Rule R2 locates the title subelement of the element titleSet. An instance o12 of the class E35 Title is created in the ontology. The value “Autumn - On the Hudson River” of the element title is assigned to the instance o12 (due to the ‘*’ after the title element, meaning data transfer ). R2 also creates an instance of the property P102 has title, which relates o1 with o12. R3 states that the relation P102 has title has a speciﬁc type, which is an instance of the class E55 Type getting its value from the attribute type of the title element. The mapping of the agent and its subelements is performed using the rules R4 − R7 (since the value of the attribute type of the agent’s name subelement is “personal”). R4 creates an instance o14 of the class E12 Production (representing the production event responsible for the speciﬁc work) and an instance of the property P108B was produced by relating o1 with o14. Similarly, R5 creates an instance o15 of the class E21 Person and an instance of the property P14 carried out by, which relates o14 with o15. Then, R6 creates an instance o16 of E82 Actor Appellation and assigns to it the value “Cropsey, Jasper

326

M. Gergatsoulis et al.

Francis” taken from the name subelement of the agent. Finally, R7 creates an instance o17 of E55 Type, assigns the value “painter” to this instance (taken from the subelement role) and creates an instance of the property of property P14.1 has type, which relates the instance of the property P 14 with o17. Figure 1 can be encoded in a PROLOG unit clause representation (or equivalently can be stored in a relational database) as follows: We deﬁne a binary predicate classInstance(object id,class id) denoting that an object with id object id belongs to a class with id class id. Then, we deﬁne a binary predicate instanceValue(object id, object value), to represent the values assigned to speciﬁc objects. Concerning the property instances, we use a ternary predicate propertyInstance(property id,object id1,object id2) declaring that the property with id property id relates the instance object id1 of the property’s domain with the instance object id2 of the property’s range. Finally, a ternary predicate propertyOfPropertyInstance(propertyOfProperty id, propertyInstance, object id) can represent the properties of properties. A prototype implementation of the above transformation is based on [6].

5

Conclusion

The presented mapping methodology is part of a CIDOC CRM-based metadata integration scenario [7], since CIDOC CRM provides deﬁnitions and a formal structure for describing the implicit and explicit concepts and relationships of cultural heritage. The methodology transforms widely known metadata schemas such as VRA and EAD, to CIDOC CRM. Currently, we are investigating the transformation of queries between various metadata schemas and CIDOC CRM.

References 1. Amann, B., Beeri, C., Fundulaki, I., Scholl, M.: Ontology-Based Integration of XML Web Resources. In: Horrocks, I., Hendler, J. (eds.) ISWC 2002. LNCS, vol. 2342, pp. 117–131. Springer, Heidelberg (2002) 2. Chan, L.M., Zeng, M.L.: Metadata interoperability and standardization - A study of methodology, Part I: Achieving interoperability at the schema level. D-Lib. Magazine 12(6) (2006) 3. CIDOC CRM Special Interest Group. Deﬁnition of the CIDOC Conceptual Reference Model. Technical report (November 2009) 4. Cruz, I.F., Xiao, H.: The Role of Ontologies in Data Integration. Journal of Engineering Intelligent Systems 13(4), 245–252 (2005) 5. Noy, N.F.: Semantic Integration: a Survey of Ontology-Based Approaches. SIGMOD Record 33(4), 65–70 (2004) 6. Polychronakis, O.: Transformation of XML cultural heritage metadata in the form of CIDOC CRM ontology. Dimploma thesis, NTUA, Athens, Greece (2010) 7. Stasinopoulou, T., Bountouri, L., Kakali, C., Lourdi, I., Papatheodorou, C., Doerr, M., Gergatsoulis, M.: Ontology-Based Metadata Integration in the Cultural Heritage Domain. In: Goh, D.H.-L., Cao, T.H., Sølvberg, I.T., Rasmussen, E. (eds.) ICADL 2007. LNCS, vol. 4822, pp. 165–175. Springer, Heidelberg (2007) 8. Uschold, M., Gruninger, M.: Ontologies: principles, methods and applications. Knowledge Engineering Review 11(2), 93–155 (1996) 9. W3C. XML Path Language (XPath) 2.0 (January 2007), http://www.w3.org/TR/xpath20/

Genetic Algorithm Solution to Optimal Sizing Problem of Small Autonomous Hybrid Power Systems Yiannis A. Katsigiannis1, Pavlos S. Georgilakis2, and Emmanuel S. Karapidakis1 1 Department of Environment and Natural Resources, Technological Educational Institute of Crete, Chania, Greece {katsigiannis, karapidakis}@chania.teicrete.gr 2 School of Electrical and Computer Engineering, National Technical University of Athens, Athens, Greece {[email protected]}

Abstract. The optimal sizing of a small autonomous hybrid power system can be a very challenging task, due to the large number of design settings and the uncertainty in key parameters. This problem belongs to the category of combinatorial optimization, and its solution based on the traditional method of exhaustive enumeration can be proved extremely time-consuming. This paper proposes a binary genetic algorithm in order to solve the optimal sizing problem. Genetic algorithms are popular optimization metaheuristic techniques based on the principles of genetics and natural selection and evolution, and can be applied to discrete or continuous solution space problems. The obtained results prove the performance of the proposed methodology in terms of solution quality and computational time. Keywords: Combinatorial optimization, genetic algorithms, metaheuristics, renewable energy sources, small autonomous hybrid power systems.

1 Introduction A small autonomous hybrid power system (SAHPS) is a system that generates electricity in order to serve a local low energy demand, and it usually operates in areas that are far from the main grid. Renewable energy sources (RES) can often be used as a primary source of energy in such a system, as they are usually located in geo graphically remote and demographically sparse areas. However, since renewable technologies such as wind turbines (WTs) and photovoltaics (PVs) are dependent on a resource that is not dispatchable, there is an impact on the reliability of the electric energy of the system, which has to be considered. The basic ways to solve this problem is either to use storage as a type of energy-balancing medium, or to install conventional generators in the system such as diesel generators. The problem of optimal sizing of a SAHPS belongs in the category of combinatorial optimization problems, since the sizes of system’s components, which constitute the input variables, can only take specific values. For the solution of this problem, several methods have been proposed. The most direct method is the complete enumeration method. This approach is used by HOMER software [1] and ensures that the S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 327–332, 2010. © Springer-Verlag Berlin Heidelberg 2010

328

Y.A. Katsigiannis, P.S. Georgilakis, and E.S. Karapidakis

best solution is obtained, but it can be proved extremely time consuming. In [2] linear programming techniques are used in order to optimize the design of a hybrid WT-PV system. Heuristic methods have been also applied, as stated in [3]. In recent years, a number of new methods have been developed, in order to solve many types of complex problems, particularly those of a combinatorial nature. These methods are called metaheuristics and include genetic algorithms (GAs), simulated annealing (SA), tabu search (TS) and particle swarm optimization (PSO). Metaheuristics combine characteristics of local search procedures and higher level search strategies in order to create a process capable of escaping from local optima and performing a robust search of the solution space. From the area of metaheuristics, TS [4], PSO [5] and GAs [6]-[8] have been proposed for the solution of optimal SAHPS sizing. Moreover, HOGA software [9] uses a GA in order to minimize the net present cost of a hybrid power system. This paper proposes the application of a GA in the optimal sizing of a SAHPS. The paper is organized as follows: Section 2 formulates the SAHPS optimal sizing problem. Section 3 describes SAHPS components and modeling and Section 4 presents the main characteristics of the proposed GA methodology. Section 5 presents and discusses the obtained results and Section 6 concludes the paper.

2 Problem Formulation The objective function to be minimized is the system’s cost of energy (COE):

COE =

C tot ,ann Etot ,ann , served

(1)

where Ctot,ann is the total annualized cost and Etot,ann,served is the total annual useful electric energy production. Ctot,ann takes into account the annualised capital costs, the annualised replacement costs, the annual operation and maintenance (O&M) costs, the annual fuel costs (if applicable) of system’s components. The constraints that have been taken into consideration in this paper are: 1. Initial cost constraint: The available budget (total initial cost at the beginning of system’s lifetime) is limited to ICmax. 2. Unmet load constraint: The annual unmet load (which was not served due to insufficient generation), expressed as a percentage of the total annual electrical load, cannot exceed a fixed value fULmax. 3. Capacity shortage constraint: The annual capacity shortage fraction, which is the total annual capacity shortage divided by the total annual electric energy demand, cannot exceed a fixed value fCSmax. 4. Fuel availability constraints: The maximum amount of each fuel that is consumed throughout a year cannot exceed a specific limit FCgenmax,ann. 5. Minimum renewable fraction constraint: The portion of system's total energy production originating from RES technologies must be greater than or equal a specified minimum limit fRESmin. 6. System component size range: The sizes of each system’s component must lie between zero and sizecompmax.

Genetic Algorithm Solution to Optimal Sizing Problem of SAHPS

329

3 System Modeling The considered SAHPS has to serve electrical load, and it can contain the following component types: WTs, PVs, diesel generator, biodiesel generator, fuel cells, batteries and converters. Renewable power sources (WTs and PVs) have a priority in supplying the electric load. If they are not capable to fully serve the load, the remaining electric load has to be supplied by generators and/or the batteries. An additional aspect of system operation is whether (and how) the generators should charge the battery bank. Two common control strategies that can be used are load following (LF) strategy and cycle charging (CC) strategy [10]. In the LF strategy, the operating point of each generator is set to match the instantaneous required load. In the CC strategy, whenever a generator needs to operate to serve the primary load, it operates at full output power. A setpoint state of charge, SOCa, has also to be set in this strategy. The charging of the battery by the generators will not stop until it reaches the specified SOCa. In this paper, three values of SOCa are considered: 80%, 90% and 100%.

4 Genetic Algorithm Implementation for SAHPS Optimal Sizing Genetic algorithms mimic natural evolutionary principles to constitute search and optimization procedures, and can be classified in two categories: 1. Binary GAs: They borrow their working principle directly from natural genetics, as the variables are represented by bits of zeros and ones. Binary GAs are preferred when the problem consists of discrete variables. 2. Continuous GAs: Although they present the same working principle with binary GAs, the variables here are represented by floating-point numbers over whatever range is deemed appropriate. Continuous GAs are ideally suited to handle problems with a continuous search space. The considered sizes of each component can take only discrete values, so the binary GA is selected. Two alternative GA coding schemes are examined: conventional binary coding and Gray coding. In the proposed GA, each chromosome consists of 8 genes, of which the first 7 genes represent the SAHPS component sizes (WT, PV, diesel generator, biodiesel generator, fuel cell, batteries and converters), while the eighth gene refers to adopted dispatch strategy. For the constraint handling, the penalty function approach is adopted, in which an exterior penalty term that penalizes infeasible solutions is used. Since different constraints may take different orders of magnitude, prior to the calculation of the overall penalty function all constraints are normalized. The proposed GA offers the following significant advantages compared to other GA solutions of SAHPS sizing problem: 1. Additional aspects of GA performance are examined, e.g., coding schemes. 2. In comparison with GA methodologies of [6] and [7], the proposed GA is used for complex SAHPS that include a large number of conventional as well as renewable technologies.

330

Y.A. Katsigiannis, P.S. Georgilakis, and E.S. Karapidakis

3. The proposed GA considers the choice of the proper dispatch strategy as a decision variable. 4. Compared to [8], the proposed GA presents the advantage of simplicity, as the choice of optimal sizing and dispatch strategy is implemented in the same GA run.

5 Results and Discussion 5.1 Case Study System In the considered SAHPS, the project lifetime is assumed to be 25 years and the discount rate has been taken equal to 8%. The maximum annual value of electric load has been set to 50 kW, the time step of the simulation has considered equal to 10 min (1/6 h), while the wind, solar and temperature data needed for the estimation of WT and PV performance refer to the Chania region, Crete, Greece. The characteristics of system components are presented in Table 1. For each component, the replacement cost is assumed equal to the capital cost. Moreover, with the exception of diesel and biodiesel generators, all components have constant increment of their size, as Table 1 shows. The considered sizes for the generators are 0, 3, 5, 7.5, 10, 15, 20, 25, 30, 35, 40, and 50 kW. The dispatch strategy (LF or CC) represents also an optimization variable, as each component configuration has to be checked for both strategies. Table 2 presents the constraint values for the case study system. Table 1. SAHPS component characteristics

Component

sizecompmax Increment

WTs (10kW rated) PVs Diesel generator

10 WT 60 kWp 50 kW

1 WT 1 kWp Variable

Biodiesel generator

50 kW

Variable

Fuel Cells

40 kW

4 kW

Batteries (625Ah, 12V) Converter

150 bat. 60 kW

10 bat. 2 kW

Capital cost

O&M cost

Fuel cost

Lifetime

15,000 €€/WT 300 €€/y 20 y 5,000 €€/kWp 0 25 y 200 €€/kW 0.01 €€/h 1.0 €€/L 20,000 per kW (diesel) oper. hours 200 €€/kW 0.01 €€/h 1.4 €€/L 20,000 per kW (biodiesel) oper. hours 2,000 €€/kW 0.02 €€/h 0.8 €€/L 40,000 per kW (methanol) oper. hours 700 €€/bat. 0 9,000 kWh 1,000 €€/kW 0 10 y

Table 2. Constraint values for the case study system

Constraint

Parameter

Value

Initial cost Unmet load Capacity shortage Fuel availability (diesel) Fuel availability (biodiesel) Fuel availability (methanol) Renewable fraction

ICmax fULmax fCSmax FCgenmax,ann FCgenmax,ann FCgenmax,ann fRESmin

300,000 €€ 0.5% 1.0% No constraint 10,000 L/y 10,000 L/y 50%

Genetic Algorithm Solution to Optimal Sizing Problem of SAHPS

331

For the SAHPS sizing problem of Table 1, the complete enumeration method requires: 9 11 { ⋅ 61 { ⋅ 12 { ⋅ 11 { ⋅ 16 { ⋅ 12 { ⋅ 31 { ⋅ 4{ ≈ 2.1 ⋅10

(2)

WTs PVs Dsl Bio FCs Bat. Conv. Disp.

evaluations in order to find the optimal COE. The computational time for each COE evaluation is 3.5 seconds. Consequently, the evaluations of the complete enumeration method require approximately 234 years. Gray code

0.30

0.30

0.29

0.29

0.28

0.28

COE (€€ /kWh)

COE (€€ /kWh)

Conventional binary code

0.27 0.26 0.25

0.27 0.26 0.25

0.24

0.24

0.23

0.23

0.22

0.22 0

10

20

30

Number of generation

40

50

0

10

20

30

40

50

Number of generation

Fig. 1. Effect of coding type in GA convergence (Npop=50, gn=50, tournament selection, uniform crossover, 0.01 mutation rate): (a) binary code, (b) Gray code

5.2 Results The optimum configuration parameters of the proposed GA are: population size Npop=50, number of generations gn=15, Gray coding, tournament selection, uniform crossover, and 0.01 mutation rate. Fig. 1 shows the superiority of Gray coding compared to conventional binary coding, in terms of solution quality and convergence speed. The optimal configuration contains 9 WTs, diesel generator of 25 kW, biodiesel generator of 5 kW, 100 batteries, converter of 38 kW, and LF dispatch strategy, and the COE is 0.22635 €€/kWh. The total number of performed objective function (COE) evaluations was 800.

6 Conclusions This paper proposes the application of a binary genetic algorithm in the problem of optimal sizing of a small autonomous hybrid power system. The main advantage of the proposed methodology is that the calculation time was very short compared to the prohibitive time required using the complete enumeration method. The obtained system with the minimum cost of energy contains large capacities of wind turbines, batteries and diesel generator, negligible sizes of photovoltaics, fuel cells and biodiesel generator, while it uses load following as dispatch strategy.

332

Y.A. Katsigiannis, P.S. Georgilakis, and E.S. Karapidakis

References 1. HOMER, The Micropower Optimization Model, http://www.nrel.gov/homer 2. Chedid, R., Rahman, S.: Unit Sizing and Control of Hybrid Wind-Solar Systems. IEEE Trans. Energy Conv. 12(1), 79–85 (1997) 3. Gavanidou, E., Bakirtzis, A.: Design of a Stand Alone System with Renewable Energy Sources using Trade off Methods. IEEE Trans. Energy Conv. 7, 42–48 (1992) 4. Katsigiannis, Y., Georgilakis, P.: Optimal Sizing of Small Isolated Hybrid Power Systems using Tabu Search. J. Optoel. Adv. Mat. 10(5), 1241–1245 (2008) 5. Hakimi, S.M., Moghaddas-Tafreshi, S.M.: Optimal Sizing of a Stand-alone Hybrid Power System via Particle Swarm Optimization for Kahnouj Area in South-east of Iran. In: Ren. Energy, vol. 34, pp. 1855–1862 (2009) 6. Koutroulis, E., Kolokotsa, D., Potirakis, A., Kalaitzakis, K.: Methodology for Optimal Sizing of Stand-alone Photovoltaic/Wind-Generator Systems Using Genetic Algorithms. Solar Energy 80, 1072–1088 (2006) 7. Senjyua, T., Hayashia, D., Yonaa, A., Urasakia, N., Funabashi, T.: Optimal configuration of power generating systems in isolated island with renewable energy. Ren. Energy 32, 1917–1933 (2007) 8. Dufo-Lopez, R., Bernal-Agustin, J., Contreras, J.: Optimization of Control Strategies for Stand-alone Renewable Energy Systems with Hydrogen Storage. Ren. Energy 32, 1102– 1126 (2007) 9. HOGA, Hybrid Optimization by Genetic Algorithms, http://www.unizar.es/rdufo/hoga-eng.htm 10. Barley, C.D., Winn, C.B.: Optimal dispatch strategy in remote hybrid power systems. Solar Energy 58, 165–179 (1996)

A WSDL Structure Based Approach for Semantic Categorization of Web Service Elements Dionisis D. Kehagias, Efthimia Mavridou, Konstantinos M. Giannoutakis, and Dimitrios Tzovaras Centre for Research and Technology Hellas, Informatics and Telematics Institute, 57 001, Thermi, Greece {diok,kgiannou,Dimitrios.Tzovaras}@iti.gr

Abstract. This paper introduces a semantic categorization technique whose goal is to classify web service structural elements into their semantically described counterparts. The semantic representations of the web service elements are provided by a set of predefined ontologies. Our technique proposes a three layer categorization approach based on the WSDL document structure, i.e. its operations and input/output parameters. By means of the proposed categorization technique, automatic web service invocation is enabled in addition to semantic annotation of web service. Apart from presenting the theoretical details of the semantic categorization technique, this paper conducts preliminary assessment of its performance in terms of classification accuracy and compares it to one of the most known and successful web service classification tools. Keywords: Semantic Web, Knowledge representation and reasoning.

1 Introduction Semantic categorization of web services (WS) according to a formal conceptualization is the key element to enable seamless service integration on behalf of WS consumers. In particular, automatic categorization of services into a set of predefined application domains facilitates service advertisement and discovery. The goal of the commonest WS categorization techniques is to classify WS into a set of predefined application domains (e.g. financial, sports, leisure) for the purpose of semantic annotation of services. These techniques operate on WS description documents that comply with the WS Description Language (WSDL) specifications. One noticeable WS classification approach based on schema matching is supported by the MWSAF tool [1], which matches WSDL documents to concepts from an appropriate ontology by converting both WS and ontologies to corresponding common graph-like structures. An improved version of [1] was proposed in [2] that uses a machine learning technique. Another tool, Assam [3] uses an ensemble machine learning approach for WS classification and it publicly provides its WS repository, thus allowing comparison to our approach. Other approaches, such as [4] use external data in the form of semantically annotated services in order to create a training dataset. S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 333–338, 2010. © Springer-Verlag Berlin Heidelberg 2010

334

D.D. Kehagias et al.

By exploiting the structure of a WSDL document, i.e. operations with their input/output (i/o) parameters, we achieve a more detailed degree of categorization. We introduce a three-layer approach for WS classification based on WSDL structural elements. Our approach develops mechanisms for automatic categorization of WSDL elements based on a set of appropriate ontologies. After presenting the technical details of the proposed categorization mechanism we provide an evaluation framework in order to assess the performance in terms of categorization accuracy.

2 Semantic Categorization Technique As opposed to the most common available service classification techniques, work reported in this paper focuses on the classification of all structural elements of a WS, i.e. operations and their input and output parameters into different classes that describe operations of the same application domain, and their parameters respectively. With respect to the basic hierarchical levels of a WSDL structure, we introduce a semantic categorization schema composed of the following three layers: 1) A first layer whose goal is to semantically categorize the overall WS into one application domain. 2) A second layer to categorize the WS operations into a set of “ideal” operations defined in the ontology. By using the term “ideal” we distinguish the ontology operations from the real ones (simply called “operations”). 3) A third layer that concerns categorization of i/o operation parameters into ontological concepts. The problem of WS categorization is formally described as follows. Let A = {a1, ..., am} be a vector of m WS and D = {d1, ..., dn} a vector of n application domains. The categorization mechanism aims to develop an accurate classification model based on the collection A, capable of predicting the application domain dj to which any arbitrary WS ai belongs. The result of the categorization can be expressed by a (m×n) matrix K, where: kij = 1, if ai belongs to dj, or kij = 0, otherwise. For each i≤m there is exactly one j ≤ n, such that: kij ≠ 0.

1

WSDL document parsing

2

Word Token extraction

3

4

Vector Transformation

Ontology Collection of WSDL documents.

Classification model K

Term extraction

5

Training

Fig. 1. The creation of the classification model is a five stage process: The WSDL documents are parsed (1), word tokens (2) and terms (3) are extracted from them. These are transformed into numerical vectors (4) which are used for training (5) in combination with semantic descriptions of terms in the ontology.

A WSDL Structure Based Approach for Semantic Categorization of WS Elements

335

Matrix K is the result of a composite process, consisting of five stages, represented as boxes in the diagram illustrated in Fig. 1. In the first stage, a collection of WS already categorized to a set of application domains are used as input to the process. In this stage each WSDL document is parsed and WS-specific data are extracted. The extracted data pass to the next stage where data are split into distinct words. The third stage performs filtering of all extracted word tokens in order to determine the “uniqueness” of each token and its importance regarding the information it encapsulates. For this purpose a stop-list of words is used, containing articles, prepositions, and generally words that appear frequently in each operation name or WSDL tags. Our categorisation technique discovers all the synonym words by the use of the WordNet dictionary (http://wordnet.princeton.edu), and performs word stemming, by adopting the Porter stemming algorithm [5]. The actions that take place in stage 3 result in a set of terms that are provided as input to stage 4, where each WS is transformed into a term vector of the form: v = [ w0 , w1 ,..., wn , c ] Each vector element wi represents a weight associated with the importance of term ti, while the last element c refers to the category, to which the WS belongs (the application domain in the first categorization layer). In stage 5, all produced vectors are provided as training data to a learning classification algorithm (e.g. Naïve Bayes) for the generation of an accurate classification model. In this stage the classes that represent the application domain are derived from an ontology formatted in OWL (Web Ontology Language). For the implementation of the second categorization layer, that categorizes WS operations into their semantically described counterparts in the ontology, we re-use the same process adopted in the first layer, by changing the semantics of vector A that now represents the operations included in all WS that belong to the same application domain and vector D includes all “ideal” operations, i.e., WS operation descriptions defined in the ontology and parameter c represents the “ideal” operation to which a real WS operation is categorized. In the third categorization layer i/o parameters are classified into their ontology counterparts (i.e. concepts). Since the third layer does not contain adequate information, as opposed to the previous two layers, a different classification method was required. Categorization in this layer is dealt with a composite scheme for matching i/o of a WS operation with ideal operation i/o from the ontology. The algorithm used consists of three levels of matching, lexicographic, structure and data type matching. Each name of i/o operation parameter is compared to all ideal operations i/o respectively and are normalized to 1. At the first level of this comparison each tuple of i/o names is tokenized, and tokens are placed in a bipartite graph. Let Z={z1, …, zk} and W={w1, …, wl}be the sets of tokens of the two inputs to be compared. Then, for every tuple (zi, wj) a (normalized to 1) score is assigned, using WordNet:Similarity, cf. [6], and n-grams model, according to their semantic and lexicographic similarity. The resulting k×l matrix S consists of all score combinations between tokens. The assignment problem of finding the best matches of tokens (represented with matrix S) is solved by the Kuhn-Munkres algorithm (also known as the Hungarian algorithm) in polynomial time, cf. [7]. Then, the score of the comparison between two i/o is estimated by the average number of scores of best assignments. Let us now consider without loss of generality the sets X={x1, …, xn} and Y={y1,…,ym} representing the inputs of the WS operation and the inputs of the ideal

336

D.D. Kehagias et al.

operation, respectively. The inner process described above forms a n×m matrix Ms with the scores of all possible combinations of xi with yj. At the second level, a n×m matrix Md is produced, whose scores represent the data type similarity between input xi and yj. Similarly, Mg score matric is produced by matching complex data types. The final score matrix M, is a weighted sum of the three matrices computed above, i.e. M=w1Ms+ w2Md + w3Mg, with w1+w2+w3=1. During our experimentation, we observed that the values w1 = 0.6, w2 = 0.2 and w3 = 0.2 provide efficient i/o matching, while they can be adjusted according to each individual need. Finally, by applying the Hungarian algorithm to matrix M, the best assignments between the inputs or outputs of the operation with the ideal operation are computed and a final matching score is then estimated.

3 Evaluation The goal of the evaluation processes is to assess the performance of the proposed categorization mechanism and compare it to a well-known tool. Since evaluation data about relevant techniques refer to what we call first categorization layer, we assess the accuracy of our proposed mechanism only for the first categorization layer. A k-fold validation process was conducted in the beginning of our evaluation procedure that divides the dataset into k categories. In each iteration we use data from only one category for testing and data from the remaining k-1 ones for training. For each WS in the test set the classification model predicts the category, i.e. the application domain to which it belongs. We measure the accuracy of the classifier as the proportion of the total number of WS that were correctly classified. Our method for the first layer achieved an average accuracy of 77.4436 % tested on a data set of 266 WSDL documents categorized to 4 domains. These are: A. “Business and Money” (98 instances), B. “Tourism and Leisure” (21 instances), C. “Communications” (68 instances), D. “Geographic Information” (79 instances). The overall performance of the proposed mechanism for each one of the aforementioned domains is evaluated by means of the ROC curves depicted in Fig. 2, where true positive (TP) rates are plotted on the vertical axis vs. false positive (FP) rates on the horizontal axis. One categorization result counts as a TP for a given domain when the corresponding WS is classified correctly to this domain (it belongs a priori to this domain), and as an FP when it is incorrectly classified in the same domain (it belongs a priori to a different domain). The results are shown in Fig. 2 for the four aforementioned domains (A-D). The area under the ROC curve (AUC) measures the discriminating ability of a classification model. Since the AUC is a portion of the area of the unit square, its value will always be between 0 and 1. The larger the AUC becomes, the higher the likelihood that an actual positive case will be assigned a higher probability of being positive than an actual negative case. From Fig. 2 we derive that our method achieves a quite high level of AUC for all the classes (0.94 the highest for the class “Communication” and 0.87 the lowest for the class “Tourism_and_leisure”). This means that the likelihood that an actual positive case will be assigned a higher probability of being positive than an actual negative case is relevantly high.

A WSDL Structure Based Approach for Semantic Categorization of WS Elements

“Business and Money”

“Tourism and Leisure”

“Communications”

“Geographic Information”

337

Fig. 2. ROC curves for evaluating classification accuracy of WS into four application domains in terms of AUC. Results are: “Business and Money” AUC=0.88, “Tourism and Leisure” AUC=0.87, “Communications” AUC=0.94, “Geographic Information” AUC=0.91.

In order to further evaluate our proposed mechanism we compared it to Assam tool [3], which has shown good accuracy compared to other classification mechanisms. Since Assam deals only with the classification of WS into application domains (first layer) we reproduced the evaluation process reported in [3] by applying the same datasets to our categorization mechanism. The results are rendered in Fig. 3. WSCM

Classification accuracy

70

Assam

65

60

55

50 0

1

2

Tolera nce va lues

Fig. 3. Comparison of classification accuracy (vertical axis) between Assam and our proposed web service categorization mechanism (WSCM) for various tolerance values (horiz.axis)

Assam achieves a level of almost 51% when using WSDL documents as the only source of data. A tolerance value t was used for determining the classification accuracy while allowing near misses. Thus, the correct category is included in a sequence of t+1 suggestions. For example, the tolerance value t=1 implies that we allowed the classifier to fail only once. Fig. 3 presents the evaluation results. The horizontal axis corresponds to the tolerance threshold t, while the vertical axis renders classification accuracy. Our mechanism outperforms Assam for t=0, t=1. This implies that our mechanism predicts more accurately the right category of a WS when this appears in the first two suggestions. When t=2, i.e., the right category is the third suggestion, both approaches achieve almost identical accuracy. Hence our categorization mechanism performs better when it is provided no second, or third opportunity to correct its output.

338

D.D. Kehagias et al.

A similar evaluation process was followed also for the second layer. Specifically, a k-fold validation procedure was performed for each one of the domains and the accuracy for each one was calculated. As it is depicted in Table 1, our technique achieved sufficient level of accuracy for all the domains. Table 1. Accuracy of the four application domains for the second classification layer Domain

Business and Money

Communication

Geographic

Accuracy(%)

62.01

65.05

73.95

Tourism and Leisure 60.47

4 Summary and Conclusions Our presented mechanism proposes a three-layer classification schema that encompassed classification of all structural elements of a WSDL document. In this paper we described the details of the implemented categorization mechanism and provide evaluation results for the first classification layer, based on a set of real WS data. According to the derived results, we conclude that our proposed categorization mechanism shows efficient accuracy. This is also proven by comparing our mechanism to the one adopted by Assam, one of the most reliable and robust existing WS classification tools. The result of this comparison showed that our proposed mechanism surpasses Assam. Unfortunately, direct comparison with additional tools was deemed unfeasible due to lack of a common evaluation dataset. Our future plans include extension of the evaluation framework to all three classification layers.

References 1. Oldham, N., Thomas, C., Sheth, A., Verma, K.: METEOR-S web service annotation framework with machine learning classification. In: Cardoso, J., Sheth, A.P. (eds.) SWSWPC 2004. LNCS, vol. 3387, pp. 137–146. Springer, Heidelberg (2005) 2. Corella, M.A., Castells, P.: Semi-automatic semantic-based Web service classification. In: Eder, J., Dustdar, S. (eds.) BPM Workshops 2006. LNCS, vol. 4103, pp. 459–470. Springer, Heidelberg (2006) 3. Heß, A., Kushmerick, N.: Learning to attach semantic metadata to Web services. In: Fensel, D., Sycara, K., Mylopoulos, J. (eds.) ISWC 2003. LNCS, vol. 2870, pp. 258–273. Springer, Heidelberg (2003) 4. Crasso, M., Zunino, A., Campo, M.: Awsc: An approach to Web service classification based on machine learning techniques. Inteligencia Artificial, Revista Iberoamericana de IA 12(37), 25–36 (2008) 5. Porter, M.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980) 6. Pedersen, T., Patwardhan, S., Michelizzi, J.: WordNet: Similarity - Measuring the Relatedness of Concepts. In: Proceedings of Fifth Annual Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL 2004), Boston, MA, May 3-5, 2004, pp. 38–41 (2004) 7. Burkard, R.E., Dell’Amico, M., Martello, S.: Assignment Problems. SIAM, Philadelphia (2009)

Heuristic Rule Induction for Decision Making in Near-Deterministic Domains Stavros Korokithakis and Michail G. Lagoudakis Intelligent Systems Laboratory Department of Electronic and Computer Engineering Technical University of Crete Chania 73100, Crete, Greece [email protected], [email protected]

Abstract. A large corpus of work in artiﬁcial intelligence focuses on planning and learning in arbitrarily stochastic domains. However, these methods require signiﬁcant computational resources (large transition models, huge amounts of samples) and the resulting representations can hardly be broken into easily understood parts, even for deterministic or near-deterministic domains. This paper focuses on a rule induction method for (near-)deterministic domains, so that an unknown world can be described by a set of short rules with well-deﬁned preconditions and eﬀects given a brief interaction with the environment. The extracted rules can then be used by the agent for decision making. We have selected a multiplayer online game based on the SMAUG MUD server as a model of a near-deterministic domain and used our approach to infer rules about the world, generalising from a few examples. The agent starts with zero knowledge about the world and tries to explain it by generating hypotheses, reﬁning them as they are refuted. The end result is a set of a few meaningful rules that accurately describe the world. A simple planner using these rules was able to perform near optimally in a ﬁght scenario.

1

Introduction

Intelligent agents that learn to make decisions in unknown environments using reinforcement learning [1] focus on the diversity of action eﬀects in diﬀerent states. The agent explores the world, observing the actions it took at each state and their immediate eﬀects, and eventually learns the action choices that yield the desired long-term outcome. This means that the agent would have to observe more or less every state and perform every action in each state in order to see how they contribute to the desired outcome. This approach quickly becomes intractable in the real world, which nevertheless in many cases exhibits some regularity. Induction learning, on the other hand, aims at deriving generic rules from a few experiences, which can then be applied to previously unseen situations to produce meaningful predictions of the outcomes that will be observed [2]. Multiplayer online games (Multi-User Dungeons or MUDs) represent a large class of moderately complex abstractions of the real world. Input in these games S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 339–344, 2010. c Springer-Verlag Berlin Heidelberg 2010

340

S. Korokithakis and M.G. Lagoudakis

takes the form of commands, which a player enters to perform actions in the world. Each command eﬀects a change in the state of the world which is reﬂected on the observed variables, and this change is unlikely to be arbitrary. For example, a fight variable (boolean) may indicate whether the player is currently ﬁghting an enemy and a health variable (integer) may reﬂect the player’s current health. The player can inﬂuence these variables by performing commands, namely strike may cause the player to engage in a ﬁght (fight becomes True) and the enemy to take some amount of damage. A player currently in a ﬁght can expect to take an amount of damage himself from an enemy ﬁghting back, regardless of his actions. The ﬁghting player’s health will diminish and he may die if it reaches 0, however he may increase his health by the heal command, which is subject to constraints in itself, and so on ... Our goal is to design agents which are able to discover enough aspects of the world to achieve a list of desired outcomes. We must, therefore, ﬁnd a way for the agent to understand the eﬀect that his actions have on the world and on its current state. For example, our agent must learn that healing will increase its health and fighting will decrease it, but will eventually provide him with gold, if he wins. We also want the agent to be able to generalise the rules he discovers and then reﬁne the circumstances he believes they apply to, so that he goes from the general to the speciﬁc, instead of the other way around.

2

Decision Making in Near-Deterministic Domains

The world is typically described by a collection of state variables s1 , s2 , . . . , sn ; S(t) = s1 (t), s2 (t), . . . , sn (t) reﬂects the state of the world at time t and a(t) the action taken by the agent at time t. The goal of our agent is to understand how each action impacts thestate. We assume an underlying (unknown) transition model T : S(t + 1) = T S(t), a(t) , which is both Markovian (state transitions depend only on the current state and action and not on the past history) and deterministic (there is only one outcome for a given state and action pair). We must, therefore, discover what changes each action imparts on the state in each case, and also when these occur, i.e. the conditions under which they occur. Our goal is to infer a transition model for the world that takes the form of a collection of rules. The general form of these rules is the following: If action a is taken at time t, then a subset of state variables {si , sj , sk , . . .} will change by {di , dj , dk , . . .} at time t+1, provided that the values of another subset of state variables {sl , sm , sn , . . .} are (not) {vl , vm , vn , . . .}. Our approach is predicated on the assumption that the world is mostly deterministic (i.e. the same action will not bring radically diﬀerent results, if performed at two diﬀerent times in the same state), but we must account for noise, as our measurements might be noisy. Another assumption we make is that there are weak dependencies, i.e. an action aﬀects only a small number of state variables,

Heuristic Rule Induction for Decision Making in Near-Deterministic Domains

341

and that the preconditions under which this happens involve only a small number of state variables as well. Both assumptions are valid for MUDs in general. Related Work. Salzberg [3] developed HANDICAPPER, a system for predicting horse race outcomes using heuristic inductive learning methods, which fares markedly better than human experts and chance. He describes various heuristics at a high level, providing inspiration for our work, however it was not clear how to combine them at an algorithmic level and apply them to our domain. Amir and Chang [4] developed an exact solution for identifying actions’ eﬀects in partially observable STRIPS domains. Their methods apply in other deterministic domains with conditional eﬀects, but may be inexact, as they produce false positives, or ineﬃcient, as the resulting model can grow arbitrarily). The MUD Domain. Our test domain is a subset of the full MUD domain that focuses on combat situations. We have selected the following state variables: Health is an integer in the range of approximately [0 − 1000]. It reﬂects the player’s well-being and it is generally desired that its value is as high as possible. If a player’s health falls to zero, the player dies. Mana is an integer in the range of approximately [0 − 700] and reﬂects the amount of available “magic power” the player has for casting spells. Fight is a Boolean variable indicating whether the player is in combat or not. When in combat, the player will usually take (and deal) damage. Enemy health is an integer variable in the range of [0-11], which indicates the current health of the enemy. If it reaches zero, the enemy dies. There are four combat-related actions available to the agent: Pause is a control action that does nothing. We include it so that we can see what happens in the world without our intervention. Strike opponent is an aggressive action that causes damage to the opponent. Additionally, the angry opponent starts a ﬁght and fight changes to True. Heal causes our health and mana to increase by certain amounts. Obviously, if these variables have already reached their maxima, this action has no eﬀect. Cast spell is another aggressive action that causes damage to the opponent. It also turns fight to True and makes mana to diminish by some amount.

3

Heuristic Rule Induction

In this section, we propose a heuristic algorithm for learning rules that attempt to capture the preconditions and eﬀects of actions using a series of observed interactions with the world. An agent can then use these rules to plan his actions. Rule Induction. The ﬁrst step is to construct rules for each action. We want to infer the changes that an action brings about to the world, so it makes sense to start by grouping all the results by actions. Since there may be several outcomes for an action, we allow for multiple disjoint pairs of preconditions and eﬀects

342

S. Korokithakis and M.G. Lagoudakis

update action rules (S(t − 1), a(t − 1), s(t)) // S(t − 1): previous state, a(t − 1): action performed, S(t): current state δ(t) ← S(t) − S(t − 1) if (δ(t) ∈ δa(t−1) ) // If the changes in the state are in the deltas set, reinforce(a(t − 1), δt ) // reinforce the already existing delta. else δa(t−1) ← δa(t−1) ∪ δ(t) // Otherwise, add it to the deltas set. for each δ ∈ δa(t−1) if δ = δ(t) pos precon(a(t − 1), δ, S(t − 1)) // Positive precondition. else neg precon(a(t − 1), δ, S(t − 1)) // Negative precondition.

for each action. Initially, the outcomes for each action are empty and they are updated as the data is processed. The update step of the algorithm takes as input a state pair and an action. A state pair consists of a state, S(t), and the one immediately preceding it, S(t − 1). The action a(t − 1) is the action that the agent took between those two states. The algorithm then notes the diﬀerences between the two states in a vector δ and checks if this particular outcome has been seen before. If so, trust in it is reinforced by increasing a conﬁdence value. If not, it is added to the set of outcomes for the current action. The next step is to infer a priori necessities, which we call preconditions. Brieﬂy, the current state is added as a positive precondition to the current outcome and as a negative precondition to all the others. More speciﬁcally, if an action a(t − 1) led to a certain outcome δ(t), while a state variable si (t − 1) had the value v, but never when it had any other value, then we can infer that, for the action a(t − 1) to produce the outcome δ(t), the variable si must have the value v. The opposite is true for negative preconditions, i.e. ones that require that the state value not be v. State variables that do not change between S(t − 1) and S(t) are discarded. The rule induction algorithm, shown in the box above, creates the deltas for each action and a measure of conﬁdence for each rule. Merging. To counter duplicate outcomes that might be produced due to noise, we take a merging step. This step merges any two outcomes of a rule that share the same preconditions. For example, an action might sometimes cause a variable to increase by 10 and sometimes by 5. In this case, the two rules could be merged by taking the average 7.5 as their numeric outcome. It stands to reason that, if a result δi of an action ai has priors pi and another result δj of the same action has the same priors, then the two results must be diﬀerent instances of the same mechanism. Planning. Initially, since the player will know nothing about the actions, it can fall back on a purely random player, which will, nevertheless, aid in exploration. The agent can then decide to leverage exploitation vs exploration as it generates new rules. One way to plan would be deciding which target state we would like to be in in the long term and using the rules as transformations on the current state to reach our target. This process is known as planning and the action chain is called a plan.

Heuristic Rule Induction for Decision Making in Near-Deterministic Domains

4

343

Experimental Results

The following results were obtained by ﬁrst running a random player for 600 actions to collect data and then running our algorithm oﬀ-line to learn rules. The learned rules along with the actual rules are shown below. The “pause” action is not a real action, but merely a way of telling what happens if we do nothing. When ﬁghting, the player takes some damage, but nothing happens when not ﬁghting. In this case the preconditions are wrong; this action does not require mana to have any speciﬁc value. The erroneous preconditions caused the merging step to err, creating two outcomes instead of one. Note that our agent has not seen any data of “pause” when not in a ﬁght. Actual : "pause" decreases health by about 40 units when fighting, derived: Confidence 18: Influences: health by -48.61, mana MUST NOT be 1 Confidence 14: Influences: health by -35.00, mana MUST NOT be 2

The “cast spell” action is a way for the player to cast an oﬀensive spell on the target, draining its health. We can see that the eﬀects are once more derived correctly, but not the priors. This is the result of seeing too few useful states, but this problem can be addressed with directed exploration. Actual : "cast spell" causes mana to decrease by 8 and the opponent to take damage, derived: Confidence 142: Influences: mana by -8.00, health by -64.57, enemy_health by -1.00 Confidence 131: Influences: mana by -7.72, enemy_health by -1.0, health MUST NOT be 2

The “heal” action increases the health by 200 and mana by 60; health is and mana, but we see that the detected health increase is 116 on average, because the enemy deals damage which decreases our health at every turn. Actual : "heal" health to increase by 200 and mana to increase by 60, derived: Confidence 618: Influences: mana by 58.0, health by 116.96, enemy_health MUST NOT be 0 Confidence 158: Influences: mana by 16.0, health by 116.53, health MUST NOT be 2

The “strike” action deals damage to the opponent and starts a ﬁght if there isn’t one. The rule (not shown due to space limitations) is correctly derived and the change of the state in fighting is correctly interpreted. The decrease in health is, again, because we get dealt damage by the opponent when in a ﬁght. To obtain decision making results, we used a rudimentary planner that acts in very basic ways to ensure its survival, yet makes full use of the rules we have created. It was not within the scope of this paper to utilise a competent planner. At each planning step, the current state is inspected and the actions whose preconditions in the learned rules are not satisﬁed are culled. Afterwards, the planner decides which action to use according to its directives and the rules. Figure 1 (a) shows the health ﬂuctuations of the random player. Health ﬂuctuates wildly, as we would expect. Figure 1 (b) shows the performance of a player that plans using the rules learned oﬀ-line. The health of this player ﬂuctuates minimally, between about 880 and 1000. Figure 1 (c) shows the performance of a player that starts out with no knowledge (essentially a random player), but periodically uses its samples to learn rules. We can see that after only 150 samples it has learnt rules good enough to play near-optimally, and its health never drops below 900 (except for cases of large damage).

344

S. Korokithakis and M.G. Lagoudakis Health − Offline player

Health − Online player 1000

900

900

800

800

800

700

700

700

600

600

600

500

Health

1000

900

Health

Health

Health − Random player 1000

500

500

400

400

400

300

300

300

200

200

100

100

0

0

100

200

300 Time

400

500

600

0

200 100

0

100

200

300 Time

400

500

600

0

0

100

200

300 Time

400

500

600

Fig. 1. Health ﬂuctuations of a (a) random, (b) learned, and (c) online player

5

Discussion and Conclusion

The proposed algorithm operates at a high level, which means that the rules it generates are very compact. It not only infers which variables changed, but also how they changed, usually with speciﬁc values. This oﬀers an obvious advantage to planning, because the planner can predict with greater accuracy how an action is going to alter the current state. In addition, the algorithm starts generating rules from as early as the ﬁrst datum. These rules may later be revised, superseded, or reinforced, but there is virtually no bootstrapping period as we can begin to perform actions that are more relevant to the environment almost immediately. Finally, it provides the planner with conﬁdence measurements on each rule. This means that the planner can actively decide if it wants to experiment with another area of the search space. As usual, there are always tradeoﬀs, and the proposed algorithm has its own limitations. It is not well suited to purely stochastic environments. If the world is purely stochastic, the advantage of precise reporting our algorithm oﬀers will be lost. Furthermore, the algorithm cannot handle multiple variable dependencies. If a result is dependent on the speciﬁc combined value of two or more variables, our algorithm will not detect this correctly. Inspired by the way humans approach problems, we have presented a method to infer rules about the world and generalise examples of data so that we can apply the actions to unknown circumstances. We have demonstrated the advantages of our algorithm, namely the fact that it is online, is amenable to directed learning, and fast with low memory requirements. The results have been demonstrated, and they show a marked increase of the ability of the player to function in the game, as compared to the random player.

References 1. Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: A survey. Journal of Artiﬁcial Intelligence Research 4, 237–285 (1996) 2. Wexler, M.: Embodied induction: Learning external representations. In: AAAI Fall Symposium, pp. 134–138. AAAI Press, Menlo Park (1996) 3. Salzberg, S.: Heuristics for inductive learning. In: Proceedings of the International Joint Conference on Artiﬁcial Intelligence, pp. 603–609 (1985) 4. Amir, E., Chang, A.: Learning partially observable deterministic action models. Journal of Artiﬁcial Intelligence Research 33, 349–402 (2008)

Behavior Recognition from Multiple Views Using Fused Hidden Markov Models Dimitrios I. Kosmopoulos1, Athanasios S. Voulodimos2 , and Theodora A. Varvarigou2 1 2

N.C.S.R. “Demokritos” Institute of Inform. and Telecom. 15310, Aghia Paraskevi, Greece National Technical University of Athens School of Electr. and Comp. Enginnering 15773, Zografou, Greece

Abstract. In this work, we provide a framework for recognizing human behavior from multiple cameras in structured industrial environments. Since target recognition and tracking can be very challenging, we bypass these problems by employing an approach similar to Motion History Images for feature extraction. Modeling and recognition are performed through the use of Hidden Markov Models (HMMs) with Gaussian observation likelihoods. The problems of limited visibility and occlusions are addressed by showing how the framework can be extended for multiple cameras, both at the feature and at the state level. Finally, we evaluate the performance of the examined approaches under real-life visual behavior understanding scenarios and we discuss the obtained results. Keywords: behavior recognition, Hidden Markov Models, multi-camera classiﬁcation.

1

Introduction

Motion analysis in video and particularly human behavior understanding, has attracted many researchers, mainly because of its fundamental applications in video indexing, virtual reality, human-computer interaction or smart monitoring. Smart monitoring is applicable to large-scale enterprises which have a clear need for automated supervision services to guarantee safety, security, and quality by enforcing adherence to predeﬁned procedures for production or services. Here, we focus on monitoring on the production line of an automobile manufacturer, which is a quite structured process. The identiﬁed deviations from this process possibly indicate security and safety related events and will be automatically highlighted. The complexity of detection and tracking moving objects under occlusions in a typical industrial environment requires more than a single camera and features that will not result from an error-prone tracker. Furthermore, the high diversity and complexity of the behaviors which need to be monitored requires new learning methods that will be able to fuse information from multiple streams. Based on these observations, our work contributes in the following ways: Firstly, we propose a holistic approach for action representation in every video frame, which bypasses the problem of object tracking. In addition, we provide a behavior recognition framework S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 345–350, 2010. c Springer-Verlag Berlin Heidelberg 2010

346

D.I. Kosmopoulos, A.S. Voulodimos, and T.A. Varvarigou

based on the aforementioned features which is extended to solve the multicamera problem in an endeavor to alleviate visibility problems and occlusions. The rest of this paper is organized as follows: In section 2 we brieﬂy survey the related literature. In section 3 we analyze the fusion frameworks and their applicability for multi-camera behavior recognition. In section 4 we explain our holistic approach for action representation in each frame and describe the feature extraction method. The experimental results are given in section 5. Finally, section 6 concludes the paper.

2

Related Work

Regarding classiﬁcation, HMMs are an extremely popular means of modeling various kinds of sequences of observations. The popularity of HMMs as a classiﬁcation means (e.g. [1]) is mainly due to to the fact that they can eﬃciently model stochastic time series at various time scales. However, the possibility to utilize multiple, interdependent streams of data, all corresponding to the same sequence of events, is of crucial importance in several practical applications, as it allows for a signiﬁcant performance enhancement of the HMM-based pattern recognition algorithm [2]. Related literature comprises an abundance of methodologies proposed as solutions for the multiple stream setup problem. The ﬁrst and most straightforward solution is early integration [3]; it consists in merging all the observations related to all the streams into one large stream (frame by frame), and modeling it using a single HMM. Xiang and Gong [4] used Dynamically Multi-Linked HMMs to model actions and interactions between persons, however when many actors are involved the complexity can become too high. Several fusion schemes using HMMs have also been presented, which were typically used for fusing heterogeneous feature streams, such as audio-visual systems, but can be applied to streams of holistic features from multiple cameras as well. Such examples are the synchronous HMMs [2], the parallel HMMs [5] and the multistream HMMs [6]. The reliability of each stream has been expressed by introducing stream-wise factors in the total likelihood estimation as in the case of parallel, synchronous or multistream HMMs.

3

Fusion Frameworks for Multi-view Learning

The goal of automatic behavior recognition, may be viewed as the recovery of a speciﬁc learned behavior (class or visual task) from the sequence of observations O. Each camera frame is associated with one observation vector and the observations from all cameras have to be combined in a fusion framework to exploit complementarity of the diﬀerent views. The sequence of observations from each camera composes a separate camera-speciﬁc information stream, which can be modelled by a camera-speciﬁc HMM. The HMM framework entails a Markov chain comprising a number of, say, N states, with each state being coupled with an observation emission distribution. Gaussian mixture models are typically used for modeling the observation

Behavior Recognition from Multiple Views Using Fused HMMs

347

Fig. 1. Fusion schemes using the HMM framework for two streams

emission densities of the HMM hidden states. The EM algorithm is very popular for training HMMs under a maximum-likelihood framework. The ultimate aim of multicamera fusion is to achieve behavior recognition results better than the results that we could attain by using the information obtained by the individual data streams (stemming from diﬀerent cameras) independently from each other. Among existing approaches Feature fusion is the simplest; it assumes that the observation streams are synchronous. This synchronicity is a valid assumption for cameras that have overlapping ﬁelds of view and support synchronization. The related architecture is displayed in Fig. 1(a). For streams from C cameras and respective observations at time t given by o1t ,..., oCt , the proposed scheme deﬁnes the full observation vector as a simple concatenation of the individual observations: (1) ot = {oct }C c=1 Then, the observation emission probability of the state st = i of the fused model, when considered as a k-component mixture model, yields: P (ot |st = i) =

K

wik P (ot |θik )

(2)

k=1

where wik denotes the weights of the mixtures and θik the parameters of the kth component density of the ith state (e.g., mean and covariance matrix of a Gaussian pdf). Both training and testing are performed in the typical way using the obtained concatenated vectors. The parallel HMM (see Fig. 1(b)) approach assumes that the streams are independent; thus we can train one individual HMM for each stream. This model can be applied to cameras that may not be synchronized and may operate at diﬀerent acquisition rates. Each stream c may have its own weight rc depending on the reliability of the source. Classiﬁcation is performed by selecting the class that maximizes the weighted sum of the classiﬁcation probabilities from the stream-wise HMMs, i.e. by picking the class ˆl with: C ˆl = argmax([ rc logP (o1 ...oT |λcl )]) l

c=1

(3)

348

D.I. Kosmopoulos, A.S. Voulodimos, and T.A. Varvarigou

where λcl are the parameters of the postulated streamwise HMM of the cth stream that corresponds to the lth class. The multistream fused HMM (MFHMM) (Fig. 1(c)) is another promising method for modeling of multistream data [6] with several desirable features: a) state transitions do not necessarily happen simultaneously, which makes the method appropriate for both synchronous and asynchronous camera networks; b) it has simple and fast training and inference algorithms; c) if one of the component HMMs fails, the rest of the constituent HMMs can still work properly; and d) it retains the crucial information about the interdependencies between the multiple data streams. Similar to parallel HMMs, the class that maximizes the weighted sum of the log-likelihoods over the streamwise models is the winner.

4

Feature Extraction

Hereby we describe the feature calculation process: Firstly we perform background modelling. We use the foreground regions to represent the multi-scale spatiotemporal changes at pixel level. For this purpose we use a concept proposed in [4], which is similar to Motion History Images, but has better representation capabilities as shown therein. The Pixel Change History (PCH) of a pixel is deﬁned as: ⎧ min(Pς,τ (x, y, t − 1) + 255 ⎪ ς , 255) ⎪ ⎨ if D(x, y, t) = 1 Pς,τ (x, y, t) = (4) max(Pς,τ (x, y, t − 1) − 255 ⎪ ⎪ τ , 0) ⎩ otherwise where Pς,τ (x, y, t) is the PCH for a pixel at (x, y), D(x, y, t) is the binary image indicating the foreground region, ς is an accumulation factor and τ is a decay factor. By setting appropriate values to ς and τ we are able to capture pixel-level changes over time. To represent the resulting PCH images we propose use of Zernike moments. Zernike moments are very attractive because of their noise resiliency, their reduced information redundancy and their reconstruction capability. The complex Zernike moments of order p are deﬁned as: p+1 1 π Apq = Rpq (r)e−jqθ f (r, θ)rdrdθ (5) π 0 −π where r = x2 + y 2 and θ = tan−1 (y/x) and −1 < x, y < 1 and: p−q

Rpq (r) =

2

s=0

(−1)s

s!( p+q 2

(p − s)! rp−2s − s)!( p−q − s)! 2

(6)

where p − q = even and 0 ≤ q ≤ p. The higher the order of moments that we employ, the more detailed the region reconstruction will be, but also the more processing power will be required. Limiting the order of moments used is also justiﬁed by the fact that the details captured by higher order moments have much higher variability and are more sensitive to noise.

Behavior Recognition from Multiple Views Using Fused HMMs

5

349

Experiments and Results

We experimentally veriﬁed the applicability of the described methods. For this purpose, we have acquired some very challenging videos from the production line of a major automobile manufacturer. The production cycle on the production line included tasks of picking several parts from racks and placing them on a designated cell. Each of the above tasks was regarded as a class of behavioral patterns to be recognized. The information acquired from this procedure can be used for the extraction of production statistics or anomaly detection. Partial or total occlusions due to the racks made the classiﬁcation task diﬃcult if using a single camera, so two synchronized, partially overlapping views are used. The workspace conﬁguration and the cameras’ positioning is shown in Fig 2. The work cycle to be modeled, despite noise and outliers (e.g. persons walking into the working cell, vehicles passing by, etc), remains a structured process and is a good candidate to model with holistic methods. The behaviors we were aiming to model application are brieﬂy the following: (1) One worker picks part #1 from rack #1 and places it on the welding cell. (2) Two workers pick part #2a from rack #2 and place it on the welding cell. (3) Two workers pick part #2b from rack #3 and place it on the welding cell. (4) A worker picks up parts #3a and #3b from rack #4 and places them on the welding cell. (5) A worker picks up part #4 from rack #1 and places it on the welding cell. (6) Two workers pick up part #5 from rack #5 and place it on the welding cell. (7) Welding: two workers grab the welding tools and weld the parts together. We have used 20 sequences representing full assembly cycles, each one containing at least one of the seven behaviors. The total number of frames was approximately 80000. The annotation of these frames has been done manually. We have synchronised the cameras using the timestamps embedded by the camera server of our ip cameras, and have used cross validation by training using all scenarios except for one that was used for testing. For capturing the spatiotemporal variations we have set the parameters at ς = 10 and τ = 70. Furthermore, we have used as feature vector the Zernike moments up to sixth order (excluding four angles that were constant), together with the center of gravity and the area, thus obtaining a good scene reconstruction without too high a dimension (31). Zernike moments have been calculated in rectangular regions of interest of approx. 15000 pixels in each image, to limit processing and allow real time feature extraction. The processing was performed at a rate of approximately 50-60 fps. We trained our models using the EM algorithm. We used the typical HMM model for the individual streams and three HMM fusion approaches, namely feature fusion, parallel and multistream HMMs. We experimented with the Gaussian observation model, used three-state HMMs with a single mixture component per state to model each of the seven tasks described, which is a good trade-oﬀ between performance and eﬃciency. For the mixture model representing the interstream interactions in the multistream HMM, we used mixture models of two component distributions. The obtained results are given in Fig. 3. It becomes obvious that the sequences of our features and the respective HMMs represent quite well the assembly

350

D.I. Kosmopoulos, A.S. Voulodimos, and T.A. Varvarigou

Fig. 2. Depiction of a workcell

Fig. 3. Success rates

process. Information fusion seems to provide signiﬁcant added value when implemented in the form of the multistream fused HMM, and about similar accuracy when using parallel HMMs. However, the accuracy deteriorates when using simple feature level fusion reﬂecting the known restrictions of this approach.

6

Conclusions

In this work, we have presented a framework for fusion of multiple streams and have applied it for recognition of visual tasks in an industrial environment using two cameras viewing the workcell from diﬀerent angles to avoid occlusions. The problem of tracking was bypassed by applying an image-based approach. The proposed classiﬁcation framework is appropriate for visual behavior recognition tasks and can be obviously used to extend existing HMM-based behavior recognition systems to create scalable multicamera systems. The limitations of the holistic feature extraction concern mainly the fact that no detailed representations are possible. Finally, through the complementarity from multiple views it is possible to enhance accuracy, when selecting the appropriate fusion architecture.

References 1. Ivanov, Y.A., Bobick, A.F.: Recognition of visual activities and interactions by stochastic parsing. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 852–872 (2000) 2. Dupont, S., Luettin, J.: Audio-visual speech modeling for continuous speech recognition. IEEE Trans. Multimedia 2, 141–151 (2000) 3. Stork, D.G., Hennecke, M.E.: Speech reading by humans and machines. NATO. ASI Series F, vol. 150. Springer, Heidelberg (1996) 4. Xiang, T., Gong, S.: Beyond tracking: Modelling activity and understanding behaviour. Int. J. Comput. Vision 67(1), 21–51 (2006) 5. Vogler, C., Metaxas, D.: Parallel HMMs for ASL recognition. In: ICCV 1999 (1999) 6. Zeng, Z., Tu, J., Pianfetti, B., Huang, T.: Audio-visual aﬀective expression recognition through multistream fused hmm. IEEE Trans. Mult. 10(4), 570–577 (2008)

A Machine Learning-Based Evaluation Method for Machine Translation Katsunori Kotani1 and Takehiko Yoshimi2 1

Kansai Gaidai University, 16-1 Nakamiya, Hirakata, 573-1001 Osaka, Japan 2 Ryukoku Univeristy, 1-5, Yokotani, Seta Oe, Otsu, 520-2914 Shiga, Japan [email protected], [email protected]

Abstract. Constructing a classifier that distinguishes machine translations from human translations is a promising approach to automatic evaluation of machinetranslated sentences. Using this approach, we constructed a classifier using Support Vector Machines based on word-alignment distributions between source sentences and human or machine translations. This paper investigates the validity of the classification-based method by comparing it with well-known evaluation methods. The experimental results show that our classification-based method can accurately evaluate fluency of machine translations. Keywords: automatic machine translation evaluation; machine learning, classification, word-alignment.

1 Introduction Automatic evaluation of machine translations (MTs) assists in developing MT systems [1, 2, 3, 4, 5, 6, 7, 8]. One of the automatic MT evaluation methods, called here a classification-based method, uses machine learning algorithms to construct a classifier that judges an MT as "HT-like" (human-authored) or "MT-like" (machine-generated) translation. The performance of a classifier depends heavily on the classification features used for constructing the classifier. That is, the classification features should capture the distinctive translation properties that constitute the difference between MTs and human translations (HTs). This study regards the naturalness of literal translation as a classification feature. The naturalness of literal translation can be described in terms of wordalignment distribution, as we will see in Section 3. Our classification-based method is assessed by examining how well classification results correlate with manual evaluation results at the sentence level. A correlation coefficient is computed between the evaluation score of our classification-based method and the median evaluation score of three evaluators. This correlation coefficient is examined to determine whether it is higher than a lower bound, namely the highest correlation coefficient between well-known MT evaluation scores [6, 7, 8], and whether it is close to an upper bound, namely the lowest pair-wise correlation coefficient between the evaluators. Experimental results showed that our S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 351–356, 2010. © Springer-Verlag Berlin Heidelberg 2010

352

K. Kotani and T. Yoshimi

classification-based method yielded a higher correlation coefficient than the lower bound, and this correlation coefficient is close to the upper bound.

2 Related Works Several studies have proposed evaluation methods that use machine learning algorithms to construct either a regression model or a classification model [1, 2, 3, 4, 5]. Constructing the former is more expensive than doing the latter. This section focuses on previous studies on the latter. A classification-based method regards MT evaluation as a classification problem. Corston-Oliver et al. [1] constructed three classifiers with decision trees [9] for Spanish-to-English MTs. These classifiers use classification features derived from perplexity properties and linguistic properties of MTs. Kulesza & Shieber [2] used Support Vector Machines (SVMs) [10] to construct a classifier for Chinese-to-English MTs. Classification features are (i) N-gram precision of MTs compared with HTs, (ii) sentence length of MTs and HTs, and (iii) word error rate of MTs. Kotani et al. [3, 4, 5] constructed a classifier for English-to-Japanese MTs using SVMs. Classification features are word-alignment properties between English sentences and their translations (MTs and HTs). This classifier examines the degree to which MTs contain unnatural literal translations based on aligned pairs and non-aligned words. Kotani et al. [3, 4, 5] also examined the classification-based evaluation method. Kotani et al. [3, 4] determined the validity of the classification-based evaluation method by examining solely the classification accuracy. Even though the classification accuracies of their classifiers exceeded 90%, we consider that the validity of classification-based evaluation method should also be determined based on the comparison with manual evaluation results. Kotani et al. [5] carried out a preliminary experiment of this study, and compared the automatic evaluation results with manual evaluation methods. The size of the test data was only 44 sentences. Their experimental result showed that the classification-based evaluation results were as high as the other automatic evaluation results, i.e., NIST.

3 Constructing a Regression Model-Like Classifier Our classification-based method uses SVMs for training the classifier. Although an SVM classifier performs binary-class classification, it can output an evaluation score (continuous value) for a test example. As Kulesza & Shieber [2] noted, an evaluation score can be obtained by computing the distance between the separating hyper plane and a test example. Thus, an SVM classifier can apparently be regarded as a regression model. A regression model is more useful for MT evaluation assistance than a classifier. Literal translation often makes MTs unnatural. Therefore, a critical issue in MT development is how to deal with literal translation. If a classifier uses classification features that reveal unnatural literal translations, classification results will reveal problematic MTs. Reducing unnatural literal translations would improve MTs. Distinctions between English-to-Japanese MTs and HTs can be drawn by the naturalness of literal translation, as the following sentence (1) illustrates. Most of the

A Machine Learning-Based Evaluation Method for Machine Translation

353

words in the source sentence (1a) are translated literally in the MT sentence (1b). The English noun phrases "today" and "the sun" are literally translated into Japanese noun phrases "kyoo" and "taiyoo," respectively. The English verb phrase "is shining" is also literally translated into a Japanese verb phrase "kagayai-teiru." By contrast, only the noun phrase "today" is literally translated into a Japanese noun phrase "kyoo" in the HT sentence (1c). (1)

a. Source Sentence: Today, the sun is shining. b. MT: Kyoo taiyoo-wa kagayai-teiru today the-sun-TOP shine-BE(ING) "Today the sun is shining." c. HT: Kyoo-wa seiten-da. today-TOP fine-BE "It's fine today." (BE(ING): gerundive verb form)

Unnatural literal translations can be identified by examining word-alignment distribution between source sentences and their translations. Aligned pairs exhibit the existence of parallel lexical properties between a source sentence and its translation. Literal translations maintain lexical properties such as part-of-speech, but non-literal translations lack parallel lexical properties. Hence, literally translated expressions are more easily alignable than non-literally translated expressions. The MT sentence (1b) and the HT sentence (1c) show different word-alignment distributions. This word-alignment property is empirically confirmed by the wordalignment distribution obtained by a word-aligner that we used in the experiment, as shown in Table 1. Here, "align(A, B)" indicates that an English word "A" and a Japanese word "B" compose an aligned pair, "non-align_eng(C)" means that an English word "C" remains unaligned, and "non-align_jpn(D)" means that a Japanese word "D" remains unaligned. The sentence (1a) is more aligned with the MT sentence (1b) than with the HT sentence (1c). In the MT sentence (1b), the non-aligned words are only functional words such as the topic marker "wa" and the determiner "the." By contrast, the wordalignment distribution between the source sentence (1a) and the HT sentence (1c) has non-alignment not only of the functional word "the" but also of content words. Supposing that more aligned pairs appear in MTs and more non-aligned words occur in HTs, we use word-alignment distribution as a classification feature. Word-alignment distributions between source sentences and their HTs and MTs are regarded as positive and negative examples, respectively. Table 1. Word-alignment distribution for MT (1b) and HT (1c)

1 2 3 4 5 6

Word-alignment between (1a) and MT (1b) align(today, kyoo [today]) align(is, teiru [BE-ING]) align(sun, taiyoo [sun]) align(shining, kagayai [shine]) non-align_jpn(wa [TOP]) non-align_eng(the)

Word-alignment between (1a) and HT (1c) align(today, kyoo-wa [today-TOP]) align(is, da [BE]) non-align_jpn(seiten [fine]) non-align_eng(the) non-align_eng(sun) non-align_eng(shining)

354

K. Kotani and T. Yoshimi

4 Experiments Experiments were conducted to confirm the validity of our classification-based method by examining how closely its evaluation scores correlate with manual evaluation scores. Training and test data were taken from a parallel corpus of Reuters English news articles and their Japanese translations [12], because there is no publicly available test collection for English-to-Japanese MTs, as far as we know. The Japanese sentences in this parallel corpus were used as HTs. The English sentences in the corpus were translated into Japanese by three major MT systems commercially available in Japan [13, 14, 15], and three sets of MT data were obtained: MT-a, MT-b, and MT-c. Each data set consists of 25,800 sentences (12,900 HT and MT sentences, respectively). On the basis of the general circumstances of MT systems in Japan, these MT systems are likely based primarily on a rule-based translation method and partly use an example-based translation method. Word-alignment distribution was derived by using a statistical word-aligner trained by GIZA++ [11]. The GIZA++ word-aligner was trained using the 12,900 HT and MT sentences. Each parameter was default. Before operating GIZA++, the Japanese sentences were segmented into words by the morphological analysis system ChaSen [16]. SVM learning was carried out with TinySVM [17]. The first order polynomial was taken as a type of kernel function, and the other settings were default. The fluency of 500 MT-a sentences was assessed by three human evaluators (not including the authors) whose native language was Japanese and who had an average of 6 years of experience practicing MT development. Each sentence was evaluated and classified as one of four levels of fluency, level 1 being the least fluent, and level 4 being the most fluent. An evaluation score for each MT sentence was determined based on the median value of the three evaluators’ results, as shown in Table 2. Table 2. Fluency evaluation results of MT-a Fluency 1 2 3 4

Number of translations 184 208 90 18

Relative frequency (%) 36.8 41.6 18.0 3.6

Our classification-based method was examined at a sentence level by comparison with reference translation-based methods such as BLEU [6], NIST [7], and METEOR [8]. Our classification-based method regards the distance between an MT example and the separating hyper plane as an evaluation score, as mentioned in Section 3. The reference translation-based methods provide evaluation scores based on the similarity of MTs and reference translations. The reference translations for these methods were taken from Japanese sentences in the parallel corpus used in our experiment. Thus, these reference translation-based methods used a single reference translation instead of multiple reference translations. Spearman rank correlation analysis was performed to compute correlation coefficients between the median of the three evaluators’ scores and the classification-based evaluation scores for the 500 sentences. The classification-based evaluation scores were obtained using three classifiers in terms of the three types of classification features: aligned pairs

A Machine Learning-Based Evaluation Method for Machine Translation

355

(A), non-aligned words (NA), and both aligned pairs and non-aligned words (A&NA). Correlation coefficients were also computed between the median of evaluator scores and the reference translation-based evaluation scores. The highest coefficient serves as a lower bound (a baseline) for the assessment of our classification-based method. Inter-evaluator correlations were also computed among the three evaluators (E1, E2, E3). The lowest inter-evaluator correlation coefficient serves as an upper bound for the assessment. 0.6 0.5 0.4 0.3 0.2 0.1

3

2

3

E1 -

E1 -

A

E2 -

& N

A

A A

N

IS T ET EO R

N

M

BL EU

0

Fig. 1. Correlation of evaluation results with manual evaluation results

Figure 1 shows that both our classification-based method and the reference translation-based methods fail to reach the performance level of manual evaluation. The classification-based method, however, outperforms the reference translation-based methods and makes up approximately half of the gap between the lower bound (METEOR) and the upper bound (E2-3). Correlation coefficients of our classification-based method were statistically compared with the upper bound by a one-tailed test for significance of the difference between two correlation coefficients using the Fisher Z-transformation. Statistically significant differences were not observed for correlation coefficients of the A-based classifier and the A&NA-based classifier at the 5% significance level. Hence, classification results of these classifiers are regarded as similar to the manual evaluation results (E2-3). The classification results of the A-based classifier and the A&NA-based classifier showed a statistically significant difference (p<0.05) in correlation coefficients with the lower bound. Therefore, these classifiers yielded higher correlation coefficients than the reference translation-based method (METEOR). These results lead to the conclusion that the A-based classifier and the A&NAbased classifier will facilitate automatic MT evaluation by examining the naturalness of literal translation.

5 Conclusion This study assessed the validity of our classification-based method by comparing it with the reference translation-based methods. The experiments showed that the classificationbased evaluation results had higher correlation with manual evaluation than the reference

356

K. Kotani and T. Yoshimi

translation-based evaluation results. Given these findings, we conclude that our method is an inexpensive and effective automatic MT evaluation method. This work leaves several problems unsolved. First, the validity of a classificationbased method should be assessed based on qualitative examination of classification results of these. Second, word-alignment features should be examined regarding whether these features can be used for training statistical MT systems. Last, whether the validity of our word-alignment-based classification can be maintained for other language pairs must be examined in future research.

References 1. Corston-Oliver, S., Gamon, M., Brockett, C.: A Machine Learning Approach to the Automatic Evaluation of Machine Translation. In: Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, Toulouse, France, pp. 148–155 (2001) 2. Kulesza, A., Shieber, S.M.: A Learning Approach to Improving Sentence-level MT Evaluation. In: Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation, Baltimore, Maryland, pp. 75–84 (2004) 3. Kotani, K., Yoshimi, T., Kutsumi, T., Sata, I., Isahara, H.: A Classification Approach to Automatic Evaluation of Machine Translation Based on Word Alignment. Language Forum 34, 153–168 (2008) 4. Kotani, K., Yoshimi, T., Kutsumi, T., Sata, I.: Validity of an Automatic Evaluation of Machine Translation Using a Word-Alignment-Based Classifier. In: Li, W., Mollá-Aliod, D. (eds.) ICCPOL 2009. LNCS(LNAI), vol. 5459, pp. 91–102. Springer, Heidelberg (2009) 5. Kotani, K., Yoshimi, T., Kutsumi, T., Sata, I., Isahara, H.: A Method of Automatically Evaluating Machine Translations Using a Word-alignment-based Classifier. In: Proceedings of the Workshop “Mixing Approaches to Machine Translation” (MATMT), pp. 11–18 (2008) 6. Papineni, K.A., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: A Method for Automatic Evaluation of Machine Translation. Technical Report RC22176. IBM Research Division, Thomas J. Watson Research Center (2001) 7. Doddington, G.: Automatic Evaluation of Machine Translation Quality Using N-gram Cooccurrence Statistics. In: Proceedings of the 2nd Human Language Technology Conference, San Diego, California, pp. 128–132 (2002) 8. Banerjee, S., Alon, L.: METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, Ann Arbor, Michigan, pp. 65–72 (2005) 9. Quinlan, J.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo (1992) 10. Vapnik, V.: Statistical Learning Theory. Wiley Interscience, New York (1998) 11. Och, F.J., Ney, H.: A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics 29(1), 19–51 (2003) 12. Utiyama, M., Isahara, H.: Reliable Measures for Aligning Japanese-English News Articles and Sentences. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan, pp. 72–79 (2003) 13. Sharp, Hon’yaku Kore Ippom (2003) 14. Fujitsu, Atlas Personal Translation (2005) 15. LogoVista, LogoVista X PRO ver.3.0 (2004) 16. Chasen, http://chasen-legacy.sourceforge.jp/ (in Japanese) 17. TinySVM, http://chasen.org/~taku/software/TinySVM/

Feature Selection for Improved Phone Duration Modeling of Greek Emotional Speech Alexandros Lazaridis, Todor Ganchev, Iosif Mporas, Theodoros Kostoulas, and Nikos Fakotakis Artificial Intelligence Group, Wire Communications Laboratory, Department of Electrical and Computer Engineering, University of Patras, Rion-Patras 26500, Greece Tel.: +30 2610 996496; Fax: +30 2610 997336 {alaza,imporas,fakotaki}@upatras.gr, [email protected], [email protected]

Abstract. In the present work we address the problem of phone duration modeling for the needs of emotional speech synthesis. Specifically, relying on ten well known machine learning techniques, we investigate the practical usefulness of two feature selection techniques, namely the Relief and the Correlationbased Feature Selection (CFS) algorithms, for improving the accuracy of phone duration modeling. The feature selection is performed over a large set of phonetic, morphologic and syntactic features. In the experiments, we employed phone duration models, based on decision trees, linear regression, lazy-learning algorithms and meta-learning algorithms, trained on a Modern Greek speech database of emotional speech, which consists of five categories of emotional speech: anger, fear, joy, neutral, sadness. The experimental results demonstrated that feature selection significantly improves the accuracy of phone duration modeling regardless of the type of machine learning algorithm used for phone duration modeling. Keywords: Phone duration modeling, feature selection, emotional speech.

1 Introduction Prosody refers to the aspects of human speech communication that introduce functions which may not be encoded by grammar, such as emphasis, intent, attitude or emotional state. In speech, prosody is expressed by features such as duration, fundamental frequency and energy [1]. In the present work, we focus on phone duration modeling. In this context, the construction of proper phone duration models is crucial for generating synthetic speech that sounds more natural. The phone duration modeling approaches are divided in two major categories: the rule-based [2] and the data-driven methods [3-8]. Synthesis of emotional speech depends on knowledge extracted from analysis of natural speech, and particularly on knowledge about the influence of various prosodic features on the emotional speech perception. A lot of research had been done on this filed [9,10], nonetheless, we deem that a further study on the phone duration S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 357–362, 2010. © Springer-Verlag Berlin Heidelberg 2010

358

A. Lazaridis et al.

modeling in the context of emotional speech, along with the analysis of other prosodic features, would contribute both to enhancing the quality of synthesized emotional speech and to achieving a more expressive synthetic speech. In concept-learning problems, in order to represent some data, often many features are used while only a few of them are really related to the target concept. In order to overcome this problem, feature selection algorithms are employed for limiting the computational cost and improving the concept quality and accuracy of the representation [11]. In the present work, we investigate the practical usefulness of two feature selection techniques for improving the accuracy of phone duration modeling in the context of emotional speech. Ten well-known machine learning algorithms are employed, which are members of four distinct categories, namely: linear regression (LR) [12], decision trees (DT) [13-15], lazy-learning algorithms [16,17] and meta-learning algorithms [18,19], serving here as phone duration models (PDMs). Furthermore, seeking for the improvement of the performance of these PDMs, we employed the Relief [11] and the Correlation-based Feature Selection (CFS) [20] feature selection methods.

2 Feature Selection and Phone Duration Modeling Algorithms In the present work we study the applicability and effectiveness of two dissimilar feature selection algorithms, namely the Relief [11] and the Correlation-based Feature Selection (CFS) [20]. Kira and Rendell described a statistical feature selection algorithm called Relief that uses instance-based learning to assign a relevance weight to each feature in order to achieve maximal margin [11,12]. The advantage of marginbased feature selection methods is that the consideration of the global information leads to optimal weights of features, making these methods efficient even for data with dependent features. Relief algorithm is used in combination with a ranking algorithm, in our case the Ranker [12]. The Ranker is a search method for ranking individual features according to their individual evaluations. In the Correlation-based Feature Selection (CFS) technique [20], used in conjunction with a genetic algorithm (GA) [21], scores and ranks subsets of features, rather than individual features. It uses the criterion that a good feature subset for a classifier contains features that are highly correlated with the class variable but poorly correlated with each other [20]. In order to identify the applicability of these feature selection techniques, we utilized various machine learning algorithms as phone duration models. Four categories of machine learning algorithms for building the PDMs, are used in our experiments and these are decision trees, linear regression, lazy-learning and meta-learning algorithms. These are: three decision trees, namely the m5p model tree [13], and two regression trees, the m5pR [14] and the Reduced Error Pruning trees (REPTrees) [15], two lazy-learning algorithms: the IBK [16] and the Locally Weighted Learning algorithm (LWL) [17], two meta-learning algorithms: the additive regression (AR) [18] and the bagging algorithm (BG) [19], utilizing two different regression trees (m5pR and REPTrees), and finally the linear regression (LR) [12] algorithm. The implementation of all the machine learning algorithms was done through the WEKA statistical analysis program [12].

Feature Selection for Improved Phone Duration Modeling of Greek Emotional Speech

359

3 Experimental Setup All experiments are based on a Modern Greek emotional speech database which was purposely designed in support of research on speech synthesis. The database contains emotional speech from the categories: anger, fear, joy, sadness and neutral speech. The contents of the database were extracted from passages, newspapers or were set up by a professional linguist. The database which was utilized for the experiments consisted of 62 utterances, which are pronounced several times with different emotional charge. Moreover all the utterances were uttered separately in the five emotional styles. The entire database consisted of 4150 words. We used a phone inventory of 34 phones. Furthermore, each vowel class included both stressed and unstressed cases of the corresponding vowel. All utterances were uttered by a professional female actress, speaking Modern Greek. In the present study, we consider a number of features which have been reported successful on the task of phone duration modeling [3,5]. From each utterance of the speech database, we computed 33 features along with the contextual information concerning some of these features. These are: eight phonetic features, three segmentlevel features, thirteen syllable-level features, two word-level features, one phraselevel feature and six accentual features. The overall size of the feature vector, including the contextual information about the aforementioned features, is 93. The performance of the phone duration prediction models was measured in terms of root mean square error (RMSE) and the correlation coefficient (CC). The RMSE is frequently used as a global measure sensitive to gross errors. The CC measures the statistical correlation between the actual and the predicted values of the segmental duration.

4 Experimental Results In order to better utilize the available data, in all the experiments we followed an experimental protocol based on 10-fold cross-validation. In the tables we report results on the phone duration modeling task for ten different machine learning techniques. Furthermore, we present results for the cases without feature selection, denoted in the tables as NoFS, and for the Relief and CFS feature selection algorithms in terms of RMSE (Table 1) and CC (Table 2). The values of the RMSE are in milliseconds and these with the lowest RMSE and the highest CC are shown in italics. Bold font indicates the cases where the results with feature selection (Relief or CFS) outperform these with the full set of features, i.e. without feature selection (NoFS). Finally, the best result for each category of emotional speech is indicated in the cell with grey background. 4.1 Phone Duration Modeling with No Feature Selection As shown in Table 1 for the NoFS case all the evaluated PDMs demonstrated reasonable performance, yielding RMSE values between 19.0 and 30.3 milliseconds. These values are bearable for the needs of emotional speech synthesis, but a more accurate phone duration modeling (prediction error <20ms) would result in a better quality of the synthetic speech [22]. Regarding the CC, the PDMs achieved values in the range 0.54 to 0.83, which is a decent outcome for a model. The highest accuracy for phone

360

A. Lazaridis et al.

Table 1. RMSE values in milliseconds for the different categories of emotional speech. Each phone duration model was evaluated for the full feature set, i.e. without feature selection (NoFS), and with feature selection based either on the Relief or the CFS methods.

AR.M5pR AR.R.Tr. BG.M5pR BG.R.Tr. IB12 LWL LR M5p M5pR R.Tr.

Anger NoFS Relief CFS 22.1 22.4 22.7 23.8 23.6 23.2 23.3 23.3 23.6 28.2 28.0 23.4 24.7 24.3 23.9 28.6 28.5 25.0 22.8 23.3 23.8 21.7 22.2 21.3 24.1 23.5 24.4 30.3 30.2 25.3

Fear NoFS Relief 20.1 20.0 21.3 21.1 20.9 20.9 22.5 22.5 21.8 21.6 24.4 24.1 22.0 19.9 20.2 19.8 21.6 21.7 24.3 24.4

CFS 22.4 22.9 22.6 23.5 21.5 25.1 22.1 22.0 23.2 23.5

NoFS 19.0 20.8 20.4 22.8 22.2 23.4 19.8 19.5 21.6 24.5

Joy Relief 19.3 20.9 20.5 22.9 21.4 23.1 20.2 19.5 21.4 24.2

CFS 22.2 23.4 22.8 25.5 21.0 26.1 22.7 21.7 23.4 27.4

Neutral NoFS Relief CFS 26.3 26.1 28.1 26.7 26.4 29.2 26.7 26.7 28.0 27.6 27.1 29.3 27.5 27.3 26.9 28.9 28.3 30.9 26.4 26.8 28.6 26.2 26.0 27.7 27.2 27.0 28.7 29.4 29.2 31.8

Sadness NoFS Relief CFS 20.6 20.8 22.0 22.1 21.9 22.3 21.4 21.7 22.5 24.3 24.2 22.7 20.6 22.5 23.4 25.7 25.4 24.2 20.8 21.2 22.4 20.9 20.9 20.7 22.1 22.3 23.2 26.6 26.6 24.3

Table 2. CC values for the different categories of emotional speech. Each phone duration model was evaluated for the full feature set, i.e. without feature selection (NoFS), and with feature selection based either on the Relief or the CFS methods. Anger Fear NoFS Relief CFS NoFS Relief CFS AR.M5pR 0.83 0.82 0.82 0.72 0.72 0.63 AR.R.Tr. 0.79 0.79 0.81 0.67 0.68 0.60 BG.M5pR 0.81 0.81 0.80 0.70 0.70 0.63 BG.R.Tr. 0.70 0.70 0.80 0.62 0.62 0.58 IB12 0.78 0.79 0.80 0.66 0.67 0.68 LWL 0.70 0.70 0.77 0.55 0.57 0.52 LR 0.81 0.80 0.79 0.66 0.72 0.64 M5p 0.83 0.82 0.84 0.72 0.73 0.65 M5pR 0.79 0.81 0.78 0.66 0.66 0.59 R.Tr. 0.65 0.65 0.76 0.55 0.55 0.66

Joy NoFS Relief CFS 0.78 0.77 0.69 0.73 0.73 0.64 0.75 0.74 0.66 0.66 0.66 0.54 0.69 0.72 0.73 0.65 0.68 0.54 0.76 0.75 0.66 0.77 0.77 0.70 0.70 0.71 0.64 0.60 0.62 0.46

NoFS 0.66 0.65 0.66 0.62 0.63 0.59 0.66 0.67 0.63 0.57

Neutral Relief CFS 0.68 0.61 0.66 0.56 0.66 0.61 0.63 0.55 0.64 0.66 0.61 0.51 0.65 0.58 0.69 0.62 0.64 0.58 0.57 0.46

Sadness NoFS Relief CFS 0.75 0.75 0.71 0.70 0.71 0.70 0.73 0.72 0.69 0.63 0.63 0.68 0.75 0.70 0.66 0.59 0.61 0.64 0.74 0.73 0.69 0.74 0.74 0.76 0.70 0.70 0.67 0.54 0.54 0.63

duration modeling in all categories of emotional speech was observed for the M5p trees and the meta-learning algorithms using M5pR regression trees as base classifiers (AR.M5pR and BG.M5pR). It was also observed that the LR model although showing higher error rates, has performance, which approaches that of the M5pR regression trees. Moreover, it is interesting to point out that between the two local learning methods that were employed, IB12 rather than LWL, performed better in all categories of emotional speech. On the contrary, REPTrees (R.Tr.) appear to demonstrate the lowest accuracy among all evaluated methods, both as a single model, and as a base classifier for the cases of AR and BG algorithms (AR.R.Tr., BG.R.Tr.). The M5p trees achieved the best performance due to the fact that they adopt a greedy algorithm which constructs a model tree with a non-fixed structure by using a certain stopping criterion, minimizing the error at each interior node, one node at a time, recursively until the best accuracy is achieved. In this way, the computational cost increases, but very robust models are constructed. Moreover, the meta-learning algorithms, processing meta-data and taking advantage of the information that is produced by other

Feature Selection for Improved Phone Duration Modeling of Greek Emotional Speech

361

methods (in our experiments the M5pR model) were expected to perform well. However, as we can notice, Additive Regression and Bagging performed better when combined with a robust prediction method such as M5pR, while they didn't perform that well when the REPTrees (R.Tr.) were used as base classifier. Regarding the other methods, we noticed that local learning methods or methods that apply a more strict strategy of pruning might perform faster, but they do not yield the highest accuracy. 4.2 Phone Duration Modeling Using Feature Selection It is also interesting to compare the results obtained with the use of the two feature selection algorithms (Relief and CFS) to the ones using the full feature set, i.e. with no feature selection (NoFS). Apart from the category Joy, where the best performance was achieved by the PDM with NoFS (AR.M5pR), in all the other categories (Anger, Fear, Neutral and Sadness), one of the two feature selection algorithms led to the lowest RMSE. As shown in Tables 1 and 2, the PDMs employing the Relief algorithm presented the best results in the categories Fear and Neutral, presenting a relative reduction of RMSE by 2% in Fear category and by 0.8% in Neutral category and improving the CC by 1.4% and by 3% respectively. While the PDMs relying on the CFS had the best performance in the categories Anger and Sadness, presenting a relative reduction RMSE by 1.8% in Anger category and by 1% in Sadness category and improving the CC by 1.2% and by 2.7% respectively. These results are in support of the proposed additional effort for performing feature selection, since it contributes to the improvement of the accuracy of phone duration modeling. This can be seen in the tables, where the PDMs using feature selection outperformed the respective PDMs with NoFS even if the particular phone duration model didn’t have the best accuracy among the other PDMs in the particular category of emotional speech.

5 Conclusions In this work, we study the importance of feature selection for improving the accuracy of phone duration modeling. We experimented with a Modern Greek database of emotional speech, employing duration models based on decision trees, linear regression, lazy-learning and meta-learning algorithms. The experimental results demonstrated that in four out of the five categories of emotional speech, the phone duration models using a feature selection algorithm outperformed the best phone duration model with no feature selection regardless of the machine learning algorithm they utilized. Therefore the results well support the claim that the feature selection contributes to the improvement of the overall accuracy of phone duration modeling.

References 1. Dutoit, T.: An Introduction to Text-To-Speech Synthesis. Kluwer Academic Publishers, Dodrecht (1997) 2. Klatt, D.H.: Synthesis by rule of segmental durations in English sentences. In: Lindlom, B., Ohman, S. (eds.) Frontiers of Speech Communication Research, pp. 287–300. Academic Press, New York (1979)

362

A. Lazaridis et al.

3. Möbius, B., Santen, P.H.J.: Modeling Segmental duration in German Text-to-Speech Synthesis. In: 4th International Conference on Spoken Language Processing (ICSLP), pp. 2395–2398 (1996) 4. Takeda, K., Sagisaka, Y., Kuwabara, H.: On sentence-level factors governing segmental duration in Japanese. Journal of Acoustic Society of America 6(86), 2081–2087 (1989) 5. Santen, J.P.H.: Contextual effects on vowel durations. Speech Communication 11, 513– 546 (1992) 6. Campbell, W.N.: Syllable based segment duration. In: Bailly, G., Benoit, C., Sawallis, T.R. (eds.) Talking Machines: Theories, Models and Designs, pp. 211–224. Elsevier, Amsterdam (1992) 7. Goubanova, O., King, S.: Bayesian network for phone duration prediction. Speech Communication 50, 301–311 (2008) 8. Lazaridis, A., Zervas, P., Kokkinakis, G.: Segmental Duration Modeling for Greek Speech Synthesis. In: 19th IEEE International Conference of Tools with Artificial Intelligence (ICTAI), pp. 518–521 (2007) 9. Jiang, D.N., Zhang, W., Shen, L., Cai, L.H.: Prosody Analysis and Modeling for Emotional Speech Synthesis. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP 2005), pp. 281–284 (2005) 10. Inanoglu, Z., Young, S.: Data-driven emotion conversion in spoken English. Speech Communication 51, 268–283 (2009) 11. Kira, K., Rendell, L.A.: A practical approach to feature selection. In: 9th International Conference on Machine Learning (ICML), pp. 249–256 (1992) 12. Witten, H.I., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann Publishing, San Francisco (2005) 13. Wang, Y., Witten, I.H.: Induction of model trees for predicting continuous classes. In: 9th European Conf. on Machine Learning, University of Economics, Faculty of Informatics and Statistics, pp. 128–137 (1997) 14. Quinlan, R.J.: Learning with continuous classes. In: 5th Australian Joint Conference on Artificial Intelligence, pp. 343–348 (1992) 15. Kääriäinen, M., Malinen, T.: Selective Rademacher Penalization and Reduced Error Pruning of Decision Trees. Journal of Machine Learning Research 5, 1107–1126 (2004) 16. Aha, D., Kibler, D., Albert, M.: Instance-based learning algorithms. Journal of Machine Learning 6, 37–66 (1991) 17. Atkeson, C.G., Moorey, A.W., Schaal, S.: Locally Weighted Learning. Artificial Intelligence Review 11, 11–73 (1996) 18. Friedman, J.H.: Stochastic gradient boosting. Comput. Statist. Data Anal. 4(38), 367–378 (2002) 19. Breiman, L.: Bagging Predictors. Journal of Machine Learning 2(24), 123–140 (1996) 20. Hall, M.A.: Correlation-based Feature Subset Selection for Machine Learning. PhD thesis, Department of Computer Science, University of Waikato, Waikato, New Zealand (1999) 21. Goldberg, D.E.: Genetic algorithms in search, optimization and machine learning. Addison-Wesley, Reading (1989) 22. Wang, L., Zhao, Y., Chu, M., Zhou, J., Cao, Z.: Refining segmental boundaries for TTS database using fine contextual-dependent boundary models. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP 2004), pp. 641–644 (2004)

A Stochastic Greek-to-Greeklish Transcriber Modeled by Real User Data Dimitrios P. Lyras, Ilias Kotinas, Kyriakos Sgarbas, and Nikos Fakotakis Artificial Intelligence Group, Wire Communications Lab, Electrical and Computer Engineering Department, University of Patras, Rion, Patras, GR-26500, Greece {dimlyras,ikotinas,sgarbas,fakotaki}@upatras.gr

Abstract. Greek to Greeklish transcription does not appear to be a difficult task since it can be achieved by directly mapping each Greek character to a corresponding symbol of the Latin alphabet. Nevertheless, such transliteration systems do not simulate efficiently the human way of Greeklish writing, since Greeklish users do not follow a standardized way of transliteration. In this paper a stochastic Greek to Greeklish transcriber modeled by real user data is presented. The proposed transcriber employs knowledge derived from the analytical processing of 9,288 Greek-Greeklish word pairs annotated by real users and achieves the automatic transcription of any Greek word into a valid Greeklish form in a stochastic way (i.e. each Greek symbolset corresponds to a variety of Latin symbols according to the processed data), simulating thus human-like behavior. This transcriber could be used as a real-time Greek-to-Greeklish transcriber and/or as a data generator engine used for the performance evaluation of Greeklish-to-Greek transliteration systems. Keywords: Greek, Greeklish, Transcriber, Transliteration.

1 Introduction Greeklish is a combination of the words Greek and English (also known as Grenglish, Fragolevantinika, Latinoellinika or ASCII Greek) representing a writing style characterized by the use of the Latin alphabet for the writing of Greek texts. This practice, known as Digraphia [1, 2] appears in two forms: synchronic, meaning that two writing systems coexist for the same language, and diachronic, meaning that the writing system has changed over time and has finally been replaced by a new one. Forms of digraphia appear in many languages that do not use the Latin script (e.g. Gulf Arabic, Chinese, Japanese etc.). Serbian is maybe the most prominent example of synchronic Digraphia, using simultaneously the Serbian Cyrillic alphabet and an adapted Latin-alphabeted one. Singlish, a word similar to Greeklish refers to an English-based creole used in Singapore which employs transliteration practices, as well as vocabulary modifications and additions from the English language. Another form of digraphia mostly based on the phonetic transcription of non-Latin characters is usually employed in the Arabic writing system, since there is no optical similarity between the Arabic and the English alphabets. Nevertheless, some symbols (e.g. vowels) are transcribed using numbers resembling the optical form of the original letters, S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 363–368, 2010. © Springer-Verlag Berlin Heidelberg 2010

364

D.P. Lyras et al.

because of the fact that there does not exist any corresponding English character with satisfactory phonetic similarity. Finally, there also exist cases where an initial local script has been completely replaced by a Latin-alphabeted way of writing, such as in the cases of Romanian (which originally used Cyrillic and then changed to Latin); Turkish (Arabic then Latin), and many languages of the former Soviet Central Asia, which abandoned the Cyrillic script after the dissolution of the USSR.

2 Digraphia in Greek A lot of research has been carried out recently with regards to the usage and the acceptance of Greeklish writing. In particular, the extensive research of Jannis Andritsopoulos, has come up with some interesting and descriptive statistical results: 60% of the users have been reported to use Greeklish in over 75% of the contexts they submit; 82% of the users accept Greeklish as an electronic communication tool, 53% consider this writing style unaesthetic, 24% regard it as a “threat” or vandalism of the Greek Language and 46% have reported to face difficulties in the reading of such texts [3]. Regarding the difficulties in the reading of Greeklish texts, other research efforts investigate the response time for the comprehension of words and sentences written in the Greek and Romanized Greek writing systems. The results report that the response time is lower when the text is written in Greek (657ms mean) than when it is written using characters of the Latin alphabet (886ms mean) [4]. Although an official standard, named ELOT 743:1982 [5], has been developed, the Greek to Greeklish conversion seems to be performed in an empirical way since the Greeklish writing style is mainly characterized by subjective transcription preferences and intense inconsistency. For example, according to [6], the word “διεύθυνση” (address), has been transcribed in 23 different ways by approximately 70 users. Moreover, Greeklish writing suffers from the phenomenon of “iotacism” which is characterized by the use of the Latin character “i” for the transliteration of the Greek symbolsets “ι”, “η”, “υ”, “ει”, and “οι” since they are all pronounced as “I” according to the SAMPA phonetic alphabet [7]. Nevertheless, all the existing transliteration norms may be practically grouped in the following four distinct categories i.e., phonetic transcription, optical transliteration, keyboard-mapping conversion and mixed style greeklish writing [8]. All currently known implementations addressing the problem of Greek-toGreeklish conversion, (e.g. ASDA[9], Repchars[10], Translatum converter [11], Greek to Greeklish Converting Utility[12] and el2gr[13]) are based on static or dictionary-defined mapping rules for the Greek to Latin conversion task. Nevertheless, such deterministic transliteration systems are unable to imitate efficiently the humanlike way of Greeklish writing since they cannot simulate the unpredictability and lack of a standard transliteration style. Moreover, although a lot of work has been done in the field of Greeklish-to-Greek conversion (e.g. E-Chaos[14], Innoetics Greek to Greeklish[15] All Greek to me![16], deGreeklish [17]), there still exists the need for large corpora of annotated Greek text and its corresponding Greeklish transliteration (including all the aforementioned transliteration norms) that could be used either as training databases for such systems, and/or as test data in order to measure the

A Stochastic Greek-to-Greeklish Transcriber Modeled by Real User Data

365

efficiency and the performance of these tools. Towards this direction, in the present paper we propose a stochastic Greek-to-Greeklish transcriber modeled by real user data that can generate multiple transcriptions of any Greek input text by combining all types of Greeklish writing and is therefore able to simulate human-like behavior. The proposed system may appear to be useful to users who communicate with people that work in restricted computer environments with deficient language support and/or could be employed as a data generator engine used for the performance evaluation of Greeklish-to-Greek transliteration systems.

3 Suggested Approach The proposed transcriber was developed in two steps: At first, Greek to Greeklish transcription data, collected and annotated by real users were analytically processed in order to reveal patterns and transliteration pairs that are frequently used by Greeklish users; afterwards, the extracted knowledge derived from the previous step was modeled and integrated into a real-time Greek-to-Greeklish transliteration system that could simulate the human Greeklish type of writing in an efficient way. 3.1 Data Processing and Analysis In this part of the research, a systematic examination of the spelling variation in Latinalphabeted Greek was performed in order to: i) retrieve all the frequently used GreekGreeklish transcription symbolsets (i.e. Greek symbolsets and their corresponding Greeklish transcriptions) and ii) rank the Greek-Greeklish transcription symbolsets according to their frequency of appearance as observed in the collected data. Towards this direction, a freely available database [18] comprised of 9,288 GreekGreeklish word pairs was systematically processed and analyzed. In order for this database to be created, at first 15,929 Greeklish words written using of variety of transliteration norms were collected. Then each one of the 418 registered users was asked to transcribe several Greeklish words (randomly selected) into their equivalent Greek orthographic form, creating thus Greeklish-Greek transcription pairs. Such a transcription pair was considered correct and valid only if at least 3 users had agreed on the same Greek equivalent of the Greeklish input word, resulting thus into the 9,288 Greek-Greeklish word pairs that were used in the present study. In order to automatically retrieve all the frequently used transliteration norms of the Greek symbolsets, a trial-and-test procedure was employed. More specifically, at first, an initial Greek-Greeklish transliteration mapping was assumed in which each Greek symbolset was uniquely transcribed into a Latin-alphabeted equivalent (e.g “ξ”→“ks”, “θ”→”8” etc). According to this mapping, each Greek word contained in the employed database was automatically transcribed into its Greeklish equivalent form and the output was then compared to the one contained in the database. In case they did not match, a supervised examination of the transliteration errors was performed and the dataset containing the Greek – Greeklish transliteration pairs was further enriched including the missing transliteration symbolsets. This procedure was iteratively performed until

366

D.P. Lyras et al.

all the 9,288 Greek words contained in the employed database were transcribed successfully into their equivalent Greeklish form, resulting thus into finally considering 46 different Greek symbolsets (i.e. characters or pairs of characters) each one being mapped to several Latin-alphabeted symbolsets (e.g. the Greek symbolsets “μπ” could be transliterated as “mp”, “b” or “mb”). After this procedure was completed, a ranking of these Greek-Greeklish transcription symbolsets according to their frequency of appearance as observed in the collected data was performed. Via the aforementioned procedure, the conclusion that there does not exist a widely accepted transliteration norm for the Greeklish type of writing was verified. Out of the 46 considered Greek symbolsets, only 10 are uniquely matched to a Latin equivalent, whereas the remaining 36 may correspond to multiple symbolsets. Another important point worthy to be mentioned is the discovery of the some “unclassifiable” (meaning that they do not clearly belong to any of the 4 transliteration categories listed at Section 2) Greek-Greeklish transliteration pairs (e.g “κ”→“c”), which constitute actual writing patterns in daily Greeklish communication and were consequently taken under consideration in the present study. 3.2 Stochastic Greek–to-Greeklish Transcription The second step of the stochastic Greek-to-Greeklish transcriber involved the integration of the extracted knowledge derived from the Data Processing and Analysis step to an automatic transliteration engine that could be able to produce multiple transcriptions for each input Greek word. Towards this direction, an initial version of the stochastic Greek-to-Greeklish transcriber was developed. In this version, at first all the stresses and the punctuation marks were removed from the input text since they are unimportant with regards to the Greek-Greeklish transliteration task. Afterwards, the input text was tokenized and each derived token underwent a transliteration process which exploits the Greek-Greeklish transliteration pairs and their corresponding frequencies of appearance. During the transliteration process, each token was initially split into its corresponding symbolsets. In cases where there existed more than one ways by which a token may be split, then the one resulting into the least number of symbolsets was preferred. For example, the Greek word “είμαι” (I am) may be split in 4 different ways: {“ει”, “μ”, “αι}, {“ε”, “ι”, “μ”, “αι”}, {“ει”, “ι”, “μ”, “α”, “ι”}, and {“ε”, “ι”, “μ”, “α”, “ι”}. In this case, the first way was preferred as it results into less symbolsets than the rest. Afterwards, for each derived symbolset, the corresponding transliteration model was loaded and the greeklish equivalent for this symbolset was decided over a randomly generated float number ranging from 0.0 to 100.0. At this point of the development phase, a set of preliminary experimental evaluations were performed in order to examine the performance of the proposed transcriber with regards to the Greek-to-Greeklish conversion task. According to these evaluations, the 1,000 most frequent Greek words were provided as input to the initial version of the stochastic transcriber, and all the different greeklish transcriptions for each one of these words were stored in an output file which was later examined by 12 Greeklish users. Each one of these users had to spot all the greeklish transcriptions that he/she either would never use, or has never encountered again in his/her life. The

A Stochastic Greek-to-Greeklish Transcriber Modeled by Real User Data

367

feedback we received led us to the conclusion that some of the assumed transliteration pairs may appear less often than it was initially presumed and that some of these pairs either could not co-exist or they might appear only under specific circumstances. Therefore, the database used at the Data Processing and Analysis step was reexamined in order to retrieve all the Greek-Greeklish pairs where the minority cases spotted by our annotators appeared in the actual data. The results of this analysis led us to the development of a bias function that is invoked during the transliteration process, aiming to fine-tune the performance of the stochastic transcriber by correcting cases as the ones described earlier. The flowchart of the final version of the proposed Greek-to-Greeklish transcriber is presented at Fig. 1.

Fig. 1. Stochastic Greek-to-Greeklish transcription flowchart

4 Conclusions In this paper a stochastic Greek-to-Greeklish transcriber modeled by real user data was presented. By the term stochastic we refer to the fact that the proposed system is able to generate multiple Latin-alphabeted transcriptions of the same Greek input text simulating thus human-like behavior of Greeklish writing, as opposed to other transliteration systems that operate in a deterministic way by directly mapping each Greek character to a corresponding symbol of the Latin alphabet. The proposed system was developed in two phases: At first, a freely available annotated database containing 9,288 Greek-Greeklish word pairs was systematically processed and analyzed in order to retrieve all the commonly used transliteration mappings between Greek and Greeklish and their corresponding frequencies of appearance within the processed data. The outcomes of this analysis led to the development of the initial version of the stochastic Greek-to-Greeklish transcriber which was preliminarily evaluated by 12 Greeklish users with regards to the understandability of the transcribed text. The feedback we received resulted into an improved version of the stochastic transcriber that could take into account the coexistence and/or the position of the transliteration symbolsets within the transcribed words further fine-tuning thus the performance of the transcriber with regards to the understandability of the transcribed text and to its ability to imitate human-like behavior of Greeklish writing.

368

D.P. Lyras et al.

The suggested transcriber could appear to be useful as a real-time Greek-toGreeklish conversion tool for users who communicate with people that work in restricted computer environments with deficient language support. Moreover, it could be used as a data generator engine able to produce multiple Greeklish transcriptions of any input Greek text, which could be latter used as training and/or test data for Greeklish-to-Greek transliteration systems.

References 1. Palfreyman, D., al Khalil, M.: A funky language for teenz to use: Representing Gulf Arabic in instant messaging. Journal of Computer Mediated Communication (2003) 2. Wikipedia: Digraphia, http://en.wikipedia.org/wiki/Digraphia 3. Androutsopoulos, J.: Latin-Greek spelling in e-mail messages: Usage and attitudes. In: Studies in Greek linguistics, pp.75-86 (2000) (in Greek) 4. Tseliga, T., Marinis, T.: On-line processing of Roman-alphabeted Greek: the influence of morphology in the spelling preferences of Greeklish. In: 6th International Conference in Greek Linguistics, Rethymno, Crete, 18-21 September (2003) 5. ELOT, Greek Organisation of Standardization (1982) 6. Androutsopoulos, J.K.: Από dieuthinsi σε diey8ynsh. Ορθογραφική ποικιλότητα στην λατινική μεταγραφή των ελληνικών .In: 4th International Conference on Greek Linguistics, Cyprus (September 1999) 7. SAMPA, ESPRIT project 1541: SAM (1989), http://www.phon.ucl.ac.uk/home/sampa/index.html 8. Varouta, M.: MyGreeklish to standard Greeklish translator needed, http://www.proz.com/translation-articles/articles/930/ 9. ASDA: Greek-to-Greeklish converter, http://home.asda.gr/active/GrLish2.asp 10. Repchars: Greek-to-Greeklish converter, http://www.code.gr/repchars/ 11. Translatum Greek-Greeklish converted, http://www.translatum.gr/converter/greeklishconverter.htm 12. Greek to Greeklish Converting Utility, http://www.kokoras.com/greektogreeklish/ 13. el2gr: Python Greek to Greeklish Converter, http://betabug.ch/blogs/chathens/135 14. e-Chaos: freeware Greeklish converter, http://www.paraschis.gr/files.php 15. Greek to Greeklish by Innoetics, http://services.innoetics.com/greeklish/ 16. Chalamandaris, A., Protopapas, A., Tsiakoulis, P., Raptis, S.: All Greek to me! An automatic Greeklish to Greek transliteration system. In: Proceedings of the 5th Intl. Conference in Language Resources and Evaluation, pp. 1226–1229 (2006) 17. DeGreeklish, http://tools.wcl.ece.upatras.gr/degreeklish 18. Greeklish Out!, http://greeklishout.gr/main/

Face Detection Using Particle Swarm Optimization and Support Vector Machines Ermioni Marami and Anastasios Tefas Aristotle University of Thessaloniki, Department of Informatics Box 451, 54124 Thessaloniki, Greece [email protected], [email protected]

Abstract. In this paper, a face detection algorithm that uses Particle Swarm Optimization (PSO) for searching the image is proposed. The algorithm uses a linear Support Vector Machine (SVM) as a fast and accurate classifier in order to search for a face in the two-dimension solution space. Using PSO, the exhaustive search in all possible combinations of the 2D coordinates can be avoided, saving time and decreasing the computational complexity. Moreover, linear SVMs have proven their efficiency in classification problems, especially in demanding applications. Experimental results based on real recording conditions from the BioID database are very promising and support the potential use of the proposed approach to real applications.

1

Introduction

Face detection deals with ﬁnding and localizing faces in images and videos [1]. It is by far the most active specialization in object detection, since it is an essential step in most face-analysis applications, such as facial expression analysis for human computer interfaces, face recognition for access control and surveillance, as well as multimedia retrieval. Face detection is a rather diﬃcult task due to the variability of the object of interest itself and the environment. For face detection, the following issues need to be considered: – Size: A face detector should be able to detect faces in diﬀerent sizes. This is usually achieved by either scaling the input image or the object model. – Position: A face detector should be able to detect faces at diﬀerent positions within the image. This is usually achieved by sliding a window over the image and applying the detection step at each image position. – Orientation: Faces can appear in diﬀerent orientations within the image plane depending on the angle of the camera and the face. – Illumination: Varying illumination can be a big problem for face detection since it changes the color and the appearance of the face depending on the color and the direction of the light.

This work has been funded by the Collaborative European Project MOBISERV FP7248434 (http://www.mobiserv.eu), An Integrated Intelligent Home Environment for the Provision of Health, Nutrition and Mobility Services to the Elderly.

S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 369–374, 2010. c Springer-Verlag Berlin Heidelberg 2010

370

E. Marami and A. Tefas

In the present paper we propose a novel face detection algorithm that is able to locate frontal faces in a given image, in the two-dimension search space. This is the case in most applications that use face detection. For example, most of the state-of-the-art face recognition and facial expression recognition methods consider that the face has been correctly and precisely located in the image and that it is in frontal view. The most competitive face detection algorithms are searching exhaustively in the test image for localizing the face. To avoid the exhaustive search of all possible locations in the image, we propose a face detection algorithm based on swarm intelligence and more speciﬁcally the particle swarm optimization (PSO) method. Each particle is equipped with a very fast and accurate classiﬁer and cooperates with the other particles in order to form an intelligent swarm that is able to detect faces. Applying a nature-inspired intelligent method not only has the advantage of searching the 2D solution space in an eﬃcient way but also an exhaustive search in all possible combinations on the 2D coordinates can be avoided. Indeed, approximately only 5% of the possible image positions had to be examined. In order to check whether each image sub-window is a face or not, a very fast and eﬃcient classiﬁer is used, which is a linear support vector machine (SVM) that transforms the detection problem to an inner vector product problem. It is worth noting that the proposed method can be combined with any exhaustive face detection technique so as to speed up the detection process.

2

Particle Swarm Optimization (PSO)

Particle Swarm Optimization (PSO) is a population-based stochastic optimization technique originally proposed by James Kennedy and Russell C. Eberhart in 1995 [2]. PSO is a search algorithm based on the simulation of the behavior of birds within a ﬂock. Deﬁnitions of several technical terms commonly used in PSO can be found in [3]. The swarm is a population of particles. Each particle represents a potential solution to the problem being solved. The personal best (pbest) of a given particle is the position of the particle that has provided the greatest success (i.e. the maximum value given by the classiﬁcation method used). The local best (lbest) is the position of the best particle member of the neighborhood of a given particle. The global best (gbest) is the position of the best particle of the entire swarm. The leader is the particle that is used to guide another particle towards better regions of the search space. The velocity is the vector that determines the direction in which a particle needs to ”ﬂy” (move), in order to improve its current position. The inertia weight, denoted by W , is employed to control the impact of the previous history of velocities on the current velocity of a given particle. The learning factor represents the attraction that a particle has toward either its own success (C1 - cognitive learning factor) or that of its neighbors (C2 - social learning factor). Both, C1 and C2 , are usually deﬁned as constants. Finally, the neighborhood topology determines the set of particles that contribute to the calculation of the lbest value of a given particle.

Face Detection Using PSO and Support Vector Machines

371

The position of each particle is changed according to its own experience (pbest ) and that of its neighbors (lbest and gbest ). Let zi (t) denote the position of particle pi , at time step t. The position of pi is then changed by adding a velocity ui (t) to the current position, i.e.: zi (t) = zi (t − 1) + ui (t)

(1)

The velocity vector reﬂects the socially exchanged information and, in general, is deﬁned in the following way: ui (t) = W ui (t − 1) + r1 C1 (zpbesti − zi (t)) + r2 C2 (zleader − zi (t))

(2)

where r1 , r2 ∈ [0, 1] are random values. The presented face detector uses the fully connected graph as neighborhood topology, where all members of the swarm connect to one another and leader = gbest in Eq. 2. In the fully connected topology the swarm tends to converge faster than in other topologies.

3

Combining PSO and SVMs for Face Detection

We trained SVMs using linear and polynomial kernels [4]. The comparison between them shows that, in our case, the computational complexity for nonlinear SVMs is 1000 times more intensive than linear SVMs. The linear SVMs give a slightly lower success rate (≈ 1%) but they are much faster. Nonlinear SVMs are more intensive because we have to compute inner products for all the support vectors which in our case are more than 1000. Training data for the linear SVM consisted of 2901 grayscale face images and 28121 grayscale non-face images of size 19 × 19. These images are taken from the CBCL Face Database, which is available at http://cbcl.mit.edu/cbcl/software-datasets/FaceData2.html. After ﬁne-tuning the training parameters using cross-validation, we trained the SVMs to the whole training and test set and the success rate for this classiﬁer was 98.74%. In order to eliminate or minimize the eﬀect of diﬀerent lighting conditions, images used for training SVMs were histogram equalized and normalized so that all pixel values are between 0 and 1. These preprocessing methods were applied on the entire training images. Normalization and histogram equalization is therefore necessary during detection as well, and they were applied separately in each sub-window investigated. Conclusively, histogram equalization is required but computationally intensive process. Thus, a process of reduced computational burden (that avoids the exhaustive search) is of outmost importance for real-time detection. Such a method that utilizes PSO is discussed next. Algorithm 1 describes the face detection process using a linear SVM as classiﬁer and the PSO method to decrease computation time. The general idea as described previously is that each particle has its own intelligence using the SVM classiﬁer in order to evaluate the current position on whether it contains a face or not. The particles communicate and inform one another for the most possible face position at each iteration.

372

E. Marami and A. Tefas

Algorithm 1. Face Detector using PSO Initialize parameters: inertia, correction factor, maxVelocity, minVelocity Load trained linear SVM and Read input image Initialize particles to random positions (x, y) iteration = 0, repeat = 0 while repeat < R do for each particle do Update position Process this square of image with histogram equalization Evaluate SVM’s result for this position Update pbest end for Update gbest’s index Update velocities for each particle iteration + + if gbest s position has not changed and iteration > K then repeat + + end if end while gbests position = left upper corner of detected face

After K iterations we inspect whether the particles converge or not. If for R+1 successive iterations gbest is at the same position, the swarm seems to converge and we assume that this location probably contains a face. We check whether the value of SVM output at this point is above a predetermined threshold. If the value of gbest is larger than this threshold, we terminate the detection procedure. If gbest ’s value is below the threshold, we initialize the particles again at random positions and we repeat the above procedure. After ﬁne-tuning the algorithm, we selected for our experiments the values 5 and 3 for parameters K and R respectively.

4

Experimental Results

Frame Detection Accuracy (FDA) is an evaluation metric used to measure any object detection algorithm’s performance. This measure calculates the spatial overlap between the ground-truth and the algorithm’s output [5]. If the face detection algorithm aims to detect multiple faces, the sum of all the overlaps is normalized over the average number of ground-truth and detected objects. If NG is the number of ground-truth objects and ND the number of detected objects, FDA is deﬁned as: Overlap Ratio F DA = (3) NG +ND 2

where

Nmapped

Overlap Ratio =

i=1

Gi ∩ Di Gi ∪ Di ,

(4)

Face Detection Using PSO and Support Vector Machines

373

Nmapped is the number of mapped object pairs in the image, Gi is the i-th groundtruth object image region and Di is the i-th detected object image region. The speed performance of the presented face detector is directly related to various parameters, such as the swarm size, the image’s initial size and the repeat cycles. The average number of positions examined using PSO, was a small percentage of all the possible combinations of the 2D coordinates ranging from 5 to 6 %. This demonstrates the signiﬁcant reduction in the number of possible solutions to which we have to apply the classiﬁer each time. That is, using PSO for searching we are able to reduce the time needed for a detection by a factor of 20 for any given face detection algorithm. Moreover, using linear SVMs instead of nonlinear gives another ×103 boost in the detection speed. We applied the presented algorithm to the BioID Face Database (available at http://www.bioid.com/support/downloads/software/bioid-face-database.html) consisted of 1521 gray level images with an initial resolution of 384 × 286 pixels. To detect faces of various sizes, prior to applying the algorithm, we scaled the initial image using diﬀerent scaling factors. Emphasis has been placed on real world conditions and, therefore, the testset features a large variety of illumination, background and face size. Figure 1 shows the output of our face detector on some images from the BioID Face Database along with the overlap and the detector output value. Black rectangles represent the ground-truths of every image, while green (lighter gray) windows represent the proposed algorithm’s output. Table 1 lists the detection rate for the presented algorithm in comparison with Viola-Jones’ state-of-the-art algorithm [6] using as threshold for overlap the value 25.00. For initial scaling factor 0.19 and 0.24, images are too small for the Viola-Jones algorithm to be detected while our algorithm gives a very good detection rate. So, we apply the algorithms to larger images (scaling factor 0.25

Fig. 1. Output of our face detector on a number of test images from the BioID Database

Table 1. Detection rates for the presented algorithm (SVM-PSO) in comparison with Viola-Jones’ algorithm (OpenCV) for the BioID Face Database. The initial scaling factors for the tested images are given in the first row. Detector SVM-PSO OpenCV

Scale 0.19 93.95% 0%

Scale 0.24 93.23% 0%

Scale 0.25 93.82% 83.30%

Scale 0.3 94.08% 94.21%

374

E. Marami and A. Tefas

and 0.3) and the detection rates for our algorithm remain very good. The ViolaJones algorithm gives for scaling factor 0.25 a relatively good detection rate, whilst for scaling factor 0.3 it gives a detection rate similar to our algorithm. We should, also, mention that the classiﬁer used in Viola-Jones’ algorithm is trained using many more training samples than our classiﬁer.

5

Conclusions

We presented a fast and accurate face detection system searching for frontal faces in the image plane. To avoid exhaustive search in all possible combinations of coordinates in 2D space, we used a PSO algorithm. What is more, in order to save time and decrease the computational complexity we used as classiﬁer a linear SVM. Experimental results demonstrated the algorithm’s good performance in a dataset with images recorded under real world conditions and proved its eﬃciency. The proposed method can be combined with any face detector, e.g., the one used in OpenCV, to reduce their execution time.

References 1. Goldmann, L., M¨ onich, U., Sikora, T.: Components and their topology for robust face detection in the presence of partial occlusions. IEEE Transactions on Information Forensics and Security 2(3) (September 2007) 2. Kennedy, J., Eberhart, R.: Particle swarm optimization. In: Proceedings of the 1995 IEEE International Conference on Neural Networks, pp. 1942–1948. IEEE Service Center, Piscataway (1995) 3. Reyes-Sierra, M., Coello, C.C.: Multi-objective particle swarm optimizers: A survey of the state-of-the-art. International Journal of Computational Intelligence Research 2(3), 287–308 (2006) 4. Burges, C.: A tutorial on support vector machines for pattern recognition. Kluwer Academic Publishers, Boston (1998) 5. Kasturi, R., Goldgof, D., Soundararajan, P., Manohar, V., Garofolo, J., Bowers, R., Boonstra, M., Korzhova, V., Zhang, J.: Framework for performance evaluation of face, text, and vehicle detection and tracking in video: Data, metrics, and protocol. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(2), 319–336 (2009) 6. Viola, P., Jones, M.: Robust real-time face detection. International Journal of Computer Vision 57(2), 137–154 (2004)

Reducing Impact of Conflicting Data in DDFS by Using Second Order Knowledge Luca Marchetti and Luca Iocchi Dpt. of Computer and System Sciences Sapienza, University of Rome {lmarchetti,iocchi}@dis.uniroma1.it

Abstract. Fusing estimation information in a Distributed Data Fusion System (DDFS) is a challenging problem. One of the main issues is how to detect and handle conﬂicting data coming from multiple sources. In fact, a key of success of a Data Fusion System is the ability to detect wrong information. In this paper, we propose the inclusion of reliability assessment of information sources in the fusion process. The evaluated reliability imposes constraints on the use of information data. We applied our proposal in the challenging scenario of Multi-Agent Multi-Object Tracking.

1

Introduction

Most of the work on Distributed Data Fusion Systems (DDFS) investigates how to optimize or improve the fusion process by optimistically assuming the correctness of uncertainty models. The impact of using poor-quality information is not well addressed. In fact, most of the literature focuses its attention on establishing reliability of the belief computed within the framework of the selected model. This approach fails to produce good estimation in particular situations, in which it is not possible to detect errors from inside the fusion process. For example, a fusion process cannot detect ambiguous features, unpredictable systematic errors or conﬂicting data. In this paper, we address the problem of conﬂicting data, by using a Second Order of knowledge and introducing the concept of Reliability associated to each source, as uncertainty of evaluation of uncertainty [1][2]. Before accepting or rejecting the measurements from a source, the system checks the quality of the source itself. Given a priori information about the environment, the relations among objects and agents, the introspective analysis of the ﬁltering process, it is possible to measure the reliability of the sources. Therefore, a smarter integration of information can be adopted, and the resulting estimation could be improved.

2

Related Work

The main problems we have investigated is how to fuse data when they contain conﬂicting information, and how to assess the quality of information sources. S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 375–381, 2010. c Springer-Verlag Berlin Heidelberg 2010

376

L. Marchetti and L. Iocchi

One possibility aims at reducing the drawback of bad information inclusion by weighting the belief among sources. Another way is to throw away bad information and select the most consistent ones. All these approaches use information gathered by the ﬁltering process and the conﬂicts are evaluated by using metrics from inside the ﬁlter. A more interesting point of view is detecting and ﬁltering conﬂicting data by using information gathered externally to the ﬁlter. This leads to a Data Fusion framework that takes into account contextual information [3], relations among objects in the environment [4], consistency measurements of beliefs [5]. These information introduce a second order knowledge, that represents a level of knowledge that cannot be evaluated by using the fused measurements. Using the reliability as degree of trustworthiness of a source, the measurements can be modiﬁed to reﬂect the guessed quality. As pointed out in [1], two strategies can be adopted to handle reliability information: Discount and Pruning. In this paper, we present a general framework for DDFS, that explicitly adds an additional step to a recursive ﬁltering algorithm, providing assessment and handling for a higher level knowledge.

3

Introducing Reliability in DDFS

In Figure 1, the global view of this proposal is depicted. We implemented a Multiple Hypotheses Kalman Filter for multi-object tracking. However, this approach could be easily extended to any ﬁltering algorithm [6].

Fig. 1. Data ﬂow in MAMOT-R Algorithm

First, a Kalman Filter (KF) predict step is performed. At the same time, in the World Knowledge block, we use the previous state estimation and the current measurements, to update the world model representation. The Perceptions are expressed in terms of < ρ, θ >, where ρ is the relative distance of the tracked object related to the agent coordinates, and θ is the relative angle. The block labelled Evaluation assesses the quality of each information source, as explained in Section 3.1. The reliability coeﬃcients, Rs for each information source s, are evaluated here and then passed to the Handling block, where the measurements

Reducing Impact of Conﬂicting Data in DDFS by II Order Knowledge

377

are modiﬁed to reﬂect the quality of the generating source itself. The complete operation is illustrated in Section 3.2. After the KF update step, the new state of the system < xt , Σt > is estimated. In case of a Multi-Agent System, we propagate the information to other teammates. In particular, we send the hypotheses (considering the < x, Σ > as the mean and covariance of each KF hypotheses), the position of the agents and the updated world representation. 3.1

Evaluating Reliability

We identiﬁed four classes of feature, used to assign a quality value: Filter Introspection (meta-information from reasoning about fusion results’ characteristics); Relations among Objects (inter-relationship among objects); A priori Knowledge (contextual information about the environment); and Consensus among agents’ perceptions). Parameters and description of the implemented features are described in the following Table: Class Filter Introspection (FI ) Relations among Objects (RO) A Priori Knowledge (AK )

Consensus (CO)

Feature implemented

Decay constant

distance between the observations and the associλf i ated track: δ(fi , zt ) = ρ2o + ρ2t − 2ρo ρt cos(θo − θt ) time from last observation-to-track association: λf i δ(fi , zt ) = to − tt occupancy map for occluded object detection λfi (camera FOV: range = [0.3m, 6m], angle = (∼ [−30◦ , 30◦ ], areamap = 18.8m2 ) δ(fi , zt ) = areamax − areaoccupied areas with diﬀerent light conditions (distance beλf i tween agentpose and center of light area): δ(fi , zt ) = (xa − xl )2 + (ya − yl )2 areas with colored patterns (distance between λf i agent pose and center of “colored” area): 2 2 δ(fi , zt ) = (xa − xl ) + (ya − yl ) percentage of agreeing agents: λf i agents δ(fi , zt ) = agreeing #agents

= 1m

= 2sec = 80% 15m2 )

= 1.5m

= 1.5m

= 80%

The reliability coeﬃcients R1 , . . . , Rs , for information sources 1, . . . , s, are evaluated as follows: for each class feature, we estimate the reliability as a distance, δ(fi , zt ), between the “best” correspondence, on a given feature fi , and the current measurements zt [7]. In order to compute the probability of reliability (p) of an information source, we need a function that maps the distance to a numerical value. This is expressed by Rfi (zt ) ≡ p(δ(fi , zt ), λfi ) = exp(−

δ(fi , zt ) ). λfi

The parameter λfi represents the rate parameter (or decay constant ) characterizing the exponential function. It indicates the time after that the reliability

378

L. Marchetti and L. Iocchi

is considered meaningless: the distance δ(fi , zt ) represents how close the observation is to the feature. The overall reliability of the source is, thus, given by M Rs = Πi=1 Rfi .

3.2

Reliability Handling

Discount Strategy. This policy class embraces most of methodologies that assign a weight to measurements in relation to their quality ([8]). Let Z = {Z1 , · · · , ZS } be the union set of measurements coming from S sources, and let xs represents statistics of each source s to be combined. The fusion operator FR is expressed by the Bayesian rule, which under the condition of source independence is reduced to a product [1]: FR (x1 , · · · , xS , R1 , · · · , RS )|Z ≡ p(xs )

S p(xs |Zs )Rs , ∀s ∈ S . p(xs ) s=1

where xs = p(xs |Zs ). Pruning strategy. The pruning policy selects the reliable sources using a thresholding mechanism. Using probability measures to represent reliability coefﬁcients, the overall probability distribution will be inﬂuenced only by the sources that will survive to a validation threshold. Thus, the fusion will use the selected sources as FR (x1 , · · · , xS , R1 , · · · , RS )|Z ≡ p(xs )

S p(xs |Zs ) , ∀s ∈ S = {xs |Rs > threshold} . p(x s) s=1

For the purposes of this paper, we used a ﬁxed threshold. However, as states in [9], using variable thresholds could overcomes ﬁxed one.

4

Experimental Results

The experiments described in this paper have been conducted in the scenarios of Cooperative Robots Laser Tag game1 . We modelled the problem of Object Tracking within a simulated world for Laser Tag game2 . Within this scenario, we developed a Multi-Agent Multi-Object Tracking algorithm [10] , based on Nearest-Neighbour Multiple Hypotheses Object Tracking. In this scenario, two teams of agents look for the opponents to “tag” them, by using a simulated ﬁducial sensor. A global controller is in charge to detect and propagate such information to the agents themselves. 4.1

Assumptions

A Priori Errors. To evaluate the performance in the presented scenario, we added artiﬁcial errors to the environment. 1 2

http://www.nerdvest.com/robotics/RoboTag/ http://playerstage.sourceforge.net/

Reducing Impact of Conﬂicting Data in DDFS by II Order Knowledge

(a)

379

(b)

Fig. 2. Area with a priori errors and an example of inﬂuence of diﬀerent light condition

As shown in Figure 2a, into the (A) areas, there are diﬀerent light conditions. We model the spots as lamp with diﬀerent luminosity: the observations will be distorted with systematic errors, in relation of the position of the agent in such areas. In Figure 2b is presented a real world example, in which the tracking objects are represented by colored balls. In the areas labelled with (B), instead, we simulate the presence of objects in the environment wrongly detected as targets. Error Patterns. We deﬁne three error proﬁles: Reliable, Faulty1 and Faulty2. Each agent has a common zero mean Gaussian noise on perceptions, described, as stated before, by a tuple < ρ, θ >. The Reliable proﬁle does not have any additional artiﬁcial error added: it models, in fact, an agent operating in normal conditions. The Faulties proﬁles add, respectively, false positives and systematically wrong perceptions, when an agent pass through the (A) or (B) areas. We run the experiments dividing the agents in three sets, corresponding to each error proﬁle. 4.2

Multi-agent Multi-object Tracking in Laser Tag Game

We run the simulated world for about 30 minutes. The simulated robots explore the arena collecting information about the opponents, simulating a “search and tag” match. The frame rate of readings was 10Hz (resulting in ∼ 3000 measurements). Reliability Assessment. We measured the ability of correctly detect the aents’ quality. In Table 1, we present the percentage of correct assessment (Reliable/Non-reliable), for each proﬁles. It represents the fraction of the correct reliable/unreliable assessment over the ground-truth provided. In brackets, we indicate also the standard deviation. It is interesting to note that Reliable agent has better performance. This suggests that it is able to detect faulty agents better because it can use better local information. Towards a more eﬀective Multi-Agent tracking, the faulty agents should be able to detect themselves as faulty and treat their own observations accordingly.

380

L. Marchetti and L. Iocchi Table 1. Accuracy of reliability assessment FI RO AK CO All Reliable 75%(±4.6%) 72%(±4.2%) 95%(±2.1%) 87%(±3.3%) 86%(±3.6%) Faulty1 55%(±5.3%) 68%(±5.0%) 93%(±4.1%) 85%(±4.2%) 83%(±4.8%) Faulty2 63%(±6.1%) 63%(±5.6%) 92%(±4.5%) 81%(±4.7%) 72%(±5.2%)

Table 2. Least Mean Square Error on policy rules in Laser Tag game experiments None Discount Pruning Reliable 1.364m(±0.512m) 1.043m(±0.324m) 0.732m(±0.311m) Faulty1 2.391m(±0.883m) 1.541m(±0.760m) 1.699m(±0.796m) Faulty2 1.563m(±0.739m) 1.198m(±0.698m) 1.252m(±0.715m)

Policy Rules. The eﬀectiveness of the diﬀerent policy rules was evaluated considering all the previously introduced feature classes. We have compared the error on estimation using diﬀerent policy rules. The results in Table 2 show the improvements given by the Second Order knowledge approach. The values indicates the Least Mean Square Error of trajectory estimation over the ground -truth. The Discount rule has the best performance, for the Faulty proﬁles, than the Pruning one. This could be explained if we consider how the rule works. Weighting bad information is better than completely removing them. If the agents is assessing as “reliable” a source, while it is not, the tracking considers the bad perceptions with a greater variance, than in the normal behaviour. By Pruning them, the tracking cannot use any information at all, reducing the possibility of tracking dynamic objects (and, thus, increasing the overall tracking accuracy).

5

Conclusions and Future Work

The executed experiments show the importance of using “Second Order knowledge” to handle multiple information sources. We have conﬁrmed, by experiments, this reasonable assumption: if we could know that a source is giving wrong information, it is better to exclude or, at least, discount it. The results suggest that Pruning badly aﬀects the estimation when the number of information source is small. In such situation, the Discount rule is preferable, because prevent excluding potentially good information. Despite the promising results, more work has to be done: better methodologies to evaluate reliability coeﬃcients have to be developed. More interesting is the problem of understanding the model of reliability using information from the environment, as contextual high-level reasoning. In this paper we introduced this concept by modelling the a priori knowledge as a feature for Reliability Coeﬃcients evaluation. However, a more extensive use of contextual information can contribute to signiﬁcantly improve the results of a data fusion process.

Reducing Impact of Conﬂicting Data in DDFS by II Order Knowledge

381

References 1. Rogova, G., Boss, L.: Information quality eﬀects on information fusion. Technical report, Defense and Research Development Canada (2008) 2. Wang, P.: Conﬁdence as higher level of uncertainty. In: Proc. of Int. Symp. on Imprecise Probabilities and Their Applications (2001) 3. Elmenreich, W.: A review on system architectures for sensor fusion applications. LNCS. Springer, Heidelberg (2007) 4. Guibas, L.J.: Sensing, tracking and reasoning with relations. IEEE Signal Processing Magazine (March 2002) 5. Roli, F., Fumera, G.: Analysis of linear and order statistics combiners for fusion of imbalanced classiﬁers. In: Roli, F., Kittler, J. (eds.) MCS 2002. LNCS, vol. 2364, p. 252. Springer, Heidelberg (2002) 6. Marchetti, L., Nobili, D., Iocchi, L.: Improving tracking by integrating reliability of multiple sources. In: Proceedings of the 11th International Conference on Information Fusion (2008) 7. Yan, W.: Fusion in multi-criterion feature ranking. In: 2007 10th International Conference on Information Fusion, July 9-12, (2007) 8. Appriou, A.: Uncertain Data Aggregation in Classiﬁcation and Tracking Processes. Physica-verlag, Heidelberg (1998) 9. Tumer, K., Ghosh, J.: Analysis of decision boundaries in linearly combined neural classiﬁers. Pattern Recognition (1996) 10. Marchetti, L.: To believe or not to believe: Improving distributed data fusion with second order knowledge. PhD thesis, “Sapienza” University of Rome, Department of Computer and System Sciences (2009)

Towards Intelligent Management of a Student's Time Evangelia Moka and Ioannis Refanidis University of Macedonia, Department of Applied Informatics Egnatia str. 156, 54006 Thessaloniki, Greece {emoka,yrefanid}@uom.gr

Abstract. In parallel with studies, a lot of extra activities need to be fitted in a student’s schedule. Frequently, excessive workload results in poor performance or in failing to finish the studies. The problem is more severe in lifelong learning, where students are professionals with family duties. So, the need of making informative decisions as of whether taking a specific course fits into a student's schedule is of great importance. This paper illustrates a system, called EDUPLAN and being currently under development, which aims at helping the student to make intelligent management of her time. EDUPLAN aims at informing the student as for which learning objects can fit her schedule or not, as well as at organizing her time. This can be achieved using scheduling algorithms and a description of the user's tasks and events. In the paper we also extend the LOM 1484.12.3™-2005 ontology with classes that can be used to describe the temporal distribution of the workload of any learning object. Finally, we provide EDUPLAN's architecture, being built around the existing SELFP LANNER intelligent calendar application. Keywords: Intelligent systems, scheduling, calendar applications.

1 Introduction Nowadays, the rapid rhythm of development of societies has led to growing importance of education. Students are required to fulfill more obligatory activities and studies. Moreover, adults considering lifelong learning have to fit their studies in around their professional and family commitments. Therefore, taking informative decisions is important, in order to avoid, whenever possible, anxieties and tension, and to avert missed opportunities or deadlines. This paper illustrates a system under development, called EDUPLAN, that helps the prospective student to avoid taking bad decisions as for whether to attend a course (more generally a learning object) or not, and better organize her time. Our proposal is based on our experience with intelligent calendar applications. In particular, SELFPLANNER1 is a web-based intelligent calendar application [6], which allows the user to specify her commitments by employing a powerful scheduler to put the tasks within the user' calendar [5]. In the paper we propose an ontology for describing the 1

http://selfplanner.uom.gr

S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 383–388, 2010. © Springer-Verlag Berlin Heidelberg 2010

384

E. Moka and I. Refanidis

workload of learning objects. The proposed ontology can be considered an extension of the IEEE LOM 1484.12.3™-20052 model that characterizes learning objects. The rest of the paper is structured as follows: Section 2 illustrates a typical use case. Section 3 gives a brief presentation of the SELFPLANNER application. Section 4 presents the EDUPLAN Ontology, whereas Section 5 presents the system's architecture. Finally, Section 6 concludes the paper and identifies further work to be done.

2 A Typical Use Case Perhaps the students of open universities constitute the best example of people that would greatly benefit from EDUPLAN. They are mainly professionals with families and children. At the beginning of each academic year they have to decide whether to undertake two thematic units, which results in full-time studies, or take a single thematic unit, which results in part-time studies. Apparently, making a bad decision could be avoided if the students were well informed about the workload each thematic unit incurs. At a very fine-grained level, the student could be informed the detailed daily program of each unit, such as: • • • •

On March 14th you have to submit the 5th project of the unit. The estimated workload is 12 hours. Having read ch.23 from [2] is a prerequisite. On February 16th, 7 to 8 pm, there is a synchronous tele-lecture. Attending the tele-lecture is optional. If you attend it, then the estimate workload for reading ch.22 of [2] lowers to 3 hours. You can attend the tele-lecture of February 16th afterwards offline, provided that you haven't attended it online. It is preferably to attend it online. On May 29th, 9 to 12 am, are the final exams. Participation is obligatory.

All this information in addition to student’s schedule is necessary for informative decision making. Manual arrangement of this information is impractical. So, an automated scheduler is necessary to solve the computational problem. Our approach applies in all cases of learning objects, synchronous or asynchronous, simple or composite, covering entities varying from tutorials to a university program, like the scenario presented above.

3 The SELFPLANNER Application SELFPLANNER is a web-based intelligent calendar application that helps the user to schedule her personal tasks [6]. With the term ‘personal task’ we mean any activity that has to be performed by the user and requires some of her time. Each task is characterized by its duration and its temporal domain [1]. A domain consists of a set of intervals, where the task can be scheduled. A task might be interruptible and/or periodic. A location or a set of locations is attached to each task; in order to execute a task or a part of it, the user has to be in one of these locations. Travelling time between pairs of locations is taken into account when the system 2

http://ltsc.ieee.org/wg12/

Towards Intelligent Management of a Student's Time

385

schedules adjacent tasks. Ordering constraints and unary preferences, denoting when the user prefers the task to be scheduled, are also supported by the system. SELFPLANNER utilizes Google Calendar for presenting the calendar to the user, and a Google Maps application to define locations and compute the time the user needs to go from one location to another.

4 The Ontology This Section introduces the EDUPLAN ontology, following a brief introduction of the IEEE LOM 1484.12.3™-2005 model. 4.1 The IEEE LOM 1484.12.3™-2005 Ontology The IEEE working group that developed the IEEE 1484.12.1TM-2002 Standard defined learning objects, for the purposes of the standard, as being “any entity, digital or non-digital, that may be used for learning, education or training”. The IEEE LOM 1484.12.3™-2005 Standard defines an XML Schema Binding of the LOM Data Model defined in IEEE Std 1484.12.1TM-2002. The purpose of this standard is to allow the creation of LOM instances in XML, which allows interoperability and the exchange of LOM XML instances between various systems. The IEEE LOM data model comprises a hierarchy of elements. At the first level, there are nine categories, each of which contains sub-elements; these sub-elements might either be simple elements that hold data, or be themselves aggregate elements, which contain further sub-elements. All LOM data elements are optional. The data model also specifies the value space and datatype for each of the simple data elements. The value space defines the restrictions, if any, on the data that can be entered for that element. Fig. 1 depicts the structure of IEEE LOM Ontology. 4.2 The EDUPLAN Ontology The IEEE LOM 1484.12.3™-2005 model emphasizes on the type and the content of each learning object. On the other hand, EDUPLAN ontology focuses on the time demands and restrictions of each learning object. The two ontologies are complementary; EDUPLAN individuals may refer to IEEE LOM 1484 objects for content information. The two basic classes of the proposed ontology are the learningObject and the course. Any individual of the learningObject refers to a well defined learning activity. The class includes a pointer to the IEEE LOM metadata (sourceLO) and carries properties, such as title, description, type and expected duration. An additional property of the learningObject class concerns whether an individual is interruptible (e.g., reading a book) or not (e.g., attending a lecture in real-time). Furthermore, learning objects might have associated locations, serving estimation of travelling time. A learningObject individual is not associated with a specific (in time) learning activity. Indeed, reading a book chapter might be optional in one course and obligatory in another, whereas they will have different deadlines. The class course serves exactly this purpose and has four subclasses: courseAsynchronous, courseSynchronous,

386

E. Moka and I. Refanidis

Fig. 1. A schematic representation of the hierarchy of elements in the LOM data model

courseComposite and coursePeriodic, with the first three of them being disjoint to each other. Any individual of the coursePeriodic should also belong to one of the other three subclasses. Class course adopts all the properties of the learningObject. In addition, object property refersTo links non-composite course individuals to learning objects. Data property optional is also defined. A courseAsynchronous individual is characterized by its deadline, i.e., a dateTime. On the other hand, a courseSynchronous individual is characterized by its startTime and endTime. A coursePeriodic individual is characterized by its periodType that takes the literals daily, weekly and monthly as values. There are several ways to define the number of occurrences: Specifying a firstPeriod and lastPeriod dateTime values or specifying only a firstPeriod dateTime value accompanied with the number of Occurrences. A periodic activity might have exceptions. Object property exceptions, ranging over the course class, is employed to accommodate the deviations from the base definition. In this case, an integer data property named occurrence is associated with the course class in order to discriminate between the various occurrences. Moreover, property missingPeriods, containing a collection of positive integers, is used to indicate the missing occurrences. A course individual might be defined recursively from other simpler course individuals, as shown in Chapter 2. Class courseComposite servers exactly this purpose. The main property of this class is bagOfCourses, ranging over the entire course class. Several constraints might hold between the various simpler individuals comprising a courseComposite one. The before constraint is a binary constraint with two object properties, firstCourse and secondCourse, ranging over the course individuals. It implies the order of the two courses. Another constraint is the atLeastOne, which applies over optional courses. Property bagOfCourses is used again to designate the involved course individuals. Other types of constraints can be defined as well.

Towards Intelligent Management of a Student's Time

387

Fig. 2. A schematic representation of the EduPlan ontology. Solid lines denote subclass relationships, whereas dashed lined denote object properties.

5 The Overall Architecture Making informative decision as for whether to accomplish a learning activity has three requirements: a precise estimate of the workload imposed by the learning activity, a precise estimate of other duties of the prospective student and an efficient scheduler. The SELFPLANNER system described in Section 3 covers the last two requirements, as soon as users keep their calendars as up-to-date as possible. Concerning the first requirement, we need an information system that will manage information about a variety of learning objects and offered courses. The ontology presented in the previous Section could serve as a basis for this purpose. Taking into account that intelligent calendar application’s independence increases its utility, we consider the two parts of our architecture as separate systems that communicate through web-service invocation. Finally, taking into account SELFPLANNER’s architecture, with the intelligent component being distinct from the calendar application (i.e., Google Calendar), a third part of our architecture comprises the user’s calendar. Fig. 3 depicts EDUPLAN architecture.

Fig. 3. EDUPLAN overall architecture

388

E. Moka and I. Refanidis

Focusing on the information system, we aim for a more personalized experience. The information system should be able to make personalized estimates for the workload, based on the student’s profile. A user profile can be created both explicitly, with direct encoding by the student, and implicitly, by receiving feedback from the user concerning the actual workload. Lazy learning methods, such as the k-nearest neighbor [3], can be used to obtain these estimates. A more elaborate student’s profile might also retain her already achieved skills. Finally, user preferences as for how to schedule asynchronous learning activities should also be manually provided or learnt using reinforcement learning techniques [4].

6 Conclusions and Further Work This paper presented a system under development called EDUPLAN, which aims at allowing the students to take informative decisions of whether they can afford to attend another learning object or not; furthermore, the system will allow them to schedule all their educational activities within their calendar. Several parts of the overall architecture, mainly the intelligent calendar module, have already implemented. Apart from the system’s architecture, in this paper we presented an ontology that could be used to describe workload aspects of the learning objects. The major next step concerns the development of the information system. Finally, the information system has to integrate with the SELFPLANNER application through the exposition of suitable interfaces.

References 1. Alexiadis, A., Refanidis, I.: Defining a Task’s Temporal Domain for Intelligent Calendar Applications. In: 5th IFIP Conference on Artificial Intelligence Applications & Innovations (AIAI 2009), Thessaloniki, pp. 399–406. Springer, Heidelberg (2009) 2. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms. MIT Press, Cambridge (2009) 3. Cover, T.M., Hart, P.: Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13(1), 21–27 (1967) 4. Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research 4, 237–285 (1996) 5. Refanidis, I.: Managing Personal Tasks with Time Constraints and Preferences. In: Proc. of ICAPS 2007, RI, US, pp. 272–279. AAAI Press, Menlo Park (2007) 6. Refanidis, I., Alexiadis, A.: Deployment and Evaluation of SelfPlanner, an Automated Individual Task Management System. Computational Intelligence (2010) (to be published)

Virtual Simulation of Cultural Heritage Works Using Haptic Interaction Konstantinos Moustakas and Dimitrios Tzovaras Informatics and Telematics Institute / Centre for Research and Technology Hellas, 57001 Thessaloniki, Greece {moustak,tzovaras}@iti.gr

Abstract. This paper presents a virtual reality framework for the modeling and interactive simulation of cultural heritage works with the use of advanced human computer interaction technologies. A novel algorithm is introduced for realistic real-time haptic rendering that is based on an efficient collision detection scheme. Smart software agents assist the user in manipulating the smart objects in the environment, while haptic devices are utilized to simulate the sense of touch. Moreover, the virtual hand that simulates the user’s hand is modeled using analytical implicit surfaces so as to further increase the speed of the simulation and the fidelity of the force feedback. The framework has been tested with several ancient technology works and has been evaluated with visitors of the Science Center and Technology Museum of Thessaloniki. Keywords: virtual reality, cultural heritage, simulation, haptic rendering.

1 Introduction A recent trend of museums and exhibitions of Ancient Greek Technology is the use of advanced multimedia and virtual reality technologies for improving the educational potential of their exhibitions [1]. In [2] the authors utilize augmented reality technology to present an archaeological site. Another attempt was to visually enhance archaeological walkthroughs through the use of visualization techniques [3]. Even if the acceptance of these applications by the museum visitors is considered to be high, there is a clear need for more realistic presentations, including haptic interaction that should be able to offer to the user the capability of interacting with the simulation, achieving in this way enhanced educational and pedagogical benefits. However, these interactive applications involve several time-consuming processes that in most of the cases inhibit any attempt of real-time simulation, like collision detection [4], i.e. the identification of colliding parts of the simulated objects, and the haptic rendering [5], i.e. the calculation of the force that should be fed back to the user via a haptic device. Moreover, the proper modeling of the scene objects so as to increase the ease of representation and simulation is of high importance. In this paper a highly efficient simulator is presented that is based on a robust haptic rendering scheme and interaction agents so as to provide the necessary force feedback to the user for haptic interaction with the virtual environment in real time. S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 389–394, 2010. © Springer-Verlag Berlin Heidelberg 2010

390

K. Moustakas and D. Tzovaras

2 System Overview The main goal is to enhance the realistic simulation and demonstration of the technology works and to present their educational/pedagogical characteristics. The user is allowed to interact with the mechanisms in the virtual environment either by constructing or by using them via the proposed haptic interface. The application aims to contribute to the development of a new perception of the modern era needs, by making reference to the technology evolution, efficiently demonstrating Ancient Greek Technology works, presenting their evolution in time and linking this evolution with corresponding individual and social needs.

Fig. 1. Architecture of the proposed framework

Figure 1 illustrates the general architecture of the proposed framework. A simulation scenario is initially designed using the authoring tool and the available 3D content of the multimedia database. During the interaction with the user the core simulation unit takes as input the user’s actions and the simulation scenario. The software agents perform a high level interpretation of the user’s actions and decide upon the next simulation steps. In parallel, the collision detector checks for possible collision during each simulation step and whenever collision is detected the advanced haptic rendering engine provides the appropriate force feedback to the user that is displayed using either the Phantom or the CyberGrasp haptic device. All aforementioned processes are described in the following sections.

3 Simulation Engine 3.1 Smart Object-Scene Modeling In order to simplify all underlying simulation and interaction processing a smart object modeling tool was created that provides an environment to the expert user for manipulating all the necessary data so as to create an educational scenario. The tool provides: a) functionalities for the composition of 3D simulations, b) connecting with the VR haptic devices c) parameterizing the intelligent software agents that simulate

Virtual Simulation of Cultural Heritage Works Using Haptic Interaction

391

the functionality of parts of Ancient Greek mechanisms, d) composing, processing and storing scenarios, e) integration of various scenarios and f) modifying simulation, interaction and haptic parameters of the objects. The expected increased complexity of the scenario files, lead to the adoption of X3D standard as the scenario format, in order to be able to create more realistic applications. Information that cannot be supported directly from the X3D format is stored as a meta tag of the X3D scenario file. The tool allows the user to select virtual reality agents, associate them with objects in the scene, insert and modify their parameters and provide constrains. The objects may have different characteristics and associations in each step of the scenario according to its needs. The author can control the flow of a scenario using simple arithmetic rules (i.e. <, >, =) in order to trigger the next step in the scenario depending on the actions of the user. Moreover, in the context of the current framework the virtual hand is modeled using superquadrics. All other objects are also modeled using superquadrics augmented with distance maps [7], so as to preserve their accurate geometry. 3.2 Haptic Rendering A very efficient collision detection scheme presented in [7] is utilized in the proposed framework to resolve collisions and perform realistic simulations. Moreover, a simple and very efficient haptic rendering scheme has been developed that utilizes the superquadric representation of the virtual hand to rapidly estimate the force feedback. Consider that point P is a point of the penetrating object and is detected to lie inside a segment of the virtual hand (Figure 2).

Fig. 2. Force feedback evaluation P Let also S SQ represent the distance of point P from the superquadric segment,

which corresponds to point PSQ on the superquadric surface, i.e. PSQ is the projection of P onto the superquadric. The amplitude of the force fed onto the haptic devices is obtained using a simple spring model as illustrated in Figure 2. In particular: P F = k ⋅ SSQ

(1)

where k is the stiffness of the spring. The rest length of the spring is set to zero so that it tends to bring point P onto the superquadric surface. In the present framework, the already obtained superquadric approximation is used in order to rapidly evaluate the

392

K. Moustakas and D. Tzovaras

force direction. More precisely, the direction of the force feedback is set to be perpendicular to the superquadric surface at point PSQ . In particular using the parametric representation of the superquadric [6], the normal vector is defined at point r (η , ω ) as the cross product of the tangent vectors along the coordinate curves. n (η ,ω ) = tη (η ,ω ) × tω (η ,ω ) = ⎡1 ⎤ 1 1 = s (η ,ω ) ⎢ cos 2−ε1 η ⋅ cos2 −ε 2 ω , cos2 −ε1 η ⋅ sin 2−ε 2 ω , sin 2−ε1 η ⎥ a2 a3 ⎣ a1 ⎦

T

(2)

where s (η , ω ) = −a1a2 a3ε1ε 2 sin ε1 −1 η ⋅ cos 2ε1 −1 η ⋅ sin ε 2 −1 ω ⋅ cosε 2 −1 ω

(3)

Thus, the resulting force is estimated from the following equation: P F = k ⋅ S SQ

n (η , ω ) n (η , ω )

(4)

A significant advantage of the proposed haptic rendering scheme is that friction and haptic texture can be analytically modeled by modifying the above equation through the addition of the friction component. In particular, P Ffriction = − fC ⋅ (1 + k f ⋅ S SQ )

n f (η , ω ) n f (η , ω )

(5)

where fC is the friction coefficient and nf the direction of the motion of the processed point. The product in the parenthesis is perpendicular to the penetrating distance in order to increase the magnitude of the friction force when the penetration depth of the processed point increases. Factor kf controls the contribution of the penetration depth to the calculated friction force. Finally, the force fed onto the haptic device yields from the addition of the reaction and the friction force. Following a similar procedure, a force component related to haptic texture can be also modeled. 3.3 Geometry Construction and Haptic Interaction Agents

The geometry construction agent (GCA) is responsible for the construction and assembly of the geometrical objects in the scene. The agent allows the user to insert a variety of objects. Default sized geometrical objects can be used in order to construct an environment rapidly. The properties of inserted objects can be modified using one or more control points. The GCA is responsible to appropriately check and modify the user’s actions in order to allow only admissible modifications to the objects. The (Haptic Interaction Agent) HIA is responsible for returning force feedback to the user, providing sufficient data to the Geometry Construction Agent and triggering the appropriate actions according to the user input. The environment supports different layers in order to provide an easier way of interaction to the user. The user can select one layer as active and multiple layers as visible. The HIA returns feedback only when the hand is in contact to objects of visible layers and actions of the user modify the active layer. The HIA receives collision information from the collision detection sub-component and is responsible to trigger actions in the haptic environment and send haptic feedback to the user. Feedback is send to the fingers that

Virtual Simulation of Cultural Heritage Works Using Haptic Interaction

393

touch any visible geometry in the scene. The HIA decides when geometries in the scene are grasped or released by the user hand. To grasp an object the user must touch the object with the thumb and index fingertips. To release an object the index and thumb fingers should retain from touching the object.

4 Evaluation and Experimental Results The proposed framework has been evaluated with simulations on the functionality of ancient technology works and war machines. The evaluation included both the assembly and functional simulation of the virtual prototypes by several users including also visitors of the Science Center and Technology Museum of Thessaloniki. The virtual prototypes include the Archimedes screw-pump, the Ktisivios pump, single and double pulley cranes, catapults, cross-bows, the sphere of Eolos, the odometer and other ancient machines. Illustrations of some of the aforementioned virtual prototypes are depicted in Figure 3.

Fig. 3. Ancient technology works. Starting from the top left image: Archimedes screw pump, catapult, double pulley crane, odometer, Eolos’ sphere.

The simulation fidelity and efficiency of ancient technology works was tested in the context of the performed scenarios and haptic rendering update rate of 1kHz can be achieved even for large and detailed virtual environments, while this was not possible with state-of-the-art mesh-based approaches. Moreover, the force feedback obtained from the proposed scheme is not suffering from the force discontinuities at the edges of the mesh triangles, on contrary to the approaches that generate the force feedback directly from the meshes of the colliding objects and does not produce the overounded effect of the force shading method [8]. The system has been evaluated in tests with visitors of the Science Center and Technology Museum of Thessaloniki, in Greece. The test procedure consisted of two phases: In the first phase, the users were introduced to the system and they were asked

394

K. Moustakas and D. Tzovaras

to use it. During this phase, they were asked questions that focused on usability issues and on their interest in participating to each test. The questionnaire contained also questions to the test observers, e.g. if the user performed the task correctly, how long did it take him/her to perform the task, etc. The second phase was carried out immediately after the tests, using an after tests questionnaire. Specifically, the users where questioned, after finishing all the tests, about general issues such as: (a) the benefits and limitations that they foresee on this technology, (b) the usability of the system in a museum environment, (c) other tests and applications or technologies that they would like to experiment with the application, if any, etc. The system evaluation results have shown that users consider it very innovative and satisfactory in terms of providing a presentation environment in a real museum. The percentage of the satisfied users was over 90%.

5 Conclusions In this paper a novel framework for the simulation of ancient technology works was presented. Novel virtual reality technologies for object modeling and haptic rendering have been proposed that provide realistic interactive simulation using haptic devices. Moreover, a number of simulation scenarios have been developed and evaluated by visitors of the science center and technology museum in Thessaloniki, Greece. Specifically, the analysis of the basic characteristics of Ancient Greek Technologies are presented using virtual reality environments, so that they can become easily perceptible even in those that are not familiar with the technology. In this way, the platform contributes substantially in the general effort to promote the knowledge on Ancient Technologies.

References 1. Iliadis, N.: Learning Technology Through the Internet. Kastaniotis Publisher, Athens (2002) 2. Ledermann, F., Schmalstieg, D.: Presenting an archaeological site in the virtual showcase: Proceedings of the 2003 conference on Virtual reality, archeology, and cultural heritage. ACM Press, New York (2003) 3. Papaioannou, G., Christopoulos, D.: Enhancing virtual reality walkthroughs of archaeological sites. In: Proceedings of the 2003 conference on Virtual reality, archeology, and cultural heritage. ACM Press, New York (2003) 4. Gottschalk, S., Lin, M.C., Manocha, D.: OBBTree: A Hierarchical Structure for Rapid Interference Detection. In: Proc. ACM SIGGRAPH, pp. 171–180 (1996) 5. McNeely, W.A., Puterbaugh, K.D., Troy, J.J.: Six Degree-of-Freedom Haptic Rendering Using Voxel Sampling. In: Computer Graphics and Interactive Techniques, pp. 401–408 (1999) 6. Solina, F., Bajcsy, R.: Recovery of parametric models from range images: The case for uperquadrics with global deformations. IEEE Transactions on Pattern Analysis and Machine Intelligence 12(2), 131–147 (1990) 7. Moustakas, K., Tzovaras, D., Strintzis, M.G.: SQ-Map: Efficient Layered Collision detection and haptic rendering. IEEE Transactions on Visualization and Computer Graphics 13(1), 80–93 (2007) 8. Ruspini, D.C., Kolarov, K., Khatib, O.: The Haptic Display of Complex Graphical Environments. In: Computer Graphics (SIGGRAPH 1997 Conference Proceedings), pp. 345–352 (1997)

Ethnicity as a Factor for the Estimation of the Risk for Preeclampsia: A Neural Network Approach Costas Neocleous1, Kypros Nicolaides2, Kleanthis Neokleous3, and Christos Schizas3 1 Department of Mechanical Engineering, Cyprus University of Technology, Lemesos, Cyprus [email protected] 2 Harris Birthright Research Centre for Fetal Medicine, King’s College Hospital Medical School, Denmark Hill, SE5 8RX, London, United Kingdom [email protected] 3 Department of Computer Science, University of Cyprus, 75 Kallipoleos, 1678, POBox 20537, Nicosia, Cyprus [email protected], [email protected]

Abstract. A large number of feedforward neural structures, both standard multilayer and multi-slab schemes have been applied to a large data base of pregnant women, aiming at generating a predictor for the risk of preeclampsia occurrence at an early stage. In this study we have investigated the importance of ethnicity on the classification yield. The database was composed of 6838 cases of pregnant women in UK, provided by the Harris Birthright Research Centre for Fetal Medicine in London. For each subject 15 parameters were considered as the most influential at characterizing the risk of preeclampsia occurrence, including information on ethnicity. The same data were applied to the same neural architecture, after excluding the information on ethnicity, in order to study its importance on the correct classification yield. It has been found that the inclusion of information on ethnicity, deteriorates the prediction yield in the training and test (guidance) data sets but not in the totally unknown verification data set. Keywords: preeclampsia, neural predictor, ethnicity, gestational age.

1 Introduction Preeclampsia is a syndrome that may appear during pregnancy and can cause perinatal and maternal morbidity and mortality. It affects approximately 2% of pregnancies [1; 2]. It is characterized by hypertension and by significant protein concentration in the urine (proteinuria). Such a high blood pressure may result in damage to the maternal endothelium, kidneys and liver [3; 4]. Preeclampsia may occur during the late 2nd or 3rd trimesters. It has also been observed that it is more common to women on their first pregnancy. The prevailing conditions that lead to preeclampsia are not well understood, hence its diagnosis depends on appropriate signs or suitable investigations [5]. The likelihood of developing S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 395–398, 2010. © Springer-Verlag Berlin Heidelberg 2010

396

C. Neocleous et al.

preeclampsia is thought to increase by a number of factors in the maternal history, nulliparity, high body mass index (BMI), and previous personal or family history of preeclampsia. However, screening by maternal history alone will detect only 30% of those who will develop the condition, with a false positive rate of 10%. Thus, the early diagnosis of preeclampsia is a difficult task, and the prediction even more difficult. Attempts of preeclampsia prevention by using prophylactic interventions have been rather unsuccessful [6; 7]. Thus, any tool that may improve its detection, as for instance a reliable predictor or a method for the effective and early identification of the high-risk group, would be of great help to obstetricians and of course to pregnant women. In recent years, neural networks and other computationally intelligent techniques have been used as medical diagnosis tools aiming at achieving effective medical decisions incorporated in appropriate medical support systems [8; 9]. Neural networks in particular have proved to be quite effective and also have resulted in some relevant patents [10; 11].

2 Data The data were obtained from the greater London area and South-East England, from pregnant women who had singleton pregnancies and attended routine clinical and ultrasound assessment of the risk for chromosomal abnormalities. The database was composed of 6838 cases of pregnant women. For each woman, 24 parameters that were presumed to contribute to preeclampsia were recorded. Some of these parameters were socio-epidemiologic, others were records from ultrasound examination and some from appropriate laboratory measurements. Based on recommendations from medical experts, only 15 parameters were ultimately considered to be the most influential at characterizing the risk of preeclampsia occurrence, and those were used in the built-up of the neural predictor. These are: Mean arterial pressure (MAP), Uterine pulsatility index (UPI), Serum marker PAPPA, Ethnicity, Weight, Height, Smoking? (Y/N), Alcohol consumption? (Y/N), Previous preeclampsia, Conception (spontaneous, ovulation drug or IVF), Medical condition of pregnant woman, Drugs taken by the pregnant woman, Gestation age (in days) when the crown rump length (CRL) was measured, Crown rump length, Mother had preeclampsia? (Y/N). The parameters were encoded in appropriate numerical scales that could make the neural processing to be most effective. A network guidance test set of 36 cases was extracted and used to test the progress of training. This data set included 16 cases (44%) of women that exhibited preeclampsia. Also, a verification data set having 9 cases out of which 5 were with preeclampsia (56%) was extracted to be used as totally unknown to the neural network, and thus to be used for checking the prediction capabilities of each attempted network.

3 Neural Predictor A number of feedforward neural structures, both standard multilayer, of varying number of layers and neurons per layer, as well as multi-slab of different structures, sizes,

Ethnicity as a Factor for the Estimation of the Risk for Preeclampsia

397

and activation functions, were systematically tried for the prediction. The structure ultimately selected and used a multislab neural structure having four slabs that were connected as depicted in Figure 1. Based on extensive previous experience of some of the authors, all the weights were initialized to 0.3, while the learning rate was the same for all connections, having value of 0.1. Similarly, the momentum rate was 0.2 for all links. INPUT – 15 Characteristics MAP, UPI, PAPP_A, Ethnicity, Weight Height, Smoking, Alcohol, Previous PET, Conception Medical condition, Drugs, GA in days, CRL, Mother’s previous PET (Linear activation)

SLAB 1

SLAB 2

100 neurons

100 neurons

(Gaussian complement activation)

(Gaussian activation)

OUTPUT – Preeclampsia occurrence (Logistic activation)

Fig. 1. The neural structure that was selected and used for the prediction of preeclampsia

4 Results Table 1 shows the data characteristics and the overall prediction results when ethnicity was included among the input information. Table 1. Preeclampsia prediction results when all characteristics were used

No of subjects in the database No of preeclampsia cases Percentage of preeclampsia cases Cases predicted Percentage of cases predicted Preeclampsia cases predicted Percentage of Preeclampsia cases predicted

TRAINING SET 6793 116 1.7 3024 44.5 97 83.6

TEST SET 36 16 44.4 26 72.2 15 93.8

VERIFICATION SET 9 5 55.6 7 77.8 5 100

5 Conclusion and Future Work Considering the importance of ethnicity, contrary to the conclusions of Chamberlain and Steer [6], it has been found that by including such information the prediction

398

C. Neocleous et al. Table 2. Preeclampsia prediction results when “ethnicity” information was not used

No of preeclampsia cases Preeclampsia cases predicted Percentage of Preeclampsia cases predicted

TRAINING SET 116 99 85.3

TEST SET 16 16 100

VERIFICATION SET 5 5 100

yield becomes worse as it can easily be observed from Table 2. In fact, it is noted that when “ethnicity” information is provided to the network, the training and test data set predictions improved, especially so, that of the data set. Thus, it may be concluded that this information is not needed in order to assure high prognosis yield. In future work, a sensitivity analysis on other important predictors will be done in order to reach a trimmed network that may effectively predict preeclampsia using as little as possible input information.

Acknowledgments The FMF foundation is a UK registered charity (No. 1037116). We would also like to kindly acknowledge Dr Leona C. Poon and Dr Panayiotis Anastasopoulos for their contribution to the initial organization of the parameters from the original database.

References 1. World Health Organization. Make Every Mother and Child Count, World Health Report, Geneva, Switzerland (2005) 2. Lewis, G. (ed.): Why Mothers Die 2000–2002: The Sixth Report of Confidential Enquiries Into Maternal Deaths in the United Kingdom, pp. 79–85. RCOG Press, London (2004) 3. Drife, J., Magowan, B. (eds.): Clinical Obst. and Gyn., ch. 39, pp. 367–370. Saunders, Philadelphia (2004) 4. Douglas, K., Redman, C.: Eclampsia in the United Kingdom. Br. Med. J. 309(6966), 1395– 1400 (1994) 5. James, D., Steer, P., Weiner, C., Gonik, B. (eds.): High Risk Pregnancy, ch. 37, pp. 639– 640. Saunders, Philadelphia (1999) 6. Chamberlain, G., Steer, P.: Turnbull’s Obstetrics, ch. 21, pp. 336–337. Churchill Livingstone (2001) 7. Moffett, A., Hiby, S.: How does the maternal immune system contribute to the development of pre-eclampsia? Placenta (2007) 8. Yu, C., Smith, G., Papageorghiou, A., Cacho, A., Nicolaides, K.: An integrated model for the prediction of pre-eclampsia using maternal factors and uterine artery Doppler velocimetry in unselected low-risk women. Am. J. Obstet. Gynecol. 193, 429–436 (2005) 9. US Patent 5839438 Computer-based neural network system and method for medical diagnosis and interpretation. US Patent 5839438

A Multi-class Method for Detecting Audio Events in News Broadcasts Sergios Petridis, Theodoros Giannakopoulos, and Stavros Perantonis Computational Intelligence Laboratory, Institute of Informatics and Telecommunications, National Center of Scientiﬁc Research Demokritos {petridis,sper}@iit.demokritos.gr, [email protected]

Abstract. We propose a method for audio event detection in video streams from news. Apart from detecting speech, which is obviously the major class in such content, the proposed method detects ﬁve non-speech audio classes. The major diﬃculty of the particular task lies in the fact that most of the non-speech audio events are actually background sounds, with speech as the primary sound. We have adopted a set of 21 statistics computed on a mid-term basis over 7 audio features. A variation of the One Vs All classiﬁcation architecture has been adopted and each binary classiﬁcation problem is modeled using a separate probabilistic Support Vector Machine. Experiments have shown that the proposed method can achieve high precision rates for most of the audio events of interest. Keywords: Audio event detection, Support Vector Machines, Semiautomatic multimedia annotation.

1

Introduction

With the huge increase of multimedia content that is made available over Internet, a number of methods have been proposed for automatic characterization of this content. Especially for the case of multimedia ﬁles from news broadcasts, several methods have been proposed for automatic annotation, though, only a few of those make extensive use of the audio domain ([1], [2], [3], [4]). In this work, we propose an audio-based algorithm for event detection in real broadcaster videos. This work is part of the CASAM European project (www.casamproject.eu), which aims at computer-aided semantic annotation of multimedia data. Our main goal is to detect (apart from speech) ﬁve non-speech sounds that were met in our datasets from real broadcasts. Most of these audio events were secondary sounds to the main event which is obviously speech. This task of recognizing background audio events in news can help in extracting richer semantic information from such content.

2

Audio Class Description

Since the purpose of this work is to analyze audio streams from news, it is expected that the vast majority of the audio data is speech. Therefore, the ﬁrst S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 399–404, 2010. c Springer-Verlag Berlin Heidelberg 2010

400

S. Petridis, T. Giannakopoulos, and S. Perantonis

of the audio class we have selected to detect is speech. Speech tracking may be useful if its results were used by another audio analysis module, e.g. by a speech recognition task. Though, the detection of speech as an event is not of major importance in a news audio stream. Therefore, the following more semantically rich audio classes have been selected: music, sound of water, sound of air, engine sounds and applause. In a news audio stream the above events most of the times exist as background events, with speech being the major sound. Hence, the detection of such events is obviously a hard task. It has to be noted that an audio segment can, at the same time, be labeled as speech and as some other type of event, e.g. music.

3 3.1

Audio Feature Extraction Short-Term and Mid-term Processing for Feature Extraction

In order to calculate any audio feature of an audio signal, it is needed to adopt a short-term processing technique. Therefore, the audio signal is divided in (overlapping or non-overlapping) short-term windows (frames) and the feature value f is calculated for each frame. Therefore, an array of feature values F, for the whole audio signal is calculated. We have selected a frame size equal to 40 msecs, and a step of 20 msecs. The process of short-term windowing, described above, leads, for each audio signal, to a sequence F of feature values. This sequence can be used for processing / analysis of the audio data. Though, a common technique is the processing of the feature in a mid-term basis. According to this technique, the audio signal is ﬁrst divided into mid-term windows (segments) and then for each segment the short-term process is executed. In the sequel, the sequence F, which has been extracted for each segment, is used for calculating a statistic, e.g., the average value. So ﬁnally, each segment is represented by a single value which is the statistic of the respective feature sequence. We have chosen to use a 2 second mid-term window, with a 1 second step. 3.2

Adopted Audio Features and Respective Statistics

We have implemented 7 audio features, while, for each feature three statistics have been used in a mid-term basis: mean value, standard deviation and std by mean ratio. Therefore, in total, each mid-term window is represented by 21 feature values. In the following, the 7 features are presented, along with some examples of their statistics for diﬀerent audio classes. For more detailed descriptions of the adopted audio features the reader can refer to [5]. Energy. Let xi (n), n = 1, . . . , N the audio samples of the i−th frame, of length N . Then, for each N frame i the energy is calculated according to the equation: E(i) = N1 n=1 |xi (n)|2 . A statistic that has been used for the case of

A Multi-class Method for Detecting Audio Events in News Broadcasts

401

discriminating signals with large energy variations (like speech, gunshots etc.) is the standard deviation σ 2 of the energy sequence. Zero Crossing Rate. Zero Crossing Rate (ZCR) is the rate of sign-changes of a signal, i.e., the number of times the signal changes from positive to negative or back, per time unit. It can be used for discriminating noisy environmental 2 sounds, e.g., rain. In speech signals, the σμ ratio of the ZCR sequence is high, since speech contains unvoiced (noisy) and voiced parts and therefore the ZCR values have abrupt changes. On the other hand, music, being largely tonal in nature, does not show abrupt changes of the ZCR. ZCR has been used for speechmusic discrimination ([6]) and for musical genre classiﬁcation ([7]). Energy Entropy. This feature is a measure of abrupt changes in the energy level of an audio signal. It is computed by further dividing each frame into K sub-frames of ﬁxed duration. For each sub-frame j, the normalized energy e2j is calculated, i.e., the sub-frame’s energy, divided by the whole frame’s energy. Afterwards, the entropy of this sequence is computed. The entropy of energy of an audio frame is lower if there are abrupt changes present in that audio frame. Therefore, it can be used for discrimination of abrupt energy changes. Spectral Centroid. The spectral centroid Ci , of the i-th frame is deﬁned as the center of “gravity” of its spectrum. This feature is a measure of the spectral position, with high values corresponding to “brighter” sounds. Position of the Maximum FFT Coeﬃcient. This feature directly uses the FFT coeﬃcients of the audio segment: the position of the maximum FFT coeﬃcient is computed and then normalized by the sampling frequency. This feature is another measure of the spectral position. Spectral Rolloﬀ. Spectral Rolloﬀ is the frequency below which certain percentage (usually around 90%) of the magnitude distribution of the spectrum is concentrated. It is a measure of the spectral shape of an audio signal and it can be used for discriminating between voiced and unvoiced speech ([8]). Spectral Entropy. Spectral entropy ([9]) is computed by dividing the spectrum of the short-term frame into L sub-bands (bins). The energy Ef of the f -th subband, f = 0, . . . , L − 1, is then normalized by the total spectral energy, yielding E nf = L−1f E , f = 0, . . . , L − 1. The entropy of the normalized spectral energy f f =0 L−1 n is then computed by the equation: H = − f =0 nf · log2 (nf ).

4

Event Detection

As described in Section 3, the mid-term analysis procedure leads to a vector of 21 elements for each mid-term window. In order to classify each audio segment, we

402

S. Petridis, T. Giannakopoulos, and S. Perantonis

have adopted Support Vector Machines (SVMs) and a variation of the One Vs All classiﬁcation architecture. In particular, each binary classiﬁcation task ,e.g., ’Speech Vs Non-Speech’, ’Music Vs Non-Music’, etc, is modeled using a separate SVM. The SVM has a soft output which is an estimation of the probability that the input sample (i.e. audio segment) belongs to the respective class. Therefore, for each audio segment the following soft classiﬁcation outputs are extracted: Pspeech , Pmusic , Pair , Pwater , Pengine , Papplause . Furthermore, six corresponding thresholds are deﬁned for each binary classiﬁcation task. In the training stage, apart from the training of the SVMs, a cross-validation procedure is executed for each of the binary classiﬁcation sub-problems, in order to estimate the thresholds which maximize the respective binary precision rates. For each audio segment the following three possible classiﬁcation decisions can exist: a) The label Speech can be given to the segment. b) Any of the non-speech labels can be given to the segment. c) The labels Speech and any of the other labels can be given to the segment. d) The segment can be left unlabeled. In the event detection testing stage, given the 7 soft decisions from the respective binary classiﬁcation tasks, for each 1-sec audio segment the following process is executed: – If Pspeech ≥ Tspeech , then the label ’Speech’ is given to the segment. – For each of the other labels i, i ∈ {music, air, water, engine, applause}: if Pi < Ti then Pi = 0. – Find the maximum of the non-speech soft outputs and its label imax. – If Pimax > Timax then label the segment as imax. The above process is repeated for all mid-term segments of the audio stream. As a ﬁnal step, successive audio segments that share the same label are merged. This leads to a sequence of audio events, each one of which is characterized by its label and its time limits.

5 5.1

Experimental Results Datasets and Manual Annotation

For training - testing purposes, two datasets have been populated in the CASAM project: one from the a German international broadcaster (DW- DeutscheWelle) and the second from the Portuguese broadcaster (Lusa - Agncia de Notcias de Portuga). Almost 100 multimedia streams (7 hours total duration) from the above datasets have been manually annotated, using the Transcriber Tool (http://trans.sourceforge.net/). The annotation on the audio stream is carried out in a segment basis. For each homogenous segment, two labels are deﬁned: the primary label is binary and corresponds to the existence of speech, while the secondary label is related to the type of background sound. In Table 1, a representation for an example of an annotated audio ﬁle is shown.

A Multi-class Method for Detecting Audio Events in News Broadcasts

403

Table 1. Representation example for an annotated audio ﬁle Segment Start Segment End Primary Label (speech) Secondary Label 0 1.2 yes engine 1.2 3.3 no engine 3.3 9.8 no music ... ... ... ... Table 2. Detection performance measures Class names Recall(%) Precision(%) Speech 87 80 SoundofAir 20 82 CarEngine 42 87 Water 52 90 Music 56 85 Applause 59 99 Average (non-speech events) 45 86

5.2

Method Evaluation

Performance measures. The audio event detection performance measures should diﬀer from the standard deﬁnitions used in the classiﬁcation case. In order to proceed, let us ﬁrst deﬁne an event, as the association of a segment s with an element c of a class set : e = {s → c}. Furthermore, let S be the set of all segments of events known to hold as ground truth and S be the set of all segments of events found by the system. For a particular class label c. Also, let S(c) = {s ∈ S : s → c} be the set of ground truth segments associated to class c, ¯ = {s ∈ S : s S(c) → c = c } the set of ground truth segments not associated to class c, S (c) = {s ∈ S : s → c} the set of system segments associated to class c and S¯ (c) = {s ∈ S : s → c = c} the set of system segments not associated to class c. In the sequel let, two segments and a threshold value t ∈ (0, 1). We deﬁne | the segment matching function g : S × S → {0, 1} as: gt (s, s ) = |s∩s |s∪s | > t. For deﬁning the recall rate, let A(c) be the ground truth segments s → c for which there exist a matching segment s → c A(c) = {s ∈ S(c), ∃s ∈ S (c) : gt (s, s ) = 1}. Then, the recall of class c is deﬁned as: Recall(c) = |A(c)| |S(c)| . In order to deﬁne the event detection precision, let A (c) be the system segments s → c for which there exist a matching segment s → c: A (c) = {s ∈ S(c), ∃s ∈ S(c) : gt (s, s ) = (c)| 1}. Then the precision of class c is deﬁned as: P recision(c) = |A |S (c)| . Performance results. In Table 2, the results of the event detection process is presented. It can bee seen that for most of the audio event types the precision rate is at above 80%. Furthermore, the average performance measures for all non-speech events has been calculated. In particular, the recall rate was found

404

S. Petridis, T. Giannakopoulos, and S. Perantonis

equal to 45%, while precision was 86%. This actually means that almost half of the manually annotated audio events were successfully detected, while 86% of the detected events were correctly classiﬁed.

6

Conclusions

We have presented a method for automatic audio event detection in news videos. Apart from detecting speech, which is obviously the most dominant class in the particular content, we have trained classiﬁers for detecting ﬁve other types of sounds, which can provide important content information. Our major purpose was to achieve high precision rates. The experimental results, carried out over a large dataset from real news streams, indicate that the precision rates are always above 80%. Finally, the proposed method managed to detect almost 50% of all the manually annotated non-speech events, while from all the detected events 86% were correct. This is a rather high performance, if we take into consideration that most of these events exist as background sounds to speech in the given content. Acknowledgments. This paper has been supported by the CASAM project (www.casam-project.eu).

References 1. Mark, B., Jose, J.M.: Audio-based event detection for sports video. In: Bakker, E.M., Lew, M., Huang, T.S., Sebe, N., Zhou, X.S. (eds.) CIVR 2003. LNCS, vol. 2728, pp. 61–65. Springer, Heidelberg (2003) 2. Baillie, M., Jose, J.: An audio-based sports video segmentation and event detection algorithm. In: 2004 Conference on Computer Vision and Pattern Recognition Workshop, pp. 110–110 (2004) 3. Tzanetakis, G., Chen, M.: Building audio classiﬁers for broadcast news retrieval. In: 5th International Workshop on Image Analysis for Multimedia Interactive Services, Lisboa, Portugal, April 2004, pp. 21–23 (2004) 4. Huang, R., Hansen, J.: Advances in unsupervised audio segmentation for the broadcast news and ngsw corpora. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004, vol. 1 (2004) 5. Giannakopoulos, T.: Study and application of acoustic information for the detection of harmful content, and fusion with visual information. PhD thesis, Dpt. of Informatics and Telecommunications, University of Athens, Greece (2009) 6. Panagiotakis, C., Tziritas, G.: A speech/music discriminator based on rms and zerocrossings 7(1), 155–166 (2005) 7. Tzanetakis, G., Cook, P.: Musical genre classiﬁcation of audio signals. IEEE Transactions on Speech and Audio Processing 10(5), 293–302 (2002) 8. Hyoung-Gook, K., Nicolas, M., Sikora, T.: MPEG-7 Audio and Beyond: Audio Content Indexing and Retrieval. John Wiley & Sons, Chichester (2005) 9. Misra, H., et al.: Spectral entropy based feature for robust asr. In: ICASSP, Montreal, Canada (2004)

Flexible Management of Large-Scale Integer Domains in CSPs Nikolaos Pothitos and Panagiotis Stamatopoulos Department of Informatics and Telecommunications, University of Athens, Panepistimiopolis, 157 84 Athens, Greece {pothitos,takis}@di.uoa.gr

Abstract. Most research on Constraint Programming concerns the (exponential) search space of Constraint Satisfaction Problems (CSPs) and intelligent algorithms that reduce and explore it. This work proposes a diﬀerent way, not of solving a problem, but of storing the domains of its variables, an important—and less focused—issue especially when they are large. The new data structures that are used are proved theoretically and empirically to adapt better to large domains, than the commonly used ones. The experiments of this work display the contrast between the most popular Constraint Programming systems and a new system that uses the data structures proposed in order to solve CSP instances with wide domains, such as known Bioinformatics problems. Keywords: CSP domain, Bioinformatics, stem-loop detection.

1

Introduction

Constraint Programming is an Artiﬁcial Intelligence area that focuses on solving CSPs in an eﬃcient way. A CSP is a triplet containing variables, their domains (i.e. set of values) and constraints between variables. The simplicity of this definition makes Constraint Programming attractive to many Computer Science ﬁelds, as it makes it easy to express a variety of problems. When it comes to solving a CSP, the main problem that we face is the exponential time needed, in the general case. The space complexity comes in second place, as it is polynomial in the size (usually denoted d) of the largest domain. But is O(d) the best space—and therefore time—complexity we can achieve when we have to store a domain? Is it possible to deﬁne a lower bound for this complexity? Memory management is a crucial factor determining a Constraint Programming system speed, especially when d is too big. Gent et al. have recently described data structures used to propagate the constraints of a CSP [3]. To the best of our knowledge, the representation of a domain itself has not yet been the primary sector of interest of a speciﬁc publication in the area. Nevertheless, Schulte and Carlsson in their Constraint Programming systems survey [7] deﬁned formally the two most popular data structures that can represent a ﬁnite set of integers: S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 405–410, 2010. c Springer-Verlag Berlin Heidelberg 2010

406

N. Pothitos and P. Stamatopoulos

Bit Vector. Without loss of generality, we suppose that a domain D contains only positive integer values. Let a be a bit array. Then the value v belongs to D, if and only if a[v] = 1. Bit vector variants are implemented in many Constraint Programming solvers [1,2]. Range Sequence. Another approach is to use a sequence of ranges. Formally, D is ‘decomposed’ into a set {[a1 , b1 ], . . . , [an , bn ]}, such that ∪i [ai , bi ] = D. A desired property for this sequence is to be ordered and the shortest possible, i.e. [ai , bi ] ∩ [aj , bj ] = ∅, ∀i = j. In this case δ denotes the number of ranges. A more simple data structure than the two above, stores only the bounds of D. E.g., for the domain [1..100000]1 we store only two numbers in memory: 1 and 100000. Obviously, this is an incomplete representation for the non-continuous domains (e.g. [1..3 5..9]). It is therefore incompatible with most algorithms designed for CSPs; only speciﬁc methodologies can handle it [11]. On the other hand, for the above domain [1..100000], a bit vector would allocate 100,000 bits of memory, although it could be represented by a range sequence using only two memory words. A range sequence can be implemented as a linked list, or as a binary tree, so it is costlier to search for a value in it. In this work we study the trade-oﬀ between memory allocation cost and time consuming operations on domains. A new way of memory management that seeks to reduce the redundant space is proposed. The new algorithms and data structures are shown to perform well, especially on problems which contain large domains. Such problems eminently occur in Bioinformatics, a science that aims at extracting information from large genetic data.

2

Eﬃcient Domain Implementations

While attempting to reduce the space complexity, we should not neglect time complexity. Except for memory allocation, a constraint programming system is responsible for two other basic operations that are executed many times on a domain: 1. Search whether a range of values is included in it. 2. Removal of a range of values from a domain. Note that addition of values is unnecessary; the domain sizes only decrease due to constraint propagation or assignments. Search or removal of a range of w values costs O(w) time in a bit vector; if w = 1 this structure is ideal. The same operations in a range sequence that has been implemented as a linked list [7] require O(δ) steps, while the space complexity is much less (O(δ) too) than the bit vector’s one (O(d)). A wiser choice would be to implement the range sequence as a binary search tree, with an average search/removal complexity O(log δ), and the space complexity left unaﬀected. 1

[a..b] denotes the integer set {a, a + 1, . . . , b}.

Flexible Management of Large-Scale Integer Domains in CSPs

407

However, the subtraction of a range of values from the tree is complicated. (It roughly performs two traversals and then joins two subtrees.) This is undesirable, not only for the time it spends, but also for the many modiﬁcations that are done on the structure. The number of modiﬁcations is crucial because they are recorded in order to be undone when a Constraint Programming system backtracks, that is when it restores a previous (or the initial) state of the domains, in order to restart the process of ﬁnding a solution to a CSP (through other paths). 2.1

Gap Intervals Tree Representation

To make things simpler and more eﬃcient, a binary search tree of gap ranges was implemented. The advantage of this choice is that the subtraction of a range of values is faster, as it aﬀects only one tree node (i.e. it inserts or modiﬁes only one node). For example the domain [9..17 44.. 101] is described by three gaps: [−∞..8], [18..43] and [102..+∞]. Figure 1 depicts the gaps of a domain that are arranged as a binary search tree. A node of the tree apparently contains the ﬁrst and the last gap value, and pointers to the left and right ‘children.’ 2.2

[-∞..-17] [2001..+∞] [100..102] [10..10]

[999..1050]

[-5..0]

Fig. 1. A tree with the gaps of the domain [−16..−6 1..9 11..99 103..998 1051..2000]

Search/Delete Algorithm

Another advantage of this approach is that the two basic operations on a domain are performed by a single algorithm named SearchGap.2 This function accepts four arguments (gapN ode, newStartV al, newEndV al, removeInterval). – If removeInterval is 1, the range [newStartV al..newEndV al] is deleted from the domain, which is represented by a tree whose root is gapN ode. – If removeInterval is 0, the function returns a node of the tree that contains at least one element of [newStartV al..newEndV al]. If there does not exist such a node that meets this criterion, then the function returns an empty node. Thus, in case we want to check whether a range [a..b] belongs to D, we call SearchGap(root, a, b, 0): • If the returned node is empty, then [a..b] ⊆ D; • otherwise [a..b] D. The above procedures manipulate the data structure as a normal binary search tree; the insertions of gaps and the search for speciﬁc values is done in logarithmic time as we traverse a path from the root gapN ode to an internal node. While a Constraint Programming system tries to ﬁnd a solution, it only adds gaps to the tree. During gap insertions the algorithm seeks to merge as many gap nodes as possible in order to keep the tree short. 2

Available at http://www.di.uoa.gr/~ pothitos/setn2010/algo.pdf

408

3

N. Pothitos and P. Stamatopoulos

Empirical Results

Although the above domain implementation is compatible with the ordinary CSP formulation, algorithms and constraint propagation methodologies [6], it is recommended especially when we have to solve problems with large non-continuous domains. Such problems naturally occur in Bioinformatics, so we are going to apply the memory management proposed to them. 3.1

A Sequence Problem

Each human cell contains 46 chromosomes; a chromosome is part of our genetic material, since it contains a sequence of DNA nucleotides. There are four types of nucleotides, namely A, T, G and C. (A = adenine, T = thymine, G = guanine, C = cytosine.) A chromosome may include approximately 247.2 million nucleotides. A Simple Problem Definition. Suppose that we want to ‘ﬁt’ in a chromosome a sequence of four cytosines C1 , C2 , C3 , C4 and a sequence of four guanines G1 , G2 , G3 , G4 too. Ci and Gi designate the positions of the corresponding nucleotides in the DNA chain; the initial domain for a position is [1..247200000]. We assume the ﬁrst sequence grows geometrically with Ci = Ci+1 /99 and the second sequence is the arithmetic progression Gi+1 = Gi + 99. Pitfalls While Solving. This naive CSP, which is limited to only eight constraint variables, may become. . . diﬃcult, if we do not properly manage the domains that contain millions of values. So, we evolved the data structures of an existing Constraint Programming library and observed their behaviour in comparison with two popular systems.3 Naxos. At ﬁrst, we integrated the gap intervals tree described into Naxos Solver [5]. Naxos is a library for an object-oriented programming environment; it is implemented in C++. It allows the statement of CSPs having constrained variables with ﬁnite domains containing integers. The solution4 for the naive problem described was found immediately, using 3 MB of memory. All the experiments were carried out on a Sun Blade computer with an 1.5 GHz SPARC processor and 1 GB of memory. ECLi PSe . On the same machine, however, it took three seconds for the constraint logic programming system ECLi PSe version 5.105 [2] to ﬁnd the same solution, using 125 MB of memory, as it implements a bit vector variant to store the domains. If we add one more nucleotide to the problem (i.e. one more constraint variable) the program will be terminated due to stack overﬂow. This 3

4

5

The datasets and the experiments source code—for each Constraint Programming system we used—are available at http://www.di.uoa.gr/~ pothitos/setn2010 The ﬁrst solution includes the assignments C1 = 1, C2 = 99, C3 = 9801, C4 = 970299, G1 = 2, G2 = 101, G3 = 200 and G4 = 299. We used the ECLi PSe library ‘ic’ that targets ‘Interval Constraints.’

Flexible Management of Large-Scale Integer Domains in CSPs

14

10000 ECLiPSe ILOG Naxos

10

1000 Space (MB)

Time (minutes)

12

409

8 6 4

ECLiPSe ILOG Naxos

100 10

2 0

1 4 24 44 64 84 194 394 594 794 994 Guanines

(a) Time needed to ﬁnd a solution

4 24 44 64 84 194 394 594 794 994 Guanines

(b) Memory space allocated

Fig. 2. The resources used by Constraint Programming systems as the problem scales

happens because the default stack size is limited, so in order to continue with the following experiments, we increased it manually. Ilog. Ilog Solver version 4.4 [4], a well-known C++ Constraint Programming library, needs treble time (about ten seconds) to ﬁnd the solution in comparison with ECLi PSe , but it consumes almost the same memory. Scaling the Problem. A simple way to scale the problem is to add more guanines in the corresponding sequence. Figure 2 illustrates the time and space that each system spends in order to reach a solution. Before even adding a hundred nucleotides, ECLi PSe and Ilog Solver ran out of resources, as they had already used all the available physical and virtual memory. On the other hand, Naxos scales normally, as it beneﬁts from the proposed domain representation, and requires orders of magnitude less memory. The lower price of allocating space makes the diﬀerence. 3.2

RNA Motifs Detection Problem

In the previous problem we created a nucleotide sequence, but in Bioinformatics it is more important to search for speciﬁc nucleotide patterns/motifs inside genomes, i.e. the nucleotide chains of a speciﬁc organism. We can focus on a speciﬁc pattern that describes the way that an RNA molecule folds back on itself, thus formulating helices, also known as stemloops [10]. A stem-loop consists of a helix and a region with speciﬁc characters from the RNA alphabet [9]. In contrast to Ilog Solver, Naxos Solver extended with the proposed memory management is able to solve this problem for the bacterium Escherichia coli genome, which is available through the site of MilPat, a tool dedicated to searching molecular motifs [8].

410

4

N. Pothitos and P. Stamatopoulos

Conclusions and Further Work

In this work, it has been shown that we can achieve a much better lower memory bound for the representation of a domain, than the actual memory consumption of Constraint Programming systems. An improved way of storing a domain, through new data structures and algorithms was proposed. This methodology naturally applies to various problems with wide domains, e.g. Bioinformatics problems that come along with large genome databases. In future, hybrid data structures can contribute towards the same direction. For example, variable size bit vectors could be integrated into binary tree nodes. Everything should be designed to be as much generic as possible, in order to exploit at any case the plethora of known algorithms for generic CSPs. Acknowledgements. This work is funded by the Special Account Research Grants of the National and Kapodistrian University of Athens, in the context of the project ‘C++ Libraries for Constraint Programming’ (project no. 70/4/4639). We would also like to thank Stavros Anagnostopoulos, a Bioinformatics expert, for his valuable help in our understanding of various biological problems and data.

References 1. Codognet, P., Diaz, D.: Compiling constraints in clp(FD). The Journal of Logic Programming 27(3), 185–226 (1996) 2. ECLi PSe constraint programming system (2008), http://eclipse-clp.org 3. Gent, I., Jeﬀerson, C., Miguel, I., Nightingale, P.: Data structures for generalised arc consistency for extensional constraints. In: AAAI 2007: 22nd National Conference on Artiﬁcial Intelligence, pp. 191–197. AAAI Press, Menlo Park (2007) 4. ILOG S.A.: ILOG Solver 4.4: User’s Manual (1999) 5. Pothitos, N.: Naxos Solver (2009), http://www.di.uoa.gr/~ pothitos/naxos 6. Sabin, D., Freuder, E.C.: Contradicting conventional wisdom in constraint satisfaction. In: Borning, A. (ed.) PPCP 1994. LNCS, vol. 874, pp. 125–129. Springer, Heidelberg (1994) 7. Schulte, C., Carlsson, M.: Finite domain constraint programming systems. In: Handbook of Constraint Programming, pp. 495–526. Elsevier Science, Amsterdam (2006) 8. Th´ebault, P.: MilPat’s user manual (2006), http://carlit.toulouse.inra.fr/MilPat 9. Th´ebault, P., de Givry, S., Schiex, T., Gaspin, C.: Searching RNA motifs and their intermolecular contacts with constraint networks. Bioinformatics 22(17), 2074– 2080 (2006) 10. Watson, J., Baker, T., Bell, S., Gann, A., Levine, M., Losick, R.: Molecular Biology of the Gene, ch. 6, 5th edn. Pearson/Benjamin Cummings (2004) 11. Zytnicki, M., Gaspin, C., Schiex, T.: A new local consistency for weighted CSP dedicated to long domains. In: SAC 2006: Proceedings of the 2006 ACM symposium on Applied computing, pp. 394–398. ACM, New York (2006)

A Collaborative System for Sentiment Analysis Vassiliki Rentoumi1,2 , Stefanos Petrakis3, Vangelis Karkaletsis1, Manfred Klenner3 , and George A. Vouros2 1 2

Inst. of Informatics and Telecommunications, NCSR “Demokritos”, Greece University of the Aegean, Artiﬁcial Intelligence Laboratory, Samos, Greece 3 Institute of Computational Linguistics, University of Zurich, Switzerland [email protected], [email protected], [email protected], [email protected], [email protected]

Abstract. In the past we have witnessed our machine learning method for sentiment analysis coping well with ﬁgurative language, but determining with uncertainty the polarity of mildly ﬁgurative cases. We have shown that for these uncertain cases, a rule-based system should be consulted. We evaluate this collaborative approach on the ”Rotten Tomatoes” movie reviews dataset and compare it with other state-of-the-art methods, providing further evidence in favor of this approach.

1

Introduction

In the past we have shown that ﬁgurative language conveys sentiment that can be eﬃciently detected by FigML[2], a machine learning (ML) approach trained on corpora manually annotated with strong ﬁgurative expressions1 . FigML was able to detect the polarity of sentences bearing highly ﬁgurative expressions, where disambiguation is considered mandatory, such as: (a)“credibility sinks into a mire of sentiments”. On the other hand, there exist cases for which FigML provided a classiﬁcation decision based on a narrow margin between negative and positive polarity orientation, often resulting in erroneous polarity evaluation. It was observed that such cases bear mild ﬁgurativeness, which according to [4] are synchronically as literal as their primary sense, as a result of standardized usage, like: (b) “this 10th ﬁlm in the series looks and feels tired”. Here, fatigue as a property of inanimate or abstract objects, although highly ﬁgurative, presents an obvious negative connotation, due to standardized usage of this particular sense, therefore sentiment disambiguation is not necessary. Such regular cases could be more eﬃciently treated by a rule-based system such as PolArt[1]. In fact, in this paper we extend the work presented in [8] where we have indeed shown that cases of mild ﬁgurative language are better treated by PolArt, while cases of strong ﬁgurative language are better handled by FigML. In [8], a novel collaborative system for sentiment analysis was proposed and managed 1

Subsets from the AﬀectiveText corpus (SemEval’07) and the MovieReviews sentence polarity dataset v1.0, annotated with metaphors and expanded senses: http://www.iit.demokritos.gr/~ vrentoumi/corpus.zip

S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 411–416, 2010. c Springer-Verlag Berlin Heidelberg 2010

412

V. Rentoumi et al.

to outperform its two subcomponents, FigML and PolArt, tested on the AﬀectiveText corpus. Here, we try to verify the validity of this approach on a larger corpus and of a diﬀerenet domain and style. In addition and most importantly, another dimension of complementarity between a machine learning method and a rule-based one is explored: the rule-based approach handles the literal cases and the - already introduced - collaborative method treats the cases of ﬁgurative language. Results show that integrating a machine learning approach with a ﬁner-grained linguistically-based one leads to a superior, best-of-breed system.

2

Methodology Description

The proposed collaborative method involves four consecutive steps: (a)Word sense disambiguation(WSD): We chose an algorithm which takes as input a sentence and a relatedness measure[6]. The algorithm supports several WordNet based similarity measures among which Gloss Vector (GV)[6] performs best for non-literal verbs and nouns [5]. Integrating GV in the WSD step is detailed in [2]. (b)Sense level polarity assignment(SLPA): We adopted a machine learning approach which exploits graphs based on character n-grams[7]. We compute models of positive and negative polarity from examples of positive and negative words and deﬁnitions provided by a enriched version of the Subjectivity Lexicon2,3 . The polarity class of each test sense, is determined by computing its similarity with the models as detailed in [2]. (c)HMMs training: HMMs serve two purposes. Computing the threshold which divides the sentences in marginal/non-marginal and judging the polarity(positive/ negative) of non-marginal sentences. We train one HMM model for each polarity class. The format of the training instances is detailed in [2]. For computing the threshold, the training data are also used as a testing set. Each test instance is tested against both models and the output is a pair of log probabilities of a test instance to belong to either the negative or the positive class. For each polarity class we compute the absolute diﬀerence of the log probabilities. We then sort these diﬀerences in ascending order and calculate the ﬁrst Quartile (Q1) which separates the lower 25% of the sample population from the rest of the data. We set this to be the threshold and we apply it to the test instances. Marginal cases are the ones for which the absolute diﬀerence of log probability is below that threshold. In our experiments we use a 10-fold cross validation approach to evaluate our results. (d) Sentence-level polarity detection: The polarity of each sentence is determined by HMMs [2] for non-marginal cases and by PolArt[1] for marginal 2 3

http://www.cs.pitt.edu/mpqa/ For each positive or negative word entry contained in the Subjectivity Lexicon, we extracted the corresponding set of senses from WordNet, represented by their synsets and gloss examples; in this way we tried to reach a greater degree of consistency between the test and the training set.

A Collaborative System for Sentiment Analysis

413

ones. PolArt employs compositional rules and obtains word-level polarities from a polarity lexicon, as described in detail in [1]. The Collaborative system’s total performance is then given by adding up the performances of FigML and PolArt.

3

Experimental Setup

3.1

Resources

We ran our experiments on the MovieReviews corpus4 . This corpus was split into diﬀerent subsets according to our experimental setup in two diﬀerent ways: – Expanded Senses/Metaphors/Whole: The corpus was enhriched with manually-added annotations for metaphors and expanded senses inside sentences. We produced an expanded senses dataset and a metaphorical expressions one. Furthermore, we treated the entire corpus as a third dataset, ignoring the aforementioned annotations. The produced datasets are: • Expanded senses: 867 sentences, 450 negative and 417 positive ones. • Metaphors: 996 sentences, 505 negative and 491 positive ones. • Whole: 10649 sentences, 5326 negative and 5323 positive ones. – Literal/Non-literal: We group all ﬁgurative sentences (metaphors/expanded senses) as the non-literal set. The rest of the sentences we call the literal set. • Non-literal: 1862 sentences5 , 954 negative and 908 positive ones. • Literal: 8787 sentences, 4372 negative and 4415 positive ones. We run numerous variations of PolArt, modifying each time the polarity lexicon it consults: – SL+: This is the subjectivity lexicon6 with manually added valence operators. – Merged: The FigML system produces automatically sense-level polarity lexica (AutSPs), one for each dataset or subset. For the non-literal, metaphors and expanded senses, these lexica target non-literal expressions, metaphors and expanded senses accordingly. For the entire MovieReviews dataset (Whole), all word senses are targeted. Various Merged lexica are produced by combining and merging the SL+ lexicon with each of the AutSPs. 4 5 6

We used the sentence polarity dataset v1.0 from http://www.cs.cornell.edu/People/pabo/movie-review-data/ One sentence belonged to both the metaphors and expanded senses subsets, and was included only once here. http://www.cs.pitt.edu/mpqa/

414

3.2

V. Rentoumi et al.

Collaborative Method Tested on MovieReviews Dataset

We tested our Collaborative method originally presented and evaluated in [8], with the extended MovieReviews corpus, in order to test its validity. Table 1 presents scores for each polarity class, for both variants of our method, the CollaborativeSL+ (using the SL lexicon) and CollaborativeMerged (using the Merged Lexica), across all three datasets. For the majority of cases, CollaborativeSL+ has better performance than CollaborativeMerged. Comparing the performance of CollaborativeSL+ for the MovieReviews with that of CollaborativeSL+ for the AﬀectiveText corpus [8], for the Whole corpus (f-measure: neg: 0.62, pos: 0.59), we noticed that the performance remains approximately the same. This is evidence that the method is consistent across diﬀerent datasets. Table 1. MovieReviews: Performance scores for full system runs

recall Whole precision f-measure recall Met precision f-measure recall Exp precision f-measure

3.3

CollaborativeSL+ neg pos 0.682 0.537 0.596 0.628 0.636 0.579 0.724 0.735 0.737 0.722 0.731 0.728 0.640 0.623 0.647 0.616 0.643 0.619

CollaborativeMerged neg pos 0.656 0.536 0.586 0.609 0.619 0.570 0.697 0.704 0.708 0.693 0.702 0.699 0.642 0.623 0.648 0.617 0.645 0.620

The Collaborative Approach Treats Non-literal Cases as a Whole: Complementarity on the Literal/Non-literal Axis

We have so far shown that our Collaborative method is performing quite well on the expanded senses and metaphors datasets. Although we consider them as distinct language phenomena, they both belong to the sphere of ﬁgurative connotation. To support this we tested our claim collectively, across non-literal expressions in general, by merging these two datasets into one labelled nonliterals. As a baseline system for assessing the performance of the collaborative method we use a clean version of PolArt (i.e. without added valence shifters). In Table 2, we compare BaselinePolart with CollaborativeSL+ (using the SL lexicon) and CollaborativeMerged (using the Merged Lexica), tested upon the non-literals dataset. We observe that our proposed method outperforms the baseline and proves quite capable of treating non-literal cases collectively. By assembling the non-literals into one dataset and treating it with our collaborative method we set aside its complementary dataset of literals. Since our method is more inclined to treat ﬁgurative language, we do not expect that it should treat literal cases optimally, or at least as eﬃciently as a system that is more inclined to treat literal language. Therefore, assigning the literals to PolArt and the nonliterals to Collaborative, would provide a more sane system architecture and result in better performance for the entire MovieReviews dataset. In Table 3 we present the performance of both variants of the new system architecture (PolartwithCollaborativeSL+, PolartwithCollaborativeMerged). In

A Collaborative System for Sentiment Analysis

415

Table 2. MovieReviews: Performance scores for the non-literals subset CollaborativeSL+ neg pos recall 0.710 0.646 Nonliterals precision 0.678 0.680 f-measure 0.694 0.662

CollaborativeMerged neg pos 0.681 0.644 0.668 0.658 0.674 0.651

BaselinePolart neg pos 0.614 0.667 0.659 0.622 0.636 0.644

Table 3. MovieReviews: Performance scores for full system runs

recall Literals/nonliterals precision f-measure

Whole

recall precision f-measure

PolartwithCollaborativeSL+ neg pos 0.608 0.659 0.641 0.627 0.624 0.642 CollaborativeSL+ neg pos 0.682 0.537 0.596 0.628 0.636 0.579

PolartwithCollaborativeMerged neg pos 0.603 0.659 0.638 0.624 0.620 0.641 CollaborativeMerged neg pos 0.656 0.536 0.586 0.609 0.619 0.570

both versions pure PolArt treats literal cases, while CollaborativeSL+ and CollaborativeMerged treat non literals cases. This new architecture is compared to the one concerning the treatment of the whole corpus (Whole) by both variants of the proposed method (CollaborativeSL+, CollaborativeMerged). It is observed that the performance of this modiﬁed system is better for the majority of cases. This fact leads us to the conclusion that a system which treats sentiments in a more language-sensitive way, can exhibit improved performance. We further compared our system with a state-of-the-art system by Andreevskaia and Bergler[3], tested on the MovieReviews corpus. Their system employs a Naive Bayes Classiﬁer for polarity classiﬁcation of sentences, trained with unigrams, bigrams or trigrams derived from the same corpus. This state-of-the-art system’s accuracy was reported to be 0.774, 0.739 and 0.654 for unigrams, bigrams and trigrams. Our two alternative system architectures, CollaborativeSL+ and PolartwithCollaborativeSL+, scored 0.609 and 0.633. The performances of both our alternatives are clearly lower than the state-ofthe-art system’s when the latter is trained with unigrams or bigrams, but they get closer when it is trained with trigrams. The main point is that the CollaborativeSL+ method performs quite well even for the case of a corpus containing mainly literal language. We expect CollaborativeSL+ to perform optimally when applied on a corpus consisting mainly of non-literal language. It is also worth noting that since PolArt deals with the majority of cases it is bound to heavily aﬀect the overall system performance. Additionally PolArt’s dependency on its underlying resources and especially the prior polarity lexicon is also a crucial performance factor. Thus, the observed moderate performance of the system can be attributed to the moderate PolArt’s performance, probably due to the incompatibility of the Subjectivity Lexicon with the idiosyncratic/colloquial language of the Movie Reviews corpus.

416

V. Rentoumi et al.

All in all, the overall performance is still quite satisfactory. Consequently, if we provide PolArt with a more appropriate lexicon, we expect a further boost.

4

Conclusions and Future Work

In this paper we further extend and examine the idea of a sentiment analysis method which exploits complementarily two language speciﬁc subsystems, a rule-based (PolArt) for the mild ﬁgurative, and a machine learning system (FigML) for the strong ﬁgurative language phenomena[8]. By further examining the validity of such an approach in a larger (and of diﬀerent domain) corpus (Movie Reviews corpus), in which strong ﬁgurative language co-exists with mild ﬁgurative language, we observed that this Collaborative method is consistent. We also explored another dimension of complementarity concerning literal/ non-literal cases of language, where PolArt is treating the literal cases and the Collaborative method the non-literal cases. We get empirical support from the performance obtained that utilizing the special virtues of the participating subsystems can be a corner-stone in the design and performance of the resulting system. We will test the collaborative method on a more extensive corpus bearing ﬁgurative language. We intend to dynamically produce sense-level polarity lexica exploiting additional machine learning approaches (e.g. SVMs).

References 1. Klenner, M., Petrakis, S., Fahrni, A.: Robust compositional polarity classiﬁcation. In: Recent Advances in Natural Language Processing (RANLP), Borovets, Bulgaria (2009) 2. Rentoumi, V., Giannakopoulos, G., Karkaletsis, V., Vouros, G.: Sentiment analysis of ﬁgurative language using a word sense disambiguation approach. In: Recent Advances in Natural Language Processing (RANLP), Borovets, Bulgaria (2009) 3. Andreevskaia, A., Bergler, S.: When specialists and generalists work together: overcoming domain dependence in sentiment tagging. In: Proceedings of ACL 2008: HLT, pp. 290–298 (2008) 4. Cruse, D.A.: Meaning in language. Oxford University Press, Oxford (2000) 5. Rentoumi, V., Karkaletsis, V., Vouros, G., Mozer, A.: Sentiment Analysis Exploring Metaphorical and Idiomatic Senses: A Word Sense Disambiguation Approach. In: International Workshop on Computational Aspects of Aﬀectual and Emotional Interaction, CAFFEi 2008 (2008) 6. Pedersen, T., Banerjee, S., Patwardhan, S.: Maximizing Semantic Relatedness to Perform Word Sense Disambiguation. Supercomputing Institute Research Report UMSI, vol. 25 (2005) 7. Giannakopoulos, G., Karkaletsis, V., Vouros, G., Stamatopoulos, P.: Summarization system evaluation revisited: N-gram graphs. ACM Transactions on Speech and Language Processing (TSLP) 5 (2008) 8. Rentoumi, V., Petrakis, S., Klenner, M., Vouros, G., Karkaletsis, V.: A Hybrid System for Sentiment Analysis. To appear in LREC 2010 (2010)

Minimax Search and Reinforcement Learning for Adversarial Tetris Maria Rovatsou and Michail G. Lagoudakis Intelligent Systems Laboratory Department of Electronic and Computer Engineering Technical University of Crete Chania 73100, Crete, Greece [email protected], [email protected]

Abstract. Game playing has always been considered an intellectual activity requiring a good level of intelligence. This paper focuses on Adversarial Tetris, a variation of the well-known Tetris game, introduced at the 3rd International Reinforcement Learning Competition in 2009. In Adversarial Tetris the mission of the player to complete as many lines as possible is actively hindered by an unknown adversary who selects the falling tetraminoes in ways that make the game harder for the player. In addition, there are boards of diﬀerent sizes and learning ability is tested over a variety of boards and adversaries. This paper describes the design and implementation of an agent capable of learning to improve his strategy against any adversary and any board size. The agent employs MiniMax search enhanced with Alpha-Beta pruning for looking ahead within the game tree and a variation of the Least-Squares Temporal Diﬀerence Learning (LSTD) algorithm for learning an appropriate state evaluation function over a small set of features. The learned strategies exhibit good performance over a wide range of boards and adversaries.

1

Introduction

Skillful game playing has always been considered a token of intelligence, consequently Artiﬁcial Intelligence and Machine Learning exploit games in order to exhibit intelligent performance. A game that has become a benchmark, exactly because it involves a great deal of complexity along with very simple playing rules, is the game of Tetris. It consists of a grid board in which four-block tiles, chosen randomly, fall from the top and the goal of the player is to place them so that they form complete lines, which are eliminated from the board, lowering all blocks above. The game is over when a tile reaches the top of the board. The fact that the rules are simple should not give the impression that the task is simple. There are about 40 possible actions available to the player for placing a tile and about 1064 possible states that these actions could lead to. These magnitudes are hard to deal with for any kind of player (human or computer). Adversarial Tetris is a variation of Tetris that introduces adversity in the game, making it even more demanding and intriguing; an unknown adversary tries to S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 417–422, 2010. c Springer-Verlag Berlin Heidelberg 2010

418

M. Rovatsou and M.G. Lagoudakis

hinder the goals of the player by actively choosing pieces that augment the difﬁculty of line completion and by even “leaving out” a tile from the entire game, if that suits his adversarial goals. This paper presents our approach to designing a learning player for Adversarial Tetris. Our player employs MiniMax search to produce a strategy that accounts for any adversary and reinforcement learning to learn an appropriate state evaluation function. Our agent exhibits improving performance over an increasing number of learning games.

2

Tetris and Adversarial Tetris

Tetris is a video game created in 1984 by Alexey Pajitnov, a Russian computer engineer. The game is played on a 10 × 20 board using seven kinds of simple tiles, called tetraminoes. All tetraminoes are composed of four colored blocks (minoes) forming a total of seven diﬀerent shapes. The rules of the game are very simple. The tiles are falling down one-by-one from the top of the board and the user rotates and moves them until they rest on top of existing tiles in the board. The goal is to place the tiles so that lines are completed without gaps; completed lines are eliminated, lowering all the remaining blocks above. The game ends when a resting tile reaches the top of the board. Tetris is a very demanding and intriguing game. It has been proved [1] that ﬁnding a strategy that maximizes the number of completed rows, or maximizes the number of the lines eliminated simultaneously, or minimizes the board height, or maximizes the number of tetraminoes placed in the board before the game ends is an N P-hard problem; even approximating an optimal strategy is N P-hard. This inherent diﬃculty is one of the reasons this game is widely used as a benchmark domain. Tetris is naturally formulated as a Markovian Decision Process (MDP) [2]. The state consists of the current board and the current falling tile and the actions are the approximately 40 placement actions for the falling tile. The transition model is fairly simple; there are seven equiprobable possible next states, since the next board is uniquely determined and the next falling piece is chosen uniformly. The reward function gives positive numerical values for completed lines and the goal is to ﬁnd a policy that maximizes the long-term cumulative reward. The recent Reinforcement Learning (RL) Competition [3] introduced a variation of Tetris, called Adversarial Tetris, whereby the falling tile generator is replaced by an active opponent. The tiles are now chosen purposefully to hinder the goals of the player (completion or lines). The main diﬀerence in the MDP model of Adversarial Tetris is the fact that the distribution of falling tiles is non-stationary and the dimension of the board varies in height and width. Furthermore, the state is produced like the frames of the video game, as it includes the current position and rotation of the falling tile in addition to the conﬁguration of the board and the player can move/rotate the falling tile at each frame. The RL Competition oﬀers a generalized MDP model for Adversarial Tetris which is fully speciﬁed by four parameters (the height and width of the board and the adversity and type of the opponent). For the needs of the competition 20 instances of this model were speciﬁed with widths ranging from 6 to 11, heights ranging from 16 to 25, and diﬀerent types of opponents and opponent’s adversity.

Minimax Search and Reinforcement Learning for Adversarial Tetris

3

419

Designing a Learning Player for Adversarial Tetris

Player Actions. In Adversarial Tetris the tile is falling one step downwards every time the agent chooses one of the 6 low-level actions: move the tile left or right, rotate it clockwise or counterclockwise, drop it, and do nothing. Clearly, there exist various alternative sequences of these actions to achieve the same placement of the tile; this freedom yields repeated board conﬁgurations that lead to an unnecessary growth of the game tree. Also, playing at the level of the 6 lowlevel actions ruins the idea of a two-player alternating game, as the opponent’s turn appears only once after several turns of the player. Lastly, the branching factor of 6 would lead to an intractable game tree, even before the falling tile reaches a resting position in the board. These observations led us to consider an abstraction of the player’s moves, namely high-level actions that bring the tile from the top of the board directly to its resting position using a minimal sequence of low-level actions planned using a simple look-ahead search. The game tree now contains alternating plies of the player’s and the opponent’s moves, as a true twoplayer alternating game; all unnecessary intermediate nodes of player’s low-level actions are eliminated. The actual number of high-level actions available in each state depends on the width of the board and the number of distinct rotations of the tile itself, but they will be at most 4× wb, where wb is the width of the board (wb columns and 4 rotations). Similarly, the opponent chooses not only the next falling tile, but also its initial rotation, which means that he has as many as 4 × 7 = 28 actions. However, not all these actions are needed to represent the opponent’s moves, since in the majority of cases the player can use low-level actions to rotate the tile at will. Thus, the initial rotation can be neglected to reduce the branching factor at opponent nodes from 28 to just 7. In summary, there are about 4wb choices for the player and 7 choices for the opponent. Game Tree. The MiniMax objective criterion is commonly used in two-player zero-sum games, where any gain on one side (Max) is equal to the loss on the other side (Min). The Max player is trying to select its best action over all possible Min choices in the next and future turns. In Adversarial Tetris, our player is taken as Max, since he is trying to increase his score, whereas the adversarial opponent is taken as Min, since he is trying to decrease our player’s score. We adopted this criterion because it is independent of the opponent (it produces the same strategy irrespectively of the competence of the opponent) and protects against tricky opponents who may initially bluﬀ. Its drawback is that it does not take risks and therefore it cannot exploit weak opponents. The implication is that our agent should be able to play Tetris well against any friendly, adversarial, or no-care opponent. The MiniMax game tree represents all possible paths of action sequences of the two players playing in alternating turns. Our player forms a new game tree from the current state, whenever it is his turn to play, to derive his best action choice. Clearly, our player cannot generate the entire tree, therefore expansion continues up to a cut-oﬀ depth. The utility of the nodes at the cut-oﬀ depth is estimated by an evaluation function described below. MiniMax is aided by Alpha-Beta Pruning, which prunes away nodes and subtrees not contributing to the root value and to the ﬁnal decision.

420

M. Rovatsou and M.G. Lagoudakis

Evaluation Function. The evaluation of a game state s whether in favor or against our agent is done by an evaluation function V (s), which also implicitly determines the agent’s policy. Given the huge state space of the game, such an evaluation function cannot be computed or stored explicitly, so it must be approximated. We are using a linear approximation architecture formed by a vector of k features φ(s) and a vector of k weights w. The approximate value is k computed as the weighted sum of the features, V (s) = i=1 φi (s)wi = φ(s) w. We have issued two possible sets of features which will eventually lead to two diﬀerent agents. The ﬁrst set includes 6 features for characterizing the board: a constant term, the maximum height, the mean height, the sum of absolute column diﬀerences in height, the total number of empty cells below placed tiles (holes), and the total number of empty cells above placed tiles up to the maximum height (gaps). The second set uses a separate block of these 6 features for each one of the 7 tiles of Tetris, giving a total of 42 features. This is proposed because with the ﬁrst set the agent can learn which boards and actions are good for him, but cannot associate them to the falling tiles that these actions manipulate. The same action on diﬀerent tiles, even if the board is unchanged, may have a totally diﬀerent eﬀect; ignoring the type of tile leads to less eﬀective behavior. This second set of features alleviates this problem by simply weighing the 6 base features diﬀerently for diﬀerent falling tiles. Note that only one block of size 6 is active in any state, the one corresponding to the current falling tile. Learning. In order to learn a good set of weights for our evaluation function we applied a variation of the Least-Squares Temporal Diﬀerence Learning (LSTD) algorithm [4]. The need for modifying the original LSTD algorithm stems from the fact that the underlying agent policy is determined through the values given to states by our evaluation function, which are propagated to the root; if these values change, so does the policy, therefore it is important to discard old data and use only the recent ones for learning. To this end, we used the technique of exponential windowing, whereby the weights are updated in regular intervals called epochs; each epoch may last for several decision steps. During an epoch the underlying value function and policy remain unchanged for collecting correct evaluation data and only at the completion of the epoch are the weights updated. In the next epoch, data from the previous epoch are discounted by a parameter μ. Therefore, past data are not completely eliminated, but are weighted less and less as they become older and older. Their inﬂuence depends on the value of μ which ranges between 0 (no inﬂuence) to 1 (full inﬂuence). A value of 0 leads to singularity problems due to the shortage of samples within a single epoch, however a value around 0.95 oﬀers a good balance between recent and old data with exponentially decayed weights. A full description of the modiﬁed algorithm is given in Algorithm 1 (t indicates the epoch number). In order to accommodate a wider range of objectives we used a rewarding scheme that encourages line completion (positive reward), but discourages loss of a game (negative reward). We balanced these two objectives by giving a reward of +1 for each completed line and a penalty of −10 for each game lost. We set the discount factor to 1 (γ = 1) since rewards/penalties do not loose value as time advances.

Minimax Search and Reinforcement Learning for Adversarial Tetris

421

Algorithm 1. LSTD with Exponential Windowing (wt , At , bt ) = LSTD-EW(k, φ, γ, t, Dt , wt−1 , At−1 , bt−1 , μ) if t == 0 then At ← 0; bt ← 0 else At ← μAt−1 ; bt ← μbt−1 end if for all samples (s, r, s ) ∈ Dt do At ← At + φ(s) φ(s) − γφ(s ) ; bt ← bt + φ(s)r end for −1 wt ← (At ) bt return wt , At , bt

Related Work. There is a lot of work on Tetris in recent years. Tsitsiklis and Van Roy applied approximate value iteration, whereas Bertsekas and Ioﬀe tried policy iteration, and Kakade used the natural policy gradient method. Later, Lagoudakis et al. applied a least-squares approach to learning an approximate value function, while Ramon and Driessens modeled Tetris as a relational reinforcement learning problem and applied a regression technique using Gaussian processes to predict the value function. Also, de Farias and Van Roy used the technique of randomized constraint sampling in order to approximate the optimal cost function. Finally, Szita and L¨ orincz applied the noisy cross-entropy method. In the 2008 RL Competition, the approach of Thiery [5] based on λ-Policy Iteration outperformed all previous work at the time. There is only unpublished work on Adversarial Tetris from the 2009 RL Competition, where only two teams participated. The winning team from Rutgers University applied look-ahead tree search and the opponent in each MDP was modeled as a ﬁxed probability distribution over falling tiles, which was learned using the cross entropy method.

4

Results and Conclusion

Our learning experiments are conducted over a period of 400 epochs of 8,000 game steps each, giving a total of 3,200,000 samples. The weights are updated at the end of each learning epoch. Learning is conducted only on MDP #1 (out of the 20 MDPs of the RL Competition) which has board dimensions that are closer to the board dimensions of the original Tetris. Learning takes place only at the root of the tree in each move, as learning at the internal nodes leads to a great degree of repetition biasing the learned evaluation function. Agent 1 (6 features) learns by backing up values from depth 1 (or any other odd depth). This set of features ignores the choice of Min and thus it would be meaningless to expand the tree one more level deeper at Min nodes, which are found at odd depths. The second agent (42 features) learns by backing up values from depth 2 (or any other even depth). This set of basis functions takes the action choice of the Min explicitly into account and thus it makes sense to cut-oﬀ the search at Max nodes, which are found at even depths. The same cut-oﬀs apply to testing.

M. Rovatsou and M.G. Lagoudakis 600

30

500

Steps per Game

L2 Change in Weights

400

Average Lines per Game

422

300

200

100

400 300 200 100

0 0

100

200

300

0 0

400

100

350

5

300

4 3 2 1

200

Epoch

10 5 100

300

400

250 200 150

50 0

200

300

400

Epoch 12

100

100

15

0 0

400

Average Lines per Game

6

Steps per Game

L2 Change in Weights

300

20

Epoch

Epoch

0 0

200

25

100

200

Epoch

300

400

10 8 6 4 2 0 0

100

200

300

400

Epoch

Fig. 1. Learning curve, steps and lines per update for Agents 1 (top) and 2 (bottom)

Learning results are shown in Figure 1. Agent 1 clearly improves with more training epochs. Surprisingly, Agent 2 hits a steady low level, despite an initial improvement phase. In any case, the performance of the learned strategies is way below expectations compared to the current state-of-the-art. A deeper look into the problem indicated that the opponent in Adversarial Tetris is not very aggressive after all and the MiniMax criterion is way too conservative, as it assumes an optimal opponent. In fact, it turns out that an optimal opponent could actually make the game extremely hard for the player; this is reﬂected in the game tree and therefore our player’s choices are rather mild in an attempt to avoid states where the opponent could give him a hard time. Agent 1 avoids this pitfall because it goes only to depth 1, where he cannot “see” the opponent, unlike Agent 2. Nevertheless, the learned strategies are able to generalize consistently to the other MDPs (recall that training takes place only on MDP #1). For each learned strategy, we played 500 games on each MDP to obtain statistics. Agent 1 achieves 574 steps and 44 lines per game on average over all MDPs (366 steps and 16 lines on MDP #1), whereas Agent 2 achieves 222 steps and 11 lines (197 steps and 5 lines on MDP #1). Note that our approach is oﬀ-line; training takes place without an actual opponent. It remains to be seen how it will perform in an on-line setting facing the exploration/exploitation dilemma.

References 1. Breukelaar, R., Demaine, E.D., Hohenberger, S., Hoogeboom, H.J., Kosters, W.A., Liben-Nowell, D.: Tetris is hard, even to approximate. International Journal of Computational Geometry and Applications 14(1-2), 41–68 (2004) 2. Tsitsiklis, J.N., Roy, B.V.: Feature-based methods for large scale dynamic programming. Machine Learning, 59–94 (1994) 3. Reinforcement Learning Competition (2009), http://2009.rl-competition.org 4. Bradtke, S.J., Barto, A.G.: Linear least-squares algorithms for temporal diﬀerence learning. Machine Learning, 22–33 (1996) 5. Thi´ery, C: Contrˆ ole optimal stochastique et le jeu de Tetris. Master’s thesis, Universit´e Henri Poincar´e – Nancy I, France (2007)

A Multi-agent Simulation Framework for Emergency Evacuations Incorporating Personality and Emotions Alexia Zoumpoulaki1, Nikos Avradinis2, and Spyros Vosinakis1 1

Department of Product and Systems Design Engineering, University of the Aegean, Hermoupolis, Syros, Greece {azoumpoulaki,spyrosv}@aegean.gr 2 Department of Informatics, University of Piraeus, Greece [email protected]

Abstract. Software simulations of building evacuation during emergency can provide rich qualitative and quantitative results for safety analysis. However, the majority of them do not take into account current surveys on human behaviors under stressful situations that explain the important role of personality and emotions in crowd behaviors during evacuations. In this paper we propose a framework for designing evacuation simulations that is based on a multi-agent BDI architecture enhanced with the OCEAN model of personality and the OCC model of emotions. Keywords: Multi-agent Systems, Affective Computing, Simulation Systems.

1 Introduction Evacuation simulation systems [1] have been accepted as very important tools for safety science, since they help examine how people gather, flow and disperse in areas. They are commonly used for estimating factors like evacuation times, possible areas of congestion and distribution amongst exits under various evacuation scenarios. Numerous models for crowd motion and emergency evacuation simulations have been proposed, such as fluid or particle analogies, mathematical equations estimated from real data, cellular automata, and multi-agent autonomous systems. Most recent systems adopt the multi-agent approach, where each individual agent is enriched with various characteristics and their motion is the result of rules or decision making strategies. [2, 3, 4, 5]. Modern surveys indicate that there is number of factors [8, 9] influencing human behavior and social interactions during evacuations. These factors include personality traits, individual knowledge and experience and situation-related conditions like building characteristics or crowd density, among others. Contrary to what is believed, people don’t immediately rush towards the exits but take some time before they start evacuating, performing several tasks (i.e. gather information, collect items) and look at the behaviors of others in order to decide whether to start moving or not. Also route and exit choices depend on familiarity with the building. Preexisting relationships among the individuals also play a crucial role upon behavior as members of the same S. Konstantopoulos et al. (Eds.): SETN 2010, LNAI 6040, pp. 423–428, 2010. © Springer-Verlag Berlin Heidelberg 2010

424

A. Zoumpoulaki, N. Avradinis, and S. Vosinakis

group like friends and members of a family will try to stay together, move with similar speeds, help each other and aim to exit together. Additionally, emergency evacuations involve complex social interactions, where new groups form and grow dynamically as the egress progress. New social relations arise as people exchange information, try to decide between alternatives and select a course of actions. Some members act as leaders, committed to help others, by shouting instructions or leading towards the exits while others follow. [10] Although individuals involved in evacuations continue to be social actors, and this is why under non-immediate danger, people try to find friends, help others evacuate or even collect belongings, stressful situations can result to behaviors like panic. [11] During an emergency, the nature of the information obtained, time pressure, the assessment of danger, the emotional reaction and observed actions of others are elements that might result to catastrophic events, such as stampedes. The authors claim that above factors and their resulting actions should be modeled, for realistic behaviors to emerge during an evacuation simulation. The proposed approach takes in consideration recent research not only in evacuation simulation models but also in multi agent system development [7], cognitive science, group dynamics and surveys of real situations. [8] In our approach, decision making is based on emotional appraisal of the environment, combined with personality traits in order to select the most suited behavior according to the agents’ psychological state. We introduce an EP – BDI (Emotion Personality Beliefs Desires Intentions) architecture that incorporates computational models of personality (OCEAN) and emotion (OCC). The emotion module participates in the appraisal of information obtained, decision making and action execution. The personality module influences emotional reactions, indicates tendencies to behaviors and help address issues of diversity. Additionally we use a more meaningful mechanism for social organization, where groups form dynamically and roles emerge due to knowledge, personality and emotions. We claim that these additions may provide the necessary mechanisms for simulating realistic human like behavior under evacuation. Although the need for such an approach is widely accepted, to our knowledge no other evacuation simulation framework has been designed incorporating fully integrated computational models of emotion and personality.

2 The Proposed Framework The proposed agent architecture (Fig.1) is based on the classic BDI (Beliefs-DesiresIntentions) architecture enriched with the incorporation of Personality and Emotions. The agent’s operation cycle starts with the Perception phase, where the agent acquires information on the current world state through its sensory subsystem. Depending on the agent’s emotional state at the time, its perception may be affected and some information may possibly be missed. The newly acquired information is used to update the agent’s Beliefs. Based upon its new beliefs, the agent performs an appraisal process, using its personality and its knowledge about the environment in order to update its emotional state. The agent’s Decision making process follows, where current conditions, personality and agent’s

A Multi-agent Simulation Framework for Emergency Evacuations

Perception

425

Beliefs Knowledge of World

Personality Emotional State of others

Appraisal

OCEAN

Group Status Simulation Environment

Physical Status

Other Agents

Desire Status

Decision Making

Emotion Emotional State

Desires Evacuation

Action

Intentions

Threat Avoidance Group Related Information

Fig. 1. The proposed agent architecture

own emotional state are synthesized in order to generate a Desire. This desire is fulfilled through an appropriate Intention, which will be executed as a ground action in the simulation environment. The personality model adopted in the proposed framework is the Five Factor Model [12], also known as OCEAN by the initials of the five personality traits it defines: Openness, Conscientiousness, Extraversion, Agreeableness and Neuroticism. Every agent considered to possess the above characteristics in varying degrees and is assigned a personality vector, instantiated with values representing their intensity. It has been shown in psychology research that personality and emotion, although distinct concepts, are closely interdependent and there is a link between personality and types [13]. Based on this premise, the proposed architecture follows an original approach in evacuation simulation systems, by closely intertwining the functions of emotion and personality, in practically every step of the operation cycle. The emotion model adopted is based on the OCC model and particularly its revised version, as presented in [14]. In the current approach, we model five positive/negative emotion types, the first of which is an undifferentiated positive/negative emotion, as coupled emotion pairs: Joy/Distress, Hope/Fear, Pride/Shame, Admiration/Reproach and SorryFor/HappyFor. The first three emotions concern the agent itself, while the last two focus on other agents. Each agent is assigned a vector representing its emotion status at a specific temporal instance of the simulation. Agents can perceive objects, events and messages through sensors and update their beliefs accordingly. Their initial beliefs include at least one route to the exit, i.e. the route they followed entering the building, and, besides imminent perception, they may acquire knowledge about other exits or blocked paths due to the exchange of messages. Agents can also perceive the emotional state of others, which may impact their own emotions as well. The agent’s own emotional state may influence perception, affecting

426

A. Zoumpoulaki, N. Avradinis, and S. Vosinakis

an agent’s ability to notice an exit sign or an obstacle. Relationships between agents like reproach or admiration may cause a communication message to be ignored or accounted as truth respectively. Perceived events, actions and information from the environment are appraised according to their consequences on the agent’s goals, and the well being of itself as well as other agents. All events, negative or positive, affect one or more of the agent’s emotions, in a varying degree, according to its personality. The level of influence a particular event may have on the agent’s emotional status depends on its evaluation, the agent’s personality, and an association matrix that links personality traits to emotions. This drives the agent into an, intermediate emotional state that affects the agent’s appraisal of the actions of other agents. Attribution-related emotions (pride/shame, admiration/reproach) are updated by evaluating the current desire’s achievement status with respect to an emotional expectation that is associated with each desire. Finally, the agent’s emotional state is updated with the calculated impact. This process is repeated for all events newly perceived. Every agent has a number of competing desires each of which is assigned an importance value. This value is affected by the agent’s emotional state, personality and by his beliefs about the state of the environment and of its group. These desires have been determined by surveys on human action during emergency situation [8] and include: a) move towards an exit, b) receive information, c) transmit information d) join a group, e) maintain a group f) expand a group and g) avoid threat. Each of these is assigned a set of activation conditions and they can become active only if these conditions are met. Once the decision process starts, the activation conditions of all desires are checked and the valid desires are determined. The agent, in every cycle, will try to accomplish the desire with the highest importance value. This value is calculated as the weighted sum of two independent values, one calculated from the agent’s personality and one from his current emotional status. The first is produced using an association matrix that relates specific OCEAN profiles to personality-based importance values for each desire. The relative distance of the agent’s personality value to the profiles in the association matrix determines the personality-based importance value that will be assigned to the active desires. On the other hand, emotion-based importance values are assigned according to agent’s current emotional state and the expected emotional outcome, if the desire is fulfilled. Once an agent is committed to pursuit a desire, a list of possible intention for its fulfillment becomes available. For example “evacuate” desire can be translated to either “move to known exit” or “search for exit” and “follow exit sign”. The selection of the most appropriate intention depends on current knowledge of the world. Choosing an intention translates to basic actions like walk, run or wait, which are affected by the emotional state. For example, agents in a state of panic will be less careful in terms of keeping their personal space and they will not decrease their speed significantly when approaching other agents, leading to inappropriate and dangerous behaviors, such as pushing. Social interactions are modeled through group dynamics. There are two types of groups; static groups representing families and friends that don’t change during the simulation and emergent groups. The latter are formed during the simulation based on agent’s personality, evacuation experience, message exchange, and relationships

A Multi-agent Simulation Framework for Emergency Evacuations

427

established between agents. This relationship once is established is evaluated by terms of achieving a goal, keeping safe and maintaining personal spaces. The size of the groups is also an important factor influencing the merging of nearby groups.

3 Simulation Environment The authors have set up a simulation environment of fire evacuation as an implementation of the proposed framework. The environment is a continuous 2D space in which all static elements are represented as polygonal obstacles and the fire is modeled as a set of expanding regions. The initial agent population, the demographic and personality distribution and the position and spread parameters of fire are userdefined. Agents have individual visual and aural perception abilities and can detect the alarm sound, other agents, the fire, exit signs and exits. They are equipped with a short term memory, which they use to remember the last observed position people and elements that are no longer in their field of view. The visual and aural perception abilities of each agent can be temporarily reduced due to its current emotional state and crowd density. The agents can demonstrate a variety of goal-oriented behaviors. They can explore the environment in search of the exit, a specific group or a specific person; they can move individually, such as following an exit sign or moving to a known exit, or they can perform coordinated motion behaviors, such as following a group or waiting for slower group members to follow. These behaviors are selected according to an agent’s desire with the highest priority and the associated intentions it is committed to. Agents may get injured or die during the simulation if they are found in areas of great congestion or if they find themselves very close to the fire. The authors ran of a series of scenarios under a variety of initial conditions to test the simulation results and to evaluate the proposed framework. The initial tests showed a number of promising results. Emergent groups were formed during evacuation time, due to agents taking the role of a leader and inviting other agents to follow. Some members abandoned the groups because of an increase in anger towards the leader, e.g. due to a series of observed negative events, such as injury of group members or close proximity to the fire. The sight of fire and the time pressure caused an increase in negative emotions, such as fear and distress, and some agents demonstrated non-adaptive pushing behavior. This behavior was appraised negatively by other observer agents, causing distress to spread through the crowd population and leading to an increased number of injuries. Furthermore, the perception of the alarm sound caused agents to seek information about the emergency and to exchange messages with each other about exit routes and fire location. Missing members of preexisting groups caused other group members to search for them, often ignoring bypassing groups and moving to opposite directions.

4 Conclusions and Future Work We presented a simulation framework for crowd evacuation that incorporates computational models of emotion and personality in order to generate realistic behaviors in emergency scenarios. The proposed approach is based on research results

428

A. Zoumpoulaki, N. Avradinis, and S. Vosinakis

about the actual crowd responses observed during real emergency situations or drills. The initial implementation results demonstrated the ability of the simulation platform to generate a variety of behaviors, consistent with real life evacuations. These include emergent group formation, bi-directional motion, altruistic behaviors and emotion propagation. Future work includes further research in emotion models and appraisal theories to formalize the decision making mechanism under evacuation scenarios. Further study of the complex social processes, characterizing group dynamics is also needed. Furthermore, we are planning to run a series of case studies using various age and personality distributions and to compare the results with published data from real emergency evacuations in order to evaluate the validity of the proposed framework.

References 1. Still, G., Review, K.: of pedestrian and evacuation simulations. Int. J. Critical Infrastructures 3(3/4), 376–388 (2007) 2. Pelechano, N., Allbeck, J.M., Badler, N.I.: Virtual Crowds: Methods, Simulation and Control. Morgan & Claypool, San Francisco (2008) 3. Pan, X., Han, C.S., Dauber, K., Law, K.H.: Human and social behavior in computational modeling and analysis of egress. Automation in Construction 15 (2006) 4. Musse, S.R., Thalmann, D.: Hierarchical model for real time simulation of virtual human crowds. IEEE Transaction on Visualization and Computer Graphics, 152–164 (2001) 5. Luo, L., et al.: Agent-based human behavior modeling for crowd simulation. Comput. Animat. Virtual Worlds 19(3-4), 271–281 (2008) 6. Helbing, D., Farkas, I., Vicsek, T.: Simulating dynamics feature of escape panic. Nature, 487–490 (2000) 7. Shao, W., Terzopoulos, D.: Autonomous pedestrians. In: Proc. ACM SIGGRAPH, pp. 19– 28 (2005) 8. Zhao, C.M., Lo, S.M., Liu, M., Zhang, S.P.: A post-fire survey on the pre-evacuation human behavior. Fire Technology 45, 71–95 (2009) 9. Proulx, G.: Occupant Behavior and Evacuation. In: Proceedings of the 9th International Fire Protection Symposium, Munich, May 25-26, 2001, pp. 219–232 (2001) 10. Turner, R.H., Killian, L.M.: Collective Behavior, 3rd edn. Prentice-Hall, Englewood Cliffs (1987) 11. Chertkoff, J.M., Kushigian, R.H.: Don’t Panic. The Psychology of Emergency Egress and Ingress. Praeger, Westport (1999) 12. Costa, P.T., McCrae, R.R.: Normal personality assessment in clinical practice: The NEO personality inventory. Psychological Assessment, 5–13 (1992) 13. Ortony, A.: On making believable emotional agents believable. In: Trappl, R., Petta, P., Payr, S. (eds.) Emotions in humans and artifacts, MIT Press, Cambridge (2003) 14. Zelenski, J., Larsen, R.: Susceptibility to affect: a comparison of three personality taxonomies. Journal of Personality 67(5) (1999)

Author Index

Alexopoulos, Nikolaos D. 9 Amini, Massih-Reza 183 Ampazis, Nikolaos 9 Anagnostopoulos, Dimosthenis Antoniou, Grigoris 213, 265 Antoniou, Maria A. 297 Arvanitopoulos, Nikolaos 19 Avradinis, Nikos 423 Bakopoulos, Yannis 163 Balafoutis, Thanasis 29 Ballesteros, Miguel 39 Bassiliades, Nick 123, 173 Bernardino, Anabela Moreira Bernardino, Eug´enia Moreira Bikakis, Antonis 213 Blekas, Konstantinos 203 Bountouri, Lina 321 Bouzas, Dimitrios 19

Hatzi, Ourania Herrera, Jes´ us 123

123 39

Iocchi, Luca 375 Iosif, Elias 133 Jennings, Nicholas R.

49, 303 49, 303

Chasanis, Vasileios 309 Chrysakis, Ioannis 213 Constantopoulos, Panos 1 Dasiopoulou, Stamatia 61 Dimitrov, Todor 71 Doukas, Charalampos N. 243 Dounias, George 101 Dzega, Dorota 223, 315 Fakotakis, Nikos 81, 357, 363 Francisco, Virginia 39 Fykouras, Ilias 143 Gaitanou, Panorea 321 Ganchev, Todor 81, 357 Georgilakis, Pavlos S. 327 Georgopoulos, Efstratios F. 297 Gergatsoulis, Manolis 321 Gerv´ as, Pablo 39 Giannakopoulos, Theodoros 91, 399 Giannakouris, Giorgos 101 Giannopoulos, Vasilis 113 Giannoutakis, Konstantinos M. 333 G´ omez-Pulido, Juan Antonio 49, 303 Goutte, Cyril 183

275

Kalles, Dimitris 143 Kanterakis, Alexandros 233 Karapidakis, Emmanuel S. 327 Karavasilis, Vasileios 153 Karkaletsis, Vangelis 411 Katsigiannis, Yiannis A. 327 Kehagias, Dionisis D. 333 Klenner, Manfred 411 Kompatsiaris, Ioannis 61 Kontopoulos, Efstratios 173 Korokithakis, Stavros 339 Kosmopoulos, Dimitrios I. 91, 345 Kostoulas, Theodoros 357 Kotani, Katsunori 351 Kotinas, Ilias 363 Koumakis, Lefteris 233 Koutroumbas, Konstantinos 163 Kravari, Kalliopi 173 Krithara, Anastasia 183 Lagoudakis, Michail G. 3, 339, 417 Lazaridis, Alexandros 357 Likas, Aristidis 153, 309 Likothanassis, Spiridon D. 297 Lindner, Claudia 193 Lyras, Dimitrios P. 363 Maglogiannis, Ilias 243 Makris, Alexandros 91 Marami, Ermioni 369 Marchetti, Luca 375 Marketakis, Yannis 265 Mavridis, Nikolaos 5 Mavridou, Efthimia 333 Moka, Evangelia 383 Moustakas, Konstantinos 389 Moustakis, Vassilis 233 Mporas, Iosif 81, 357

430

Author Index

Naroska, Edwin 71 Neocleous, Costas 395 Neokleous, Kleanthis 395 Nicolaides, Kypros 395 Nikou, Christophoros 153 Oikonomou, Vangelis P.

Stamatopoulos, Panagiotis Stergiou, Kostas 29

203

Paliouras, Georgios 287 Papatheodorou, Christos 321 Patkos, Theodore 213 Pauli, Josef 71 Peppas, Pavlos 113 Perantonis, Stavros 91, 399 Petrakis, Stefanos 411 Petridis, Sergios 399 Pietruszkiewicz, Wieslaw 223, 315 Plagianakos, Vassilis P. 243 Plexousakis, Dimitris 213 Potamias, George 233 Pothitos, Nikolaos 405 Refanidis, Ioannis 383 Renders, Jean-Michel 183 Rentoumi, Vassiliki 411 Rovatsou, Maria 417

Tasoulis, Sotiris K. 243 Tefas, Anastasios 19, 369 Terzopoulos, Demetri 7 Theodoridis, Sergios 91 Theoﬁlatos, Konstantinos A. 297 Tsatsaronis, George 287 Tzanis, George 255 Tzitzikas, Yannis 265 Tzovaras, Dimitrios 333, 389 Varlamis, Iraklis 287 Varvarigou, Theodora A. 345 Vassiliadis, Vassilios 101 Vassilopoulos, Anastasios P. 297 Vega-Rodr´ıguez, Miguel Angel 49, 303 Vetsikas, Ioannis A. 275 Vlahavas, Ioannis 123, 255 Vosinakis, Spyros 423 Voulodimos, Athanasios S. 345 Vouros, George A. 411 Vrakas, Dimitris 123 Yoshimi, Takehiko

S´ anchez-P´erez, Juan Manuel Schizas, Christos 395 Sgarbas, Kyriakos 363

405

351

49, 303 Zavitsanos, Elias 287 Zoumpoulaki, Alexia 423