Handbook of Research on Fuzzy Information Processing in Databases José Galindo University of Málaga, Spain Volume I I...

Author: Jose Galindo

29 downloads 874 Views 15MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Handbook of Research on Fuzzy Information Processing in Databases José Galindo University of Málaga, Spain

Volume I

Information science reference Hershey • New York

Acquisitions Editor: Development Editor: Senior Managing Editor: Managing Editor: Assistant Managing Editor: Copy Editor: Cover Design: Printed at:

Kristin Klinger Kristin Roth Jennifer Neidig Jamie Snavely Carole Coulson April Schmidt, Shanelle Ramelb Lisa Tosheff Yurchak Printing Inc.

Published in the United States of America by Information Science Reference (an imprint of IGI Global) 701 E. Chocolate Avenue, Suite 200 Hershey PA 17033 Tel: 717-533-8845 Fax: 717-533-8661 E-mail: [email protected] Web site: http://www.igi-global.com and in the United Kingdom by Information Science Reference (an imprint of IGI Global) 3 Henrietta Street Covent Garden London WC2E 8LU Tel: 44 20 7240 0856 Fax: 44 20 7379 0609 Web site: http://www.eurospanbookstore.com Copyright © 2008 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher. Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or companies does not indicate a claim of ownership by IGI Global of the trademark or registered trademark. Library of Congress Cataloging-in-Publication Data Handbook of research on fuzzy information processing in databases / Jose Galindo, editor. p. cm. Summary: "This book provides comprehensive coverage and definitions of the most important issues, concepts, trends, and technologies in fuzzy topics applied to databases, discussing current investigation into uncertainty and imprecision management by means of fuzzy sets and fuzzy logic in the field of databases and data mining. It offers a guide to fuzzy information processing in databases"--Provided by publisher. Includes bibliographical references and index. ISBN-13: 978-1-59904-853-6 (hardcover) ISBN-13: 978-1-59904-854-3 (ebook) 1. Databases--Handbooks, manuals, etc. 2. Data mining--Handbooks, manuals, etc. 3. Fuzzy mathematics--Handbooks, manuals, etc. I. Galindo, Jose, 1970QA76.9.D32H336 2008 005.74--dc22 2007037381

British Cataloguing in Publication Data A Cataloguing in Publication record for this book is available from the British Library. All work contributed to this book set is original material. The views expressed in this book are those of the authors, but not necessarily of the publisher. If a library purchased a print copy of this publication, please go to http://www.igi-global.com/agreement for information on activating the library's complimentary electronic access to this publication.

Editorial Advisory Board

Troels Andreasen Roskilde University, Demark

Abonyi János Pannon University, Hungary

Isabelle Bloch Ecole Nationale Supérieure des Télécommunications, France

Janusz Kacprzyk Polish Academy of Sciences, Poland

Patrick Bosc IRISA/ENSSAT Technopole Anticipa, France Rita de Caluwe Ghent University, Belgium Guy de Tré Ghent University, Belgium Didier Dubois Université Paul Sabatier, France

Juan Miguel Medina Universidad de Granada, Spain Witold Pedrycz University of Alberta, Canada Olga Pons Universidad de Granada, Spain Henri Prade Université Paul Sabatier, France Ronald R. Yager Iona College, USA

Program Committee

M. Carmen Aranda University of Málaga, Spain

Gloria Bordogna Consiglio Nazionale delle Richerche, Italy

Sofiane Achiche Danmarks Tekniske University, Denmark

Patrice Buche Institut National de la Recherche Agronomique, France

Sergio Alonso Universidad de Granada, Spain Francisco Araque Universidad de Granada, Spain Wai-Ho Au University of Medfordshire, UK Mehmet Emin Aydin University of Bedforshire, UK Carlos D. Barranco Universidad Pablo de Olavide, Spain Rafael Bello Universidad Central de Las Villas, Cuba Radim Belohlavek Binghamton University, USA Malcolm J. Beynon Cardiff University, UK Arijit Bhattacharya The Patent Office in West Bengal, India Ignacio José Blanco Universidad de Granada, Spain

Henrik Bulskov Roskilde Universitetscenter, Denmark Jesús Campaña Universidad de Granada, Spain Ramón Alberto Carrasco Universidad de Granada, Spain Jesús Chamorro Universidad de Granada, Spain Yan Chen Louisiana State University, USA Jianhua Chen Louisiana State University, USA Juan Carlos Cubero Universidad de Granada, Spain Miguel Delgado Universidad de Granada, Spain Jesús María Doña Universidad de Málaga, Spain

Luminita Dumitriu Universitatea “Dunărea de Jos” din Galaţi, Romania Manuel Enciso Universidad de Málaga, Spain Céline Fiot Université Montpellier 2, France Joaquín Fernández-Valdivia Universidad de Granada, Spain Adem Göleç Erciyes Üniversitesi, Turkey Antonio González Universidad de Granada, Spain Claudia González Universidad Simón Bolívar, Venezuela Allel Hadjali Université de Rennes 1, France Leoncio Jiménez Universidad Católica del Maule, Chile Stanislav Krajči Univerzita Pavla Jozefa Šafárika, Slovakia Cemallettin Kubat Sakarya Üniversitesi, Turkey Sid Kulkarni University of Ballarat, Australia Hongbo Liu Dalian Maritime University, China Manuel Lozano Universidad de Granada, Spain Gabriel Jesús Luque Universidad de Málaga, Spain Carlos Mantas Universidad de Granada, Spain

María José Martín Universidad de Granada, Spain Carlos Morell Universidad Central de Las Villas, Cuba Juan Moreno Universidad de Castilla-La Mancha, Spain José Ángel Olivas Universidad de Castilla-La Mancha, Spain Carlos Ortiz Universidad de Navarra, Spain Jan Outrata Univerzita Palackého v Olomouci, Czech Republic Olivier Pivert Université de Rennes 1, France Giuseppe Psaila Università degli studi di Bergamo, Italy Guillaume Raschia Université de Nantes, France José María Rodríguez Universidad de Cádiz, Spain Graham H. Rong Massachusetts Institute of Technology, USA Daniel Sánchez Universidad de Granada, Spain Cristián R. Sepúlveda ACL Aplicaciones Computacionales Ltda., Chile José María Serrano Universidad de Jaén, Spain Srđan Škrbić Univerzitet u Novom Sadu, Serbia Aleksandar Takači Univerzitet u Novom Sadu, Serbia

Harun Taşkin Sakarya Üniversitesi, Turkey

Marcela Varas Universidad de Concepción, Chile

Oliver Thomas Universität des Saarlandes, Germany

Pandian Vasant Universiti Teknologi Petronas, Malaysia

Rallou Thomopoulos Institut National de la Recherche Agronomique, France

María Amparo Vila Universidad de Granada, Spain

Leonid Tineo Universidad Simón Bolívar, Venezuela Cornelia Tudorie Universitatea "Dunărea de Jos" din Galaţi, Romania Safiye Turgay Abant İzzet Baysal Üniversitesi, Turkey Hamid Haidarian Shahri University of Maryland, USA Awadhesh Kumar Sharma Madan Mohan Malviya Engineering College, India Angélica Urrutia Universidad Católica del Maule, Chile Dao Van Tuyet Vietnamese Academy of Science and Technology, Vietnam

W. Amenel Voglozin Université de Nantes, France Peter Vojtáš Univerzita Karlova v Prague, Czech Republic Vilem Vychodil Univerzita Palackého v Olomouci, Czech Republic Shyue Liang Wang New York Institute of Technology, USA Yi Wang Cardiff University, UK Nicolas Werro Université de Fribourg, Switzerland Geraldo Xexéo Universidade Federal do Rio de Janeiro, Brazil Qi Yang University of Wisconsin at Platteville, USA

List of Contributors

Abonyi, Janos / University of Pannonia, Hungary............................................................................... 55 Andreasen, Troels / Roskilde University, Denmark........................................................................... 325 Araque, F. / Universidad de Granada, Spain .................................................................................... 563 Au, Wai-Ho / Microsoft Corporation, USA ....................................................................................... 685 Barranco, Carlos D. / Pablo de Olavide University, Spain .............................................................. 435 Belohlavek, Radim / Binghamton University–SUNY, USA and Palacky University, Czech Republic......................................................................................................................... 462,.634 Ben Hassine, Mohamed Ali / Tunis El Manar University, Tunisia ................................................... 351 Beynon, Malcolm J. / Cardiff University, UK ........................................................................... 760,.784 Blot, Jean-Yves / Portugal Institute of Archaeology, Portugal.......................................................... 516 Bordogna, Gloria / CNR IDPA, Italy ................................................................................................. 191 Bosc, P. / IRISA-ENSSAT, Université de Rennes 1, France................................................................ 143 Braga, André / IBM Brazil, Brazil..................................................................................................... 381 Buche, Patrice / INRA, France .......................................................................................................... 299 Bulskov, Henrik / Roskilde University, Denmark.............................................................................. 325 Callens, Bert / Ghent University, Belgium ......................................................................................... 167 Campaña, Jesús R. / University of Granada, Spain ......................................................................... 435 Carrasco, R. A. / Universidad de Granada, Spain ............................................................................ 563 Chen, Jianhua / Louisiana State University, USA ............................................................................. 538 Chen, Yan / Louisiana State University, USA .................................................................................... 538 Coelho, Joao./ Portugal Institute of Archaeology, Portugal .............................................................. 516 de Tré, Guy / Ghent University, Belgium ..................................................................................... 34,.167 de Caluwe, Rita / Ghent University, Belgium...................................................................................... 34 Demoor, Marysa / Ghent University, Belgium .................................................................................. 167 Doña, J. M. / University of Malága, Spain ........................................................................................ 805 Dubois, Didier / IRIT, Université de Toulouse, France ....................................................................... 97 Feil, Balazs / University of Pannonia, Hungary .................................................................................. 55 Fiot, Céline / University of Montpellier II – CNRS, France .............................................................. 727 Galindo, José / University of Málaga, Spain ................................................................................. 1,.351 Gonzalez, Claudia / Universidad Simón Bolívar, Venezuela ............................................................. 270 Gosseye, Lise / Ghent University, Belgium ........................................................................................ 167 Goswami, A. / I.I.T., Kharagpur, India .............................................................................................. 658 Gupta, D. K. / I.I.T., Kharagpur, India .............................................................................................. 658 Hadjali, A. / IRISA-ENSSAT, Université de Rennes 1, France .......................................................... 143 Haemmerlé, Ollivier / IRIT, France .................................................................................................. 299

Hong, Tuzng-Pei / National University of Kaohsiung, Taiwan ......................................................... 615 Kacprzyk, Janusz / Polish Academy of Sciences, Poland................................................................... 34 La Red, D. / National University of the Northeast, Argentina ........................................................... 805 Liétard, Ludovic / IRISA/IUT & IRISA/ENSSAT, France ................................................................. 246 Medina, Juan M. / University of Granada, Spain ............................................................................. 435 Meier, Andreas / University of Fribourg, Switzerland ...................................................................... 586 Mouaddib, Noureddine / Université de Nantes, France .................................................................. 115 Ounelli, Habib / Tunis El Manar University, Tunisia ........................................................................ 351 Peláez, J. I. / University of Malága, Spain......................................................................................... 805 Pivert, O. / IRISA-ENSSAT, Université de Rennes 1, France ............................................................ 143 Prade, Henri / IRIT, Université de Toulouse, France .......................................................................... 97 Psaila, Guiseppe / University of Bergamo, Italy................................................................................ 191 Raschia, Guiseppe / Université de Nantes, France ........................................................................... 115 Rocacher, Daniel / IRISA/IUT & IRISA/ENSSAT, France ................................................................. 246 Rong, Graham H. / Massachusetts Institute of Technology, USA ..................................................... 538 Salguero, A. / Universidad de Granada, Spain.................................................................................. 563 Schindler, Günter / Galexis AG, Switzerland.................................................................................... 586 Schneider, Markus / University of Florida, USA .............................................................................. 490 Shahri, Hamid Haidarian / University of Maryland, USA............................................................... 745 Sharma, Awadhesh Kumar / MMM Engineering College, Gorakhpur, UP, India .......................... 658 Shen, Ju-Wen / Chunghwa Telecom Lab, Taiwan ............................................................................. 615 Škrbić, Srđan / University of Novi Sad, Serbia ................................................................................. 407 Takači, Aleksandar / University of Novi Sad, Serbia ........................................................................ 407 Thomopoulos, Rallou / INRA, France .............................................................................................. 299 Tineo, Leonid / Universidad Simón Bolívar, Venezuela .................................................................... 270 Touzi, Amel Grissa / Tunis El Manar University, Tunisia ................................................................. 351 Tudorie, Cornelia / University “Dunărea de Jos”, Galaţi, Romania ............................................... 218 Turgay, Safiye / Abant Izzet Baysal University, Turkey ..................................................................... 822 Ughetto, Laurent / Université de Nantes, France ............................................................................. 115 Urrutia, Angélica / Universidad Católica del Maule, Chile ............................................................. 270 Veryha, Yauheni / ABB Corporate Research Center, Germany ........................................................ 516 Vila, M. A. / Universidad de Granada, Spain .................................................................................... 563 Voglozin, W. Amenel / Université de Nantes, France ....................................................................... 115 Vychodil, Vilem ./ Binghamton University–SUNY, USA and Palacky University, Czech Republic................................................................................................................................. 634 Wang, Shyue-Liang / New York Institute of Technology, USA .......................................................... 615 Wang, Yi / Nottingham Trent University, UK..................................................................................... 706 Werro, Nicolas / University of Fribourg, Switzerland ....................................................................... 586 Xexéo, Geraldo / Universidade Federal do Rio de Janeiro, Brazil................................................... 381 Zadrożny, Sławomir / Polish Academy of Sciences, Poland .............................................................. 34

Table of Contents

Foreword............................................................................................................................................xxiii Preface................................................................................................................................................ xxvi

Section I Introduction Volume I Chapter I Introduction.and.Trends.to.Fuzzy.Logic.and.Fuzzy.Databases. ............................................................. 1. José Galindo, University of Málaga, Spain Chapter II An.Overview.of.Fuzzy.Approaches.to.Flexible.Database.Querying. ................................................... 34 Sławomir Zadrożny, Polish Academy of Sciences, Poland Guy de Tré, Ghent University, Belgium Rita de Caluwe, Ghent University, Belgium Janusz Kacprzyk, Polish Academy of Sciences, Poland Chapter III Introduction.to.Fuzzy.Data.Mining.Methods........................................................................................ 55 Balazs Feil, University of Pannonia, Hungary Janos Abonyi, University of Pannonia, Hungary.

Section II Fuzzy Queries Chapter IV Handling.Bipolar.Queries.in.Fuzzy.Information.Processing ................................................................ 97 Didier Dubois, IRIT, Université de Toulouse, France Henri Prade, IRIT, Université de Toulouse, France

Chapter V From.User.Requirements.to.Evaluation.Strategies.of.Flexible.Queries.in.Databases. ....................... 115 Noureddine Mouaddib, Université de Nantes, France Guillaume Raschia, Université de Nantes, France W. Amenel Voglozin, Université de Nantes, France Laurent Ughetto, Université de Rennes 2, France Chapter VI On.the.Versatility.of.Fuzzy.Sets.for.Modeling.Flexible.Queries. ....................................................... 143 P. Bosc, IRISA-ENSSAT, Université de Rennes 1, France A. Hadjali, IRISA-ENSSAT, Université de Rennes 1, France O. Pivert, IRISA-ENSSAT, Université de Rennes 1, France Chapter VII Flexible.Querying.Techniques.Based.on.CBR. .................................................................................. 167 Guy de Tré, Ghent University, Belgium Marysa Demoor, Ghent University, Belgium Bert Callens, Ghent University, Belgium Lise Gosseye, Ghent University, Belgium Chapter VIII Customizable.Flexible.Querying.for.Classical.Relational.Databases................................................. 191 Gloria Bordogna, CNR IDPA, Italy Guiseppe Psaila, University of Bergamo, Italy Chapter IX Qualifying.Objects.in.Classical.Relational.Database.Querying. ........................................................ 218 Cornelia Tudorie, University “Dunărea de Jos”, Galati, Romania Chapter X Evaluation of Quantified Statements Using Gradual Numbers. ......................................................... 246 Ludovic Liétard, IRISA/IUT & IRISA/ENSSAT, France Daniel Rocacher, IRISA/IUT & IRISA/ENSSAT, France Chapter XI FSQL.and.SQLf:.Towards.a.Standard.in.Fuzzy.Databases. ............................................................... 270 Angélica Urrutia, Universidad Católica del Maule, Chile Leonid Tineo, Universidad Simón Bolívar, Venezuela Claudia Gonzalez, Universidad Simón Bolívar, Venezuela Chapter XII Hierarchical.Fuzzy.Sets.to.Query.Possibilistic.Databases. ................................................................. 299 Rallou Thomopoulos, INRA, France Patrice Buche, INRA, France Ollivier Haemmerlé, IRIT, France

Chapter XIII Query.Expansion.by.Taxonomy. ......................................................................................................... 325 Troels Andreasen, Roskilde University, Denmark Henrik Bulskov, Roskilde University, Denmark

Section III Implementation, Data Models, Fuzzy Attributes, and Applications Chapter XIV How.to.Achieve.Fuzzy.Relational.Databases.Managing.Fuzzy.Data.and.Metadata. ......................... 351 Mohamed Ali Ben Hassine, Tunis El Manar University, Tunisia Amel Grissa Touzi, Tunis El Manar University, Tunisia José Galindo, University of Málaga, Spain Habib Ounelli, Tunis El Manar University, Tunisia Chapter XV A.Tool.for.Fuzzy.Reasoning.and.Querying. ....................................................................................... 381 Geraldo Xexéo, Universidade Federal do Rio de Janeiro, Brazil André Braga, IBM Brazil, Brazil Chapter XVI Data.Model.of.FRDB.with.Different.Data.Types.and.PFSQL. .......................................................... 407 Aleksandar Takači, University of Novi Sad, Serbia Srđan Škrbić, University of Novi Sad, Serbia

Volume II Chapter XVII Towards.a.Fuzzy.Object-Relational.Database.Model. ........................................................................ 435 Carlos D. Barranco, Pablo de Olavide University, Spain Jesús R. Campaña, University of Granada, Spain Juan M. Medina, University of Granada, Spain Chapter XVIII Relational Data, Formal Concept Analysis, and Graded Attributes.................................................... 462 Radim Belohlavek, Binghamton University–SUNY, USA and Palacky University, Czech Republic Chapter XIX Fuzzy.Spatial.Data.Types.for.Spatial.Uncertainty.Management.in.Databases ................................... 490 Markus Schneider, University of Florida, USA

Chapter XX Fuzzy Classification in Shipwreck Scatter Analysis ........................................................................... 516 Yauheni Veryha, ABB Corporate Research Center, Germany Jean-Yves Blot, Portugal Institute of Archaeology, Portugal Joao Coelho, Portugal Institute of Archaeology, Portugal Chapter XXI Fabric.Database.and.Fuzzy.Logic.Models.for.Evaluating.Fabric.Performance ................................. 538 Yan Chen, Louisiana State University, USA Graham H. Rong, Massachusetts Institute of Technology, USA Jianhua Chen, Louisiana State University, USA Chapter XXII Applying.Fuzzy.Data.Mining.to.Tourism.Area .................................................................................. 563 R. A. Carrasco, Universidad de Granada, Spain F. Araque, Universidad de Granada, Spain A. Salguero, Universidad de Granada, Spain M. A. Vila, Universidad de Granada, Spain

Section IV Fuzzy Data Mining Chapter XXIII Fuzzy Classification on Relational Databases .................................................................................... 586 Andreas Meier, University of Fribourg, Switzerland Günter Schindler, Galexis AG, Switzerland Nicolas Werro, University of Fribourg, Switzerland Chapter XXIV Incremental.Discovery.of.Fuzzy.Functional.Dependencies ............................................................... 615 Shyue-Liang Wang, New York Institute of Technology, USA Ju-Wen Shen, Chunghwa Telecom Lab, Taiwan Tuzng-Pei Hong, National University of Kaohsiung, Taiwan Chapter XXV Data.Dependencies.in.Codd’s.Relational.Model.with.Similarities ..................................................... 634 Radim Belohlavek, Binghamton University–SUNY, USA and Palacky University, Czech Republic Vilem Vychodil, Binghamton University–SUNY, USA and Palacky University, Czech Republic

Chapter XXVI Fuzzy Inclusion Dependencies in Fuzzy Databases............................................................................ 658 Awadhesh Kumar Sharma, MMM Engineering College, Gorakhpur, UP, India A. Goswami, I.I.T., Kharagpur, India D. K. Gupta, I.I.T., Kharagpur, India Chapter XXVII A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases.................... 685 Wai-Ho Au, Microsoft Corporation, USA Chapter XXVIII Applying Fuzzy Logic in Dynamic Causal Mining............................................................................. 706 Yi Wang, Nottingham Trent University, UK Chapter XXIX Fuzzy Sequential Patterns for Quantitative Data Mining.................................................................... 727 Céline Fiot, University of Montpellier II – CNRS, France Chapter XXX A Machine Learning Approach to Data Cleaning in Databases and Data Warehouses....................... 745 Hamid Haidarian Shahri, University of Maryland, USA Chapter XXXI Fuzzy Decision-Tree-Based Analysis of Databases............................................................................. 760 Malcolm J. Beynon, Cardiff University, UK Chapter XXXII Fuzzy Outranking Methods Including Fuzzy PROMETHEE............................................................. 784 Malcolm J. Beynon, Cardiff University, UK Chapter XXXIII Fuzzy Imputation Method for Database Systems................................................................................ 805 J. I. Peláez, University of Malága, Spain J. M. Doña, University of Malága, Spain D. La Red, National University of the Northeast, Argentina Chapter XXXIV Intelligent Fuzzy Database Management in Multiagent Systems........................................................ 822 Safiye Turgay, Abant Izzet Baysal University, Turkey

Detailed Table of Contents

Foreword............................................................................................................................................xxiii Preface................................................................................................................................................ xxvi

Section I Introduction Volume I Chapter I Introduction.and.Trends.to.Fuzzy.Logic.and.Fuzzy.Databases. ............................................................. 1. José Galindo, University of Málaga, Spain This.chapter.is.a.basic.chapter.for.novel.researchers.in.the.area.of.fuzzy.logic..It.introduces.the.main. concepts, like fuzzy sets and fuzzy numbers, linguistic labels, membership functions, the representation theorem and the extension principle, fuzzy set operations like union and intersection (t-norms and t-conorms), negations, fuzzy implications, different comparison operations, fuzzy quantifiers, and the possibility.theory..With.respect.to.the.fuzzy.databases,.this.chapter.gives.a.brief.introduction.to.this.topic. and enumerates a list of six research topics in this field. Chapter II An.Overview.of.Fuzzy.Approaches.to.Flexible.Database.Querying. ................................................... 34 Sławomir Zadrożny, Polish Academy of Sciences, Poland Guy de Tré, Ghent University, Belgium Rita de Caluwe, Ghent University, Belgium Janusz Kacprzyk, Polish Academy of Sciences, Poland An. overview. of. main. trends. in. the. research. on. fuzzy. querying. techniques,. including. both. querying. techniques.for.traditional.databases.as.well.as.for.fuzzy.databases,.is.described.

Chapter III Introduction.to.Fuzzy.Data.Mining.Methods........................................................................................ 55 Balazs Feil, University of Pannonia, Hungary Janos Abonyi, University of Pannonia, Hungary. This chapter gives a comprehensive view about the links between fuzzy logic and data mining, following nine steps of knowledge discovery. It defines and studies interesting methods, like fuzzy clustering, fuzzy classification, fuzzy association rule mining, and visualization of the results.

Section II Fuzzy Queries Chapter IV Handling.Bipolar.Queries.in.Fuzzy.Information.Processing ................................................................ 97 Didier Dubois, IRIT, Université de Toulouse, France Henri Prade, IRIT, Université de Toulouse, France Bipolar queries distinguish between negative and positive preferences in the processing of flexible queries. Negative preferences express what is more or less impossible or feasible, and they specify flexible constraints.restricting.feasible.or.tolerated.values..Positive.preferences.are.less.compulsory.and.rather. express.wishes,.giving.measurements.between.indifferent.and.preferred.values. Chapter V From.User.Requirements.to.Evaluation.Strategies.of.Flexible.Queries.in.Databases. ....................... 115 Noureddine Mouaddib, Université de Nantes, France Guillaume Raschia, Université de Nantes, France W. Amenel Voglozin, Université de Nantes, France Laurent Ughetto, Université de Rennes 2, France This.chapter.studies.the.whole.process.of.fuzzy.querying,.from.the.query.formulation.to.its.evaluation,. proposing.index.structures.in.the.evaluation.of.fuzzy.queries..After.introducing.different.ways.for.expressing flexibility in queries, the chapter reviews current methods for evaluating fuzzy queries. Finally, SAINTETIQ is presented, a data summarization model that produces a hierarchy of summaries given a relational.table.and.additional.metadata. Chapter VI On.the.Versatility.of.Fuzzy.Sets.for.Modeling.Flexible.Queries. ....................................................... 143 P. Bosc, IRISA-ENSSAT, Université de Rennes 1, France A. Hadjali, IRISA-ENSSAT, Université de Rennes 1, France O. Pivert, IRISA-ENSSAT, Université de Rennes 1, France This work advocates the interest of extending usual Boolean queries with preferences using fuzzy sets, highlighting.the.expressiveness.of.fuzzy.sets.with.the.division.operator.in.the.context.of.regular.databases..Some.useful.examples.are.exposed.using.the.fuzzy.query.language.SQLf.

Chapter VII Flexible.Querying.Techniques.Based.on.CBR. .................................................................................. 167 Guy de Tré, Ghent University, Belgium Marysa Demoor, Ghent University, Belgium Bert Callens, Ghent University, Belgium Lise Gosseye, Ghent University, Belgium The goal of this work is to enhance case-based reasoning (CBR) modeling a gradation in similarity of the cases. Thus, a new case is compared to previous cases in order to predict the corresponding unknown data values for the new case using possibility theory. This flexible CBR can be used to enhance flexible querying of regular databases under some conditions. Briefly, a real-world application is shown for information.retrieval.in.a.juridical.database. Chapter VIII Customizable.Flexible.Querying.for.Classical.Relational.Databases................................................. 191 Gloria Bordogna, CNR IDPA, Italy Guiseppe Psaila, University of Bergamo, Italy The.Soft-SQL.project.is.presented,.an.extension.of.SQL.for.fuzzy.queries.to.classic.relational.databases.. Perhaps.the.most.interesting.characteristic.is.to.provide.tools.allowing.users.to.directly.specify.the.context-dependent semantics of soft conditions. For example, a cheap flat in Milan does not have a similar price to a cheap flat in Tokyo. Chapter IX Qualifying.Objects.in.Classical.Relational.Database.Querying. ........................................................ 218 Cornelia Tudorie, University “Dunărea de Jos”, Galati, Romania The author studies fuzzy queries in order to rank the resulting objects (object qualification). After a discussion on different kinds of fuzzy conditions in a fuzzy query, a new particular condition is proposed: the relative object qualification as a query selection criterion, that is, queries with two conditions in which the first one depends on the results of the second one, for example, “Retrieve the inexpensive cars.among.the.high-speed.ones.” Chapter X Evaluation of Quantified Statements Using Gradual Numbers. ......................................................... 246 Ludovic Liétard, IRISA/IUT & IRISA/ENSSAT, France Daniel Rocacher, IRISA/IUT & IRISA/ENSSAT, France The chapter is devoted to the evaluation of quantified statements that can be found in many applications, for.example,.in.fuzzy.querying.databases..It.introduces.the.main.techniques.to.evaluate.such.statements. and proposes a new theoretical background for the evaluation of quantified statements with one or two fuzzy conditions: “most of the employees are well paid” and “most of the young employees are well paid.” The work shows that the context of fuzzy numbers provides some nice characteristics.

Chapter XI FSQL.and.SQLf:.Towards.a.Standard.in.Fuzzy.Databases. ............................................................... 270 Angélica Urrutia, Universidad Católica del Maule, Chile Leonid Tineo, Universidad Simón Bolívar, Venezuela Claudia Gonzalez, Universidad Simón Bolívar, Venezuela The goal of this chapter is to propose a unified SQL-based language for fuzzy relational databases. The authors study the two more general approaches in this field, SQLf and FSQL. They study the characteristics and definitions of these languages, and also the current implementations based on both languages. Chapter XII Hierarchical.Fuzzy.Sets.to.Query.Possibilistic.Databases. ................................................................. 299 Rallou Thomopoulos, INRA, France Patrice Buche, INRA, France Ollivier Haemmerlé, IRIT, France Within the framework of flexible querying of possibilistic databases, based on the fuzzy set theory, this chapter.focuses.on.the.case.where.the.vocabulary.used.both.in.the.querying.language.and.in.the.data.is. hierarchically organized, which occurs in systems that use ontologies. A hierarchical fuzzy set is defined as a fuzzy set whose definition domains are hierarchies. Besides this, two applications are presented. Chapter XIII Query.Expansion.by.Taxonomy. ......................................................................................................... 325 Troels Andreasen, Roskilde University, Denmark Henrik Bulskov, Roskilde University, Denmark An.overview.of.the.use.of.taxonomies.and.ontologies.in.querying.is.presented,.with.a.special.emphasis. on similarity derived from the ontology, where key concepts are organized and related. Queries can be expanded.with.these.similarity.measures,.thereby.causing.query.evaluation.to.be.based.on.concepts.from. the.ontology.domain.rather.than.on.words.or.numbers.in.the.query.

Section III Implementation, Data Models, Fuzzy Attributes, and Applications Chapter XIV How.to.Achieve.Fuzzy.Relational.Databases.Managing.Fuzzy.Data.and.Metadata. ......................... 351 Mohamed Ali Ben Hassine, Tunis El Manar University, Tunisia Amel Grissa Touzi, Tunis El Manar University, Tunisia José Galindo, University of Málaga, Spain Habib Ounelli, Tunis El Manar University, Tunisia This.chapter.is.addressed.mainly.to.database.administrators.and.enterprises.interested.in.the.fuzzy.capabilities.in.their.current.databases..It.presents.three.migration.approaches.from.real.relational.databases.

toward.fuzzy.relational.databases..These.strategies.offer.different.possibilities,.from.the.possibility.of. fuzzy.queries.using.the.FSQL.language.to.storing.fuzzy.data..Of.course,.each.possibility.poses.different. troubles.that.must.be.solved. Chapter XV A.Tool.for.Fuzzy.Reasoning.and.Querying. ....................................................................................... 381 Geraldo Xexéo, Universidade Federal do Rio de Janeiro, Brazil André Braga, IBM Brazil, Brazil CLOUDS.is.a.library.and.user.interface.organizing.uncertainty.in.database.systems,.a.tool.that.allows. the creation of fuzzy reasoning systems over classic, nonfuzzy relational databases. It defines a fuzzy extension.to.SQL.queries.and.was.incorporated.into.a.geographic.information.system. Chapter XVI Data.Model.of.FRDB.with.Different.Data.Types.and.PFSQL. .......................................................... 407 Aleksandar Takači, University of Novi Sad, Serbia Srđan Škrbić, University of Novi Sad, Serbia This.chapter.introduces.a.way.to.extend.the.relational.model.with.mechanisms.that.can.handle.imprecise,.uncertain,.and.inconsistent.attribute.values.using.fuzzy.logic..Furthermore,.a.query.language.called. PFSQL.is.described.for.this.fuzzy.database.model,.with.fuzzy.capabilities.and.the.possibility.to.specify. priorities.in.every.simple.fuzzy.condition..The.priorities.of.PFSQL.are.compared.with.the.thresholds. of.FSQL. Volume II Chapter XVII Towards.a.Fuzzy.Object-Relational.Database.Model. ........................................................................ 435 Carlos D. Barranco, Pablo de Olavide University, Spain Jesús R. Campaña, University of Granada, Spain Juan M. Medina, University of Granada, Spain The.authors.introduce.a.fuzzy.object-relational.database.model.including.fuzzy.extensions.of.the.userdefined data types and the collection types. Then they study a way to flexibly compare complex data types.and.an.extension.of.collection.types.allowing.partial.membership.of.its.elements..An.application. in the image-retrieval field is briefly exposed. Chapter XVIII Relational Data, Formal Concept Analysis, and Graded Attributes.................................................... 462 Radim Belohlavek, Binghamton University–SUNY, USA and Palacky University, Czech Republic Formal concept analysis with graded (fuzzy) attributes is studied. It is a particular method of analysis of fuzzy.relational.data,.and.here.an.overview.of.foundations.of.this.formal.concept.analysis.is.presented. together.with.concept.lattices.and.attribute.implications...

Chapter XIX Fuzzy.Spatial.Data.Types.for.Spatial.Uncertainty.Management.in.Databases ................................... 490 Markus Schneider, University of Florida, USA The.author.proposes.some.fuzzy.spatial.data.types,.introducing.fuzzy.points,.fuzzy.lines,.and.fuzzy. regions.in.the.two-dimensional.space..This.chapter.also.studies.fuzzy.topological.predicates.for.fuzzy. querying with an SQL-like spatial query language. Chapter XX Fuzzy Classification in Shipwreck Scatter Analysis ........................................................................... 516 Yauheni Veryha, ABB Corporate Research Center, Germany Jean-Yves Blot, Portugal Institute of Archaeology, Portugal Joao Coelho, Portugal Institute of Archaeology, Portugal This is an application of fuzzy sets theory in the area of maritime archaeology. Specifically, the authors show how fuzzy classification using SQL is applied in shipwreck scatter analysis to obtain a user-friendly representation of the wear-type parameters of fragments of ceramics from an ancient shipwreck. This data mining.method.helps.to.classify.fragments.of.ceramics.by.detecting.intrinsic.classes.and.neighborhood. relations, keeping high precision of data classification in comparison to classical methods. The authors state that this framework can be relatively easily integrated with conventional relational databases, which are.widely.used.in.existing.archaeological.information.systems. Chapter XXI Fabric.Database.and.Fuzzy.Logic.Models.for.Evaluating.Fabric.Performance ................................. 538 Yan Chen, Louisiana State University, USA Graham H. Rong, Massachusetts Institute of Technology, USA Jianhua Chen, Louisiana State University, USA A.Web-based.fabric.database.is.introduced.in.terms.of.its.physical.structure,.software.system.architecture,. basic.and.intelligent.search.engines,.and.various.display.methods.for.search.results..This.application.uses. effective fuzzy linear clustering methods to predict the fabric drape coefficient from fabric mechanical and.structural.properties,.and.the.fabric.tailorability.with.good.prediction.accuracy..Finally,.a.neuro-fuzzy. computing.technique.for.evaluating.nonwoven.fabric.softness.is.presented. Chapter XXII Applying.Fuzzy.Data.Mining.to.Tourism.Area .................................................................................. 563 R. A. Carrasco, Universidad de Granada, Spain F. Araque, Universidad de Granada, Spain A. Salguero, Universidad de Granada, Spain M. A. Vila, Universidad de Granada, Spain This.chapter.proposes.the.use.of.an.extension.of.the.FSQL.language.for.fuzzy.queries.as.one.of.the. techniques.of.data.mining,.which.can.be.used.to.solve.the.problem.of.offering.the.better.place.for.soaring.given.the.environment.conditions.and.customer.characteristics..After.doing.a.process.of.clustering. and.characterization,.the.method.is.able.of.classify.new.items.in.a.cluster.

Section IV Fuzzy Data Mining Chapter XXIII Fuzzy Classification on Relational Databases .................................................................................... 586 Andreas Meier, University of Fribourg, Switzerland Günter Schindler, Galexis AG, Switzerland Nicolas Werro, University of Fribourg, Switzerland A.context.model.with.fuzzy.classes.is.proposed.to.extend.relational.database.systems..More.precisely,. fuzzy.classes.and.linguistic.variables.and.terms,.together.with.appropriate.membership.functions,.are. added.to.the.database.schema..In.order.to.formulate.unsharp.queries,.the.authors.present.the.fCQL,.a. fuzzy classification query language, whose statements are transformed into SQL statements. Chapter XXIV Incremental.Discovery.of.Fuzzy.Functional.Dependencies ............................................................... 615 Shyue-Liang Wang, New York Institute of Technology, USA Ju-Wen Shen, Chunghwa Telecom Lab, Taiwan Tuzng-Pei Hong, National University of Kaohsiung, Taiwan Mining.fuzzy.functional.dependencies.from.fuzzy.databases.based.on.similarity.relations.is.studied,.while. methods.are.proposed.to.validate.and.incrementally.search.these.dependencies..A.detailed.example.is. given.to.illustrate.the.process.of.the.mining.algorithm..In.addition,.numerical.results.are.given.to.show. the.monotonic.characteristics.of.the.fuzzy.functional.dependencies. Chapter XXV Data.Dependencies.in.Codd’s.Relational.Model.with.Similarities ..................................................... 634 Radim Belohlavek, Binghamton University–SUNY, USA and Palacky University, Czech Republic Vilem Vychodil, Binghamton University–SUNY, USA and Palacky University, Czech Republic This.chapter.deals.with.fuzzy.logic.extensions.of.the.relational.model.that.consist.of.adding.similarity. relations to domains, truth degrees attached to the table rows (ranked tables), and considers functional dependencies.in.these.extensions..It.presents.a.particular.extension.and.functional.dependencies.in.this. extension.that.follow.the.principles.of.fuzzy.logic.in.a.narrow.sense..This.extension.is.compared.to. several.other.extensions.proposed.in.the.literature. Chapter XXVI Fuzzy.Inclusion.Dependencies.in.Fuzzy.Databases ........................................................................... 658 Awadhesh Kumar Sharma, MMM Engineering College, Gorakhpur, UP, India A. Goswami, I.I.T., Kharagpur, India D. K. Gupta, I.I.T., Kharagpur, India This chapter introduces one definition of fuzzy inclusion dependencies in fuzzy databases, a fuzzy constraint that we can see as a fuzzy foreign key between two given fuzzy relations. Inference rules on.such.dependencies.are.derived.and.an.algorithm.has.been.proposed.for.the.discovery.of.these.fuzzy. inclusion.dependencies.

Chapter XXVII A.Distributed.Algorithm.for.Mining.Fuzzy.Association.Rules.in.Traditional.Databases................... 685 Wai-Ho Au, Microsoft Corporation, USA A.new.distributed.algorithm.for.mining.fuzzy.association.rules.from.very.large.databases.is.proposed.. This.algorithm.has.a.very.effective.measure.to.distinguish.interesting.associations.from.uninteresting. ones..Each.site.scans.its.own.database.partition.to.obtain.the.number.of.tuples.characterized.by.different. linguistic.variables.and.linguistic.terms..Afterward,.they.exchange.their.own.local.counts.with.all.the. other sites to find the global values. Chapter XXVIII Applying.Fuzzy.Logic.in.Dynamic.Causal.Mining ............................................................................ 706 Yi Wang, Nottingham Trent University, UK This.chapter.applies.fuzzy.logic.to.a.dynamic.causal.mining.algorithm,.which.is.a.combination.of.mining. rules and system dynamics for discovering causality patterns in a target system. The final goal is that fuzzy logic assists the user to make better decisions, and also assists in a better understanding of future behavior.of.this.target.system. Chapter XXIX Fuzzy.Sequential.Patterns.for.Quantitative.Data.Mining ................................................................... 727 Céline Fiot, University of Montpellier II – CNRS, France Sequential-pattern.methods.handle.sequence.databases,.extracting.frequently.occurring.patterns.related. to time and transforming large amounts of data into useful comprehensible knowledge. After introducing various.fuzzy.sequential-pattern.approaches.and.the.general.principles.they.are.based.on,.a.complete. framework is defined for mining fuzzy sequential patterns handling different levels of consideration of quantitative information. This framework is applied to two real databases: Web access logs and a textual database. Chapter XXX A.Machine.Learning.Approach.to.Data.Cleaning.in.Databases.and.Data.Warehouses ...................... 745 Hamid Haidarian Shahri, University of Maryland, USA The. data. cleaning. process. is. a. duplicate. elimination. problem,. for. example,. in. data. integration. and. warehousing. Here, neuro-fuzzy techniques are mixed to produce a unique adaptive framework for data cleaning, which automatically learns from and adapts to the specific notion of similarity at a meta-level. It.can.be.utilized.in.the.production.of.an.intelligent.tool.to.increase.the.quality.and.accuracy.of.data. Chapter XXXI Fuzzy.Decision-Tree-Based.Analysis.of.Databases............................................................................ 760 Malcolm J. Beynon, Cardiff University, UK This.chapter.offers.a.description.of.fuzzy.decision-tree-based.research,.including.the.exposition.of.small. and.large.fuzzy.decision.trees.to.demonstrate.their.construction.and.practicality..Basically,.a.fuzzy.decision.tree.is.a.set.of.fuzzy.if-then.decision.rules.allowing.a.linguistic.interpretation.of.the.considered. problem.and.managing.the.possibility.for.imprecision.in.the.used.data.values.

Chapter XXXII Fuzzy Outranking Methods Including Fuzzy PROMETHEE............................................................. 784 Malcolm J. Beynon, Cardiff University, UK This chapter describes the rudiments of fuzzy outranking methods, with particular attention to fuzzy PROMETHEE, a multicriteria decision-making technique using fuzzy information. Alternative fuzzy PROMETHEE approaches are described, with one used in two real-life applications. Starting with known data about a series of possible alternatives, a preference ranking of them can be achieved. Chapter XXXIII Fuzzy Imputation Method for Database Systems................................................................................ 805 J. I. Peláez, University of Malága, Spain J. M. Doña, University of Malága, Spain D. La Red, National University of the Northeast, Argentina Missing data are often an actual problem in real data sets. Imputation is a method to fill in missing data with plausible values to produce a complete data set. This work analyzes the performance of the different traditional data imputation methods. A new fuzzy imputation approach is proposed using the ordered weighted average (OWA) operators by Yager and the majority concept. Chapter XXXIV Intelligent Fuzzy Database Management in Multiagent Systems........................................................ 822 Safiye Turgay, Abant Izzet Baysal University, Turkey In this chapter, an agent-based fuzzy data mining structure is defined to process and evaluate data and to build a rule structure for the system. Within the developed system, the focus was on the operation feature of the fuzzy data mining structure, which is the same for each agent composing the system. The suggested association rules are derived from a relational database.

xxiii

Foreword

Few scientific communities within computer sciences have thought over both their present and future like the one devoted to databases. During more than 15 years, a very comprehensive and well-known group of researchers of recognized prestige in this field meets regularly to fix the expected main challenges.and.problems.within.its.scope.and.to.propose.what.research.lines.are.the.most.promising.and. necessary. From the first meetings in Laguna Beach and Palo Alto (1988, 1990), one can follow the proposals and results in the databases field through a series of reports, which in some way constitute a guideline for the development of research in this area (Abiteboul et al., 2005; Bernstein et al., 1998; Silberschatz, Stonebraker, & Ullman, 1991, 1996; Silberschatz, Zdonik, et al., 1996). Because of the experience.and.quality.of.the.people.involved.in.these.meetings,.we.can.assess.the.opportunity.and. novelty of any work in databases in light of the recommendations and research hints proposed in these reports..One.of.the.lines.that.appear.with.more.continuity.and.insistence.is.the.treatment.of.imprecise. and.uncertain.information.in.databases. In 1991 the first report contained a section on new concepts in data models, which remarked on the management.of.uncertainty.as.a.need.for.inclusion.in.new.data.models..At.that.time,.data.never.being. entirely precise, such as those in satellite photographs, justified the need. Later in 1996, the second report.also.posed.new.problems.associated.with.vague.queries.concerning.images,.but.in.this.case.the. imprecision was supposed to arise from two sources: first, imprecise features such as color, texture, and so.forth,.and.second,.imprecise.valuations.by.the.user.of.time.and/or.space,.such.as.statements.made. that.images.are.close.to.something.or,.for.example,.made.in.the.morning..Also.in.this.second.report,. data.mining.arose.as.a.new.trend.in.the.treatment.of.information,.and.it.was.conceived.as.a.new.way. for.imprecise.querying. In.the.next.report,.the.same.ideas.appeared.again.but.included.the.interpretation.and.management.of. the imprecise results as one of the key research subjects to be studied. Nevertheless, it was in the fourth and fifth reports where the need for including imprecision and uncertainty as natural elements in databases was more clearly reflected. Specifically, in the fifth report, the most recent one, with reference to approximate data, it is said, “When.one.leaves.business.data.processing, essentially all data is uncertain or imprecise. Scientific measurements have standard errors. Location data.for.moving.objects.involves.uncertainty.in.current.position..Sequence,.image,.and.text.similarity. are.approximate.metrics.” With reference to the imprecise queries, it is said, “users should also be able to ask imprecise queries and have the processing engine include this further source of uncertainty. Of course,.with.imprecise.answers.comes.a.duty.for.the.system.to.characterize.the.accuracy.offered,.so. users.can.understand.whether.the.approximation.is.good.enough.for.their.needs.” Data.mining,.conceived.as.a.new.form.of.accessing.databases,.is.considered.again.as.a.top-priority. research line in the fifth report, where it is also emphasized that data mining contains in its own essence the task of answering some kind of imprecise query. Quoting this fifth report, “users invariably point

xxiv

out.they.have.a.single.data.mining.query:.Tell.me.something.interesting,”.which.is.clearly.an.imprecise. query. These are the reasons why this book containing a collection of chapters devoted to research on fuzzy information processing in databases appears to be quite adequate and timely. Its key subject is directly focused.on.the.resolution.of.problems.that.have.been.considered.as.being.very.important.in.all.reports. referenced. In other words, this book deals with some questions and tasks that the community of researchers.of.databases.has.been.pointing.out.for.several.years. The.use.of.fuzzy.logic.to.manage.the.imprecise.and/or.uncertain.information.in.databases.is.even.older. than.the.aforementioned.reports..It.is.a.research.line.that.is.widely.consolidated.and.has.been.developing. for over 20 years. Traditionally, two major categories of work lines have been considered. a. b.

Those dealing with the problems of flexible querying to databases, in general, consider that the user expresses the query by using imprecise terms; the result is often a set of elements of the database affected.to.an.accomplishment.degree. Those. addressing. the. description. of. data. models. include. imprecise. and/or. uncertain. attributes,. relationships,.and.structures.represented.by.fuzzy.sets.and.fuzzy.logic..

The book we are presenting includes some interesting and innovative chapters belonging to both work lines, starting with a chapter by the editor introducing basic concepts on fuzzy logic and fuzzy databases. The chapter by S. Zadrożny, G. de Tré, R. de Caluwe, and J. Kacprzyk has a special interest since it offers a very wide review of fuzzy flexible querying..D..Dubois.and.H..Prade.discuss.the.possibility. of expressing negative preferences. R. Thomopoulos, P. Buche, and O. Haemmerlé study the use of ontologies in hierarchical queries. On the other hand, different approaches to solving flexible querying are also considered: case-based reasoning by G. de Tré et al., relative object qualification by C. Tudorie, and flexible queries using taxonomies by T. Andreasen and H. Bulskov. The evaluation strategies for fuzzy queries are also studied by W. A. Voglozin, G. Raschia, L. Ughetto, and N. Mouaddib, whereas P. Bosc, O. Pivert, and A. Hadjali present a new study about the expressiveness of fuzzy sets illustrated by the fuzzy division operator. G. Xexéo and A. Braga give a new tool for fuzzy.reasoning.and.querying.applied.to.geographic.information.systems..In.other.chapters,.M..Schneider. defines fuzzy spatial data types, L. Liétard and.D..Rocacher.give.an.exhaustive.list.for.the.evaluation.of. quantified statements, and G. Bordogna and G. Psaila define a language for fuzzy querying in classical relational.databases. Formal extensions of the fuzzy relational data model are dealt with in the chapter by A. Takači and S. Škrbić, introducing a query language with the possibility to specify priorities in fuzzy statements, in the chapter by R. Belohlavek, about formal concepts, and in the chapter by R. Belohlavek and V. Vychodil,.about.similarities. The.interesting.problem.of.implementing.fuzzy.database.languages.and.systems.is.also.dealt.with.in. the book in works by A. Urrutia, L. Tineo, and C. Gonzalez, and M. A. Ben Hassine et al. The chapter by C. Barranco, J. Campaña, and J. M. Medina studies the object relational approach and the fuzzy object-oriented data model. The above-mentioned chapter by Zadrożny et al. presents both the relational.and.object-oriented.cases.where.fuzzy.queries.are.made.to.fuzzy.data.models. As we already mentioned, data mining appeared as a major research area inside the database field in the first challenge report 15 years ago, and it has been successively included inside this category in all.subsequent.reports..The.use.of.fuzzy.sets.and.fuzzy.logic.in.data.mining.has.been.widely.extended,.

xxv

and in fact, before data mining was properly considered as a research area, some of its problems were addressed by means of fuzzy approaches. In this sense, let us remember the well-known fuzzy extensions of the K-means method for clustering problems, or the use of fuzzy rules in classification models. However, it has been in the last decade that this research line was consolidated by appearing in a wide variety of suggestive results, such as new fuzzy clustering approaches, different fuzzy association-rule definitions, new fuzzy classification techniques, and so forth. This book also offers interesting results about fuzzy data mining topics. First, the chapter by B. Feil and J. Abonyi is an excellent theoretical review. The use of fuzzy decision trees is studied in one chapter by M. J. Beynon. The extraction of fuzzy association rules is discussed by W.-H. Au and Y. Wang in their respective chapters. C. Fiot studies sequential pattern discovery, S. L. Wang et al. study fuzzy functional dependencies, and A. K. Sharma, A. Goswami, and D. K. Gupta define fuzzy inclusion dependencies. Fuzzy classification is studied by A. Meier, G. Schindler, and N. Werro. Subjects associated with data mining such as data cleaning and decision making are presented in the works of H. H. Shahri and M. J. Beynon, respectively. S. Turgay proposes an agent-based fuzzy data mining structure, and missing data are studied by J. I. Peláez, J. M. Doña, and D. La Red by using the so-called fuzzy imputation approach. Additionally, the diversity of the theoretical results cited above, including topics such as fuzzy information processing, should generate many different applications, which can be mainly found in the chapters by Y. Veryha et al., Y. Chen et al., and R. Carrasco et al., as well as in the examples and demonstrations of other chapters. Summarizing, we can state that the present book offers an excellent perspective about what is now being investigated about uncertainty and imprecision management by means of fuzzy sets and fuzzy logic in the field of databases and data mining. Furthermore, all chapters include good introductions to their respective topics, good lists of references, and some key terms with concise and useful definitions. Therefore, we are sure that this handbook will be very informative and useful for a broader class of researchers, students, and companies related to the database world. Professor Dr. M. Amparo Vila and Professor Dr. Miguel Delgado University of Granada Granada, Spain

References Abiteboul, S., Agrawal, R., Bernstein, P., Carey, M., Ceri, S., et al. (2005). The Lowell Database Research Self-Assessment. Communications of the ACM, 48(5), 111-118. Bernstein, P., Brodie, M., Ceri, S., DeWitt, D., Franklin, M., et al. (1998). The Asilomar Report on Database Research. ACM SIGMOD Record, 27(4), 74-80. Silberschatz, A., Stonebraker, M., & Ullman, J. D. (Eds.). (1991). Database systems: Achievements and opportunities. Communications of the ACM, 34(10), 110-120. Silberschatz, A., Stonebraker, M., & Ullman, J. D. (Eds.). (1996). Database research: Achievements and opportunities into the 21st century. ACM SIGMOD Record, 25(1), 52-63. Silberschatz, A., Zdonik, S., et al. (1996). Strategic directions in database systems: Breaking out the box. ACM Computing Surveys, 28(4), 764-778.

xxvi

Preface

In order to write this preface, I began to read one of the most inspiring books I have ever read. I read many of the underlined sentences (I always underline good books). Suddenly, the muses visited me and they.said.to.me.that.it.would.be.easier.if.I.quoted.some.interesting.text..One.of.the.most.prestigious. Italian philosophers, Ludovico Geymonat (1908-1991), said,1 “The first step of the human reason is satisfied, in all investigation, showing the existing difficulties in it, not hiding them, even if they are very serious. Only who knows them, not who ignores them, can feel the impulse to search for the indispensable means to dominate them; and this search is the decisive spring for the scientific progress.” I think that, today, most of research papers are focused on only a few possible solutions to a very small and very concrete subject, and even with a very local point of view. Is this useful? I think so, of course..However,.it.is.possible.that.many.researchers.are.more.interested.in.increasing.the.number.of. publications than the quality of these works, or if these works can be extended with a wider point of view, studying previous works and showing the most important “existing difficulties.” In this book, the referees.and.I.have.spared.no.effort.to.reduce.these.troubles,.but.I.am.not.sure.if.we.have.achieved.it.. Indeed, in science and research (at least) it is important not to be really sure of anything. Skepticism is important for scientific progress, and thus it was taught by scholars from Pirrón of Elis (365-275 B.C.) to René Descartes (1596-1650), including the doctor Sextus Empiricus (second to third centuries, B.C.) or Michel de Montaigne (1533-1592). Thus, with this humility that must characterize every research work, we present this book and hope that it contributes a bit to scientific progress and therefore to a better.world. In the context of this handbook, Vila and Delgado defend in the foreword that the treatment of imprecise.and.uncertain.information.in.databases.is.a.very.interesting.research.line..Imprecision.has.been. studied in order to elaborate systems, databases, and consequently applications that support this kind of information. Most works that studied the imprecision in information have used possibility, similarity, and fuzzy techniques. In this foreword, the reader can find an interesting overview of each chapter of this.volume. Basically,. a. fuzzy. database. is. a. database. with. fuzzy. characteristics,. particularly. fuzzy. attributes.. These may be defined as attributes of an item, row, or object in a database that allow the storage of fuzzy

xxvii

information (imprecise or uncertain data). There are many forms of adding flexibility in fuzzy databases. The.simplest.technique.is.to.add.a.fuzzy.membership.degree.to.each.record,.that.is,.an.attribute.in.the. range [0,1]. However, there are other kinds of databases allowing fuzzy values to be stored in fuzzy attributes, using fuzzy sets (including fuzzy spatial data types), possibility distributions, or fuzzy degrees associated with some attributes and with different meanings (membership degree, importance degree, fulfillment degree, etc.). Sometimes, the expression fuzzy databases.is.used.for.classical.databases.with. fuzzy queries or with other fuzzy aspects, such as constraints. The first chapter gives a wide historical point of view summarizing the main fuzzy database models, but this scientific field has a very promising.future. The.research.on.fuzzy.databases.has.been.developing.for.about.20.years.and.is.concentrated.mainly. on.the.following.six.research.lines. 1.. 2.. 3. 4.. 5.. 6..

Fuzzy.querying.in.classical.databases. Fuzzy.queries.on.fuzzy.databases Extension of classical data models in order to achieve fuzzy databases (fuzzy relational databases, fuzzy.object-oriented.databases,.etc.) Fuzzy.conceptual.modeling.tools Fuzzy.data.mining.techniques. Applications.of.these.advances.in.real.databases.

All.of.these.different.issues.have.been.studied.in.different.chapters.of.this.volume,.except.the.fourth. item.because,.in.general,.there.is.little.interest.in.fuzzy.conceptual.issues.and.this.subject.has.been.studied. in some other works in a very exhaustive manner (see related references in Chapter I). Querying.with.imprecision,.contrary.to.classical.querying,.allows.users.to.implement.fuzzy.linguistic. labels (also named linguistic terms) and express their preferences to better qualify the data they wish to get. An example of a flexible query, also named in this context a fuzzy query, would be “a list of the young employees working in a department with a big budget.” This query contains the fuzzy linguistic labels.young.and.big budget..These.labels.are.words,.in.natural.language,.that.express.or.identify.a.fuzzy. set (fixed or context dependent). Summarizing, fuzzy queries are useful to reflect the preferences of the end user and to rank the solutions. The ability to make fuzzy queries in classical databases is very useful because currently there are many classical databases. The second research line includes the first one, but we prefer to separate them because this second line finds new problems that must be studied, and because it must be framed in a concrete fuzzy database model (third research line). These two first lines are summarized by Zadrożny et al. in their chapter. On the other hand, Chapters IV to XIII study concrete problems about the fuzzy querying world (bipolar queries, fuzzy languages, quantified queries, etc.). This handbook also includes interesting chapters about the third item, extending classical data models.in.order.to.achieve.fuzzy.databases..They.study.useful.topics,.such.as.how.a.database.administrator. may.achieve.a.fuzzy.relational.database,.a.new.fuzzy.relational.model.with.a.fuzzy.query.language. including.the.possibility.to.specify.priorities.for.fuzzy.statements,.a.good.approach.to.creating.a.fuzzy. object-relational.database.model,.fuzzy.spatial.data.types,.and.even.more. Regarding fuzzy data mining issues, this handbook includes a complete review chapter by Feil and Abonyi.studying.the.main.fuzzy.data.mining.methods..This.is.probably.the.most.promising.area.because. today.there.are.many.databases.that.may.give.us.information.if.we.use.the.proper.tools..Perhaps.the. more.interesting.and.useful.tools.are.fuzzy.clustering.and.fuzzy.dependencies,.and.both.are.also.studied. in different chapters of this handbook.

xxviii

The.last.research.line,.applications,.is.also.studied.in.some.chapters..These.chapters.mix.different. theoretical issues, like data mining, with real contexts to achieve different goals. Many chapters end with some examples or applications of the topics; however, we want to highlight that the third part of this handbook includes some chapters with very interesting applications. In summary, this handbook includes a very good selection of works by leaders in this field. Each chapter includes.a.good.introduction,.shows.some.of.the.new.advances.and.the.future.lines.in.its.corresponding. topic, and gives some key-term definitions. We can be assured that fuzzy databases will be studied and developed.in.the.upcoming.years.with.the.main.target.of.improving.current.databases..It.is.easy.to.see. that scientific and technological development, including information science, can assist humankind in making the world a better place to live. Therefore, why is it not achieved on the whole planet? Why is this.world.so.unjust?.Why.can.we.not.enhance.our.lives.without.destroying.other.forms.of.life,.plants,. animals, and our own fellow people? Perhaps it is useful to reflect on one dissertation by Geymonat in which he studies the antiquity after Aristotle of Estagira (384-322 B.C.). He studied Archimedes of Syracuse (287-212 B.C.) and Herón of Alexandria (about first century, A.D.), and their fusion of science and.technology..Then.he.wondered.why.the.ancient.world.did.not.develop.a.mechanic.civilization..He. said that, probably, the reason was in the social structure of the Latin Greek world, which did not feel the necessity of inventing new machines because they had cheap and efficient machines: slaves. The Latin writer Marcus Terentius Varro (116-27 B.C.) confirms that slavery was then seen as a true machine. The French economist Bertrand De Jouvenel (1903-1987) would say that our modern machines work because we have made “the big mutation” from soil forces (animals, water, wind) to subsoil forces (coal, petroleum). In this way, the invention by the Scotch engineer James Watt (1736-1819), patented in 1769, “provokes a huge difference between leader countries and other civilizations that in the 18th.century.it. had never crossed somebody mind to consider like inferior ones.”2 These dissertations invite us to think that perhaps it is impossible to reach a very comfortable society without.using.slavery.or.subsoil.forces,.two.options.with.a.lot.of.big.problems..It.is.sad,.but.our.world.is. now.using.both.options,.especially.rich.countries..Modern.slavery.is.located.in.far-off.and.poor.countries. (many adults and children work in very bad conditions making sport shoes, footballs, ephemeral toys, and even in the dangerous fields of gold and diamond mining, tobacco plantations, etc.). On July 4, 2007,.newspapers.wrote.that.in.Brazil.more.than.1,000.slaves.were.freed.from.a.sugar-cane.plantation. where they were being forced to work 14-hour days in horrendous conditions cutting cane for ethanol production..Human-rights.groups.and.labour.organizations.believe.that.between.25,000.and.80,000.people. could be working in conditions akin to slavery in Brazil (on deforestation and on sugar-cane, coffee, and cotton plantations). Most of these products (e.g., wood) go to rich countries. The same newspapers published.that.Brazil,.one.of.the.world’s.largest.producers.of.alternative.fuel.and.the.number-one.exporter.of.ethanol.made.from.sugar.cane,.plans.to.double.production.of.the.biofuel.over.the.next.5.years,. and.more.than.the.50%.will.go.to.ecological.Europe..Biofuels3.may.be.with.an.interesting.alternative. to subsoil forces, especially for a hungry planet. They may be a renewable energy source (using the appropriate.manure.and.when.not.being.transported.over.a.long.distance,.for.example)..I.am.not.sure. whether.Brazilian.biofuel.is.a.good.environmental.option.for.Europe,.but.it.is.not.an.ethical.option.if. we do not know whether it uses forced labour or abusive work conditions, which are illegal in Europe, or.even.whether.it.uses.ecological.agriculture.techniques.or.not. Where.is.the.solution?.The.solution.is.in.our.hands,.in.you.and.me,.in.all.the.citizens.of.the.world.. All.of.us.must.demand.ethical.politics.and.refuse.such.a.very.comfortable.and.consumerist.society..It.is. a pleasure to drive a car or to eat meat everyday, or to have many shoes, jackets, rings, and necklaces, but, unfortunately, it is not sustainable. I do not know if we will be able to achieve sustainable development.or.if.it.is.even.possible,.but.in.any.case,.we.must.use.every.endeavor.to.reach.it..We.must.decide.

xxix

whether.development.is.more.or.less.important.than.sustainability.because.many.times.we.will.have.to. choose.between.these.two.concepts. Unfortunately,.I.do.not.have.global.solutions.but.I.need.to.believe.that.solutions.exist..For.now,.I. can think of some local proposals like planting trees (in order to preserve soil, water, and biodiversity), not eating meat everyday or buying unnecessary or “fussy” objects (because they need large quantities of energy), and living with open eyes and mind, looking for situations where we can help to achieve a better.world..Our.life.is.not.neutral..We.contribute.to.changing.this.world,.for.worse.or.for.better..Our. activity and knowledge have an influence on our little planet. Geymonat wrote that knowledge is not only the result of personal ingenuity, but that “it sinks its own roots in the whole collection of the diverse human activities,” and that there are two kinds of scientific and philosophic research. One of them consists of well-connected systems (such as those by Aristotle or Euclid), while the other one consists of connected fragments. Neither is better than the other because the best one is that “which provokes the highest interest to continue researching and the highest trust in the investigation powerful.” Like other philosophers, such as the Spanish Ortega y Gasset (1883-1955), Geymonat said that preserving the past and looking for the new are complementary aspects, and both of.them.are.indispensable.at.the.same.level. This book brings together some connected fragments and they are a well-connected system in the particular area of fuzzy databases. I think that it will provoke at least some interest to continue researching.and.some.trust.in.the.investigation..Maybe.the.next.generation.of.database.management.systems.will. include.many.fuzzy.characteristics.and.users.will.enjoy.fuzzy.interfaces,.fuzzy.queries,.fuzzy.dependencies, and fuzzy data mining even without knowing anything about t-norms, fuzzy measures, FSQL, or a man called Zadeh. In this sense, I think and I hope that this book will be at least a bit useful. This book was a big effort for me, but it was also a big effort for the authors, referees, and publisher. Each chapter has been reviewed by three to five referees who looked for errors and areas in need of improvement, proposing interesting approaches, references, and so forth. I will be very satisfied if someone finds more errors or improvements because it means that this handbook provokes at least some interest to continue researching. All of us must undertake a continuous process of apprenticeship, research, meditation, and thinking over everything. If we refuse to do that, then the television, mass media, and politicians.will.be.very.happy.because.they.will.do.that.for.us.

endnotes .

1

2

3.

Ludovico Geymonat was a professor in the Milan University. His book Historia de la Filosofía y de la Ciencia (2nd.edition.in.Spanish,.2006,.translated.as.History of Philosophy and Science).is. a.synthesis.of.his.two.masterpieces.Storia della Filosofia.and.Storia del Pensiero Filosofico..An. abstract in Spanish is available at http://www.resumelibros.tk and http://www.lcc.uma.es/~ppgg/ libros. Bertrand De Jouvenel’s (1976) The Civilization of Power: From Political Economy to Political Ecology (abstract in Spanish available at http://www.resumelibros.tk and http://www.lcc.uma. es/~ppgg/libros) BirdLife International’s (2005) Bioenergy: Fuel for the Future? A BirdLife International Position Paper on Bioenergy Use in the EU (retrieved from http://www.birdlife.org)

Professor Dr. José Galindo (editor) University of Málaga Málaga, Spain

xxx

Acknowledgment

The editor would like to acknowledge the help of all involved in the writing and review process of this handbook, without whose support the project could not have been satisfactorily completed. Deep appreciation is due to the Spanish Ministry of Education and Science projects TIN2006-14285 and TIN2006-07262, the Spanish Consejería de Innovación Ciencia y Empresa de Andalucía under research project.TIC-1570,.and.especially.to.their.respective.directors.for.their.partial.support.that.let.me.to.edit. this handbook. Most of the authors of chapters included in this handbook also served as referees for chapters written by other authors. Thanks go to all those who provided constructive and comprehensive reviews, including the Program Committee. Of course, special thanks go also to my editorial advisory board, a group of wonderful and qualified researchers who aided in the review process and strengthened the overall quality of.this.publication..All.these.researchers.were.really.necessary.because.all.chapters.have.been.reviewed. by.at.least.three.different.referees.and.all.their.comments.contributed.to.enhance.every.chapter. Special thanks also go to the publishing team at IGI Global, whose contributions throughout the whole process from inception of the initial idea to final publication have been invaluable. In particular, thanks go to Ms. Kristin Roth, who continuously supervised the project march via e-mail, even in Spanish language, and whose enthusiasm motivated me to continue working on this project. Jessica Thompson also did a wonderful job. I want to acknowledge the special English review by M. Carmen Chaves and Dr. Salvador Arijo, highlighting here their excellent contribution as Greenpeace1.volunteers.in.the.unsustainable Spanish province of Málaga. I cannot forget to mention the essential contribution of Professor Dr. M. C. Aranda, including scientific and personal areas, bearing without complaint part of the huge task of editing this work. Finally, I wish to thank all of the authors for their excellent contributions to this handbook, including authors of unpublished works (unfortunately, this book had a limited extension). 1 Greenpeace is an independent global organization that acts to change attitudes and behavior, to protect and conserve the environment, and to promote peace (http://www.greenpeace.org).

xxxi

About the Editor

José Galindo has a PhD in computer science from the University of Granada (Spain) and is a professor of computer science in the School of Engineering at University of Málaga (Spain). He is author of several didactical and research books and papers on computer science, databases, information systems, and fuzzy logic. He is coauthor of the book Fuzzy Databases: Modeling, Design and Implementation published by Idea Group Publishing (Hershey, USA) in 2006. He is also the editor of this current handbook. His research interests are fuzzy logic, fuzzy databases, and ethical issues in the technological age. He is a member of the IdBIS research group and the Ibero-American research project RITOS-2.

Section I

Introduction

Chapter I

Introduction and Trends to Fuzzy Logic and Fuzzy Databases José Galindo University of Málaga, Spain

Abstract This chapter presents an introduction to fuzzy logic and to fuzzy databases. With regard to the first topic, we have introduced the main concepts in this field to facilitate the understanding of the rest of the chapters to novel readers in fuzzy subjects. With respect to the fuzzy databases, this chapter gives a list of six research topics in this fuzzy area. All these topics are briefly commented on, and we include references to books, papers, and even to other chapters of this handbook, where we can find some interesting reviews about different subjects and new approaches with different goals. Finally, we give a historic summary of some fuzzy models, and we conclude with some future trends in this scientific area.

Introduction Fuzzy logic is only a mathematical tool. It is possibly the best tool for treating uncertain, vague, or subjective information. Just to give an idea about the importance of this soft computing tool, we can mention the big quantity of publications in this field, including two research journals of great quality: Fuzzy Sets and Systems1 and IEEE Transactions on Fuzzy Systems.2 Particularly, fuzzy logic has been applied to databases in many scientific papers and real applications. Undoubtedly, it is a modern research field and it has a long road ahead. This handbook is only one step. Perhaps, it is a big step.

For that reason, we will begin by introducing some basic concepts of the fuzzy sets theory. We include definitions, examples, and useful tables with reference data (for example, lists of t-norms, t-conorms, and fuzzy implications). We can find these and other concepts in other chapters of this book, possibly with different notation. The second part of this chapter studies basic concepts about fuzzy databases, including a list of six research topics on fuzzy databases. All these topics are briefly commented on, and we include references to books, papers, and even to other chapters of this handbook. Then, an overview about the basic fuzzy database models is included to give an introduction to these topics and also to the whole handbook.

Copyright © 2008, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.

Introduction and Trends to Fuzzy Logic and Fuzzy Databases

Fuzzy Sets In written sources, we can find a large number of papers dealing with this theory, which was first introduced by Lotfi A. Zadeh3 in 1965 (Zadeh, 1965). A compilation of some of the most interesting articles published by Zadeh on the theme can be found in Yager, Ovchinnikov, Tong, and Nguyen (1987). Dubois and Prade (1980, 1988) and Zimmerman (1991) bring together the most important aspects behind the theory of fuzzy sets and the theory of possibility. A more modern synthesis of fuzzy sets and their applications can be found in Buckley and Eslami (2002); Kruse, Gebhardt, and Klawonn (1994); Mohammd, Vadiee, and Ross (1993); Nguyen and Walker (2005); and Piegat (2001), and particularly in Pedrycz and Gomide (1998). Ross (2004) includes some engineering applications and Sivanandam, Sumathi, and Deepa (2006) present an introduction using MATLAB. A complete introduction in Spanish is given in Escobar (2003) and Galindo (2001). The original interpretation of fuzzy sets arises from a generalization of the classic concept of a subset extended to embrace the description of “vague” and “imprecise” notions. This generalization is made considering that the membership of an element to a set becomes a “fuzzy” or “vague” concept. In the case of some elements, it may not be clear if they belong to a set or not. Then, their membership may be measured by a degree, commonly known as the “membership degree” of that element to the set, and it takes a value in the interval [0,1] by agreement. Using classic logic, it is only possible to deal with information that is totally true or totally false; it is not possible to handle information inherent to a problem that is imprecise or incomplete, but this type of information contains data that would allow a better solution to the problem. In classic logic, the membership of an element to a set is represented by 0 if it does not belong and by 1 if it does, having the set {0,1}. On the other hand, in fuzzy logic, this set is extended to the interval [0,1]. Therefore, it could be said that fuzzy logic is an extension of the classic systems (Zadeh, 1992). Fuzzy logic is

the logic behind approximate reasoning instead of exact reasoning. Its importance lies in the fact that many types of human reasoning, particularly the reasoning based on common sense, are by nature approximate. Note the great potential that the use of membership degrees represents by allowing something qualitative (fuzzy) to be expressed quantitatively by means of the membership degree. A fuzzy set can be defined more formally as: Definition 1: Fuzzy set A over a universe of discourse X (a finite or infinite interval within which the fuzzy set can take a value) is a set of pairs: A = { A( x ) / x : x ∈ X ,

A

( x) ∈ [0,1]∈ ℜ} (1)

where mA(x) is called the membership degree of the element x to the fuzzy set A. This degree ranges between the extremes 0 and 1 of the dominion of the real numbers: mA(x) = 0 indicates that x in no way belongs to the fuzzy set A, and mA(x) = 1 indicates that x completely belongs to the fuzzy set A. Note that mA(x) = 0.5 is the greatest uncertainty point. Sometimes, instead of giving an exhaustive list of all the pairs that make up the set (discreet values), a definition is given for the function mA(x), referring to it as characteristic function or membership function. The universe X may be called underlying universe or underlying domain, and in a more generic way, a fuzzy set A can be considered a function mA that matches each element of the universe of discourse X with its membership degree to the set A: A

( x) : X → [0,1]

(2)

The universe of discourse X, or the set of considered values, can be of these two types: •

Finite or discrete universe of discourse X = {x1, x 2,..., xn}, where a fuzzy set A can be represented by:

Introduction and Trends to Fuzzy Logic and Fuzzy Databases

•

A=

1

/ x1 +

2

/ x 2 + ... +

n

/ xn

(3)

where mi with i = 1, 2, ..., n represents the membership degree of the element xi. Normally, the elements with a zero degree are not listed. Here, the + does not have the same significance as in an arithmetical sum, but rather, it has the meaning of aggregation, and the / does not signify division, but rather the association of both values. Infinite universe of discourse, where a fuzzy set A over X can be represented by: A=∫

A

( x) / x

(4)

Actually, the membership function mA(x) of a fuzzy set A expresses the degree in which x verifies the category specified by A.

A linguistic label is that word, in natural language, that expresses or identifies a fuzzy set, that may or may not be formally defined. With this definition, we can assure that in our every day life we use several linguistic labels for expressing abstract concepts such as “young,” “old,” “cold,” “hot,” “cheap,” “expensive,” and so forth. Another interesting concept, the linguistic variable (Zadeh, 1975), is defined in the chapter by Xexéo and Braga in this handbook. Basically, a linguistic variable is a variable that may have fuzzy values. A linguistic variable is characterized by the name of the variable, the underlying universe, a set of linguistic labels, or how to generate these names and their definitions. The intuitive definition of the labels not only varies from one to another person depending on the moment, but also it varies with the context in which it is applied. For example, a “high” person and a “high” building do not measure the same. Example 1: The “Temperature” is a linguistic variable. We can define four linguistic labels, like “Very_Cold,” “Cold,” “Hot,” and “Very_Hot,” using the membership functions depicted in Figure 1.

The frame of cognition, or frame of knowledge, is the set of labels, usually associated to normalized fuzzy sets (Definition 11), used as reference points for fuzzy information processing.

Characteristics and Applications This logic is a multivalued logic, the main characteristics of which are (Zadeh, 1992): •

In fuzzy logic, exact reasoning is considered a specific case of approximate reasoning. Any logical system can be converted into terms of fuzzy logic. In fuzzy logic, knowledge is interpreted as a set of flexible or fuzzy restrictions over a set of variables (e.g., the variable Temperature is Cold). Inference is considered as a process of propagation of those restrictions. Inference is understood to be the process by which a result is reached, consequences are obtained, or one fact is deduced from another. In fuzzy logic, everything is a matter of degree.

• •

•

•

From this simple concept, a complete mathematical and computing theory has been developed which facilitates the solution of certain problems (see the references in the beginning of this chapter). Fuzzy logic has been applied to a multitude of disciplines such as control systems, modeling, simulation, prediction, optimization, pattern recFigure 1. A frame of cognition with four linguistic labels for temperature (Example 1) 1

Very_Cold

Cold

Hot

Very_Hot

0

Temperature -1

35 ºC

Introduction and Trends to Fuzzy Logic and Fuzzy Databases

ognition (e.g., word recognition), information or knowledge systems (databases, knowledge management systems, case-based reasoning systems, expert systems, etc.), computer vision, biomedicine, picture processing, artificial intelligence, artificial life, and so forth. Summarizing, fuzzy logic may be an interesting tool where hitherto known methods fail, highlighting complex processes, where we need to introduce the expert knowledge from experienced people, or where there are unknown magnitudes or ones that are difficult to measure in a reliably way. In general, fuzzy logic is used when we need to represent and operate with uncertain, vague, or subjective information. Many applications use fuzzy logic with other general or soft computing tools like genetic algorithms (GAs), neural networks (NNs), or rule based systems.

Membership Functions Zadeh proposed a series of membership functions that could be classified into two groups: those made up of straight lines, or “linear,” and Gaussian forms, or “curved.” We will now go on to look at some types of membership functions. These types of fuzzy sets are those known as convex fuzzy sets in fuzzy set theory, with the exception of that known as extended trapezium that does not necessarily have to be convex, although for semantic reasons, this property is always desirable. •

representation of a nonfuzzy (crisp) value. •

•

•

(5)

Singleton (Figure 3): It takes the value zero in all the universe of discourse except in the point m where it takes the value 1. It is the

1 b − x L( x) =  b − a 0

if x ≤ a

if a < x < b

(7)

if x ≥ b

Gamma Function (Figure 5): It is defined by its lower limit a and the value k > 0. Two definitions: 0 Γ( x) =  −k ( x −a )2 1 − e 0 2  Γ( x) =  k ( x − a ) 1 + k ( x − a) 2 







if x ≤ a if x > a if x ≤ a

(8) (9)

if x > a

This function is characterized by rapid growth starting from a. The greater the value of k, the greater the rate of growth. The growth rate is greater in the first definition than in the second. Horizontal asymptote in 1. The gamma function is also expressed in a linear way (Figure 5b):

0 if x ≤ a x −a Γ( x) =  if a < x < b b − a  if x ≥ b 1

(6)

L Function (Figure 4): This function is defined by two parameters, a and b, in the following way, using linear shape:



Triangular (Figure 2): Defined by its lower limit a, its upper limit b, and the modal value m, so that a<m
if x ≤ a or x ≥ b 0  A( x) = ( x − a ) /( m − a ) if x ∈ (a, m] (b − x) /(b − m) if x ∈ (m, b) 

0 if x ≠ m SG ( x) =  1 if x = m

•

(10)

S Function (Figure 6): Defined by its lower limit a, its upper limit b, and the value m or point of inflection so that a<m
Introduction and Trends to Fuzzy Logic and Fuzzy Databases

Figure 2. Triangular fuzzy sets: (a) General, (b) symmetrical a)

b)

1

0

a

m

b

X

1

0

Figure 3. Singleton fuzzy set

m−margin

X m+margin

Figure 4. L fuzzy set 1

1

0

m

a

m

0

b

a

b

X

Figure 5. Gamma fuzzy sets: (a) General, (b) linear a)

b)

1

0

X

a

0 2{( x − a ) /(b − a )}2  S ( x) =  2 1 − 2{( x − b) /(b − a )} 1

if x ≤ a

•

0

a

b

X

kernel, b and c, respectively:

if x ∈ (a, m] if x ∈ (m, b) if x ≥ b

1

(11)

Trapezoid Function (Figure 7): Defined by its lower limit a, its upper limit d, and the lower and upper limits of its nucleus or

0 ( x − a) /(b − a )  T ( x) =  1 (d − x) /(d − c)

if ( x ≤ a ) or ( x ≥ d ) if x ∈ (a, b) if x ∈ [b, c] if x ∈ (c, d )

.

(12)

Introduction and Trends to Fuzzy Logic and Fuzzy Databases

Figure 6. S fuzzy set

Figure 7. Trapezoidal fuzzy set

1

1

0.5 0

a

m

b

0

X

Figure 8. Gaussian fuzzy set

m

0

X

Gaussian Function (Figure 8): This is the typical Gauss bell, defined by its midvalue m and the value of k>0. The greater k is, the narrower the bell. 2

G ( x) = e − k ( x − m )

•

Pseudo-Exponential Function (Figure 9): Defined by its midvalue m and the value k>1. As the value of k increases, the growth rate increases and the bell becomes narrower.

(13)

1 P( x) = 1 + k ( x − m) 2

X

1

0

•

c d

Figure 9. Pseudo-exponential fuzzy set

1

•

a b

X

bership values (height) associated to each of these points (ei, hei). Comments: 







(14)

Extended Trapezoid Function (Figure 10): Defined by the four values of a trapezoid (a,b,c,d), and a list of points between a and b, and/or between c and d, with their mem-

m

In general, the trapezoid function adapts quite well to the definition of any concept in human contexts, with the advantage that it is easy to define, easy to represent, and simple to calculate. In specific cases, the extended trapezoid is very useful. This allows greater expressiveness through increased complexity. In general, the use of a more complex function is usually difficult to define with precision and probably it does not give increased precision, as we must keep in mind that we are defining a fuzzy concept. Concepts that require a nonconvex function can be defined. In general, a nonconvex function expresses the union

Introduction and Trends to Fuzzy Logic and Fuzzy Databases

Figure 10. Extended trapezoidal fuzzy set

A ⊆ B ⇔ ∀x ∈ X,

( x) ≤

B

( x)

(16)

A fuzzy inclusion may be defined using a degree of subsethood. For example, when both fuzzy sets are defined in a finite universe, this degree may be computed as (Kosko, 1992):

1 he3 he2 he1 0

A

a e1 b

c e2 e3

X

d

S ( A, B) =

  1 Card( A) − ∑ max{0, A( x) − B( x)} Card( A)  x∈X 

of two or more concepts, the representation of which is convex. In fuzzy control, for example, the aim is to express the notions of “increase,” “decrease,” and “approximation,” and in order to do this, the types of membership functions previously mentioned are used. The membership functions Gamma and S would be used to represent linguistic labels such as “tall” or “hot” in the dominion of height and temperature. Linguistic labels, such as “small” and “cold,” would be expressed by means of the L function. On the other hand, approximate notions are sometimes difficult to express with one word. In the dominion of temperature, it would be “comfortable” or “approximately 20ºC,” which would be expressed by means of the triangle, trapezoid, or gaussian function.

Concepts about Fuzzy Sets In this section, the most important concepts about fuzzy sets are defined. This series of concepts regarding fuzzy sets allow us to deal with fuzzy sets, measure and compare them, and so on.

(17)

Definition 4: The support of a fuzzy set A defined over X is a subset of that universe that complies with: Supp( A) = {x : x ∈ X,

A

( x) > 0}

(18)

Definition 5: The a-cut of a fuzzy set A, denoted by Aα is a classic subset of elements in X, whose membership function takes a greater or equal value to any specific α value of that universe of discourse that complies with: A = {x : x ∈ X,

A

( x) ≥ , ∈ [0,1]}

(19)

The Representation Theorem allows the representation of any fuzzy set A by means of the union of its α-cuts. Definition 6: The Representation Theorem states that any fuzzy set A can be obtained from the union of its α-cuts. A=



∈[0 ,1]

A

(20)

Definition 2: Let A and B be two fuzzy sets over X. Then, A is equal to B if:

Definition 7: By using the Representation Theorem, the concept of convex fuzzy set can be established as that in which all the α-cuts are convex:

A = B ⇔ ∀x ∈ X,

∀x, y ∈ X, ∀ ∈ [0,1]:

A

( x) =

B

( x)

(15)

∀x, y ∈ X, ∀ ∈ [0,1]:

Definition 3: Taking two fuzzy sets A and B over X, A is said to be included in B if:

A

A

( ⋅ x + (1 − ) ⋅ y ) ≥ min( A( x),

( ⋅ x + (1 − ) ⋅ y ) ≥ min( A( x),

A

( y ))

(21)

This definition means that any point situated between another two will have a higher membership

A

( y ))

Introduction and Trends to Fuzzy Logic and Fuzzy Databases

degree than the minimum of these two points. Figures 7, 8, or 9 are typical examples of convex fuzzy sets, whereas Figure 10 represents a nonconvex fuzzy set. Definition 8: A concave fuzzy set complies with:

the concrete application, the manner in which the uncertainty is to be represented, and how this one is to be measured during the experiments. The following points give a brief summary of some of these methods (Pedrycz & Gomide, 1998).

Horizontal method: It is based on the answers of a group of N “experts.” ∀x, y ∈ X, ∀ ∈ [0,1]: A( ⋅ x + (1 − ) ⋅ y ) ≤ min( A( x), A( y )) ∀x, y ∈ X, ∀ ∈ [0,1]: A( ⋅ x + (1 − ) ⋅ y ) ≤ min( A( x), A( y )) • The question takes the following form. (22) Can x be considered compatible with the concept A?” Definition 9: The kernel of a fuzzy set A, defined • Only “Yes” and “No” answers are acover X, is a subset of that universe that complies ceptable, so: with: Kern ( A) = {x : x ∈ X,

A

( x) = 1}

x∈X

A

( x)

A(x) = (Affirmative Answers) / N (27)

(23)

Definition 10: The height of a fuzzy set A defined over X is: Hgt ( A) = sup

1.

2.

(24)

Definition 11: A fuzzy set A is normalized if and only if: ∃x ∈ X,

A

( x) = Hgt ( A) = 1

•

(25)

Definition 12: The cardinality of a fuzzy set A with finite universe X is defined as: Card( A) = ∑

A

( x)

•

(26)

x∈X

If the universe is infinite, the addition must be changed for an integral defined within the universe.

Membership Function Determination If the system uses badly defined membership functions the system will not work well, and these functions must therefore be carefully defined. The membership functions can be calculated in several ways. The chosen method will depend on

Vertical method: The aim is to build several α-cuts (Definition 5), for which several values are selected for α.

3.

Now, the question that is formulated for these predetermined α values is as follows: Can the elements of X that belong to A with a degree that is not inferior to α be identified?” From these α-cuts, the fuzzy set A can be identified, using the so-called identity principle or representation theorem (Definition 6).

Pair comparison method (Saaty, 1980): Supposing that we already have the fuzzy set A over the universe of discourse X of n values (x1, x2, ..., xn), we could calculate the reciprocal matrix M=[ahi], a square matrix n×n with the following format:

Introduction and Trends to Fuzzy Logic and Fuzzy Databases

A( x1 )  A( xn )   A( x2 )  ... A( xn )   A( xi ) ...  A( x j )   ... 1   (28) • This matrix has the following properties: The principal diagonal is always one, ahi aih =1 (property of �� reciprocity�� ), and ahi aik = ahk (transitive property), ∀ i, j, k = 1, 2, …, n. • If we want to calculate the fuzzy set A, the process is reversed: The matrix M is calculated, and then A is calculated from M. • In order to calculate M, the level of priority or the highest membership degree of a pair of values is numerically quantified: xi with respect to xj.  The number of comparisons is: n (n – 1) / 2.  Transitivity is difficult to achieve (the eigenvalue of the matrix is used to measure the consistency of the data, so that if it is very low, the experiments should be repeated).

4.

5.

A( x1 )   1 A( x2 )   A( x2 ) 1  A( x1 ) M = ...  ...   A( xn ) A( xn )   A( x1 ) A( x2 )

...

Method based on problem specification: This method requires a numerical function that should be approximate. The error is defined as a fuzzy set that measures the quality of the approximation. Method based on the optimization of parameters: The shape of a fuzzy set A depends on some parameters, indicated by the vector p, which is represented by A(x;p). •

Some experimental results in the form of pairs (element, membership degree) are needed: (Ek, Gk) with k = 1, 2...., N.

6.

•

The problem consists of optimizing the vector p, for example, minimizing the squared error:

min p ∑[Gk − A( Ek ; p)]

N

2

k =1

(29)

Method based on fuzzy clustering: This is based on clustering together the objects of the universe in overlapping groups where levels of membership to each group are considered as fuzzy degrees. There are several fuzzy clustering algorithms, but the most widely used is the algorithm of “fuzzy isodata” (Bezdek, 1981). In this handbook, there is a chapter by Feil and Abonyi explaining some data mining techniques, including fuzzy clustering.

Fuzzy Set Operations Fuzzy sets theory generalizes the classic sets theory. It means that fuzzy sets allow operations of union, intersection, and complement. These and other operations can be found in Pedrycz and Gomide (1998) and Petry (1996) such as concentration (the square of the membership function), dilatation (the square root of the membership function), contrast intensification (concentration in values below 0.5 and dilatation in the rest of values), and fuzzification (the inverse operation). These operations can be used when linguistic hedges, such as “very” or “not very,” are used.

Union and Intersection: T-conorms and T-norms Definition 13: If A and B are two fuzzy sets over a universe of discourse X, the membership function of the union of the two sets A∪B is expressed by: A∪B

( x) = f ( A( x),

B

( x)), x ∈ X

(30)

where f is a t-conorm (Schweizer & Sklar, 1983).

Introduction and Trends to Fuzzy Logic and Fuzzy Databases

Definition 14: If A and B are two fuzzy sets over a universe of discourse X, the membership function of the intersection of the two sets A∩B, is expressed by: A∩B

( x) = g ( A( x),

B

( x)), x ∈ X

Figure 11. Intersection (minimum) and union (maximum) Intersection

U nion

1

(31)

where g is a t-norm (Schweizer & Sklar, 1983). Both t-conorms (s-norms) and t-norms establish generic models respectively for the operations of union and intersection, which must comply with certain basic properties (commutative, associative, monotonicity, and border conditions). They are concepts derived from Menger (1942) and Schweizer and Sklar (1983), and that have been studied in-depth more recently (Butnario & Klement, 1993). Definition 15: Triangular Norm, t-norm: binary operation, t: [0,1]2 → [0,1] that complies with the following properties: 1. Commutativity: x t y = y t x. 2. Associativity: x t (y t z) = (x t y) t z. 3. �� Monotonicity: If x ≤ y, and w ≤ z then x t w ≤ y t z. 4. Boundary conditions: x t 0 = 0, and x t 1 = x. Definition 16: Triangular Conorm, t-conorm, or s-norm: Binary operation, s: [0,1]2 → [0,1] that complies with the following properties: 1. 2. 3. 4.

Commutativity: x s y = y s x. Associativity: x s (y s z) = (x s y) s z. Monotonicity: If x ≤ y, and w ≤ z then x s w ≤ y s z. Boundary conditions: x s 0 = x, and x s 1 = 1.

The most widely used of this type of function is the t-norm of the minimum and the t-conorm or s-norm of the maximum as they have retained a large number of the properties of the boolean operators, such as the property of idempotency (x t x = x; 10

0

X

x s x = x). In Figure 11, we can see the intersection and union, using respectively the minimum and maximum, of two trapezoid fuzzy sets. There is an extensive set of operators, called t-norms (triangular norms) and t-conorms (triangular conorms), that can be used as connectors for modeling the intersection and union respectively (Dubois & Prade, 1980; Piegat, 2001; Predycz & Gomide, 1998; Yager, 1980). The most important are shown in Tables 1 and 2. A relationship exists between t-norms (t) and t-conorms (s). It is an extension of De Morgan’s Law: x s y = 1 − (1 − x) t (1 − y ) x t y = 1 − (1 − x) s (1 − y )

(32)

When a t-norm or a t-conorm comply with this property they are said to be conjugated or dual. T-norms and t-conorms cannot be ordered from larger to smaller. However, it is easy to identify the largest and the smallest t-norm and t-conorm: the largest and smallest t-norm are respectively the minimum and the drastic product, and the largest and smallest t-conorm are respectively the drastic sum and the maximum function. Note that if two fuzzy sets are convex, their intersection will also be (but not necessarily their union).

Negations or Complements The notion of the complement can be constructed using the concept of strong negation (Trillas, 1979).

Introduction and Trends to Fuzzy Logic and Fuzzy Databases

Definition 17: A function N: [0,1] → [0,1] is a strong negation if it fulfills the following conditions: 1. 2. 3. 4.

Boundary conditions: N(0) = 1 and N(1) = 0. Involution: N(N(x)) = x. Monotonicity: N is nonincreasing. Continuity: N is continuous.

Although there are several types of operators which satisfy such properties or relaxed versions of them, Zadeh’s version of the complement (Zadeh, 1965) is mainly used: N (x) = 1 – x. Thus, for a fuzzy set A in the universe of discourse X, the membership function of the complement, denoted by ¬A , or by A , is shown as: ¬A

( x) = 1 −

A

( x), x ∈ X

(33)

Implication Operators A fuzzy implication (Dubois & Prade, 1984; Zadeh, 1975) is a function to compute the fulfillment degree of a rule expressed by IF X THEN Y, where the antecedent or premise and the consequent or conclusion are fuzzy. Definition 18: A function f: [0, 1] × [0, 1] → [0, 1] is a fuzzy implication, f(x,y) ∈ [0,1], also denoted by x ⇒f y, if it fulfils the following conditions: 1. 2. 3. 4.

0 ⇒f a = 1, ∀ a ∈ [0,1] a ⇒f 1 = 1, ∀ a ∈ [0,1] 1 ⇒f a = a, ∀ a ∈ [0,1] Decreasing (respectively increasing) monotonicity with respect to the first (respectively second) argument.

Sometimes, another condition is added: (x ⇒f (y ⇒f z)) = ( y ⇒f (x ⇒f z)). The most important implication functions are shown in Table 3. Note that Kleene-Dienes implication is based on the classical implication definition (x⇒y = ¬x ∨ y); that is, it is a strong implication, using the Zadeh’s negation and the maximum s-norm. In standard fuzzy sets theory, there are, basically, four models for implication operations (Trillas

& Alsina, 2002; Trillas, Alsina, & Pradera, 2004; Trillas, Cubillo, & del Campo, 2000; Ying, 2002): (1) Strong or S-implications (x ⇒f y = N(x) s y), (2) Residuated or R-implications (x ⇒f y = supc{x∈[0,1], x s c ≤ y), (3) Quantum logic, Q-implications, or QM-implications (x ⇒f y = N(x) s (x t y)), and (4) Mamdani–Larsen or ML-implications (x ⇒f y = ϕ1(x) t ϕ(y), where ϕ1 is an order automorphism on [0,1] and ϕ2: [0,1] → [0,1] is a non-null contractive mapping, that is, ϕ2(w)≤w,∀w∈[0,1]). Some of these types of fuzzy implications are overlapping (for example, Łukasiewicz implication is an S-implication and an R-implication). Some applications utilize implication functions, which do not fulfill all the conditions in the previous definition, like the modified Łukasiewicz implication. Besides, it is very usual to use t-norms as implications functions (Gupta & Qi, 1991) obtaining very good results, especially the minimum (Mandani implication) and the product t-norms.

Comparison Operations on Fuzzy Sets The fuzzy sets, defined using a membership function, can be compared in different ways. We will now list several methods used to compare fuzzy sets (Pedrycz & Gomide, 1998). Distance Measures: A distance measure considers a distance function between the membership functions of two fuzzy sets in the same universe. In such a way, it tries to indicate the proximity between the two fuzzy sets. In general, the distance between A and B, defined in the same universe of discourse, can be defined using the Minkowski distance: 1   p p d ( A, B) =  ∫ A( x) − B( x) dx  (34)  X where p ≥ 1 and we assume that the integral exists. Several specific cases are typically used: 1.

Hamming Distance (p = 1):

d ( A, B) = ∫ A( x) − B( x) dx

(35)

X

11

Introduction and Trends to Fuzzy Logic and Fuzzy Databases

Table 1. t-norms functions: f(x,y) = x t y t-norms

Expression

Minimum

f ( x, y ) = min( x, y )

Product (Algebraic)

f ( x, y ) = x y

Drastic Product

 xy, if y = 1  f ( x, y ) =  y, if x = 1 0, otherwise 

Bounded Product (bounded difference)

f ( x, y ) = max[0, (1 + p )( x + y − 1) − pxy ],

Hamacher Product

f ( x, y ) =

Yager Family

f ( x, y ) = 1 − min(1, (1 − x) p + (1 − y ) p

Dubois-Prade Family

f ( x, y ) =

Frank Family

 ( p x − 1)( p y − 1)   , f ( x, y ) = log p 1 + p −1  

Einstein Product

f ( x, y ) =

[

f ( x, y ) = Others

xy , p + (1 − p )( x + y − xy )

f ( x, y ) =

xy , max( x, y, p )

1   2 2 d ( A, B ) =  ∫ A( x) − B ( x) dx   X

(36)

]

1/ p

),

where p > 0

where p > 0; p ≠ 1

xy 1 + (1 − x) + (1 − y )

[

1

1 + ((1 − x) x) + ((1 − y ) y ) p

p

1 1 x +1 y p −1

]

1/ p

, where p > 0

p

[

Euclidean Distance (p = 2):

where p ≥ 0

where 0 ≤ p ≤ 1

f ( x, y ) = max(0, x p + y p − 1)

2.

where p ≥ −1

]

1/ p

Equality Indexes: This is based on the logical expression of equality, i.e., two sets A and B are equals if A⊂B and B⊂A. In fuzzy sets, a certain degree of equality can be found. With that, the following expression is defined:

For discrete universe of discourses, integration is replaced with sum. The more similar are (A ≡ B )( x) = [A( x) B( x)]∧ [B( x) A( x)]+ [A ( x) B ( x)]∧ [B ( x) A ( x)] the fuzzy sets, smaller is the distance between 2 them. Therefore, it is convenient to normalize [A( x) B( x)]∧ [B( x) A( x)]+ [A ( x) B ( x)]∧ [B ( x) A ( x)] the function of distance, denoted by dn(A,B), (A ≡ and B )( x) = 2 use this form to express the similarity as a direct complementation: 1 − dn(A,B). (37)

12

Introduction and Trends to Fuzzy Logic and Fuzzy Databases

Table 2. s-norms functions: f(x,y) = x s y t-conorms or s-norms

Expression

Maximum

f ( x, y ) = max( x, y )

Sum-Product (Algebraic sum)

f ( x, y ) = x + y − xy

Drastic sum

 x, if y = 0  f ( x, y ) =  y, if x = 0 1, otherwise 

Bounded sum

f ( x, y ) = min(1, x + y + pxy ) ,

Einstein sum

f ( x, y ) =

Sugeno Family

f ( x, y ) = min(1, x + y + p − xy ) ,

Yager Family

f ( x, y ) = min(1, [ x p + y p ] 1 p ) ,

Dubois-Prade Family

f ( x, y ) =

Frank Family

 ( p1− x − 1)( p1− y − 1)   f ( x, y ) = log p 1 + p −1  

where p ≥ 0

x+ y 1 + xy

(1 − x)(1 − y ) , max(1 − x, 1 − y, p )

where p ≥ 0 where p > 0 where p ∈ [0, 1]

where p > 0; p ≠ 1

f ( x, y ) =

x + y − xy − (1 − p ) xy , 1 − (1 − p ) xy

where p ≥ 0

f ( x, y ) = 1 − max(0, [(1 − x) p + (1 − y ) p − 1] 1 p ) , where p > 0 Others

f ( x, y ) =

1 , 1 − [ x (1 − x) p + y (1 − y ) p ] 1 p

f ( x, y ) =

1 , where p > 0 1 − [1 (1 − x) p + 1 (1 − y ) p − 1] 1 p

where p > 0

1 if A( x) < B( x)  A( x) − B( x  where the conjunction (∧) is modeled on the miniA( x) B( x) =  ⇒ ( A ≡ B) ( x) =  mum operation, and the inclusion is represented  B( x) − A( x  B ( x) − A( x) + 1 if A( x) ≥ B( x) by the operator ϕ (phi), induced by a continuous 1 if A( x) < B( x)  A( x) − B( x) + 1 if A( x) < B( x) t-norm t: A( x) B( x) =  ⇒ ( A ≡ B) ( x) =   B( x) − A( x) + 1 if A( x) ≥ B( x)

A( x)

B ( x) = sup [A( x) t c ≤ B ( x)] c∈[0 ,1]

(38)

Taking the t-norm of bounded product with p=0 (Table 1) as an example:

 B( x) − A( x) + 1 if A( x) ≥ B( x)

(39)

Three basic methods can be used to obtain a single value ( ∀x ∈ X ):

13

Introduction and Trends to Fuzzy Logic and Fuzzy Databases

Table 3. Implication functions: f(x,y) = x ⇒f y Implication

Expression

f(x, y) = max(1 – x, y)

Kleene-Dienes Reichenbach, M izumoto Kleene-Dienes-

or

f(x, y) = 1 – x + xy

Klir-Yuan

f(x, y) = 1 – x + x2y

G�del

1, if x ≤ y f ( x, y ) =   y, otherwise

Rescher-Gaines

1, if x ≤ y f ( x, y ) =  0, otherwise

Goguen

1, if x = 0  f ( x, y ) =  min(1, y / x), otherwise ( x ≠ 0) if x ≤ y  1, f ( x, y ) =  1 − + , otherwise x y 

ukasiewicz Modified ukasiewicz

f(x, y) = 1 – |x – y|

Yager

f(x, y) = xy

Zadeh

f(x, y) = max(1 – x, min(x, y))

• Optimistic Equality Index: ( A ≡ B) opt = sup x∈X ( A ≡ B) ( x)

(40)

• Pessimistic Equality Index: ( A ≡ B) pes = inf x∈X ( A ≡ B) ( x)

(41)

• Medium Equality Index:

 ( A ≡ B ) ( x)dx ( A ≡ B) avg =  1  Card ( x)  ∫x

(42)

Thus, the following relationship is satisfied: ( A ≡ B ) pes ≤ ( A ≡ B ) avg ≤ ( A ≡ B) opt

(43)

Possibility and Necessity Measures: These concepts use the fuzzy sets as possibility distri14

Poss ( A, B) = sup [ min (A( x), B( x) ) ]

(44)

x∈X

butions where A(x) measures the possibility of being A for each value in X (Zadeh, 1978). Thus, the comparison, that is, the possibility of value A being equal to value B, measures the extent to which A and B superpose each other. It is denoted by Poss(A,B) and defined as:

The necessity measure describes the degree to which B is included in A, and it is denoted by Nec(A,B): Nec( A, B ) = inf [ max (A( x), 1 − B ( x) ) ] (45) x∈X

In Figures 12 and 13, we can see graphically how these measurements, for two concrete fuzzy sets, are calculated. It can be stated that: Poss ( A, B) = Poss ( B, A) . On the other hand, the measurement of necessity is asymmetrical, Nec ( A, B) ≠ Nec ( B, A) . However, the following relation is fulfilled:

Introduction and Trends to Fuzzy Logic and Fuzzy Databases

Figure 12. General illustration of the Poss(A,B) concept using the minimum t-norm

Π ( A) = Poss ( A, X) = sup [ min (A( x), 1) ]= sup A( x) x∈X

A

1

B

min(A,B) X

0

• •

Figure 13. General illustration of the Nec(A,B) concept using the maximum t-conorm A

B

max(A,1–B)

1–B

•

Nec(A,B)

Nec ( A, B) + Poss ( A , B) = 1

X

(46)

Other equivalences are: Poss ( A  B, C ) = max {Poss ( A, C ), Poss ( B, C ) } (47) Poss ( A  B, C ) = min {Nec ( A, C), Nec ( B, C ) } (48)

The generalization of the possibility and necessity measurements use triangular t-norms or t-conorms instead of min and max functions, respectively. If the concept is extended, the possibility of a fuzzy set A (or a possibility distribution) in the universe X can be defined as:

if Π ( A ) = 1 , then the certainty is indeterminate. if Π ( A ) = 0 , then the occurrence of A is certain.

Therefore, the following two equalities are always satisfied:

•

0

(49)

This possibility measures whether or not a determined event (the fuzzy set A) is possible in universe X. It would not measure uncertainty, because if Π ( A) = 1 , we know that event A is possible, but:

Poss(A,B)

1

x∈X

Π (X) = 1 (possibility of an element of the universe). Π ( ) = 0 (possibility of an element not in the universe).

Similarly, the necessity of a fuzzy set N(A) in X can be defined, and then we can set some equivalences of possibility and necessity (see Equation 50 and 51). These equivalences explain why the necessity complements the information about the certainty of event A: • • • •

•

The greater N(A), the smaller possibility of opposite event (¬A). The greater Π (A), the smaller necessity of the opposite event (¬A). N(A) = 1 ⇔ ¬A is totally impossible (if an event is totally necessary, then the opposite event is totally impossible). Π(A) = 1 ⇔ ¬A is not necessary at all N(¬A) = 0 (if an event is totally possible, then the opposite event cannot be necessary in any way). N(A) = 1 ⇒ Π(A) = 1 (if A is a totally necessary event, then must be totally possible). Note that the opposite is not satisfied.

15

Introduction and Trends to Fuzzy Logic and Fuzzy Databases

Equation 50. N(A) = infx∈X {A(x)} = 1–sup x∈X {1 – A(x)} = 1 – Π(¬A): N(A) = 1 – Π (¬A)

Equation 51. Π (A) = sup x∈X {A(x)} = 1 – inf x∈X {1 – A(x)} = 1 – N(¬A): Π (A) = 1 – N(¬A)

•

A ⊆ B ⇒ N(A) ≤ N(B) and Π(A) ≤ Π (B).

Compatibility Measures: This comparison operation measures the extent to which a certain fuzzy set is compatible with another (defined in the same space). The result is not a single number but a fuzzy set defined in the unit interval, [0,1], known as fuzzy set of compatibility. The compatibility of B with A can be defined as: Comp ( B, A) (u ) = sup u = A( x ){B( x)}, u ∈ [0,1]

(52)

Set B can be seen as a “fuzzy value” and set A as a “fuzzy concept.” Therefore, Comp(B,A) measures the compatibility with which B is A. Example 2: Let B be the value “approx. 70 years” and A be the concept “very old.” Then, the fuzzy set Comp(B,A) is represented in Figure 14 and the fuzzy set Comp(A,B) in Figure 15. The compatibility measurement has the following properties: •

• •

16

It measures the degree to which B can fulfill concept A. That degree will be greater, the more similar the fuzzy set Comp (B, A) is to the singleton “1” value (maximum compatibility). Supposing A is a normalized fuzzy set: Comp(A, A)(u) = u (Linear membership function). If A is not normalized, the function will be the same between 0 and the height of set A: If u > Height(A), Comp(A, A)(u)= indeterminate (0).

•

If B is a number x (“singleton” fuzzy set), the result will also be another “singleton” in the A(x) value: 1, if u = A( x) Comp ( B, A)(u ) =  0, otherwise (53)

• •

If B is not normalized, the result will not be either; its height is the same as that of set B. If Support (A) ∩ Support(B) = ∅, then:

Figure 14. Example 2: Illustration of Comp(B,A) u

Comp(B, A)

0.5

0.5

A

u0

Comp(B, A)(u0)=0.5

1

B

1

X

0

Figure 15. Example 2: Illustration of Comp(A,B) u Comp (A, B)

0.92 0.6 0.4

Sup{0.92,0,4}=0.92 0.6

B

1

A

u0

0

X x1

x2

Introduction and Trends to Fuzzy Logic and Fuzzy Databases

compatibil 1, if u = 0 (minimum be used as a similarity The function mR mayity) Comp ( B, A)(u ) = Comp ( A, B)(u ) =  0, otherwise or proximity function. It is important to stress  1, if u = 0 (minimum compatibility) B, A)(u ) = Comp ( A, B)( u ) =  that not all functions are relations and not all re0, otherwise lations are functions. Fuzzy relations generalize (54) the concept of relation by allowing the notion of

•

The possibility and necessity measurements between A and B are included in the support of Comp(B, A).

In order to have a clearer vision of what this measurement means, we can look at the examples shown in Figure 16. We can conclude that fuzzy set B is more compatible with another A, the closer Comp(B, A) is to 1 and the further it is from 0 (the less area it has).

partial belonging (association) between points in the universe of discourse.

Example 3: Take as an example the fuzzy relation in ℜ2 (binary relation), “approximately equal,” with the following membership function in X ⊂ ℜ, with X2 = {1,2,3}2: 1/(1,1) + 1/(2,2) + 1(3,3) + 0.8/(1,2) + 0.8/(2,3) + 0.8/(2,1) + 0.8/(3,2) + 0.3/(1,3) + 0.3/(3,1). This fuzzy relation may be defined as:

Fuzzy Relations

 1, if x = y    x approximately equal to y : R ( x , y) = 0.8, if x - y = 1  0.3, if x - y = 2  

A classic relation between two universes X and Y is a subset of the Cartesian product X×Y. Like the classic sets, the classic relation can be described using a characteristic function. In the same way, a fuzzy relation R is a fuzzy set of tuples. In the event of a binary relation, the tuple has two values.

where x, y ∈ ℜ. When the universe of discourse is finite, a matrix notation can be quite useful to represent the relation. This example would be shown as:

Definition 19: Let U and V be two infinite (continuous) universes and R : U × V → [0,1]. Then, a binary fuzzy relation R is defined as: R=∫

U× V

R

(u , v) /(u , v)

(55)

Figure 16. Three sets (B1, B2, and B3) with the same shape placed in different positions and compared to A Maximum Compatibility

1

B1 B2

B3

A

Comp(B2, A) Comp(B1, A) Minimum Compatibility

1

0

X

1

2

3

1

1

0.8

0.3

2

0.8

1

0.8

3

0.3

0.8

1

Definitions of basic operations with fuzzy relations are closely linked to operations of fuzzy sets. Let R and W be two fuzzy relations defined in X × Y: •

u

Comp(B3, A)

X2

• • • •

Union: (R ∪ W)(x,y) = R(x,y) s W(x,y), using a s-norm s. Intersection: (R ∪ W)(x,y) = R(x,y) t W(x,y), using a t-norm t. Complement: (¬R)(x,y) = 1 – R(x,y). Inclusion: R ⊆ W ⇔ R(x,y) ≤ W(x,y). Equality: R = W ⇔ R(x,y) = W(x,y).

17

Introduction and Trends to Fuzzy Logic and Fuzzy Databases

The concept of fuzzy number was first introduced in Zadeh (1975) with the purpose of analyzing and manipulating approximate numeric values, for example, “near 0,” “almost 5,” and so forth. The concept has been refined (Dubois & Prade, 1980, 1985), and several definitions exist. Definition 20: Let A be a fuzzy set in X and A(x) be its membership function with x∈X. A is a fuzzy number if its membership function satisfies that: 1. 2. 3.

(x) is convex. (x) is upper semicontinuity. Support of A is bounded. A A

These requirements can be relaxed. The general form of the membership function of a fuzzy number A with support (a,d) and kernel or modal interval (b,c) can be defined as: rA ( x) h  = ( x )  A s A ( x) 0

if x ∈ (a, b) if x ∈ [b, c] if x ∈ (c, d ) otherwise

(56)

where rA, sA:X → [0,1], rA is not decreasing, sA is not increasing rA(a) = sA(d) = 0 and rA(b) = sA(c) = h (57) with h∈(0,1] and a, b, c, d ∈X. The number h is called the height of the fuzzy number, and some authors include the necessity of normalized fuzzy numbers, that is, with h = 1. The numbers b – a and d – c are the left and right spaces, respectively. Throughout this study, we will often use a particular case of fuzzy numbers that is obtained when we consider the functions rA and sA as linear functions. We will call this type of fuzzy number triangular or trapezoidal, and it takes the form

18

Figure 17. Graphic representation of the extension principle, where f carries out its transformation from X to Y Y

Membership degree of B

Fuzzy Numbers

B

f

X

1 1

A

Membership degree of A

shown in Figure 7. Many applications usually work with normalized trapezoidal fuzzy numbers (h=1) because these fuzzy numbers are easily characterized using the four really necessary numbers: A ≡ (a, b, c, d).

The Extension Principle One of the most important notions in the fuzzy sets theory is the extension principle, proposed by Zadeh (1975). It provides a general method that allows nonfuzzy mathematical concepts to be extended to the treatment of fuzzy quantities. It is used to transform fuzzy quantities, which have the same or different universes, according to a transformation function between those universes. Let A be a fuzzy set, defined in universe of discourse X and f a nonfuzzy transformation function between universes X and Y, so that f: X → Y. The purpose is to extend f so that it can also operate on the fuzzy sets in X. The result must be a fuzzy set B in Y: B = f(A). This transformation is represented in Figure 17. It is achieved with the use of the Sup-Min composition, which will now be described in a general way in the case of the Cartesian product in n universes. Definition 21: Let X be a Cartesian product of n universes such as X = X1 × X2 × ... × Xn, and A1, A2 , …, An are n fuzzy sets in those n universes respectively. Moreover, we have a function f from X to the universe Y, so a fuzzy set B from Y is

Introduction and Trends to Fuzzy Logic and Fuzzy Databases

defined by the extension principle as B = f(A1, A2 ,..., An) defined as:

( y ) = sup min( A1 ( x1 ),..., x∈X , y = f ( x ) B

x )) (58)

An ( n

Example 4: Let both X and Y be the universe of natural numbers. • Sum 4 function: y = f(x) = x + 4;  A = 0.1/2 + 0.4/3 + 1/4 + 0.6/5;  B = f(A) = 0.1/6 + 0.4/7 + 1/8 + 0.6/9; •   

Sum: y = f(x1,x2) = x1 + x2: A1 =0.1/2 + 0.4/3 + 1/4 + 0.6/5; A2 =0.4/5 + 1/6; B = f(A1, A2) = 0.1/7 + 0.4/8 + 0.4/9 + 1/10 + 0.6/11;

We can conclude that the extension principle allows us to extend any function (for example arithmetic) to the field of fuzzy sets, making possible the fuzzy arithmetic.

Fuzzy Arithmetic Thanks to the extension principle (Definition 21), it is possible to extend the classic arithmetical operations to the treatment of fuzzy numbers (see Example 4). In this way, the four main operations are extended in: 1.

Extended sum: Given two fuzzy quantities A1 and A2 in X, the membership function of the sum A1 + A2 is found using the expression: A1 + A 2

( y ) = sup{min(

( y − x),

A1

Extended difference: Given two fuzzy quantities A1 and A2, in X, the membership function of the difference A1 – A2 is found using the expression:

2.

A2

( x)) / x ∈ X}

(59)

In this way, the sum is expressed in terms of the supreme operation. The extended sum is a commutative and associative operation and the concept of the symmetrical number does not exist.

A1 − A 2

( y ) = sup{min(

( y + x),

A1

A2

( x)) / x ∈ X}

3.

A1* A2

Extended product: The product of two fuzzy quantities A1 * A2 is obtained as follows: sup {min( A1 ( y / x), A2 ( x)), ( y) =   max ( A1 (0), A2 (0))

x ∈ X − {0}} if x ≠ 0 if x = 0

4.

(60)

(61)

Extended division: The division of two fuzzy quantities A1 ÷ A2 is defined as follows:

A1 ÷ A 2

( y ) = sup{min(

( xy ),

A1

A2

( x)),

x ∈ X }

(62)

From these definitions, we can easily conclude that if A1 and A2 have a discrete universe (with finite terms) and they have n and m terms respectively, then the number of terms of A1+A2 and of A1−A2 is (n−1)+(m−1)+1, that is, n+m−1. Based on a particular expression from the uncertainty principle, adapted to the use of α-cuts and in a type of numbers similar to those previously described, called LR fuzzy numbers (Dubois & Prade, 1980), rapid calculus formulae for the previous arithmetical operations are described. It is important to point out that if we have two fuzzy numbers, the sum or remainder of both fuzzy numbers will be fuzzier (it will have greater cardinality) than the most fuzzy of the two (that which has greatest cardinality). This is logical, since if we add two “approximate” values, the exact value of which we do not know, the result can be as varied as the initial values are. The same thing happens with division and multiplication but on a larger scale.

19

Introduction and Trends to Fuzzy Logic and Fuzzy Databases

Possibility Theory This theory is based on the idea of linguistic variables and how these are related to fuzzy sets (Dubois & Prade, 1988; Zadeh, 1978). In this way, we can evaluate the possibility of a determinate variable X being (or belonging to) a determinate set A, like the membership degree of the X elements in A. Definition 22: Let there be a fuzzy set A defined in X with membership function mA(x) and a variable x in X (whose value we do not know). So, the proposition “x is A” defines a possibility distribution, in such a way that it is said that the possibility of x = u is mA(u), ∀u∈X. The concepts of fuzzy sets and membership functions are now interpreted as linguistic labels and possibility distributions. Instead of membership degrees, we have possibility degrees, but all the tools and properties defined for fuzzy sets are also applicable to possibility distributions.

Fuzzy Quantifiers Fuzzy or linguistic quantifiers (Liu & Kerre, 1998a, 1998b; Yager, 1983; Zadeh, 1983) have been widely applied to many applications, including database applications (Galindo, 1999; Galindo, Medina, Cubero, & García, 2001). Fuzzy quantifiers allow us to express fuzzy quantities or proportions in order to provide an approximate idea of the number of elements of a subset fulfilling a certain condition or the proportion of this number in relation to the total number of possible elements. Fuzzy quantifiers can be absolute or relative: •

20

Absolute quantifiers express quantities over the total number of elements of a particular set, stating whether this number is, for example, “much more than 10,” “close to 100,” “a great number of,”and so forth. Generalizing this concept, we can consider fuzzy numbers as absolute fuzzy quantifiers, in order to use

•

expressions like “approximately between 5 and 10,” “approximately −8,”and so on. Note that the expressed value may be positive or negative. In this case, we can see that the truth of the quantifier depends on a single quantity. For this reason, the definition of absolute fuzzy quantifiers is, as we shall see, very similar to that of fuzzy numbers. Relative quantifiers express measurements over the total number of elements, which fulfill a certain condition depending on the total number of possible elements (the proportion of elements). Consequently, the truth of the quantifier depends on two quantities. This type of quantifier is used in expressions such as “the majority” or “most,” “the minority,” “little of,” “about half of,”and so forth. In this case, in order to evaluate the truth of the quantifier, we need to find the total number of elements fulfilling the condition and consider this value with respect to the total number of elements which could fulfill it (including those which fulfill it and those which do not fulfill it).

Some quantifiers such as “many” and “few” can be used in either sense, depending on the context (Liu & Kerre, 1998a). In Zadeh (1983), absolute fuzzy quantifiers are defined as fuzzy sets in positive real numbers and relative quantifiers as fuzzy sets in the interval [0,1]. We have extended the definition of absolute fuzzy quantifiers to all real numbers. Definition 23: A fuzzy quantifier named Q is represented as a function Q, the domain of which depends on whether it is absolute or relative: Qabs : ℜ → [0,1]

(63)

Qrel : [0,1]→ [0,1]

(64)

where the domain of Qrel is [0,1] because the division a/b ∈ [0,1], where a is the number of elements fulfilling a certain condition, and b is the total number of existing elements.

Introduction and Trends to Fuzzy Logic and Fuzzy Databases

In order to know the fulfillment degree of the quantifier over the elements that fulfill a certain condition, we can apply the function Q of the quantifier to the value of quantification Φ, with Φ = a if Q is absolute and Φ = a/b if Q is relative. There are two very important classic quantifiers: The universal quantifier (for all, ∀), and the existential quantifier (exist, ∃). The first of them is relative and the second one is absolute. They are discretely defined as: 1 if x = 1 Q∀ ( x ) =  0 otherwise

0 if x = 0 Q∃ ( x) =  1 otherwise

(65)

(66)

Some quantifiers (absolute or relative) may have arguments, and in these cases, the function is defined using the arguments (Galindo, Urrutia, & Piattini, 2006). A survey of methods for evaluating quantified sentences and some new methods are shown in the literature (Delgado, Sánchez, & Vila, 1999, 2000), and in this volume, see the chapter by Liétard and Rocacher.

Fuzzy Databases In the Foreword of this handbook, Vila and Delgado refer to a series of five reports, which are a guideline for the development of research in this area. They state that “one of the lines that appears with more continuity and insistence is the treatment of the imprecise and uncertain information in databases.” Imprecision has been studied in order to elaborate systems, databases, and consequently applications which support this kind of information. Most works which studied the imprecision in information have used possibility, similarity, and fuzzy techniques. If a regular or classical database is a structured collection of records or data stored in a computer, a fuzzy database is a database which is able to deal with uncertain or incomplete information

using fuzzy logic. Basically, a fuzzy database is a database with fuzzy attributes, which may be defined as attributes of an item, row, or object in a database, which allows storing fuzzy information (Bosc, 1999; De Caluwe & De Tré, 2007; Galindo et al., 2006; Petry, 1996). There are many forms of adding flexibility in fuzzy databases. The simplest technique is to add a fuzzy membership degree to each record, that is, an attribute in the range [0,1]. However, there are other kinds of databases allowing fuzzy values to be stored in fuzzy attributes using fuzzy sets, possibility distributions, or fuzzy degrees associated to some attributes and with different meanings (membership degree, importance degree, fulfillment degree, etc.). Of course, fuzzy databases should allow fuzzy queries using fuzzy or nonfuzzy data, and there are some languages based on SQL (ANSI, 1992; Date & Darwen, 1997) that allow these kind of queries like FSQL (Galindo, 2007; Galindo et al., 2006) or SQLf (Bosc & Pivert, 1995; Goncalves & Tineo, 2006). The research on fuzzy databases has been developed for about 20 years and concentrated mainly on the following areas: 1. 2. 3.

4. 5. 6.

Fuzzy querying in classical databases, Fuzzy queries on fuzzy databases, Extending classical data models in order to achieve fuzzy databases (fuzzy relational databases, fuzzy object-oriented databases, etc.), Fuzzy conceptual modeling tools, Fuzzy data mining techniques, and Applications of these advances in real databases.

All of these different issues have been studied in different chapters of this volume, except the fourth item because, in general, there is little interest in fuzzy conceptual issues and, besides, these subjects have been studied in some other works in a very exhaustive manner (Chen, 1998; Galindo et al., 2006; Kerre & Chen, 2000; Ma, 2005; Yazici & George, 1999).

21

Introduction and Trends to Fuzzy Logic and Fuzzy Databases

The first research area, fuzzy queries in classical databases, is very useful because currently there are many classical databases. The second item includes the first one, but we prefer to separate them because item 2 finds new problems that must be studied and because it must be framed in a concrete fuzzy database model (third item). The querying with imprecision, contrary to classical querying, allows the users to use fuzzy linguistic labels (also named linguistic terms) and express their preferences to better qualify the data they wish to get. An example of a flexible query, also named in this context fuzzy query, would be “list of the young employees, working in department with big budget.” This query contains the fuzzy linguistic labels “young” and “big budget.” These labels are words, in natural language, that express or identify a fuzzy set. In fact, the flexibility of a query reflects the preferences of the end user. This is manifested by using a fuzzy set representation to express a flexible selection criterion. The extent to which an object in the database satisfies a request then becomes a matter of degree. The end user provides a set of attribute values (fuzzy labels) which are fully acceptable to the user, and a list of minimum thresholds for each of these attributes. With these elements, a fuzzy condition is built and the fuzzy querying system ranks the answered items according to their fulfillment degree. Some approaches, the so-called bipolar queries, need both the fuzzy condition (or fuzzy constraint) and the less-compulsory positive preferences or wishes. A very interesting work about bipolar queries may be found in this volume in the chapter by Dubois and Prade. In another chapter, Urrutia, Tineo and Gonzalez study the two most known fuzzy querying languages, FSQL and SQLf. Of course, we must reference the interesting and general review about the fuzzy querying proposals, written by Zadrożny, de Tré, de Caluwe, and Kacprzyk. Other chapters about fuzzy queries study different aspects, such as evaluation strategies or quantified statements, for example. About fuzzy data mining issues, this handbook includes a complete review chapter by Feil and

22

Abonyi, studying the main fuzzy data mining methods. Perhaps the more interesting and useful tools are the fuzzy clustering and the fuzzy dependencies, and both of them are also studied in different other chapters of this handbook. The last item, applications, is also studied in some chapters. These chapters mix different theoretical issues like data mining to real contexts with different goals. About the third item, extending classical data models in order to achieve fuzzy databases, this handbook also includes interesting chapters. Ben Hassine et al. study in their chapter how to achieve fuzzy relational databases, giving different methods and addressing their explanations, mainly to database administrators and enterprises, in order to facilitate the migration to fuzzy databases. Takači and Škrbić present their fuzzy relational model and propose a fuzzy query language with the possibility to specify priorities for fuzzy statements. Barranco et al. give a good approach of a fuzzy object-relational database model, whereas some other interesting fuzzy object-oriented database models are presented and summarized respectively (De Caluwe, 1997; Galindo et al., 2006). This last book includes fuzzy time data types and in the book you have in your hands, Schneider defines fuzzy spatial datatypes. In another chapter, Belohlavek presents an overview of foundations of formal concept analysis of data with graded attributes, which provides elaborated mathematical foundations for relational data in some fuzzy databases. In this section, we want to give a wide historical point of view summarizing the main published models aiming at solving the problem of representation and treatment of imprecise information in relational databases. This problem is not trivial because it requires relation structure modification, and actually, the operations on these relations also need to be modified. To allow the storage of imprecise information and the making of an inaccurate query of such information, a wide variety of case studies is required, which do not occur in the classic model, without imprecision. The first approaches, which do not utilize the fuzzy logic, were proposed by Codd (1979, 1986,

Introduction and Trends to Fuzzy Logic and Fuzzy Databases

instance, in the car plate attribute of someone who does not have a car. This is a tetravalued logic where the A value, having a similar meaning to that of the m in the trivalued logic mentioned above, is generated by comparing any value containing an A-mark, and a new I value is added as a result of the comparison of any value containing an I-mark. The tetravalued logic is shown in Table 5. In Galindo et al. (2006), some other approaches are summarized, like the “default values” approach by Date (1986), similar to the DEFAULT clause in SQL, the “interval values” approach by Grant (1980), who expands the relational model in order to allow that a possible value range/interval be stored in one attribute, and statistical and probabilistic databases.

1987, 1990). Then, some basic models were proposed, like the Buckles-Petry model (1982a, 1982b, 1984), the Prade-Testemale model (1984, 1987a, 1987b; Prade, 1984), the Umano-Fukami model (Umano, 1982, 1983; Umano & Fukami, 1994), and the GEFRED model of Medina-Pons-Vila (1994; Galindo, Medina, & Aranda, 1999; Galindo et al., 2001; Medina, 1994).

Imprecision without Fuzzy Logic In this section, some ideas allowing for imprecise information treatment will be summarized, without utilizing either the fuzzy set theory or possibility theory. In the bibliography, these models are dealt with globally in the section on imprecision in conventional databases, although some of the ideas discussed here have not been implemented in any of the models. The first attempt to represent imprecise information on databases was the introduction of NULL values by Codd (1979), which was further expanded (Codd, 1986, 1987, 1990). This model did not use the fuzzy set theory. A NULL value in an attribute indicates that such a value is any value included in the domain of such an attribute. Any comparison with a NULL value originates an outcome that is neither True (T) nor False (F) called “maybe” (m) (or unknown, in the SQL of Oracle). The truth tables of the classical comparators NOT, AND, and OR can be seen in Table 4. Later on, another nuance was added, differentiating the NULL value in two marks: The “A-mark” representing an absent or unknown value, although it was applicable, and the “I-mark” representing the absence of the value because it is not applicable (undefined). An I-mark may be situated, for

Basic Model of Fuzzy Databases The simplest model of fuzzy relational databases consists of adding a grade, normally in the [0,1] interval, to each instance (or tuple). This keeps database data homogeneity. Nevertheless, the semantic assigned to this grade will determine its usefulness, and this meaning will be utilized in the query processes. This grade may have the meaning of membership degree of each tuple to the relation (Giardina, 1979; Mouaddib, 1994), but it may mean something different, like the dependence strength level between two attributes, thus representing the relation between them (Baldwin, 1983), the fulfillment degree of a condition or the importance degree (Bosc, Dubois, Pivert, & Prade, 1997) of each tuple in the relation, among others. The main problem with these fuzzy models is that they do not allow the representation of imprecise information about a certain attribute of

Table 4. Truth tables for the trivalued logic: True, false, and maybe NOT

AND

T

m F

OR

T m F

T

F

T

T

m F

T

T T

m

m

m

m m F

M

T m m

F

T

F

F

F

T m F

F

F

T

23

Introduction and Trends to Fuzzy Logic and Fuzzy Databases

Table 5. Truth tables for the tetravalued logic NOT

AND

T

A

I

F

OR

T A

I

F

T

F

T

T

A

I

F

T

T T

T

T

A

A

A

A

A

I

F

A

T A

A

A

I

I

I

I

I

I

F

I

T A

I

F

F

T

F

F

F

F F

F

T A

F

F

a specific entity (like the “tall” or “short” values for a “height” attribute). Besides, the fuzzy character is assigned globally to each instance (tuple) it making impossible to determine the specific fuzzy contribution from each constituting attribute. These problems are solved in the model presented in Galindo et al. (2006), and you can learn about this model in the chapter by Ben Hassine et al. in this handbook.

Similarity Relations Model: Buckles-Petry Model This is the first model that utilizes similarity relations (Zadeh, 1971) in the relational model. It was proposed by Buckles and Petry (1982a, 1982b, 1984). In this model, a fuzzy relation is defined as a subset of the following Cartesian product: P(D1)× ... ×P(Dm), where P(Di) represents the parts set of a Di domain, including all the subsets that could be considered within the Di domain (having any number of elements). The data types permitted by this model are finite set of scalars (labels), finite set of numbers, and fuzzy number set. The meaning of these sets is disjunctive, that is, the real value is one belonging to the set. The equivalence types on a domain are constructed from a similarity function or relation, in which the values taken by such a relation are provided by the user. Typically, these similarity values are standardized in the [0,1] interval, where 0 corresponds to “totally different” and 1 to “totally similar.” A similarity threshold can be established with a value between 0 and 1 in order

24

to get the values whose similarity is greater than the threshold, or to consider those values indistinguishable.

Possibilistic Models Under this denomination, models using the possibility theory to represent imprecision are included. The most important models in this group are PradeTestemale model, Umano-Fukami model, and GEFRED model. Another important model is the the Zemankova-Kaendel model (1984, 1985), which is briefly summarized in Galindo et al. (2006).

Prade-Testemale Model Prade and Testemale published a fuzzy relational database (FRDB) model that allows the integration of what they call incomplete or uncertain data in the possibility theory sphere (Prade, 1984; Prade & Testemale, 1984, 1987a, 1987b). An attribute A, having a D domain, is considered. All the available knowledge about the value taken by A for an x object can be represented by a possibility distribution πA(x) about D ∪ {e}, where e is a special element denoting the case in which A is not applied to x. In other words, πA(x) is an application that goes from D ∪ {e} to the [0,1] interval. From this formulation, all value types adopted by this model can be represented. In every possibilistic model one must take into account that, for a value d ∈ D, if πA(x) (d) = 1, then this just indicates that the d value is totally possible for A(x), and not that the d value is true for A(x),

Introduction and Trends to Fuzzy Logic and Fuzzy Databases

Table 6. Representation of information in two possibilistic models Prade-Testemale Model

Umano-Fukami Model

The precise data are known and this is crisp: c

πA(x)(e) = 0 πA(x)(c) = 1 πA(x)(d) = 0, ∀ d ∈ D, d ≠ c

πA(x)(d) = {1 / c }

Unknown but applicable

πA(x)(e) = 0 πA(x)(d) = 1, ∀ d ∈ D

Unknown (Equation 67)

Not applicable or nonsense

πA(x)(e) = 1 πA(x)(d) = 0, ∀ d ∈ D

Undefined (Equation 68)

Total ignorance

πA(x)(d) = 1, ∀ d ∈ D ∪ {e}

Null (Equation 69)

Range [m, n]

πA(x)(e) = 0 πA(x)(d) = 1, if d ∈ [m, n] ⊆ D πA(x)(d) = 0, in other case

πA(x)(d) = 1, if d ∈ [m, n] ⊆ D πA(x)(d) = 0, in other case

The information available is a possibility distribution µa

πA(x)(e) = 0 πA(x)(d) = µa(d), ∀ d ∈ D

πA(x)(d) = µa(d), ∀ d ∈ D

The possibility that it may not be applicable is λ and, in case it is applicable, the data are µa

πA(x)(e) = λ πA(x)(d) = µa(d), ∀ d ∈ D

Without representation

Information

unless this is the only possible value, that is, πA(x) (d’) = 0, ∀d’ ≠ d. Both the information and representation of this model is shown in Table 6. How two possibility distributions can be compared was discussed in the Comparison Operations on Fuzzy Sets section earlier in this chapter. In general, the most commonly used measurements are possibility and necessity.

Umano-Fukami Model This proposal (Umano, 1982, 1983; Umano & Fukami, 1994) also utilizes the possibility distributions in order to model information knowledge. In this model, if D is the discourse universe of A(x), πA(x) (d) represents the possibility that A(x) takes the value d∈D. The following kind of knowledge may be modeled: unknown and applicable information, the non-applicable information (undefined), and the total ignorance (we do not know if it is applicable or non-applicable):

Unknown = πA(x)(d) = 1,

∀d∈D

(67)

Undefined = πA(x)(d) = 0,

∀ d ∈ D

(68)

Null = {1/Unknown, 1/Undefined}

(69)

For the remaining cases of imprecise information, a similar model to the one above is adopted. The kind of fuzzy information and the representation of this model are shown in Table 6. Besides, every instance of a relation in this model has a possibility distribution associated with it in the [0,1] interval, thus indicating the membership degree of that particular instance to such a relation. In other words, a fuzzy relation R, with m attributes, is defined as the following membership function: µR: P(U1) × P(U2) × … × P(Um) → P([0,1]) (70)

25

Introduction and Trends to Fuzzy Logic and Fuzzy Databases

where the × symbol denotes the Cartesian product, P(Uj) with j=1, 2, ..., m is the collection of all the possibility distributions in the discourse universe Uj of the j-th R attribute. The function µR associates a P([0,1]) value to every instance of the relation R, which corresponds to all the possibility distributions in the [0,1] interval; this shall be considered an R membership degree of such an instance. Finally, in the query process, expressed either in fuzzy or precise terms, the model solves the query problem by dividing the set of instances involved in the relation into three subsets, where the first subset contains the instances completely satisfying the query; the second subset groups those instances that might satisfy the query; and the third subset consists of those instances which do not satisfy the query.

The GEFRED Model by Medina-Pons-Vila The GEFRED model dates back to 1994, and it has experienced subsequent expansions (Medina, 1994; Medina et al., 1994; Galindo et al., 1999, 2001, 2006). This model is an eclectic synthesis of some of the previously discussed models. One of the major advantages of this model is that it consists of a general abstraction that allows for the use of various approaches, regardless of how different they might look. As a possibilistic model, it refers particularly to generalized fuzzy domains, thus admitting the possibility distribution in the domains, but it also includes the case where the underlying domain is not numeric but scalars of any type. It includes UNKNOWN, UNDEFINED, and NULL values as well, having the same sense as that in Umano-Fukami model. The GEFRED model is based on the definition which is called Generalized Fuzzy Domain (D) and Generalized Fuzzy Relation (R), which include classic domains and classic relations, respectively. Basically, the Generalized Fuzzy Domain is the basic domain, with possibility distributions defined for this domain and the NULL value. All data types that can be represented are shown in Table 1 in the chapter by Ben Hassine et al. 26

On the other hand, the Generalized Fuzzy Relations of GEFRED model are relations whose attributes have a Generalized Fuzzy Domain, and each attribute may be associated to a “compatibility attribute” where we can store a compatibility degree. The compatibility degree for an attribute value is obtained by manipulation processes (such as queries) performed on that relation, and it indicates the degree to which that value has satisfied or met the operation performed on it. The GEFRED model defines fuzzy comparators that are general comparators based on any existing classical comparator (>, <, =, etc.), but it does not consolidate the definition of each one. The only requirement established is that the fuzzy comparator should respect the classical comparators outcomes when comparing possibility distributions expressing nonfuzzy values (crisp). For example, the “approximately equal” comparator, “possibly equal” or “fuzzy equal” (FEQ), may be defined using the possibility measure (see the Comparison Operations on Fuzzy Sets section in this chapter). On these definitions, GEFRED redefines the relational algebraic operators in the so-called Generalized Fuzzy Relational Algebra: union, intersection, difference, Cartesian product, projection, selection, join, and division. These operators are defined giving a generalized fuzzy relation, which is the result of the operation. All these operators are defined in the definition of GEFRED, but the fuzzy division is defined in Galindo et al. (2001). Fuzzy Relational Calculus is defined in Galindo et al. (1999). These theoretical concepts have been applied in an extended fuzzy version of SQL for fuzzy queries, the so-called FSQL (Galindo et al., 2006). Some of these definitions have been implemented in a free FSQL server (Galindo, 2007). The characteristics of FSQL are summarized for example in the aforementioned chapters by Urrutia et al. and by Ben Hassine et al. Applications in the data mining fields are proposed in some works (Carrasco, Vila, & Galindo, 2003; Galindo et al., 2006), and it is shown in the chapter by Carrasco et al. in this handbook. An extension for Fuzzy

Introduction and Trends to Fuzzy Logic and Fuzzy Databases

Deductive Relational Databases was presented by Blanco, Cubero, Pons, and Vila (2000).

Conclusion and Future Trends This chapter presents an introduction to fuzzy logic and to fuzzy databases. With regard to the first topic, we have introduced concepts like fuzzy sets and fuzzy numbers, fuzzy logic and its main characteristics, linguistic labels, membership functions and their determination methods, support and kernel, height and cardinality, α-cut, the representation theorem and the extension principle, fuzzy set operations like union and intersection (t-norms and t-conorms), negations, fuzzy implications, different comparison operations, fuzzy relations, fuzzy quantifiers, and the possibility theory. With respect to the fuzzy databases, this chapter gives a list of six research topics in this fuzzy area. All these topics are briefly commented on, and we include references to books, papers, and even to other chapters of this handbook, where we can find some interesting reviews about different subjects and new approaches with different goals. Finally, we summarize the main published models approaching the problem of representation and treatment of imprecise information in relational databases, including methods with and without fuzzy logic. Perhaps the main difficulty in fuzzy database technology is to solve the subjectivity problem in the definition and usage of fuzzy concepts, and the dependency between these concepts and the context. Bordogna and Psaila study in their chapter, in this handbook, some ideas about this context dependency, but there are other problems that should be studied, like fuzzy interfaces or that, in many applications, possibly it is better to use fuzzy intervals or approximate values instead of predefined fuzzy labels. Another interesting future line is to introduce fuzzy data types in current database management systems (relational, objectoriented, etc.) and/or introduce fuzzy comparators

and some other tools of fuzzy queries in the standard SQL. In this line, some researches of this volume may be very useful, as for example, the chapters by Urrutia et al., Ben Hassine et al., Barranco et al., Scheneider, or Takači and Škrbić. We can see that this is not only a theoretical research field and this book includes some interesting applications like those in the chapters by Veryha et al., Chen et al., Carrasco et al., or Xexéo and Braga. Surely, in some years, what now is a research proposal, will be a running application, and in this sense, this book includes many interesting works as, for example, those about fuzzy queries and fuzzy data mining.

Acknowledgment This work has been partially supported by the “Ministry of Education and Science” of Spain (projects TIN2006-14285 and TIN2006-07262) and the Spanish “Consejería de Innovación Ciencia y Empresa de Andalucía” under research project TIC-1570.

References ANSI (American National Standard Institute). (1992). Database language SQL (Tech. Rep. No. ANXI X3, 135-1992). New York. Baldwin, J. (1983). Knowledge engineering using a fuzzy relational inference language. In Proceedings of the IFAC Symposium on Fuzzy Information Knowledge Representation and Decision Analysis (pp. 15-21). Bezdek, J. (1981). Pattern recognition with fuzzy objective function algorithms. New York: Plenum Press. Blanco, I., Cubero, J.C., Pons, O., & Vila, M.A. (2000). An implementation for fuzzy deductive relational databases. In G. Bordogna and G. Pasi (Eds.), Recent issues on fuzzy databases (pp. 183-

27

Introduction and Trends to Fuzzy Logic and Fuzzy Databases

207). Physica-Verlag (Studies in Fuzziness and Soft Computing). Bosc, P. (1999). Fuzzy databases. In J. Bezdek (Ed.), Fuzzy sets in approximate reasoning and information systems (pp. 403-468). Boston: Kluwer Academic Publishers. Bosc, P., Dubois, D., Pivert, O., & Prade, H. (1997). Flexible queries in relational databases: The example of the division operator. Theoretical Computer Science, 171, 281-302. Bosc, P., & Pivert, O. (1995). SQLf: A relational database language for fuzzy querying. IEEE Transactions on Fuzzy Systems, 3, 1-17. Buckles, B.P., & Petry, F.E. (1982a). A fuzzy representation of data for relational databases. Fuzzy Sets Systems, 7, 213-226. Buckles, B.P., & Petry, F.E. (1982b). Fuzzy databases and their applications. In M. Gupta & E. Sanchez (Eds.), Fuzzy information and decision processes (vol. 2, pp. 361-371). Amsterdam: NorthHolland. Buckles, B.P., & Petry, F.E. (1984). Extending the fuzzy database with fuzzy numbers. Information Sciences, 34, 45-55. Buckley, J.J., & Eslami, E. (2002). An introduction to fuzzy logic and fuzzy sets (advances in soft computing). Physica-Verlag Heidelberg. Butnario, D., & Klement, E.P. (1993). Triangular norm-based measures and games with fuzzy coalitions. Dordrecht: Kluwer Academic Publishers. Carrasco, R., Vila, M.A., & Galindo, J. (2003). FSQL: A flexible query language for data mining. Enterprise Information Systems, IV, 68-74. Hingham, MA: Kluwer Academic Publishers. Chen, G. (1998). Fuzzy logic in data modeling: Semantics, constraints and database design. London: Kluwer Academic Publishers. Codd, E.F. (1979). Extending the database relational model to capture more meaning. ACM Transactions on Database Systems, 4, 262-296.

28

Codd, E.F. (1986). Missing information (applicable and inapplicable) in relational databases. ACM SIGMOD Record, 15(4). Codd, E.F. (1987). More commentary on missing information in relational databases. ACM SIGMOD Record, 16(1). Codd, E.F. (1990). The relational model for database management (version 2). Reading, MA: Addison-Wesley. Date, C.J. (1986). Null values in database management. In C.J. Date (Ed.), Relational database: Selected writings. Reading MA: Addison-Wesley. Date, C.J., & Darwen, H. (1997). A guide to SQL standard (4th ed.). Addison-Wesley. ISBN 0-20196426-0. De Caluwe, R. (1997). Fuzzy and uncertain objectoriented databases: Concepts and models. In R. De Caluwe (Ed.), Advances in fuzzy system: Application and theory (vol. 13). World Scientific. De Caluwe, R., & De Tré, G. (Eds.). (2007). Special issue on advances in fuzzy database technology. International Journal of Intelligent Systems, 22. Delgado, M., Sánchez, D., & Vila, M.A. (1999). A survey of methods for evaluating quantified sentences. In Proceedings of the EUSFLAT-ESTYLF Joint Conference (pp. 279-282). Palma de Mallorca (Spain). Delgado, M., Sánchez, D., & Vila, M.A. (2000). Fuzzy cardinality based evaluation of quantified sentences. International Journal of Approximate Reasoning, 23, 23-66. Dubois, D., & Prade, H. (1980). Fuzzy sets and systems: Theory and applications. New York: Academic Press. Dubois, D., & Prade, H. (1984). A theorem on implication functions defined from triangular norms. Stochastica, 8, 267-279. Also in (1993), D. Dubois, H. Prade, & R.R. Yager (Eds.), Readings in fuzzy sets for intelligent systems (pp. 105-112). Morgan & Kaufmann.

Introduction and Trends to Fuzzy Logic and Fuzzy Databases

Dubois, D., & Prade, H. (1985). Fuzzy number: An overview. In J.C. Bezdek (Ed.), The analysis of fuzzy information. Boca Raton, FL: CRS Press. Dubois, D., & Prade, H. (1988). Possibility theory: An approach to computerized processing of uncertainty. New York: Plenum Press. Escobar, C. (2003). Software para Control Difuso de Todo Tipo de Sistemas (SCD): Aplicación al Control de Invernaderos Industriales. Proyecto Fin de Carrera de Ingeniería Técnica Industrial en Electrónica (Universidad de Málaga), directed by J. Galindo: www.lcc.uma.es/~ppgg/PFC Galindo, J. (1999). Tratamiento de la Imprecisión en Bases de Datos Relacionales: Extensión del Modelo y Adaptación de los SGBD Actuales. Ph. Doctoral Thesis, University of Granada, Spain. Retrieved from: www.lcc.uma.es. Galindo, J. (2001). Curso sobre Conjuntos y Sistemas Difusos (Lógica Difusa y Aplicaciones). Informe Técnico de Docencia. LCC-ITI-2001-11, del Dpto. de Lenguajes y Ciencias de la Computación de la Universidad de Málaga (www.lcc.uma.es). Retrieved from: www.lcc.uma.es/~ppgg/FSS Galindo, J. (2007). FSQL (fuzzy SQL): A fuzzy query language. Retrieved January 12, 2008, from http://www.lcc.uma.es/~ppgg/FSQL Galindo, J., Medina, J.M., & Aranda, M.C. (1999). Querying fuzzy relational databases through fuzzy domain calculus. International Journal of Intelligent Systems, 14(4), 375-411. Galindo, J., Medina, J.M., Cubero, J.C., & García, M.T. (2001, June). Relaxing the universal quantifier of the division in fuzzy relational databases. International Journal of Intelligent Systems, 16(6), 713-742. Galindo, J., Urrutia, A., & Piattini, M. (2006). Fuzzy databases: Modeling design and implementation. Hershey, PA: Idea Group. Giardina, C. (1979). Fuzzy databases and fuzzy relational associative processors (Technical Report). Hoboken, NJ: Stevens Institute of Technology.

Goncalves, M., & Tineo, L. (2006). SQLf vs. Skyline: Expressivity and performance. In Proceedings of the 15th IEEE International Conference on Fuzzy Systems Fuzz-IEEE 2006 (pp. 2062-2067), Vancouver, Canada. Grant, J. (1980). Incomplete information in a relational database. Fundamenta Informaticae, 3, 363-378. Gupta, M.M., & Qi, J. (1991). Theory of t-norms and fuzzy inference methods. Fuzzy Sets and Systems, 40, 431-450. Kerre, E.E., & Chen, G.Q. (2000). Fuzzy data modeling at a conceptual level: Extending ER/EER concepts. In O. Pons (Ed.), Knowledge management in fuzzy databases (pp. 3-11). Heidelberg: Physica-Verlag. Kosko, B. (1992). Neural networks and fuzzy systems: A dynamical systems approach to machine intelligence. Englewood Cliffs, NJ: Prentice Hall. Kruse, R., Gebhardt, J., & Klawonn, F. (1994). Foundations of fuzzy systems. John Wiley & Sons. Liu, Y., & Kerre, E.E. (1998a). An overview of fuzzy quantifiers. (I). Interpretations. Fuzzy Sets and Systems, 95(1), 1-21. Liu, Y., & Kerre, E.E. (1998b). An overview of fuzzy quantifiers. (II). Reasoning and applications. Fuzzy Sets and Systems, 95(2), 135-146. Ma, Z. (2005). Fuzzy database modeling with XML. Springer-Verlag. Medina, J.M. (1994). Bases de datos Relacionales Difusas: Modelo Teórico y Aspectos de su Implementación. Ph. Doctoral Thesis, Universidad de Granada, España. Retrieved from: decsai.ugr.es. Medina, J.M., Pons, O., & Vila, M.A. (1994). GEFRED: A generalized model of fuzzy relational databases. Information Sciences, 76(1-2), 87-109. Menger, K. (1942). Statistical metric spaces. Proceedings of the National Academy of Sciences, 37, 535-537. 29

Introduction and Trends to Fuzzy Logic and Fuzzy Databases

Mohammd, J., Vadiee, N., & Ross, T.J. (Eds.). (1993). Fuzzy logic and control: Software and hardware applications. Eaglewood Cliffs, NJ: Prentice Hall PTR. Mouaddib, N. (1994). Fuzzy identification database: The nuanced relation division. International Journal of Intelligent System, 9, 461-473. Nguyen, H.T., & Walker, E.A. (2005). A first course in fuzzy logic (3rd ed.). Chapman & Hall/CRC. Pedrycz, W., & Gomide, F. (1998). An introduction to fuzzy sets: Analysis and design (A Bradford Book). The MIT Press. Petry, F.E. (1996). Fuzzy databases: Principles and applications (with chapter contribution by Patrick Bosc) (International Series in Intelligent Technologies). Kluwer Academic Publishers. Piegat, A. (2001). Fuzzy modeling and control. Physica-Verlag (Studies in Fuzziness and Soft Computing). Prade, H. (1984). Lipski’s approach to incomplete information databases restated and generalized in the setting of Zadeh’s possibitity theory. Information Systems, 9, 27-42. Prade, H., & Testemale, C. (1984). Generalizing database relational algebra for the treatment of incomplete/uncertain information and vague queries. Information Sciences, 34, 115-143. Prade, H., & Testemale, C. (1987a). Fuzzy relational databases: Representational issues and reduction using similarity measures. Journal of the American Society of Information Sciences, 38(2), 118-126. Prade, H., & Testemale, C. (1987b). Representation of soft constraints and fuzzy attribute values by means of possibility distributions in databases. In J. Bezdek (Ed.), Analysis of fuzzy information (vol. II: Artificial intelligence and decision systems, pp. 213-229). CRC Press. Ross, T.J. (2004). Fuzzy logic with engineering applications. Wiley.

30

Saaty, T.L. (1980). The analytic hierarchy processes. New York: McGraw-Hill. Schweizer, B., & Sklar, A. (1983). Probabilistic metric spaces. North-Holland. Sivanandam, S.N., Sumathi, S., & Deepa, S.N. (2006). Introduction to fuzzy logic using MATLAB. Springer. Trillas, E. (1979). Sobre funciones de negación en la Teoría de Conjuntos Difusos. Stochastica, 3(1), 47-59. Trillas, E., & Alsina, C., (2002). On the law [p∧q→ r]=[(p→r)∨(q→r)] in fuzzy logic. IEEE Transactions on Fuzzy Systems, 10(1), 84-88. Trillas, E., Alsina, C., & Pradera, A. (2004). On MPT-implication functions for fuzzy logic. Spanish “Rev. Real Academia de Ciencias”, Serie A. Mathematics (RACSAM), vol. 98 (1), pp. 259–271. Trillas, E., Cubillo, S., & del Campo, C. (2000). When QM-operators are implication functions and conditional fuzzy relations? International Journal Intelligence Systems, 15, 647-655. Umano, M. (1982). Freedom-O: A fuzzy database system. In M. Gupta & E. Sanchez (Eds.), Fuzzy information and decision processes (pp. 339-347). Amsterdam: North-Holland. Umano, M. (1983). Retrieval from fuzzy database by fuzzy relational algebra. In M. Gupta & E. Sanchez (Eds.), Fuzzy information, knowledge representation and decision analysis (pp. 1-6). New York: Pergamon Press. Umano, M., & Fukami, S. (1994). Fuzzy relational algebra for possibility-distribution-ruzzy-relation model of fuzzy data. Journal of Intelligent Information System, 3, 7-28. Yager, R.R. (1980). On a general class of fuzzy connectives. Fuzzy Sets and Systems, 235-242. Yager, R.R. (1983). Quantified propositions of a linguistic logic. International Journal of ManMachine Studies, 19, 195-227.

Introduction and Trends to Fuzzy Logic and Fuzzy Databases

Yager, R.R., Ovchinnikov, S., Tong, R.M., & Nguyen, H.T. (Eds.). (1987). Fuzzy sets and applications: Selected papers by L. A. Zadeh. John Wiley & Sons. Yazici, A., & George, R. (1999). Fuzzy database modeling. Physica-Verlag. Ying, M. (2002). Implication operators in fuzzy logic. IEEE Transactions on Fuzzy Systems, 10(1), 88-91. Zadeh, L.A. (1965). Fuzzy sets. Information and Control, 8, 338-353. Zadeh, L.A. (1971). Similarity relations and fuzzy orderings. Information Sciences, 3, 177-200. Zadeh, L.A. (1975). The concept of a linguistic variable and its application to approximate reasoning (parts I, II, and III). Information Sciences, 8, 199-251, 301-357; 9, 43-80. Zadeh, L.A. (1978). Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets and Systems, 1, 3-28. Zadeh, L.A. (1983). A computational approach to fuzzy quantifiers in natural languages. Computer Mathematics with Applications, 9, 149-183. Zadeh, L.A. (1992). Knowledge representation in fuzzy logic: An introduction to fuzzy logic applications in intelligent systems. Kluwer Academic Publisher. Zemankova-Leech, M., & Kandel, A. (1984). Fuzzy relational databases: A key to expert systems. �� Köln, Germany: Verlag TÜV Rheinland. Zemankova-Leech, M., & Kandel, A. (1985). Implementing imprecision in information systems. Information Sciences, 37, 107-141. Zimmermann, H.-J. (1991). Fuzzy set theory and its applications (2nd ed.). Kluwer Academic Publishers.

Key Terms Fuzzy Attribute: In a database context, a fuzzy attribute is an attribute of a row or object in a database, with a fuzzy datatype, which allows storing fuzzy information. Sometimes, if a classic attribute allows fuzzy queries, then it is also called fuzzy attribute, because it has only some of the fuzzy attribute characteristics. Fuzzy Database: If a regular or classical database is a structured collection of information (records or data) stored in a computer, a fuzzy database is a database which is able to deal with uncertain or incomplete information using fuzzy logic. There are many forms of adding flexibility in fuzzy databases. The simplest technique is to add a fuzzy membership degree to each record, that is, an attribute in the range [0,1]. However, there are other kinds of databases allowing fuzzy values to be stored in fuzzy attributes using fuzzy sets, possibility distributions, or fuzzy degrees associated to some attributes and with different meanings (membership degree, importance degree, fulfillment degree, etc.). Of course, fuzzy databases should allow fuzzy queries using fuzzy or nonfuzzy data and there are some languages that allow this kind of queries, like FSQL or SQLf. In synthesis, the research in fuzzy databases includes the following areas: flexible querying in classical or fuzzy databases, extending classical data models in order to achieve fuzzy databases (fuzzy relational databases, fuzzy object-oriented databases, etc.), fuzzy conceptual modeling, fuzzy data mining techniques, and applications of these advances in real databases. Fuzzy Implication: Function computing the fulfillment degree of a rule expressed by IF X THEN Y, where the antecedent and the consequent are fuzzy. These functions must comply with certain basic properties and the most typical is the Kleene-Dienes implication, based on the classical implication definition (x⇒y = ¬x ∨ y), using the Zadeh’s negation and the maximum s-norm, but other fuzzy implication functions exist (Table 3).

31

Introduction and Trends to Fuzzy Logic and Fuzzy Databases

Fuzzy Logic: Fuzzy logic is derived from fuzzy set theory by Zadeh (1965), dealing with reasoning that is approximate rather than precisely deduced from classical predicate logic. It can be thought of as the application side of fuzzy set theory dealing with well thought out real world expert values for a complex problem.

often remained intractable to conventional mathematical and analytical methods. Soft computing techniques include: fuzzy systems (FS), neural networks (NN), evolutionary computation (EC), probabilistic reasoning (PR), and other ideas (chaos theory, etc.). Soft computing techniques often complement each other.

Fuzzy Quantifiers: Expressions allowing us to express fuzzy quantities or proportions in order to provide an approximate idea of the number of elements of a subset fulfilling a certain condition or of the proportion of this number in relation to the total number of possible elements. Fuzzy quantifiers can be absolute or relative. Absolute quantifiers express quantities over the total number of elements of a particular set, stating whether this number is, for example, “much more than 10,” “close to 100,” “a great number of,” and so forth. Relative quantifiers express measurements over the total number of elements, which fulfill a certain condition depending on the total number of possible elements. This type of quantifier is used in expressions such as “the majority” or “most,” “the minority,” “little of,” “about half of,” and so on.

T-conorm or S-norm: Function s establishing a generic model for the operation of union with fuzzy sets. These functions must comply with certain basic properties: commutative, associative, monotonicity, and border conditions (x s 0 = x, and x s 1 = 1). The most typical is the maximum function, but other widely accepted s-norms exist (Table 2).

Possibility Theory: This theory is based on the idea that we can evaluate the possibility of a determinate variable X being (or belonging to) a determinate set or event A. Here, fuzzy sets are called possibility distributions and instead of measuring the membership degrees, they measure the possibility degrees. All the tools and properties defined for fuzzy sets are also applicable to possibility distributions.

Endnotes

Soft Computing: Computational techniques in computer science and some engineering disciplines, which attempt to study, model, and analyze very complex phenomena: those for which more conventional methods have not yielded low cost, analytic, and complete solutions. Earlier computational approaches could model and precisely analyze only relatively simple systems. More complex systems arising in biology, medicine, the humanities, management sciences, artificial intelligence, machine learning, and similar fields

32

T-norm: Function t establishing a generic model for the operation of intersection with fuzzy sets. These functions must comply with certain basic properties: commutative, associative, monotonicity, and border conditions (x t 0 = 0, and x t 1 = x). The most typical is the minimum function, but there exists other t-norms widely accepted (Table 1).

1

2

3

“Fuzzy Sets and Systems” is an international journal in information science and engineering, published by Elsevier (www.elsevier. com), Official Publication of the International Fuzzy Systems Association (IFSA). Currently, the co-editors-in-chief are B. De Baets and D. Dubois. Website: http://www. elsevier.com/locate/fss “IEEE Transactions on Fuzzy Systems” is also an international journal, published by IEEE (Institute of Electrical and Electronics Engineers, www.ieee.org). Currently, the editor-in-chief is N.R. Pal. Website: http:// ieee-cis.org/pubs/tfs L.A. Zadeh is a professor in the University of California, Berkeley. He has received many awards all over the world and �� holds honorary doctorates �� in universities worldwide, one of

Introduction and Trends to Fuzzy Logic and Fuzzy Databases

them being from the University of Granada, Spain, in 1996, in recognition of his important contribution in this scientific field. Website: http://www.cs.berkeley.edu/~zadeh

33

34

Chapter II

An Overview of Fuzzy Approaches to Flexible Database Querying Sławomir Zadrożny Polish Academy of Sciences, Poland Guy de Tré Ghent University, Belgium Rita de Caluwe Ghent University, Belgium Janusz Kacprzyk Polish Academy of Sciences, Poland

Abstract In reality, a lot of information is available only in an imperfect form. This might be due to imprecision, vagueness, uncertainty, incompleteness, or ambiguities. Traditional database systems can only adequately cope with perfect data. Among others, fuzzy set theory has been applied to deal with imperfections of data in a more natural way and to enhance the accessibility of databases. In this chapter, we give an overview of main trends in the research on flexible querying techniques that are based on fuzzy set theory. Both querying techniques for traditional databases as well as querying techniques for fuzzy databases are described. The discussion comprises both the relational and the object-oriented database modeling approaches.

Introduction Databases are a very important component in computer systems. Because of their increasing number and volume, good and accurate accessibility to a database becomes even more important. A lot of research has already been done to improve

database access. In this research, many aspects have been dealt with, among which we mention file organization, indexing, querying techniques, query languages, and other data access techniques. In this chapter, we give an overview of the main research results on the development of flexible querying techniques that are based on fuzzy

Copyright © 2008, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.

An Overview of Fuzzy Approaches

set theory (Zadeh, 1965) and its related possibility theory (Dubois & Prade, 1988; Zadeh, 1978). The scope of the chapter is further limited to an overview of those techniques that aim to enhance database querying by introducing fuzzy preferences (Bosc, Kraft, & Petry, 2005). �� Other techniques not dealt with in this chapter include: • • •

Self-correcting querying systems that can correct syntactic and semantic errors in query formulations. Navigational querying systems that allow intelligent navigation through the database. Cooperative querying systems that support “indirect” answers like summaries, conditional answers, and contextual background information for (empty) results. (Gaasterland, Godfrey, & Minker, 1992)

We will assume a simplified view of the database query as a combination of a number of conditions that are to be met by the data sought. The introduction of fuzzy preferences in queries can be done at two levels: inside query conditions and between query conditions. Fuzzy preferences are introduced inside query conditions via flexible search criteria and allow to express that some values are more desirable than others in a gradual way. Fuzzy preferences between query conditions are expressed via grades of importance assigned to particular query conditions indicating that the satisfaction of some query conditions is more desirable than the satisfaction of others. Because of the use of fuzzy preferences and the central role of fuzzy set theory, the flexible querying approaches dealt with in this chapter will be called fuzzy querying in the remainder of the chapter. The research on fuzzy querying already has a long history. It has been inspired by the success of fuzzy logic in modeling natural language propositions. The use of such propositions in queries, in turn, seems to be very natural for human users of any information system, notably the database management system. Later on, the interest in fuzzy querying has been reinforced by the omnipresence of network based applications, related

to buzzwords of modern information technology, such as e-commerce, e-government, and so forth. These applications evidently call for a flexible querying capability when users are looking for some goods, hotel accommodations, and so forth, that may be best described using natural language terms like cheap, large, close to the airport, and so on. Another amplification of the interest in fuzzy querying comes from developments in the area of data warehousing and data mining related applications. For example, a combination of fuzzy querying and data mining interfaces (Kacprzyk & Zadrożny, 2000a, 2000b) �� or fuzzy logic and the OLAP (Online Analytical Processing) technology (Laurent, 2003) may lead to new, effective, and more efficient solutions in this area. The remainder of the chapter is organized as follows. In the next section, some preliminaries are presented. In Fuzzy �� Querying of Crisp Relational Databases section,�� the results on fuzzy querying in classical relational databases are presented, while the Fuzzy Querying of Fuzzy Relational Databases and Object-Oriented Approaches sections deal with the same issues for fuzzy and object oriented cases, respectively. Finally, �� some concluding remarks are given. Other chapters in this volume also deal with some particular cases of fuzzy querying. Among the most relevant ones, we want to mention here the chapters written by: • • • •

•

Thomopoulos, Buche, and Haemmerlé who describe flexible querying with hierarchical fuzzy sets. Dubois and Prade who handle bipolar queries. Takači and Škrbić who deal with introducing priorities in fuzzy queries. Barranco, Campaña, and Medina who write about a fuzzy object-relational database model and some strategies for fuzzy queries in this model. De Tré, Demoor, Callens, and Gosseye who present some flexible querying techniques that are based on case based reasoning.

35

An Overview of Fuzzy Approaches

Preliminaries In order to review and discuss main contributions to the research area of fuzzy querying, we have to introduce the terminology and notation related to the basics of database management and fuzzy logic. A relational database may be meant in an abstract sense as a collection of relations or, informally, of tables (Codd, 1970) which represent them, comprising rows and columns. Each relation R—or relational variable R (Date, 2004)—is defined via the relation schema: R(A1 : Dom(A1 ), A2 : Dom(A2 ),  , An : Dom(An ))

(1)

where the Ai’s are the names of attributes (columns) and Dom(Ai)’s are their associated domain. Each relation (table) represents a class of objects (meant as in common parlance rather than in the object-oriented paradigm) essential for a part of the real world modeled by a given database. A tuple (row) of such a relation represents a particular object of such a class. The most interesting operation on a database, from this chapter’s perspective, is the retrieval of data satisfying certain conditions. Usually, to retrieve data, a user forms a query specifying these conditions (criteria). The retrieval process may be meant as the calculation of a matching degree for each tuple of relevant relation(s). Classically, a row either matches the query or not; that is, the concept of matching is binary. In the context of flexible criteria, a degree of matching is considered. Usually two general formal approaches to the querying are assumed: the relational algebra and the relational calculus. The former has a procedural character: a query consists here of a sequence of operations on relations that finally yield requested data. These operations comprise five basic ones: union, difference, projection, selection, and cross product that may be combined to obtain some derived operations such as intersection, division, and join. The latter approach, known in two fla-

36

vors as the tuple relational calculus (TRC) or the domain relational calculus (DRC), is of a more declarative nature. Here a query just describes what information is requested, but how it is to be retrieved from a database is left to the database management system. The exact form of queries is not of utmost importance for our considerations, as we focus on the condition part of queries. However, some reported research in this area directly employs the de-facto standard querying language for relational databases, that is, SQL (Structured Query Language) (cf., Melton & Simon, 2002; Ramakrishnan & Gehrke, 2000). Thus, we will also sometimes refer to the SELECT instruction of this language and its WHERE clause, where query conditions are specified. We will use the following concepts and notation concerning fuzzy logic. A fuzzy set FS in the universe U is characterized by a membership function: mFS F : U → [0,1]

(2)

For each element x ∈ U , mFS (x) denotes the membership grade or extent to which x belongs to FS. On the one hand, fuzzy sets make it possible to represent vague concepts like “tall man,” in an appropriate way, taking into account the graduality of such a concept. On the other hand, a fuzzy set that is interpreted as a possibility distribution can be used to represent the uncertainty about the value of a variable, for example, representing the height of a man (Dubois & Prade, 1988; Zadeh, 1978). Possibility distributions are denoted by π. The notation πX is often used to indicate that the distribution concerns the variable X: X

: U → [0,1]

where X takes values from a universe U.

(3)

Possibility and necessity measures can provide for the quantification of such an uncertainty. These measures are denoted by Π and N, respectively, that is:

An Overview of Fuzzy Approaches

~ (U ) → [0,1] Π :℘ and ~ N :℘(U ) → [0,1]

(4)

~ (U ) stands for the family of fuzzy sets where ℘ defined over U. Assuming that all we know about the value of a variable X is a possibility distribution pX, these measures, for a given fuzzy set FS, assess how it is possible (Π) or sure (N) that the value of X belongs to FS. More precisely, if pX is the underlying possibility distribution, then: Π X (F ) = sup min ( u∈U

X

N X (F ) = inf max(1 − u∈U

(u ), FS((u)) u )) X

(u ), F S((u)) u ))

(5) (6)

Sometimes, the interval [NX (FS), ΠX (FS)] is used as an estimate of the possibility that the actual value of X comes from FS. The possibility (necessity) that two variables X and Y, the values of which are given by possibility distributions, pX and pX, are in relation θ, for example, equality, is computed as follows. The joint possibility distribution, pXY, of X and Y on U × U (assuming non-interactivity of the variables) is given by: XY

(u, w) = min( X (u ), Y (w))

(7)

Knowing the possibility distributions of two variables X and Y, one may be interested on how these distributions are similar to each other. Obviously, Equations (8) through (9) may provide some assessment of this similarity, but other indices of similarity are also applicable. This leads to a distinction, proposed by Bosc, Duval, and Pivert (2000), between representation-based and value-based comparisons of possibility distributions. We will discuss this later in the Fuzzy Querying of Fuzzy Relational Databases section. As an alternative for possibility and necessity measures, extended possibilistic truth values (EPTVs) can be used to quantify uncertainty (de Tré, 2002). An EPTV is defined as a possibility distribution in the universe I * = {T , F , ⊥} that consists of the three truth values T (true), F (false), and ⊥ (undefined), that is: ~ ~ ( I *) t * : P →℘

(10)

where P denotes the universe of all propositions. In general, the EPTV ~ t * ( p ) of a proposition p ∈ P has the following format: ~ * ( ) = {( , t p T

~ t *( p )

(T )), ( F ,

~ t *( p )

( F )), (⊥,

~ t *( p )

(⊥))}

(11)

Relation θ may be fuzzy and represented by a fuzzy ~ (U × U ) F ∈℘ set FS such that mFSF (u , w) = (u , w). The possibility (resp. necessity) measure associated with pX will be denoted by Π XY (resp. NXY). Then, we calculate the measures of the variables in relation θ as follows:

Hereby ~t *( p ) (T ) denotes the possibility that p is true, ~t *( p ) ( F ) is the possibility that p is false, and ~t *( p ) (⊥) is the possibility that some elements of p are not applicable, undefined, or not supplied. EPTVs extend the approach of possibility and necessity measures with an explicit facility to deal with the inapplicability of information as can, for example, occur with the evaluation of query (F )== sup min( X (u ), Y (conditions. Possibility (X Y ) = Π XY (F) w), (u , w)) , ∈ u w U ility (X Y ) = Π XY (F ) = sup min( X (u ), Y (w), (u , w)) (8) In Table 1, some special cases of EPTVs are u , w∈U presented. These cases are verified as follows:

Necessity(X Y ) = N XY ((F) F ) == inf max(1 − X (u ),1 − Y (w), (u , w)) • If it is completely possible that the proposition u , w∈U ecessity(X Y ) = N XY (F ) = inf max(1 − X (u ),1 − Y (w), (u , w)) is true and no other truth values are possible, u , w∈U (9) then it means that the proposition is true.

37

An Overview of Fuzzy Approaches

Table 1. Special EPTVs

• •

•

•

~ t * ( p)

Interpretation

{(T,1)}

p is true

{(F,1)}

p is false

{(T,1), (F,1)}

p is unknown

{(⊥,1)}

p is inapplicable

{(T,1), (F,1), (⊥,1)}

Information about p is not available

If it is completely possible that the proposition is false and no other truth values are possible, then it means that the proposition is false. If it is completely possible that the proposition is true, it is completely possible that the proposition is false, and it is not possible that the proposition is inapplicable, then it means that the proposition is applicable, but unknown. This truth value will be called, in short, unknown. If it is completely possible that the proposition is inapplicable and no other truth values are possible, then it means that the proposition is inapplicable. If all truth values are completely possible, then this means that no information about the truth of the proposition is available. The proposition might be inapplicable, but might also be true or false. This truth value will be called, in short, unavailable.

Assume again that all we know about the value of a variable X is a possibility distribution pX, defined over a universe U. Then the EPTV of the proposition “X is FS” that expresses to which extent the value of X is compatible with the value represented by a given fuzzy set F in U can be calculated by: t*(' XisFS ')

(T ) = sup min(

t*(' XisFS ')

( F ) = sup min(

u∈U

u∈U {⊥U }

X

(u ), X

FS

(u ),1 −

38

(u )) FS

(12)

(u )) (13)

t*(' XisFS ')

(⊥) = min(

X

(⊥U ),1 −

FS

(⊥U ))

(14)

where ⊥U represents a special “undefined” element of U that is used to model cases where a regular element of U is not applicable (cf. Prade & Testemale, 1984).

Fuzzy Querying of Crisp Relational Databases In this case, a classical crisp relational database is assumed, while queries are allowed to contain natural language terms in their conditions. The main lines of research include the study of the idea of modeling linguistic terms in queries using elements of fuzzy logic (Tahani, 1977); enhancements of the fuzzy query formalism with flexible aggregation operators (Bosc & Pivert, 1993; Dubois & Prade, 1997; Kacprzyk, Zadrożny, �� & Ziółkowski��, 1989; Kacprzyk & Ziółkowski, 1986), and practical problems with embedding fuzzy constructs in the syntax of the standard SQL (Bosc, 1999; Bosc & Pivert, 1992a, 1992b, 1995; de Tré, Verstraete, Hallez, Matthé, & de Caluwe, 2006; Galindo, Medina, Pons, & Cubero�� , 1998; Galindo, �� Urrutia, & Piattini�� , 2006; Kacprzyk & Zadrożny, 1995; Umano & Fukami, 1994).

Fuzzy Preferences Inside Query Conditions Tahani (1977) was the first to propose the use of fuzzy logic to improve the flexibility of crisp data-

An Overview of Fuzzy Approaches

base queries. He proposed a formal approach and architecture to deal with simple fuzzy queries. His query language is based on SQL. Tahani proposed to use vague terms typical for natural language, for example, “high” and “young” in “WHERE salary = HIGH AND age = YOUNG.” The semantics of these vague terms is provided by appropriate fuzzy sets. The matching degree, g, for such extended queries is calculated as follows. For a tuple t and a simple (elementary) condition Q of type A = l, where A is an attribute (e.g., “age”) and l is a linguistic (fuzzy) term (e.g., “YOUNG”), the value of the function g is: g(Q, t) = ml(x)

(15)

where x is t[A]; that is, the value of tuple t for attribute A and ml is the membership function of the fuzzy set representing the linguistic term l. The matching function g for complex conditions, exemplified by “age = YOUNG AND (salary = HIGH OR empyear = RECENT ),” is obtained by applying the semantics of the fuzzy logical connectives; that is:

(P ∧ Q, t ) = min( (P, t ), (Q, t ))

(16)

(P ∨ Q, t ) = max( (P, t ), (Q, t ))

(17)

(¬Q, t ) = 1 − (Q, t)

(18)

where P, Q are conditions. The min and max operators may be replaced by, for example, t-norm and t-conorm operators (Klement, Mesiar, & Pap, 2000) to model the conjunction and disjunction connectives, respectively. The classical querying formalisms of the relational data model were also studied from the perspective of the fuzzy querying purposes. The relational algebra may be fairly easily adapted. However, for some operations, multiple fuzzy versions have been proposed. One such operation lacking a clear, widely accepted fuzzy counterpart

is the division of relations which has been studied by many researchers, including Yager (1991), Dubois and Prade (1996), and Galindo, �� Medina, Cubero, and Garcia�� (2001); see also a chapter by Bosc et al. in this volume. The relational calculus attracted much less attention. One of the earliest contributions in this area is the work of Takahashi (1995) where he proposes the FQL (Fuzzy Query Language), meant as a fuzzy extension of the domain relational calculus (DRC). A more complete approach has been proposed by Buckles, Petry, and Sachar (1989). Even if it was developed in the framework of a fuzzy database model it covers all aspects relevant for the fuzzy relational calculus. Also Zadrożny and Kacprzyk (2002) proposed to interpret elements of DRC in terms of a variant of fuzzy logic. This approach also makes it possible to account for preferences between query conditions in an uniform way.

Fuzzy Preferences Between Query Conditions The next step is to distinguish simple (fuzzy) conditions composing a query with respect to their importance. To model the relative importance of conditions, weights are associated with them. Usually, a weight wi is represented by a real number of the unit interval, that is, wi ∈ [0,1]. Hereby, as extreme cases, wi = 0 models “not important at all” and wi = 1 represents “fully important.” A weight wi is associated with each (fuzzy) condition Pi. Assume that the matching degree of a condition Pi with an importance weigh wi is denoted by (Pi * , t) . In order to be meaningful, weights should satisfy the following requirements (Dubois, Fargier, & Prade, 1997): • •

In order to have an appropriate scaling, it must hold that at least one of the associated weights is 1, that is, maxi � wi = 1�. If wi = 1 and the matching function equals 0 for Pi, that is, g(Pi, t) = 0�� , then the impact of the weight should be 0, or (Pi* , t)= 0 . In other words, if Pi is not satisfied at all and

39

An Overview of Fuzzy Approaches

Pi is fully important, then the weight should not modify the matching degree�. If wi = 1 and the matching function equals 1 for Pi , or g(Pi, t) = 1�� , then the impact of the weight should be 1, or (Pi* , t) = 1 . In other words, if Pi is completely satisfied and Pi is fully important, then the weight should not modify the matching degree�. Lastly, if wi = 0�� , then the result should be such as if Pi would not exist. �

a similar scheme may be offered). Let us denote by g(Pi, t) the matching degree for a tuple t of such an elementary condition Pi without any importance weight assigned. Then, Dubois and Prade (1997) propose to use the following formula to compute the matching degree, (Pi* , t) = , of 1 an elementary condition Pi with an importance weight wi ∈ [0,1] assigned:

The impact of a weight can be modeled by first matching the condition as if there is no weight and then modifying the resulting matching degree in accordance with the weight. A modification function that strengthens the match of more important conditions and weakens the match of less important conditions is used for this purpose. Different interpretations are possible. From a conceptual point of view, a distinction can be made between static weights and dynamic weights. Static weights are fixed, known in advance, and can be directly derived from the formulation of the query. These weights are independent of the values of the record(s) on which the query criteria act and are not allowed to change during query processing. A further, orthogonal distinction can be made between static weight assignments, where it is also known in advance with which condition a weight is associated (e.g., in a situation where the user explicitly states preferences) and dynamic weight assignments, where the associations between weights and conditions depend on the actual attribute values of the record(s) on which the query conditions act (e.g., in a situation where most criteria have to be satisfied, but it is not important which ones).

where ⇒ is an operator modeling a fuzzy implication connective. The overall matching degree of the whole query composed of the conjunction of conditions Pi is calculated using the standard min-operator. Depending on the type of the fuzzy implication operator used, we get various interpretations of importance weights. For example, using the Dienes implication, we obtain from Equation (19):

Static weights. In most approaches, static weights are used. As Dubois and Prade (1997) discovered, some of the most practical interpretations of static weights may be formalized within a universal scheme. Namely, let us assume that query condition P is a conjunction of weighted elementary query conditions Pi (for a disjunction

This is the interpretation presumably first discussed by Yager (cf. for a reference Dubois & Prade, 1997). The importance weight wi is here treated as a threshold: if condition Pi is satisfied to a degree greater than this threshold, then the weighted condition Pi* is considered to be fully satisfied. Otherwise the matching degree for Pi* equals that for Pi.

•

•

40

(P , t)= (w ⇒ (P , t )) i

*

i

i

(P , t)= max( (P , t),1 − w ) i

*

i

i

(19)

(20)

For a small importance (wi close to 0), the satisfaction of elementary condition Pi does not bear on the satisfaction of the overall query. On the other hand, with wi close to 1, the satisfaction of the elementary condition is essential for the matching of the overall query P. Consequently, the requirements for weights, proposed by Dubois et al. (1997) and mentioned in the item list above, are satisfied. For the Gödel implication, Equation (19) turns into:

(P , t) =  (P1 , t )  i

*

i

if

(Pi , t )≥ wi

otherwise

(21)

An Overview of Fuzzy Approaches

Finally, another interpretation of importance is obtained when the Goguen implication is used in Equation (19):

(P , t )=  (P ,1t ) w  i

*

i

i

if

(Pi , t )≥ wi otherwise

(22)

In fact, here we still have a threshold-type interpretation, as in the previous case, but the undersatisfaction of the condition is treated in a more continuous way. For still another interpretation of importance, see Zadrożny (2005). The use of importance weights indirectly leads to an unconventional aggregation of partial matching degrees. Dynamic weights. The approach described for static weights, based on Equation (19), has been refined (Dubois & Prade, 1997) to deal with a variable importance wi ∈ [0,1] depending on the matching degree of the associated elementary condition. For example, in a specific context, it may be useful to assume wi to be constant for a relatively high satisfaction of the elementary condition, but an extremely low satisfaction should be more strongly reflected in the overall matching by automatically increasing the weight wi. For instance, when we want a car of a moderate price, if a particular car has a very high price, the price criterion becomes more important (wi = 1) in order to reject that car. More generally, when using dynamic weights and dynamic weight assignments, neither the weights nor the associations between weights and criteria are known in advance. Both the weights and their assignments then depend on the attribute values of the record(s) on which the query criteria act. This kind of flexibility is required to avoid some unnatural behavior of the query evaluation in cases where, for example, a condition is of limited importance only within a given range of values such as if the condition “high salary” is not important, unless the salary value is extremely high.

Other approaches. Other flexible schemes of aggregation are also a direct subject of research in the framework of flexible fuzzy logic based querying. In Kacprzyk and Ziółkowski (1986) and Kacprzyk et al. (1989), the aggregation of partial queries (conditions) to be guided by a linguistic quantifier has been first described. In such approaches, conditions of the following form are considered: P = Ψ out of {P1 , … , Pk }

(23)

where Ψ is a linguistic (fuzzy) quantifier and Pi is an elementary condition to be aggregated. For example, in the context of a U.S.-based company, one may classify an order as troublesome if it meets most of the following conditions: “comes from outside of USA,” “its total value is low,” “its shipping costs are high,” “employee responsible for it is John Doe (known to be not completely reliable),” “the amount of order goods on stock is not much greater than ordered amount,” and so forth. The overall matching degree may be computed using any of the approaches used to model linguistic quantifiers. In Kacprzyk and Ziółkowski (1986) and Kacprzyk et al. (1989), first the linguistic quantifiers in the sense of Zadeh (1983) and later the OWA operators (Yager, 1994) are used (cf. Kacprzyk & Zadrożny, 1997). Such approaches make it also possible to take into account the importance of conditions to be aggregated. There are many works on this topic studying various possible interpretations of linguistic quantifiers for the flexible querying purposes such as Bosc, Pivert, and Lietard (2001), Bosc, Lietard, and Pivert (2003), Galindo et al. (2006), Vila, Cubero, Medina, and Pons (1997).

Practical Approaches More practical approaches to flexible fuzzy querying in crisp databases are well represented by SQLf (SQLfuzzy) (Bosc & Pivert, 1995) and FQUERY (FuzzyQUERY) for Access (Kacprzyk & Zadrożny, 1995). The former is an extension of

41

An Overview of Fuzzy Approaches

SQL introducing linguistic (fuzzy) terms wherever it makes sense, and the latter is an example of the implementation of a specific “fuzzy extension” of SQL for Microsoft Access®, a popular desktop DBMS (database management system). Also, Galindo et al.’s (1998) FSQL (FuzzySQL) features the capability of fuzzy querying of a, in principle, crisp database. However, as it is a more comprehensive approach, it will be considered in the section on fuzzy databases. Moreover, in another chapter by Urrutia, Tineo, and Gonzalez in this volume, the reader can find a comparison of SQLf and FSQL. FQUERY. In Kacprzyk and Zadrożny (1995), an extension of the Access SQL language, with the linguistic terms in the spirit of the approaches discussed earlier, has been presented. The following types of linguistic terms have been considered: fuzzy values (e.g., “YOUNG”); fuzzy relations (fuzzy comparison operators) (e.g., “MUCH GREATER THAN”); and fuzzy quantifiers (e.g., “MOST”). The matching degree is calculated according to the previously discussed semantics of fuzzy predicates and linguistically quantified propositions. This extension to SQL has been implemented as an add-in, FQUERY for Access, to Microsoft Access, thus extending the native Access’s querying interface with the capability of manipulating linguistic terms. In FQUERY for Access, the user composes a query using a QBE (query-by-example) type user interface provided by the host environment, that is, Microsoft Access. The resulting rows are ordered decreasingly with respect to the matching degree. FQUERY has been one of the first implementations demonstrating the usefulness of fuzzy querying features for a crisp database. In addition to the syntax and semantics of the extended SQL, the authors have also proposed a scheme for the elicitation and manipulation of linguistic terms to be used in queries. The concept of FQUERY has been further developed in two directions. In �� Zadro�� ż�� ny�� & Kacprzyk (1998) and Kacprzyk and Zadro�� ż�� ny (2001),�� the

42

very same concept has been applied in the Internet environment (WWW). Another line of development (Kacprzyk & Zadrożny, 2000a; Kacprzyk & Zadrożny 2000b) consists in adding some data mining capabilities to the existing fuzzy querying interface. Such a combined interface partially employs the same modules and data structures as the ones used in FQUERY and seems to be a promising direction for the development of advanced OLAP and data analysis tools. SQLf. So far we have only discussed the “fuzzification” of conditions appearing in the WHERE clause of the SQL’s SELECT instruction. In Bosc and Pivert (1992b), Bosc and Pivert (1995), and Bosc and Pivert (1997a), a new language, called SQLf, has been proposed. This language is a much more comprehensive and complete “fuzzy” extension of the crisp SQL language. In SQLf, linguistic terms may appear as fuzzy values, relations, and quantifiers (associated with aggregation operators) in the WHERE clause and other clauses. The linguistic quantifiers may be used together with subqueries. This is called by Bosc et al. the vertical quantification in contrast to the horizontal quantification when a quantifier plays the role of an aggregation operator and replaces the AND or OR connectives in a condition as in (23). All the operations of the relational algebra (implicitly or explicitly used in SQL’s SELECT instruction) are redefined to properly process fuzzy relations that appear when parts of a fuzzy query are processed. Other operations typical for SQL are also redefined, including the partition of relations (GROUP BY clause) and the operators “IN” and “NOT IN” used along with subqueries. All the features of SQL have been redefined in such a way so as to preserve the equivalences that occur in the “crisp” SQL. A number of pilot implementations of SQLf have been developed (e.g., Gonçalves & Tineo, 2001a, 2001b). Other approaches. Other approaches and implementations for the flexible querying of crisp relational databases, based on similar principles as

An Overview of Fuzzy Approaches

explained above, exist. Among these, we should mention the PRETI-platform that is intended as an experimental environment for the exchange of expertise (de Calmès, Dubois, Hüllermeier, Prade, & Sedes, 2002) and the approach based on EPTVs (de Tré, de Caluwe, Tourné, & Matthé, 2003; de Tré et al., 2006).

Fuzzy Querying of Fuzzy Relational Databases Fuzzy databases intend to grasp imperfect information about a modeled part of the world and represent it directly in a database. The most straightforward application of fuzzy logic to the classical relational data model is by assuming that the relations in a database themselves are also fuzzy. Each tuple of a relation (table) is associated with a membership degree. This approach is often neglected because the interpretation of the membership degree is unclear. On the other hand, it is worth noticing that fuzzy queries, as discussed in the previous section, in fact produce fuzzy relations. Two leading approaches to the representation of imperfect information in relational databases are the possibilistic model (Prade & Testemale, 1984, 1987) and the similarity relation based model (Buckles & Petry, 1982; Petry, 1996). More recently, an extended possibilistic approach, based on EPTVs has been proposed (de Tré & de Caluwe, 2003). The main idea behind the possibilistic data model is to represent the imprecisely known value of an attribute via a possibilistic distribution on the domain of this attribute. For example, if all that is known about the age of a suspect in a criminal investigation is that he is “young,” then in a corresponding database, this information may be represented by a suitable possibility distribution on, for example, the interval [1,100]. This calls for some special measures both in data representation and querying, which will be described in the next section. The similarity based approach is rooted in the observation that by specifying the search condi-

tions of a query, the user actually looks not only for tuples exactly satisfying them but also for similar tuples. Thus, a similarity relation on the attribute domain is assumed. The values taken by a similarity relation are in the unit interval [0,1], where 0 corresponds to “totally different” and 1 to “totally similar.” It is a fuzzy binary relation such that its membership function expresses the similarity degree between the pairs of the domain elements. Similarity relations are usually provided by the user. The extended possibilistic approach is an extension of the possibilistic approach. It explicitly deals with the inapplicability of information during the evaluation of the query conditions: if some part of the query conditions are inapplicable, this will be reflected by the model. We briefly discuss the main concepts of fuzzy querying as proposed for both leading models of fuzzy databases. Next, fuzzy querying in the extended possibilistic approach, as well as in some hybrid approaches, is briefly described.

The Possibilistic Approach Prade and Testemale (1984) proposed an algebra for retrieving information from a fuzzy possibilistic relational database. The principles of this algebra can be illustrated by an example of the selection operator. The syntax of the condition is more or less the same as previously, but the attributes may take possibilistic distributions as values. Two types of elementary conditions are considered:

(i) A θ a, where A is the name of an attribute, θ is a comparison operator (fuzzy or not), and

a is a constant (fuzzy or not);

(ii) A θ B, where A and B are names of attributes. The computed matching degree of an elementary condition against a tuple t is expressed by a pair: the possibility and necessity measure of some sets (with respect to the possibility distributions A(t) and B(t)). In case of (i), it is the set, crisp or fuzzy, of the elements from the domain of A in

43

An Overview of Fuzzy Approaches

relation θ with a constant a. In the second case (ii), it is the subset of the Cartesian product of domains of A and B containing only the pairs of elements being in relation θ. In this case, a joint possibility distribution over the Cartesian product of the domains of A and B is used. Formally, the matching degree for case (i) is computed as follows. Let us denote by FS the set (in general fuzzy) whose possibility and necessity measures have to be computed. Its membership function for the elements of the domain of A is as follows: mFS (d ) = sup min( (d , d ′), (d ′)), d ∈ Dom( A) F a Dom ((A ) (d , d ′ ), ′ ( d ) = sup d ′∈min F a (d )), d ∈ Dom( A) d ′∈Dom (A )

(24)

where ma is the membership function of the constant a. The possibility and necessity measures of the set FS with respect to the possibility distribution pA(t) (the value of the attribute A for the tuple t) are computed as in Equations (2)-(3). For the second form of atomic condition (ii), the set F comprises the pairs of elements (d, d' ), d ∈ Dom(A), d ′ ∈ Dom(B ) such that d θ d’ is satisfied. Thus, its membership function is identical to that of θ and the possibility and necessity measures are computed as in Equations (8) through (9). Baldwin, Coyne, and Martin (1993) have implemented a system for querying a possibilistic relational database using semantic unification and the evidential logic rule. The queries are composed of one or more conditions, the importance of each condition, a “filtering” function (similar to the notion of quantifier), and a threshold. The particularity of their work is the process, semantic unification, used for matching the fuzzy values of the criteria with the possibility distributions of the attributes of a tuple. As a result, one obtains an interval [n, p] where, similar to the previous case, n (necessity) is the certain degree of matching and p (possibility) is the maximum possible degree of matching. However, this time the calculations are based on the mass assignments theory developed by Baldwin et al. 44

Bosc and Pivert (1997b) have proposed a new type of queries for possibilistic databases. These are directly querying the representation of the attribute’s value (i.e., features of the corresponding possibility distribution) rather than the value itself. Examples of basic queries of this new type are: “Find tuples such that all the values d1, d2, …, dn are possible for attribute A”; “Find tuples such that more than n values are possible to a degree higher than λ for attribute A.” The matching degree for such queries is computed using the formula: min(

A

(d1), A(d 2 ),  , A(d n ))

(25)

where A is an attribute; d1, d2, …,dn are values from its domain Dom(A); and πA is the possibility distribution representing the value of A. The tuples �� such that more than n values are possible to a degree higher than λ for attribute A are retrieved using the condition: Card _ cut (A,

) > n

(26)

where Card _ cut (A,

) = {d d ∈ Dom(A)∧

A (d ) ≥

(27)

}

and λ is a value in the interval [0,1]. These basic queries and the scheme for computing their matching degree may then be used to process more complex queries like: “Find the tuples where for attribute A the value d1 is more possible than the value d2”; “Find the tuples where for attribute A only one value is completely possible.”

An Overview of Fuzzy Approaches

There are other works on fuzzy querying in the possibilistic approach (de Caluwe, 2002; Umano, 1982; Umano & Fukami, 1994; Zemankova-Leech & Kandel, 1984).

The Similarity Relation Based Approach The research on querying in similarity relation based fuzzy databases has been summarized in Buckles and Petry (1985), Buckles et al. (1989), and Petry (1996). A complete set of operations of the relational algebra has been defined for the similarity relation based model. These operations result from their classical counterparts by the replacement of the concept of equality of two domain values with the concept of similarity of two domain values. The conditions of queries are composed of crisp predicates as in a regular query language. Additionally, a set of level thresholds may be submitted as a part of the query. A threshold may be specified for each attribute appearing in the query’s condition. Such a threshold indicates what degree of similarity of two values from the domain of given attribute justifies in considering them equal. The concept of threshold level plays also a central role in the definition of the redundancy concept for this database model. Two tuples are redundant if the values of all corresponding attributes are similar (to a level higher than a selected degree). There are also a number of hybrid models proposed in the literature. Takahashi (1993) has proposed a model for a fuzzy relational database assuming possibility distributions as attribute values. Additionally, fuzzy sets are used as tuple truth-values. For example, a tuple t may express that “It is quite true that John’s age is nearly 40.” Medina, Pons, and Vila (1994) propose a fuzzy database model, GEFRED (Generalized Fuzzy Relational Database), trying to integrate the advantages of both the possibilistic and similarity based models. The data are stored as generalized fuzzy relations that extend the relations of the relational model by allowing imprecise information and a compatibility degree associated with

each attribute value. They also define an algebra, called a generalized fuzzy relational algebra, to manipulate information stored in such a fuzzy database. Galindo, Medina, and Aranda (1999) have extended the GEFRED model with a fuzzy domain relational calculus (FDRC). The GEFRED model has been implemented using the crisp commercial DBMS Oracle (Galindo et al., 1998). The implementation supports FSQL, a “fuzzy” SQL. This fuzzy extension to SQL includes the linguistic labels (terms; fuzzy values) and fuzzy comparison operators (relations) that have been discussed in the previous sections. Each condition could be assigned a fulfillment threshold ( ∈ [0,1]) requiring that this condition has to be satisfied at least to a degree α (thus, in some sense, changing a fuzzy condition to a crisp one). In Galindo, �� Medina, Cubero, and Garcia�� (2000), the fuzzy quantifiers have been included into their FDRC language. Some applications of the FSQL are reported (Barranco, Campaña, Medina, & Pons, 2004; Galindo et al., 2006).

An Extended Possibilistic Approach In the extended possibilistic approach, the computed matching degree of an elementary condition against a tuple t is expressed by an EPTV. This EPTV represents the extent to which it is (un)certain that t belongs to the result of a flexible query. Let us again denote by FS the set (in general, fuzzy) whose EPTV with respect to the possibility distribution pA(t) has to be computed. Then the computation of the EPTV can be done as in Equations (12) through (14). In case of composed query conditions, the resulting EPTV can be obtained by aggregating the EPTVs of the elementary conditions. Hereby, generalizations of the logical connectives of the conjunction (∧), disjunction (∨), negation (¬), implication (⇒), and equivalence (⇔) can be applied (de Tré, 2002; de Tré & de Baets, 2003). The extended possibilistic approach is an extension of the possibilistic approach based on possibility and necessity measures presented in Prade and Testemale (1984). It offers additional facilities

45

An Overview of Fuzzy Approaches

to cope with the inapplicability of information at the logical level: if some of the query conditions are inapplicable for a given tuple t, this will be explicitly reflected in the EPTV representing the matching degree for the tuple (de Tré, �� de Caluwe, & Prade�� , in press).

Object-Oriented Approaches With object-oriented databases becoming mature, research on “fuzzy” object-oriented databases has drawn a lot of attention. Nowadays, several fuzzy object-oriented database models exist. Based on some of them, prototypes have already been implemented. The majority of the presented models do not conform to a single underlying object data model which is a logical consequence of the present lack of (formal) object standards. The most recent version of the ODMG (Object Data Management Group) proposal (Cattell & Barry, 2000) offers the best perspectives, although it still suffers from some shortcomings such as the absence of formal semantics and does not have the status of an official standard (Alagić, 1997; Kim, 1994). Informally, an object database is a collection of objects that are instances of classes and typically have their own identity. Each class is characterized by its structure, usually specified by a finite number of attributes Ai: Dom(Ai) as in the relational model and by its behavior specified by a finite number of operations. Classes are interrelated via association relationships which allow to associate objects with other objects, and via inheritance relationships which allow sharing characteristics among classes. Research on fuzzy object-oriented databases can also be subdivided into two main approaches: those based on a possibilistic model and those based on a similarity relation based model. Furthermore, an extended possibilistic approach and some other alternative approaches have been proposed. Most research interest has been in the development of semantically richer data modeling facilities. Fuzzy querying of fuzzy object-oriented databases

46

has in most cases been performed using similar techniques as described in the previous sections of this chapter.

Possibilistic Models Among the possibilistic approaches are the objectcentered model of Rossazza (1990) and �� Rossazza, Dubois, and Prade�� (1997); the object-oriented model of Tanaka, Kobayashi, and Sakanoue (1991); the FOOD model (Fuzzy Object-Oriented Data model) of Bordogna, �� Lucarella, and Pasi�� (1994) and Bordogna, Pasi, and Lucarella�� (1999); the fuzzy algebra of Rocacher �� and Connan�� (1996); the UFO (Uncertainty and Fuzziness in an Object-oriented model) model of Van Gyseghem (1998); the fuzzy association algebra model of Na and Park (1997); the FIRMS (Fuzzy Information Retrieval and Management System) model of Mouaddib and Subtil (1997); and the FOODM (Fuzzy ObjectOriented Database Model) model of Marín, Pons, and Vila (2000). In the object-centered model of Rossazza (1990) and Rossazza �� et al.�� (1997), all information is contained in objects that are completely described by a set of attributes. For these objects, no behavior is defined. Objects with the same attributes are collected in classes that are organized in class hierarchies. A range of allowed values and a range of typical values have been specified for the attributes. These ranges may be fuzzy. Various kinds of (graded) inclusion relations between classes have been defined: in order to find out to which extent a class is a subclass of another class, the ranges of their corresponding attributes are compared with each other, using a “default reasoning” technique as proposed in Reiter (1980). In the object-oriented model of Tanaka et al. (1991), fuzziness is considered with respect to both the structural and the behavioral aspects of objects. Attribute values can be fuzzy. Furthermore, fuzziness is considered at the levels of instantiation, inheritance, and relationships between objects by introducing some special classes. Special comparison operators, which are obtained by applying

An Overview of Fuzzy Approaches

Zadeh’s (1975) extension principle, are provided to compare instances of fuzzy classes and to support flexible querying. The FOOD model of Bordogna et al. (1994, 1999) is based on a visualization paradigm that supports the representation of data semantics and the direct browsing of information. It has been defined as an extension of a graph-based object model in which both the database scheme and instances are represented as directed labeled graphs. A database manipulation language has been described in terms of graph transformations. A prototype of the model has been implemented (Bordogna, Leporati, Lucarella, & Pasi�� , 2000). The fuzzy algebra of Rocacher and Connan (1996) is an extension of the so-called EQUALalgebra which is part of the object-oriented database model ENCORE (Shaw & Zdonik, 1990). The extension is based on an early version of the ODMG data model (Cattell & Barry, 2000) and is aimed at the modeling and manipulation of fuzzy data. The extended operators are “union,” “intersection,” “difference,” “select,” “image” (to invoke functions on objects), “project,” “join,” “flatten,” “nest,” and “unnest.” Additionally, specific operators have been provided to generate and to compare fuzzy sets. The UFO model of Van Gyseghem (1998) has been an attempt to extend an object-oriented database model as generally as possible in order to be able to deal with fuzziness as well as with uncertainty. Different concepts of the object orientation have been extended (attributes, methods, objects, classes, inheritance, instantiation, etc.). A specific feature of this approach is the use of “role” objects to properly deal with the manipulation of uncertain data. In the approach of Na and Park (1997), a fuzzy object-oriented data model has been built by means of fuzzy classes and fuzzy associations. A fuzzy database is represented by a fuzzy schema graph at schema level and a fuzzy object graph at object instance level. Data manipulation is handled by means of a fuzzy association algebra which consists of operators that can operate on the fuzzy associa-

tion patterns of homogeneous and heterogeneous structures. As the result of these operators, truth values are returned with the patterns. The FIRMS model of Mouaddib and Subtil (1997) can deal with fuzzy, uncertain, and incomplete information. At the base of the model are the concepts of a “nuanced value” and “nuanced domain.” Furthermore, a fuzzy thesaurus is used to restrict the allowed domain values of discrete attributes. A formal grammar is used to generate the characteristic membership functions of the thesaurus terms. In the FIRMS model, no class hierarchies are supported. The FOODM model of Marín et al. (2000) illustrates how different sources of vagueness can be managed over a regular object-oriented database model. It is founded on the concept of a “fuzzy type” where properties are ranked in different levels of precision according to their relationship with the type. Objects are created using α-cuts of their fuzzy type. Architecture of a prototype implementation of the model has been presented in Berzal, Marìn, �� Pons, and Vila�� (2003).

Similarity Relation Based Models George (1992) and George, Yazici, Petry, and Buckles (1997) have proposed an object-oriented database model, which facilitates an enhanced representation of different types of imprecision and utilizes a similarity relation to generalize equality to similarity. Similarity permits to represent imprecision in data and imprecision in inheritance. An object algebra based on extensions of the five traditional operators (union, difference, product, projection, and selection) and three operators to handle nested class data have been provided to support querying.

Other Approaches In the “rough” object-oriented database of Beaubouef and Petry (2002), an indiscernibility relation and approximation regions of rough set theory are used to incorporate uncertainty and

47

An Overview of Fuzzy Approaches

vagueness into the database model. As is the case for fuzzy relational databases, the EPTVs have also been applied in fuzzy object-oriented databases. The database model of the constraint based approach of de Tré, �� de Caluwe, and Van der Cruyssen�� (2000) and de Tré and de Caluwe (2005) is consistent with the ODMG data model (Cattell & Barry, 2000). Both the data(base) semantics and the flexible querying criteria are expressed by generalized constraints. A many-valued possibilistic logic based on EPTVs is used in order to be able to explicitly cope with missing information and to express query satisfaction.

Concluding Remarks In this chapter, we have presented an overview of some of the most important contributions in two main sub-areas of fuzzy querying: in crisp and fuzzy databases. We have discussed the first sub-area in more detail because it still seems to be more promising. Both the relational and the object-oriented database modeling and querying approaches have been described.

Acknowledgment The authors would like to thank the reviewers and the editor Dr. Jose Galindo (University of Malaga, projects TIN2006-14285 and TIN2006-07262 by Ministry of Education and Science of Spain) for their valuable comments and suggestions regarding the original manuscript of this chapter which greatly helped to shape its final version.

References Alagić, S. (1997). The ODMG object model: Does it make sense? ACM SIGPLAN Notices, 32(10), 253-270. Baldwin, J. F., Coyne, M. R., & Martin, T. P. (1993). Querying a database with fuzzy attribute values

48

by iterative updating of the selection criteria. In A. L. Ralescu (Ed.), Proceedings of the Workshop on Fuzzy Logic in Artificial Intelligence (LNCS 847, pp. 62-76). London: Springer-Verlag. Barranco, C. D., Campaña, J., Medina, J. M., & Pons, O. (2004). ImmoSoftWeb: A Web based fuzzy application for real estate management. In J. Favela, E. Menasalvas, & E. Chávez (Eds.), Advances in Web intelligence (pp. 196-206). Berlin: Springer. Beaubouef, T., & Petry, F. E. (2002). Uncertainty in OODB modeled by rough sets. In Proceedings of the 9th International Conference on Information Processing and Management of Uncertainty in Knowledge-based Systems (IPMU 2002) (pp. 1697-1703), Annecy, France. Berzal, F., Marìn, N., Pons, O., & Vila, M. A. (2003). �� FoodBi: Managing fuzzy object-oriented data on top of the Java platform. In Proceedings of the 10th International Fuzzy Systems Association (IFSA) World Congress (pp. 384-387), Istanbul, Turkey. Bordogna, G., Leporati, A., Lucarella, D., & Pasi, G. (2000). The �� fuzzy object-oriented database management system. In G. Bordogna & G. Pasi (Eds.), Recent issues on fuzzy databases (pp. 209-236). Heidelberg, Germany: Physica-Verlag. Bordogna, G., Lucarella, D., & Pasi, G. (1994). A fuzzy object oriented data model. In Proceedings of the 3rd IEEE International Conference on Fuzzy Systems (FUZZ-IEEE’94) (pp. 313-318), Orlando, FL. Bordogna, G., Pasi, G., & Lucarella, D. (1999). �� A fuzzy object-oriented data model for managing vague and uncertain information. International Journal of Intelligent Systems, 14(7), 623-651. Bosc, P. (1999). Fuzzy databases. In J. Bezdek (Ed.), Fuzzy sets in approximate reasoning and information systems (pp. 403-468). Boston: Kluwer Academic Publishers. Bosc, P., Duval, L., & Pivert, O. (2000). �� Value-based and representation-based querying of possibilistic

An Overview of Fuzzy Approaches

databases. In G. Bordogna & G. Pasi (Eds.), Recent research issues on fuzzy databases (pp. 3-27). Heidelberg: Physica-Verlag.

Proceedings of the IEEE International Conference on Fuzzy Systems (FUZZ-IEEE 2001) (pp. 12311234), Melbourne, Australia.

Bosc, P., Kraft, D., & Petry, F. E. (2005). Fuzzy �� sets in database and information systems: Status and opportunities. Fuzzy Sets and Systems, 153(3), 418-426.

Buckles, B. P., & Petry, F. E. (1982). A fuzzy representation of data for relational databases. Fuzzy Sets and Systems, 7, 213-226.

Bosc, P., Lietard, L., & Pivert, O. (2003). Sugeno �� fuzzy integral as a basis for the interpretation of flexible queries involving monotonic aggregates. Information Processing and Management, 39(2), 287�� -306. Bosc, P., & Pivert, O. (1992a). Some approaches for relational databases flexible querying. International Journal on Intelligent Information Systems, 1, 323-354. Bosc, P., & Pivert, O. (1992b). Fuzzy querying in conventional databases. In L. A. Zadeh & J. Kacprzyk (Eds.), Fuzzy logic for the management of uncertainty (pp. 645-671). New York: Wiley. Bosc, P., & Pivert, O. (1993). An approach for a hierarchical aggregation of fuzzy predicates. �� In Proceedings of the 2nd IEEE International �� Conference on Fuzzy Systems (FUZZ-IEEE´93) (pp. 1231-1236), San Francisco, CA. Bosc, P., & Pivert, O. (1995). SQLf: A relational database language for fuzzy querying. IEEE Transactions on Fuzzy Systems, 3, 1-17. Bosc, P., & Pivert, O. (1997a). Fuzzy queries against regular and fuzzy databases. In T. Andreasen, H. Christiansen, & H. L. Larsen (Eds.), Flexible query answering systems. Dordrecht: Kluwer Academic Publishers. Bosc, P., & Pivert, O. (1997b). On representationbased querying of databases containing ill-known values. In Z. W. Ras & A. Skowron (Eds.), Proceedings of the 10th International Symposium on Foundations of Intelligent Systems (LNCS 1325, pp. 477-486). London: Springer-Verlag. Bosc, P., Pivert, O., & Lietard, L. (2001). Aggre�� gate operators in database flexible querying. In

Buckles, B. P., & Petry, F. E. (1985). Query languages for fuzzy databases. In J. Kacprzyk & R. Yager (Eds.), Management decision support systems using fuzzy sets and possibility theory (pp. 241-251). Cologne, Germany: Verlag TÜV Rheiland. Buckles, B. P., Petry, F. E., & Sachar, H. S. (1989). A domain calculus for fuzzy relational databases. Fuzzy Sets and Systems, 29, 327-340. Cattell, R. G. G., & Barry, D. (Eds.). �� (2000). The object data standard: ODMG 3.0. San Francisco: Morgan Kaufmann. Codd, E. F. (1970). A relational model of data for large shared data banks. Communications of the ACM, 13(6), 377-387. Date, C. J. (2004). An introduction to database systems (8th ed.). �� Boston: Pearson Education Inc. de Calmès, M., Dubois, D., Hüllermeier, E., Prade H., & Sedes, F. (2002). �� A fuzzy set approach to flexible case-based querying: methodology and experimentation. In Proceedings of the 8th International Conference, Principles of Knowledge Representation and Reasoning (KR2002) (pp. 449-458), �� Toulouse, France. de Caluwe, R. (2002). Principles of fuzzy databases. In J. Kacprzyk, M. Krawczak, & S. Zadrozny (Eds.), Issues in information technology (pp. 151172). Warszawa, Poland: Exit. de Tré, G. (2002). Extended possibilistic truth values. International Journal of Intelligent Systems, 17, 427-446. de Tré, G., & de Baets, B. (2003). Aggregating �� constraint satisfaction degrees expressed by possibilistic truth values. IEEE Transactions on Fuzzy Systems, 11(3), 361-368. 49

An Overview of Fuzzy Approaches

de Tré, G., & de Caluwe, R. (2003). Modelling �� uncertainty in multimedia database systems: An extended possibilistic approach. International Journal of Uncertainty, Fuzziness and KnowledgeBased Systems, 11(1), 5-22. de Tré, G., & de Caluwe, R. (2005). A constraint based fuzzy object oriented database model. In Z. Ma (Ed.), Advances in fuzzy object-oriented databases: Modelling and applications (pp. 1�� -45). Hershey, PA: Idea Group Publishing. de Tré, G., de Caluwe, R., & Prade, H. (in press). Null values in fuzzy databases. Journal of Intelligent Information Systems. de Tré, G., de Caluwe, R., Tourné, K., & Matthé, T. (2003). �� Theoretical considerations ensuing from experiments with flexible querying. In T. Bilgiç, B. De Baets, & O. Kaynak (Eds.), Proceedings of the IFSA 2003 World Congress (LNCS 2715, pp. 388-391). �� Springer. de Tré, G., de Caluwe, R., & Van der Cruyssen, B. (2000). A �� generalised object-oriented database model. In G. Bordogna & G. Pasi (Eds.), Recent issues on fuzzy databases (pp. 155-182). Heidelberg, Germany: Physica-Verlag. de Tré, G., Verstraete, J., Hallez, A., Matthé, T., & de Caluwe, R. (2006). The �� handling of selectproject-join operations in a relational framework supported by possibilistic logic. In Proceedings of the 11th International Conference on Information Processing and Management of Uncertainty in Knowledge-based Systems (IPMU) (pp. 2181-2188), Paris, France. Dubois, D., Fargier, H., & Prade, H. (1997). �� Beyond min aggregation in multicriteria decision: (ordered) weighted min, discri-min and leximin. In R. R. Yager & J. Kacprzyk (Eds.), The ordered weighted averaging operators: Theory and applications (pp. 181�� -192)�� . Boston: Kluwer Academic Publishers. Dubois, D., & Prade, H. (1988). Possibility theory. New York: Plenum Press.

50

Dubois, D., & Prade, H. (1996). Semantics of quotient operators in fuzzy relational databases. Fuzzy Sets and Systems, 78, 89�� -�� 93. Dubois, D., & Prade, H. (1997). �� Using fuzzy sets in flexible querying: Why and how? In T. Andreasen, H. Christiansen, & H. L. Larsen (Eds.), Flexible query answering systems. Dordrecht: Kluwer Academic Publishers. Gaasterland, T., Godfrey, P., & Minker, J. (1992). An overview of �� cooperative answering. Journal of Intelligent Information Systems, 1, 123-157. Galindo, J., Medina, J. M., & Aranda, M. C. (1999). Querying fuzzy relational databases through fuzzy domain calculus. International Journal of Intelligent Systems, 14, 375-411. Galindo, J., Medina, J. M., Cubero, J. C., & Garcia, M. T. (2000). �� Fuzzy quantifiers in fuzzy domain calculus. In �� Proceedings of the International Conference on Information Processing and Management of Uncertainty in Knowledge-based Systems (IPMU´2000) (pp. 1697-1704), Madrid, Spain. Galindo, J., Medina, J. M., Cubero, J. C., & Garcia, M. T. (2001). Relaxing �� the universal quantifier of the division in fuzzy relational databases. International Journal of Intelligent Systems, 16(6), 713-742. Galindo, J., Medina, J. M., Pons, O., & Cubero, J. C. (1998). A server for fuzzy SQL queries. In T. Andreasen, H. Christiansen, & H. L. Larsen (Eds.), Proceedings of the Third International Conference on Flexible Query Answering Systems (LNAI 1495, pp. 164-174)�� . London: Springer-Verlag. Galindo, J., Urrutia, A., & Piattini, M. (2006). Fuzzy databases: Modeling, design and implementation. Hershey, PA: Idea Group Publishing. George, R. (1992). Uncertainty management issues in the object-oriented database model. PhD Thesis, Tulane University, New Orleans, LA, USA. George, R., Yazici, A., Petry, F. E., & Buckles, B. P. (1997). Modeling impreciseness and uncertainty in the object-oriented data model: A similarity-

An Overview of Fuzzy Approaches

based approach. In R. de Caluwe, (Ed.), Fuzzy and uncertain object-oriented databases: Concepts and models (pp. 63-95). Singapore: World Scientific. Gonçalves, M., & Tineo, L. (2001a). SQLf flexible querying language extension by means of the norm SQL2. In Proceedings of the �� IEEE International Conference on Fuzzy Systems (�� FUZZ- IEEE’ 2001) (pp. 473-476). Gonçalves, M., & Tineo, L. (2001b). SQLf3: An extension of SQLf with SQL3 features. In Proceedings of �� the IEEE International Conference on Fuzzy Systems �� (FUZZ-IEEE’ 2001) (pp. 477-480). Kacprzyk, J., & Zadrożny, S. (1995). FQUERY for Access: Fuzzy querying for windows-based DBMS. In P. Bosc & J. Kacprzyk (Eds.), Fuzziness in database management systems (pp. 415-433). Heidelberg, Germany: Physica-Verlag. Kacprzyk, J., & Zadrożny, S. (1997). �� Implementation of OWA operators in fuzzy querying for Microsoft Access. In R. R. Yager & J. Kacprzyk (Eds.), The ordered weighted averaging operators: Theory and applications (pp. �� 293�� -306)�� . Boston: Kluwer Academic Publishers. Kacprzyk, J., & Zadrożny, S. (2000a). On a fuzzy querying and data mining interface. Kybernetika, 36, 657-670. Kacprzyk, J., & Zadrożny, S. (2000b). On combining intelligent querying and data mining using fuzzy logic concepts. In G. Bordogna & G. Pasi (Eds.), Recent research issues on fuzzy databases (pp. 67-81). Heidelberg: Physica-Verlag. Kacprzyk, J., & Zadrożny, S. (2001). �� Using fuzzy querying over the Internet to browse through information resources. In B. Reusch & K. H. Temme (Eds.), Computational intelligence in theory and practice (pp. 235-262). Heidelberg: Physica-Verlag. Kacprzyk, J., Zadrożny, S., & Ziółkowski, A. (1989). �� FQUERY III+: A “human-consistent” database querying system based on fuzzy logic with linguistic quantifiers. Information Systems, 14, 443-453.

Kacprzyk, J., & Ziółkowski, A. (1986). Database �� queries with fuzzy linguistic quantifiers. IEEE Transactions on Systems, Man and Cybernetics, 16, 474-479. Kim, W. (1994). Observations on the ODMG-93 proposal for an object-oriented database language. ACM SIGMOD Record, 23(1), 4-9. Klement, E. P., Mesiar, R., & Pap, E. (Eds.). (2000). Triangular norms. Kluwer Academic Publishers. Laurent, A. (2003). Querying fuzzy multidimensional databases: Unary operators and their properties. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 11, 31-46. Marín, N., Pons, O., & Vila, M. A. (2000). Fuzzy �� types: A new concept of type for managing vague structures. International Journal of Intelligent Systems, 15(11), 1061-1085. Medina, J. M., Pons, O., & Vila, M. A. (1994). GEFRED: A generalized model of fuzzy relational databases. Information Sciences, 76(1-2), 87-109. Melton, J., & Simon, A. R. (2002). SQL:1999: Understanding relational language components. Morgan Kaufmann. Mouaddib, N., &, Subtil, P. (1997). Management of uncertainty and vagueness in databases: The FIRMS point of view. International Journal of Uncertainty, Fuzziness and Knowledge Based Systems, 5(4), 437-457. Na, S., & Park, S. (1997). Fuzzy object-oriented data model and fuzzy association algebra. In R. de Caluwe (Ed.), Fuzzy and uncertain object-oriented databases: Concepts and models. Singapore: World Scientific. Petry, F. E. (1996). Fuzzy databases: Principles and applications. Boston: Kluwer Academic Publishers. Prade, H., & Testemale, C. (1984). Generalizing database relational algebra for the treatment of incomplete or uncertain information and vague queries. Information Sciences, 34, 115-143. 51

An Overview of Fuzzy Approaches

Prade, H., & Testemale, C. (1987). Representation of soft constraints and fuzzy attribute values by means of possibility distributions in databases. In J. C. Bezdek (Ed.), Analysis of fuzzy information. Boca Raton, FL: CRC Press. Ramakrishnan, R., & Gehrke, J. (2000). Database management systems. McGraw-Hill. Reiter, R. (1980). A logic for default reasoning. Artificial Intelligence, 13(1), 81-132. Rocacher, D., & Connan, F. (1996). A �� fuzzy algebra for object oriented databases. In Proceedings of the 4th European Congress on Intelligent Techniques and Soft Computing (EUFIT’96) 2 (pp. 871-876), Aachen, Germany. Rossazza, J.-P. (1990). Utilisation de hiérarchies de classes floues pour la représentation de connaissances imprécises et sujettes à exception: Le système “SORCIER.” PhD Thesis, Université Paul Sebatier, Toulouse, France. Rossazza, J.-P., Dubois, D., & Prade, H. (1997). A hierarchical model of fuzzy classes. In R. de Caluwe (Ed.), Fuzzy and uncertain object-oriented databases: Concepts and models (pp. 21-61). Singapore: World Scientific. Shaw, G. M., & Zdonik, S. B. (1990). A query algebra for object-oriented databases. In Proceedings of the 6th International Conference on Data Engineering (ICDE’90) (pp. 154-162), Los Angeles, CA. Tahani, V. (1977). A conceptual framework for fuzzy query processing: A step toward very intelligent database systems. Information Processing and Management, 13, 289-303.

Tanaka, K., Kobayashi, S., & Sakanoue, T. (1991). Uncertainty management in object-oriented database systems. In Proceedings of the International Conference on Database and Expert System Applications (DEXA’91) (pp. 251-256). Berlin: Springer-Verlag. Umano, M. (1982). FREEDOM-0: A fuzzy database system. In M. Gupta & E. Sanchez (Eds.), Fuzzy information and decision processes (pp. 339-347). Amsterdam: North-Holland. Umano, M., & Fukami, S. (1994). Fuzzy relational algebra for possibility-distribution-fuzzy relational model of fuzzy data. Journal of Intelligent Information Systems, 3, 7-27. Van Gyseghem, N. (1998). Imprecision and uncertainty in the UFO database model. Journal of the American Society for Information Science, 49(3), 236-252. Vila, M. A., Cubero, J.-C., Medina, J.-M., & Pons, O. (1997). Using OWA operator in flexible query processing. In R. R. Yager & J. Kacprzyk (Eds.), The ordered weighted averaging operators: Theory and applications (pp. 258-274)�� . Boston: Kluwer Academic Publishers. Yager, R. R. (1991). Fuzzy quotient operators for fuzzy relational databases. In Proceedings of the International Fuzzy Engineering Symposium (IFES’91) (pp. 289-296), Yokohama, Japan. Yager, R. R. (1994). Interpreting linguistically quantified propositions. International Journal of Intelligent Systems, 9, 541-569. Zadeh, L. A. (1965). Fuzzy sets. Information and Control, 8(3), 338-353.

Takahashi, Y. (1993). Fuzzy database query languages and their relational completeness theorem. IEEE Transactions on Knowledge and Data Engineering, 5, 122-125.

Zadeh, L. A. (1975). The �� concept of a linguistic variable and its application to approximate reasoning (parts I, II, and III). Information Sciences, 8, 199�� -�� 251, 301�� -�� 357; 9, 43�� -�� 80.

Takahashi, Y. (1995). A fuzzy query language for relational databases. In P. Bosc & J. Kacprzyk (Eds.), Fuzziness in database management systems (pp. 365-384). Heidelberg, Germany: Physica-Verlag.

Zadeh, L. A. (1978). Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets and Systems, 1(1), 3-28.

52

An Overview of Fuzzy Approaches

Zadeh, L. A. (1983). �� A computational approach to fuzzy quantifiers in natural languages. Computational Mathematics Applications, 9, 149-184. Zadrożny, S. (2005). Bipolar queries revisited. In V. Torra, Y. Narukawa, & S. Miyamoto (Eds.), Modelling decisions for artificial intelligence (MDAI 2005) (LNAI 3558, pp. 387-398). Berlin: Springer-Verlag. Zadro�� ż�� ny, S., & Kacprzyk, J. (1998). Implement�� ing fuzzy querying via the Internet/WWW: Java applets, ActiveX controls and cookies. �� In Flexible query answering systems (pp. 382-392). Heidelberg: Springer-Verlag. Zadrożny, S., & Kacprzyk, J. (2002). Fuzzy �� querying of relational databases: A fuzzy logic view. In Proceedings of the EUROFUSE Workshop on Information Systems (pp. 153�� -�� 158), Varenna, Italy. Zemankova-Leech, M., & Kandel, A. (1984). Fuzzy relational databases: A key to expert systems. Cologne, Germany: Verlag TÜV Rheinland.

Key Terms Database: A collection of persistent data. In a database, data are modeled in accordance with a database model. This model defines the structure of the data, the constraints for integrity and security, and the behavior of the data. Fuzzy Database: In a regular database, only crisp (perfectly described) data are stored. However, due to imprecision, vagueness, uncertainty, incompleteness, or ambiguities, a lot of data are in the real world available in an imperfect form only. Fuzzy databases intend to grasp imperfect information about a modeled part of the world and represent it directly, as accurate as possible, in a database. The two leading approaches to the representation of imperfect information in databases are the possibilistic approach and the similarity relation based approach. Fuzzy Preferences Between Query Conditions: The introduction of fuzzy preferences in

fuzzy querying can also be done between query conditions. These kinds of preferences are expressed via grades of importance, usually called weights. Different weights are then assigned to particular conditions indicating that the satisfaction of some query conditions is more desirable than the satisfaction of others. Fuzzy Preferences Inside Query Conditions: In fuzzy querying, the introduction of fuzzy preferences in queries can be done inside the query conditions via flexible search criteria and allow to express that some values are more desirable than others in a gradual way. Fuzzy Querying: Searching for data in a database is called querying. Modern database systems offer/provide a query language to support querying. Relational databases are usually queried using SQL (Structured Query Language), and object-oriented ODMG databases are queried using OQL (Object Query Language). Traditional database querying can be enhanced by introducing fuzzy preferences and/or fuzzy conditions in the queries. This is called fuzzy querying. Object-Oriented Database: An object-oriented database is a database that is modeled in accordance with an object-oriented database model. In an object-oriented database model, the data are structured in classes, which also embody the behavior of the data. Classes are constructed in the spirit of the object-oriented programming paradigm and are as such closely connected to an object-oriented programming language. The best known object-oriented database model is the ODMG model. Relational Database: A relational database is a database that is modeled in accordance with the relational database model. In the relational database model, the data are structured in relations that are represented by tables. The behavior of the data is defined in terms of the relational algebra, which originally consists of eight operators (union, intersection, division, cross product, join, selection, projection, and division), or in terms of the relational calculus, which is of a declarative nature. 53

An Overview of Fuzzy Approaches

Possibilistic Fuzzy Database Approach: In the possibilistic fuzzy database approach, imprecision in the value of an attribute is modeled via a possibilistic distribution on the domain of this attribute. This calls for the use of necessity and possibility measures in database querying.

54

Similarity Relation Based Fuzzy Database Approach: In the similarity relation based fuzzy database approach, query results are allowed to contain not only data that exactly satisfy the search conditions but also data that are similar to these data. For this reason, the attribute domains have to be equipped with a similarity relation. (A similarity relation is a fuzzy binary relation whose membership function expresses the similarity degree between the pairs of the domain elements.)

55

Chapter III

Introduction to Fuzzy Data Mining Methods Balazs Feil University of Pannonia, Hungary Janos Abonyi University of Pannonia, Hungary

Abstract This chapter aims to give a comprehensive view about the links between fuzzy logic and data mining. It will be shown that knowledge extracted from simple data sets or huge databases can be represented by fuzzy rule-based expert systems. It is highlighted that both model performance and interpretability of the mined fuzzy models are of major importance, and effort is required to keep the resulting rule bases small and comprehensible. Therefore, in the previous years, soft computing based data mining algorithms have been developed for feature selection, feature extraction, model optimization, and model reduction (rule based simplification). Application of these techniques is illustrated using the wine data classification problem. The results illustrate that fuzzy tools can be applied in a synergistic manner through the nine steps of knowledge discovery.

Introduction In our society, the amount of data doubles almost every year. Hence, there is an urgent need for a new generation of computational techniques and tools to assist humans in extracting useful information (knowledge) from the rapidly growing volumes of data. When we attempt to solve realworld problems, like extracting knowledge from large amounts of data, we realize that they are typically illdefined systems, difficult to model, and with largescale

solution spaces. In these cases, precise models are impractical, too expensive, or nonexistent. Furthermore, the relevant available information is usually in the form of empirical prior knowledge and input-output data representing instances of the system’s behavior. Therefore, we need an approximate reasoning system capable of handling such imperfect information. Computational intelligence (CI) and soft computing (SC) are recently coined terms describing the use of many emerging computing disciplines. According to Zadeh (1994),

Copyright © 2008, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.

Introduction to Fuzzy Data Mining Methods

“in contrast to traditional, hard computing, soft computing is tolerant of imprecision, uncertainty, and partial truth.” In this context, fuzzy logic (FL), probabilistic reasoning (PR), neural networks (NNs), and genetic algorithms (GAs) are considered main components of CI. Each of these technologies provides us with complementary reasoning and searching methods to solve complex, realworld problems. What is important to note is that soft computing is not a melange. Rather, it is a partnership in which each of the partners contributes a distinct methodology for addressing problems in its domain. In this perspective, the principal constituent methodologies in CI are complementary rather than competitive. As the title of this chapter shows, its aim is to give an overview about fuzzy data mining methods. It should be emphasized that this title is equivocal in some sense because it has two meanings. Fuzzy data mining methods can mean data mining methods that are fuzzy methods as well; on the other hand, it can also mean approaches to analyze fuzzy data. In some sense, the later ones are fuzzy methods as well but the conceptions are different. Fuzzy data mean imprecise, vague, uncertain, ambiguous, inconsistent, and/or incomplete data. Therefore, the source of uncertainty is the data themselves. It is very important to develop methods that are able to handle this kind of data because data from several information sources might be fuzzy (e.g., from human expert who describes their knowledge in natural language). In Giordani and Kiers (2004), a modified principal component analysis is presented for symmetric fuzzy data. Krol, Kukla, Lasota, and Trawinski (2006) deal with models on the basis of fuzzy data. There are algorithms to cluster, classify, or visualize fuzzy data (e.g., Bandemer, 2006; Esogbue, 1986; Gershon, 1992; Pang, Wittenbrink, & Lodha, 1997). This chapter deals with data mining methods based on fuzzy techniques. The data are crisp and can be given in absolute ( N × n matrix) or relative form ( N × N matrix, where N and n are the numbers of samples and attributes, respectively. By absolute data, the values of the attributes are given. “Relative data”

56

means that the data’s values are not known, but their pairwise distance is known. The approaches the data are analyzed with handle the uncertainty on the basis of fuzzy logic; for example, by clustering a problem, a sample can be the member of each cluster simultaneously with different degrees between 0 and 1. In the following, several fuzzy methods will be presented for several kinds of problems. The remainder of this chapter is organized as follows. The aim of the Steps of Knowledge Discovery section is to show how the elements of CI can be used in data mining and how fuzzy information processing can be situated within this general and comprehensive process. In the remaining sections, basic definitions, widely applied methods and tools for clustering (Classical Fuzzy Cluster Analysis section), visualization (Visualization of High Dimensional Data section), classification (Fuzzy Classifier Systems for Effective Model Representation section), and association rule mining (Fuzzy Association Rule Mining section) are discussed, including the related knowledge representation, identification, and reduction methods. A detailed bibliographical view of the methods and tools is given. The mentioned approaches are evaluated using illustrative examples. Finally, conclusions are given.

Steps of Knowledge Discovery According to Fayyad, Piatestku-Shapiro, and Smyth (1996), “historically the notion of finding useful patterns in data has been given a variety of names including data mining, knowledge extraction, information discovery, and data pattern processing. The term data mining has been mostly used by statisticians, data analysts, and the management information systems (MIS) communities.” The term knowledge discovery in databases (KDD) refers to the overall process of discovering knowledge from data, while data mining refers to a particular step of this process. Data mining is

Introduction to Fuzzy Data Mining Methods

the application of specific algorithms for extracting patterns from data. The additional steps in the KDD process, such as data selection, data cleaning, incorporating appropriate prior knowledge, and proper interpretation of the results are essential to ensure that useful knowledge is derived from the data. KDD has evolved from the intersection of research fields such as machine learning, pattern recognition, databases, statistics, artificial intelligence, and, more recently, it gets new inspiration from computational intelligence. Brachman and Anand (1994) give a practical view of the KDD process emphasizing the interactive nature of the process. Here we broadly outline some of its basic steps depicted in Figure 1 taken from Fayyad et al. (1994), and we show the connections of these steps to CI based models and algorithms. The steps are given and their characteristics are discussed in the following: 1.

Developing and understanding of the application domain and the relevant prior knowledge, and identifying the goal of the KDD process: The transparency of fuzzy systems allows the user to effectively combine different types of information, namely linguistic knowledge, first-principle knowledge, and information from data. An example for the incorporation of prior knowledge into data-driven identification of dynamic fuzzy models of the Takagi-Sugeno type can be found in Abonyi, Babuska, Verbruggen, and Szeifert

2. 3. 4.

5.

(2000a) where the prior information enters the model through constraints defined on the model parameters. In Abonyi, Nagy, and Szeifert (2000c) and Abonyi, Bodizs, Nagy, and Szeifert (2000b), a different approach has been developed that uses block-oriented fuzzy models. Creating target data set. Data cleaning and preprocessing: Basic operations such as the removal of noise, handling missing data fields. Data reduction and projection: Finding useful features to represent the data depending the goal of the task. Using dimensionality reduction or transformation methods to reduce the effective number of variables under consideration or to find invariant representation of data. Neural networks (Mao & Jain, 1995), cluster analysis (Abe, Thawonmas, & Kobayashi, 1998), and neuro-fuzzy systems are often used for this purpose. Matching the goals of the KDD process to a particular data mining method: Although the boundaries between prediction and description are not sharp, the distinction is useful for understanding the overall discovery goal. The goals of data mining are achieved via the following data mining methods: • Clustering: Identification of a finite set of categories or clusters to describe the data. Closely related to clustering is the method of probability density estima-

Figure 1. Steps of the knowledge discovery process

57

Introduction to Fuzzy Data Mining Methods

6.

58

tion. Clustering quantizes the available input-output data to get a set of prototypes and use the obtained prototypes (signatures, templates, etc., and many writers refer to this as codebook) and use the prototypes as model parameters. • Summation: Finding a compact description for subset of data, for example, the derivation of summary for association of rules and the use of multivariate visualization techniques. • Dependency modeling: Finding a model which describes significant dependencies between variables (e.g., learning of belief networks). • Regression: Learning a function which maps a data item to a real-valued prediction variable and the discovery of functional relationships between variables. • Classification: Learning a function that maps (classifies) a data item into one of several predefined classes. • Change and deviation detection: Discovering the most significant changes in the data from previously measured or normative values. Choosing the data mining algorithm(s): Selecting algorithms for searching for patterns in the data. This includes deciding which model and parameters may be appropriate and matching a particular algorithm with the overall criteria of the KDD process (e.g., the end user may be more interested in understanding the model than its predictive capabilities). One can identify three primary components in any data mining algorithm: model representation, model evaluation, and search. • Model representation: The language is used to describe the discoverable patterns. If the representation is too limited, then no amount of training time or examples will produce an accurate model for the data. Note that more powerful representation of models increases the

•

•

danger of overfitting the training data resulting in reduced prediction accuracy on unseen data. It is important that data analysts fully comprehend the representational assumptions which may be inherent in a particular method. For instance, rule-based expert systems are often applied to classification problems in fault detection, biology, medicine, and so forth. Among the wide range of CI techniques, fuzzy logic improves classification and decision support systems by allowing the use of overlapping class definitions and improving the interpretability of the results by providing more insight into the classifier structure and decision making process (de Oliveira, 1999). Model evaluation criteria: Qualitative statements or fit functions of how well a particular pattern (a model and its parameters) meet the goals of the KDD process. For example, predictive models can often be judged by the empirical prediction accuracy on some test set. Descriptive models can be evaluated along the dimensions of predictive accuracy, novelty, utility, and understandability of the fitted model. Traditionally, algorithms to obtain classifiers have focused either on accuracy or interpretability. Recently, some approaches to combining these properties have been reported; fuzzy clustering is proposed to derive transparent models in Setnes and Babuska (1999a). Linguistic constraints are applied to fuzzy modeling in de Oliveira (1999), and rule extraction from neural networks is described in Setiono (2000). Hence, to obtain compact and interpretable fuzzy models, model reduction algorithms have to be used. Search method: Consists of two components: parameter search and model

Introduction to Fuzzy Data Mining Methods

7.

8.

search. Once the model representation and the model evaluation criteria are fixed, then the data mining problem has been reduced to purely an optimization task: find the parameters/ models for the selected family which optimize the evaluation criteria given observed data and fixed model representation. Model search occurs as a loop over the parameter search method. The automatic determination of fuzzy classification rules from data has been approached by several different techniques: neurofuzzy methods (Nauck & Kruse, 1999), genetic-algorithm based rule selection (Ishibuchi, Nakashima, & Murata, 1999), and fuzzy clustering in combination with GA-optimization (Setnes, Roubos, & Verbruggen, 1998b). For high-dimensional classification problems, the initialization step of the identification procedure of the fuzzy model becomes very significant. Data mining: Searching for patterns of interest in a particular representation form or a set of such representations: classification rules or trees, regression, and so forth. Some of the CI models lend themselves to transform into other model structures that allow information transfer between different models. For example, in Sethi (1990), a decision tree was mapped into a feedforward neural network. A variation of this method is given in Ivanova and Kubat (1995) where the decision tree was used for the input domains discretization only. This approach was extended with a model pruning method in Setiono and Leow (1999). Another example is that as radial basis functions (RBF) are functionally equivalent to fuzzy inference systems (Jang & Sun, 1993; Koczy, Tikk, & Gedeon, 2000), tools developed for the identification of RBFs can also be used to design fuzzy models. Interpreting mined patterns (possibly return to any of steps 1-7 for further iteration): This step can also involve the visual-

9.

ization of the extracted patterns/models, or visualization of the data given the extracted models. Self-organizing map (SOM) as a special clustering tool provides a compact representation of the data distribution, hence it has been widely applied in the visualization of high-dimensional data (Kohonen, 1990) (see also the Kohonen Self-Organizing Maps section). Consolidating discovered knowledge: Incorporating this knowledge into another system for further action, or simply documenting and reporting it.

Classical Fuzzy Cluster Analysis Motivation The goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. Algorithms that can detect subspaces of the data space are of particular interest for identification. The performance of most clustering algorithms is influenced not only by the geometrical shapes and densities of the individual clusters but also by the spatial relations and distances among the clusters. Clusters can be well-separated, continuously connected to each other, or overlapping each other. The separation of clusters is influenced by the scaling and normalization of the data. The goal of this section is to survey the core concepts and techniques in the large subset of cluster analysis and to give a description about the fuzzy clustering methods. Typical pattern clustering activity involves (Jain & Dubes, 1988): 1.

Pattern representation (optionally including feature extraction and/or selection). Pattern representation refers to the number of classes, the number of available patterns, and the number, type, and scale of the features available to the clustering algorithm. Some of this information may not be controllable

59

Introduction to Fuzzy Data Mining Methods

2.

3.

4.

5.

60

by the practitioner. Feature selection is the process of identifying the most effective subset of the original features to use in clustering. Feature extraction is the use of one or more transformations of the input features to produce new salient features. Either or both of these techniques can be used to obtain an appropriate set of features to use in clustering. Definition of a pattern proximity measure appropriate to the data domain. Various definitions of a cluster can be formulated, depending on the objective of clustering. Generally, one may accept the view that a cluster is a group of objects that is more similar to one another than to members of other clusters. The term “similarity” should be understood as mathematical similarity, measured in some well-defined sense. In metric spaces, similarity is often defined by means of a distance norm. Clustering or grouping. The grouping step can be performed in a number of ways, and it is discussed in the Clustering Techniques section. Data abstraction (if needed). Data abstraction is the process of extracting a simple and compact representation of a data set. Here, simplicity is either from the perspective of automatic analysis (so that a machine can perform further processing efficiently) or it is human-oriented (so that the representation obtained is easy to comprehend and intuitively appealing). In the clustering context, a typical data abstraction is a compact description of each cluster, usually in terms of cluster prototypes or representative patterns such as the centroid (Diday & Simon, 1976). A low-dimensional graphical representation of the clusters could also be very informative, because one can cluster by eye and qualitatively validate conclusions drawn from clustering algorithms (see the Visualization of High Dimensional Data section). Assessment of output (if needed). How is the output of a clustering algorithm evalu-

ated? What characterizes a “good” clustering result and a “poor” one? All clustering algorithms will, when presented with data, produce clusters—regardless of whether the data contain clusters or not. If the data does contain clusters, some clustering algorithms may obtain “better” clusters than others. The assessment of a clustering procedure’s output, then, has several facets. One is actually an assessment of the data domain rather than the clustering algorithm itself; data which do not contain clusters should not be processed by a clustering algorithm. The study of cluster tendency, wherein the input data are examined to see if there is any merit to a cluster analysis prior to one being performed, is a relatively inactive research area. The interested reader is referred to Cheng (1995) and Dubes (1987) for more information. The goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. But how do we decide what constitutes a good clustering? It can be shown that there is no absolute “best” criterion which would be independent of the final aim of the clustering. Consequently, it is the user which must supply this criterion, in such a way that the result of the clustering will suit their needs. In spite of that, a “good” clustering algorithm must give acceptable results in many kinds of problems besides other requirements. In practice, the accuracy of a clustering algorithm is usually tested on well-known labeled data sets. It means that classes are known in the analyzed data set but certainly they are not used in the clustering. Hence, there is a benchmark to qualify the clustering method, and the accuracy can be represented by numbers (e.g., percentage of misclassified data). Cluster validity analysis, by contrast, is the assessment of a clustering procedure’s output. Often this analysis uses a specific criterion of optimality; however, these criteria are usually arrived at subjectively. Hence, little in the way of “gold standards” exist in clustering except in well-prescribed subdomains. Validity as-

Introduction to Fuzzy Data Mining Methods

sessments are objective (Dubes, 1993) and are performed to determine whether the output is meaningful. A clustering structure is valid if it cannot reasonably have occurred by chance or as an artifact of a clustering algorithm. When statistical approaches to clustering are used, validation is accomplished by carefully applying statistical methods and testing hypotheses. There are three types of validation studies. An external assessment of validity compares the recovered structure to an a priori structure. An internal examination of validity tries to determine if the structure is intrinsically appropriate for the data. A relative test compares two structures and measures their relative merit. Indices used for this comparison are discussed in detail in Dubes (1993) and Jain and Dubes (1988).

ferent approaches regardless of their placement in the taxonomy. •

•

Clustering Techniques Different approaches to clustering data can be described with the help of the hierarchy shown in Figure 2 (other taxonometric representations of clustering methodology are possible; ours is based on the discussion in Jain & Dubes, 1988). At the top level, there is a distinction between hierarchical and partitional approaches (hierarchical methods produce a nested series of partitions, while partitional methods produce only one). The taxonomy shown in Figure 2 must be supplemented by a discussion of cross-cutting issues that may (in principle) affect all of the dif-

•

Hard vs. fuzzy: A hard clustering algorithm allocates each pattern to a single cluster during its operation and in its output. A fuzzy clustering method assigns degrees of membership in several clusters to each input pattern. A fuzzy clustering can be converted to a hard clustering by assigning each pattern to the cluster with the largest measure of membership. Agglomerative vs. divisive: This aspect relates to algorithmic structure and operation (mostly in hierarchical clustering: see the Hierarchical Clustering Algorithms section). An agglomerative approach begins with each pattern in a distinct (singleton) cluster, and successively merges clusters together until a stopping criterion is satisfied. A divisive method begins with all patterns in a single cluster and performs splitting until a stopping criterion is met. Monothetic vs. polythetic: This aspect relates to the sequential or simultaneous use of features in the clustering process. Most algorithms are polythetic; that is, all features enter into the computation of distances between patterns, and decisions are based on those distances. A simple monothetic algorithm reported in Anderberg (1973) considers features sequentially to divide the given collection of patterns. The major problem

Figure 2. A taxonomy of clustering approaches

61

Introduction to Fuzzy Data Mining Methods

•

•

with this algorithm is that it generates 2n clusters where n is the dimensionality of the data samples. For large values of n (n > 100 is typical in information retrieval applications [Saltong, 1991]), the number of clusters generated by this algorithm is so large that the data set is divided into uninterestingly small and fragmented clusters. Deterministic vs. stochastic: This issue is most relevant to partitional approaches designed to optimize a squared error function. This optimization can be accomplished using traditional techniques or through a random search of the state space consisting of all possible labelings. Incremental vs. non-incremental: This issue arises when the pattern set to be clustered is large, and constraints on execution time or memory space affect the architecture of the algorithm. The early history of clustering methodology does not contain many examples of clustering algorithms designed to work with large data sets, but the advent of data mining has fostered the development of clustering algorithms that minimize the number of scans through the pattern set, reduce the number of patterns examined during execution, or reduce the size of data structures used in the algorithm’s operations.

dissimilarity matrix should be refreshed because the connected points form a single cluster, and the distances between this new cluster and the former ones should be computed. These steps should be iterated until only one cluster remains or the predetermined number of clusters is reached. Most hierarchical clustering algorithms are variants of the single-link (Sneath & Sokal, 1973), complete-link (King, 1967), and minimum-variance (Murtagh, 1984; Ward, 1963) algorithms. Of these, the single-link and complete-link algorithms are most popular. In the single-link method, the distance between two clusters is the minimum of the distances between all pairs of patterns drawn from the two clusters (one pattern from the first cluster, the other from the second). In the completelink algorithm, the distance between two clusters is the maximum of all pairwise distances between patterns in the two clusters. A simple example can be seen in Figure 3. On the left side, the dots depict the original data. It can be seen that there are two well-separated clusters. The results of the single-linkage algorithm can be found on the right side. On the x-axis, the numbers refer to the original data points, and the ordinate means distance level between the clusters. It can be determined that distances between data in the right cluster are greater than in the left one, but the two clusters can be separated well.

Hierarchical Clustering Algorithms

Partitional Algorithms

A hierarchical algorithm yields a dendrogram representing the nested grouping of patterns and similarity levels at which groupings change. The dendrogram can be broken at different levels to yield different clusterings of the data. In this initial state, every point forms a single cluster. The first step is to find the most similar two clusters (the nearest two data points). In this example, there are two pairs with the same distance; choose one of them arbitrarily (B and E here). Write down the signs of the points, and connect them according to the figure, where the length of the vertical line is equal to half of the distance. In the second step, the

A partitional clustering algorithm obtains a single partition of the data instead of a clustering structure, such as the dendrogram produced by a hierarchical technique. The difference of the two mentioned methods can be seen in Figure 3. Partitional methods have advantages in applications involving large data sets for which the construction of a dendrogram is computationally prohibitive.

62

Squared Error Algorithms. The most intuitive and frequently used criterion function in partitional clustering techniques is the squared error criterion, which tends to work well with isolated and com-

Introduction to Fuzzy Data Mining Methods

Figure 3. Partitional and hierarchical clustering results

pact clusters. The squared error for a clustering V = {v i | i = 1,..., c} of a pattern set X (containing c clusters) is:

is that it is sensitive to the selection of the initial partition and may converge to a local minimum of the criterion function value if the initial partition is not properly chosen.

c

J ( X; V ) = ∑∑ || x (ki ) − v i || 2 i =1 k∈i

(1)

where x (ik ) is the kth pattern belonging to the ith cluster, and vi is the centroid of the ith cluster (see Algorithm 1). The k-means is the simplest and most commonly used algorithm employing a squared error criterion (McQueen, 1967). It starts with a random initial partition and keeps reassigning the patterns to clusters based on the similarity between the pattern and the cluster centers until a convergence criterion is met (e.g., there is no reassignment of any pattern from one cluster to another, or the squared error ceases to decrease significantly after some number of iterations). The k-means algorithm is popular because it is easy to implement, and its time complexity is O(N), where N is the number of patterns. A major problem with this algorithm

Fuzzy Clustering Since clusters can formally be seen as subsets of the data set, one possible classification of clustering methods can be according to whether the subsets are fuzzy or crisp (hard). Hard clustering methods are based on classical set theory, and require that an object either does or does not belong to a cluster. Hard clustering in a data set X means partitioning the data into a specified number of mutually exclusive subsets of X. The number of subsets (clusters) is denoted by c. Fuzzy clustering methods allow objects to belong to several clusters simultaneously, with different degrees of membership. The data set X is thus partitioned into c fuzzy subsets. In many real situations, fuzzy clustering is more natural than hard clustering, as objects on the boundaries between several classes are not forced

Algorithm 1. Squared error clustering method 1.

Select an initial partition of the patterns with a fixed number of clusters and cluster centers.

2.

Assign each pattern to its closest cluster center and compute the new cluster centers as the centroids of the clusters. Repeat this step until convergence is achived, that is, until the cluster membership is stable.

3.

Merge and split clusters based on some heuristic information, optionally repeating step 2.

63

Introduction to Fuzzy Data Mining Methods

to fully belong to one of the classes, but rather are assigned membership degrees between 0 and 1, indicating their partial memberships. The discrete nature of hard partitioning also causes analytical and algorithmic intractability of algorithms based on analytic functionals, since these functionals are not differentiable. The remainder of this section focuses on fuzzy clustering with objective function. First, let us define more precisely the concept of fuzzy partitions.

Fuzzy Partition The objective of clustering is to partition the data set X into c clusters. For the time being, assume that c is known, based on prior knowledge, for instance. Fuzzy and possibilistic partitions can be seen as a generalization of hard partition. A c × N matrix U = [mi,k] represents the fuzzy partitions, where [mi,k] denotes the degree of the membership of the kth observation and belongs to the ith cluster, so the ith row of U contains values of the membership function of the ith fuzzy subset of X. The matrix U is called the fuzzy partition matrix. Conditions for a fuzzy partition matrix are given by: i ,k

∈ [0,1],

1 ≤ i ≤ c,1 ≤ k ≤ N

(2)

= 1,

1 ≤ k ≤ N

(3)

c

∑ i =1

i ,k N

0<∑ k =1

i ,k

< N,

1 ≤ i ≤ c

(4)

Fuzzy partitioning space. Let X = [x1 ,..., x N ] be a finite set and let 2 ≤ c ≤ N be an integer. The fuzzy partitioning space for X is the set:  M fc = U ∈ R c× N | 

c

i ,k

∈ [0,1], ∀i, k ; ∑ i =1

N

i ,k

= 1, ∀k ;0 < ∑

k =1

i ,k

 < N , ∀i . 

(5)

Equation (3) constrains the sum of each column to 1, and thus the total membership of each x k in X equals one. The distribution of memberships among the c fuzzy subsets is not constrained. 64

The Fuzzy c-Means Functional A large family of fuzzy clustering algorithms is based on minimization of the fuzzy c-means objective function formulated as: c

N

J ( X; U, V ) = ∑∑ i =1 k =1

m i ,k

|| x k − v i || 2A

(6)

where U is a fuzzy partition matrix of X, V = [ v 1 ,..., v c ], v i ∈ ℜ n is a matrix of cluster prototypes (centers), which have to be determined, Di2,kA =|| x k − v i || 2A = (x k − v i ) T A(x k − v i )

(7)

is a squared inner-product distance norm where A is the distance measure, and m ∈ [1, ∞) is a weighting exponent which determines the fuzziness of the resulting clusters. The measure of dissimilarity in Equation (6) is the squared distance between each data point xk and the cluster prototype vi. This distance is weighted by the power of the membership degree of that point ( i ,k ) m . The value of the cost function in Equation (6) is a measure of the total weighted within-group squared error incurred by the representation of the c clusters defined by their prototypes vi. Statistically, Equation (6) can be seen as a measure of the total variance of xk from vi.

Ways for Realizing Fuzzy Clustering Having constructed the criterion function for clustering, this subsection will study how to optimize the objective function (Xinbo & Weixin, 2000). The existing ways were mainly classified into three classes: neural networks (NN), evolutionary computing (EC), and alternative optimization (AO): •

Realization based on NN. The application of neural networks in cluster analysis stems from Kohonen’s learning vector quantization (LVQ) (Kohonen, 1984), self-organizing mapping (Kohonen, 1990), and Grossberg’s adaptive resonance theory (ART) (Carpenter & Grossberg, 1993; Grossberg, 1980, 1999).

Introduction to Fuzzy Data Mining Methods

•

Since NNs are of capability in parallel processing, people hope to implement clustering at high speed with network structure. However, the classical clustering NN can only implement spherical hard cluster analysis. So, people made much effort in the integrative research of fuzzy logic and NNs, which falls into two categories. The first type of studies is based on the fuzzy competitive learning algorithm, in which the methods proposed by Pal, Bezdek, and Tsao (1993); Xu, Krzyzak, and Oja (1993); and Zhang, Kamel, and Elmasry (1994) are representatives of this type of clustering NN. These novel fuzzy clustering NNs have several advantages over the traditional ones. The second type of studies mainly focus on the fuzzy logic operations, such as the fuzzy ART and fuzzy Min-Max NN. Realization based on EC. EC is a stochastic search strategy with the mechanism of natural selection and group inheritance, which is constructed on the basis of biological evolution. For its performance of parallel search, it can obtain the global optima with a high probability. In addition, EC has some advantages such as it is simple, universal, and robust. To achieve clustering results quickly and correctly, evolutionary computing was introduced to fuzzy clustering with a series of novel clustering algorithms based on EC (see the review of Xinbo & Weixin, 2000). This series of algorithms falls into three groups. The first group is the simulated annealing based approach. Some of them can solve the fuzzy partition matrix by annealing; the others optimize the clustering prototype gradually. However, only when the temperature decreases slowly enough can the simulate annealing converge to the global optima. Hereby, the great CPU time limits its applications. The second group is the approach based on genetic algorithm and evolutionary strategy, whose studies are focused on such aspects as solution encoding,

construction of fitness function, designing of genetic operators, and choice of operation parameters. The third group, that is, the approach based on Tabu search, is only explored and tried by AL-Sultan, which is very initial and requires further research. Realization based on alternative optimization. The most popular technique is alternative optimization even today, maybe because of its simplicity (Bezdek, 1981; Dunn, 1974).

•

The minimization of the c-means functional (6) represents a nonlinear optimization problem that can be solved by using a variety of available methods, ranging from grouped coordinate minimization and oversimulated annealing to genetic algorithms. The most popular method, however, is a simple Picard iteration through the first-order conditions for stationary points of (6), known as the fuzzy c-means (FCM) algorithm. The stationary points of the objective function (6) can be found by adjoining the constraint (3) to J by means of Lagrange multipliers (for more details, see Bezdek, 1981): c

N

J ( X; U, V, ) = ∑ ∑ i =1 k =1

m i ,k

N

Di2,kA + ∑ k =1

c

k

(∑ i =1

i ,k

− 1)

(8)

and by setting the gradients of J with respect to U, V and l to zero. If Di2,kA > 0, ∀i, k and m > 1, then (U, V ) ∈ M fc × ℜ n×c may minimize (6) only if

i ,k

=

1

∑

c j =1

( Di ,kA / D j ,kA ) 2 /( m −1)

,

1 ≤ i ≤ c,1 ≤ k ≤ N

and N

∑ x vi =

k =1 N

∑ k =1

m i ,k

k

,

1≤ i ≤ c

(9)

(10)

m i ,k

This solution also satisfies the remaining constraints (2) and (4). Note that Equation (10) gives vi 65

Introduction to Fuzzy Data Mining Methods

as the weighted mean of the data items that belong to a cluster, where the weights are the membership degrees. That is why the algorithm is called “c-means.” One can see that the FCM algorithm is a simple iteration through (9) and (10). There is a new technique for clustering, classification, and obtaining fuzzy dependencies using FSQL (Fuzzy SQL), a language for fuzzy queries. In another chapter of this volume by Carrasco, Araque, Salguero, and Vila, a data mining application of FSQL is explained, including useful examples in FSQL language.

Examples for Fuzzy c-Means Consider a synthetic and a real data set in ℜ 2 (see Figures 4-6). The dots represent the data points, the “o” markers are the cluster centers. On the left side, the membership values are also shown, on the right side the curves represent values that are inversely proportional to the distances. These data sets and the Matlab implementation of FCM among other data analysis methods were collected in a toolbox called Clustering and Data Analysis Toolbox, which can be downloaded from the home page of Mathworks (www.mathworks.com/matlabcentral/fileexchange/). The synthetic data set in Figure 4 consists of three well-separated clusters of different shapes and sizes. The first cluster has ellipsoidal shape;

the other two are round, but these are different in size. One can see that the FCM algorithm strictly imposes a circular shape, even though the clusters are rather elongated. By this data set, it can be observed that the normalization does not change the results significantly because it has only a little effect on the distribution of data. In Figure 5, a motorcycle data set can be seen: head acceleration of a human “post mortem test object” was plotted in time (Yen, Wang, & Gillespie, 1998). If the data are normalized (i.e., all the features have zero mean and unit variance), the clustering results will change as it can be seen in Figure 6. The clusters have naturally circular shape, but the cluster centers are different: these are rather located above each other by the original data (also with different initial states). Consequently, the fuzzy c-means algorithm is sensitive to the scaling (normalization) of data.

Applications of Fuzzy Clustering •

Model identification: Fuzzy identification is an effective tool for the approximation of uncertain nonlinear systems on the basis of measured data (Hellendoorn & Driankov, 1997). Among the different fuzzy modeling techniques, the Takagi-Sugeno (TS) model (Takagi & Sugeno, 1985) has attracted most

Figure 4. The results of the fuzzy c-means algorithm by the synthetic data set

66

Introduction to Fuzzy Data Mining Methods

Figure 5. The results of the fuzzy c-means algorithm by the motorcycle data set

•

attention. This model consists of if-then rules with fuzzy antecedents and mathematical functions in the consequent part. The antecedents fuzzy sets partition the input space into a number of fuzzy regions, while the consequent functions describe the system’s behavior in these regions (Sugeno & Kang, 1986). Fuzzy clustering in the Cartesian product-space of the inputs and outputs is a tool that has been quite extensively used to

obtain the antecedent membership functions (Babuska, 1998; Babuska & Verbruggen, 1997; Sugeno & Yasukawa, 1993). Attractive features of this approach are the simultaneous identification of the antecedent membership functions along with the consequent local linear models and the implicit regularization (Abonyi, Szeifert, & Babuska, 2002; Johansen & Babuska, 2002). Semi-mechanistic fuzzy models: Fuzzy modeling and identification from process data proved to be effective for approximation of uncertain nonlinear processes. Most approaches, however, utilize only the function approximation capabilities of fuzzy systems, and little attention is paid to the qualitative aspects. Furthermore, completely datadriven black-box identification techniques often yield unrealistic and non-interpretable models. Another disadvantage in process modeling is the nonscalability of black box models; that is, one has to collect new training-data when the process is modified. Due to the given drawbacks, combinations of (a priori) knowledge with black-box modeling techniques are gaining considerable interest. One of these methods is called the semimechanistic approach. In Abonyi, Roubos, Babuska, and Szeifert (2003), a semimechanistic modeling approach is presented where a fuzzy model is used to represent

Figure 6. The results of fuzzy c-means by the motorcycle data set with normalization

67

Introduction to Fuzzy Data Mining Methods

•

•

68

difficult-to-model parts of the system. An advanced clustering method is applied that pursues a further step in accomplishing the total parameter and structure identification of TS models. The clusters are represented by fuzzy sets and local semimechanistic models. This approach shows that fuzzy models can be efficiently incorporated into the semimechanistic modeling environment, and this approach provides interpretable and accurate submodels. Structure selection: Most data-driven identification algorithms assume that the model structure is a priori known or that it is selected by a higher-level “wrapper” structure-selection algorithm. In Feil, Abonyi, and Szeifert (2004), two clustering-based model-free algorithms are presented to select the right model orders. The main idea of these algorithms is the following. When the available input-output data set is clustered in the product space of the regressors and the model output, the obtained clusters will approximate the regression surface of the model. If the right input variables are used, because of the functional relationship between the regressors and the model output, the data are locally highly correlated and the obtained clusters are flat. In this way, the problem of determining the appropriate regressors is transformed into the problem of checking the flatness of the clusters. State space reconstruction: The result due to Takens (1981) shows that the dynamics of an unknown deterministic finite dimensional system can be reconstructed from a scalar time series generated by that system. A clustering-based algorithm is introduced for the estimation of the dimensions of chaotic systems in Abonyi, Feil, and Babuska (2004). Local dimension can be estimated based on cluster shape analysis. At the same time, the presented general Multiple Input-Multiple Output (MIMO) fuzzy modeling approach gives a model. The main advantage of this

•

fuzzy clustering based solution is that three tasks are simultaneously solved: selection of the embedding dimension, estimation of the intrinsic dimension, and identification of a model that can be used for prediction. Time series analysis: Partitioning a timeseries into internally homogeneous segments is an important data mining problem. The changes of the variables of a multivariate time series are usually vague and do not focus on any particular time point. Therefore, it is not practical to define crisp bounds of the segments. Although fuzzy clustering algorithms are widely used to group overlapping and vague objects, they cannot be directly applied to time-series segmentation because the clusters need to be contiguous in time. Abonyi, Feil, Nemeth, and Arva (2005) proposes a clustering algorithm for the simultaneous identification of local probabilistic principal component analysis (PPCA) models used to measure the homogeneity of the segments and fuzzy sets used to represent the segments in time.

Visualization of High Dimensional Data Since in practical data mining problems high dimensional data have to be dealt with, in most of the cases it would be very useful if we could see the structure of these data in a low-dimensional space. This section introduces the reader to the visualization of high-dimensional data in general.

Motivation and Methods The reduction of dimensionality of the feature space is also important because of the curse of dimensionality. In a nutshell, the same number of examples fills more of the available space when the dimensionality is low, and its consequence is that exponential growth with dimensionality in the number of examples is required to accurately

Introduction to Fuzzy Data Mining Methods

Figure 7. Taxonomy of dimensionality reduction methods

estimate a function. Two general approaches for dimensionality reduction are: (1) feature extraction: transforming the existing features into a lower dimensional space, and (2) feature selection: selecting a subset of the existing features without a transformation. Feature extraction means creating a subset of new features by the combination of existing features. These methods can be grouped based on linearity (see Figure 7). A linear feature extraction or projection expresses the new features as a linear combination of the original variables. The type of linear projection used in practice is influenced by the availability of category information about the patterns in the form of labels on the patterns. If no category information is available, the eigenvector projection (also called principal component analysis) is commonly used. Discriminant analysis is a popular linear mapping technique when category labels are available. In many cases, linear projection cannot preserve the data structure because of its complexity. In these cases, nonlinear projection methods should be used. Some of them will be described deeply later in this section (see the “Kohonen Self-Organizing Maps” and “Sammon Mapping” sections).

Principal Component Analysis Principal component analysis (PCA) takes a data set X = [x1 ,..., x N ]T , where x k = [ x1,k ,...x n ,k ]T is the kth sample or data point in a given orthonormal basis in ℜ n and finds a new orthonormal basis, U = [u1 ,..., u n ], u i = [u1,i ,...u n ,1 ]T , with its axes ordered (the superscript T means “transpose”).

This new basis is rotated in such a way that the first axis is oriented along the direction in which the data has their highest variance. The second axis is oriented along the direction of maximal variance in the data, orthogonal to the first axis. Similarly, subsequent axes are oriented so as to account for as much as possible of the variance in the data, subject to the constraint that they must be orthogonal to preceding axes. Consequently, these axes have associated decreasing “indeces,” li, i = 1, ..., n, corresponding to the variance of the data set when projected on the axes. The principal components are the new basis vectors, ordered by their corresponding variances. The vector with the largest variance corresponds to the first principal component because it explains the largest part of the variance of the original distribution.

Kohonen Self-Organizing Maps The self-organizing map is a new, effective tool for the visualization of high dimensional data. It implements an orderly mapping of a high dimensional distribution onto a regular low dimensional grid. Thereby, it is able to convert complex, nonlinear statistical relationships between high dimensional data items into simple geometric relationships on a low dimensional display. As it compresses information while preserving the most important topological and metric relationships of the primary data items on the display, it may also be thought to produce some kind of abstractions. These two aspects, visualization and abstraction, can be utilized in a number of ways in complex tasks such as process analysis, machine perception, control, and communication (Kohonen, 1998). 69

Introduction to Fuzzy Data Mining Methods

SOM performs a topology preserving mapping from high dimensional space onto map units so that relative distances between data points are preserved (Kohonen, 1990). The map units, or neurons, usually form a two dimensional regular lattice. Each neuron i of the SOM is represented by an n dimensional weight, or model vector v i = [vi ,1 ,..., vi ,n ]T . These weight vectors of the SOM form a codebook. The neurons of the map are connected to adjacent neurons by a neighborhood relation, which dictates the topology of the map. The number of the neurons determines the granularity of the mapping, which affects the accuracy and the generalization capability of the SOM. There are several other methods beyond the ones described above: •

70

Projection pursuit: Projection pursuit (PP) is an unsupervised technique that searches interesting low dimensional linear projections of a high dimensional data by optimizing a certain objective function called projection index (PI). The projection index defines the intent of PP procedure. The notation of interesting obviously varies with the application. The goal of data mining (i.e., revealing data clustering tendency, an edge or jump of data density) should be translated into a numerical index as a functional of the projected data distribution. This function should change continuously with the parameters defining the projection and have a large value when the projected distribution is defined to be interesting and small otherwise. Most projection indices are developed from the standpoint that normality represents the notion of “uninterestingness.” They differ in the assumptions about the nature of deviation from normality and in their computational efficiency. Generally, they can be divided into two groups: parametric projection indices and the nonparametric ones. Parametric projection indices are designed to capture any departure of data distribution from a specified distribution, whereas nonparametric indices

•

•

are more general and they are not focused on a particular distribution. There is a huge collection of proposed projection indices, for example, the Shannon entropy (Shannon, 1948) and Yenyukov index (Yenyukov, 1989). Generative topographic maps: Generative topographic mapping (GTM), introduced by Bishop, Svensen, and Williams (1998a, 1998b), can be considered a probabilistic reformulation of the self-organizing map approach. The aim of the GTM procedure is to model the distribution of data in a high dimensional space in terms of a smaller number of latent variables. Auto-associative feedforward networks: The feedforward network is usually used in supervised settings. Nevertheless, it can be applied as a nonlinear projection method. In this case, the net is trained to map a vector to itself through a “bottleneck” layer, that is, the layer with a smaller number of nodes than the input (and output) layer (Kramer, 1991). The bottleneck layer of the network performs the dimension reduction because the number of neurons in this layer is smaller than those in the input and output layers, so that the network is forced to develop compact representation of the input data. Figure 8 shows a schematic picture of an auto-associative network.

If all the units in this network are taken to be linear, in which case any intermediary layers between inputs and targets and the bottleneck layer can be removed, and the network is trained using the sum-of-squares error function. This will result in the network performing standard PCA. In fact, it can be shown that this will also be the case for a network with a single bottleneck layer of nonlinear units. •

Discriminant analysis: The discriminant analysis projection maximizes the betweengroup scatter while holding the within-group scatter constant (Chatterjee & Roychowd-

Introduction to Fuzzy Data Mining Methods

Figure 8. Auto-associative nonlinear PCA network

hury, 1997). This projection requires that all patterns have pattern class or category labels. Even when no extrinsic category labels are available, the patterns can be clustered and cluster labels used as category information for projection purposes. Multidimensional scaling is a generic name for a body of procedures and algorithms that start with an ordinal proximity matrix and generate configurations of points in one, two, or three dimensions. Multidimensional scaling translates an ordinal scale to a set of ratio scales and is an example of ordination. MDSCAL for nonmetric multidimensional scaling developed by Kruskal (1964a; 1964b; Kruskal & Wish, 1978) is one of the most popular techniques in this field. Since the objective of a multidimensional scaling method is to create a set of scales or dimensions that represent the data, natural links exist between multidimensional scaling, intrinsic dimensionality, and nonlinear projection.

Sammon Mapping While PCA attempts to preserve the variance of the data during the mapping, Sammon’s mapping tries to preserve the interpattern distances (Mao & Jain, 1995; Pal & Eluri, 1998). For this purpose, Sammon defined the mean-square-error between the distances in the high dimensional space and the

distances in the projected low dimensional space. This square-error formula is similar to the “stress” criterion from multidimensional scaling. The Sammon mapping is a well-known procedure for mapping data from a high n dimensional space onto a lower q dimensional space by finding N points in the q dimensional data space, such that the interpoint distances d i*, j = d * (y i , y j ) in the q dimensional space approximate the corresponding interpoint distances d i , j = d (x i , x j ) in the n dimensional space. This is achieved by minimizing an error criterion, called the Sammon’s stress E: E=

1

(d i , j − d i*, j ) 2 ∑ ∑ d i =1 j =i +1 i, j N −1

N

(11)

where N −1

=∑

N

∑d

i =1 j =i +1

i, j

.

The minimization of E is an optimization problem in Nq variables y i ,l , i = 1,..., N , l = 1,..., q as y i = [ y i ,1 ,..., y i ,q ]T . Sammon applied the method of steepest decent to minimizing this function. Introduce the estimate of yi,l at the tth iteration:  ∂E (t )    ∂y i ,l (t )   y i ,l (t + 1) = y i ,l (t ) −  ∂ 2 E (t )   2   ∂ y i ,l (t ) 

(12)

where a is a non-negative scalar constant (recommended aa ≈ 0.3 − 0.4 , that is, the step size for gradient search in the direction of * ∂E (t ) 2 N  d k ,i − d k ,i  =− ∑   ( y i ,l − y k ,l ) * ∂y i ,l (t ) d d k =1, k ≠ i   , , k i k i 

[

∂ 2 E (t ) 2 N 1 =− ∑ (d k ,i − d k*,i ) (13) 2 * ∂ y i ,l (t ) k =1, k ≠ i d k ,i d k ,i

71

Introduction to Fuzzy Data Mining Methods

It is not necessary to maintain l for a successful solution of the optimization problem, since the minimization of N −1 N

∑ ∑ (d i =1 j =i +1

i, j

− d i*, j ) 2 d i , j

gives the same result. When the gradient-descent method is applied to search for the minimum of Sammon’s stress, a local minimum in the error surface can be reached. Therefore, a significant number of runs with different random initializations may be necessary. Nevertheless, the initialization of y can be based on information which is obtained from the data, such as the first and second norms of the feature vectors or the principal axes of the covariance matrix of the data (Mao & Jain, 1995). The techniques mentioned so far in this section are general visualization methods, which do not have a direct connection with clustering (except SOM). In the next section, a new method will be presented for visualization of clustering results. This is based on the results of classical fuzzy clustering algorithms and an iterative projection method is applied to preserve the data structure in two dimensions in the sense that the distances between the data points and the cluster centers should be similar in the original high and in the projected two dimensions as well.

Visualization of Fuzzy Clustering Results by Modified Sammon Mapping This section focuses on the application of Sammon mapping for the visualization of the results of clustering, as the mapping of the distances is much closer to the task of clustering than the preserving the variances. This section is mainly based on a previous work of the authors; for more details, see Kovacs and Abonyi (2004). There are two main problems encountered in the application of Sammon mapping to the visualization of fuzzy clustering results:

72

•

•

The prototypes of clustering algorithms may be vectors (centers) of the same dimension as the data objects, but they can also be defined as “higher level” geometrical objects, such as linear or nonlinear subspaces or functions. Hence, classical projection methods based on the variance of the data (PCA) or based on the preservation of the Euclidian interpoint distance of the data (Sammon mapping) are not applicable when the clustering algorithm does not use the Euclidian distance norm. As Sammon mapping attempts to preserve the structure of high n dimensional data by finding N points in a much lower q dimensional data space, such the interpoint distances measured in the q dimensional space approximate the corresponding interpoint distances in the n dimensional space, the algorithm involves a large number of computations as in every iteration step it requires the computation of N ⋅ ( N − 1) / 2 distances. Hence, the application of Sammon mapping becomes impractical for large N (de Ridder & Duin, 1997; Pal & Eluri, 1998).

By using the basic properties of fuzzy clustering algorithms, a useful and easily applicable idea is to map the cluster centers and the data such that the distances between the clusters and the data points will be preserved. During the iterative mapping process, the algorithm uses the membership values of the data and minimizes an objective function that is similar to the objective function of the original clustering algorithm. To avoid the problem mentioned above, in the following, we introduce some modifications in order to tailor Sammon mapping for the visualization of fuzzy clustering results. By using the basic properties of fuzzy clustering algorithms where only the distance between the data points and the cluster centers are considered to be important, the modified algorithm takes into account only N × c distances, where c represents the number of clusters, weighted by the membership values:

Introduction to Fuzzy Data Mining Methods

c

N

E fuzz = ∑∑ i =1 k =1

m i ,k

P =| U − U * |

(d (x k , i ) − d * (y k , z i )) 2

(14)

where d(xk,hi) represents the distance between the xk data point and the hi cluster center measured in the original n dimensional space, while d * (y k , z i ) represents the Euclidian distance between the projected cluster center zi and the projected data yk. This means that in the projected space, every cluster is represented by a single point, regardless of the form of the original cluster prototype, hi. The application of the simple Euclidian distance measure increases the interpretability of the resulted plots (typically in two dimensions, although three dimensional plots can be used as well). If the type of cluster prototypes is properly selected, the projected data will fall close to the projected cluster center represented by a point resulting in an approximately spherically shaped cluster. The resulting algorithm is similar to the original Sammon mapping, but in this case, in every iteration after the adaptation of the projected data points, the projected cluster centers are recalculated based on the weighted mean formula of the fuzzy clustering algorithms. The resulted two dimensional plot of the projected data and the cluster centers is easily interpretable since it is based on normal Euclidian distance measures between the cluster centers and the data points. Based on these mapped distances, the membership values of the projected data can also be plotted based on the classical formula of the calculation of the membership values: 2

* i ,k

 d * (x , )  m −1 = 1 ∑  * k i    j =1  d ( x k , j )  c

(15)

Of course, the resulting 2D plot will only approximate the original high dimensional clustering problem. The quality of the approximation can easily be evaluated based on the mean square error of the original and the recalculated membership values.

(16)

where U * = [ i*,k ] represents the matrix of the recalculated memberships. Of course, there are other tools to get information about the quality of the mapping of the clusters. For example, the comparison of the cluster validity measures calculated based on the original and mapped membership values can also be used for this purpose.

Example: Iris Data Visualization by Fuzzy Sammon Mapping For the sake of comparison, the data and the cluster centers are also projected by PCA and the standard Sammon projection. Besides the visual inspection of the results, the mean square error of the recalculated membership values, P (see Equation [16]), the difference between the original F and the recalculated F* partition coefficient (Equation [17], one cluster validity measure), and the Sammon stress coefficient (Equation [11]) will be analyzed. The partition coefficient measures the amount of “overlapping” between clusters. It is defined by Bezdek (1981) as follows: F (c ) =

1 N

c

N

∑∑ ( i =1 k =1

i ,k

)2

(17)

where mi,k is the membership of data point k in cluster i. The optimal number of cluster is at the maximum value. The iris data set contains measurements on three classes of iris flower, namely Iris setosa, Iris versicolor, and Iris virginica. The data set was made by measurements of sepal length and width and petal length and width for a collection of 150 irises. The problem is to distinguish the three different types. These data have been analyzed many times to illustrate various methods. To test the presented method, the results of the clustering of the iris data were visualized by principal component analysis (PCA), the original Sammon mapping, and the modified method. The initial conditions in the ap-

73

Introduction to Fuzzy Data Mining Methods

Figure 9. (a–top left) PCA projection of the IRIS data and the recalculated membership contours; (b–top right) SAMMON projection of the iris data and the recalculated membership contours; (c–bottom left) FUZZSAM projection of the results of GK clustering of the iris data m = 2; (d–bottom right) FUZZSAM projection of the results of FCM clustering of the the iris data m = 1.4

plied Gustafson-Kessel fuzzy clustering algorithm were the following: c = 3, m = 2 and a = 0.4 in the Sammon and FUZZSAM mapping algorithms. The results of the projections are shown in Figure 9, where the different markers correspond to different types of iris, and the level curves represent the recalculated membership degrees. As Figure 9c nicely illustrates, the data can be properly clustered by the GK algorithm. One of the clusters is well separated from the other two clusters. To illustrate how the fuzziness of the clustering can be evaluated from the resulted plot, Figure 9d shows the result of the clustering when m = 1.4 . As can be seen in this plot, the data points

74

lie much closer to the center of the cluster and there are many more points in the first iso-membership curves. These observations are confirmed by the numerical data given in Table 1.

Fuzzy Classifier Systems for Effective Model Representation The identification of a classifier system means the construction of a model that can be used to obtain whether a given pattern, xk, in which y k = {c1 ,..., cC } class should be classified. The

Introduction to Fuzzy Data Mining Methods

Table 1. Comparison of the performance of the mappings (P means the membership value differences, F and F* are the original and recalculated partition coefficient; E is the Sammon stress)

classic approach for this problem with C classes is based on Bayes’ rule. The probability of making an error when classifying an example x is minimized by Bayes’ decision rule of assigning it to the class with the largest posterior probability: x is assigned to ci ⇔ p(ci | x) ≥ p(c j | x)∀j ≠ i (18) The a posteriori probability of each class given a pattern x can be calculated based on the p (x | ci ) class conditional distribution, which models the density of the data belonging to the ci class and the P(ci ) class prior, which represents the probability that an arbitrary example out of data belongs to class ci p (c i | x ) =

p ( x | c ) P (c i ) = p ( x)

∑

p ( x | c ) P (c i )

C

j =1

p ( x | c ) P (c i )

(19)

As Equation (18) can be rewritten using the numerator of (19), we would have an optimal classifier if we would perfectly estimate the class priors and the class conditional densities. Of course, in practice, one need to find approximate estimates of these quantities on a finite set of training data {x k , y k }, k = 1,..., N . Priors P(ci ) are often estimated on the basis of the training set as the proportion of samples of class ci or using prior knowledge. The p (ci | x) class conditional densities can be modeled with nonparametric methods like histograms, nearest-neighbors, or parametric methods such as mixture models.

Fuzzy Rules for Providing Interpretability of Classifiers The classical fuzzy rule-based classifier consists of fuzzy rules that each describe one of the C classes. The rule antecedent defines the operating region of the rule in the n dimensional feature space and the rule consequent is a crisp (nonfuzzy) class label from the {c1 ,..., cC } set: ri: If x1 is Ai ,1 ( x1,k ) and … xn is Ai ,n ( x n ,k ) then yˆ = ci (20) where Ai ,1 ,..., Ai , N are the antecedent fuzzy sets and wi is a certainty factor that represents the desired impact of the rule. The value of wi is usually chosen by the designer of the fuzzy system according to the designer’s belief in the accuracy of the rule. When such knowledge is not available, wi = 1, ∀i is used. The and connective is modeled by the product operator allowing for interaction between the propositions in the antecedent. Hence, the degree of activation of the ith rule is calculated as: n

i

(x k ) = wi ∏ Ai,j ( x j,k ) j =1

(21)

The output of the classical fuzzy classifier is determined by the winner takes all strategy; that is, the output is the class related to the consequent of the rule that has the highest degree of activation: yˆ = ci* , i * = arg max   

i

(x k )

(22)

1≤i ≤C

75

Introduction to Fuzzy Data Mining Methods

The fuzzy classifier defined by the previous equations is in fact a quadratic Bayes classifier when i (x k ) = p (x | ci ) P(ci ) . As the number of the rules in the above representation is equal to the number of the classes, the application of this classical fuzzy classifier is restricted. In Abonyi et al. (2001b), a new rule structure has been derived to avoid this problem, where the p (ci | x) posteriori densities are modeled by R>C mixture of models: R

p (ci | x) = ∑ p (rl | x) P(ci | rl ) l =1

(23)

This idea results in a fuzzy rule base where the consequent of the rule defines the probability of the given rule that represents the c1 ,..., cC classes: ri: If x1 is Ai ,1 ( x1,k ) and … xn is Ai ,n ( x n ,k ) then yˆ = c1 with P(c1 | ri ) …, yˆ = cC with P(cC | ri ) [ wi ] . (24) The aim of the remaining part of the section is to review some techniques for the identification of the fuzzy classifier presented above. In addition, methods for reduction of the model will be described.

Model Evaluation Criteria and Rule Base Reduction Fuzzy Partitioning of Continuous Attributes As a result of the increasing complexity of classification problems, it becomes necessary to deal with structural issues of the identification of classifier systems. The input attributes of a classification problem can be basically continuous or discrete in values. Also the interpretability of models and the nature of some learning algorithms require effectively discretized (partitioned) features. Hence, the effective partitioning of input domains of continuous input variables is an important aspect of the accuracy and transparency of classifier systems.

76

The discretization can be a priori or dynamic. In a priori partition methods, the attributes are partitioned before to the induction of the classifiers (e.g., a decision tree). Contrarily, dynamic methods generate the partitions during the tree induction mechanism. The most known decision tree induction method that utilizes a priori partitions is the ID3 (Quinlan, 1986). The most current dynamic methods are the CART (Breiman, Friedman, Olsen, & Stone, 1984) and the C4.5 (Quinlan, 1993) algorithms. Fuzzy partitioning of continuous features can increase the flexibility of the classifier since it has the ability to model fine knowledge details. The particular drawback of discretization into crisp sets (intervals) is that small variations of the inputs (e.g., due to noise) can cause large changes of the output. This is because the tests are based on Boolean logic and, as a result, only one branch can be followed after a test. By using fuzzy predicates, decision trees can be used to model vague decisions.

Fuzzy Decision Trees The basic idea of fuzzy decision trees is to combine data-based learning of decision trees with approximate reasoning of fuzzy logic (Janikow, 1998). In fuzzy decision trees, several branches originating from the same node can be simultaneously valid to a certain degree, according to the result of the fuzzy test. The path from the root node to a particular leaf model therefore defines a fuzzy operating range of that particular model. The output of the tree is obtained by interpolating the outputs of the leaf models that are simultaneously active. Besides the fuzzy decision trees representing the discovered rules far natural for human, the fuzzy logic serves more robust classifiers in case of false, inconsistent, and missing data. Fuzzy logic, however, is not a guarantee for interpretability. Real effort must be made to keep the resulting rule base transparent. Since the 1980s, many fuzzy decision tree induction algorithms have been introduced (e.g., Adamo, 1980). The paper published by Yuan and

Introduction to Fuzzy Data Mining Methods

Shaw (1995) analyzes the cognitive doubtfulness attached to the classification problem and proposes a fuzzy decision tree method for a solution. In Yuan and Zhuang (1996), genetic algorithms are introduced for tree induction. Boyen and Wehenkel (1999) introduced a new induction algorithm for problems where the input variables (attributes) are continuous and the output is fuzzy. (Fuzzy) decision trees based on the original ID3 algorithm assume discrete domains with small cardinalities. Another important feature of ID3 trees is that each attribute can provide at most one condition on a given path. These are great advantages as they increase the comprehensibility of the induced knowledge, but may require a good a priori partitioning, because the first discretization step of the data mining procedure especially affects the classification performance. Hence, for this purpose, Janikow (1996) introduced a genetic algorithm to optimize this partition step, while Pedrycz and Sosnowski (2001) propose an environment dependent on the clustering algorithm to get adequate partitions.

Similarity-Driven Rule Base Simplification Traditionally, algorithms to obtain classifiers have been focused either on accuracy or interpretability. Recently, some approaches to combining these properties have been reported; fuzzy clustering is proposed to derive transparent models in Setnes and Babuska (1999); linguistic constraints are applied to fuzzy modeling in de Oliveira (1999); and rule extraction from neural networks is described in Setiono (2000). The similarity-driven rule base simplification method (Setnes, Babuska, Kaymak, & van Nauta Lemke, 1998) uses a similarity measure to quantify the redundancy among the fuzzy sets in the rule base. A similarity measure based on the set-theoretic operations of intersection and union is applied: S ( Ai , j , Al , j ) =

| Ai , j ∩ Al , j | | Ai , j ∪ Al , j |

(25)

where | . | denotes the cardinality of a set, and the ∩ and ∪ operators represent the intersection and union of fuzzy sets, respectively. S is a symmetric measure in [0,1]. If S ( Ai , j , Al , j ) = 1 , then the two membership functions Ai , j and Al , j are equal. S ( Ai , j , Al , j ) becomes 0 when the membership functions are non-overlapping. The complete rule base simplification algorithm is given in Setnes et al. (1998). Similar fuzzy sets are merged when their similarity exceeds a user defined threshold ∈ [0,1] ( = 0.5 is applied). Merging reduces the number of different fuzzy sets (linguistic terms) used in the model and thereby increases the transparency. The similarity measure is also used to detect “do not care” terms, that is, fuzzy sets in which all elements of a domain have a membership close to one. If all the fuzzy sets for a feature are similar to the universal set, or if merging led to only one membership function for a feature, then this feature is eliminated from the model. The method is illustrated in Figure 10.

Multi-Objective Function for GA Based Identification To improve the classification capability of the rule base, the genetic algorithm (GA) optimization method can be applied (Setnes & Roubos, 1999) where the cost function is based on the model accuracy measured in terms of the number of misclassifications. Also, other model properties can be optimized by applying multi-objective functions. For example, in Roubos, Setnes, and Abonyi (2000a), to reduce the model complexity, the misclassification rate is combined with a similarity measure in the GA objective function. Similarity is rewarded during the iterative process; that is, the GA tries to emphasize the redundancy in the model. This redundancy is then used to remove unnecessary fuzzy sets in the next iteration. In the final step, fine tuning is combined with a penalized similarity among fuzzy sets to obtain a distinguishable term set for linguistic interpretation. The GA is subject to minimize the following multi-objective function: 77

Introduction to Fuzzy Data Mining Methods

Figure 10. Similarity-driven simplification

j = (1 + S * ) ⋅ Error

(26)

where S * ∈ [0,1] is the average of the maximum pairwise similarity that is present in each input; that is, S* is an aggregated similarity measure for the total model. The weighting factor ∈ [−1,1] determines whether similarity is rewarded (l < 0) or penalized (l > 0).

Other Reduction Algorithms The application of orthogonal transforms for reducing the number of rules has received much attention in recent literature (Yam, Baranyi, & Yang, 1999). These methods evaluate the output contribution of the rules to obtain an importance ordering. For modeling purposes, orthogonal least squares (OLS) is the most appropriate tool (Yen & Wang, 1999). Evaluating only the approximation capabilities of the rules, the OLS method often assigns high importance to a set of redundant or correlated rules. To avoid this, in Setnes and Babuska (2001), some extension for the OLS method was proposed. Using too many input variables may result in difficulties in the interpretability capabilities of the obtained classifier. Hence, selection of the relevant features is usually necessary. Others have focused on reducing the antecedent by similarity analysis of the fuzzy sets (Roubos et al., 2000a); however, this method is not very suitable for feature selection. Hence, for this purpose, the Fischer

78

interclass separability method, which is based on statistical properties of the data (Cios, Pedrycz, �� & Swiniarski�� , 1998), has been modified in Roubos and Setnes (2000).

CI Based Search Methods for the Identification of Fuzzy Classifiers Fixed membership functions are often used to partition the feature space (Ishibuchi et al., 1999). Membership functions derived from the data, however, explain the data patterns in a better way. The automatic determination of fuzzy classification rules from data has been approached by several different techniques: neurofuzzy methods (Nauck & Kruse, 1999), genetic-algorithm based rule selection (Ishibuchi et al., 1999), and fuzzy clustering in combination with GA optimization (Setnes et al., 1998b). For high dimensional classification problems, the initialization step of the identification procedure of the fuzzy model becomes very significant. Common initializations methods, such as grid-type partitioning (Ishibuchi et al., 1999) and rule generation on extrema initialization (Jin, 2000), result in complex and non-interpretable initial models and the rule-base simplification and reduction step become computationally demanding.

Identification by Fuzzy Clustering To obtain compact initial fuzzy models, fuzzy clustering algorithms (Setnes et al., 1998b) or similar

Introduction to Fuzzy Data Mining Methods

but less complex covariance based initialization techniques (Roubos & Setnes, 2000) were put forward, where the data are partitioned by ellipsoidal regions (multivariable membership functions). Normal fuzzy sets can then be obtained by an orthogonal projection of the multivariable membership functions onto the input-output domains. The projection of the ellipsoids results in hyperboxes in the product space. The information loss at this step makes the model suboptimal, resulting in a much worse performance than the initial model defined by multivariable membership functions. However, gaining linguistic interpretability is the main advantage derived from this step. To avoid the erroneous projection step, multivariate membership functions (Abonyi, Babuska, & Szeifert, 2001) or clustering algorithms providing axis-parallel clusters can be used (Abonyi & Szeifert, 2001).

Other Initialization Algorithms For the effective initialization of fuzzy classifiers, the crisp decision tree-based initialization technique is proposed in Abonyi and Roubos (2000). DT-based classifiers perform a rectangular partitioning of the input space, while fuzzy models

generate non-axis parallel decision boundaries (Hoppner, Klawonn, Kruse, & Runkler, 1999). Hence, the main advantage of rule-based fuzzy classifiers over crisp DTs is the greater flexibility of the decision boundaries. Therefore, fuzzy classifiers can be more parsimonious than DTs and one may conclude that the fuzzy classifiers, based on the transformation of DTs only (Jang, 1994; Nelles & Fischer, 1996), will usually be more complex than necessary. This suggests that the simple transformation of a DT into a fuzzy model may be successfully followed by model reduction steps to reduce the complexity and improve the interpretability. The next section proposes rule-base optimization and simplification steps for this purpose.

Example: Wine Classification by CI Techniques The wine data1 contains the chemical analysis of 178 wines grown in the same region in Italy but derived from three different cultivars. The problem is to distinguish the three different types based on 13 continuous attributes derived from chemical analysis: alcohol, malic acid, ash, alcalinity of ash, magnesium, total phenols, flavanoids, nonflava-

Figure 11. Wine data: three classes and 13 attributes

79

Introduction to Fuzzy Data Mining Methods

Figure 12. The fuzzy sets of the optimized three rule classifier for the wine data

Table 2. Three rule fuzzy classifier (L=low, M=medium, H=high)

noids, phenols, proanthocyaninsm color intensity, hue, OD280/OD315 of dilluted wines, and proline (see Figure 11). Fuzzy Classifier Identified by GA An initial classifier with three rules was constructed by the covariance-based model initialization technique proposed in Roubos et al. (2000a) using all samples resulting in 90.5% correct, 1.7% undecided, and 7.9% misclassifications for the three wine classes. Improved classifiers are developed based on the GA based optimization technique discussed in the Model Evaluation Criteria and Rule Base Simplification section. Based on the similarity analysis of the optimized fuzzy sets, some features have been removed from individual rules, while the interclass separability method has been used to omit some features in all the rules. The achieved membership functions are shown in Figure 12, while the obtained rules are shown in Table 2. Fuzzy Classifier Identified by Fuzzy Clustering A fuzzy classifier that utilizes all the 13 information profile data about the wine has been identified by

80

the clustering algorithm proposed in Abonyi and Szeifert (2001), where the obtained classifier is formulated by rules given by Equation (24). Fuzzy models with three and four rules were identified. The three-rule model gave only two misclassification (98.9%). When a cluster was added to improve the performance of this model, the obtained classifier gave only one misclassification (99.4%). The classification power of the identified models is compared with fuzzy models with the same number of rules obtained by Gath-Geva clustering, as Gath-Geva clustering can be considered the unsupervised version of the proposed clustering algorithm. The Gath-Geva identified fuzzy model gives eight (95.5%) misclassifications when the fuzzy model has three rules and six (96.6%) misclassifications with four rules. These results indicate that the proposed clustering method effectively utilizes the class labels. The interclass separability based model reduction technique is applied to remove redundancy and simplify the obtained fuzzy models, and five features were selected. The clustering has been applied again to identify a model based on the selected five attributes. This compact model with three, four, and five rules gives four, two, and zero misclassifications, respectively. The resulting membership functions and the selected features are shown in Figure 13.

Introduction to Fuzzy Data Mining Methods

Figure 13. Membership functions obtained by fuzzy clustering

Discussion The wine data are widely applied for comparing the capabilities of different data mining tools. Corcoran and Sen (1994) applied all the 178 samples for learning 60 nonfuzzy if-then rules in a real-coded genetic-based machine learning approach. They used a population of 1500 individuals and applied 300 generations, with full replacement, to come up with the following result for 10 independent trials: best classification rate (100%), average classification rate (99.5%), and worst classification rate (98.3%), which is three misclassifications. Ishibuchi et al. (1999) applied all the 178 samples designing a fuzzy classifier with 60 fuzzy rules by means of an integer-coded genetic algorithm and grid partitioning. Their population contained 100 individuals and they applied 1000 generations, with full replacement, to come up with the following result for 10 independent trials: best classification rate (99.4%, one misclassification), average classification rate (98.5%), and worst classification rate (97.8%, four misclassifications). In both approaches, the final rule base contains 60 rules. The main difference is the number of model evaluations that was necessary to come to the final result. As can be seen from Table 3, because of the simplicity of the proposed clustering algorithm,

the proposed approach is attractive in comparison with other iterative and optimization schemes that involves extensive intermediate optimization to generate fuzzy classifiers. The results are summarized in Table 3. As shown, the performance of the obtained classifiers are comparable to those in Corcoran and Sen (1994) and Ishibuchi et al. (1999), but use far less rules (3-5 compared to 60) and less features. Comparing the fuzzy sets in Figure 13 with the data in Figure 11 shows that the obtained rules are highly interpretable. For example, the flavonoids are divided into low, medium, and high, which is clearly visible in the data.

Fuzzy Association Rule Mining One of the most popular research tasks in data mining is the discovery of frequent item sets and association rules. The problem originates in market basket analysis which aims at understanding the behavior of retail customers or, in other words, finding associations among the items purchased together (Agrawal, Imielinski, & Swami, 1993). A famous example of an association rule in such a database is “diapers ⇒ beer,” that is, young fathers being sent off to the store to buy diapers reward 81

Introduction to Fuzzy Data Mining Methods

Table 3. Classification rates on the wine data for 10 independent runs

themselves for their trouble. Because of the practical usefulness of association rule discovery, this approach can be applied in various research areas. The data available for the identification of the model is arranged into an input X = [ xi ,k ] n× N and an output Y = [ y i ,k ] m× N matrices, where N represents the number of data points, k = 1,..., N . With the help of data-driven clustering or user-defined membership functions, these data can be transformed into fuzzy data. The kth data point is represented as: tk = [a1,1 ( x1,k ), a1, 2 ( x1,k ),..., a1,n1 ( x1,k ),..., a n ,nj ( x n ,k ),..., bm ,nm ( y m ,k )] tk = [a1,1 ( x1,k ), a1, 2 ( x1,k ),..., a1,n1 ( x1,k ),..., a n ,nj ( x n ,k ),..., bm ,nm ( y m ,k )]

and fuzzy models can be identified from such data by generating a fuzzy rule base with rules in the form of Ri: If x1 is a1, j and … x n is a n , j then y1 is B1, j (27) With regard to our goal of generating fuzzy rule base, in this section, we focus on the problem of mining fuzzy association rules. Such rules can be discovered in two steps: (1) mining frequent item sets and (2) generating association rules from the discovered set of frequent item sets. For both steps, we have to define the concept of fuzzy support. It is used as a criterion in deciding whether a fuzzy item set (association rule) is frequent or not; therefore, we first introduce the basic definitions and notations that are needed in frequent item set and association rule mining.

82

Counting the Fuzzy Support Let D = {t1 ,..., t N } be a transformed fuzzy data set of N tuples (data points) with a set of variables = {z1 ,..., z n + m } and let ci,j be an arbitrary fuzzy interval (fuzzy set) associated with attribute zi in Z. From this point, we use the notation z i : ci , j for an attribute-fuzzy interval pair, or simply fuzzy item. An example could be Age : young . For fuzzy item sets, we use expressions like Z : C to denote an ordered set Z ⊆ Z of attributes and a corresponding set C of some fuzzy intervals, one per attribute, that is, Z : C = [ z i1 : ci1, j ∪ ... ∪ z iq : ciq , j ], q ≤ n + m .

In the literature, the fuzzy support value has been defined in different ways. Some researchers suggest the minimum operator as in fuzzy intersection, while others prefer the product operator (e.g., see Hong, Kuo, & Chi, 1999; Kuok, Fu, & Wong, 1998). They can be defined formally as follows: Assuming that tuple t k of the data set D contains value t k ( z i ) for attribute z i , then the fuzzy support of Z : C with respect to D is defined as: N

FS ( Z : C ) =

∑ min k =1

zi :ci , j ∈ Z :C

t k ( zi )

N

(28)

or N

∑∏ zi :ci , j ∈ Z :C t k ( z i ) FS ( Z : C ) = k =1 N

(29)

Introduction to Fuzzy Data Mining Methods

We treat memberships as probabilities and therefore prefer the product form. A fuzzy support reflects how the record of the identification data set support the item set. An item set Z : C is called frequent if its fuzzy support value is higher than or equal to a user-defined minimum support threshold s. The following example illustrates the calculation of the fuzzy support value. Let X : A = [ Balance : medium ∪ Income : high ] be a fuzzy item set, the data set shown in Table 4. The fuzzy support of X : A is given by:

FS ( X : A) =

0.5 ⋅ 0.4 + 0.8 ⋅ 0.4 + 0.7 ⋅ 0.7 + 0.9 ⋅ 0.3 + 0.9 ⋅ 0.6 = 0.364 5

set must also be frequent. Each candidate generation step is followed by a counting step where the supports of candidates are checked and nonfrequent ones deleted. Generation and counting alternate until at some step all generated candidates turn out to be no-frequent. A high-level pseudocode of the algorithm is given in Algorithm 2. The subroutines are outlined as follows: •

•

(30)

Mining Frequent Item Sets As mentioned above, the first subproblem of discovering fuzzy association rules is to find all frequent item sets. The best-known and one of the most commonly applied frequent pattern mining algorithms, Apriori, was developed by Agrawal and Srikant (1994). The name is based on the fact that the algorithm uses prior knowledge of frequent item sets already determined. It is an iterative, breadthfirst search algorithm, based on generating stepwise longer candidate item sets, and clever pruning of nonfrequent item sets. Pruning takes advantage of the so-called apriori (or upward closure) property of frequent item sets: all subsets of a frequent item

Table 4. Example database containing membership values

•

•

Transform(D): Generates a fuzzy database DF from the original data set D (see next section). At the same time, the complete set of candidate items C1 is found. Count(Ck, DF, s): In this subroutine, the fuzzy database is scanned and the fuzzy support of candidates in Ck is counted. If this support is not less than minimum support s for a given item set, we put it into the set of frequent item sets Fk. Generate(F k–1): Generates candidate item sets Ck from frequent item sets Fk–1, discovered in the previous iteration k–1. F o r e x a m p l e , i f F1 = { Balance : high , Income : high } , then C2 will be C 2 = {[ Balance : high ∪ Income : high ]} . Prune(Ck): During the prune step, the item set will be pruned if one of its subsets does not exist in the set of frequent item sets F.

Generation of Fuzzy Association Rules Since the rules are generated from the frequent item sets, the generation of fuzzy association rules becomes relatively straightforward. More precisely, each frequent item set Z : C is divided into a consequent Y : B and antecedent X : A , where X ⊂ Z , Y = Z − X , A ⊂ C , and B = C − A . With the use of this notation, a fuzzy association rule can be represented in the form of: If X is A then Y is B

(31)

or in the more compact form of: 83

Introduction to Fuzzy Data Mining Methods

Algorithm 2. Mining frequent fuzzy item sets (minimum support s, data set D) 1

k =1

2

(C k , DF ) = Transform( D)

3

Fk = Count (C k , DF , s )

4

while | C k |≠ 0 do

5

inc(k )

6

C k = Generate( Fk −1 )

7

C k = Pr une(C k )

8

Fk = Count (C k , DF , s )

9 10

F = F ∪ Fk

end

X : A ⇒ Y : B

(32)

An association rule is considered strong if its support and confidence exceeds a given minimum support s and minimum confidence threshold g. Since the rules are generated from frequent item sets, they satisfy the minimum support automatically. The confidence of a fuzzy association rule X : A ⇒ Y : B is defined as: FC ( X : A ⇒ Y : B ) =

FCO( X : A , Y : B ) =

FS ( X : A ∪ Y : B ) FS ( X : A)

(33)

which can be understood as the conditional probability of Y : B , namely P( Y : B | X : A ) . Using our sample database (Table 4), the fuzzy confidence value of the rule “If Balance is medium and Income is high then Credit is high” is calculated as: FC ( X : A ⇒ Y : B ) =

0.278 = 0.766 (34) 0.364

association rules mined using the above supportconfidence framework are useful for many applica-

84

tions. However, a rule might be identified as interesting when, in fact, the occurrence X : A does not imply the occurrence of Y : B . The occurrence of an item set X : A is independent of the item set Y : B if FS ( Z : C ) = FS ( X : A) ⋅ FS (Y : B) ; otherwise, item sets X : A and Y : B are dependent and correlated as events. The correlation between the occurrence of X : A and Y : B can be measured by computing the interestingness of a given rule: FS ( Z : C ) FS ( X : A) ⋅ FS (Y : B)

(35)

The other applicable correlation measure is the f correlation: FCORR( X : A , Y : B ) = FS ( Z : C ) − FS ( X : A) ⋅ FS (Y : B ) FS ( X : A) ⋅ (1 − FS ( X : A)) ⋅ FS (Y : B ) ⋅ (1 − FS (Y : B ))

(36)

If the resulting value of Equation (37) is less than 1, then the occurrence of X : A is negatively correlated with the occurrence of Y : B . If the resulting value is greater than 1, then X : A and Y : B are

Introduction to Fuzzy Data Mining Methods

positively correlated, meaning the occurrence of one implies the other. If the resulting value is near to 1, then X : A and Y : B are independent and there is no correlation between them.

Association Rule Based Classification Models In the rule based fuzzy classification systems, the relationships between the input predictor attributes and the output class variable are represented by means of if-then rules. An X : A ⇒ Y : C fuzzy classification association rule can be used as such an if-then rule: If X is A then class is C

(37)

While an associative classification model is based on the discovered rules, the generated classifiers have two main types. The first is when a set of FCARs is used to determine an unknown class label by taking all of the mined rules. In the second case, the class label is determined by only one FCAR. In the following, we explain both types of classification models. High-level pseudocode of the algorithms is presented in Algorithms 3 and 4. Classification model type 1: The firing strength of a fuzzy rule is determined by the mechanism which is used to implement the and operation in the antecedent part of the rules. In this chapter, we propose the product to measure the degrees of membership values. Therefore, the firing strength of the j rule for the kth sample is calculated as follows: j

( x k ) = ∏ t k ( z i ), z i : Ai , j ∈ Z : A i

(38)

To classify the kth sample, first the product of the firing strength and the fuzzy confidence value (FC) are calculated for all of the FCARs in the rule base. Then the calculated products are used to measure the score of the classes in order to consider the consequents of the rules. This method is detailed in the following:

Let v = [v1 ,..., v s ]T be a score vector of the classes, where initially all the scores of classes are set to zeros ( ve = 0, ∀e = 1,..., s , the s denotes the number of the possible class labels). Let rj : X : A ⇒ Y : C (r j ) be the jth FCAR in the rule base with class C (r j ) as consequent. If C (r j ) is the pth class ( c p ∈ C = c1 ,..., c s ), the score of the pth class is increased by the product of firing strength j ( x k ) and FC j fuzzy confident value: v = [v1 ,..., v p −1 , v p + (

j

( x k ) ⋅ FC j ), v p +1, ..., v s ]T

(39)

For all the FCARs in the rule base, the additional score vales of classes are calculated in the same way: vC ( r j ) = ∑ j

j

( x k ) ⋅ FC j , j = 1,..., R

(40)

where R is the number of FCARs in the rule base. The firing strengths of the FCARs can be aggregated for the classes to get information about the relationships between the rules and the classes. Let g = [ g1 ,..., g s ]T be a vector for aggregating the non-zero firing strengths of the rules, and let h = [h1 ,..., hs ]T be the vector for aggregating the number of the rules with non-zero firing strength for all the classes (s is the number of the possible classes). The aggregations can be easily implemented by the sum operation: g C ( r j ) = ∑ j ( x k ), ∀j = 1,..., R | j ( x k ) > 0 j (41) hC ( r j ) = ∑1, ∀j = 1,..., R | j

j

( xk ) > 0

(42)

The aggregated values of the firing strengths for classes (g) can be used as a weight coefficient for scoring the classes: vC ( r j ) = (∑ j

j

( x k ) ⋅ FC j ) ⋅ g C ( r j )

(43)

The quotient of the aggregated values in g and h can also be used to determine a weight coefficient for scoring of classes: 85

Introduction to Fuzzy Data Mining Methods

vC ( r j ) = (∑ j

j

( x k ) ⋅ FC j ) ⋅

g C (rj ) hC ( r j )

(44)

Whichever type of scoring (Equations [40], [43], and [44]) is applied, the class label with top score in the scoring vector v will be the predicted class for the k sample: yˆ k = arg max( v )

(45)

Classification model type 2: In the second type of classification models, only one rule (the strongest one) determines the class label of a new case. The rule with the highest product of firing strength and fuzzy confident is called the strongest FCAR for the kth sample: vC ( j ) = max ( j =1,..., R

j

( x k ) ⋅ FC j )

(46)

where R denotes the number of FCARs in the rule base. Then the predicted class label is determined by Equation (45).

Visualization of Fuzzy Association Rules To generate the map of the mined fuzzy itemsets and classification rules, there is a need for a distance metric that describes the relative information content of two arbitrary itemsets. This could be done by the following inter-itemset distance measure: d i , j = d IIS ( Z i : C i , Z j : C j ) = FS ( Z i : C i ∩ Z j : C j ) 1 − Z i : Ci ∪ Z j : C j

(47)

where i, j = 1,..., N a is the number of itemsets. We can use this distance measure for fuzzy association rules if we consider a rule as an itemset: Z i : C i = X i ∪ Yi : Ai ∪ Bi . The calculation of the dIIS distance measure is quite straightforward, since: FS ( Z i : C i ∩ Z j : C j ) = FS ( Z i ∪ Z j : C i ∪ C j ) FS ( Z i : C i ∩ Z j : C j ) =

(48)

Algorithm 3. The proposed classification algorithm based on the first fuzzy associative classification model Classification algorithm (with model 1) Input:

A set of FCARs

Output:

The predicted class

Method: 1. calculate the fuzzy confidence of the FCARs; 2. calculate the firing strength of the FCARs; 3. for each FCAR, calculate the products of results in steps 1 and 2; 4. for each class label, set its score to 0; 5. increase the scores of the class labels by the product values (step 2) of all the FCARs to consider the class labels in the rules; (Aggregate the non-zero firing strengths of FCARs for the class labels and the number of such rules); (Apply the results of aggregations as weight coefficients to score the classes in step 5); 6. the class label with top score will be the predicted class.

86

Introduction to Fuzzy Data Mining Methods

Algorithm 4. The proposed classification algorithm based on the second fuzzy associative classification model Classification algorithm (with model 2) Input:

A set of FCARs

Output:

The predicted class

Method: 1. calculate the fuzzy confidence of the FCARs; 2. calculate the firing strength of the FCARs; 3. for each FCAR, calculate the products of results in steps 1 and 2; 4. for each class label, set its score to 0; 5. set the scores of the class labels by the product values (step 2) of the strongest FCAR (which has maximal product value in step 3) to consider the class labels in the rules; 6. the class label with top score will be the predicted class.

FS ( Z i : C i ) + FS ( Z j : C j ) − FS ( Z i ∪ Z j : C i ∪ C j )

(49)

FS ( X : A ∩ Y : B )

(50)

and the fuzzy supports of the itemsets are the average fireing strengths (rule weight) of the given rules (see Equation [29]). It is interesting to note that this measure has already been used as a ruleinteresting measure (Jaccard) when the two compared itemsets are the antecedent and consequent itemsets of the rules. dIIS ( X : A , Y : B ) = 1−

FS ( X : A ∪ Y : B )

Using the d IIS distance metric, we can determine the distances between all the pair of rules. These distances can be stored in a matrix D, which d i , j element represents the distances between the ith and the jth rule, and d i , j = 0 . This D symmetric matrix allows us to generate the desired galaxy views of the rules with Sammon’s mapping method. The main feature of the visualization is that each rule is represented by one point in a two dimensional plot, where the distances between the points represent the relative interestingness of the itemsets of the rules. If the distance is small, the two itemsets are relatively not interesting; the

occurrence of the itemsets on the original transactional database are similar. Itemsets are often similar if they share a common ancestor or child. Figure 14 shows a generalized lattice on items (a,b,c,d). As can be seen, there is a hierarchy of the itemsets. A k-itemset contains k items. According to this hierarchy, there are more views for the interactive visualization of the frequent itemsets. The first view is the full galaxy (FG). The full galaxy contains all the minded rules (or frequent itemsets). Due to the large number of the mined itemsets, the generation of such plots is computationally demanding. In case of the selection of a given itemset, its ancestors or children can be visualized on separate subgalaxies (SG). We can highlight the subgalaxy in the derived galaxy connecting the child sets with the ancestor (the selected set) with a straight line, which generates a star schema. For the interactive generation of these subgalaxies, it is absolutely necessary to have a graphical user interface (GUI) or a working environment to help the user switch between the views quickly in a logical, rational way. For this purpose, a GUI called FISARVis (Frequent Item Set and Association Rules Visualization) has been developed to materialize the described views.

87

Introduction to Fuzzy Data Mining Methods

Figure 14. Lattice of the (a,b,c,d) items

Application Example The iris data set contains measurements on three classes of iris flower. The data set consists of measurements of sepal length and width and petal length and width for a collection of 150 irises. The problem is to distinguish the three different types (Iris setosa, Iris versicolor, and Iris virginica). This classical benchmark data set has been analyzed many times to illustrate various methods. The original data set, D, includes crisp values (continuous variables for each attribute and discrete variable for the class label). To allow fuzzy association rule mining, these raw data must be transformed into a fuzzy transactional data set. Therefore, the first step of the algorithm fuzzifies the original data by user specified fuzzy sets. In this work, the Gustafson-Kessel (GK) clustering algorithm is used to partition all the candidate input variables. The resulting clusters can be directly used to generate the fuzzy data, for example, a j ,i ( x j ,k ) = U kj ,i . Besides this nonparametric definition of the membership values, it is advantageous to design parameterized membership functions to represent the a j ,i ( x j ,k ) fuzzy sets. For this purpose, trapezoidal membership functions can be used. In this example, on every feature, three fuzzy sets were defined. This results in a transactional database, where every tk transaction consists of 4 ⋅ 3 + 3 items related to the three fuzzy sets defined on the four features and the three class labels.

88

The frequent itemsets and the rules were generated by the classical Apriori algorithm. The minimal support and confidence were set to s = 10% and g = 75%. With these parameters, 115 frequent items and 45 rules were obtained. It is interesting to note that even in case of this smallscale problem, the number of itemsets and rules are quite high compared to the small number of datapoints (150). The proposed Full Galaxy plot of the itemsets (Figure 15) shows that although the number of the itemsets is high, most of the itemsets form well-separated clusters where the itemsets of a given cluster are quite similar to each other. To give much more insight to how these clusters are formed, the first elements of the itemsets are also shown on this plot. As can be seen, the itemsets related to association rules (items with first element of the class labels: 1,2,3) form separate clusters. This information highlights that the generated association rules can be used for classification. The analysis of the neighboring itemsets can be used for “model pruning,” and the redundant rules can be removed from the rule base. For this purpose, the developed interactive FISARVis tool can be used. The GUI is prepared to show association rules as well, separating the derived (fuzzy) sets from the antecedent and the consequent part of rules. There is also the opportunity to monitor the movements of users; the visited itemsets are also logged to jump next time easily to the desired views. For the sake of comparison, the raw data are also projected by standard Sammon projection. As Figure 16 shows, not only are the frequent items clustered but the raw data are too. Compared to the plot of the frequent itemsets, this plot shows much less information about the hidden structure of the multivariate data. The only one information is that one of the clusters (class) is well separated from the other two classes. Contrarily, the proposed association rule based visualization tool gives information on how the classes can be described, which variables are important, and what fuzzy sets form useful rules.

Introduction to Fuzzy Data Mining Methods

Figure 15. Sammon mapping of the frequent itemsets of the Iris data (Full Galaxy)

Figure 16. Sammon mapping of the Iris data. The different markers represent the different classes of the Iris flower.

Conclusion This chapter gave an overall description about how fuzzy data and fuzzy techniques can be incorporated into data mining. It also focused on the most important data mining tasks, especially within fuzzy information processing: clustering, classification, association rule mining, and visualization. The case studies showed how a wide range of data mining and CI tools developed for knowledge rep-

resentation (fuzzy rules), feature selection (class separability criterion), model initialization (clustering and decision tree), model reduction (orthogonal methods), and tuning (genetic algorithm) can be combined. It has been shown that these tools can be applied in a synergistic manner through the nine steps of knowledge discovery. The developed and proposed approaches have been implemented as a Matlab. Toolboxes are downloadable from the Web site of the authors: www.fmt.uni-pannon. hu/softcomp.

89

Introduction to Fuzzy Data Mining Methods

Acknowledgment The support of the Cooperative Research Center (2004-III-1), the Hungarian Science Foundation (T037600 and T049534), the Oveges Fellowship, and the Bolyai Fellowship of the Hungarian Academy of Sciences are gratefully acknowledged.

References Abe, S., Thawonmas, R., & Kobayashi, Y. (1998). Feature selection by analyzing regions approximated by ellipsoids. IEEE Transactions on Systems, Man, and Cybernetics, Part C, 28(2), 282-287. Abonyi, J., Babuska, R., & Szeifert, F. (2001). Fuzzy modeling with multidimensional membership functions: Grey-box identification and control design. IEEE Transactions on Systems, Man, and Cybernetics, Part C. Abonyi, J., Babuska, R., Verbruggen, H., & Szeifert, F. (2000a). Using a priori knowledge in fuzzy model identification. International Journal of Systems Science, 31, 657-667. Abonyi, J., Bodizs, A., Nagy, L., & Szeifert, F. (2000b). Hybrid fuzzy convolution model and its application in predictive control. Chemical Engineering Research and Design, 78, 597-604. Abonyi, J., Feil, B., & Babuska, R. (2004). Statespace reconstruction and prediction of chaotic time series based on fuzzy clustering. Paper presented at the International Conference on Systems, Man and Cybernetics. Abonyi, J., Feil, B., Nemeth, S., & Arva, P. (2005). Modified Gath-Geva clustering for fuzzy segmentation of multivariate time-series. Fuzzy Sets and Systems: Fuzzy Sets in Knowledge Discovery, 149(1), 39-56. Abonyi, J., Nagy, L., & Szeifert, F. (2000c). Hybrid fuzzy convolution modeling and identification of chemical process systems. International Journal of Systems Science, 31, 457-466.

90

Abonyi, J., & Roubos, J.A. (2000). Structure identification of fuzzy classification rules. Paper presented at the 5th Online World Conference on Soft Computing in Industrial Applications (WSC5). Abonyi, J., Roubos, H., Babuska, R., & Szeifert, F. (2003). Identification of semi-mechanistic models with interpretable TS-fuzzy submodels by clustering, OLS and FIS model reduction. In Fuzzy modeling and the interpretability-accuracy trade-of. Part I, interpretability issues. Studies in Fuzziness and Soft Computing, Physica-Verlag. Abonyi, J., & Szeifert, F. (2001). Supervised fuzzy clustering for the identification of fuzzy classifiers. Paper presented at the 6th Online World Conference on Soft Computing in Industrial Applications. Abonyi, J., Szeifert, F., & Babuska, R. (2002). Modified Gath-Geva fuzzy clustering for identification of Takagi-Sugeno fuzzy models. IEEE Systems, Man and Cybernetics, Part B, 612-621. Adamo, J.M. (1980). Fuzzy decision trees. Fuzzy Sets and Systems, 4(3), 207-219. Agrawal, R., Imielinski, T., & Swami, A. (1993). Database mining: A performance perspective [Special Issue on Learning and Discovery in Knowledge-Based Databases]. IEEE Transactions on Knowledge and Data Engineering, 5(6), 914-925. Agrawal, R., & Srikant, R. (1994). Fast algorithm for mining association rules in large databases. Paper presented at the 20th International Conference on Very Large Data Bases (pp. 487-499). Anderberg, M.R. (1973). Cluster analysis for applications. Academic Press, Inc. Babuska, R. (1998). Fuzzy modeling for control. Boston: Kluwer Academic Publishers. Babuska, R., & Verbruggen, H.B. (1997). Constructing fuzzy models by product space clustering. In H. Hellendoorn & D. Driankov (Ed.), Fuzzy model identification: Selected approaches (pp. 53-90). Berlin: Springer.

Introduction to Fuzzy Data Mining Methods

Bandemer, H. (2006). Mathematics of uncertainty: Ideas, methods, application problems. Studies in Fuzziness and Soft Computing, 189. Springer.

classification. In IEEE Congress on Evolutionary Computation (pp. 120-124), Orlando, FL, USA.

Bezdek, J.C. (1981). Pattern recognition with fuzzy objective function algorithms. Plenum Press.

de Oliveira, J.V. (1999). Semantic constraints for membership function optimization. IEEE Transactions on Fuzzy Systems, 19, 128-138.

Bishop, C.M., Svensen, M., & Williams, C.K.I. (1998a). Developments of the generative topographic mapping. Neurocomputing, 21, 203-224.

de Ridder, D., & Duin, R.P.W. (1997). Sammon’s mapping using neural networks: A comparison. Pattern Recognition Letters, 18, 1307-1316.

Bishop, C.M., Svensen, M., & Williams, C.K.I. (1998b). GTM: The generative topographic mapping. Neural Computing, 10(1), 215-234.

Diday, E., & Simon, J.C. (1976). Clustering analysis. Digital Pattern Recognition, 4794. Secaucus, NJ: Springer-Verlag.

Boyen, X., �� &�� Wehenkel, L. (1999). �� Automatic induction of fuzzy decision trees and its application to power system security assessment. Fuzzy Sets and Systems, 102(1), 3-19.

Dubes, R.C. (1987). How many clusters are best? An experiment. Pattern Recognition, 20(6), 645663.

Brachman, R., & Anand, T. (1994). The process of knowledge discovery in databases. In Advances in knowledge discovery and data mining (pp. 37-58). AAAI/MIT Press. Breiman, L., Friedman, J.H., Olsen, R.A., & Stone, C.J. (1984). Classification and regression trees (The Wadsworth and Brooks/Cole Statistics/Probability Series). Wadsworth. Carpenter, G.A., & Grossberg, S. (1993). Normal and amnesic learning, recognition, and memory by a neural model of cortico-hippocampal interactions. Trends in Neuroscience, 16(4), 131-137. Chatterjee, C., & Roychowdhury, V.P. (1997). On self-organizing algorithms and networks for class-separability features. IEEE Transactions on Neural Networks, 8(3), 663-678. Cheng, Y. (1995). Mean shift, mode seeking, and clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(7), 790-799. Cios, K.J., Pedrycz, W., & Swiniarski, R.W. (1998). Data mining methods for knowledge discovery. Boston: Kluwer Academic Press. Corcoran, A.L., & Sen, S. (1994). Using realvalued genetic algorithms to evolve rule sets for

Dubes, R.C. (1993). Cluster analysis and related issues. In Handbook of pattern recognition & computer vision (pp. 3-32). River Edge, NJ: World Scientific Publishing Co. Dunn, J.C. (1974). A fuzzy relative of the ISODATA process and its use in detecting compact well separated cluster. Journal of Cybernetics, 3, 32-57. Esogbue, A.O. (1986). Optimal clustering of fuzzy data via fuzzy dynamic programming. Fuzzy Sets and Systems, 18(3), 283-298. Fayyad, U.M., Piatestku-Shapiro, G., & Smyth, P. (1994). Knowledge discovery and data mining: Towards a unifying framework. In Advances in Knowledge Discovery and Data Mining. AAAI/ MIT Press. Fayyad, U.M., Piatestku-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. American Association for Artificial Intelligence, 17, 37-54. Feil, B, Abonyi, J., & Szeifert, F. (2004). Model order selection of nonlinear input-output models: A clustering based approach. Journal of Process Control, 14(6), 593-602. Gershon, N. (1992). Visualization of fuzzy data using generalized animation. In Proceedings of Visualization ’92 (pp. 268-273).

91

Introduction to Fuzzy Data Mining Methods

Giordani, P., & Kiers, H.A.L. (2004). Principal component analysis of symmetric fuzzy data. Computational Statistics and Data Analysis, 45, 519-548. Grossberg, S. (1980). How does a brain build a cognitive code? Psychological Review, 87, 1-51. Grossberg, S. (1999). The link between brain, learning, attention, and consciousness. Consciousness and Cognition, 8, 1-44. Hellendoorn, H., & Driankov, D. (Eds.). (1997). Fuzzy model identification: Selected approaches. Berlin: Springer. Hong, T.-P., Kuo, C.-S., & Chi, S.-C. (1999). Mining association rules from quantitative data. Intelligent Data Analysis, 3(5), 363-376. Hoppner, F., Klawonn, F., Kruse, R., & Runkler, T. (1999). Fuzzy cluster analysis: Methods for classification, data analysis and image recognition. John Wiley and Sons. Ishibuchi, H., Nakashima, T., & Murata, T. (1999). Performance evaluation of fuzzy classifier systems for multidimensional pattern classification problems. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 29, 601-618. Ivanova, I., & Kubat, M. (1995). Initialization of neural networks by means of decision trees. Knowledge-Based Systems, 8, 333-344. Jain, A.K., & Dubes, R.C. (1988). Algorithms for clustering data (Prentice Hall Advanced Reference Series). Prentice Hall. Jang, J.-S.R. (1994). Structure determination in fuzzy modeling: A fuzzy cart approach. Paper presented at the IEEE International Conference on Fuzzy Systems, Orlando, FL, USA.

Janikow, C.Z. (1998). Fuzzy �� decision trees: Issues and methods. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 28, 1-14. Jin, Y. (2000). Fuzzy modeling of high-dimensional systems. IEEE Transactions on Fuzzy Systems, 8, 212-221. Johansen, T.A., & Babuska, R. (2002). On multiobjective identification of Takagi-Sugeno fuzzy model parameters. In Preprints of the 15th International Federation of Automatic Control World Congress, Barcelona, Spain. King, B. (1967). Step-wise clustering procedures. Journal of The American Statistical Association, 69, 86-101. Koczy, L.T., Tikk, D., & Gedeon, T.D. (2000). On functional equivalence of certain fuzzy controllers and rbf type approximation schemes. International Journal of Fuzzy Systems. Kohonen, T. (1984). Self-organization and associative memory (2nd ed). Berlin: Springer. Kohonen, T. (1990). The self-organizing map. Proceedings of the IEEE, 78(9), 1464-1480. Kohonen, T. (1998). The self-organizing map. Neurocomputing, 21, 1-6. Kovacs, A., & Abonyi, J. (2004). Vizualization of fuzzy clustering results by modified Sammon mapping. Paper presented at the 3rd International Symposium of Hungarian Researchers on Computational Intelligence (pp. 177-188). Kramer, M.A. (1991). Nonlinear principal component analysis using autoassociative neural networks. Neural Computation, 9(7), 1493-1516.

Jang, J.-S.R., & Sun, C.-T. (1992). Functional equivalence between radial basis function networks and fuzzy inference systems. IEEE Transactions on Neural Networks, 4, 156-159.

Krol, D., Kukla, G.S., Lasota, T., & Trawinski, B. (2006). Fuzzy model for the assessment of operators’ work in a cadastre information system. Paper presented at the 10th International Conference on Knowledge-Based and Intelligent Information and Engineering Systems.

Janikow, C.Z. (1996). A genetic algorithm method for optimizing fuzzy decision trees. Information Sciences, 89(3-4), 275-296.

Kruskal, J.B. (1964a). Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29, 1-27.

92

Introduction to Fuzzy Data Mining Methods

Kruskal, J.B. (1964b). Nonmetric multidimensional scaling: a numerical method. Psychometrika, 29, 115-130. Kruskal, J.B., & Wish, M. (1978). Multidimensional scaling. Sage University Papers on Quantitative Applications in the Social Sciences, 07(011). Newbury Park, CA. Kuok, C.M., Fu, A., & Wong, M.H. (1998). Mining fuzzy association rules in databases. ACM SIGMOD Record, 27(1), 41-46. Mao, J., & Jain, K. (1995). Artificial neural networks for feature extraction and multivariate data projection. IEEE Transactions on Neural Networks, 6(2), 296-317. McQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Paper presented at the 5th Berkeley Symposium on Mathematical Statistics and Probability (pp. 281-297). Murtagh, F. (1984). A survey of recent advances in hierarchical clustering algorithms which use cluster centers. Computer Journal, 26, 354-359. Nauck, D., & Kruse, R. (1999) Obtaining interpretable fuzzy classification rules from medical data. Artificial Intelligence in Medicine, 16, 149-169. Nelles, O., & Fischer, M. (1996). Local linear model trees (LOLIMOT) for non-linear system identification of a cooling blast. Paper presented at the European Congress on Intelligent Techniques and Soft Computing (EUFIT), Aachen, Germany. Pal, N.R., Bezdek, J.C., & Tsao, E.C.K. (1993). Generalized clustering networks and Kohonen’s self-organization scheme. IEEE Transactions on Neural Networks, 4(4), 549-557. Pal, N.R., & Eluri, V.K. (1998). Two efficient connectionist schemes for structure preserving dimensionality reduction. IEEE Transactions on Neural Networks, 9, 1143-1153. Pang, A., Wittenbrink, C., & Lodha, S. (1997). Approaches to uncertainty visualization. The Visual Computer, 13(8), 370-390.

Pedrycz, W., & Sosnowski, Z.A. (2001). The design of decision trees in the framework of granular data and their application to software quality model. Fuzzy Sets and Systems, 123(3), 271-290. Quinlan, J.R. (1986). Induction on decision trees. Machine Learning, 1(1), 81-106. Quinlan, J.R. (1993). C4.5: Programs for machine learning. Morgan Kaufmann. Roubos, J.A., & Setnes, M. (2000). Compact fuzzy models through complexity reduction and evolutionary optimization. Paper presented at the IEEE International Conference on Fuzzy Systems (pp. 762-767), San Antonio, TX, USA. Roubos, J.A., Setnes, M., & Abonyi, J. (2000a). Learning fuzzy classification rules from data. Developments in Soft Computing, 108-115. Roubos, J.A., Setnes, M., & Abonyi, J. (2000b). Learning fuzzy classification rules from data. Paper presented at the RASC Conf, Leichester, UK. Saltong, G. (1991). Developments in automatic text retrieval. Science, 253, 974-980. Sethi, L.K. (1990). Entropy nets: From decision trees to neural networks. Proceedings of the IEEE, 78, 1605-1613. Setiono, R. (2000). Generating concise and accurate classification rules for breast cancer diagnosis. Artificial Intelligence in Medicine, 18, 205-219. Setiono, R., & Leow, W.K. (1999). On mapping decision trees and neural networks. Knowledge Based Systems, 13, 95-99. Setnes, M., & Babuska, R. (1999). Fuzzy relational classifier trained by fuzzy clustering. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 29, 619-625. Setnes, M., & Babuska, R. (2001). Rule base reduction: Some comments on the use of orthogonal transforms. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 31, 199-206. Setnes, M., Babuska, R., Kaymak, U., & van Nauta Lemke, H.R. (1998). Similarity measures 93

Introduction to Fuzzy Data Mining Methods

in fuzzy rule base simplification. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 28, 376-386. Setnes, M., & Roubos, J.A. (1999). Transparent fuzzy modeling using fuzzy clustering and GA’s. In Proceedings of NAFIPS (pp. 198-202), New York, NY, USA. Setnes, M., Roubos, J.A., & Verbruggen, H.B. (1998b). Rule-based modeling: Precision and transparency. IEEE Transactions on Systems, Man, and Cybernetics, Part C, 28, 165-169. Shannon, C.E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 379-432. Sneath, P.H.A., & Sokal, R.R. (1973). Numerical taxonomy. London: Freeman. Sugeno, M., & Kang, G.T. (1986). Fuzzy modelling and control of multilayer incinerator. Fuzzy Sets and Systems, 18, 329-346. Sugeno, M., & Yasukawa, T. (1993). A fuzzy-logicbased approach to qualitative modeling. IEEE Transactions on Fuzzy Systems, 1(1), 7-31. Takagi, T., & Sugeno, M. (1985). Fuzzy identification of systems and its application to modeling and control. IEEE Transactions on Systems, Man, and Cybernetics, 15, 116-132. Takens, F. (1981). Detecting strange attractor in turbulence. In D.A. Rand & L.S. Young (Eds.), Dynamical systems and turbulence (pp. 366-381). Berlin: Springer. Ward, J.H. (1963). Hierarchical grouping to optimize an objective function. Journal of The American Statistical Association, 58, 236-244. Xinbo, G., & Weixin, X. (2000). Advances in theory and applications of fuzzy clustering. Chinese Science Bulletin, 45(11), 961-970. Xu, L., Krzyzak, A., & Oja, E. (1993). Rival penalized competitive learning for clustering analysis, RBF net and curve detection. IEEE Transactions

94

on Neural Networks, 4(4), 636-649. Yam, Y., Baranyi, P., & Yang, C.T. (1999). Reduction of fuzzy rule base via singular value decomposition. IEEE Transactions on Fuzzy Systems, 7(2), 120-132. Yen, J., & Wang, L. (1999). Simplifying fuzzy rule-based models using orthogonal transformation methods. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 29, 13-24. Yen, J., Wang, L., & Gillespie, C.W. (1998). Improving the interpretability of TSK fuzzy models by combining global learning and local learning. IEEE Transactions on Fuzzy Systems, 6(4), 531-537. Yenyukov, I.S. (1989). Data analysis learning symbolic and numeric knowledge. In Indices for projection pursuit. New York: Nova Science Publishers. Yuan, Y.M., & Shaw, J. (1995). Induction of fuzzy decision trees. Fuzzy Sets and Systems, 69(2), 125-139. Yuan, Y., & Zhuang, H. (1996). A genetic algorithm for generating fuzzy classification rules. Fuzzy Sets and Systems, 84(1), 1-19. Zhang, D., Kamel, M., & Elmasry, M.T. (1994). Fuzzy clustering neural network (FCNN): Competitive learning and parallel architecture. Journal of Intelligent and Fuzzy Systems, 2(4), 289-298.

Key Terms Association Rule Mining: Aims to discover dependencies between attributes on the basis of frequent item sets extracted from the measured data. Fuzzy Classifier System: A fuzzy rule based model that can be used to obtain whether a given pattern in which class should be classified. Fuzzy Clustering: A family of methods that aims to partition the data set into clusters in a

Introduction to Fuzzy Data Mining Methods

way that objects are allowed to belong to several clusters simultaneously with different degrees of membership. Knowledge Discovery in Databases: Refers to the overall process of discovering knowledge from data.

Endnote

1

The Wine data is available from the University of California, Irvine, via anonymous ftp ftp.ics.uci.edu/pub/machine-learningdatabases.

Rule Base Reduction: Aims to discover the redundant or unimportant rules to simplify the rule based model. Visualization: A technique that can be used to map high-dimensional data or other objects like clusters into two- or three-dimensional space.

95

Section II

Fuzzy Queries

97

Chapter IV

Handling Bipolar Queries in Fuzzy Information Processing Didier Dubois IRIT, Université de Toulouse, France Henri Prade IRIT, Université de Toulouse, France

Abstract The chapter advocates the interest of distinguishing between negative and positive preferences in the processing of flexible queries. Negative preferences express what is (more or less, or completely) impossible or undesirable, and by complementation, they specify flexible constraints restricting feasible or tolerated values. Positive preferences are less compulsory, and rather express wishes; they specify attribute values that would be really satisfactory. Because they are often expressed independently, negative and positive preferences may be inconsistent. Consistency is then restored by giving priority to negative preferences, since they express genuine constraints. The chapter discusses the handling of bipolar queries, that is, queries involving negative and positive preferences, in the framework of possibility theory. Both ordinary queries expressed in terms of flexible requirements and case-based queries referring to examples and counterexamples are considered in this perspective.

Introduction Flexible queries have aroused an increasing interest for many years in the database literature (Bosc & Pivert, 1992; Christiansen, Larsen, & Andreasen, 1997; Larsen, Kacpryk, Zadrozny, Andreasen, & Christiansen, 2001), and the fuzzy set-based approach to this problem, introduced about 25 years ago, has been developed both at the theoretical and the practical levels through many works (e.g., Bosc & Kacprzyk, 1995; Petry, 1996). For recent

references, see De Caluwe and De Tré (2007) and Galindo, Urrutia, and Piattini (2006). See also the overview chapter by Kacprzyk, Zadrozny, De Tré, and De Caluwe in this book. In almost all of these works, flexible queries are generally thought of in terms of conjunctions of constraints restricting possible values of attributes. By flexible queries, we mean that these constraints could be fuzzy or prioritized. Fuzzy constraints were introduced by Zadeh (1978) as fuzzy restrictions, induced by natural language

Copyright © 2008, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.

Handling Bipolar Queries in Fuzzy Information Processing

statements and described by membership functions understood as possibility distributions. In Zadeh’s view, possibility refers to the idea of feasibility, and fuzzy constraints can be viewed as a kind of preference profile: values associated with degree 1 are fully feasible, while values with degree 0 are completely rejected; the smaller the degree, the less acceptable the value. The degree of feasibility of a solution to a set of fuzzy constraints is the degree of feasibility of the least satisfied one (Dubois, Fargier, & Prade, 1996b). The status of a possible solution with respect to a (crisp) prioritized constraint is understood as follows: if it satisfies the constraint, it is fully feasible; if it violates the constraint, it is feasible only to an extent that depends on the strength of the priority attached to the constraint. The feasibility of a solution that violates a prioritized constraint is all the lesser as it has a higher priority. A fuzzy constraint can then be viewed as a collection of nested prioritized constraints (Dubois et al., 1996b), the support of the fuzzy set having maximal priority and its core having minimal priority. Queries expressed by fuzzy constraints correspond to “negative” preferences of users, in the sense that their complements define fuzzy sets of values that are rejected as being unacceptable. These constraints should be combined conjunctively, thus acknowledging the fact they are constraints. However, there is another type of preference qualified as “positive” in the following. These preferences do not express constraints, but only desires, wishes, and recommendations that are more or less strongly suggested. The satisfaction of some of these desiderata should give some bonus to the corresponding solutions (provided that they also satisfy the constraints, if any). Wishes are not seen as compulsory and can be combined disjunctively. More recently, Benferhat, Dubois, Kaci, and Prade (2002a, 2006) have proposed and developed a bipolar possibilistic logic framework for preference modeling. On the one hand, prioritized logical formulas (weighted in terms of necessity degrees) are used for expressing constraints whose priorities

98

are more or less high, which thus delimits the fuzzy set of solutions compatible with the constraints. On the other hand, other formulas, weighted in terms of a “guaranteed possibility” function, express the minimal level of satisfaction reached if the solution under concern lies in some subsets of interpretations. The second type of formula corresponds to a “positive” assessment of what is wished, while the first type expresses what is allowed, and rather reflects the result of “negative” preferences (what is rejected defines, by complementation, what may be acceptable). The consistency of the two types of preferences requires that the fuzzy set of interpretations that have some guaranteed satisfaction level be included in the fuzzy set of the interpretations compatible with the constraints. This chapter1 offers a discussion of the idea of bipolarity in flexible querying. Two types of queries are considered. One section considers queries (e.g., looking for an apartment) involving both fuzzy constraints (e.g., “not too expensive” is a constraint, since the user cannot afford to pay an overly expensive fare) and fuzzy recommendations (e.g., preferably “near the train station”). Another section studies a type of bipolar querying, namely case-based querying. In this case, the query is evaluated both on the basis of the similarity to examples of what is looked for, and on the basis of the dissimilarity w.r.t. counter-examples. Let us first present basic notions required for the bipolar representation of preferences.

Bipolar Representation of Preferences This section provides a tutorial introduction to bipolar representations in the setting of possibility theory. In this chapter, only preference representation is considered in detail. The reader is referred to Dubois and Prade (2006) for an introductory discussion on the different types of bipolarity and to special issues (Dubois & Prade, in press) for more detailed studies of various points of view on bipolarity and various applications.

Handling Bipolar Queries in Fuzzy Information Processing

Symmetric vs. Asymmetric Bipolarity It has been observed for a long time that humans express both negative and positive judgments. The presence of such negative or positive flavor has some impact on the meaning of evaluation scales that are used for rating and combining information. An evaluation scale is bipolar if its end-points correspond to positive and negative extreme values, respectively, which means that some central value will express neutrality. Bipolarity may also use two separate evaluation scales, a negative scale containing values from negative to neutral and a positive scale containing values from neutral to positive. The concept of bipolarity may apply to beliefs as well as preferences. There are at least three different types of bipolarity (Dubois �� & Prade, 2006)�� . The simplest one may be termed symmetric univariate bipolarity. It uses a bipolar scale whose negative and positive parts are the mirror images of each other. It is the situation in classical logic where negation is involuted. Graded versions of this type of bipolarity are provided in uncertainty modeling by probabilities (since P(A) = 1 − P(Ac), where Ac is the complement of the set A), and in multiple criteria decision making, where total satisfaction, total dissatisfaction, and indifference levels are acknowledged as landmarks (Grabisch & Labreuche, 2006). A second type of bipolarity, termed symmetric bivariate, refers to the use of two separate unipolar scales (one for the positive affects, the other one for the negative affects) still pertaining to the same information, with generally a duality relation putting the scales in symmetric correspondence. For instance, Buchanan and Shortliffe (1984) found it natural to assess the uncertainty pervading a hypothetical statement, given the available evidence, by means of a pair made of a “measure of belief” and a “measure of disbelief” (then combined into a certainty factor). In such a case, the two measures are closely related and associated with the same body of evidence, since the measure of disbelief of a proposition is also the belief of the opposite proposition.

This situation is also encountered in possibility theory (Dubois & Prade, 1988). In �� usual possibility theory, a possibility distribution π (Zadeh, 1978) encodes a total pre-order on a set U of possible values. It associates to each value u an element π(u) in the possibility scale (e.g., [0, 1]), which represents the compatibility of u with the available information, in the case of uncertain knowledge. The larger π(u), the more plausible u is, according to the problem under concern. The distribution π acts as a restriction on possible states. In particular, π(u) = 0 means that u is totally impossible. So, [0, 1] is a unipolar negative scale whose neutral level is 1. Moreover, �� π is assumed to be normalized, that is, ∃ u ∈ U such that π(u) = 1. This means at least one situation is fully potentially possible, which expresses consistency of the negative information represented. Given a possibility distribution π, the (potential) possibility degree of an event A is: Π(A) = max{π(u): u ∈ A}, and its dual guaranteed necessity (or certainty) degree (denoting Ac the complement of A) is: N(A) = 1 − Π(Ac). They are defined from the same possibility distribution�� . Dual necessity and possibility pairs in possibility theory also look bivariate, but they actually live in a single totally ordered bipolar scale, made of pairs (α, β) = ( � Π(A), Π(Ac))�� such that max(α, β� ) = 1 (Giang �� &�� Shenoy, 2005). The scale then ranges from the certainty of falsity of A (represented by (0, 1)) to the certainty of its truth (= (1, 0)) through the epistemic state of complete ignorance (1, 1): (0, 1) < (α, 1) < (1, 1) < (1, β) < (1, 0). In Shafer’s (1976) theory of evidence, a pair of dual plausibility and belief functions Pl(A) and Bel(A) are defined from the same basic probability assignment m assigning to each non-empty subset E ⊆ U a probability mass m(E) allocated to E itself due to imprecise knowledge that prevents any sharing of this mass between elements of E. Then: 99

Handling Bipolar Queries in Fuzzy Information Processing

Bel(A) =

Σ

E⊆A

m(E) ; Pl(A) =

1 – Bel(Ac) =

Σ

E∩A≠Ø

m(E).

When the set of subsets E with positive mass are nested, the pair (Bel, Pl) coincides with (N, Π) in possibility theory. A third type of bipolarity, called “asymmetric” is addressed in this chapter. It takes place when dealing with two unrelated kinds of information in parallel, one ruling out possible worlds, the other asserting their existence. Negative information is usually given by pieces of generic knowledge, integrity constraints, laws, necessary conditions, which state what is impossible or forbidden. Observed cases, examples of solutions, and sufficient conditions are positive pieces of information. Beware that positive knowledge pertaining to an event may not just mirror the fact that the opposite event is impossible, hence the asymmetry. Indeed what is not impossible, not forbidden, does not coincide with what is explicitly possible or permitted. So, it is not sufficient for a situation not to be impossible (i.e., potentially possible) in order to assert its existence: it is actually possible (i.e., has full existence) only if explicitly permitted, observed, or given as an example. When dealing with preferences, desired and rejected options are stated independently and give birth to two (possibly inconsistent) distributions. In this case, neither what is not rejected is necessarily desired, nor what is not desired necessarily unfeasible. The latter type of bipolarity is asymmetric in nature, since the positive and negative sides do not mirror each other and stem from distinct pieces of information.

Set-Based Specification of Positive and Negative Preferences Let U be a set of possible choices or solutions (it may be a set of objects or items, or the domain of an attribute used for describing objects). Assume for simplicity, for the moment, that preferences are {0, 1}-valued, it means that U in general is parti-

100

tioned in three subsets (Dubois et al., 2001): the set of desirable or recommended choices D, the set of rejected choices R, and and the set U \ (D ∪ R) of elements on which no opinion is provided. Consistency of preferences entails that D ∩ R = ∅. Moreover, since one may be indifferent to some choices, one often has U \ (D ∪ R) ≠ ∅. From D and R, we can equivalently work with the nested pair (D, U \ R) where D ⊆ U \ R. When every possible choice is either desired or rejected, D = U \ R. However, the set of desired choices D is usually only a proper subset of the set U \ R of nonrejected choices. This simple representation framework can be straightforwardly extended to graded preferences, changing D and R into membership functions µD and µR, respectively. We shall use [0, 1] as the preference scale. D and R are then fuzzy sets such that: supu min(µD(u), µR(u)) = 0 (consistency), (1) which ensures ∀u, µD(u) ≤ 1 − µR(u).

(2)

The pair (R, D) under condition (2) is a membership/nonmembership pair (improperly) called “intuitionistic fuzzy set” by Atanassov (1986), while the pair (D, U \ R) under condition (1) is a twofold fuzzy set (Dubois & Prade, 1987) because: support(D) = {u, µD(u) > 0} ⊆ core(Rc) = {u, µR(u) = 0} where Rc is the fuzzy complement of R. Suppose now that, depending on criteria, or on sources i, we are faced with n pairs (Di, Ric), to be fused into a pair (D, Rc). The Dis must be combined disjunctively. Indeed, because if u is desirable according to source i, it gives one reason for recommending u while the absence of u from Di gives no reason to reject it. So recommended values accumulate, and it is enough that u ∈ Di can make it be desirable overall.

Handling Bipolar Queries in Fuzzy Information Processing

A similar reasoning is valid on the negative side with the Ris: if u is rejected according to constraint i, it should be rejected overall. Thus: (D, Rc) = (∪i Di, ∩i (Ric)) with Rc = U\R = (∪i Ri)c

(3)

where ∩ and ∪ stands for fuzzy set intersection and union. They can be respectively defined by min and max operations. Other triangular norms and co-norms could be used if a reinforcement effect is needed (e.g., if a value u has a rather small acceptability degree with respect to two constraints Ci = Ric and Cj = Rjc then its acceptability degree w.r.t. the conjunction of the constraints can still be much smaller, or even zero). Similarly, degrees of satisfaction according to desires Di may add. Note that in Equation (3), it may happen that D ⊆ Rc does not hold even if ∀i, Di ⊆ Ri. In other words, the resulting bipolar preferences will not be consistent. A revision step must take place to restore consistency, as explained later in this section. It is worth noticing that in this approach the evaluations pertaining to positive and negative preferences are not systematically recombined in a unique evaluation, contrary to some approaches for the handling of bipolar criteria (Grabisch & Labreuche, 2000, 2006). A separate treatment of positive and negative preferences in decision making has been also proposed in Fortemps and Slowinski (2002), in a relational framework capturing the pairwise comparison of alternatives.

Encoding Asymmetric Bipolar Preference in Possibility Theory In preference modeling, a possibility distribution π encodes a flexible constraint. The degree π(u) represents how feasible a solution u is, or equivalently to what extent u is not rejected. The larger π(u), the more feasible u is, according to the problem under concern. The distribution π acts as a restriction on possible solutions. In particular, π(u) = 0 means that u is totally unfeasible. So, [0, 1] is a unipolar negative preference scale whose neutral level is 1.�

Moreover, π is said to be normalized, that is, ∃ u ∈ U such that π(u) = 1. This means at least one solution is fully feasible, which expresses the consistency of the negative information represented by π. In view of the discussion in the previous subsection, it is clear that π is tailored to account for the set Rc and coincides with membership function 1− µR. A second possibility distribution δ will account for the set D of desirable solutions and should be understood differently. The degree δ(u) = µD(u) ∈ [0, 1] estimates to what extent the solution u is a desirable state, and δ(u) = 0 just means that u is not especially desired (but this does not mean rejection in any case). Here, [0, 1] is a unipolar positive scale whose neutral level is 0. Note that there is no normalization condition of the form δ(u) = 1 for some u, bearing on δ. On the contrary, it is often the case that δ(u) = 0 for some u, since, given the full set of options, it is often the case that the agent is not interested by some of them. The coherence condition of the last section now reads (Dubois, Hajek, �� & Prade�� , 2000; Dubois et al., 2001): ∀u ∈ U, δ(u) ≤ π(u).

(4)

It means that desires must be feasible. Of course, this inequality presupposes that the two distributions are commensurate. If not, a more drastic condition, like ∀u ∈ U, δ(u) > 0 implies π(u) = 1 still makes sense. So, �� π(u) > 0 is only a necessary condition for the desirability of solution u, while δ(u) > 0 is a sufficient condition for it. As said earlier, positive information aggregates disjunctively, that is, accumulates solutions of interest, and negative information aggregates conjunctively; that is, it eliminates solutions that are not feasible. This can be understood in our setting in the following way (Dubois et al., 2000). A constraint like « the value of X is restricted by Ai » is encoded by a possibility distribution π s. t. π ≤ µAi. Several constraints « X is Ai » for i = 1, …, n are thus equivalent to π ≤ mini µAi. By the principle of minimal commitment (anything not declared impossible is feasible), it leads to

101

Handling Bipolar Queries in Fuzzy Information Processing

choosing the greatest possibility distribution π = mini µAi compatible with the constraints, hence a conjunctive combination. In the case of positive information, «all X in Ai are recommended» is equivalent to δ ≥ µAi, since it reports to what extent some solutions are desirable (and maybe others will turn out to be so). Then several desires of the form «X is Ai » for i = 1,…, n, are equivalent to δ ≥ maxi µAi. By the closed world assumption (anything not explicitly desired is not considered), one gets δ = maxi µAi, hence a disjunctive combination. This kind of combination process is at work when pieces of local preference information are provided by a user and must be aggregated towards a representation of global preference over tentative solutions. This is the case when adopting a logical representation framework as explained below. Table 1 summarizes the conceptual possibilistic framework for preference modeling. The desirability distribution δ goes along with the sufficiency (or guaranteed possibility) degree of subsets A, ∆(A) = min{�δ(u): u ∈ A}. The constraint ∆(A) > 0 is sufficient to conclude that all choices in A are recommended to some degree. The dual degree of potential necessity is ∇(A) = 1 − ∆(Ac). Note that Π underlies an existential quantifier since Π(A) is high as soon as some u in A is feasible enough. It agrees with the negative nature of information, since Π(A) < 1 expresses that no choice in A is fully feasible

and Π(A) = 0 comes down to asserting the certainty of unfeasibility of all choices in A. On the contrary, ∆ underlies a universal quantifier since ∆(A) is low as soon as some solution u in A is not recommended explicitly.

Logical Representations of Bipolar Preference It is generally difficult to specify a possibility distribution pointwisely on large discrete universes. The qualitative possibility theory framework offers three compact representations of a possibility distribution, namely a logical representation by means of a set of weighted formulas (Dubois, Lang, �� & Prade,�� 1994)�� ,�� a conditional representation by means of default rules, and a Bayesian-like directed acyclic graph representation. The representation of negative information has been widely investigated in these three formats (Benferhat, Dubois, �� Kaci, & Prade,�� 2001). These formats can be adapted for representing positive information as well (Benferhat et al., 200�� 2a, 2002b�� ). We only consider the logical format here. For negative information, formulas refer to pieces of belief (with certainty levels) or to goals (with priority levels), while for positive information, formulas report observations (with evidential supports) or describe classes of solutions (with their satisfaction levels). The logical representation of bipolar information is given by means of two separate sets of weighted classical logic formulas ϕ (resp. ψ) of the form (Dubois et al., 2000):

Table 1. Conventions for bipolar preference in possibility theory

102

Possibility distribution

Meaning

Polarity

Neutral level

Range

Represents

p

feasibility

negative

1

0 = impossible 1 = feasible

Constraints

δ

desirability

positive

0

0 = indifferent 1 = preferred

Wishes

Handling Bipolar Queries in Fuzzy Information Processing

ΣN = {(ϕi, ai): i = 1, …, m} and Σ∆ = {[ψj, bj] : j = 1, …, n} modeling respectively negative and positive information. The pair (ϕi, ai) in ΣN means that the necessity degree of ϕi is at least equal to ai, that is, N(ϕi) ≥ ai (which is equivalent to say that ¬ϕi is somewhat impossible, i.e., Π(¬ϕi) ≤ 1 − ai), and the pair [ψj, bj] in Σ∆ means that the guaranteed possibility degree of ψj is at least equal to bj, that is, ∆(ψj) ≥ bj. Given the bases ΣN and Σ∆ , we can generate their associated possibility distributions as follows: ∀u ∈ U, π(u) = 1 − maxi{ai : (ϕi, ai) ∈ ΣN, u |≠ ϕi} ; π(u) = 1 if ∀(ϕi, ai) ∈ ΣN, u |= ϕi δ(u) = maxi{bj : [ψj, bj] ∈ Σ∆ , u |= ψj}; δ(u) = 0 if ∀[ψj, bj] ∈ Σ∆ , u |≠ ψ i. where u= ϕ denotes that the interpretation u Clearly, π(u) is all the lesser as u makes ϕ true.�� violates a proposition with higher weight, while δ(u) is all the greater as u satisfies a proposition with higher weight. Note that adding new formulas to ΣN leads to a more restrictive possibility distribution, since new constraints are added, which agrees with the fact that Π-measures model negative information. On the other hand, adding new formulas to Σ∆ yields a larger possibility distribution, since it comes down to offering more options, which fits with the fact that ∆-measures model positive information. More details on the representation of bipolar preferences in possibilistic frameworks are given in Benferhat et al. (2006). Example 1. Consider the following elementary example. Assume an agent is fully satisfied if a is true while the agent fully rejects situations where b is true. We assume that a and b are logically independent. For instance, the agent is looking for a inexpensive (b = “expensive”) apartment, but it would be very nice for this agent to live near the station (a = “near_the_station”).

Using the bipolar point of view, the agent has as positive preference base Σ∆ = {[a, 1]} and as negative preference base ΣN = {(¬b, 1)}, that is, ∆(a) = 1 and N(¬b) = 1 (or equivalently Π(b) = 1 − N(¬b) = 0). These induce a set of constraints for δ and �� π respectively in the set of possible worlds U�� = {� u1 = ab, u�2 = a¬b, �u3 = ¬ab, u�4 = ¬a¬b}, namely: δ(u1) = ∆(ab) = 1; δ(u2) = ∆(a¬b) = 1; δ(u3) = ∆(¬ab) = 0 ; δ(u4) = ∆(¬a¬b) = 0. π(u1) = Π(ab) = 0; π(u2) = Π(a¬b) = 1; π(u3) = Π(¬ab) = 0 ; π(u4) = Π(¬a¬b) = 1. u1) = 0 < δ(� u1) = 1, which violates the Note that π(� coherence condition ∀u ∈ U, δ(u) ≤ π(u). Indeed, u1 is an expensive appartment near the station, which is not a realistic desire. More generally, a similar violation would take place when ∆(a) ≥ α, which yields δ(� u1) = δ(� u2) = α, δ(� u3) = δ(� u4) = 0 and N(¬b) ≥ β, which yields π(� u1) = π(� u3) = ��1 – β, π(� u2) = π(� u4) = 1, if �� we have α > 1 − β. �� Such a situation is often encountered since positive and negative preferences are specified independently by the agent. We discuss in the next section how to cope with this problem.

Making Positive and Negative Preferences Consistent Merging bipolar information (Benferhat et al., 2002a; Dubois et al., 2001b), by disjunctive (resp. conjunctive) combination of positive (resp. negative) information, may create inconsistency when the upper and lower possibility distributions, which represent the negative part and the positive part of the information respectively, fail to satisfy the consistency condition ∀u, π(u) ≥ δ(u). Then, it is necessary to revise either π or δ (or their syntactic counterparts) for restoring consistency. In the case of bipolar preference representation, negative information specifies constraints �� that have more priority than wishes expressed as positive information�� . A revision process �� consists revised in changing δ into δ so as to restore the consistency condition δrevised ≤ π: 103

Handling Bipolar Queries in Fuzzy Information Processing

δrevised(u) = min(π(u), δ(u))

Bipolar Queries

This type of revision is natural when dealing with preferences, since then π is associated with a set of more or less imperative constraints, while positive information corresponds to the expression of simple recommendations (that are not compulsory). On the contrary, when dealing with knowledge, observations are generally regarded as more solid information than beliefs, and then the revision process of π by δ is modeled by

Bipolar queries are queries that involve two components, one pertaining to simple wishes and the other expressing constraints. Consider n attributes. Let Ci be the subset of the domain Ui of attribute i, representing the (flexible) constraint restricting the acceptable values for i (in other words, Ci = Ric where Ri is the (fuzzy) set of rejected values according to i). Let Di be the subset of the domain of i, representing the recommended values, the ones that are really wished and satisfactory for i. The consistency condition (4) is assumed to hold for each pair (Ci, Di), that is:

πrevised(u) = max(π(u), δ(u)), Priority is given to reports on observed values that force beliefs to be revised in the face of reality. Example 1 (continued). Giving priority to negative preference, we compute δrevised(u) = min(π(u), δ(u)). It gives u1) = 0; δrevised(� u2) = 1; δrevised(� u3) = δrevised(� revised 0;δ (� u4) = 0 which yields u�2 = a¬b as the preferred solution, as expected, that is, a inexpensive apartment near the station. This revision process should not be confused with another one pertaining only to the negative part of the information, when a new constraint modeled by πnew is specified. It results in a conjunctive fusion of the form min(π, π new) (or using some triangular norm in the numerical case), but this process may result in a subnormalized possibility distribution, which means that some inconsistency exists inside the negative information. If such an inconsistency takes place, it should be resolved (by some appropriate renormalization or the decision to relax a constraint) before one of the two above bipolar revision steps can be applied.

∀u, µD (u) ≤ µC (u) i i

(5)

expressing that a value cannot be more recommended than it is allowed by the constraint. Then a query is represented by a set of pairs {(Ci, Di), i =1, ...,n} satisfying Equation (5). It may happen that for some attribute i there is no constraint. In such a case Ci = Ui, that is,∀u, µC (u) = 1. If no value is particularly wished in Ui, i then Di = ∅, that is,∀u, µD (u) = 0. For example, i the query “find an apartment not too expensive, preferably near the train station” involves two attributes: the price and the distance to the station. Let us assume that one wants to express a constraint on the price and a recommendation on the distance to the station. This is represented by the set of the two pairs {(Not_too_expensive, ∅), (Udistance, Near)}, where Not_too_expensive and Near are labels of fuzzy sets. This means that any item in the database will not be rejected inasmuch as it satisfies the price constraint and that it will be really satisfactory if moreover it is near the station. Then following Equation (3), the natural aggregation scheme for a query represented by a set of pairs {(Ci, Di), i =1, ...,n}, with possibly Ci = Ui or Di = ∅, is given by: (C, D) = (×i Ci, +i Di)

104

Handling Bipolar Queries in Fuzzy Information Processing

where ×i denotes the Cartesian product on U1 × ... × Un and +i the Cartesian coproduct (i.e., Di + Dj = ((Di)c × (Dj)c)c ). Note �� that we have +i Di ⊄ ×i Ci generally, even if Equation (5) holds for each i. The consistency of the wishes with the constraints can be formally recovered by a revision of positive preferences in agreement with the above discussion: (C, D) = ((×i Ci), (+i Di) ∩ (×i Ci)). How can a query {(Ci, Di), i =1, ...,n} be evaluated? Given a tuple u = (u1, ..., ui, ..., un) of a nonfuzzy database, we can thus compute a pair of matching degrees, namely (π(u), δ(u)) = (⊗i µC (ui), ⊕i µD (ui)), i i where ⊗ (resp. ⊕) is a conjunctive (resp. disjunctive) combination operation. Using min and max respectively, it reflects the extent to which u satisfies all the constraints and at least one wish. The question is then to rank-order the tuples. A basic idea is to give priority to constraints, and thus to have a lexicographic ranking by using π(u) as primary criterion and δ(u) as a secondary one for breaking ties. This procedure can be improved by considering that satisfying several wishes is certainly better than satisfying just one. This procedure assumes that constraints are really imperative while wishes are not. Let σ be a permutation reordering the ) ≥ ... µD (ui)’s decreasingly, that is, µD (u σ(1) σ(1) i ) and τ a permutation reordering ≥ µD (u σ(n) σ(n) the µC (ui)’s increasingly, or µC (uτ(1)) ≤ ... ≤ i τ(1) µC (uτ(m)). τ(m) This leads to rank-ordering the vectors: ), (u (u ), µ (µC (uτ(1)), ..., µC τ(1) τ(m) τ(m) Dσ(1) σ(1) )) ..., µD (u σ(n) σ(n)

leximax on the Dis. It refines both the min-based ordering on the constraints and the max-ordering on the wishes, as well as the Pareto ordering (Dubois, Fargier, & Prade, 1996a). Let us illustrate this double lexicographic ranking procedure on an example. As we are going to see, the wishes should not be considered additional constraints, even with a lower priority, but rather play the role of criteria for choosing between a set of potential solutions among those satisfying the set of constraints at the same level. Example 2. Consider the request “find an apartment to let that is not too expensive, whose surface is not too small and preferably rather large, and if possible near the train station.” Assume we use piecewise linear membership functions obeying the following constraints: µtoo_expensive(u) = 1if u ≥ 800, µtoo_expensive(u) = 0 if u ≤ 400; µtoo_small(u) = 0 if u ≥ 80, µtoo_ small(u) = 1 if u ≤ 40; µrather_large(u) = 0 if u ≤ 80, µrather_large(u) = 1 if u ≥ 120; µnear(u) = 0 if u ≥ 2, µnear(u) = 1 if u ≤ 1; with µnot_A(u) = 1 − µA(u). In the request, the constraints are (1) price “not too expensive,” (2) surface “not too small,” and (3) no constraint on the distance to the station, while the recommendations are (1) no special wish on the price, (2) surface “rather large,” and (3) distance to the station “near.” Assume the database describing the apartments to let contains the items described in Table 2. For each apartment k, we can compute the 3tuple, expressing the extent to which the constraints are satisfied, namely:

lexicographically from (1, 1, …, 1) to (0, 0, …, 0). This method is called leximin on the Cis and

105

Handling Bipolar Queries in Fuzzy Information Processing

Table 2. Database of apartments Price

Surface

Distance to station

Apart 0

600

80

1

Apart 1

380

50

1.5

Apart 2

600

90

1

Apart 3

500

100

2

Apart 4

400

70

1

Apart 5

700

150

0.5

5 ∼ Apart 1 (“>” stands for strictly preferred and “∼” for equivalent). Then applying now the leximax ranking on the Dks for breaking ties yields Apart 4 > Apart 3 > Apart 2 > Apart 0 > Apart 5 > Apart 1. Now imagine that we no longer make any distinction between constraints and wishes, and that all are considered constraints. Since “surface ‘rather large’” is a stronger condition than “surface ‘not too small,’” the latter is subsumed by the former, and the request generates the following 3-tuple evaluation:

Ck = (1 − µtoo_expensive(price(k)), 1 − µtoo_small(surface(k)), 1)

Ek = (1 − µtoo_expensive(price(k)), µrather_large(surface(k)), µnear(distance_to_the_station(k))).

where the empty constraint on the distance to the station is always fully satisfied. These constraints are here clearly associated with the two rejections of apartments that are too expensive or that are too small. Similarly, one can compute the 3-tuple expressing the extent to which the wishes are satisfied, namely: Dk = (0, µrather_large(surface(k)), µnear(distance_to_ the_station(k)))

For the six apartments, this would yield:

where the absence of any wish on the price always translates into 0 since no price is really desirable for the user. Table 3 displays the resulting evaluations for the six apartments. Applying the leximin ranking on the Cks, one gets Apart 4 ∼ Apart 3 > Apart 2 ∼ Apart 0 > Apart Table 3. Local evaluations with respect to a bipolar query

106

Apartment k

Constraints Ck

Wishes Dk

Apart 0

(0.5, 1, 1)

(0, 0, 1)

Apart 1

(1, 0.25, 1)

(0, 0, 0.5)

Apart 2

(0.5, 1, 1)

(0,0.25, 1)

Apart 3

(0.75, 1, 1)

(0, 0.5, 0)

Apart 4

(1, 0.75, 1)

(0, 0, 1)

Apart 5

(0.25, 1, 1)

(0, 1, 1).

E0 = (0.5, 0, 1); E1 = (1, 0, 0.5); E2 = (0.5, 0.25, 1); E3 = (0.75, 0.5, 0); E4 = (1, 0, 1); E5 = (0.25, 1, 1). Then the lexicographic ordering would yield Apart 5 > Apart 2 > Apart 4 > Apart 1∼ Apart 0 > Apart 3, that is, a very different ordering from the one obtained when distinguishing between constraints and wishes. Indeed, Apart 4 is depreciated in the latter ranking because the real constraint not_too_small is redundant with respect to the other constraint rather_large that was only a wish and is interpreted too strongly in the unipolar model. Therefore, Apart 5 appears like the best choice although it will be judged too expensive. On the contrary, in the bipolar model, the real constraint not_too_small is satisfied by Apart 4, which is less expensive than Apart 5 (and the same phenomenon is present to a lesser extent for other apartments except for Apart 1). So, in the bipolar approach, Apart 4 is selected because it better satisfies the real constraints than Apart 5, even it satisfies only one wish.

Handling Bipolar Queries in Fuzzy Information Processing

One might also argue that wishes might be viewed as constraints, but with a priority smaller than the so declared constraints. For each criterion i, the constraint evaluation is now changed into: µC*i(u) = min(µCi(u), max(µDi(u), 1 − λ)) See Dubois et al. (1996b) for a discussion of the representation of prioritized constraints. Observe that if priority is maximal (λ = 1), it reduces to min(µCi(u), µDi(u)) = µDi(u), while if priority is null (λ = 0), it reduces to µCi(u), which are the two already considered situations. However, note also that µC*i(u) = µCi(u) if µCi(u) ≤ 1 − λ, which means that the satisfaction of wishes would have only an influence for sufficiently high degrees of satisfaction of the constraints. Moreover, since µDi(u) ≤ µCi(u), for µDi(u) ≤ 1 − λ ≤ µCi(u), we have

µC*i(u) = 1 − λ, which means a loss of discrimination power since the evaluation becomes constant. Thus, a query based on the distinction between negative and positive desires cannot be made equivalent to a unipolar query where positive desires would also be expressed by means of (possibly weakened) constraints. For instance, it is not the same to look for “an apartment not too expensive, preferably near the train station,” or for “an apartment not too expensive, and not too far from the train station.” In the second query, any apartment that is not sufficiently close to the station is definitively rejected (and thus not retrieved); this is not the case with the first query. Example 2 (continued). In our example, viewing the two wishes as constraints with respective priority λ and ρ would lead to the 3-tuple evaluation in Exhibit 1. Then, we would have as shown in Exhibit 2.

Exhibit 1. Fk = (1 − µtoo_expensive(price(k)), min(1 − µtoo_small(surface(k)), max(µrather_large(surface(k)), 1 − λ)), max(µnear(distance_to_the_station(k)), 1 − ρ)). Exhibit 2. F0 = (0.5, min(1, max(0, 1 − λ)), max(1, 1 − ρ)) = (0.5, 1 − λ, 1) ; �� F1 = (1, min(0.25, max(0, 1 − λ)), max(0.5, 1 − ρ)) = (1, min(0.25, 1 − λ), max(0.5, 1 − ρ)); 2 F = (0.5, min(1, max(0.25, 1 − λ)), max(1, 1 − ρ)) = (0.5, max(0.25, 1 − λ), 1) ; 3 F = (0.75, min(1, max(0.5, 1 − λ)), max(0, 1 − ρ)) = (0.75, max(0.5, 1 − λ), 1 − ρ) ; F4 = (1, min(0.75, max(0, 1 − λ)), max(1, 1 − ρ)) = (1, min(0.75, 1 − λ), 1) ; F5 = (0.25, min(1, max(1, 1 − λ)), max(1, 1 − ρ)) = (0.25, 1, 1).

107

Handling Bipolar Queries in Fuzzy Information Processing

If one considers that wishes are constraints with rather low priorities, say, 0 < λ ≤ 0.25 and 0 < ρ ≤ 0.25 in our example, one gets: F0 = (0.5, 1 − λ, 1) F1 = (1, 0.25, 1 − ρ); F2 = (0.5, 1 − λ, 1); F3 = (0.75, 1 − λ, 1 − ρ); F4 = (1, 0.75, 1); F5 = (0.25, 1, 1). Then Apart 4 > Apart 3 > Apart 2 ∼ Apart 0 > Apart 5> Apart 1. In this case, the final ordering is closer to the one previously obtained by using a bipolar approach, although it fails to recognize that Apart 2 is better than Apart 0. Zadrozny (2005) (see also Zadrozny & Kacprzyk, 2006) discuss this kind of example on various techniques for handling bipolar queries: one due to Bosc and Pivert (1992), the above lexicographic one, and a variant of the prioritized constraint after Lacroix and Lavency (1987). The latter handle bipolar queries as above, first finding feasible solutions (satisfying C) and then finding good solutions (D) among them. It corresponds to satisfying the following logical requirement: Find {u | C(u) ∧ [(∃ v, C(v) ∧ D(v) )⇒ D(u)]}. In the flexible setting, it comes down to computing a degree of satisfaction of two constraints, one imperative and the other having a priority depending on its degree of consistency maxv min(µC(v), µC(v)) with the former: µC*(u) = min(µC(u), max(1 − maxv min(µC(v), µD(v)), µD(u)) This kind of aggregation was first proposed by Dubois and Prade (1988b). Zadrozny advocates the latter scheme as more reasonable than the lexicographic handling of constraints and desires because he finds the lexicographic ordering “artificially” discriminant and counterituitive in some examples. However, in defense of our approach,

108

it must be stressed that it applies only if the two following conditions are met: Positive and negative information are of a different nature, the latter much more imperative and the violation of positive recommendation natural when constraints tend to be violated. The fuzzy extension of prioritized constraints according to Lacroix and Lavency does not really account for bipolar preferences. The rating scales are qualitative and coarse. For instance, comparing pairs (µC(v), µD(v)) such as (1, 0) and (1−ε, 1−ε) using the lexicographic scheme makes little sense as ε goes to zero. Indeed, it always gives preference to the first pair of evaluation because of the (however slight) violation of the constraint by the second pair. Lexicographic comparisons require coarse-grained scales where each degree of satisfaction is one order-of-magnitude higher than the next lower one.

Bipolarity in Case-Based Querying The user is not always able to easily express a request by expressing restrictions on attribute domains, even in a flexible way, these restrictions being then conjunctively combined. It may be more convenient for him to express what the user is looking for from prototypical examples, still keeping the idea of positive and negative requirements. Suppose that, as in the system described in de Calmès, Dubois, Hüllermeier, Prade, and Sèdes (2002), the user is looking for some houses to let, described in a database. One may think of providing the user with typical examples of existing houses stored in the base, which are representative of different categories. The system may also propose a small set of (maybe fictitious, but realistic) houses to the user that are well-contrasted on the attributes of interest. We may also, more simply, assume that examples of houses the user likes, or dislikes, are provided by this user, referring to existing already experienced cases, if any (i.e., houses rent in the past, in our example).

Handling Bipolar Queries in Fuzzy Information Processing

The user may be asked to explain to what extent a few examples of items are representative of what he is looking for by using some (finite) preference scale. Querying based on examples for eliciting the user’s preferences can thus provide the necessary information for building and handling a query. Then, the relevance of items stored in the database should be evaluated in terms of their similarity with respect to these preferred examples, and their dissimilarity from counter-examples the user dislikes. These issues are clearly close to fuzzy case-based reasoning (see, for instance, Hüllermeier, 2007). Examples, as well as counter-examples, are supposed to be described in terms of precisely known attribute values. These attributes are assumed to be relevant and sufficient for describing their main features from the user’s point of view. Then the request processing algorithm could basically look for the items which are similar to at least one example (w.r.t. all the attributes) and which are dissimilar to all the counter-examples (each time w.r.t. at least one attribute), as proposed in Dubois et al. (2001a). Indeed, being close to one example is a sufficient reason for selecting an item (since examples may differ from each other), being close to a counterexample is a sufficient reason for rejection. With regard to attribute values, this method considers it necessary for choosing an item (resp. rejecting it) that it be close to an example (resp. a counterexample) with respect to all attributes, which is a conservative approach, closer to a symmetric bivariate view of bipolarity. At any rate, bipolar reasoning is at the heart of this approach. This basic approach can be sophisticated: we can take into account the extent to which each example or counter-example is highly representative and the importance of the (relevant) attributes. Thus, the more similar a tuple u is w.r.t. (at least) a representative example and the more dissimilar u is w.r.t. all representative counter-examples, the more possible u is eligible for the user. This evaluation might also be improved by requiring the similarity of u with several examples when possible (see Dubois, Prade, & Testemale, 1988) for the handling of softly quantified weighted min expressions).

Suppose for simplicity that all the importance and representativeness weights are equal (see Dubois et al., 2001a for the general case). Let Si be a fuzzy similarity relation on the attribute domain of i (Si is supposed to be reflexive and symmetric). The items u are then rank-ordered in terms of the function: E’’(u) = min[maxj mini Si(ai(u), aij), min k maxi (1 – Si(ai(u), bik))] (6) where ai(u) is the value of attribute i for item u, aij is the value of attribute i for example j, and bik is the value of attribute i for counter-example k. Clearly, Si(., aij) defines a fuzzy set of values close to aij, and 1 – Si(., bik) defines a fuzzy set of values not similar to bik for each attribute i. Under this view, degrees of similarity belong to a symmetric bipolar scale. Alternatively, we may introduce an independent dissimilarity relation (a distance) that may account for by a more restrictive fuzzy set of values significantly different from bik than when using 1 – Si(., bik). As is clear from Equation (6), the positive examples are combined disjunctively as expected, since the user may like houses which are quite different. The negative examples which are disliked, in agreement with Equation (3), are also combined disjunctively, but after complementation appears in a conjunctive combination in Equation (6). This evaluation scheme is not very discriminant as usual with max-min approaches, but could be lexicographically refined, changing the external minimum into leximin and, if necessary, turning the inner min’s and max’s into leximin and leximax. Besides, one could also refrain from aggregating the global rating of similarity to examples and the global rating of dissimilarity to counter-examples, only comparing pairs of such ratings. One would then give up the idea of ranking as some incomparability may ensue. However such incomparable items could be presented to the user, when otherwise considered good ones. This separate handling is in the spirit of bipolar reasoning (see Dubois & Fargier, 2006).

109

Handling Bipolar Queries in Fuzzy Information Processing

Note also that the approach is likely to account for interactivity between attributes. For instance, the user may be simultaneously interested in two types of houses, one which is “small” with a “low” price and another which is “large” but with a “medium” price. Such a type of requirement is more easily expressed by referring to examples than by expressing compound requirements on attribute values. Compared to the specification of soft queries in terms of requirements on attribute values, the case-based approach may look less tedious for a user. However, as suggested by the simple example below, it may also be less expressive when the number of examples or counter-examples is small and a crude approach is adopted. But sophisticating the case-based approach either by asking for more information about importance may burden the user again. Using many examples and counter-examples may become demanding as well and makes the case-based approach less attractive if the database is already not so large. Indeed, the case-based approach to reasoning relies on a possibly slowly accumulated depository of cases, while in the case of querying the user is not likely to provide many examples and counter-examples. Example 2 (continued). Considering again the apartment database of Example 2, one may assume that the user does not suggest Apart 4 as a good example of what is looked for, and Apart 1 as the typical apartment that is not acceptable (a good counter-example). These choices are on purpose in agreement with

the result of the lexicographic approach to the handling of soft bipolar queries. The similarity relations on each attribute are defined as shown in Exhibit 3. Table 4 provides the degrees of similarity between Apart 4 and the other apartments. Table 5 provides the degrees of dissimilarity between Apart 1 and the other apartments, and their global rating appears in the rightmost column. The following ranking is thus obtained: Apart 4 > Apart 2 ~ Apart 0 > Apart 3 ~ Apart 5 ~ Apart 1. Changing the external minimum into leximin leads to a better ranking for Apart 2 and interestingly, a better ranking for the counter-example Apart 1 than Apart 3 and 5: Apart 4 > Apart 2 > Apart 0 > Apart 1 > Apart 3 ~ Apart 5. But it cannot discriminate between Apart 3 and Apart 5. Using leximin on attributes for similarity, we get Apart 3 > Apart 5, as well as using leximax on attributes for dissimilarity. So the overall ranking is: Apart 4 > Apart 2 > Apart 0 > Apart 1 > Apart 3 > Apart 5 Some comments are useful to explain this partially paradoxical result in the face of the more natural outcome of the soft bipolar query processing method of the previous section. The bad performance of Apart 3 here is due to the lack of distinction between constraints and wishes: the distance to the station is too far. Note that some bad local ratings are due to the symmetry of the similarity relations: For instance, Apart 5 is penalized for its distance to the station because it is not similar to

Exhibit 3. Price : Supposing a price of 300 euros makes a big difference, S(p1, p2) = 1 − | p1− p2|/300. Surface: Supposing a surface of 50 sq. meters makes a big difference, S(s1, s2) = 1 − | s1− s2|/50. Distance to station: Supposing a distance of 1 km makes a big difference, S(d1, d2) = 1 − | d1− d2|.

110

Handling Bipolar Queries in Fuzzy Information Processing

Table 4. Similarity to the example Similarity to Example

Price

Surface

Distance to station

Overall similarity

Apart 0

0.33

0.8

1

0.33

Apart 1

0.93

0.6

0.5

0.5

Apart 2

0.33

0.6

1

0.33

Apart 3

0.66

0.4

0

0

Apart 4

1

1

1

1

Apart 5

0

0

0.5

0

Table 5. Dissimilarity with the counterexample and overall result Dissimilarity to counterexample

Price

Surface

Distance to station

Global dissimilarity

Global rating

Apart 0

0.73

0.6

0.5

0.73

0.33

Apart 1

0

0

0

0

0

Apart 2

0.73

0.8

0.5

0.8

0.33

Apart 3

0.4

1

0.5

1

0

Apart 4

0.06

0.4

0.5

0.5

0.5

Apart 5

1

0

1

1

0

Apart 4 in this respect even if it is closer to the station: the latter request cannot be properly captured by pointing out Apart 4 as a good example. The strange good performance of the counter-example Apart 1 comes from the fact that it is rather closer to the example Apart 4 than Apart 3 and Apart 5 in some respects, even if Apart 3 and Apart 5 are considered similar to the counter-example. Indeed, Apart 1 is more similar to Apart 4 in terms of price and surfaces than to Apart 3 and Apart 5, while the distance to the station makes no difference. It pinpoints the importance for examples and counter-examples to be different enough in as many attributes as possible for the case-based aproach to work properly. One way out of this difficulty could be to cancel attributes for which examples and counter-examples are similar.

Concluding Remarks This chapter has primarily emphasized the interest of sepatarely handling rejected values and desired values in the expression of preferences, as acknowledged by many studies in cognitive psychology. This is clearly important when dealing with flexible queries. Rejected values lead to the expression of constraints by complementation, while positive wishes should be independently considered for refining choices between items satisfying the constraints. The chapter has illustrated this bipolar view of queries both on attribute-based querying and on example-based querying in face of an ordinary database. The handling of ill-known data in the bipolar perspective is a topic for further research. Besides, we may think of hybrid queries made

111

Handling Bipolar Queries in Fuzzy Information Processing

of examples and of classical (fuzzy) restrictions expressing attribute values that are undesirable.

References Atanassov, K. T. (1986). Intuitionistic fuzzy sets. Fuzzy Sets and Systems, 20, 87-96. Benferhat, S., Dubois, D., Kaci, S., & Prade, H. (2001). Bridging logical, comparative and graphical possibilistic representation frameworks. In Proceedings of the 6th European Conference on Symbolic and Quantitative Approaches to Reasoning and Uncertainty (LNAI 2143, pp. 422-431). Berlin: Springer-Verlag. Benferhat, S., Dubois, D., Kaci, S., & Prade, H. (2002a). Bipolar representation and fusion of preferences in the possibilistic logic framework. In Proceedings of the 8th International Conference on Principles of Knowledge Representation and Reasoning (pp. 421-432). San Francisco: Morgan Kaufmann. Benferhat, S., Dubois, D., Kaci, S., & Prade, H. (2002b). Bipolar possibilistic representations. In Proceedings of the 18th International Conference on Uncertainty in Artificial Intelligence (pp. 45-52). San Francisco: Morgan Kaufmann. Benferhat, S., Dubois, D., Kaci, S., & Prade, H. (2006). Bipolar possibility theory in preference modeling: Representation, fusion and optimal solutions. Information Fusion, 7, 135-150. Bosc, P., & Kacprzyk, J. (Eds.). (1995). Fuzziness in database management systems. Heidelberg: Physica-Verlag. Bosc, P., & Pivert, O. (1992). Some approaches for relational databases flexible querying. Journal of Intelligent Information Systems, 1, 323-354. Buchanan, B. G., & Shortliffe, E. H. (1984). Rulebased expert systems: The MYCIN experiments of the Stanford Heuristic Programming Project. Reading, MA: Addison-Wesley.

112

Christiansen, H., Larsen, H. L., & Andreasen, T. (Eds.). (1997). Flexible query answering systems. Dordrecht, The Netherlands: Kluwer Academic Publishers. de Calmès, M., Dubois, D., Hüllermeier, E., Prade, H., & Sèdes F. (2002). A �� fuzzy set approach to flexible case-based querying: methodology and experimentation. In Proceedings of the 8th International Conference on Principles of Knowledge Representation and Reasoning (pp. 449-458). San Francisco: Morgan Kaufmann. De Caluwe, R., & De Tré, G. (Eds.). (2007). Special issue on advances in fuzzy database technology. International Journal of Intelligent Systems, 22. Dubois, D., & Fargier, H. (2006). Qualitative decision-making with bipolar information. In Proceedings of the 10th International Conference on Principles of Knowledge Representation and Reasoning (pp. 175-186). Menlo Park, CA: AAAI Press. Dubois, D., Fargier, H., & Prade, H. (1996a). Refinements of the maximin approach to decisionmaking in fuzzy environment. Fuzzy Sets and Systems, 81, 103-122. Dubois, D., Fargier, H., & Prade, H. (1996b). Possibility theory in constraint satisfaction problems: Handling priority, preference and uncertainty. Applied Intelligence, 6, 287-309. Dubois, D., Hajek, P., & Prade, H. (2000). Knowledge-driven versus data-driven logics. Journal of Logic, Language, and Information, 9, 65-89. Dubois, D., Lang, J., & Prade, H. (1994). Possibilistic logic. In D. Gabbay et al. (Eds.), Handbook of logic in artificial intelligence and logic programming (vol. 3, pp. 65-89). Oxford, UK: Oxford University Press. Dubois, D., & Prade, H. (1987). Twofold fuzzy sets and rough sets: Some issues in knowledge representation. Fuzzy Sets and Systems, 23, 3-18.

Handling Bipolar Queries in Fuzzy Information Processing

Dubois, D., & Prade, H. (1988a). Possibility theory: An approach to computerized processing of uncertainty. New York: Plenum Press.

Galindo, J., Urrutia, A., & Piattini, M. (2006). Fuzzy databases: Modeling, design and implementation. Hershey, PA: Idea Group Publishing.

Dubois, D., & Prade, H. (1988b). Default reasoning and possibility theory. Artificial Intelligence, 35(2), 243-257.

Giang, P. H., & Shenoy, P. P. (2005). Two axiomatic approaches to decision-making using possibility theory. European Journal of Operational Research, 162(2), 450-467.

Dubois, D., & Prade, H. (2002). Bipolarity in flexible querying. In Flexible query answering systems (LNAI 2522, pp. 174-182). Berlin: Springer-Verlag. Dubois, D., & Prade, H. (2006). Bipolar representations in reasoning, knowledge extraction and decision. In S. Greco et al. (Eds.), Proceedings of the 5th International Conference on Rough Sets and Current Trends in Computing (RSCTC 2006) (LNAI 4259, pp. 15-26). Berlin: Springer. Dubois, D., & Prade, H. (Eds.). (in press). Bipolar representations: 1. cognition and decision; 2. reasoning and learning. International Journal of Intelligent Systems. Dubois, D., Prade, H., & Sedes, F. (2001a). Fuzzy logic techniques in multimedia database querying: A preliminary investigation of the potentials. IEEE Transactions on Knowledge Data Engineering, 13(3), 383-392. Dubois, D., Prade, H., & Smets, P. (2001b). “Not impossible” vs. “guaranteed possible” in fusion and revision. In Proceedings of the 6th European Conference on Symbolic and Quantitative Approaches to Reasoning with Uncertainty (ECSQARU’01) (LNAI 2143, pp. 522-531). Berlin: Springer-Verlag. Dubois, D., Prade, H., & Testemale, C. (1988). Weighted fuzzy pattern matching. Fuzzy Sets and Systems, 28, 313-331. Fortemps, P., & Slowinski, R. (2002). A graded quadrivalent logic for ordinal preference modeling: Loyola-like approach. Fuzzy Optimization and Decision Making, 1, 93-111.

Grabisch, M., & Labreuche, C. (2000). The Sipos integral for the aggregation of interacting bipolar criteria. In Proceedings of the 8th International Conference on Information Processing and Management of Uncertainty in Knowledge-based Systems (pp. 395-409). Grabisch, M., & Labreuche, C. (2006). Generalized Choquet-like aggregation functions for handling bipolar scales. European Journal of Operational Research, 172(3), 931-955. Hüllermeier, E. (2007). Case-based approximate reasoning. Berlin: Springer. Lacroix, M., & Lavency, P. (1987). Preferences: Putting more knowledge into queries. In P. M. Stocker et al. (Eds.), Proceedings of the 13th International Conference on Very Large Data Bases (pp. 217225). San Francisco: Morgan Kaufmann. Larsen, H., Kacpryk, J., Zadrozny, S., Andreasen, T., & Christiansen, H. (Eds.). (2001). Flexible query answering systems: Recent advances. PhysicaVerlag. Petry, F. E. (1996). Fuzzy databases: Principles and applications. Kluwer Academic Publishers. Shafer, G. (1976). A mathematical theory of evidence. Princeton, MA: Princeton University Press. Zadeh, L. A. (1978). Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets and Systems, 1, 3-28. Zadrozny, S. (2005). Bipolar queries revisited. In V. Torra et al. (Eds.), Modeling decisions for artificial intelligence (LNCS 3558, pp. 387-398). Berlin: Springer.

113

Handling Bipolar Queries in Fuzzy Information Processing

Zadrozny, S., & Kacprzyk, J. (2006). Bipolar queries and queries with preferences. In Proceedings of the 17th International Conference on Database and Expert Systems Applications (DEXA’06) (pp. 415-419).

Keywords Asymmetric Bipolarity: Bipolarity is asymmetric when the positive and the negative parts of the information are not in reflection with respect to each other and are obtained from independent sources Bipolarity: Cognitive phenomenon whereby reasoning and decision processes are described in terms of positive and negative aspects, often separately. Case-Based Reasoning: Inference or problemsolving technique based on the similarity between the current problem and previously solved ones that are stored in a repository. Its principle relies on the idea that the more similar two problems, the more similar their solution.

Flexible Constraint: Constraint in which satisfaction is a matter of degree and can be partially relaxed if necessary so as to ensure the feasibility of a problem. Flexible Query: Query to a database interpreted as a flexible constraint on items to be retrieved. Possibility Theory: A theory of uncertainty dedicated to the gradual modeling of incomplete information, similar to probability theory, but where maximum and minimum is used in place of sum and product. Preference Modeling: Formal methods for the description of the user’s attitude when ranking objects in terms of merit. Twofold Fuzzy Set: A special kind of intervalvalued fuzzy set (in which membership grades are only known to belong to intervals). The upper and lower bounds are respectively defined as the possibility and the necessity of membership, hence satisfy the usual constraints of possibility necessity-pairs; if the lower bound is positive, the upperbound is 1.

Endnote 1

114

A revised and expanded version of a previous conference paper by the authors (Dubois & Prade, 2002).

115

Chapter V

From User Requirements to Evaluation Strategies of Flexible Queries in Databases Noureddine Mouaddib Université de Nantes, France Guillaume Raschia Université de Nantes, France W. Amenel Voglozin Université de Nantes, France Laurent Ughetto Université de Rennes 2, France

Abstract This chapter presents a discussion on fuzzy querying. It deals with the whole process of fuzzy querying, from the query formulation to its evaluation. Mainly, it advocates the use of index structures in the evaluation of fuzzy queries. First, various ways of introducing flexibility in querying processes are discussed, especially the most represented in the literature, which are based on rankings of the answers or which are using user-oriented fuzzy labels in the queries. Current methods for evaluating fuzzy queries are also reviewed. Then, properties of access methods are given in the context of fuzzy querying. Last, SaintEtiQ, the method developed in our team, is briefly presented.

Introduction First intended to provide efficient transactional mechanisms satisfying ACID (atomicity, consistency, isolation, durability) properties, query processing has been intensively studied from the machine-machine interface point of view. As a

consequence, current commercial database systems are all able to handle very high workloads with update queries and succeed in storing very large data sets. However, the rapid development of personal numerical devices combined with the widely spread high-bandwidth network connections has lead to the point where huge amounts of

Copyright © 2008, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.

From User Requirements to Evaluation Strategies of Flexible Queries in Databases

data are instantaneously available to the end user. Obviously, the user asks for intelligent and modern tools to query, analyse, and browse the data, especially to support complex decision-making processes. Hence, the database community faces new challenges regarding the key technology of query processing in order to fill the gap between its current system-optimised features and the strong requirements for human-centric capabilities. To achieve this goal, recent works try to integrate well-established IR (information retrieval) technologies into database systems. The database community also looks at all theoretical backgrounds to introduce flexibility within the query processing. One of the promising research directions is towards the fuzzy set theory. In a broader sense, the introduction of fuzzy set theory into database management systems raises new fields of studies: (1) fuzzy data models for management of uncertainty, (2) fuzzy dependencies obtained in relaxing integrity constraints, and (3) fuzzy queries to provide flexibility within the querying process. Regarding fuzzy querying, research works often deal with the definition of a new language, able to accommodate the new formalism (Bosc & Pivert, 1995; Galindo, Medina, Pons, & Cubero, 1998; Galindo, Urrutia, & Piattini, 2006; Kacprzyk & Ziółkowski, 1986�� ). The literature also offers studies of how to express concepts or needs through constructs such as operators or linguistic variables. Although the subjects are not limited to these two examples, almost all studies are on the formal or conceptual side of the new situation. The question of the evaluation of queries has seldom been treated. What does the phrase “evaluation of a query” encompass? The evaluation of a query refers to the whole process of finding, in the queried database, all the records that satisfy the query. Thus, the evaluation lies on a more “physical” or “technical” level than formal. For instance, the query “Find all the students who registered for at least one course this semester” may be evaluated by sequentially search-

ing the STUDENTS relation and, for each tuple identified by its student ID field value, sequentially

116

searching the REGISTRATION relation for entries that have the same value in student ID. Finally, if such entries exist, the student ID is returned. With fuzzy sets, the main new element in query evaluation is the gradual aspect. The immediate consequence of graduality is that many binary concepts from the database management system (DBMS) have to be adapted.

Need for Flexibility into Database Query Processing Flexible querying in databases starts with the simple requirement for user-friendly interfaces to access the data. A query interface usually provides several components to formulate, evaluate, and give the answer to a query. At the end of the process, flexibility can be introduced with both the ideas of approximate answers and partial answers to queries. It allows computing quickly a rough answer rather than an exact result set that requires more time. It also provides mechanisms to deal with excessively large result sets as well as empty result sets of complex queries. In contrast, at the very beginning of the query chain is the analysis and capture of user requirements. What are the main concepts users would like to express into a query? The database community answers the question in two ways: user preferences and user perceptions of the queried domain. It could be synthesised into the very obvious idea of personalisation of the query. Since data are no more represented according to the user’s requirements, the query should, acting as an interface between the user’s language and the one of the DBMS. Hence, declarative languages supported by databases have to provide new constructs dealing with such revolutionary concepts (user preferences and user perceptions of the domain) in regard to the past database fundaments. There is another chapter in this volume by Kacprzyk, Zadrożny, de Tré, and de Caluwe that includes a review about flexible querying.

From User Requirements to Evaluation Strategies of Flexible Queries in Databases

Evaluation of Queries in a Relational DBMS In a traditional relational DBMS, the evaluation of a query follows a few standard steps (Date, 2000). First, the query, which is expressed in a data manipulation language (DML)—typically structured query language (SQL)—is cast into an internal form. The internal form is needed for an easier machine manipulation as it rids the query from the syntactic ornaments of the DML. An archetypal internal representation is an expression tree (also referred to as “query tree”) which figures the operations implied by the query. From an operational point of view, an internal form may also be expressed in a well-chosen syntax of the underlying theoretical foundation. In SQL, the relational algebra represents that foundation. Second, the internal representation is converted into a canonical form. If the DML offers diverse ways of expressing a need through different queries, this step reduces the diversity of query expressions to a narrower set of representations. The reducing is made possible by the fact that various representations are equivalent to each other in terms of results. However, one of those equivalent forms has desirable properties such as efficiency or simplicity; it is then called “the canonical form” and stands as a representative for all other forms to which it is equivalent. Third, a choice is made to choose the most appropriate “low-level procedures,” meaning the implementation procedures. This reveals the fact that the same operation (e.g., a selection) is performed differently depending on what may be called the operation’s “environment.” For example, a selection on an indexed attribute of a relation R would use the index structure if usage statistics do not show evidence of the inefficiency of this index. However, a selection on an unindexed attribute from the same relation R translates into a sequential scan of R, with a per-tuple matching test. The environment of an operation consists of (1) dependencies with other operations involved in the query and (2) metadata about the relations and

attributes operated on. The result of this third step is a set of “candidate low-level procedures,” each having a parameterised cost formula. The formulas take into account physical elements such as distribution of data values on attribute domains, indexes (clustered or not), hash tables, whether tuples are sorted, memory needs, and so forth. Fourth, query plans (i.e., combinations of candidate low-level procedures) are generated and their cost formulas instantiated to select a unique query plan. Date (2000) reports that not all possible combinations are generated. Heuristics are used to limit the number of alternative query plans. In short, the query plan selected here is the one with the lowest projected cost. It will also be the one executed by the DBMS’s run time manager.

Impact of Graduality Pointing out the exact places where the gradual aspect of fuzzy sets has an influence on the traditional evaluation scheme is, to say the least, a difficult task. To the best of our knowledge, no research works (including ours) have studied this specific issue. Hence, no conclusions or findings can be reported here. Nevertheless, a few hints on the subject are given. Considering that the four evaluation steps are maintained in a fuzzy context, the representation into an internal form does not seem to be a potential problem. First, because frameworks for expressing fuzzy queries are a reality (Bosc & Pivert, 1995; Galindo et al., 1998, 2006). Such systems process the queries they receive. They necessarily have a suitable machine representation of the query. Second, the “internal form” may be reduced to collecting some semantically important information (thresholds, quantifiers, linguistic descriptors, and other constructs). Choosing a canonical form is a matter of equivalences. Equivalences are a consequence of the algebra or a consequence of the semantics of operators. Although graduality affects equivalences, it does not seem relevant to treat these equivalence issues at the evaluation level.

117

From User Requirements to Evaluation Strategies of Flexible Queries in Databases

Low-level procedures concentrate the impact of graduality, mainly because implementations and optimisations depend on the binary system of “true and false.” While searching the data, search procedures cannot discard records that do not fully satisfy the search criteria. A simple way to cope with that new constraint is to set a threshold and treat it as a standard binary criterion. The data that are above the threshold are then sorted to reestablish the graduality (see the Current Evaluation Methods section). Doing so has the advantage of needing little software development. A more complex approach is to build a data management system capable of dealing with less standard data models or less standard DMLs. For this approach to be successful, optimisations procedures or tools may be necessary. The purpose of this chapter is to discuss the possibility of using a fuzzy index, serving as an optimisation tool for the aforementioned systems. The last step (query plan selection) is obviously independent from the computation paradigm, whether gradual or not.

Flexibility in Query Processing Ranking Answers Introducing flexibility within the query process of data is first concerned with ranking tuples according to their grades of membership to the result set. While traditional database systems only support a Boolean query model, ranking the results of a query is a popular aspect of IR. The most effective approach to ranking answers of database queries consists in providing a partial answer with the k tuples having the highest grades of membership to the result set. Indeed, the user is first interested in the most relevant or interesting tuples of the answer to the query, avoiding result sets that are either too large or empty. Hence, a new family of algorithms has emerged to compute the top-k result set of a database query with respect to a given scoring function. A collection of IR-based scoring

118

functions for top-k query processing have also been developed to enrich capabilities of database query models. Moreover, finding the k highest ranked answers allows minimising some cost metric associated with the retrieval of the complete answer set. An intensive research has been conducted in distributed top-k query processing in order to save bandwidth and improve response time of queries in peer-topeer networks. The main issue is to efficiently integrate ranking process and join of distributed score tuples that are fulfillment degrees computed from raw data.

Computing Top-k Answers Assume a query Q with m elementary conditions on the attributes Ai, i in {1, …, m}. The multidimensional database consists of a single relation R with a finite set of N tuples described on the attributes A1, …, Am. Each tuple t is associated with a vector (x1, …, xm) of m scores, one for each attribute of the elementary query condition. Scores are computed from attribute values of each tuple with respect to their similarity to the query condition. Scoring functions are, for instance, TF/IDF (�� term frequency–inverse document frequency)-�� based functions for keyword search or concept-based similarity measures for categorical attributes or distances for numerical values. For the top-k problem, the database could alternatively be seen as a set of m sorted lists Li of N pairs (t, xi), t in R. Hence, for each elementary condition of the query Q, there is a sorted list Li in which all N database tuples are ranked in descendant order. Entries in the lists could be accessed randomly from the tuple identifier or sequentially from the sorted score. The main issue for top-k query processing is then to obtain the k tuples with the highest overall scores computed according to a given aggregation function agg(x1, …, xm) of the attribute-oriented scores xi. The aggregation function agg used to combine elementary conditions has to be monotone; that is, agg must satisfy the following property:

From User Requirements to Evaluation Strategies of Flexible Queries in Databases

agg(x1, …, xm) <= agg(x1’, …, xm’) if xi <= xi’ for every i. Among the various monotone aggregation functions are t-norms and t-conorms respectively associated with conjunctive and disjunctive queries, and weighted means as well. Min and Max are the most common functions respectively for conjunctive and disjunctive queries. The naive algorithm consists in looking at every entry (t, xi) in each of the sorted lists Li, computing the overall grade of every object t, and returning the top k answers. Obviously, this approach suffers from a high access cost to the lists since all the N overall grades are computed. The Fagin’s algorithm (FA) (Fagin, 1996) is much more efficient. For strict1 and monotone aggregation functions, in the worst case, FA (left side of �� Figure 1�� ) is optimal with high probability. It is a three-phase algorithm that first performs a sorted access to each of the m lists Li until it is able to reconstruct k m-tuples from the attribute values of the list entries. Let E be the set of partial or total score tuples retrieved from this phase. Second,�� FA completes the missing attribute values for all the candidate tuples in E by random access to the lists. And third, FA computes the overall grade agg(x1, …, xm) for each of the candidate tuples t=<x1, …, xm> in E. The result set is the ranked list (t, v) of the k tuples t with highest grades v in E. The best known general-purpose method for top-k query is the threshold algorithm (TA) (right side of �� Figure 1�� ) by Fagin, Lotem, and Naor (2003), also independently proposed by Nepal and �� Ramakrishna�� (1999) and Güntzer, �� Balke, and Kießling� (2000). Its principle is to perform a sorted access

to the m lists Li and, for each candidate score tuple t retrieved from a list entry (t, xi) to immediately fill the entire score tuple t=<x1, …, xi, …, xm> by random access to the m-1 lists and then compute the overall grade agg(x1, …, xi, …, xm). Next, TA incorporates t in the current result set E if it has one of the k highest grades. Let xi* be the current grade of the sorted access to the list Li. As soon as k tuples t with overall grade greater than the threshold value agg(x1*, …, xm*) have been computed, the process is halted. TA requires only a small constant-sized buffer to memorise k temporary answers to the query. Moreover, it is optimal over every database and for all the monotone aggregation functions. Fagin et al. (2003) also provide a few others algorithms derived from TA. For instance, TA Z assumes that a subset Z of {L1, …, Lm} has a sorted access mode. At the opposite, both algorithms NRA (no random access) and CA (combined algorithm) try to deal with constraints on random lists. NRA considers that random access to the lists is impossible whereas CA, an integration of NRA and TA, takes into account the cost of random accesses. Cost models for TA have also been studied in Bast, Majumdar, Schenkel, Theobald, and Weikum (2006); Lang, Chang, and Smith (2004) to optimise list scans, and in Bruno and Wang (2007) to integrate top-k query processing into query optimisers of database engines. Algorithms for approximate top-k query have also been proposed in Fagin et al. (2003) and Theobald, �� Schenkel, and Weikum�� (2004). The main ideas shared in both approaches are to early stop examining the lists, to drop candidate items and consequently, to tolerate non-top-k answers in

Figure 1.�� FA and TA algorithms

119

From User Requirements to Evaluation Strategies of Flexible Queries in Databases

the result set. Recent works (Ré, Dalvi, & Suciu, 2007; Soliman, Ilyas, & Chang, in press) have also been interested in top-k query processing on probabilistic data. Ranking answers of queries is a common issue in information retrieval systems. It has been adapted for database management systems to bring flexibility into query processing. Following the idea of IR and DB integration, several works have been done (Agrawal, Chaudhuri,�� Das, �� & Gionis��, 2003; Chaudhuri, �� Das, Hristidis, &�� Weikum�� , 2004; Kapoor, Das, Hristidis, Sudarshan, & Weikum, 2007) to translate IR-based ranking functions for top-k query evaluation in DB. Agrawal et al. (2003) propose IDF similarity, a global ranking function derived from TF/IDF and adapted to both categorical and numerical attribute domains. It is based on the frequency of occurrence Fk(t) of attribute value t.Ak in the database and is defined on categorical attribute Ad as IDFk(t)=log(n/Fk(t)), where n is the number of tuples in the database. The ranking function is then the sum of IDF similarities of matching attribute values between the tuple and the query predicates. Another ranking function, QF (�� quantity-frequency)�� similarity, is also presented to leverage workload information. This work is mainly intended to solve the emptyanswer problem of database queries since similar (and not only identical) tuples are returned. Such an automated ranking of query results is taking user query as it and mapping to top-k query with IR-based ranking function. In Chaudhuri et al. (2004), the same philosophy applies. The authors propose a probabilistic IR-based ranking function to address the many-answers problem of database queries that occurs when the query is not selective enough. They use data analysis and workload to compute both a global score which captures the global importance of query-unspecified attribute values, and a conditional score which captures the strengths of dependencies between specified and unspecified attribute values. The proposal is implemented in STAR (Kapoor et al., 2007).

120

Top-k Join Query Evaluation The TA algorithm (and its multiple variants presented above) is optimal for the top-k selection query problem, where ranked lists are locally, and in a parallel way, obtained from attribute values of a single relation. But it does not fit the requirements of the related top-k join query issue. In that case, the main objective is to join ranked inputs and minimise some cost metric. Two distinct families of approaches have been proposed in the literature, one for rank-join algorithms (Ilyas, Aref, & Elmagarmid, 2002, 2003; Natsev, Chang, Smith, Li, & Vitter, 2001) enhancing the usual join operators of relational databases, and another for distributed top-k query evaluation (Cao & Wang, 2004; Yu, Li, Wu, Agrawal, & El Abbadi, 2005; ZeinalipourYazti, Vagena, Gunopulos, Kalogeraki, Tsotras, Vlachos, et al., 2005). To perform a rank-join on ranked inputs in a top-k query plan, the naive approach consists in joining the inputs then sorting tuples in the result set, and finally retaining the k first tuples. The major drawbacks of such a plan are: 1. 2.

Sorting is expensive and uselessly performed on the entire result set, Sorting is a blocking operator.

Thus, there is a need for coupling the sorting operation and the join to preserve orders of the inputs along the whole query evaluation. In order to convert algorithms to effective query operators, top-k query processing should be incremental and pipelined. Natsev et al. (2001) developed J*, an algorithm for incremental joins of ranked inputs with arbitrary join predicates. It is based on the A* class of search algorithms, modified with iterative deepening2 to minimise access costs and memory footprint. It has the ability to handle multiple levels of joins that arise in nested views. A variation J*PA of the J* algorithm is also designed to use indexes to directly access tuples from the inputs based on the join predicates rather than to use a sorted access mode.

From User Requirements to Evaluation Strategies of Flexible Queries in Databases

Approximate top-k query evaluation is provided for both J* and J*PA to reduce complexity. The NRA-RJ algorithm (Ilyas et al., 2002) is a more efficient and practical rank-join operator than J* but it can only perform equijoins on key attributes. It modifies NRA to work on ranges of scores instead of requiring the inputs to provide exact scores. Ilyas et al. (2003) propose a rankjoin algorithm and its implementation in various rank-join operators of the HRJN (Hash Rank Join) family, based on ripple join. The algorithm works as follows: 1. 2. 3. 4.

5. 6.

Retrieve each object from ranked input relations in a sorted access mode, Generate new valid join combinations with all tuples seen so far from other relations, using some join strategy, Compute the overall score agg(x1, ..., xm) for each combination, Update the threshold value T=max(agg(x1*, x2T, ..., xmT), agg(x1T, x2*, ..., xmT),..., agg(x1T, x2T, ..., xm*)), where xiT is the top score and xi* the current score in the ranked input i, Retain the k join results in the temporary result set E, Halt when every object in E has a score greater than T.

The algorithm is independent from the join strategy and progressively performs the ranking during the join. A new join strategy guided by the input score values has also been proposed and implemented in the HRJN* operator. In distributed systems, top-k join query algorithms try to minimise both bandwidth consumption and query execution time in regard to the corresponding complete result set. Assume there are m nodes in the network. Each node i is connected to a central manager and maintains a sorted list of entries (o, si(o)) where si(o) is the local score of object o. The central manager initiates a top-k query to retrieve the k objects with the highest overall score computed as an aggregated value agg(s1(o), ..., sm(o)) where agg is monotone. The

naive algorithm consists in sending all sorted lists to the central manager which performs the rank-join. Intermediate nodes could either transmit the lists as is (Centralised Join Algorithm) or aggregate the lists before forwarding (Staged Join Algorithm). Obviously, this approach does not fit any of the two objectives. Besides, the strict application of TA in distributed systems leads to (1) a huge number of round trips (phases); (2) an unpredictable latency since phases are sequential, and (3) impossibility to perform aggregation. The main problem comes from numerous random accesses to nodes for every object. Thus, new algorithms are needed that can terminate deterministically. Such an algorithm is proposed in Cao and Wang (2004) for star topology networks, which is referred to as the threephase uniform threshold algorithm (TPUT) and requires three rounds to return the exact top-k objects. Three variants are also proposed in Yu et al. (2005): the three-phase adaptive threshold algorithm (TPAT) generalises TPUT by exploiting data distributions thanks to summary statistics to further enhance the pruning power of TPUT. The second algorithm, the three-phase object ranking based algorithm (TPOR), prunes ineligible objects by their rankings in the sorted lists. In contrast, TPUT prunes ineligible objects based on their scores. The last algorithm, the hybrid threshold algorithm (HT), combines the advantages of both TPUT and TPOR. Within a hierarchical topology network, the threshold join algorithm (TJA) (Zeinalipour-Yazti et al., 2005) uses a non-uniform threshold on the queried attribute in order to minimise the number of tuples that have to be transferred to the querying node. Additionally, TJA resolves queries in the network rather than in a centralised fashion, which minimises even more the consumption of bandwidth and delay. Finally, the work in Michel, Triantafillou, and Weikum�� (2005) examines the problem of approximate top-k queries in distributed environments whereas Akbarinia, �� Martins, Pacitti, and Valduriez�� (2006) proposes a top-k query evaluation algorithm for unstructured networks.

121

From User Requirements to Evaluation Strategies of Flexible Queries in Databases

User Preferences The notion of preference is common in various contexts involving decision or choice. Classical utility theory (Fishburn, 1970) views preferences as binary relations. A similar view has recently been espoused in database research (Chomicki, 2003; Kießling, 2002; Kießling & Köstler, 2002), where preference relations are used in formulating preference queries. Indeed, the second and most popular way of introducing flexibility within a regular database query is to cope with user preferences. The main purpose of such an idea is to provide ranked answers to a given query, that satisfy user preferences in that the most preferred tuples have the highest grades of membership to the answer. To achieve this, the very first step is to propose a clear semantics of preference. Next, a model has to be defined to combine multiple elementary soft requirements. The representation of preferences is either query-dependent or part of a persistent user profile. In both cases, it requires specific query language constructs to incorporate preferences into regular queries.

Semantics of User Preferences The information provided by a preference ordering is explicit in the assertion “I prefer circular tables to square tables”: objects with value “circle” on the attribute SHAPE are preferred to those with value “square.” Thinking of preferences in terms of “better-than” has a very straightforward counterpart in mathematics: one can map such real life soft requirements onto strict partial orders
122

the attribute A k of R such that t’.Ak

tsb = <Smart, Mercedes, black> tmr = <Mini, Austin, red> tmb = <Mini, Austin, black> A preference for red cars implies, in the totalitarian semantics, that all the red cars are preferred to all the black cars, regardless of other features of the cars: tsb

From User Requirements to Evaluation Strategies of Flexible Queries in Databases

Brafman and Domshlak (2004) claim that the low discrimination ability of the ceteris paribus interpretation is the main cause of its lack of success in the database approaches dealing with user preferences. However, the ceteris paribus semantics is philosophically well founded (see their technical report for references) and better fits the user requirements regarding preference formulation.

Nature of User Preferences Chomicki (2003) distinguishes two classes of preferences regarding their formulation: qualitative preferences and quantitative preferences. A distinct classification is also proposed in Kießling (2002) that provides elementary constructs for base vs. complex preferences and numerical vs. non-numerical preferences. Qualitative Preferences Qualitative preferences have been intensively studied (Brafman & Domshlak, 2004; Chomicki, 2002, 2003; Gaasterland & Lobo, 1994; Kießling 2002; Lacroix & Lavency, 1987). The main idea is to translate user preference formulations into several strict partial orders on attribute domains. To achieve this, each approach proposes various constructs from which to derive binary preference relations. For instance, the following PreferenceSQL (Kießling, 2002) expression P = POS(A, {v1, …, vn}) states that a positive (i.e., desired) value for attribute A should be one from {v1, …, vn}. PreferenceSQL is an extension to SQL that integrates preference statements into regular queries such as in the following example: SELECT * FROM CarRental PREFERRING color IN (‘red’, ‘black’). The above preference construct defines exactly P = POS(color, {red, black}). PreferenceSQL also defines negative preferences (the least preferred values are specified) as well as constructs for numerical attributes such as AROUND(price, 40). For more details about available constructs

in PreferenceSQL, please refer to Kießling (2002). Brafman and Domshlak (2004) offer a singular analysis regarding semantics (often missing or implicit in other works) of user preferences. Interpretation of preference formulation is not so clearly defined, even if ad hoc query language constructs are provided such as in the following example taken from the PREFERENCES language, an early work by Lacroix and Lavency (1987): sele c t t he ve r sion s of M A I N h av i ng STATUS=coded from which prefer those having TARGET=16 // preference clause The intuitive interpretation of such a query is as follows: first, it is evaluated without the preference clause and it returns a result set E of objects. Then, the preference is used to filter the result set: if at least one object in E matches the preference statement, the answer is restricted to such objects. Otherwise, E is not updated. The underlying evaluation model for that preference query scenario is “Best-Match-Only” (BMO) since only the maximally satisfying objects with respect to the preference are returned. It is the most common strategy for qualitative preferences. Problems arise when trying to combine elementary preferences on several attributes such as P1 = (A1, and y = are 2 tuples defined on attributes A1 and A2. Pareto accumulation means that on every preference-ordered attribute domain, y is at least

123

From User Requirements to Evaluation Strategies of Flexible Queries in Databases

equal to x and there is one attribute domain where y is preferred to x. In this case, y dominates x. An answer to a Pareto accumulation-based preference query in the BMO model is the subset of the objects that are not dominated. Such a query is a so-called “Skyline query” as defined in Borzsonyi, Stocker, and Kossmann (2001). The prioritised accumulation refers to conditional preferences. In the previous example, P1 has first to be satisfied, then P2 is used to discriminate between equally preferred objects. In the BMO model, the result set is the non-empty set of objects verifying the maximum number of elementary preferences w.r.t the provided prioritised accumulation (P1>P2). Quantitative Preferences According to Chomicki (2003), the second category of preferences is quantitative preferences. Preferences are modeled with scoring functions rather than being explicitly formulated with preference relations. With quantitative preferences, a tuple t is preferred to t’ if the score of t is higher than the score of t’. Usually, the overall score of a single tuple t is computed from attribute-oriented scores based on similarity measures on the attribute domains. Measures si(t.Ai, Ci) calculate the similarity of the tuple value t.Ai to the query predicate Ci. The scoring function f performs an aggregation of all the similarity values to provide an overall score s(t) = f(s1(t.A1, C1), ..., sn(t.An,Cn)). R�� ankSQL (Li, Soliman, Chang, & Ilyas, 2005) pro�� poses to integrate preference queries as a first-class query type in the existing SQL query engines. The issue addressed is strongly connected to the top-k join query processing discussed in the Top-k Join Query Evaluation section, where scoring functions come from user preferences. In quantitative query processing, it is possible to support assigning some importance to elementary requirements Ci in the query thanks to a weighted sum or a weighted mean as an aggregation function. The PREFER system (Hristidis, Koudas, & Papakonstantinou, 2001) deals with such an idea and proposes an algorithm that computes a top-k query by using prematerialised views obtained from past similar preference queries. 124

User’s Perception The main goal of preference handling into database queries is to fit user requirements in the best way. Another issue related to flexible query processing is to take into account user’s perception of the domain within the query. It means that database query languages must provide user-friendly constructs to allow lay users to manipulate the data. This concept is mainly at the heart of the “computing with words” paradigm. Providing fuzzy terms rather than exact values and operators for query requirements makes the evaluation process robust, since tuples can partially belong to the result set depending on how well they fit the given requirements. First, the underlying fuzzy relational algebra providing theoretical support to incorporation of user’s perception into query languages will be presented. Next, the wide range of SQL constructs dedicated to the “Computing with Words” paradigm will be detailed with languages that incorporate them. To the end of this section, the reader will be provided with a short introduction to linguistic summaries, which can be seen as precomputed answers to fuzzy aggregate queries. For more details and more �� general sources�� about fuzzy querying, please refer to Bosc and Kacprzyk (1995) and Petry (1996).

Fuzzy Relational Calculus A fuzzy relation R is considered a fuzzy set on the space of the records it represents. A membership function θR is associated with R. Contrary to a fuzzy predicate for which the membership function applies to values from an attribute domain, θR applies to records of R. In the case of a classical relation, such as defined by Codd (1970), the membership function is implicit and invariant: ∀ t ∈ R, Θ R (t )=1 and ∀ t ∉ R, Θ R (t )= 0 . For a gradual relation R, if R is a table from a regular database, the function θR is also binary. If R results from the application of extended operators, the function becomes gradual. For instance, the relation R of table EMPLOYEES (�� Table 1�� ) has its tuples associated with membership degrees resulting from the application of predicates

From User Requirements to Evaluation Strategies of Flexible Queries in Databases

Table 1. Excerpt of the EMPLOYEES table ID

Name

Department

Salary

Age

1

Alice

RH

5, 600

62

2

Jonathan

CPT

1, 500

33

3

Michael

TECH

2, 200

42

4

Suzanne

RD

3, 200

25

5

Paul

RH

4, 000

50

/ selection / join / division / set operators: union, intersection, Cartesian product, set difference) and the logical operators by generalising them. The semantics of an operator may be defined as fuzzy or as crisp but, in any case, their membership function is explicit. However, the membership degrees of tuples to the relation R depend on the interpretation of an aggregation operator Ω. In fuzzy SQL, or FSQL (Galindo et al., 2006), the CDEG function is an instance of Ω. For example, let Ω be defined as θR1 Ω R2 (t) = MIN (θR1(t), θR2(t)). Table 4�� gives the resulting relation. Constructs other than operators may also be defined. For instance, fuzzy constants, which are particular values (UNKNOWN, UNDEFINED, NULL), approximate values, intervals or possibility distributions. Satisfaction thresholds may also be given; they are attached to a fuzzy condition or predicate, their role is to determine the minimal satisfaction degree a tuple must satisfy to be considered in subsequent processing. In practical implementations, these constructs may be reused from the moment they are defined. This is the role of the fuzzy metaknowledge base in FSQL (Galindo et al., 1998, 2006). FSQL is notable for its interest

…

“well-paid” and “young” (relation R1 in �� Table 2� and relation R2 in Table �� 3�� , respectively). �� Figure 2�� shows a definition of the two fuzzy predicates. When the membership degree of a tuple to the relation R falls to 0, the tuple does not belong to the table any longer (for instance, tuples #1 and #5 in �� Table 3�� ).

Theoretical and Practical Extensions of SQL The fuzzification of the relational algebra must also deal with the relational operators (projection

Table 2. Gradual relation R1 ID

Name

Department

Salary

Age

…

θwell_paid

1

Alice

RH

5, 600

62

1.0

2

Jonathan

CPT

1, 500

33

0.25

3

Michael

TECH

2, 200

42

0.6

4

Suzanne

RD

3, 200

25

1.0

5

Paul

RH

4, 000

50

1.0

Table 3. Gradual relation R2 ID

Name

Department

Salary

Age

…

θyoung

1

Alice

RH

5, 600

62

0.0

2

Jonathan

CPT

1, 500

33

0.8

3

Michael

TECH

2, 200

42

0.2

4

Suzanne

RD

3, 200

25

1.0

5

Paul

RH

4, 000

50

0.0

125

From User Requirements to Evaluation Strategies of Flexible Queries in Databases

Figure 2.�� Gradual predicates “young” and “well-paid” young

1

0

16

30

well-paid

1

Age (years)

0

1000

3000

Salary (euros)

Table 4.� �� R1 Ω� �� R2 ID

Name

Department

Salary

Age

1

Alice

RH

5, 600

62

0.0

2

Jonathan

CPT

1, 500

33

0.25

3

Michael

TECH

2, 200

42

0.2

4

Suzanne

RD

3, 200

25

1.0

5

Paul

RH

4, 000

50

0.0

in SQL clauses other than the traditionally targeted SELECT clause; that is, it also extends INSERT, DELETE, and UPDATE. Interqueries connectors (IN, NOT IN, EXISTS) have also been studied (Bosc & Pivert, 1995) with the aim of preserving the equivalence between the nested expression of a query and its non-nested expression. It is worth noticing that fuzzy queries may be interpreted in a quantitative preference framework, provided that (1) a membership function gives a similarity value of tuples to elementary query requirements (the fuzzy predicates) and (2) fuzzy aggregation computes an overall score that allows ranking items in the result set. However, we believe that it could not be the users’ very first intention when they deal with such fuzzy queries. The simple fact that they need to define membership functions to compute attribute-oriented scores is somehow less natural than explicitly formulating preferences into query.

Cubes and Views: Linguistic Summaries Even if a few approaches are already dealing with graduality in database querying, one of the main 126

…

θR1 Ω R2

challenges of extending query languages is to enrich query formulation without drastically reducing the performance of the query evaluation process. Linguistic summaries, studied by Yager (1991), serve that concern by expressing the content of a set of data. The new expression is a description of the data using linguistic terms. Many works, some prior to Yager’s, fall into the domain of linguistic summaries. Quantified summaries approaches (Prade & Testemale, 1984; Rasmussen & Yager, 1997) use fuzzy quantifiers in addition to linguistic terms to describe the data. For instance, in SummarySQL (Rasmussen & Yager, 1997), evaluating “summary most from EMPLOYEES where age is young” provides a degree of validity for the proposition “most employees are young.” Linguistic summaries also comprise fuzzy rules-based summaries. Such summaries are discovered by searching associations and relations between attribute values (Bosc, Pivert, & Ughetto, 1999) or by exploiting fuzzy functional dependencies (Bosc, Dubois, & Prade, 1998; Cubero, Medina, Pons, & Vila, 1999). They produce, in the case of gradual rules of Bosc et al. (1999), propositions such as “the more age is old, the more salary is high.”

From User Requirements to Evaluation Strategies of Flexible Queries in Databases

It is also possible to summarise records by repeatedly generalising linguistic descriptions. This approach uses techniques from automatic learning and classification. Its output is a tree of descriptions. Lee and Kim’s “is-a” hierarchies (Lee & Kim, 1997) and the SaintEtiQ model (see the SaintEtiQ Summaries section) are instances of this approach. �� In a further direction towards concision, Kacprzyk and Zadrożny (2005) extend Yager’s linguistic summaries (Yager, 1991) to Zadeh’s protoforms (Zadeh, 2002).

Query Evaluation Current Evaluation Methods The content of this section is largely based on a chapter of a book by Bosc, Liétard, Pivert, and Rocacher (2004), available in French.

Translation into an SQL Query In this fuzzy query evaluation scheme, the objective is to reuse functionalities of existing DBMS as much as possible. The fuzzy query treatment consists of a layer placed on top of the traditional DBMS. Its role is to transform the fuzzy query FQ into an SQL query Q, usually augmented with user functions. Any query Q output from the fuzzy layer includes one or more functions for satisfaction degree computation. The user functions include the satisfaction/compatibility degree functions, but also representation functions to make the fuzzy elements of the query legible to the user and comparison functions (Galindo et al., 1998), which implement the semantics of fuzzy operators.

Query Q is obtained through derivation rules that preserve the semantics of the fuzzy expression. The result of Q is “seen as a fuzzy set [S], the elements of which are the tuples whose degree is above a threshold α” (Bosc et al., 2004, p. 115). The problem is essentially to determine an alpha-cut of level α using a Boolean computation environment. The level α is the starting point of the rules as it can be distributed to most predicates. Thus, the threshold affects each fuzzy predicate element of FQ, redefining the set of values implied by the predicate. The “translation into SQL” approach is the de facto standard, as it requires the least amount of additional software compared to approaches discussed in upcoming sections: Procedural Compiling and Evaluation Using a Fuzzy Set-Based Index. It is also the most versatile since its principles apply to all database systems. However, the rewriting rules may lead to a superset of the correct result set S. This occurs as soon as one rewriting step uses an implication (A ⇒ B) instead of an equivalence (A ⇔ B). In this case, the results contain false positive results (wrong results given to the user). A filtering postprocessing step is then needed, as shown for instance in Ughetto, Voglozin, and Mouaddib (2006). The translation of fuzzy queries into SQL has also been studied in Galindo (1999).

Procedural Compiling In this scheme, functionalities of current DBMSs are also reused. However, the fuzzy evaluation plugs into the DBMS at a lower level. The motivations for a procedural translation of queries stem from limits of the translation in the Translation into an SQL Query section. First, that translation is suitable for queries that can be expressed as a

Figure 3�� .�� Translation of a fuzzy query into SQL

Fuzzy query

Translation into SQL

Ordering & filtering DBMS

Final results

127

From User Requirements to Evaluation Strategies of Flexible Queries in Databases

join-select-project query. Unfortunately, not all queries fall into that category, for example, nested queries using the NOT IN operator. Bosc et al. (2004, p. 120) give the following example query, where fc1 and fc2 are fuzzy conditions, and A is an attribute from relation R: SELECT … FROM R WHERE fc1 AND A NOT IN (SELECT … FROM S WHERE fc2)

Second, translations induce an exhaustive scan of the relations. Third, they are not redundancyfree. The procedural translation aims at reducing the cost of the degree computations and allowing the expression of complex queries. Cost reduction is obtained by avoiding exhaustive scans of the relation S each time a tuple from R is tested for qualification to the query (thus, the number of I/Os is eventually lower). Early stop conditions must be used. In practice, the evaluation algorithm is a program Prg written in a procedural SQL language (e.g., PL/SQL). Satisfaction and membership degrees are computed by the program. Prg is in fact equivalent to a low-level procedure mentioned in the Evaluation of Queries in a Relational DBMS section.

Evaluation Using a Fuzzy Set-Based Index In a relational DBMS, an index is a data structure that is built from relational tuple values and organFigure 4�� .�� Index for a predicate P 0.1 0.2

t1 ti tn

1.0

128

tp

ised in a way that permits a fast localisation (i.e., retrieval) of tuples compared to a sequential scan. From this brief description of an index, one can define the purpose of an index as an identification role. All identified tuples are identical with respect to the selection criteria. They are all relatable to the query in the same way. This is true for nonfuzzy databases. Nevertheless, the identification role does not hold for a fuzzified data manipulation language because the gradual aspect allows tuples to have different connections to the query. In such a DML, selection criteria may have a satisfaction degree associated with a fuzzy predicate. This association made Bosc et al. (2004) conclude that an index should have the form in Figure 4. Unfortunately, this approach is not feasible because of the following limitations. First, the number and values of satisfaction degrees are fixed and not variable. Although this is not a problem in itself, it implies that only a few (if not only one) tuples would be associated with each index entry, that is, with each possible value of the predicate’s satisfaction degree. An implicit hypothesis is then made on the operators: they must return a value that matches one of the entries. Second, all predicates (at least those from linguistic variables) cannot have their own access data structure. This would be equivalent to having multiple secondary indexes in a database. In the same time, the existence of several indexes for one relation has been one of the prime motivations for multidimensional indexes. It is well known that a unique index data structure is easier to maintain, has a lower overhead, and has a higher selectivity than multiple secondary indexes (Kriegel, 1984). Third, a fuzzy DML can allow the user to define new predicates, as is the case with FSQL (Galindo et al., 1998). The question of managing the new predicates’ index has not been addressed. In this section, properties of indexing techniques are surveyed under the fuzzy DML point of view. A few principles of an index for fuzzy DMLs are also proposed. Finally, a discussion of SaintEtiQ summaries, which describe relational data using linguistic terms and obey to the above-mentioned principles, is given.

From User Requirements to Evaluation Strategies of Flexible Queries in Databases

Index Structures General Aspects The first classification characteristic of indexes is the number of attribute domains considered in building the index. Some indexing techniques such as hash tables or B-Trees exploit information coming from a single domain to store and organise records. Other techniques can make use of distinct attribute values. The former will be referred to as “monodimensional indexes” whereas the latter will be referred to as “multidimensional indexes.” An attribute whose values serve in the construction of the index structure is a discriminating attribute. The data structure built by an indexing technique usually consists of entries. An index entry is the association between a discriminating value and the set of tuples that have the discriminating value(s) on the discriminating attribute(s). When every possible value of a discriminating attribute that exists in the relation corresponds to an entry, the index is a “dense index.” Figure �� 5�� shows a table of Texas airports, as well as two kinds of index. The index on airport codes (left index) is dense since each airport code is represented in the data structure. The index on the town (right index) is not dense as some town names do not appear in the data structure.

An index may also be a clustered3 index. In this case, the order between discriminating values dictates (more or less strictly) the order of tuples on the physical support. This definition has two immediate corollaries: (1) there is only one clustered index for a given relation and (2) a multidimensional index cannot be clustered in the general case. Figure 6 shows tree-structured indexes, one clustered (left side of the figure) and the other unclustered (right side). An indexing technique (abusively called “index”) associates a data structure (or “index”) with algorithms that manage the data structure and realise the usual data management operations (insertion, update, deletion, and search). For that reason, many properties are linked to the physical implementation of the technique. For now, it is sufficient to say that index entries (i.e., the association between a discriminating value and corresponding tuples) are organised in generally fixed-size physical units called “pages.” Since indexes take part in data management, index pages are usually not full, in order to save some space for possible future insertions. Details of behaviours common to most techniques can be found in reference books such as Date (1994) or in the excellent survey by Gaede and Günther (1998). The evaluations of indexing techniques one can find in the literature are often oriented towards a study of features. Whether the study is conducted

Figure 5. Dense index and nondense index

129

From User Requirements to Evaluation Strategies of Flexible Queries in Databases

Figure 6�� .�� Clustered index and unclustered index

in papers proposing new techniques or in surveys, a small set of properties/features keeps appearing. A short overview of those properties/features is given below. A note on multidimensional indexes. To our knowledge, the little research available on onedimensional indexes has been concentrating on variants of the B+-Tree (Comer, 1979). Most papers on the subject of indexes during the past 20 years deal with multidimensional indexes. The reason for this fact is probably that multidimensional spaces are particularly difficult. In that field, the data space is considered a geometrical d-dimensional space E where each tuple t is mapped to a point p. The attribute domains define the axes of E and the tuple values define the coordinates of p. A “cell” is defined as a region of E. Through a recursive subdivision of E using hyperplanes, one can obtain elementary cells which, in a tree-structured index, will correspond to a leaf node. In d-dimensional spaces, indexing techniques use descriptions of subspaces as discriminating values. In addition to the larger space requirement for d-dimensional descriptions (compared to 1-dimensional descriptions), multidimensional techniques (index structures and algorithms) are confronted with issues due to the nature of the space they deal with. These problems, which are beyond the scope of this chapter, are explained in all surveys of multidimensional indexing techniques (e.g., Berrani, �� Amsaleg, & Gros�� , 2002; Gaede &

130

Günther, 1998; Weber, Schek, & Blott, 1998) and, to a lesser extent, in proposals for new techniques. Let us just say that those problems are rooted in the number of cells being exponential with the number of dimensions. Properties and Features of Indexing Techniques The properties and features in this section have been selected because of their relevance to the purpose of this chapter. They may be useful for designing and implementing an index structure for a fuzzy DML. Order-dependency. This property defines the fact that applying the technique on a unique dataset but taking the elements in different orders leads to different data structures. It prevents the technique from systematically reaching the optimal data structure. The R*-Tree (Beckmann, Kriegel, Schneider, & Seeger�� , 1990) specifically targets the order-dependency problem in R-Trees (Guttman, 1984). Overlap refers to the possibility for the same object to be covered by distinct cells. The X-Tree (Berchtold, Keim, �� & Kriegel�� , 1996) and the R+-Tree (Sellis, Roussoupoulos, & Faloutsos, 1987) try to reduce overlaps that occur in R-Trees (Guttman, 1984). In techniques that permit overlaps, there are (potentially) many paths to an object that multiplies disk accesses. This possibility must be dealt with, hence less efficient search routines.

From User Requirements to Evaluation Strategies of Flexible Queries in Databases

Dimension reduction of a space is a way to get round the dimension-related, above-mentioned problems. The idea here is to exploit correlations in real datasets so that a smaller number of attributes is indexed. However, a preprocessing of the data is needed to discover the correlations. Space contraction is the ability of a technique to store information only about cells that contain objects: empty cells are not represented in the data structure. Most tree-structured techniques have this ability. However, techniques that transform the data space into a linear ordered space do not have this feature (Freeston, 1995, pp. 1-2). The fan-out is linked to the organisation of nodes on physical storage. Depending on the technique and implementation choices, a physical node is equivalent to one or several logical nodes. The fanout is the number of index entries (i.e., subspaces or cells) that are referenced in a physical node. Informally, this value relates to the proportion of the search space that is dismissed in a single search step. A small fan-out value leads to deeper trees compared to a larger value. It also leads to a larger number of nodes visited in a search operation. The fan-out is an important parameter of an index data structure since searches are performed not only for querying, but also for insertion, deletion, and update operations. The balance of a tree-structured index is another property often mentioned in research papers. It refers to the same height for all leaves of the index tree. The advantage of a balanced tree is to provide a high bound to the number of steps needed to complete search operations. Tuning refers to some techniques being parameterised. The optimal performance for the same dataset and context depends on the parameters. A need for tuning is obviously detrimental to a technique. The X-Tree (Berchtold et al., 1996) is an example of technique that can create a less efficient index structure if the parameter values are not properly set.

General Principles of a Fuzzy Set-Based Index In order to have an index structure operated in a fuzzy DML, several conditions and properties are proposed in this section. First, the role of an index needs to be redefined. It can no longer be an “identification” task as it was in relational databases. As shown in Figure 4, this perception of indexes has some limitations. A new purpose for an index could be the “reduction of the search space.” This approach implies that some tuples from the result set S will not be evaluated anymore; that is, the tuples will not undergo the degree computation step anymore. The “reduction of the search space” approach also applies to relational databases indexes, whether one-dimensional, such as B-Trees (Bayer & McCreight, 1972; Comer, 1979), or multidimensional (see Gaede & Günther, 1998 for a survey of multidimensional indexing techniques). As said in the Procedural Compiling section, the idea of using several one-dimensional indexes does not seem feasible. Therefore, a multidimensional approach is more suitable. However, the reader must be aware that multidimensional spaces present problems due to the number of subdivisions being exponential with respect to the dimensionality. The overlap property of multidimensional indexing techniques, which is seen as having a negative impact on the technique’s performance, becomes mandatory in a fuzzy DML context. The gradual aspect of fuzzy sets has to be maintained while selecting records. The representation space for tuple values is then the Cartesian product of linguistic variables domains (see Figure 7). T�� he Cartesian product of linguistic variables domains defines elementary domain-intersections, which can be characterised by the intersecting linguistic labels (for instance, “soft” and “medium” on Figure 7). A fuzzy index cell would correspond to these intersecting labels (for example, the hatched area on Figure 7), thus providing an intensional description of the cell. The containment of index cells, a direct consequence of the hierarchical

131

From User Requirements to Evaluation Strategies of Flexible Queries in Databases

thickness

16 0.15

small

3

thin

8

medium

thick

32

huge

50

Figure 7�� .�� Cartesian product of linguistic variables hardness and thickness

10

27

malleable

44

soft

aspect in tree-based index structures, is reflected by a generalisation of elementary cells. In such an index, leaf nodes correspond to the elementary cells whereas the root node corresponds to the whole representation space.

SaintEtiQ Summaries The SaintEtiQ model4 (Raschia, 2001; Raschia & Mouaddib, 2002) is a proposition for a process that computes summaries of structured data. The processed data consists of relational records. The model intends to synthesise the data under a concise form. It deals with issues such as covering all the data, minimising the volume of the summaries, and presenting the summaries under a comprehensible form. This section presents the SaintEtiQ model. First, the necessary input data is given. Then, the three-step construction process is detailed. Finally, the semantics of the obtained hierarchy of summaries is detailed.

132

61

hard

78

95

hardness

compact impenetrable

Input Data Data used in a summarisation process are either the data to be summarised or the data referred to as “background knowledge,” that gives indications about the way the summarisation should be conducted. The data to be summarised comes from classic relational databases. As such, the data are organised into records, with a schema R (A1, A2, …, An). Each attribute Ai is defined on an attribute domain Di. Thus, each tuple t consists of n attributes values from domains D1 to Dn: t = . Another constraint on the data is that completeness is required: any value t.Ai is necessarily known, elementary, precise, and certain. In other words, all records with a NULL value are dismissed. The background knowledge (BK) controls how attribute values are interpreted in the summarisation process. The BK is given by users or experts of the data domain, in order to define a data description

From User Requirements to Evaluation Strategies of Flexible Queries in Databases

sheet, will be described by the same linguistic term “hard” (defined on Figure �� 8�� ). By rewriting the raw data, the model intends to find all possible ways of qualifying a tuple t using terms from the BK. Let t = < t.A1, t.A2, …, t.An > be a tuple from the to-be-summarised relation and ΠA the generic function that associates any value t.A to the labels characterising the value. The rewriting step builds a tuple t’:

language whose semantics is as close as possible to the user. Thus, the BK provides the vocabulary in which summaries will be expressed. It consists of linguistic variables defined over the domains D1, …, Dn. In all cases, linguistic variables come with linguistic terms used to characterise domain values and, by extension, database tuples. For a continuous domain Di, the linguistic variable is a partition of the attribute domain. For a discrete domain Di (for instance, Occupation), the BK element is a fuzzy set of nominal values. In short, the BK gives the summarisation process the means to match attribute domain values with the summary expression vocabulary.

t’ = < ΠA1(t.A1), ΠA2(t.A2), …, ΠAn(t.An) > Note that t’ may be multivalued. In this case, the model reduces t’ to one-valued tuples. Thus, rewriting a single tuple may lead to multiple representations, each of which is referred to as a “candidate tuple.” The set of (one-valued) candidate tuples from the rewriting of t is denoted φ(t). Table 5 gives an example of the output of the rewriting step for a few tuples.

Rewriting Step The rewriting step performs an abstraction of the data thanks to the BK. A representation of the data under the form of fuzzy sets is obtained. That new representation is homogeneous whatever the nature of the domain attribute (that is, numerical or symbolic). The principle of the abstraction used here is to neglect nonsignificant differences between attribute values. Thus, the values 45mm and 48mm, characterising the thickness of a metal

Summarisation Step The summarisation step performs two interleaved tasks: a conceptual classification of data into summaries and the organisation of these summaries

Figure 8. Linguistic variables small

1

thin

medium

large

thick

thickness

0 0.15

3

8

10 12

malleable soft

16

hard

32 compact

50 mm impenetrable

1

hardness

0 10

20 cold

40 44 low

61

78

moderated normal

high

95 extreme

1

900

0 100

550

1000

temperature 1800

2700

3500

˚C

133

From User Requirements to Evaluation Strategies of Flexible Queries in Databases

Table 5. Example of rewritings Alloy

Tuple

Rewriting

UZ40 (brass)

t1 = < 10, 38, 900 >

φ�(t1) = { ct11 } ct11 = <1.0/medium, 1.0/soft, 1.0/moderated>

CuSn12 (bronze)

t2 = < 8, 40, 850 >

φ(t2) = { ct21 , ct22 } ct21 = <0.5/medium, 1.0/soft, 1.0/moderated> ct22 = <0.5/thin, 1.0/soft, 1.0/moderated>

Figure 9. Example of a summary tree <medium + thin, soft, moderated>; [ ct11, ct21, ct22]

<medium, soft, moderated>; [ct11, ct21]

into a hierarchy. A partial order relation offers a basis for the hierarchy. This step incorporates candidate tuples from the rewriting step into the summary hierarchy. A candidate tuple ct is inserted into the root summary z0. ct reaches a leaf node zleaf that is the leaf whose description matches the description of ct. In other words, the sets of labels that describe ct and zleaf are identical. Figure �� 9� shows the summary tree for the tuples and candidate tuples in Table 5. However, the path followed by ct (that is, from z0 to zleaf) is discovered node after node thanks to learning operators. The role of the learning operators is to adapt the structure and organisation of the summary hierarchy to its current content, according to quality measures that are evaluated after each modification of the structure of the hierarchy. At each node traversed by ct, the summarisation process decides of the operator that, applied to the current hierarchy, offers the best results of a dispersion measure and a specificity measure. These two measures implement the well-known concepts of interclasses dissimilarity and intraclass similarity in automated learning or classification. They are also used as quality evaluation criteria. The dispersion measure determines whether a summary needs to be split (its content is then not uniform

134

; [ct22]

in terms of description). It may also decide that distinct summaries are close enough in description to be merged. The specificity measure acts on the granularity of a summary, that is, its level in the tree (close to the root or close to a leaf). It also decides whether a node is to reach the leaf level. If so, the incorporation is over. Otherwise, the process uses the dispersion and specificity previously computed to determine the most appropriate child node for receiving the candidate tuple. Raschia (2001) proposed four learning operators for the SaintEtiQ model. The operators are listed hereafter along with a description: • •

•

Initialise(znew, ct): a new summary znew is created and contains only ct. Note that znew is necessarily a leaf. Assign(z, ct): the summary z is simply traversed by ct. This operator implies no structural modifications, but the representations of z, necessarily the extension and possibly the intension, are modified. Merge(z1, z2): z1 and z2 are deemed too specific for the level they are at in the hierarchy. A merge operation (Figure 10) yields a summary zmerge, parent of z1 and z2. The two summaries are pushed to a one-step lower level.

From User Requirements to Evaluation Strategies of Flexible Queries in Databases

Figure 10�� .�� Merge and Split operators z0

z0 zmerge

merge

z1 z11

z2 z12

z21

z1 z22

z0 z1 z11

•

z12

z2

z11

z12

z22

z0

split

z2

z21

z11

z12

z13

z2

z13

Split(z): at the opposite of the previous, this operator deletes a summary z that is too specific for the level it is at. Since its covered data are also covered by its parent (Figure 10), there is no need to keep z. This operator cannot be applied to the root of the hierarchy.

Expressions of Summaries The representations of a summary describe the tuples covered by the summary in two ways: (1) as an extension and (2) as an intension. The extensional expression of a summary z provides a list of all tuples that belong to z. Depending on the data set and the level of z in the hierarchy, that list may be short enough for immediate grasp by a user. However, in most cases, the extension is rather large. The extension R z of a summary z has no formal writing, but it is conveniently expressed as a set: R z = {ct1, ct2, ct3, ct4, …}. The intensional expression of a summary displays the characteristics of the data through the linguistic descriptors from the vocabulary. For each attribute, the labels that describe more or less well the data covered by z appear. The intension of a summary z is given by:

z=

{

1 1

/ d1 +

1 2

/ d2 +  +

1 n

} {

/ dn , ,

p 1

/ d1 +

p 2

/ d2 +  +

p m

/ dm

In this formalisation, p is the number of attributes in the summarised relation. For each attribute, an expression of the form { 11 / d1 + 21 / d 2 +  + n1 / d n }is part of the intension. Each di represents a descriptor from the linguistic variable defined on attribute A1. Subscripts n and m denote the fact that summaries have multivalued (see the example below for a definition) attributes in general. However, the SaintEtiQ model guarantees that leaf summaries have only one label per attribute. For the sake of simplicity, the brackets will be omitted in the rest of this chapter, as well as the terms for which a = 0. Example. Consider the relation R = (thickness, hardness, temperature). A leaf summary zleaf from the summary hierarchy built on R is one-valued for all attributes. For instance, suppose z11 from Figure 10 is z11 = <1.0/small, 0.8/malleable, 1.0/normal >. However, z1, parent of z11, is necessarily multivalued on at least one attribute, since only leaves are fully one-valued (this is ensured by the specificity mea-

135

}

From User Requirements to Evaluation Strategies of Flexible Queries in Databases

sure). A summary is said “multivalued” when its intension shows, for one or many attributes, more than one label. In the following intensional expression, z1 is multivalued since attribute thickness is characterised by two linguistic labels “small” and “thin”: z1 = <1.0/small + 0.5/thin, 0.8/malleable, 1.0/normal >. The writing “1.0/small + 0.5/thin” denotes that Z1 is a generalisation of all its children. Hence, Z1 describes the same data than the set of its children. The set of labels in the intension of z1 is the union of its children’s sets of labels. In the rest of this chapter, this union of labels will be referred to as the “label set augmentation rule.” An intensional expression displays, in addition to linguistic labels, real values attached to each label (“1.0/small” or “0.5/thin” in the example above). These real values are satisfaction degrees computed in the translation step. However, several sorts of values may be attached to labels. The SaintEtiQ model instantiates the values into satisfaction degrees, tuple proportions, and normalised cardinalities. It should be noticed that the labels are given attribute per attribute. Thus, no guarantees can be given as to the existence of a specific combination of labels. Example. Consider z0 = <1.0/small + 0.5/thin, 0.8/malleable, 1.0/normal + 1.0/high>. It cannot be assumed that some small materials have a high temperature. Suppose all tuples described by z0 are described on thickness and temperature by either small and normal or thin and high. The assumption does not hold in this particular context. Properties Summary hierarchies show two interesting properties for the role of index or the role of description of the underlying relational data. The first property is the uniqueness of each summary. Whatever sum-

136

mary z from a hierarchy H, no other summary has the same extension as z and, consequently, the same intension. Given that an intension can be reduced to the list of linguistic terms that describe the summary’s tuples, if a cell (from the representation space) is described by the linguistic terms it is an intersection of (see General Principles of a Fuzzy Set-based Index section), it comes that any leaf summary zleaf is equivalent to one elementary cell. This leads to leaf summaries being unique, that is not sharing their expression, since cells are unique. A summary z at a higher level is necessarily a generalisation of lower-level summaries. The uniqueness property is still preserved for higher-level summaries thanks to the containment transitivity of cells (see Index Structures section) and the label set augmentation rule. The second property is the partial order relation that exists in a summary hierarchy. This property is a consequence of (1) organising the summaries into a tree and (2) the semantics of generalisation attached to intensional expressions. The generalisation sets the root of a summary tree as the most general summary. Hence, the root z0 has the most complete intensional expression and the largest extension. Conversely, a leaf summary zleaf is more specialised (i.e., specific) than the root: its extension is small and its intension displays only one linguistic label per-attribute. With respect to the partial order, the following expression, where z is an ancestor of zleaf and “Ð” means “is more specific than,” holds: zleaf Ð z Ð z0 However, the “more specific” feature of a summary zleaf compared to z can be asserted only if z is on the path from the root z0 to zleaf. For instance, z11 and z12, or z2 and z12 in Figure 11 are not comparable. The interest of the two properties discussed above lies in the fact that a progressive refinement can be done while performing searches in the tree without redundancy. The generalisation guarantees that all data described in z1 is necessarily described

From User Requirements to Evaluation Strategies of Flexible Queries in Databases

Figure 11�� .�� Partial orders in a summary hierarchy z0 specialization

z1 z11

z2

generalization

z12

by z0. This transitivity of description is reflected by the intensional expression of z0. That intension reports all linguistic descriptors from z1. On the opposite, the intension of a summary is entirely formed from the moment all its leaf nodes are known. On the level of intensional expressions, the partial order translates into labels. Indeed, the labels in the intension of a summary are necessarily in the intension of any of its ancestors. Conversely, a descriptor d appears in the intensional expression of a summary z only if it also appears in (at least) one leaf zleaf of its subtree. As in many tree searches, branch cutting can occur.

Using an Index Assuming that an index is available for a fuzzy DML, how should it be used? The answer is as simple as “just as an index is used in a relational DBMS.” In PostgreSQL, for instance, the indexing technique (or “access method” in the PostgreSQL documentation) is “responsible for retrieving the tuple ID” (PostgreSQL Global Development Group, 2006, p. 1133), meaning the access methods provide the information necessary for fetching attributes values from the physical storage. All other operations (e.g., selection of displayed columns, sorting, grouping, checking selection criteria, joins) are performed by the database engine. In short, the access method is responsible for identifying the appropriate records with respect to the query. Under the light of General Principles of a Fuzzy Set-based Index section, an access method restricts the data space by selecting only the records that may be part of the answer to the query.

Summary and Outcomes This chapter has dealt with the whole process of fuzzy querying from the query formulation to its evaluation. The first part has introduced the various ways for a query to be flexible. Top-k query algorithms have then been reviewed, those based on the selection problem and often derived from the Threshold Algorithm, and those for the rank-join issue, especially used in distributed environments. Next, the integration of user preferences within a query has been presented. Various semantics of such an approach have been first discussed, and then dedicated language constructs such as in PreferenceSQL have been detailed. A categorisation that distinguishes qualitative preferences based on a binary relation and quantitative preferences guided by score functions has been proposed. The problem of preference accumulation has also been presented as well as the main solutions. Finally, issues related to user perceptions of the queried domain have been introduced. It deals with the integration of fuzzy set-based constructs into query languages. Then, fuzzy relations are defined that give birth to a fuzzy relational calculus. Hence, fuzzy queries must be accompanied with new and efficient evaluation methods. The evaluation of a query consists of the algorithms and data used for finding the answer to the query. Existing methods have been briefly reported. The most obvious method is the translation of the query into a standard SQL query. This method induces the inclusion of degree computation functions into the SQL query. The method exploits the capabilities of current DBMSs, and it requires little software development. However, it is not suitable for all types of queries. Furthermore, it suffers from poor efficiency. The second evaluation method is intended to cover a larger space of queries and to provide a better efficiency by avoiding computations. It consists in programs written in a procedural SQL language (e.g., C/SQL). The programs then use early stop conditions to reduce fruitless operations.

137

From User Requirements to Evaluation Strategies of Flexible Queries in Databases

The main matter of the chapter has been in advocating the use of index structures in the evaluation. This solution for the evaluation seems to have been dismissed because of the amount of programming needed. Consequently, very little works, if any, are published on the subject. We reported works we started a few years ago, the crux of which is the SaintEtiQ model, a classification process that provides linguistic summaries of a relational table. A follow-up to the summarisation model is querying the SaintEtiQ summaries (Voglozin, Raschia, Ughetto, & Mouaddib, 2006). This work raised the issue of summaries as an indexing technique. From studying indexing techniques, the properties of the techniques were identified. This allowed us to report, in this chapter, along with interesting properties, a few principles for an index structure in a fuzzy context. A description of the SaintEtiQ summaries was also given as an applicative instance of these principles. Work on summaries is still undergoing, for instance, in the practical aspect with the use summary hierarchies as an access method in the PostgreSQL DBMS (Voglozin, 2007) or with the possibility of using a new vocabulary (user defined linguistic terms) for querying summary hierarchies (Ughetto et al., 2006)..

References Agrawal, S., Chaudhuri, S., Das, G., & Gionis, A. (2003). Automated ranking of database query results. In Proceedings of the Conference on Innovative Data Systems Research. Retrieved January 30, 2008, from www-db.cs.wisc.edu/cidr/ Akbarinia, R., Martins, V., Pacitti, E., & Valduriez, V. (2�� 006). Top-k query processing in the APPA P2P system. In Proceedings of the International Conference on High Performance Computing for Computational Science (VecPar). Bast, H., Majumdar, D., Schenkel, R., Theobald, M., & Weikum, G. (2006). �� IO-Top-k: Index-access optimized top-k query processing. In Proceedings

138

of the International Conference on Very Large Data Bases (pp. 475-486). Bayer, R., & McCreight, E. M. (1972). Organization and maintenance of large ordered indices. Acta Informatica, 1(3), 173-189. Beckmann, N., Kriegel, H.-P., Schneider, R., & Seeger, B. (1990). �� The R*-Tree: An efficient and robust access method for points and rectangles. In Proceedings of the International Conference on Management of Data (ACM SIGMOD) (pp. 322-331). Berchtold, S., Keim, D. A., & Kriegel, H.-P. (1996). The X-Tree: An index structure for high-dimensional data. In Proceedings of the International Conference on Very Large Data Bases (pp. 2839). Berrani, S.-A., Amsaleg, L., & Gros, P. (2002). Recherche par similarité dans les bases de données multidimensionnelles: Panorama des techniques d’indexation (in French). RSTI - Ingénierie des systèmes d’information. Bases de données et multimédia, 7(5-6), 9-44. Borzsonyi, S., Stocker, K., & Kossmann, D. (2001). The Skyline Operator. In Proceedings of the 17th International Conference on Data Engineering (pp. 421-430). Bosc, P., Dubois, D., & Prade, H. (1998). Fuzzy �� functional dependencies and redundancy elimination. Journal of the American Society for Information Science, 49(3), 217-235. Bosc, P., & Kacprzyk, J., (Eds.). (1995). Fuzziness in database management systems. Heidelberg: �� Physica-Verlag. Bosc, P., Liétard, L., Pivert, O., & Rocacher, D. (2004). Gradualité et imprécision dans les bases de données - Ensembles flous, requêtes flexibles et interrogation de données mal connues (in French). Ellipses. Bosc, P., & Pivert, O. (1993). On the evaluation of simple fuzzy relational queries: Principles and

From User Requirements to Evaluation Strategies of Flexible Queries in Databases

measures. In R. Lowen & M. Roubens (Eds.), Fuzzy logic: State of the art (pp. 355-364). Kluwer Academic Publishers. Bosc, P., & Pivert, O. (1994). Fuzzy queries and relational databases. In Proceedings of the ACM Symposium on Applied Computing (pp. 170-174). Bosc, P., & Pivert, O. (1995). SQLf: A relational database language for fuzzy querying. IEEE Transactions on Fuzzy Systems, 3(1), 1-17. Bosc, P., Pivert, O., & Ughetto, L. (1999). On data summaries based on gradual rules. In Fuzzy Days (pp. 512-521). Brafman, R. I., & Domshlak, C. (2004). Database preference queries revisited (Tech. Rep. No. TR2004-1934). Cornell University Computing and Information Science. Bruno, N., & Wang, H. (2007). The Threshold algorithm: From middleware systems to the relational engine. IEEE Transactions on Knowledge and Data Engineering, 19(4), 523-537. Cao, P., & Wang, Z. (2004). Efficient top-K query calculation in distributed networks. In Proceedings of the ACM Symposium on Principles of Distributed Computing (pp. 206-215). Chaudhuri, S., Das, G., Hristidis, V., & Weikum, G. (2004). �� Probabilistic ranking of database query results. In Proceedings of the International Conference on Very Large Data Bases (pp. 888-899). Chomicki, J. (2002). Querying with intrinsic preferences. In Proceedings of the International Conference on Extending Database Technology (pp. 34-51). Chomicki, J. (2003). Preference formulas in relational queries. ACM Transactions on Database Systems, 28(4), 427-466. Codd, E. F. (1970). A relational model for large shared data banks. Communications of the ACM, 13(6), 377-387.

Comer, D. (1979). The ubiquitous B-Tree. ACM Computing Surveys, 11(2), 121-137. Cubero, J. C., Medina, J. M., Pons, O., & Vila, M. A. (1999). Data summarization in relational databases through fuzzy dependencies. Information Sciences, 121(3-4), 233-270. Date, C. J. (1994). An introduction to database systems (6th ed.). Reading: Addison-Wesley. Date, C. J. (2000). An introduction to database systems (7th ed.). Reading: Addison-Wesley. Fagin, R.. (1996). Combining fuzzy information from multiple systems. In Proceedings of the ACM Symposium on Principles of Database Systems (PODS) (pp. 216-227). Fagin, R., Lotem, A., & Naor, M. (2003). �� Optimal aggregation algorithms for middleware. ACM Journal of Computer and System Sciences, 66, 614-656. Fishburn, P. C. (1970). Utility theory for decisionmaking. New York: Wiley. Freeston, M. (1995). A general solution of the ndimensional B-Tree problem. In Proceedings of ACM SIGMOD (pp. 81-90). Gaasterland, T., & Lobo, J. (1994). Qualified answers that reflect user needs and preferences. In Proceedings of International Conference on Very Large Data Bases (pp.�� 309-320). �� Gaede, V., & Günther, O. (1998). Multidimensional access methods. ACM Computing Surveys, 30(2), 170-231. Galindo, J. (1999). Tratamiento de la Imprecisión en Bases de Datos Relacionales: Extensión del Modelo y Adaptación de los SGBD Actuales (in Spanish). Doctoral �� thesis, University of Granada, Spain. Retrieved January 30, 2008, from www. lcc.uma.es Galindo, J., Medina, J. M., Pons, O., & Cubero, J. C. (1998). �� A server for fuzzy SQL queries. In T. Andreasen, H. Christiansen, & H. L. Larsen

139

From User Requirements to Evaluation Strategies of Flexible Queries in Databases

(Eds.), Flexible query answering systems (LNAI 1495, pp. 164-174). Berlin: Springer. Galindo, J., Urrutia, A., & Piattini, M. (2006). Fuzzy databases: Modeling, design and implementation. Hershey, PA: Idea Group. Güntzer, U., Balke, W.-T., & Kießling, W. (2000). Optimizing multi-feature queries for image databases. In Proceedings of the International Conference on Very Large Data Bases (pp. 419-428). Guttman, A. (1984). R-Trees: A dynamic index structure for spatial searching. In Proceedings of the International Conference on Management of Data (pp. 47-57). Hristidis, V., Koudas, N., & Papakonstantinou, Y. (2001). �� PREFER: A system for the efficient execution of multi-parametric ranked queries. In Proceedings of ACM SIGMOD (pp. 259-270). Ilyas, I. F., Aref, W. G., & Elmagarmid, A. K. (2002). Joining ranked inputs in practice. In Proceedings of the International Conference on Very Large Data Bases (pp. 950-961). Ilyas, I. F., Aref, W. G., & Elmagarmid, A. K. (2003). �� Supporting top-k join queries in relational databases. In Proceedings of the International Conference on Very Large Data Bases (pp. 754765). Kacprzyk, J., & Zadrożny, S. (2005). Linguistic database summaries and their protoforms: Towards natural language based knowledge discovery tools. Information Sciences, 173(4), 281-304. Kacprzyk, J., & Ziółkowski, A. (1986). Database queries with fuzzy linguistic quantifiers. IEEE Transactions on Systems, Man and Cybernetics, 16, 474‑479. Kapoor, N., Das, G., Hristidis, V., Sudarshan, S., & Weikum, G. (2007). STAR: �� A system for tuple and attribute ranking of query answers. In Proceedings of the International Conference on Data Engineering (ICDE).

140

Kießling, W. (2002). Foundations of preferences in database systems. In Proceedings of the International Conference on Very Large Data Bases (pp. 311-322). Kießling, W., & Köstler, G. (2002). Preference SQL: Design, implementation, experiences. In Proceedings of the International Conference on Very Large Data Bases (pp. 990-1001). Kriegel, H.-P. (1984). Performance comparison of index structures for multi-key retrieval. In Proceedings of ACM SIGMOD (pp. 186-196). Lacroix, M., & Lavency, P. (1987). Preferences: Putting more knowledge into queries. In Proceedings of the International Conference on Very Large Data Bases (pp. 217-225). Lang, C. A., Chang, Y.-C., & Smith, J. R. (2004). Making the threshold algorithm access cost aware. IEEE Transactions on Knowledge and Data Engineering, 16(10), 1297-1301. Lee, D. H., & Kim, M. H. (1997). Database summarization using fuzzy ISA hierarchies. IEEE Transactions on Systems, Man and Cybernetics, Part B: Cybernetics, 27, 68-78. Li, C., Soliman, M. A., Chang, K. C.-C., & Ilyas, I. F. (2005). RankSQL: �� Supporting ranking queries in relational database management systems. In Proceedings of the International Conference on Very Large Data Bases (pp. 1342-1345). Michel,�� S., Triantafillou, P., & Weikum, G. (2005). KLEE�� : A framework for distributed top-k query algorithms. In Proceedings of the International Conference on Very Large Data Bases (pp. 637648). Natsev, A., Chang, Y.-C., Smith, J. R., Li, C.-S., & Vitter, J. S. (2001). Supporting incremental join queries on ranked inputs. In Proceedings of the International Conference on Very Large Data Bases (pp. 281-290). Nepal, S., & Ramakrishna, M. V. (1999). Query processing issues in image (multimedia) databases.

From User Requirements to Evaluation Strategies of Flexible Queries in Databases

In Proceedings of the International Conference on Data Engineering (pp. 22-29). Petry, F. E. (Ed.). (1996). Fuzzy databases: Principles and applications. Springer. PostgreSQL Global Development Group. (2006). PostgreSQL 8.1 Developer Documentation. Retrieved January 30, 2008, from http://techdocs. postgresql.org/ Prade, H., & Testemale, C. (1984). Generalizing database relational algebra for the treatment of incomplete/uncertain information and vague queries. Information Sciences, 34(2), 115-143. Raschia, G. (2001). SaintEtiQ: une approche floue pour la génération de résumés à partir de bases de données relationnelles (in French). Unpublished doctoral dissertation, Université de Nantes, France. Raschia, G., & Mouaddib, N. (2002). SaintEtiQ: A fuzzy set-based approach to database summarization. Fuzzy Sets and Systems, 129, 137-162. Rasmussen, D., & Yager, R. R. (1997). SummarySQL: A fuzzy tool for data mining. Intelligent Data Analysis, 1, 49-58. Ré, C., Dalvi, N., & Suciu, D. (in press). Efficient top-k query evaluation on probabilistic data. In Proceedings of the International Conference on Data Engineering. Sellis, T. K., Roussoupoulos, N., & Faloutsos, C. (1987). The R+-tree: A dynamic index for multidimensional objects. In Proceedings of the International Conference on Very Large Data Bases (pp. 507-518). Soliman, M. A., Ilyas, I. F., & Chang, K. C.-C. (in press). Top-k query processing in uncertain databases. In Proceedings of the International Conference on Data Engineering. Theobald, M., Schenkel, R., & Weikum, G. (2004). Top-k query processing with probabilistic guarantees. In Proceedings of the International Conference on Very Large Data Bases (pp. 242-249).

Ughetto, L., Voglozin, W. A., & Mouaddib, N. (2006). Personalized �� database querying using data summaries. In Proceedings of the IEEE International Conference on Fuzzy Systems (pp. 736-743). Voglozin, W. A. (2007). Le résumé linguistique de données structures comme support pour l’interrogation (in French). Unpublished �� doctoral dissertation, Université de Nantes, France. Voglozin, W. A., Raschia, G., Ughetto, L., & Mouaddib, N. (2006). Querying a summary of database. Journal of Intelligent Information Systems, 26, 59-73. Weber, R., Schek, H.-J., & Blott, S. (1998). A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Proceedings of the International Conference on Very Large Data Bases (pp. 194-205). Yager, R. R. (1991). On linguistic summaries of data. In G. Piatetsky-Shapiro & W. J. Frawley (Eds.), Knowledge discovery in databases (pp. 347-366). AAAI/MIT Press. Yu, H., Li, H.-G., Wu, P., Agrawal, D., & El Abbadi, A. (2005). Efficient processing of distributed top-k queries. In Proceedings of the International Conference on Database and Expert Systems Applications (pp. 65-74). Zadeh, L. A. (2002). A prototype-centered approach to adding deduction capabilities to search engines-the concept of a protoform. Paper presented at the BISC Seminar, University of California, Berkeley. Zeinalipour-Yazti, D., Vagena, Z., Gunopulos, D., Kalogeraki, V., Tsotras, V., Vlachos, M., Koudas, N., & Srivastava, D. (2005). The threshold join algorithm for top-k queries in distributed sensor networks. In Proceedings of the 2nd International Workshop on Data Management for Sensor Networks (pp. 61-66).

141

From User Requirements to Evaluation Strategies of Flexible Queries in Databases

Key Terms Data Summary (or “summay”): A generalized concept, result of a classification process that groups together relational tuples whose values are close. Evaluation of a Query: The process conducted for obtaining the results of a given query matched against every tuple. In flexible querying, the result is a fuzzy relation, which tuples belong to with a computed fulfillment degree. Extension (of a summary): The set of tuples that belongs to the summary. Hierarchy of Summaries: A set of summaries organized in a tree structure using a generalization relationship between nodes of the tree. Index Structure (or “index”): A data structure built using data from tuple values. It is mainly used to provide fast access to records compared to a sequential scan. Indexing Technique: Design principles, algorithms, and procedures for creating and managing an index structure. It is sometimes improperly referred to as an “index.” Intension (of a summary): The expression of a summary using linguistic labels defined as fuzzy sets. It embodies the characteristics of the summary’s extension.

142

SaintEtiQ: A data summarization model, proposed by G. Raschia, that produces a hierarchy of summaries given a relational table and additional metadata. Top-k Answer: The set of the k best answers to a query with respect to a given result tuple evaluation scheme. User Preference: A soft requirement provided by users, in addition to a query, to reflect a wish. It influences the querying process by causing some results to be favored.

Endnotes �� Strictness is the property of a function that takes on the maximal value of 1 precisely when each of its arguments takes on this maximal value. Conjunctive aggregation functions follow such a requirement. 2 Iterative Deepening is a mechanism for limiting computational resources by dividing computation into successive rounds (Natsev et al., 2001). 3 �� This interpretation of “cluster” is independent from other meanings such as in Oracle where a cluster stores two tables in the same structure for faster access. 4 �� Information and downloads can be found at http://www.simulation.fr/seq/ 1

143

Chapter VI

On the Versatility of Fuzzy Sets for Modeling Flexible Queries P. Bosc IRISA / ENSSAT – Université de Rennes 1, France A. Hadjali IRISA / ENSSAT – Université de Rennes 1, France O. Pivert IRISA / ENSSAT – Université de Rennes 1, France

Abstract The idea of extending the usual Boolean queries with preferences has become a hot topic in the database community. One of the advantages of this approach is to deliver discriminated answers rather than flat sets of elements. Fuzzy sets are a natural means to represent preferences, and many works have been undertaken to define queries where fuzzy predicates can be introduced inside user queries. The objective of this chapter is to illustrate the expressiveness of fuzzy sets with the division operator in the context of regular databases. Like other operators, the regular division is not flexible at all and small variations in the data may lead to totally different results. To counter this behavior, a variety of extended division operators founded on fuzzy sets are suggested. All of them obey a double requirement: to have a clear meaning from a user point of view and to deliver a resulting relation which is a quotient.

Introduction The idea of introducing preferences into queries is gaining more and more attention in the database community. After some initial works in the 1970s and 1980s, such as nearest neighbors (�� Friedman, Baskett, & Shustek, 1975),� Deduce2 (Chang, 1982), Preferences (Lacroix & Lavency, 1987), Ares (Ichikawa & Hirakawa, 1986), and Vague (Motro, 1988), a new stream of work, including top-k queries (Bruno, Chaudhuri, & Gravano, 2002),

PreferenceSQL (�� Kießling & Köstler, 2002�� ), Skyline (Börzsönyi, Kossmann, & Stocker�� , 2001), and the synthesis in Chomicki (2003), �� is based on the use of preferences inside user queries. Undoubtedly, this contributes to make queries (and database systems) more flexible, but it must be noticed that most (not to say all) of these approaches or systems do not call on fuzzy sets. We believe that the main advantage for founding flexible queries on fuzzy sets lies in the generality of the approach, which allows notably the combination of preferences

Copyright © 2008, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.

Versatility of Fuzzy Sets for Modeling Flexible Queries

over different attributes (or items). Whatever the approach taken, the final objective is no longer to specify acceptable elements, but rather to define an order (partial or total) over the elements retrieved from the database. This chapter is situated in the context of regular relational databases, and its main objective is to illustrate the expressive power of fuzzy sets for modeling and interpreting a wide range of division operations involving preferences. Several works dealing with the expression and the interpretation of fuzzy queries have been carried out (in particular, Tahani, 1977; Kacprzyk & Ziolkowski, 1986; Bosc & Pivert, 1995). Selection, projection, Cartesian product, join, as well as set-oriented operations have been defined so as to take preferences into account. It turns out that these operations can be extended only in a straightforward manner. On the contrary, the division operation is a bit more complex, and it gives birth to a bunch of extensions in the context of fuzzy sets. The relational division has been the subject of some research works in the fuzzy set community in the past with two types of objectives : (1) to extend the operation in the presence of fuzzy relations (Bosc, Dubois, Pivert, & Prade, 1997; Cubero, Medina, Pons, & Vila, 1994; Dubois & Prade, 1996; Mouaddib, 1993; Yager, 1991) and (2) to define the division when the attributes of the relations may be imprecise (�� Dubois, Nakata, & Prade, 2000; Galindo, Medina, Cubero, & Garcia, 2001; Umano & Fukami, 1994). This chapter falls in the first category where data are precisely known and different types of division operators are studied in the context of flexible queries addressed to regular databases: division of fuzzy relations and approximate division intended for some tolerance to exceptions. Let us recall that the relational division is somewhat analogous to the integer division. In effect, similarly to the integer division which returns an integer subject to a constraint (the largest one such that its product with the divisor is smaller than or equal to the dividend), the relational division returns a relation subject to an inclusion constraint. As a consequence, one key point in extending the

144

division is the assessment of the quality of quotient of the result delivered. In the following, we aim at illustrating how the division can become a flexible operator with two central concerns: (1) to suggest extensions which have a clear semantics from a user point of view and (2) to show that the extended operator still has the characteristic of a division, that is, that it delivers a quotient. The rest of the chapter is structured as follows. The second section is devoted to an overview of flexible queries in the context of fuzzy sets. Both an algebraic framework and the key elements of an SQL-like query language are outlined. In the third section, basics on the relational division are recalled and some motivations for extending this operator to make it more flexible are pushed forward. The fourth section deals with the division of fuzzy relations, that is, relations whose tuples are graded to express the extent to which each of them is compatible with the fuzzy concept represented by the considered fuzzy relation. Different lines of extension are proposed and the fact that a quotient is delivered is the criterion for their acceptability. In the next section, the notion of exception in the context of a division is introduced. Quantitative and qualitative exceptions are distinguished, which are the source of two families of approximate divisions. In the sixth section, an approximate division based on the idea of erosion (reduction) of the divisor and the dilation (augmentation) of the dividend is suggested. More precisely, erosion and dilation are performed on the basis of a parameterized proximity relation. Finally, the conclusion summarizes the contributions of the chapter and draws some lines of future work.

Flexible Queries and Fuzzy Sets General Objectives The need for queries delivering an answer which is more than a flat set of elements has been felt for a long time. To achieve this type of objective, a

Versatility of Fuzzy Sets for Modeling Flexible Queries

natural idea is to introduce preferences inside queries. The basic idea is to express that some values are preferred to others (with two extreme cases: ideal or unacceptable values). In this context, each element of an answer complies more or less with the preferences and the global answer is made of a set of discriminated elements. By the way, this makes it fairly possible to restrict the answer to those elements which are either sufficiently satisfactory or among the best k. In order to illustrate the notion of flexible query, let us consider the case of a person who would like to go to a restaurant “preferably Chinese” close to downtown to take an “affordable-priced” menu. If a database is available describing restaurants with their cuisine, the price of the menus, and their location, the person will use it to be helped with the choice. Of course, there is a need for defining the terms used in the query which convey the preferences so as to get discriminated answers (for example, “preferably Chinese” will be understood as Chinese completely preferred, Vietnamese or Indian acceptable, Japanese if no better possibility), but also the relative importance assigned to each of the three criteria (cuisine, price, location). Let us notice that such distinctions cannot be dealt with in a Boolean framework where a predicate is either satisfied or not (e.g., Chinese and Japanese cannot be separated) and all the conditions have the same importance by construction. Different proposals have been made to introduce preferences into queries. There are two basic approaches depending on whether the satisfaction scores tied to different criteria are assumed to be commensurable or not. In the first case, preference values associated with different attributes are aggregated to give a global value (score) and a total order is defined over the elements of the answer according to this global score. On the contrary, when commensurability does not hold, only a partial order (e.g., Pareto order) is available and classes of incomparable elements are built. This approach is detailed in Chomicki (2003), and it is illustrated by the Skyline operator (Börzsönyi, �� Kossmann, & Stocker�� , 2001)�� and �� the PreferenceSQL system �� (�� Kießling & Köstler, 2002�� )�� . For

instance, let us consider the attribute color where green is preferred to red, itself preferred to blue and the attribute price for which the lower value the better. The pair is better than since green (respectively 140) is preferred to blue (respectively 180), but the pairs and cannot be compared since green is preferred to red, while 130 is preferred to 150 (only a partial order is available). Clearly, fuzzy sets fall in the first category since it is assumed that the unit interval serves as a common basis to measure and to combine satisfaction degrees thanks to a variety of aggregation operators. However, it is worth mentioning that some research works have advocated the use of alternative techniques in order to define total order-based preference systems. Generally, the proposed systems are based on a two-step process: (1) a subset of elements is selected by means of a regular Boolean query, and (2) these elements are ordered thanks to a dedicated module. Several techniques can be used to perform the ranking. In Preferences (Lacroix & Lavency, 1987), the user specifies additional Boolean conditions and the more conditions satisfied, the better. In the systems called ARES (Ichikawa & Hirakawa, 1986) and VAGUE (Motro, 1988), distances over domain values are defined and they intervene in conditions of the form “A ≈ v” which are meant for generalizing the strict equality between the value of attribute A and the constant v. Here, the closer to v the value of attribute A in the considered element, the smaller the distance. It must be noticed that while being efficient, such approaches have some limitations, for instance, a small discrimination scale (with respect to the unit interval), discontinuity in some situations, and often an aggregation mechanism which strongly impacts the final ordering but which remains hidden to the user (see Bosc & Pivert, 1992 for a more detailed discussion).

An Extended Algebraic Framework An algebraic context is considered where a user builds his/her query by means of an algebraic expression based on a composition of operations 145

Versatility of Fuzzy Sets for Modeling Flexible Queries

in a way similar to what is feasible in the regular relational approach. Here, one works with fuzzy relations designed as fuzzy subsets of Cartesian products of domains. Thus, any such fuzzy relation r can be seen as made of weighted tuples, denoted by µ/t, where µ expresses the extent to which tuple t belongs to the relation, that is, is compatible with the concept conveyed by r. Of course, since regular databases are assumed to be queried, initial relations (i.e., those stored in the database) are special cases of fuzzy relations where all the tuple weights are equal to 1. It is worth noticing that the weight attached to any tuple in this model does not refer at all to imprecise or ill-known data, but only to the idea of graduation with respect to a flexible condition.

fyms-emp since no employee reaches the maximal degree 1.♦ The regular relational operations can be straightforwardly extended to fuzzy relations if one considers fuzzy relations as fuzzy sets on the one hand and gradual predicates are introduced in addition to Boolean ones in the appropriate operations (selections and joins especially) on the other hand. If r and s are two fuzzy relations defined over the same domains D1, …, Dk, the following three set-oriented operations can be defined: •

Union: µunion(r, s) (t) = ⊥(µr(t), µs(t)), where ⊥ denotes a triangular co-norm generalizing the disjunction, for example, max or probabilistic sum. Intersection: µinter(r, s) (t) = ⊤(µr(t), µs(t)), where ⊤ denotes a triangular norm generalizing the conjunction, for example, min or product. Difference: µdiffer(r, s) (t) = ⊤(µr(t), 1 − µs(t)), which stems from the fact that in the Boolean framework r – s = r ∩ s .

Example 1. Let �� us consider a database with the relation employee(num, name, salary, age, livingcity) with the extension given in Table 1. From this initial relation, it is possible to get the intermediate fuzzy relation fyms-emp shown in Table 2 containing those employees who are “fairly young” with a “medium” salary. It can be noticed that no element is a full member of the fuzzy relation

•

Table 1. An extension of the relation employee

µprod(r, s) (tu) = ⊤(µr(t), µs(u)).

num

name

salary

age

living-city

17 76 26 12 55

dupont martin tanaka smith lucas

13000 12500 12000 12000 13000

42 40 37 39 35

Lyon New-York Chiba London Miami

•

The Cartesian product of any two fuzzy relations r and s defined respectively on the sets of domains X and Y is given by:

Selection, projection, and join operations applying to fuzzy relations are defined as follows: • •

Table 2. The extension of the relation fyms-emp num

name

salary

age

living-city

degree

17 76 26 12 55

dupont martin tanaka smith lucas

13000 12500 12000 12000 13000

42 40 37 39 35

Lyon New-York Chiba London Miami

0.1 0.3 0.4 0.4 0.8

146

•

Selection: µselect(r, cond) (t) = ⊤(µ r(t), µcond(t)) where cond is a fuzzy predicate. Projection: µproject(r, Y) (u) = maxv µ r(uv) where Y is a subset of X the set of attributes of r and u one of its values, while v takes its value in (X – Y). Join: µjoin(r, s, A, B, θ) (tu) = ⊤(µr(t), µs(u), µθ (t.A, u.B)) where A (resp. B) is a subset of X (resp. Y) the set of attributes of r (resp. s), A and B are defined over the same domains, θ is a binary relational operator (possibly fuzzy), and t.A (resp. u.B) stands for the value of t over A (resp. u over B).

Versatility of Fuzzy Sets for Modeling Flexible Queries

Example 2. Let us consider a database containing the relations employee(num, name, salary, age, dep) and department(nd, budget) describing the employees (especially the department where they work) and the departments of a company. The query looking for the departments with a medium budget and no young employee having a very high salary can be formulated as follows: differ(project(select(department, budget = “medium”), nd), project(select(employee, age = “young” and salary = “very high”), dep).

With the extensions given in Tables 3a and 3b where the degree attached to the attribute salary (respectively age, budget) expresses the extent to which the salary (respectively age, budget) is medium (respectively young, very high), the result: {1/17, 0.2/23} – {0.1/17, 0.4/23} = {0.9/17, 0.2/23}

is obtained with the norm minimum.♦ Regular sets allow for the definition of Boolean predicates. In an analogous way, gradual predicates (or conditions) can be associated with fuzzy sets. Often, elementary fuzzy predicates correspond to adjectives of the natural language, such as young, tall, cheap, or well-paid. They can be modeled as a function (usually of triangular or trapezoidal shape) Table 3a. An extension of the relation employee name

salary

age

dep

smith

12500 (0.1)

38 (1)

23

arnold

14500 (0.9)

42 (.2)

23

willy

13000 (0.4)

37 (1)

23

durant

12500 (0.1)

39 (.8)

17

Table 3b. An extension of the relation department nd

budget

17

200 (1)

23

150 (0.2)

from one or several domains to the unit interval. An elementary �� predicate can also compare two attributes using the usual operators (equality, superiority, etc.), but also gradual relational comparison operators such as “more or less equal” or “fairly greater than.” It is possible to alter (weakening or strengthening) the meaning of a given predicate using a modifier which is generally associated with an adverb (e.g., very, more or less, relatively, really). For instance, “very cheap” is more restrictive than “cheap” and “fairly high” is less demanding than “high.” Different approaches have been advocated and the reader interested can refer to BouchonMeunier and Yao (1992). Atomic and modified predicates can be involved in compound predicates which go far beyond those used in regular queries. The weighted conjunction and disjunction as well as weighted mean or OWA (Yager, 1988) allow for assigning a different importance to each of the incoming predicates (see, in particular, Fodor & Yager, 1999 for a more complete presentation).

An Outline of an SQL-Like Flexible Query Language Even if an algebraic framework is useful as a formal basis for defining a query language, it turns out that SQL is the query language used in practice. That is why we give the key aspects of such a language extended to the support of fuzzy queries. The language called SQLf described in Bosc and Pivert (1995) has such an objective. The general principle consists in introducing gradual predicates wherever it makes sense.

Table 4. Extension of the relation dep dep

dep#

d-name

budget (k€)

place

1

networks

5000 (1)

Lyon

3

componants

1500 (0.65)

Lyon

5

services

400 (0.1)

Paris

147

Versatility of Fuzzy Sets for Modeling Flexible Queries

Table 5. Extension of relation emp emp

emp#

e-name

salary

position

age

liv-city

work-dep

17

Dupont

3500

engineer

51 (0)

Lyon

3

76

Martin

3000

engineer

40 (0.25)

Paris

5

26

Durant

2000

secretary

24 (1)

Lyon

3

12

Dubois

2500

technician

39 (0.3)

Lyon

3

55

Lorant

3500

clerk

30 (0.75)

Lyon

1

Table 6. Result of the query of example 3 e-name

degree

Durant

0.65

Lorant

0.75

The SQLf Base Block The three clauses “select,” “from,” and “where” of the base block of SQL are kept in SQLf and the “from” clause remains unchanged. The principal differences affect mainly two aspects: (1) the calibration of the result since it is made with discriminated elements, which can be achieved through a number of desired answers (n), a minimal level of satisfaction (t), or both; and (2) the nature of the authorized conditions as mentioned previously. Therefore, the base block is expressed: select [distinct] [n | t | n, t] attributes from relations where fuzzy-cond

where “fuzzy-cond” may involve both Boolean and fuzzy conditions. This expression is interpreted as: (1) the fuzzy selection of the Cartesian product of the relations appearing in the “from” clause; (2) a projection over the attributes of the “select” clause (in case of duplicates, they are kept by default, and if “distinct” is specified, the maximal degree is attached to the representative in the result); and (3) the calibration of the result (top n elements and/or those whose score is over the threshold t).

148

Example 3. Let us take the relations emp and dep whose respective schemas are E(emp#, ename, salary, position, age, liv-city, work-dep) and D(dep#, d-name, budget, place). The query looking for the employees (name) young and working in a high budget department with a satisfaction degree greater than 0.6 is expressed: select distinct 0.6 e-name from emp, dept where work-dep = dep# and age = “young” and budget = “high”.

Let us assume that µyoung (a) = 1 if a ≤ 25, 0 if a ≥ 45, is linearly decreasing in-between and that µhigh (b) goes from 0 to 1 when b moves from 200 to 2200. With the extensions of the relations emp and dep given in Tables 4 and 5 (where the degrees obtained with each fuzzy condition appear between parentheses), the result of the query is that shown in Table 6 since these two employees are the only ones to get a final degree of satisfaction over 0.6.♦

Subqueries Beyond the previous (simple) predicates, SQLf allows for the use of predicates based on the nesting of another predicate which plays a role of subquery. This type of construct is often used to simplify the expression of queries calling on a multirelation block. The operators used to introduce a subquery are mainly : (1) membership (“in”), (2) non-emptiness (“exists”), and (3) universal and existential quantifications with a comparison

Versatility of Fuzzy Sets for Modeling Flexible Queries

operator (“θ any” and “θ all”). So, in the context of the predicate “A in (select B from …),” the operator “in” returns the degree of membership of the value of A in the current tuple of the outer block to the (fuzzy) set of B-values returned by the inner block (subquery SR). Example 4. The query of example 3 may also be expressed using a subquery as: select 0.6 e-name from emp where age = “young” and work-dep in (select dep# from dep where budget = “high”).

With the table dep given before, the inner block returns the fuzzy set {1/1, 0.65/3, 0.1/5} and the final result is the one obtained in example 3: 0.65 (= min(1, 0.65)) for Durant and 0.75 (= min(0.75, 1)) for Lorant.♦ In the same spirit, the predicate “exists (select ...)” is defined as the extent to which the result delivered by the subquery is not empty (according to its height). It has been proved in Bosc and Pivert (1995) that most of the equivalences valid in SQL between multirelation blocks and subqueries still hold in SQLf. It is also possible to build predicates using the existential and universal quantifiers applying to subqueries returning a fuzzy set of values instead of a regular one. In order to keep usual equivalences, “A θ any (select B from s where fuzzy-cond)” must be equivalent to “exists (select * from s where A θ B and fuzzy-cond)” and also to “A in (select B from s where A θ B and fuzzy-cond).” This is also true for “A θ all (select B from s where fuzzy-cond)” which must be equivalent to “not exists (select * from s where not (A θ B) and fuzzy-cond).” These requirements fix the natural semantics of these quantifiers in the context of fuzzy queries. It is worth noticing that under some conditions, it is also possible to introduce other quantifiers (in particular gradual ones) instead of “any” and “all.”

Partitioning and Set-Oriented Conditions The notion of relation partitioning, that is, grouping tuples of a relation according to a common value over a given set of attributes, is extended to SQLf where two types of conditions allowing for the selection of subsets of tuples apply: (1) fuzzy condition on the result of an aggregate (set-function) and (2) use of quantified propositions. Similarly to the use of an aggregate function (min, sum, etc.) as a constituent of a Boolean set condition in SQL, aggregate functions (denoted by agi) may intervene as parameters of gradual conditions in SQLf in queries of the type: select list-attributes-1, list-of-ag from r where bc group by list-attributes-2 having fc1(ag1(B1)) conn. ... conn. fcp(ag p(Bp))

where “list-attributes-1” and “list-attributes-2” denote sets of attributes of relation r and “listattributes-1” is a subset of “list-attributes-2,” “list-of-ag” stands for a list of aggregates, “fci” is a fuzzy condition, “conn.” is a possibly fuzzy connector, and “bc” is a Boolean condition so that the aggregate applies to a regular set. Example 5. The query looking for the top five departments where the average salary of the engineers is around 3500 euros writes: select 5 dep# from emp where position = ‘engineer’ group by dep# having around(avg(salary), 3500).♦

Another way for selecting subsets of a partition relies on the use of quantified propositions of type “Q X are A,” which have no counterpart in SQL. The general syntax is: select attributes from relations where bc group by attributes having (Q1 are cf 1) conn. ... conn. (Q p are cf p)

where “Qi are cf i” is a quantified proposition (with Qi a quantifier and cf i a fuzzy predicate). There

149

Versatility of Fuzzy Sets for Modeling Flexible Queries

are several interpretations for such propositions (see, e.g., Galindo, Urrutia, & Piattini, 2006), in particular on the basis of a Sugeno fuzzy integral (Bosc, Liétard, & Pivert, 2003; Sugeno, 1974).

Similarly, for department 5, one gets: max(min(0.1, 1), min(0.3, 0.5), min(0.4, 0.75), min(0.9, 0.25)) = 0.4.♦

Example 6. The SQLf query:

Let us mention that the expression “Q X are A” rewrites “count (X is A) is Q” and a general type of condition likely to intervene in the having clause is “ag (X is A) is C” where ag denotes an aggregate function, where A and C are fuzzy predicates. It is suggested to interpret such setoriented conditions using a Sugeno integral (Bosc, Liétard, & Pivert, 2003), but this works only if the aggregate is monotonic. A solution for nonmonotonic aggregates is proposed in Bosc and Liétard (2004) so as to allow for queries such as the one looking for the departments where the average salary of young employees is medium, which could be expressed as:

select 5, dep# from emp group by dep# having most are (salary = “well-paid”)

is intended for the retrieval of the top five departments such that most of their employees are well paid. Since we have a relative quantifier, its definition calls on the cardinality of the underlying referential (denoted by C), for instance, µmost(n) = n/C. Let us take the extension of emp of Table 7 where the values in parentheses for salary stand for the degrees of satisfaction for the predicate “well-paid.” If the interpretation using a Sugeno integral is adopted, for department 1 the three cuts of the set {1/18, 0.6/34, 0.5/55} must be taken into account, namely the 1-cut with the single element {18}, the 0.6-cut with two elements {18, 34}, and the 0.5-cut with three elements {18, 34, 55}. Therefore, the degree of satisfaction for department 1 is:

select dep# from emp where age = “young” group by dep# having avg(salary) = “medium”

Last, let us recall that in SQL a possible expression of the division relies on the use of a partitioning along with a set-oriented condition. The issue of expressing extended forms of the division in SQLf will be tackled later when these extensions are presented.

max (min(0.5, µmost(3)), min(0.6, µmost(2)), min(1, µmost(1))) = max (min(0.5, 1), min(0.6, 0.66), min(1, 0.33)) = 0.6.

Table 7. Extension of the relation emp of example 6 emp

150

emp#

e-name

salary

position

age

liv-city

work-dep#

18

Dumont

4000 (1)

director

51

Lyon

1

34

Colbert

2400 (0.6)

engineer

32

Lyon

1

55

Lorant

2000 (0.5)

clerk

30

Lyon

1

76

Martin

2800 (0.9)

engineer

40

Paris

5

26

Durant

1200 (0.1)

secretary

24

Paris

5

12

Dubois

1800 (0.4)

technician

39

Paris

5

57

Marchand

1600 (0.3)

technician

23

Paris

5

Versatility of Fuzzy Sets for Modeling Flexible Queries

importance of skill s is i. Looking for candidates having all the skills whose importance is 3 with level 2 or over leads to formulate a query based on a division, namely:

The Relational Division: Reminders and Lines of Extension Arithmetic and Relational Division The division of integers is defined as follows: the result of the division of m (the dividend) by n (the divisor) is the largest integer q such that q * n ≤ m. The division of relations is defined in a similar way using both a notion of product (of relations) and an order over relations. �� More precisely, if r and s are two relations of respective schemas, R(A, X) and S(B) where A and B are compatible, that is, defined over the same domain(s), the result of the division of r by s is the maximal relation t such that: (s × t) ⊆ r

(1)

where × denotes the Cartesian product and ⊆ is set inclusion. A usual definition of the relational division is:

div(r, s, A, B) = t = {x | s ⊆ Ωr(x)}

(2)

where Ωr(x) = {a | ∈ r}. �� In other words, a value x belongs to relation t if and only if it is connected in the dividend (r) with at least all the values b appearing in the divisor relation (s). It can be easily proved that the relation t delivered by expression (2) is a quotient, that is, is the largest relation t complying with the requirement expressed by formula (1). This constraint can be equivalently stated as: ∀x, x ∈ t ⇒ s × {x} ⊆ r

(3a)

∀x, x ∉ t ⇒ s × {x} ⊄ r.

(3b)

Example 7. Let us consider a database involving two relations: curriculum (c) and profile (p) with respective schemas C(#i, skill, level) and P(skill, importance). Tuple of c means that candidate ind has level l for skill s, and tuple <s, i> of p states that for the job proposed the

div(c-l2, p-i3, {skill}, {skill})

where c-l2 involves the pairs issued from a candidate where l is greater or equal to 2 and p-i3 gathers the skills whose importance is 3. From the extensions of c and p given in Tables 8a and 8b, the intermediary relations are those appearing in Tables 9a and 9b. The final result is made of the two candidates 15 and 32. One may observe that: (1) p-i3 × {15} ⊂ c-l2, (2) p-i3 × {32} ⊂ c-l2, and (3) for any other value v (in particular 17), p-i3 × {v}⊄ c-l2.♦ One popular way for expressing the division in SQL is founded on the use of the partitioning. Let us recall that no feature is provided in SQL in order to compare sets (typically their inclusion), while division is a matter of inclusion. So, the idea is to replace E ⊆ F by an equivalent expression, namely: card(E ∩ F) = card(E). From that, the following SQL expression of the division of r(A, X) by s(B) follows: select X from r where A in (select B from s) group by X having count(*) = (select count(*) from s).

In the case of the query of example 6, the query looking for candidates having all the skills whose importance is 3 with level 2 or over writes: select #i from curriculum where level ≥ 2 and skill in (select skill from profile where importance = 3) group by #i having count(*) = (select count(*) from profile where importance = 3)

151

Versatility of Fuzzy Sets for Modeling Flexible Queries

Table 8a. Extension of relation c c

Table 8b. Extension of relation p

#i

skill

level

skill

importance

15

B

4

A

2

15

D

2

B

1

15

E

2

D

3

32

D

3

E

3

32

E

2

17

C

1

17

E

4

Table 9a. Relation c-l2 c-l2

Table 9b. Relation p-i3

#i

skill

15

B

D

15

D

E

15

E

32

D

32

E

17

E

Some Motivations for Extending the Relational Division Let us consider a database with the relations order and product whose respective schemas are O(store, #p, qty) and P(#p, price). Tuple <s, p, q> of order expresses that product p has been ordered in quantity q to store s, while tuple of product indicates that the price of product p is pr euros. Assume that a user is interested in the stores which have ordered about 50 copies of at least all the products priced under 20 euros. With a regular DBMS and query language (typically SQL), it is mandatory to translate the term “about 50 copies” in terms of a Boolean condition, for instance, an interval of the type [50 – a, 50 + a]. However, it is worth noticing that a small increase (respectively decrease) of the interval (variation of a) may lead to an undesirable behavior: elements initially selected (respectively discarded) are rejected (respectively accepted) due

152

p

p-i3

skill

to a larger (respectively smaller) divisor. Calling on preferences may be convenient to counter this abrupt way of doing. So, instead of the interval [45, 55], one will specify that 49, 50, and 51 are ideal values, 48 and 52 very satisfactory, …, 44 and 56 borderline, and others unacceptable. In general, preferences may apply to both dividend and divisor relations, and it is of particular interest to study their impact on the resulting relation as will be discussed later in the next section. Now, let us suppose that the divisor relation involves 20 elements (a1, …, a20) and that the dividend relation contains the pairs: {<x, a1>, <x, a6>, <x, a20>, , , …, , , , …, }.

Neither x, nor y, nor z is satisfactory as to the division of r by s. Nevertheless, it seems legitimate to think that if x is definitely inadequate, y and z

Versatility of Fuzzy Sets for Modeling Flexible Queries

are almost satisfactory since they are associated with respectively 19 and 18 ai of the divisor. Thus, one may be interested in distinguishing between these quite different situations through tolerance to exceptions. An “all or nothing” approach will accept y and z provided that a 10% ratio of exceptions is allowed. It is also possible to adopt a graded view according to which exceptions are a matter of preferences and then their ratio (or number) a matter of degree. For instance, full satisfaction is maintained if exceptions are under 8%; above 15%, it becomes zero, and in-between satisfaction decreases in a linear way. In the preceding case, exceptions are treated on a quantitative basis, that is, according to their number. So, it is impossible to compensate a large number of small exceptions, that is, to account for the notion of low-level exceptions which may occur in the context of fuzzy relations. For instance, let us consider the fuzzy relations:

Tolerant strategies, which are more deeply studied later, may be adopted either directly or because the initial query (with a nontolerant division) has led to an empty answer. In this latter case, the new (tolerant) division represents a weakened form of the regular one and a nonempty answer can then be expected.

Division of Fuzzy Relations Objectives and Basic Tools Now, one considers fuzzy relations in the sense given previously, that is, whose tuples are weighted. In this context, one can envisage queries similar to that of example 7, for instance: “to what extent each candidate has all the highlyimportant skills required for the position with a medium level.”

r = {1/, 0.3/, …, 0.3/}, s = {1/a1, 0.4/a2, …, 0.4/a10}.

With Zadeh’s (all or nothing) inclusion of E in F defined as: E ⊆ F ⇔ ∀x ∈ X, µE(x) ≤ µF(x)

(4)

we could say that s is almost included in Ωr(x) since the grades almost agree on the previous condition. Of course, the notion of qualitative exceptions may be dealt with in a gradual way (i.e., an exception is more or less a low-level one) so as to prevent a sharp behavior of the tolerance mechanism. Tolerance may also come into play in the following case. Let us assume dividend and divisor relations where the common attribute is provided with a resemblance (or proximity) relation. For instance, the divisor contains value a, while the dividend does not involve the pair <x, a>, but <x, b> where a and b are close to each other. In such a case, one might consider that <x, b> is a somewhat acceptable substitute for <x, a> which is required for a strict division.

If selection and projection are those defined before, this query can be algebraically expressed thanks to a division of fuzzy relations as: div(project(select(c, level = “medium”), {#i, apt}), project(select(p, importance = “high”), {apt}), {apt}, {apt})

provided that this operation is appropriately extended (i.e., is defined when the operand relations become fuzzy). Such an extension is based on an adaptation of formula (2) inside which the Boolean inclusion is replaced so that (1) its arguments may be fuzzy sets and (2) its result is graded (i.e., it delivers a degree of inclusion). Several ways of extension can be devised, among which: deg(E ⊆ F) = minx ∈ X µE(x) ⇒f µF(x)

(5)

where ⇒f denotes a fuzzy implication intended for generalizing the regular one, that is, an application from [0, 1] × [0, 1] into [0, 1] obeying a number of properties among which (1) 0 ⇒f a = 1, (2) a ⇒f 1

153

Versatility of Fuzzy Sets for Modeling Flexible Queries

= 1, (3) 1 ⇒f a = a, and (4) decreasing (respectively increasing) monotonicity with respect to the first (respectively second) argument, deg(E ⊆ F) = card(E ∩ F)/card(E) = Σx ∈ X ⊤(µE(x), µF(x))/Σx ∈ X µE(x) (6) where ⊤ is a triangular norm. Formula (5) conveys a logical view, whereas formula (6) is cardinality-based and the extended divisions issued from these two approaches are presented hereafter.

The Logical Approach If r is a fuzzy relation, from: Ωr (x) = {µ/a | µ/ ∈ r} and formula (2), the definition of the division of fuzzy relations obtained is: ∀ x ∈ supp(project(r, X)), µdiv(r, s, A, B) (x) = mina ∈ µ (a) ⇒f µr (a, x) supp(s) s (7)

with the usual one in that case (in particular 1 ⇒f 0 = 0 and 1 ⇒f 1 = 1). Such a definition guarantees that the result obtained by the division is a quotient (Bosc, Pivert, & Rocacher, 2007). In effect, using this generator, the Cartesian product of the divisor and the result t of the division is included (in Zadeh’s sense) in the dividend r and it is maximal, that is: ∀x, x ∈ project(supp(t), X) and µt (x) = d ⇒ s × {d/<x>} ⊆ r (9a) ∀x, x ∈ project(supp(r), X) and µt (x) = d and d1 > d ⇒ s × {d1/<x>} ⊄ r. (9b) Of course, the use of an R-implication or an S-implication in formula (7) has a strong impact on the semantics of the obtained division. It turns out that R-implications can be rewritten: p ⇒R-i q = 1 if p ≤ q, f(p, q) otherwise

where supp(E) denotes the support of a fuzzy set, that is, the set of elements with a strictly positive degree. Two types of fuzzy implications are considered in the rest of this chapter due to their clear meaning and properties: R-implications and S-implications. Let us recall that these two types of implications can both be written in a “residuated” form (Dubois & Prade, 1984) as:

where f(p, q) accounts for the penalty applied when the conclusion q does not reach the antecedent p. It is worth noticing that, using an R-implication, if E is included in F in Zadeh’s sense, formula (5) returns the maximal degree of inclusion. The degree attached to an element of the divisor (relation s) plays a role of a threshold. The higher it is in tuple of the divisor, the higher it should be in tuple of the dividend in order for x to get the maximal grade 1. Similarly, S-implications can be alternatively formulated as:

p ⇒f q = sup {u ∈ [0, 1] | cnj(p, u) ≤ q} (8)

p ⇒S-i q = ⊥(1 – p, q)

where cnj(a, b), the generator of the considered implication, is a continuous triangular norm for an R-implication and a continuous noncommutative conjunction for an S-implication. One can notice that the regular division (formula [2]) is recovered from formula (7) in the presence of regular relations due to the fact that any fuzzy implication coincides

where ⊥ stands for a triangular co-norm extending the usual disjunction. As a consequence, the antecedent p may be considered playing a role of degree of importance and (1 – p) is a guaranteed level of satisfaction. Here, one may remark that, in general, the maximal degree of inclusion is not obtained with formula (5) when E is included in F

154

(10)

Versatility of Fuzzy Sets for Modeling Flexible Queries

in the usual sense. In fact, the notion of inclusion conveyed by an S-implication is more drastic. For instance, with Kleene-Dienes implication (p ⇒K-D q = max(1 – p, q)) or Reichenbach implication (p ⇒S-i q = 1 – p + pq)), 1 is obtained if the support of E (elements with a positive degree) is included in the core (elements with the degree 1) of F. Note that this does not mean that a degree of inclusion based on an S-implication is less than one based on an R-implication as illustrated in example 8. In this context, the higher the degree of in the divisor, the more important the degree of in the dividend (i.e., the more this degree influences the grade assigned to x). Example 8. Let us consider the relations curriculum and profile of example 7 and the query looking for the candidates who have all the highly weighted skills with a reasonable level. This query leads to divide the two relations c-rl and p-hw and with the extensions of Tables 10a and 10b, one gets: 0.6/ if Gödel implication (p ⇒Gö q = 1 if p ≤ q, q otherwise) is used and {0.6/, 0.8/} with Kleene-Dienes implication. Candidate c1 has skills A and C with a degree at least equal to that associated with these skills in the divisor (which, indeed, plays a role of minimal desired profile), and the candidate is penalized for competence B. Candidate c2 does not have skill C at all, but this skill is not considered important (degree 0.2) according to the use of an S-implication. That is why candidate c2 (previously eliminated) receives a fairly high degree of satisfaction with Kleene-Dienes implication, and c2 is even better than c1. The Cartesian product of p-hw and {0.6/} with the norm minimum (which is the generator of Gödel implication) is: pc1 = {0.6/, 0.6/, 0.2/}, which is included in the dividend c-rl. Clearly, if the grade 0.6 is increased, the inclusion of the Cartesian product obtained in c-rl no longer holds. So, the result delivered is a quotient. Similarly, the Cartesian product of p-hw and {0.6/ with the noncommutative conjunction ncc(a, b) = 0 if (a + b) ≤ 1, b otherwise (which

is the generator of Kleenes-Dienes implication) leads to: pc2 = {0.6/, 0.6/}. Relation pc2 is included in c-rl, and the degree 0.6 associated with candidate c1 is maximal. The same property holds for the Cartesian product of p-hw and {0.8/.♦ The expression of a division in SQLf cannot be formulated on the basis of the transformation evoked before for SQL. So, the convenient way of doing consists in introducing the set containment operator (denoted by “containsi” hereafter) which is parameterized with the name of the implication underlying it. If r(A, X) and s(B) are two fuzzy relations, their division is expressed as: select X from r group by X having set(A) containsi (select B from s). For each group of the partition, the containment of the B-values appearing in s is assessed with respect to the A-values tied to the group. It turns out that this type of expression is convenient for all forms of extended divisions provided

Table 10a. Relation c-rl c-rl

#i

skill

µ

c1

A

1

c1

B

0.6

c1

C

0.4

c2

A

0.8

c2

B

1

Table 10b. Relation p-hw p-hw

skill

µ

A

1

B

0.8

C

0.2

155

Versatility of Fuzzy Sets for Modeling Flexible Queries

that the containment operator is appropriately parameterized.

The Cardinality-Based Approach As mentioned above, a degree of inclusion can be built from formula (6), which, when used in the definition of the division yields: ∀ x ∈ supp(project(r, X)), µdiv(r, s, A, B) (x) = Σ a ∈ supp(s) ⊤(µr (a, x), µs (a))/Σ a ∈ supp(s) µs (b)

where ⊤ denotes a triangular norm. If such an approach seems legitimate, the definite validation depends on whether or not the result is a quotient. It turns out that this is not the case. In effect, it may happen (Bosc et al., 2007) that: (1) the Cartesian product using the smallest triangular norm of the divisor and the smallest result of such a division (i.e., using the smallest norm in the above expression) is not�� included in the dividend, and (2) the Cartesian product using the largest noncommutative conjunction of the divisor and the largest result of such a division (i.e., using the largest norm (min in the above expression) is not maximal.

Exception-Based Approximate Division As mentioned earlier, an approach to extending the division consists in the tolerance to exceptions, which leads to an approximate division. This can be understood in two different ways depending on the exceptions which can be of a quantitative or qualitative nature. Obviously, the result delivered by any approximate division must be a superset of the one returned by the regular division.

Quantitative Approach The definition of a quantitative approximate division is based on the allowance for the existence of some elements of the divisor (s) not connected in the dividend (r) with the value x under consideration. In other words, a certain number of values 156

of s can be more or less ignored depending on the authorized level of relaxation. The principle adopted is to weaken the universal quantifier into the relative fuzzy quantifier “almost all” (Kerre & Liu, 1998; Zadeh, 1983) modeled as a function from the unit interval into itself. So doing, degrees are associated with the weakened quantifier and the result is a fuzzy relation, although input relations may be nonfuzzy ones. These grades convey a natural semantics, namely the degree of satisfaction obtained when a given number of values are ignored. Let us remark that a similar approach is adopted in Galindo �� et al. (2001) in the context of a division of relations involving imprecise data represented as possibility distributions.

Quantitative Approximate Division of Regular Relations Such an approximate division is a way for answering queries like: “to what extent does each candidate possess with a level over 3 almost all the skills whose importance is greater than 4.” �� The definition of a quantitative approximate division of regular relations is based on the allowance for some elements of the divisor (s) not (at all) connected in the dividend r (i.e., absent) with the value x under consideration. In other words, a certain number of values of s can be more or less ignored depending on the authorized level of relaxation. The grades are defined as follows: µalmost all (0) = 0, µalmost all (1) = 1, µalmost all (1 – i/n) = wi expresses the degree of satisfaction when i out of the n elements of the referential are ignored. By definition: 1 = w0 ≥ w1 ≥ … ≥ wn = 0 and if we denote k1 = sup {j | wj = 1}, k 2 = sup {j | wj > 0},

Versatility of Fuzzy Sets for Modeling Flexible Queries

the quantifer allows for the total ignorance of k1 (and the partial ignorance of up to k 2 exceptions). The quantitative approximate division of relation r(A, X) by s(B) is obtained by: ∀x ∈ project(r, X), µquant-app-div(r, s, A, B) (x) = wi (11) with i = card({a | ∈ s and ∉ r}) and n = card(s). If the quantifier “almost all” is Boolean, the result obtained is a Boolean one, since then one is completely satisfied for a number of exceptions under (or equal to) k1 = k 2. Moreover, formula (11) generalizes formula (2) which is recovered with the universal quantifier characterized by w0 = 1, w1 = … = wn = 0 with which no exception is accepted. Moreover, �� the resulting relation t can be shown to be a quotient according to the following characterization formulas: ∀x, x ∈ project(t, X) and µt (x) = d ⇒ s × {d/<x>} ⊆a r (12a) ∀x, x ∈ project(r, X) and µt (x) = d and d1 > d ⇒ s × {d1/<x>} ⊄a r (12b) where the inclusion operator used (⊆a) is an approximate one accounting for the tolerance which took place during the division (see Bosc et al., 2007, for details). Example 9. Let us take the relations r and s respectively defined over the schemas LessThan3km(#hôtel, #site) and Guide(#site) with the following extensions: r = {, …, , , …, , , …, } s = {s1, …, s10}

and the quantifier “almost all” is defined as: µalmost all (f) = 0 if f ∈ [0, 0.75], µalmost all (f) = 1 if f ∈ [0.95, 1], µalmost all (f) linearly increasing if f ∈ [0.75, 0.95].

The query looking for hotels located at less than 3 km from almost all sites described in the guide is based on the approximate division of r by s. In r, hotel h1 is associated with 8 sites of s, h2 with 9, andt h3 with all of the 10 sites of s. Since the referential has 10 elements, the quantifier allows for somewhat ignoring up to 2 elements (k1 = 0, k 2 = 2). So, the result t of the approximate division of r by s using formula (11) is: µquant-app-div(r, s, A, B) (h1) = 0.25, µquant-app-div(r, s, A, B) (h2) = 0.75, µquant-app-div(r, s, A, B) (h3) = 1, while that of the regular division involves only h3. Clearly, the Cartesian product of s by t ({0.25/, …, 0.25/< h1, s10>, 0.75/, …, 0.75/, 1/, …, 1/}) is not included in r, since some exceptions occurred for h1 and h2 which have been partly ignored. However, one may observe that the two (respectively one) extra elements relative to h1 (respectively h2) have a degree which corresponds to w2 (respectively w1), that is, the ignorance of 2 (respectively 1) elements, and this is the rationale of the approximate inclusion used in formulas (12a) and (12b).♦

Quantitative Approximate Division of Fuzzy Relations The objective of the approximate division of fuzzy relations is to answer queries such as: “to what extent does each candidate possess with a medium level almost all the fairly important skills.” As in the previous subsection, one considers that the presence of values of the divisor insufficiently (or not at all) connected with x in the dividend r may be more or less compensated by the weakening of the universal quantifier into “almost all.” The idea is to ignore to a certain extent the result of a fuzzy implication provided that there is a strictly greater grade issued from the quantifier. Here again, the reasoning is made at a quantitative level, and it is based on a number of more or less acceptable exceptions (to the complete inclusion according

157

Versatility of Fuzzy Sets for Modeling Flexible Queries

to the type of fuzzy implication used). It follows that the quantitative approximate division of the fuzzy relation r(A, X) by the fuzzy relation s(B) is defined as: ∀x ∈ proj(supp(r), X), µquant-app-div(r, s, A, B) (x) = inf i max(αi, wi)

(13)

where αi is the ith smallest implication degree (µs(aj) ⇒f µr (aj, x)) and wi is the degree of ignoration wi = µalmost all (1 – i/n), issued from the quantifier for relation s whose cardinality is n; that is, n implication values intervene in formula (13). It is of particular interest to notice that this formula is recovered if one adopts the view according to which one searches for the best k such that k elements of the divisor s are connected with x in the dividend and k is compatible with “almost all.” One observes that an implication value insufficiently satisfactory is replaced by the degree of satisfaction corresponding to the number of values ignored so far (according to “almost all”). When wi is 1, total ignoration takes place, whereas if wi is 0, the associated element αi is completely taken into account as such. It appears that degrees of ignoration act inversely with respect to levels of importance (used for instance in the weighted conjunction discussed in Dubois & Prade, 1986). They define degrees of guaranteed satisfaction, that is, if p implication values are ignored, the satisfaction level is at least wp. Furthermore, associating the largest ignorance degree with the smallest implication value, the second largest ignorance degree with the second smallest implication value, and so on, is optimal as to the grade assigned to an element x in the resulting relation. Obviously, the user can choose the fuzzy implication to be used in formula (13) so as to specify the role played by the degrees of the divisor (threshold or importance). Furthermore, if the quantifier Q1 is included in Q2 (according to formula (4)), the result of the approximate division founded on Q1 is included in that of the approximate division based on Q2. In the presence of regular relations, formula

158

(13) delivers the same result as formula (11) and if the universal quantifier (w0 = 1, w1 = … = wn = 0) is used in formula (13), formula (7) is recovered. Last, it is proved in Bosc et al. (2007) that the result of this division is a quotient provided that: (1) an appropriate approximate Boolean inclusion is used in the characterization formulas, (2) either the divisor is a normalized relation (i.e., at least one element has the maximal degree 1), or the relaxed quantifier is nonfuzzy. Example 10. Let r with schema Close(#hotel, #site) and s with schema InterestingSites(#site) be two fuzzy relations. The tuple µ/<st, ht> of relation r gives the extent to which site st is close to hotel ht. Similarly, the tuple µ/<st> expresses the interestingness of site st. One considers the quantitative approximate division of r by s with the quantifier “almost all” of example 9, that is, one looks for hotels more or less close to almost all sites of interest. In this perspective, the degrees issued from the quantifier are once again w1 = 0.75, w2 = 0.25, w3 = … = w10 = 0 if s contains 10 tuples. Using Gödel implication and the following extensions of r and s: r = {0.1/, 0.1/, 0.2/, 0.5/, 0.7/, 0.9/, 1/, 1/, 0.2/, 0.5/, 0.3/, 0.3/, 1/, …, 1/}, s = {1/<s1>, 0.9/<s2>, 0.9/<s3>, 0.9/<s4>, 0.9/<s5>, 0.8/<s6>, 0.7/<s7>, 0.4/<s8>, 0.2/<s9>, 0.1/<s10>},

the result of the division is for h1 : min(max(0.1, 0.75), max(0.1, 0.25), max(0.2, 0), max(0.5, 0), max(0.7, 0), max(1, 0), … , max(1, 0)) = 0.2 for h2 : min(max(0.3, 0.75), max(0.3, 0.25), max(1, 0), … , max(1, 0)) = 0.3.

For h1, the two implication values 0.1 are ignored thanks to w1 and w2. For h2, only the first implication value 0.3 is ignored (thanks to w1). The result of the approximate division of r by s is t = {0.2/